How to store missmatching rows from two files to a new file - linux

I have two input files as follows And I need to write the mismatching rows from second file to a new file.Each column in the file is separated by a tab space
Input 1
1 94564350 . C A
1 94564350 . C T
Input 2
1 94564351 . A T
1 94564351 . A C
1 94564350 . C A
and the Output is
1 94564351 . A T
1 94564351 . A C
I have tried this command
awk -F"\t" 'NR==FNR{a[$0];next}($2 in a)&& $1>=3' fileB fileA >fileC
but not working.
awk 'NR == FNR{a[$0];next} !($0 in a)' fileA fileB
above command also taking too much time for big files is there any other options to do the same

Try this taken from Idiomatic awk:
awk 'NR == FNR{a[$0];next} !($0 in a)' fileA fileB
You don't need to assign -F="\t", awk interprets it properly on files like these.
Test
$ awk 'NR == FNR{a[$0];next} !($0 in a)' fileA fileB
1 94564351 . A T
1 94564351 . A C

Related

AWK to filter to files if their columns match

I basically am working with two files (file1 and file2). The goal is to write a script that pulls rows from file1, if columns 1,2,3 match between files1 and files2. Here's the code I have been playing with:
awk -F'|' 'NR==FNR{c[$1$2$3]++;next};c[$1$2$3] > 0' file1 file2 > filtered.txt
ile1 and file2 both look like this (but has many more columns):
name1 0 c
name1 1 c
name1 2 x
name2 3 x
name2 4 c
name2 5 c
The awk code I provided isn't producing any output. Any help would be appreciated!
your delimiter isn't pipe, try this
$ awk 'NR==FNR {c[$1,$2,$3]++; next} c[$1,$2,$3]' file1 file2 > filtered.txt
or
$ awk 'NR==FNR {c[$0]++; next} c[$0]' file1 file2 > filtered.txt
however, if you're matching the whole line perhaps easier with grep
$ grep -xFf file1 file2 > filtered.txt
awk '{key=$1 FS $2 FS $3} NR==FNR{file2[key];next} key in file2' file2 file1

Using awk to calculate word counts in multiple columns

Input file (tab separated)
1 . Hello World . 51.4 . This is a text . 200
2 . Another line . 16.4 . Some more words . 600
Output desired (tab separated)
Hello World . 2 . This is a text . 4
Another line . 2 . Some more words . 3
The output is columns 2 and 4, and their word counts
I've gotten to
awk '{print $2, "\t", NF}' > output.tsv
but don't know how to do this for multiple columns in a single command
awk to the rescue!
awk 'BEGIN {FS=OFS="\t"}
{print $2,split($2,x," +"),$4,split($4,x," +")}' file
Hello World 2 This is a text 4
Another line 2 Some more words 3

Count duplicates from several files

I have five files which contain some duplicate strings.
file1:
a
file2:
b
file3:
a
b
file4:
b
file5:
c
So i used awk 'NR==FNR{A[$0];next}$0 in A' file1 file2 file3 file4 file5
And it prints $ a, but as you see there is b string 3 times repeated in other files, but print only a.
So how to get all repeated string (a b) from analysing/comparing every file with each other using one line command? Also how do I get the number of repeats for each element.
I suggest with GNU sort and uniq:
sort file[1-5] | uniq -dc
Output:
2 a
3 b
From man uniq:
-d: only print duplicate lines
-c: prefix lines by the number of occurrences
you can use one of these;
awk '{count[$0]++}END{for (a in count) {if (count[a] > 1 ) {print a}}}' file1 file2 file3 file4 file5
or
awk 'seen[$0]++ == 1' file1 file2 file3 file4 file5
you could test this for a=3 and b=4.
awk '{count[$0]++} END {for (line in count) if ( count[line] == 3 && line == "a" || count[line] == 4 && line == "b" ) {print line} }' file1 file2 file3 file4 file5
test:
$ awk '{count[$0]++}END{for (a in count) {if (count[a] > 1 ) {print a}}}' file1 file2 file3 file4 file5
a
b
$ awk 'seen[$0]++ == 1' file1 file2 file3 file4 file5
a
b
$ awk '{count[$0]++} END {for (line in count) if ( count[line] == 2 && line == "a" || count[line] == 3 && line == "b" ) {print line, count[line]} }' 1 2 3 4 5
a 2
b 3
In awk:
$ awk '{ a[$1]++ } END { for(i in a) if(a[i]>1) print i,a[i] }' file[1-5]
a 2
b 3
It counts the occurrances of each record (character in this case) and prints out the ones with count more than one.

Combine Two Files With Common Column

I have two files that look like the following
First File:
FileA
FileB
FileC
Second File:
FileA 2
FileC 2
I want the third file to look like the following:
FileA FileA 2
FileB
FileC FileC 2
Basically I'm doing a selective paste. I'm open to any awk or sed solution in order to achieve the desired results.
It's a job for join:
join -a1 -o 1.1 2.1 2.2 file1 file2
Using awk you can do:
awk 'FNR == NR{a[$1]=$0; next} {print $0, a[$1]}' file2 file1
FileA FileA 2
FileB
FileC FileC 2

linux command to get the last appearance of a string in a text file

I want to find the last appearance of a string in a text file with linux commands. For example
1 a 1
2 a 2
3 a 3
1 b 1
2 b 2
3 b 3
1 c 1
2 c 2
3 c 3
In such a text file, i want to find the line number of the last appearance of b which is 6.
I can find the first appearance with
awk '/ b / {print NR;exit}' textFile.txt
but I have no idea how to do it for the last occurrence.
cat -n textfile.txt | grep " b " | tail -1 | cut -f 1
cat -n prints the file to STDOUT prepending line numbers.
grep greps out all lines containing "b" (you can use egrep for more advanced patterns or fgrep for faster grep of fixed strings)
tail -1 prints last line of those lines containing "b"
cut -f 1 prints first column, which is line # from cat -n
Or you can use Perl if you wish (It's very similar to what you'd do in awk, but frankly, I personally don't ever use awk if I have Perl handy - Perl supports 100% of what awk can do, by design, as 1-liners - YMMV):
perl -ne '{$n=$. if / b /} END {print "$n\n"}' textfile.txt
This can work:
$ awk '{if ($2~"b") a=NR} END{print a}' your_file
We check every second file being "b" and we record the number of line. It is appended, so by the time we finish reading the file, it will be the last one.
Test:
$ awk '{if ($2~"b") a=NR} END{print a}' your_file
6
Update based on sudo_O advise:
$ awk '{if ($2=="b") a=NR} END{print a}' your_file
to avoid having some abc in 2nd field.
It is also valid this one (shorter, I keep the one above because it is the one I thought :D):
$ awk '$2=="b" {a=NR} END{print a}' your_file
Another approach if $2 is always grouped (may be more efficient then waiting until the end):
awk 'NR==1||$2=="b",$2=="b"{next} {print NR-1; exit}' file
or
awk '$2=="b"{f=1} f==1 && $2!="b" {print NR-1; exit}' file

Resources