extracting unique values between 2 sets/files - linux

Working in linux/shell env, how can I accomplish the following:
text file 1 contains:
1
2
3
4
5
text file 2 contains:
6
7
1
2
3
4
I need to extract the entries in file 2 which are not in file 1. So '6' and '7' in this example.
How do I do this from the command line?
many thanks!

$ awk 'FNR==NR {a[$0]++; next} !($0 in a)' file1 file2
6
7
Explanation of how the code works:
If we're working on file1, track each line of text we see.
If we're working on file2, and have not seen the line text, then print it.
Explanation of details:
FNR is the current file's record number
NR is the current overall record number from all input files
FNR==NR is true only when we are reading file1
$0 is the current line of text
a[$0] is a hash with the key set to the current line of text
a[$0]++ tracks that we've seen the current line of text
!($0 in a) is true only when we have not seen the line text
Print the line of text if the above pattern returns true, this is the default awk behavior when no explicit action is given

Using some lesser-known utilities:
sort file1 > file1.sorted
sort file2 > file2.sorted
comm -1 -3 file1.sorted file2.sorted
This will output duplicates, so if there is 1 3 in file1, but 2 in file2, this will still output 1 3. If this is not what you want, pipe the output from sort through uniq before writing it to a file:
sort file1 | uniq > file1.sorted
sort file2 | uniq > file2.sorted
comm -1 -3 file1.sorted file2.sorted
There are lots of utilities in the GNU coreutils package that allow for all sorts of text manipulations.

I was wondering which of the following solutions was the "fastest" for "larger" files:
awk 'FNR==NR{a[$0]++}FNR!=NR && !a[$0]{print}' file1 file2 # awk1 by SiegeX
awk 'FNR==NR{a[$0]++;next}!($0 in a)' file1 file2 # awk2 by ghostdog74
comm -13 <(sort file1) <(sort file2)
join -v 2 <(sort file1) <(sort file2)
grep -v -F -x -f file1 file2
Results of my benchmarks in short:
Do not use grep -Fxf, it's much slower (2-4 times in my tests).
comm is slightly faster than join.
If file1 and file2 are already sorted, comm and join are much faster than awk1 + awk2. (Of course, they do not assume sorted files.)
awk1 + awk2, supposedly, use more RAM and less CPU. Real run times are lower for comm probably due to the fact that it uses more threads. CPU times are lower for awk1 + awk2.
For the sake of brevity I omit full details. However, I assume that anyone interested can contact me or just repeat the tests. Roughly, the setup was
# Debian Squeeze, Bash 4.1.5, LC_ALL=C, slow 4 core CPU
$ wc file1 file2
321599 321599 8098710 file1
321603 321603 8098794 file2
Typical results of fastest runs
awk2: real 0m1.145s user 0m1.088s sys 0m0.056s user+sys 1.144
awk1: real 0m1.369s user 0m1.324s sys 0m0.044s user+sys 1.368
comm: real 0m0.980s user 0m1.608s sys 0m0.184s user+sys 1.792
join: real 0m1.080s user 0m1.756s sys 0m0.140s user+sys 1.896
grep: real 0m4.005s user 0m3.844s sys 0m0.160s user+sys 4.004
BTW, for the awkies: It seems that a[$0]=1 is faster than a[$0]++, and (!($0 in a)) is faster than (!a[$0]). So, for an awk solution I suggest:
awk 'FNR==NR{a[$0]=1;next}!($0 in a)' file1 file2

How about:
diff file_1 file_2 | grep '^>' | cut -c 3-
This would print the entries in file_2 which are not in file_1. For the opposite result one just has to replace '>' with '<'. 'cut' removes the first two characters added by 'diff', that are not part of the original content.
The files don't even need to be sorted.

with grep:
grep -F -x -v -f file_1 file_2

here's another awk solution
$ awk 'FNR==NR{a[$0]++;next}(!($0 in a))' file1 file2
6
7

$ cat file1 file1 file2 | sort | uniq -u
6
7
uniq -- report or filter out repeated lines in a file
... Repeated
lines in the input will not be detected if they are not adjacent, so
it may be necessary to sort the files first.
-u Only output lines that are not repeated in the input.
Print file1 twice to make sure all entries from file1 are skipped by uniq -u .

cat file1 file2 | sort -u > unique

If you are really set on doing this from the command line, this site (search for "no duplicates found") has an awk example that searches for duplicates. It may be a good starting point to look at that.
However, I'd encourage you to use Perl or Python for this. Basically, the flow of the program would be:
findUniqueValues(file1, file2){
contents1 = array of values from file1
contents2 = array of values from file2
foreach(value2 in contents2){
found=false
foreach(value1 in contents1){
if (value2 == value1) found=true
}
if(!found) print value2
}
}
This isn't the most elegant way of doing this, since it has a O(n^2) time complexity, but it will do the job.

Related

How do I compare two files in unix based on their columns

I am fairly new to unix commands, but i have two .csv files where i would like to compare the first column either with diff or comm. Every line is different, if i were to compare the whole line, thats why i want to compare the first column in each file and then have the difference printed out in numbers where the landcode sould not be counted more than once. The first file has also has a header i want to skip when it compares.
sample from file1:
iso_code,continent,location,date,total_cases
AND,Denver ,America,2020-07-26,897.0
ABW,Copenhagen Denmark,,2020-03-13,2.0
AFG,Oslo,Norway,2020-09-06,324.0
AZE,Hamburg,Germany,2020-03-30,29.0
sample from file2:
AND,Denver ,America,2020-07-26,897.0
ABW,Copenhagen Denmark,,2020-03-13,5.0
ABW,Chil Ukrain,Aruba,2020-10-06,4449.0
ALB,Upsala,Sweden,2020-08-275.0,
AFG,Afghanistan,,2020-09-06,324.0
The expected output should be "2", as there are two occurrences of the same land code in the two files. Duplicates of the contry code sould only be counted one time. That is why expected out should be 2 and not 3
I have tried multiple solutions:
awk 'NR==FNR{c[$1]++;next};c[$1] == 0' owid-covid-data-filtered.csv owid-covid-data.csv | wc -l
with the awk i get output: 1
and
diff owid-covid-data.csv owid-covid-data-filtered.csv |cut -d' ' -f1 owid-covid-data-filtered.csv| wc -l
overall i want the occurrences that are similar in both file1 and file2 column 1
From the condition c[$1] == 0 in the awk script from the question I assumed you want to print lines from file2 that contain a code that is not present in file1.
As it is clarified now, that you want to count the codes that are present in both files, see below at the end of the answer for the reverse check.
Slight modifications to your script will fix the problems:
awk -F, 'NR==FNR { if(NR!=1)c[$1]++; next} c[$1]++ == 0' file1 file2
Option -F , specifies comma (,) as field separator.
The condition if(NR!=1)c[$1]++; skips the header line in file1.
The post-increment operator in c[$1]++ == 0 will make the condition fail for the second or later occurrence of the same code in file2.
I omit the trailing | wc -l here to show the output lines.
I modified file2 to contain two lines with the same code in column 1 that is not present in file1.
With file2 shown here
AND,Europe,Andorra,2020-07-26,897.0
ABW,North America,Aruba,2020-03-13,2.0
ABW,North America,Aruba,2020-10-06,4079.0
ALB,Europe,Albania,2020-08-23,8275.1
ALB,Europe,Albania,2020-08-23,8275.2
AFG,Asia,Afghanistan,2020-09-06,38324.0
AFG,Asia,Afghanistan,2020-09-06,38324.0
and file1 from the question I get this output:
AND,Europe,Andorra,2020-07-26,897.0
ALB,Europe,Albania,2020-08-23,8275.1
(Only the first line with ALB is printed`.)
You can also implemente the counting in awk instead of using wc -l.
awk -F , 'NR==FNR { if(NR!=1)c[$1]++; next } c[$1]++ == 0 {count++} END {print count}' file1 file2
If you want to print the lines from file2 that contain a code that is present in file1, the script can be modified like this:
awk -F, 'NR==FNR { if(NR!=1)c[$1]++; next} c[$1] { c[$1]=0; print}' file1 file2
This prints
ABW,North America,Aruba,2020-03-13,2.0
AFG,Asia,Afghanistan,2020-09-06,38324.0
(The first line with code ABW.)
Alternative solution as requested in a comment.
tail -n +2 file1|cut -f1 -d,|sort -u>code1
cut -f1 -d, file2|sort -u>code2
fgrep -vf code1 code2
rm code1 code2
Or combined in one command without using temporary files code1 and code2:
fgrep -f <(tail -n +2 file1|cut -f1 -d,|sort -u) <(cut -f1 -d, file2|sort -u)
Add | wc -l to count the lines instead of printing them.
Explanation:
tail -n +2 print everything starting from the 2nd line
cut -f1 -d, print the first field, delimited with ,
sort -u sort lines and remove duplicates
fgrep -f code1 code2 print all lines from code2 that contain any of the strings from code1
occurrences that are similar in both file1 and file2 column 1:
$ awk -F, 'NR==FNR{a[$1];next}$1 in a' file1 file2
Output:
ABW,North America,Aruba,2020-03-13,2.0
ABW,North America,Aruba,2020-10-06,4079.0
AFG,Asia,Afghanistan,2020-09-06,38324.0

Linux - Delete lines from file 1 in file 2 BIG DATA

have two files:
file1:
a
b
c
d
file2:
a
b
f
c
d
e
output file (file2) should be:
f
e
I want that the lines of file1 should be deleted directly in file2. I want that the output should be not a new file. It should direct deleted in file 2. Of course there can be created a temp file.
I real file two contains more than 300.000 lines. That is the reason why some solution:
comm -13 file1 file2
don't work.
comm needs the input files to be sorted. You can use process substitution for that:
#!/bin/bash
comm -13 <(sort file1) <(sort file2) > tmp_file
mv tmp_file > original_file
Output:
e
f
Alternatively, if you have enough memory, you can use the following awk command which does not need the input to be sorted:
awk 'NR==FNR{a[$0];next} !($0 in a)' file1 file2
Output (preserved sort order):
f
e
Keep in mind that the size of the array a directly depends on the size of file1.
PS: grep -vFf file1 file2 can also be used and the memory requirements are the same as for the awk solution. Given that, I would probably just use grep.

Get a line from a file and add as a column into another file

I have two files.
File A:
Unix
File B:
A,B
C,D
E,f
.,.
.,.
N,N
Expected Output:
A,B,Unix
C,D,Unix
E,f,Unix
.,.,Unix
.,.,Unix
N,N,Unix
How it is possible with in Linux shell script?
➜ cat a
A,B
C,D
E,f
.,.
.,.
N,N
➜ cat f2
Unix
➜ awk 'BEGIN{getline f2<"f2"}; {printf("%s,%s\n",$0,f2);}' a
A,B,Unix
C,D,Unix
E,f,Unix
.,.,Unix
.,.,Unix
N,N,Unix
Assuming fileA contains only 1 word, it's better to pass it to awk as a parameter.
awk -v v="Unix" 'BEGIN{FS=OFS=","}{$(NF+1)=v}1' fileB
If this fileA contains more words, and assuming there is only 1 per line, you could also use this:
awk 'BEGIN{FS=OFS=","}NR==FNR{a[++i]=$1;next} {for(j=1; j<=i; j++) $(NF+1)=a[j]}1' fileA fileB
And then there is good old paste as an option:
$ cat file1
UNIX
$ cat file2
A,B
C,D
E,F
$ paste -d',' file2 <(yes `cat file1` | head -n $(cat file2 | wc -l))
A,B,UNIX
C,D,UNIX
E,F,UNIX
The tricky part here is that the number of rows differ in file1 and file2, so we need to repeat the UNIX row of file1 as often as necessary (i.e, as many times as there are rows in file2) to be able to use paste.

why does grep produce this behavior with these files

I have two files in this format, although they have the same amount of columns, the actual files have significantly more rows.
file1
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
file2
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
Both files have the same amount of columns and rows, and are formatted exactly the same, to my knowledge. None of the files have an extension, so no .csv or .txt it's simply the file name. I wanted to compare each row in the files and output the count of the matching rows. So look at row 1 in file1 and then row 1 in file2 and if the match that gets a one and so forth.
To do this I've done grep -c -f file1 file2 in the past however, this time when I try to do it, grep never stops running, and it takes up an entire core. Now when I do grep -w -c -f file1 file2 I do get back a count of the matching lines in the files, however what is strange is if I change the order so if I do grep -w -c -f file2 file1 I get back a different number.
The actual files each have 2983 rows and 61 columns. For those files if I do grep -w -c -f file1 file2 I get back 2021 but if I do grep -w -c -f file2 file1 I get back 1950
Also if I do diff file1 file2 I get this output at the end of the other output \ No newline at end of file but if I do diff -w file1 file2 the no newline message is no longer returned.
If you want to find the common lines, use comm or grep -Fx
comm -12 file1 file2
grep -Fx -f file1 file2
You comment: "it worked the way I want it to. by adding a column with numbers that correspond to the line number"
So, you only want to match if line number N in file1 is the same as line number N in file2:
awk 'NR==FNR {line[FNR] = $0; next} $0 == line[FNR]' file1 file2
or with bash (this will likely be slower, but consume less memory):
while IFS= read -u3 -r line1; IFS= read -u4 -r line2; do
[ "$line1" = "$line2" ] && echo "$line1"
done 3<file1 4<file2
There is an excellent tool called diff designed for exactly this purpose:
diff -c file1 file2 #Difference
comm -1 -2 file1 file2 #Common
This will tell you matching lines, and also exactly specify where lines differ.

Removing last column from rows that have three columns using bash

I have a file that contains several lines of data. Some lines contain three columns, but most contain only two. All lines are single-tab separated. For those that contain three columns, the third column is typically redundant and contains the same data as the second so I'd like to remove it.
I imagine awk or cut would be appropriate, but I'm drawing a blank on how to test the row for three columns so my script will only work on those rows. I know awk is a very powerful language with logic and whatnot built into it, I'm just not that strong with it.
I looked at a similar question, but I'm not sure what is going on with the awk answer. Should the -4 be -1 since I only want to remove one column? What about if the row has two columns; will it remove the second even though I don't want to do anything?
I modified it to what I think it would be:
awk -F"\t" -v OFS="\t" '{ for (i=1;i<=NF-4;i++){ print $i }}'
But when I run it (with the file) nothing happens. If I change NF-1 or NF-2 I get some output, but it only a handful of lines and only the first column.
Can anyone clue me into what I should be doing?
If you just want to remove the third column, you could just print the first and the second:
awk -F '\t' '{print $1 "\t" $2}'
And it's similar to cut:
cut -f 1,2
The awk variable NF gives you the number for fields. So an expression like this should work for you.
awk -F, 'NF == 3 {print $1 "," $2} NF != 3 {print $0}'
Running it on an input file like so
a,b,c
x,y
u,v,w
l,m
gives me
$ cat test | awk -F, 'NF == 3 {print $1 "," $2} NF != 3 {print $0}'
a,b
x,y
u,v
l,m
This might work for you (GNU sed):
sed 's/\t[^\t]*//2g' file
Restricts the file to two columns.
awk 'NF==3{print $1"\t"$2}NF==2{print}' your_file
Testde below:
> cat temp
1 2
3 4 5
6 7
8 9 10
>
> awk 'NF==3{print $1"\t"$2}NF==2{print}' temp
1 2
3 4
6 7
8 9
>
or in a much more simplere way in awk:
awk 'NF==3{print $1"\t"$2}NF==2' your_file
Or you can also go with perl:
perl -lane 'print "$F[0]\t$F[1]"' your_file

Resources