How to remove the lines which appear on file 1 from another file 2 KEEPING empty lines? - linux

I know that in other questions you solve the first part of "How to remove the lines which appear on file 1 from another file 2" with:
comm -23 file1 file2 and grep -Fvxf file1 file2
But in my case I have empty lines separating data sets that I need to keep, for example:
File 1:
A
B
C
D
E
F
G
H
I
File 2
A
D
F
I
what I want as a result:
B
C
E
G
H
The solution can be in bash or csh.
Thanks

With awk please try following once.
awk 'FNR==NR{arr[$0];next} !NF || !($0 in arr)' file2 file1
Explanation: Adding detailed explanation for above code.
awk ' ##Mentioning awk program from here.
FNR==NR{ ##Checking if FNR==NR which will be TRUE when file2 is being read.
arr[$0] ##Creating array with index of $2 here.
next ##next will skip all further statements from here.
}
(!NF || !($0 in arr)) ##If line is empty OR not in arr then print it.
' file2 file1 ##Mentioning Input_file names here.

Related

Merging two txt files based on a common column with different row numbers

I would like to merge two whitespace-delimited files without sorting them first based on the "phenotype" column. File 1 contains the same phenotype several times, while file 2 has each phenotype only once. I need to match "phenotype" from file 1 to "category" in file 2.
File 1:
chr pos pval_EAS phenotype FDR
1 1902906 0.234 biomarkers-30600-both_sexes-irnt.tsv.gz 1
2 1475898 0.221 biomarkers-30600-both_sexes-irnt.tsv.gz 1
2 568899 0.433 continuous-4566-both_sexes-irnt.tsv.gz 1
2 2435478 0.113 continuous-4566-both_sexes-irnt.tsv.gz 1
4 1223446 0.112 phecode-554-both_sexes-irnt.tsv.gz 0.345
4 3456573 0.0003 phecode-554-both_sexes-irnt.tsv.gz 0.989
File 2:
phenotype Category
biomarkers-30600-both_sexes-irnt.tsv.bgz Metabolic
continuous-4566-both_sexes-irnt.tsv.gz Neoplasms
phecode-554-both_sexes-irnt.tsv.gz Immunological
I tried the following, but I don't get the desired output:
awk -F' ' 'FNR==NR{a[$1]=$4; next} {print $0 a[$6]}' file2 file1 > file3
With your shown samples, please try following.
awk 'FNR==NR{arr[$1]=$2;next} ($4 in arr){print $0,arr[$4]}' file2 file1
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
FNR==NR{ ##Checking condition which will be TRUE when file2 is being read.
arr[$1]=$2 ##Creating array arr with index of $1 and value is $2.
next ##next will skip all further statements from here.
}
($4 in arr){ ##Checking condition if 4th field is in arr then do following.
print $0,arr[$4] ##Printing current line along with value of arr with 4th field as index number.
}
' file2 file1 ##Mentioning Input_file names here.
Bonus solution: In case you want to print those lines which are not matching values and want to print with N/A then do following.
awk 'FNR==NR{arr[$1]=$2;next} {print $0,(($4 in arr)?arr[$4]:"N/A")}' file2 file1

How do I compare two files in unix based on their columns

I am fairly new to unix commands, but i have two .csv files where i would like to compare the first column either with diff or comm. Every line is different, if i were to compare the whole line, thats why i want to compare the first column in each file and then have the difference printed out in numbers where the landcode sould not be counted more than once. The first file has also has a header i want to skip when it compares.
sample from file1:
iso_code,continent,location,date,total_cases
AND,Denver ,America,2020-07-26,897.0
ABW,Copenhagen Denmark,,2020-03-13,2.0
AFG,Oslo,Norway,2020-09-06,324.0
AZE,Hamburg,Germany,2020-03-30,29.0
sample from file2:
AND,Denver ,America,2020-07-26,897.0
ABW,Copenhagen Denmark,,2020-03-13,5.0
ABW,Chil Ukrain,Aruba,2020-10-06,4449.0
ALB,Upsala,Sweden,2020-08-275.0,
AFG,Afghanistan,,2020-09-06,324.0
The expected output should be "2", as there are two occurrences of the same land code in the two files. Duplicates of the contry code sould only be counted one time. That is why expected out should be 2 and not 3
I have tried multiple solutions:
awk 'NR==FNR{c[$1]++;next};c[$1] == 0' owid-covid-data-filtered.csv owid-covid-data.csv | wc -l
with the awk i get output: 1
and
diff owid-covid-data.csv owid-covid-data-filtered.csv |cut -d' ' -f1 owid-covid-data-filtered.csv| wc -l
overall i want the occurrences that are similar in both file1 and file2 column 1
From the condition c[$1] == 0 in the awk script from the question I assumed you want to print lines from file2 that contain a code that is not present in file1.
As it is clarified now, that you want to count the codes that are present in both files, see below at the end of the answer for the reverse check.
Slight modifications to your script will fix the problems:
awk -F, 'NR==FNR { if(NR!=1)c[$1]++; next} c[$1]++ == 0' file1 file2
Option -F , specifies comma (,) as field separator.
The condition if(NR!=1)c[$1]++; skips the header line in file1.
The post-increment operator in c[$1]++ == 0 will make the condition fail for the second or later occurrence of the same code in file2.
I omit the trailing | wc -l here to show the output lines.
I modified file2 to contain two lines with the same code in column 1 that is not present in file1.
With file2 shown here
AND,Europe,Andorra,2020-07-26,897.0
ABW,North America,Aruba,2020-03-13,2.0
ABW,North America,Aruba,2020-10-06,4079.0
ALB,Europe,Albania,2020-08-23,8275.1
ALB,Europe,Albania,2020-08-23,8275.2
AFG,Asia,Afghanistan,2020-09-06,38324.0
AFG,Asia,Afghanistan,2020-09-06,38324.0
and file1 from the question I get this output:
AND,Europe,Andorra,2020-07-26,897.0
ALB,Europe,Albania,2020-08-23,8275.1
(Only the first line with ALB is printed`.)
You can also implemente the counting in awk instead of using wc -l.
awk -F , 'NR==FNR { if(NR!=1)c[$1]++; next } c[$1]++ == 0 {count++} END {print count}' file1 file2
If you want to print the lines from file2 that contain a code that is present in file1, the script can be modified like this:
awk -F, 'NR==FNR { if(NR!=1)c[$1]++; next} c[$1] { c[$1]=0; print}' file1 file2
This prints
ABW,North America,Aruba,2020-03-13,2.0
AFG,Asia,Afghanistan,2020-09-06,38324.0
(The first line with code ABW.)
Alternative solution as requested in a comment.
tail -n +2 file1|cut -f1 -d,|sort -u>code1
cut -f1 -d, file2|sort -u>code2
fgrep -vf code1 code2
rm code1 code2
Or combined in one command without using temporary files code1 and code2:
fgrep -f <(tail -n +2 file1|cut -f1 -d,|sort -u) <(cut -f1 -d, file2|sort -u)
Add | wc -l to count the lines instead of printing them.
Explanation:
tail -n +2 print everything starting from the 2nd line
cut -f1 -d, print the first field, delimited with ,
sort -u sort lines and remove duplicates
fgrep -f code1 code2 print all lines from code2 that contain any of the strings from code1
occurrences that are similar in both file1 and file2 column 1:
$ awk -F, 'NR==FNR{a[$1];next}$1 in a' file1 file2
Output:
ABW,North America,Aruba,2020-03-13,2.0
ABW,North America,Aruba,2020-10-06,4079.0
AFG,Asia,Afghanistan,2020-09-06,38324.0

AWK: Comparing substrings from two files and write to third file

I'm trying to compare two different files, let's say "file1" and "file2", in this way. If the substring of characters i.e 5 characters at position (8 to 12) matches in both files - file1 and file2, then remove that matching row from file 1. Finally, write the output to file3.(output contains the remaining rows which are not matching with file 2)
My output is the non matching rows of file1.
Output (file3) = File1 - File2
File1
-----
aqcdfdf**45555**78782121
axcdfdf**45555**75782321
aecdfdf**75555**78782221
aqcdfdf**95555**78782121
File2
-----
aqcdfdf**45555**78782121
axcdfdf**25555**75782321
File3
-----
aecdfdf**75555**78782221
aqcdfdf**95555**78782121
I tried awk but i need some thing which looks at substring of the two files, since there are no delimiters in my files.
$ awk 'FNR==NR {a[$1]; next} $1 in a' f1 f2 > file3
Could you please try following, written and tested with shown samples in GNU awk. Once happy with results on terminal then redirect output of following command to > file3(append > file3 to following command).
awk '{str=substr($0,8,5)} FNR==NR{a[str];next} !(str in a)' file2 file1
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
{
str=substr($0,8,5) ##Creating str which has sub-string of current line from 8th to 12th character.
}
FNR==NR{ ##Checking condition FNR==NR which will run when Input_file2 is being read.
a[str] ##Creating array a with index of str here.
next ##next will skip all further statements from here.
}
!(str in a) ##Checking condition if str is NOT present in a then print that line from Input_file1.
' file2 file1 ##Mentioning Input_file names here.

Find duplicate lines based on column and print both lines and their numbers with awk

I have a following file:
userID PWD_HASH
test 1234
admin 1234
user 6789
abcd 5555
efgh 6666
root 1234
Using AWK,
I need to find both original lines and their duplicates with row numbers,
so that get the output like:
NR $0
1 test 1234
2 admin 1234
6 root 1234
I have tried the following, but it does not print the correct row number with NR :
awk 'n=x[$2]{print NR" "n;print NR" "$0;} {x[$2]=$0;}' file.txt
Any help would be appreciated!
$ awk '
($2 in a) { # look for duplicates in $2
if(a[$2]) { # if found
print a[$2] # output the first, stored one
a[$2]="" # mark it outputed
}
print NR,$0 # print the duplicated one
next # skip the storing part that follows
}
{
a[$2]=NR OFS $0 # store the first of each with NR and full record
}' file
Output (with the header in file):
2 test 1234
3 admin 1234
7 root 1234
Using GAWK, you can do this by below construct : -
awk '
{
NR>1
{
a[$2][NR-1 " " $0];
}
}
END {
for (i in a)
if(length(a[i]) > 1)
for (j in a[i])
print j;
}
' Input_File.txt
Create a 2-dimensional array.
In first dimension, store PWD_HASH and in second dimension, store line number(NR-1) concatenated with whole line($0).
To display only duplicate ones, you can use length(a[i] > 1) condition.
Could you please try following.
awk '
FNR==NR{
a[$2]++
b[$2,FNR]=FNR==1?FNR:(FNR-1) OFS $0
next
}
a[$2]>1{
print b[$2,FNR]
}
' Input_file Input_file
Output will be as follows.
1 test 1234
2 admin 1234
6 root 1234
Explanation: Following is the explanation for above code.
awk ' ##Starting awk program here.
FNR==NR{ ##Checking condition here FNR==NR which will be TRUE when first time Input_file is being read.
a[$2]++ ##Creating an array named a whose index is $1 and incrementing its value to 1 each time it sees same index.
b[$2,FNR]=FNR==1?FNR:(FNR-1) OFS $0 ##Creating array b whose index is $2,FNR and concatenating its value to its own.
next ##Using next for skipping all further statements from here.
}
a[$2]>1{ ##Checking condition where value of a[$2] is greater than 1, this will be executed when 2nd time Input_file read.
print b[$2,FNR] ##Printing value of array b whose index is $2,FNR here.
}
' Input_file Input_file ##Mentioning Input_file(s) names here 2 times.
Without using awk, but GNU coretutils tools:
tail -n+2 file | nl | sort -k3n | uniq -D -f2
tail remove the first line.
nl add line number.
sort based on the 3rd field.
uniq only prints duplicate based on the 3rd field.

Extract lines from File2 already found File1

Using linux commandline, i need to output the lines from text file2 that are already found in file1.
File1:
C
A
G
E
B
D
H
F
File2:
N
I
H
J
K
M
D
L
A
Output:
A
D
H
Thanks!
You are looking for the tools 'grep'
Check this out.
Lets say you have inputs in file1 & file2 files
grep -f file1 file2
will return you
H
D
A
A more flexible tool to use would be awk
awk 'NR==FNR{lines[$0]++; next} $1 in lines'
Example
$ awk 'NR==FNR{lines[$0]++; next} $1 in lines' file1 file2
H
D
A
What it does?
NR==FNR{lines[$0]++; next}
NR==FNR checks if the file number of records is equal to the overall number of records. This is true only for the first file, file1
lines[$0]++ Here we create an associative array with the line, $0 in file 1 as index.
$0 in lines This line works only for the second file because of the next in previous action. This checks if the line in file 2 is there in the saved array lines, if yes the default action of printing the entire line is taken
Awk is more flexible than the grep as you can columns in file1 with any column in file 2 and decides to print any column rather than printing the entire line
This is what the comm utility does, but you have to sort the files first: To get the lines in common between the 2 files:
comm -12 <(sort File1) <(sort File2)

Resources