How to use awk to delete lines of file1 whose column 1 values exist in file2 in Ubuntu? - linux

Say we have file1.csv like this
"agvsad",314
"gregerg",413
"dfwer",53214
"fewf",344
and file2.csv like this
"dfwer"
"fewf"
how to use awk to delete those lines whose column 1 values exist in file2 and get a file3 looks like:
"agvsad",314
"gregerg",413
By the way I am dealing with millions of lines

awk 'NR==FNR{seen[$0]++; next} !seen[$1]' file2.csv FS=, file1.csv should do what you want but it will require enough memory to store an entry for each line in file2.csv.

As an alternative, using grep:
$ grep -vf file2.csv file1.csv
"agvsad",314
"gregerg",413

Related

How to compare the columns of file1 to the columns of file2, select matching values, and output to new file using grep or unix commands

I have two files, file1 and file2, where the target_id compose the first column in both.
I want to compare file1 to file2, and only keep the rows of file1 which match the target_id in file2.
file2:
target_id
ENSMUST00000128641.2
ENSMUST00000185334.7
ENSMUST00000170213.2
ENSMUST00000232944.2
Any help would be appreciated.
% grep -x -f file1 file2 resulted in no output in my terminal
Sample data that actually shows overlaps between the files.
file1.csv:
target_id,KO_1D_7dpi,KO_2D_7dpi
ENSMUST00000178537.2,0,0
ENSMUST00000178862.2,0,0
ENSMUST00000196221.2,0,0
ENSMUST00000179664.2,0,0
ENSMUST00000177564.2,0,0
file2.csv
target_id
ENSMUST00000178537.2
ENSMUST00000196221.2
ENSMUST00000177564.2
Your grep command, but swapped:
$ grep -F -f file2.csv file1.csv
target_id,KO_1D_7dpi,KO_2D_7dpi
ENSMUST00000178537.2,0,0
ENSMUST00000196221.2,0,0
ENSMUST00000177564.2,0,0
Edit: we can add the -F argument since it is a fixed-string search. Plus it adds protection against the . matching something else as a regex. Thanks to #Sundeep for the recommendation.

Extracting lines from a file containing keywords that are present in another file [duplicate]

This question already has answers here:
Inner join on two text files
(5 answers)
Closed 5 years ago.
File1 (keywords present in it (after 2nd comma) for picking Ex: GOLD, BRO, ...)
File2 (extraction of lines from here)
File1:
ABC,123,GOLD,20171201,GOLDFUTURE
ABC,467,SILVER,20171201,SILVERFUTURE
ABC,987,BRO,20171201,BROFUTURE
File2:
XYZ,32,RUBY,20171201,RUBY
XYZ,33,GOLD,20171201,GOLD
XYZ,34,CEMENT,20171201,CEMENT
XYZ,35,PILLAR,20171201,pillar
XYZ,36,CNBC,20171201,CNBC
XYZ,37,CBX,20171201,CBX
XYZ,38,BRO,20171201,BRO
I want Linux commands(awk-sed-cat-grep etc) to get output file:
which is:
XYZ,33,GOLD,20171201,GOLD
XYZ,38,BRO,20171201,BRO
I have found commands online:
grep -F -f File1 File2
awk 'FNR==NR {a[$0];next} ($NF in a)' File1 File2
awk 'FNR==NR {a[$0];next} ($0 in a)' File1 File2
diff File1 File2
In the point 3. I am picking up whole lines from File1 for comparison, is there any way to pickup a keyword after comma? Or is there any way to insert File separator in the awk command of point 2.
Could you please try following and let me know if this helps you.
awk -F, 'FNR==NR{a[$3];next} ($3 in a)' File1 File2

shell script to compare two files and write the difference to third file

I want to compare two files and redirect the difference between the two files to third one.
file1:
/opt/a/a.sql
/opt/b/b.sql
/opt/c/c.sql
In case any file has # before /opt/c/c.sql, it should skip #
file2:
/opt/c/c.sql
/opt/a/a.sql
I want to get the difference between the two files. In this case, /opt/b/b.sql should be stored in a different file. Can anyone help me to achieve the above scenarios?
file1
$ cat file1 #both file1 and file2 may contain spaces which are ignored
/opt/a/a.sql
/opt/b/b.sql
/opt/c/c.sql
/opt/h/m.sql
file2
$ cat file2
/opt/c/c.sql
/opt/a/a.sql
Do
awk 'NR==FNR{line[$1];next}
{if(!($1 in line)){if($0!=""){print}}}
' file2 file1 > file3
file3
$ cat file3
/opt/b/b.sql
/opt/h/m.sql
Notes:
The order of files passed to awk is important here, pass the file to check - file2 here - first followed by the master file -file1.
Check awk documentation to understand what is done here.
You can use some tools like cat, sed, sort and uniq.
The main observation is this: if the line is in both files then it is not unique in cat file1 file2.
Furthermore in cat file1 file2| sort, all doubles are in sequence. Using uniq -u we get unique lines and have this pipe:
cat file1 file2 | sort | uniq -u
Using sed to remove leading whitespace, empty and comment lines, we get this final pipe:
cat file1 file2 | sed -r 's/^[ \t]+//; /^#/ d; /^$/ d;' | sort | uniq -u > file3

Using file1 as an Index to search file2 when file1 contains extra informations

as you can read in the title Im dealing with two files. Her is the example how the look like.
file1:
Name (additional info separated by a tab from the name)
Peter Schwarzer<tab>Best friend of mine
file2:
Name (followed by a float separated by a tab from the name)
Peter Schwarzer<tab>1456
So what i want to do is use file1 one as an index for searching file2. If the Names match it should be written in file3 which should contain the Name followed by the float from file2 followed by the additional info from file1.
So file3 should look like:
Peter Schwarzer<tab>1456<tab>Best friend of mine
(everything separated by tab)
I tried grep -f to read a pattern from a file and without the additional information it works. So is there any way to get the desired result with grep or is AWK the answer?
Thanks in advance,
Julian
give this line a try, I didn't test, but should go:
awk -F'\t' -v OFS="\t" 'NR==FNR{n[$1]=$2;next}$1 in n{print $0,n[$1]}' file1 file2 > file3
Try this awk one liner!
awk -v FS="\t" -v OFS="\t" 'FNR==NR{ A[$1]=$2; next}$1 in A{print $0,A[$1];}' file1.txt file2.txt > file3.txt
To me this looks like a job for join:
join -t '\t' file1 file2
This assumes file1 and file2 are sorted. If not, sort them first:
sort -o file1 file1
sort -o file2 file2
join -t '\t' file1 file2
If you can't modify file1 and file2 (if you need to leave them in their original, unsorted state), use a temporary file:
tmpfile=/tmp/tf$$
sort file1 > $tmpfile
sort file2 | join -t '\t' $tmpfile -
If join says "illegal tab character specification" you'll have to use join -t ' ' where you type an actual tab between the single quotes (and depending on your shell, you may have to use control-V before that tab).

Probably grep, but still do not get how to read row in file1 and paste it as a column in file2

I have a CSV file like this:
(A),4.999165E-08,5.99986E-08,7.000066E-08,8.000618E-08, etc.,.
All I want to do is make an output (without commas) in a form of column in another CSV file, so it should look like:
(A)
4.999165E-08
5.99986E-08
7.000066E-08
I still don't get the basics of grep or it is not possible and should I use awk command?
If the file looks like what you mentioned, then you can just do..
tr ',' '\n'
An example of what it will do..
echo "(A),4.999165E-08,5.99986E-08,7.000066E-08,8.000618E-08, " | tr ',' '\n'
(A)
4.999165E-08
5.99986E-08
7.000066E-08
8.000618E-08
If you like to use awk
awk -F, '$1=$1' OFS="\n"
(A)
4.999165E-08
5.99986E-08
7.000066E-08
8.000618E-08
etc.
This is some more robust
awk -F, '{$1=$1}1' OFS="\n"
Another awk
awk 'gsub(/,/,"\n")'

Resources