Iterate over two files in linux || column comparison - linux

we have two files File1 and File 2
File 1 columns
Name Age
abc 12
bcd 14
File2 Columns
Age
12
14
I was to Iterate over the second column of File1 and First column of File2 in single loop and Then check if they are same.
Note:- note number of Rows in both the files are same and I am using .sh shell

First make a temporary file from file1 that should be the same as file2.
The field name might have spaces, so remove everything until the last space.
When you have done this you can compare the files.
sed 's/.* //' file1 > file1.tmp
diff file1.tmp file2

Related

match 1,2,5 columns of file1 with 1,2,3 columns of file2 respectively and output should have matched rows from file 2. second file is zipped file .gz

file1
3 1234581 A C rs123456
file2 zipped file .gz
1 1256781 rs987656 T C
3 1234581 rs123456 A C
22 1792471 rs928376 G T
output
3 1234581 rs123456 A C
I tried
zcat file2.gz | awk 'NR==FNR{a[$1,$2,$5]++;next} a[$1,$2,$3]' file1.txt - > output.txt
but it is not working
Please try following awk code for your shown samples. Use zcat to read your .gz file and then pass it as 2nd input to awk program for reading, after its done reading with file1.
zcat your_file.gz | awk 'FNR==NR{arr[$1,$2,$5];next} (($1,$2,$3) in arr)' file1 -
Fixes in OP's attempt:
You need not to increment value of array while creating it in file1. Just existence of indexes in it will be enough.
While checking condition in reading file2(passed by zcat command) just check if respective fields are present in array if yes then print that line.

Linux - putting lines that contain a string at a specific column in a new file

I want to pull all rows from a text file in linux which contain a specific number (in this case 9913) in a specific column (column 4). This is a tab-delimited file, so I am calling this a column, though I am not sure it is.
In some cases, there is only one number in column 4, but in other lines there are multiple numbers in this column (ex. 9913; 4444; 5555). I would like to get any rows for which the number 9913 appears in the 4th column (whether or not it is the only number or in a list). How do I put all lines which contain the number 9913 in column 4 and put them in their own file?
Here is an example of what I have tried:
cat file.txt | grep 9913 > newFile.txt
result is a mixture of the following:
CDR1as CDR1as ENST00000003100.8 9913 AAA-GGCAGCAAGGGACUAAAA (files that I want)
CDR1as CDR1as ENST00000399139.1 9606 GUCCCCA................(file ex. I don't want)
I do not get any results when calling a specific column. Shown by the helper below, the code is not recognizing the columns I think, and I get blank files when using awk.
awk '$4 == "9913"' file.txt > newfile.txt
will give me no transfer of data to a new file.
Thanks
This is one way of doing it
awk '$4 == "9913" {print $0}' file.txt > newfile.txt
or just
awk '$4 == "9913"' file.txt > newfile.txt

Bash: grep selected text from a file

I have two files, file1 :
abc/def/ghi/ss/sfrere/sfs
xyz/pqr/sef/ert/wwqwq/bh
file2:
ind abc def
bcf pqr sss
i wish to grep text file from file1, such that any words on any line of file2 match on one line of file1, so in this case answer would be first line, as abc and def are present in first line of file1. 2 or more words from lines of flie 1 should match in any line of file 2.
This should do the trick,
awk 'FNR==NR{a[$1];next}{for(i in a){c=0;for(j=1;j<=NF;j++){if(index(i,$j)>0)c++}if(c>=2)print i}}' file1.txt file2.txt
Explanation
FNR==NR{a[$1];next} will iterate through first File1.txt and store lines in a.
for(i in a) will loop through the above stored lines,
c=0 just to have a number check to keep track of number of columns matched.
for(j=1;j<NF;j++) loop through columns in lines of File2.txt
if(index(i,$j)>0)c++ increment counter if one of the columns in File2.txt is in a line of File1.txt.
if(c>=2)print i Your given condition that it should match at least 2 columns, then we print line from File1.txt.
This is the most straight forward way that I could think of, I'm sure there are crazier ways to do this.
on huge file
sed 's/\([^ ]*\) \([^ ]*\) \([^ ]*\)/(\1.*\2)|(\2.*\1)|(\1.*\3)|(\3.*\1)|(\2.*\3)|(\3.*\2)/' file2 >/tmp/file2.egrep
egrep -f /tmp/file2.egrep file1
rm >/tmp/file2.egrep
create a temporary pattern matching for egrep based on file2 content

Get lines of file1 which are not in file2

I have two long, but sorted files. How to get all lines of first file which are not in second file ?
file1
0000_aaa_b
0001_bccc_b
0002_bcc <------ file2 have not that line
0003_aaa_d
0006_xxx
...
file2
0000_aaa_b
0001_bccc_b
0003_aaa_d
0006_xxx
...
This is what the comm command is for:
$ comm -3 file1 file2
0002_bcc
From man comm:
DESCRIPTION
Compare sorted files FILE1 and FILE2 line by line.
With no options, produce three-column output. Column one contains
lines unique to FILE1, column two contains lines unique to FILE2, and
column three contains lines common to both files.
-1 suppress column 1 (lines unique to FILE1)
-2 suppress column 2 (lines unique to FILE2)
-3 suppress column 3 (lines that appear in both files)
Just run a diff on them:
diff -c file1 file2
The -c (for "context") flag will only display the lines that are different, with two lines surrounding each line.

Best way to print rows not common to two large files in Unix

I have two files which are of following format.
File1: - It contains 4 column. First field is ID in text format and rest of columns are also some text values.
id1 val12 val13 val14
id2 val22 val23 val24
id3 val32 val33 val34
File2 - In file two I only have IDs.
id1
id2
Output
id3 val32 val33 val34
My question is: How to find rows from first file whose ID(first field) does not appear in second file. Size of both files in pretty large with file1 containing 42 million rows, size 8GB and file2 contains 33 million IDs. Order of IDs in two files might not be same.
Assuming the two files are sorted by id, then something like
join "-t " -j 1 -v 1 file1 file2
should do it.
You could do like this with awk:
awk 'FNR == NR { h[$1] = 1; next } !h[$1]' file2 file1
The first block gathers ids from file2 into the h hash. The last part (!h[$1]) executes the default block ({ print $0 }) if the id wasn't present in file2.
I don't claim that this is the "best" way to do it because best can include a number of trade-off criteria, but here's one way:
You can do this with the -f option to specify File2 as the file containing search patterns to grep:
grep -v -f File2 File1 > output
And as #glennjackman suggests:
One way to force the id to match at the beginning of the line:grep -vf <(sed 's/^/^/' File2) File1

Resources