Bash: grep selected text from a file - text

I have two files, file1 :
abc/def/ghi/ss/sfrere/sfs
xyz/pqr/sef/ert/wwqwq/bh
file2:
ind abc def
bcf pqr sss
i wish to grep text file from file1, such that any words on any line of file2 match on one line of file1, so in this case answer would be first line, as abc and def are present in first line of file1. 2 or more words from lines of flie 1 should match in any line of file 2.

This should do the trick,
awk 'FNR==NR{a[$1];next}{for(i in a){c=0;for(j=1;j<=NF;j++){if(index(i,$j)>0)c++}if(c>=2)print i}}' file1.txt file2.txt
Explanation
FNR==NR{a[$1];next} will iterate through first File1.txt and store lines in a.
for(i in a) will loop through the above stored lines,
c=0 just to have a number check to keep track of number of columns matched.
for(j=1;j<NF;j++) loop through columns in lines of File2.txt
if(index(i,$j)>0)c++ increment counter if one of the columns in File2.txt is in a line of File1.txt.
if(c>=2)print i Your given condition that it should match at least 2 columns, then we print line from File1.txt.
This is the most straight forward way that I could think of, I'm sure there are crazier ways to do this.

on huge file
sed 's/\([^ ]*\) \([^ ]*\) \([^ ]*\)/(\1.*\2)|(\2.*\1)|(\1.*\3)|(\3.*\1)|(\2.*\3)|(\3.*\2)/' file2 >/tmp/file2.egrep
egrep -f /tmp/file2.egrep file1
rm >/tmp/file2.egrep
create a temporary pattern matching for egrep based on file2 content

Related

match 1,2,5 columns of file1 with 1,2,3 columns of file2 respectively and output should have matched rows from file 2. second file is zipped file .gz

file1
3 1234581 A C rs123456
file2 zipped file .gz
1 1256781 rs987656 T C
3 1234581 rs123456 A C
22 1792471 rs928376 G T
output
3 1234581 rs123456 A C
I tried
zcat file2.gz | awk 'NR==FNR{a[$1,$2,$5]++;next} a[$1,$2,$3]' file1.txt - > output.txt
but it is not working
Please try following awk code for your shown samples. Use zcat to read your .gz file and then pass it as 2nd input to awk program for reading, after its done reading with file1.
zcat your_file.gz | awk 'FNR==NR{arr[$1,$2,$5];next} (($1,$2,$3) in arr)' file1 -
Fixes in OP's attempt:
You need not to increment value of array while creating it in file1. Just existence of indexes in it will be enough.
While checking condition in reading file2(passed by zcat command) just check if respective fields are present in array if yes then print that line.

How do I compare two files in unix based on their columns

I am fairly new to unix commands, but i have two .csv files where i would like to compare the first column either with diff or comm. Every line is different, if i were to compare the whole line, thats why i want to compare the first column in each file and then have the difference printed out in numbers where the landcode sould not be counted more than once. The first file has also has a header i want to skip when it compares.
sample from file1:
iso_code,continent,location,date,total_cases
AND,Denver ,America,2020-07-26,897.0
ABW,Copenhagen Denmark,,2020-03-13,2.0
AFG,Oslo,Norway,2020-09-06,324.0
AZE,Hamburg,Germany,2020-03-30,29.0
sample from file2:
AND,Denver ,America,2020-07-26,897.0
ABW,Copenhagen Denmark,,2020-03-13,5.0
ABW,Chil Ukrain,Aruba,2020-10-06,4449.0
ALB,Upsala,Sweden,2020-08-275.0,
AFG,Afghanistan,,2020-09-06,324.0
The expected output should be "2", as there are two occurrences of the same land code in the two files. Duplicates of the contry code sould only be counted one time. That is why expected out should be 2 and not 3
I have tried multiple solutions:
awk 'NR==FNR{c[$1]++;next};c[$1] == 0' owid-covid-data-filtered.csv owid-covid-data.csv | wc -l
with the awk i get output: 1
and
diff owid-covid-data.csv owid-covid-data-filtered.csv |cut -d' ' -f1 owid-covid-data-filtered.csv| wc -l
overall i want the occurrences that are similar in both file1 and file2 column 1
From the condition c[$1] == 0 in the awk script from the question I assumed you want to print lines from file2 that contain a code that is not present in file1.
As it is clarified now, that you want to count the codes that are present in both files, see below at the end of the answer for the reverse check.
Slight modifications to your script will fix the problems:
awk -F, 'NR==FNR { if(NR!=1)c[$1]++; next} c[$1]++ == 0' file1 file2
Option -F , specifies comma (,) as field separator.
The condition if(NR!=1)c[$1]++; skips the header line in file1.
The post-increment operator in c[$1]++ == 0 will make the condition fail for the second or later occurrence of the same code in file2.
I omit the trailing | wc -l here to show the output lines.
I modified file2 to contain two lines with the same code in column 1 that is not present in file1.
With file2 shown here
AND,Europe,Andorra,2020-07-26,897.0
ABW,North America,Aruba,2020-03-13,2.0
ABW,North America,Aruba,2020-10-06,4079.0
ALB,Europe,Albania,2020-08-23,8275.1
ALB,Europe,Albania,2020-08-23,8275.2
AFG,Asia,Afghanistan,2020-09-06,38324.0
AFG,Asia,Afghanistan,2020-09-06,38324.0
and file1 from the question I get this output:
AND,Europe,Andorra,2020-07-26,897.0
ALB,Europe,Albania,2020-08-23,8275.1
(Only the first line with ALB is printed`.)
You can also implemente the counting in awk instead of using wc -l.
awk -F , 'NR==FNR { if(NR!=1)c[$1]++; next } c[$1]++ == 0 {count++} END {print count}' file1 file2
If you want to print the lines from file2 that contain a code that is present in file1, the script can be modified like this:
awk -F, 'NR==FNR { if(NR!=1)c[$1]++; next} c[$1] { c[$1]=0; print}' file1 file2
This prints
ABW,North America,Aruba,2020-03-13,2.0
AFG,Asia,Afghanistan,2020-09-06,38324.0
(The first line with code ABW.)
Alternative solution as requested in a comment.
tail -n +2 file1|cut -f1 -d,|sort -u>code1
cut -f1 -d, file2|sort -u>code2
fgrep -vf code1 code2
rm code1 code2
Or combined in one command without using temporary files code1 and code2:
fgrep -f <(tail -n +2 file1|cut -f1 -d,|sort -u) <(cut -f1 -d, file2|sort -u)
Add | wc -l to count the lines instead of printing them.
Explanation:
tail -n +2 print everything starting from the 2nd line
cut -f1 -d, print the first field, delimited with ,
sort -u sort lines and remove duplicates
fgrep -f code1 code2 print all lines from code2 that contain any of the strings from code1
occurrences that are similar in both file1 and file2 column 1:
$ awk -F, 'NR==FNR{a[$1];next}$1 in a' file1 file2
Output:
ABW,North America,Aruba,2020-03-13,2.0
ABW,North America,Aruba,2020-10-06,4079.0
AFG,Asia,Afghanistan,2020-09-06,38324.0

Iterate over two files in linux || column comparison

we have two files File1 and File 2
File 1 columns
Name Age
abc 12
bcd 14
File2 Columns
Age
12
14
I was to Iterate over the second column of File1 and First column of File2 in single loop and Then check if they are same.
Note:- note number of Rows in both the files are same and I am using .sh shell
First make a temporary file from file1 that should be the same as file2.
The field name might have spaces, so remove everything until the last space.
When you have done this you can compare the files.
sed 's/.* //' file1 > file1.tmp
diff file1.tmp file2

Linux split a file in two columns

I have the following file that contains 2 columns :
A:B:IP:80 apples
C:D:IP2:82 oranges
E:F:IP3:84 grapes
How is possible to split the file in 2 other files, each column in a file like this:
File1
A:B:IP:80
C:D:IP2:82
E:F:IP3:84
File2
apples
oranges
grapes
Try:
awk '{print $1>"file1"; print $2>"file2"}' file
After runningl that command, we can verify that the desired files have been created:
$ cat file1
A:B:IP:80
C:D:IP2:82
E:F:IP3:84
And:
$ cat file2
apples
oranges
grapes
How it works
print $1>"file1"
This tells awk to write the first column to file1.
print $2>"file2"
This tells awk to write the second column to file2.
Perl 1-liner using (abusing) the fact that print goes to STDOUT, i.e. file descriptor 1, and warn goes to STDERR, i.e. file descriptor 2:
# perl -n means loop over the lines of input automatically
# perl -e means execute the following code
# chomp means remove the trailing newline from the expression
perl -ne 'chomp(my #cols = split /\s+/); # Split each line on whitespace
print $cols[0] . "\n";
warn $cols[1] . "\n"' <input 1>col1 2>col2
You could, of course, just use cut -b with the appropriate columns, but then you would need to read the file twice.
Here's an awk solution that'll work with any number of columns:
awk '{for(n=1;n<=NF;n++)print $n>"File"n}' input.txt
This steps through each field on the line and prints the field to a different output file based on the column number.
Note that blank fields -- or rather, lines with fewer fields than other lines, will cause line numbers to mismatch. That is, if your input is:
A 1
B
C 3
Then File2 will contain:
1
3
If this is a concern, mention it in an update to your question.
You could of course do this in bash alone, in a number of ways. Here's one:
while read -r line; do
a=($line)
for m in "${!a[#]}"; do
printf '%s\n' "${a[$m]}" >> File$((m+1))
done
done < input.txt
This reads each line of input into $line, then word-splits $line into values in the $a[] array. It then steps through that array, printing each item to the appropriate file, named for the index of the array (plus one, since bash arrays start at zero).

Extract lines from File2 already found File1

Using linux commandline, i need to output the lines from text file2 that are already found in file1.
File1:
C
A
G
E
B
D
H
F
File2:
N
I
H
J
K
M
D
L
A
Output:
A
D
H
Thanks!
You are looking for the tools 'grep'
Check this out.
Lets say you have inputs in file1 & file2 files
grep -f file1 file2
will return you
H
D
A
A more flexible tool to use would be awk
awk 'NR==FNR{lines[$0]++; next} $1 in lines'
Example
$ awk 'NR==FNR{lines[$0]++; next} $1 in lines' file1 file2
H
D
A
What it does?
NR==FNR{lines[$0]++; next}
NR==FNR checks if the file number of records is equal to the overall number of records. This is true only for the first file, file1
lines[$0]++ Here we create an associative array with the line, $0 in file 1 as index.
$0 in lines This line works only for the second file because of the next in previous action. This checks if the line in file 2 is there in the saved array lines, if yes the default action of printing the entire line is taken
Awk is more flexible than the grep as you can columns in file1 with any column in file 2 and decides to print any column rather than printing the entire line
This is what the comm utility does, but you have to sort the files first: To get the lines in common between the 2 files:
comm -12 <(sort File1) <(sort File2)

Resources