AWK to filter to files if their columns match

AWK to filter to files if their columns match - linux

I basically am working with two files (file1 and file2). The goal is to write a script that pulls rows from file1, if columns 1,2,3 match between files1 and files2. Here's the code I have been playing with:
awk -F'|' 'NR==FNR{c[$1$2$3]++;next};c[$1$2$3] > 0' file1 file2 > filtered.txt
ile1 and file2 both look like this (but has many more columns):
name1 0 c
name1 1 c
name1 2 x
name2 3 x
name2 4 c
name2 5 c
The awk code I provided isn't producing any output. Any help would be appreciated!

your delimiter isn't pipe, try this
$ awk 'NR==FNR {c[$1,$2,$3]++; next} c[$1,$2,$3]' file1 file2 > filtered.txt
or
$ awk 'NR==FNR {c[$0]++; next} c[$0]' file1 file2 > filtered.txt
however, if you're matching the whole line perhaps easier with grep
$ grep -xFf file1 file2 > filtered.txt

awk '{key=$1 FS $2 FS $3} NR==FNR{file2[key];next} key in file2' file2 file1

Related

I want to find some strings/words from column 1 and 2 in file1 that match column 1 in file2 and replace with column 2 strings/words in file2

I'm still learning coding using Linux platform. I have search for problems similar to mine but the once I found they were either specific or focusing only on changing the entire column 1.
Here are example of my files:
File 1
abc Gamma 3.44
bcd abc 5.77
abc Alpha 1.99
beta abc 0.88
bcd Alpha 5.66
File 2
Gamma Bacteria
Alpha Bacteria
Beta Bacteria
Output file3
abc Bacteria 3.44
bcd abc 5.77
abc Bacteria 1.99
Bacteria abc 0.88
bcd Bacteria 5.66
I have tried:
awk:
$ awk 'FNR==NR{a[$1]=$2;next} {if ($1,$2 in a){$1,$2=a[$1,$2]}; print $0}' file2 file1
$ awk 'NR==FNR {a[FNR]=$0; next} /$1|$2/ {$1 $2=a[FNR]} 1' file2 file1
They gave me:
abc Gamma 3.44
abc 5.77
abc Alpha 1.99
Bacteria abc 0.88
bcd Alpha 5.66
Only changing the $1 and remove the other text strings in column 1 which are not found in file2 $2
And this one:
$ awk -F'\t' -v OFS='\t' 'FNR==1 { next }FNR == NR { file2[$1,$2] = $1 FS $2 } FNR != NR { file1[$1,$2,] = $1 FS $2} END { print "Match:"; for (k in file1) if (k in file1) print file2[k] # Or file1[k]}' file2 file1
Didn't work
Then after i tried sed:
$ sed = file2 | sed -r 'N;s/(.*)\n(.*)/\1s|\&$|\2|/' | sed -f - file1
This gave me an error and complained about
sed -e not being called properly.
Then after take only the smallest $3 if $1 and $2 or $2 and $1 are similar
file 4
bcd abc 5.77
Bacteria abc 0.88
bcd Bacteria 5.66
I have tried this code:
$ awk 'NR == $1&$2 || $3 < min {line = $0; min = $3}END{print line}' file3
$ awk '/^$1/{if(h){print h RS m}min=""; h=$0; next}min=="" || $3 < min{min=$3; m=$0}END{print h RS m}' file3
$ awk -F'\t' '$3 != "NF==min"' OFS='\t' file3
$ awk -v a=NODE '{c=a*$3+(1-a)} !($1 in min) || c<min[$1]{min[$1]=c; minLine[$1]=$0} END{for(k in minLine) print minLine[k]}' file3 | column -t
All didn't work and i tried to research what what does each line means and changed it to fit my problem. But they all failed

This might work for you (GNU sed):
sed -E 's#(.*) (.*)#/^\1 /Is/\\S+/\2/;/^\\S+ \1 /Is/\\S+/\2/2#' file2 |
sed -Ef - file1
Generate a sed script from file2 which is run against file1 to produce the required format.

How to print only words that doesn't match between two files? [duplicate]

This question already has answers here:
Compare two files line by line and generate the difference in another file
(14 answers)
Closed 2 years ago.
FILE1:
cat
dog
house
tree
FILE2:
dog
cat
tree
I need to be printed only:
house

$ cat file1
cat
dog
house
tree
$ cat file2
dog
cat
tree
$ grep -vF -f file2 file1
house
The -v flag only shows non-matches, -f is for a filename to use as a filter, and -F is for exact matches (doesn't slow it down with any pattern matching).

Using awk
awk 'FNR==NR{arr[$0]=1; next} !($0 in arr)' FILE2 FILE1
First build an associative array with words from FILE2 and than loop over FILE1 and only print those.
Using comm
comm -2 -3 <(sort FILE1) <(sort FILE2)
-2 suppresses lines unique to FILE2 and -3 suppresses lines found in both.

If you want just the words, you can sort the files, diff them, then use sed to filter out diff's symbols:
diff <(sort file1) <(sort file2) | sed -n '/^</s/^< //p'

Awk is an option here:
awk 'NR==FNR { arr[$1]="1" } NR != FNR { if (arr[$1] == "") { print $0 } } ' file2 file1
Create an array called arr, using the contents of file2 as indexes. Then with file1, look at each entry and check to see if an entry in the array arr exists. If it doesn't, print.

Count duplicates from several files

I have five files which contain some duplicate strings.
file1:
a
file2:
b
file3:
a
b
file4:
b
file5:
c
So i used awk 'NR==FNR{A[$0];next}$0 in A' file1 file2 file3 file4 file5
And it prints $ a, but as you see there is b string 3 times repeated in other files, but print only a.
So how to get all repeated string (a b) from analysing/comparing every file with each other using one line command? Also how do I get the number of repeats for each element.

I suggest with GNU sort and uniq:
sort file[1-5] | uniq -dc
Output:
2 a
3 b
From man uniq:
-d: only print duplicate lines
-c: prefix lines by the number of occurrences

you can use one of these;
awk '{count[$0]++}END{for (a in count) {if (count[a] > 1 ) {print a}}}' file1 file2 file3 file4 file5
or
awk 'seen[$0]++ == 1' file1 file2 file3 file4 file5
you could test this for a=3 and b=4.
awk '{count[$0]++} END {for (line in count) if ( count[line] == 3 && line == "a" || count[line] == 4 && line == "b" ) {print line} }' file1 file2 file3 file4 file5
test:
$ awk '{count[$0]++}END{for (a in count) {if (count[a] > 1 ) {print a}}}' file1 file2 file3 file4 file5
a
b
$ awk 'seen[$0]++ == 1' file1 file2 file3 file4 file5
a
b
$ awk '{count[$0]++} END {for (line in count) if ( count[line] == 2 && line == "a" || count[line] == 3 && line == "b" ) {print line, count[line]} }' 1 2 3 4 5
a 2
b 3

In awk:
$ awk '{ a[$1]++ } END { for(i in a) if(a[i]>1) print i,a[i] }' file[1-5]
a 2
b 3
It counts the occurrances of each record (character in this case) and prints out the ones with count more than one.

How to append a column for the result set in shell script

I need a script for the below scenario. I am very new to shell script.
wc file1 file2
the above query results with following result
40 149 947 file1
2294 16638 97724 file2
Now I need to get result as follows: 1st column, 3rd column ,4th column of above result set and new column with default values
40 947 file1 DF.tx1
2294 97724 file2 DF.rb2
Here the last column values is always known values i.e for file1 DF.tx1 and file2 DF.rb2.
If the give filenames in any order the default values should not change.
Please help me to write this script. Thanks in advance!!

You can use awk:
wc file1 file2 |
awk '$4 != "total"{if ($4 ~ /file1/) f="DF.tx1"; else if ($4 ~ /file2/) f="DF.rb2";
else if ($4 ~ /file3/) f="foo.bar"; print $1, $3, $4, f}'
1 12 file1 DF.tx1
9 105 file2 DF.rb2
5 15 file3 foo.bar

script to join 2 separate text files and also add specified text

hi there i want to creat a bash script that does the following:
i have 2 texts files one is adf.ly links and the other Recipie names
i want to creat a batch scrript that takes each line from each text file and do the following
<li>**Recipie name line 1 of txt file**</li>
<li>**Recipie name line 2 of txt file**</li>
ect ect and save all the results to another text file called LINKS.txt
someone please help or point me in direction of linux bash script

this awk one-liner will do the job:
awk 'BEGIN{l="<li>%s</li>\n"}NR==FNR{a[NR]=$0;next}{printf l, a[FNR],$0}' file1 file2
more clear version (same script):
awk 'BEGIN{l="<li>%s</li>\n"}
NR==FNR{a[NR]=$0;next}
{printf l, a[FNR],$0}' file1 file2
example:
kent$ seq -f"%g from file1" 7 >file1
kent$ seq -f"%g from file2" 7 >file2
kent$ head file1 file2
==> file1 <==
1 from file1
2 from file1
3 from file1
4 from file1
5 from file1
6 from file1
7 from file1
==> file2 <==
1 from file2
2 from file2
3 from file2
4 from file2
5 from file2
6 from file2
7 from file2
kent$ awk 'BEGIN{l="<li>%s</li>\n"};NR==FNR{a[NR]=$0;next}{printf l, a[FNR],$0}' file1 file2
<li>1 from file2</li>
<li>2 from file2</li>
<li>3 from file2</li>
<li>4 from file2</li>
<li>5 from file2</li>
<li>6 from file2</li>
<li>7 from file2</li>
EDIT for the comment of OP:
if you have only one file: (the foo here is just dummy text)
awk 'BEGIN{l="<li>foo</li>\n"}{printf l,$0}' file1
output from same file1 example:
<li>foo</li>
<li>foo</li>
<li>foo</li>
<li>foo</li>
<li>foo</li>
<li>foo</li>
<li>foo</li>
if you want to save the output to a file:
awk 'BEGIN{l="<li>foo</li>\n"}{printf l,$0}' file1 > newfile

Try doing this :
$ cat file1
aaaa
bbb
ccc
$ cat file2
111
222
333
$ paste file1 file2 | while read a b; do
printf '<li>%s</li>\n' "$a" "$b"
done | tee newfile
Output
<li>111</li>
<li>222</li>
<li>333</li>

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

AWK to filter to files if their columns match - linux

your delimiter isn't pipe, try this $ awk 'NR==FNR {c[$1,$2,$3]++; next} c[$1,$2,$3]' file1 file2 > filtered.txt or $ awk 'NR==FNR {c[$0]++; next} c[$0]' file1 file2 > filtered.txt however, if you're matching the whole line perhaps easier with grep $ grep -xFf file1 file2 > filtered.txt

awk '{key=$1 FS $2 FS $3} NR==FNR{file2[key];next} key in file2' file2 file1

Related

I want to find some strings/words from column 1 and 2 in file1 that match column 1 in file2 and replace with column 2 strings/words in file2

How to print only words that doesn't match between two files? [duplicate]

Count duplicates from several files

How to append a column for the result set in shell script

script to join 2 separate text files and also add specified text

Categories

Resources