I want to find some strings/words from column 1 and 2 in file1 that match column 1 in file2 and replace with column 2 strings/words in file2 - linux

I'm still learning coding using Linux platform. I have search for problems similar to mine but the once I found they were either specific or focusing only on changing the entire column 1.
Here are example of my files:
File 1
abc Gamma 3.44
bcd abc 5.77
abc Alpha 1.99
beta abc 0.88
bcd Alpha 5.66
File 2
Gamma Bacteria
Alpha Bacteria
Beta Bacteria
Output file3
abc Bacteria 3.44
bcd abc 5.77
abc Bacteria 1.99
Bacteria abc 0.88
bcd Bacteria 5.66
I have tried:
awk:
$ awk 'FNR==NR{a[$1]=$2;next} {if ($1,$2 in a){$1,$2=a[$1,$2]}; print $0}' file2 file1
$ awk 'NR==FNR {a[FNR]=$0; next} /$1|$2/ {$1 $2=a[FNR]} 1' file2 file1
They gave me:
abc Gamma 3.44
abc 5.77
abc Alpha 1.99
Bacteria abc 0.88
bcd Alpha 5.66
Only changing the $1 and remove the other text strings in column 1 which are not found in file2 $2
And this one:
$ awk -F'\t' -v OFS='\t' 'FNR==1 { next }FNR == NR { file2[$1,$2] = $1 FS $2 } FNR != NR { file1[$1,$2,] = $1 FS $2} END { print "Match:"; for (k in file1) if (k in file1) print file2[k] # Or file1[k]}' file2 file1
Didn't work
Then after i tried sed:
$ sed = file2 | sed -r 'N;s/(.*)\n(.*)/\1s|\&$|\2|/' | sed -f - file1
This gave me an error and complained about
sed -e not being called properly.
Then after take only the smallest $3 if $1 and $2 or $2 and $1 are similar
file 4
bcd abc 5.77
Bacteria abc 0.88
bcd Bacteria 5.66
I have tried this code:
$ awk 'NR == $1&$2 || $3 < min {line = $0; min = $3}END{print line}' file3
$ awk '/^$1/{if(h){print h RS m}min=""; h=$0; next}min=="" || $3 < min{min=$3; m=$0}END{print h RS m}' file3
$ awk -F'\t' '$3 != "NF==min"' OFS='\t' file3
$ awk -v a=NODE '{c=a*$3+(1-a)} !($1 in min) || c<min[$1]{min[$1]=c; minLine[$1]=$0} END{for(k in minLine) print minLine[k]}' file3 | column -t
All didn't work and i tried to research what what does each line means and changed it to fit my problem. But they all failed

This might work for you (GNU sed):
sed -E 's#(.*) (.*)#/^\1 /Is/\\S+/\2/;/^\\S+ \1 /Is/\\S+/\2/2#' file2 |
sed -Ef - file1
Generate a sed script from file2 which is run against file1 to produce the required format.

Related

AWK to filter to files if their columns match

I basically am working with two files (file1 and file2). The goal is to write a script that pulls rows from file1, if columns 1,2,3 match between files1 and files2. Here's the code I have been playing with:
awk -F'|' 'NR==FNR{c[$1$2$3]++;next};c[$1$2$3] > 0' file1 file2 > filtered.txt
ile1 and file2 both look like this (but has many more columns):
name1 0 c
name1 1 c
name1 2 x
name2 3 x
name2 4 c
name2 5 c
The awk code I provided isn't producing any output. Any help would be appreciated!
your delimiter isn't pipe, try this
$ awk 'NR==FNR {c[$1,$2,$3]++; next} c[$1,$2,$3]' file1 file2 > filtered.txt
or
$ awk 'NR==FNR {c[$0]++; next} c[$0]' file1 file2 > filtered.txt
however, if you're matching the whole line perhaps easier with grep
$ grep -xFf file1 file2 > filtered.txt
awk '{key=$1 FS $2 FS $3} NR==FNR{file2[key];next} key in file2' file2 file1

how to use awk/sed to deal with these two files to get a result that I want

I want to use awk/sed to deal with two files(a.txt and b.txt) below and get the result
cat a.txt
a UK
b Japan
c China
d Korea
e US
And cat b.txt results
c Russia
e Canada
The result that I want is as below:
a UK
b Japan
c Russia
d Korea
e Canada
With awk:
First fill aray/hash a with complete row ($0) and use first column ($1) from this row as index. Finally, print all elements of array/hash a with a loop.
awk '{a[$1]=$0} END{for(i in a) print a[i]}' file1 file2
Output:
a UK
b Japan
c Russia
d Korea
e Canada
try:
awk 'FNR==NR{A[$1]=$NF;next} {printf("%s %s\n",$1,$1 in A?A[$1]:$NF)}' b.txt a.txt
Checking here condition FNR==NR which will be TRUE only when first file(b.txt) is being read. Then creating an array named A whose index is $1 and have the value last column. Then using printf for printing 2 strings where first string is $1 and another is if $1 of a.txt is present in array A then print array A's value whose index is $1 else print last column of a.tzt itself.
EDIT: as OP had carriage characters into Input_files so please remove them by following too.
tr -d '\r' < b.txt > temp_b.txt && mv temp_b.txt b.txt
You can use the below one-liner:
join -a 1 -a 2 a.txt <( awk '{print $1, "--", $0, "--"}' < b.txt ) | sed 's/ --$//' | awk -F ' -- ' '{print $NF}'
We use awk to prefix each line in b.txt with a key and -- to give us a split point later:
<( awk '{print $1, "--", $0, "--"}' < b.txt )
Use the join command to join the files on common keys. The -a 1 option tells the command to
join -a 1 -a 2 a.txt <( awk '{print $1, "--", $0, "--"}' < b.txt )
Use sed to remove the -- parts that are on some end of lines:
sed 's/ --$//'
Use awk to print the last item on each line:
awk -F ' -- ' '{print $NF}'
$ awk 'NR==FNR{b[$1]=$2;next} {print $1, ($1 in b ? b[$1] : $2)}' b.txt a.txt
a UK
b Japan
c Russia
d Korea
e Canada

Count duplicates from several files

I have five files which contain some duplicate strings.
file1:
a
file2:
b
file3:
a
b
file4:
b
file5:
c
So i used awk 'NR==FNR{A[$0];next}$0 in A' file1 file2 file3 file4 file5
And it prints $ a, but as you see there is b string 3 times repeated in other files, but print only a.
So how to get all repeated string (a b) from analysing/comparing every file with each other using one line command? Also how do I get the number of repeats for each element.
I suggest with GNU sort and uniq:
sort file[1-5] | uniq -dc
Output:
2 a
3 b
From man uniq:
-d: only print duplicate lines
-c: prefix lines by the number of occurrences
you can use one of these;
awk '{count[$0]++}END{for (a in count) {if (count[a] > 1 ) {print a}}}' file1 file2 file3 file4 file5
or
awk 'seen[$0]++ == 1' file1 file2 file3 file4 file5
you could test this for a=3 and b=4.
awk '{count[$0]++} END {for (line in count) if ( count[line] == 3 && line == "a" || count[line] == 4 && line == "b" ) {print line} }' file1 file2 file3 file4 file5
test:
$ awk '{count[$0]++}END{for (a in count) {if (count[a] > 1 ) {print a}}}' file1 file2 file3 file4 file5
a
b
$ awk 'seen[$0]++ == 1' file1 file2 file3 file4 file5
a
b
$ awk '{count[$0]++} END {for (line in count) if ( count[line] == 2 && line == "a" || count[line] == 3 && line == "b" ) {print line, count[line]} }' 1 2 3 4 5
a 2
b 3
In awk:
$ awk '{ a[$1]++ } END { for(i in a) if(a[i]>1) print i,a[i] }' file[1-5]
a 2
b 3
It counts the occurrances of each record (character in this case) and prints out the ones with count more than one.

Difference between column sum in file 1 and column sum in file 2

I have this input
file 1 file 2
A 10 222 77.11 11
B 20 2222 1.215 22
C 30 22222 12.021 33
D 40 222222 145.00 44
The output I need is (11+22+33+44)- (10+20+30+40) = 110-100=10
thank you in advance for your help
Here you go:
paste file1.txt file2.txt | awk '{s1+=$5; s2+=$2} END {print s1-s2}'
Or better yet (clever of #falsetru's answer to do the summing with a single variable):
paste file1.txt file2.txt | awk '{sum+=$5-$2} END {print sum}'
If you want to work with column N in file1 and column M in file2, this might be "easier", but less efficient:
paste <(awk '{print $N}' file1) <(awk '{print $M}' file2.txt) | awk '{sum+=$2-$1} END {print sum}'
It's easier in the sense that you don't have to count the right position of the column in the second file, but less efficient because of the added extra awk sub-processes.
Using awk:
$ cat test.txt
file 1 file 2
A 10 222 77.11 11
B 20 2222 1.215 22
C 30 22222 12.021 33
D 40 222222 145.00 44
$ tail -n +2 test.txt | awk '{s += $5 - $2} END {print s}'
10

deleting lines from a text file with bash

I have two sets of text files. First set is in AA folder. Second set is in BB folder. The content of ff.txt file from first set(AA folder) is shown below.
Name number marks
john 1 60
maria 2 54
samuel 3 62
ben 4 63
I would like to print the second column(number) from this file if marks>60. The output will be 3,4. Next, read the ff.txt file in BB folder and delete the lines containing numbers 3,4. How can I do this with bash?
files in BB folder looks like this. second column is the number.
marks 1 11.824 24.015 41.220 1.00 13.65
marks 1 13.058 24.521 40.718 1.00 11.82
marks 3 12.120 13.472 46.317 1.00 10.62
marks 4 10.343 24.731 47.771 1.00 8.18
awk 'FNR == NR && $3 > 60 {array[$2] = 1; next} {if ($2 in array) next; print}' AA/ff.txt BB/filename
This works, but is not efficient (is that matter?)
gawk 'BEGIN {getline} $3>60{print $2}' AA/ff.txt | while read number; do gawk -v number=$number '$2 != number' BB/ff.txt > /tmp/ff.txt; mv /tmp/ff.txt BB/ff.txt; done
Of course, the second awk can be replaced with sed -i
For multi files:
ls -1 AA/*.txt | while read file
do
bn=`basename $file`
gawk 'BEGIN {getline} $3>60{print $2}' AA/$bn | while read number
do
gawk -v number=$number '$2 != number' BB/$bn > /tmp/$bn
mv /tmp/$bn BB/$bn
done
done
I didn't test it, so if there is a problem, please comment.

Resources