Count duplicates from several files - linux

I have five files which contain some duplicate strings.
file1:
a
file2:
b
file3:
a
b
file4:
b
file5:
c
So i used awk 'NR==FNR{A[$0];next}$0 in A' file1 file2 file3 file4 file5
And it prints $ a, but as you see there is b string 3 times repeated in other files, but print only a.
So how to get all repeated string (a b) from analysing/comparing every file with each other using one line command? Also how do I get the number of repeats for each element.

I suggest with GNU sort and uniq:
sort file[1-5] | uniq -dc
Output:
2 a
3 b
From man uniq:
-d: only print duplicate lines
-c: prefix lines by the number of occurrences

you can use one of these;
awk '{count[$0]++}END{for (a in count) {if (count[a] > 1 ) {print a}}}' file1 file2 file3 file4 file5
or
awk 'seen[$0]++ == 1' file1 file2 file3 file4 file5
you could test this for a=3 and b=4.
awk '{count[$0]++} END {for (line in count) if ( count[line] == 3 && line == "a" || count[line] == 4 && line == "b" ) {print line} }' file1 file2 file3 file4 file5
test:
$ awk '{count[$0]++}END{for (a in count) {if (count[a] > 1 ) {print a}}}' file1 file2 file3 file4 file5
a
b
$ awk 'seen[$0]++ == 1' file1 file2 file3 file4 file5
a
b
$ awk '{count[$0]++} END {for (line in count) if ( count[line] == 2 && line == "a" || count[line] == 3 && line == "b" ) {print line, count[line]} }' 1 2 3 4 5
a 2
b 3

In awk:
$ awk '{ a[$1]++ } END { for(i in a) if(a[i]>1) print i,a[i] }' file[1-5]
a 2
b 3
It counts the occurrances of each record (character in this case) and prints out the ones with count more than one.

Related

I want to find some strings/words from column 1 and 2 in file1 that match column 1 in file2 and replace with column 2 strings/words in file2

I'm still learning coding using Linux platform. I have search for problems similar to mine but the once I found they were either specific or focusing only on changing the entire column 1.
Here are example of my files:
File 1
abc Gamma 3.44
bcd abc 5.77
abc Alpha 1.99
beta abc 0.88
bcd Alpha 5.66
File 2
Gamma Bacteria
Alpha Bacteria
Beta Bacteria
Output file3
abc Bacteria 3.44
bcd abc 5.77
abc Bacteria 1.99
Bacteria abc 0.88
bcd Bacteria 5.66
I have tried:
awk:
$ awk 'FNR==NR{a[$1]=$2;next} {if ($1,$2 in a){$1,$2=a[$1,$2]}; print $0}' file2 file1
$ awk 'NR==FNR {a[FNR]=$0; next} /$1|$2/ {$1 $2=a[FNR]} 1' file2 file1
They gave me:
abc Gamma 3.44
abc 5.77
abc Alpha 1.99
Bacteria abc 0.88
bcd Alpha 5.66
Only changing the $1 and remove the other text strings in column 1 which are not found in file2 $2
And this one:
$ awk -F'\t' -v OFS='\t' 'FNR==1 { next }FNR == NR { file2[$1,$2] = $1 FS $2 } FNR != NR { file1[$1,$2,] = $1 FS $2} END { print "Match:"; for (k in file1) if (k in file1) print file2[k] # Or file1[k]}' file2 file1
Didn't work
Then after i tried sed:
$ sed = file2 | sed -r 'N;s/(.*)\n(.*)/\1s|\&$|\2|/' | sed -f - file1
This gave me an error and complained about
sed -e not being called properly.
Then after take only the smallest $3 if $1 and $2 or $2 and $1 are similar
file 4
bcd abc 5.77
Bacteria abc 0.88
bcd Bacteria 5.66
I have tried this code:
$ awk 'NR == $1&$2 || $3 < min {line = $0; min = $3}END{print line}' file3
$ awk '/^$1/{if(h){print h RS m}min=""; h=$0; next}min=="" || $3 < min{min=$3; m=$0}END{print h RS m}' file3
$ awk -F'\t' '$3 != "NF==min"' OFS='\t' file3
$ awk -v a=NODE '{c=a*$3+(1-a)} !($1 in min) || c<min[$1]{min[$1]=c; minLine[$1]=$0} END{for(k in minLine) print minLine[k]}' file3 | column -t
All didn't work and i tried to research what what does each line means and changed it to fit my problem. But they all failed
This might work for you (GNU sed):
sed -E 's#(.*) (.*)#/^\1 /Is/\\S+/\2/;/^\\S+ \1 /Is/\\S+/\2/2#' file2 |
sed -Ef - file1
Generate a sed script from file2 which is run against file1 to produce the required format.

AWK to filter to files if their columns match

I basically am working with two files (file1 and file2). The goal is to write a script that pulls rows from file1, if columns 1,2,3 match between files1 and files2. Here's the code I have been playing with:
awk -F'|' 'NR==FNR{c[$1$2$3]++;next};c[$1$2$3] > 0' file1 file2 > filtered.txt
ile1 and file2 both look like this (but has many more columns):
name1 0 c
name1 1 c
name1 2 x
name2 3 x
name2 4 c
name2 5 c
The awk code I provided isn't producing any output. Any help would be appreciated!
your delimiter isn't pipe, try this
$ awk 'NR==FNR {c[$1,$2,$3]++; next} c[$1,$2,$3]' file1 file2 > filtered.txt
or
$ awk 'NR==FNR {c[$0]++; next} c[$0]' file1 file2 > filtered.txt
however, if you're matching the whole line perhaps easier with grep
$ grep -xFf file1 file2 > filtered.txt
awk '{key=$1 FS $2 FS $3} NR==FNR{file2[key];next} key in file2' file2 file1

How to append a column for the result set in shell script

I need a script for the below scenario. I am very new to shell script.
wc file1 file2
the above query results with following result
40 149 947 file1
2294 16638 97724 file2
Now I need to get result as follows: 1st column, 3rd column ,4th column of above result set and new column with default values
40 947 file1 DF.tx1
2294 97724 file2 DF.rb2
Here the last column values is always known values i.e for file1 DF.tx1 and file2 DF.rb2.
If the give filenames in any order the default values should not change.
Please help me to write this script. Thanks in advance!!
You can use awk:
wc file1 file2 |
awk '$4 != "total"{if ($4 ~ /file1/) f="DF.tx1"; else if ($4 ~ /file2/) f="DF.rb2";
else if ($4 ~ /file3/) f="foo.bar"; print $1, $3, $4, f}'
1 12 file1 DF.tx1
9 105 file2 DF.rb2
5 15 file3 foo.bar

How to store missmatching rows from two files to a new file

I have two input files as follows And I need to write the mismatching rows from second file to a new file.Each column in the file is separated by a tab space
Input 1
1 94564350 . C A
1 94564350 . C T
Input 2
1 94564351 . A T
1 94564351 . A C
1 94564350 . C A
and the Output is
1 94564351 . A T
1 94564351 . A C
I have tried this command
awk -F"\t" 'NR==FNR{a[$0];next}($2 in a)&& $1>=3' fileB fileA >fileC
but not working.
awk 'NR == FNR{a[$0];next} !($0 in a)' fileA fileB
above command also taking too much time for big files is there any other options to do the same
Try this taken from Idiomatic awk:
awk 'NR == FNR{a[$0];next} !($0 in a)' fileA fileB
You don't need to assign -F="\t", awk interprets it properly on files like these.
Test
$ awk 'NR == FNR{a[$0];next} !($0 in a)' fileA fileB
1 94564351 . A T
1 94564351 . A C

script to join 2 separate text files and also add specified text

hi there i want to creat a bash script that does the following:
i have 2 texts files one is adf.ly links and the other Recipie names
i want to creat a batch scrript that takes each line from each text file and do the following
<li>**Recipie name line 1 of txt file**</li>
<li>**Recipie name line 2 of txt file**</li>
ect ect and save all the results to another text file called LINKS.txt
someone please help or point me in direction of linux bash script
this awk one-liner will do the job:
awk 'BEGIN{l="<li>%s</li>\n"}NR==FNR{a[NR]=$0;next}{printf l, a[FNR],$0}' file1 file2
more clear version (same script):
awk 'BEGIN{l="<li>%s</li>\n"}
NR==FNR{a[NR]=$0;next}
{printf l, a[FNR],$0}' file1 file2
example:
kent$ seq -f"%g from file1" 7 >file1
kent$ seq -f"%g from file2" 7 >file2
kent$ head file1 file2
==> file1 <==
1 from file1
2 from file1
3 from file1
4 from file1
5 from file1
6 from file1
7 from file1
==> file2 <==
1 from file2
2 from file2
3 from file2
4 from file2
5 from file2
6 from file2
7 from file2
kent$ awk 'BEGIN{l="<li>%s</li>\n"};NR==FNR{a[NR]=$0;next}{printf l, a[FNR],$0}' file1 file2
<li>1 from file2</li>
<li>2 from file2</li>
<li>3 from file2</li>
<li>4 from file2</li>
<li>5 from file2</li>
<li>6 from file2</li>
<li>7 from file2</li>
EDIT for the comment of OP:
if you have only one file: (the foo here is just dummy text)
awk 'BEGIN{l="<li>foo</li>\n"}{printf l,$0}' file1
output from same file1 example:
<li>foo</li>
<li>foo</li>
<li>foo</li>
<li>foo</li>
<li>foo</li>
<li>foo</li>
<li>foo</li>
if you want to save the output to a file:
awk 'BEGIN{l="<li>foo</li>\n"}{printf l,$0}' file1 > newfile
Try doing this :
$ cat file1
aaaa
bbb
ccc
$ cat file2
111
222
333
$ paste file1 file2 | while read a b; do
printf '<li>%s</li>\n' "$a" "$b"
done | tee newfile
Output
<li>111</li>
<li>222</li>
<li>333</li>

Resources