linux bash - compare two files and remove duplicate lines having same ending - linux

I have two files containing paths to files.
File 1
/home/anybody/proj1/hello.h
/home/anybody/proj1/engine.h
/home/anybody/proj1/car.h
/home/anybody/proj1/tree.h
/home/anybody/proj1/sun.h
File 2
/home/anybody/proj2/module/include/cat.h
/home/anybody/proj2/module/include/engine.h
/home/anybody/proj2/module/include/tree.h
/home/anybody/proj2/module/include/map.h
/home/anybody/proj2/module/include/sun.h
I need a command, probably using grep, that would compare the two file and output a combination of the two files, but in case of duplicates in the name of the file, keep the file from File 2.
Expected output:
/home/anybody/proj1/hello.h
/home/anybody/proj1/car.h
/home/anybody/proj2/module/include/cat.h
/home/anybody/proj2/module/include/engine.h
/home/anybody/proj2/module/include/tree.h
/home/anybody/proj2/module/include/map.h
/home/anybody/proj2/module/include/sun.h
This is so I can generate a list of include files for my project's tag database, but some files are duplicated by the build, and I don't want to have two copies of the same file in my database.

This awk command should do the job:
awk -F/ 'NR == FNR{a[$NF]=$0; next} !($NF in a); END{for (i in a) print a[i]}' file2 file1
/home/anybody/proj1/hello.h
/home/anybody/proj1/car.h
/home/anybody/proj2/module/include/map.h
/home/anybody/proj2/module/include/cat.h
/home/anybody/proj2/module/include/engine.h
/home/anybody/proj2/module/include/tree.h
/home/anybody/proj2/module/include/sun.h

This should do it
cat file2 file1 | awk -F '/' '
{ if (a[$NF] == "") a[$NF] = $0 }
END { for (k in a) print a[k] }' | sort

Related

AWK - compare two 1-column files for matching strings then write a file with updated info

I have a problem on comparing two files of different no. of lines and use info from a file to update the other. I have tried various key examples that I found online but none seems to work.
Hopefully you could help me with this one.
I have two files:
$ cat 1.txt
>01234
NNNNNNNNAAAAAANNNNNNAAAAANNNNAAAAANNNNAA
>17321
SSSSSSKKKKKKLLLLLIIIIIIMMMMMMNNNNNNAAAA
>13920
ZZZZZZZYYYYYYYAAAAAAABBBBBBBCCCCNNNNNNNNNN
...
...
(>1000 lines)
$ cat 2.txt
hbcd/efgh/z-01234/2000
hbcd/efgh/zw-11000/2000
hbcd/efgh/t-13290/2000
...
...
(<1000 lines)
My intention is to have an updated file 1.txt, with updated lines but also to keep the lines that are not matched in file 2.txt, so could be saved as a new file as follows:
$ cat 3.txt
>abcd/efgh/z-01234/2000
NNNNNNNNAAAAAANNNNNNAAAAANNNNAAAAANNNNAA
>17321
SSSSSSKKKKKKLLLLLIIIIIIMMMMMMNNNNNNAAAA
>abcd/efgh/t-13290/2000
ZZZZZZZYYYYYYYAAAAAAABBBBBBBCCCCNNNNNNNNNN
...
...
I have tried something like this:
awk 'NR==FNR{a[$0]=$0;next}{a[$1]=$0}END{for (i in a) print a[i]}' 1.txt 2.txt > 3.txt
or like this (to search based on the substring in 1.txt):
awk 'NR==FNR{a[substr($0,2,5)]=$0;next}{a[$1]=$0}END{for (i in a) print a[i]}' 2.txt 1.txt > 3.txt
but the I get mixed up lines in the output file, something like this (or even without the lines from 2.txt, respectively):
01234
17321
SSSSSSKKKKKKLLLLLIIIIIIMMMMMMNNNNNNAAAA
abcd/efgh/t-13290/2000
NNNNNNNNAAAAAANNNNNNAAAAANNNNAAAAANNNNAA
ZZZZZZZYYYYYYYAAAAAAABBBBBBBCCCCNNNNNNNNNN
I haven't used awk for very long time and I'm not sure how the arrays and keys work.
Update:
I have tried to write an awk script, to do the above. The condition to check them works but somehow I still have a problem with writing the lines from 1.txt that don't match the ones from 2.txt.
BEGIN{
i = 0;
j = 0;
k = 0;
maxi = 0;
maxj = 0;
maxk = 0;
FS = "\/";
}
FILENAME == ARGV[1]{
header1=substr($0,1,1);
if(header1==">"){
++maxi;
seqcode1[maxi]=substr($0,2,5);
# printf("%s\n",seqcode1[maxi]);
}
else if(header1!=">"){
++maxk;
seqFASTA[maxk]=$0;
# print seqFASTA[maxk];
}
}
FILENAME == ARGV[2]{
header2=substr($0,1,1);
if(header2=="h"){
++maxj;
wholename[maxj]=$0;
seqcode2[maxj]=substr($3,4,5);
# printf("%s\n",seqcode2[maxj]);
}
}
END{
for(i=1;i<=maxi=maxk;i++){
for(j=1;j<=maxj;j++){
if(seqcode1[i] == seqcode2[j]) {
printf("%s %s %s\n",seqcode1[i],seqcode2[j],wholename[j]);
}
else
print seqcode1[i];
print seqFASTA[k];
}
}
}
I think the problem may be with declaring seqFASTA but I'm not sure where.
Thank you very much!
M.
I'm assuming 13920 should be 13290 in 1.txt.
$ awk 'NR==FNR{split($0, a, "/"); sub(/^[^-]+-/, "", a[3]); map[a[3]]=$0; next}
(k=substr($0, 2)) in map{$0 = ">" map[k]} 1' 2.txt 1.txt
>hbcd/efgh/z-01234/2000
NNNNNNNNAAAAAANNNNNNAAAAANNNNAAAAANNNNAA
>17321
SSSSSSKKKKKKLLLLLIIIIIIMMMMMMNNNNNNAAAA
>hbcd/efgh/t-13290/2000
ZZZZZZZYYYYYYYAAAAAAABBBBBBBCCCCNNNNNNNNNN
Here are some alternate solutions:
# with GNU awk
awk 'NR==FNR{match($0, /-([0-9]+)/, a); map[a[1]]=$0; next}
(k=substr($0, 2)) in map{$0 = ">" map[k]} 1' 2.txt 1.txt
# assuming '/' and '-' will always be similar to given sample
awk -F'[/-]' 'NR==FNR{map[$4]=">"$0; next}
$2 in map{$0 = map[$2]} 1' 2.txt FS='>' 1.txt

How do I grep a string on multiple files only if the string is present in all of all the files?

I have around 20 files. The first columns of each file contains ids (ID0001, ID0056, ID0165 etc). I have a list file that contains all possible ids. I want to find the ids from that file that are present in all the files. Is there a way to use grep for this? So far if I use the command:
grep "id_name" file*.txt,
it prints the id even if it is present in only 1 file.
There is a simple grep pipeline that you can do, but it is a bit cumbersome to write down:
cut -f1 file1 | grep -Ff - file2 | grep -Ff - file3 | grep -Ff - file3 ...
Another way is using awk:
awk '{a[$1]++}END{for(i in a) if (a[i]==ARGC-1) print i}' file1 file2 file3 ...
The latter assumes that the id's are unique per file.
If they are not unique, it is a bit more tricky:
awk '(FNR==1){delete b}!($1 in b){a[$1]++;b[$1]}END{for(i in a) if (a[i]==ARGC-1) print i }' file1 file2 file3 ...
Say you have a list of all the ids in a file ids_list.txt with each ID being on a single line like
id001
id101
id201
...
And all the files from which you want to search from are in the folder data . So in this scenario, this little script should be able to help you
#!/bin/bash
all_ids="";
for i in `cat ids_list.txt`; do
all_ids="$all_ids|$i"
done
all_ids=`echo $all_ids|sed -e 's/^|//'`
grep -Pir "^($all_ids)[\s,]+" data
It output would be like
data/f1:id001, ssd
data/f3:id201, some data
...
This may be what you're trying to do but without sample input/output it's an untested guess:
awk '
!seen[FILENAME,$1]++ {
cnt[$1]++
}
END {
for (id in cnt) {
if ( cnt[id] == (ARGC-1) ) {
print id
}
}
}
' list file*

Replacing a substring in certain lines of file A by string in file B

I have 2 files at the moment, file A and file B. Certain lines in file A contain a substring of some line in file B. I would like to replace these substrings with the corresponding string in file B.
Example of file A:
#Name_1
foobar
info_for_foobar
evenmoreinfo_for_foobar
#Name_2
foobar2
info_for_foobar2
evenmoreinfo_for_foobar2
Example of file B:
#Name_1_Date_Place
#Name_2_Date_Place
The desired output I would like:
#Name_1_Date_Place
foobar
info_for_foobar
evenmoreinfo_for_foobar
#Name_2_Date_Place
foobar2
info_for_foobar2
evenmoreinfo_for_foobar2
What I have so far:
I was able to get the order of the names in File B corresponding to those in File A. So I was thinking to use a while loop here which goes through every line of file B, and then finds and replaces the corresponding substring in file A with that line, however I'm not sure how to put this into a bash script.
The code I have so far is, but which is not giving a desired output:
grep '#' fileA.txt > fileAname.txt
while read line
do
replace="$(grep '$line' fileB.txt)"
sed -i 's/'"$line"'/'"$replace"'/g' fileA.txt
done < fileAname.txt
Anybody has an idea?
Thanks in advance!
You can do it with this script:
awk 'NR==FNR {str[i]=$0;i++;next;} $1~"^#" {for (i in str) {if(match(str[i],$1)){print str[i];next;}}} {print $0}' B.txt A.txt
This awk should work for you:
awk 'FNR==NR{a[$0]; next} {for (i in a) if (index(i, $0)) $0=i} 1' fileB fileA
#Name_1_Date_Place
foobar
info_for_foobar
evenmoreinfo_for_foobar
#Name_2_Date_Place
foobar2
info_for_foobar2
evenmoreinfo_for_foobar2

Comparing two files and updating second file using bash and awk and sorting the second file

I have two files with two colums in each file that I want to compare the 1st column of both files. If the value of the 1st column in the first file does not exist in the second file, I then want to append to the second file the value in the 1st column of the first file, eg
firstFile.log
1457935407,998181
1457964225,998191
1457969802,997896
secondFile.log
1457966024,1
1457967635,1
1457969802,5
1457975246,2
After, secondFile.log should look like:
1457935407,null
1457964225,null
1457966024,1
1457967635,1
1457969802,5
1457975246,2
Note: Second file should be sorted by the first column after being updated.
Using awk and sort:
awk 'BEGIN{FS=OFS=","} FNR==NR{a[$1]; next} {delete a[$1]; print} END{
for (i in a) print i, "null"}' firstFile.log secondFile.log |
sort -t, -k1 > $$.temp && mv $$.temp secondFile.log
1457935407,null
1457964225,null
1457966024,1
1457967635,1
1457969802,5
1457975246,2
using non awk tools...
$ sort -t, -uk1,1 file2 <(sed 's/,.*/,null/' file1)
1457935407,null
1457964225,null
1457966024,1
1457967635,1
1457969802,5
1457975246,2

Removing Fields from file keeping delimeter intact

We have a requirement to remove certain fields inside a delimited file using shell script, but we do not want to loose delimeter. i.e
$cat file1
col1,col2,col3,col4,col5
1234,4567,7890,9876,6754
abcd,efgh,ijkl,rtyt,cvbn
now we need to generate different file (say file2) out of file1 with second and fourth column removed with delimeter (,) intact
i.e
$cat file2
col1,,col3,,col5
1234,,7890,,6754
abcd,,ijkl,,cvbn
Please suggest what would be easiest and efficient way of achieving this, also as file is having around 300 fields/columns AWK is not working because of its limitation related to number of fields.
awk '{$2 = $4 = ""}1' FS=, OFS=, file1 > file2
awk 'BEGIN { FS = OFS = ","; } { $2 = $4 = ""; print }' file1 > file2
Which, simply saying:
sets input & output field separator to , at start,
empties fields 2 & 4 for each line, and then prints the line back.

Resources