Two text file comparison with grep - text

I have two files (a.txt, b.txt)
a.txt is a list of English words (one word in ever row)
b.txt contains in every row: a number, a space character, a 5-65 char long string
(for example b.txt can contain: 1234 dsafaaraehawada)
I would like to know which row in b.txt contains words from a.txt and how many of them?
Example input:
a.txt
green
apple
bar
b.txt
1212 greensdsdappleded
12124 dfsfsd
123 bardws
output:
2 1212 greensdsdappleded
1 123 bardws
First row contains 'green' and 'apple' (2)
Second row contains nothing.
Third row contains 'bar' (1)
Thats all I would like to know.
The code (By Mr. Barmar):
grep -F -o -f a.txt b.txt | sort | uniq -c | sort -nr
But it need to be modified.

Try something like this:
awk 'NR==FNR{A[$1]; next} {t=0; for (i in A) t+=gsub(i,"&",$2)} t{print t, $0}' file1 file2

Try something like this:
awk '
NR==FNR { list[$1]++; next }
{
cnt=0
for(word in list) {
if(index($2,word) > 0)
cnt++
}
if(cnt>0)
print cnt,$0
}' a.txt b.txt
Test:
$ cat a.txt
green
apple
bar
$ cat b.txt
1212 greensdsdappleded
12124 dfsfsd
123 bardws
$ awk '
NR==FNR { list[$1]++; next }
{
cnt=0
for(word in list) {
if(index($2,word) > 0)
cnt++
}
if(cnt>0)
print cnt,$0
}' a.txt b.txt
2 1212 greensdsdappleded
1 123 bardws

Related

How to print only words that doesn't match between two files? [duplicate]

This question already has answers here:
Compare two files line by line and generate the difference in another file
(14 answers)
Closed 2 years ago.
FILE1:
cat
dog
house
tree
FILE2:
dog
cat
tree
I need to be printed only:
house
$ cat file1
cat
dog
house
tree
$ cat file2
dog
cat
tree
$ grep -vF -f file2 file1
house
The -v flag only shows non-matches, -f is for a filename to use as a filter, and -F is for exact matches (doesn't slow it down with any pattern matching).
Using awk
awk 'FNR==NR{arr[$0]=1; next} !($0 in arr)' FILE2 FILE1
First build an associative array with words from FILE2 and than loop over FILE1 and only print those.
Using comm
comm -2 -3 <(sort FILE1) <(sort FILE2)
-2 suppresses lines unique to FILE2 and -3 suppresses lines found in both.
If you want just the words, you can sort the files, diff them, then use sed to filter out diff's symbols:
diff <(sort file1) <(sort file2) | sed -n '/^</s/^< //p'
Awk is an option here:
awk 'NR==FNR { arr[$1]="1" } NR != FNR { if (arr[$1] == "") { print $0 } } ' file2 file1
Create an array called arr, using the contents of file2 as indexes. Then with file1, look at each entry and check to see if an entry in the array arr exists. If it doesn't, print.

how to use awk/sed to deal with these two files to get a result that I want

I want to use awk/sed to deal with two files(a.txt and b.txt) below and get the result
cat a.txt
a UK
b Japan
c China
d Korea
e US
And cat b.txt results
c Russia
e Canada
The result that I want is as below:
a UK
b Japan
c Russia
d Korea
e Canada
With awk:
First fill aray/hash a with complete row ($0) and use first column ($1) from this row as index. Finally, print all elements of array/hash a with a loop.
awk '{a[$1]=$0} END{for(i in a) print a[i]}' file1 file2
Output:
a UK
b Japan
c Russia
d Korea
e Canada
try:
awk 'FNR==NR{A[$1]=$NF;next} {printf("%s %s\n",$1,$1 in A?A[$1]:$NF)}' b.txt a.txt
Checking here condition FNR==NR which will be TRUE only when first file(b.txt) is being read. Then creating an array named A whose index is $1 and have the value last column. Then using printf for printing 2 strings where first string is $1 and another is if $1 of a.txt is present in array A then print array A's value whose index is $1 else print last column of a.tzt itself.
EDIT: as OP had carriage characters into Input_files so please remove them by following too.
tr -d '\r' < b.txt > temp_b.txt && mv temp_b.txt b.txt
You can use the below one-liner:
join -a 1 -a 2 a.txt <( awk '{print $1, "--", $0, "--"}' < b.txt ) | sed 's/ --$//' | awk -F ' -- ' '{print $NF}'
We use awk to prefix each line in b.txt with a key and -- to give us a split point later:
<( awk '{print $1, "--", $0, "--"}' < b.txt )
Use the join command to join the files on common keys. The -a 1 option tells the command to
join -a 1 -a 2 a.txt <( awk '{print $1, "--", $0, "--"}' < b.txt )
Use sed to remove the -- parts that are on some end of lines:
sed 's/ --$//'
Use awk to print the last item on each line:
awk -F ' -- ' '{print $NF}'
$ awk 'NR==FNR{b[$1]=$2;next} {print $1, ($1 in b ? b[$1] : $2)}' b.txt a.txt
a UK
b Japan
c Russia
d Korea
e Canada

compare if two columns of a file is identiical in linux

I would like to compare if two columns (mid) in a file is identical to each other. I am not sure of how to do it...Since the original file that I am working one is rather huge (in Gb)
file 1 (column1 and column 4 - to check if they are identical)
mid A1 A2 mid A3 A4 A5 A6
18 we gf 18 32 23 45 89
19 ew fg 19 33 24 46 90
21 ew fg 21 35 26 48 92
Thanks
M
if you just need to find the different row, awk will do,
awk '$1!=$4{print $1,$4}' data
You can check using diff and awk for advance difference.
diff <(awk '{print $1}' data) <(awk '{print $4}' data)
The status code ($?)of this command will tell if they are same (zero) or different (non-zero).
You can use that in base expression like this too,
if diff <(awk '{print $1}' data) <(awk '{print $4}' data) >& /dev/null;
then
echo same;
else
echo different;
fi
Something like this:
awk '{ if ($1 == $4) { print "same"; } else { print "different"; } }' < foo.txt
Completing a litle bit the Shiplu Mokaddim answer, if you have another delimiter, for example in a csv file, you can use:
awk -F; '$1!=$4{print $1,$4}' data.csv | sed -r 's/ /;/g'
In this sample, the delimiter is a ";". The sed command in the end is to replace again the delimiter to the original one. Be sure that you donĀ“t have another space in you answer, i.e. and date time.
Question: Compare two columns value in the same file.
Answer:
cut -d, -f1 a.txt > b.txt ; cut -d, -f3 a.txt > c.txt ; cmp b.txt c.txt && echo "Column values are same"; rm -rf b.txt c.txt

linux command to get the last appearance of a string in a text file

I want to find the last appearance of a string in a text file with linux commands. For example
1 a 1
2 a 2
3 a 3
1 b 1
2 b 2
3 b 3
1 c 1
2 c 2
3 c 3
In such a text file, i want to find the line number of the last appearance of b which is 6.
I can find the first appearance with
awk '/ b / {print NR;exit}' textFile.txt
but I have no idea how to do it for the last occurrence.
cat -n textfile.txt | grep " b " | tail -1 | cut -f 1
cat -n prints the file to STDOUT prepending line numbers.
grep greps out all lines containing "b" (you can use egrep for more advanced patterns or fgrep for faster grep of fixed strings)
tail -1 prints last line of those lines containing "b"
cut -f 1 prints first column, which is line # from cat -n
Or you can use Perl if you wish (It's very similar to what you'd do in awk, but frankly, I personally don't ever use awk if I have Perl handy - Perl supports 100% of what awk can do, by design, as 1-liners - YMMV):
perl -ne '{$n=$. if / b /} END {print "$n\n"}' textfile.txt
This can work:
$ awk '{if ($2~"b") a=NR} END{print a}' your_file
We check every second file being "b" and we record the number of line. It is appended, so by the time we finish reading the file, it will be the last one.
Test:
$ awk '{if ($2~"b") a=NR} END{print a}' your_file
6
Update based on sudo_O advise:
$ awk '{if ($2=="b") a=NR} END{print a}' your_file
to avoid having some abc in 2nd field.
It is also valid this one (shorter, I keep the one above because it is the one I thought :D):
$ awk '$2=="b" {a=NR} END{print a}' your_file
Another approach if $2 is always grouped (may be more efficient then waiting until the end):
awk 'NR==1||$2=="b",$2=="b"{next} {print NR-1; exit}' file
or
awk '$2=="b"{f=1} f==1 && $2!="b" {print NR-1; exit}' file

deleting lines from a text file with bash

I have two sets of text files. First set is in AA folder. Second set is in BB folder. The content of ff.txt file from first set(AA folder) is shown below.
Name number marks
john 1 60
maria 2 54
samuel 3 62
ben 4 63
I would like to print the second column(number) from this file if marks>60. The output will be 3,4. Next, read the ff.txt file in BB folder and delete the lines containing numbers 3,4. How can I do this with bash?
files in BB folder looks like this. second column is the number.
marks 1 11.824 24.015 41.220 1.00 13.65
marks 1 13.058 24.521 40.718 1.00 11.82
marks 3 12.120 13.472 46.317 1.00 10.62
marks 4 10.343 24.731 47.771 1.00 8.18
awk 'FNR == NR && $3 > 60 {array[$2] = 1; next} {if ($2 in array) next; print}' AA/ff.txt BB/filename
This works, but is not efficient (is that matter?)
gawk 'BEGIN {getline} $3>60{print $2}' AA/ff.txt | while read number; do gawk -v number=$number '$2 != number' BB/ff.txt > /tmp/ff.txt; mv /tmp/ff.txt BB/ff.txt; done
Of course, the second awk can be replaced with sed -i
For multi files:
ls -1 AA/*.txt | while read file
do
bn=`basename $file`
gawk 'BEGIN {getline} $3>60{print $2}' AA/$bn | while read number
do
gawk -v number=$number '$2 != number' BB/$bn > /tmp/$bn
mv /tmp/$bn BB/$bn
done
done
I didn't test it, so if there is a problem, please comment.

Resources