grep reverse with exact matching - linux

I have a list file, which has id and number and am trying to get those lines from a master file which do not have those ids.
List file
nw_66 17296
nw_67 21414
nw_68 21372
nw_69 27387
nw_70 15830
nw_71 32348
nw_72 21925
nw_73 20363
master file
nw_1 5896
nw_2 52814
nw_3 14537
nw_4 87323
nw_5 56466
......
......
nw_n xxxxx
so far am trying this but not working as expected.
for i in $(awk '{print $1}' list.txt); do grep -v -w $i master.txt; done;
Kindly help

Give this awk one-liner a try:
awk 'NR==FNR{a[$1]=1;next}!a[$1]' list master

Maybe this helps:
awk 'NR == FNR {id[$1]=1;next}
{
if (id[$1] == "") {
print $0
}
}' listfile masterfile
We accept 2 files as input above, first one is listfile, second is masterfile.
NR == FNR would be true while awk is going through listfile. In the associative array id[], all ids in listfile are made a key with value as 1.
When awk goes through masterfile, it only prints a line if $1 i.e. the id is not a key in array ids.

The OP attempted the following line:
for i in $(awk '{print $1}' list.txt); do grep -v -w $i master.txt; done;
This line will not work as for every entry $i, you print all entries in master.txt tat are not equivalent to "$i". As a consequence, you will end up with multiple copies of master.txt, each missing a single line.
Example:
$ for i in 1 2; do grep -v -w "$i" <(seq 1 3); done
2 \ copy of seq 1 3 without entry 1
3 /
1 \ copy of seq 1 3 without entry 2
3 /
Furthermore, the attempt reads the file master.txt multiple times. This is very inefficient.
The unix tool grep allows one the check multiple expressions stored in a file in a single go. This is done using the -f flag. Normally this looks like:
$ grep -f list.txt master.txt
The OP can use this now in the following way:
$ grep -vwf <(awk '{print $1}' list.txt) master.txt
But this would do matches over the full line.
The awk solution presented by Kent is more flexible and allows the OP to define a more tuned match:
awk 'NR==FNR{a[$1]=1;next}!a[$1]' list master
Here the OP clearly states, I want to match column 1 of list with column 1 of master and I don't care about spaces or whatever is in column 2. The grep solution could still match entries in column 2.

Related

How to show number counts on terminal in linux

Please how can I count the list of numbers in a file
awk '{for(i=1;i<=NF;i++){if($i>=0 && $i<=25){print $i}}}'
Using the command above I can display the range of numbers on the terminal but if there are so many it will be difficult to count them. Please how can I show the count of the numbers on the terminal for example
1-20,
2-22,
3-23,
4-24,
etc
I know I can use wc but I don't know how to infuse it into the command above
awk '
{ for(i=1;i<=NF;i++) if (0<=$i && $i<=25) cnts[$i]++ }
END { for (n in cnts) print n, cnts[n] }
' file
Pipe the output to sort -n and uniq -c
awk '{for(i=1;i<=NF;i++){if($i>=0 && $i<=25){print $i}}}' filename | sort -n | uniq -c
You need to sort first because uniq requires all the same elements to be consecutive.
While I'm personally an awk fan, you might be glad to learn about grep -o functionality. I'm using grep -o to match all numbers in the file, and then awk can be used to pick all the numbers between 0 and 25 (inclusive). Last, we can use sort and uniq to count the results.
grep -o "[0-9][0-9]*" file | awk ' $1 >= 0 && $1 <= 25 ' | sort -n | uniq -c
Of course, you could do the counting in awk with an associative array as Ed Morton suggests:
egrep -o "\d+" file | awk ' $1 >= 0 && $1 <= 25 ' | awk '{cnt[$1]++} END { for (i in cnt) printf("%s-%s\n", i,cnt[i] ) } '
I modified Ed's code (typically not a good idea - I've been reading his code for years now) to show a modular approach - an awk script for filtering numbers in the range of 0 and 25 and another awk script for counting a list (of anything).
I also provided another subtle difference from my first script with egrep instead of grep.
To be honest, the second awk script generates some unexpected output, but I wanted to share an example of a more general approach. EDIT: I applied Ed's suggestion to correct the unexpected output - it's fine now.

How should I count the duplicate lines in each file?

I have tried this :
dirs=$1
for dir in $dirs
do
ls -R $dir
done
Like this?:
$ cat > foo
this
nope
$ cat > bar
neither
this
$ sort *|uniq -c
1 neither
1 nope
2 this
and weed out the ones with just 1s:
... | awk '$1>1'
2 this
Use sort with uniq to find the duplicate lines.
#!/bin/bash
dirs=("$#")
for dir in "${dirs[#]}" ; do
cat "$dir"/*
done | sort | uniq -c | sort -n | tail -n1
uniq -c will prepend the number of occurrences to each line
sort -n will sort the lines by the number of occurrences
tail -n1 will only output the last line, i.e. the maximum. If you want to see all the lines with the same number of duplicates, add the following instead of tail:
perl -ane 'if ($F[0] == $n) { push #buff, $_ }
else { #buff = $_ }
$n = $F[0];
END { print for #buff }'
You could use awk. If you just want to "count the duplicate lines", we could infer that you're after "all lines which have appeared earlier in the same file". The following would produce these counts:
#!/bin/sh
for file in "$#"; do
if [ -s "$file" ]; then
awk '$0 in a {c++} {a[$0]} END {printf "%s: %d\n", FILENAME, c}' "$file"
fi
done
The awk script first checks to see if the current line is stored in the array a, and if it does, increments a counter. Then it adds the line to its array. At the end of the file, we print the total.
Note that this might have problems on very large files, since the entire input file needs to be read into memory in the array.
Example:
$ printf 'foo\nbar\nthis\nbar\nthat\nbar\n' > inp.txt
$ awk '$0 in a {c++} {a[$0]} END {printf "%s: %d\n", FILENAME, c}' inp.txt
inp.txt: 2
The word 'bar' exist three times in the file, thus there are two duplicates.
To aggregate multiple files, you can just feed multiple files to awk:
$ printf 'foo\nbar\nthis\nbar\n' > inp1.txt
$ printf 'red\nblue\ngreen\nbar\n' > inp2.txt
$ awk '$0 in a {c++} {a[$0]} END {print c}' inp1.txt inp2.txt
2
For this, the word 'bar' appears twice in the first file and once in the second file -- a total of three times, thus we still have two duplicates.

Count specific numbers from a column from an input file linux

i was trying to read a file and count a specific number at a specific place and show how many times it appears, for example:
1st field are numbers, 2nd field brand name, 3rd field a group they belong to, 4th and 5th not important.
1:audi:2:1990:5
2:bmw:2:1987:4
3:bugatti:3:1988:19
4.buick:4:2000:12
5:dodge:2:1999:4
6:ferrari:2:2000:4
As an output, i want to search by column 3, and group together 2's(by brand name) and count how many of them i have.
The output i am looking for should look like this:
1:audi:2:1990:5
2:bmw:2:1987:4
5:dodge:2:1999:4
6:ferrari:2:2000:4
4 -> showing how many lines there are.
I have tried taken this approach but can't figure it out:
file="cars.txt"; sort -t ":" -k3 $file #sorting by the 3rd field
grep -c '2' cars.txt # this counts all the 2's in the file including number 2.
I hope you understand. and thank you in advance.
I am not sure exactly what you mean by "group together by brand name", but the following will get you the output that you describe.
awk -F':' '$3 == 2' Input.txt
If you want a line count, you can pipe that to wc -l.
awk -F':' '$3 == 2' Input.txt | wc -l
I guess line 4 is 4:buick and not 4.buick. Then I suggest this
$ awk 'BEGIN{FS=":"} $3~2{total++;print} END{print "TOTAL --- "total}' Input.txt
Plain bash solution:
#!/bin/bash
while IFS=":" read -ra line; do
if (( ${line[2]} == 2 )); then
IFS=":" && echo "${line[*]}"
(( count++ ))
fi
done < file
echo "Count = $count"
Output:
1:audi:2:1990:5
2:bmw:2:1987:4
5:dodge:2:1999:4
6:ferrari:2:2000:4
Count = 4

bash print first to nth column in a line iteratively

I am trying to get the column names of a file and print them iteratively. I guess the problem is with the print $i but I don't know how to correct it. The code I tried is:
#! /bin/bash
for i in {2..5}
do
set snp = head -n 1 smaller.txt | awk '{print $i}'
echo $snp
done
Example input file:
ID Name Age Sex State Ext
1 A 12 M UT 811
2 B 12 F UT 818
Desired output:
Name
Age
Sex
State
Ext
But the output I get is blank screen.
You'd better just read the first line of your file and store the result as an array:
read -a header < smaller.txt
and then printf the relevant fields:
printf "%s\n" "${header[#]:1}"
Moreover, this uses bash only, and involves no unnecessary loops.
Edit. To also answer your comment, you'll be able to loop through the header fields thus:
read -a header < smaller.txt
for snp in "${header[#]:1}"; do
echo "$snp"
done
Edit 2. Your original method had many many mistakes. Here's a corrected version of it (although what I wrote before is a much preferable way of solving your problem):
for i in {2..5}; do
snp=$(head -n 1 smaller.txt | awk "{print \$$i}")
echo "$snp"
done
set probably doesn't do what you think it does.
Because of the single quotes in awk '{print $i}', the $i never gets expanded by bash.
This algorithm is not good since you're calling head and awk 4 times, whereas you don't need a single external process.
Hope this helps!
You can print it using awk itself:
awk 'NR==1{for (i=2; i<=5; i++) print $i}' smaller.txt
The main problem with your code is that your assignment syntax is wrong. Change this:
set snp = head -n 1 smaller.txt | awk '{print $i}'
to this:
snp=$(head -n 1 smaller.txt | awk '{print $i}')
That is:
Do not use set. set is for setting shell options, numbered parameters, and so on, not for assigning arbitrary variables.
Remove the spaces around =.
To run a command and capture its output as a string, use $(...) (or `...`, but $(...) is less error-prone).
That said, I agree with gniourf_gniourf's approach.
Here's another alternative; not necessarily better or worse than any of the others:
for n in $(head smaller.txt)
do
echo ${n}
done
somthin like
for x1 in $(head -n1 smaller.txt );do
echo $x1
done

how can i compare two text files which has multiple fields in unix

i have two text files
file 1
number,name,account id,vv,sfee,dac acc,TDID
7000,john,2,0,0,1,6
7001,elen,2,0,0,1,7
7002,sami,2,0,0,1,6
7003,mike,1,0,0,2,1
8001,nike,1,2,4,1,8
8002,paul,2,0,0,2,7
file 2
number,account id,dac acc,TDID
7000,2,1,6
7001,2,1,7
7002,2,1,6
7003,1,2,1
i want to compare those two text files. if the four columns of file 2 is there in file 1 and equal means i want output like this
7000,john,2,0,0,1,6
7001,elen,2,0,0,1,7
7002,sami,2,0,0,1,6
7003,mike,1,0,0,2,1
nawk -F"," 'NR==FNR {a[$1];next} ($1 in a)' file2.txt file1.txt.. this works good for comparing two single column in two files. i want to compare multiple column. any one have suggestion?
EDIT: From the OP's comments:
nawk -F"," 'NR==FNR {a[$1];next} ($1 in a)' file2.txt file1.txt
.. this works good for comparing two single column in two files. i want to compare multiple column. you have any suggestion?
This awk one-liner works for multi-column on unsorted files:
awk -F, 'NR==FNR{a[$1,$2,$3,$4]++;next} (a[$1,$3,$6,$7])' file1.txt file2.txt
In order for this to work, it is imperative that the first file used for input (file1.txt in my example) be the file that only has 4 fields like so:
file1.txt
7000,2,1,6
7001,2,1,7
7002,2,1,6
7003,1,2,1
file2.txt
7000,john,2,0,0,1,6
7000,john,2,0,0,1,7
7000,john,2,0,0,1,8
7000,john,2,0,0,1,9
7001,elen,2,0,0,1,7
7002,sami,2,0,0,1,6
7003,mike,1,0,0,2,1
7003,mike,1,0,0,2,2
7003,mike,1,0,0,2,3
7003,mike,1,0,0,2,4
8001,nike,1,2,4,1,8
8002,paul,2,0,0,2,7
Output
$ awk -F, 'NR==FNR{a[$1,$2,$3,$4]++;next} (a[$1,$3,$6,$7])' file1.txt file2.txt
7000,john,2,0,0,1,6
7001,elen,2,0,0,1,7
7002,sami,2,0,0,1,6
7003,mike,1,0,0,2,1
Alternatively, you could also use the following syntax which more closely matches the one in your question but is not very readable IMHO
awk -F, 'NR==FNR{a[$1,$2,$3,$4];next} ($1SUBSEP$3SUBSEP$6SUBSEP$7 in a)' file1.txt file2.txt
TxtSushi looks like what you want. It allows to work with CSV files using SQL.
It's not an elegant one-liner, but you could do it with perl.
#!/usr/bin/perl
open A, $ARGV[0];
while(split/,/,<A>) {
$k{$_[0]} = [#_];
}
close A;
open B, $ARGV[1];
while(split/,/,<B>) {
print join(',',#{$k{$_[0]}}) if
defined($k{$_[0]}) &&
$k{$_[0]}->[2] == $_[1] &&
$k{$_[0]}->[5] == $_[2] &&
$k{$_[0]}->[6] == $_[3];
}
close B;
Quick answer: Use cut to split out the fields you need and diff to compare the results.
Not really well tested, but this might work:
join -t, file1 file2 | awk -F, 'BEGIN{OFS=","} {if ($3==$8 && $6==$9 && $7==$10) print $1,$2,$3,$4,$6,$7}'
(Of course, this assumes the input files are sorted).
This is neither efficient nor pretty it will however get the job done. It is not the most efficient implementation as it parses file1 multiple times however it does not read the entire file into RAM either so has some benefits over the simple scripting approaches.
sed -n '2,$p' file1 | awk -F, '{print $1 "," $3 "," $6 "," $7 " " $0 }' | \
sort | join file2 - |awk '{print $2}'
This works as follows
sed -n '2,$p' file1 sends file1 to STDOUT without the header line
The first awk command prints the 4 "key fields" from file1 in the same format as they are in file2 followed by a space followed by the contents of file1
The sort command ensures that file1 is in the same order as file2
The join command joins file2 and STDOUT only writing records that have a matching record in file2
The final awk command prints just the original part of file1
In order for this to work you must ensure that file2 is sorted before running the command.
Running this against your example data gave the following result
7000,john,2,0,0,1,6
7001,elen,2,0,0,1,7
7002,sami,2,0,0,1,6
7003,mike,1,0,0,2,1
EDIT
I note from your comments you are getting a sorting error. If this error is occuring when sorting file2 before running the pipeline command then you could split the file, sort each part and then cat them back together again.
Something like this would do that for you
mv file2 file2.orig
for i in 0 1 2 3 4 5 6 7 8 9
do
grep "^${i}" file2.orig |sort > file2.$i
done
cat file2.[0-9] >file2
rm file2.[0-9] file2.orig
You may need to modify the variables passed to for if your file is not distributed evenly across the full range of leading digits.
The statistical package R handles processing multiple csv tables really easily.
See An Intro. to R or R for Beginners.

Resources