grep reverse with exact matching - linux
I have a list file, which has id and number and am trying to get those lines from a master file which do not have those ids.
List file
nw_66 17296
nw_67 21414
nw_68 21372
nw_69 27387
nw_70 15830
nw_71 32348
nw_72 21925
nw_73 20363
master file
nw_1 5896
nw_2 52814
nw_3 14537
nw_4 87323
nw_5 56466
......
......
nw_n xxxxx
so far am trying this but not working as expected.
for i in $(awk '{print $1}' list.txt); do grep -v -w $i master.txt; done;
Kindly help
Give this awk one-liner a try:
awk 'NR==FNR{a[$1]=1;next}!a[$1]' list master
Maybe this helps:
awk 'NR == FNR {id[$1]=1;next}
{
if (id[$1] == "") {
print $0
}
}' listfile masterfile
We accept 2 files as input above, first one is listfile, second is masterfile.
NR == FNR would be true while awk is going through listfile. In the associative array id[], all ids in listfile are made a key with value as 1.
When awk goes through masterfile, it only prints a line if $1 i.e. the id is not a key in array ids.
The OP attempted the following line:
for i in $(awk '{print $1}' list.txt); do grep -v -w $i master.txt; done;
This line will not work as for every entry $i, you print all entries in master.txt tat are not equivalent to "$i". As a consequence, you will end up with multiple copies of master.txt, each missing a single line.
Example:
$ for i in 1 2; do grep -v -w "$i" <(seq 1 3); done
2 \ copy of seq 1 3 without entry 1
3 /
1 \ copy of seq 1 3 without entry 2
3 /
Furthermore, the attempt reads the file master.txt multiple times. This is very inefficient.
The unix tool grep allows one the check multiple expressions stored in a file in a single go. This is done using the -f flag. Normally this looks like:
$ grep -f list.txt master.txt
The OP can use this now in the following way:
$ grep -vwf <(awk '{print $1}' list.txt) master.txt
But this would do matches over the full line.
The awk solution presented by Kent is more flexible and allows the OP to define a more tuned match:
awk 'NR==FNR{a[$1]=1;next}!a[$1]' list master
Here the OP clearly states, I want to match column 1 of list with column 1 of master and I don't care about spaces or whatever is in column 2. The grep solution could still match entries in column 2.
Related
How to show number counts on terminal in linux
Please how can I count the list of numbers in a file awk '{for(i=1;i<=NF;i++){if($i>=0 && $i<=25){print $i}}}' Using the command above I can display the range of numbers on the terminal but if there are so many it will be difficult to count them. Please how can I show the count of the numbers on the terminal for example 1-20, 2-22, 3-23, 4-24, etc I know I can use wc but I don't know how to infuse it into the command above
awk ' { for(i=1;i<=NF;i++) if (0<=$i && $i<=25) cnts[$i]++ } END { for (n in cnts) print n, cnts[n] } ' file
Pipe the output to sort -n and uniq -c awk '{for(i=1;i<=NF;i++){if($i>=0 && $i<=25){print $i}}}' filename | sort -n | uniq -c You need to sort first because uniq requires all the same elements to be consecutive.
While I'm personally an awk fan, you might be glad to learn about grep -o functionality. I'm using grep -o to match all numbers in the file, and then awk can be used to pick all the numbers between 0 and 25 (inclusive). Last, we can use sort and uniq to count the results. grep -o "[0-9][0-9]*" file | awk ' $1 >= 0 && $1 <= 25 ' | sort -n | uniq -c Of course, you could do the counting in awk with an associative array as Ed Morton suggests: egrep -o "\d+" file | awk ' $1 >= 0 && $1 <= 25 ' | awk '{cnt[$1]++} END { for (i in cnt) printf("%s-%s\n", i,cnt[i] ) } ' I modified Ed's code (typically not a good idea - I've been reading his code for years now) to show a modular approach - an awk script for filtering numbers in the range of 0 and 25 and another awk script for counting a list (of anything). I also provided another subtle difference from my first script with egrep instead of grep. To be honest, the second awk script generates some unexpected output, but I wanted to share an example of a more general approach. EDIT: I applied Ed's suggestion to correct the unexpected output - it's fine now.
How should I count the duplicate lines in each file?
I have tried this : dirs=$1 for dir in $dirs do ls -R $dir done
Like this?: $ cat > foo this nope $ cat > bar neither this $ sort *|uniq -c 1 neither 1 nope 2 this and weed out the ones with just 1s: ... | awk '$1>1' 2 this
Use sort with uniq to find the duplicate lines. #!/bin/bash dirs=("$#") for dir in "${dirs[#]}" ; do cat "$dir"/* done | sort | uniq -c | sort -n | tail -n1 uniq -c will prepend the number of occurrences to each line sort -n will sort the lines by the number of occurrences tail -n1 will only output the last line, i.e. the maximum. If you want to see all the lines with the same number of duplicates, add the following instead of tail: perl -ane 'if ($F[0] == $n) { push #buff, $_ } else { #buff = $_ } $n = $F[0]; END { print for #buff }'
You could use awk. If you just want to "count the duplicate lines", we could infer that you're after "all lines which have appeared earlier in the same file". The following would produce these counts: #!/bin/sh for file in "$#"; do if [ -s "$file" ]; then awk '$0 in a {c++} {a[$0]} END {printf "%s: %d\n", FILENAME, c}' "$file" fi done The awk script first checks to see if the current line is stored in the array a, and if it does, increments a counter. Then it adds the line to its array. At the end of the file, we print the total. Note that this might have problems on very large files, since the entire input file needs to be read into memory in the array. Example: $ printf 'foo\nbar\nthis\nbar\nthat\nbar\n' > inp.txt $ awk '$0 in a {c++} {a[$0]} END {printf "%s: %d\n", FILENAME, c}' inp.txt inp.txt: 2 The word 'bar' exist three times in the file, thus there are two duplicates. To aggregate multiple files, you can just feed multiple files to awk: $ printf 'foo\nbar\nthis\nbar\n' > inp1.txt $ printf 'red\nblue\ngreen\nbar\n' > inp2.txt $ awk '$0 in a {c++} {a[$0]} END {print c}' inp1.txt inp2.txt 2 For this, the word 'bar' appears twice in the first file and once in the second file -- a total of three times, thus we still have two duplicates.
Count specific numbers from a column from an input file linux
i was trying to read a file and count a specific number at a specific place and show how many times it appears, for example: 1st field are numbers, 2nd field brand name, 3rd field a group they belong to, 4th and 5th not important. 1:audi:2:1990:5 2:bmw:2:1987:4 3:bugatti:3:1988:19 4.buick:4:2000:12 5:dodge:2:1999:4 6:ferrari:2:2000:4 As an output, i want to search by column 3, and group together 2's(by brand name) and count how many of them i have. The output i am looking for should look like this: 1:audi:2:1990:5 2:bmw:2:1987:4 5:dodge:2:1999:4 6:ferrari:2:2000:4 4 -> showing how many lines there are. I have tried taken this approach but can't figure it out: file="cars.txt"; sort -t ":" -k3 $file #sorting by the 3rd field grep -c '2' cars.txt # this counts all the 2's in the file including number 2. I hope you understand. and thank you in advance.
I am not sure exactly what you mean by "group together by brand name", but the following will get you the output that you describe. awk -F':' '$3 == 2' Input.txt If you want a line count, you can pipe that to wc -l. awk -F':' '$3 == 2' Input.txt | wc -l
I guess line 4 is 4:buick and not 4.buick. Then I suggest this $ awk 'BEGIN{FS=":"} $3~2{total++;print} END{print "TOTAL --- "total}' Input.txt
Plain bash solution: #!/bin/bash while IFS=":" read -ra line; do if (( ${line[2]} == 2 )); then IFS=":" && echo "${line[*]}" (( count++ )) fi done < file echo "Count = $count" Output: 1:audi:2:1990:5 2:bmw:2:1987:4 5:dodge:2:1999:4 6:ferrari:2:2000:4 Count = 4
bash print first to nth column in a line iteratively
I am trying to get the column names of a file and print them iteratively. I guess the problem is with the print $i but I don't know how to correct it. The code I tried is: #! /bin/bash for i in {2..5} do set snp = head -n 1 smaller.txt | awk '{print $i}' echo $snp done Example input file: ID Name Age Sex State Ext 1 A 12 M UT 811 2 B 12 F UT 818 Desired output: Name Age Sex State Ext But the output I get is blank screen.
You'd better just read the first line of your file and store the result as an array: read -a header < smaller.txt and then printf the relevant fields: printf "%s\n" "${header[#]:1}" Moreover, this uses bash only, and involves no unnecessary loops. Edit. To also answer your comment, you'll be able to loop through the header fields thus: read -a header < smaller.txt for snp in "${header[#]:1}"; do echo "$snp" done Edit 2. Your original method had many many mistakes. Here's a corrected version of it (although what I wrote before is a much preferable way of solving your problem): for i in {2..5}; do snp=$(head -n 1 smaller.txt | awk "{print \$$i}") echo "$snp" done set probably doesn't do what you think it does. Because of the single quotes in awk '{print $i}', the $i never gets expanded by bash. This algorithm is not good since you're calling head and awk 4 times, whereas you don't need a single external process. Hope this helps!
You can print it using awk itself: awk 'NR==1{for (i=2; i<=5; i++) print $i}' smaller.txt
The main problem with your code is that your assignment syntax is wrong. Change this: set snp = head -n 1 smaller.txt | awk '{print $i}' to this: snp=$(head -n 1 smaller.txt | awk '{print $i}') That is: Do not use set. set is for setting shell options, numbered parameters, and so on, not for assigning arbitrary variables. Remove the spaces around =. To run a command and capture its output as a string, use $(...) (or `...`, but $(...) is less error-prone). That said, I agree with gniourf_gniourf's approach.
Here's another alternative; not necessarily better or worse than any of the others: for n in $(head smaller.txt) do echo ${n} done
somthin like for x1 in $(head -n1 smaller.txt );do echo $x1 done
how can i compare two text files which has multiple fields in unix
i have two text files file 1 number,name,account id,vv,sfee,dac acc,TDID 7000,john,2,0,0,1,6 7001,elen,2,0,0,1,7 7002,sami,2,0,0,1,6 7003,mike,1,0,0,2,1 8001,nike,1,2,4,1,8 8002,paul,2,0,0,2,7 file 2 number,account id,dac acc,TDID 7000,2,1,6 7001,2,1,7 7002,2,1,6 7003,1,2,1 i want to compare those two text files. if the four columns of file 2 is there in file 1 and equal means i want output like this 7000,john,2,0,0,1,6 7001,elen,2,0,0,1,7 7002,sami,2,0,0,1,6 7003,mike,1,0,0,2,1 nawk -F"," 'NR==FNR {a[$1];next} ($1 in a)' file2.txt file1.txt.. this works good for comparing two single column in two files. i want to compare multiple column. any one have suggestion? EDIT: From the OP's comments: nawk -F"," 'NR==FNR {a[$1];next} ($1 in a)' file2.txt file1.txt .. this works good for comparing two single column in two files. i want to compare multiple column. you have any suggestion?
This awk one-liner works for multi-column on unsorted files: awk -F, 'NR==FNR{a[$1,$2,$3,$4]++;next} (a[$1,$3,$6,$7])' file1.txt file2.txt In order for this to work, it is imperative that the first file used for input (file1.txt in my example) be the file that only has 4 fields like so: file1.txt 7000,2,1,6 7001,2,1,7 7002,2,1,6 7003,1,2,1 file2.txt 7000,john,2,0,0,1,6 7000,john,2,0,0,1,7 7000,john,2,0,0,1,8 7000,john,2,0,0,1,9 7001,elen,2,0,0,1,7 7002,sami,2,0,0,1,6 7003,mike,1,0,0,2,1 7003,mike,1,0,0,2,2 7003,mike,1,0,0,2,3 7003,mike,1,0,0,2,4 8001,nike,1,2,4,1,8 8002,paul,2,0,0,2,7 Output $ awk -F, 'NR==FNR{a[$1,$2,$3,$4]++;next} (a[$1,$3,$6,$7])' file1.txt file2.txt 7000,john,2,0,0,1,6 7001,elen,2,0,0,1,7 7002,sami,2,0,0,1,6 7003,mike,1,0,0,2,1 Alternatively, you could also use the following syntax which more closely matches the one in your question but is not very readable IMHO awk -F, 'NR==FNR{a[$1,$2,$3,$4];next} ($1SUBSEP$3SUBSEP$6SUBSEP$7 in a)' file1.txt file2.txt
TxtSushi looks like what you want. It allows to work with CSV files using SQL.
It's not an elegant one-liner, but you could do it with perl. #!/usr/bin/perl open A, $ARGV[0]; while(split/,/,<A>) { $k{$_[0]} = [#_]; } close A; open B, $ARGV[1]; while(split/,/,<B>) { print join(',',#{$k{$_[0]}}) if defined($k{$_[0]}) && $k{$_[0]}->[2] == $_[1] && $k{$_[0]}->[5] == $_[2] && $k{$_[0]}->[6] == $_[3]; } close B;
Quick answer: Use cut to split out the fields you need and diff to compare the results.
Not really well tested, but this might work: join -t, file1 file2 | awk -F, 'BEGIN{OFS=","} {if ($3==$8 && $6==$9 && $7==$10) print $1,$2,$3,$4,$6,$7}' (Of course, this assumes the input files are sorted).
This is neither efficient nor pretty it will however get the job done. It is not the most efficient implementation as it parses file1 multiple times however it does not read the entire file into RAM either so has some benefits over the simple scripting approaches. sed -n '2,$p' file1 | awk -F, '{print $1 "," $3 "," $6 "," $7 " " $0 }' | \ sort | join file2 - |awk '{print $2}' This works as follows sed -n '2,$p' file1 sends file1 to STDOUT without the header line The first awk command prints the 4 "key fields" from file1 in the same format as they are in file2 followed by a space followed by the contents of file1 The sort command ensures that file1 is in the same order as file2 The join command joins file2 and STDOUT only writing records that have a matching record in file2 The final awk command prints just the original part of file1 In order for this to work you must ensure that file2 is sorted before running the command. Running this against your example data gave the following result 7000,john,2,0,0,1,6 7001,elen,2,0,0,1,7 7002,sami,2,0,0,1,6 7003,mike,1,0,0,2,1 EDIT I note from your comments you are getting a sorting error. If this error is occuring when sorting file2 before running the pipeline command then you could split the file, sort each part and then cat them back together again. Something like this would do that for you mv file2 file2.orig for i in 0 1 2 3 4 5 6 7 8 9 do grep "^${i}" file2.orig |sort > file2.$i done cat file2.[0-9] >file2 rm file2.[0-9] file2.orig You may need to modify the variables passed to for if your file is not distributed evenly across the full range of leading digits.
The statistical package R handles processing multiple csv tables really easily. See An Intro. to R or R for Beginners.