How to pipe sorted results to grep? - linux

$ grep HxH 20170213.csv | awk -F',' '{print $13}' | cut -b 25-27 | sort -u
868
881
896
904
913
914
918
919
920
Question> How to pipe the sorted results and feed into grep?
Now I have to do the following command manually.
grep 868 /tmp/aaa/*.csv
grep 881 /tmp/aaa/*.csv
...
grep 920 /tmp/aaa/*.csv

Since your output is numeric (output lines do not contain spaces), you can use a for loop with command substitution:
for id in $(grep HxH 20170213.csv | awk -F',' '{print $13}' \
| cut -b 25-27 | sort -u); do
grep $id /tmp/aaa/*.csv
done
Another option is to use xargs:
grep HxH 20170213.csv | awk -F',' '{print $13}' | cut -b 25-27 | sort -u \
| xargs -n1 grep /tmp/aaa/*.csv -e
The xargs variant requires one to jump through a couple hoops to get right:
by default xargs would stick more than one pattern to the same grep, which is prevented using -n1;
xargs specifies the stdin contents as the last argument in the command line, which is a problem because grep expects pattern then file name. Fortunately, grep PATTERN FILES... can be spelled as grep FILES... -e PATTERN, which is why grep must be followed by -e.

Related

Grep for string containing several metacharacters and extract 3 lines after match

I'd like to grep for 1:N:0:CGATGT within a file and extract the line containing 1:N:0:CGATGT and 3 additional lines after (4 lines total for each match). I've tired to grep numerous ways, all yielding unsuccessful:
[ssabri#login2 data]$ history | tail -n 8
1028 zcat A1_S1_L008_R1_001.fastq.gz | grep -A4 "1[[:]][[N]][[:]]0[[:]]CGATGT" | wc -l
1029 zcat A1_S1_L008_R1_001.fastq.gz | grep -A4 "1[[:]][[N]][[:]]0[[:]]CGATGT$" | wc -l
1030 zcat A1_S1_L008_R1_001.fastq.gz | grep -A4 "1[[:]][[N]][[:]][[0]][[:]]CGATGT$" | wc -l
1031 zcat A1_S1_L008_R1_001.fastq.gz | grep -A4 -w "1[[:]][[N]][[:]][[0]][[:]]CGATGT$" | wc -l
1032 zcat A1_S1_L008_R1_001.fastq.gz | egrep -A4 -w "1[[:]][[N]][[:]][[0]][[:]]CGATGT$" | wc -l
1033 zcat A1_S1_L008_R1_001.fastq.gz | grep -x -A4 -w "1:N:0:CGATGT" | wc -l
1034 zcat A1_S1_L008_R1_001.fastq.gz | grep -E -A4 -w "1:N:0:CGATGT" | wc -l
1035 zcat A1_S1_L008_R1_001.fastq.gz | grep -A4 -w "1\:N\:0\:CGATGT$" | wc -l
EDIT: The input files looks something like this:
[ssabri#login2 data]$ zcat A1_S1_L008_R1_001.fastq.gz | head -n 12
#J00153:28:H7LNWBBXX:8:1101:28625:1191 1:N:0:CGAGGT
ACNTGCTCCATCCATAGCACCTAGAACAGAGCCTGGNACAGAANAAGNGC
+
A-#<-<<FJJAJFFFF-FJJJJJAJFJJJFF-A-FA#JJJJFJ#JJA#FJ
#J00153:28:H7LNWBBXX:8:1101:29457:1191 1:N:0:CGATGT
GTNGTGGTAGATCTGGACGCGGCTGAAGGCCTGGGGNCCCGTGNCAGN
+
-<#<FJJJJJJJJJJJJJJJJJJJJJJFJJJJJJJJ#JJJJJJ#JJJ#
#J00153:28:H7LNWBBXX:8:1101:31000:1191 1:N:0:CCATGT
TCNAATTATCACCATTACAGGAGGGTCAGTAGAACANGCGTTCTGGTNGG
+
<A#<AFFJJJFJJJFJJJJJJFFFJ7A<<JJFJJJJ#JJJAFJJJJF#-A
grep -A3 "1:N:0:CGATGT" file
#J00153:28:H7LNWBBXX:8:1101:29457:1191 1:N:0:CGATGT
GTNGTGGTAGATCTGGACGCGGCTGAAGGCCTGGGGNCCCGTGNCAGN
+
-<#<FJJJJJJJJJJJJJJJJJJJJJJFJJJJJJJJ#JJJJJJ#JJJ#
sometimes simpler thinking is better, here you don't need any regex extensions since you're string matching without any special regex chars that will need escaping. A(fter) context match should be 3 since you want 3 trailing lines (total will be 4 with matching line).
I understand that you are looking for a grep solution. However this is not the only option in text processing. If you use awk then this might be a solution:
awk ' BEGIN {line=4;} \
/1:N:0:CGATGT/ {line=0; print $0; next;} \
{ if (line<3) { print $0; line = line+1; } } ' your-file
Given the problem you seem to be having with using grep and pulling out a fixed 4 lines, try this:
$ awk 'NF>1{f=0} $NF=="1:N:0:CGATGT"{f=1} f' file
#J00153:28:H7LNWBBXX:8:1101:29457:1191 1:N:0:CGATGT
GTNGTGGTAGATCTGGACGCGGCTGAAGGCCTGGGGNCCCGTGNCAGN
+
-<#<FJJJJJJJJJJJJJJJJJJJJJJFJJJJJJJJ#JJJJJJ#JJJ#
Rather than printing a fixed number of lines after a match, it will print from the first line where the last field is your target string to just before the next line that COULD contain your target string.
To identify any blocks that have some number other than 4 lines including the target, use this:
$ awk 'f && NF>1{ if (f!=5) print NR, f-1 | "cat>&2"; f=0} $NF=="1:N:0:CGATGT"{f=1} f{print; ++f}' file
It will output to stderr the input file line number and the count of the number of lines in the unexpected block.

wc -m in linux excess 1 value

md5sum file.png | awk '{print $1}' | wc -m
I get: 33
I expect it will return 32 as the length of md5 hash. After read man page and googling I still didn't find out why.
TL;DR
Use awk's length() function:
md5sum file.png | awk '{print length($1)}'
32
It's because awk will add a line feed character to the output. You can check:
md5sum file.png | awk '{print $1}' | xxd
You can tell awk to not doing that using ORS output record separator variable:
md5sum file.png | awk '{print $1}' ORS='' | wc -m
32

how to use user defined variables inside egrep

Hi I want to know how to use user defined variables inside egrep with or condition.
If i use below command i am getting result as required.
grep -w "Connect" audit.log.13766735635311490 | egrep 'USER=\"root\"|DB=\"CUST_PRESTIGEWW_DB\"' | wc -l
24416
now i have done below and executed the same command but getting different result
DATABASE=CUST_PRESTIGEWW_DB
grep -w "Connect" audit.log.13766735635311490 | egrep 'USER=\"root\"|DB=\"$DATABASE\"' | wc -l
40
grep -w "Connect" audit.log.13766735635311490 | egrep 'USER=\"root\"|DB=\""$DATABASE"\"' | wc -l
40
echo $DATABASE
CUST_PRESTIGEWW_DB
Now tried changing the variable value but not worked
DATABASE=`echo "\"CUST_PRESTIGEWW_DB\""`
echo $DATABASE
"CUST_PRESTIGEWW_DB"
grep -w "Connect" audit.log.13766735635311490 | egrep 'USER=\"root\"|DB=$DATABASE' | wc -l
grep -w "Connect" audit.log.13766735635311490 | egrep 'USER=\"root\"|DB='"$DATABASE" | wc -l
119956
grep -w "Connect" audit.log.13766735635311490 | egrep 'USER=\"root\"|DB="'$DATABASE'"' | wc -l
114
Can any one help me how to do it ?
Use:
... egrep 'USER="root"|DB='"$DATABASE" ...
You need to close the single quoted section to make bash expand $DATABASE. Further note that you won't need to escape double quotes in single quoted string.

Need to remove the count from the output when using "uniq -c" command

I am trying to read a file and sort it by number of occurrences of a particular field. Suppose i want to find out the most repeated date from a log file then i use uniq -c option and sort it in descending order. something like this
uniq -c | sort -nr
This will produce some output like this -
809 23/Dec/2008:19:20
the first field which is actually the count is the problem for me .... i want to get ony the date from the above output but m not able to get this. I tried to use cut command and did this
uniq -c | sort -nr | cut -d' ' -f2
but this just prints blank space ... please can someone help me on getting the date only and chop off the count. I want only
23/Dec/2008:19:20
Thanks
The count from uniq is preceded by spaces unless there are more than 7 digits in the count, so you need to do something like:
uniq -c | sort -nr | cut -c 9-
to get columns (character positions) 9 upwards. Or you can use sed:
uniq -c | sort -nr | sed 's/^.\{8\}//'
or:
uniq -c | sort -nr | sed 's/^ *[0-9]* //'
This second option is robust in the face of a repeat count of 10,000,000 or more; if you think that might be a problem, it is probably better than the cut alternative. And there are undoubtedly other options available too.
Caveat: the counts were determined by experimentation on Mac OS X 10.7.3 but using GNU uniq from coreutils 8.3. The BSD uniq -c produced 3 leading spaces before a single digit count. The POSIX spec says the output from uniq -c shall be formatted as if with:
printf("%d %s", repeat_count, line);
which would not have any leading blanks. Given this possible variance in output formats, the sed script with the [0-9] regex is the most reliable way of dealing with the variability in observed and theoretical output from uniq -c:
uniq -c | sort -nr | sed 's/^ *[0-9]* //'
Instead of cut -d' ' -f2, try
awk '{$1="";print}'
Maybe you need to remove one more blank in the beginning:
awk '{$1="";print}' | sed 's/^.//'
or completly with sed, preserving original whitspace:
sed -r 's/^[^0-9]*[0-9]+//'
Following awk may help you here.
awk '{a[$0]++} END{for(i in a){print a[i],i | "sort -k2"}}' Input_file
Solution 2nd: In case you want order of output to be same as input but not as sort.
awk '!a[$0]++{b[++count]=$0} {c[$0]++} END{for(i=1;i<=count;i++){print c[b[i]],b[i]}}' Input_file
an alternative solution is this:
uniq -c | sort -nr | awk '{print $1, $2}'
also you may easily print a single field.
use(since you use -f2 in the cut in your question)
cat file |sort |uniq -c | awk '{ print $2; }'
If you want to work with the count field downstream, following command will reformat it to a 'pipe friendly' tab delimited format without the left padding:
.. | sort | uniq -c | sed -r 's/^ +([0-9]+) /\1\t/'
For the original task it is a bit of an overkill, but after reformatting, cut can be used to remove the field, as OP intended:
.. | sort | uniq -c | sed -r 's/^ +([0-9]+) /\1\t/' | cut -d $'\t' -f2-
Add tr -s to the pipe chain to "squeeze" multiple spaces into one space delimiter:
uniq -c | tr -s ' ' | cut -d ' ' -f3
tr is very useful in some obscure places. Unfortunately it doesn't get rid of the first leading space, hence the -f3
You could make use of sed to strip both the leading spaces and the numbers printed by uniq -c
sort file | uniq -c | sed 's/^ *[0-9]* //'
I would illustrate this with an example. Consider a file
winebottles.mkv
winebottles.mov
winebottles.xges
winebottles.xges~
winebottles.mkv
winebottles.mov
winebottles.xges
winebottles.xges~
The command
sort file | uniq -c | sed 's/^ *[0-9]* //'
would return
winebottles.mkv
winebottles.mov
winebottles.xges
winebottles.xges~
first solution
just using sort when input repetition has not been taken into consideration. sort has unique option -u
sort -u file
sort -u < file
Ex.:
$ cat > file
a
b
c
a
a
g
d
d
$ sort -u file
a
b
c
d
g
second solution
if sorting based on repetition is important
sort txt | uniq -c | sort -k1 -nr | sed 's/^ \+[0-9]\+ //g'
sort txt | uniq -c | sort -k1 -nr | perl -lpe 's/^ +[\d]+ +//g'
which has this output:
a
d
g
c
b

Bash script to find the frequency of every letter in a file

I am trying to find out the frequency of appearance of every letter in the english alphabet in an input file. How can I do this in a bash script?
My solution using grep, sort and uniq.
grep -o . file | sort | uniq -c
Ignore case:
grep -o . file | sort -f | uniq -ic
Just one awk command
awk -vFS="" '{for(i=1;i<=NF;i++)w[$i]++}END{for(i in w) print i,w[i]}' file
if you want case insensitive, add tolower()
awk -vFS="" '{for(i=1;i<=NF;i++)w[tolower($i)]++}END{for(i in w) print i,w[i]}' file
and if you want only characters,
awk -vFS="" '{for(i=1;i<=NF;i++){ if($i~/[a-zA-Z]/) { w[tolower($i)]++} } }END{for(i in w) print i,w[i]}' file
and if you want only digits, change /[a-zA-Z]/ to /[0-9]/
if you do not want to show unicode, do export LC_ALL=C
A solution with sed, sort and uniq:
sed 's/\(.\)/\1\n/g' file | sort | uniq -c
This counts all characters, not only letters. You can filter out with:
sed 's/\(.\)/\1\n/g' file | grep '[A-Za-z]' | sort | uniq -c
If you want to consider uppercase and lowercase as same, just add a translation:
sed 's/\(.\)/\1\n/g' file | tr '[:upper:]' '[:lower:]' | grep '[a-z]' | sort | uniq -c
Here is a suggestion:
while read -n 1 c
do
echo "$c"
done < "$INPUT_FILE" | grep '[[:alpha:]]' | sort | uniq -c | sort -nr
Similar to mouviciel's answer above, but more generic for Bourne and Korn shells used on BSD systems, when you don't have GNU sed, which supports \n in a replacement, you can backslash escape a newline:
sed -e's/./&\
/g' file | sort | uniq -c | sort -nr
or to avoid the visual split on the screen, insert a literal newline by type CTRL+V CTRL+J
sed -e's/./&\^J/g' file | sort | uniq -c | sort -nr

Resources