Piping awk output into grep - linux

So I'm writing a bash script to alphabetically list names from a text file, but only names with the same frequency (defined in the second column)
grep -wi '$1' /usr/local/linuxgym-data/census/femalenames.txt |
awk '{ print ($2) }' |
grep '$1' /usr/local/linuxgym-data/census/femalenames.txt |
sort |
awk '{ print ($1) }'
Since I'm doing this for class, I've been given the example of inputting 'ANA', and should return
ANA
RENEE
And the document has about 4500 lines in it
but the two fields I'm looking at have
ANA 0.120 55.989 181
RENEE 0.120 56.109 182
And so I want to find all names with the second column the same as ANA (0.120). The second column is the frequency of the name... This is just dummy data given to me by my school, so I don't know what that means.
But if there was another name with the same frequency as ANA (0.120) it would also be listed in the output.
When I run the commands on their own, they work fine, but it seems to have trouble with the 3rd line with using the awk output as $1 in the grep below it.
I am pretty new to this, so I'm most likely doing it in the most roundabout way.

You could probably do this in one line, but that's a pushing it a bit. Split it into two pieces to make it easier to write/read. For example:
name=$1
src=/usr/local/linuxgym-data/census/femalenames.txt
# get the frequency you're after
freq=$(awk -v name="$name" '$1==name {print $2}' "$src")
# get the names with that frequency
awk -v freq="$freq" '$2==freq {print $1}' "$src"
Tradeoff between this and RomanPerekhrest's solution is that their solution will do one scan, but index everything in memory. This one will scan the file twice, but save you the memory.

With single awk:
inp="ANA"
awk -v inp=$inp '{ a[$1]=$2 } END { if(inp in a){ v=a[inp];
for(i in a){ if(a[i]==v) print i }}
}' /usr/local/linuxgym-data/census/femalenames.txt | sort
The output:
ANA
RENEE
a[$1]=$2 - accumulating frequency value for each name
if(inp in a){ v=a[inp]; - if the input name inp is in array - get its frequency value
for(i in a){ if(a[i]==v) print i - print all names that have the same frequency value as for input name

This should probably do it...
f="/usr/local/linuxgym-data/census/femalenames.txt"
grep $(grep -wi -m 1 "$1" $f | awk '{ print ($2) }') $f | \
sort | awk '{ print ($1) }'
Test...
echo 'ANA 0.120 55.989 181
RENEE 0.120 56.109 182' > fem
foo() { grep $(grep -wi -m 1 "$1" $f | awk '{ print ($2) }') $f | \
sort | awk '{ print ($1) }' ; }
f=fem ; foo ANA
Output:
ANA
RENEE

Related

Find number of unique values in a column

I would like to know the count of unique values in column using linux commands. The column has values like below (data is edited from previous ones). I need to ignore .M, .Q and .A at the end and just count the unique number of plants
"series_id":"ELEC.PLANT.CONS_EG_BTU.56855-ALL-ALL.M"
"series_id":"ELEC.PLANT.CONS_EG_BTU.56855-ALL-ALL.Q"
"series_id":"ELEC.PLANT.CONS_EG_BTU.56855-WND-ALL.A"
"series_id":"ELEC.PLANT.CONS_EG_BTU.56868-LFG-ALL.Q"
"series_id":"ELEC.PLANT.CONS_EG_BTU.56868-LFG-ALL.A"
"series_id":"ELEC.PLANT.CONS_EG_BTU.56841-WND-WT.Q"
"series_id":"ELEC.CONS_TOT.COW-GA-2.M"
"series_id":"ELEC.CONS_TOT.COW-GA-94.M"
I've tried this code but I'm not able to avoid those suffix
cat ELEC.txt | grep 'series_id' | cut -d, -f1 | wc -l
For above sample, expected count should be 6 but I get 8
This should do the job:
grep -Po "ELEC.PLANT.*" FILE | cut -d. -f -4 | sort | uniq -c
You first grep for the "ELEC.PLANT." part
remove the .Q,A,M
remove duplicates and count using sort | uniq -c
EDIT:
for the new data it should be only necessary to do the following:
grep -Po "ELEC.*" FILE | cut -d. -f -4 | sort | uniq -c
When you have to do some counting, you can easily do it with awk. Awk is an extremely versatile tool and I strongly recommend you to have a look at it. Maybe start with Awk one-liners explained.
Having that said, you can easily do some conditioned counting here:
What you want, is to count all unique lines which have series_id in it.
awk '/series_id/ && (! $0 in a) { c++; a[$0] } END {print c}'
This essentially states: if my line contains "series_id" and I did not store the line in my array a, then it means I did not encounter my line yet and increase the counter c with 1. At the END of the program, I print the count c.
Now you want to clean things up a bit. Your lines of interest essentially look like
"something":"something else"
So we are interested in something else which is in the 4th field if " is a field separator, and we are only interested in that if something is series_id located in field 2.
awk -F'"' '($2=="series_id") && (! $4 in a ) { c++; a[$4] } END {print c}'
Finally, you don't care about the last letter of the fourth field, so we need to make a small substitution:
awk -F'"' '($2=="series_id") { str=$4; gsub(/.$/,"",str); if (! str in a) {c++; a[str] } } END {print c}'
You could also rewrite this differently as:
awk -F'"' '($2 != "series_id" ) { next }
{ str=$4; gsub(/.$/,"",str) }
( str in a ) { next }
{ c++; a[str] }
END { print c }'
My standard way to count unique values is making sure I have the list of values (using grep and cut in your case), and add the following commands behind a pipe:
| sort -n | uniq -c
The sort does the sorting, based on number sorting, while the uniq gets the unique entries (the -c stands for "count").
Do this : cat ELEC.txt | grep 'series_id' | cut -f1-4 -d. | uniq | wc -l
-f1-4 will remove the the fourth . from each line
Here is a possible solution using awk:
awk 'BEGIN{FS="[:.\"]+"} /^"series_id":/{print $6}' \
ELEC.txt |sort -n |uniq -c
The ouput for the sample you posted will be something like this:
1 56841-WND-WT
2 56855-ALL-ALL
1 56855-WND-ALL
2 56868-LFG-ALL
If you need the entire string, you can print the other fields as well:
awk 'BEGIN{FS="[:.\"]+"; OFS="."} /^"series_id":/{print $3,$4,$5,$6}' \
ELEC.txt |sort -n | uniq -c
And the output will be something like this:
1 ELEC.PLANT.CONS_EG_BTU.56841-WND-WT
2 ELEC.PLANT.CONS_EG_BTU.56855-ALL-ALL
1 ELEC.PLANT.CONS_EG_BTU.56855-WND-ALL
2 ELEC.PLANT.CONS_EG_BTU.56868-LFG-ALL

How should I count the duplicate lines in each file?

I have tried this :
dirs=$1
for dir in $dirs
do
ls -R $dir
done
Like this?:
$ cat > foo
this
nope
$ cat > bar
neither
this
$ sort *|uniq -c
1 neither
1 nope
2 this
and weed out the ones with just 1s:
... | awk '$1>1'
2 this
Use sort with uniq to find the duplicate lines.
#!/bin/bash
dirs=("$#")
for dir in "${dirs[#]}" ; do
cat "$dir"/*
done | sort | uniq -c | sort -n | tail -n1
uniq -c will prepend the number of occurrences to each line
sort -n will sort the lines by the number of occurrences
tail -n1 will only output the last line, i.e. the maximum. If you want to see all the lines with the same number of duplicates, add the following instead of tail:
perl -ane 'if ($F[0] == $n) { push #buff, $_ }
else { #buff = $_ }
$n = $F[0];
END { print for #buff }'
You could use awk. If you just want to "count the duplicate lines", we could infer that you're after "all lines which have appeared earlier in the same file". The following would produce these counts:
#!/bin/sh
for file in "$#"; do
if [ -s "$file" ]; then
awk '$0 in a {c++} {a[$0]} END {printf "%s: %d\n", FILENAME, c}' "$file"
fi
done
The awk script first checks to see if the current line is stored in the array a, and if it does, increments a counter. Then it adds the line to its array. At the end of the file, we print the total.
Note that this might have problems on very large files, since the entire input file needs to be read into memory in the array.
Example:
$ printf 'foo\nbar\nthis\nbar\nthat\nbar\n' > inp.txt
$ awk '$0 in a {c++} {a[$0]} END {printf "%s: %d\n", FILENAME, c}' inp.txt
inp.txt: 2
The word 'bar' exist three times in the file, thus there are two duplicates.
To aggregate multiple files, you can just feed multiple files to awk:
$ printf 'foo\nbar\nthis\nbar\n' > inp1.txt
$ printf 'red\nblue\ngreen\nbar\n' > inp2.txt
$ awk '$0 in a {c++} {a[$0]} END {print c}' inp1.txt inp2.txt
2
For this, the word 'bar' appears twice in the first file and once in the second file -- a total of three times, thus we still have two duplicates.

awk - send sum to global variable

I have a line in a bash script that calculates the sum of unique IP requests to a certain page.
grep $YESTERDAY $ACCESSLOG | grep "$1" | awk -F" - " '{print $1}' | sort | uniq -c | awk '{sum += 1; print } END { print " ", sum, "total"}'
I am trying to get the value of sum to a variable outside the awk statement so I can compare pages to each other. So far I have tried various combinations of something like this:
unique_sum=0
grep $YESTERDAY $ACCESSLOG | grep "$1" | awk -F" - " '{print $1}' | sort | uniq -c | awk '{sum += 1; print ; $unique_sum=sum} END { print " ", sum, "total"}'
echo "${unique_sum}"
This results in an echo of "0". I've tried placing __$unique_sum=sum__ in the END, various combinations of initializing the variable (awk -v unique_sum=0 ...) and placing the variable assignment outside of the quoted sections.
So far, my Google-fu is failing horribly as most people just send the whole of the output to a variable. In this example, many lines are printed (one for each IP) in addition to the total. Failing a way to capture the 'sum' variable, is there a way to capture that last line of output?
This is probably one of the most sophisticated things I've tried in awk so my confidence that I've done anything useful is pretty low. Any help will be greatly appreciated!
You can't assign a shell variable inside an awk program. In general, no child process can alter the environment of its parent. You have to have the awk program print out the calculated value, and then shell can grab that value and assign it to a variable:
output=$( grep $YESTERDAY $ACCESSLOG | grep "$1" | awk -F" - " '{print $1}' | sort | uniq -c | awk '{sum += 1; print } END {print sum}' )
unique_sum=$( sed -n '$p' <<< "$output" ) # grab the last line of the output
sed '$d' <<< "$output" # print the output except for the last line
echo " $unique_sum total"
That pipeline can be simplified quite a lot: awk can do what grep can do, so first
grep $YESTERDAY $ACCESSLOG | grep "$1" | awk -F" - " '{print $1}'
is (longer, but only one process)
awk -F" - " -v date="$YESTERDAY" -v patt="$1" '$0 ~ date && $0 ~ patt {print $1}' "$ACCESSLOG"
And the last awk program just counts how many lines and can be replaced with wc -l
All together:
unique_output=$(
awk -F" - " -v date="$YESTERDAY" -v patt="$1" '
$0 ~ date && $0 ~ patt {print $1}
' "$ACCESSLOG" | sort | uniq -c
)
echo "$unique_output"
unique_sum=$( wc -l <<< "$unique_output" )
echo " $unique_sum total"

How do I count number of instances of an output in awk?

TL;DR
The idea is :
awk '{
IP[$1]++;
}
END {
for(var in IP)
print IP[var]
}
}' getline < sockstat | awk '{print $2 "#" $3}' | grep -v '^PROCESS#PID'
I want to count the number of instance of every block in the output from ->
sockstat | awk '{print $2 "#" $3}' | grep -v '^PROCESS#PID'
Which looks like:
ubuntu-geoip-pr#2382
chrome#2453
chrome#2453
chrome#2453
chrome#2453
chrome#2453
chrome#2453
chrome#2453
chrome#2453
rhythmbox#4759
rhythmbox#4759
rhythmbox#4759
Finally, I want to get the output as:
1
8
3
This corresponds to the number of occurrences of each of the items in the previous output.
Problem in full:
The sockstat command outputs the info for some networking stats for the localhost. I first print out a single key from the second and third columns from the output (PROCESS and PID, respectively), in the form PROCESS#PID. Then, I want to calculate the frequency of each unique key from that output. One way to do this is to use the awk getline structure, but that seems works for files, and I have not been able to make it pull input directly from the above command.
I do not want to use temporary files, as that takes away the elegance of the solution.
sockstat | awk '{print $2 "#" $3}' | grep -v '^PROCESS#PID' | sort | uniq -c | awk '{print $1}'
How about this?
sockstat | grep -v PROCESS | awk '{key=$2"#"$3; count[key]++} END {for ( key in count ) { print key" "count[key]; } }'
You could simplify your command:
sockstat | awk 'NR>1 { a[$2 "#" $3]++ } END { for (i in a) print a[i], i }'
If you just want the counts, simply edit the print statement:
sockstat | awk 'NR>1 { a[$2 "#" $3]++ } END { for (i in a) print a[i] }'

unix - breakdown of how many lines with number of character occurrences

Is there an inbuilt command to do this or has anyone had any luck with a script that does it?
I am looking to get counts of how many lines had how many occurrences of a specfic character. (sorted descending by the number of occurrences)
For example, with this sample file:
gkdjpgfdpgdp
fdkj
pgdppp
ppp
gfjkl
Suggested input (for the 'p' character)
bash/perl some_script_name "p" samplefile
Desired output:
occs count
4 1
3 2
0 2
Update:
How would you write a solution that worked off a 2 character string such as 'gd' not a just a specific character such as p?
$ sed 's/[^p]//g' input.txt | awk '{print length}' | sort -nr | uniq -c | awk 'BEGIN{print "occs", "count"}{print $2,$1}' | column -t
occs count
4 1
3 2
0 2
You could give the desired character as the field separator for awk, and do this:
awk -F 'p' '{ print NF-1 }' |
sort -k1nr |
uniq -c |
awk -v OFS="\t" 'BEGIN { print "occs", "count" } { print $2, $1 }'
For your sample data, it produces:
occs count
4 1
3 2
0 2
If you want to count occurrences of multi-character strings, just give the desired string as the separator, e.g., awk -F 'gd' ... or awk -F 'pp' ....
#!/usr/bin/env perl
use strict; use warnings;
my $seq = shift #ARGV;
die unless defined $seq;
my %freq;
while ( my $line = <> ) {
last unless $line =~ /\S/;
my $occurances = () = $line =~ /(\Q$seq\E)/g;
$freq{ $occurances } += 1;
}
for my $occurances ( sort { $b <=> $a} keys %freq ) {
print "$occurances:\t$freq{$occurances}\n";
}
If you want short, you can always use:
#!/usr/bin/env perl
$x=shift;/\S/&&++$f{$a=()=/(\Q$x\E)/g}while<>
;print"$_:\t$f{$_}\n"for sort{$b<=>$a}keys%f;
or, perl -e '$x=shift;/\S/&&++$f{$a=()=/(\Q$x\E)/g}while<>;print"$_:\t$f{$_}\n"for sort{$b<=>$a}keys%f' inputfile, but now I am getting silly.
Pure Bash:
declare -a count
while read ; do
cnt=${REPLY//[^p]/} # remove non-p characters
((count[${#cnt}]++)) # use length as array index
done < "$infile"
for idx in ${!count[*]} # iterate over existing indices
do echo -e "$idx ${count[idx]}"
done | sort -nr
Output as desired:
4 1
3 2
0 2
Can to it in one gawk process (well, with a sort coprocess)
gawk -F p -v OFS='\t' '
{ count[NF-1]++ }
END {
print "occs", "count"
coproc = "sort -rn"
for (n in count)
print n, count[n] |& coproc
close(coproc, "to")
while ((coproc |& getline) > 0)
print
close(coproc)
}
'
Shortest solution so far:
perl -nE'say tr/p//' | sort -nr | uniq -c |
awk 'BEGIN{print "occs","count"}{print $2,$1}' |
column -t
For multiple characters, use a regex pattern:
perl -ple'$_ = () = /pg/g' | sort -nr | uniq -c |
awk 'BEGIN{print "occs","count"}{print $2,$1}' |
column -t
This one handles overlapping matches (e.g. it finds 3 "pp" in "pppp" instead of 2):
perl -ple'$_ = () = /(?=pp)/g' | sort -nr | uniq -c |
awk 'BEGIN{print "occs","count"}{print $2,$1}' |
column -t
Original cryptic but short pure-Perl version:
perl -nE'
++$c{ () = /pg/g };
}{
say "occs\tcount";
say "$_\t$c{$_}" for sort { $b <=> $a } keys %c;
'

Resources