How do I count number of instances of an output in awk? - linux

TL;DR
The idea is :
awk '{
IP[$1]++;
}
END {
for(var in IP)
print IP[var]
}
}' getline < sockstat | awk '{print $2 "#" $3}' | grep -v '^PROCESS#PID'
I want to count the number of instance of every block in the output from ->
sockstat | awk '{print $2 "#" $3}' | grep -v '^PROCESS#PID'
Which looks like:
ubuntu-geoip-pr#2382
chrome#2453
chrome#2453
chrome#2453
chrome#2453
chrome#2453
chrome#2453
chrome#2453
chrome#2453
rhythmbox#4759
rhythmbox#4759
rhythmbox#4759
Finally, I want to get the output as:
1
8
3
This corresponds to the number of occurrences of each of the items in the previous output.
Problem in full:
The sockstat command outputs the info for some networking stats for the localhost. I first print out a single key from the second and third columns from the output (PROCESS and PID, respectively), in the form PROCESS#PID. Then, I want to calculate the frequency of each unique key from that output. One way to do this is to use the awk getline structure, but that seems works for files, and I have not been able to make it pull input directly from the above command.
I do not want to use temporary files, as that takes away the elegance of the solution.

sockstat | awk '{print $2 "#" $3}' | grep -v '^PROCESS#PID' | sort | uniq -c | awk '{print $1}'

How about this?
sockstat | grep -v PROCESS | awk '{key=$2"#"$3; count[key]++} END {for ( key in count ) { print key" "count[key]; } }'

You could simplify your command:
sockstat | awk 'NR>1 { a[$2 "#" $3]++ } END { for (i in a) print a[i], i }'
If you just want the counts, simply edit the print statement:
sockstat | awk 'NR>1 { a[$2 "#" $3]++ } END { for (i in a) print a[i] }'

Related

Find number of unique values in a column

I would like to know the count of unique values in column using linux commands. The column has values like below (data is edited from previous ones). I need to ignore .M, .Q and .A at the end and just count the unique number of plants
"series_id":"ELEC.PLANT.CONS_EG_BTU.56855-ALL-ALL.M"
"series_id":"ELEC.PLANT.CONS_EG_BTU.56855-ALL-ALL.Q"
"series_id":"ELEC.PLANT.CONS_EG_BTU.56855-WND-ALL.A"
"series_id":"ELEC.PLANT.CONS_EG_BTU.56868-LFG-ALL.Q"
"series_id":"ELEC.PLANT.CONS_EG_BTU.56868-LFG-ALL.A"
"series_id":"ELEC.PLANT.CONS_EG_BTU.56841-WND-WT.Q"
"series_id":"ELEC.CONS_TOT.COW-GA-2.M"
"series_id":"ELEC.CONS_TOT.COW-GA-94.M"
I've tried this code but I'm not able to avoid those suffix
cat ELEC.txt | grep 'series_id' | cut -d, -f1 | wc -l
For above sample, expected count should be 6 but I get 8
This should do the job:
grep -Po "ELEC.PLANT.*" FILE | cut -d. -f -4 | sort | uniq -c
You first grep for the "ELEC.PLANT." part
remove the .Q,A,M
remove duplicates and count using sort | uniq -c
EDIT:
for the new data it should be only necessary to do the following:
grep -Po "ELEC.*" FILE | cut -d. -f -4 | sort | uniq -c
When you have to do some counting, you can easily do it with awk. Awk is an extremely versatile tool and I strongly recommend you to have a look at it. Maybe start with Awk one-liners explained.
Having that said, you can easily do some conditioned counting here:
What you want, is to count all unique lines which have series_id in it.
awk '/series_id/ && (! $0 in a) { c++; a[$0] } END {print c}'
This essentially states: if my line contains "series_id" and I did not store the line in my array a, then it means I did not encounter my line yet and increase the counter c with 1. At the END of the program, I print the count c.
Now you want to clean things up a bit. Your lines of interest essentially look like
"something":"something else"
So we are interested in something else which is in the 4th field if " is a field separator, and we are only interested in that if something is series_id located in field 2.
awk -F'"' '($2=="series_id") && (! $4 in a ) { c++; a[$4] } END {print c}'
Finally, you don't care about the last letter of the fourth field, so we need to make a small substitution:
awk -F'"' '($2=="series_id") { str=$4; gsub(/.$/,"",str); if (! str in a) {c++; a[str] } } END {print c}'
You could also rewrite this differently as:
awk -F'"' '($2 != "series_id" ) { next }
{ str=$4; gsub(/.$/,"",str) }
( str in a ) { next }
{ c++; a[str] }
END { print c }'
My standard way to count unique values is making sure I have the list of values (using grep and cut in your case), and add the following commands behind a pipe:
| sort -n | uniq -c
The sort does the sorting, based on number sorting, while the uniq gets the unique entries (the -c stands for "count").
Do this : cat ELEC.txt | grep 'series_id' | cut -f1-4 -d. | uniq | wc -l
-f1-4 will remove the the fourth . from each line
Here is a possible solution using awk:
awk 'BEGIN{FS="[:.\"]+"} /^"series_id":/{print $6}' \
ELEC.txt |sort -n |uniq -c
The ouput for the sample you posted will be something like this:
1 56841-WND-WT
2 56855-ALL-ALL
1 56855-WND-ALL
2 56868-LFG-ALL
If you need the entire string, you can print the other fields as well:
awk 'BEGIN{FS="[:.\"]+"; OFS="."} /^"series_id":/{print $3,$4,$5,$6}' \
ELEC.txt |sort -n | uniq -c
And the output will be something like this:
1 ELEC.PLANT.CONS_EG_BTU.56841-WND-WT
2 ELEC.PLANT.CONS_EG_BTU.56855-ALL-ALL
1 ELEC.PLANT.CONS_EG_BTU.56855-WND-ALL
2 ELEC.PLANT.CONS_EG_BTU.56868-LFG-ALL

Find all unique columns from a huge(having millions of records and columns)OFS file(without fixed header row) unix

Input
119764469|14100733//1,k1=v1,k2=v2,STREET:1:1=NY
119764469|14100733//1,k1=v1,k2=v2,k3=v3
119764469|14100733//1,k1=v1,k4=v4,abc.xyz:1:1=nmb,abc,po.foo:1:1=yu
k1 could be any name with alphanumeric with . & : special chars like abc.nm.1:1
Expected output(all unique columns), sorting not required/necessary , it should be super fast
k1,k2,STREET:1:1,k3,k4,abc.xyz:1:1
My current approach/solution is
awk -F',' '{for (i=0; i<=NR; i++) {for(j=1; j<=NF; j++){split($j,a,"="); print a[1];}}}' file.txt | awk '!x[$1]++' | grep -v '|' | sed -e :a -e '$!N; s/\n/ | /; ta'
It works fine but it is too slow for huge size of file(which could be in MBs or in GBs in size)
NOTE: This is required in data migration, should use basic unix shell commands as production may not allow to have 3rd party utilities.
not sure about the speed but give it a try
$ cut -d, -f2- file | # select the key/value pairs
tr ',' '\n' | # split each k=v to its own line
cut -d= -f1 | # select only keys
sort -u | # filter uniques
paste -sd, # serialize back to single csv line
abc.xyz:1:1,k1,k2,k3,k4,STREET:1:1
I expect it to be faster than grep since no regex is involved.
Use grep -o to grep only the parts you need:
grep -o -e '[^=,]\+=[^,]\+' file.txt |awk -F'=' '{print $1}' |sort |uniq |tr '\n' ',' |sed 's/,$/\n/'
>>> abc.xyz:1:1,k1,k2,k3,k4,STREET:1:1
(sort is needed here because otherwise uniq doesn't work)
If you don't really need the output all on one line:
$ awk -F'[,=]' '{for (i=2;i<=NF;i+=2) print $i}' file | sort -u
abc.xyz:1:1
k1
k2
k3
k4
STREET:1:1
If you do:
$ awk -F'[,=]' '{for (i=2;i<=NF;i+=2) print $i}' file | sort -u |
awk -v ORS= '{print sep $0; sep=","} END{print RS}'
abc.xyz:1:1,k1,k2,k3,k4,STREET:1:1
You could do it all in one awk script but I'm not sure it'll be as efficient as the above or might run into memory issues if/when the array grows to millions of values:
$ cat tst.awk
BEGIN { FS="[,=]"; ORS="" }
{
for (i=2; i<=NF; i+=2) {
vals[$i]
}
}
END {
for (val in vals) {
print sep val
sep = ","
}
print RS
}
$ awk -f tst.awk file
k1,abc.xyz:1:1,k2,k3,k4,STREET:1:1

Piping awk output into grep

So I'm writing a bash script to alphabetically list names from a text file, but only names with the same frequency (defined in the second column)
grep -wi '$1' /usr/local/linuxgym-data/census/femalenames.txt |
awk '{ print ($2) }' |
grep '$1' /usr/local/linuxgym-data/census/femalenames.txt |
sort |
awk '{ print ($1) }'
Since I'm doing this for class, I've been given the example of inputting 'ANA', and should return
ANA
RENEE
And the document has about 4500 lines in it
but the two fields I'm looking at have
ANA 0.120 55.989 181
RENEE 0.120 56.109 182
And so I want to find all names with the second column the same as ANA (0.120). The second column is the frequency of the name... This is just dummy data given to me by my school, so I don't know what that means.
But if there was another name with the same frequency as ANA (0.120) it would also be listed in the output.
When I run the commands on their own, they work fine, but it seems to have trouble with the 3rd line with using the awk output as $1 in the grep below it.
I am pretty new to this, so I'm most likely doing it in the most roundabout way.
You could probably do this in one line, but that's a pushing it a bit. Split it into two pieces to make it easier to write/read. For example:
name=$1
src=/usr/local/linuxgym-data/census/femalenames.txt
# get the frequency you're after
freq=$(awk -v name="$name" '$1==name {print $2}' "$src")
# get the names with that frequency
awk -v freq="$freq" '$2==freq {print $1}' "$src"
Tradeoff between this and RomanPerekhrest's solution is that their solution will do one scan, but index everything in memory. This one will scan the file twice, but save you the memory.
With single awk:
inp="ANA"
awk -v inp=$inp '{ a[$1]=$2 } END { if(inp in a){ v=a[inp];
for(i in a){ if(a[i]==v) print i }}
}' /usr/local/linuxgym-data/census/femalenames.txt | sort
The output:
ANA
RENEE
a[$1]=$2 - accumulating frequency value for each name
if(inp in a){ v=a[inp]; - if the input name inp is in array - get its frequency value
for(i in a){ if(a[i]==v) print i - print all names that have the same frequency value as for input name
This should probably do it...
f="/usr/local/linuxgym-data/census/femalenames.txt"
grep $(grep -wi -m 1 "$1" $f | awk '{ print ($2) }') $f | \
sort | awk '{ print ($1) }'
Test...
echo 'ANA 0.120 55.989 181
RENEE 0.120 56.109 182' > fem
foo() { grep $(grep -wi -m 1 "$1" $f | awk '{ print ($2) }') $f | \
sort | awk '{ print ($1) }' ; }
f=fem ; foo ANA
Output:
ANA
RENEE

Trying to join output from ps and pwdx linux commands

I am trying to join output from ps and pwdx command. Can anyone point out the mistake in my command.
ps -eo %p,%c,%u,%a --no-headers | awk -F',' '{ for(i=1;i<=NF;i++) {printf $i",
"} ; printf pwdx $1; printf "\n" }'
I expect the last column in each row to be the process directory. But it just shows the value of $1 instead of the command output pwdx $1
This is my output sample (1 row):
163957, processA , userA , /bin/processA -args, 163957
I expected
163957, processA , userA , /bin/processA -args, /app/processA
Can anyone point out what I may be missing
Try this:
ps -eo %p,%c,%u,%a --no-headers | awk -F',' '{ printf "%s,", $0; "pwdx " $1 | getline; print gensub("^[0-9]*: *","","1",$0);}'
Explanation:
awk '{print pwdx $1}' will concatenate the awk variable pwdx (which is empty) and $1 (pid). So, effectively, you were getting only the pid at the output.
In order to run a command and gets its output, you need to use this awk construct:
awk '{"some command" | getline; do_something_with $0}'
# After getline, the output will be present in $0.
#For multiline output, use this:
awk '{while ("some command" | getline){do_something_with $0}}'
# Each individual line will be present in subsequent run of the while loop.
Simplifying your example to focus on how to execute the pwdx command within awk and capture the result of this command into an awk variable as this is where you were having issues:
ps -eo %p,%c,%u,%a --no-headers | awk -F',' '{ system("pwdx "$1) | getline vpwdx; printf vpwdx $1}'
produces:
15651665: /
16651690: /
16901691: /home/fpm
169134248: /home/fpm
3424834254: /home/fpm/tmp
3425440181: /home/fpm/UDK2015
...

awk - send sum to global variable

I have a line in a bash script that calculates the sum of unique IP requests to a certain page.
grep $YESTERDAY $ACCESSLOG | grep "$1" | awk -F" - " '{print $1}' | sort | uniq -c | awk '{sum += 1; print } END { print " ", sum, "total"}'
I am trying to get the value of sum to a variable outside the awk statement so I can compare pages to each other. So far I have tried various combinations of something like this:
unique_sum=0
grep $YESTERDAY $ACCESSLOG | grep "$1" | awk -F" - " '{print $1}' | sort | uniq -c | awk '{sum += 1; print ; $unique_sum=sum} END { print " ", sum, "total"}'
echo "${unique_sum}"
This results in an echo of "0". I've tried placing __$unique_sum=sum__ in the END, various combinations of initializing the variable (awk -v unique_sum=0 ...) and placing the variable assignment outside of the quoted sections.
So far, my Google-fu is failing horribly as most people just send the whole of the output to a variable. In this example, many lines are printed (one for each IP) in addition to the total. Failing a way to capture the 'sum' variable, is there a way to capture that last line of output?
This is probably one of the most sophisticated things I've tried in awk so my confidence that I've done anything useful is pretty low. Any help will be greatly appreciated!
You can't assign a shell variable inside an awk program. In general, no child process can alter the environment of its parent. You have to have the awk program print out the calculated value, and then shell can grab that value and assign it to a variable:
output=$( grep $YESTERDAY $ACCESSLOG | grep "$1" | awk -F" - " '{print $1}' | sort | uniq -c | awk '{sum += 1; print } END {print sum}' )
unique_sum=$( sed -n '$p' <<< "$output" ) # grab the last line of the output
sed '$d' <<< "$output" # print the output except for the last line
echo " $unique_sum total"
That pipeline can be simplified quite a lot: awk can do what grep can do, so first
grep $YESTERDAY $ACCESSLOG | grep "$1" | awk -F" - " '{print $1}'
is (longer, but only one process)
awk -F" - " -v date="$YESTERDAY" -v patt="$1" '$0 ~ date && $0 ~ patt {print $1}' "$ACCESSLOG"
And the last awk program just counts how many lines and can be replaced with wc -l
All together:
unique_output=$(
awk -F" - " -v date="$YESTERDAY" -v patt="$1" '
$0 ~ date && $0 ~ patt {print $1}
' "$ACCESSLOG" | sort | uniq -c
)
echo "$unique_output"
unique_sum=$( wc -l <<< "$unique_output" )
echo " $unique_sum total"

Resources