How to hash particular column in csv file | linux | - linux

I have a scenario
where i want to hash some columns of csv file
how to do that with below data
ID|NAME|CITY|AGE
1|AB1|BBC|12
2|AB2|FGD|17
3|AB3|ASD|18
4|AB4|SDF|19
5|AB5|ASC|22
The Column name NAME | AGE should get hashed with random values
like below output
ID|NAME|CITY|AGE
1|68b329da9111314099c7d8ad5cb9c940|BBC|77bAD9da9893er34099c7d8ad5cb9c940
2|69b32fga9893e34099c7d8ad5cb9c940|FGD|68bAD9da989yue34099c7d8ad5cb9c940
3|46b329da9893e3403453d8ad5cb9c940|ASD|60bfgD9da9893e34099c7d8ad5cb9c940
4|50Cd29da9893e34099c7d8ad5cb9c940|SDF|67bAD9da98973e34099c7d8ad5cb9c940
5|67bAD9da9893e34099c7d8ad5cb9c940|ASC|67bAD9da11893e34099c7d8ad5cb9c940
When i tested this code below code gives me same value for the column 'NAME' it should give randomized values
awk '{
tmp="echo " $2 " | openssl md5 | cut -f2 -d\" \""
tmp | getline cksum
close(tmp)
$2=cksum
print
}' < sample.csv
output :
68b329da9893e34099c7d8ad5cb9c940
68b329da9893e34099c7d8ad5cb9c940
68b329da9893e34099c7d8ad5cb9c940
68b329da9893e34099c7d8ad5cb9c940
68b329da9893e34099c7d8ad5cb9c940
68b329da9893e34099c7d8ad5cb9c940

You may use it like this:
awk 'function hash(s, cmd, hex, line) {
cmd = "openssl md5 <<< \"" s "\""
if ( (cmd | getline line) > 0)
hex = line
close(cmd)
return hex
}
BEGIN {
FS = OFS = "|"
}
NR == 1 {
print
next
}
{
print $1, hash($2), $3, hash($4)
}' file
ID|NAME|CITY|AGE
1|d44aec35a11ff6fa8a800120dbef1cd7|BBC|2737b49252e2a4c0fe4c342e92b13285
2|157aa4a48373eaf0415ea4229b3d4421|FGD|4d095eeac8ed659b1ce69dcef32ed0dc
3|ba3c08d4a65f1baa1d7220a6802b5710|ASD|cf4278314ef8e4b996e1b798d8eb92cf
4|69be622e1c0d417ceb9b8fb0aa9dc574|SDF|3bb50ff8eeb7ad116724b56a820139fa
5|427872b1ac3a22dc154688ddc2050516|ASC|2fc57d6f63a9ee7e2f21a26fa522e3b6

You have to specify | as input and output field separators. Otherwise $2 is not what you expect, but an empty string.
awk -F '|' -v "OFS=|" 'FNR==1 { print; next } {
tmp="echo " $2 " | openssl md5 | cut -f2 -d\" \""
tmp | getline cksum
close(tmp)
$2=cksum
print
}' sample.csv
prints
ID|NAME|CITY|AGE
1|d44aec35a11ff6fa8a800120dbef1cd7|BBC|12
2|157aa4a48373eaf0415ea4229b3d4421|FGD|17
3|ba3c08d4a65f1baa1d7220a6802b5710|ASD|18
4|69be622e1c0d417ceb9b8fb0aa9dc574|SDF|19
5|427872b1ac3a22dc154688ddc2050516|ASC|22

Example using GNU datamash to do the hashing and some awk to rearrange the columns it outputs:
$ datamash -t'|' --header-in -f md5 2,4 < input.txt | awk 'BEGIN { FS=OFS="|"; print "ID|NAME|CITY|AGE" } { print $1, $5, $3, $6 }'
ID|NAME|CITY|AGE
1|1109867462b2f0f0470df8386036243c|BBC|c20ad4d76fe97759aa27a0c99bff6710
2|14da3a611e2f8953d76b6fb7866b01d1|FGD|70efdf2ec9b086079795c442636b55fb
3|710a24b9eac0692b1adaabd07726211a|ASD|6f4922f45568161a8cdf4ad2299f6d23
4|c4d15b255ef3c6a89d1fe2e6a26b8eda|SDF|1f0e3dad99908345f7439f8ffabdffc4
5|96b24a28173a75cc3c682e25d3a6bd49|ASC|b6d767d2f8ed5d21a44b0e5886680cb9
Note that the MD5 hashes are different in this answer than (At the time of writing) the ones in the others; that's because they use approaches that add a trailing newline to the strings being hashed, producing incorrect results if you want the exact hash:
$ echo AB1 | md5sum
d44aec35a11ff6fa8a800120dbef1cd7 -
$ echo -n AB1 | md5sum
1109867462b2f0f0470df8386036243c -

You might consider using a language that has support for md5 included, or at least cache the md5 results (I assume that the city and age have a limited domain, which is smaller than the number of lines).
Perl has support for md5 out of the box:
perl -M'Digest::MD5 qw(md5_hex)' -F'\|' -le 'if (2..eof) {
$F[$_] = md5_hex($F[$_]) for (1,3);
print join "|",#F
} else { print }'
online demo: https://ideone.com/xg6cxZ (to my surprise ideone has perl available in bash)
Digest::MD5 is a core module, any perl installation should have it
-M'Digest::MD5 qw(md5_hex)' - this loads the md5_hex function
-l handle line endings
-F'\|' - autosplit fields on | (this implies -a and -n)
2..eof - range operator (or flip-flop as some want to call it) - true between line 2 and end of the file
$F[$_] = md5_hex($F[$_]) - replace field $_ with it's md5 sum
for (1,3) - statement modifier runs the statement for 1 and 3 aliasing $_ to them
print join "|",#F - print the modified fields
else { print } - this hanldes the header
Note about speed: on my machine this processes ~100,000 lines in about 100 ms, compared with an awk variant of this answer that does 5000 lines in ~1 minute 14 seconds (i wasn't patient enough to wait for 100,000 lines)
time perl -M'Digest::MD5 qw(md5_hex)' -F'\|' -le 'if (2..eof) { $F[$_] = md5_hex($F[$_]) for (1,3);print join "|",#F } else { print }' <sample2.txt > out4.txt
real 0m0.121s
user 0m0.118s
sys 0m0.003s
$ time awk -F'|' -v OFS='|' -i md5.awk '{ print $1,md5($2),$3,md5($4) }' <(head -5000 sample2.txt) >out2.txt
real 1m14.205s
user 0m50.405s
sys 0m35.340s
md5.awk defines the md5 function as such:
$ cat md5.awk
function md5(str, cmd, l, hex) {
cmd= "/bin/echo -n "str" | openssl md5 -r"
if ( ( cmd | getline l) > 0 )
hex = substr(l,0,32)
close(cmd)
return hex
}
I'm using /bin/echo because there are some variants of shell where echo doesn't have -n
I'm using -n mostly because I want to be able to compare the results with the perl results
substr(l,0,32) - on my machine openssl md5 doesn't return just the sum, it has also the file name - see: https://ideone.com/KGMWPe - substr gets only the relevant part
I'm using a separate file because it seems much cleaner, and because I can switch between function implementations fairly easy
As I was saying in the beginning, if you really want to use awk, at least cache the result of the openssl tool.
$ cat md5memo.awk
function md5(str, cmd, l, hex) {
if (cache[str])
return cache[str]
cmd= "/bin/echo -n "str" | openssl md5 -r"
if ( ( cmd | getline l) > 0 )
hex = substr(l,0,32)
close(cmd)
cache[str] = hex
return hex
}
With the above caching, the results improve dramatically:
$ time awk -F'|' -v OFS='|' -i md5memo.awk '{ print $1,md5($2),$3,md5($4) }' <(head -5000 sample2.txt) >outmemo.txt
real 0m0.192s
user 0m0.141s
sys 0m0.085s
[savuso#localhost hash]$ time awk -F'|' -v OFS='|' -i md5memo.awk '{ print $1,md5($2),$3,md5($4) }' <sample2.txt >outmemof.txt
real 0m0.281s
user 0m0.222s
sys 0m0.088s
however your mileage my vary: sample2.txt has 100000 lines, with 5 different values for $2 and 40 different values for $4. Real life data may vary!
Note: I just realized that my awk implementation doesn't handle headers, but you can get that from the other answers

Related

Piping awk output into grep

So I'm writing a bash script to alphabetically list names from a text file, but only names with the same frequency (defined in the second column)
grep -wi '$1' /usr/local/linuxgym-data/census/femalenames.txt |
awk '{ print ($2) }' |
grep '$1' /usr/local/linuxgym-data/census/femalenames.txt |
sort |
awk '{ print ($1) }'
Since I'm doing this for class, I've been given the example of inputting 'ANA', and should return
ANA
RENEE
And the document has about 4500 lines in it
but the two fields I'm looking at have
ANA 0.120 55.989 181
RENEE 0.120 56.109 182
And so I want to find all names with the second column the same as ANA (0.120). The second column is the frequency of the name... This is just dummy data given to me by my school, so I don't know what that means.
But if there was another name with the same frequency as ANA (0.120) it would also be listed in the output.
When I run the commands on their own, they work fine, but it seems to have trouble with the 3rd line with using the awk output as $1 in the grep below it.
I am pretty new to this, so I'm most likely doing it in the most roundabout way.
You could probably do this in one line, but that's a pushing it a bit. Split it into two pieces to make it easier to write/read. For example:
name=$1
src=/usr/local/linuxgym-data/census/femalenames.txt
# get the frequency you're after
freq=$(awk -v name="$name" '$1==name {print $2}' "$src")
# get the names with that frequency
awk -v freq="$freq" '$2==freq {print $1}' "$src"
Tradeoff between this and RomanPerekhrest's solution is that their solution will do one scan, but index everything in memory. This one will scan the file twice, but save you the memory.
With single awk:
inp="ANA"
awk -v inp=$inp '{ a[$1]=$2 } END { if(inp in a){ v=a[inp];
for(i in a){ if(a[i]==v) print i }}
}' /usr/local/linuxgym-data/census/femalenames.txt | sort
The output:
ANA
RENEE
a[$1]=$2 - accumulating frequency value for each name
if(inp in a){ v=a[inp]; - if the input name inp is in array - get its frequency value
for(i in a){ if(a[i]==v) print i - print all names that have the same frequency value as for input name
This should probably do it...
f="/usr/local/linuxgym-data/census/femalenames.txt"
grep $(grep -wi -m 1 "$1" $f | awk '{ print ($2) }') $f | \
sort | awk '{ print ($1) }'
Test...
echo 'ANA 0.120 55.989 181
RENEE 0.120 56.109 182' > fem
foo() { grep $(grep -wi -m 1 "$1" $f | awk '{ print ($2) }') $f | \
sort | awk '{ print ($1) }' ; }
f=fem ; foo ANA
Output:
ANA
RENEE

Linux - parsing data, what language to use

I am looking to parse data out of a 'column' based format. I am running into issues where I feel I am 'hacking' bash/awk commands to pull the strings and numbers. If the numbers/text come in different formats then the script might fail unexpectedly and I will have errors.
Data:
RSSI (dBm): -86 Tx Power: 0
RSRP (dBm): -114 TAC: 4r5t (12341)
RSRQ (dB): -10 Cell ID: efefwg (4261431)
SINR (dB): 2.2
My method:
Using bash and awk
#!/bin/bash
DATA_OUTPUT=$(get_data)
RSSI=$(echo "${DATA_OUTPUT}" | awk '$1 == "RSSI" {print $3}')
RSRP=$(echo "${DATA_OUTPUT}" | awk '$1 == "RSRP" {print $3}')
RSRQ=$(echo "${DATA_OUTPUT}" | awk '$1 == "RSRQ" {print $3}')
SINR=$(echo "${DATA_OUTPUT}" | awk '$1 == "SINR" {print $3}')
TX_POWER=$(echo "${DATA_OUTPUT}" | awk '$4 == "Tx" {print $6}')
echo "$SINR"
echo ">$SINR<"
However the output of the above comes out very strange.
2.2 # thats fine!
<2.2 # what??? expecting >4.6<
Little things like this make me question using awk and bash to parse the data. Should I use C++ or some other language? Or is there a better way of doing this?
Thank you
This should be your starting point (the match() can be simplified or removed if your input data is tab-separated or fixed width fields):
$ cat file
RSSI (dBm): -86 Tx Power: 0
RSRP (dBm): -114 TAC: 4r5t (12341)
RSRQ (dB): -10 Cell ID: efefwg (4261431)
SINR (dB): 2.2
.
$ cat tst.awk
{
tail = $0
while ( match(tail,/[^:]+:[[:space:]]+[^[:space:]]+[[:space:]]*([^[:space:]]*$)?/) )
{
nvPair = substr(tail,RSTART,RLENGTH)
sub(/ \([^)]+\):/,":",nvPair) # remove (dB) or (dBm)
sub(/:[[:space:]]+/,":",nvPair) # remove spaces after :
sub(/[[:space:]]+$/,"",nvPair) # remove trailing spaces
split(nvPair,tmp,/:/)
name2value[tmp[1]] = tmp[2] # name2value["RSSI"] = "-86"
tail = substr(tail,RSTART+RLENGTH)
}
}
END {
for (name in name2value) {
value = name2value[name]
printf "%s=\"%s\"\n", name, value
}
}
.
$ awk -f tst.awk file
Tx Power="0"
RSSI="-86"
TAC="4r5t (12341)"
Cell ID="efefwg (4261431)"
RSRP="-114"
RSRQ="-10"
SINR="2.2"
Hopefully it's clear that in the above script after the match() loop you can simply say things like print name2value["Tx Power"] to print the value of that key phrase.
If your data was created in DOS, run dos2unix or tr -d '^M' on it first, where ^M means a literal control-M character.
Your data contains DOS-style \r\n line endings. When you do this
echo ">$SINR<"
The actual output is actually
>4.6\r<
The carriage return sends the cursor back to the start of the line.
You can do this:
DATA_OUTPUT=$(get_data | sed 's/\r$//')
But instead of parsing the output over and over, I'd rewrite like this:
while read -ra fields; do
case ${fields[0]} in
RSSI) rssi=${fields[2]};;
RSRP) rsrp=${fields[2]};;
RSPQ) rspq=${fields[2]};;
SINR) sinr=${fields[2]};;
esac
if [[ ${fields[3]} == "Tx" ]]; then tx_power=${fields[5]}; fi
done < <(get_data | sed 's/\r$//' )

wput speed result as pass or fail

I'm using the following to output the result of an upload speed test
wput 10MB.zip ftp://user:pass#host 2>&1 | grep '\([0-9.]\+[KM]/s\)'
which returns
18:14:38 (10MB.zip) - '10.49M/s' [10485760]
Transfered 10,485,760 bytes in 1 file at 10.23M/s
I'd like to have the result 10.23M/s (i.e. the speed) echoed, and a comparison result:
if speed=>5 MB/s then echo "pass" else echo "fail"
So, the final output would be:
PASS 7 M/s
23/01/2013
ideally i'd like it all done on a single line so far i've got
wput 100M.bin ftp://test:test#0.0.0.0 2>&1 | grep -o '\([0-9.]\+[KM]/s\)$' | awk ' { if (($1 > 5) && ($2 == "M/s")) { printf("FAST %s\n ", $0); }}'
however it doesn't output anything if I remove
&& ($2 == "M/s"))
it works but I obviously want to it output above 5M/s and as it is it would still echo fast if it was over 1K/s. Can someone tell me what i've missed.
Using awk:
# Over 5M/s
$ cat pass
18:14:38 (10MB.zip) - '10.49M/s' [10485760]
Transfered 10,485,760 bytes in 1 file at 10.23M/s
$ awk 'END{f="FAIL "$NF;p="PASS "$NF;if($NF~/K\/s/){print f;exit};gsub(/M\/s/,"");print(int($NF)>5?p:f)}' pass
PASS 10.23M/s
# Under 5M/s
$ cat fail
18:14:38 (10MB.zip) - '3.49M/s' [10485760]
Transfered 10,485,760 bytes in 1 file at 3.23M/s
$ awk 'END{f="FAIL "$NF;p="PASS "$NF;if($NF~/K\/s/){print f;exit};gsub(/M\/s/,"");print(int($NF)>5?p:f)}' fail
FAIL 3.23M/s
# Also Handle K/s
$ cat slow
18:14:38 (10MB.zip) - '3.49M/s' [10485760]
Transfered 10,485,760 bytes in 1 file at 8.23K/s
$ awk 'END{f="FAIL "$NF;p="PASS "$NF;if($NF~/K\/s/){print f;exit};gsub(/M\/s/,"");print(int($NF)>5?p:f)}' slow
FAIL 8.23K/s
Not sure where you get 7 M/s from?
According to #Rubens, you can use grep -o with your regex to show the speed, just append $ for end of line
wput 10MB.zip ftp://user:pass#host 2>&1 | grep -o '\([0-9.]\+[KM]/s\)$'
With perl you can easily do the remaining stuff
use strict;
use warnings;
while (<>) {
if (m!\s+((\d+\.\d+)([KM])/s)$!) {
if ($2 > 5 && $3 eq 'M') {
print "PASS $1\n";
} else {
print "FAIL $1\n";
}
}
}
and then call it
wput 10MB.zip ftp://user:pass#host 2>&1 | perl script.pl
This is an answer to the question update.
With the awk program, you haven't split the speed into numeric and unit value. It is just one string.
Because fast speed is greater than 5 M/s, you can ignore K/s and extract the speed by splitting at the character M. Then you have the speed in $1 and can compare it
wput 100M.bin ftp://test:test#0.0.0.0 2>&1 | grep -o '[0-9.]\+M/s$' | awk -F '/M/' '{ if ($1 > 5) { printf("FAST %s\n ", $0); }}'

GROUP BY/SUM from shell

I have a large file containing data like this:
a 23
b 8
a 22
b 1
I want to be able to get this:
a 45
b 9
I can first sort this file and then do it in Python by scanning the file once. What is a good direct command-line way of doing this?
Edit: The modern (GNU/Linux) solution, as mentioned in comments years ago ;-) .
awk '{
arr[$1]+=$2
}
END {
for (key in arr) printf("%s\t%s\n", key, arr[key])
}' file \
| sort -k1,1
The originally posted solution, based on old Unix sort options:
awk '{
arr[$1]+=$2
}
END {
for (key in arr) printf("%s\t%s\n", key, arr[key])
}' file \
| sort +0n -1
I hope this helps.
No need for awk here, or even sort -- if you have Bash 4.0, you can use associative arrays:
#!/bin/bash
declare -A values
while read key value; do
values["$key"]=$(( $value + ${values[$key]:-0} ))
done
for key in "${!values[#]}"; do
printf "%s %s\n" "$key" "${values[$key]}"
done
...or, if you sort the file first (which will be more memory-efficient; GNU sort is able to do tricks to sort files larger than memory, which a naive script -- whether in awk, python or shell -- typically won't), you can do this in a way which will work in older versions (I expect the following to work through bash 2.0):
#!/bin/bash
read cur_key cur_value
while read key value; do
if [[ $key = "$cur_key" ]] ; then
cur_value=$(( cur_value + value ))
else
printf "%s %s\n" "$cur_key" "$cur_value"
cur_key="$key"
cur_value="$value"
fi
done
printf "%s %s\n" "$cur_key" "$cur_value"
This Perl one-liner seems to do the job:
perl -nle '($k, $v) = split; $s{$k} += $v; END {$, = " "; foreach $k (sort keys %s) {print $k, $s{$k}}}' inputfile
This can be easily achieved with the following single-liner:
cat /path/to/file | termsql "SELECT col0, SUM(col1) FROM tbl GROUP BY col0"
Or.
termsql -i /path/to/file "SELECT col0, SUM(col1) FROM tbl GROUP BY col0"
Here a Python package, termsql, is used, which is a wrapper around SQLite. Note, that currently it's not upload to PyPI, and also can only be installed system-wide (setup.py is a little broken), like:
pip install --user https://github.com/tobimensch/termsql/archive/master.zip
Update
In 2020 version 1.0 was finally uploaded to PyPI, so pip install --user termsql can be used.
One way using perl:
perl -ane '
next unless #F == 2;
$h{ $F[0] } += $F[1];
END {
printf qq[%s %d\n], $_, $h{ $_ } for sort keys %h;
}
' infile
Content of infile:
a 23
b 8
a 22
b 1
Output:
a 45
b 9
With GNU awk (versions less than 4):
WHINY_USERS= awk 'END {
for (E in a)
print E, a[E]
}
{ a[$1] += $2 }' infile
With GNU awk >= 4:
awk 'END {
PROCINFO["sorted_in"] = "#ind_str_asc"
for (E in a)
print E, a[E]
}
{ a[$1] += $2 }' infile
With sort + awk combination one could try following, without creating array.
sort -k1 Input_file |
awk '
prev!=$1 && prev{
print prev,(prevSum?prevSum:"N/A")
prev=prevSum=""
}
{
prev=$1
prevSum+=$2
}
END{
if(prev){
print prev,(prevSum?prevSum:"N/A")
}
}'
Explanation: Adding detailed explanation for above.
sort -k1 file1 | ##Using sort command to sort Input_file by 1st field and sending output to awk as an input.
awk ' ##Starting awk program from here.
prev!=$1 && prev{ ##Checking condition prev is NOT equal to first field and prev is NOT NULL.
print prev,(prevSum?prevSum:"N/A") ##Printing prev and prevSum(if its NULL then print N/A).
prev=prevSum="" ##Nullify prev and prevSum here.
}
{
prev=$1 ##Assigning 1st field to prev here.
prevSum+=$2 ##Adding 2nd field to prevSum.
}
END{ ##Starting END block of this awk program from here.
if(prev){ ##Checking condition if prev is NOT NULL then do following.
print prev,(prevSum?prevSum:"N/A") ##Printing prev and prevSum(if its NULL then print N/A).
}
}'

unix - breakdown of how many lines with number of character occurrences

Is there an inbuilt command to do this or has anyone had any luck with a script that does it?
I am looking to get counts of how many lines had how many occurrences of a specfic character. (sorted descending by the number of occurrences)
For example, with this sample file:
gkdjpgfdpgdp
fdkj
pgdppp
ppp
gfjkl
Suggested input (for the 'p' character)
bash/perl some_script_name "p" samplefile
Desired output:
occs count
4 1
3 2
0 2
Update:
How would you write a solution that worked off a 2 character string such as 'gd' not a just a specific character such as p?
$ sed 's/[^p]//g' input.txt | awk '{print length}' | sort -nr | uniq -c | awk 'BEGIN{print "occs", "count"}{print $2,$1}' | column -t
occs count
4 1
3 2
0 2
You could give the desired character as the field separator for awk, and do this:
awk -F 'p' '{ print NF-1 }' |
sort -k1nr |
uniq -c |
awk -v OFS="\t" 'BEGIN { print "occs", "count" } { print $2, $1 }'
For your sample data, it produces:
occs count
4 1
3 2
0 2
If you want to count occurrences of multi-character strings, just give the desired string as the separator, e.g., awk -F 'gd' ... or awk -F 'pp' ....
#!/usr/bin/env perl
use strict; use warnings;
my $seq = shift #ARGV;
die unless defined $seq;
my %freq;
while ( my $line = <> ) {
last unless $line =~ /\S/;
my $occurances = () = $line =~ /(\Q$seq\E)/g;
$freq{ $occurances } += 1;
}
for my $occurances ( sort { $b <=> $a} keys %freq ) {
print "$occurances:\t$freq{$occurances}\n";
}
If you want short, you can always use:
#!/usr/bin/env perl
$x=shift;/\S/&&++$f{$a=()=/(\Q$x\E)/g}while<>
;print"$_:\t$f{$_}\n"for sort{$b<=>$a}keys%f;
or, perl -e '$x=shift;/\S/&&++$f{$a=()=/(\Q$x\E)/g}while<>;print"$_:\t$f{$_}\n"for sort{$b<=>$a}keys%f' inputfile, but now I am getting silly.
Pure Bash:
declare -a count
while read ; do
cnt=${REPLY//[^p]/} # remove non-p characters
((count[${#cnt}]++)) # use length as array index
done < "$infile"
for idx in ${!count[*]} # iterate over existing indices
do echo -e "$idx ${count[idx]}"
done | sort -nr
Output as desired:
4 1
3 2
0 2
Can to it in one gawk process (well, with a sort coprocess)
gawk -F p -v OFS='\t' '
{ count[NF-1]++ }
END {
print "occs", "count"
coproc = "sort -rn"
for (n in count)
print n, count[n] |& coproc
close(coproc, "to")
while ((coproc |& getline) > 0)
print
close(coproc)
}
'
Shortest solution so far:
perl -nE'say tr/p//' | sort -nr | uniq -c |
awk 'BEGIN{print "occs","count"}{print $2,$1}' |
column -t
For multiple characters, use a regex pattern:
perl -ple'$_ = () = /pg/g' | sort -nr | uniq -c |
awk 'BEGIN{print "occs","count"}{print $2,$1}' |
column -t
This one handles overlapping matches (e.g. it finds 3 "pp" in "pppp" instead of 2):
perl -ple'$_ = () = /(?=pp)/g' | sort -nr | uniq -c |
awk 'BEGIN{print "occs","count"}{print $2,$1}' |
column -t
Original cryptic but short pure-Perl version:
perl -nE'
++$c{ () = /pg/g };
}{
say "occs\tcount";
say "$_\t$c{$_}" for sort { $b <=> $a } keys %c;
'

Resources