Find number of unique values in a column

Find number of unique values in a column - linux

I would like to know the count of unique values in column using linux commands. The column has values like below (data is edited from previous ones). I need to ignore .M, .Q and .A at the end and just count the unique number of plants
"series_id":"ELEC.PLANT.CONS_EG_BTU.56855-ALL-ALL.M"
"series_id":"ELEC.PLANT.CONS_EG_BTU.56855-ALL-ALL.Q"
"series_id":"ELEC.PLANT.CONS_EG_BTU.56855-WND-ALL.A"
"series_id":"ELEC.PLANT.CONS_EG_BTU.56868-LFG-ALL.Q"
"series_id":"ELEC.PLANT.CONS_EG_BTU.56868-LFG-ALL.A"
"series_id":"ELEC.PLANT.CONS_EG_BTU.56841-WND-WT.Q"
"series_id":"ELEC.CONS_TOT.COW-GA-2.M"
"series_id":"ELEC.CONS_TOT.COW-GA-94.M"
I've tried this code but I'm not able to avoid those suffix
cat ELEC.txt | grep 'series_id' | cut -d, -f1 | wc -l
For above sample, expected count should be 6 but I get 8

This should do the job:
grep -Po "ELEC.PLANT.*" FILE | cut -d. -f -4 | sort | uniq -c
You first grep for the "ELEC.PLANT." part
remove the .Q,A,M
remove duplicates and count using sort | uniq -c
EDIT:
for the new data it should be only necessary to do the following:
grep -Po "ELEC.*" FILE | cut -d. -f -4 | sort | uniq -c

When you have to do some counting, you can easily do it with awk. Awk is an extremely versatile tool and I strongly recommend you to have a look at it. Maybe start with Awk one-liners explained.
Having that said, you can easily do some conditioned counting here:
What you want, is to count all unique lines which have series_id in it.
awk '/series_id/ && (! $0 in a) { c++; a[$0] } END {print c}'
This essentially states: if my line contains "series_id" and I did not store the line in my array a, then it means I did not encounter my line yet and increase the counter c with 1. At the END of the program, I print the count c.
Now you want to clean things up a bit. Your lines of interest essentially look like
"something":"something else"
So we are interested in something else which is in the 4th field if " is a field separator, and we are only interested in that if something is series_id located in field 2.
awk -F'"' '($2=="series_id") && (! $4 in a ) { c++; a[$4] } END {print c}'
Finally, you don't care about the last letter of the fourth field, so we need to make a small substitution:
awk -F'"' '($2=="series_id") { str=$4; gsub(/.$/,"",str); if (! str in a) {c++; a[str] } } END {print c}'
You could also rewrite this differently as:
awk -F'"' '($2 != "series_id" ) { next }
{ str=$4; gsub(/.$/,"",str) }
( str in a ) { next }
{ c++; a[str] }
END { print c }'

My standard way to count unique values is making sure I have the list of values (using grep and cut in your case), and add the following commands behind a pipe:
| sort -n | uniq -c
The sort does the sorting, based on number sorting, while the uniq gets the unique entries (the -c stands for "count").

Do this : cat ELEC.txt | grep 'series_id' | cut -f1-4 -d. | uniq | wc -l
-f1-4 will remove the the fourth . from each line

Here is a possible solution using awk:
awk 'BEGIN{FS="[:.\"]+"} /^"series_id":/{print $6}' \
ELEC.txt |sort -n |uniq -c
The ouput for the sample you posted will be something like this:
1 56841-WND-WT
2 56855-ALL-ALL
1 56855-WND-ALL
2 56868-LFG-ALL
If you need the entire string, you can print the other fields as well:
awk 'BEGIN{FS="[:.\"]+"; OFS="."} /^"series_id":/{print $3,$4,$5,$6}' \
ELEC.txt |sort -n | uniq -c
And the output will be something like this:
1 ELEC.PLANT.CONS_EG_BTU.56841-WND-WT
2 ELEC.PLANT.CONS_EG_BTU.56855-ALL-ALL
1 ELEC.PLANT.CONS_EG_BTU.56855-WND-ALL
2 ELEC.PLANT.CONS_EG_BTU.56868-LFG-ALL

Related

How to show number counts on terminal in linux

Please how can I count the list of numbers in a file
awk '{for(i=1;i<=NF;i++){if($i>=0 && $i<=25){print $i}}}'
Using the command above I can display the range of numbers on the terminal but if there are so many it will be difficult to count them. Please how can I show the count of the numbers on the terminal for example
1-20,
2-22,
3-23,
4-24,
etc
I know I can use wc but I don't know how to infuse it into the command above

awk '
{ for(i=1;i<=NF;i++) if (0<=$i && $i<=25) cnts[$i]++ }
END { for (n in cnts) print n, cnts[n] }
' file

Pipe the output to sort -n and uniq -c
awk '{for(i=1;i<=NF;i++){if($i>=0 && $i<=25){print $i}}}' filename | sort -n | uniq -c
You need to sort first because uniq requires all the same elements to be consecutive.

While I'm personally an awk fan, you might be glad to learn about grep -o functionality. I'm using grep -o to match all numbers in the file, and then awk can be used to pick all the numbers between 0 and 25 (inclusive). Last, we can use sort and uniq to count the results.
grep -o "[0-9][0-9]*" file | awk ' $1 >= 0 && $1 <= 25 ' | sort -n | uniq -c
Of course, you could do the counting in awk with an associative array as Ed Morton suggests:
egrep -o "\d+" file | awk ' $1 >= 0 && $1 <= 25 ' | awk '{cnt[$1]++} END { for (i in cnt) printf("%s-%s\n", i,cnt[i] ) } '
I modified Ed's code (typically not a good idea - I've been reading his code for years now) to show a modular approach - an awk script for filtering numbers in the range of 0 and 25 and another awk script for counting a list (of anything).
I also provided another subtle difference from my first script with egrep instead of grep.
To be honest, the second awk script generates some unexpected output, but I wanted to share an example of a more general approach. EDIT: I applied Ed's suggestion to correct the unexpected output - it's fine now.

Piping awk output into grep

So I'm writing a bash script to alphabetically list names from a text file, but only names with the same frequency (defined in the second column)
grep -wi '$1' /usr/local/linuxgym-data/census/femalenames.txt |
awk '{ print ($2) }' |
grep '$1' /usr/local/linuxgym-data/census/femalenames.txt |
sort |
awk '{ print ($1) }'
Since I'm doing this for class, I've been given the example of inputting 'ANA', and should return
ANA
RENEE
And the document has about 4500 lines in it
but the two fields I'm looking at have
ANA 0.120 55.989 181
RENEE 0.120 56.109 182
And so I want to find all names with the second column the same as ANA (0.120). The second column is the frequency of the name... This is just dummy data given to me by my school, so I don't know what that means.
But if there was another name with the same frequency as ANA (0.120) it would also be listed in the output.
When I run the commands on their own, they work fine, but it seems to have trouble with the 3rd line with using the awk output as $1 in the grep below it.
I am pretty new to this, so I'm most likely doing it in the most roundabout way.

You could probably do this in one line, but that's a pushing it a bit. Split it into two pieces to make it easier to write/read. For example:
name=$1
src=/usr/local/linuxgym-data/census/femalenames.txt
# get the frequency you're after
freq=$(awk -v name="$name" '$1==name {print $2}' "$src")
# get the names with that frequency
awk -v freq="$freq" '$2==freq {print $1}' "$src"
Tradeoff between this and RomanPerekhrest's solution is that their solution will do one scan, but index everything in memory. This one will scan the file twice, but save you the memory.

With single awk:
inp="ANA"
awk -v inp=$inp '{ a[$1]=$2 } END { if(inp in a){ v=a[inp];
for(i in a){ if(a[i]==v) print i }}
}' /usr/local/linuxgym-data/census/femalenames.txt | sort
The output:
ANA
RENEE
a[$1]=$2 - accumulating frequency value for each name
if(inp in a){ v=a[inp]; - if the input name inp is in array - get its frequency value
for(i in a){ if(a[i]==v) print i - print all names that have the same frequency value as for input name

This should probably do it...
f="/usr/local/linuxgym-data/census/femalenames.txt"
grep $(grep -wi -m 1 "$1" $f | awk '{ print ($2) }') $f | \
sort | awk '{ print ($1) }'
Test...
echo 'ANA 0.120 55.989 181
RENEE 0.120 56.109 182' > fem
foo() { grep $(grep -wi -m 1 "$1" $f | awk '{ print ($2) }') $f | \
sort | awk '{ print ($1) }' ; }
f=fem ; foo ANA
Output:
ANA
RENEE

Checking 2 files in Unix and finding the sum of a particular column in 3rd file through shell script

I have something I need help with, would appreciate your help
Let's take an example
I have file 1 with data
"eno", "ename", "salary"
"1","john","50000"
"2","steve","30000"
"3","aku","20000"
and I have file 2 with data
"eno", "ename", "incentives"
"1","john","2000"
"2","steve","5000"
"4","akshi","200"
And the expected output in 3 file I want is :
"eno", "ename", "t_salary"
"1","john","52000"
"2","steve","35000"
This is what is expected result
as I should be using eno and the ename as the primary key and output should be shown like this

if your files are sorted and first field is the key, you can join the files and work on the combined fields
that is,
$ join -t, file1 file2
"eno", "ename", "salary", "ename", "incentives"
"1","john","50000","john","2000"
"2","steve","30000","steve","5000"
and your awk can be
... | awk -F, -v OFS=, 'NR==1{print ...}
NR>1{gsub(/"/,"",$3);
gsub(/"/,"",$5);
print $1,$2,$3+$5}'
printing header and quoting the total field is left as an exercise.

$ cat tst.awk
BEGIN { FS="\"[[:space:]]*,[[:space:]]*\""; OFS="\",\"" }
{ key = $1 FS $2 }
NR==FNR { sal[key] = $NF; next }
key in sal { $3 = (FNR>1 ? $3+sal[key] : "t_salary") "\""; print }
$ awk -f tst.awk file1 file2
"eno","ename","t_salary"
"1","john","52000"
"2","steve","35000"
Get the book Effective Awk Programming, 4th Edition, by Arnold Robbins.

Abbreviating the input files to f1 & f2, and breaking out the swiss army knife utils, (plus a bashism):
head -n 1 f1 | sed 's/sal/t_&/' ; \
grep -h -f <(tail -qn +2 f1 f2 | tr ',' '\t' | sort -k1,2 | \
rev | uniq -d -f1 | rev | \
cut -f 2) \
f1 f2 | \
tr -s ',"' '\t' | datamash -s -g2,3 sum 4 | sed 's/[^\t]*/"&"/g;s/\t/,/g'
Output:
"eno", "ename", "t_salary"
"1","john","52000"
"2","steve","35000"
The main job is fairly simple:
grep searches for only those lines with duplicate (and therefore add-able) fields #1 & #2, and this is piped to...
datamash which does the adding.
The rest of the code is reformatting needed to please the various text utils which all seem to have ugly but minor format inconsistencies.
Those revs are only needed because uniq lacks most of sort's field functions.
The trs are because uniq also lacks a field separator switch, and datamash can't sum quoted numbers. The sed at the end is to undo all that tr-ing.

unix - breakdown of how many lines with number of character occurrences

Is there an inbuilt command to do this or has anyone had any luck with a script that does it?
I am looking to get counts of how many lines had how many occurrences of a specfic character. (sorted descending by the number of occurrences)
For example, with this sample file:
gkdjpgfdpgdp
fdkj
pgdppp
ppp
gfjkl
Suggested input (for the 'p' character)
bash/perl some_script_name "p" samplefile
Desired output:
occs count
4 1
3 2
0 2
Update:
How would you write a solution that worked off a 2 character string such as 'gd' not a just a specific character such as p?

$ sed 's/[^p]//g' input.txt | awk '{print length}' | sort -nr | uniq -c | awk 'BEGIN{print "occs", "count"}{print $2,$1}' | column -t
occs count
4 1
3 2
0 2

You could give the desired character as the field separator for awk, and do this:
awk -F 'p' '{ print NF-1 }' |
sort -k1nr |
uniq -c |
awk -v OFS="\t" 'BEGIN { print "occs", "count" } { print $2, $1 }'
For your sample data, it produces:
occs count
4 1
3 2
0 2
If you want to count occurrences of multi-character strings, just give the desired string as the separator, e.g., awk -F 'gd' ... or awk -F 'pp' ....

#!/usr/bin/env perl
use strict; use warnings;
my $seq = shift #ARGV;
die unless defined $seq;
my %freq;
while ( my $line = <> ) {
last unless $line =~ /\S/;
my $occurances = () = $line =~ /(\Q$seq\E)/g;
$freq{ $occurances } += 1;
}
for my $occurances ( sort { $b <=> $a} keys %freq ) {
print "$occurances:\t$freq{$occurances}\n";
}
If you want short, you can always use:
#!/usr/bin/env perl
$x=shift;/\S/&&++$f{$a=()=/(\Q$x\E)/g}while<>
;print"$_:\t$f{$_}\n"for sort{$b<=>$a}keys%f;
or, perl -e '$x=shift;/\S/&&++$f{$a=()=/(\Q$x\E)/g}while<>;print"$_:\t$f{$_}\n"for sort{$b<=>$a}keys%f' inputfile, but now I am getting silly.

Pure Bash:
declare -a count
while read ; do
cnt=${REPLY//[^p]/} # remove non-p characters
((count[${#cnt}]++)) # use length as array index
done < "$infile"
for idx in ${!count[*]} # iterate over existing indices
do echo -e "$idx ${count[idx]}"
done | sort -nr
Output as desired:
4 1
3 2
0 2

Can to it in one gawk process (well, with a sort coprocess)
gawk -F p -v OFS='\t' '
{ count[NF-1]++ }
END {
print "occs", "count"
coproc = "sort -rn"
for (n in count)
print n, count[n] |& coproc
close(coproc, "to")
while ((coproc |& getline) > 0)
print
close(coproc)
}
'

Shortest solution so far:
perl -nE'say tr/p//' | sort -nr | uniq -c |
awk 'BEGIN{print "occs","count"}{print $2,$1}' |
column -t
For multiple characters, use a regex pattern:
perl -ple'$_ = () = /pg/g' | sort -nr | uniq -c |
awk 'BEGIN{print "occs","count"}{print $2,$1}' |
column -t
This one handles overlapping matches (e.g. it finds 3 "pp" in "pppp" instead of 2):
perl -ple'$_ = () = /(?=pp)/g' | sort -nr | uniq -c |
awk 'BEGIN{print "occs","count"}{print $2,$1}' |
column -t
Original cryptic but short pure-Perl version:
perl -nE'
++$c{ () = /pg/g };
}{
say "occs\tcount";
say "$_\t$c{$_}" for sort { $b <=> $a } keys %c;
'

LInux sort/uniq apache log

I have small file (100) lines of web request(apache std format) there are multiple request from clients. I want to ONLY have a list of request(lines) in my file that comes from a UNIQUE IP and is the latest entry
I have so far
/home/$: cat all.txt | awk '{ print $1}' | sort -u | "{print the whole line ??}"
The above gives me the IP's(bout 30 which is right) now i need to have the rest of the line(request) as well.

Use an associative array to keep track of which IPs you've found already:
awk '{
if (!found[$1]) {
print;
found[$1]=1;
}
}' all.txt
This will print the first line for each IP. If you want the last one then:
awk '
{ found[$1] = $0 }
END {
for (ip in found)
print found[ip]
}
' all.txt

I hate that unique doesn't come with the same options as sort, or that sort cannot do what it says, I reckon this should work[1],
tac access.log | sort -fb -k1V -u
but alas, it doesn't;
Therefore, it seems we're stuck at doing something silly like
cat all.txt | awk '{ print $1}' | sort -u | while read ip
do
tac all.txt | grep "^$ip" -h | head -1
done
Which is really inefficient, but 'works' (haven't tested it: module typo's then)
[1] according to the man-page

The following should work:
tac access.log | sort -f -k1,1 -us
This takes the file in reverse order and does a stable sort using the first field, keeping only unique items.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Find number of unique values in a column - linux

My standard way to count unique values is making sure I have the list of values (using grep and cut in your case), and add the following commands behind a pipe: | sort -n | uniq -c The sort does the sorting, based on number sorting, while the uniq gets the unique entries (the -c stands for "count").

Do this : cat ELEC.txt | grep 'series_id' | cut -f1-4 -d. | uniq | wc -l -f1-4 will remove the the fourth . from each line

Related

How to show number counts on terminal in linux

Piping awk output into grep

Checking 2 files in Unix and finding the sum of a particular column in 3rd file through shell script

unix - breakdown of how many lines with number of character occurrences

LInux sort/uniq apache log

Categories

Resources