Alternate way to print and sort output - linux

I have a list of names and scores (First,Last,Score)
I'm trying to print out ONLY the last name that occurs most often in DESCENDING NUMERICAL order.
Here is an example list.
inisha__Ohler__1
Loralee__Hippe__5
Boyd__Leslie__8
Donnette__Cosentino__5
Viva__Bedsole__4
Jann__Banfield__3
Alan__Dionne__2
Sandee__Verdun__2
Raeann__Sweetman__3
Judson__Goers__2
Mandie__Salcedo__8
Yesenia__Bibeau__1
Doug__Petteway__9
Alejandra__Winter__9
Marquitta__Sang__7
Rusty__Rodrigue__2
Rickie__Devin__1
Marie__Elem__3
Faustina__Haltom__4
Dorthea__Ervin__4
Yesenia__Bibeau__5
Doug__Petteway__8
Alejandra__Winter__1
Marquitta__Sang__9
Rusty__Rodrigue__4
Yesenia__Bibeau__2
Doug__Petteway__4
Alejandra__Winter__3
Marquitta__Sang__6
Rusty__Rodrigue__6
Rickie__Devin__7
Marie__Elem__1
Faustina__Haltom__2
Dorthea__Ervin__4
I want to spit the output out using a single "|" or less.
cut -d "_" -f 3 scores | sort -r | uniq -c | sort -nr
Already works but I am looking for something less expensive.

I believe least expensive way to achieve the same is by using awk with sort as follows:
awk -F"__" '{ count[$2]++ } END {for (word in count) print count[word], word}' < scores | sort -r
and in case if you also want those three spaces in the beginning just like uniq -c provides you,
awk -F"__" '{ count[$2]++ } END {for (word in count) print " ", count[word], word}' < scores | sort -r

GNU awk-specific:
$ gawk -F__ '{ names[$2]++ }
END { PROCINFO["sorted_in"] = "#val_num_desc";
for (n in names) { print n }
}' input.txt
Sang
etc.

Using this perl one-liner
perl -aF/__/ -ne '$h{$F[1]}++; END{ print"$$_[0]\t$$_[1]\n" for sort {$$b[0]<=>$$a[0]} map {[$h{$_},$_]} keys %h }' <scores
or to show only the name that occurs most often
perl -MList::Util=max -aF/__/ -ne '$h{$F[1]}++; END{ $max=max(values%h); print "$h{$_}\t$_\n" for grep {$h{$_}==$max} keys%h }' <scores

Related

Print column 1 and 4 that have the less price using awk

I have this file that contains the car name, colour and price:
Toyota#Red#4500
Sedan#Blue#2600
Hyunda#Black#5000
Dudge#White#3900
Lymozeen#Black#2400
The output should display the car name and the price that is less than 5000:
Lymozeen#2400
Sedan#2600
Dudge#3900
Toyota#4500
I have tried this following code:
awk '{if($3 <= 5000)print $1,$3}' myfile
I'd suggest breaking this up. First, sort the content of the file based on the value of the third column, then select the lines of interest with your condition. Here's how:
awk -F'#' '{ print $NF, $0}' myfile | sort -n | awk -F'[# ]' '{if($1<5000)print $5"#"$7}'
One in GNU awk:
$ gawk '
BEGIN {
FS=OFS="#" # set field separators
}
$3<5000 { # if less than 5k
a[NR]=$3 # on NR hash price
b[NR]=$1 # on NR hash brand
}
END { # in the end
PROCINFO["sorted_in"]="#val_num_asc" # set for traverse order
for(i in a) # loop in ascendeing price order
print b[i],a[i] # output
}' file
Output:
Lymozeen#2400
Sedan#2600
Dudge#3900
Toyota#4500

Filling empty spaces in a CSV file

I have a CSV file where some columns are empty such as
oski14,safe,0,13,53,4
oski15,Unknow,,,,0
oski16,Unknow,,,,0
oski17,Unknow,,,,0
oski18,unsafe,0.55,,1,2
oski19,unsafe,0.12,4,,56
How do I replace all the empty columns with the word "empty".
I have tried using awk(which is a command I am learning to use).
I want to have
oski14,safe,0,13,53,4
oski15,Unknow,empty,empty,empty,0
oski16,Unknow,empty,empty,empty,0
oski17,Unknow,empty,empty,empty,0
oski18,unsafe,0.55,empty,1,2
oski19,unsafe,0.12,4,empty,56
I tried to replace just the 3rd column to see if I was on the right track
awk -F '[[:space:]]' '$2 && !$3{$3="empty"}1' file
this left me with
oski14,safe,0,13,53,4
oski15,Unknow,,,,0
oski16,Unknow,,,,0
oski17,Unknow,,,,0
oski18,unsafe,0.55,,1,2
oski19,unsafe,0.12,4,,56
I have also tried
nawk -F, '{$3="\ "?"empty":$3;print}' OFS="," file
this resulted in
oski14,safe,empty,13,53,4
oski15,Unknow,empty,,,0
oski16,Unknow,empty,,,0
oski17,Unknow,empty,,,0
oski18,unsafe,empty,,1,2
oski19,unsafe,empty,4,,56
Lastly I tried
awk '{if (!$3) {print $1,$2,"empty"} else {print $1,$2,$3}}' file
this left me with
oski14,safe,empty,13,53,4 empty
oski15,Unknow,empty,,,0 empty
oski16,Unknow,empty,,,0 empty
oski17,Unknow,empty,,,0 empty
oski18,unsafe,empty,,1,2 empty
oski19,unsafe,empty,4,,56 empty
With a sed that supports EREs with a -E argument (e.g. GNU sed or OSX/BSD sed):
$ sed -E 's/(^|,)(,|$)/\1empty\2/g; s/(^|,)(,|$)/\1empty\2/g' file
oski14,safe,0,13,53,4
oski15,Unknow,empty,empty,empty,0
oski16,Unknow,empty,empty,empty,0
oski17,Unknow,empty,empty,empty,0
oski18,unsafe,0.55,empty,1,2
oski19,unsafe,0.12,4,empty,56
You need to do the substitution twice because given contiguous commas like ,,, one regexp match would use up the first 2 ,s and so you'd be left with ,empty,,.
The above would change a completely empty line into empty, let us know if that's an issue.
This is the awk command
awk 'BEGIN { FS=","; OFS="," }; { for (i=1;i<=NF;i++) { if ($i == "") { $i = "empty" }}; print $0 }' yourfile
As suggested in the comments, you can shorten the BEGIN procedure to FS=OFS="," as awk allows chained assignment (which I did not know, thank you #EdMorton).
I've set FS="," in the BEGIN procedure instead of using the -F, option just for uniformity with setting OFS=",".
Clearly you can put the script in a more nice looking form:
#!/usr/bin/awk -f
BEGIN {
FS = ","
OFS = ","
}
{
for (i = 1; i <= NF; ++i)
if ($i == "")
$i = "empty"
print $0
}
and use it as a standalone program (you have to chmod +x it), even if this is known to have some drawbacks (consult the comments to this question as well as this answer):
./the_script_above your_file
or
down_the_pipe | ./the_script_above | further_processing
Clearly you are still able to feed the above script to awk this way:
awk -f the_script_above file1 file2

compare occurrence of a set of words

I have a text file with random words in it. i want to find out which words have maximum occurrence as a pair('hi,hello' OR 'Good,Bye').
Simple.txt
hi there. hello this a dummy file. hello world. you did good job. bye for now.
I have written this command to get the count for each word(hi,hello,good,bye).
cat simple.txt| tr -cs '[:alnum:]' '[\n*]' | sort | uniq -c|grep -E -i "\<hi\>|\<hello\>|\<good\>|\<bye\>"
this gives me the the occurrence of each word with a count(number of times it occurs) in the file but now how to refine this and get a direct output as "Hi/hello is the pair with maximum occurrence"
To make it more interesting, let's consider this test file:
$ cat >file.txt
You say hello. I say good bye. good bye. good bye.
To get a count of all pairs of words:
$ awk -v RS='[[:space:][:punct:]]+' 'NR>1{a[last","$0]++} {last=$0} END{for (pair in a) print a[pair], pair}' file.txt
3 good,bye
1 say,good
2 bye,good
1 I,say
1 You,say
1 hello,I
1 say,hello
To get the single pair with the highest count, we need to sort:
$ awk -v RS='[[:space:][:punct:]]+' 'NR>1{a[last","$0]++} {last=$0} END{for (pair in a) print a[pair], pair}' file.txt | sort -nr | head -1
3 good,bye
How it works
-v RS='[[:space:][:punct:]]+'
This tells awk to use any combination of white space or punctuation as a record separator. This means that each word becomes a record.
NR>1{a[last","$0]++}
For every word after the first, increment the count in associative array a for the combination of the previous and current work.
last=$0
Save the current word in the variable last.
END{for (pair in a) print a[pair], pair}
After we have finished reading the input, print out the results for each pair.
sort -nr
Sort the output numerically in reverse (highest number first) order.
head -1
Select the first line (giving us the pair with the highest count).
Multiline version
For those who prefer their code spread out over multiple lines:
awk -v RS='[[:space:][:punct:]]+' '
NR>1 {
a[last","$0]++
}
{
last=$0
}
END {
for (pair in a)
print a[pair], pair
}' file.txt | sort -nr | head -1
some terse perl:
perl -MList::Util=max,sum0 -slne '
for $word (m/(\w+)/g) {$count{$word}++}
} END {
$pair{$_} = sum0 #count{+split} for ($a, $b);
$max = max values %pair;
print "$max => ", {reverse %pair}->{$max};
' -- -a="hi hello" -b="good bye" simple.txt
3 => hi hello

How to Compare CSV Column using awk?

I receive and CSV like this:
column$1,column$2,column$
john,P,10
john,P,10
john,A,20
john,T,30
john,T,10
marc,P,10
marc,C,10
marc,C,20
marc,T,30
marc,A,10
I need so sum the values and display the name and results but column$2 needs to show the sum of values T separated from values P,A,C.
Output should be this:
column$1,column$2,column$3,column$4
john,PCA,40
john,T,40,CORRECT
marc,PCA,50
marc,T,30,INCORRECT
All i could do was extract the columns i need from the original csv:
awk -F "|" '{print $8 "|" $9 "|" $4}' input.csv >> output.csv
Also sort by the correct column:
sort -t "|" -k1 input.csv >> output.csv
And add a new column to the end of the csv:
awk -F, '{NF=2}1' OFS="|" input.csv >> output.csv
I managed to sum and display the sum by column$1 and $2, but i don't how to group different values from column$2:
awk -F "," '{col[$1,$2]++} END {for(i in col) print i, col[i]}' file > output
Awk is stream oriented. It processes input and outputs what you change. It does not do in file changes.
You just need to add a corresponding print
awk '{if($2 == "T") {print "MATCHED"}}'
If you want to output more than the "matched" you need to add it to the print
e.g. '{print $1 "|" $2 "|" $3 "|" " MATCHED"}'
or use print $0 as comment mentions above.
Assuming that "CORRECT" and "INCORRECT" are determined by comparing the "PCA" value to the "T" value, the following awk script should do the trick:
awk -F, -vOFS=, '$2=="T"{t[$1]+=$3;n[$1]} $2!="T"{s[$1]+=$3;n[$1]} END{ for(i in n){print i,"PCA",s[i]; print i,"T",t[i],(t[i]==s[i] ? "CORRECT" : "INCORRECT")} }' inputfile
Broken out for easier reading, here's what this looks like:
awk -F, -vOFS=, '
$2=="T" { # match all records that are "T"
t[$1]+=$3 # add the value for this record to an array of totals
n[$1] # record this name in our authoritative name list
}
$2!="T" { # match all records that are NOT "T"
s[$1]+=$3 # add the value for this record to an array of sums
n[$1] # record this name too
}
END { # Now that we've collected data, analyse the results
for (i in n) { # step through our authoritative list of names
print i,"PCA",s[i]
print i,"T",t[i],(t[i]==s[i] ? "CORRECT" : "INCORRECT")
}
}
' inputfile
Note that array order is not guaranteed in awk, so your output may not come out in the same order as your input.
If you want your output to be delimited using vertical bars, change the -vOFS=, to -vOFS='|'.
Then you can sort using:
awk ... | sort
which defaults to -k1.

awk - how to "re-awk" the output?

I need to take a file and count the number of occurrences of $7 - I've done this with awk (because I need to run this through more awk)
What I want to do is combine this into one script - so far I have
#! /usr/bin/awk -f
# get the filename, count the number of occurs
# <no occurs> <filename>
{ print $7 | "grep /datasheets/ | sort | uniq -c"}
how do I grab that output and run it through more awk commands - in the same file
Eventually, I need to be able to run
./process.awk <filename>
so it can be a drop-in replacement for a previous setup which would take too much time/effor to to change -
if you want to forward the output of an awk script to another awk script, just pipe it to awk.
awk 'foobar...' file|awk 'new awkcmd'
and your current awk|grep|sort|uniq could be done with awk itself. save your 3 processes. you want to get the repeated counts, don't you?
awk '$7~=/datasheets/{a[$7]++;} END{for(x in a)print x": "a[x]' file
should work.
If you use Gawk, you could use the 2-way communications to push the data to the external command then read it back:
#!/usr/bin/gawk -f
BEGIN {
COMMAND = "sort | uniq -c"
SEEN = 0
PROCINFO[ COMMAND, "pty" ] = 1
}
/datasheets/ {
print $7 |& COMMAND
SEEN = 1
}
END {
# Don't read sort output if no input was provided
if ( SEEN == 1 ) {
# Tell sort no more input data is available
close( COMMAND, "to" )
# Read the sorted data
while( ( COMMAND |& getline SORTED ) > 0 ) {
# Do whatever you want on the sorted data
print SORTED
}
close( COMMAND, "from" )
}
}
See https://www.gnu.org/software/gawk/manual/gawk.html#Two_002dway-I_002fO

Resources