I have a CSV file where some columns are empty such as
oski14,safe,0,13,53,4
oski15,Unknow,,,,0
oski16,Unknow,,,,0
oski17,Unknow,,,,0
oski18,unsafe,0.55,,1,2
oski19,unsafe,0.12,4,,56
How do I replace all the empty columns with the word "empty".
I have tried using awk(which is a command I am learning to use).
I want to have
oski14,safe,0,13,53,4
oski15,Unknow,empty,empty,empty,0
oski16,Unknow,empty,empty,empty,0
oski17,Unknow,empty,empty,empty,0
oski18,unsafe,0.55,empty,1,2
oski19,unsafe,0.12,4,empty,56
I tried to replace just the 3rd column to see if I was on the right track
awk -F '[[:space:]]' '$2 && !$3{$3="empty"}1' file
this left me with
oski14,safe,0,13,53,4
oski15,Unknow,,,,0
oski16,Unknow,,,,0
oski17,Unknow,,,,0
oski18,unsafe,0.55,,1,2
oski19,unsafe,0.12,4,,56
I have also tried
nawk -F, '{$3="\ "?"empty":$3;print}' OFS="," file
this resulted in
oski14,safe,empty,13,53,4
oski15,Unknow,empty,,,0
oski16,Unknow,empty,,,0
oski17,Unknow,empty,,,0
oski18,unsafe,empty,,1,2
oski19,unsafe,empty,4,,56
Lastly I tried
awk '{if (!$3) {print $1,$2,"empty"} else {print $1,$2,$3}}' file
this left me with
oski14,safe,empty,13,53,4 empty
oski15,Unknow,empty,,,0 empty
oski16,Unknow,empty,,,0 empty
oski17,Unknow,empty,,,0 empty
oski18,unsafe,empty,,1,2 empty
oski19,unsafe,empty,4,,56 empty
With a sed that supports EREs with a -E argument (e.g. GNU sed or OSX/BSD sed):
$ sed -E 's/(^|,)(,|$)/\1empty\2/g; s/(^|,)(,|$)/\1empty\2/g' file
oski14,safe,0,13,53,4
oski15,Unknow,empty,empty,empty,0
oski16,Unknow,empty,empty,empty,0
oski17,Unknow,empty,empty,empty,0
oski18,unsafe,0.55,empty,1,2
oski19,unsafe,0.12,4,empty,56
You need to do the substitution twice because given contiguous commas like ,,, one regexp match would use up the first 2 ,s and so you'd be left with ,empty,,.
The above would change a completely empty line into empty, let us know if that's an issue.
This is the awk command
awk 'BEGIN { FS=","; OFS="," }; { for (i=1;i<=NF;i++) { if ($i == "") { $i = "empty" }}; print $0 }' yourfile
As suggested in the comments, you can shorten the BEGIN procedure to FS=OFS="," as awk allows chained assignment (which I did not know, thank you #EdMorton).
I've set FS="," in the BEGIN procedure instead of using the -F, option just for uniformity with setting OFS=",".
Clearly you can put the script in a more nice looking form:
#!/usr/bin/awk -f
BEGIN {
FS = ","
OFS = ","
}
{
for (i = 1; i <= NF; ++i)
if ($i == "")
$i = "empty"
print $0
}
and use it as a standalone program (you have to chmod +x it), even if this is known to have some drawbacks (consult the comments to this question as well as this answer):
./the_script_above your_file
or
down_the_pipe | ./the_script_above | further_processing
Clearly you are still able to feed the above script to awk this way:
awk -f the_script_above file1 file2
I have a text file with random words in it. i want to find out which words have maximum occurrence as a pair('hi,hello' OR 'Good,Bye').
Simple.txt
hi there. hello this a dummy file. hello world. you did good job. bye for now.
I have written this command to get the count for each word(hi,hello,good,bye).
cat simple.txt| tr -cs '[:alnum:]' '[\n*]' | sort | uniq -c|grep -E -i "\<hi\>|\<hello\>|\<good\>|\<bye\>"
this gives me the the occurrence of each word with a count(number of times it occurs) in the file but now how to refine this and get a direct output as "Hi/hello is the pair with maximum occurrence"
To make it more interesting, let's consider this test file:
$ cat >file.txt
You say hello. I say good bye. good bye. good bye.
To get a count of all pairs of words:
$ awk -v RS='[[:space:][:punct:]]+' 'NR>1{a[last","$0]++} {last=$0} END{for (pair in a) print a[pair], pair}' file.txt
3 good,bye
1 say,good
2 bye,good
1 I,say
1 You,say
1 hello,I
1 say,hello
To get the single pair with the highest count, we need to sort:
$ awk -v RS='[[:space:][:punct:]]+' 'NR>1{a[last","$0]++} {last=$0} END{for (pair in a) print a[pair], pair}' file.txt | sort -nr | head -1
3 good,bye
How it works
-v RS='[[:space:][:punct:]]+'
This tells awk to use any combination of white space or punctuation as a record separator. This means that each word becomes a record.
NR>1{a[last","$0]++}
For every word after the first, increment the count in associative array a for the combination of the previous and current work.
last=$0
Save the current word in the variable last.
END{for (pair in a) print a[pair], pair}
After we have finished reading the input, print out the results for each pair.
sort -nr
Sort the output numerically in reverse (highest number first) order.
head -1
Select the first line (giving us the pair with the highest count).
Multiline version
For those who prefer their code spread out over multiple lines:
awk -v RS='[[:space:][:punct:]]+' '
NR>1 {
a[last","$0]++
}
{
last=$0
}
END {
for (pair in a)
print a[pair], pair
}' file.txt | sort -nr | head -1
some terse perl:
perl -MList::Util=max,sum0 -slne '
for $word (m/(\w+)/g) {$count{$word}++}
} END {
$pair{$_} = sum0 #count{+split} for ($a, $b);
$max = max values %pair;
print "$max => ", {reverse %pair}->{$max};
' -- -a="hi hello" -b="good bye" simple.txt
3 => hi hello
I receive and CSV like this:
column$1,column$2,column$
john,P,10
john,P,10
john,A,20
john,T,30
john,T,10
marc,P,10
marc,C,10
marc,C,20
marc,T,30
marc,A,10
I need so sum the values and display the name and results but column$2 needs to show the sum of values T separated from values P,A,C.
Output should be this:
column$1,column$2,column$3,column$4
john,PCA,40
john,T,40,CORRECT
marc,PCA,50
marc,T,30,INCORRECT
All i could do was extract the columns i need from the original csv:
awk -F "|" '{print $8 "|" $9 "|" $4}' input.csv >> output.csv
Also sort by the correct column:
sort -t "|" -k1 input.csv >> output.csv
And add a new column to the end of the csv:
awk -F, '{NF=2}1' OFS="|" input.csv >> output.csv
I managed to sum and display the sum by column$1 and $2, but i don't how to group different values from column$2:
awk -F "," '{col[$1,$2]++} END {for(i in col) print i, col[i]}' file > output
Awk is stream oriented. It processes input and outputs what you change. It does not do in file changes.
You just need to add a corresponding print
awk '{if($2 == "T") {print "MATCHED"}}'
If you want to output more than the "matched" you need to add it to the print
e.g. '{print $1 "|" $2 "|" $3 "|" " MATCHED"}'
or use print $0 as comment mentions above.
Assuming that "CORRECT" and "INCORRECT" are determined by comparing the "PCA" value to the "T" value, the following awk script should do the trick:
awk -F, -vOFS=, '$2=="T"{t[$1]+=$3;n[$1]} $2!="T"{s[$1]+=$3;n[$1]} END{ for(i in n){print i,"PCA",s[i]; print i,"T",t[i],(t[i]==s[i] ? "CORRECT" : "INCORRECT")} }' inputfile
Broken out for easier reading, here's what this looks like:
awk -F, -vOFS=, '
$2=="T" { # match all records that are "T"
t[$1]+=$3 # add the value for this record to an array of totals
n[$1] # record this name in our authoritative name list
}
$2!="T" { # match all records that are NOT "T"
s[$1]+=$3 # add the value for this record to an array of sums
n[$1] # record this name too
}
END { # Now that we've collected data, analyse the results
for (i in n) { # step through our authoritative list of names
print i,"PCA",s[i]
print i,"T",t[i],(t[i]==s[i] ? "CORRECT" : "INCORRECT")
}
}
' inputfile
Note that array order is not guaranteed in awk, so your output may not come out in the same order as your input.
If you want your output to be delimited using vertical bars, change the -vOFS=, to -vOFS='|'.
Then you can sort using:
awk ... | sort
which defaults to -k1.
I need to take a file and count the number of occurrences of $7 - I've done this with awk (because I need to run this through more awk)
What I want to do is combine this into one script - so far I have
#! /usr/bin/awk -f
# get the filename, count the number of occurs
# <no occurs> <filename>
{ print $7 | "grep /datasheets/ | sort | uniq -c"}
how do I grab that output and run it through more awk commands - in the same file
Eventually, I need to be able to run
./process.awk <filename>
so it can be a drop-in replacement for a previous setup which would take too much time/effor to to change -
if you want to forward the output of an awk script to another awk script, just pipe it to awk.
awk 'foobar...' file|awk 'new awkcmd'
and your current awk|grep|sort|uniq could be done with awk itself. save your 3 processes. you want to get the repeated counts, don't you?
awk '$7~=/datasheets/{a[$7]++;} END{for(x in a)print x": "a[x]' file
should work.
If you use Gawk, you could use the 2-way communications to push the data to the external command then read it back:
#!/usr/bin/gawk -f
BEGIN {
COMMAND = "sort | uniq -c"
SEEN = 0
PROCINFO[ COMMAND, "pty" ] = 1
}
/datasheets/ {
print $7 |& COMMAND
SEEN = 1
}
END {
# Don't read sort output if no input was provided
if ( SEEN == 1 ) {
# Tell sort no more input data is available
close( COMMAND, "to" )
# Read the sorted data
while( ( COMMAND |& getline SORTED ) > 0 ) {
# Do whatever you want on the sorted data
print SORTED
}
close( COMMAND, "from" )
}
}
See https://www.gnu.org/software/gawk/manual/gawk.html#Two_002dway-I_002fO