I have a CSV file where some columns are empty such as
oski14,safe,0,13,53,4
oski15,Unknow,,,,0
oski16,Unknow,,,,0
oski17,Unknow,,,,0
oski18,unsafe,0.55,,1,2
oski19,unsafe,0.12,4,,56
How do I replace all the empty columns with the word "empty".
I have tried using awk(which is a command I am learning to use).
I want to have
oski14,safe,0,13,53,4
oski15,Unknow,empty,empty,empty,0
oski16,Unknow,empty,empty,empty,0
oski17,Unknow,empty,empty,empty,0
oski18,unsafe,0.55,empty,1,2
oski19,unsafe,0.12,4,empty,56
I tried to replace just the 3rd column to see if I was on the right track
awk -F '[[:space:]]' '$2 && !$3{$3="empty"}1' file
this left me with
oski14,safe,0,13,53,4
oski15,Unknow,,,,0
oski16,Unknow,,,,0
oski17,Unknow,,,,0
oski18,unsafe,0.55,,1,2
oski19,unsafe,0.12,4,,56
I have also tried
nawk -F, '{$3="\ "?"empty":$3;print}' OFS="," file
this resulted in
oski14,safe,empty,13,53,4
oski15,Unknow,empty,,,0
oski16,Unknow,empty,,,0
oski17,Unknow,empty,,,0
oski18,unsafe,empty,,1,2
oski19,unsafe,empty,4,,56
Lastly I tried
awk '{if (!$3) {print $1,$2,"empty"} else {print $1,$2,$3}}' file
this left me with
oski14,safe,empty,13,53,4 empty
oski15,Unknow,empty,,,0 empty
oski16,Unknow,empty,,,0 empty
oski17,Unknow,empty,,,0 empty
oski18,unsafe,empty,,1,2 empty
oski19,unsafe,empty,4,,56 empty
With a sed that supports EREs with a -E argument (e.g. GNU sed or OSX/BSD sed):
$ sed -E 's/(^|,)(,|$)/\1empty\2/g; s/(^|,)(,|$)/\1empty\2/g' file
oski14,safe,0,13,53,4
oski15,Unknow,empty,empty,empty,0
oski16,Unknow,empty,empty,empty,0
oski17,Unknow,empty,empty,empty,0
oski18,unsafe,0.55,empty,1,2
oski19,unsafe,0.12,4,empty,56
You need to do the substitution twice because given contiguous commas like ,,, one regexp match would use up the first 2 ,s and so you'd be left with ,empty,,.
The above would change a completely empty line into empty, let us know if that's an issue.
This is the awk command
awk 'BEGIN { FS=","; OFS="," }; { for (i=1;i<=NF;i++) { if ($i == "") { $i = "empty" }}; print $0 }' yourfile
As suggested in the comments, you can shorten the BEGIN procedure to FS=OFS="," as awk allows chained assignment (which I did not know, thank you #EdMorton).
I've set FS="," in the BEGIN procedure instead of using the -F, option just for uniformity with setting OFS=",".
Clearly you can put the script in a more nice looking form:
#!/usr/bin/awk -f
BEGIN {
FS = ","
OFS = ","
}
{
for (i = 1; i <= NF; ++i)
if ($i == "")
$i = "empty"
print $0
}
and use it as a standalone program (you have to chmod +x it), even if this is known to have some drawbacks (consult the comments to this question as well as this answer):
./the_script_above your_file
or
down_the_pipe | ./the_script_above | further_processing
Clearly you are still able to feed the above script to awk this way:
awk -f the_script_above file1 file2
I have a bash script which gets a text file as input and takes two parameters (Line N° one and line N° two), then changes both lines with each other in the text. Here is the code:
#!/bin/bash
awk -v var="$1" -v var1="$2" 'NR==var {
s=$0
for(i=var+1; i < var1 ; i++) {
getline; s1=s1?s1 "\n" $0:$0
}
getline; print; print s1 s
next
}1' Ham > newHam_changed.txt
It works fine for every two lines which are not consecutive. but for lines which follows after each other (for ex line 5 , 6) it works but creates a blank line between them. How can I fix that?
I think your actual script is not what you posted in the question. I think the line with all the prints contains:
print s1 "\n" s
The problem is that when the lines are consecutive, s1 will be empty (the for loop is skipped), but it will still print a newline before s, producing a blank line.
So you need to make that newline conditional.
awk -v var="4" -v var1="6" 'NR==var {
s=$0
for(i=var+1; i < var1 ; i++) {
getline; s1=s1?s1 "\n" $0:$0
}
getline; print; print (s1 ? s1 "\n" : "") s
next
}1' Ham > newHam_changed.txt
Using getline makes awk scripts always a bit complicated. It is better to prevent the use of getline and just make use of the awk pattern { action } syntax. This will make perfectly readable scripts. In any other language you would just do a loop and get the next line, but in awk I think it is best to make good use of this feature.
awk -v var="$1" -v var1="$2" '
NR==var {s=$0; collect=1; next;}
NR==var1 {collect=0; print; printf inbetween; print s}
collect {inbetween=inbetween""$0"\n"; next;}
1' Ham
Here I capture the first line in s when I found it and set the collect flag. This will trigger the collect block on the next iteration which collects all lines in between. Whenever the second line is found it sets the collect back to zero and prints first the current line, than the inbetween lines and then s. If the lines are consecutive inbetween is empty and printf will than do nothing.
Too complex for my taste, here is something quite simple that achieves the same task:
#!/bin/bash
ORIGFILE='original.txt' # original text file
PROCFILE='processed.txt' # copy of the original file to be proccesed
CHGL1=`sed "$1q;d" $ORIGFILE` # get original $1 line
CHGL2=`sed "$2q;d" $ORIGFILE` # get original $2 line
`cat $ORIGFILE > $PROCFILE`
sed -i "$2s/^.*/$CHGL1/" $PROCFILE # replace
sed -i "$1s/^.*/$CHGL2/" $PROCFILE # replace
More code doesn't mean more useful, keep it simple. This code do not use for and instead goes directly to the specific lines.
EDIT:
A simple way on one line to do this task:
printf '%s\n' 14m26 26-m14- w q | ed -s file
Found in this answer.
I have a simple awk command that converts a date from MM/DD/YYYY to YYYY/MM/DD. However, the file I'm using has \r\n at the end of the lines, and sometimes the date is at the end of the line.
awk '
BEGIN { FS = OFS = "|" }
{
split($27, date, /\//)
$27 = date[3] "/" date[1] "/" date[2]
print $0
}
' file.txt
In this case, if the date is MM/DD/YYYY\r\n then I end up with this in the output:
YYYY
/MM/DD
What is the best way to get around this? Keep in mind, sometimes the input is simply \r\n in which case the output SHOULD be // but instead ends up as
/
/
Given that the \r isn't always at the end of field $27, the simplest approach is to remove the \r from the entire line.
With GNU Awk or Mawk (one of which is typically the default awk on Linux platforms), you can simply define your input record separator, RS, accordingly:
awk -v RS='\r\n' ...
Or, if you want \r\n-terminated output lines too, set the output record separator, ORS, to the same value:
awk 'BEGIN { RS=ORS="\r\n"; ...
Optional reading: an aside for BSD/macOS Awk users:
BSD/macOS awk doesn't support multi-character RS values (in line with the POSIX Awk spec: "If RS contains more than one character, the results are unspecified").
Therefore, a sub call inside the Awk script is necessary to trim the \r instance from the end of each input line:
awk '{ sub("\r$", ""); ...
To also output \r\n-terminated lines, option -v ORS='\r\n' (or ORS="\r\n" inside the script's BEGIN block) will work fine, as with GNU Awk and Mawk.
If you're on a system where \n by itself is the newline, you should remove the \r from the record. You could do it like:
$ awk '{sub(/\r/,"",$NF); ...}'
This is the content of file.txt:
hello bro
my nam§
is Jhon Does
The file could also contain non-printable characters (for example \x00, or \x02), and, as you can see, the lenght of the lines are not the same.
Then I want to read it each each 5 characters without having into a count line breaks. I thought in something like this using awk:
awk -v RS='' '{
s=s $0;
}END{
n=length(s);
for(x=1; x<n; x=x+5){
# Here I will put some calcs and stuff
i++;
print "line " i ": #" substr(s,x,5) "#"
}
}' file.txt
The output is the following:
line 1: #hello#
line 2: # bro
#
line 3: #my na#
line 4: #m§
is#
line 5: # Jhon#
line 6: # Does#
It works perfectly, but the input file will be very large, so the performance is important.
In short, I'm looking for something like this:
awk -v RS='.{5}' '{ # Here I will put some calcs and stuff }'
But it doesn't works.
Another alternative that works ok:
xxd -ps mifile.txt | tr -d '\n' | fold -w 10 | awk '{print "23" $0 "230a"}' | xxd -ps -r
Do you have any idea or alternative? Thank you.
I'm not sure I understand what you want but this outputs the same as the script in your question that you say works perfectly so hopefully this is it:
$ awk -v RS='.{5}' 'RT!=""{ print "line", NR ": #" RT "#" }' file
line 1: #hello#
line 2: # bro
#
line 3: #my na#
line 4: #m§
is#
line 5: # Jhon#
line 6: # Does#
The above uses GNU awk for multi-char RS and RT.
If you are okay with Python, You may try this
f = open('filename', 'r+')
w = f.read(5)
while(w != ''):
print w;
w = f.read(5);
f.close()
You can use perl and binmode assuming you are using normal characters.
use strict;
use warnings;
open my $fh, '<', 'test';
#open the file.
binmode $fh;
# Set to binary mode
$/ = \5;
#Read a record as 5 bytes
while(<$fh>){
#Read records
print "$_#"
#Do whatever calculations you want here
}
For extended character sets you can use UTF8 and read every 5 characters instead of bytes.
use strict;
use warnings;
open my $fh, '<:utf8', 'test';
#open file in utf8.
binmode(STDOUT, ":utf8");
# Set stdout to utf8 as well
while ((read($fh, my $data, 5)) != 0){
#Read 5 characters into variable data
print "$data#";
#Do whatever you want with data here
}
So you asked How to read a file each n characters instead of each line using awk.
Solution:
If you have a modern gawk implementation use FPAT
Normally, when using FS, gawk defines the fields as the parts of the
record that occur in between each field separator. In other words, FS
defines what a field is not, instead of what a field is. However,
there are times when you really want to define the fields by what they
are, and not by what they are not.
Code:
gawk 'BEGIN{FS="\n";RS="";FPAT=".{,5}"}
{for (i=1;i<=NF;i++){
printf("$%d = <%s>\n", i, $i)}
}' file
Check the demo
I try to sum the traffic of diffrent ports in the logfiles from "IPCop" so i write and command for my shell, but i think its possible to optimize the command.
First a Line from my Logfile:
01/00:03:16 kernel INPUT IN=eth1 OUT= MAC=xxx SRC=xxx DST=xxx LEN=40 TOS=0x00 PREC=0x00 TTL=98 ID=256 PROTO=TCP SPT=47438 DPT=1433 WINDOW=16384 RES=0x00 SYN URGP=0
Now i grep with following Command the sum of all lengths who contains port 1433
grep 1433 log.dat|awk '{for(i=1;i<=10;i++)if($i ~ /LEN/)print $i};'|sed 's/LEN=//g;'|awk '{sum+=$1}END{print sum}'
The for loop i need because the LEN-col is not on same position at all time.
Any suggestion for optimizing this command?
Regards
Rene
Since I don't have the rep to add a comment to Noufal Ibrahims answer, here is a more natural solution using Perl.
perl -ne '$sum += $1 if /LEN=(\d+)/; END { print $sum; }' log.dat
#Noufal you can can make perl do all the hard work ;).
If it really needs optimization, as in it runs so unbearably slow: you should probably rewrite it in a more general purpose language. Even AWK could do, but I'd suggest something closer to Perl or Java for a long running extractor.
One change you could make is, rather than using an unnecessary SED and second AWK call, move the END into the first AWK call, and use split() to extract the number from LEN=num; and add it to the accumulator. Something like split($i, x, "="); sum += x[2].
The main problem is you can't write awk '/LEN=(...)/ { sum += var matching the ... }'.
Any time you have grep/sed/awk combinations in a pipeline, you can simplify into a single awk or perl command. Here's an awk solution:
gawk -v dpt=1433 '
$0 ~ dpt {
for (i=1; i<=NF; i++) {
if ($i ~ /^LEN=[[:digit:]]+/) {
split($i, ary, /=/)
sum += ary[2]
next
}
}
}
END {print sum}
' log.dat
If you are using gawk, you can use \< to avoid the need for the for-loop, the match(-) function to find the substring "\<LEN=.*\>", i.e., projecting out the field you want, and substr to project out the argument of LEN. You can then use just the single awk invocation to do everything.
Postscript
The regexp I gave above doesn't work, because the = character is not part of a word. The following awk script does work:
/1433/ { f=match($0,/ LEN=[[:digit:]]+ /); v=substr($0,RSTART+5,RLENGTH-6); s+=v; }
END { print "sum=" s; }
If these will be on a single line, you can use perl to extract the LOG numbers and sum it.
perl -e '$f = 0; while (<>) {/.*LEN=([0-9]+).*/ ; $f += $1;} print "$f\n";' input.log
I apologise for the bad Perl. I'm not a Perl guy at all.