Filling empty spaces in a CSV file - linux
I have a CSV file where some columns are empty such as
oski14,safe,0,13,53,4
oski15,Unknow,,,,0
oski16,Unknow,,,,0
oski17,Unknow,,,,0
oski18,unsafe,0.55,,1,2
oski19,unsafe,0.12,4,,56
How do I replace all the empty columns with the word "empty".
I have tried using awk(which is a command I am learning to use).
I want to have
oski14,safe,0,13,53,4
oski15,Unknow,empty,empty,empty,0
oski16,Unknow,empty,empty,empty,0
oski17,Unknow,empty,empty,empty,0
oski18,unsafe,0.55,empty,1,2
oski19,unsafe,0.12,4,empty,56
I tried to replace just the 3rd column to see if I was on the right track
awk -F '[[:space:]]' '$2 && !$3{$3="empty"}1' file
this left me with
oski14,safe,0,13,53,4
oski15,Unknow,,,,0
oski16,Unknow,,,,0
oski17,Unknow,,,,0
oski18,unsafe,0.55,,1,2
oski19,unsafe,0.12,4,,56
I have also tried
nawk -F, '{$3="\ "?"empty":$3;print}' OFS="," file
this resulted in
oski14,safe,empty,13,53,4
oski15,Unknow,empty,,,0
oski16,Unknow,empty,,,0
oski17,Unknow,empty,,,0
oski18,unsafe,empty,,1,2
oski19,unsafe,empty,4,,56
Lastly I tried
awk '{if (!$3) {print $1,$2,"empty"} else {print $1,$2,$3}}' file
this left me with
oski14,safe,empty,13,53,4 empty
oski15,Unknow,empty,,,0 empty
oski16,Unknow,empty,,,0 empty
oski17,Unknow,empty,,,0 empty
oski18,unsafe,empty,,1,2 empty
oski19,unsafe,empty,4,,56 empty
With a sed that supports EREs with a -E argument (e.g. GNU sed or OSX/BSD sed):
$ sed -E 's/(^|,)(,|$)/\1empty\2/g; s/(^|,)(,|$)/\1empty\2/g' file
oski14,safe,0,13,53,4
oski15,Unknow,empty,empty,empty,0
oski16,Unknow,empty,empty,empty,0
oski17,Unknow,empty,empty,empty,0
oski18,unsafe,0.55,empty,1,2
oski19,unsafe,0.12,4,empty,56
You need to do the substitution twice because given contiguous commas like ,,, one regexp match would use up the first 2 ,s and so you'd be left with ,empty,,.
The above would change a completely empty line into empty, let us know if that's an issue.
This is the awk command
awk 'BEGIN { FS=","; OFS="," }; { for (i=1;i<=NF;i++) { if ($i == "") { $i = "empty" }}; print $0 }' yourfile
As suggested in the comments, you can shorten the BEGIN procedure to FS=OFS="," as awk allows chained assignment (which I did not know, thank you #EdMorton).
I've set FS="," in the BEGIN procedure instead of using the -F, option just for uniformity with setting OFS=",".
Clearly you can put the script in a more nice looking form:
#!/usr/bin/awk -f
BEGIN {
FS = ","
OFS = ","
}
{
for (i = 1; i <= NF; ++i)
if ($i == "")
$i = "empty"
print $0
}
and use it as a standalone program (you have to chmod +x it), even if this is known to have some drawbacks (consult the comments to this question as well as this answer):
./the_script_above your_file
or
down_the_pipe | ./the_script_above | further_processing
Clearly you are still able to feed the above script to awk this way:
awk -f the_script_above file1 file2
Related
Match lines based on patterns and reformat file Bash/ Linux
I am looking preferably for a bash/Linux method for the problem below. I have a text file (input.txt) that looks like so (and many many more lines): TCCTCCGC+TAGTTAGG_Vel_24_CC_LlanR_34 CC_LlanR GGAGTATG+TCTATTCG_Vel_24_CC_LlanR_22 CC_LlanR TTGACTAG+TGGAGTAC_Vel_02_EN_DavaW_11 EN_DavaW TCGAATAA+TGGTAATT_Vel_24_CC_LlanR_23 CC_LlanR CTGCTGAA+CGTTGCGG_Vel_02_EN_DavaW_06 EN_DavaW index_07_barcode_04_PA-17-ACW-04 17-ACW index_09_barcode_05_PA-17-ACW-05 17-ACW index_08_barcode_37_PA-21-YC-15 21-YC index_09_barcode_04_PA-22-GB-10 22-GB index_10_barcode_37_PA-28-CC-17 28-CC index_11_barcode_29_PA-32-MW-07 32-MW index_11_barcode_20_PA-32-MW-08 32-MW I want to produce a file that looks like CC_LlanR(TCCTCCGC+TAGTTAGG_Vel_24_CC_LlanR_34,GGAGTATG+TCTATTCG_Vel_24_CC_LlanR_22,TCGAATAA+TGGTAATT_Vel_24_CC_LlanR_23) EN_DavaW(TTGACTAG+TGGAGTAC_Vel_02_EN_DavaW_11,CTGCTGAA+CGTTGCGG_Vel_02_EN_DavaW_06) 17-ACW(index_07_barcode_04_PA-17-ACW-04,index_09_barcode_05_PA-17-ACW-05) 21-YC(index_08_barcode_37_PA-21-YC-15) 22-GB(index_09_barcode_04_PA-22-GB-10) 28-CC(index_10_barcode_37_PA-28-CC-17) 32-MW(index_11_barcode_29_PA-32-MW-07,index_11_barcode_20_PA-32-MW-08) I thought that I could do something along the lines of this. cat input.txt | awk '{print $1}' | grep -e "CC_LlanR" | paste -sd',' > intermediate_file cat input.txt | awk '{print $2"("}' something something?? But I only know how to grep one pattern at a time? Is there a way to find all the matching lines at once and output them in this format? Thank you! (Happy Easter/ long weekend to all!)
With your shown samples please try following. awk ' FNR==NR{ arr[$2]=(arr[$2]?arr[$2]",":"")$1 next } ($2 in arr){ print $2"("arr[$2]")" delete arr[$2] } ' Input_file Input_file 2nd solution: Within a single read of Input_file try following. awk '{arr[$2]=(arr[$2]?arr[$2]",":"")$1} END{for(i in arr){print i"("arr[i]")"}}' Input_file Explanation(1st solution): Adding detailed explanation for 1st solution here. awk ' ##Starting awk program from here. FNR==NR{ ##Checking condition FNR==NR which will be TRUE when first time Input_file is being read. arr[$2]=(arr[$2]?arr[$2]",":"")$1 ##Creating array with index of 2nd field and keep adding its value with comma here. next ##next will skip all further statements from here. } ($2 in arr){ ##Checking condition if 2nd field is present in arr then do following. print $2"("arr[$2]")" ##Printing 2nd field ( arr[$2] ) here. delete arr[$2] ##Deleteing arr value with 2nd field index here. } ' Input_file Input_file ##Mentioning Input_file names here.
Assuming your input is grouped by the $2 value as shown in your example (if it isn't then just run sort -k2,2 on your input first) using 1 pass and only storing one token at a time in memory and producing the output in the same order of $2s as the input: $ cat tst.awk BEGIN { ORS="" } $2 != prev { printf "%s%s(", ORS, $2 ORS = ")\n" sep = "" prev = $2 } { printf "%s%s", sep, $1 sep = "," } END { print "" } $ awk -f tst.awk input.txt CC_LlanR(TCCTCCGC+TAGTTAGG_Vel_24_CC_LlanR_34,GGAGTATG+TCTATTCG_Vel_24_CC_LlanR_22) EN_DavaW(TTGACTAG+TGGAGTAC_Vel_02_EN_DavaW_11) CC_LlanR(TCGAATAA+TGGTAATT_Vel_24_CC_LlanR_23) EN_DavaW(CTGCTGAA+CGTTGCGG_Vel_02_EN_DavaW_06) 17-ACW(index_07_barcode_04_PA-17-ACW-04,index_09_barcode_05_PA-17-ACW-05) 21-YC(index_08_barcode_37_PA-21-YC-15) 22-GB(index_09_barcode_04_PA-22-GB-10) 28-CC(index_10_barcode_37_PA-28-CC-17) 32-MW(index_11_barcode_29_PA-32-MW-07,index_11_barcode_20_PA-32-MW-08)
This might work for you (GNU sed): sed -E 's/^(\S+)\s+(\S+)/\2(\1)/;H x;s/(\n\S+)\((\S+)\)(.*)\1\((\S+)\)/\1(\2,\4)\3/;x;$!d;x;s/.//' file Append each manipulated line to the hold space. Before moving on to the next line, accumlate like keys into a single line. Delete every line except the last. Replace the last line by the contents of the hold space. Remove the first character (newline artefact introduced by H comand) and print the result. N.B. The final solution is unsorted and in the original order.
how to modify a text file that every line has same number of columns?
I've got a text file which includes several lines. Every line has words which are separated with a comma. The number of words in lines are not the same. I would like with the help of the awk command to make every line have same number of column. For example, if the text file is as follows: word1, text, help, test number, begin last, line, line I would like the output be as the following which every line has same size in column with an extra null word: word1, text, help, test number, begin, null, null last, line, line, null I tried the following code: awk '{print $0,Null}' file.txt
$ awk 'BEGIN {OFS=FS=", "} NR==FNR {max=max<NF?NF:max; next} {for(i=NF+1;i<=max;i++) $i="null"}1' file{,} first scan to find the max number of columns and fill the missing entries in the second round. If the first line contains all the columns (header perhaps), you can change to $ awk 'BEGIN {OFS=FS=", "} NR==1 {max=NF} {for(i=NF+1;i<=max;i++) $i="null"}1' file file{,} is expanded by bash to file file, a neat trick not to repeat the filename (and eliminates possible typos).
Passing twice through the input file, using getline on first pass: awk ' BEGIN { OFS=FS=", " while(getline < ARGV[1]) { if (NF > max) {max = NF} } close(ARGV[1]) } { for(i=NF+1; i<=max; i++) $i="null" } 1 ' file.txt Alternatively, keeping it simple by running awk twice... #!/bin/bash infile="file.txt" maxfields=$(awk 'BEGIN {FS=", "} {if (NF > max) {max = NF}} END{print max}' "$infile" ) awk -v max="$maxfields" 'BEGIN {OFS=FS=", "} {for(i=NF+1;i<=max;i++) $i="null"} 1' "$infile"
Use these Perl one-liners. The first one goes through the file and finds the max number of fields to use. The second one goes through the file and prints the input fields, padded at the end by the null strings: export num_fields=$( perl -F'/,\s+/' -lane 'print scalar #F;' in_file | sort -nr | head -n1 ) perl -F'/,\s+/' -lane 'print join ", ", map { defined $F[$_] ? $F[$_] : "null" } 0..( $ENV{num_fields} - 1 );' in_file > out_file The Perl one-liner uses these command line flags: -e : Tells Perl to look for code in-line, instead of in a file. -n : Loop over the input one line at a time, assigning it to $_ by default. -l : Strip the input line separator ("\n" on *NIX by default) before executing the code in-line, and append it when printing. -a : Split $_ into array #F on whitespace or on the regex specified in -F option. -F'/,\s+/' : Split into #F on comma with whitespace. SEE ALSO: perldoc perlrun: how to execute the Perl interpreter: command line switches
How to remove double quotes in a specific column by using sub() in AWK
My sample data is cat > myfile "a12","b112122","c12,d12" a13,887988,c13,d13 a14,b14121,c79,d13 when I try to remove " from colum 2 by awk -F, 'BEGIN { OFS = FS } $2 ~ /"/ { sub(/"/, "", $2) }1' myfile "a12",b112122","c12,d12" a13,887988,c13,d13 a14,b14121,c79,d13 It only remove only 1 comma, instead of b112122 i am getting b112122" how to remove all " in 2nd column
From the documentation: Search target, which is treated as a string, for the leftmost, longest substring matched by the regular expression regexp.[...] Return the number of substitutions made (zero or one). It is quite clear that the function sub is doing at most one single replacement and does not replace all occurences. Instead, use gsub: Search target for all of the longest, leftmost, nonoverlapping matching substrings it can find and replace them with replacement. The ‘g’ in gsub() stands for “global,” which means replace everywhere. So you can add a 'g' to your line and it works fine: awk -F, 'BEGIN { OFS = FS } $2 ~ /"/ { gsub(/"/, "", $2) }1' myfile
When you dealing with CSV file, not using FPAT, it will break sooner or later. Here is a gnu awk that does the jib. awk -v OFS="," -v FPAT="([^,]+)|(\"[^\"]+\")" '{gsub(/"/,"",$2)}1' file "a12",b112122,"c12,d12" a13,887988,c13,d13 a14,b14121,c79,d13 It will work fine on any column, number 3 as well. Example on remove " on column 3 at the same time change separator to | awk -v OFS="|" -v FPAT="([^,]+)|(\"[^\"]+\")" '{gsub(/"/,"",$3);$1=$1}1' file "a12"|"b112122"|c12,d12 a13|887988|c13|d13 a14|b14121|c79|d13
Search for a string and print the line in a different order using Linux
I need to write a shell script that does the following which I am showing below with an example. Suppose I have a file cars.txt which depicts a table like this Person|Car|Country The '|' is the separator. So the first two lines goes like this Michael|Ford|USA Rahul|Maruti|India I have to write a shell script which will find the lines in the cars.txt file that has the country as USA and will print it like USA|Ford|Michael I am not very adept with Unix so I need some help here.
Will this do? while read -r i; do NAME="$(cut -d'|' -f1 <<<"$i")" MAKE="$(cut -d'|' -f2 <<<"$i")" COUNTRY="$(cut -d'|' -f3 <<<"$i")" echo "$COUNTRY|$MAKE|$NAME" done < <(grep "USA$" cars.txt)
Updated To Locate USA Not 1st Line As Provided in Your Question Using awk you can do what you are attempting in a very simple manner, e.g. $ awk -F'|' '/USA/ {for (i = NF; i >= 1; i--) printf "%s%s", $i, i==1 ? RS : FS}' cars.txt USA|Ford|Michael India|Maruti|Rahul Explanation awk -F'|' read the file using '|' as the Field-Separator, specified as -F'|' at the beginning of the call, or as FS within the command itself, /USA/ locate only lines containing "USA", for (i = NF; i >= 1; i--) - loop over fields in reverse order, printf "%s%s", $i, i==1 ? RS : FS - output the field followed by a '|' (FS) if i is not equal 1 or by the Record-Separator (RS) which is a "\n" by default if i is equal 1. The form test ? true_val : false_val is just the ternary operator that tests if i == 1 and if so provides RS for output, otherwise provides FS for output. It will be orders of magnitude faster than spawning 8-subshells using command substitutions, grep and cut (plus the pipes). Printing Only The 1st Occurrence of Line Containing "USA" To print only the first line with "USA", all you need to do is exit after processing, e.g. $ awk -F'|' '/USA/ {for (i = NF; i >= 1; i--) printf "%s%s", $i, i==1 ? RS : FS; exit}' cars.txt USA|Ford|Michael Explanation simply adding exit to the end of the command will cause awk to stop processing records after the first one. While both awk and sed take a little time to make friends with, together they provide the Unix-Swiss-Army-Knife for text processing. Well worth the time to learn both. It only takes a couple of hours to get a good base by going through one of the tutorials. Good luck with your scripting.
AWK: confirming matching columns of consecutive rows
Hello and thank you for taking the time to read this question. For the last day I have been trying to solve a problem and haven’t come any closer to a solution. I have a sample file of data that contains the following: Fighter#Trainer Bobby#SamBonen Billy#BobBrown Sammy#DJacobson James#DJacobson Donny#SonnyG Ben#JasonS Dave#JuanO Derrek#KMcLaughlin Dillon#LGarmati Orson#LGarmati Jeff#RodgerU Brad#VCastillo The goal is to identify “Trainers” that have have more then one fighter. My gut feeling is the “getline” and variable declaration directives in AWK are going to be needed. I have tried different combinations of awk -F# 'NR>1{a=$2; getline; if($2 = a) {print $0,"Yes"} else {print $0,"NO"}}' sample.txt Yet, the output is nowhere near the desired results. In fact, it doesn’t even output all the rows in the sample file! My desired results are: Fighter#Trainer Bobby#SamBonen#NO Billy#BobBrown#NO Sammy#DJacobson#YES James#DJacobson#YES Donny#SonnyG#NO Ben#JasonS#NO Dave#JuanO#NO Derrek#KMcLaughlin#NO Dillon#LGarmati#YES Orson#LGarmati#YES Jeff#RodgerU#NO Brad#VCastillo#NO I am completely lost as to where to go from here. I have been searching and trying to find a solution to no avail, and I'm looking for some input. Thank you!
You don't need getline. You could just process the input normally, building up counts per trainer, and print the result in an END block: awk -F# '{ lines[NR] = $0; trainers[NR] = $2; counts[$2]++; } END { print lines[1]; for (i = 2; i <= length(lines); i++) { print lines[i] "#" (counts[trainers[i]] > 1 ? "YES" : "NO"); } }' sample.txt
Another option is to make two passes: $ cat p.awk BEGIN {FS=OFS="#"} NR==1 {print;next}; NR==FNR {++trainers[$2]; next} FNR>1 {$3=(trainers[$2]>1)?"YES":"NO"; print} $ awk -f p.awk p.txt p.txt Fighter#Trainer Bobby#SamBonen#NO Billy#BobBrown#NO Sammy#DJacobson#YES James#DJacobson#YES Donny#SonnyG#NO Ben#JasonS#NO Dave#JuanO#NO Derrek#KMcLaughlin#NO Dillon#LGarmati#YES Orson#LGarmati#YES Jeff#RodgerU#NO Brad#VCastillo#NO Explained: Set the input and output file separators: BEGIN {FS=OFS="#"} Print the header: NR==1 {print;next}; First pass, count occurrences of each trainer: NR==FNR {++trainers[$2]; next} Second pass, set YES or NO according to trainer count, and print result: FNR>1 {$3=(trainers[$2]>1)?"YES":"NO"; print}