How to add a header to awk output? - linux

I have a csv file that looks like below
"10.8.70.67","wireless",,"UTY_07_ISD",,26579
"10.8.70.69","wireless",,"RGB_34_FTR",,19780
I want to retrieve first, second and fourth column values (without quotes) and populate into a another csv in the below format.
IP DEVICETYPE DEVICENAME
10.8.70.67 wireless UTY_07_ISD
10.8.70.69 wireless RGB_34_FTR
I have used the below awk command
awk -F ',|,,' '{gsub(/"/,"",$1); gsub(/"/,"",$2); gsub(/"/,"",$3); print $1, $2, $3}' file.csv
and got the below output
10.8.70.67 wireless UTY_07_ISD
10.8.70.69 wireless RGB_34_FTR
please help in assigning headings to each column.

assuming you don't have comma or double quotes in the quoted strings (a big assumption!) it can be as simple as
$ awk -F, 'NR==1 {print "IP","DEVICETYPE","DEVICENAME"}
{gsub(/"/,"");
print $1,$2,$4}' file | column -t
IP DEVICETYPE DEVICENAME
10.8.70.67 wireless UTY_07_ISD
10.8.70.69 wireless RGB_34_FTR

With your shown samples, could you please try following. Written and tested in GNU awk.
awk -v FPAT='([^,]*)|("[^"]+")' '
BEGIN{
OFS=","
print "IP DEVICETYPE DEVICENAME"
}
function remove(fields){
num=split(fields,arr,",")
for(i=1;i<=num;i++){
gsub(/^"|"$/,"",$arr[i])
}
}
{
remove("1,2,4")
print $1,$2,$4
}
' Input_file
Explanation: Adding detailed explanation for above.
awk -v FPAT='([^,]*)|("[^"]+")' ' ##Setting FPAT to get only matched fields only as ([^,]*)|("[^"]+") as per samples.
BEGIN{ ##Starting BEGIN section of this program from here.
print "IP DEVICETYPE DEVICENAME" ##printing header here.
}
function remove(fields){ ##Creating function named remove here where we are passing field numbers from where we need to remove "
num=split(fields,arr,",") ##Splitting fields into arr here.
for(i=1;i<=num;i++){ ##Traversing through all items of arr here.
gsub(/^"|"$/,"",$arr[i]) ##Globally substituting starting and ending " in mentioned fields with NULL here.
}
}
{
remove("1,2,4") ##Calling remove here with field numbers of 1,2 and 4 which we need as per output.
print $1,$2,$4 ##Printing 1st, 2nd and 4th field here.
}
' Input_file ##Mentioning Input_file name here.

A simple oneliner will be:
awk -F ',|,,' 'BEGIN {format = "%-20s %-20s %-20s\n"; printf format, "IP", "DEVICETYPE", "DEVICENAME"} {gsub(/"/,"",$1); gsub(/"/,"",$2); gsub(/"/,"",$3); printf format, $1, $2, $3}' abc.csv
Here I have used BEGIN/END special pattern, which is used to do some startup or cleanup actionn, to add headings. For more details please refer to the documentation Using BEGIN/END

I got the expected output with the below command
awk -F ',|,,' 'BEGIN {print "IP,DEVICETYPE,DEVICENAME"} {gsub(/"/, "", $1); gsub(/"/, "", $2); gsub(/"/, "", $3); print $1","$2","$3}' input.csv > output.csv
I found that I was missing BEGIN part. Thanks all for your response.

Related

How to split on several delimiters but keep those in between square brackets?

I am trying to split the following text string by dash, square brackets and colon delimiters but keep those in square brackets
Input:
10:100 - [10/09/21:12:23:22]
Desired output:
100, 10/09/21:12:23:22
My current code:
awk -F '[- ":]' '{print $1, $2, $3, $4, $5}'
1st solution: With GNU awk you could try following code.
awk '
match($0,/:([^[:space:]]+)[[:space:]]+-[[:space:]]+\[([^]]*)\]/,arr){
print arr[1],arr[2]
}
' Input_file
2nd solution: Using sed's s(substitution operation) along with its capturing group capability try following:
sed -E 's/^[^:]*:([^[:space:]]+)[[:space:]]+-[[:space:]]+\[([^]]*)\]/\1 \2/' Input_file
3rd solution: Using any awk you could use following code. Using its sub and gsub operations on 1st and last fields.
awk '{sub(/.*:/,"",$1);gsub(/^\[|\]$/,"",$NF);print $1,$NF}' Input_file
4th solution: With Perl's one-liner solution using a lazy match.*? one could try following using its substitution operation.
perl -pe 's/^.*?:([^[:space:]]+)[[:space:]]+-[[:space:]]+\[([^]]*)\]/\1 \2/' Input_file
If you have multiple of these patterns in the string and not regarding the order, you can make use of awk, match the patterns that you are interested in, and then remove the surrounding delimters.
In this case, you can match
\[[^][]+]|:[0-9]+
The pattern matches:
\[[^][]+] Match from [...]
| Or
:[0-9]+ Match : and 1+ digits
The part in gsub [:\[]|\]$ matches either : [ at the start of the string, or match ] at the end of the string, and will replace that with an empty string.
awk '
{
while(match($0,/\[[^][]+]|:[0-9]+/)){
v = substr($0,RSTART,RLENGTH)
gsub(/^[:\[]|\]$/, "", v)
print v
$0=substr($0,RSTART+RLENGTH)
}
}
' file
Output
100
10/09/21:12:23:22
assuming no empty lines within input data :
echo '10:100 - [10/09/21:12:23:22]' |
nawk 'sub("^[^:]*:",_, $!--NF)' FS='[ -]*[][]' OFS=', '
or
gawk 'NF -= sub("^[^:]*:",_)' FS='[ -]*[][]' OFS=', '
or
mawk 'NF -= sub("^[^:]*:",_)' FS='[][ -]+' OFS=', '
100, 10/09/21:12:23:22

Match lines based on patterns and reformat file Bash/ Linux

I am looking preferably for a bash/Linux method for the problem below.
I have a text file (input.txt) that looks like so (and many many more lines):
TCCTCCGC+TAGTTAGG_Vel_24_CC_LlanR_34 CC_LlanR
GGAGTATG+TCTATTCG_Vel_24_CC_LlanR_22 CC_LlanR
TTGACTAG+TGGAGTAC_Vel_02_EN_DavaW_11 EN_DavaW
TCGAATAA+TGGTAATT_Vel_24_CC_LlanR_23 CC_LlanR
CTGCTGAA+CGTTGCGG_Vel_02_EN_DavaW_06 EN_DavaW
index_07_barcode_04_PA-17-ACW-04 17-ACW
index_09_barcode_05_PA-17-ACW-05 17-ACW
index_08_barcode_37_PA-21-YC-15 21-YC
index_09_barcode_04_PA-22-GB-10 22-GB
index_10_barcode_37_PA-28-CC-17 28-CC
index_11_barcode_29_PA-32-MW-07 32-MW
index_11_barcode_20_PA-32-MW-08 32-MW
I want to produce a file that looks like
CC_LlanR(TCCTCCGC+TAGTTAGG_Vel_24_CC_LlanR_34,GGAGTATG+TCTATTCG_Vel_24_CC_LlanR_22,TCGAATAA+TGGTAATT_Vel_24_CC_LlanR_23)
EN_DavaW(TTGACTAG+TGGAGTAC_Vel_02_EN_DavaW_11,CTGCTGAA+CGTTGCGG_Vel_02_EN_DavaW_06)
17-ACW(index_07_barcode_04_PA-17-ACW-04,index_09_barcode_05_PA-17-ACW-05)
21-YC(index_08_barcode_37_PA-21-YC-15)
22-GB(index_09_barcode_04_PA-22-GB-10)
28-CC(index_10_barcode_37_PA-28-CC-17)
32-MW(index_11_barcode_29_PA-32-MW-07,index_11_barcode_20_PA-32-MW-08)
I thought that I could do something along the lines of this.
cat input.txt | awk '{print $1}' | grep -e "CC_LlanR" | paste -sd',' > intermediate_file
cat input.txt | awk '{print $2"("}' something something??
But I only know how to grep one pattern at a time? Is there a way to find all the matching lines at once and output them in this format?
Thank you!
(Happy Easter/ long weekend to all!)
With your shown samples please try following.
awk '
FNR==NR{
arr[$2]=(arr[$2]?arr[$2]",":"")$1
next
}
($2 in arr){
print $2"("arr[$2]")"
delete arr[$2]
}
' Input_file Input_file
2nd solution: Within a single read of Input_file try following.
awk '{arr[$2]=(arr[$2]?arr[$2]",":"")$1} END{for(i in arr){print i"("arr[i]")"}}' Input_file
Explanation(1st solution): Adding detailed explanation for 1st solution here.
awk ' ##Starting awk program from here.
FNR==NR{ ##Checking condition FNR==NR which will be TRUE when first time Input_file is being read.
arr[$2]=(arr[$2]?arr[$2]",":"")$1 ##Creating array with index of 2nd field and keep adding its value with comma here.
next ##next will skip all further statements from here.
}
($2 in arr){ ##Checking condition if 2nd field is present in arr then do following.
print $2"("arr[$2]")" ##Printing 2nd field ( arr[$2] ) here.
delete arr[$2] ##Deleteing arr value with 2nd field index here.
}
' Input_file Input_file ##Mentioning Input_file names here.
Assuming your input is grouped by the $2 value as shown in your example (if it isn't then just run sort -k2,2 on your input first) using 1 pass and only storing one token at a time in memory and producing the output in the same order of $2s as the input:
$ cat tst.awk
BEGIN { ORS="" }
$2 != prev {
printf "%s%s(", ORS, $2
ORS = ")\n"
sep = ""
prev = $2
}
{
printf "%s%s", sep, $1
sep = ","
}
END { print "" }
$ awk -f tst.awk input.txt
CC_LlanR(TCCTCCGC+TAGTTAGG_Vel_24_CC_LlanR_34,GGAGTATG+TCTATTCG_Vel_24_CC_LlanR_22)
EN_DavaW(TTGACTAG+TGGAGTAC_Vel_02_EN_DavaW_11)
CC_LlanR(TCGAATAA+TGGTAATT_Vel_24_CC_LlanR_23)
EN_DavaW(CTGCTGAA+CGTTGCGG_Vel_02_EN_DavaW_06)
17-ACW(index_07_barcode_04_PA-17-ACW-04,index_09_barcode_05_PA-17-ACW-05)
21-YC(index_08_barcode_37_PA-21-YC-15)
22-GB(index_09_barcode_04_PA-22-GB-10)
28-CC(index_10_barcode_37_PA-28-CC-17)
32-MW(index_11_barcode_29_PA-32-MW-07,index_11_barcode_20_PA-32-MW-08)
This might work for you (GNU sed):
sed -E 's/^(\S+)\s+(\S+)/\2(\1)/;H
x;s/(\n\S+)\((\S+)\)(.*)\1\((\S+)\)/\1(\2,\4)\3/;x;$!d;x;s/.//' file
Append each manipulated line to the hold space.
Before moving on to the next line, accumlate like keys into a single line.
Delete every line except the last.
Replace the last line by the contents of the hold space.
Remove the first character (newline artefact introduced by H comand) and print the result.
N.B. The final solution is unsorted and in the original order.

Combining rows of data

File.csv
1234,1
6789,1
I'm trying to transform the file above to the below output :
1234,1
6789,1
Looking to merge rows using array or loop
Could you please try following, written and tested with shown samples in GNU awk.
awk '
BEGIN{
FS=OFS=","
}
{
sub(/ +$/,"")
first=$1
sub(/^[^,]*,/,"")
arr[first]=(arr[first]?arr[first] OFS:"")$0
}
END{
for(i in arr){
print i,arr[i]
}
}' Input_file
Explanation: Adding detailed explanation for above solution:
awk ' ##Starting awk program from here.
BEGIN{ ##Starting BEGIN section from here.
FS=OFS="," ##Setting field separator and output field separator as , here.
}
{
sub(/ +$/,"") ##Substituting spaces coming at last of line with NULL OP samples have it.
first=$1 ##Setting $1 value to first variable here.
sub(/^[^,]*,/,"") ##Substituting everything till first , with NULL here.
arr[first]=(arr[first]?arr[first] OFS:"")$0 ##Creating array arr with index of first and keep on adding values to it.
}
END{ ##Starting END block of this awk program from here.
for(i in arr){ ##Traversing through arr here for all elements here.
print i,arr[i] ##Printing i and value of arr with index of i here.
}
}' Input_file ##Mentioning Input_file name here.
One way, using a perl one-liner:
$ perl -F, -lanE '
push #{$g{$F[0]}}, #F[1..$#F];
END { print join(",", $_, $g{$_}->#*) for (sort { $a <=> $b } keys %g) }
' input.csv
1234,1,5,No,4,1,Not Applicable,2,5,6,8,6,1,3
6789,1,5,No,4,1,Not Applicable,2,5,6,8,6,1,3
Splits lines on commas, and adds all the fields to arrays stored in a hash table using the first element as the key, and then prints out all the combined lines in sorted order.

How can you compare entries between two columns in linux?

I am trying to figure out whether the first letter of an amino acid is the same as its letter code.
For example, Glycine begins with G and its letter code is also (G)
On the other hand, Arginine begins with A but its letter code is (R)
I am trying to print out, as a result, the amino acids that have the same letter code and starting alphabet.
I have a CSV datafile in which the columns are delimited by ','
Name,One letter code,Three letter code,Hydropathy,Charge,Abundance,DNA codon(s)
Arginine,R,Arg,hydrophilic,+,0.0514,CGT-CGC-CGA-CGG-AGA-AGG
Asparagine,N,Asn,hydrophilic,N,0.0447,AAT-AAC
Aspartate,D,Asp,hydrophilic,-,0.0528,GAT-GAC
Glutamate,E,Glu,hydrophilic,-,0.0635,GAA-GAG
Glutamine,Q,Gln,hydrophilic,N,0.0399,CAA-CAG
Lysine,K,Lys,hydrophilic,+,0.0593,AAA-AAG
Serine,S,Ser,hydrophilic,N,0.0715,TCT-TCC-TCA-TCG-AGT-AGC
Threonine,T,Thr,hydrophilic,N,0.0569,ACT-ACC-ACA-ACG
I believe the code below is one way to compare columns, but I am wondering how I can extract the first letter from the first column and compare that with the alphabet in the second column
awk '{ if ($1 == $2) { print $1; } }' < foo.txt
Could you please try following.
awk 'BEGIN{FS=","} substr($1,1,1) == $2' Input_file
Output will be as follows.
Serine,S,Ser,hydrophilic,N,0.0715,TCT-TCC-TCA-TCG-AGT-AGC
Threonine,T,Thr,hydrophilic,N,0.0569,ACT-ACC-ACA-ACG
Explanation: Adding explanation for above code.
awk ' ##Starting awk program here.
BEGIN{ ##Starting BEGIN section for awk here.
FS="," ##Setting FS as comma here, field separator.
} ##Closing BLOCK for BEGIN here.
substr($1,1,1) == $2 ##Using substr function of awk to get sub string from line, substr(line/variable/field, starting point, ending point) is method for using it. Getting 1st letter of $1 and comparing it with $2 of current line, if TRUE then it will print current line.
' Input_file ##Mentioning Input_file name here.
Simpler way using grep:
$ grep -E '^(.)[^,]*,\1' input.csvĀ 
Serine,S,Ser,hydrophilic,N,0.0715,TCT-TCC-TCA-TCG-AGT-AGC
Threonine,T,Thr,hydrophilic,N,0.0569,ACT-ACC-ACA-ACG
Same as RavinderSingh's expression, but field selector attribute is different.
awk -F "," 'substr($1,1,1) == $2' InFile
Serine,S,Ser,hydrophilic,N,0.0715,TCT-TCC-TCA-TCG-AGT-AGC
Threonine,T,Thr,hydrophilic,N,0.0569,ACT-ACC-ACA-ACG

How to Compare CSV Column using awk?

I receive and CSV like this:
column$1,column$2,column$
john,P,10
john,P,10
john,A,20
john,T,30
john,T,10
marc,P,10
marc,C,10
marc,C,20
marc,T,30
marc,A,10
I need so sum the values and display the name and results but column$2 needs to show the sum of values T separated from values P,A,C.
Output should be this:
column$1,column$2,column$3,column$4
john,PCA,40
john,T,40,CORRECT
marc,PCA,50
marc,T,30,INCORRECT
All i could do was extract the columns i need from the original csv:
awk -F "|" '{print $8 "|" $9 "|" $4}' input.csv >> output.csv
Also sort by the correct column:
sort -t "|" -k1 input.csv >> output.csv
And add a new column to the end of the csv:
awk -F, '{NF=2}1' OFS="|" input.csv >> output.csv
I managed to sum and display the sum by column$1 and $2, but i don't how to group different values from column$2:
awk -F "," '{col[$1,$2]++} END {for(i in col) print i, col[i]}' file > output
Awk is stream oriented. It processes input and outputs what you change. It does not do in file changes.
You just need to add a corresponding print
awk '{if($2 == "T") {print "MATCHED"}}'
If you want to output more than the "matched" you need to add it to the print
e.g. '{print $1 "|" $2 "|" $3 "|" " MATCHED"}'
or use print $0 as comment mentions above.
Assuming that "CORRECT" and "INCORRECT" are determined by comparing the "PCA" value to the "T" value, the following awk script should do the trick:
awk -F, -vOFS=, '$2=="T"{t[$1]+=$3;n[$1]} $2!="T"{s[$1]+=$3;n[$1]} END{ for(i in n){print i,"PCA",s[i]; print i,"T",t[i],(t[i]==s[i] ? "CORRECT" : "INCORRECT")} }' inputfile
Broken out for easier reading, here's what this looks like:
awk -F, -vOFS=, '
$2=="T" { # match all records that are "T"
t[$1]+=$3 # add the value for this record to an array of totals
n[$1] # record this name in our authoritative name list
}
$2!="T" { # match all records that are NOT "T"
s[$1]+=$3 # add the value for this record to an array of sums
n[$1] # record this name too
}
END { # Now that we've collected data, analyse the results
for (i in n) { # step through our authoritative list of names
print i,"PCA",s[i]
print i,"T",t[i],(t[i]==s[i] ? "CORRECT" : "INCORRECT")
}
}
' inputfile
Note that array order is not guaranteed in awk, so your output may not come out in the same order as your input.
If you want your output to be delimited using vertical bars, change the -vOFS=, to -vOFS='|'.
Then you can sort using:
awk ... | sort
which defaults to -k1.

Resources