Combining rows of data

Combining rows of data - linux

File.csv
1234,1
6789,1
I'm trying to transform the file above to the below output :
1234,1
6789,1
Looking to merge rows using array or loop

Could you please try following, written and tested with shown samples in GNU awk.
awk '
BEGIN{
FS=OFS=","
}
{
sub(/ +$/,"")
first=$1
sub(/^[^,]*,/,"")
arr[first]=(arr[first]?arr[first] OFS:"")$0
}
END{
for(i in arr){
print i,arr[i]
}
}' Input_file
Explanation: Adding detailed explanation for above solution:
awk ' ##Starting awk program from here.
BEGIN{ ##Starting BEGIN section from here.
FS=OFS="," ##Setting field separator and output field separator as , here.
}
{
sub(/ +$/,"") ##Substituting spaces coming at last of line with NULL OP samples have it.
first=$1 ##Setting $1 value to first variable here.
sub(/^[^,]*,/,"") ##Substituting everything till first , with NULL here.
arr[first]=(arr[first]?arr[first] OFS:"")$0 ##Creating array arr with index of first and keep on adding values to it.
}
END{ ##Starting END block of this awk program from here.
for(i in arr){ ##Traversing through arr here for all elements here.
print i,arr[i] ##Printing i and value of arr with index of i here.
}
}' Input_file ##Mentioning Input_file name here.

One way, using a perl one-liner:
$ perl -F, -lanE '
push #{$g{$F[0]}}, #F[1..$#F];
END { print join(",", $_, $g{$_}->#*) for (sort { $a <=> $b } keys %g) }
' input.csv
1234,1,5,No,4,1,Not Applicable,2,5,6,8,6,1,3
6789,1,5,No,4,1,Not Applicable,2,5,6,8,6,1,3
Splits lines on commas, and adds all the fields to arrays stored in a hash table using the first element as the key, and then prints out all the combined lines in sorted order.

Related

Match lines based on patterns and reformat file Bash/ Linux

I am looking preferably for a bash/Linux method for the problem below.
I have a text file (input.txt) that looks like so (and many many more lines):
TCCTCCGC+TAGTTAGG_Vel_24_CC_LlanR_34 CC_LlanR
GGAGTATG+TCTATTCG_Vel_24_CC_LlanR_22 CC_LlanR
TTGACTAG+TGGAGTAC_Vel_02_EN_DavaW_11 EN_DavaW
TCGAATAA+TGGTAATT_Vel_24_CC_LlanR_23 CC_LlanR
CTGCTGAA+CGTTGCGG_Vel_02_EN_DavaW_06 EN_DavaW
index_07_barcode_04_PA-17-ACW-04 17-ACW
index_09_barcode_05_PA-17-ACW-05 17-ACW
index_08_barcode_37_PA-21-YC-15 21-YC
index_09_barcode_04_PA-22-GB-10 22-GB
index_10_barcode_37_PA-28-CC-17 28-CC
index_11_barcode_29_PA-32-MW-07 32-MW
index_11_barcode_20_PA-32-MW-08 32-MW
I want to produce a file that looks like
CC_LlanR(TCCTCCGC+TAGTTAGG_Vel_24_CC_LlanR_34,GGAGTATG+TCTATTCG_Vel_24_CC_LlanR_22,TCGAATAA+TGGTAATT_Vel_24_CC_LlanR_23)
EN_DavaW(TTGACTAG+TGGAGTAC_Vel_02_EN_DavaW_11,CTGCTGAA+CGTTGCGG_Vel_02_EN_DavaW_06)
17-ACW(index_07_barcode_04_PA-17-ACW-04,index_09_barcode_05_PA-17-ACW-05)
21-YC(index_08_barcode_37_PA-21-YC-15)
22-GB(index_09_barcode_04_PA-22-GB-10)
28-CC(index_10_barcode_37_PA-28-CC-17)
32-MW(index_11_barcode_29_PA-32-MW-07,index_11_barcode_20_PA-32-MW-08)
I thought that I could do something along the lines of this.
cat input.txt | awk '{print $1}' | grep -e "CC_LlanR" | paste -sd',' > intermediate_file
cat input.txt | awk '{print $2"("}' something something??
But I only know how to grep one pattern at a time? Is there a way to find all the matching lines at once and output them in this format?
Thank you!
(Happy Easter/ long weekend to all!)

With your shown samples please try following.
awk '
FNR==NR{
arr[$2]=(arr[$2]?arr[$2]",":"")$1
next
}
($2 in arr){
print $2"("arr[$2]")"
delete arr[$2]
}
' Input_file Input_file
2nd solution: Within a single read of Input_file try following.
awk '{arr[$2]=(arr[$2]?arr[$2]",":"")$1} END{for(i in arr){print i"("arr[i]")"}}' Input_file
Explanation(1st solution): Adding detailed explanation for 1st solution here.
awk ' ##Starting awk program from here.
FNR==NR{ ##Checking condition FNR==NR which will be TRUE when first time Input_file is being read.
arr[$2]=(arr[$2]?arr[$2]",":"")$1 ##Creating array with index of 2nd field and keep adding its value with comma here.
next ##next will skip all further statements from here.
}
($2 in arr){ ##Checking condition if 2nd field is present in arr then do following.
print $2"("arr[$2]")" ##Printing 2nd field ( arr[$2] ) here.
delete arr[$2] ##Deleteing arr value with 2nd field index here.
}
' Input_file Input_file ##Mentioning Input_file names here.

Assuming your input is grouped by the $2 value as shown in your example (if it isn't then just run sort -k2,2 on your input first) using 1 pass and only storing one token at a time in memory and producing the output in the same order of $2s as the input:
$ cat tst.awk
BEGIN { ORS="" }
$2 != prev {
printf "%s%s(", ORS, $2
ORS = ")\n"
sep = ""
prev = $2
}
{
printf "%s%s", sep, $1
sep = ","
}
END { print "" }
$ awk -f tst.awk input.txt
CC_LlanR(TCCTCCGC+TAGTTAGG_Vel_24_CC_LlanR_34,GGAGTATG+TCTATTCG_Vel_24_CC_LlanR_22)
EN_DavaW(TTGACTAG+TGGAGTAC_Vel_02_EN_DavaW_11)
CC_LlanR(TCGAATAA+TGGTAATT_Vel_24_CC_LlanR_23)
EN_DavaW(CTGCTGAA+CGTTGCGG_Vel_02_EN_DavaW_06)
17-ACW(index_07_barcode_04_PA-17-ACW-04,index_09_barcode_05_PA-17-ACW-05)
21-YC(index_08_barcode_37_PA-21-YC-15)
22-GB(index_09_barcode_04_PA-22-GB-10)
28-CC(index_10_barcode_37_PA-28-CC-17)
32-MW(index_11_barcode_29_PA-32-MW-07,index_11_barcode_20_PA-32-MW-08)

This might work for you (GNU sed):
sed -E 's/^(\S+)\s+(\S+)/\2(\1)/;H
x;s/(\n\S+)\((\S+)\)(.*)\1\((\S+)\)/\1(\2,\4)\3/;x;$!d;x;s/.//' file
Append each manipulated line to the hold space.
Before moving on to the next line, accumlate like keys into a single line.
Delete every line except the last.
Replace the last line by the contents of the hold space.
Remove the first character (newline artefact introduced by H comand) and print the result.
N.B. The final solution is unsorted and in the original order.

How to add a header to awk output?

I have a csv file that looks like below
"10.8.70.67","wireless",,"UTY_07_ISD",,26579
"10.8.70.69","wireless",,"RGB_34_FTR",,19780
I want to retrieve first, second and fourth column values (without quotes) and populate into a another csv in the below format.
IP DEVICETYPE DEVICENAME
10.8.70.67 wireless UTY_07_ISD
10.8.70.69 wireless RGB_34_FTR
I have used the below awk command
awk -F ',|,,' '{gsub(/"/,"",$1); gsub(/"/,"",$2); gsub(/"/,"",$3); print $1, $2, $3}' file.csv
and got the below output
10.8.70.67 wireless UTY_07_ISD
10.8.70.69 wireless RGB_34_FTR
please help in assigning headings to each column.

assuming you don't have comma or double quotes in the quoted strings (a big assumption!) it can be as simple as
$ awk -F, 'NR==1 {print "IP","DEVICETYPE","DEVICENAME"}
{gsub(/"/,"");
print $1,$2,$4}' file | column -t
IP DEVICETYPE DEVICENAME
10.8.70.67 wireless UTY_07_ISD
10.8.70.69 wireless RGB_34_FTR

With your shown samples, could you please try following. Written and tested in GNU awk.
awk -v FPAT='([^,]*)|("[^"]+")' '
BEGIN{
OFS=","
print "IP DEVICETYPE DEVICENAME"
}
function remove(fields){
num=split(fields,arr,",")
for(i=1;i<=num;i++){
gsub(/^"|"$/,"",$arr[i])
}
}
{
remove("1,2,4")
print $1,$2,$4
}
' Input_file
Explanation: Adding detailed explanation for above.
awk -v FPAT='([^,]*)|("[^"]+")' ' ##Setting FPAT to get only matched fields only as ([^,]*)|("[^"]+") as per samples.
BEGIN{ ##Starting BEGIN section of this program from here.
print "IP DEVICETYPE DEVICENAME" ##printing header here.
}
function remove(fields){ ##Creating function named remove here where we are passing field numbers from where we need to remove "
num=split(fields,arr,",") ##Splitting fields into arr here.
for(i=1;i<=num;i++){ ##Traversing through all items of arr here.
gsub(/^"|"$/,"",$arr[i]) ##Globally substituting starting and ending " in mentioned fields with NULL here.
}
}
{
remove("1,2,4") ##Calling remove here with field numbers of 1,2 and 4 which we need as per output.
print $1,$2,$4 ##Printing 1st, 2nd and 4th field here.
}
' Input_file ##Mentioning Input_file name here.

A simple oneliner will be:
awk -F ',|,,' 'BEGIN {format = "%-20s %-20s %-20s\n"; printf format, "IP", "DEVICETYPE", "DEVICENAME"} {gsub(/"/,"",$1); gsub(/"/,"",$2); gsub(/"/,"",$3); printf format, $1, $2, $3}' abc.csv
Here I have used BEGIN/END special pattern, which is used to do some startup or cleanup actionn, to add headings. For more details please refer to the documentation Using BEGIN/END

I got the expected output with the below command
awk -F ',|,,' 'BEGIN {print "IP,DEVICETYPE,DEVICENAME"} {gsub(/"/, "", $1); gsub(/"/, "", $2); gsub(/"/, "", $3); print $1","$2","$3}' input.csv > output.csv
I found that I was missing BEGIN part. Thanks all for your response.

sum column 2 of a csv file where column 1 has the same value [duplicate]

This question already has answers here:
How can I sum values in column based on the value in another column?
(5 answers)
Closed 3 years ago.
I'm trying to sum up the values of column 2 where column 1 is a duplicate value, however my google search is off as I get results on how to add columns or sum whole rows not where a value matches.
Can someone confirm or link me to where I can find out how to do this? I've got to the point of organising the data but the final step eludes my search engine.
Current code
cat example.csv | sort | ##pipe of the thing that sums > output.csv
example.csv
platform1,24257022
platform2,44959636
platform_3,62
platform2,2
platform_3,20
platform1,572475
platform_3,75
desired output.csv
platform1,24829497
platform2,44959638
platform_3,157
Apologies that this is such a basic question I'm asking...

Could you please try following.
awk 'BEGIN{FS=OFS=","}{a[$1]+=$2}END{for(i in a){print i,a[i]}}' Input_file
Explanation: Adding detailed explanation of above code here.
awk ' ##Starting awk program from here.
BEGIN{ ##Starting BEGIN section of this awk program from here.
FS=OFS="," ##Setting FS and OFS as comma here.
} ##Closing BEGIN block of this code here.
{ ##Starting main block for this code here.
a[$1]+=$2 ##Creating an array named a whose index is $1 and value is $2 which keep adding to its own value.
} ##Closing main block of this program here.
END{ ##Starting END block of this program here.
for(i in a){ ##Traversing through array a all elements here.
print i,a[i] ##Printing index of element and value of element here.
} ##Closing block for, above for loop here.
} ##Closing BLOCK for this program END section here.
' Input_file ##Mentioning Input_file name here.
2nd solution: In case you want the output in same order in which 1st field has occurred then try following since above solution will not take care of sequence.
awk '
BEGIN{
FS=OFS=","
}
!a[$1]++{
b[++count]=$1
}
{
c[$1]+=$2
}
END{
for(i=1;i<=count;i++){
print b[i],c[b[i]]
}
}
' Input_file

Another way, using the always-useful GNU datamash:
$ datamash -t, -s -g1 sum 2 < example.csv
platform1,24829497
platform2,44959638
platform_3,157
(Using comma as a field delimiter (-t,), sort (-s, needed for unsorted input like yours) and group by the first column (-g1) and sum the second column of each group.

How can you compare entries between two columns in linux?

I am trying to figure out whether the first letter of an amino acid is the same as its letter code.
For example, Glycine begins with G and its letter code is also (G)
On the other hand, Arginine begins with A but its letter code is (R)
I am trying to print out, as a result, the amino acids that have the same letter code and starting alphabet.
I have a CSV datafile in which the columns are delimited by ','
Name,One letter code,Three letter code,Hydropathy,Charge,Abundance,DNA codon(s)
Arginine,R,Arg,hydrophilic,+,0.0514,CGT-CGC-CGA-CGG-AGA-AGG
Asparagine,N,Asn,hydrophilic,N,0.0447,AAT-AAC
Aspartate,D,Asp,hydrophilic,-,0.0528,GAT-GAC
Glutamate,E,Glu,hydrophilic,-,0.0635,GAA-GAG
Glutamine,Q,Gln,hydrophilic,N,0.0399,CAA-CAG
Lysine,K,Lys,hydrophilic,+,0.0593,AAA-AAG
Serine,S,Ser,hydrophilic,N,0.0715,TCT-TCC-TCA-TCG-AGT-AGC
Threonine,T,Thr,hydrophilic,N,0.0569,ACT-ACC-ACA-ACG
I believe the code below is one way to compare columns, but I am wondering how I can extract the first letter from the first column and compare that with the alphabet in the second column
awk '{ if ($1 == $2) { print $1; } }' < foo.txt

Could you please try following.
awk 'BEGIN{FS=","} substr($1,1,1) == $2' Input_file
Output will be as follows.
Serine,S,Ser,hydrophilic,N,0.0715,TCT-TCC-TCA-TCG-AGT-AGC
Threonine,T,Thr,hydrophilic,N,0.0569,ACT-ACC-ACA-ACG
Explanation: Adding explanation for above code.
awk ' ##Starting awk program here.
BEGIN{ ##Starting BEGIN section for awk here.
FS="," ##Setting FS as comma here, field separator.
} ##Closing BLOCK for BEGIN here.
substr($1,1,1) == $2 ##Using substr function of awk to get sub string from line, substr(line/variable/field, starting point, ending point) is method for using it. Getting 1st letter of $1 and comparing it with $2 of current line, if TRUE then it will print current line.
' Input_file ##Mentioning Input_file name here.

Simpler way using grep:
$ grep -E '^(.)[^,]*,\1' input.csv 
Serine,S,Ser,hydrophilic,N,0.0715,TCT-TCC-TCA-TCG-AGT-AGC
Threonine,T,Thr,hydrophilic,N,0.0569,ACT-ACC-ACA-ACG

Same as RavinderSingh's expression, but field selector attribute is different.
awk -F "," 'substr($1,1,1) == $2' InFile
Serine,S,Ser,hydrophilic,N,0.0715,TCT-TCC-TCA-TCG-AGT-AGC
Threonine,T,Thr,hydrophilic,N,0.0569,ACT-ACC-ACA-ACG

move lines into a file by number of columns using awk

I have a sample file with '||o||' as field separator.
www.google.org||o||srScSG2C5tg=||o||bngwq
farhansingla.it||o||4sQVj09gpls=||o||
ngascash||o||||o||
ms-bronze.com.br||o||||o||
I want to move the lines with only 1 field in 1.txt and those having more than 1 field in not_1.txt. I am using the following command:
sed 's/\(||o||\)\+$//g' sample.txt | awk -F '[|][|]o[|][|]' '{if (NF == 1) print > "1.txt"; else print > "not_1.txt" }'
The problem is that it is moving not the original lines but the replaced ones.
The output I am getting is (not_1.txt):
td#the-end.org||o||srScSG2C5tg=||o||bnm
erba01#tiscali.it||o||4sQVj09gpls=
1.txt:
ngas
ms-inside#bol.com.br
As you can see the original lines are modified. I don't want to modify the lines.
Any help would be highly appreciated.

Awk solution:
awk -F '[|][|]o[|][|]' \
'{
c = 0;
for (i=1; i<=NF; i++) if ($i != "") c++;
print > (c == 1? "1" : "not_1")".txt"
}' sample.txt
Results:
$ head 1.txt not_1.txt
==> 1.txt <==
ngascash||o||||o||
ms-bronze.com.br||o||||o||
==> not_1.txt <==
www.google.org||o||srScSG2C5tg=||o||bngwq
farhansingla.it||o||4sQVj09gpls=||o||

Following awk may help you on same.
awk -F'\\|\\|o\\|\\|' '{for(i=1;i<=NF;i++){count=$i?++count:count};if(count==1){print > "1_field_only"};if(count>1){print > "not_1_field"};count=""}' Input_file
Adding a non-one liner form of solution too now.
awk -F'\\|\\|o\\|\\|' '
{
for(i=1;i<=NF;i++){ count=$i?++count:count };
if(count==1) { print > "1_field_only" };
if(count>1) { print > "not_1_field" };
count=""
}
' Input_file
Explanation: Adding explanation for above code too now.
awk -F'\\|\\|o\\|\\|' ' ##Setting field separator as ||o|| here and escaping the | here to take it literal character here.
{
for(i=1;i<=NF;i++){ count=$i?++count:count }; ##Starting a for loop to traverse through all the fields here, increasing variable count value if a field is NOT null.
if(count==1) { print > "1_field_only" }; ##Checking if count value is 1 it means fields are only 1 in line so printing current line into 1_field_only file.
if(count>1) { print > "not_1_field" }; ##Checking if count is more than 1 so printing current line into output file named not_1_field file here.
count="" ##Nullifying the variable count here.
}
' Input_file ##Mentioning Input_file name here.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Combining rows of data - linux

File.csv 1234,1 6789,1 I'm trying to transform the file above to the below output : 1234,1 6789,1 Looking to merge rows using array or loop

Related

Match lines based on patterns and reformat file Bash/ Linux

How to add a header to awk output?

sum column 2 of a csv file where column 1 has the same value [duplicate]

How can you compare entries between two columns in linux?

move lines into a file by number of columns using awk

Categories

Resources