Split file using awk at pattern - linux

Here is an example of the data that I have in a row in example.tsv:
somedata1:data1#||#somedata2:data2#||#somedata1:data3#||#somedata2:data4
I wanted to do two things:
Split the data from the pattern '#||#' and write it to other file. The number of columns after splitting is not fixed.
I have tried the awk command:
awk -F"#\|\|#" '{print;}' example.tsv > splitted.tsv
Output of the first file should be:
column 1
somedata1:data1
somedata2:data2
somedata1:data3
somedata2:data4
Next I want split the data in splitted.tsv based on the ':'.
somedata1
data1
data3
And write it to a file.
Is there a way we could do this in a single awk command?

You need to escape the | correctly. Then use split
awk -F'#\\|\\|#' '{split($2,a,":");print a[2]}' file
data2
To print all data out in a table:
awk -F'#\\|\\|#' '{for (i=1;i<=NF;i++) print $i}' file
somedata:data1
somedata:data2
somedata:data3
somedata:data1
To split the data even more:
awk -F'#\\|\\|#' '{for (i=1;i<=NF;i++) {split($i,a,":");print a[1],a[2]}}' file
somedata data1
somedata data2
somedata data3
somedata data1

For the first split, you could try
$ awk 'BEGIN{print "column1"}{gsub(/#\|\|#/,"\n"); print }' file
column1
somedata:data1
somedata:data2
somedata:data3
somedata:data1
To then split on :, you could do:
$ awk -F: 'BEGIN{print "column1","column2"}
{gsub(/#\|\|#/,"\n"); gsub(/:/," ");print }' file
column1 column2
somedata data1
somedata data2
somedata data3
somedata data1

Related

Matching two files and print all columns

I have two files I want to match according to column 1 in file 1 and column 2 in file 2.
File 1:
1000019 -0.013936 0.0069218 -0.0048443 -0.0053688
1000054 0.013993 0.0044969 -0.0050022 -0.0043233
File 2:
5131885 1000019
1281471 1000054
I would like to print all columns after matching.
Expected output (file 3):
5131885 1000019 -0.013936 0.0069218 -0.0048443 -0.0053688
1281471 1000054 0.013993 0.0044969 -0.0050022 -0.0043233
I tried the following:
awk 'FNR==NR{arr[$1]=$2;next} ($2 in arr){print $0,arr[$2]}' file1 file2 > file3
join file1 file2 > file3 #after sorting
This awk should work
awk 'NR==FNR {r[$2]=$1; next}{print r[$1], $0}' $file2 $file1
Output
5131885 1000019 -0.013936 0.0069218 -0.0048443 -0.0053688
1281471 1000054 0.013993 0.0044969 -0.0050022 -0.0043233

How to read lines from file and split it based on comma delimited and assign variable names

This is what i have as input line
Table_name,1~2~3,U,2018-09-26 05:07:31.000000886,2018-09-26 06:49:03.003,DEPT_NBR|DIV_NBR,FARGO | |916244244 |
||||||,FARGO||916244244|
awk -F, '{for(i=1;i<=NF;i++) print $i}' file
and below is what it gives
Table_name
1~2~3
U
2018-09-26 05:07:31.000000886
2018-09-26 06:49:03.003
DEPT_NBR|DIV_NBR
FARGO | |916244244
||||||FARGO||916244244|
Below is expected output that i'm looking for
Tablename:Table_name
Row_key:1~2~3
CDC_FLAG:U
INGESTION_TS:2018-09-26 05:07:31.000000886
CDC_TS:2018-09-26 06:49:03.003
FIELD_NAME:DEPT_NBR|DIV_NBR
OLD_VALUES:FARGO | |916244244
NEW_VALUES:||||||FARGO||916244244|
and i want to be able to compare OLD_VALUES and NEW_VALUES and write the differences
Input the names in your BEGIN block.
awk -F, 'BEGIN { name[1] ="Tablename"; name[2]="Row_key"; etc.}
{for(i=1;i<=NF;i++) print name[i] ":" $i}' file

bash text file transpose, add new column and make one big two-column again

I have a large text file:
modularity_class;keys;columna1;columna2;columna3;
1;Antimalarial;Borneo;Cytotoxicity;Indonesia
0;Africa;malaria;morbidity;mortality
6;Anopheles albimanus;compression sprayer;house?spraying;;
12;;;;Tanzania;;
The final result should be:
Antimalarial;1
Borneo;1
Cytotoxicity;1
Indonesia;1
Africa;0
malaria;0
morbidity;0
mortality;0
Anopheles albimanus;6
compression sprayer;6
house?spraying;6
Tanzania;12
As you can see I need to:
1st: remove first row (should be trivial)
transpose each row (one by one)
add first value in original row to every element transposed as a second column
skip every null/blank value between semicolon delimiters
I've read about awk, sed, tr and so on... but I cannot figure out how to get it in an efficient way.
Note: every row may have different length or elements.
Simple awk should do the trick:
awk -F';' 'NR>1 {
for(i=2; i<=NF; i++) {
if($i!="")
print $i FS $1
}
}' file
One-liner:
awk -F';' 'NR>1 { for(i=2; i<=NF; i++) { if($i!="") print $i FS $1 } }' file

linux grep pattern in an unknown number of column

I have a text file with many rows and columns and I want to grep a column by the 'column name'.
M121 M125 M123 M124 M131 M126 M211 N
0.41463252 1.00296561 -0.1713496 0.15923644 -1.49682602 -1.9478695 1.45223392 …
-0.46775802 0.14591103 1.122446 0.83648981 -0.3038532 -1.1841548 2.18074729 …
0.67736835 2.12969375 -0.8187298 0.13582824 -1.49290987 -0.6798428 1.04353114 …
0.08673344 -0.40437672 1.8441559 -0.63679375 0.47998832 0.1702844 0.54029264 …
-0.32606297 -0.95551833 0.6157599 0.02819133 1.44818627 -0.9528659 0.09207864 …
-0.51781121 0.88806507 -0.2913757 -0.00463802 0.05037374 0.953773 0.01244763 …
-0.25724472 0.05119051 0.2109025 -0.26083822 -0.52094072 -0.938595 -0.01275275 …
1.94348766 -1.83607523 1.2010512 -0.54109756 -0.88323831 -0.6263788 -0.96973544 …
0.1900408 -0.61025656 0.4586306 -0.69181051 -0.90713834 0.3589271 0.6870383 …
0.54866057 -0.03861159 -1.505861 0.54871682 -0.24602601 -0.3941754 0.85673905 …
for example, I want to grep M211 column but I don't know the number of column. I tried:
awk '$i == "M211"' filename or awk '$0 == "M211"' filename
awk: illegal field $(), name "i"
input record number 1, filename
source line number 1
Is there any solution ? Thank you.
awk solution - iterates over column names for first line of input file and saves column number if it matches desired pattern. Then print that column. No output if match is not found
$ awk 'NR==1{ for(i=1;i<=NF;i++){if($i=="M125")c=i;} if(c==0)exit; }
{print $c}' ip.txt
M125
1.00296561
0.14591103
2.12969375
-0.40437672
-0.95551833
0.88806507
0.05119051
-1.83607523
-0.61025656
-0.03861159
Similar solution with perl
$ perl -lane '#i = grep {$F[$_] eq "M123"} 0..$#F if $.==1; exit if !#i;
print #F[#i]' ip.txt
M123
-0.1713496
1.122446
-0.8187298
1.8441559
0.6157599
-0.2913757
0.2109025
1.2010512
0.4586306
-1.505861
#i = grep {$F[$_] eq "M123"} 0..$#F if $.==1 for the header line, get index for which column value matches the string M123
exit if !#i exit if no match found
print #F[#i] print the matched column
assumes there'll be only one column match
for multiple matches, use
perl -lane '#i = grep {$F[$_] =~ /^(M121|M126)$/} 0..$#F if $.==1; exit if !#i;
print join " ", #F[#i]' ip.txt
Another in awk:
$ awk 'NR==1 {for(i=NF;i>0;i--) if($i=="M125") break; if(!i) exit} {print $i}' file
M125
1.00296561
0.14591103
2.12969375
-0.40437672
-0.95551833
0.88806507
0.05119051
-1.83607523
-0.61025656
-0.03861159
Explained:
NR==1 { # for the first record
for(i=NF;i>0;i--) # iterate fields backwards for change
if($i=="M125") break # until desired column, remember i
if (!i) exit # if column not found, exit
}
{print $i} # print value from ith field
If you are more familiar with Python:
import csv
column_name = "M125"
with open("file", "rb") as f:
data_dict = csv.DictReader(f, delimiter=" ")
print column_name
for item in data_dict:
print item[column_name]
To do anything with columns ("fields" in awk) by name rather than number you should first create an array that maps the field name to number and from then on just access the fields using that array indexed by the field name(s) rather than accessing them directly by field number(s):
$ awk 'NR==1{for (i=1;i<=NF;i++) f[$i]=i} {print $(f["M124"])}' file
M124
0.15923644
0.83648981
0.13582824
-0.63679375
0.02819133
-0.00463802
-0.26083822
-0.54109756
-0.69181051
0.54871682
or if you don't want to hard-code the column name:
$ awk -v c=M124 'NR==1{for (i=1;i<=NF;i++) f[$i]=i} {print $(f[c])}' file
M124
0.15923644
0.83648981
0.13582824
-0.63679375
0.02819133
-0.00463802
-0.26083822
-0.54109756
-0.69181051
0.54871682
and to print any number of columns in the order you choose:
$ awk -v cols='M129 M124' 'NR==1{for (i=1;i<=NF;i++) f[$i]=i; n=split(cols,c)} {for (i=1;i<=n;i++) printf "%s%s", $(f[c[i]]), (i<n ? OFS : ORS)}' file
M129 M124
1.45223392 0.15923644
2.18074729 0.83648981
1.04353114 0.13582824
0.54029264 -0.63679375
0.09207864 0.02819133
0.01244763 -0.00463802
-0.01275275 -0.26083822
-0.96973544 -0.54109756
0.6870383 -0.69181051
0.85673905 0.54871682

How to Compare CSV Column using awk?

I receive and CSV like this:
column$1,column$2,column$
john,P,10
john,P,10
john,A,20
john,T,30
john,T,10
marc,P,10
marc,C,10
marc,C,20
marc,T,30
marc,A,10
I need so sum the values and display the name and results but column$2 needs to show the sum of values T separated from values P,A,C.
Output should be this:
column$1,column$2,column$3,column$4
john,PCA,40
john,T,40,CORRECT
marc,PCA,50
marc,T,30,INCORRECT
All i could do was extract the columns i need from the original csv:
awk -F "|" '{print $8 "|" $9 "|" $4}' input.csv >> output.csv
Also sort by the correct column:
sort -t "|" -k1 input.csv >> output.csv
And add a new column to the end of the csv:
awk -F, '{NF=2}1' OFS="|" input.csv >> output.csv
I managed to sum and display the sum by column$1 and $2, but i don't how to group different values from column$2:
awk -F "," '{col[$1,$2]++} END {for(i in col) print i, col[i]}' file > output
Awk is stream oriented. It processes input and outputs what you change. It does not do in file changes.
You just need to add a corresponding print
awk '{if($2 == "T") {print "MATCHED"}}'
If you want to output more than the "matched" you need to add it to the print
e.g. '{print $1 "|" $2 "|" $3 "|" " MATCHED"}'
or use print $0 as comment mentions above.
Assuming that "CORRECT" and "INCORRECT" are determined by comparing the "PCA" value to the "T" value, the following awk script should do the trick:
awk -F, -vOFS=, '$2=="T"{t[$1]+=$3;n[$1]} $2!="T"{s[$1]+=$3;n[$1]} END{ for(i in n){print i,"PCA",s[i]; print i,"T",t[i],(t[i]==s[i] ? "CORRECT" : "INCORRECT")} }' inputfile
Broken out for easier reading, here's what this looks like:
awk -F, -vOFS=, '
$2=="T" { # match all records that are "T"
t[$1]+=$3 # add the value for this record to an array of totals
n[$1] # record this name in our authoritative name list
}
$2!="T" { # match all records that are NOT "T"
s[$1]+=$3 # add the value for this record to an array of sums
n[$1] # record this name too
}
END { # Now that we've collected data, analyse the results
for (i in n) { # step through our authoritative list of names
print i,"PCA",s[i]
print i,"T",t[i],(t[i]==s[i] ? "CORRECT" : "INCORRECT")
}
}
' inputfile
Note that array order is not guaranteed in awk, so your output may not come out in the same order as your input.
If you want your output to be delimited using vertical bars, change the -vOFS=, to -vOFS='|'.
Then you can sort using:
awk ... | sort
which defaults to -k1.

Resources