AWK process data until next match - linux

I am trying to process a file using awk.
sample data:
233;20180514;1;00;456..;m
233;1111;2;5647;6754;..;n
233;1111;2;5647;2342;..;n
233;1111;2;5647;p234;..;n
233;20180211;1;00;780..;m
233;1111;2;5647;3434;..;n
233;1111;2;5647;4545;..;n
233;1111;2;5647;3453;..;n
The problem statement is say I need to copy second column of record matching "1;00;" to following records until the next "1;00;" match and then copy the second column of that record further until next "1;00;" match. The match pattern "1;00;" could change as well.
It could be say "2;20;" . In that case I need to copy the second column until there is either "1;00;" or "2;20;" match.
I can do this using a while loop but I really need to do this using awk or sed as the file is huge and while may take a lot of time.
Expected output:
233;20180514;1;00;456..;m
233;20180514;1111;2;5647;6754;..;n+1
233;20180514;1111;2;5647;2342;..;n+1
233;20180514;1111;2;5647;p234;..;n+1
233;20180211;1;00;780..;m
233;20180211;1111;2;5647;3434;..;n+1
233;20180211;1111;2;5647;4545;..;n+1
233;20180211;1111;2;5647;3453;..;n+1
Thanks in advance.

EDIT: Since OP have changed the sample Input_file in question so adding code as per the new sample now.
awk -F";" '
length($2)==8 && !($3=="1" && $4=="00"){
flag=""}
($3=="1" && $4=="00"){
val=$2;
$2="";
sub(/;;/,";");
flag=1;
print;
next
}
flag{
$2=val OFS $2;
$NF=$NF"+1"
}
1
' OFS=";" Input_file
Basically checking if length of 2nd field of 8 and 3rd and 4th fields are NOT 1 and 0 conditions, rather than checking ;1;0.
If your actual Input_file is same as shown samples then following may help you.
awk -F";" 'NF==5 || !/pay;$/{flag=""} /1;00;$/{val=$2;$2="";sub(/;;/,";");flag=1} flag{$2=val OFS $2} 1' OFS=";" Input_file
Explanation:
awk -F";" ' ##Setting field separator as semi colon for all the lines here.
NF==5 || !/pay;$/{ ##Checking condition if number of fields are 5 on a line OR line is NOT ending with pay; if yes then do following.
flag=""} ##Setting variable flag value as NULL here.
/1;00;$/{ ##Searching string /1;00; at last of a line if it is found then do following:
val=$2; ##Creating variable named val whose value is $2(3nd field of current line).
$2=""; ##Nullifying 2nd column now for current line.
sub(/;;/,";"); ##Substituting 2 continous semi colons with single semi colon to remove 2nd columns NULL value.
flag=1} ##Setting value of variable flag as 1 here.
flag{ ##Checking condition if variable flag is having values then do following.
$2=val OFS $2} ##Re-creating value of $2 as val OFS $2, basically adding value of 2nd column of pay; line here.
1 ##awk works on concept of condition then action so mentioning 1 means making condition TRUE and no action mentioned so print will happen of line.
' OFS=";" Input_file ##Setting OFS as semi colon here and mentioning Input_file name here.

Related

Match lines based on patterns and reformat file Bash/ Linux

I am looking preferably for a bash/Linux method for the problem below.
I have a text file (input.txt) that looks like so (and many many more lines):
TCCTCCGC+TAGTTAGG_Vel_24_CC_LlanR_34 CC_LlanR
GGAGTATG+TCTATTCG_Vel_24_CC_LlanR_22 CC_LlanR
TTGACTAG+TGGAGTAC_Vel_02_EN_DavaW_11 EN_DavaW
TCGAATAA+TGGTAATT_Vel_24_CC_LlanR_23 CC_LlanR
CTGCTGAA+CGTTGCGG_Vel_02_EN_DavaW_06 EN_DavaW
index_07_barcode_04_PA-17-ACW-04 17-ACW
index_09_barcode_05_PA-17-ACW-05 17-ACW
index_08_barcode_37_PA-21-YC-15 21-YC
index_09_barcode_04_PA-22-GB-10 22-GB
index_10_barcode_37_PA-28-CC-17 28-CC
index_11_barcode_29_PA-32-MW-07 32-MW
index_11_barcode_20_PA-32-MW-08 32-MW
I want to produce a file that looks like
CC_LlanR(TCCTCCGC+TAGTTAGG_Vel_24_CC_LlanR_34,GGAGTATG+TCTATTCG_Vel_24_CC_LlanR_22,TCGAATAA+TGGTAATT_Vel_24_CC_LlanR_23)
EN_DavaW(TTGACTAG+TGGAGTAC_Vel_02_EN_DavaW_11,CTGCTGAA+CGTTGCGG_Vel_02_EN_DavaW_06)
17-ACW(index_07_barcode_04_PA-17-ACW-04,index_09_barcode_05_PA-17-ACW-05)
21-YC(index_08_barcode_37_PA-21-YC-15)
22-GB(index_09_barcode_04_PA-22-GB-10)
28-CC(index_10_barcode_37_PA-28-CC-17)
32-MW(index_11_barcode_29_PA-32-MW-07,index_11_barcode_20_PA-32-MW-08)
I thought that I could do something along the lines of this.
cat input.txt | awk '{print $1}' | grep -e "CC_LlanR" | paste -sd',' > intermediate_file
cat input.txt | awk '{print $2"("}' something something??
But I only know how to grep one pattern at a time? Is there a way to find all the matching lines at once and output them in this format?
Thank you!
(Happy Easter/ long weekend to all!)
With your shown samples please try following.
awk '
FNR==NR{
arr[$2]=(arr[$2]?arr[$2]",":"")$1
next
}
($2 in arr){
print $2"("arr[$2]")"
delete arr[$2]
}
' Input_file Input_file
2nd solution: Within a single read of Input_file try following.
awk '{arr[$2]=(arr[$2]?arr[$2]",":"")$1} END{for(i in arr){print i"("arr[i]")"}}' Input_file
Explanation(1st solution): Adding detailed explanation for 1st solution here.
awk ' ##Starting awk program from here.
FNR==NR{ ##Checking condition FNR==NR which will be TRUE when first time Input_file is being read.
arr[$2]=(arr[$2]?arr[$2]",":"")$1 ##Creating array with index of 2nd field and keep adding its value with comma here.
next ##next will skip all further statements from here.
}
($2 in arr){ ##Checking condition if 2nd field is present in arr then do following.
print $2"("arr[$2]")" ##Printing 2nd field ( arr[$2] ) here.
delete arr[$2] ##Deleteing arr value with 2nd field index here.
}
' Input_file Input_file ##Mentioning Input_file names here.
Assuming your input is grouped by the $2 value as shown in your example (if it isn't then just run sort -k2,2 on your input first) using 1 pass and only storing one token at a time in memory and producing the output in the same order of $2s as the input:
$ cat tst.awk
BEGIN { ORS="" }
$2 != prev {
printf "%s%s(", ORS, $2
ORS = ")\n"
sep = ""
prev = $2
}
{
printf "%s%s", sep, $1
sep = ","
}
END { print "" }
$ awk -f tst.awk input.txt
CC_LlanR(TCCTCCGC+TAGTTAGG_Vel_24_CC_LlanR_34,GGAGTATG+TCTATTCG_Vel_24_CC_LlanR_22)
EN_DavaW(TTGACTAG+TGGAGTAC_Vel_02_EN_DavaW_11)
CC_LlanR(TCGAATAA+TGGTAATT_Vel_24_CC_LlanR_23)
EN_DavaW(CTGCTGAA+CGTTGCGG_Vel_02_EN_DavaW_06)
17-ACW(index_07_barcode_04_PA-17-ACW-04,index_09_barcode_05_PA-17-ACW-05)
21-YC(index_08_barcode_37_PA-21-YC-15)
22-GB(index_09_barcode_04_PA-22-GB-10)
28-CC(index_10_barcode_37_PA-28-CC-17)
32-MW(index_11_barcode_29_PA-32-MW-07,index_11_barcode_20_PA-32-MW-08)
This might work for you (GNU sed):
sed -E 's/^(\S+)\s+(\S+)/\2(\1)/;H
x;s/(\n\S+)\((\S+)\)(.*)\1\((\S+)\)/\1(\2,\4)\3/;x;$!d;x;s/.//' file
Append each manipulated line to the hold space.
Before moving on to the next line, accumlate like keys into a single line.
Delete every line except the last.
Replace the last line by the contents of the hold space.
Remove the first character (newline artefact introduced by H comand) and print the result.
N.B. The final solution is unsorted and in the original order.

How to edit output rows from awk with defined position?

Is there a way how to solve this?
I have a bash script, which creates .dat and .log file from source files.
I'm using awk with print and position what I need to print. The problem is with the last position - ID2 (lower). It should be just \*[0-9]{3}\*#, but in some cases there is a string before [0-9]{12}\[00]\>.
Then row looks for example like this:
2020-01-11 01:01:01;test;test123;123456789123[00]>*123*#
What I need is remove the string before in a file:
2020-01-11 01:01:01;test;test123;*123*#
File structure:
YYYY-DD-MM HH:MM:SS;string;ID1;ID2
I will be happy for any advice, thanks.
awk 'BEGIN{FS=OFS=";"} {$NF=substr($NF,length($NF)-5)}1' file
Here we keep only last 6 characters of the last field, while semicolon is the field separator. If there is nothing else in front of that *ID*#, then we keep all of it.
Delete everything before the first *:
$ awk 'BEGIN{FS=OFS=";"}{sub(/^[^*]*/,"",$NF)}1' file
Output:
2020-01-11 01:01:01;test;test123;*123*#
Could you please try following tested and written with shown samples in GNU awk.
awk '
match($0,/[0-9]{12}\[[0-9]+\]>/) && /\*[0-9]{3}\*#/{
print substr($0,1,RSTART-1) substr($0,RSTART+RLENGTH)
}
' Input_file
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
match($0,/[0-9]{12}\[[0-9]+\]>/) && /\*[0-9]{3}\*#/{ ##Using match function to match regex in it, what regex does is: It matches digits(12 in number) then [ then digits(continuously coming) and ] Also checking condition if line ends with *3 digits *
print substr($0,1,RSTART-1) substr($0,RSTART+RLENGTH) ##If above condition is TRUE then printing sub-string from 1st character to RSTART-1 and then sub-string from RSTART+RLENGTH value to till last of line.
}
' Input_file ##Mentioning Input_file name here.

How can you compare entries between two columns in linux?

I am trying to figure out whether the first letter of an amino acid is the same as its letter code.
For example, Glycine begins with G and its letter code is also (G)
On the other hand, Arginine begins with A but its letter code is (R)
I am trying to print out, as a result, the amino acids that have the same letter code and starting alphabet.
I have a CSV datafile in which the columns are delimited by ','
Name,One letter code,Three letter code,Hydropathy,Charge,Abundance,DNA codon(s)
Arginine,R,Arg,hydrophilic,+,0.0514,CGT-CGC-CGA-CGG-AGA-AGG
Asparagine,N,Asn,hydrophilic,N,0.0447,AAT-AAC
Aspartate,D,Asp,hydrophilic,-,0.0528,GAT-GAC
Glutamate,E,Glu,hydrophilic,-,0.0635,GAA-GAG
Glutamine,Q,Gln,hydrophilic,N,0.0399,CAA-CAG
Lysine,K,Lys,hydrophilic,+,0.0593,AAA-AAG
Serine,S,Ser,hydrophilic,N,0.0715,TCT-TCC-TCA-TCG-AGT-AGC
Threonine,T,Thr,hydrophilic,N,0.0569,ACT-ACC-ACA-ACG
I believe the code below is one way to compare columns, but I am wondering how I can extract the first letter from the first column and compare that with the alphabet in the second column
awk '{ if ($1 == $2) { print $1; } }' < foo.txt
Could you please try following.
awk 'BEGIN{FS=","} substr($1,1,1) == $2' Input_file
Output will be as follows.
Serine,S,Ser,hydrophilic,N,0.0715,TCT-TCC-TCA-TCG-AGT-AGC
Threonine,T,Thr,hydrophilic,N,0.0569,ACT-ACC-ACA-ACG
Explanation: Adding explanation for above code.
awk ' ##Starting awk program here.
BEGIN{ ##Starting BEGIN section for awk here.
FS="," ##Setting FS as comma here, field separator.
} ##Closing BLOCK for BEGIN here.
substr($1,1,1) == $2 ##Using substr function of awk to get sub string from line, substr(line/variable/field, starting point, ending point) is method for using it. Getting 1st letter of $1 and comparing it with $2 of current line, if TRUE then it will print current line.
' Input_file ##Mentioning Input_file name here.
Simpler way using grep:
$ grep -E '^(.)[^,]*,\1' input.csvĀ 
Serine,S,Ser,hydrophilic,N,0.0715,TCT-TCC-TCA-TCG-AGT-AGC
Threonine,T,Thr,hydrophilic,N,0.0569,ACT-ACC-ACA-ACG
Same as RavinderSingh's expression, but field selector attribute is different.
awk -F "," 'substr($1,1,1) == $2' InFile
Serine,S,Ser,hydrophilic,N,0.0715,TCT-TCC-TCA-TCG-AGT-AGC
Threonine,T,Thr,hydrophilic,N,0.0569,ACT-ACC-ACA-ACG

Find a pattern and replace

This is the input to my file.
Number : 123
PID : IIT/123/Dakota
The expected output is :
Number : 111
PID : IIT/111/Dakota
I want to replace 123 to 111. To solve this I have tried following:
awk '/Number/{$NF=111} 1' log.txt
awk -F '[/]' '/PID/{$2="123"} 1' log.txt
Use sed for something this simple ?
Print the change to the screen (test with this) :
sed -e 's:123:111:g' f2.txt
Update the file (with this) :
sed -i 's:123:111:g' f2.txt
Example:
$ sed -i 's:123:111:g' f2.txt
$ cat f2.txt
Number : 111
PID : IIT/111/Dakota
EDIT2: Or you want to substitute each line's 123 with 111 without checking any condition which you tried in your awk then simply do:
awk '{sub(/123/,"111")} 1' Input_file
Change sub to gsub in case of many occurrences of 123 in a single line too.
Explanation of above code:
awk -v new_value="111" ' ##Creating an awk variable named new_value where OP could keep its new value which OP needs to be there in line.
/^Number/ { $NF=new_value } ##Checking if a line starts from Number string and then setting last field value to new_value variable here.
/^PID/ { num=split($NF,array,"/"); ##Checking if a line starts from PID then creating an array named array whose delimiter it / from last field value
array[2]=new_value; ##Setting second item of array to variable new_value here.
for(i=1;i<=num;i++){ val=val?val "/" array[i]:array[i] }; ##Starting a loop from 1 to till length of array and creating variable val to re-create last field of current line.
$NF=val; ##Setting last field value to variable val here.
val="" ##Nullifying variable val here.
}
1' Input_file ##Mentioning 1 to print the line and mentioning Input_file name here too.
EDIT: In case you need to / in your output too then use following awk.
awk -v new_value="111" '
/^Number/ { $NF=new_value }
/^PID/ { num=split($NF,array,"/");
array[2]=new_value;
for(i=1;i<=num;i++){ val=val?val "/" array[i]:array[i] };
$NF=val;
val=""
}
1' Input_file
Following awk may help you here.(Seems after I have applied code tags to your samples your sample input is changed a bit so editing my code accordingly now)
awk -F"[ /]" -v new_value="111" '/^Number/{$NF=new_value} /^PID/{$(NF-1)=new_value}1' Input_file
In case you want to save changes into Input_file itself append > temp_file &7 mv temp_file Input_file in above code then.
Explanation:
awk -F"[ /]" -v new_value="111" ' ##Setting field separator as space and / to each line and creating awk variable new_value which OP wants to have new value.
/^Number/{ $NF=new_value } ##Checking condition if a line is starting with string Number then change its last field to new_value value.
/^PID/ { $(NF-1)=new_value } ##Checking condition if a line starts from string PID then setting second last field to variable new_value.
1 ##awk works on method of condition then action, so putting 1 making condition TRUE here and not mentioning any action so by default print of current line will happen.
' Input_file ##Mentioning Input_file name here.

Bash addition by the first column, if data in second column is the same

I have a list with delimiters |
40|192.168.1.2|user4
42|192.168.1.25|user2
58|192.168.1.55|user3
118|192.168.1.3|user11
67|192.168.1.25|user2
As you can see, I have the same ip in the field 42|192.168.1.25|user2 and in the field 67|192.168.1.25|user2. How can I append these lines between them ? Can you give me a solution using awk. Can you give me some examples ?
I need in a result something like this:
40|192.168.1.2|user4
58|192.168.1.55|user3
109|192.168.1.25|user2
118|192.168.1.3|user11
How you can see, we have counted the numbers from first column.
If you need output in same order in which Input_file is there then following awk may help you in same.
awk -F"|" '!c[$2,$3]++{val++;v[val]=$2$3} {a[$2,$3]+=$1;b[$2,$3]=$2 FS $3;} END{for(j=1;j<=val;j++){print a[v[j]] FS b[v[j]]}}' SUBSEP="" Input_file
Adding a non-one liner form of solution too now.
awk -F"|" ' ##Making field separator as pipe(|) here for all the lines for Input_file.
!c[$2,$3]++{ ##Checking if array C whose index is $2,$3 is having its first occurrence in array c then do following.
val++; ##incrementing variable val value with 1 each time cursor comes here.
v[val]=$2$3 ##creating an array named v whose index is val and value is $2$3(second field 3rd field).
} ##Closing c array block here now.
{
a[$2,$3]+=$1; ##creating an array named a whose index is $2 $3 and incrementing its value with 1st field value and add in its same index values to get SUM.
b[$2,$3]=$2 FS $3;##create array b with index of $2$3 and setting its value to $2 FS $3, where FS is field separator.
} ##closing this block here.
END{ ##Starting awk code END bock here.
for(j=1;j<=val;j++){ ##starting a for loop here from variable named j value 1 to till value of variable val here.
print a[v[j]] FS b[v[j]] ##printing value of array a whose index is value of array v with index j, and array b with index of array v with index j here.
}}
' SUBSEP="" Input_file ##Setting SUBSEP to NULL here and mentioning the Input_file name here.
Short GNU datamash + awk solution:
datamash -st'|' -g2,3 sum 1 <file | awk -F'|' '{print $3,$1,$2}' OFS='|'
g2,3 - group by the 2nd and 3rd field (i.e. by IP address and user id)
sum 1 - sum the 1st field values within grouped records
The output:
40|192.168.1.2|user4
109|192.168.1.25|user2
118|192.168.1.3|user11
58|192.168.1.55|user3
Modifying the sample data to include different users for ip address 192.168.1.25:
$ cat ipfile
40|192.168.1.2|user4
42|192.168.1.25|user1 <=== same ip, different user
58|192.168.1.55|user3
118|192.168.1.3|user11
67|192.168.1.25|user9 <=== same ip, different user
And a simple awk script:
$ awk '
BEGIN { FS="|" ; OFS="|" }
{ sum[$2]+=$1 ; if (user[$2]=="") { user[$2]=$3 } }
END { for (idx in sum) { print sum[idx],idx,user[idx] } }
' ipfile
58|192.168.1.55|user3
40|192.168.1.2|user4
118|192.168.1.3|user11
109|192.168.1.25|user1 <=== captured first user id
BEGIN { FS="|" ; OFS="|" } : define input and output field separators; executed once at beginning
sum[$2]+=$1 : store/add field #1 to array (indexed by ip address == field #2); executed once for each row in data file
if .... : if a user hasn't already been stored for a given ip address, then store it now; this has the effect of saving the first user id we find for a given ip address; executed once for each row in data file
END { for .... / print ...} : loop through array indexes, printing our sum, ip address and (first) user id; executed once at the end
NOTE: No sorting requirement was provided in the original question; sorting could be added as needed ...
awk to the rescue!
$ awk 'BEGIN {FS=OFS="|"}
{a[$2 FS $3]+=$1}
END {for(k in a) print a[k],k}' file | sort -n
40|192.168.1.2|user4
58|192.168.1.55|user3
109|192.168.1.25|user2
118|192.168.1.3|user11
if user* is not part of the key and you want to capture the first value
$ awk 'BEGIN {FS=OFS="|"}
{c[$2]+=$1;
if(!($2 in u)) u[$2]=$3} # capture first user
END {for(k in c) print c[k],k,u[k]}' file | sort -n
which ends up almost the same with #markp's answer.
Another idea on the same path but allows for different users:
awk -F'|' '{c[$2] += $1}u[$2] !~ $3{u[$2] = (u[$2]?u[$2]",":"")$3}END{for(i in c)print c[i],i,u[i]}' OFS='|' input_file
If multiple users they will be separated by a comma

Resources