How can you compare entries between two columns in linux? - linux

I am trying to figure out whether the first letter of an amino acid is the same as its letter code.
For example, Glycine begins with G and its letter code is also (G)
On the other hand, Arginine begins with A but its letter code is (R)
I am trying to print out, as a result, the amino acids that have the same letter code and starting alphabet.
I have a CSV datafile in which the columns are delimited by ','
Name,One letter code,Three letter code,Hydropathy,Charge,Abundance,DNA codon(s)
Arginine,R,Arg,hydrophilic,+,0.0514,CGT-CGC-CGA-CGG-AGA-AGG
Asparagine,N,Asn,hydrophilic,N,0.0447,AAT-AAC
Aspartate,D,Asp,hydrophilic,-,0.0528,GAT-GAC
Glutamate,E,Glu,hydrophilic,-,0.0635,GAA-GAG
Glutamine,Q,Gln,hydrophilic,N,0.0399,CAA-CAG
Lysine,K,Lys,hydrophilic,+,0.0593,AAA-AAG
Serine,S,Ser,hydrophilic,N,0.0715,TCT-TCC-TCA-TCG-AGT-AGC
Threonine,T,Thr,hydrophilic,N,0.0569,ACT-ACC-ACA-ACG
I believe the code below is one way to compare columns, but I am wondering how I can extract the first letter from the first column and compare that with the alphabet in the second column
awk '{ if ($1 == $2) { print $1; } }' < foo.txt

Could you please try following.
awk 'BEGIN{FS=","} substr($1,1,1) == $2' Input_file
Output will be as follows.
Serine,S,Ser,hydrophilic,N,0.0715,TCT-TCC-TCA-TCG-AGT-AGC
Threonine,T,Thr,hydrophilic,N,0.0569,ACT-ACC-ACA-ACG
Explanation: Adding explanation for above code.
awk ' ##Starting awk program here.
BEGIN{ ##Starting BEGIN section for awk here.
FS="," ##Setting FS as comma here, field separator.
} ##Closing BLOCK for BEGIN here.
substr($1,1,1) == $2 ##Using substr function of awk to get sub string from line, substr(line/variable/field, starting point, ending point) is method for using it. Getting 1st letter of $1 and comparing it with $2 of current line, if TRUE then it will print current line.
' Input_file ##Mentioning Input_file name here.

Simpler way using grep:
$ grep -E '^(.)[^,]*,\1' input.csvĀ 
Serine,S,Ser,hydrophilic,N,0.0715,TCT-TCC-TCA-TCG-AGT-AGC
Threonine,T,Thr,hydrophilic,N,0.0569,ACT-ACC-ACA-ACG

Same as RavinderSingh's expression, but field selector attribute is different.
awk -F "," 'substr($1,1,1) == $2' InFile
Serine,S,Ser,hydrophilic,N,0.0715,TCT-TCC-TCA-TCG-AGT-AGC
Threonine,T,Thr,hydrophilic,N,0.0569,ACT-ACC-ACA-ACG

Related

Match lines based on patterns and reformat file Bash/ Linux

I am looking preferably for a bash/Linux method for the problem below.
I have a text file (input.txt) that looks like so (and many many more lines):
TCCTCCGC+TAGTTAGG_Vel_24_CC_LlanR_34 CC_LlanR
GGAGTATG+TCTATTCG_Vel_24_CC_LlanR_22 CC_LlanR
TTGACTAG+TGGAGTAC_Vel_02_EN_DavaW_11 EN_DavaW
TCGAATAA+TGGTAATT_Vel_24_CC_LlanR_23 CC_LlanR
CTGCTGAA+CGTTGCGG_Vel_02_EN_DavaW_06 EN_DavaW
index_07_barcode_04_PA-17-ACW-04 17-ACW
index_09_barcode_05_PA-17-ACW-05 17-ACW
index_08_barcode_37_PA-21-YC-15 21-YC
index_09_barcode_04_PA-22-GB-10 22-GB
index_10_barcode_37_PA-28-CC-17 28-CC
index_11_barcode_29_PA-32-MW-07 32-MW
index_11_barcode_20_PA-32-MW-08 32-MW
I want to produce a file that looks like
CC_LlanR(TCCTCCGC+TAGTTAGG_Vel_24_CC_LlanR_34,GGAGTATG+TCTATTCG_Vel_24_CC_LlanR_22,TCGAATAA+TGGTAATT_Vel_24_CC_LlanR_23)
EN_DavaW(TTGACTAG+TGGAGTAC_Vel_02_EN_DavaW_11,CTGCTGAA+CGTTGCGG_Vel_02_EN_DavaW_06)
17-ACW(index_07_barcode_04_PA-17-ACW-04,index_09_barcode_05_PA-17-ACW-05)
21-YC(index_08_barcode_37_PA-21-YC-15)
22-GB(index_09_barcode_04_PA-22-GB-10)
28-CC(index_10_barcode_37_PA-28-CC-17)
32-MW(index_11_barcode_29_PA-32-MW-07,index_11_barcode_20_PA-32-MW-08)
I thought that I could do something along the lines of this.
cat input.txt | awk '{print $1}' | grep -e "CC_LlanR" | paste -sd',' > intermediate_file
cat input.txt | awk '{print $2"("}' something something??
But I only know how to grep one pattern at a time? Is there a way to find all the matching lines at once and output them in this format?
Thank you!
(Happy Easter/ long weekend to all!)
With your shown samples please try following.
awk '
FNR==NR{
arr[$2]=(arr[$2]?arr[$2]",":"")$1
next
}
($2 in arr){
print $2"("arr[$2]")"
delete arr[$2]
}
' Input_file Input_file
2nd solution: Within a single read of Input_file try following.
awk '{arr[$2]=(arr[$2]?arr[$2]",":"")$1} END{for(i in arr){print i"("arr[i]")"}}' Input_file
Explanation(1st solution): Adding detailed explanation for 1st solution here.
awk ' ##Starting awk program from here.
FNR==NR{ ##Checking condition FNR==NR which will be TRUE when first time Input_file is being read.
arr[$2]=(arr[$2]?arr[$2]",":"")$1 ##Creating array with index of 2nd field and keep adding its value with comma here.
next ##next will skip all further statements from here.
}
($2 in arr){ ##Checking condition if 2nd field is present in arr then do following.
print $2"("arr[$2]")" ##Printing 2nd field ( arr[$2] ) here.
delete arr[$2] ##Deleteing arr value with 2nd field index here.
}
' Input_file Input_file ##Mentioning Input_file names here.
Assuming your input is grouped by the $2 value as shown in your example (if it isn't then just run sort -k2,2 on your input first) using 1 pass and only storing one token at a time in memory and producing the output in the same order of $2s as the input:
$ cat tst.awk
BEGIN { ORS="" }
$2 != prev {
printf "%s%s(", ORS, $2
ORS = ")\n"
sep = ""
prev = $2
}
{
printf "%s%s", sep, $1
sep = ","
}
END { print "" }
$ awk -f tst.awk input.txt
CC_LlanR(TCCTCCGC+TAGTTAGG_Vel_24_CC_LlanR_34,GGAGTATG+TCTATTCG_Vel_24_CC_LlanR_22)
EN_DavaW(TTGACTAG+TGGAGTAC_Vel_02_EN_DavaW_11)
CC_LlanR(TCGAATAA+TGGTAATT_Vel_24_CC_LlanR_23)
EN_DavaW(CTGCTGAA+CGTTGCGG_Vel_02_EN_DavaW_06)
17-ACW(index_07_barcode_04_PA-17-ACW-04,index_09_barcode_05_PA-17-ACW-05)
21-YC(index_08_barcode_37_PA-21-YC-15)
22-GB(index_09_barcode_04_PA-22-GB-10)
28-CC(index_10_barcode_37_PA-28-CC-17)
32-MW(index_11_barcode_29_PA-32-MW-07,index_11_barcode_20_PA-32-MW-08)
This might work for you (GNU sed):
sed -E 's/^(\S+)\s+(\S+)/\2(\1)/;H
x;s/(\n\S+)\((\S+)\)(.*)\1\((\S+)\)/\1(\2,\4)\3/;x;$!d;x;s/.//' file
Append each manipulated line to the hold space.
Before moving on to the next line, accumlate like keys into a single line.
Delete every line except the last.
Replace the last line by the contents of the hold space.
Remove the first character (newline artefact introduced by H comand) and print the result.
N.B. The final solution is unsorted and in the original order.

Convert floating point numbers to user defined output using AWK

I am trying to convert floating point numbers (columns) from a text file to the user defined output using awk, e-01 -> $\exp 10^{-01}$
Test input:
1.2e-01
1.8e-02
1.12e-03
1.222e+04
1.23e+05
441.2e+05
221.2e+06
Expect results
1.2$\exp 10^{-01}$
1.8$\exp 10^{-02}$
1.12$\exp 10^{-03}$
1.222$\exp 10^{+04}$
1.23$\exp 10^{+05}$
441.2$\exp 10^{+05}$
221.2$\exp 10^{+06}$
I have used the following command "awk '{printf "%.4e\n", $1}'", which does not solve this problem.
Any help would be really appreciated.
You may use this simple sed substitution with a capturing group and a back-reference:
sed -E 's/e([+-][0-9]+)/$\\exp 10^{\1}$/' file
1.2$\exp 10^{-01}$
1.8$\exp 10^{-02}$
1.12$\exp 10^{-03}$
1.222$\exp 10^{+04}$
1.23$\exp 10^{+05}$
441.2$\exp 10^{+05}$
221.2$\exp 10^{+06}$
Could you please try following, written and tested with shown samples only in GNU awk.
awk '{sub(/ +$/,"");sub(/e/,"$\\exp ");sub(/[-+]/,"10^{&");$0=$0"}$"} 1' Input_file
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
{
sub(/ +$/,"") ##Substituting space at last of line with NULL in each line.
sub(/e/,"$\\exp ") ##Substituting e with $\\exp in current line.
sub(/[-+]/,"10^{&") ##Substituting either - or + with 10^{ with matched - or +
$0=$0"}$" ##Appending }$ at current line.
}
1 ##1 will print current line.
' Input_file ##Mentioning Input_file name here.
I would treat input as text and do two subsequent replacements, namely:
awk '{$0=gensub("e", "$\\\\exp 10^", 1); $0=gensub("(-|+)([0-9]+)[[:blank:]]+", "{\\1\\2}$", 1); print}' file.txt
Let file.txt be:
1.2e-01
1.8e-02
1.12e-03
1.222e+04
1.23e+05
441.2e+05
221.2e+06
then output is:
1.2$\exp 10^{-01}$
1.8$\exp 10^{-02}$
1.12$\exp 10^{-03}$
1.222$\exp 10^{+04}$
1.23$\exp 10^{+05}$
441.2$\exp 10^{+05}$
221.2$\exp 10^{+06}$
Explanation: I alter whole line ($0), firstly I replace e with $\exp 10^ (\ needs to be escaped), secondly I search for sign (- or +) followed by (one or more digits) followed by one or more space or tab, which I replace with {signdigits}$. Finally I print altered line.

How to edit output rows from awk with defined position?

Is there a way how to solve this?
I have a bash script, which creates .dat and .log file from source files.
I'm using awk with print and position what I need to print. The problem is with the last position - ID2 (lower). It should be just \*[0-9]{3}\*#, but in some cases there is a string before [0-9]{12}\[00]\>.
Then row looks for example like this:
2020-01-11 01:01:01;test;test123;123456789123[00]>*123*#
What I need is remove the string before in a file:
2020-01-11 01:01:01;test;test123;*123*#
File structure:
YYYY-DD-MM HH:MM:SS;string;ID1;ID2
I will be happy for any advice, thanks.
awk 'BEGIN{FS=OFS=";"} {$NF=substr($NF,length($NF)-5)}1' file
Here we keep only last 6 characters of the last field, while semicolon is the field separator. If there is nothing else in front of that *ID*#, then we keep all of it.
Delete everything before the first *:
$ awk 'BEGIN{FS=OFS=";"}{sub(/^[^*]*/,"",$NF)}1' file
Output:
2020-01-11 01:01:01;test;test123;*123*#
Could you please try following tested and written with shown samples in GNU awk.
awk '
match($0,/[0-9]{12}\[[0-9]+\]>/) && /\*[0-9]{3}\*#/{
print substr($0,1,RSTART-1) substr($0,RSTART+RLENGTH)
}
' Input_file
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
match($0,/[0-9]{12}\[[0-9]+\]>/) && /\*[0-9]{3}\*#/{ ##Using match function to match regex in it, what regex does is: It matches digits(12 in number) then [ then digits(continuously coming) and ] Also checking condition if line ends with *3 digits *
print substr($0,1,RSTART-1) substr($0,RSTART+RLENGTH) ##If above condition is TRUE then printing sub-string from 1st character to RSTART-1 and then sub-string from RSTART+RLENGTH value to till last of line.
}
' Input_file ##Mentioning Input_file name here.

gsub in awk with variable

I want to replace the ">" with variable names staring with ">" and ends with ".". But the following code is not printing the variable names.
for f in *.fasta;
do
nam=$(basename $f .fasta);
awk '{print $f}' $f | awk '{gsub(">", ">$nam."); print $0}'; done
Input of first file sample01.fasta:
cat sample01.fasta:
>textofDNA
ATCCCCGGG
>textofDNA2
ATCCCCGGGTTTT
Output expected:
>sample01.textofDNA
ATCCCCGGG
>sample01.textofDNA2
ATCCCCGGGTTTT
$ awk 'FNR==1{fname=FILENAME; sub(/[^.]+$/,"",fname)} sub(/^>/,""){$0=">" fname $0} 1' *.fasta
>sample01.textofDNA
ATCCCCGGG
>sample01.textofDNA2
ATCCCCGGGTTTT
Compared to the other answers you've got so far, the above will work in any awk, only does the file name calculation once per input file rather than once per line or once per >-line, won't fail if the file name contains other .s, won't fail if the file name contains &, and won't fail if the file name doesn't contain the string fasta..
Or like this? You don't really need the looping and basename or two awk invocations.
awk '{stub=gensub( /^([^.]+\.)fasta.*/ , "\\1", "1",FILENAME ) ; gsub( />/, ">"stub); print}' *.fasta
>sample01.textofDNA
ATCCCCGGG
>sample01.textofDNA2
ATCCCCGGGTTTT
Explanation: awk has knowledge of the filename it currently operates on through the built-in variable FILENAME; I strip the .fasta extension using gensub, and store it in the variable stub. The I invoke gsub to replace ">" with ">" and the content of my variable stub. After that I print it.
As Ed points out in the comments: gensub is a GNU extension and won't work on other awk implementations.
Could you please try following too.
awk '/^>/{split(FILENAME,array,".");print substr($0,1,1) array[1]"." substr($0,2);next} 1' Input_file
Explanation: Adding explanation for above code here.
awk '
/^>/{ ##Checking condition if a line starts from > then do following.
split(FILENAME,array,".") ##Using split function of awk to split Input_file name here which is stored in awk variable FILENAME.
print substr($0,1,1) array[1]"." substr($0,2) ##Printing substring to print 1st char then array 1st element and then substring from 2nd char to till last of line.
next ##next will skip all further statements from here.
}
1 ##1 will print all lines(except line that are starting from >).
' sample01.fasta ##Mentioning Input_file name here.

AWK process data until next match

I am trying to process a file using awk.
sample data:
233;20180514;1;00;456..;m
233;1111;2;5647;6754;..;n
233;1111;2;5647;2342;..;n
233;1111;2;5647;p234;..;n
233;20180211;1;00;780..;m
233;1111;2;5647;3434;..;n
233;1111;2;5647;4545;..;n
233;1111;2;5647;3453;..;n
The problem statement is say I need to copy second column of record matching "1;00;" to following records until the next "1;00;" match and then copy the second column of that record further until next "1;00;" match. The match pattern "1;00;" could change as well.
It could be say "2;20;" . In that case I need to copy the second column until there is either "1;00;" or "2;20;" match.
I can do this using a while loop but I really need to do this using awk or sed as the file is huge and while may take a lot of time.
Expected output:
233;20180514;1;00;456..;m
233;20180514;1111;2;5647;6754;..;n+1
233;20180514;1111;2;5647;2342;..;n+1
233;20180514;1111;2;5647;p234;..;n+1
233;20180211;1;00;780..;m
233;20180211;1111;2;5647;3434;..;n+1
233;20180211;1111;2;5647;4545;..;n+1
233;20180211;1111;2;5647;3453;..;n+1
Thanks in advance.
EDIT: Since OP have changed the sample Input_file in question so adding code as per the new sample now.
awk -F";" '
length($2)==8 && !($3=="1" && $4=="00"){
flag=""}
($3=="1" && $4=="00"){
val=$2;
$2="";
sub(/;;/,";");
flag=1;
print;
next
}
flag{
$2=val OFS $2;
$NF=$NF"+1"
}
1
' OFS=";" Input_file
Basically checking if length of 2nd field of 8 and 3rd and 4th fields are NOT 1 and 0 conditions, rather than checking ;1;0.
If your actual Input_file is same as shown samples then following may help you.
awk -F";" 'NF==5 || !/pay;$/{flag=""} /1;00;$/{val=$2;$2="";sub(/;;/,";");flag=1} flag{$2=val OFS $2} 1' OFS=";" Input_file
Explanation:
awk -F";" ' ##Setting field separator as semi colon for all the lines here.
NF==5 || !/pay;$/{ ##Checking condition if number of fields are 5 on a line OR line is NOT ending with pay; if yes then do following.
flag=""} ##Setting variable flag value as NULL here.
/1;00;$/{ ##Searching string /1;00; at last of a line if it is found then do following:
val=$2; ##Creating variable named val whose value is $2(3nd field of current line).
$2=""; ##Nullifying 2nd column now for current line.
sub(/;;/,";"); ##Substituting 2 continous semi colons with single semi colon to remove 2nd columns NULL value.
flag=1} ##Setting value of variable flag as 1 here.
flag{ ##Checking condition if variable flag is having values then do following.
$2=val OFS $2} ##Re-creating value of $2 as val OFS $2, basically adding value of 2nd column of pay; line here.
1 ##awk works on concept of condition then action so mentioning 1 means making condition TRUE and no action mentioned so print will happen of line.
' OFS=";" Input_file ##Setting OFS as semi colon here and mentioning Input_file name here.

Resources