gsub in awk with variable - linux

I want to replace the ">" with variable names staring with ">" and ends with ".". But the following code is not printing the variable names.
for f in *.fasta;
do
nam=$(basename $f .fasta);
awk '{print $f}' $f | awk '{gsub(">", ">$nam."); print $0}'; done
Input of first file sample01.fasta:
cat sample01.fasta:
>textofDNA
ATCCCCGGG
>textofDNA2
ATCCCCGGGTTTT
Output expected:
>sample01.textofDNA
ATCCCCGGG
>sample01.textofDNA2
ATCCCCGGGTTTT

$ awk 'FNR==1{fname=FILENAME; sub(/[^.]+$/,"",fname)} sub(/^>/,""){$0=">" fname $0} 1' *.fasta
>sample01.textofDNA
ATCCCCGGG
>sample01.textofDNA2
ATCCCCGGGTTTT
Compared to the other answers you've got so far, the above will work in any awk, only does the file name calculation once per input file rather than once per line or once per >-line, won't fail if the file name contains other .s, won't fail if the file name contains &, and won't fail if the file name doesn't contain the string fasta..

Or like this? You don't really need the looping and basename or two awk invocations.
awk '{stub=gensub( /^([^.]+\.)fasta.*/ , "\\1", "1",FILENAME ) ; gsub( />/, ">"stub); print}' *.fasta
>sample01.textofDNA
ATCCCCGGG
>sample01.textofDNA2
ATCCCCGGGTTTT
Explanation: awk has knowledge of the filename it currently operates on through the built-in variable FILENAME; I strip the .fasta extension using gensub, and store it in the variable stub. The I invoke gsub to replace ">" with ">" and the content of my variable stub. After that I print it.
As Ed points out in the comments: gensub is a GNU extension and won't work on other awk implementations.

Could you please try following too.
awk '/^>/{split(FILENAME,array,".");print substr($0,1,1) array[1]"." substr($0,2);next} 1' Input_file
Explanation: Adding explanation for above code here.
awk '
/^>/{ ##Checking condition if a line starts from > then do following.
split(FILENAME,array,".") ##Using split function of awk to split Input_file name here which is stored in awk variable FILENAME.
print substr($0,1,1) array[1]"." substr($0,2) ##Printing substring to print 1st char then array 1st element and then substring from 2nd char to till last of line.
next ##next will skip all further statements from here.
}
1 ##1 will print all lines(except line that are starting from >).
' sample01.fasta ##Mentioning Input_file name here.

Related

Match lines based on patterns and reformat file Bash/ Linux

I am looking preferably for a bash/Linux method for the problem below.
I have a text file (input.txt) that looks like so (and many many more lines):
TCCTCCGC+TAGTTAGG_Vel_24_CC_LlanR_34 CC_LlanR
GGAGTATG+TCTATTCG_Vel_24_CC_LlanR_22 CC_LlanR
TTGACTAG+TGGAGTAC_Vel_02_EN_DavaW_11 EN_DavaW
TCGAATAA+TGGTAATT_Vel_24_CC_LlanR_23 CC_LlanR
CTGCTGAA+CGTTGCGG_Vel_02_EN_DavaW_06 EN_DavaW
index_07_barcode_04_PA-17-ACW-04 17-ACW
index_09_barcode_05_PA-17-ACW-05 17-ACW
index_08_barcode_37_PA-21-YC-15 21-YC
index_09_barcode_04_PA-22-GB-10 22-GB
index_10_barcode_37_PA-28-CC-17 28-CC
index_11_barcode_29_PA-32-MW-07 32-MW
index_11_barcode_20_PA-32-MW-08 32-MW
I want to produce a file that looks like
CC_LlanR(TCCTCCGC+TAGTTAGG_Vel_24_CC_LlanR_34,GGAGTATG+TCTATTCG_Vel_24_CC_LlanR_22,TCGAATAA+TGGTAATT_Vel_24_CC_LlanR_23)
EN_DavaW(TTGACTAG+TGGAGTAC_Vel_02_EN_DavaW_11,CTGCTGAA+CGTTGCGG_Vel_02_EN_DavaW_06)
17-ACW(index_07_barcode_04_PA-17-ACW-04,index_09_barcode_05_PA-17-ACW-05)
21-YC(index_08_barcode_37_PA-21-YC-15)
22-GB(index_09_barcode_04_PA-22-GB-10)
28-CC(index_10_barcode_37_PA-28-CC-17)
32-MW(index_11_barcode_29_PA-32-MW-07,index_11_barcode_20_PA-32-MW-08)
I thought that I could do something along the lines of this.
cat input.txt | awk '{print $1}' | grep -e "CC_LlanR" | paste -sd',' > intermediate_file
cat input.txt | awk '{print $2"("}' something something??
But I only know how to grep one pattern at a time? Is there a way to find all the matching lines at once and output them in this format?
Thank you!
(Happy Easter/ long weekend to all!)
With your shown samples please try following.
awk '
FNR==NR{
arr[$2]=(arr[$2]?arr[$2]",":"")$1
next
}
($2 in arr){
print $2"("arr[$2]")"
delete arr[$2]
}
' Input_file Input_file
2nd solution: Within a single read of Input_file try following.
awk '{arr[$2]=(arr[$2]?arr[$2]",":"")$1} END{for(i in arr){print i"("arr[i]")"}}' Input_file
Explanation(1st solution): Adding detailed explanation for 1st solution here.
awk ' ##Starting awk program from here.
FNR==NR{ ##Checking condition FNR==NR which will be TRUE when first time Input_file is being read.
arr[$2]=(arr[$2]?arr[$2]",":"")$1 ##Creating array with index of 2nd field and keep adding its value with comma here.
next ##next will skip all further statements from here.
}
($2 in arr){ ##Checking condition if 2nd field is present in arr then do following.
print $2"("arr[$2]")" ##Printing 2nd field ( arr[$2] ) here.
delete arr[$2] ##Deleteing arr value with 2nd field index here.
}
' Input_file Input_file ##Mentioning Input_file names here.
Assuming your input is grouped by the $2 value as shown in your example (if it isn't then just run sort -k2,2 on your input first) using 1 pass and only storing one token at a time in memory and producing the output in the same order of $2s as the input:
$ cat tst.awk
BEGIN { ORS="" }
$2 != prev {
printf "%s%s(", ORS, $2
ORS = ")\n"
sep = ""
prev = $2
}
{
printf "%s%s", sep, $1
sep = ","
}
END { print "" }
$ awk -f tst.awk input.txt
CC_LlanR(TCCTCCGC+TAGTTAGG_Vel_24_CC_LlanR_34,GGAGTATG+TCTATTCG_Vel_24_CC_LlanR_22)
EN_DavaW(TTGACTAG+TGGAGTAC_Vel_02_EN_DavaW_11)
CC_LlanR(TCGAATAA+TGGTAATT_Vel_24_CC_LlanR_23)
EN_DavaW(CTGCTGAA+CGTTGCGG_Vel_02_EN_DavaW_06)
17-ACW(index_07_barcode_04_PA-17-ACW-04,index_09_barcode_05_PA-17-ACW-05)
21-YC(index_08_barcode_37_PA-21-YC-15)
22-GB(index_09_barcode_04_PA-22-GB-10)
28-CC(index_10_barcode_37_PA-28-CC-17)
32-MW(index_11_barcode_29_PA-32-MW-07,index_11_barcode_20_PA-32-MW-08)
This might work for you (GNU sed):
sed -E 's/^(\S+)\s+(\S+)/\2(\1)/;H
x;s/(\n\S+)\((\S+)\)(.*)\1\((\S+)\)/\1(\2,\4)\3/;x;$!d;x;s/.//' file
Append each manipulated line to the hold space.
Before moving on to the next line, accumlate like keys into a single line.
Delete every line except the last.
Replace the last line by the contents of the hold space.
Remove the first character (newline artefact introduced by H comand) and print the result.
N.B. The final solution is unsorted and in the original order.

How to edit output rows from awk with defined position?

Is there a way how to solve this?
I have a bash script, which creates .dat and .log file from source files.
I'm using awk with print and position what I need to print. The problem is with the last position - ID2 (lower). It should be just \*[0-9]{3}\*#, but in some cases there is a string before [0-9]{12}\[00]\>.
Then row looks for example like this:
2020-01-11 01:01:01;test;test123;123456789123[00]>*123*#
What I need is remove the string before in a file:
2020-01-11 01:01:01;test;test123;*123*#
File structure:
YYYY-DD-MM HH:MM:SS;string;ID1;ID2
I will be happy for any advice, thanks.
awk 'BEGIN{FS=OFS=";"} {$NF=substr($NF,length($NF)-5)}1' file
Here we keep only last 6 characters of the last field, while semicolon is the field separator. If there is nothing else in front of that *ID*#, then we keep all of it.
Delete everything before the first *:
$ awk 'BEGIN{FS=OFS=";"}{sub(/^[^*]*/,"",$NF)}1' file
Output:
2020-01-11 01:01:01;test;test123;*123*#
Could you please try following tested and written with shown samples in GNU awk.
awk '
match($0,/[0-9]{12}\[[0-9]+\]>/) && /\*[0-9]{3}\*#/{
print substr($0,1,RSTART-1) substr($0,RSTART+RLENGTH)
}
' Input_file
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
match($0,/[0-9]{12}\[[0-9]+\]>/) && /\*[0-9]{3}\*#/{ ##Using match function to match regex in it, what regex does is: It matches digits(12 in number) then [ then digits(continuously coming) and ] Also checking condition if line ends with *3 digits *
print substr($0,1,RSTART-1) substr($0,RSTART+RLENGTH) ##If above condition is TRUE then printing sub-string from 1st character to RSTART-1 and then sub-string from RSTART+RLENGTH value to till last of line.
}
' Input_file ##Mentioning Input_file name here.

How can you compare entries between two columns in linux?

I am trying to figure out whether the first letter of an amino acid is the same as its letter code.
For example, Glycine begins with G and its letter code is also (G)
On the other hand, Arginine begins with A but its letter code is (R)
I am trying to print out, as a result, the amino acids that have the same letter code and starting alphabet.
I have a CSV datafile in which the columns are delimited by ','
Name,One letter code,Three letter code,Hydropathy,Charge,Abundance,DNA codon(s)
Arginine,R,Arg,hydrophilic,+,0.0514,CGT-CGC-CGA-CGG-AGA-AGG
Asparagine,N,Asn,hydrophilic,N,0.0447,AAT-AAC
Aspartate,D,Asp,hydrophilic,-,0.0528,GAT-GAC
Glutamate,E,Glu,hydrophilic,-,0.0635,GAA-GAG
Glutamine,Q,Gln,hydrophilic,N,0.0399,CAA-CAG
Lysine,K,Lys,hydrophilic,+,0.0593,AAA-AAG
Serine,S,Ser,hydrophilic,N,0.0715,TCT-TCC-TCA-TCG-AGT-AGC
Threonine,T,Thr,hydrophilic,N,0.0569,ACT-ACC-ACA-ACG
I believe the code below is one way to compare columns, but I am wondering how I can extract the first letter from the first column and compare that with the alphabet in the second column
awk '{ if ($1 == $2) { print $1; } }' < foo.txt
Could you please try following.
awk 'BEGIN{FS=","} substr($1,1,1) == $2' Input_file
Output will be as follows.
Serine,S,Ser,hydrophilic,N,0.0715,TCT-TCC-TCA-TCG-AGT-AGC
Threonine,T,Thr,hydrophilic,N,0.0569,ACT-ACC-ACA-ACG
Explanation: Adding explanation for above code.
awk ' ##Starting awk program here.
BEGIN{ ##Starting BEGIN section for awk here.
FS="," ##Setting FS as comma here, field separator.
} ##Closing BLOCK for BEGIN here.
substr($1,1,1) == $2 ##Using substr function of awk to get sub string from line, substr(line/variable/field, starting point, ending point) is method for using it. Getting 1st letter of $1 and comparing it with $2 of current line, if TRUE then it will print current line.
' Input_file ##Mentioning Input_file name here.
Simpler way using grep:
$ grep -E '^(.)[^,]*,\1' input.csvĀ 
Serine,S,Ser,hydrophilic,N,0.0715,TCT-TCC-TCA-TCG-AGT-AGC
Threonine,T,Thr,hydrophilic,N,0.0569,ACT-ACC-ACA-ACG
Same as RavinderSingh's expression, but field selector attribute is different.
awk -F "," 'substr($1,1,1) == $2' InFile
Serine,S,Ser,hydrophilic,N,0.0715,TCT-TCC-TCA-TCG-AGT-AGC
Threonine,T,Thr,hydrophilic,N,0.0569,ACT-ACC-ACA-ACG

How to search a specific expression in multiple files with Awk

I have like 500 text documents. In every of them the expression "Numero de expediente" appears at least once. I want to locate every file where there is at least twice. Every file has its own name, I'm not sure if that's a problem (I don't know if *.txt works as in cmd with Windows). So yeah, I would like to know which document contain that expression at least twice and I don't know which command is more useful for that, if grep or cat.
Thanks.
I would add another way with grep and awk. grep is responsible for matching. awk filters out the files with matched counter>=2:
grep -o -m2 'YOUR_PATTERN' *.txt
|awk -F: '{a[$1]++}END{for(x in a)if(a[x]>1)print x}'
Note:
-o works with multiple occurrences in same line case
-m2 will improve the performance: after hits two matches, stop processing the file.
awk line just builds up a hashtable, and output the filenames with match count > 1
EDIT: As per #kent and #tripleee sir's comments I am taking care of multiple instances in a single line sum of string's occurences + if someone awk is NOT supporting nextfile I am creating a flag kind of no_processing which will simply skip lines if it is TRUE(after seeing 2 instances of string in any file).
awk 'FNR==1{count=0;no_processing=""} no_processing{next} {count+=gsub("Numero de expediente","")} count==2{print FILENAME;no_processing=1}' *.txt
OR(non-one liner form of solution)
awk '
FNR==1{
count=0
no_processing=""
}
no_processing{
next
}
{
count+=gsub("Numero de expediente","")
}
count==2{
print FILENAME
no_processing=1
}
' *.txt
Could you please try following, should work with GNU awk.
awk 'FNR==1{count=0} /Numero de expediente/{count++} count==2{print FILENAME " has at least 2 instances of searched string in it.";nextfile}' *.txt
Above will print eg--> test.txt has at least 2 instances of string in it. In case you want to simply print file names then try following.
awk 'FNR==1{count=0} /Numero de expediente/{count++} count==2{print FILENAME;nextfile}' *.txt
Explanation: Adding expplanation for above code now.
awk ' ##Starting awk program here.
FNR==1{ ##Checking condition FNR==1 which will check if this is a 1st line for any new Input_file(since we are reading multiple Input_files from awk in this code).
count=0 ##Setting value of variable count as ZERO here.
} ##Closing BLOCK for FNR condition here.
/Numero de expediente/{ ##Checking condition here if a line contains string Numero de expediente in it then do following.
count++ ##Incrementing variable named count value with 1 here.
} ##Closing BLOCK for string checking condition here.
count==2{ ##Checking condition if variable count value is 2 then do following.
print FILENAME ##Printing Input_file name here, where FILENAME is out of the box awk variable contains current Input_file name in it.
nextfile ##nextfile will skip current Input_file, since we got 2 instances so need NOT to read this Input_file as per OP requirement and SAVE some time here.
} ##Closing BLOCK for count condition here.
' *.txt ##Mentioning *.txt which will pass all .txt extension files to it.
You can try with Perl as well
perl -lne ' $x++ for(/Numero de expediente/g); if($x>=2) { print $ARGV;close(ARGV);$x=0 } ' *.txt
The $x will be 0 and for every pattern match (Numero de expediente) it will be incremented, even if the pattern is appearing twice in the same line. When you have atleast 2 matches, the file handle is closed using close(ARGV) and the nextfile is read.

Convert a combination of string and list to a csv file

I have a comma separated file which has a combination of strings and list and It looks like this
24.33,-30.59,42542,['2.4', '3.6', '4.4', '5.8', '6.6', '7.2','9.4', '9.4', '9.8']
24.33,-30.60,25512,['1.4', '2.6', '2.8', '3.8', '4.6', '5.2','7.4', '8', '9']
I want something like this
24.33,-30.59,42542,2.4,3.6,4.4,5.8,6.6,7.2,9.4,9.4,9.8
24.33,-30.60,25512,1.4,2.6,2.8,3.8,4.6,5.2,7.4,8,9
What can be the best and easiest way to do this
Simple substitutions on individual lines is the job sed exists to do:
$ sed 's/[^0-9,.-]//g' file
24.33,-30.59,42542,2.4,3.6,4.4,5.8,6.6,7.2,9.4,9.4,9.8
24.33,-30.60,25512,1.4,2.6,2.8,3.8,4.6,5.2,7.4,8,9
but you could use tr for this too:
$ tr -cd '0-9,.\n-' < file
24.33,-30.59,42542,2.4,3.6,4.4,5.8,6.6,7.2,9.4,9.4,9.8
24.33,-30.60,25512,1.4,2.6,2.8,3.8,4.6,5.2,7.4,8,9
and if you insist on awk it'd be:
$ awk '{gsub(/[^0-9,.-]/,"")}1' file
24.33,-30.59,42542,2.4,3.6,4.4,5.8,6.6,7.2,9.4,9.4,9.8
24.33,-30.60,25512,1.4,2.6,2.8,3.8,4.6,5.2,7.4,8,9
Could you please try following.
awk '{gsub(/\[|\]/,"");gsub(/\047/,"");gsub(/, /,",")} 1' Input_file
OR more precisely(reducing one more gsub from above code):
awk '{gsub(/\[|\]|\047/,"");gsub(/, /,",")} 1' Input_file
Explanation: Adding explanation too here.
awk '
{
gsub(/\[|\]|\047/,"") ##globally substituting [, ] and single dash with NULL in current line.
gsub(/, /,",") ##globally substituting comma with space with only comma in current line.
}
1 ##Mentioning 1 here to print the current line
' Input_file ##Mentioning Input_file name here.

Resources