How to edit output rows from awk with defined position? - string

Is there a way how to solve this?
I have a bash script, which creates .dat and .log file from source files.
I'm using awk with print and position what I need to print. The problem is with the last position - ID2 (lower). It should be just \*[0-9]{3}\*#, but in some cases there is a string before [0-9]{12}\[00]\>.
Then row looks for example like this:
2020-01-11 01:01:01;test;test123;123456789123[00]>*123*#
What I need is remove the string before in a file:
2020-01-11 01:01:01;test;test123;*123*#
File structure:
YYYY-DD-MM HH:MM:SS;string;ID1;ID2
I will be happy for any advice, thanks.

awk 'BEGIN{FS=OFS=";"} {$NF=substr($NF,length($NF)-5)}1' file
Here we keep only last 6 characters of the last field, while semicolon is the field separator. If there is nothing else in front of that *ID*#, then we keep all of it.

Delete everything before the first *:
$ awk 'BEGIN{FS=OFS=";"}{sub(/^[^*]*/,"",$NF)}1' file
Output:
2020-01-11 01:01:01;test;test123;*123*#

Could you please try following tested and written with shown samples in GNU awk.
awk '
match($0,/[0-9]{12}\[[0-9]+\]>/) && /\*[0-9]{3}\*#/{
print substr($0,1,RSTART-1) substr($0,RSTART+RLENGTH)
}
' Input_file
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
match($0,/[0-9]{12}\[[0-9]+\]>/) && /\*[0-9]{3}\*#/{ ##Using match function to match regex in it, what regex does is: It matches digits(12 in number) then [ then digits(continuously coming) and ] Also checking condition if line ends with *3 digits *
print substr($0,1,RSTART-1) substr($0,RSTART+RLENGTH) ##If above condition is TRUE then printing sub-string from 1st character to RSTART-1 and then sub-string from RSTART+RLENGTH value to till last of line.
}
' Input_file ##Mentioning Input_file name here.

Related

Convert floating point numbers to user defined output using AWK

I am trying to convert floating point numbers (columns) from a text file to the user defined output using awk, e-01 -> $\exp 10^{-01}$
Test input:
1.2e-01
1.8e-02
1.12e-03
1.222e+04
1.23e+05
441.2e+05
221.2e+06
Expect results
1.2$\exp 10^{-01}$
1.8$\exp 10^{-02}$
1.12$\exp 10^{-03}$
1.222$\exp 10^{+04}$
1.23$\exp 10^{+05}$
441.2$\exp 10^{+05}$
221.2$\exp 10^{+06}$
I have used the following command "awk '{printf "%.4e\n", $1}'", which does not solve this problem.
Any help would be really appreciated.
You may use this simple sed substitution with a capturing group and a back-reference:
sed -E 's/e([+-][0-9]+)/$\\exp 10^{\1}$/' file
1.2$\exp 10^{-01}$
1.8$\exp 10^{-02}$
1.12$\exp 10^{-03}$
1.222$\exp 10^{+04}$
1.23$\exp 10^{+05}$
441.2$\exp 10^{+05}$
221.2$\exp 10^{+06}$
Could you please try following, written and tested with shown samples only in GNU awk.
awk '{sub(/ +$/,"");sub(/e/,"$\\exp ");sub(/[-+]/,"10^{&");$0=$0"}$"} 1' Input_file
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
{
sub(/ +$/,"") ##Substituting space at last of line with NULL in each line.
sub(/e/,"$\\exp ") ##Substituting e with $\\exp in current line.
sub(/[-+]/,"10^{&") ##Substituting either - or + with 10^{ with matched - or +
$0=$0"}$" ##Appending }$ at current line.
}
1 ##1 will print current line.
' Input_file ##Mentioning Input_file name here.
I would treat input as text and do two subsequent replacements, namely:
awk '{$0=gensub("e", "$\\\\exp 10^", 1); $0=gensub("(-|+)([0-9]+)[[:blank:]]+", "{\\1\\2}$", 1); print}' file.txt
Let file.txt be:
1.2e-01
1.8e-02
1.12e-03
1.222e+04
1.23e+05
441.2e+05
221.2e+06
then output is:
1.2$\exp 10^{-01}$
1.8$\exp 10^{-02}$
1.12$\exp 10^{-03}$
1.222$\exp 10^{+04}$
1.23$\exp 10^{+05}$
441.2$\exp 10^{+05}$
221.2$\exp 10^{+06}$
Explanation: I alter whole line ($0), firstly I replace e with $\exp 10^ (\ needs to be escaped), secondly I search for sign (- or +) followed by (one or more digits) followed by one or more space or tab, which I replace with {signdigits}$. Finally I print altered line.

Append multiple lines from one file to another

Hello Guys,
I have two files that have same number of lines (503 exactly) and i want to append the contents of one file to the other right infront of them without any space or tab. Let us consider the input file contents are:
One.txt
C:\Users\Desktop.VG\New-folder\?filename=
C:\Users\Desktop.VG\New-folder\?filename=
C:\Users\Desktop.VG\New-folder\?filename=
C:\Users\Desktop.VG\New-folder\?filename=
C:\Users\Desktop.VG\New-folder\?filename=
Two.txt
one
val_ilu_girl
pacmanhall
four_stars
squares3
Now I want like this:
C:\Users\Desktop.VG\New-folder\?filename=one
C:\Users\Desktop.VG\New-folder\?filename=val_ilu_girl
C:\Users\Desktop.VG\New-folder\?filename=pacmanhall
C:\Users\Desktop.VG\New-folder\?filename=four_stars
C:\Users\Desktop.VG\New-folder\?filename=squares3
Is there a way to do this using anything from SED GREP AWK to EXCEL Notepadd++ etc.?
Thanks in advance...!
Could you please try following.
awk 'FNR==NR{a[FNR]=$0;next} {print $0 a[FNR]}' two.txt one.txt
Explanation: Adding explanation for above code.
awk ' ##Starting awk program here.
FNR==NR{ ##Checking condition FNR==NR which will be TRUE when first Input_file is being read.
a[FNR]=$0 ##Creating an array named a whose index is FNR and value is current line.
next ##next will skip all further statements from here.
} ##Closing BLOCK for FNR==NR condition here.
{ ##These statements will be executed when 2nd Input_file is being read.
print $0 a[FNR] ##Printing current line along with array a value with index of FNR.
}
' two.txt one.txt ##Mentioning Input_file names here.
That;s the job paste was invented to do:
$ paste -d '' one.txt two.txt
C:\Users\Desktop.VG\New-folder\?filename=one
C:\Users\Desktop.VG\New-folder\?filename=val_ilu_girl
C:\Users\Desktop.VG\New-folder\?filename=pacmanhall
C:\Users\Desktop.VG\New-folder\?filename=four_stars
C:\Users\Desktop.VG\New-folder\?filename=squares3

gsub in awk with variable

I want to replace the ">" with variable names staring with ">" and ends with ".". But the following code is not printing the variable names.
for f in *.fasta;
do
nam=$(basename $f .fasta);
awk '{print $f}' $f | awk '{gsub(">", ">$nam."); print $0}'; done
Input of first file sample01.fasta:
cat sample01.fasta:
>textofDNA
ATCCCCGGG
>textofDNA2
ATCCCCGGGTTTT
Output expected:
>sample01.textofDNA
ATCCCCGGG
>sample01.textofDNA2
ATCCCCGGGTTTT
$ awk 'FNR==1{fname=FILENAME; sub(/[^.]+$/,"",fname)} sub(/^>/,""){$0=">" fname $0} 1' *.fasta
>sample01.textofDNA
ATCCCCGGG
>sample01.textofDNA2
ATCCCCGGGTTTT
Compared to the other answers you've got so far, the above will work in any awk, only does the file name calculation once per input file rather than once per line or once per >-line, won't fail if the file name contains other .s, won't fail if the file name contains &, and won't fail if the file name doesn't contain the string fasta..
Or like this? You don't really need the looping and basename or two awk invocations.
awk '{stub=gensub( /^([^.]+\.)fasta.*/ , "\\1", "1",FILENAME ) ; gsub( />/, ">"stub); print}' *.fasta
>sample01.textofDNA
ATCCCCGGG
>sample01.textofDNA2
ATCCCCGGGTTTT
Explanation: awk has knowledge of the filename it currently operates on through the built-in variable FILENAME; I strip the .fasta extension using gensub, and store it in the variable stub. The I invoke gsub to replace ">" with ">" and the content of my variable stub. After that I print it.
As Ed points out in the comments: gensub is a GNU extension and won't work on other awk implementations.
Could you please try following too.
awk '/^>/{split(FILENAME,array,".");print substr($0,1,1) array[1]"." substr($0,2);next} 1' Input_file
Explanation: Adding explanation for above code here.
awk '
/^>/{ ##Checking condition if a line starts from > then do following.
split(FILENAME,array,".") ##Using split function of awk to split Input_file name here which is stored in awk variable FILENAME.
print substr($0,1,1) array[1]"." substr($0,2) ##Printing substring to print 1st char then array 1st element and then substring from 2nd char to till last of line.
next ##next will skip all further statements from here.
}
1 ##1 will print all lines(except line that are starting from >).
' sample01.fasta ##Mentioning Input_file name here.

How can you compare entries between two columns in linux?

I am trying to figure out whether the first letter of an amino acid is the same as its letter code.
For example, Glycine begins with G and its letter code is also (G)
On the other hand, Arginine begins with A but its letter code is (R)
I am trying to print out, as a result, the amino acids that have the same letter code and starting alphabet.
I have a CSV datafile in which the columns are delimited by ','
Name,One letter code,Three letter code,Hydropathy,Charge,Abundance,DNA codon(s)
Arginine,R,Arg,hydrophilic,+,0.0514,CGT-CGC-CGA-CGG-AGA-AGG
Asparagine,N,Asn,hydrophilic,N,0.0447,AAT-AAC
Aspartate,D,Asp,hydrophilic,-,0.0528,GAT-GAC
Glutamate,E,Glu,hydrophilic,-,0.0635,GAA-GAG
Glutamine,Q,Gln,hydrophilic,N,0.0399,CAA-CAG
Lysine,K,Lys,hydrophilic,+,0.0593,AAA-AAG
Serine,S,Ser,hydrophilic,N,0.0715,TCT-TCC-TCA-TCG-AGT-AGC
Threonine,T,Thr,hydrophilic,N,0.0569,ACT-ACC-ACA-ACG
I believe the code below is one way to compare columns, but I am wondering how I can extract the first letter from the first column and compare that with the alphabet in the second column
awk '{ if ($1 == $2) { print $1; } }' < foo.txt
Could you please try following.
awk 'BEGIN{FS=","} substr($1,1,1) == $2' Input_file
Output will be as follows.
Serine,S,Ser,hydrophilic,N,0.0715,TCT-TCC-TCA-TCG-AGT-AGC
Threonine,T,Thr,hydrophilic,N,0.0569,ACT-ACC-ACA-ACG
Explanation: Adding explanation for above code.
awk ' ##Starting awk program here.
BEGIN{ ##Starting BEGIN section for awk here.
FS="," ##Setting FS as comma here, field separator.
} ##Closing BLOCK for BEGIN here.
substr($1,1,1) == $2 ##Using substr function of awk to get sub string from line, substr(line/variable/field, starting point, ending point) is method for using it. Getting 1st letter of $1 and comparing it with $2 of current line, if TRUE then it will print current line.
' Input_file ##Mentioning Input_file name here.
Simpler way using grep:
$ grep -E '^(.)[^,]*,\1' input.csvĀ 
Serine,S,Ser,hydrophilic,N,0.0715,TCT-TCC-TCA-TCG-AGT-AGC
Threonine,T,Thr,hydrophilic,N,0.0569,ACT-ACC-ACA-ACG
Same as RavinderSingh's expression, but field selector attribute is different.
awk -F "," 'substr($1,1,1) == $2' InFile
Serine,S,Ser,hydrophilic,N,0.0715,TCT-TCC-TCA-TCG-AGT-AGC
Threonine,T,Thr,hydrophilic,N,0.0569,ACT-ACC-ACA-ACG

How to search a specific expression in multiple files with Awk

I have like 500 text documents. In every of them the expression "Numero de expediente" appears at least once. I want to locate every file where there is at least twice. Every file has its own name, I'm not sure if that's a problem (I don't know if *.txt works as in cmd with Windows). So yeah, I would like to know which document contain that expression at least twice and I don't know which command is more useful for that, if grep or cat.
Thanks.
I would add another way with grep and awk. grep is responsible for matching. awk filters out the files with matched counter>=2:
grep -o -m2 'YOUR_PATTERN' *.txt
|awk -F: '{a[$1]++}END{for(x in a)if(a[x]>1)print x}'
Note:
-o works with multiple occurrences in same line case
-m2 will improve the performance: after hits two matches, stop processing the file.
awk line just builds up a hashtable, and output the filenames with match count > 1
EDIT: As per #kent and #tripleee sir's comments I am taking care of multiple instances in a single line sum of string's occurences + if someone awk is NOT supporting nextfile I am creating a flag kind of no_processing which will simply skip lines if it is TRUE(after seeing 2 instances of string in any file).
awk 'FNR==1{count=0;no_processing=""} no_processing{next} {count+=gsub("Numero de expediente","")} count==2{print FILENAME;no_processing=1}' *.txt
OR(non-one liner form of solution)
awk '
FNR==1{
count=0
no_processing=""
}
no_processing{
next
}
{
count+=gsub("Numero de expediente","")
}
count==2{
print FILENAME
no_processing=1
}
' *.txt
Could you please try following, should work with GNU awk.
awk 'FNR==1{count=0} /Numero de expediente/{count++} count==2{print FILENAME " has at least 2 instances of searched string in it.";nextfile}' *.txt
Above will print eg--> test.txt has at least 2 instances of string in it. In case you want to simply print file names then try following.
awk 'FNR==1{count=0} /Numero de expediente/{count++} count==2{print FILENAME;nextfile}' *.txt
Explanation: Adding expplanation for above code now.
awk ' ##Starting awk program here.
FNR==1{ ##Checking condition FNR==1 which will check if this is a 1st line for any new Input_file(since we are reading multiple Input_files from awk in this code).
count=0 ##Setting value of variable count as ZERO here.
} ##Closing BLOCK for FNR condition here.
/Numero de expediente/{ ##Checking condition here if a line contains string Numero de expediente in it then do following.
count++ ##Incrementing variable named count value with 1 here.
} ##Closing BLOCK for string checking condition here.
count==2{ ##Checking condition if variable count value is 2 then do following.
print FILENAME ##Printing Input_file name here, where FILENAME is out of the box awk variable contains current Input_file name in it.
nextfile ##nextfile will skip current Input_file, since we got 2 instances so need NOT to read this Input_file as per OP requirement and SAVE some time here.
} ##Closing BLOCK for count condition here.
' *.txt ##Mentioning *.txt which will pass all .txt extension files to it.
You can try with Perl as well
perl -lne ' $x++ for(/Numero de expediente/g); if($x>=2) { print $ARGV;close(ARGV);$x=0 } ' *.txt
The $x will be 0 and for every pattern match (Numero de expediente) it will be incremented, even if the pattern is appearing twice in the same line. When you have atleast 2 matches, the file handle is closed using close(ARGV) and the nextfile is read.

Resources