Convert floating point numbers to user defined output using AWK - linux

I am trying to convert floating point numbers (columns) from a text file to the user defined output using awk, e-01 -> $\exp 10^{-01}$
Test input:
1.2e-01
1.8e-02
1.12e-03
1.222e+04
1.23e+05
441.2e+05
221.2e+06
Expect results
1.2$\exp 10^{-01}$
1.8$\exp 10^{-02}$
1.12$\exp 10^{-03}$
1.222$\exp 10^{+04}$
1.23$\exp 10^{+05}$
441.2$\exp 10^{+05}$
221.2$\exp 10^{+06}$
I have used the following command "awk '{printf "%.4e\n", $1}'", which does not solve this problem.
Any help would be really appreciated.

You may use this simple sed substitution with a capturing group and a back-reference:
sed -E 's/e([+-][0-9]+)/$\\exp 10^{\1}$/' file
1.2$\exp 10^{-01}$
1.8$\exp 10^{-02}$
1.12$\exp 10^{-03}$
1.222$\exp 10^{+04}$
1.23$\exp 10^{+05}$
441.2$\exp 10^{+05}$
221.2$\exp 10^{+06}$

Could you please try following, written and tested with shown samples only in GNU awk.
awk '{sub(/ +$/,"");sub(/e/,"$\\exp ");sub(/[-+]/,"10^{&");$0=$0"}$"} 1' Input_file
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
{
sub(/ +$/,"") ##Substituting space at last of line with NULL in each line.
sub(/e/,"$\\exp ") ##Substituting e with $\\exp in current line.
sub(/[-+]/,"10^{&") ##Substituting either - or + with 10^{ with matched - or +
$0=$0"}$" ##Appending }$ at current line.
}
1 ##1 will print current line.
' Input_file ##Mentioning Input_file name here.

I would treat input as text and do two subsequent replacements, namely:
awk '{$0=gensub("e", "$\\\\exp 10^", 1); $0=gensub("(-|+)([0-9]+)[[:blank:]]+", "{\\1\\2}$", 1); print}' file.txt
Let file.txt be:
1.2e-01
1.8e-02
1.12e-03
1.222e+04
1.23e+05
441.2e+05
221.2e+06
then output is:
1.2$\exp 10^{-01}$
1.8$\exp 10^{-02}$
1.12$\exp 10^{-03}$
1.222$\exp 10^{+04}$
1.23$\exp 10^{+05}$
441.2$\exp 10^{+05}$
221.2$\exp 10^{+06}$
Explanation: I alter whole line ($0), firstly I replace e with $\exp 10^ (\ needs to be escaped), secondly I search for sign (- or +) followed by (one or more digits) followed by one or more space or tab, which I replace with {signdigits}$. Finally I print altered line.

Related

How to edit output rows from awk with defined position?

Is there a way how to solve this?
I have a bash script, which creates .dat and .log file from source files.
I'm using awk with print and position what I need to print. The problem is with the last position - ID2 (lower). It should be just \*[0-9]{3}\*#, but in some cases there is a string before [0-9]{12}\[00]\>.
Then row looks for example like this:
2020-01-11 01:01:01;test;test123;123456789123[00]>*123*#
What I need is remove the string before in a file:
2020-01-11 01:01:01;test;test123;*123*#
File structure:
YYYY-DD-MM HH:MM:SS;string;ID1;ID2
I will be happy for any advice, thanks.
awk 'BEGIN{FS=OFS=";"} {$NF=substr($NF,length($NF)-5)}1' file
Here we keep only last 6 characters of the last field, while semicolon is the field separator. If there is nothing else in front of that *ID*#, then we keep all of it.
Delete everything before the first *:
$ awk 'BEGIN{FS=OFS=";"}{sub(/^[^*]*/,"",$NF)}1' file
Output:
2020-01-11 01:01:01;test;test123;*123*#
Could you please try following tested and written with shown samples in GNU awk.
awk '
match($0,/[0-9]{12}\[[0-9]+\]>/) && /\*[0-9]{3}\*#/{
print substr($0,1,RSTART-1) substr($0,RSTART+RLENGTH)
}
' Input_file
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
match($0,/[0-9]{12}\[[0-9]+\]>/) && /\*[0-9]{3}\*#/{ ##Using match function to match regex in it, what regex does is: It matches digits(12 in number) then [ then digits(continuously coming) and ] Also checking condition if line ends with *3 digits *
print substr($0,1,RSTART-1) substr($0,RSTART+RLENGTH) ##If above condition is TRUE then printing sub-string from 1st character to RSTART-1 and then sub-string from RSTART+RLENGTH value to till last of line.
}
' Input_file ##Mentioning Input_file name here.

gsub in awk with variable

I want to replace the ">" with variable names staring with ">" and ends with ".". But the following code is not printing the variable names.
for f in *.fasta;
do
nam=$(basename $f .fasta);
awk '{print $f}' $f | awk '{gsub(">", ">$nam."); print $0}'; done
Input of first file sample01.fasta:
cat sample01.fasta:
>textofDNA
ATCCCCGGG
>textofDNA2
ATCCCCGGGTTTT
Output expected:
>sample01.textofDNA
ATCCCCGGG
>sample01.textofDNA2
ATCCCCGGGTTTT
$ awk 'FNR==1{fname=FILENAME; sub(/[^.]+$/,"",fname)} sub(/^>/,""){$0=">" fname $0} 1' *.fasta
>sample01.textofDNA
ATCCCCGGG
>sample01.textofDNA2
ATCCCCGGGTTTT
Compared to the other answers you've got so far, the above will work in any awk, only does the file name calculation once per input file rather than once per line or once per >-line, won't fail if the file name contains other .s, won't fail if the file name contains &, and won't fail if the file name doesn't contain the string fasta..
Or like this? You don't really need the looping and basename or two awk invocations.
awk '{stub=gensub( /^([^.]+\.)fasta.*/ , "\\1", "1",FILENAME ) ; gsub( />/, ">"stub); print}' *.fasta
>sample01.textofDNA
ATCCCCGGG
>sample01.textofDNA2
ATCCCCGGGTTTT
Explanation: awk has knowledge of the filename it currently operates on through the built-in variable FILENAME; I strip the .fasta extension using gensub, and store it in the variable stub. The I invoke gsub to replace ">" with ">" and the content of my variable stub. After that I print it.
As Ed points out in the comments: gensub is a GNU extension and won't work on other awk implementations.
Could you please try following too.
awk '/^>/{split(FILENAME,array,".");print substr($0,1,1) array[1]"." substr($0,2);next} 1' Input_file
Explanation: Adding explanation for above code here.
awk '
/^>/{ ##Checking condition if a line starts from > then do following.
split(FILENAME,array,".") ##Using split function of awk to split Input_file name here which is stored in awk variable FILENAME.
print substr($0,1,1) array[1]"." substr($0,2) ##Printing substring to print 1st char then array 1st element and then substring from 2nd char to till last of line.
next ##next will skip all further statements from here.
}
1 ##1 will print all lines(except line that are starting from >).
' sample01.fasta ##Mentioning Input_file name here.

How can you compare entries between two columns in linux?

I am trying to figure out whether the first letter of an amino acid is the same as its letter code.
For example, Glycine begins with G and its letter code is also (G)
On the other hand, Arginine begins with A but its letter code is (R)
I am trying to print out, as a result, the amino acids that have the same letter code and starting alphabet.
I have a CSV datafile in which the columns are delimited by ','
Name,One letter code,Three letter code,Hydropathy,Charge,Abundance,DNA codon(s)
Arginine,R,Arg,hydrophilic,+,0.0514,CGT-CGC-CGA-CGG-AGA-AGG
Asparagine,N,Asn,hydrophilic,N,0.0447,AAT-AAC
Aspartate,D,Asp,hydrophilic,-,0.0528,GAT-GAC
Glutamate,E,Glu,hydrophilic,-,0.0635,GAA-GAG
Glutamine,Q,Gln,hydrophilic,N,0.0399,CAA-CAG
Lysine,K,Lys,hydrophilic,+,0.0593,AAA-AAG
Serine,S,Ser,hydrophilic,N,0.0715,TCT-TCC-TCA-TCG-AGT-AGC
Threonine,T,Thr,hydrophilic,N,0.0569,ACT-ACC-ACA-ACG
I believe the code below is one way to compare columns, but I am wondering how I can extract the first letter from the first column and compare that with the alphabet in the second column
awk '{ if ($1 == $2) { print $1; } }' < foo.txt
Could you please try following.
awk 'BEGIN{FS=","} substr($1,1,1) == $2' Input_file
Output will be as follows.
Serine,S,Ser,hydrophilic,N,0.0715,TCT-TCC-TCA-TCG-AGT-AGC
Threonine,T,Thr,hydrophilic,N,0.0569,ACT-ACC-ACA-ACG
Explanation: Adding explanation for above code.
awk ' ##Starting awk program here.
BEGIN{ ##Starting BEGIN section for awk here.
FS="," ##Setting FS as comma here, field separator.
} ##Closing BLOCK for BEGIN here.
substr($1,1,1) == $2 ##Using substr function of awk to get sub string from line, substr(line/variable/field, starting point, ending point) is method for using it. Getting 1st letter of $1 and comparing it with $2 of current line, if TRUE then it will print current line.
' Input_file ##Mentioning Input_file name here.
Simpler way using grep:
$ grep -E '^(.)[^,]*,\1' input.csvĀ 
Serine,S,Ser,hydrophilic,N,0.0715,TCT-TCC-TCA-TCG-AGT-AGC
Threonine,T,Thr,hydrophilic,N,0.0569,ACT-ACC-ACA-ACG
Same as RavinderSingh's expression, but field selector attribute is different.
awk -F "," 'substr($1,1,1) == $2' InFile
Serine,S,Ser,hydrophilic,N,0.0715,TCT-TCC-TCA-TCG-AGT-AGC
Threonine,T,Thr,hydrophilic,N,0.0569,ACT-ACC-ACA-ACG

BASH - Extract Data from String

I have a log that returns thousands of lines of data, I want to extract a few values from that.
In the log there is only one line containing the unquie unit reference so I can grep for that using:
grep "unit=Central-C152" logfile.txt
That produces a line of output similar to the following:
a3cd23e,85d58f5,53f534abef7e7,unit=Central-C152,locale=32325687-8595-9856-1236-12546975,11="School",1="Mr Green",2="Qual",3="SWE",8="report",5="channel",7="reset",6="velum"
The format of the line may change in that the order of the values won't always be in the same position.
I'm trying to work out how to get the value of 2 and 7 in to separate variables.
I had thought about cut on , or = but as the values aren't in a set order I couldn't work out that best way to do it.
I' trying to get:
var state=value of 2 without quotes
var mode=value of 7 without quotes
Can anyone advise on the best way to do this ?
Thanks
Could you please try following to create variable's values.
state=$(awk '/unit=Central-C152/ && match($0,/2=\"[^"]*/){print substr($0,RSTART+3,RLENGTH-3)}' Input_file)
mode=$(awk '/unit=Central-C152/ && match($0,/7=\"[^"]*/){print substr($0,RSTART+3,RLENGTH-3)}' Input_file)
You could print them too by doing following.
echo "$state"
echo "$mode"
Explanation: Adding explanation of command too now.
awk ' ##Starting awk program here.
/unit=Central-C152/ && match($0,/2=\"[^"]*/){ ##Checking condition if a line has string (unit=Central-C152) and using match using REGEX to check from 2 to till "
print substr($0,RSTART+3,RLENGTH-3) ##Printing substring starting from RSTART+3 till RLENGTH-3 characters.
}
' Input_file ##Mentioning Input_file name here.
You are probably better off doing all of the processing in Awk.
awk -F, '/unit=Central-C152/ {
for(i=1;i<=NF;++i)
if($i ~ /^[27]="/) {
b[++k] = $i
sub(/^[27]="/, "", b[k])
sub(/"$/, "", b[k])
gsub(/\\/, "", b[k])
}
print "state " b[1] ", mode " b[2]
}' logfile.txt
This presupposes that the fields always occur in the same order (2 before 7). Maybe you need to change or disable the gsub to remove backslashes in the values.
If you want to do more than print the values, refactoring whatever Bash code you have into Awk is often a better approach than doing this processing in Bash.
Assuming you already have the line in a variable such as with:
line="$(grep 'unit=Central-C152' logfile.txt | head -1)"
You can then simply use the built-in parameter substitution features of bash:
f2=${line#*2=\"} ; f2=${f2%%\"*} ; echo ${f2}
f7=${line#*7=\"} ; f7=${f7%%\"*} ; echo ${f7}
The first command on each line strips off the first part of the line up to and including the <field-number>=". The second command then strips everything off that beyond (and including) the first quote. The third, of course, simply echos the value.
When I run those commands against your input line, I see:
Qual
reset
which is, from what I can see, what you were after.

How To Sed Search Replace Entire Word With String Match In File

I have modified the code found here: sed whole word search and replace
I have been trying to use the proper syntax \< and \> for the sed to match multiple terms in a file.
echo "Here Is My Example Testing Code" | sed -e "$(sed 's:\<.*\>:s/&//ig:' file.txt)"
However, I think, because it's looking into the file, it doesn't match the full word (only exact match) leaving some split words and single characters.
Does anyone know the proper syntax?
Example:
Input:
Here Is My Example Testing Code
File.txt:
example
test
Desired output:
Here Is My Code
Modify your sed command as followed should extract what you want,
sed -e "$(sed 's:\<.*\>:s/&\\w*\\s//ig:' file.txt)"
Brief explanation,
\b matches the position between a word and a non-alphanumeric character. In this case, the pattern 'test' in file.txt would not match 'Testing'.
In this way, modify the searched pattern appended with \w* should work. \w actually matched [a-zA-Z0-9_]
And don't forget to eliminate the space behind each searched pattern, \s should be added.
Following awk could help you in same.
awk 'FNR==NR{a[$0]=$0;next} {for(i=1;i<=NF;i++){for(j in a){if(tolower($i)~ a[j]){$i=""}}}} 1' file.txt input
***OR***
awk '
FNR==NR{
a[$0]=$0;
next
}
{
for(i=1;i<=NF;i++){
for(j in a){
if(tolower($i)~ a[j]){
$i=""}
}}}
1
' file.txt input
Output will be as follows.
Here Is My Code
Also if your Input_file is always a single space delimited and you don't want unnecessary space as shown in above output, then you could use following.
awk 'FNR==NR{a[$0]=$0;next} {for(i=1;i<=NF;i++){for(j in a){if(tolower($i)~ a[j]){$i=""}}};gsub(/ +/," ")} 1' file.txt input
***OR***
awk '
FNR==NR{
a[$0]=$0;
next
}
{
for(i=1;i<=NF;i++){
for(j in a){
if(tolower($i)~ a[j]){
$i=""}
}};
gsub(/ +/," ")
}
1
' file.txt input
Output will be as follows.
Here Is My Code

Resources