I have a task where I want to convert the below text to single quote text
The data in the file is:
(A,1)
(DC,2)
(EFG,3)
The output should be like:
('A'1)
('DC',2)
('EFG'3)
I used awk -F print '{$2}' > file.txt
Could you please try following.
awk 'BEGIN{s1="\047";FS=OFS=","} {sub(/^\(/,"&" s1 );$1=$1 s1} 1' Input_file
Why OP's attempt didn't work: Since OP has not defined F(field separator) and simply trying to print 2nd column which will be NULL.
Related
I have a sequence file that has a repeated pattern that looks like this:
$>g34 | effector probability: 0.6
GPCKPRTSASNTLTTTLTTAEPTPTTIATETTIATSDSSKTTTIDNITTTTSEAESNTKTESSTIAQTRTTTDTSEHESTTASSVSSQPTTTEGITTTSIAQTRTTTDTSEHESTTASSVSSQPTTTEGITTTS"
$>g104 | effector probability: 0.65
GIFSSLICATTAVTTGIICHGTVTLATGGTCALATLPAPTTSIAQTRTTTDTSEH
$>g115 | effector probability: 0.99
IAQTRTTTDTSEHESTTASSVSSQPTTTEGITTTSIAQTRTTTDTSEHESTTASSVSSQPTTTEGITTTS
and so on.
I want to extract the text between and including each >g## and create a new file titled protein_g##.faa
In the above example it would create a file called "protein_g34.faa" and it would be:
$>g34 | effector probability: 0.6
GPCKPRTSASNTLTTTLTTAEPTPTTIATETTIATSDSSKTTTIDNITTTTSEAESNTKTESSTIAQTRTTTDTSEHESTTASSVSSQPTTTEGITTTSIAQTRTTTDTSEHESTTASSVSSQPTTTEGITTTS
I was trying to use sed but I am not very experienced using it. My guess was something like this:
$ sed -n '/^>g*/s///p; y/ /\n/' file > "g##"
but I can clearly tell that that is wrong... maybe the right thing is using awk?
Thanks!
Yeah, I would use awk for that. I don't think sed can write to more than one different output stream.
Here's how I would write that:
< input.txt awk '/^\$>/{fname = "protein_" substr($1, 3) ".faa"; print "sending to " fname} {print $0 > fname}'
Breaking it down into details:
< input.txt This part reads in the input file.
awk Runs awk.
/^\$>/ On lines which start with the literal string $>, run the piece of code in brackets.
(If previous step matched) {fname = "protein_" substr($1, 3) ".faa"; print "sending to " fname} Take the first field in the previous line. Remove the first two characters of that field. Surround that with protein_ .faa. Save it as the variable fname. Print a message about switching files.
This next block has no condition before it. Implicitly, that means that it matches every line.
{print $0 > fname} Take the entire line, and send it to the filename held by fname. If no file is selected, this will cause an error.
Hope that helps!
If awk is an option:
awk '/\|/ {split($1,a,">"); fname="protein_"a[2]".faa"} {print $0 >> fname}' src.dat
awk is better than sed for this problem. You can implement it in sed with
sed -rz 's/(\$>)(g[^ ]*)([^\n]*\n[^\n]*)\n/echo '\''\1\2\3'\'' > protein_\2.faa/ge' file
This solution is nice for showing some sed tricks:
-z for parsing fragments that span several lines
(..) for remembering strings
\$ matching a literal $
[^\n]* matching until end of line
'\'' for a single quote
End single quoted string, escape single quote and start new single quoted string
\2 for recalling the second remembered string
Write a bash command in the replacement string
e execute result of replacement
awk procedure
awk allows records to be extracted between empty (or white space only) lines by setting the record separator to an empty string RS=""
Thus the records intended for each file can be got automatically.
The id to be used in the filename can be extracted from field 1 $1 by splitting the (default white-space-separated) field at the ">" mark, and using element 2 of the split array (named id in this example).
The file is written from awk before closing the file to prevent errors is you have many lines to process.
The awk procedure
The example data was saved in a file named all.seq and the following procedure used to process it:
awk 'BEGIN{RS="";} {split($1,id,">"); fn="protein_"id[2]".faa"; print $0 > fn; close(fn)}' all.seq
tests results
(terminal listings/outputs)
$ ls
all.seq protein_g104.faa protein_g115.faa protein_g34.faa
$ cat protein_g104.faa
$>g104 | effector probability: 0.65
GIFSSLICATTAVTTGIICHGTVTLATGGTCALATLPAPTTSIAQTRTTTDTSEH
$ cat protein_g115.faa
$>g115 | effector probability: 0.99
IAQTRTTTDTSEHESTTASSVSSQPTTTEGITTTSIAQTRTTTDTSEHESTTASSVSSQPTTTEGITTTS
$ cat protein_g34.faa
$>g34 | effector probability: 0.6
GPCKPRTSASNTLTTTLTTAEPTPTTIATETTIATSDSSKTTTIDNITTTTSEAESNTKTESSTIAQTRTTTDTSEHESTTASSVSSQPTTTEGITTTSIAQTRTTTDTSEHESTTASSVSSQPTTTEGITTTS"
Tested using GNU Awk 5.1.0
I have a tab separated text file. In column 1 and 2 there are family and individual ids that start with a character followed by number as follow:
HG1005 HG1005
HG1006 HG1006
HG1007 HG1007
NA1008 NA1008
NA1009 NA1009
I would like to replace NA with HG in both the columns. I am very new to linux and tried the following code and some others:
awk '{sub("NA","HG",$2)';print}' input file > output file
Any help is highly appreciated.
Converting my comment to answer now, use gsub in spite of sub here. Because it will globally substitute NA to HG here.
awk 'BEGIN{FS=OFS="\t"} {gsub("NA","HG");print}' inputfile > outputfile
OR use following in case you have several fields and you want to perform substitution only in 1st and 2nd fields.
awk 'BEGIN{FS=OFS="\t"} {sub("NA","HG",$1);sub("NA","HG",$2);print}' inputfile > outputfile
Change sub to gsub in 2nd code in case multiple occurrences of NA needs to be changed within field itself.
The $2 in your call to sub only replaces the first occurrence of NA in the second field.
Note that while sed is more typical for such scenarios:
sed 's/NA/HG/g' inputfile > outputfile
you can still use awk:
awk '{gsub("NA","HG")}1' inputfile > outputfile
See the online demo.
Since there is no input variable in gsub (that performs multiple search and replaces) the default $0 is used, i.e. the whole record, the current line, and the code above is equal to awk '{gsub("NA","HG",$0)}1' inputfile > outputfile.
The 1 at the end triggers printing the current record, it is a shorter variant of print.
Notice /^NA/ position at the beginning of field:
awk '{for(i=1;i<=NF;i++)if($i ~ /^NA/) sub(/^NA/,"HG",$(i))} 1' file
HG1005 HG1005
HG1006 HG1006
HG1007 HG1007
HG1008 HG1008
HG1009 HG1009
and save it:
awk '{for(i=1;i<=NF;i++)if($i ~ /^NA/) sub(/^NA/,"HG",$(i))} 1' file > outputfile
If you have a tab as separator:
awk 'BEGIN{FS=OFS="\t"} {for(i=1;i<=NF;i++)if($i ~ /^NA/) sub(/^NA/,"HG",$(i))} 1' file > outputfile
I want to replace the ">" with variable names staring with ">" and ends with ".". But the following code is not printing the variable names.
for f in *.fasta;
do
nam=$(basename $f .fasta);
awk '{print $f}' $f | awk '{gsub(">", ">$nam."); print $0}'; done
Input of first file sample01.fasta:
cat sample01.fasta:
>textofDNA
ATCCCCGGG
>textofDNA2
ATCCCCGGGTTTT
Output expected:
>sample01.textofDNA
ATCCCCGGG
>sample01.textofDNA2
ATCCCCGGGTTTT
$ awk 'FNR==1{fname=FILENAME; sub(/[^.]+$/,"",fname)} sub(/^>/,""){$0=">" fname $0} 1' *.fasta
>sample01.textofDNA
ATCCCCGGG
>sample01.textofDNA2
ATCCCCGGGTTTT
Compared to the other answers you've got so far, the above will work in any awk, only does the file name calculation once per input file rather than once per line or once per >-line, won't fail if the file name contains other .s, won't fail if the file name contains &, and won't fail if the file name doesn't contain the string fasta..
Or like this? You don't really need the looping and basename or two awk invocations.
awk '{stub=gensub( /^([^.]+\.)fasta.*/ , "\\1", "1",FILENAME ) ; gsub( />/, ">"stub); print}' *.fasta
>sample01.textofDNA
ATCCCCGGG
>sample01.textofDNA2
ATCCCCGGGTTTT
Explanation: awk has knowledge of the filename it currently operates on through the built-in variable FILENAME; I strip the .fasta extension using gensub, and store it in the variable stub. The I invoke gsub to replace ">" with ">" and the content of my variable stub. After that I print it.
As Ed points out in the comments: gensub is a GNU extension and won't work on other awk implementations.
Could you please try following too.
awk '/^>/{split(FILENAME,array,".");print substr($0,1,1) array[1]"." substr($0,2);next} 1' Input_file
Explanation: Adding explanation for above code here.
awk '
/^>/{ ##Checking condition if a line starts from > then do following.
split(FILENAME,array,".") ##Using split function of awk to split Input_file name here which is stored in awk variable FILENAME.
print substr($0,1,1) array[1]"." substr($0,2) ##Printing substring to print 1st char then array 1st element and then substring from 2nd char to till last of line.
next ##next will skip all further statements from here.
}
1 ##1 will print all lines(except line that are starting from >).
' sample01.fasta ##Mentioning Input_file name here.
Case example:
$ cat data.txt
foo,bar,moo
I can obtain the field data by using cut, assuming , as separator, but only if I know which position it has. Example to obtain value bar (second field):
$ cat data.txt | cut -d "," -f 2
bar
How can I obtain that same bar (or number field == 2) if I only know it contains a a letter?
Something like:
$ cat data.txt | reversecut -d "," --string "a"
[results could be both "2" or "bar"]
In other words: how can I know what is the field containing a substring in a text-delimited file using linux shell commands/tools?
Of course, programming is allowed, but do I really need looping and conditional structures? Isn't there a command that solves this?
Case of specific shell, I would prefer Bash solutions.
A close solution here, but not exactly the same.
More same-example based scenario (upon requestion):
For a search pattern of m or mo, the results could be both 3 or moo.
For a search pattern of f or fo, the results could be both 1 or foo.
Following simple awk may also help you in same.
awk -F, '$2~/a/{print $2}' data.txt
Output will be bar in this case.
Explanation:
-F,: Setting field separator for lines as comma, to identify the fields easily.
$2~/a/: checking condition here if 2nd field is having letter a in it, if yes then printing that 2nd field.
EDIT: Adding solution as per OP's comment and edited question too now.
Let's say following Input_file is there
cat data.txt
foo,bar,moo
mo,too,far
foo,test,test1
fo,test2,test3
Then following is the code for same:
awk -F, '{for(i=1;i<=NF;i++){if($i ~ /fo/){print $i}}}' data.txt
foo
foo
fo
OR
awk -F, '{for(i=1;i<=NF;i++){if($i ~ /mo/){print $i}}}' data.txt
moo
mo
I work on unix server.
I have many csv files containing, among other info, date fields.
I have to replace some of this date fields with another value, for example 20110915 to 20110815. Their position is variable from a file to another.
The problem is that the substitution is position field specific. For example, if my file has row like this:
blablabla;12;0.2121;20110915;20110915;19951231;popopo;other text;321;20101010
I have to replace only first date fields and not other, transforming row in:
blablabla;12;0.2121;20110815;20110915;19951231;popopo;other text;321;20101010
Is there a way to restrict the replace in file, using some constraints?
Thanks
You can try awk:
awk 'BEGIN {FS=";";OFS=";"} {if($4=="20110915")$4="20110815"; print}' input.csv
How it works:
FS and OFS define the input and output field separators. It compares the fourth field ($4) against 20110915. If it matches, it is changed to 20110815. The line is then printed.
Here is an alternative using gsub in awk:
awk 'BEGIN {FS=";";OFS=";"} {gsub(/20110915/,20110815,$4); print}' input.csv
Here is a method, if you have to substitute in a range of fields/columns (e.g. 4-4):
awk 'BEGIN {FS=";";OFS=";"} {for(i=4;i<=4;i++){gsub(/20110915/,20110815,$i)}; print}' input.csv