Fasta file - line issues - linux

I have a FASTA file test.fasta which has the following information:
>QWE2J2_DEFR00000200123 DEFR00000560077.11 DEFR00000100333.7 3:444563-33443(-
)
acccaaagggagggagagagggctattatcatggaaaactaatttttcccagagaatttcctttcaaacctcccagtatc
tatgatcactcccaacgggaggtttaagtgcaacaccaggctgtgtctttctatcacggatttccacccggacacgtgga
acccggcctggtctgtctccaccatcctgactgggctcctgagcttcatggtggagaagggccccaccctgggcagtata
gagacgtcggacttcacgaaaagacaactggcagtgcagagaaaaggggggggggggggggataaagtcttttgtgaatt
atttcctgaagtcgtggaggagattaaacaaaaacagaaagcacaagacgaactcagtagcagaccccagactctcccct
tgccagacgtggttccagaaaaaaaaaaaaacctcgtccagaacgggattcagctgctcaacgggcatgcgccgggggcc
gtcccaaacctcgcagggctccagcaggccaaccggcaccacggactcctgggtggcgccctggcgaacttgtttgtgat
agttgggtttgcagcctttgcttacacggtcaagtaggggggggggggggcgcaggagtg
I need to convert it to CSV in the following format:
>QWE2J2_DEFR00000200123,DEFR00000560077.11,DEFR00000100333.7,3:444563-33443(-),acccaaagggagggagagagggctattatcatggaaaactaatttttcccagagaatttcctttcaaacctcccagtatctatgatcactcccaacgggaggtttaagtgcaacaccaggctgtgtctttctatcacggatttccacccggacacgtggaacccggcctggtctgtctccaccatcctgactgggctcctgagcttcatggtggagaagggccccaccctgggcagtatagagacgtcggacttcacgaaaagacaactggcagtgcagagaaaaggggggggggggggggataaagtcttttgtgaattatttcctgaagtcgtggaggagattaaacaaaaacagaaagcacaagacgaactcagtagcagaccccagactctccccttgccagacgtggttccagaaaaaaaaaaaaacctcgtccagaacgggattcagctgctcaacgggcatgcgccgggggccgtcccaaacctcgcagggctccagcaggccaaccggcaccacggactcctgggtggcgccctggcgaacttgtttgtgatagttgggtttgcagcctttgcttacacggtcaagtaggggggggggggggcgcaggagtg
I have tried in Linux terminal:
input_file=test.fasta; vim -c '0,$s/>\(.*\)\n/>\1,/' -c '0,$s/\(.*\)\n\([^>]\)/\1\2/' -c 'w! my-tmp.fasta.csv' -c 'q!' $input_file; mv my-tmp.fasta.csv $input_file.csv
However, it gives me wrong output:
>QWE2J2_DEFR00000200123 DEFR00000560077.11 DEFR00000100333.7 3:444563-33443(-,)acccaaagggagggagagagggctattatcatggaaaactaatttttcccagagaatttcctttcaaacctcccagtatctatgatcactcccaacgggaggtttaagtgcaacaccaggctgtgtctttctatcacggatttccacccggacacgtggaacccggcctggtctgtctccaccatcctgactgggctcctgagcttcatggtggagaagggccccaccctgggcagtatagagacgtcggacttcacgaaaagacaactggcagtgcagagaaaaggggggggggggggggataaagtcttttgtgaattatttcctgaagtcgtggaggagattaaacaaaaacagaaagcacaagacgaactcagtagcagaccccagactctccccttgccagacgtggttccagaaaaaaaaaaaaacctcgtccagaacgggattcagctgctcaacgggcatgcgccgggggccgtcccaaacctcgcagggctccagcaggccaaccggcaccacggactcctgggtggcgccctggcgaacttgtttgtgatagttgggtttgcagcctttgcttacacggtcaagtaggggggggggggggcgcaggagtg
How can I create this CSV file?

Using awk with RS set to > is just simple:
awk -vRS='>' 'NR>1{
gsub(/ /, ",")
sub(/\)\n/, "),")
gsub("\n", "")
print RS $0
}' file
GNU sed with -z looks simple too:
sed -z '
s/ /,/g
s/)\n/),/g
s/\n//g
s/>/\n>/g
s/^\n//
' file
The following sed script should also work:
sed -n '
# if line does not start with >
/^>/!{
# append the line to hold space
H
# if its not the end of file, start over
$!b
}
# switch pattern space with hold space
x
# add a comma after )
s/)/),/
# remove all the newlines
s/\n//g
# print it all, if hold space not empty
/^$/!p
# switch pattern space with hold space
x
# replace spaces with comma
s/ /,/g
# hold the line
h
' file
Scripts written and tested on repl:
>QWE2J2_DEFR00000200123,DEFR00000560077.11,DEFR00000100333.7,3:444563-33443(-),acccaaagggagggagagagggctattatcatggaaaactaatttttcccagagaatttcctttcaaacctcccagtatcacccggcctggtctgtctccaccatcctgactgggctcctgagcttcatggtggagaagggccccaccctgggcagtataatttcctgaagtcgtggaggagattaaacaaaaacagaaagcacaagacgaactcagtagcagaccccagactctcccctgtcccaaacctcgcagggctccagcaggccaaccggcaccacggactcctgggtggcgccctggcgaacttgtttgtgat
Prefer sed instead of vim.

Related

Using awk to make changes to nth character in nth line in a file

I have written an awk command
awk 'NR==5 {sub(substr($1,14,1),(substr($1,14,1) + 1)); print "test.py"}' > test.py
This is trying to change the 14th character on the 5th line of a python file. For some reason this doesn't stop executing and I have to break it. It also deletes the contents of the file.
Sample input:
import tools
tools.setup(
name='test',
tagvisc='0.0.8',
packages=tools.ges(),
line xyz
)
`
Output:
import tools
tools.setup(
name='test',
tagvisc='0.0.9',
packages=tools.ges(),
line xyz
)
If I understand the nuances of what you need to do now, you will need to split the first field of the 5th record into an array using "." as the fieldsep and then remove the "\"," from the end of the 3rd element of the array (optional) before incrementing the number and putting the field back together. You can do so with:
awk '{split($1,a,"."); sub(/["],/,"",a[3]); $1=a[1]"."a[2]"."(a[3]+1)"\","}1'
(NR==5 omitted for example)
Example Use/Output
$ echo 'tagvisc="3.4.30"', |
awk '{split($1,a,"."); sub(/["],/,"",a[3]); $1=a[1]"."a[2]"."(a[3]+1)"\","}1'
tagvisc="3.4.31",
I'll leave redirecting to a temp file and then back to the original to you. Let me know if this isn't what you need.
Adding NR == 5 you would have
awk 'NR==5 {split($1,a,"."); sub(/["],/,"",a[3]); $1=a[1]"."a[2]"."(a[3]+1)"\","}1' test.py > tmp; mv -f tmp test.py
Get away from the fixed line number (NR==5) and fixed character position (14) and instead look at dynamically finding what you want to change/increment, eg:
$ cat test.py
import tools
tools.setup(
name='test',
tagvisc='0.0.10',
packages=tools.ges(),
line xyz
)
One awk idea to increment the 10 (3rd line, 3rd numeric string in line):
awk '
/tagvisc=/ { split($0,arr,".") # split line on periods
sub("." arr[3]+0 "\047","." arr[3]+1 "\047") # replace .<oldvalue>\047 with .<newvalue>\047; \047 == single quote
}
1
' test.py
NOTES:
arr[3] = 10',; with arr[3]+0 awk will take the leftmost all-numeric content, strip off everything else, then add 0, leaving us with arr[3] = 10; same logic applies for arr[3]+1 (arr[3]+1 = 11); basically a trick for discarding any suffix that is not numeric
if there are multiple lines in the file with the string tagvisc='x.y.z' then this will change z in all of the lines; we can get around this by adding some more logic to only change the first occurrence, but I'll leave that out for now assuming it's not an issue
This generates:
import tools
tools.setup(
name='test',
tagvisc='0.0.11',
packages=tools.ges(),
line xyz
)
If the objective is to overwrite the original file with the new values you have a couple options:
# use temporary file:
awk '...' test.py > tmp ; mv tmp test.py
# if using GNU awk, and once accuracy of script has been verified:
awk -i inplace '...' test.py
Using awk to make changes to nth character in [mth] line in a file:
$ awk 'BEGIN{FS=OFS=""}NR==5{$18=9}1' file # > tmp && mv tmp file
Outputs:
import tools
tools.setup(
name='test',
tagvisc='0.0.9', <----- this is not output but points to what changed
packages=tools.ges(),
line xyz
)
Explained:
$ awk '
BEGIN {
FS=OFS="" # set the field separators to empty and you can reference
} # each char in record by a number
NR==5 { # 5th record
$18=9 # and 18th char is replaced with a 9
}1' file # > tmp && mv tmp file # output to a tmp file and replace
Notice: Some awks (probably all but GNU awk) will fail if you try to replace a multibyte char by a single byte one (for example utf8 ä (0xc3 0xa4) with an a (0x61) will result in 0x61 0xa4). Naturally an ä before the position you'd like to replace will set your calculations off by 1.
Oh yeah, you can replace one char with multiple chars but not vice versa.
something like this...
$ awk 'function join(a,k,s,sep) {for(k in a) {s=s sep a[k]; sep="."} return s}
BEGIN {FS=OFS="\""}
/^tagvisc=/{v[split($2,v,".")]++; $2=join(v)}1' file > newfile
Using GNU awk for the 3rd arg to match() and "inplace" editing:
$ awk -i inplace '
match($0,/^([[:space:]]*tagvisc=\047)([^\047]+)(.*)/,a) {
split(a[2],ver,".")
$0 = a[1] ver[1] "." ver[2] "." ver[3]+1 a[3]
}
{ print }
' test.py
$ cat test.py
import tools
tools.setup(
name='test',
tagvisc='0.0.9',
packages=tools.ges(),
line xyz
)

linux append missing qoutes to csv fields/header

I have the following csv file:
id,"path",score,"file"
1,"/tmp/file 1.csv",5,"file 1.csv"
2,"/tmp/file2.csv",15,"file2.csv"
I want to convert it to:
"id","path","score","file"
"1","/tmp/file 1.csv","5","file 1.csv"
"2","/tmp/file2.csv","15","file2.csv"
How can I do it using sed/awk or any another linux tool?
Assuming that you want to quote all entries, coma is separator, and there are no white spaces between separator and entry (this one can be solved as well but for brevity I didn't include it).
$ cat csv1 | sed -e 's/^/\"/' -e 's/$/\"/' -e 's/,/\",\"/g' -e 's/\"\"/\"/g' > csv2
It replaces beginning ^, end $ of line, , with " and at the end removes duplicates.
Using Miller, if you run
mlr --csv --quote-all cat input.csv >output.csv
you will have
"id","path","score","file"
"1","/tmp/file 1.csv","5","file 1.csv"
"2","/tmp/file2.csv","15","file2.csv"

Using sed to add text after a pattern, but the added text comes from a list file

How can I use sed to locate a string, and add text from another file after the string?
File 1:
stage ('Clone Repo31') {
steps {
git credentialsId: '', url: '/stash/scm/'
}
}
stage ('Zip Repo31') {
steps {
sh"""
tar --exclude='*.tar' -cvf .tar *
"""
}
}
steps {
git credentialsId: '', url: '/stash/scm/'
}
}
stage ('Zip Repo32') {
steps {
sh"""
tar --exclude='*.tar' -cvf .tar *
"""
}
}
File 2:
randomRepo.git
differentRandomRepo.git
I want to be able to use sed to read the second file, and add the contents of each line from the second file after each occurance of stash/scm/
Desired output:
stage ('Clone Repo31') {
steps {
git credentialsId: '', url: '/stash/scm/randomRepo.git'
}
}
stage ('Zip Repo31') {
steps {
sh"""
tar --exclude='*.tar' -cvf .tar *
"""
}
}
steps {
git credentialsId: '', url: '/stash/scm/differentRandomRepo.git'
}
}
stage ('Zip Repo32') {
steps {
sh"""
tar --exclude='*.tar' -cvf .tar *
"""
}
}
Can this be done with sed? I'm having issues reading it from a list file and it's confusing since it has a lot of slashes in it. I've been able to use normal sed substitution but I don't know how to do substitution by reading another file.
In the following I present an almost pure sed solution.
sed has an r command to read files, so you could in principle use that to read the file2. However, no subsequent command will affect the lines read from the file, so I cannot think of any way of using the r command effectively to do what you ask.
However, a solution is possible if file1 and file2 are both given in input to sed.
In the following, in order to distinguish the two files, I put a marker line (-----) that I give for granted is not in file2; it could be anywhere in file1 without creating any problems, however.
cat file2 <(echo '-----') file1 | sed -f script.sed
where script.sed is the following:
1{ # only on line 1
:a # begin while
/-----/!{ # while the line does not contain the marker
N # append the following line
ba # end while
} # here the pattern space is a multiline containing the list
s/\n-----// # remove the last newline and the marker
h # put the multiline in the hold space
d # delete, as we don't want to print anything so far
} # that's it, lines from 1 to the marker are processed
/stash\/scm\//{ # for lines matching this pattern
G # we append the full hold space
s/'\n\([^\n]*\)/\1'/ # and position the first entry in the list appropriately
x # then we swap pattern and hold space
s/[^\n]*\n// # remove the first element of the list
x # and swap again
} # now the hold space has one item less
This is a bash script that uses sed and reads File_2 (The file containing the replacements) line by line, thus reading one replacement at a time. I then replaced the lines in File_1 with a sed script.
while IFS= read -r line; do
sed -i "0,/\/stash\/scm\/'/{s|/stash/scm/'|/stash/scm/${line}'|}" File_1.txt
done < File_2.txt
Some tricks used to do this:
sed '0,/Apple/{s/Apple/Banana/}' input_filename Replace only the first occurrence in filename of the string Apple with the string Banana
Using double quotes for the sed script to allow for variable expansion ${line}
Making sure the search string to replace was being changed each iteration. This was done by including the ending single quote char ' for the search argument in the sed script s|/stash/scm/'|
Reading a file line by line in a bash script
while IFS= read -r line; do
echo $line
done < File_2.txt
Read File line by line in bash
You want to have lines like
sed 's#'/stash/scm/'#&something_from_file2#' file1
You can make these lines with
# Note:
# / is not a delimiter, but part of the path
# % is the delimiter in the current sed-command
# # is the delimiter in the generated command.
sed 's%.*%s#/stash/scm/#\&&#%' file2
You can generate these commands on the fly and execute them on file1.
sed -f <(sed 's%.*%s#/stash/scm/#\&&#%' file2) file1
One problem left. Both commands will substitute all matches.
I will use the single quote given after the match.
When something is put before the single quote in /stash/scm/' this is different than the match files when you look for the /stash/scm/' string including the quote.
You want to generate lines like
s#(/stash/scm/)(')#\1randomRepo.git\2#
s#(/stash/scm/)(')#\1differentRandomRepo.git\2#
Each substition should be done only once, so we consider file2 as one long line using the option -z:
sed -rzf <(sed 's%.*%s#(/stash/scm/)('\'')#\\1&\\2#%' file2) file1

Remove new line character by checking the expression, using sed

Have to write a script which updates the file in this way.
raw file:
<?blah blah blah?>
<pen>
<?pineapple?>
<apple>
<pen>
Final file:
<?blah blah blah?><pen>
<?pineapple?><apple><pen>
Where ever in the file if the new line charter is not followed by
<?
We have to remove the newline in order to append it at the end of previous line.
Also it will be really helpful if you explain how your sed works.
Perl solution:
perl -pe 'chomp; substr $_, 0, 0, "\n" if $. > 1 && /^<\?/'
-p reads the input line by line, printing each line after changes
chomp removes the final newline
substr with 4 arguments modifies the input string, here it prepends newline if it's not the first line ($. is the input line number) and the line starts with <?.
Sed solution:
sed ':a;N;$!ba;s/\n\(<[^?]\)/\1/g' file > newfile
The basic idea is to replace every
\n followed by < not followed by ?
with what you matched except the \n.
When you are happy with a solution that puts every <? at the start of a line, you can combine tr with sed.
tr -d '\n' < inputfile| sed 's/<?/\n&/g;$s/$/\n/'
Explanation:
I use tr ... < inputfile and not cat inputfile | tr ... avoiding an additional catcall.
The sed command has 2 parts.
In s/<?/\n&/g it will insert a newline and with & it will insert the matched string (in this case always <?, so it will only save one character).
With $s/$/\n/ a newline is appended at the end of the last line.
EDIT: When you only want newlines before <? when you had them already,
you can use awk:
awk '$1 ~ /^<\?/ {print} {printf("%s",$0)} END {print}'
Explanation:
Consider the newline as the start of the line, not the end. Then your question transposes into "write a newline when the line starts with <?. You must escape the ? and use ^ for the start of the line.
awk '$1 ~ /^<\?/ {print}'
Next print the line you read without a newline character.
And you want a newline at the end.

replace string in a file with a string from within the same file

I have a file like this (tens of variables) :
PLAY="play"
APPS="/opt/play/apps"
LD_FILER="/data/mysql"
DATA_LOG="/data/log"
I need a script that will output the variables into another file like this (with space between them):
PLAY=${PLAY} APPS=${APPS} LD_FILER=${LD_FILER}
Is it possible ?
I would say:
$ awk -F= '{printf "%s=${%s} ", $1,$1} END {print ""}' file
PLAY=${PLAY} APPS=${APPS} LD_FILER=${LD_FILER} DATA_LOG=${DATA_LOG}
This loops through the file and prints the content before = in a format var=${var} together with a space. At the end, it prints a new line.
Note this leaves a trailing space at the end of the line. If this matters, we can check how to improve it.
< input sed -e 's/\(.*\)=.*/\1=${\1}/' | tr \\n \ ; echo
sed 's/"\([^"]*"\)"/={\1}/;H;$!d
x;y/\n/ /;s/.//' YourFile
your sample exclude last line so if this is important
sed '/DATA_LOG=/ d
s/"\([^"]*"\)"/={\1}/;H;$!d
x;y/\n/ /;s/.//' YourFile

Resources