adding an adapter sequence to the end of a fastq file - linux

I have a large fastq file and I want to add the sequence "TTAAGG" to the end of each sequence in my file (the 2nd line then every 4th line after), while still maintaining the fastq file format. For example:
this is the first line I start with:
#HWI-D00449:41:C2H8BACXX:5:1101:1219:2053 1:N:0:
GCAATATCCTTCAACTA
+
FFFHFHGFHAGGIIIII
and I want it to print out:
#HWI-D00449:41:C2H8BACXX:5:1101:1219:2053 1:N:0:
GCAATATCCTTCAACTATTAAGG
+
FFFHFHGFHAGGIIIII
I imagine sed or awk would be good for this, but I haven't been able to find a solution that allows me to keep the fastq format.
I tried:
awk 'NR%4==2 { print $0 "TTAAGG"}' < file_in.fastq > fileout_fastq
which added the TTAAGG to the second line and then every fourth line, but it also deleted the other three lines.
Does anyone have an suggestions of command lines I can use or if you know of a package currently available that can do this, please let me know!

Try this with GNU sed:
sed '2~4s/$/TTAAGG/' file

Related

Simple way to remove multi-line string using sed

Using sed, is there a way to remove multiple lines from a text file based on some starting and ending expressions?
I have known markers in the file and want to remove everything between (markers inclusive). I have seen some really complicated solutions and I would like to do this without resorting to micro commands.
My file looks something like this:
cat /tmp/foobar.txt
this is line 1
this is line 3
tomcat.util.scan.StandardJarScanFilter.jarsToSkip=\
annotations-api.jar,\
ant-junit*.jar,\
ant-launcher.jar,\
ant.jar,\
asm-*.jar,\
aspectj*.jar,\
bootstrap.jar,\
catalina-ant.jar,\
catalina-ha.jar,\
catalina-ssi.jar,\
catalina-storeconfig.jar
the end leave me
and me
I want to remove everything starting at tomcat.util all the away to the last .jar
tldr;
I think this is the simplest way, ad no need for the assembly like micro commands
sed '/^tomcat\.util.*$/,/^.*[^\]$/d' /tmp/foobar.txt
which produces
this is line 1
this is line 3
the end leave me
and me
if you wanted to remove the lines in the file rather than spit out the output to stdout then use the inline flag, so
sed -i '/^tomcat\.util.*$/,/^.*[^\]$/d' /tmp/foobar.txt
So... how does this work?
sed commands, like vi commands operate on an address. Normally we don't specify an address and that simply applies the command to all lines of the file, eg when replacing the for that in a file we'd normally do
sed -i 's/the/that/g' /tmp/foobar.txt
ie applying the substitute or s command to all lines in the file.
In this case you want to delete some lines so we can use the delete or d command. But we need to tell it where to delete. So we need to give it an address.
The format of a sed command is
[addr][!]command[options]
(see the docs )
If no address is specified then the command is applied to all lines, if the ! is specified then it is applied to all lines that don't match the pattern. So far so good.
The trick here is that addr can be a single address or a range of addresses. The address can be a line number or a regex pattern. You use a , between two addresses to to specify a range.
so to delete line 5 to 8 inclusive you could do
sed -i '5,8d' /tmp/foobar.txt
in this case rather than knowing the line number we know some "markers" and we can use Regex instead, so the first marker, a line starting with tomcat.util is found by the regex
/^tomcat\.util.*$/
The second marker is a bit more tricky but if we look we can see that the final line to remove is the first one that does not end with a \, so we can match a line that consists of "anything but does not end with \"
/^.*[^\]$/
While the second marker could match a whole bunch of lines if we make a range out of these two regexes, the range means that the second "address" is the first line after the first address that matches the regex.
Putting that all together, we want to delete (d) all lines in the range from the address that is found by the regex matching a line starting with tomcat.util and ending with a line that does not end in \ ie
sed '/^tomcat\.util.*$/,/^.*[^\]$/d' /tmp/foobar.txt
hope that helps ;-)
Cheers
Karl
Awk is generally more useful than sed for anything spanning lines. Using any awk in any shell on every Unix box:
$ awk '!/\.jar/{f=0} /tomcat\.util/{f=1} !f' file
this is line 1
this is line 3
the end leave me
and me
This might work for you (GNU sed):
sed -n '/tomcat\.util/{:a;n;/\.jar/ba};p' file
Turn off implicit printing using the -n option.
Match on a line containing tomcat.util.
Continue fetching lines until such a line does not match one containing .jar.
Print all other lines.
Alternative:
sed -E '/tomcat\.util/{:a;$!N;/\.jar(,\\)?$/s/\n//;ta;D}' file
Gather up lines beginning tomcat.util and ending either .jar,\ or .jar, removing newlines until the end-of-file or a mis-match and then delete the collection.

How to count number of lines with only 1 character?

Im trying to just print counted number of lines which have only 1 character.
I have a file with 200k lines some of the lines have only one character (any type of character)
Since I have no experience I have googled a lot and scraped documentation and come up with this mixed solution from different sources:
awk -F^\w$ '{print NF-1}' myfile.log
I was expecting that will filter lines with single char, and it seems work
^\w$
However Im not getting number of the lines containing a single character. Instead something like this:
If a non-awk solution is OK:
grep -c '^.$'
You could try the following:
awk '/^.$/{++c}END{print c}' file
The variable c is incremented for every line containing only 1 character (any character).
When the parsing of the file is finished, the variable is printed.
In awk, rules like your {print NF-1} are executed for each line. To print only one thing for the whole file you have to use END { print ... }. There you can print a counter which you increment each time you see a line with one character.
However, I'd use grep instead because it is easier to write and faster to execute:
grep -xc . yourFile

Need to create a single record from 3 consecutive lines of text

I can easily write a little parse program to do this task, but I just know that some linux command line tool guru can teach me something new here. I've pulled apart a bunch of files to gather some data within such that I can create a table with it. The data is presently in the format:
.serviceNum=6360,
.transportId=1518,
.suid=6360,
.serviceNum=6361,
.transportId=1518,
.suid=6361,
.serviceNum=6362,
.transportId=1518,
.suid=6362,
.serviceNum=6359,
.transportId=1518,
.suid=6359,
.serviceNum=203,
.transportId=117,
.suid=20203,
.serviceNum=9436,
.transportId=919,
.suid=16294,
.serviceNum=9524,
.transportId=906,
.suid=17613,
.serviceNum=9439,
.transportId=917,
.suid=9439,
What I would like is this:
.serviceNum=6360,.transportId=1518,.suid=6360,
.serviceNum=6361,.transportId=1518,.suid=6361,
.serviceNum=6362,.transportId=1518,.suid=6362,
.serviceNum=6359,.transportId=1518,.suid=6359,
.serviceNum=203,.transportId=117,.suid=20203,
.serviceNum=9436,.transportId=919,.suid=16294,
.serviceNum=9524,.transportId=906,.suid=17613,
.serviceNum=9439,.transportId=917,.suid=9439,
So, the question is, is there a linux command line tool that will somehow read through the file and auto remove the EOL/CR on the end of every 2nd and 3rd line? I've seen old school linux gurus do incredible things on the command line and this is one of those instances where I think it's worth my time to inquire. :)
TIA
O
Use cat and paste and see the magic
cat inputfile.txt | paste - - -
Perl to the rescue:
perl -pe 'chomp if $. % 3' < input
-p processes the input line by line printing each line;
chomp removes the final newline;
$. contains the input line number;
% is the modulo operator.
perl -alne '$a.=$_; if($.%3==0){print $a; $a=""}' filename

Copy a section within two keywords into a target file

I have thousand of files in a directory and each file contains numbers of defined variables starting with keyword DEFINE and ending with a semicolon (;), I want to copy all the occurrences of the data between this keyword(Inclusive) into a target file.
Example: Below is the content of the text file:
/* This code is for lookup */
DEFINE variable as a1 expr= extract (n123f1 using brach, code);
END.
Now from the above content i just want to copy the section starting with DEFINE and ending with ; into a target file i.e. the output should be:
DEFINE variable as a1 expr= extract (n123f1 using brach, code);
this needs to done for thousands of scripts and multiple occurences, Please help out.
Thanks a lot , the provided code works, but to a limited extent only when the whole sentence is in a single line but the data is not supposed to be in one single line it is spread in multiple line like below:
/* This code is for lookup */
DEFINE variable as a1 expr= if branchno > 55
then
extract (n123f1 using brach, code)
else
branchno = null
;
END.
The code is also in the above fashion i need to capture all the data between DEFINE and semicolon (;) after every define there will be an ending semicolon ;, this is the pattern.
It sounds like you want grep(1):
grep '^DEFINE.*;$' input > output
Try using grep. Let's say you have files with extension .txt in present directory,
grep -ho 'DEFINE.*;' *.txt > outfile
Output:
DEFINE variable as a1 expr= extract (n123f1 using brach, code);
Short Description
-o will give you only matching string rather than whole line, if line also contains something else and want to ommit it.
-h will suppress file names before matching result
Read man page of grep by typing man grep on your terminal
EDIT
If you want capability to search in multiple lines, you can use pcregrep with -M option
pcregrep -M 'DEFINE.*?(\n|.)*?;' *.txt > outfile
Works fine on my system. Check man pcregrep for more details
Reference : SO Question
One can make a simple solution using sed with version :
sed -n -e '/^DEFINE/{:a p;/;$/!{n;ba}}' your-file
Option -n prevents sed from printing every line; then each time a line begins with DEFINE, print the line (command p) then enter a loop: until you find a line ending with ;, grab the next line and loop to the print command. When exiting the loop, you do nothing.
It looks a bit dirty; it seems that the version sed15 has a shorter (and more straightforward) way to achieve this in one line:
sed -n -e '/^DEFINE/,/;$/p' your-file
Indeed, only for this version of sed, both patterns are treated; for other versions of sed like mine under cygwin, the range patterns must be on separate lines to work properly.
One last thing to remember: it does not treat inclusive patterned ranges, i.e. it stops printing after the first encountered end-pattern even if multiple start patterns have been matched. Prefer something with awk if this is a feature you are looking for.

Combine matching lines using sed or awk?

I have a file like the following:
1,
cake:01351
12,
bun:1063
scone:13581
biscuit:1931
14,
jelly:1385
I need to convert it so that when a number is read at the start of a line it is combined with the line beneath it, but if there is no number at the start the line is left as is. This would be the output that I need:
1,cake:01351
12,bun:1063
scone:13581
biscuit:1931
14,jelly:1385
Having a lot of trouble achieving this with sed, it seems it may not be the best way for what I think should be quite simple.
Any suggestions greatly appreciated.
A very basic sed implementation:
sed -e '/^[0-9]/{N;s/\n//;}'
This relies on the first character on only the 'number' lines being a number (as you specified).
It
matches lines starting with a number, ^[0-9]
brings in the next line, N
deletes the embedded newline, s/\n//
This is a file on my intranet. I can't recall where I found the handy sed one-liner. You might find something if you search for 'sed one-liner'
Have you ever needed to combine lines of text, but it's too tedious to do it by hand.
For example, imagine that we have a text file with hundreds of lines which look like this:
14/04/2003,10:27:47,0
IdVg,3.000,-1.000,0.050,0.006
GmMax,0.011,0.975,0.005
IdVg,3.000,-1.000,0.050,0.006
GmMax,0.011,0.975,0.005
14/04/2003,10:30:51,600
IdVg,3.000,-1.000,0.050,0.006
GmMax,0.011,0.975,0.005
IdVg,3.000,-1.000,0.050,0.006
GmMax,0.010,0.975,0.005
14/04/2003,10:34:02,600
IdVg,3.000,-1.000,0.050,0.006
GmMax,0.011,0.975,0.005
IdVg,3.000,-1.000,0.050,0.006
GmMax,0.010,0.975,0.005
Each date (14/04/2003) is the start of a data record, and it continues on the next four lines.
We would like to input this to Excel as a 'comma separated value' file, and see each record in its own row.
In our example, we need to append any line starting with a G or I to the preceding line, and insert a comma, so as to produce the following:
14/04/2003,10:27:47,0,IdVg,3.000,-1.000,0.050,0.006,GmMax,0.011,0.975,0.005,IdVg,3.000,...
14/04/2003,10:30:51,600,IdVg,3.000,-1.000,0.050,0.006,GmMax,0.011,0.975,0.0005,IdVg,3.000,...
14/04/2003,10:34:02,600,IdVg,3.000,-1.000,0.050,0.006,GmMax,0.011,0.975,0.0005,IdVg,3.000,...
This is a classic application of a 'regular expression' and, once again, sed comes to the rescue.
The editing can be done with a single sed command:
sed -e :a -e '$!N;s/\n\([GI]\)/,\1/;ta' -e 'P;D' filename >newfilename
I didn't say it would be obvious, or easy, did I?
This is the kind of command you write down somewhere for the rare occasions when you need it.
Try a regular expression, such as:
sed '/[0-9]\+,/{N}s/\n//)'
That checks the first line for a number (0-9) and a comma, then replaces the new line with nothing, removing it.
Another awk solution, less cryptic than some other answers:
awk '/^[0-9]/ {n = $0; getline; print n $0; next} 1'
$ awk 'ORS= /^[0-9]+,$/?" ":"\n"' file
1, cake:01351
12, bun:1063
scone:13581
biscuit:1931
14, jelly:1385

Resources