How to extract a specific text from gz file? - linux

I need to extract the 5 to 11 characters from my fastq.gz data this data is just too large for running in R. So I was wondering if I can do it directly in Linux command line?
The fastq file looks like this:
#NB501399:67:HFKTCBGX5:1:11101:13202:1044 1:N:0:CTTGTA
GAGGTNACGGAGTGGGTGTGTGCAGGGCCTGGTGGGAATGGGGAGACCCGTGGACAGAGCTTGTTAGAGTGTCCTAGAGCCAGGGGGAACTCCAGGCAGGGCAAATTGGGCCCTGGATGTTGAGAAGCTGGGTAACAAGTACTGAGAGAAC
+
AAAAA#EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAAAEEEEEEEEEEEEEEEEAEEEEEEEEEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAE6
#NB501399:67:HFKTCBGX5:1:11101:1109:1044 1:N:0:CTTGTA
TAGGCNACCTGGTGGTCCCCCGCTCCCGGGAGGTCACCATATTGATGCCGAACTTAGTGCGGACACCCGATCGGCATAGCGCACTACAGCCCAGAACTCCTGGACTCAAGCGATCCTCCAGCCTCAGCCTCCCGAGTAGCTGGGACTACAG
+
And I only want to extract the 5 to 11 character which located in sequence part (for the first one is TNACGG, for the second is CNACCT) and makes it a new txt file. Can I do that?

You can use GNU sed with zcat:
zcat fastq.gz | sed -n '2~5{s/.\{4\}\(.\{6\}\).*/\1/;p}'
-n means lines are not printed by default
2~5 means start with line 2, match every fifth line
when the "address" matches, the substitution remembers the fifth to tenth character in \1 and replaces the whole line with it, p prints the result

Another using zgrep and positive lookbehind:
$ zgrep -oP "(?<=^[ACTGN]{4})[ACTGN]{6}" foo.gz
TNACGG
CNACCT
Explained:
zgrep : man zgrep: search possibly compressed files for a regular expression
-o Print only the matched (non-empty) parts of a matching line
-P Interpret the pattern as a Perl-compatible regular expression (PCRE).
(?<=^[ACTGN]{4}) positive lookbehind
[ACTGN]{6} match 6 named characters that are preceeded by above
foo.gz my test file

$ zcat fastq.gz | awk '(NR%5)==2{print substr($0,5,6)}'
TNACGG
CNACCT

Related

How to count exact match of certain patterns in a text file using linux shell command?

I want to find the count of certain pattern in a text file which contains lot of mixed patterns also using linux shell command.
I have a text file which contains below patterns,
[--------------]
[+--------------+]
[+----------+------------+--------------------+]
[+---------------------+---------------------+]
How to find exact count of only first pattern [--------------]?
Note: Don't include square bracket as a pattern. Only special character inside square bracket is a pattern.
cat ./file | sed -e 's/\]/\]\n/' |grep "\[--------------\]" -c
cat reads file
sed replace ] with ]\n
grep searches every line for your expression and prints the number of lines -c

How to use sed to replace a string that contains the slash?

I have a text file that contain a lot of mess text.
I used grep to get all the text that contains the string prod like this
cat textfile | grep "<host>prod*"
The result
<host>prod-reverse-proxy01</host>
<host>prod-reverse-proxy01</host>
<host>prod-reverse-proxy01</host>
Continually, i used sed with the intention to remove all the "host" part
cat textfile | grep "<host>prod*" | sed "s/<host>//g"; "s/</host>//g"
But only the first "host" was removed.
prod-reverse-proxy01</host>
prod-reverse-proxy01</host>
prod-reverse-proxy01</host>
How can i remove the other "/host" part?
sed -n -e "s/^<host>\(.*\)<\/host>/\1/p" textfile
sed can process your file directly. No need to grep or cat.
-n is there to suppress any lines that do not match. Last 'p' in the script will print all matching files.
Script dissection:
s/.../.../...
is the search/replace form. The bit between the first and the second '/' is what you search for. The bit between the second and third is what you replace it with. The last part is any commands you want to apply to the replacement.
Search:
^<host>\(.*\)<\/host>
finds all lines beginning with <host> followed by any text (.*) followed by </host>. Any text between <host> and </host> is stored into internal variable '1' using '(' and ')'. Note that (, ) and / (in </host>) have to be escaped.
Replace:
\1
Replace found text with contents of variable 1 (1 has to be escaped, otherwise, everything is replaced by character '1'.
Commands:
p
Print resulting line (after replacement).
Note: Your search involves removing two similar but not identical strings (<host> and </host>).
I think this sed is enough
sed 's/<[/]*host>//g' infile

Replace first six commas for each line in a text file

I want to replace the first six , for each line in a text file using sed or something similar in linux.
There are more than six , on each line, but only the first six should be replaced by |.
Sed doesn't really support the notion of "the first n occurrences", only "the n-th occurrence"; GNU sed has one for "replace all matches from the n-th on", which is not what you want in this case. To get the first six commas replaced, you have to call the s command six times:
sed 's/,/|/;s/,/|/;s/,/|/;s/,/|/;s/,/|/;s/,/|/' infile
If, however, you know that there are no | in the file and you have GNU sed, you can do this:
sed 's/,/|/g;s/|/,/7g' infile
This replaces all commas with pipes, then turns the pipes from the 7th on back to commas.
If you do have pipes beforehand, you can turn them into something that you know isn't in the string first:
sed 's/|/~~/g;s/,/|/g;s/|/,/7g;s/~~/|/g' infile
This makes all | into ~~ first, then all , into |, then the | from the 7th on back into ,, and finally the ~~ back into |.
Testing on this input file:
,,,,,,X,,,,,,
,,,|,,,|,,,|,,,|
the first and third command result in
||||||X,,,,,,
||||||||,,,|,,,|
The second one would fail on the second line because there are already pipe characters.
This might work for you (GNU sed):
sed 'y/,/\n/;s/\n/,/7g;y/\n/|/' file
Translate all ,'s to \n's, then replace from the seventh \n to the end of line by ,'s, then replace the remaining \n's by |'s.
Use the following pattern in sed: sed 's/old/new/<number>'
Where <number> is the number of times you want this pattern applied.
You can replace <number> with g to apply the pattern to all occurrences.
You can try this sed,
sed -r ':loop; s/^([^,]*),/\1|/g; /^([^|]*\|){6}/t; b loop' file
(OR)
sed ':loop; s/^\([^,]*\),/\1|/g; /^\([^|]*|\)\{6\}/t; b loop' file
Test:
$ cat file
a,b,c,d,e,f,g,h,i,j,k
$ sed -r ':loop; s/^([^,]*),/\1|/g; /^([^|]*\|){6}/t; b loop' file
a|b|c|d|e|f|g,h,i,j,k
Note: This will work only if you do not have any pipe(|) before that.

Filter out only matched values from a text file in each line

I have a file "test.txt" with the lines below and also lot bunch of extra stuff after the "version"
soainfra_metrics{metric_group="sca_composite",partition="test",is_active="true",state="on",is_default="true",composite="test123"} map:stats version:1.0
soainfra_metrics{metric_group="sca_composite",partition="gello",is_active="true",state="on",is_default="true",composite="test234"} map:stats version:1.8
soainfra_metrics{metric_group="sca_composite",partition="bolo",is_active="true",state="on",is_default="true",composite="3415"} map:stats version:3.1
soainfra_metrics{metric_group="sca_composite",partition="solo",is_active="true",state="on",is_default="true",composite="hji"} map:stats version:1.1
I tried:
egrep -r 'partition|is_active|state|is_default|composite' test.txt
It's displaying every line, but I need only specific mentioned fields like this below,ignoring rest of the data/stuff or lines
in a nut shell, i want to display only these fields from a line not the rest
partition="test",is_active="true",state="on",is_default="true",composite="test123"
partition="gello",is_active="true",state="on",is_default="true",composite="test234"
partition="bolo",is_active="true",state="on",is_default="true",composite="3415"
partition="solo",is_active="true",state="on",is_default="true",composite="hji"
If your version of grep supports Perl-style regular expressions, then I'd use this:
grep -oP '.*?,\K[^}]+' file
It removes everything up to the first comma (\K kills any previous output) and prints everything up to the }.
Alternatively, using awk:
awk -F'}' '{ sub(/[^,]+,/, ""); print $1 }' file
This sets the field separator to } so the part you're interested in is the first field. It then uses sub to remove the part up to the first comma.
For completeness, you could also use sed:
sed 's/[^,]*,\([^}]*\).*/\1/' file
This captures the part after the first , up to the } and replaces the content of the line with it.
After the grep to pick out the lines you want, use sed to edit the lines:
sed 's/.*\(partition[^}]*\)} map.*/\1/'
This means: "whenever you see anything .*, followed by partition and
any number of non-}, then } map and anything else, grab the part
from partition up to but not including the brace \(...\) as group 1.
The replacement text is just group 1 \1.
Use a pipe | to connect the output of egrep to the input of sed:
egrep ... | sed ...
As far as i understood your file might have more lines you don't want to see, so i would use:
sed -n 's/.*\(partition.*\)}.*/\1/p' file
we use -n p to show only lines where we made substitution. The substitution part just gets the part of the line you need substituting the whole line with the pattern.
This might work for you (GNU sed):
sed -r 's/(partition|is_active|state|is_default|composite)="[^"]*"/\n&\n/g;s/[^\n]*\n([^\n]*)\n[^\n]*/\1,/g;s/,$//' file
Treat the problem as if it were a "decomposed club sandwich". Identify the fillings, remove the bread and tidy up.

how to print last few lines when a pattern is matched using sed?

I want to last few lines when a pattern is matched in a file using sed.
if file has following entries:
This is the first line.
This is the second line.
This is the third line.
This is the forth line.
This is the Last line.
so, search for pattern, "Last" and print last few lines ..
Find 'Last' using sed and pipe it to tail command which print last n lines of the file -n specifies the no. of lines to be read from end of the file, here I am reading last 2 lines of the file.
sed '/Last/ p' yourfile.txt|tail -n 2
For more info on tail use man tail.
Also, | symbol here is known as a pipe(unnamed pipe), which helps in inter-process communication. So, in simple words sed feeds data to tail command using pipe.
I assume you mean "find the pattern and also print the previous few lines". grep is your friend: to print the previous 3 lines:
$ grep -B 3 "Last" file
This is the second line.
This is the third line.
This is the forth line.
This is the Last line.
-B n for "before". There's also -A n ("after"), and -C n ("context", both before and after).
This might work for you (GNU sed):
sed ':a;$!{N;s/\n/&/2;Ta};/Last/P;D' file
This will print the line containing Last and the two previous lines.
N.B. This will only print the lines before the match once. Also more lines can by shown by changing the 2 to however many lines you want.

Resources