using grep command to search man pape - linux

Although the man page have so many information, I need a little bit of it at once. For example, objdump -f kernel.o, but I forgot the feature of the flag '-f'.
I try this but fail.
man objdump | grep -e '*.(\-)f.*'
The error message as following:
<standard input>:161: warning [p 1, 5.5i]: can't break line
<standard input>:594: warning [p 6, 6.5i, div `an-div', 0.0i]: can't break line
How do I search the man page using grep?

$ man objdump | grep -e "-f"
<standard input>:161: warning [p 1, 5.5i]: can't break line
<standard input>:590: warning [p 6, 6.5i, div `an-div', 0.0i]: can't break line
[-f|--file-headers]
[-F|--file-offsets]
[--file-start-context]
[-s|--full-contents]
-a,-d,-D,-e,-f,-g,-G,-h,-H,-p,-P,-r,-R,-s,-S,-t,-T,-V,-x must be given.
-f
--file-headers
--file-offsets
--file-start-context
--full-contents
all of -a -f -h -p -r -t.
Is that what you want?

Related

Extract lines containing two patterns

I have a file which contains several lines as follows:
>header1
<pattern_1>CGGCGGGCAGATGGCCACCAACAACCAGAGCTCCCTGGCCGGGCCTCTTTTCCTGACGGCCGCCCCCACTGCCCCCACGACCGGCCCGTACAAC<pattern_2>
>header2
<pattern_1>CGGCGGGCAGATGGCCACCAACAACCAGAGCTCCCTGGCCTGCAATCACTACTCGTGTTTTGCCACCACTGCCCCCACGACCGGCACGTACAAC<pattern_2>
>header3
<pattern_1>ATGGCCACCAACAACCAGAGCTCCC
>header4
GACCGGCACGTACAACCTCCAGGAAATCGTGCCCGGCAGCGTGTGGATGGAGAGGGACGTG
>header5
TGCCCCCACGACCGGCACGTACAAC<pattern_2>
I want to extract all lines containing both and including the header lines.
I have tried using grep, but it only extracts the sequence lines but not the header lines.
grep <pattern_1> | grep <pattern_2> input.fasta > output.fasta
How to extract lines containing both the patterns and the headers in Linux? The patterns can be present anywhere in the lines. Not limited to start or end of the lines.
Expected output:
>header1
<pattern_1>CGGCGGGCAGATGGCCACCAACAACCAGAGCTCCCTGGCCGGGCCTCTTTTCCTGACGGCCGCCCCCACTGCCCCCACGACCGGCCCGTACAAC<pattern_2>
>header2
<pattern_1>CGGCGGGCAGATGGCCACCAACAACCAGAGCTCCCTGGCCTGCAATCACTACTCGTGTTTTGCCACCACTGCCCCCACGACCGGCACGTACAAC<pattern_2>
$ grep -A 1 header[12] file
>header1
<pattern_1>CGGCGGGCAGATGGCCACCAACAACCAGAGCTCCCTGGCCGGGCCTCTTTTCCTGACGGCCGCCCCCACTGCCCCCACGACCGGCCCGTACAAC<pattern_2>
>header2
<pattern_1>CGGCGGGCAGATGGCCACCAACAACCAGAGCTCCCTGGCCTGCAATCACTACTCGTGTTTTGCCACCACTGCCCCCACGACCGGCACGTACAAC<pattern_2>
man grep:
-A NUM, --after-context=NUM
Print NUM lines of trailing context after matching lines.
Places a line containing a group separator (--) between
contiguous groups of matches. With the -o or --only-matching
option, this has no effect and a warning is given.
-B NUM, --before-context=NUM
Print NUM lines of leading context before matching lines.
Places a line containing a group separator (--) between
contiguous groups of matches. With the -o or --only-matching
option, this has no effect and a warning is given.
grep -B 1 pattern_[12]could work also, but you have several pattern_1s in the sample data so... not this time.
You can easily do that with awk like this:
awk '/^>/{h=$0;next}
/<pattern_1>/&&/<pattern_2>/{print h;print}' input.fasta > output.fasta
And here is a sed solution which yields the desired output as well:
sed -n '/^>/{N;/<pattern_1>/{/<pattern_2>/p}}' input.fasta > output.fasta
If it is likely that multiline records exist, you can use this:
awk -v pat1='<pattern_1>' -v pat2='<pattern_2>' '
/^>/ {r=$0;p=0;next}
!p {r=r ORS $0;if(chk()){print r;p=1};next}
p
function chk( tmp){
tmp=gensub(/\n/,"","g",r)
return (tmp~pat1&&tmp~pat2)
}' input.fasta > output.fasta
You might be interested in BioAwk, it is an adapted version of awk which is tuned to process fasta files
bioawk -c fastx -v seq1="pattern1" -v seq2="pattern2" \
'($seq ~ seq1) && ($seq ~ seq2) { print ">"$name; print $seq }' file.fasta
If you want seq1 at the beginning and seq2 at the end, you can change it into:
bioawk -c fastx -v seq1="pattern1" -v seq2="pattern2" \
'($seq ~ "^"seq1) && ($seq ~ seq2"$") { print ">"$name; print $seq }' file.fasta
This is really practical for processing fasta files, as often the sequence is spread over multiple lines. The above code handles this very easily as the variable $seq contains the full sequence.
If you do not want to install BioAwk, you can use the following method to process your FASTA file. It will allow multi-line sequences and does the following:
read a single record at a time (this assumes no > in the header, except the first character)
extract the header from the record and store it in name (not really needed)
merge the full sequence in a single string of characters, removing all newlines and spaces. This ensures that searching for pattern1 or pattern2 will not fail if the pattern is split over multiple lines.
if a match is found, print the record.
The following awk does the requested:
awk -v seq1="pattern1" -v seq2="pattern2" \
'BEGIN{RS=">"; ORS=""; FS="\n"}
{ seq="";for(i=2;i<=NF;++i) seq=seq""$i; gsub(/[^a-zA-Z0-9]/,"",seq) }
(seq ~ seq1 && seq ~ seq2){print ">" $0}' file.fasta
If the record header contains other > characters which are not at the beginning of the line, you have to take a slightly different approach (unless you use GNU awk)
awk -v seq1="pattern1" -v seq2="pattern2" \
'/^>/ && (seq ~ seq1 && seq ~ seq2) {
print name
for(i=0;i<n;i++) print aseq[i]
}
/^>/ { seq=""; delete aseq; n=0; name=$0; next }
{ aseq[n++] = $0; seq=seq""$0; sub(/[^a-zA-Z0-9]*$/,"",seq) }
END { if (seq ~ seq1 && seq ~ seq2) {
print name
for(i=0;i<n;i++) print aseq[i]
}
}' file.fasta
note: we make use of sub here in case unexpected characters are introduced in the fasta file (eg. spaces/tabs or CR (\r))
Note: BioAwk is based on Brian Kernighan's awk which is documented in "The AWK Programming Language",
by Al Aho, Brian Kernighan, and Peter Weinberger
(Addison-Wesley, 1988, ISBN 0-201-07981-X)
. I'm not sure if this version is compatible with POSIX.
If you want grep to print lines around the match, use the -B flag for lines before, the -A for lines after, and -C for both before and after the match.
In your case, grep -B 1 seems like it would do the job.
If your input file is exactly as described in your post then you can use:
grep -B1 '^<pattern_1>.*<pattern_2>$' input
>header1
<pattern_1>CGGCGGGCAGATGGCCACCAACAACCAGAGCTCCCTGGCCGGGCCTCTTTTCCTGACGGCCGCCCCCACTGCCCCCACGACCGGCCCGTACAAC<pattern_2>
>header2
<pattern_1>CGGCGGGCAGATGGCCACCAACAACCAGAGCTCCCTGGCCTGCAATCACTACTCGTGTTTTGCCACCACTGCCCCCACGACCGGCACGTACAAC<pattern_2>
Where -B1 will display on top of the matching lines the line before it. The regex used is based on the hypothesis that your 2 patterns are located in the exact order at the beginning and at the end of the line. If this is not the case: use '.*<pattern_1>.*<pattern_2>.*'. Last but not least, if the order of the 2 patterns are not always respected then you can use: '^.*<pattern_1>.*<pattern_2>.*$\|^.*<pattern_2>.*<pattern_1>.*$'
On the following input file:
cat input
>header1
<pattern_1>CGGCGGGCAGATGGCCACCAACAACCAGAGCTCCCTGGCCGGGCCTCTTTTCCTGACGGCCGCCCCCACTGCCCCCACGACCGGCCCGTACAAC<pattern_2>
>header2
<pattern_1>CGGCGGGCAGATGGCCACCAACAACCAGAGCTCCCTGGCCTGCAATCACTACTCGTGTTTTGCCACCACTGCCCCCACGACCGGCACGTACAAC<pattern_2>
>header2b
<pattern_2>CGGCGGGCAGATGGCCACCAACAACCAGAGCTCCCTGGCCTGCAATCACTACTCGTGTTTTGCCACCACTGCCCCCACGACCGGCACGTACAAC<pattern_1>
>header3
<pattern_1>ATGGCCACCAACAACCAGAGCTCCC
>header4
GACCGGCACGTACAACCTCCAGGAAATCGTGCCCGGCAGCGTGTGGATGGAGAGGGACGTG
>header5
TGCCCCCACGACCGGCACGTACAAC<pattern_2>
output:
grep -B1 '^.*<pattern_1>.*<pattern_2>.*$\|^.*<pattern_2>.*<pattern_1>.*$' input
>header1
<pattern_1>CGGCGGGCAGATGGCCACCAACAACCAGAGCTCCCTGGCCGGGCCTCTTTTCCTGACGGCCGCCCCCACTGCCCCCACGACCGGCCCGTACAAC<pattern_2>
>header2
<pattern_1>CGGCGGGCAGATGGCCACCAACAACCAGAGCTCCCTGGCCTGCAATCACTACTCGTGTTTTGCCACCACTGCCCCCACGACCGGCACGTACAAC<pattern_2>
>header2b
<pattern_2>CGGCGGGCAGATGGCCACCAACAACCAGAGCTCCCTGGCCTGCAATCACTACTCGTGTTTTGCCACCACTGCCCCCACGACCGGCACGTACAAC<pattern_1>

sort fasta by sequence size

I currently want to sort a hudge fasta file (+10**8 lines and sequences) by sequence size. fasta is a clear defined format in biology use to store sequence (genetic or proteic):
>id1
sequence 1 # could be on several line
>id2
sequence 2
...
I have run a tools that give me in tsv format:
the Identifiant, the length, and the position in bytes of the identifiant.
for now what I am doing is to sort this file by the length column then I parse this file and use seek to retrieve the corresponding sequence then append it to a new file.
# this fonction will get the sequence using seek
def get_seq(file, bites):
with open(file) as f_:
f_.seek(bites, 0) # go to the line of interest
line = f_.readline().strip() # this line is the begin of the
#sequence
to_return = "" # init the string which will contains the sequence
while not line.startswith('>') or not line: # while we do not
# encounter another identifiant
to_return += line
line = f_.readline().strip()
return to_return
# simply append to a file the id and the sequence
def write_seq(out_file, id_, sequence):
with open(out_file, 'a') as out_file:
out_file.write('>{}\n{}\n'.format(id_.strip(), sequence))
# main loop will parse the index file and call the function defined below
with open(args.fai) as ref:
indice = 0
for line in ref:
spt = line.split()
id_ = spt[0]
seq = get_seq(args.i, int(spt[2]))
write_seq(out_file=args.out, id_=id_, sequence=seq)
my problems is the following is really slow does it is normal (it takes several days)? Do I have another way to do it? I am a not a pure informaticien so I may miss some point but I was believing to index files and use seek was the fatest way to achive this am I wrong?
Seems like opening two files for each sequence is probably contibuting to a lot to the run time. You could pass file handles to your get/write functions rather than file names, but I would suggest using an established fasta parser/indexer like biopython or samtools. Here's an (untested) solution with samtools:
subprocess.call(["samtools", "faidx", args.i])
with open(args.fai) as ref:
for line in ref:
spt = line.split()
id_ = spt[0]
subprocess.call(["samtools", "faidx", args.i, id_, ">>", args.out], shell=True)
What about bash and some basic unix commands (csplit is the clue)? I wrote this simple script, but you can customize/improve it. It's not highly optimized and doesn't use index file, but nevertheless may run faster.
csplit -z -f tmp_fasta_file_ $1 '/>/' '{*}'
for file in tmp_fasta_file_*
do
TMP_FASTA_WC=$(wc -l < $file | tr -d ' ')
FASTA_WC+=$(echo "$file $TMP_FASTA_WC\n")
done
for filename in $(echo -e $FASTA_WC | sort -k2 -r -n | awk -F" " '{print $1}')
do
cat "$filename" >> $2
done
rm tmp_fasta_file*
First positional argument is a filepath to your fasta file, second one is a filepath for output, i.e. ./script.sh input.fasta output.fasta
Using a modified version of fastq-sort (currently available at https://github.com/blaiseli/fastq-tools), we can convert the file to fastq format using bioawk, sort with the -L option I added, and convert back to fasta:
cat test.fasta \
| tee >(wc -l > nb_lines_fasta.txt) \
| bioawk -c fastx '{l = length($seq); printf "#"$name"\n"$seq"\n+\n%.*s\n", l, "IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII"}' \
| tee >(wc -l > nb_lines_fastq.txt) \
| fastq-sort -L \
| tee >(wc -l > nb_lines_fastq_sorted.txt) \
| bioawk -c fastx '{print ">"$name"\n"$seq}' \
| tee >(wc -l > nb_lines_fasta_sorted.txt) \
> test_sorted.fasta
The fasta -> fastq conversion step is quite ugly. We need to generate dummy fastq qualities with the same length as the sequence. I found no better way to do it with (bio)awk than this hack based on the "dynamic width" thing mentioned at the end of https://www.gnu.org/software/gawk/manual/html_node/Format-Modifiers.html#Format-Modifiers.
The IIIII... string should be longer than the longest of the input sequences, otherwise, invalid fastq will be obtained, and when converting back to fasta, bioawk seems to silently skip such invalid reads.
In the above example, I added steps to count the lines. If the line numbers are not coherent, it may be because the IIIII... string was too short.
The resulting fasta file will have the shorter sequences first.
To get the longest sequences at the top of the file, add the -r option to fastq-sort.
Note that fastq-sort writes intermediate files in /tmp. If for some reason it is interrupted before erasing them, you may want to clean your /tmp manually and not wait for the next reboot.
Edit
I actually found a better way to generate dummy qualities of the same length as the sequence: simply using the sequence itself:
cat test.fasta \
| bioawk -c fastx '{print "#"$name"\n"$seq"\n+\n"$seq}' \
| fastq-sort -L \
| bioawk -c fastx '{print ">"$name"\n"$seq}' \
> test_sorted.fasta
This solution is cleaner (and slightly faster), but I keep my original version above because the "dynamic width" feature of printf and the usage of tee to check intermediate data length may be interesting to know about.
You can also do it very conveniently with awk, check the code below:
awk '/^>/ {printf("%s%s\t",(N>0?"\n":""),$0);N++;next;} {printf("%s",$0);} END {printf("\n");}' input.fasta |\
awk -F '\t' '{printf("%d\t%s\n",length($2),$0);}' |\
sort -k1,1n | cut -f 2- | tr "\t" "\n"
This and other methods have been posted in Biostars (e.g. using BBMap's sortbyname.sh script), and I strongly recommend this community for questions such like this one.

Find HEX value in file and grep the following value

I have a 2GB file in raw format. I want to search for all appearance of a specific HEX value "355A3C2F74696D653E" AND collect the following 28 characters.
Example: 355A3C2F74696D653E323031312D30342D32365431343A34373A30322D31343A34373A3135
In this case I want the output: "323031312D30342D32365431343A34373A30322D31343A34373A3135" or better: 2011-04-26T14:47:02-14:47:15
I have tried with
xxd -u InputFile | grep '355A3C2F74696D653E' | cut -c 1-28 > OutputFile.txt
and
xxd -u -ps -c 4000000 InputFile | grep '355A3C2F74696D653E' | cut -b 1-28 > OutputFile.txt
But I can't get it working.
Can anybody give me a hint?
As you are using xxd it seems to me that you want to search the file as if it were binary data. I'd recommend using a more powerful programming language for this; the Unix shell tools assume there are line endings and that the text is mostly 7-bit ASCII. Consider using Python:
#!/usr/bin/python
import mmap
fd = open("file_to_search", "rb")
needle = "\x35\x5A\x3C\x2F\x74\x69\x6D\x65\x3E"
haystack = mmap.mmap(fd.fileno(), length = 0, access = mmap.ACCESS_READ)
i = haystack.find(needle)
while i >= 0:
i += len(needle)
print (haystack[i : i + 28])
i = haystack.find(needle, i)
If your grep supports -P parameter then you could simply use the below command.
$ echo '355A3C2F74696D653E323031312D30342D32365431343A34373A30322D31343A34373A3135' | grep -oP '355A3C2F74696D653E\K.{28}'
323031312D30342D32365431343A
For 56 chars,
$ echo '355A3C2F74696D653E323031312D30342D32365431343A34373A30322D31343A34373A3135' | grep -oP '355A3C2F74696D653E\K.{56}'
323031312D30342D32365431343A34373A30322D31343A34373A3135
Why convert to hex first? See if this awk script works for you. It looks for the string you want to match on, then prints the next 28 characters. Special characters are escaped with a backslash in the pattern.
Adapted from this post: Grep characters before and after match?
I added some blank lines for readability.
VirtualBox:~$ cat data.dat
Thisis a test of somerandom characters before thestringI want5Z</time>2011-04-26T14:47:02-14:47:15plus somemoredata
VirtualBox:~$ cat test.sh
awk '/5Z\<\/time\>/ {
match($0, /5Z\<\/time\>/); print substr($0, RSTART + 9, 28);
}' data.dat
VirtualBox:~$ ./test.sh
2011-04-26T14:47:02-14:47:15
VirtualBox:~$
EDIT: I just realized something. The regular expression will need to be tweaked to be non-greedy, etc and between that and awk need to be tweaked to handle multiple occurrences as you need them. Perhaps some of the folks more up on awk can chime in with improvements as I am real rusty. An approach to consider anyway.

Linux head/tail with offset

Is there a way in Linux to ask for the Head or Tail but with an additional offset of records to ignore.
For example if the file example.lst contains the following:
row01
row02
row03
row04
row05
And I use head -n3 example.lst I can get rows 1 - 3 but what if I want it to skip the first row and get rows 2 - 4?
I ask because some commands have a header which may not be desirable within the search results. For example du -h ~ --max-depth 1 | sort -rh will return the directory size of all folders within the home directory sorted in descending order but will append the current directory to the top of the result set (i.e. ~).
The Head and Tail man pages don't seem to have any offset parameter so maybe there is some kind of range command where the required lines can be specified: e.g. range 2-10 or something?
From man tail:
-n, --lines=K
output the last K lines, instead of the last 10;
or use -n +K to output lines starting with the Kth
You can therefore use ... | tail -n +2 | head -n 3 to get 3 lines starting from line 2.
Non-head/tail methods include sed -n "2,4p" and awk "NR >= 2 && NR <= 4".
To get the rows between 2 and 4 (both inclusive), you can use:
head -n4 example.lst | tail -n+2
or
head -n4 example.lst | tail -n3
It took make a lot of time to end-up with this solution which, seems to be the only one that covered all usecases (so far):
command | tee full.log | stdbuf -i0 -o0 -e0 awk -v offset=${MAX_LINES:-200} \
'{
if (NR <= offset) print;
else {
a[NR] = $0;
delete a[NR-offset];
printf "." > "/dev/stderr"
}
}
END {
print "" > "/dev/stderr";
for(i=NR-offset+1 > offset ? NR-offset+1: offset+1 ;i<=NR;i++)
{ print a[i]}
}'
Feature list:
live output for head (obviously that for tail is not possible)
no use of external files
progressbar on stderr, one dot for each line after the MAX_LINES, very useful for long running tasks.
avoids possible incorrect logging order due to buffering (stdbuf)
sed -n 2,4p somefile.txt
#fill

grep: show lines surrounding each match

How do I grep and show the preceding and following 5 lines surrounding each matched line?
For BSD or GNU grep you can use -B num to set how many lines before the match and -A num for the number of lines after the match.
grep -B 3 -A 2 foo README.txt
If you want the same number of lines before and after you can use -C num.
grep -C 3 foo README.txt
This will show 3 lines before and 3 lines after.
-A and -B will work, as will -C n (for n lines of context), or just -n (for n lines of context... as long as n is 1 to 9).
ack works with similar arguments as grep, and accepts -C. But it's usually better for searching through code.
grep astring myfile -A 5 -B 5
That will grep "myfile" for "astring", and show 5 lines before and after each match
ripgrep
If you care about the performance, use ripgrep which has similar syntax to grep, e.g.
rg -C5 "pattern" .
-C, --context NUM - Show NUM lines before and after each match.
There are also parameters such as -A/--after-context and -B/--before-context.
The tool is built on top of Rust's regex engine which makes it very efficient on the large data.
I normally use
grep searchstring file -C n # n for number of lines of context up and down
Many of the tools like grep also have really great man files too. I find myself referring to grep's man page a lot because there is so much you can do with it.
man grep
Many GNU tools also have an info page that may have more useful information in addition to the man page.
info grep
Use grep
$ grep --help | grep -i context
Context control:
-B, --before-context=NUM print NUM lines of leading context
-A, --after-context=NUM print NUM lines of trailing context
-C, --context=NUM print NUM lines of output context
-NUM same as --context=NUM
If you search code often, AG the silver searcher is much more efficient (ie faster) than grep.
You show context lines by using the -C option.
Eg:
ag -C 3 "foo" myFile
line 1
line 2
line 3
line that has "foo"
line 5
line 6
line 7
Search for "17655" in /some/file.txt showing 10 lines context before and after (using Awk), output preceded with line number followed by a colon. Use this on Solaris when grep does not support the -[ACB] options.
awk '
/17655/ {
for (i = (b + 1) % 10; i != b; i = (i + 1) % 10) {
print before[i]
}
print (NR ":" ($0))
a = 10
}
a-- > 0 {
print (NR ":" ($0))
}
{
before[b] = (NR ":" ($0))
b = (b + 1) % 10
}' /some/file.txt;
Let's understand using an example.
We can use grep with options:
-A 5 # this will give you 5 lines after searched string.
-B 5 # this will give you 5 lines before searched string.
-C 5 # this will give you 5 lines before & after searched string
Example.
File.txt contains 6 lines and following are the operations.
[abc#xyz]~/% cat file.txt # print all file data
this is first line
this is 2nd line
this is 3rd line
this is 4th line
this is 5th line
this is 6th line
[abc#xyz]~% grep "3rd" file.txt # we are searching for keyword '3rd' in the file
this is 3rd line
[abc#xyz]~% grep -A 2 "3rd" file.txt # print 2 lines after finding the searched string
this is 3rd line
this is 4th line
this is 5th line
[abc#xyz]~% grep -B 2 "3rd" file.txt # Print 2 lines before the search string.
this is first line
this is 2nd line
this is 3rd line
[abc#xyz]~% grep -C 2 "3rd" file.txt # print 2 line before and 2 line after the searched string
this is first line
this is 2nd line
this is 3rd line
this is 4th line
this is 5th line
Trick to remember options:
-A  → A means "after"
-B  → B means "before"
-C  → C means "in between"
I do it the compact way:
grep -5 string file
That is the equivalent of:
grep -A 5 -B 5 string file
Here is the #Ygor solution in awk
awk 'c-->0;$0~s{if(b)for(c=b+1;c>1;c--)print r[(NR-c+1)%b];print;c=a}b{r[NR%b]=$0}' b=3 a=3 s="pattern" myfile
Note: Replace a and b variables with number of lines before and after.
It's especially useful for system which doesn't support grep's -A, -B and -C parameters.
Grep has an option called Context Line Control, you can use the --context in that, simply,
| grep -C 5
or
| grep -5
Should do the trick
$ grep thestring thefile -5
-5 gets you 5 lines above and below the match 'thestring' is equivalent to -C 5 or -A 5 -B 5.

Resources