Extract lines containing two patterns - linux

I have a file which contains several lines as follows:
>header1
<pattern_1>CGGCGGGCAGATGGCCACCAACAACCAGAGCTCCCTGGCCGGGCCTCTTTTCCTGACGGCCGCCCCCACTGCCCCCACGACCGGCCCGTACAAC<pattern_2>
>header2
<pattern_1>CGGCGGGCAGATGGCCACCAACAACCAGAGCTCCCTGGCCTGCAATCACTACTCGTGTTTTGCCACCACTGCCCCCACGACCGGCACGTACAAC<pattern_2>
>header3
<pattern_1>ATGGCCACCAACAACCAGAGCTCCC
>header4
GACCGGCACGTACAACCTCCAGGAAATCGTGCCCGGCAGCGTGTGGATGGAGAGGGACGTG
>header5
TGCCCCCACGACCGGCACGTACAAC<pattern_2>
I want to extract all lines containing both and including the header lines.
I have tried using grep, but it only extracts the sequence lines but not the header lines.
grep <pattern_1> | grep <pattern_2> input.fasta > output.fasta
How to extract lines containing both the patterns and the headers in Linux? The patterns can be present anywhere in the lines. Not limited to start or end of the lines.
Expected output:
>header1
<pattern_1>CGGCGGGCAGATGGCCACCAACAACCAGAGCTCCCTGGCCGGGCCTCTTTTCCTGACGGCCGCCCCCACTGCCCCCACGACCGGCCCGTACAAC<pattern_2>
>header2
<pattern_1>CGGCGGGCAGATGGCCACCAACAACCAGAGCTCCCTGGCCTGCAATCACTACTCGTGTTTTGCCACCACTGCCCCCACGACCGGCACGTACAAC<pattern_2>

$ grep -A 1 header[12] file
>header1
<pattern_1>CGGCGGGCAGATGGCCACCAACAACCAGAGCTCCCTGGCCGGGCCTCTTTTCCTGACGGCCGCCCCCACTGCCCCCACGACCGGCCCGTACAAC<pattern_2>
>header2
<pattern_1>CGGCGGGCAGATGGCCACCAACAACCAGAGCTCCCTGGCCTGCAATCACTACTCGTGTTTTGCCACCACTGCCCCCACGACCGGCACGTACAAC<pattern_2>
man grep:
-A NUM, --after-context=NUM
Print NUM lines of trailing context after matching lines.
Places a line containing a group separator (--) between
contiguous groups of matches. With the -o or --only-matching
option, this has no effect and a warning is given.
-B NUM, --before-context=NUM
Print NUM lines of leading context before matching lines.
Places a line containing a group separator (--) between
contiguous groups of matches. With the -o or --only-matching
option, this has no effect and a warning is given.
grep -B 1 pattern_[12]could work also, but you have several pattern_1s in the sample data so... not this time.

You can easily do that with awk like this:
awk '/^>/{h=$0;next}
/<pattern_1>/&&/<pattern_2>/{print h;print}' input.fasta > output.fasta
And here is a sed solution which yields the desired output as well:
sed -n '/^>/{N;/<pattern_1>/{/<pattern_2>/p}}' input.fasta > output.fasta
If it is likely that multiline records exist, you can use this:
awk -v pat1='<pattern_1>' -v pat2='<pattern_2>' '
/^>/ {r=$0;p=0;next}
!p {r=r ORS $0;if(chk()){print r;p=1};next}
p
function chk( tmp){
tmp=gensub(/\n/,"","g",r)
return (tmp~pat1&&tmp~pat2)
}' input.fasta > output.fasta

You might be interested in BioAwk, it is an adapted version of awk which is tuned to process fasta files
bioawk -c fastx -v seq1="pattern1" -v seq2="pattern2" \
'($seq ~ seq1) && ($seq ~ seq2) { print ">"$name; print $seq }' file.fasta
If you want seq1 at the beginning and seq2 at the end, you can change it into:
bioawk -c fastx -v seq1="pattern1" -v seq2="pattern2" \
'($seq ~ "^"seq1) && ($seq ~ seq2"$") { print ">"$name; print $seq }' file.fasta
This is really practical for processing fasta files, as often the sequence is spread over multiple lines. The above code handles this very easily as the variable $seq contains the full sequence.
If you do not want to install BioAwk, you can use the following method to process your FASTA file. It will allow multi-line sequences and does the following:
read a single record at a time (this assumes no > in the header, except the first character)
extract the header from the record and store it in name (not really needed)
merge the full sequence in a single string of characters, removing all newlines and spaces. This ensures that searching for pattern1 or pattern2 will not fail if the pattern is split over multiple lines.
if a match is found, print the record.
The following awk does the requested:
awk -v seq1="pattern1" -v seq2="pattern2" \
'BEGIN{RS=">"; ORS=""; FS="\n"}
{ seq="";for(i=2;i<=NF;++i) seq=seq""$i; gsub(/[^a-zA-Z0-9]/,"",seq) }
(seq ~ seq1 && seq ~ seq2){print ">" $0}' file.fasta
If the record header contains other > characters which are not at the beginning of the line, you have to take a slightly different approach (unless you use GNU awk)
awk -v seq1="pattern1" -v seq2="pattern2" \
'/^>/ && (seq ~ seq1 && seq ~ seq2) {
print name
for(i=0;i<n;i++) print aseq[i]
}
/^>/ { seq=""; delete aseq; n=0; name=$0; next }
{ aseq[n++] = $0; seq=seq""$0; sub(/[^a-zA-Z0-9]*$/,"",seq) }
END { if (seq ~ seq1 && seq ~ seq2) {
print name
for(i=0;i<n;i++) print aseq[i]
}
}' file.fasta
note: we make use of sub here in case unexpected characters are introduced in the fasta file (eg. spaces/tabs or CR (\r))
Note: BioAwk is based on Brian Kernighan's awk which is documented in "The AWK Programming Language",
by Al Aho, Brian Kernighan, and Peter Weinberger
(Addison-Wesley, 1988, ISBN 0-201-07981-X)
. I'm not sure if this version is compatible with POSIX.

If you want grep to print lines around the match, use the -B flag for lines before, the -A for lines after, and -C for both before and after the match.
In your case, grep -B 1 seems like it would do the job.

If your input file is exactly as described in your post then you can use:
grep -B1 '^<pattern_1>.*<pattern_2>$' input
>header1
<pattern_1>CGGCGGGCAGATGGCCACCAACAACCAGAGCTCCCTGGCCGGGCCTCTTTTCCTGACGGCCGCCCCCACTGCCCCCACGACCGGCCCGTACAAC<pattern_2>
>header2
<pattern_1>CGGCGGGCAGATGGCCACCAACAACCAGAGCTCCCTGGCCTGCAATCACTACTCGTGTTTTGCCACCACTGCCCCCACGACCGGCACGTACAAC<pattern_2>
Where -B1 will display on top of the matching lines the line before it. The regex used is based on the hypothesis that your 2 patterns are located in the exact order at the beginning and at the end of the line. If this is not the case: use '.*<pattern_1>.*<pattern_2>.*'. Last but not least, if the order of the 2 patterns are not always respected then you can use: '^.*<pattern_1>.*<pattern_2>.*$\|^.*<pattern_2>.*<pattern_1>.*$'
On the following input file:
cat input
>header1
<pattern_1>CGGCGGGCAGATGGCCACCAACAACCAGAGCTCCCTGGCCGGGCCTCTTTTCCTGACGGCCGCCCCCACTGCCCCCACGACCGGCCCGTACAAC<pattern_2>
>header2
<pattern_1>CGGCGGGCAGATGGCCACCAACAACCAGAGCTCCCTGGCCTGCAATCACTACTCGTGTTTTGCCACCACTGCCCCCACGACCGGCACGTACAAC<pattern_2>
>header2b
<pattern_2>CGGCGGGCAGATGGCCACCAACAACCAGAGCTCCCTGGCCTGCAATCACTACTCGTGTTTTGCCACCACTGCCCCCACGACCGGCACGTACAAC<pattern_1>
>header3
<pattern_1>ATGGCCACCAACAACCAGAGCTCCC
>header4
GACCGGCACGTACAACCTCCAGGAAATCGTGCCCGGCAGCGTGTGGATGGAGAGGGACGTG
>header5
TGCCCCCACGACCGGCACGTACAAC<pattern_2>
output:
grep -B1 '^.*<pattern_1>.*<pattern_2>.*$\|^.*<pattern_2>.*<pattern_1>.*$' input
>header1
<pattern_1>CGGCGGGCAGATGGCCACCAACAACCAGAGCTCCCTGGCCGGGCCTCTTTTCCTGACGGCCGCCCCCACTGCCCCCACGACCGGCCCGTACAAC<pattern_2>
>header2
<pattern_1>CGGCGGGCAGATGGCCACCAACAACCAGAGCTCCCTGGCCTGCAATCACTACTCGTGTTTTGCCACCACTGCCCCCACGACCGGCACGTACAAC<pattern_2>
>header2b
<pattern_2>CGGCGGGCAGATGGCCACCAACAACCAGAGCTCCCTGGCCTGCAATCACTACTCGTGTTTTGCCACCACTGCCCCCACGACCGGCACGTACAAC<pattern_1>

Related

How to parse a specific column or data without losing content from other columns/rows after parsing?

I have the following output to grep the value in this case "225". This value is actually a variable $pd so it could change depending on users input" It could be integer numbers or an alphanumeric character case-insensitive exact match. Example if value of variable is "225" then a "0225" or "11225" its not a valid output from the file Im reading it.
Input File:
10.20.223.10|2000-H1|1/1/2|DeviceX_4021|LG
10.20.223.10|2000-H1|1/1/3|Undiscoverable|Unkwn
10.20.225.10|2000-H1|1/1/5|DeviceZ_2050|LG
10.20.223.10|2000-H1|1/1/8|DeviceY_225_|Kenmore
10.20.223.10|2000-H1|1/1/8|DeviceY_01225_|Kenmore
10.20.225.10|2000-H1|1/1/8|DeviceY_2250_|Kenmore
Desired Output File:
10.20.223.10|2000-H1|1/1/8|DeviceY_225_|Kenmore
If user input is "lg"; then it should output the line without not ignoring it because the input file has "lg" in uppercase. (This part is already fixed on the script).
Desired Output:
10.20.223.10|2000-H1|1/1/2|DeviceX_4021|LG
10.20.225.10|2000-H1|1/1/5|DeviceZ_2050|LG
$ awk -F'|' -v n='225' '$4 ~ n' file
10.20.223.10|2000-H1|1/1/8|DeviceY_225_|Kenmore
or if you don't want a partial match (e.g. against 1225) then one way is:
$ awk -F'|' -v n='225' '$4 ~ ("(^|[^0-9])" n "([^0-9]|$)")' file
10.20.223.10|2000-H1|1/1/8|DeviceY_225_|Kenmore
or:
$ awk -F'|' -v n='225' '$4 ~ ("(^|_)" n "(_|$)")' file
10.20.223.10|2000-H1|1/1/8|DeviceY_225_|Kenmore
There are other possibilities too. The right solution depends on the requirements you haven't told us about and will pass or fail when using input other then you've shown us yet.
awk
awk -F"|" -v var="[A-Za-z].225_" '$4 ~ var{print}'
sed
sed -n '/[A-Za-z].225./p'
grep
grep '[A-Za-z].225.'
Output
10.20.223.10|2000-H1|1/1/8|DeviceY_225_|Kenmore
Using sed:
sed -n '/^\([^|]*\|\)\{3\}[^|]*225/p' < input
Explanation:
the -n option disables automatic output at the end of each sed cycle
the pattern matches arbitrary contents of the first three (\{3\}) columns of data via the \(parenthesized\) pattern [^|]*\| -- any number of non-delmiter characters followed by the column delimiter
it matches additional input at the beginning of the fourth column, but not spanning columns, with a similar subexpression: [^|]*
then comes the literal text you want to match
the p command after the pattern causes the line to be printed to sed's output in the event that it matches the pattern
There's almost certainly an awk solution too, but in Perl it's this:
$ perl -aF'\|' -ne '$F[3] =~ 225 and print' < input
10.20.223.10|2000-H1|1/1/8|DeviceY_225_|Kenmore
-a: Autosplit the input into array #F
-F'\|: Set the autosplit delimiter to |
-n: Run code for each line in the input file
-e: Here's the code to run
$F[3]: The 4th element of the autosplit array #F
=~: Regex match
and print: Print the input line if the regex matches
Update: You can get the string you're interested in from a command line parameter by assigning it in a BEGIN block.
$ perl -aF'\|' -ne 'BEGIN { $x = shift } $F[3] =~ $x and print' 225 < input

how to loop through string for patterns from linux shell?

I have a script that looks through files in a directory for strings like :tagName: which works fine for single :tag: but not for multiple :tagOne:tagTwo:tagThree: tags.
My current script does:
grep -rh -e '^:\S*:$' ~/Documents/wiki/*.mkd ~/Documents/wiki/diary/*.mkd | \
sed -r 's|.*(:[Aa-Zz]*:)|\1|g' | \
sort -u
printf '\nNote: this fails to display combined :tagOne:tagTwo:etcTag:\n'
The first line is generating an output like this:
:politics:violence:
:positivity:
:positivity:somewhat:
:psychology:
:socialServices:family:
:strategy:
:tech:
:therapy:babylon:
:trauma:
:triggered:
:truama:leadership:business:toxicity:
:unfurling:
:tagOne:tagTwo:etcTag:
And the objective is to get that into a list of single :tag:'s.
Again, the problem is that if a line has multiple tags, the line does not appear in the output at all (as opposed to the problem merely being that only the first tag of the line gets displayed). Obviously the | sed... | there is problematic.
**I want :tagOne:tagTwo:etcTag: to be turned this into:
:tagOne:
:tagTwo:
:etcTag:
and so forth with :politics:violence: etc.
Colons aren't necessary, tagOne is just as good (maybe better, but this is trivial) than :tagOne:.
The problem is that if a line has multiple tags, the line does not appear in the output at all (as opposed to the problem merely being that only the first tag of the line gets displayed). Obviously the | sed... | there is problematic.
So I should replace the sed with something better...
I've tried:
A smarter sed:
grep -rh -e '^:\S*:$' ~/Documents/wiki/*.mkd ~/Documents/wiki/diary/*.mkd | \
sed -r 's|(:[Aa-Zz]*:)([Aa-Zz]*:)|\1\r:\2|g' | \
sed -r 's|(:[Aa-Zz]*:)([Aa-Zz]*:)|\1\r:\2|g' | \
sed -r 's|(:[Aa-Zz]*:)([Aa-Zz]*:)|\1\r:\2|g' | \
sort -u
...which works (for a limited number of tags) except that it produces weird results like:
:toxicity:p:
:somewhat:y:
:people:n:
...placing weird random letters at the end of some tags in which :p: is the final character of the :leadership: tag and "leadership" no longer appears in the list. Same for :y: and :n:.
I've also tried using loops in a couple ways...
grep -rh -e '^:\S*:$' ~/Documents/wiki/*.mkd ~/Documents/wiki/diary/*.mkd | \
sed -r 's|(:[Aa-Zz]*:)([Aa-Zz]*:)|\1\r:\2|g' | \
sed -r 's|(:[Aa-Zz]*:)([Aa-Zz]*:)|\1\r:\2|g' | \
sed -r 's|(:[Aa-Zz]*:)([Aa-Zz]*:)|\1\r:\2|g' | \
sort -u | grep lead
...which has the same problem of :leadership: tags being lost etc.
And like...
for m in $(grep -rh -e '^:\S*:$' ~/Documents/wiki/*.mkd ~/Documents/wiki/diary/*.mkd); do
for t in $(echo $m | grep -e ':[Aa-Zz]*:'); do
printf "$t\n";
done
done | sort -u
...which doesn't separate the tags at all, just prints stuff like:
:truama:leadership:business:toxicity
Should I be taking some other approach? Using a different utility (perhaps cut inside a loop)? Maybe doing this in python (I have a few python scripts but don't know the language well, but maybe this would be easy to do that way)? Every time I see awk I think "EEK!" so I'd prefer a non-awk solution please, preferring to stick to paradigms I've used in order to learn them better.
Using PCRE in grep (where available) and positive lookbehind:
$ echo :tagOne:tagTwo:tagThree: | grep -Po "(?<=:)[^:]+:"
tagOne:
tagTwo:
tagThree:
You will lose the leading : but get the tags nevertheless.
Edit: Did someone mention awk?:
$ awk '{
while(match($0,/:[^:]+:/)) {
a[substr($0,RSTART,RLENGTH)]
$0=substr($0,RSTART+1)
}
}
END {
for(i in a)
print i
}' file
Another idea using awk ...
Sample data generated by OPs initial grep:
$ cat tags.raw
:politics:violence:
:positivity:
:positivity:somewhat:
:psychology:
:socialServices:family:
:strategy:
:tech:
:therapy:babylon:
:trauma:
:triggered:
:truama:leadership:business:toxicity:
:unfurling:
:tagOne:tagTwo:etcTag:
One awk idea:
awk '
{ split($0,tmp,":") # split input on colon;
# NOTE: fields #1 and #NF are the empty string - see END block
for ( x in tmp ) # loop through tmp[] indices
{ arr[tmp[x]] } # store tmp[] values as arr[] indices; this eliminates duplicates
}
END { delete arr[""] # remove the empty string from arr[]
for ( i in arr ) # loop through arr[] indices
{ printf ":%s:\n", i } # print each tag on separate line leading/trailing colons
}
' tags.raw | sort # sort final output
NOTE: I'm not up to speed on awk's ability to internally sort arrays (thus eliminating the external sort call) so open to suggestions (or someone can copy this answer to a new one and update with said ability?)
The above also generates:
:babylon:
:business:
:etcTag:
:family:
:leadership:
:politics:
:positivity:
:psychology:
:socialServices:
:somewhat:
:strategy:
:tagOne:
:tagTwo:
:tech:
:therapy:
:toxicity:
:trauma:
:triggered:
:truama:
:unfurling:
:violence:
A pipe through tr can split those strings out to separate lines:
grep -hx -- ':[:[:alnum:]]*:' ~/Documents/wiki{,/diary}/*.mkd | tr -s ':' '\n'
This will also remove the colons and an empty line will be present in the output (easy to repair, note the empty line will always be the first one due to the leading :). Add sort -u to sort and remove duplicates, or awk '!seen[$0]++' to remove duplicates without sorting.
An approach with sed:
sed '/^:/!d;s///;/:$/!d;s///;y/:/\n/' ~/Documents/wiki{,/diary}/*.mkd
This also removes colons, but avoids adding empty lines (by removing the leading/trailing : with s before using y to transliterate remaining : to <newline>). sed could be combined with tr:
sed '/:$/!d;/^:/!d;s///' ~/Documents/wiki{,/diary}/*.mkd | tr -s ':' '\n'
Using awk to work with the : separated fields, removing duplicates:
awk -F: '/^:/ && /:$/ {for (i=2; i<NF; ++i) if (!seen[$i]++) print $i}' \
~/Documents/wiki{,/diary}/*.mkd
Sample data generated by OPs initial grep:
$ cat tags.raw
:politics:violence:
:positivity:
:positivity:somewhat:
:psychology:
:socialServices:family:
:strategy:
:tech:
:therapy:babylon:
:trauma:
:triggered:
:truama:leadership:business:toxicity:
:unfurling:
:tagOne:tagTwo:etcTag:
One while/for/printf idea based on associative arrays:
unset arr
typeset -A arr # declare array named 'arr' as associative
while read -r line # for each line from tags.raw ...
do
for word in ${line//:/ } # replace ":" with space and process each 'word' separately
do
arr[${word}]=1 # create/overwrite arr[$word] with value 1;
# objective is to make sure we have a single entry in arr[] for $word;
# this eliminates duplicates
done
done < tags.raw
printf ":%s:\n" "${!arr[#]}" | sort # pass array indices (ie, our unique list of words) to printf;
# per OPs desired output we'll bracket each word with a pair of ':';
# then sort
Per OPs comment/question about removing the array, a twist on the above where we eliminate the array in favor of printing from the internal loop and then piping everything to sort -u:
while read -r line # for each line from tags.raw ...
do
for word in ${line//:/ } # replace ":" with space and process each 'word' separately
do
printf ":%s:\n" "${word}" # print ${word} to stdout
done
done < tags.raw | sort -u # pipe all output (ie, list of ${word}s for sorting and removing dups
Both of the above generates:
:babylon:
:business:
:etcTag:
:family:
:leadership:
:politics:
:positivity:
:psychology:
:socialServices:
:somewhat:
:strategy:
:tagOne:
:tagTwo:
:tech:
:therapy:
:toxicity:
:trauma:
:triggered:
:truama:
:unfurling:
:violence:

Find unique sequences within dna strings

I have a file which contains bunch of sequences. The strings have a prefix of AAGCTT and a suffix of GCGGCCGC.
Between these two pattern lies unique sequences. I want to find these sequences and count their occurrence.
Example below
AAGCTTCTGCCCACACACCGAAACATGAATCGATCACATACTAGAATCAGGCAGTCAGAGATATCAAAGATGATGAGTTCGGCGGCCGC
String CTGCCCACACACCGAAACATGAATCGATCACATACTAGAATCAGGCAGTCAGAGATATCAAAGATGATGAGTTCG is present 1000 times.
I'd divide the problem into these subproblems:
Extract all sequences between AAGCTT and GCGGCCGC:
grep -Po 'AAGCTT\K.*?(?=GCGGCCGC)'.
-P is a GNU extension. If your implementation of grep does not support it use pcregrep.
Assumption: The sequences to be extracted never contain AAGCTT/GCGGCCGC except at the beginning/end.
Count the found sequences:
sort | uniq -c
Putting everything together, we end up with:
grep -Po 'AAGCTT\K.*?(?=GCGGCCGC)' yourInputFile | sort | uniq -c
It's hard (impossible?) to assess whether this will work for you, given the sample size. My one-liner assumes one sequence per line, lines defined by unix line endings.
echo "AAGCTTCTGCCCACACACCGAAACATGAATCGATCACATACTAGAATCAGGCAGTCAGAGATATCAAAGATGATGAGTTCGGCGGCCGC" | awk '{a[gensub( /AAGCTT(.*)GCGGCCGC/,"\\1",1,$0)]++}END{for(i in a){print i" is present "a[i]" times"}}'
CTGCCCACACACCGAAACATGAATCGATCACATACTAGAATCAGGCAGTCAGAGATATCAAAGATGATGAGTTCG is present 1 times
I believe this will do what you want:
awk '/^AAGCTT/ && /GCGGCCGC$/ {arr[$0]++} END {for (i in arr) {print i "\t" arr[i]}}' file
Explanation: find lines beginning with the first adapter and ending with the last adapter, then load these into an array and print the unique lines followed by the count for each line
With this test data:
AAGCTTCTGCCCACACACCGAAACATGAATCGATCACATACTAGAATCAGGCAGTCAGAGATATCAAAGATGATGAGTTCGGCGGCCGC
AAGCTTCTGCCCACACACCGAAACATGAATCGATCACATACTAGAATCAGGCAGTCAGAGATATCAAAGATGATGAGTTCGGCGGCCGC
AAGCTTCTGCCCACACACCGAAACATGAATCGATCACATACTAGAATCAGGCAGTCAGAGATATCAcggtcaaaaaaaaCGGCGGCCGC
AAGCTTCTGCCCACACACCGAAACATGAATCGATCACATACTAGAATCAGGCAGTCAGAGATATCAAAGATGATGAGTTCGGCGGCCGCAACT
AAGCTTCTGCCCACACACCGAAACATGAATCGATCACATACTAGAATCAGGCAGTCAGAGATATCAcggtcaaaaaaaaCGGCGGCCGC
AAGCTTCTGCCCACACACCGAAACATGAATCGATCACATACTAGAATCAGGCAGTCAGAGATATCAcggtcaaaaaaaaCccccccccc
AAGCTTCTGCCCACACACCGAAACATGAATCGATCACATACTAGAATCAGGCAGTCAGAGATATCAAAGATGATGAGTTCGGCGGCCGC
AAGCTTCTGCCCACACACCGAAACATGAATCGATCACATACTAGAATCAGGCAGTCAGAGATATCAcggtcaaaaaaaaCGGCGGCCGC
AAGCTTCTGCCCACACACCGAAACATGAATCGATCACATACTAGAATCAGGCAGTCAGAGATATCAAAGATGATGAGTTCGGCGGCCGC
AAGCTTCTGCCCACACACCGAAACATGAATCGATCACATACTAGAATCAGGCAGTCAGAGATATCAAAGATGATGAGTTCGGCGGCCGC
AAGCTTCTGCCCACACACCGAAACATGAATCGATCACATACTAGAATCAGGCAGTCAGAGATATCAcggtcaaaaaaaaCGGCGGCCGC
AAGCTTCTGCCCACACACCGAAACATGAATCGATCACATACTAGAATCAGGCAGTCAGAGATATCAcggtcaaaaaaaaCccccccccc
The output is:
AAGCTTCTGCCCACACACCGAAACATGAATCGATCACATACTAGAATCAGGCAGTCAGAGATATCAAAGATGATGAGTTCGGCGGCCGC 5
AAGCTTCTGCCCACACACCGAAACATGAATCGATCACATACTAGAATCAGGCAGTCAGAGATATCAcggtcaaaaaaaaCGGCGGCCGC 4
If you just want the count, you can use print arr[i] instead of print i "\t" arr[i], or if you want the count before the read you can use print arr[i] "\t" i
Assuming you have some file dna.txt you could simply:
Separate your original continuous DNA string into multiple lines, using your PREFIX as a line delimiter, then simply remove all the suffixes and all irrelevant DNA following them
Then use sort -u to iterate through all lines in your new file with no repeats (All the unique patterns).
Then simply use grep -o and wc -l to count the occurrences!
PREFIX='AAGCTT'
SUFFIX='GCGGCCGC'
find_traits() {
# Step 1
sed "s/${PREFIX}/\n/g" dna.txt > /tmp/dna_lines.txt
sed -i "s/${SUFFIX}.*//" /tmp/dna_lines.txt
# Step 2
for pattern in $(sort -u /tmp/dna_lines.txt)
do
# Step 3
printf "
PATTERN [$(grep -o $pattern dna.txt | wc -l)] : |${PREFIX}|${pattern}|${SUFFIX}|
"
done
}

awk Print Line Issue

I'm experiencing some issues with a awk command right now. The original script was developed using awk on MacOS and was then ported to Linux. There awk shows a different behavior.
What I want to do is to count the occurrences of single strings provided via /tmp/test.uniq.txt in the file /tmp/test.txt.
awk '{print $1, system("cat /tmp/test.txt | grep -o -c " $1)}' /tmp/test.uniq.txt
Mac delivers an expected output like:
test1 2
test2 1
The output is in one line, the sting and the number of occurrences, separated by a whitespace.
Linux delivers an output like:
2
test1 1
test2
The output is not in one line an the output of the system command is printed first.
Sample input:
test.txt looks like:
test1 test test
test1 test test
test2 test test
test.uniq.txt looks like:
test1
test2
As comments suggested that using grep and cat etc using system function is not recommended as awk is complete language that can perform most of these tasks.
You can use following awk command to replace your cat | grep functionality:
awk 'FNR == NR {a[$1]=0; next} {for (i=1; i<=NF; i++) if ($i in a) a[$i]++}
END { for (i in a) print i, a[i] }' uniq.txt test.txt
test1 2
test2 1
Note that this output doesn't match with the count 5 as your question states as your sample data is probably different.
References:
Effective AWK Programming
Awk Tutorial
It looks to me as if you're trying to count the number of line containing each unique string in the uniq file. But the way you're doing it is .. awkward, and as you've demonstrated, inconsistent between versions of awk.
The following might work a little better:
$ awk '
NR==FNR {
a[$1]
next
}
{
for (i in a) {
if ($1~i) {
a[i]++
}
}
}
END {
for (i in a)
printf "%6d\t%s\n",a[i],i
}
' test.uniq.txt test.txt
2 test1
1 test2
This loads your uniq file into an array, then for every line in your text file, steps through the array to count the matches.
Note that these are being compared as regular expressions, without word boundaries, so test1 will also be counted as part of test12.
Another way might be to use grep+sort+uniq:
grep -o -w -F -f uniq.txt test.txt | sort | uniq -c
It's a pipeline but a short one
From man grep:
-F, --fixed-strings, --fixed-regexp Interpret PATTERN as a list of fixed strings, separated by newlines, any of which is to be matched.
(-F is specified by POSIX, --fixed-regexp is an obsoleted alias, please do not use it in new scripts.)
-f FILE, --file=FILE Obtain patterns from FILE, one per line. The empty file contains zero patterns and therefore matches nothing. (-f is specified by POSIX.)
-o, --only-matching Print only the matched (non-empty) parts of a matching line, with each such part on a separate output line.
-w, --word-regexp
Select only those lines containing matches that form whole words. The test is that the matching substring must either be at the beginning of the line or preceded by a non-word constituent character. Similarly, it must be either at the end of the line or followed by a non-word constituent character. Word-constituent characters are letters, digits, and the underscore.

find a pattern and print line based on finding the first pattern sed, awk grep

I have a rather large file. What is common to all is the hostname to break each section example :
HOSTNAME:host1
data 1
data here
data 2
text here
section 1
text here
part 4
data here
comm = 2
HOSTNAME:host-2
data 1
data here
data 2
text here
section 1
text here
part 4
data here
comm = 1
The above prints
As you see above, in between each section there are other sections broken down by key words or lines that have specific values
I like to use a oneliner to print host name for each section and then print which ever lines I want to extract under each hostname section
Can you please help. I am using now grep -C 10 HOSTNAME | gerp -C pattern
but this assumes that there are 10 lines in each section. This is not an optimal way to do this; can someone show a better way. I also need to be able to print more than one line under each pattern that I find . So if I find data1 and there are additional lines under it I like to grab and print them
So output of command would be like
grep -C 10 HOSTNAME | grep data 1
grep -C 10 HOSTNAME | grep -A 2 data 1
HOSTNAME:Host1
data 1
HOSTNAME:Hoss2
data 1
Beside Grep I use this sed command to print my output
sed -r '/HOSTNAME|shared/!d' filename
The only problem with this sed command is that it only prints the lines that have patterns shared & HOSTNAME in them. I also need to specify the number of lines I like to print in my case under the line that matched patterns shared. So I like to print HOSTNAME and give the number of lines I like to print under second search pattern shared.
Thanks
awk to the rescue!
$ awk -v lines=2 '/HOSTNAME/{c=lines} NF&&c&&c--' file
HOSTNAME:host1
data 1
HOSTNAME:host-2
data 1
print lines number of lines including pattern match, skips empty lines.
If you want to specify secondary keyword instead number of lines
$ awk -v key='data 1' '/HOSTNAME/{h=1; print} h&&$0~key{print; h=0}' file
HOSTNAME:host1
data 1
HOSTNAME:host-2
data 1
Here is a sed twoliner:
sed -n -r '/HOSTNAME/ { p }
/^\s+data 1/ {p }' hostnames.txt
It prints (p)
when the line contains a HOSTNAME
when the line starts with some whitespace (\s+) followed by your search criterion (data 1)
non-mathing lines are not printed (due to the sed -n option)
Edit: Some remarks:
this was tested with GNU sed 4.2.2 under linux
you dont need the -r if your sed version does not support it, replace the second pattern to /^.*data 1/
we can squash everything in one line with ;
Putting it all together, here is a revised version in one line, without the need for the extended regex ( i.e without -r):
sed -n '/HOSTNAME/ { p } ; /^.*data 1/ {p }' hostnames.txt
The OP requirements seem to be very unclear, but the following is consistent with one interpretation of what has been requested, and more importantly, the program has no special requirements, and the code can easily be modified to meet a variety of requirements. In particular, both search patterns (the HOSTNAME pattern and the "data 1" pattern) can easily be parameterized.
The main idea is to print all lines in a specified subsection, or at least a certain number up to some limit.
If there is a limit on how many lines in a subsection should be printed, specify a value for limit, otherwise set it to 0.
awk -v limit=0 '
/^HOSTNAME:/ { subheader=0; hostname=1; print; next}
/^ *data 1/ { subheader=1; print; next }
/^ *data / { subheader=0; next }
subheader && (limit==0 || (subheader++ < limit)) { print }'
Given the lines provided in the question, the output would be:
HOSTNAME:host1
data 1
HOSTNAME:host-2
data 1
(Yes, I know the variable 'hostname' in the awk program is currently unused, but I included it to make it easy to add a test to satisfy certain obvious requirements regarding the preconditions for identifying a subheader.)
sed -n -e '/hostname/,+p' -e '/Duplex/,+p'
The simplest way to do it is to combine two sed commands ..

Resources