multiline search pattern in linux - search

I am trying to use grep to perform multiline search in linux but having problem with it.
Basically i want to extract all the lines that follows with Sequences string in the below example.
Query= BRNW_157
Sequences producing significant alignments: (Bits) Value
Query= BRNW_428
Query= BRNW_503
Sequences producing significant alignments: (Bits) Value
Query= BRNW_601
Query= BRNW_617
Sequences producing significant alignments: (Bits) Value
I tried awk but it doesn't work...
awk '/Query=*/,/Sequences*/' and then i used grep and it doesn't work either...grep -PZo 'Query=*\n.*sequences'.
Is there a way to go around this problem?

Are you saying you want to find the word Sequences and print that line plus the line before it?
That'd just be:
awk '/Sequences/{print prev ORS $0} {prev=$0}' file

You are probably looking for
grep -oPz '(?ms)Query=(?:(?!Query).)*?Sequences.*?$'
This passes PCRE MULTILINE and DOTALL flags via the (?ms) and picks out each segment from a Query line to the next Sequences line.
Additionally, the -z flag passed to grep forces it to treat NUL as line-separator, ,making the contents of the file appear as a single string to it.

Related

How to count number of lines with only 1 character?

Im trying to just print counted number of lines which have only 1 character.
I have a file with 200k lines some of the lines have only one character (any type of character)
Since I have no experience I have googled a lot and scraped documentation and come up with this mixed solution from different sources:
awk -F^\w$ '{print NF-1}' myfile.log
I was expecting that will filter lines with single char, and it seems work
^\w$
However Im not getting number of the lines containing a single character. Instead something like this:
If a non-awk solution is OK:
grep -c '^.$'
You could try the following:
awk '/^.$/{++c}END{print c}' file
The variable c is incremented for every line containing only 1 character (any character).
When the parsing of the file is finished, the variable is printed.
In awk, rules like your {print NF-1} are executed for each line. To print only one thing for the whole file you have to use END { print ... }. There you can print a counter which you increment each time you see a line with one character.
However, I'd use grep instead because it is easier to write and faster to execute:
grep -xc . yourFile

How can I escape all non-alphanumeric characters in AWK?

I inherited a very large AWK script that matches against .csv files, and I've found it does not match some alphanumeric characters, especially + ( ).
While I realize this would be easy in sed:
sed 's/\([^A-z0-9]\)/\\\1/g'
I can't seem to find a way to call on the matched character the same way in AWK.
For instance a sample input is:
select.awk 'Patient data +(B/U)'
I would like to escape the non-alphanumeric characters, and turn the line into:
Patient\ data\ \+\(B\/U\)
I have seen some people pass very obscure non-alphanumeric characters as well, which I would like to escape.
gsub(/[^[:alnum:]]/, "\\\\&", arg)
the gnu variant has more feature,
awk '{n=gensub(/[^[:alnum:]]/,"\\\\&","g"); print n}' d.csv

extract first instance per line (maybe grep?)

I want to extract the first instance of a string per line in linux. I am currently trying grep but it yields all the instances per line. Below I want the strings (numbers and letters) after "tn="...but only the first set per line. The actual characters could be any combination of numbers or letters. And there is a space after them. There is also a space before the tn=
Given the following file:
hello my name is dog tn=12g3 fun 23k3 hello tn=1d3i9 cheese 234kd dks2 tn=6k4k ksk
1263 chairs are good tn=k38493kd cars run vroom it95958 tn=k22djd fair gold tn=293838 tounge
Desired output:
12g3
k38493
Here's one way you can do it if you have GNU grep, which (mostly) supports Perl Compatible Regular Expressions with -P. Also, the non-standard switch -o is used to only print the part matching the pattern, rather than the whole line:
grep -Po '^.*?tn=\K\S+' file
The pattern matches the start of the line ^, followed by any characters .*?, where the ? makes the match non-greedy. After the first match of tn=, \K "kills" the previous part so you're only left with the bit you're interested in: one or more non-space characters \S+.
As in Ed's answer, you may wish to add a space before tn to avoid accidentally matching something like footn=.... You might also prefer to use something like \w to match "word" characters (equivalent to [[:alnum:]_]).
Just split the input in tn=-separators and pick the second one. Then, split again to get everything up to the first space:
$ awk -F"tn=" '{split($2,a, " "); print a[1]}' file
12g3
k38493kd
$ awk 'match($0,/ tn=[[:alnum:]]+/) {print substr($0,RSTART+4,RLENGTH-4)}' file
12g3
k38493kd

which command is fast to search consecutive patterns in a line

which command is fast to search consecutive patterns in a line in unix ?
The word "=" follows the word "Model".
Input File
Model = It supports 10 Modular Controllers
Support Config Page Model = Yes
Model files are here
Output:
Extract the line where word "=" comes after the word "Model" and "Model" appears as a first word.
Here first line of input file satisfy the criteria- "Model = It supports 10 Modular Controllers"
I have used sed and awk commands but want to know which one is better.
sed -n '/^Model/ s/=/&/p'
sed -n 's/^Model.*=/&/p'
sed -n '/^Model/ {/=/p ;}'
awk '/^Model.*=/'
Can someone please tell me which one is fast and better.
As Ed Morton says, without RegEx is faster. My proposed solution is
awk '{a=index($0, "Model");b=index($0, "=")}a==1 && a<b'
First I get the position for the substrings, and then I compare them to avoid a double search for Model, Another solution could be:
awk 'index($0, "Model")!=1{next}index($0, "=")>1'
In both scripts I'm assuming that Model must be the first word (you are using the "^" in your regexps) And in this second script I check that Model is present and is the first word in string, once this is validated I only check that the "=" comes after the "Model" (its position its greater than 1)

How can I remove lines that contain more than N words

Is there a good one-liner in bash to remove lines containing more than N words from a file?
example input:
I want this, not that, but thank you it is very nice of you to offer.
The very long sentence finding form ordering system always and redundantly requires an initial, albeit annoying and sometimes nonsensical use of commas, completion of the form A-1 followed, after this has been processed by the finance department and is legal, by a positive approval that allows for the form B-1 to be completed after the affirmative response to the form A-1 is received.
example output:
I want this, not that, but thank you it is very nice of you to offer.
In Python I would code something like this:
if len(line.split()) < 40:
print line
To only show lines containing less than 40 words, you can use awk:
awk 'NF < 40' file
Using the default field separator, each word is treated as a field. Lines with less than 40 fields are printed.
Note this answer assumes the first approach of the question: how to print those lines being shorter than a given number of characters
Use awk with length():
awk 'length($0)<40' file
You can even give the length as a parameter:
awk -v maxsize=40 'length($0) < maxsize' file
A test with 10 characters:
$ cat a
hello
how are you
i am fine but
i would like
to do other
things
$ awk 'length($0)<10' a
hello
things
If you feel like using sed for this, you can say:
sed -rn '/^.{,39}$/p' file
This checks if the line contains less than 40 characters. If so, it prints it.

Resources