How can I search for a multiline pattern in a file? - linux

I needed to find all the files that contained a specific string pattern. The first solution that comes to mind is using find piped with xargs grep:
find . -iname '*.py' | xargs grep -e 'YOUR_PATTERN'
But if I need to find patterns that spans on more than one line, I'm stuck because vanilla grep can't find multiline patterns.

Why don't you go for awk:
awk '/Start pattern/,/End pattern/' filename

Here is the example using GNU grep:
grep -Pzo '_name.*\n.*_description'
-z/--null-data Treat the input as a set of lines, each terminated by a zero byte (the ASCII NUL character) instead of a newline.
Which has the effect of treating the whole file as one large line.
See -z description on grep's manual and also common question no 14 on grep's manual usage page

So I discovered pcregrep which stands for Perl Compatible Regular Expressions GREP.
the -M option makes it possible to search for patterns that span line boundaries.
For example, you need to find files where the '_name' variable is followed on the next line by the '_description' variable:
find . -iname '*.py' | xargs pcregrep -M '_name.*\n.*_description'
Tip: you need to include the line break character in your pattern. Depending on your platform, it could be '\n', \r', '\r\n', ...

grep -P also uses libpcre, but is much more widely installed. To find a complete title section of an html document, even if it spans multiple lines, you can use this:
grep -P '(?s)<title>.*</title>' example.html
Since the PCRE project implements to the perl standard, use the perl documentation for reference:
http://perldoc.perl.org/perlre.html#Modifiers
http://perldoc.perl.org/perlre.html#Extended-Patterns

Here is a more useful example:
pcregrep -Mi "<title>(.*\n){0,5}</title>" afile.html
It searches the title tag in a html file even if it spans up to 5 lines.
Here is an example of unlimited lines:
pcregrep -Mi "(?s)<title>.*</title>" example.html

With silver searcher:
ag 'abc.*(\n|.)*efg'
Speed optimizations of silver searcher could possibly shine here.

#Marcin:
awk example non-greedy:
awk '{if ($0 ~ /Start pattern/) {triggered=1;}if (triggered) {print; if ($0 ~ /End pattern/) { exit;}}}' filename

You can use the grep alternative sift here (disclaimer: I am the author).
It support multiline matching and limiting the search to specific file types out of the box:
sift -m --files '*.py' 'YOUR_PATTERN'
(search all *.py files for the specified multiline regex pattern)
It is available for all major operating systems. Take a look at the samples page to see how it can be used to to extract multiline values from an XML file.

This answer might be useful:
Regex (grep) for multi-line search needed
To find recursively you can use flags -R (recursive) and --include (GLOB pattern). See:
Use grep --exclude/--include syntax to not grep through certain files

perl -ne 'print if (/begin pattern/../end pattern/)' filename

Using ex/vi editor and globstar option (syntax similar to awk and sed):
ex +"/string1/,/string3/p" -R -scq! file.txt
where aaa is your starting point, and bbb is your ending text.
To search recursively, try:
ex +"/aaa/,/bbb/p" -scq! **/*.py
Note: To enable ** syntax, run shopt -s globstar (Bash 4 or zsh).

Related

grep and cut a specific pattern [duplicate]

Is there a way to make grep output "words" from files that match the search expression?
If I want to find all the instances of, say, "th" in a number of files, I can do:
grep "th" *
but the output will be something like (bold is by me);
some-text-file : the cat sat on the mat
some-other-text-file : the quick brown fox
yet-another-text-file : i hope this explains it thoroughly
What I want it to output, using the same search, is:
the
the
the
this
thoroughly
Is this possible using grep? Or using another combination of tools?
Try grep -o:
grep -oh "\w*th\w*" *
Edit: matching from Phil's comment.
From the docs:
-h, --no-filename
Suppress the prefixing of file names on output. This is the default
when there is only one file (or only standard input) to search.
-o, --only-matching
Print only the matched (non-empty) parts of a matching line,
with each such part on a separate output line.
Cross distribution safe answer (including windows minGW?)
grep -h "[[:alpha:]]*th[[:alpha:]]*" 'filename' | tr ' ' '\n' | grep -h "[[:alpha:]]*th[[:alpha:]]*"
If you're using older versions of grep (like 2.4.2) which do not include the -o option, then use the above. Else use the simpler to maintain version below.
Linux cross distribution safe answer
grep -oh "[[:alpha:]]*th[[:alpha:]]*" 'filename'
To summarize: -oh outputs the regular expression matches to the file content (and not its filename), just like how you would expect a regular expression to work in vim/etc... What word or regular expression you would be searching for then, is up to you! As long as you remain with POSIX and not perl syntax (refer below)
More from the manual for grep
-o Print each match, but only the match, not the entire line.
-h Never print filename headers (i.e. filenames) with output lines.
-w The expression is searched for as a word (as if surrounded by
`[[:<:]]' and `[[:>:]]';
The reason why the original answer does not work for everyone
The usage of \w varies from platform to platform, as it's an extended "perl" syntax. As such, those grep installations that are limited to work with POSIX character classes use [[:alpha:]] and not its perl equivalent of \w. See the Wikipedia page on regular expression for more
Ultimately, the POSIX answer above will be a lot more reliable regardless of platform (being the original) for grep
As for support of grep without -o option, the first grep outputs the relevant lines, the tr splits the spaces to new lines, the final grep filters only for the respective lines.
(PS: I know most platforms by now would have been patched for \w.... but there are always those that lag behind)
Credit for the "-o" workaround from #AdamRosenfield answer
It's more simple than you think. Try this:
egrep -wo 'th.[a-z]*' filename.txt #### (Case Sensitive)
egrep -iwo 'th.[a-z]*' filename.txt ### (Case Insensitive)
Where,
egrep: Grep will work with extended regular expression.
w : Matches only word/words instead of substring.
o : Display only matched pattern instead of whole line.
i : If u want to ignore case sensitivity.
You could translate spaces to newlines and then grep, e.g.:
cat * | tr ' ' '\n' | grep th
Just awk, no need combination of tools.
# awk '{for(i=1;i<=NF;i++){if($i~/^th/){print $i}}}' file
the
the
the
this
thoroughly
grep command for only matching and perl
grep -o -P 'th.*? ' filename
I was unsatisfied with awk's hard to remember syntax but I liked the idea of using one utility to do this.
It seems like ack (or ack-grep if you use Ubuntu) can do this easily:
# ack-grep -ho "\bth.*?\b" *
the
the
the
this
thoroughly
If you omit the -h flag you get:
# ack-grep -o "\bth.*?\b" *
some-other-text-file
1:the
some-text-file
1:the
the
yet-another-text-file
1:this
thoroughly
As a bonus, you can use the --output flag to do this for more complex searches with just about the easiest syntax I've found:
# echo "bug: 1, id: 5, time: 12/27/2010" > test-file
# ack-grep -ho "bug: (\d*), id: (\d*), time: (.*)" --output '$1, $2, $3' test-file
1, 5, 12/27/2010
cat *-text-file | grep -Eio "th[a-z]+"
You can also try pcregrep. There is also a -w option in grep, but in some cases it doesn't work as expected.
From Wikipedia:
cat fruitlist.txt
apple
apples
pineapple
apple-
apple-fruit
fruit-apple
grep -w apple fruitlist.txt
apple
apple-
apple-fruit
fruit-apple
I had a similar problem, looking for grep/pattern regex and the "matched pattern found" as output.
At the end I used egrep (same regex on grep -e or -G didn't give me the same result of egrep) with the option -o
so, I think that could be something similar to (I'm NOT a regex Master) :
egrep -o "the*|this{1}|thoroughly{1}" filename
To search all the words with start with "icon-" the following command works perfect. I am using Ack here which is similar to grep but with better options and nice formatting.
ack -oh --type=html "\w*icon-\w*" | sort | uniq
You could pipe your grep output into Perl like this:
grep "th" * | perl -n -e'while(/(\w*th\w*)/g) {print "$1\n"}'
grep --color -o -E "Begin.{0,}?End" file.txt
? - Match as few as possible until the End
Tested on macos terminal
$ grep -w
Excerpt from grep man page:
-w: Select only those lines containing matches that form whole words. The test is that the matching substring must either be at the beginning of the line, or preceded by a non-word constituent character.
ripgrep
Here are the example using ripgrep:
rg -o "(\w+)?th(\w+)?"
It'll match all words matching th.

line return in grep search?

I have some files with the text:
xxxxx
xxxxx
<cert>
</cert>
some other stuff
How can I search with grep and ignore the line returns?
I have many files in the same folder.
I have tried this but it does not seem to stop running:
tr '\n' ' ' | grep '<cert></cert>' *
That is searching for a multi-line pattern, which the usual grep does not appear to support. There are alternative tools, e.g.,
How can I search for a multiline pattern in a file?, which suggests pcregrep, or custom awk, perl scripts.
How can I “grep” patterns across multiple lines?, again suggesting pcregrep (as well as sed scripts).
However, GNU grep is said to support this as well:
How do I grep for multiple patterns on multiple lines? gives as an example
grep -Pzo "^begin\$(.|\n)*^end$" file
to use a newline in a pattern. The options used however include the "experimental" -P which may make it less suitable than pcregrep:
-P, --perl-regexp
Interpret PATTERN as a Perl regular expression. This is highly
experimental and grep -P may warn of unimplemented features.
-z, --null-data
Treat the input as a set of lines, each terminated by a zero
byte (the ASCII NUL character) instead of a newline. Like the
-Z or --null option, this option can be used with commands like
sort -z to process arbitrary file names.
-o, --only-matching
Print only the matched (non-empty) parts of a matching line,
with each such part on a separate output line.
Some experimental options are useful, others less so. This one was noted as the source of problems in Searching for non-ascii characters.

Some help needed on grep

I am trying to find alphanumeric string including these two characters "/+" with at least 30 characters in length.
I have written this code,
grep "[a-zA-Z0-9\/\+]{30,}" tmp.txt
cat tmp.txt
> array('rWmyiJgKT8sFXCmMr639U4nWxcSvVFEur9hNOOvQwF/tpYRqTk9yWV2xPFBAZwAPRVs/s
ddd73ZEjfy+airfy8DtqIqKI9+dd 6hdd7soJ9iG0sGs/ld5f2GHzockoYHfh
+pAzx/t17Crf0T/2+8+reo+MU39lqCr02sAkcC1k/LzyBvSDEtu9N/9NHicr jA3SvDqg5s44DFlaNZ/8BW37fGEf2rk13S/q68OVVyzac7IT7yE7PIL9XZ/6LsmrY
KEsAmN4i/+ym8be3wwn KWGYaIB908+7W98pI6qao3iaZB
3mh7Y/nZm52hyLa37978f+PyOCqUh0Wfx2PL3vglofi0l
QVrOM1pg+mFLEIC88B706UzL4Pss7ouEo+EsrES+/qJq9Y1e/UGvwefOWSL2TJdt
this does not work, Mainly I wanted to have minimum length of the string to be 30
In the syntax of grep, the repetition braces need to be backslashed.
grep -o '[a-zA-Z0-9/+]\{30,\}' file
If you want to constrain the match to lines containing only matches to this pattern, add line-start and line-ending anchors:
grep '^[a-zA-Z0-9/+]\{30,\}$' file
The -o option in the first command line causes grep to only print the matching part, not the entire matching line.
The repetition operator is not directly supported in Basic Regular Expression syntax. Use grep -E to enable Extended Regular Expression syntax, or backslash the braces.
You can use
grep -e "^[a-zA-Z0-9/+]\{30,\}" tmp.txt
grep -e "^[a-zA-Z0-9/+]\{30,\}" tmp.txt
+pAzx/t17Crf0T/2+8+reo+MU39lqCr02sAkcC1k/LzyBvSDEtu9N/9NHicr jA3SvDqg5s44DFlaNZ/8BW37fGEf2rk13S/q68OVVyzac7IT7yE7PIL9XZ/6LsmrY
3mh7Y/nZm52hyLa37978f+PyOCqUh0Wfx2PL3vglofi0l
QVrOM1pg+mFLEIC88B706UzL4Pss7ouEo+EsrES+/qJq9Y1e/UGvwefOWSL2TJdt
man grep
Read up about the difference between between regular and extended patterns. You need the -E option.

Linux Prompt Change Content Within File based on File Name

I know how to do a search and replace amongst group of files:
perl -pi -w -e 's/search/replace/g;' *.php
So I can use that to search for a keyword or phrase and change it. But I have a more complicated task I dont know how to do.
I want to do a search and replace among all my php files to search for a specific Keyword and replace it with the File Name minus the extension.
Example: Search the file Mountains.php for the keyword Trees and everywhere you see Trees, replace it with Mountains
Of course I want to be able to do that in batch, for a few hundred php files all with different names, however, all containing the search term Trees.
If someone is looking for an extra challenge, haha, it would be even better if I could do a more complex scenario such as....
Example: Search the file MountainTowns.php for the keyword Trees and everywhere you see Trees, replace it with "Mountain Towns" (note the extra space, Capital Letters could would indicate where spaces go)
Thanks for your time and considering my question.
Well, the filename is in $ARGV, so there is not much more work needed.
perl -i -pe '($x=$ARGV)=~s{.php$}{};s{Trees}{$x}g' BlueMountains.php RedMountains.php
Add in
$x=~s{(.)([A-Z])}{$1 $2}g;
to add the space before upcased letters, for a complete line of
perl -i -pe '($x=$ARGV)=~s{.php$}{};$x=~s{(.)([A-Z])}{$1 $2}g;s{Trees}{$x}g' BlueRedMountains.php
This might work for you:
printf "%s\n" *.php |perl -pwe 's|(.*).php|perl -pi -we "s/Trees/$1/g;" $&|' | bash
This uses perl to write a script to do you bidding.
Other little languages could be employed, like awk or:
printf "%s\n" *.php |sed 'h;s/\.php//;s/\B[A-Z]/ &/;G;s|\(.*\)\n\(.*\)|sed -i "s/Trees/\1/g" \2|' | bash
This uses sed to provide a solution for the second request.
You want a separate replacement for each file, so run a separate search and replace for each:
for file in *.php; do sed -i "s/foo/${file%.*}/g" "$file"; done
And your second request is a bit harder, it at least requires a subshell.
for file in *; do sed -i "s/bar/$(echo ${file%.*} | sed 's/\(.\)\([A-Z]\)/\1 \2/')/g" "$file"; done
It's a bit more readable if you put it in a script:
#!/bin/bash
for file in "$#"; do
replacement=$(echo ${file%.*} | sed 's/\(.\)\([A-Z]\)/\1 \2/')
sed -i "s/bar/$replacement/g" "$file";
done
This will work over all the arguments passed it, so call with ./script.sh *.php.

Replacing a line in a csv file?

I have a set of 10 CSV files, which normally have a an entry of this kind
a,b,c,d
d,e,f,g
Now due to some error entries in this file have become of this kind
a,b,c,d
d,e,f,g
,,,
h,i,j,k
Now I want to remove the line with only commas in all the files. These files are on a Linux filesystem.
Any command that you recommend that can replaces the erroneous lines in all the files.
It depends on what you mean by replace. If you mean 'remove', then a trivial variant on #wnoise's solution is:
grep -v '^,,,$' old-file.csv > new-file.csv
Note that this deletes just those lines with exactly three commas. If you want to delete mal-formed lines with any number of commas (including zero) - and no other characters on the line, then:
grep -v '^,*$' ...
There are endless other variations on the regex that would deal with other scenarios. Dealing with full CSV data with commas inside quotes starts to need something other than a regex machine. It can be done, within broad limits, especially in more complex regex systems such as PCRE or Perl. But it requires more work.
Check out Mastering Regular Expressions.
sed 's/,,,/replacement/' < old-file.csv > new-file.csv
optionally followed by
mv new-file.csv old-file.csv
Replace or remove, your post is not clear... For replacement see wnoise's answer. For removing, you could use
awk '$0 !~ /,,,/ {print}' <old-file.csv > new-file.csv
What about trying to keep only lines which are matching the desired format instead of handling one exception ?
If the provided input is what you really want to match:
grep -E '[a-z],[a-z],[a-z],[a-z]' < oldfile.csv > newfile.csv
If the input is different, provide it, the regular expression should not be too hard to write.
Do you want to replace them with something, or delete them entirely? Either way, it can be done with sed. To delete:
sed -i -e '/^,\+$/ D' yourfile1.csv yourfile2.csv ...
To replace: well, see wnoise's answer, or if you don't want to create new files with the output,
sed -i -e '/^,\+$/ s//replacement/' yourfile1.csv yourfile2.csv ...
or
sed -i -e '/^,\+$/ c\
replacement' yourfile1.csv yourfile2.csv ...
(that should be entered exactly as is, including the line break). Of course, you can also do this with awk or perl or, if you're only deleting lines, even grep:
egrep -v '^,+$' < oldfile.csv > newfile.csv
I tested these to make sure they work, but I'd advise you to do the same before using them (just in case). You can omit the -i option from sed, in which case it'll print out the results (rather than writing them back to the file), or omit the output redirection >newfile.csv from grep.
EDIT: It was pointed out in a comment that some features of these sed commands only work on GNU sed. As far as I can tell, these are the -i option (which can be replaced with shell redirection, sed ... <infile >outfile ) and the \+ modifier (which can be replaced with \{1,\} ).
Most simply:
$ grep -v ,,,, oldfile > newfile
$ mv newfile oldfile
yes, awk or grep are very good option if you are working in linux platform. However you can use perl regex for other platform. using join & split options.

Resources