Is there a way to make grep output "words" from files that match the search expression?
If I want to find all the instances of, say, "th" in a number of files, I can do:
grep "th" *
but the output will be something like (bold is by me);
some-text-file : the cat sat on the mat
some-other-text-file : the quick brown fox
yet-another-text-file : i hope this explains it thoroughly
What I want it to output, using the same search, is:
the
the
the
this
thoroughly
Is this possible using grep? Or using another combination of tools?
Try grep -o:
grep -oh "\w*th\w*" *
Edit: matching from Phil's comment.
From the docs:
-h, --no-filename
Suppress the prefixing of file names on output. This is the default
when there is only one file (or only standard input) to search.
-o, --only-matching
Print only the matched (non-empty) parts of a matching line,
with each such part on a separate output line.
Cross distribution safe answer (including windows minGW?)
grep -h "[[:alpha:]]*th[[:alpha:]]*" 'filename' | tr ' ' '\n' | grep -h "[[:alpha:]]*th[[:alpha:]]*"
If you're using older versions of grep (like 2.4.2) which do not include the -o option, then use the above. Else use the simpler to maintain version below.
Linux cross distribution safe answer
grep -oh "[[:alpha:]]*th[[:alpha:]]*" 'filename'
To summarize: -oh outputs the regular expression matches to the file content (and not its filename), just like how you would expect a regular expression to work in vim/etc... What word or regular expression you would be searching for then, is up to you! As long as you remain with POSIX and not perl syntax (refer below)
More from the manual for grep
-o Print each match, but only the match, not the entire line.
-h Never print filename headers (i.e. filenames) with output lines.
-w The expression is searched for as a word (as if surrounded by
`[[:<:]]' and `[[:>:]]';
The reason why the original answer does not work for everyone
The usage of \w varies from platform to platform, as it's an extended "perl" syntax. As such, those grep installations that are limited to work with POSIX character classes use [[:alpha:]] and not its perl equivalent of \w. See the Wikipedia page on regular expression for more
Ultimately, the POSIX answer above will be a lot more reliable regardless of platform (being the original) for grep
As for support of grep without -o option, the first grep outputs the relevant lines, the tr splits the spaces to new lines, the final grep filters only for the respective lines.
(PS: I know most platforms by now would have been patched for \w.... but there are always those that lag behind)
Credit for the "-o" workaround from #AdamRosenfield answer
It's more simple than you think. Try this:
egrep -wo 'th.[a-z]*' filename.txt #### (Case Sensitive)
egrep -iwo 'th.[a-z]*' filename.txt ### (Case Insensitive)
Where,
egrep: Grep will work with extended regular expression.
w : Matches only word/words instead of substring.
o : Display only matched pattern instead of whole line.
i : If u want to ignore case sensitivity.
You could translate spaces to newlines and then grep, e.g.:
cat * | tr ' ' '\n' | grep th
Just awk, no need combination of tools.
# awk '{for(i=1;i<=NF;i++){if($i~/^th/){print $i}}}' file
the
the
the
this
thoroughly
grep command for only matching and perl
grep -o -P 'th.*? ' filename
I was unsatisfied with awk's hard to remember syntax but I liked the idea of using one utility to do this.
It seems like ack (or ack-grep if you use Ubuntu) can do this easily:
# ack-grep -ho "\bth.*?\b" *
the
the
the
this
thoroughly
If you omit the -h flag you get:
# ack-grep -o "\bth.*?\b" *
some-other-text-file
1:the
some-text-file
1:the
the
yet-another-text-file
1:this
thoroughly
As a bonus, you can use the --output flag to do this for more complex searches with just about the easiest syntax I've found:
# echo "bug: 1, id: 5, time: 12/27/2010" > test-file
# ack-grep -ho "bug: (\d*), id: (\d*), time: (.*)" --output '$1, $2, $3' test-file
1, 5, 12/27/2010
cat *-text-file | grep -Eio "th[a-z]+"
You can also try pcregrep. There is also a -w option in grep, but in some cases it doesn't work as expected.
From Wikipedia:
cat fruitlist.txt
apple
apples
pineapple
apple-
apple-fruit
fruit-apple
grep -w apple fruitlist.txt
apple
apple-
apple-fruit
fruit-apple
I had a similar problem, looking for grep/pattern regex and the "matched pattern found" as output.
At the end I used egrep (same regex on grep -e or -G didn't give me the same result of egrep) with the option -o
so, I think that could be something similar to (I'm NOT a regex Master) :
egrep -o "the*|this{1}|thoroughly{1}" filename
To search all the words with start with "icon-" the following command works perfect. I am using Ack here which is similar to grep but with better options and nice formatting.
ack -oh --type=html "\w*icon-\w*" | sort | uniq
You could pipe your grep output into Perl like this:
grep "th" * | perl -n -e'while(/(\w*th\w*)/g) {print "$1\n"}'
grep --color -o -E "Begin.{0,}?End" file.txt
? - Match as few as possible until the End
Tested on macos terminal
$ grep -w
Excerpt from grep man page:
-w: Select only those lines containing matches that form whole words. The test is that the matching substring must either be at the beginning of the line, or preceded by a non-word constituent character.
ripgrep
Here are the example using ripgrep:
rg -o "(\w+)?th(\w+)?"
It'll match all words matching th.
Related
I am trying to find alphanumeric string including these two characters "/+" with at least 30 characters in length.
I have written this code,
grep "[a-zA-Z0-9\/\+]{30,}" tmp.txt
cat tmp.txt
> array('rWmyiJgKT8sFXCmMr639U4nWxcSvVFEur9hNOOvQwF/tpYRqTk9yWV2xPFBAZwAPRVs/s
ddd73ZEjfy+airfy8DtqIqKI9+dd 6hdd7soJ9iG0sGs/ld5f2GHzockoYHfh
+pAzx/t17Crf0T/2+8+reo+MU39lqCr02sAkcC1k/LzyBvSDEtu9N/9NHicr jA3SvDqg5s44DFlaNZ/8BW37fGEf2rk13S/q68OVVyzac7IT7yE7PIL9XZ/6LsmrY
KEsAmN4i/+ym8be3wwn KWGYaIB908+7W98pI6qao3iaZB
3mh7Y/nZm52hyLa37978f+PyOCqUh0Wfx2PL3vglofi0l
QVrOM1pg+mFLEIC88B706UzL4Pss7ouEo+EsrES+/qJq9Y1e/UGvwefOWSL2TJdt
this does not work, Mainly I wanted to have minimum length of the string to be 30
In the syntax of grep, the repetition braces need to be backslashed.
grep -o '[a-zA-Z0-9/+]\{30,\}' file
If you want to constrain the match to lines containing only matches to this pattern, add line-start and line-ending anchors:
grep '^[a-zA-Z0-9/+]\{30,\}$' file
The -o option in the first command line causes grep to only print the matching part, not the entire matching line.
The repetition operator is not directly supported in Basic Regular Expression syntax. Use grep -E to enable Extended Regular Expression syntax, or backslash the braces.
You can use
grep -e "^[a-zA-Z0-9/+]\{30,\}" tmp.txt
grep -e "^[a-zA-Z0-9/+]\{30,\}" tmp.txt
+pAzx/t17Crf0T/2+8+reo+MU39lqCr02sAkcC1k/LzyBvSDEtu9N/9NHicr jA3SvDqg5s44DFlaNZ/8BW37fGEf2rk13S/q68OVVyzac7IT7yE7PIL9XZ/6LsmrY
3mh7Y/nZm52hyLa37978f+PyOCqUh0Wfx2PL3vglofi0l
QVrOM1pg+mFLEIC88B706UzL4Pss7ouEo+EsrES+/qJq9Y1e/UGvwefOWSL2TJdt
man grep
Read up about the difference between between regular and extended patterns. You need the -E option.
I want to search Exact word pattern in Unix.
Example: Log.txt file contains following text:
aaa
bbb
cccaaa ---> this should not be counted in grep output looking for aaa
I am using following code:
count=$?
count=$(grep -c aaa $EAT_Setup_BJ3/Log.txt)
Here output should be ==> 1 not 2, using above code I am getting 2 as output.
Something is missing, so can any one help me for the this please?
Use whole word option:
grep -c -w aaa $EAT_Setup_BJ3/Log.txt
From the grep manual:
-w, --word-regexp
Select only those lines containing matches that form whole words. The test is that the matching substring must
either be at the beginning of the line, or preceded by a non-word constituent character.
As noted in the comment -w is a GNU extension. With a non GNU grep you can use the word boundaries:
grep -c "\<aaa\>" $EAT_Setup_BJ3/Log.txt
Word boundary matching is an extension to the standard POSIX grep utility. It might be available or not. If you want to search for words portably, I suggest you look into perl instead, where you would use
perl -ne 'print if /\baaa\b/' $EAT_Setup_BJ3/Log.txt
You can use a word boundary (\b) in regex to match an exact word. To enable extended regex, use the -E flag with grep.
Solution:
grep -E "\baaa\b" $EAT_Setup_BJ3/Log.txt
I'm trying to come up with a way to find a specific flag in a man-page. Usually, I type '/'
to search for something, followed by something like '-Werror' to find a specific flag.
The thing is though that there are man-pages (gcc is the one motivating me right now) that
have a LOT of references to flags in their text, so there are a lot of occurrences.
It's not that big of a deal, but maybe it can be done a bit better. I thought of looking for
something like '-O\n' but it didn't work (probably because the man program doesn't use C escapes?)
Then I've tried something like man gcc | grep $'-O\n', since I kind of recall that a
single-quoted string preceded by a dollar sign haves bash interpret common C escapes...
It' didn't work, grep echoed the whole man-page.
That's what has brought me here: why? or rather, can this be done?
rici's helpful answer explains the problem with the original approach well.
However, there's another thing worth mentioning:
man's output contains formatting control characters, which interfere with text searches.
If you pipe to col -b before searching, these control characters are removed - note the side effect that the search results will be plain-text too.
However, grep is not the right tool for this job; I suggest using awk as follows to obtain the description of -O:
man gcc | col -b | awk -v RS= '/^\s+-O\n/'
RS= (an empty input-record separator) is an awk idiom that breaks the input into blocks of non-empty lines, so matching the option at the start of such a block ensures that all lines comprising the description of the option are returned.
If you have a POSIX-features-only awk such as BSD/OSX awk, use this version:
man gcc | col -b | awk -v RS= '/^[[:blank:]]+-O\n/'
Obviously, such a command is somewhat cumbersome to type, so find generic bash function manopt below, which returns the description of the specified option for the specified command from its man page. (There can be false positives and negatives, but overall it works pretty well.)
Examples:
manopt gcc O # search `man gcc` for description of `-O`
manopt grep regexp # search `man grep` for description of `--regexp`
manopt find '-exec.*' # search `man find` for all actions _starting with_ '-exec'
bash function manopt() - place in ~/.bashrc, for instance:
# SYNOPSIS
# manopt command opt
#
# DESCRIPTION
# Returns the portion of COMMAND's man page describing option OPT.
# Note: Result is plain text - formatting is lost.
#
# OPT may be a short option (e.g., -F) or long option (e.g., --fixed-strings);
# specifying the preceding '-' or '--' is OPTIONAL - UNLESS with long option
# names preceded only by *1* '-', such as the actions for the `find` command.
#
# Matching is exact by default; to turn on prefix matching for long options,
# quote the prefix and append '.*', e.g.: `manopt find '-exec.*'` finds
# both '-exec' and 'execdir'.
#
# EXAMPLES
# manopt ls l # same as: manopt ls -l
# manopt sort reverse # same as: manopt sort --reverse
# manopt find -print # MUST prefix with '-' here.
# manopt find '-exec.*' # find options *starting* with '-exec'
manopt() {
local cmd=$1 opt=$2
[[ $opt == -* ]] || { (( ${#opt} == 1 )) && opt="-$opt" || opt="--$opt"; }
man "$cmd" | col -b | awk -v opt="$opt" -v RS= '$0 ~ "(^|,)[[:blank:]]+" opt "([[:punct:][:space:]]|$)"'
}
fish implementation of manopt():
Contributed by Ivan Aracki.
function manopt
set -l cmd $argv[1]
set -l opt $argv[2]
if not echo $opt | grep '^-' >/dev/null
if [ (string length $opt) = 1 ]
set opt "-$opt"
else
set opt "--$opt"
end
end
man "$cmd" | col -b | awk -v opt="$opt" -v RS= '$0 ~ "(^|,)[[:blank:]]+" opt "([[:punct:][:space:]]|$)"'
end
I suspect you didn't actually use grep $'-O\n', but rather some flag recognized by grep.
From grep's point of view, you are simply passing an argument, and that argument starts with a - so it's going to be interpreted as an option. You need to do something like grep -- -O$ to explicitly flag the end of the list of options, or grep -e -O$ to explicitly flag the pattern as a pattern. In any event, you cannot include a newline in a pattern because grep patterns are actually lists of patterns separated by newline characters, so the argument $'foo\n' is actually two patterns, foo and the empty string, and the empty string will match every line.
Perhaps you searched for the flag -e since that takes a pattern as an argument, and giving it a newline as an argument will cause grep to find every line in the whole file.
For most GNU programs, such as gcc, you might find the info interface easier to navigate in, since it includes reference links, tables of contents, and even indices. The info gcc document includes an index of options, which is very useful. In some linux distributions, and somewhat surprisingly since they call themselves GNU/linux distributions, it's necessary to separately install info packages although man files are distributed with the base software. The debian/ubuntu package containing the gcc info files is called gcc-doc, for example. (The use of the -doc suffix to the package name is quite common.)
In the case of gcc you can rapidly find an option using a command like:
info gcc "option index" O
or
info gcc --index-search=funroll-loops
For programs with fewer options, it's usually good enough to use info's -O option:
info -O gawk
The thing is that 'man' uses a pager, commonly 'less', whose man-page states:
/pattern
Search forward in the file for the N-th line containing the pattern.
N defaults to 1. The pattern is a regular expression, as recognized by the
regular expression library supplied by your system. The search starts at the
first line displayed (but see the -a and -j options, which change this).
So one could try and look for '-O$' in a man-page to find a flag that lives alone in it's
own line. Although, it is common for a flag to be followed by text in the very same line,
so this is not guaranteed to work.
The issue with grep and $'-O\n' is still a mystery though.
man gcc | grep "\-"
This works pretty well, as it displays all flags and usually not much more.
Edit: I notice I didn't completely answer your question, but I hope my suggestion can be considered as a nice alternative.
I use folowing:
man some_command | col -b | grep -A5 -- 'your_request'
Examples:
man man | col -b | grep -A5 -- '-K'
man grep | col -b | grep -A5 -- '-e patt'
You can make alias for it.
The manly Python utility is very convenient for getting a quick explanation of all options used in a given command.
Note that it only outputs the first paragraph of the option descriptions.
pip install manly
$ manly blkid /dev/sda -o value -p
blkid - locate/print block device attributes
============================================
-o, --output format
Use the specified output format. Note that the order of vari‐
ables and devices is not fixed. See also option -s. The format
parameter may be:
-p, --probe
Switch to low-level superblock probing mode (bypassing the
cache).
a double dash (--) is used in most bash built-in commands and many other commands to signify the end of command options
https://unix.stackexchange.com/a/11382/204245
Without the double-dash, grep is trying to use whatever flag you are looking for:
$ man curl | grep -c # Looks for this c flag, but can't find one so throws the error below.
usage: grep [-abcDEFGHhIiJLlmnOoqRSsUVvwxZ] [-A num] [-B num] [-C[num]]
[-e pattern] [-f file] [--binary-files=value] [--color=when]
[--context[=num]] [--directories=action] [--label] [--line-buffered]
[--null] [pattern] [file ...]
If you use double-dash to signify the end of input to grep, it works a bit better, but you still end up with every occurrence of the match:
$ man curl | grep -- -c
--cacert <file>
certs file named 'curl-ca-bundle.crt', either in the same direc-
--capath <dir>
curl to make SSL-connections much more efficiently than using
--cacert if the --cacert file contains many CA certificates.
--cert-status
--cert-type <type>
-E, --cert <certificate[:password]>
--ciphers <list of ciphers>
# ...many more matches......
So simply wrap the flag in quotes and throw a space before it to only match the -c flag:
$ man curl | grep -- " -c"
-c, --cookie-jar <filename>
This has driven me insane for years. Hope this helps.
man is based on an environment variable (EDITOR if I'm not mistaking). You can change this from more (the default value) to, e.g., emacs, and then while using man an emacs session gets opened on your system, where you can search and browse as you like.
This question already has answers here:
Colorized grep -- viewing the entire file with highlighted matches
(24 answers)
Closed 7 years ago.
When using grep, it will highlight any text in a line with a match to your regular expression.
What if I want this behaviour, but have grep print out all lines as well? I came up empty after a quick look through the grep man page.
Use ack. Checkout its --passthru option here: ack. It has the added benefit of allowing full perl regular expressions.
$ ack --passthru 'pattern1' file_name
$ command_here | ack --passthru 'pattern1'
You can also do it using grep like this:
$ grep --color -E '^|pattern1|pattern2' file_name
$ command_here | grep --color -E '^|pattern1|pattern2'
This will match all lines and highlight the patterns. The ^ matches every start of line, but won't get printed/highlighted since it's not a character.
(Note that most of the setups will use --color by default. You may not need that flag).
You can make sure that all lines match but there is nothing to highlight on irrelevant matches
egrep --color 'apple|' test.txt
Notes:
egrep may be spelled also grep -E
--color is usually default in most distributions
some variants of grep will "optimize" the empty match, so you might want to use "apple|$" instead (see: https://stackoverflow.com/a/13979036/939457)
EDIT:
This works with OS X Mountain Lion's grep:
grep --color -E 'pattern1|pattern2|$'
This is better than '^|pattern1|pattern2' because the ^ part of the alternation matches at the beginning of the line whereas the $ matches at the end of the line. Some regular expression engines won't highlight pattern1 or pattern2 because ^ already matched and the engine is eager.
Something similar happens for 'pattern1|pattern2|' because the regex engine notices the empty alternation at the end of the pattern string matches the beginning of the subject string.
[1]: http://www.regular-expressions.info/engine.html
FIRST EDIT:
I ended up using perl:
perl -pe 's:pattern:\033[31;1m$&\033[30;0m:g'
This assumes you have an ANSI-compatible terminal.
ORIGINAL ANSWER:
If you're stuck with a strange grep, this might work:
grep -E --color=always -A500 -B500 'pattern1|pattern2' | grep -v '^--'
Adjust the numbers to get all the lines you want.
The second grep just removes extraneous -- lines inserted by the BSD-style grep on Mac OS X Mountain Lion, even when the context of consecutive matches overlap.
I thought GNU grep omitted the -- lines when context overlaps, but it's been awhile so maybe I remember wrong.
You can use my highlight script from https://github.com/kepkin/dev-shell-essentials
It's better than grep cause you can highlight each match with it's own color.
$ command_here | highlight green "input" | highlight red "output"
Since you want matches highlighted, this is probably for human consumption (as opposed to piping to another program for instance), so a nice solution would be to use:
less -p <your-pattern> <your-file>
And if you don't care about case sensitivity:
less -i -p <your-pattern> <your-file>
This also has the advantage of having pages, which is nice when having to go through a long output
You can do it using only grep by:
reading the file line by line
matching a pattern in each line and highlighting pattern by grep
if there is no match, echo the line as is
which gives you the following:
while read line ; do (echo $line | grep PATTERN) || echo $line ; done < inputfile
If you want to print "all" lines, there is a simple working solution:
grep "test" -A 9999999 -B 9999999
A => After
B => Before
If you are doing this because you want more context in your search, you can do this:
cat BIG_FILE.txt | less
Doing a search in less should highlight your search terms.
Or pipe the output to your favorite editor. One example:
cat BIG_FILE.txt | vim -
Then search/highlight/replace.
If you are looking for a pattern in a directory recursively, you can either first save it to file.
ls -1R ./ | list-of-files.txt
And then grep that, or pipe it to the grep search
ls -1R | grep --color -rE '[A-Z]|'
This will look of listing all files, but colour the ones with uppercase letters. If you remove the last | you will only see the matches.
I use this to find images named badly with upper case for example, but normal grep does not show the path for each file just once per directory so this way I can see context.
Maybe this is an XY problem, and what you are really trying to do is to highlight occurrences of words as they appear in your shell. If so, you may be able to use your terminal emulator for this. For instance, in Konsole, start Find (ctrl+shift+F) and type your word. The word will then be highlighted whenever it occurs in new or existing output until you cancel the function.
I needed to find all the files that contained a specific string pattern. The first solution that comes to mind is using find piped with xargs grep:
find . -iname '*.py' | xargs grep -e 'YOUR_PATTERN'
But if I need to find patterns that spans on more than one line, I'm stuck because vanilla grep can't find multiline patterns.
Why don't you go for awk:
awk '/Start pattern/,/End pattern/' filename
Here is the example using GNU grep:
grep -Pzo '_name.*\n.*_description'
-z/--null-data Treat the input as a set of lines, each terminated by a zero byte (the ASCII NUL character) instead of a newline.
Which has the effect of treating the whole file as one large line.
See -z description on grep's manual and also common question no 14 on grep's manual usage page
So I discovered pcregrep which stands for Perl Compatible Regular Expressions GREP.
the -M option makes it possible to search for patterns that span line boundaries.
For example, you need to find files where the '_name' variable is followed on the next line by the '_description' variable:
find . -iname '*.py' | xargs pcregrep -M '_name.*\n.*_description'
Tip: you need to include the line break character in your pattern. Depending on your platform, it could be '\n', \r', '\r\n', ...
grep -P also uses libpcre, but is much more widely installed. To find a complete title section of an html document, even if it spans multiple lines, you can use this:
grep -P '(?s)<title>.*</title>' example.html
Since the PCRE project implements to the perl standard, use the perl documentation for reference:
http://perldoc.perl.org/perlre.html#Modifiers
http://perldoc.perl.org/perlre.html#Extended-Patterns
Here is a more useful example:
pcregrep -Mi "<title>(.*\n){0,5}</title>" afile.html
It searches the title tag in a html file even if it spans up to 5 lines.
Here is an example of unlimited lines:
pcregrep -Mi "(?s)<title>.*</title>" example.html
With silver searcher:
ag 'abc.*(\n|.)*efg'
Speed optimizations of silver searcher could possibly shine here.
#Marcin:
awk example non-greedy:
awk '{if ($0 ~ /Start pattern/) {triggered=1;}if (triggered) {print; if ($0 ~ /End pattern/) { exit;}}}' filename
You can use the grep alternative sift here (disclaimer: I am the author).
It support multiline matching and limiting the search to specific file types out of the box:
sift -m --files '*.py' 'YOUR_PATTERN'
(search all *.py files for the specified multiline regex pattern)
It is available for all major operating systems. Take a look at the samples page to see how it can be used to to extract multiline values from an XML file.
This answer might be useful:
Regex (grep) for multi-line search needed
To find recursively you can use flags -R (recursive) and --include (GLOB pattern). See:
Use grep --exclude/--include syntax to not grep through certain files
perl -ne 'print if (/begin pattern/../end pattern/)' filename
Using ex/vi editor and globstar option (syntax similar to awk and sed):
ex +"/string1/,/string3/p" -R -scq! file.txt
where aaa is your starting point, and bbb is your ending text.
To search recursively, try:
ex +"/aaa/,/bbb/p" -scq! **/*.py
Note: To enable ** syntax, run shopt -s globstar (Bash 4 or zsh).