Using grep to get 12 letter alphabet only lines - linux

Using grep
How many 12 letter - alphabet only lines are in testing.txt?
excerpt of testing.txt
tyler1
Tanktop_Paedo
xyz2#geocities.com
milt#uole.com
justincrump
cranges10
namer#uole.com
soulfunkbrotha
timetolearnz
hotbooby#geocities.com
Fire_Crazy
helloworldad
dingbat#geocities.com
from this excerpt, I want to get a result of 2. (helloworldad, and timetolearnz)
I want to check every line and grep only those that have 12 characters in each line. I can't think of a way to do this with grep though.
For the alphabet only, I think I can use
grep [A-Za-z] testing.txt
However, how do I make it so only the characters [A-Za-z] show up in those 12 characters?

You can do it with extended regex -E and by specifying that the match is exactly {12} characters from start ^ to finish $
$ grep -E "^[A-Za-z]{12}$" testing.txt
timetolearnz
helloworldad
Or if you want to get the count -c of the lines you can use
$ grep -cE "^[A-Za-z]{12}$" testing.txt
2

grep supports whole-line match and counting, e.g.:
grep -xc '[[:alpha:]]\{12\}' testing.txt
Output:
2
The [:alpha:] character class is another way of saying [A-Za-z]. See section 3.2 of the the info pages: info grep 'Regular Expressions' 'Character Classes and Bracket Expressions' for more on this subject. Or look it up in the pdf manual online.

Related

grep and cut a specific pattern [duplicate]

Is there a way to make grep output "words" from files that match the search expression?
If I want to find all the instances of, say, "th" in a number of files, I can do:
grep "th" *
but the output will be something like (bold is by me);
some-text-file : the cat sat on the mat
some-other-text-file : the quick brown fox
yet-another-text-file : i hope this explains it thoroughly
What I want it to output, using the same search, is:
the
the
the
this
thoroughly
Is this possible using grep? Or using another combination of tools?
Try grep -o:
grep -oh "\w*th\w*" *
Edit: matching from Phil's comment.
From the docs:
-h, --no-filename
Suppress the prefixing of file names on output. This is the default
when there is only one file (or only standard input) to search.
-o, --only-matching
Print only the matched (non-empty) parts of a matching line,
with each such part on a separate output line.
Cross distribution safe answer (including windows minGW?)
grep -h "[[:alpha:]]*th[[:alpha:]]*" 'filename' | tr ' ' '\n' | grep -h "[[:alpha:]]*th[[:alpha:]]*"
If you're using older versions of grep (like 2.4.2) which do not include the -o option, then use the above. Else use the simpler to maintain version below.
Linux cross distribution safe answer
grep -oh "[[:alpha:]]*th[[:alpha:]]*" 'filename'
To summarize: -oh outputs the regular expression matches to the file content (and not its filename), just like how you would expect a regular expression to work in vim/etc... What word or regular expression you would be searching for then, is up to you! As long as you remain with POSIX and not perl syntax (refer below)
More from the manual for grep
-o Print each match, but only the match, not the entire line.
-h Never print filename headers (i.e. filenames) with output lines.
-w The expression is searched for as a word (as if surrounded by
`[[:<:]]' and `[[:>:]]';
The reason why the original answer does not work for everyone
The usage of \w varies from platform to platform, as it's an extended "perl" syntax. As such, those grep installations that are limited to work with POSIX character classes use [[:alpha:]] and not its perl equivalent of \w. See the Wikipedia page on regular expression for more
Ultimately, the POSIX answer above will be a lot more reliable regardless of platform (being the original) for grep
As for support of grep without -o option, the first grep outputs the relevant lines, the tr splits the spaces to new lines, the final grep filters only for the respective lines.
(PS: I know most platforms by now would have been patched for \w.... but there are always those that lag behind)
Credit for the "-o" workaround from #AdamRosenfield answer
It's more simple than you think. Try this:
egrep -wo 'th.[a-z]*' filename.txt #### (Case Sensitive)
egrep -iwo 'th.[a-z]*' filename.txt ### (Case Insensitive)
Where,
egrep: Grep will work with extended regular expression.
w : Matches only word/words instead of substring.
o : Display only matched pattern instead of whole line.
i : If u want to ignore case sensitivity.
You could translate spaces to newlines and then grep, e.g.:
cat * | tr ' ' '\n' | grep th
Just awk, no need combination of tools.
# awk '{for(i=1;i<=NF;i++){if($i~/^th/){print $i}}}' file
the
the
the
this
thoroughly
grep command for only matching and perl
grep -o -P 'th.*? ' filename
I was unsatisfied with awk's hard to remember syntax but I liked the idea of using one utility to do this.
It seems like ack (or ack-grep if you use Ubuntu) can do this easily:
# ack-grep -ho "\bth.*?\b" *
the
the
the
this
thoroughly
If you omit the -h flag you get:
# ack-grep -o "\bth.*?\b" *
some-other-text-file
1:the
some-text-file
1:the
the
yet-another-text-file
1:this
thoroughly
As a bonus, you can use the --output flag to do this for more complex searches with just about the easiest syntax I've found:
# echo "bug: 1, id: 5, time: 12/27/2010" > test-file
# ack-grep -ho "bug: (\d*), id: (\d*), time: (.*)" --output '$1, $2, $3' test-file
1, 5, 12/27/2010
cat *-text-file | grep -Eio "th[a-z]+"
You can also try pcregrep. There is also a -w option in grep, but in some cases it doesn't work as expected.
From Wikipedia:
cat fruitlist.txt
apple
apples
pineapple
apple-
apple-fruit
fruit-apple
grep -w apple fruitlist.txt
apple
apple-
apple-fruit
fruit-apple
I had a similar problem, looking for grep/pattern regex and the "matched pattern found" as output.
At the end I used egrep (same regex on grep -e or -G didn't give me the same result of egrep) with the option -o
so, I think that could be something similar to (I'm NOT a regex Master) :
egrep -o "the*|this{1}|thoroughly{1}" filename
To search all the words with start with "icon-" the following command works perfect. I am using Ack here which is similar to grep but with better options and nice formatting.
ack -oh --type=html "\w*icon-\w*" | sort | uniq
You could pipe your grep output into Perl like this:
grep "th" * | perl -n -e'while(/(\w*th\w*)/g) {print "$1\n"}'
grep --color -o -E "Begin.{0,}?End" file.txt
? - Match as few as possible until the End
Tested on macos terminal
$ grep -w
Excerpt from grep man page:
-w: Select only those lines containing matches that form whole words. The test is that the matching substring must either be at the beginning of the line, or preceded by a non-word constituent character.
ripgrep
Here are the example using ripgrep:
rg -o "(\w+)?th(\w+)?"
It'll match all words matching th.

How can I find the number of 8 letter words that do not contain the letter "e", using the grep command?

I want to find the number of 8 letter words that do not contain the letter "e" in a number of text files (*.txt). In the process I ran into two issues: my lack of understanding in quantifiers and how to exclude characters.
I'm quite new to the Unix terminal, but this is what I have tried:
cat *.txt | grep -Eo "\w+" | grep -i ".*[^e].*"
I need to include the cat command because it otherwise includes the names of the text files in the pipe. The second pipe is to have all the words in a list, and it works, but the last pipe was meant to find all the words that do not have the letter "e" in them, but doesn't seem to work. (I thought "." for no or any number of any character, followed by a character that is not an "e", and followed by another "." for no or any number of any character.)
cat *.txt | grep -Eo "\w+" | grep -wi "[a-z][a-z][a-z][a-z][a-z][a-z][a-z][a-z]"
This command works to find the words that contain 8 characters, but it is quite ineffective, because I have to repeat "[a-z]" 8 times. I thought it could also be "[a-z]{8}", but that doesn't seem to work.
cat *.txt | grep -Eo "\w+" | grep -wi "[a-z][a-z][a-z][a-z][a-z][a-z][a-z][a-z]" | grep -i ".*[^e].*"
So finally, this would be my best guess, however, the third pipe is ineffective and the last pipe doesn't work.
You may use this grep:
grep -hEiwo '[a-df-z]{8}' *.txt
Here:
[a-df-z]{8}: Matches all letters except e
-h: Don't print filename in output
-i: Ignore case search
-o: Print matches only
-w: Match complete words
In case you are ok with GNU awk and assuming that you want to print only the exact words and could be multiple matches in a line if this is the case one could try following.
awk -v IGNORECASE="1" '{for(i=1;i<=NF;i++){if($i~/^[a-df-z]{8}$/){print $i}}}' *.txt
OR without the use of IGNORCASE one could try:
awk '{for(i=1;i<=NF;i++){if(tolower($i)~/^[a-df-z]{8}$/){print $i}}}' *.txt
NOTE: Considering that you want exact matches of 8 letters only in lines. 8 letter words followed by a punctuation mark will be excluded.
Here is a crazy thought with GNU awk:
awk 'BEGIN{FPAT="\\<\\w{8}\\>"}{c+=NF}END{print c}' file
Or if you want to make it work only on a select set of characters:
awk 'BEGIN{FPAT="\\<[a-df-z]{8}\\>"}{c+=NF}END{print c}' file
What this does is, it defines the fields, to be a set of 8 characters (\w as a word-constituent or [a-df-z] as a selected set) which is enclosed by word-boundaries (\< and \>). This is done with FPAT (note the Gory details about escaping).
Sometimes you might also have words which contain diatrics, so you have to expand. Then this might be the best solution:
awk 'BEGIN{FPAT="\\<\\w{8}\\>"}{for(i=1;i<=NF;++i) if($i !~ /e/) c++}END{print c}' file

Removing number of dots with grep using regular expression

How can I remove lines that contain more than 5 "." or less than 5 dots (simply put: 5 dots per line?
How can I write a regex that will detect it in bash using grep?
INPUT:
yGEtfWYBCBKtvxTbHxMK,126.221.42.321.0.147.30,10,Bad stuff is happening,http://mystuff.com/file.json
yGEtfWYBCBKtvxTbHxwK,126.221.42.21,10,Bad stuff is happening,http://mystuff.com/file.json
EXPECTED OUTPUT:
yGEtfWYBCBKtvxTbHxwK,176.221.42.21,10,Bad stuff is happening,http://mystuff.com/file.json
Tried:
grep -P '[.]{5}' stuff.txt
grep -P '[\.]{5}' stuff.txt
grep -P '([\.]{5})' stuff.txt
grep -P '\.{5}' stuff.txt
grep -E '([\.]{5}' stuff.txt
You can display only the lines that contain exactly 5 dots as follow :
grep '^[^.]*\.[^.]*\.[^.]*\.[^.]*\.[^.]*\.[^.]*$' stuff.txt
or if you want to factor it :
grep -E '^([^.]*\.){5}[^.]*$' stuff.txt
Using -ERE in this second one is helpful to avoid having to escape the \(\) and \{\}, in the first one grep's default BRE regex flavour is sufficient.
^ and $ are anchors representing respectively the start and end of the line that make sure we match the whole line and not just a part of it that contains 5 dots.
[^.] is a negated character class that will match anything but a dot.
They are quantified with * so that any number of non-dot characters can happen between each dot (you might want to change that to + if consecutive dots shouldn't be matched).
\. matches a literal dot (rather than any character, which the meta-character . outside of a character class would).
To detect specifically the bad IP address
Can you be certain that the IP address is always surrounded by commas and does not contain spaces - i.e. is never the first or last field?
Then, you might get away with:
grep -E ',\w+((\.\w+){2,3}|(\.\w+){5,}),'
If not, it is quite difficult to distinguish between a broken IP form with spaces and an ordinary sentence, so you might have to specify the column.
Using Perl one-liner to print only if number of "." exceeds 5
> cat five_dots.txt
yGEtfWYBCBKtvxTbHxMK,126.221.42.321.0.147.30,10,Bad stuff is happening,http://mystuff.com/file.json
yGEtfWYBCBKtvxTbHxwK,126.221.42.21,10,Bad stuff is happening,http://mystuff.com/file.json
> perl -ne '{ while(/\./g){$count++} print if $count > 5; $count=0 } ' five_dots.txt
yGEtfWYBCBKtvxTbHxMK,126.221.42.321.0.147.30,10,Bad stuff is happening,http://mystuff.com/file.json
>

How can I get a whole word with grep? [duplicate]

This question already has answers here:
Display exact matches only with grep [duplicate]
(9 answers)
How to make grep only match if the entire line matches?
(12 answers)
Closed 5 years ago.
I want to get the lines that contain a defined word with grep.
Edit: The solution was this one.
I know that I can use the -w option, but it doesn't seems to do the trick.
For example: every word that contains my defined word separated by punctuation signs is included. If I look for dogs, it will show me lines that contain not only dogs word but also cats.dogs, cats-dogs, etc.
# cat file.txt
some alphadogs dance
some cats-dogs play
none dogs dance
few dog sing
all cats.dogs shout
And with grep:
# cat file.txt | grep -w "dogs"
some cats-dogs play
none dogs dance
all cats.dogs shout
Desired output:
# cat file.txt | grep -w "dogs"
none dogs dance
Do you know any workaround that allows you to get the whole word? I've tested it with \b or \< with negative results.
Thanks,
Eudald
Use the word-boundary anchors, in any version of grep you have installed
grep '^dogs$' file.txt
An excerpt from this regular-expressions page,
Anchors
[..] Anchors do not match any characters. They match a position. ^ matches at the start of the string, and $ matches at the end of the string.[..]
Try with -x parameter:
grep -x dogs file.txt
From grep manual:
-x, --line-regexp
Select only those matches that exactly match the whole line. (-x is specified by POSIX.)
NOTE: cat is useless when you pipe its output to grep

Highlight text similar to grep, but don't filter out text [duplicate]

This question already has answers here:
Colorized grep -- viewing the entire file with highlighted matches
(24 answers)
Closed 7 years ago.
When using grep, it will highlight any text in a line with a match to your regular expression.
What if I want this behaviour, but have grep print out all lines as well? I came up empty after a quick look through the grep man page.
Use ack. Checkout its --passthru option here: ack. It has the added benefit of allowing full perl regular expressions.
$ ack --passthru 'pattern1' file_name
$ command_here | ack --passthru 'pattern1'
You can also do it using grep like this:
$ grep --color -E '^|pattern1|pattern2' file_name
$ command_here | grep --color -E '^|pattern1|pattern2'
This will match all lines and highlight the patterns. The ^ matches every start of line, but won't get printed/highlighted since it's not a character.
(Note that most of the setups will use --color by default. You may not need that flag).
You can make sure that all lines match but there is nothing to highlight on irrelevant matches
egrep --color 'apple|' test.txt
Notes:
egrep may be spelled also grep -E
--color is usually default in most distributions
some variants of grep will "optimize" the empty match, so you might want to use "apple|$" instead (see: https://stackoverflow.com/a/13979036/939457)
EDIT:
This works with OS X Mountain Lion's grep:
grep --color -E 'pattern1|pattern2|$'
This is better than '^|pattern1|pattern2' because the ^ part of the alternation matches at the beginning of the line whereas the $ matches at the end of the line. Some regular expression engines won't highlight pattern1 or pattern2 because ^ already matched and the engine is eager.
Something similar happens for 'pattern1|pattern2|' because the regex engine notices the empty alternation at the end of the pattern string matches the beginning of the subject string.
[1]: http://www.regular-expressions.info/engine.html
FIRST EDIT:
I ended up using perl:
perl -pe 's:pattern:\033[31;1m$&\033[30;0m:g'
This assumes you have an ANSI-compatible terminal.
ORIGINAL ANSWER:
If you're stuck with a strange grep, this might work:
grep -E --color=always -A500 -B500 'pattern1|pattern2' | grep -v '^--'
Adjust the numbers to get all the lines you want.
The second grep just removes extraneous -- lines inserted by the BSD-style grep on Mac OS X Mountain Lion, even when the context of consecutive matches overlap.
I thought GNU grep omitted the -- lines when context overlaps, but it's been awhile so maybe I remember wrong.
You can use my highlight script from https://github.com/kepkin/dev-shell-essentials
It's better than grep cause you can highlight each match with it's own color.
$ command_here | highlight green "input" | highlight red "output"
Since you want matches highlighted, this is probably for human consumption (as opposed to piping to another program for instance), so a nice solution would be to use:
less -p <your-pattern> <your-file>
And if you don't care about case sensitivity:
less -i -p <your-pattern> <your-file>
This also has the advantage of having pages, which is nice when having to go through a long output
You can do it using only grep by:
reading the file line by line
matching a pattern in each line and highlighting pattern by grep
if there is no match, echo the line as is
which gives you the following:
while read line ; do (echo $line | grep PATTERN) || echo $line ; done < inputfile
If you want to print "all" lines, there is a simple working solution:
grep "test" -A 9999999 -B 9999999
A => After
B => Before
If you are doing this because you want more context in your search, you can do this:
cat BIG_FILE.txt | less
Doing a search in less should highlight your search terms.
Or pipe the output to your favorite editor. One example:
cat BIG_FILE.txt | vim -
Then search/highlight/replace.
If you are looking for a pattern in a directory recursively, you can either first save it to file.
ls -1R ./ | list-of-files.txt
And then grep that, or pipe it to the grep search
ls -1R | grep --color -rE '[A-Z]|'
This will look of listing all files, but colour the ones with uppercase letters. If you remove the last | you will only see the matches.
I use this to find images named badly with upper case for example, but normal grep does not show the path for each file just once per directory so this way I can see context.
Maybe this is an XY problem, and what you are really trying to do is to highlight occurrences of words as they appear in your shell. If so, you may be able to use your terminal emulator for this. For instance, in Konsole, start Find (ctrl+shift+F) and type your word. The word will then be highlighted whenever it occurs in new or existing output until you cancel the function.

Resources