Grep substrings in string/word - string

Is there a way on grep or any other unix tool to search for a sequence of substrings in a string?
To clarify:
$ grep "substring1.*subrstring2"
substring1_mySubstring2 # OK substrings forming a single string
substring1 substring2 # WRONG substrings are separated`

You can tell grep to look for substring1 + some characters + substring2:
grep -iE 'substring1\w+substring2' file
Note the usage of -i to ignore case and -E for an extended regex coverage (the same without -E could be done with \w\+ instead).
Test
$ cat a
substring1_mySubstring2
substring1 substring2
substring1_and_other_things12345substring2 blabla
Let's see how this matches just when there is no spaces in between:
$ grep -iE 'substring1\w+substring2' a
substring1_mySubstring2
substring1_and_other_things12345substring2 blabla

Related

grep character not followed by character

I'm trying to print lines that have b not followed by e in a file. I tried using negative look-ahead, but it's not working.
grep 'b(?!e)' filename
grep '(?!e)b)' filename
egrep 'b(?!e)' f3.txt
When I run these commands, nothing shows up, no error or anything else. I checked other people's similar posts as well but was unable to run it.
grep -E 'b([^e]|$)' filename
That should match 'b' followed by a character which is not 'e', or 'b' at end-of-line.
If your grep supports Perl regular expressions with -P, look-arounds work:
$ grep -P 'b(?!e)' <<< 'be' # Gets no output
$ grep -P 'b(?!e)' <<< 'bb'
bb
$ grep -P 'b(?!e)' <<< 'b'
b
The only difference to grep -E (in this case) is that you don't have to take care of the end-of-line situation (see pilcrow's answer).

grep - print all lines containing 'cat' as the second word

Ok so considering i have a file containing the following text:
lknsglkn cat lknrhlkn lsrhkn
cat lknerylnk lknaselk cat
awiooiyt lkndrhlk dhlknl
blabla cat cat bla bla
I need to use grep to print only the lines containing 'cat' as the second word on the line, namely lines 1 and 4. I've tried multiple grep -e 'regex' <file> commands but can't seem to get the right one. I don't know how to match the N'th word of a line.
this may work for you?
grep -E '^\w+\s+cat\s' file
if the first "word" can contain some non-word characters, e.g. "#, (,[..", you could also try:
grep -E '^\S+\s+cat\s' file
with your example input:
kent$ echo "lknsglkn cat lknrhlkn lsrhkn
cat lknerylnk lknaselk cat
awiooiyt lkndrhlk dhlknl
blabla cat cat bla bla"|grep -E '^\S+\s+cat\s'
lknsglkn cat lknrhlkn lsrhkn
blabla cat cat bla bla
What constitutes a word?
grep '^[a-z][a-z]* *cat '
This will work if there is at least a blank after cat. If that's not guaranteed, then:
grep -E '^[a-z]+ +cat( |$)'
which looks for cat followed by a blank or end of line.
If you want a more extensive definition of 'first word' (upper case, digits, punctuation), change the character class. If you want to allow for blanks or tabs, there are changes that can be made. If you can have leading blanks, add '*' at the caret. Variations as required.
These variations will work with any version of grep that supports the -E option. POSIX does not mandate notations such as \S to mean 'non-white-space', though GNU grep does support that as an extension. The grep -E version will work with regular egrep if grep -E does not work but egrep exists (don't use the -E option with egrep).
The following should work:
grep -e '^\S\+\scat\s'
The line should start with a non-whitespace of length at least 1, followed by a whitespace and the word "cat" followed by a whitespace.
Will be slower, but perhaps more readable:
awk '$2 == "cat"' file

Grep Syntax with Capitals

I'm trying to write a script with a file as an argument that greps the text file to find any word that starts with a capital and has 8 letters following it. I'm bad with syntax so I'll show you my code, I'm sure it's an easy fix.
grep -o '[A-Z][^ ]*' $1
I'm not sure how to specify that:
a) it starts with a capital letter, and
b)that it's a 9 letter word.
Cheers
EDIT:
As an edit I'd like to add my new code:
while read p
do
echo $p | grep -Eo '^[A-Z][[:alpha:]]{8}'
done < $1
I still can't get it to work, any help on my new code?
'[A-Z][^ ]*' will match one character between A and Z, followed by zero or more non-space characters. So it would match any A-Z character on its own.
Use \b to indicate a word boundary, and a quantifier inside braces, for example:
grep '\b[A-Z][a-z]\{8\}\b'
If you just did grep '[A-Z][a-z]\{8\}' that would match (for example) "aaaaHellosailor".
I use \{8\}, the braces need to be escaped unless you use grep -E, also known as egrep, which uses Extended Regular Expressions. Vanilla grep, that you are using, uses Basic Regular Expressions. Also note that \b is not part of the standard, but commonly supported.
If you use ^ at the beginning and $ at the end then it will not find "Wiltshire" in "A Wiltshire pig makes great sausages", it will only find lines which just consist of a 9 character pronoun and nothing else.
This works for me:
$ echo "one-Abcdefgh.foo" | grep -o -E '[A-Z][[:alpha:]]{8}'
$ echo "one-Abcdefghi.foo" | grep -o -E '[A-Z][[:alpha:]]{8}'
Abcdefghi
$
Note that this doesn't handle extensions or prefixes. If you want to FORCE the input to be a 9-letter capitalized word, we need to be more explicit:
$ echo "one-Abcdefghij.foo" | grep -o -E '\b[A-Z][[:alpha:]]{8}\b'
$ echo "Abcdefghij" | grep -o -E '\b[A-Z][[:alpha:]]{8}\b'
$ echo "Abcdefghi" | grep -o -E '\b[A-Z][[:alpha:]]{8}\b'
Abcdefghi
$
I have a test file named 'testfile' with the following content:
Aabcdefgh
Babcdefgh
cabcdefgh
eabcd
Now you can use the following command to grep in this file:
grep -Eo '^[A-Z][[:alpha:]]{8}' testfile
The code above is equal to:
cat testfile | grep -Eo '^[A-Z][[:alpha:]]{8}'
This matches
Aabcdefgh
Babcdefgh

Grep not as a regular expression

I need to search for a PHP variable $someVar. However, Grep thinks that I am trying to run a regex and is complaining:
$ grep -ir "Something Here" * | grep $someVar
Usage: grep [OPTION]... PATTERN [FILE]...
Try `grep --help' for more information.
$ grep -ir "Something Here" * | grep "$someVar"
<<Here it returns all rows with "someVar", not only those with "$someVar">>
I don't see an option for telling grep not to interpret the string as a regex, but to include the $ as just another string character.
Use fgrep (deprecated), grep -F or grep --fixed-strings, to make it treat the pattern as a list of fixed strings, instead of a regex.
For reference, the documentation mentions (excerpts):
-F --fixed-strings Interpret the pattern as a list of fixed
strings (instead of regular expressions), separated by newlines, any
of which is to be matched. (-F is specified by POSIX.)
fgrep is the same as grep -F. Direct invocation as fgrep is
deprecated, but is provided to allow historical applications that rely
on them to run unmodified.
For the complete reference, check:
https://www.gnu.org/savannah-checkouts/gnu/grep/manual/grep.html
grep -F is a standard way to tell grep to interpret argument as a fixed string, not a pattern.
You have to tell grep you use a fixed-string, instead of a pattern, using '-F' :
grep -ir "Something Here" * | grep -F \$somevar
In this question, the main issue is not about grep interpreting $ as a regex. It's about the shell substituting $someVar with the value of the environment variable someVar, likely the empty string.
So in the first example, it's like calling grep without any argument, and that's why it gives you a usage output. The second example should not return all rows containing someVar but all lines, because the empty string is in all lines.
To tell the shell to not substitute, you have to use '$someVar' or \$someVar. Then you'll have to deal with the grep interpretation of the $ character, hence the grep -F option given in many other answers.
So one valid answer would be:
grep -ir "Something Here" * | grep '$someVar'
+1 for the -F option, it shall be the accepted answer.
Also, I had a "strange" behaviour while searching for the -I.. pattern in my files, as the -I was considered as an option of grep ; to avoid such kind of errors, we can explicitly specify the end of the arguments of the command using --.
Example:
grep -HnrF -- <pattern> <files>
Hope that'll help someone.
Escape the $ by putting a \ in front of it.

How to trim specific text with grep

I am in need of trimming some text with grep, I have tried various other methods and havn't had much luck, so for example:
C:\Users\Admin\Documents\report2011.docx: My Report 2011
C:\Users\Admin\Documents\newposter.docx: Dinner Party Poster 08
How would it be possible to trim the text file, so to trim the ":" and all characters after it.
E.g. so the output would be like:
C:\Users\Admin\Documents\report2011.docx
C:\Users\Admin\Documents\newposter.docx
use awk?
awk -F: '{print $1':'$2}' inputFile > outFile
you can use grep
(note that -o returns only the matching text)
grep -oe "^C:[^:]" inputFile > outFile
That is pretty simple to do with grep -o:
$ grep -o '^C:[^:]*' input
C:\Users\Admin\Documents\report2011.docx
C:\Users\Admin\Documents\newposter.docx
If you can have other drives just replace C by .:
$ grep -o '^.:[^:]*' input
If a line can start with something different than a drive name, you can consider both the occurrence a drive name in the beginning of the line and the case where there is no such drive name:
$ grep -o '^\(.:\|\)[^:]*' input
cat inputFile | cut -f1,2 -d":"
The -d specifies your delimiter, in this case ":". The -f1,2 means you want the first and second fields.
The first part doesn't necessarily have to be cat inputFile, it's just whatever it takes to get the text that you referred to. The key part being cut -f1,2 -d":"
Your text looks like output of grep. If what you're asking is how to print filenames matching a pattern, use GNU grep option --files-with-matches
You can use this as well for your example
grep -E -o "^C\S+"| tr -d ":"
egrep -o "^C\S+"| tr -d ":"
\S here is non-space character match

Resources