Using grep to identify a pattern

Using grep to identify a pattern - linux

I have several documents hosted on a cloud instance. I want to extract all words conforming to a specific pattern into a .txt file. This is the pattern:
ABC123A
ABC123B
ABC765A
and so one. Essentially the words start with a specific character string 'ABC', have a fixed number of numerals, and end with a letter. This is my code:
grep -oh ABC[0-9].*[a-zA-Z]$ > /home/user/abcLetterMatches.txt
When I execute the query, it runs for several hours without generating any output. I have over 1100 documents. However, when I run this query:
grep -r ABC[0-9].*[a-zA-Z]$ > /home/user/abcLetterMatches.txt
the list of files with the strings is generated in a matter for seconds.
What do I need to correct in my query? Also, what is causing the delay?
UPDATE 1
Based on the answers, it's evident that the command is missing the file name on which it needs to be executed. I want to run the code on multiple document files (>1000)
The documents I want searched are in multiple sub-directories within a directory. What is a good way to search through them? Doing
grep -roh ABC[0-9].*[a-zA-Z]$ > /home/user/abcLetterMatches.txt
only returns the file names.
UPDATE 2
If I use the updated code from the answer below:
find . -exec grep -oh "ABC[0-9].*[a-zA-Z]$" >> ~/abcLetterMatches.txt {} \;
I get a no file or directory error
UPDATE 3
The pattern can be anywhere in the line.

You can use this regexp :
~/ grep -E "^ABC[0-9]{3}[A-Z]$" docs > filename
ABC123A
ABC123B
ABC765A

There is no delay, grep is just waiting for the input you didn't give it (and therefore it waits on standard input, by default). You can correct your command by supplying argument with filename:
grep -oh "ABC[0-9].*[a-zA-Z]$" file.txt > /home/user/abcLetterMatches.txt
Source (man grep):
SYNOPSIS
grep [OPTIONS] PATTERN [FILE...]
To perform the same grepping on several files recursively, combine it with find command:
find . -exec grep -oh "ABC[0-9].*[a-zA-Z]$" >> ~/abcLetterMatches.txt {} \;

This does what you ask for:
grep -hr '^ABC[0-9]\{3\}[A-Za-z]$'
-h to not get the filenames.
-r to search recursively. If no directory is given (as above) the current one is used. Otherwise just specify one as the last argument.
Quotes around the pattern to avoid accidental globbing, etc.
^ at the beginning of the pattern to — together with $ at the end — only match whole lines. (Not sure if this was a requirement, but the sample data suggests it.)
\{3\} to specify that there should be three digits.
No .* as that would match a whole lot of other things.

Related

grep search for pipe term Argument list too long

I have something like
grep ... | grep -f - *orders*
where the first grep ... gives a list of order numbers like
1393
3435
5656
4566
7887
6656
and I want to find those orders in multiple files (a_orders_1, b_orders_3 etc.), these files look something like
1001|strawberry|sam
1002|banana|john
...
However, when the first grep... returns too many order numbers I get the error "Argument list too long".
I also tried to give the grep command one order number at a time using a while loop but that's just way too slow. I did
grep ... | while read order; do grep $order *orders*; done
I'm very new to Unix clearly, explanations would be greatly appreciated, thanks!

The problem is the expansion of *orders* in grep ... | grep -f - *orders*. Your shell expands the pattern to the full list of files before passing that list to grep.
So we need to pass fewer "orders" files to each grep invocation. The find program is one way to do that, because it accepts wildcards and expands them internally:
find . -name '*orders*' # note this searches subdirectories too
Now that you know how to generate the list of filenames without running into the command line length limit, you can tell find to execute your second grep:
grep ... | find . -name '*orders*' -exec grep -f - {} +
The {} is where find places the filenames, and the + terminates the command and lets find know you're OK with passing multiple arguments to each invocation of grep -f, while still respecting the command line length limit by invoking grep -f more than once if the list of files exceeds the allowed length of a single command.

Find files whose content match a line from text file

I have a text file - accessions.txt (below is a subset of this file):
KRO94967.1
KRO95967.1
KRO96427.1
KRO94221.1
KRO94121.1
KRO94145.1
WP_088442850.1
WP_088252850.1
WP_088643726.1
WP_088739685.1
WP_088283155.1
WP_088939404.1
And I have a directory with multiple files (*.align).
I want to find the filenames (*.align) which content matches any line within my accessions.txt text file.
I know that find . -exec grep -H 'STRING' {} + works to find specific strings (e.g replacing STRING with WP_088939404.1 returns every filename where the string WP_088939404.1 is present).
Is there a way to replace STRING with "all strings inside my text file" ?
Or
Is there another (better) way to do this?
I was trying to avoid writing a loop that reads the content of all my files as there are too many of them.
Many thanks!

You're looking for grep's -f option.
find . -name '*.align' -exec grep -Fxqf accessions.txt {} \; -print

grep can take a list of patterns to match with -f.
grep -lFf accessions.txt directory/*.align
-F tells grep to interpret the lines as fixed strings, not regex patterns.
Sometimes, -w is also needed to prevent matching inside words, e.g.
abcd
might match not only abcd, but also xabcd or abcdy. Sometimes, preprocessing the input list is needed to prevent unwanted matching if the rules are more complex.

Finding multiple strings in directory using linux commends

If I have two strings, for example "class" and "btn", what is the linux command that would allow me to search for these two strings in the entire directory.
To be more specific, lets say I have directory that contains few folders with bunch of .php files. My goal is to be able to search throughout those .php files so that it prints out only files that contain "class" and "btn" in one line. Hopefully this clarifies things better.
Thanks,

I normally use the following to search for strings inside my source codes. It searches for string and shows the exact line number where that text appears. Very helpful for searching string in source code files. You can always pipes the output to another grep and filter outputs.
grep -rn "text_to_search" directory_name/
example:
$ grep -rn "angular" menuapp
$ grep -rn "angular" menuapp | grep some_other_string
output would be:
menuapp/public/javascripts/angular.min.js:251://# sourceMappingURL=angular.min.js.map
menuapp/public/javascripts/app.js:1:var app = angular.module("menuApp", []);

grep -r /path/to/directory 'class|btn'
grep is used to search a string in a file. With the -r flag, it searches recursively all files in a directory.

Or, alternatively using the find command to "identify" the files to be searched instead of using grep in recursive mode:
find /path/to/your/directory -type f -exec grep "text_to_search" {} \+;

Listing entries in a directory using grep

I'm trying to list all entries in a directory whose names contain ONLY upper-case letters. Directories need "/" appended.
#!/bin/bash
cd ~/testfiles/
ls | grep -r *.*
Since grep by default looks for upper-case letters only (right?), I'm just recursively searching through the directories under testfiles for all names who contain only upper-case letters.
Unfortunately this doesn't work.
As for appending directories, I'm not sure why I need to do this. Does anyone know where I can start with some detailed explanations on what I can do with grep? Furthermore how to tackle my problem?

No, grep does not only consider uppercase letters.
Your question I a bit unclear, for example:
from your usage of the -r option, it seems you want to search recursively, however you don't say so. For simplicity I assume you don't need to; consider looking into #twm's answer if you need recursion.
you want to look for uppercase (letters) only. Does that mean you don't want to accept any other (non letter) characters, but which are till valid for file names (like digits or dashes, dots, etc.)
since you don't say th it i not permissible to have only on file per line, I am assuming it is OK (thus using ls -1).
The naive solution would be:
ls -1 | grep "^[[:upper:]]\+$"
That is, print all lines containing only uppercase letters. In my TEMP directory that prints, for example:
ALLBIG
LCFEM
WPDNSE
This however would exclude files like README.TXT or FILE001, which depending on your requirements (see above) should most likely be included.
Thus, a better solution would be:
ls -1 | grep -v "[[:lower:]]\+"
That is, print all lines not containing an lowercase letter. In my TEMP directory that prints for example:
ALLBIG
ALLBIG-01.TXT
ALLBIG005.TXT
CRX_75DAF8CB7768
LCFEM
WPDNSE
~DFA0214428CD719AF6.TMP
Finally, to "properly mark" directories with a trailing '/', you could use the -F (or --classify) option.
ls -1F | grep -v "[[:lower:]]\+"
Again, example output:
ALLBIG
ALLBIG-01.TXT
ALLBIG005.TXT
CRX_75DAF8CB7768
LCFEM/
WPDNSE/
~DFA0214428CD719AF6.TMP
Note a different option would to be use find, if you can live with the different output (e.g. find ! -regex ".*[a-z].*"), but that will have a different output.

The exact regular expression depend on the output format of your ls command. Assuming that you do not use an alias for ls, you can try this:
ls -R | grep -o -w "[A-Z]*"
note that with -R in ls you will recursively list directories and files under the current directory. The grep option -o tells grep to only print the matched part of the text. The -w options tell grep to consider as match only for whole words. The "[A-Z]*" is a regexp to filter only upper-cased words.
Note that this regexp will print TEST.txt as well as TEXT.TXT. In other words, it will only consider names that are formed by letters.

It's ls which lists the files, not grep, so that is where you need to specify that you want "/" appended to directories. Use ls --classify to append "/" to directories.
grep is used to process the results from ls (or some other source, generally speaking) and only show lines that match the pattern you specify. It is not limited to uppercase characters. You can limit it to just upper case characters and "/" with grep -E '^[A-Z/]*$ or if you also want numbers, periods, etc. you could instead filter out lines that contain lowercase characters with grep -v -E [a-z].
As grep is not the program which lists the files, it is not where you want to perform the recursion. ls can list paths recursively if you use ls -R. However, you're just going to get the last component of the file paths that way.
You might want to consider using find to handle the recursion. This works for me:
find . -exec ls -d --classify {} \; | egrep -v '[a-z][^/]*/?$'
I should note, using ls --classify to append "/" to the end of directories may also append some other characters to other types of paths that it can classify. For instance, it may append "*" to the end of executable files. If that's not OK, but you're OK with listing directories and other paths separately, this could be worked around by running find twice - once for the directories and then again for other paths. This works for me:
find . -type d | egrep -v '[a-z][^/]*$' | sed -e 's#$#/#'
find . -not -type d | egrep -v '[a-z][^/]*$'

Looking for tool to search text in files on command line

Hello
I'm looking some script or program that use keywords or pattern search in files ex. php, html, etc and show where is this file
I use command cat /home/* | grep "keyword"
but i have too many folders and files and this command causes big uptime :/
I need this script to find fake websites (paypal, ebay, etc)

find /home -exec grep -s "keyword" {} \; -print

You don't really say what OS (and shell) you are using. You might want to retag your question to help us out.
Because you mention cat | ... , I am assuming you are using a Unix/Linux variant, so here are some pointers for looking at files. (bmargulies solution is good too).
I'm looking some script or program that use keywords or pattern search in files
grep is the basic program for searching files for text strings. Its usage is
grep [-options] 'search target' file1 file2 .... filen
(Note that 'search target' contains a space, if you don't surround spaces in your searchTarget with double or single quotes, you will have a minor error to debug.)
(Also note that 'search target' can use a wide range of wild-card characters, like .,?,+,,., and many more, that is beyond the scope of your question). ... anyway ...
As I guess you have discovered, you can only cram so many files at a time into the comand-line, even when using wild-card filename expansion. Unix/linux almost always have a utiltiyt that can help with that,
startDir=/home
find ${startDir} -print | xargs grep -l 'Search Target'
This, as one person will be happy to remind you, will require further enhancements if your filenames contain whitespace characters or newlines.
The options available for grep can vary wildly based on which OS you are using. If you're lucky, you type the following to get the man page for your local grep.
man grep
If you don't have your page buffer setup for a large size, you might need to do
man grep | page
so you can see the top of the 'document'. Press any key to advance to the next page and when you are at the end of the document, the last key press returns you to the command prompt.
Some options that most greps have that might be useful to you are
-i (ignore case)
-l (list filenames only (where txt is found)
There is also fgrep, which is usually interpretted to mean 'file' grep
becuase you can give it a file of search targets to scan for, and is used like
fgrep [-other_options] -f srchTargetsFile file1 file2 ... filen
I need this script to find fake websites (paypal, ebay, etc)
Final solution
you can make a srchFile like
paypal.fake.com
ebay.fake.com
etc.fake.com
and then combined with above, run the following
startDir=/home
find ${startDir} -print | xargs fgrep -il -f srchFile
Some greps require that the -fsrchFile be run together.
Now you are finding all files starting /home, searching with fgrep for paypay, ebay, etc in all files. The -l says it will ONLY print the filename where a match is found. You can remove the -l and then you will see the output of what is found, prepended with the filename.
IHTH.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string