How to use grep efficiently? - linux

I have a large number of small files to be searched. I have been looking for a good de-facto multi-threaded version of grep but could not find anything. How can I improve my usage of grep? As of now I am doing this:
grep -R "string" >> Strings

If you have xargs installed on a multi-core processor, you can benefit from the following just in case someone is interested.
Environment:
Processor: Dual Quad-core 2.4GHz
Memory: 32 GB
Number of files: 584450
Total Size: ~ 35 GB
Tests:
1. Find the necessary files, pipe them to xargs and tell it to execute 8 instances.
time find ./ -name "*.ext" -print0 | xargs -0 -n1 -P8 grep -H "string" >> Strings_find8
real 3m24.358s
user 1m27.654s
sys 9m40.316s
2. Find the necessary files, pipe them to xargs and tell it to execute 4 instances.
time find ./ -name "*.ext" -print0 | xargs -0 -n1 -P4 grep -H "string" >> Strings
real 16m3.051s
user 0m56.012s
sys 8m42.540s
3. Suggested by #Stephen: Find the necessary files and use + instead of xargs
time find ./ -name "*.ext" -exec grep -H "string" {} \+ >> Strings
real 53m45.438s
user 0m5.829s
sys 0m40.778s
4. Regular recursive grep.
grep -R "string" >> Strings
real 235m12.823s
user 38m57.763s
sys 38m8.301s
For my purposes, the first command worked just fine.

Wondering why -n1 is used below won't it be faster to use a higher value (say -n8? or leave it out so xargs will do the right thing)?
xargs -0 -n1 -P8 grep -H "string"
Seems it will be more efficient to give each grep that's forked to process on more than one file (I assume -n1 will give only one file name in argv for the grep) -- as I see it, we should be able to give the highest n possible on the system (based on argc/argv max length limitation). So the setup cost of bringing up a new grep process is not incurred more often.

Related

How to accelerate substitution when using GNU sed with GNU find?

I have the results of a numerical simulation that consist of hundreds of directories; each directory contains millions of text files.
I need to substitute a the string "wavelength;" with "wavelength_bc;" so I have tried both the following:
find . -type f -exec sed -i 's/wavelength;/wavelength_bc;/g' {} \;
and
find . -type f -exec sed -i 's/wavelength;/wavelength_bc;/g' {} +
Unfortunately, the commands above take a very long time to finish, (more than 1 hour).
I wonder how can I take advantage of the number of cores on my machine (8) to accelerate the command above?
I am thinking of using xargs with -P flag. I'm scared that that will corrupt the files; so I have no idea if that is safe or not?
In summary:
How can I accelerate sed substitutions when using with find?
Is it safe to uses xargs -P to run that in parallel?
Thank you
xargs -P should be safe to use, however you will need to use -print0 option of find and piping to xargs -0 to address filenames with spaces or wildcards:
find . -type f -print0 |
xargs -0 -I {} -P 0 sed -i 's/wavelength;/wavelength_bc;/g' {}
-P 0 option in xargs will run in Parallel mode. It will run as many processes as possible for your CPU.
This might work for you (GNU sed & parallel):
find . -type f | parallel -q sed -i 's/wavelength;/wavelength_bc;/g' {}
GNU parallel will run as many jobs as there are cores on the machine in parallel.
More sophisticated uses can involve remote servers and file transfer see here and a cheatsheet here.

How to grep through many files of same file type

I wish to grep through many (20,000) text files, each with about 1,000,000 lines each, so the faster the better.
I have tried the below code and it just doesn't seem to want to do anything, it doesn't find any matches even after an hour (it should have done by now).
for i in $(find . -name "*.txt"); do grep -Ff firstpart.txt $1; done
Ofir's answer is good. Another option:
find . -name "*.txt" -exec grep -fnFH firstpart.txt {} \;
I like to add the -n for line numbers and -H to get the filename. -H is particularly useful in this case as you could have a lot of matches.
Instead of iterating through the files in a loop, you can just give the file names to grep using xargs and let grep go over all the files.
find . -name "*.txt" | xargs grep $1
I'm not quite sure whether it will actually increase the performance, but it's probably worth a try.
ripgrep is the most amazing tool. You should get that and use it.
To search *.txt files in all directories recursively, do this:
rg -t txt -f patterns.txt
Ripgrep uses one of the fastest regular expression engines out there. It uses multiple threads. It searches directories and files, and filters them to the interesting ones in the fastest way.
It is simply great.
For anyone stuck using grep for whatever reason:
find -name '*.txt' -type f -print0 | xargs -0 -P 8 -n 8 grep -Ff patterns.txt
That tells xargs to -n 8 use 8 arguments per command and to -P 8 run 8 copies in parallel. It has the downside that the output might become interleaved and corrupted.
Instead of xargs you could use parallel which does a fancier job and keeps output in order:
$ find -name '*.txt' -type f -print0 | parallel -0 grep --with-filename grep -Ff patterns.txt

Delete files 100 at a time and count total files

I have written a bash script to delete 100 files at a time from a directory because i was getting args list too long error but now i want to count the total files that were deleted in total from the directory
Here is the script
echo /example-dir/* | xargs -n 100 rm -rf
What i want is to write the total deleted files from each directory into a file along with path for example Deleted <count> files from <path>
How can i achieve this with my current setup?
You can simply do this by enabling verbose output from rm and then simply count the output lines using wc -l
If you have whitespaces or special characters in the file names, using echo to pass the list of files to xargs will not work.
Better use find with -print0 to use a NULL character as a delimiter for the individual files:
find /example-dir -type f -print0 | xargs --null -n 100 rm -vrf | wc -l
You can avoid xargs and do this in a simple while loop and use a counter:
destdir='/example-dir/'
count=0
while IFS= read -d '' file; do
rm -rf "$file"
((count++))
done < <(find "$destdir" -type f -print0)
echo "Deleted $count files from $destdir"
Note use of -print0 to take care of file names with whitespaces/newlines/glob etc.
By the way, if you really have lots of files and you do this often, it might be useful to look at some other options:
Use find's built-in -delete
time find . -name \*.txt -print -delete | wc -l
30000
real 0m1.244s
user 0m0.055s
sys 0m1.037s
Use find's ability to build up maximal length argument list
time find . -name \*.txt -exec rm -v {} + | wc -l
30000
real 0m0.979s
user 0m0.043s
sys 0m0.920s
Use GNU Parallel's ability to build long argument lists
time find . -name \*.txt -print0 | parallel -0 -X rm -v | wc -l
30000
real 0m1.076s
user 0m1.090s
sys 0m1.223s
Use a single Perl process to read filenames and delete whilst counting
time find . -name \*.txt -print0 | perl -0ne 'unlink;$i++;END{print $i}'
30000
real 0m1.049s
user 0m0.057s
sys 0m1.006s
For testing, you can create 30,000 files really fast with GNU Parallel, which allows -X to also build up long argument lists. For example, I can create 30,000 files in 8 seconds on my Mac with:
seq -w 0 29999 | parallel -X touch file{}.txt

How do I recursively grep all directories and subdirectories?

How do I recursively grep all directories and subdirectories?
find . | xargs grep "texthere" *
grep -r "texthere" .
The first parameter represents the regular expression to search for, while the second one represents the directory that should be searched. In this case, . means the current directory.
Note: This works for GNU grep, and on some platforms like Solaris you must specifically use GNU grep as opposed to legacy implementation. For Solaris this is the ggrep command.
If you know the extension or pattern of the file you would like, another method is to use --include option:
grep -r --include "*.txt" texthere .
You can also mention files to exclude with --exclude.
Ag
If you frequently search through code, Ag (The Silver Searcher) is a much faster alternative to grep, that's customized for searching code. For instance, it's recursive by default and automatically ignores files and directories listed in .gitignore, so you don't have to keep passing the same cumbersome exclude options to grep or find.
I now always use (even on Windows with GoW -- Gnu on Windows):
grep --include="*.xxx" -nRHI "my Text to grep" *
(As noted by kronen in the comments, you can add 2>/dev/null to void permission denied outputs)
That includes the following options:
--include=PATTERN
Recurse in directories only searching file matching PATTERN.
-n, --line-number
Prefix each line of output with the line number within its input file.
(Note: phuclv adds in the comments that -n decreases performance a lot so, so you might want to skip that option)
-R, -r, --recursive
Read all files under each directory, recursively; this is equivalent to the -d recurse option.
-H, --with-filename
Print the filename for each match.
-I
Process a binary file as if it did not contain matching data;
this is equivalent to the --binary-files=without-match option.
And I can add 'i' (-nRHIi), if I want case-insensitive results.
I can get:
/home/vonc/gitpoc/passenger/gitlist/github #grep --include="*.php" -nRHI "hidden" *
src/GitList/Application.php:43: 'git.hidden' => $config->get('git', 'hidden') ? $config->get('git', 'hidden') : array(),
src/GitList/Provider/GitServiceProvider.php:21: $options['hidden'] = $app['git.hidden'];
tests/InterfaceTest.php:32: $options['hidden'] = array(self::$tmpdir . '/hiddenrepo');
vendor/klaussilveira/gitter/lib/Gitter/Client.php:20: protected $hidden;
vendor/klaussilveira/gitter/lib/Gitter/Client.php:170: * Get hidden repository list
vendor/klaussilveira/gitter/lib/Gitter/Client.php:176: return $this->hidden;
...
Also:
find ./ -type f -print0 | xargs -0 grep "foo"
but grep -r is a better answer.
globbing **
Using grep -r works, but it may overkill, especially in large folders.
For more practical usage, here is the syntax which uses globbing syntax (**):
grep "texthere" **/*.txt
which greps only specific files with pattern selected pattern. It works for supported shells such as Bash +4 or zsh.
To activate this feature, run: shopt -s globstar.
See also: How do I find all files containing specific text on Linux?
git grep
For projects under Git version control, use:
git grep "pattern"
which is much quicker.
ripgrep
For larger projects, the quickest grepping tool is ripgrep which greps files recursively by default:
rg "pattern" .
It's built on top of Rust's regex engine which uses finite automata, SIMD and aggressive literal optimizations to make searching very fast. Check the detailed analysis here.
In POSIX systems, you don't find -r parameter for grep and your grep -rn "stuff" . won't run, but if you use find command it will:
find . -type f -exec grep -n "stuff" {} \; -print
Agreed by Solaris and HP-UX.
If you only want to follow actual directories, and not symbolic links,
grep -r "thingToBeFound" directory
If you want to follow symbolic links as well as actual directories (be careful of infinite recursion),
grep -R "thing to be found" directory
Since you're trying to grep recursively, the following options may also be useful to you:
-H: outputs the filename with the line
-n: outputs the line number in the file
So if you want to find all files containing Darth Vader in the current directory or any subdirectories and capture the filename and line number, but do not want the recursion to follow symbolic links, the command would be
grep -rnH "Darth Vader" .
If you want to find all mentions of the word cat in the directory
/home/adam/Desktop/TomAndJerry
and you're currently in the directory
/home/adam/Desktop/WorldDominationPlot
and you want to capture the filename but not the line number of any instance of the string "cats", and you want the recursion to follow symbolic links if it finds them, you could run either of the following
grep -RH "cats" ../TomAndJerry #relative directory
grep -RH "cats" /home/adam/Desktop/TomAndJerry #absolute directory
Source:
running "grep --help"
A short introduction to symbolic links, for anyone reading this answer and confused by my reference to them:
https://www.nixtutor.com/freebsd/understanding-symbolic-links/
To find name of files with path recursively containing the particular string use below command
for UNIX:
find . | xargs grep "searched-string"
for Linux:
grep -r "searched-string" .
find a file on UNIX server
find . -type f -name file_name
find a file on LINUX server
find . -name file_name
just the filenames can be useful too
grep -r -l "foo" .
another syntax to grep a string in all files on a Linux system recursively
grep -irn "string"
the -r indicates a recursive search that searches for the specified string in the given directory and sub directory looking for the specified string in files, program, etc
-i ingnore case sensitive can be used to add inverted case string
-n prints the line number of the specified string
NB: this prints massive result to the console so you might need to filter the output by piping and remove less interesting bits of info it also searches binary programs so you might want to filter some of the results
ag is my favorite way to do this now github.com/ggreer/the_silver_searcher . It's basically the same thing as ack but with a few more optimizations.
Here's a short benchmark. I clear the cache before each test (cf https://askubuntu.com/questions/155768/how-do-i-clean-or-disable-the-memory-cache )
ryan#3G08$ sync && echo 3 | sudo tee /proc/sys/vm/drop_caches
3
ryan#3G08$ time grep -r "hey ya" .
real 0m9.458s
user 0m0.368s
sys 0m3.788s
ryan#3G08:$ sync && echo 3 | sudo tee /proc/sys/vm/drop_caches
3
ryan#3G08$ time ack-grep "hey ya" .
real 0m6.296s
user 0m0.716s
sys 0m1.056s
ryan#3G08$ sync && echo 3 | sudo tee /proc/sys/vm/drop_caches
3
ryan#3G08$ time ag "hey ya" .
real 0m5.641s
user 0m0.356s
sys 0m3.444s
ryan#3G08$ time ag "hey ya" . #test without first clearing cache
real 0m0.154s
user 0m0.224s
sys 0m0.172s
This should work:
grep -R "texthere" *
If you are looking for a specific content in all files from a directory structure, you may use find since it is more clear what you are doing:
find -type f -exec grep -l "texthere" {} +
Note that -l (downcase of L) shows the name of the file that contains the text. Remove it if you instead want to print the match itself. Or use -H to get the file together with the match. All together, other alternatives are:
find -type f -exec grep -Hn "texthere" {} +
Where -n prints the line number.
This is the one that worked for my case on my current machine (git bash on windows 7):
find ./ -type f -iname "*.cs" -print0 | xargs -0 grep "content pattern"
I always forget the -print0 and -0 for paths with spaces.
EDIT: My preferred tool is now instead ripgrep: https://github.com/BurntSushi/ripgrep/releases . It's really fast and has better defaults (like recursive by default). Same example as my original answer but using ripgrep: rg -g "*.cs" "content pattern"
grep -r "texthere" . (notice period at the end)
(^credit: https://stackoverflow.com/a/1987928/1438029)
Clarification:
grep -r "texthere" / (recursively grep all directories and subdirectories)
grep -r "texthere" . (recursively grep these directories and subdirectories)
grep recursive
grep [options] PATTERN [FILE...]
[options]
-R, -r, --recursive
Read all files under each directory, recursively.
This is equivalent to the -d recurse or --directories=recurse option.
http://linuxcommand.org/man_pages/grep1.html
grep help
$ grep --help
$ grep --help |grep recursive
-r, --recursive like --directories=recurse
-R, --dereference-recursive
Alternatives
ack (http://beyondgrep.com/)
ag (http://github.com/ggreer/the_silver_searcher)
Throwing my two cents here. As others already mentioned grep -r doesn't work on every platform. This may sound silly but I always use git.
git grep "texthere"
Even if the directory is not staged, I just stage it and use git grep.
Below are the command for search a String recursively on Unix and Linux environment.
for UNIX command is:
find . -name "string to be searched" -exec grep "text" "{}" \;
for Linux command is:
grep -r "string to be searched" .
In 2018, you want to use ripgrep or the-silver-searcher because they are way faster than the alternatives.
Here is a directory with 336 first-level subdirectories:
% find . -maxdepth 1 -type d | wc -l
336
% time rg -w aggs -g '*.py'
...
rg -w aggs -g '*.py' 1.24s user 2.23s system 283% cpu 1.222 total
% time ag -w aggs -G '.*py$'
...
ag -w aggs -G '.*py$' 2.71s user 1.55s system 116% cpu 3.651 total
% time find ./ -type f -name '*.py' | xargs grep -w aggs
...
find ./ -type f -name '*.py' 1.34s user 5.68s system 32% cpu 21.329 total
xargs grep -w aggs 6.65s user 0.49s system 32% cpu 22.164 total
On OSX, this installs ripgrep: brew install ripgrep. This installs silver-searcher: brew install the_silver_searcher.
In my IBM AIX Server (OS version: AIX 5.2), use:
find ./ -type f -print -exec grep -n -i "stringYouWannaFind" {} \;
this will print out path/file name and relative line number in the file like:
./inc/xxxx_x.h
2865: /** Description : stringYouWannaFind */
anyway,it works for me : )
For a list of available flags:
grep --help
Returns all matches for the regexp texthere in the current directory, with the corresponding line number:
grep -rn "texthere" .
Returns all matches for texthere, starting at the root directory, with the corresponding line number and ignoring case:
grep -rni "texthere" /
flags used here:
-r recursive
-n print line number with output
-i ignore case
Note that find . -type f | xargs grep whatever sorts of solutions will run into "Argument list to long" errors when there are too many files matched by find.
The best bet is grep -r but if that isn't available, use find . -type f -exec grep -H whatever {} \; instead.
I guess this is what you're trying to write
grep myText $(find .)
and this may be something else helpful if you want to find the files grep hit
grep myText $(find .) | cut -d : -f 1 | sort | uniq
For .gz files, recursively scan all files and directories
Change file type or put *
find . -name \*.gz -print0 | xargs -0 zgrep "STRING"
Just for fun, a quick and dirty search of *.txt files if the #christangrant answer is too much to type :-)
grep -r texthere .|grep .txt
Here's a recursive (tested lightly with bash and sh) function that traverses all subfolders of a given folder ($1) and using grep searches for given string ($3) in given files ($2):
$ cat script.sh
#!/bin/sh
cd "$1"
loop () {
for i in *
do
if [ -d "$i" ]
then
# echo entering "$i"
cd "$i"
loop "$1" "$2"
fi
done
if [ -f "$1" ]
then
grep -l "$2" "$PWD/$1"
fi
cd ..
}
loop "$2" "$3"
Running it and an example output:
$ sh script start_folder filename search_string
/home/james/start_folder/dir2/filename
Get the first matched files from grep command and get all the files don't contain some word, but input files for second grep comes from result files of first grep command.
grep -l -r --include "*.js" "FIRSTWORD" * | xargs grep "SECONDwORD"
grep -l -r --include "*.js" "FIRSTWORD" * | xargs grep -L "SECONDwORD"
dc0fd654-37df-4420-8ba5-6046a9dbe406
grep -l -r --include "*.js" "SEARCHWORD" * | awk -F'/' '{print $NF}' | xargs -I{} sh -c 'echo {}; grep -l -r --include "*.html" -w --include=*.js -e {} *; echo '''
5319778a-cec2-444d-bcc4-53d33821fedb

find string inside a gzipped file in a folder

My current problem is that I have around 10 folders, which contain gzipped files (around on an average 5 each). This makes it 50 files to open and look at.
Is there a simpler method to find out if a gzipped file inside a folder has a particular pattern or not?
zcat ABC/myzippedfile1.txt.gz | grep "pattern match"
zcat ABC/myzippedfile2.txt.gz | grep "pattern match"
Instead of writing a script, can I do the same in a single line, for all the folders and sub folders?
for f in `ls *.gz`; do echo $f; zcat $f | grep <pattern>; done;
zgrep will look in gzipped files, has a -R recursive option, and a -H show me the filename option:
zgrep -R --include=*.gz -H "pattern match" .
OS specific commands as not all arguments work across the board:
Mac 10.5+: zgrep -R --include=\*.gz -H "pattern match" .
Ubuntu 16+: zgrep -i -H "pattern match" *.gz
You don't need zcat here because there is zgrep and zegrep.
If you want to run a command over a directory hierarchy, you use find:
find . -name "*.gz" -exec zgrep ⟨pattern⟩ \{\} \;
And also “ls *.gz” is useless in for and you should just use “*.gz” in the future.
how zgrep don't support -R
I think the solution of "Nietzche-jou" could be a better answer, but I would add the option -H to show the file name something like this
find . -name "*.gz" -exec zgrep -H 'PATTERN' \{\} \;
use the find command
find . -name "*.gz" -exec zcat "{}" + |grep "test"
or try using the recursive option (-r) of zcat
Coming in a bit late on this, had a similar problem and was able to resolve using;
zcat -r /some/dir/here | grep "blah"
As detailed here;
http://manpages.ubuntu.com/manpages/quantal/man1/gzip.1.html
However, this does not show the original file that the result matched from, instead showing "(standard input)" as it's coming in from a pipe. zcat does not seem to support outputting a name either.
In terms of performance, this is what we got;
$ alias dropcache="sync && echo 3 > /proc/sys/vm/drop_caches"
$ find 09/01 | wc -l
4208
$ du -chs 09/01
24M
$ dropcache; time zcat -r 09/01 > /dev/null
real 0m3.561s
$ dropcache; time find 09/01 -iname '*.txt.gz' -exec zcat '{}' \; > /dev/null
0m38.041s
As you can see, using the find|zcat method is significantly slower than using zcat -r when dealing with even a small volume of files. I was also unable to make zcat output the file name (using -v will apparently output the filename, but not on every single line). It would appear that there isn't currently a tool that will provide both speed and name consistency with grep (i.e. the -H option).
If you need to identify the name of the file that the result belongs to, then you'll need to either write your own tool (could be done in 50 lines of Python code) or use the slower method. If you do not need to identify the name, then use zcat -r.
Hope this helps
find . -name "*.gz"|xargs zcat | grep "pattern" should do.
zgrep "string" ./*/*
You can use above command to search for string in .gz files of dir directory where dir has following sub-directories structure:
/dir
/childDir1
/file1.gz
/file2.gz
/childDir2
/file3.gz
/file4.gz
/childDir3
/file5.gz
/file6.gz
You can use this command -
zgrep "foo" $(find . -name "*.gz")

Resources