Fastest way to grep through thousands of gz files?

Fastest way to grep through thousands of gz files? - linux

I have thousands of .gz files all in one directory. I need to grep through them for the string Mouse::Handler, is the following the fastest (and most accurate) way to do this?
find . -name "*.gz" -exec zgrep -H 'Mouse::Handler' {} \;
Ideally I would also like to print out the line that I find this string on.
I'm running on a RHEL linux box.

You can search in parallel using
find . -name "*.gz" | xargs -n 1 -P NUM zgrep -H 'Mouse::Handler'
where NUM is around the number of cores you have.

Related

How to accelerate substitution when using GNU sed with GNU find?

I have the results of a numerical simulation that consist of hundreds of directories; each directory contains millions of text files.
I need to substitute a the string "wavelength;" with "wavelength_bc;" so I have tried both the following:
find . -type f -exec sed -i 's/wavelength;/wavelength_bc;/g' {} \;
and
find . -type f -exec sed -i 's/wavelength;/wavelength_bc;/g' {} +
Unfortunately, the commands above take a very long time to finish, (more than 1 hour).
I wonder how can I take advantage of the number of cores on my machine (8) to accelerate the command above?
I am thinking of using xargs with -P flag. I'm scared that that will corrupt the files; so I have no idea if that is safe or not?
In summary:
How can I accelerate sed substitutions when using with find?
Is it safe to uses xargs -P to run that in parallel?
Thank you

xargs -P should be safe to use, however you will need to use -print0 option of find and piping to xargs -0 to address filenames with spaces or wildcards:
find . -type f -print0 |
xargs -0 -I {} -P 0 sed -i 's/wavelength;/wavelength_bc;/g' {}
-P 0 option in xargs will run in Parallel mode. It will run as many processes as possible for your CPU.

This might work for you (GNU sed & parallel):
find . -type f | parallel -q sed -i 's/wavelength;/wavelength_bc;/g' {}
GNU parallel will run as many jobs as there are cores on the machine in parallel.
More sophisticated uses can involve remote servers and file transfer see here and a cheatsheet here.

How to grep through many files of same file type

I wish to grep through many (20,000) text files, each with about 1,000,000 lines each, so the faster the better.
I have tried the below code and it just doesn't seem to want to do anything, it doesn't find any matches even after an hour (it should have done by now).
for i in $(find . -name "*.txt"); do grep -Ff firstpart.txt $1; done

Ofir's answer is good. Another option:
find . -name "*.txt" -exec grep -fnFH firstpart.txt {} \;
I like to add the -n for line numbers and -H to get the filename. -H is particularly useful in this case as you could have a lot of matches.

Instead of iterating through the files in a loop, you can just give the file names to grep using xargs and let grep go over all the files.
find . -name "*.txt" | xargs grep $1
I'm not quite sure whether it will actually increase the performance, but it's probably worth a try.

ripgrep is the most amazing tool. You should get that and use it.
To search *.txt files in all directories recursively, do this:
rg -t txt -f patterns.txt
Ripgrep uses one of the fastest regular expression engines out there. It uses multiple threads. It searches directories and files, and filters them to the interesting ones in the fastest way.
It is simply great.
For anyone stuck using grep for whatever reason:
find -name '*.txt' -type f -print0 | xargs -0 -P 8 -n 8 grep -Ff patterns.txt
That tells xargs to -n 8 use 8 arguments per command and to -P 8 run 8 copies in parallel. It has the downside that the output might become interleaved and corrupted.
Instead of xargs you could use parallel which does a fancier job and keeps output in order:
$ find -name '*.txt' -type f -print0 | parallel -0 grep --with-filename grep -Ff patterns.txt

Optimal string replacing in files for AIX

I need to remove about 40 emails from several files in a distribution list.
One Address might appear in different files and need to be removed from all of them.
I am working in a directory with several .sh files which also have several lines.
I have done something like this in a couple of test files:
find . -type f -exec grep -li ADDRESS_TO_FIND {} 2>/dev/null \; | xargs sed -i 's/ADDRESS_TO_REMOVE/ /g' *
It works fine but once I try it in the real files, it takes a long time and just sits there. I need to run this in different servers, this is the main cause I want to optimize this.
I have tried to run something like this:
find . -type f -name '*sh' 2>/dev/null | xargs grep ADDRESS_TO_FIND
but that will return:
./FileContainingAddress.sh:ADDRESS_TO_FIND
How do I add something like this:
awk '{print substr($0,1,10)}'
But to return me everything before the ":"?
I can do the rest from there, but haven't found how to trim that part

You can use -exec as a predicate in find, as long as you don't use the multiple file + version, which means that you can provide several -exec clauses each of which will be dependent on the success of the previous one. This style will avoid the construction of lists of filenames, which makes it much more robust in the face of files with odd characters in their names.
For example:
find . -type f -name '*sh' \
-exec grep -qi ADDRESS_TO_FIND {} \; \
-exec sed -i 's/ADDRESS_TO_FIND/ /g' {} \;
You probably want to provide the address as a parameter rather than having to type it twice, unless you really meant for the two instance to be different (ADDRESS_TO_FIND vs. ADDRESS_TO_REMOVE):
clean() {
find . -type f -name '*sh' \
-exec grep -qi "$1" {} \; \
-exec sed -i "s/$1/ /g" {} \;
}
(Watch out for / in the argument to clean. I'll leave making the sed more robust as an exercise.)

After looking back at your question, I noticed something that's potentially quite important:
find -type f -exec grep -li ADDRESS {} \; | xargs sed -i 's/ADDRESS/ /g' *
# here! -----------------------------------------------------------------^
The asterisk is being expanded, so the sed line is operating on every file in the directory.
Assuming that this wasn't a typo in your question, I believe that this is the source of your poor performance. You should remove it!

linux + find word in file under directory but quickly

I have the following command
find /var -type f -exec grep "param1" {} \; -print
With this command I can find the param1 string in any file under /var
but the time that it take for this is very long.
I need other possibility to find string in file but much more faster then my example
THX
yael

grep -r "string"
The find is not neccesary.
This is a good link, though outdated.
link text
Also i think this belongs in superuser.com

Take a look at the -l option to the grep command for a speed boost. To speed up the find command use:
find ... -exec sh -c '...' arg0 '{}' +
# grep ... -l: print files with matches, but stop scanning the file on the first match
grep -lsr "param1" /var
find /var -type f -exec sh -c 'grep -ls "param1" "$#"' arg0 '{}' +
find /var -type f -exec sh -c 'grep -ls "$0" "$#"' "param1" '{}' +

find /var -type f | xargs grep "param1"
would be slightly faster (no process spawning for each file)
grep -r "param1" /var
would be slightly more so I think.

Try also using ack, which is "better than grep" in most cases. Among its features the ability to ignore typical garbage files by default (such as .svn or .git directories, core dumps, backup files), the ability to use a large set of predefined file classes, nice output formatting.

You can use locate's index (if you don't depend on files that are added/removed)
grep "param1" $(locate -r '^/var')

some of these command optimizations are helpful, but the biggest jump in speed I got from grepping 2 million files was to use a SSD Hard drive. Same queries took 1/5 of the time.

find string inside a gzipped file in a folder

My current problem is that I have around 10 folders, which contain gzipped files (around on an average 5 each). This makes it 50 files to open and look at.
Is there a simpler method to find out if a gzipped file inside a folder has a particular pattern or not?
zcat ABC/myzippedfile1.txt.gz | grep "pattern match"
zcat ABC/myzippedfile2.txt.gz | grep "pattern match"
Instead of writing a script, can I do the same in a single line, for all the folders and sub folders?
for f in `ls *.gz`; do echo $f; zcat $f | grep <pattern>; done;

zgrep will look in gzipped files, has a -R recursive option, and a -H show me the filename option:
zgrep -R --include=*.gz -H "pattern match" .
OS specific commands as not all arguments work across the board:
Mac 10.5+: zgrep -R --include=\*.gz -H "pattern match" .
Ubuntu 16+: zgrep -i -H "pattern match" *.gz

You don't need zcat here because there is zgrep and zegrep.
If you want to run a command over a directory hierarchy, you use find:
find . -name "*.gz" -exec zgrep ⟨pattern⟩ \{\} \;
And also “ls *.gz” is useless in for and you should just use “*.gz” in the future.

how zgrep don't support -R
I think the solution of "Nietzche-jou" could be a better answer, but I would add the option -H to show the file name something like this
find . -name "*.gz" -exec zgrep -H 'PATTERN' \{\} \;

use the find command
find . -name "*.gz" -exec zcat "{}" + |grep "test"
or try using the recursive option (-r) of zcat

Coming in a bit late on this, had a similar problem and was able to resolve using;
zcat -r /some/dir/here | grep "blah"
As detailed here;
http://manpages.ubuntu.com/manpages/quantal/man1/gzip.1.html
However, this does not show the original file that the result matched from, instead showing "(standard input)" as it's coming in from a pipe. zcat does not seem to support outputting a name either.
In terms of performance, this is what we got;
$ alias dropcache="sync && echo 3 > /proc/sys/vm/drop_caches"
$ find 09/01 | wc -l
4208
$ du -chs 09/01
24M
$ dropcache; time zcat -r 09/01 > /dev/null
real 0m3.561s
$ dropcache; time find 09/01 -iname '*.txt.gz' -exec zcat '{}' \; > /dev/null
0m38.041s
As you can see, using the find|zcat method is significantly slower than using zcat -r when dealing with even a small volume of files. I was also unable to make zcat output the file name (using -v will apparently output the filename, but not on every single line). It would appear that there isn't currently a tool that will provide both speed and name consistency with grep (i.e. the -H option).
If you need to identify the name of the file that the result belongs to, then you'll need to either write your own tool (could be done in 50 lines of Python code) or use the slower method. If you do not need to identify the name, then use zcat -r.
Hope this helps

find . -name "*.gz"|xargs zcat | grep "pattern" should do.

zgrep "string" ./*/*
You can use above command to search for string in .gz files of dir directory where dir has following sub-directories structure:
/dir
/childDir1
/file1.gz
/file2.gz
/childDir2
/file3.gz
/file4.gz
/childDir3
/file5.gz
/file6.gz

You can use this command -
zgrep "foo" $(find . -name "*.gz")

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Fastest way to grep through thousands of gz files? - linux

You can search in parallel using find . -name "*.gz" | xargs -n 1 -P NUM zgrep -H 'Mouse::Handler' where NUM is around the number of cores you have.

Related

How to accelerate substitution when using GNU sed with GNU find?

How to grep through many files of same file type

Optimal string replacing in files for AIX

linux + find word in file under directory but quickly

find string inside a gzipped file in a folder

Categories

Resources