md5/sha1 hashing large files - linux

I have over 1/2 million files to hash over multiple folders
An md5/crc hashing is taking too long some files are 1GB ~ 11Gb in size
Im thinking of just hashing part of the file using head
So the below works when it comes to hashing finding and hashing everything.
find . -type f -exec sha1sum {} \;
Im just sure how to take this a step further and just do hash for the first say 256kB of the file e.g
find . -type f -exec head -c 256kB | sha1sum
Not sure if head is okay to use in this instance of would dd be better?
The above command doesn't work so looking for ideas on how I can do this
I would like the output to be the same as what is seen in a native md5sum e.g in the below format (going to a text file)
<Hash> <file name>
Im not sure if the above is possible with a single line or will a for/do loop need to be used..... Performance is key using bash on RHEL6

It is unclear where your limitation is. Do you have a slow disk or a slow CPU?
If your disk is not the limitation, you are probably limited by using a single core. GNU Parallel can help with that:
find . -type f | parallel -X sha256sum
If the limitation is disk I/O, then your idea of head makes perfect sense:
sha() {
tail -c 1M "$1" | sha256sum | perl -pe 'BEGIN{$a=shift} s/-/$a/' "$1";
}
export -f sha
find . -type f -print0 | parallel -0 -j10 --tag sha
The optimal value of -j10 depends on your disk system, so try adjusting it until you find the optimal value (which can be as low as -j1).

Related

How to accelerate substitution when using GNU sed with GNU find?

I have the results of a numerical simulation that consist of hundreds of directories; each directory contains millions of text files.
I need to substitute a the string "wavelength;" with "wavelength_bc;" so I have tried both the following:
find . -type f -exec sed -i 's/wavelength;/wavelength_bc;/g' {} \;
and
find . -type f -exec sed -i 's/wavelength;/wavelength_bc;/g' {} +
Unfortunately, the commands above take a very long time to finish, (more than 1 hour).
I wonder how can I take advantage of the number of cores on my machine (8) to accelerate the command above?
I am thinking of using xargs with -P flag. I'm scared that that will corrupt the files; so I have no idea if that is safe or not?
In summary:
How can I accelerate sed substitutions when using with find?
Is it safe to uses xargs -P to run that in parallel?
Thank you
xargs -P should be safe to use, however you will need to use -print0 option of find and piping to xargs -0 to address filenames with spaces or wildcards:
find . -type f -print0 |
xargs -0 -I {} -P 0 sed -i 's/wavelength;/wavelength_bc;/g' {}
-P 0 option in xargs will run in Parallel mode. It will run as many processes as possible for your CPU.
This might work for you (GNU sed & parallel):
find . -type f | parallel -q sed -i 's/wavelength;/wavelength_bc;/g' {}
GNU parallel will run as many jobs as there are cores on the machine in parallel.
More sophisticated uses can involve remote servers and file transfer see here and a cheatsheet here.

How to grep through many files of same file type

I wish to grep through many (20,000) text files, each with about 1,000,000 lines each, so the faster the better.
I have tried the below code and it just doesn't seem to want to do anything, it doesn't find any matches even after an hour (it should have done by now).
for i in $(find . -name "*.txt"); do grep -Ff firstpart.txt $1; done
Ofir's answer is good. Another option:
find . -name "*.txt" -exec grep -fnFH firstpart.txt {} \;
I like to add the -n for line numbers and -H to get the filename. -H is particularly useful in this case as you could have a lot of matches.
Instead of iterating through the files in a loop, you can just give the file names to grep using xargs and let grep go over all the files.
find . -name "*.txt" | xargs grep $1
I'm not quite sure whether it will actually increase the performance, but it's probably worth a try.
ripgrep is the most amazing tool. You should get that and use it.
To search *.txt files in all directories recursively, do this:
rg -t txt -f patterns.txt
Ripgrep uses one of the fastest regular expression engines out there. It uses multiple threads. It searches directories and files, and filters them to the interesting ones in the fastest way.
It is simply great.
For anyone stuck using grep for whatever reason:
find -name '*.txt' -type f -print0 | xargs -0 -P 8 -n 8 grep -Ff patterns.txt
That tells xargs to -n 8 use 8 arguments per command and to -P 8 run 8 copies in parallel. It has the downside that the output might become interleaved and corrupted.
Instead of xargs you could use parallel which does a fancier job and keeps output in order:
$ find -name '*.txt' -type f -print0 | parallel -0 grep --with-filename grep -Ff patterns.txt

Multithreaded Bash in while loop

I have the following Bash one liner which should iterate through all the files in the folder named *.xml , check if they have the below string, and if not, rename them to $.empty
find -name '*.xml' | xargs -I{} grep -LZ "state=\"open\"" {} | while IFS= read -rd '' x; do mv "$x" "$x".empty ; done
this process is very slow, and when running this script in folders with over 100k files, it takes well over 15 minutes to complete.
I couldn't find a way to make this process to run multithreadly.
Note that in for loop im hitting the "too many arguments" errors, due to the large number of files.
Can anyone think of a solution ?
Thanks !
Roy
The biggest bottleneck in your code is that you are running a separate mv process (which is just a wrapper around a system call) to rename each file. Let's say you have 100,000 files, and 20,000 of them need to be renamed. Your original code will need 120,000 processes, one grep per file and one mv per rename. (Ignoring the 2 calls to find and xargs.)
A better approach would be to use a language than can access the system call directly. Here is a simple Perl example:
find -name '*.xml' | xargs -I{} grep -LZ "state=\"open\"" {} |
perl -n0e 'rename("$_", "$_.empty")'
This replaces 20,000 calls to mv with a single call to perl.
The other bottleneck is running a single grep process for each file. Instead, you'd like to pass as many files as possible to grep each time. There is no need for xargs here; use the -exec primary to find instead.
find -name '*.xml' -exec grep -LZ "state=\"open\"" {} + |
perl -n0e 'rename("$_", "$_.empty")'
The too many arguments error you were receiving is based on total argument length. Suppose the limit is 4096, and your XML files have an average name length of 20 characters. This means you should be able to pass 200+ files to each call to grep. The -exec ... + primary takes care of passing as many files as possible to each call to grep, so this code at most will require 100,000 / 200 = 500 calls to grep, a vast improvement.
Depending on the size of the files, it might be faster to read each file in the Perl process to check for the string to match. However, grep is very well optimized, and the code to do so, while not terribly complicated, is still more than you can comfortably write in a one-liner. This should be a good balance between speed and simplicity.

Fastest way to grep through thousands of gz files?

I have thousands of .gz files all in one directory. I need to grep through them for the string Mouse::Handler, is the following the fastest (and most accurate) way to do this?
find . -name "*.gz" -exec zgrep -H 'Mouse::Handler' {} \;
Ideally I would also like to print out the line that I find this string on.
I'm running on a RHEL linux box.
You can search in parallel using
find . -name "*.gz" | xargs -n 1 -P NUM zgrep -H 'Mouse::Handler'
where NUM is around the number of cores you have.

How to use grep efficiently?

I have a large number of small files to be searched. I have been looking for a good de-facto multi-threaded version of grep but could not find anything. How can I improve my usage of grep? As of now I am doing this:
grep -R "string" >> Strings
If you have xargs installed on a multi-core processor, you can benefit from the following just in case someone is interested.
Environment:
Processor: Dual Quad-core 2.4GHz
Memory: 32 GB
Number of files: 584450
Total Size: ~ 35 GB
Tests:
1. Find the necessary files, pipe them to xargs and tell it to execute 8 instances.
time find ./ -name "*.ext" -print0 | xargs -0 -n1 -P8 grep -H "string" >> Strings_find8
real 3m24.358s
user 1m27.654s
sys 9m40.316s
2. Find the necessary files, pipe them to xargs and tell it to execute 4 instances.
time find ./ -name "*.ext" -print0 | xargs -0 -n1 -P4 grep -H "string" >> Strings
real 16m3.051s
user 0m56.012s
sys 8m42.540s
3. Suggested by #Stephen: Find the necessary files and use + instead of xargs
time find ./ -name "*.ext" -exec grep -H "string" {} \+ >> Strings
real 53m45.438s
user 0m5.829s
sys 0m40.778s
4. Regular recursive grep.
grep -R "string" >> Strings
real 235m12.823s
user 38m57.763s
sys 38m8.301s
For my purposes, the first command worked just fine.
Wondering why -n1 is used below won't it be faster to use a higher value (say -n8? or leave it out so xargs will do the right thing)?
xargs -0 -n1 -P8 grep -H "string"
Seems it will be more efficient to give each grep that's forked to process on more than one file (I assume -n1 will give only one file name in argv for the grep) -- as I see it, we should be able to give the highest n possible on the system (based on argc/argv max length limitation). So the setup cost of bringing up a new grep process is not incurred more often.

Resources