How to parallelize my bash script for use with `find` without facing race conditions? - linux

I am trying to execute a command like this:
find ./ -name "*.gz" -print -exec ./extract.sh {} \;
The gz files themselves are small. Currently my extract.sh contains the following:
# Start delimiter
echo "#####" $1 >> Info
zcat $1 > temp
# Series of greps to extract some useful information
grep -o -P "..." temp >> Info
grep -o -P "..." temp >> Info
rm temp
echo "####" >> Info
Obviously, this is not parallelizable because if I run multiple extract.sh instances, they all write to the same file. What is a smart way of doing this?
I have 80K gz files on a machine with massive horse power of 32 cores.

Assume (just for simplicity and clearness) all your files starts with a-z.
So you could use 26 cores in parallel when launching an find sequence like above for each letter. Each "find" need to generate an own aggregate file
find ./ -name "a*.gz" -print -exec ./extract.sh a {} \; &
find ./ -name "b*.gz" -print -exec ./extract.sh b {} \; &
..
find ./ -name "z*.gz" -print -exec ./extract.sh z {} \;
(extract needs to take to first parameter to separate the "info" destination file)
When you want a big aggregate file just joins all aggregate.
However, I am not convinced to gain performance with that approach. In the end all file content will be serialized.
Probably hard disk head movement will be the limitation not the unzip (cpu) performance.
But let's try

A quick check through the findutils source reveals that find starts a child process for each exec. I believe it then moves on, though I may be misreading the source. Because of this you are already parallel, since the OS will handle sharing these out across your cores. And through the magic of virtual memory, the same executables will mostly share the same memory space.
The problem you are going to run into is file locking/data mixing. As each individual child runs, it pipes info into your info file. These are individual script commands, so they will mix their output together like spaghetti. This does not guarantee that the files will be in order! Just that all of an individual file's contents will stay together.
To solve this problem, all you need to do is take advantage of the shell's ability to create a temporary file (using tempfile), have each script dump to the temp file, then have each script cat the temp file into the info file. Don't forget to delete your temp file after use.
If the tempfiles are in ram(see tmpfs), then you will avoid being IO bound except when writing to your final file, and running the find search.
Tmpfs is a special file system that uses your ram as "disk space". It will take up to the amount of ram you allow, not use more than it needs from that amount, and swap to disk as needed if it does fill up.
To use:
Create a mount point ( I like /mnt/ramdisk or /media/ramdisk )
Edit /etc/fstab as root
Add tmpfs /mnt/ramdrive tmpfs size=1G 0 0
Run umount as root to mount your new ramdrive. It will also mount at boot.
See the wikipedia entry on fstab for all the options available.

You can use xargs to run your search in parallel. --max-procs limits number of processes executed (default is 1):
find ./ -name "*.gz" -print | xargs --max-args 1 --max-procs 32 ./extract.sh
In the ./extract.sh you can use mktemp to write data from each .gz to a temporary file, all of which may be later combined:
# Start delimiter
tmp=`mktemp -t Info.XXXXXX`
src=$1
echo "#####" $1 >> $tmp
zcat $1 > $tmp.unzip
src=$tmp.unzip
# Series of greps to extract some useful information
grep -o -P "..." $src >> $tmp
grep -o -P "..." $src >> $tmp
rm $src
echo "####" >> $tmp
If you have massive horse power you can use zgrep directly, without unzipping first. But it may be faster to zcat first if you have many greps later.
Anyway, later combine everything into a single file:
cat /tmp/Info.* > Info
rm /tmp/Info.*
If you care about order of .gz files apply second argument to ./extract.sh:
find files/ -name "*.gz" | nl -n rz | sed -e 's/\t/\n/' | xargs --max-args 2 ...
And in ./extract.sh:
tmp=`mktemp -t Info.$1.XXXXXX`
src=$2

I would create a temporary directory. Then create an output file for each grep (based on the name of te file it processed). Files created under /tmp are located on a RAM disk and so will not thrash your harddrive with lots of writes.
You can then either cat it all together at the end, or get each grep to signal another process when it has finished and that process can begin catting files immediately (and removing them when done).
Example:
working_dir="`pwd`"
temp_dir="`mktemp -d`"
cd "$temp_dir"
find "$working_dir" -name "*.gz" | xargs -P 32 -n 1 extract.sh
cat *.output > "$working_dir/Info"
rm -rf "$temp_dir"
extract.sh
filename=$(basename $1)
output="$filename.output"
extracted="$filename.extracted"
zcat "$1" > "$extracted"
echo "#####" $filename > "$output"
# Series of greps to extract some useful information
grep -o -P "..." "$extracted" >> "$output"
grep -o -P "..." "$extracted" >> "$output"
rm "$extracted"
echo "####" >> "$output"

The multiple grep invocations in extract.sh are probably the main bottleneck here. An obvious optimization is to read each file only once, then print a summary in the order you want. As an added benefit, we can speculate that the report can get written as a single block, but it might not prevent interleaved output completely. Still, here's my attempt.
#!/bin/sh
for f; do
zcat "$f" |
perl -ne '
/(pattern1)/ && push #pat1, $1;
/(pattern2)/ && push #pat2, $1;
# ...
END { print "##### '"$1"'\n";
print join ("\n", #pat1), "\n";
print join ("\n", #pat2), "\n";
# ...
print "#### '"$f"'\n"; }'
done
Doing this in awk instead of Perl might be slightly more efficient, but since you are using grep -P I figure it's useful to be able to keep the same regex syntax.
The script accepts multiple .gz files as input, so you can use find -exec extract.sh {} \+ or xargs to launch a number of parallel processes. With xargs you can try to find a balance between sequential jobs and parallel jobs by feeding each new process, say, 100 to 500 files in one batch. You save on the number of new processes, but lose in parallelization. Some experimentation should reveal what the balance should be, but this is the point where I would just pull a number out of my hat and see if it's good enough already.
Granted, if your input files are small enough, the multiple grep invocations will run out of the disk cache, and turn out to be faster than the overhead of starting up Perl.

Related

List file using ls to find meet the condition

I am writing a batch program to delete all file in a directory with condition in filename.
In the directory there's a large number of text file (~ hundreds of thousand of files) with filename fixed as "abc" + date
abc_20180820.txt
abc_20180821.txt
abc_20180822.txt
abc_20180823.txt
abc_20180824.txt
The program try to grep all the file, compare the date to a fixed-date, delete it if filename's date < fixed date.
But the problem is it took so long to handle that large amount of file (~1 hour to delete 300k files).
My question: Is there a way to compare the date when running ls command? Not get all file in a list then compare to delete, but list only file already meet the condition then delete. I think that will have better performance.
My code is
TARGET_DATE = "5-12"
DEL_DATE = "20180823"
ls -t | grep "[0-9]\{8\}".txt\$ > ${LIST}
for EACH_FILE in `cat ${LIST}` ;
do
DATE=`echo ${EACH_FILE} | cut -c${TARGET_DATE }`
COMPARE=`expr "${DATE}" \< "${DEL_DATE}"`
if [ $COMPARE -eq 1 ] ;
then
rm -f ${EACH_FILE}
fi
done
Found some similar problem but I dont know how to get it done
List file using ls with a condition and process/grep files that only whitespaces
Here is a refactoring which gets rid of the pesky ls. Looping over a large directory is still going to be somewhat slow.
# Use lowercase for private variables
# to avoid clobbering a reserved system variable
# You can't have spaces around the equals sign
del_date="20180823"
# No need for ls here
# No need for a temporary file
for filename in *[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9].txt
do
# Avoid external process; use the shell's parameter substitution
date=${filename%.txt}
# This could fail if the file name contains literal shell metacharacters!
date=${date#${date%?????????}}
# Avoid expr
if [ "$date" -lt "$del_date" ]; then
# Just print the file name, null-terminated for xargs
printf '%s\0' "$filename"
fi
done |
# For efficiency, do batch delete
xargs -r0 rm
The wildcard expansion will still take a fair amount of time because the shell will sort the list of filenames. A better solution is probably to refactor this into a find command which avoids the sorting.
find . -maxdepth 1 -type f \( \
-name '*1[89][0-9][0-9][0-9][0-9][0-9][0-9].txt' \
-o -name '*201[0-7][0-9][0-9][0-9][0-9].txt' \
-o -name '*20180[1-7][0-9][0-9].txt ' \
-o -name '*201808[01][0-9].txt' \
-o -name '*2018082[0-2].txt' \
\) -delete
You could do something like:
rm 201[0-7]*.txt # remove all files from 2010-2017
rm 20180[1-4]*.txt # remove all files from Jan-Apr 2018
# And so on
...
to remove a large number of files. Then your code would run faster.
Yes it takes a lot of time if you have so many files in one folder.
It is bad idea to keep so many files in one folder. Even simple ls or find will be killing storage. And if you have some scripts which iterate over your files, you are for sure killing storage.
So after you wait for one hour to clean it. Take time and make better folders structure. It is good idea to sort files according to years/month/days ... possibly hours
e.g.
somefolder/2018/08/24/...files here
Then you can easily delete, move compress ... whole month or year.
I found a solution in this thread.
https://unix.stackexchange.com/questions/199554/get-files-with-a-name-containing-a-date-value-less-than-or-equal-to-a-given-inpu
The awk command is so powerful, only take me ~1 minute to deal with hundreds of thousand of files (1/10 compare to the loop).
ls | awk -v date="$DEL_DATE" '$0 <= date' | xargs rm -vrf
I can even count, copy, move with that command with the fastest answer I've ever seen.
COUNT="$(ls | awk -v date="${DEL_DATE}" '$0 <= target' | xargs rm -vrf | wc -l)"

shell - faster alternative to "find"

I'm writing a shell script wich should output the oldest file in a directory.
This directory is on a remote server and has (worst case) between 1000 and 1500 (temporary) files in it. I have no access to the server and I have no influence on how the files are stored. The server is connect through a stable but not very fast line.
The result of my script is passed to a monitoring system wich in turn allerts the staff if there are too many (=unprocessed) files in the directory.
Unfortunately the monitoring system only allows a maximun execution time of 30 seconds for my script before a timeout occurs.
This wasn't a problem when testing with small directories, this wasn't a problem. Testing with the target directory over the remote-mounted directory (approx 1000 files) it is.
So I'm looking for the fastest way to get things like "the oldest / newest / largest / smallest" file in a directory (not recursive) without using 'find' or sorting the output of 'ls'.
Currently I'm using this statement in my sh script:
old)
# return oldest file (age in seconds)
oldest=`find $2 -maxdepth 1 -type f | xargs ls -tr | head -1`
timestamp=`stat -f %B $oldest`
curdate=`date +%s`
echo `expr $(($curdate-$timestamp))`
;;
and I tried this one:
gfind /livedrive/669/iwt.save -type f -printf "%T# %P\n" | sort -nr | tail -1 | cut -d' ' -f 2-
wich are two of many variants of statements one can find using google.
Additional information:
I'writing this on a FreeBSD Box with sh und bash installed. I have full access to the box and can install programs if needed. For reference: gfind is the GNU-"find" utuility as known from linux as FreeBSD has another "find" installed by default.
any help is appreciated
with kind regards,
dura-zell
For the oldest/newest file issue, you can use -t option to ls which sorts the output using the time modified.
-t Sort by descending time modified (most recently modified first).
If two files have the same modification timestamp, sort their
names in ascending lexicographical order. The -r option reverses
both of these sort orders.
For the size issue, you can use -S to sort file by size.
-S Sort by size (largest file first) before sorting the operands in
lexicographical order.
Notice that for both cases, -r will reverse the order of the output.
-r Reverse the order of the sort.
Those options are available on FreeBSD and Linux; and must be pretty common in most implementations of ls.
Let use know if it's fast enough.
In general, you shouldn't be parsing the output of ls. In this case, it's just acting as a wrapper around stat anyway, so you may as well just call stat on each file, and use sort to get the oldest.
old) now=$(date +%s)
read name timestamp < <(stat -f "%N %B" "$2"/* | sort -k2,2n)
echo $(( $now - $timestamp ))
The above is concise, but doesn't distinguish between regular files and directories in the glob. If that is necessary, stick with find, but use a different form of -exec to minimize the number of calls to stat:
old ) now=$(date +%s)
read name timestamp < <(find "$2" -maxdepth 1 -type f -exec stat -f "%N %B" '{}' + | sort -k2,2n)
echo $(( $now - $timestamp ))
(Neither approach works if a filename contains a newline, although since you aren't using the filename in your example anyway, you can avoid that problem by dropping %N from the format and just sorting the timestamps numerically. For example:
read timestamp < <(stat -f %B "$2"/* | sort -n)
# or
read timestamp < <(find "$2" -maxdepth 1 -type f -exec stat -f %B '{}' + | sort -n)
)
Can you try creating a shell script that will reside in the remote host and when executed will provide the required output. Then from your local machine just use ssh or something like that to run that. In this way the script will run locally there. Just a thought :-)

listing file in unix and saving the output in a variable(Oldest File fetching for a particular extension)

This might be a very simple thing for a shell scripting programmer but am pretty new to it. I was trying to execute the below command in a shell script and save the output into a variable
inputfile=$(ls -ltr *.{PDF,pdf} | head -1 | awk '{print $9}')
The command works fine when I fire it from terminal but fails when executed through a shell script (sh). Why is that the command fails, does it mean that shell script doesn't support the command or am I doing it wrong? Also how do I know if a command will work in shell or not?
Just to give you a glimpse of my requirement, I was trying to get the oldest file from a particular directory (I also want to make sure upper case and lower case extensions are handled). Is there any other way to do this ?
The above command will work correctly only if BOTH *.pdf and *.PDF files are in the directory you are currently.
If you would like to execute it in a directory with only one of those you should consider using e.g.:
inputfiles=$(find . -maxdepth 1 -type f \( -name "*.pdf" -or -name "*.PDF" \) | xargs ls -1tr | head -1 )
NOTE: The above command doesn't work with files with new lines, or with long list of found files.
Parsing ls is always a bad idea. You need another strategy.
How about you make a function that gives you the oldest file among the ones given as argument? the following works in Bash (adapt to your needs):
get_oldest_file() {
# get oldest file among files given as parameters
# return is in variable get_oldest_file_ret
local oldest f
for f do
[[ -e $f ]] && [[ ! $oldest || $f -ot $oldest ]] && oldest=$f
done
get_oldest_file_ret=$oldest
}
Then just call as:
get_oldest_file *.{PDF,pdf}
echo "oldest file is: $get_oldest_file_ret"
Now, you probably don't want to use brace expansions like this at all. In fact, you very likely want to use the shell options nocaseglob and nullglob:
shopt -s nocaseglob nullglob
get_oldest_file *.pdf
echo "oldest file is: $get_oldest_file_ret"
If you're using a POSIX shell, it's going to be a bit trickier to have the equivalent of nullglob and nocaseglob.
Is perl an option? It's ubiquitous on Unix.
I would suggest:
perl -e 'print ((sort { -M $b <=> -M $a } glob ( "*.{pdf,PDF}" ))[0]);';
Which:
uses glob to fetch all files matching the pattern.
sort, using -M which is relative modification time. (in days).
fetches the first element ([0]) off the sort.
Prints that.
As #gniourf_gniourf says, parsing ls is a bad idea. Such as leaving unquoted globs, and generally not counting for funny characters in file names.
find is your friend:
#!/bin/sh
get_oldest_pdf() {
#
# echo path of oldest *.pdf (case-insensitive) file in current directory
#
find . -maxdepth 1 -mindepth 1 -iname "*.pdf" -printf '%T# %p\n' \
| sort -n \
| tail -1 \
| cut -d\ -f1-
}
whatever=$(get_oldest_pdf)
Notes:
find has numerous ways of formatting the output, including
things like access time and/or write time. I used '%T# %p\n',
where %T# is last write time in UNIX time format incl.fractal part.
This will never containt space so it's safe to use as separator.
Numeric sort and tail get the last item, sorting by the time,
cut removes the time from the output.
I used IMO much easier to read/maintain pipe notation, with help of \.
the code should run on any POSIX shell,
You could easily adjust the function to parametrize the pattern,
time used (access/write), control the search depth or starting dir.

Script for renaming files with logical

Someone has very kindly help get me started on a mass rename script for renaming PDF files.
As you can see I need to add a bit of logical to stop the below happening - so something like add a unique number to a duplicate file name?
rename 's/^(.{5}).*(\..*)$/$1$2/' *
rename -n 's/^(.{5}).*(\..*)$/$1$2/' *
Annexes 123114345234525.pdf renamed as Annex.pdf
Annexes 123114432452352.pdf renamed as Annex.pdf
Hope this makes sense?
Thanks
for i in *
do
x='' # counter
j="${i:0:2}" # new name
e="${i##*.}" # ext
while [ -e "$j$x" ] # try to find other name
do
((x++)) # inc counter
done
mv "$i" "$j$x" # rename
done
before
$ ls
he.pdf hejjj.pdf hello.pdf wo.pdf workd.pdf world.pdf
after
$ ls
he.pdf he1.pdf he2.pdf wo.pdf wo1.pdf wo2.pdf
This should check whether there will be any duplicates:
rename -n [...] | grep -o ' renamed as .*' | sort | uniq -d
If you get any output of the form renamed as [...], then you have a collision.
Of course, this won't work in a couple corner cases - If your files contain newlines or the literal string renamed as, for example.
As noted in my answer on your previous question:
for f in *.pdf; do
tmp=`echo $f | sed -r 's/^(.{5}).*(\..*)$/$1$2/'`
mv -b ./"$f" ./"$tmp"
done
That will make backups of deleted or overwritten files. A better alternative would be this script:
#!/bin/bash
for f in $*; do
tar -rvf /tmp/backup.tar $f
tmp=`echo $f | sed -r 's/^(.{5}).*(\..*)$/$1$2/'`
i=1
while [ -e tmp ]; do
tmp=`echo $tmp | sed "s/\./-$i/"`
i+=1
done
mv -b ./"$f" ./"$tmp"
done
Run the script like this:
find . -exec thescript '{}' \;
The find command gives you lots of options for specifing which files to run on, works recursively, and passes all the filenames in to the script. The script backs all file up with tar (uncompressed) and then renames them.
This isn't the best script, since it isn't smart enough to avoid the manual loop and check for identical file names.

Linux: compute a single hash for a given folder & contents?

Surely there must be a way to do this easily!
I've tried the Linux command-line apps such as sha1sum and md5sum but they seem only to be able to compute hashes of individual files and output a list of hash values, one for each file.
I need to generate a single hash for the entire contents of a folder (not just the filenames).
I'd like to do something like
sha1sum /folder/of/stuff > singlehashvalue
Edit: to clarify, my files are at multiple levels in a directory tree, they're not all sitting in the same root folder.
One possible way would be:
sha1sum path/to/folder/* | sha1sum
If there is a whole directory tree, you're probably better off using find and xargs. One possible command would be
find path/to/folder -type f -print0 | sort -z | xargs -0 sha1sum | sha1sum
And, finally, if you also need to take account of permissions and empty directories:
(find path/to/folder -type f -print0 | sort -z | xargs -0 sha1sum;
find path/to/folder \( -type f -o -type d \) -print0 | sort -z | \
xargs -0 stat -c '%n %a') \
| sha1sum
The arguments to stat will cause it to print the name of the file, followed by its octal permissions. The two finds will run one after the other, causing double the amount of disk IO, the first finding all file names and checksumming the contents, the second finding all file and directory names, printing name and mode. The list of "file names and checksums", followed by "names and directories, with permissions" will then be checksummed, for a smaller checksum.
Use a file system intrusion detection tool like aide.
hash a tar ball of the directory:
tar cvf - /path/to/folder | sha1sum
Code something yourself, like vatine's oneliner:
find /path/to/folder -type f -print0 | sort -z | xargs -0 sha1sum | sha1sum
If you just want to check if something in the folder changed, I'd recommend this one:
ls -alR --full-time /folder/of/stuff | sha1sum
It will just give you a hash of the ls output, that contains folders, sub-folders, their files, their timestamp, size and permissions. Pretty much everything that you would need to determine if something has changed.
Please note that this command will not generate hash for each file, but that is why it should be faster than using find.
You can do tar -c /path/to/folder | sha1sum
So far the fastest way to do it is still with tar. And with several additional parameters we can also get rid of the difference caused by metadata.
To use GNU tar for hash the dir, one need to make sure you sort the path during tar, otherwise it is always different.
tar -C <root-dir> -cf - --sort=name <dir> | sha256sum
ignore time
If you do not care about the access time or modify time also use something like --mtime='UTC 2019-01-01' to make sure all timestamp is the same.
ignore ownership
Usually we need to add --group=0 --owner=0 --numeric-owner to unify the owner metadata.
ignore some files
use --exclude=PATTERN
it is known some tar does not have --sort, be sure you have GNU tar.
A robust and clean approach
First things first, don't hog the available memory! Hash a file in chunks rather than feeding the entire file.
Different approaches for different needs/purpose (all of the below or pick what ever applies):
Hash only the entry name of all entries in the directory tree
Hash the file contents of all entries (leaving the meta like, inode number, ctime, atime, mtime, size, etc., you get the idea)
For a symbolic link, its content is the referent name. Hash it or choose to skip
Follow or not to follow(resolved name) the symlink while hashing the contents of the entry
If it's a directory, its contents are just directory entries. While traversing recursively they will be hashed eventually but should the directory entry names of that level be hashed to tag this directory? Helpful in use cases where the hash is required to identify a change quickly without having to traverse deeply to hash the contents. An example would be a file's name changes but the rest of the contents remain the same and they are all fairly large files
Handle large files well(again, mind the RAM)
Handle very deep directory trees (mind the open file descriptors)
Handle non standard file names
How to proceed with files that are sockets, pipes/FIFOs, block devices, char devices? Must hash them as well?
Don't update the access time of any entry while traversing because this will be a side effect and counter-productive(intuitive?) for certain use cases.
This is what I have on top my head, any one who has spent some time working on this practically would have caught other gotchas and corner cases.
Here's a tool, very light on memory, which addresses most cases, might be a bit rough around the edges but has been quite helpful.
An example usage and output of dtreetrawl.
Usage:
dtreetrawl [OPTION...] "/trawl/me" [path2,...]
Help Options:
-h, --help Show help options
Application Options:
-t, --terse Produce a terse output; parsable.
-j, --json Output as JSON
-d, --delim=: Character or string delimiter/separator for terse output(default ':')
-l, --max-level=N Do not traverse tree beyond N level(s)
--hash Enable hashing(default is MD5).
-c, --checksum=md5 Valid hashing algorithms: md5, sha1, sha256, sha512.
-R, --only-root-hash Output only the root hash. Blank line if --hash is not set
-N, --no-name-hash Exclude path name while calculating the root checksum
-F, --no-content-hash Do not hash the contents of the file
-s, --hash-symlink Include symbolic links' referent name while calculating the root checksum
-e, --hash-dirent Include hash of directory entries while calculating root checksum
A snippet of human friendly output:
...
... //clipped
...
/home/lab/linux-4.14-rc8/CREDITS
Base name : CREDITS
Level : 1
Type : regular file
Referent name :
File size : 98443 bytes
I-node number : 290850
No. directory entries : 0
Permission (octal) : 0644
Link count : 1
Ownership : UID=0, GID=0
Preferred I/O block size : 4096 bytes
Blocks allocated : 200
Last status change : Tue, 21 Nov 17 21:28:18 +0530
Last file access : Thu, 28 Dec 17 00:53:27 +0530
Last file modification : Tue, 21 Nov 17 21:28:18 +0530
Hash : 9f0312d130016d103aa5fc9d16a2437e
Stats for /home/lab/linux-4.14-rc8:
Elapsed time : 1.305767 s
Start time : Sun, 07 Jan 18 03:42:39 +0530
Root hash : 434e93111ad6f9335bb4954bc8f4eca4
Hash type : md5
Depth : 8
Total,
size : 66850916 bytes
entries : 12484
directories : 763
regular files : 11715
symlinks : 6
block devices : 0
char devices : 0
sockets : 0
FIFOs/pipes : 0
If this is a git repo and you want to ignore any files in .gitignore, you might want to use this:
git ls-files <your_directory> | xargs sha256sum | cut -d" " -f1 | sha256sum | cut -d" " -f1
This is working well for me.
If you just want to hash the contents of the files, ignoring the filenames then you can use
cat $FILES | md5sum
Make sure you have the files in the same order when computing the hash:
cat $(echo $FILES | sort) | md5sum
But you can't have directories in your list of files.
Another tool to achieve this:
http://md5deep.sourceforge.net/
As is sounds: like md5sum but also recursive, plus other features.
md5deep -r {direcotory}
There is a python script for that:
http://code.activestate.com/recipes/576973-getting-the-sha-1-or-md5-hash-of-a-directory/
If you change the names of a file without changing their alphabetical order, the hash script will not detect it. But, if you change the order of the files or the contents of any file, running the script will give you a different hash than before.
I had to check into a whole directory for file changes.
But with excluding, timestamps, directory ownerships.
Goal is to get a sum identical anywhere, if the files are identical.
Including hosted into other machines, regardless anything but the files, or a change into them.
md5sum * | md5sum | cut -d' ' -f1
It generate a list of hash by file, then concatenate those hashes into one.
This is way faster than the tar method.
For a stronger privacy in our hashes, we can use sha512sum on the same recipe.
sha512sum * | sha512sum | cut -d' ' -f1
The hashes are also identicals anywhere using sha512sum but there is no known way to reverse it.
Here's a simple, short variant in Python 3 that works fine for small-sized files (e.g. a source tree or something, where every file individually can fit into RAM easily), ignoring empty directories, based on the ideas from the other solutions:
import os, hashlib
def hash_for_directory(path, hashfunc=hashlib.sha1):
filenames = sorted(os.path.join(dp, fn) for dp, _, fns in os.walk(path) for fn in fns)
index = '\n'.join('{}={}'.format(os.path.relpath(fn, path), hashfunc(open(fn, 'rb').read()).hexdigest()) for fn in filenames)
return hashfunc(index.encode('utf-8')).hexdigest()
It works like this:
Find all files in the directory recursively and sort them by name
Calculate the hash (default: SHA-1) of every file (reads whole file into memory)
Make a textual index with "filename=hash" lines
Encode that index back into a UTF-8 byte string and hash that
You can pass in a different hash function as second parameter if SHA-1 is not your cup of tea.
adding multiprocessing and progressbar to kvantour's answer
Around 30x faster (depending on CPU)
100%|██████████████████████████████████| 31378/31378 [03:03<00:00, 171.43file/s]
# to hash without permissions
find . -type f -print0 | sort -z | xargs -P $(nproc --all) -0 sha1sum | tqdm --unit file --total $(find . -type f | wc -l) | sort | awk '{ print $1 }' | sha1sum
# to hash permissions
(find . -type f -print0 | sort -z | xargs -P $(nproc --all) -0 sha1sum | sort | awk '{ print $1 }';
find . \( -type f -o -type d \) -print0 | sort -z | xargs -P $(nproc --all) -0 stat -c '%n %a') | \
sort | sha1sum | awk '{ print $1 }'
make sure tqdm is installed, pip install tqdm or check documentation
awk will remove the filepath so that if the parent directory or path is different it wouldn't affect the hash
You can try hashdir which is an open source command line tool written for this purpose.
hashdir /folder/of/stuff
It has several useful flags to allow you to specify the hashing algorithm, print the hashes of all children, as well as save and verify a hash.
hashdir:
A command-line utility to checksum directories and files.
Usage:
hashdir [options] [<item>...] [command]
Arguments:
<item> Directory or file to hash/check
Options:
-t, --tree Print directory tree
-s, --save Save the checksum to a file
-i, --include-hidden-files Include hidden files
-e, --skip-empty-dir Skip empty directories
-a, --algorithm <md5|sha1|sha256|sha384|sha512> The hash function to use [default: sha1]
--version Show version information
-?, -h, --help Show help and usage information
Commands:
check <item> Verify that the specified hash file is valid.
Try to make it in two steps:
create a file with hashes for all files in a folder
hash this file
Like so:
# for FILE in `find /folder/of/stuff -type f | sort`; do sha1sum $FILE >> hashes; done
# sha1sum hashes
Or do it all at once:
# cat `find /folder/of/stuff -type f | sort` | sha1sum
I would pipe the results for individual files through sort (to prevent a mere reordering of files to change the hash) into md5sum or sha1sum, whichever you choose.
I've written a Groovy script to do this:
import java.security.MessageDigest
public static String generateDigest(File file, String digest, int paddedLength){
MessageDigest md = MessageDigest.getInstance(digest)
md.reset()
def files = []
def directories = []
if(file.isDirectory()){
file.eachFileRecurse(){sf ->
if(sf.isFile()){
files.add(sf)
}
else{
directories.add(file.toURI().relativize(sf.toURI()).toString())
}
}
}
else if(file.isFile()){
files.add(file)
}
files.sort({a, b -> return a.getAbsolutePath() <=> b.getAbsolutePath()})
directories.sort()
files.each(){f ->
println file.toURI().relativize(f.toURI()).toString()
f.withInputStream(){is ->
byte[] buffer = new byte[8192]
int read = 0
while((read = is.read(buffer)) > 0){
md.update(buffer, 0, read)
}
}
}
directories.each(){d ->
println d
md.update(d.getBytes())
}
byte[] digestBytes = md.digest()
BigInteger bigInt = new BigInteger(1, digestBytes)
return bigInt.toString(16).padLeft(paddedLength, '0')
}
println "\n${generateDigest(new File(args[0]), 'SHA-256', 64)}"
You can customize the usage to avoid printing each file, change the message digest, take out directory hashing, etc. I've tested it against the NIST test data and it works as expected. http://www.nsrl.nist.gov/testdata/
gary-macbook:Scripts garypaduana$ groovy dirHash.groovy /Users/garypaduana/.config
.DS_Store
configstore/bower-github.yml
configstore/insight-bower.json
configstore/update-notifier-bower.json
filezilla/filezilla.xml
filezilla/layout.xml
filezilla/lockfile
filezilla/queue.sqlite3
filezilla/recentservers.xml
filezilla/sitemanager.xml
gtk-2.0/gtkfilechooser.ini
a/
configstore/
filezilla/
gtk-2.0/
lftp/
menus/
menus/applications-merged/
79de5e583734ca40ff651a3d9a54d106b52e94f1f8c2cd7133ca3bbddc0c6758
Quick summary: how to hash the contents of an entire folder, or compare two folders for equality
# 1. How to get a sha256 hash over all file contents in a folder, including
# hashing over the relative file paths within that folder to check the
# filenames themselves (get this bash function below).
sha256sum_dir "path/to/folder"
# 2. How to quickly compare two folders (get the `diff_dir` bash function below)
diff_dir "path/to/folder1" "path/to/folder2"
# OR:
diff -r -q "path/to/folder1" "path/to/folder2"
The "one liners"
Do this instead of the main answer, to get a single hash for all non-directory file contents within an entire folder, no matter where the folder is located:
This is a "1-line" command. Copy and paste the whole thing to run it all at once:
# This one works, but don't use it, because its hash output does NOT
# match that of my `sha256sum_dir` function. I recommend you use
# the "1-liner" just below, therefore, instead.
time ( \
starting_dir="$(pwd)" \
&& target_dir="path/to/folder" \
&& cd "$target_dir" \
&& find . -not -type d -print0 | sort -zV \
| xargs -0 sha256sum | sha256sum; \
cd "$starting_dir"
)
However, that produces a slightly different hash than my sha256sum_dir bash function, which I present below, produces. So, to get the output hash to exactly match the output from my sha256sum_dir function, do this instead:
# Use this one, as its output matches that of my `sha256sum_dir`
# function exactly.
all_hashes_str="$( \
starting_dir="$(pwd)" \
&& target_dir="path/to/folder" \
&& cd "$target_dir" \
&& find . -not -type d -print0 | sort -zV | xargs -0 sha256sum \
)"; \
cd "$starting_dir"; \
printf "%s" "$all_hashes_str" | sha256sum
For more on why the main answer doesn't produce identical hashes for identical folders in different locations, see further below.
[My preferred method] Here are some bash functions I wrote: sha256sum_dir and diff_dir
Place the following functions in your ~/.bashrc file or in your ~/.bash_aliases file, assuming your ~/.bashrc file sources the ~/.bash_aliases file like this:
if [ -f ~/.bash_aliases ]; then
. ~/.bash_aliases
fi
You can find both of the functions below in my personal ~/.bash_aliases file in my eRCaGuy_dotfiles repo.
Here is the sha256sum_dir function, which obtains a total "directory" hash of all files in the directory:
# Take the sha256sum of all files in an entire dir, and then sha256sum that
# entire output to obtain a _single_ sha256sum which represents the _entire_
# dir.
# See:
# 1. [my answer] https://stackoverflow.com/a/72070772/4561887
sha256sum_dir() {
return_code="$RETURN_CODE_SUCCESS"
if [ "$#" -eq 0 ]; then
echo "ERROR: too few arguments."
return_code="$RETURN_CODE_ERROR"
fi
# Print help string if requested
if [ "$#" -eq 0 ] || [ "$1" = "-h" ] || [ "$1" = "--help" ]; then
# Help string
echo "Obtain a sha256sum of all files in a directory."
echo "Usage: ${FUNCNAME[0]} [-h|--help] <dir>"
return "$return_code"
fi
starting_dir="$(pwd)"
target_dir="$1"
cd "$target_dir"
# See my answer: https://stackoverflow.com/a/72070772/4561887
filenames="$(find . -not -type d | sort -V)"
IFS=$'\n' read -r -d '' -a filenames_array <<< "$filenames"
time all_hashes_str="$(sha256sum "${filenames_array[#]}")"
cd "$starting_dir"
echo ""
echo "Note: you may now call:"
echo "1. 'printf \"%s\n\" \"\$all_hashes_str\"' to view the individual" \
"hashes of each file in the dir. Or:"
echo "2. 'printf \"%s\" \"\$all_hashes_str\" | sha256sum' to see that" \
"the hash of that output is what we are using as the final hash" \
"for the entire dir."
echo ""
printf "%s" "$all_hashes_str" | sha256sum | awk '{ print $1 }'
return "$?"
}
# Note: I prefix this with my initials to find my custom functions easier
alias gs_sha256sum_dir="sha256sum_dir"
Assuming you just want to compare two directories for equality, you can use diff -r -q "dir1" "dir2" instead, which I wrapped in this diff_dir command. I learned about the diff command to compare entire folders here: how do I check that two folders are the same in linux.
# Compare dir1 against dir2 to see if they are equal or if they differ.
# See:
# 1. How to `diff` two dirs: https://stackoverflow.com/a/16404554/4561887
diff_dir() {
return_code="$RETURN_CODE_SUCCESS"
if [ "$#" -eq 0 ]; then
echo "ERROR: too few arguments."
return_code="$RETURN_CODE_ERROR"
fi
# Print help string if requested
if [ "$#" -eq 0 ] || [ "$1" = "-h" ] || [ "$1" = "--help" ]; then
echo "Compare (diff) two directories to see if dir1 contains the same" \
"content as dir2."
echo "NB: the output will be **empty** if both directories match!"
echo "Usage: ${FUNCNAME[0]} [-h|--help] <dir1> <dir2>"
return "$return_code"
fi
dir1="$1"
dir2="$2"
time diff -r -q "$dir1" "$dir2"
return_code="$?"
if [ "$return_code" -eq 0 ]; then
echo -e "\nDirectories match!"
fi
# echo "$return_code"
return "$return_code"
}
# Note: I prefix this with my initials to find my custom functions easier
alias gs_diff_dir="diff_dir"
Here is the output of my sha256sum_dir command on my ~/temp2 dir (which dir I describe just below so you can reproduce it and test this yourself). You can see the total folder hash is b86c66bcf2b033f65451e8c225425f315e618be961351992b7c7681c3822f6a3 in this case:
$ gs_sha256sum_dir ~/temp2
real 0m0.007s
user 0m0.000s
sys 0m0.007s
Note: you may now call:
1. 'printf "%s\n" "$all_hashes_str"' to view the individual hashes of each
file in the dir. Or:
2. 'printf "%s" "$all_hashes_str" | sha256sum' to see that the hash of that
output is what we are using as the final hash for the entire dir.
b86c66bcf2b033f65451e8c225425f315e618be961351992b7c7681c3822f6a3
Here is the cmd and output of diff_dir to compare two dirs for equality. This is checking that copying an entire directory to my SD card just now worked correctly. I made the output indicate Directories match! whenever that is the case!:
$ gs_diff_dir "path/to/sd/card/tempdir" "/home/gabriel/tempdir"
real 0m0.113s
user 0m0.037s
sys 0m0.077s
Directories match!
Why the main answer doesn't produce identical hashes for identical folders in different locations
I tried the most-upvoted answer here, and it doesn't work quite right as-is. It needs a little tweaking. It doesn't work quite right because the hash changes based on the folder-of-interest's base path! That means that an identical copy of some folder will have a different hash than the folder it was copied from even if the two folders are perfect matches and contain exactly the same content! That kind of defeats the purpose of taking a hash of the folder if the hashes of two identical folders differ! Let me explain:
Assume I have a folder named temp2 at ~/temp2. It contains file1.txt, file2.txt, and file3.txt. file1.txt contains the letter a followed by a return, file2.txt contains a letter b followed by a return, and file3.txt contains a letter c followed by a return.
If I run find /home/gabriel/temp2, I get:
$ find /home/gabriel/temp2
/home/gabriel/temp2
/home/gabriel/temp2/file3.txt
/home/gabriel/temp2/file1.txt
/home/gabriel/temp2/file2.txt
If I forward that to sha256sum (in place of sha1sum) in the same pattern as the main answer states, I get this. Notice it has the full path after each hash, which is not what we want:
$ find /home/gabriel/temp2 -type f -print0 | sort -z | xargs -0 sha256sum
87428fc522803d31065e7bce3cf03fe475096631e5e07bbd7a0fde60c4cf25c7 /home/gabriel/temp2/file1.txt
0263829989b6fd954f72baaf2fc64bc2e2f01d692d4de72986ea808f6e99813f /home/gabriel/temp2/file2.txt
a3a5e715f0cc574a73c3f9bebb6bc24f32ffd5b67b387244c2c909da779a1478 /home/gabriel/temp2/file3.txt
If you then pipe that output string above to sha256sum again, it hashes the file hashes with their full file paths, which is not what we want! The file hashes may match in a folder and in a copy of that folder exactly, but the absolute paths do NOT match exactly, so they will produce different final hashes since we are hashing over the full file paths as part of our single, final hash!
Instead, what we want is the relative file path next to each hash. To do that, you must first cd into the folder of interest, and then run the hash command over all files therein, like this:
cd "/home/gabriel/temp2" && find . -type f -print0 | sort -z | xargs -0 sha256sum
Now, I get this. Notice the file paths are all relative now, which is what I want!:
$ cd "/home/gabriel/temp2" && find . -type f -print0 | sort -z | xargs -0 sha256sum
87428fc522803d31065e7bce3cf03fe475096631e5e07bbd7a0fde60c4cf25c7 ./file1.txt
0263829989b6fd954f72baaf2fc64bc2e2f01d692d4de72986ea808f6e99813f ./file2.txt
a3a5e715f0cc574a73c3f9bebb6bc24f32ffd5b67b387244c2c909da779a1478 ./file3.txt
Good. Now, if I hash that entire output string, since the file paths are all relative in it, the final hash will match exactly for a folder and its copy! In this way, we are hashing over the file contents and the file names within the directory of interest, to get a different hash for a given folder if either the file contents are different or the filenames are different, or both.
You could sha1sum to generate the list of hash values and then sha1sum that list again, it depends on what exactly it is you want to accomplish.
How to hash all files in an entire directory, including the filenames as well as their contents
Assuming you are trying to compare a folder and all its contents to ensure it was copied correctly from one computer to another, for instance, you can do it as follows. Let's assume the folder is named mydir and is at path /home/gabriel/mydir on computer 1, and at /home/gabriel/dev/repos/mydir on computer 2.
# 1. First, cd to the dir in which the dir of interest is found. This is
# important! If you don't do this, then the paths output by find will differ
# between the two computers since the absolute paths to `mydir` differ. We are
# going to hash the paths too, not just the file contents, so this matters.
cd /home/gabriel # on computer 1
cd /home/gabriel/dev/repos # on computer 2
# 2. hash all files inside `mydir`, then hash the list of all hashes and their
# respective file paths. This obtains one single final hash. Sorting is
# necessary by piping to `sort` to ensure we get a consistent file order in
# order to ensure a consistent final hash result.
find mydir -type f -exec sha256sum {} + | sort | sha256sum
# Optionally pipe that output to awk to filter in on just the hash (first field
# in the output)
find mydir -type f -exec sha256sum {} + | sort | sha256sum | awk '{print $1}'
That's it!
To see the intermediary list of file hashes, for learning's sake, just run this:
find mydir -type f -exec sha256sum {} + | sort
Note that the above commands ignore empty directories, file permissions, timestamps of when files were last edited, etc. For most cases though that's ok.
Example
Here is a real run and actual output. I wanted to ensure my eclipse-workspace folder was properly copied from one computer to another. As you can see, the time command tells me it took 11.790 seconds:
$ time find eclipse-workspace -type f -exec sha256sum {} + | sort | sha256sum
8f493478e7bb77f1d025cba31068c1f1c8e1eab436f8a3cf79d6e60abe2cd2e4 -
real 0m11.790s
user 0m11.372s
sys 0m0.432s
The hash I care about is: 8f493478e7bb77f1d025cba31068c1f1c8e1eab436f8a3cf79d6e60abe2cd2e4
If piping to awk and excluding time, I get:
$ find eclipse-workspace -type f -exec sha256sum {} + | sort | sha256sum | awk '{print $1}'
8f493478e7bb77f1d025cba31068c1f1c8e1eab436f8a3cf79d6e60abe2cd2e4
Be sure you check find for errors in the printed stderr output, as a hash will be produced even in the event find fails.
Hashing my whole eclipse-workspace dir in only 12 seconds is impressive considering it contains 6480 files, as shown by this:
find eclipse-workspace -type f | wc -l
...and is 3.6 GB in size, as shown by this:
du -sh eclipse-workspace
See also
My other answer here, where I use the above info.: how do I check that two folders are the same in linux
Other credit: I had a chat with ChatGPT to learn some of the pieces above. All work and text above, however, was written by me, tested by me, and verified by me.

Resources