Compare 2 Folders and Find Files with Differing Byte Counts

Compare 2 Folders and Find Files with Differing Byte Counts - linux

Using Gnome in Linux Mint 12, I copied a Folder of about 9.7 GB (containing a complex tree of subfolders) from one NTFS Flash Drive to another NTFS Flash Drive. According to Gnome the file counts match, but according to du (and other programs) the byte counts don't match. (I've had the same problem copying folders in other Linux distros and Windows XP.)
I only want to know which files don't have matching byte counts. (I don't want to compare the contents of each file, because that would take way too long.) What's the best, easiest and fastest way to find the byte-count-mismatched files?

I would adapt the answer by #user1464130 as it has trouble handling spaces in file names.
cd dir1
find . -type f -printf "%p %s\n" | sort > ~/dir1.txt
cd dir2
find . -type f -printf "%p %s\n" | sort > ~/dir2.txt
diff ~/dir1.txt ~/dir2.txt
If you want to launch a command on each file and use the result in the report, you can use the while Bash construct. This example uses md5sum to compute a checksum for each file.
find . -maxdepth 1 -type f -printf "%p %s\n" | while read path size; do echo "$path - $(md5sum $path | tr -s " " | cut -f 1 -d " ") - $size" ; done
Each $() is executed separately and allows us to compute the checksum for each file. The use of tr squeezes every consecutive spaces into a single space and cut extracts the word in the n-th position, here in the first position. If we don't do that, we get the name of the file two times because md5sum give it back on stdout.
Here is an example without using the comparison (no diff). Note that I've used a dash - to emphasize the three datas we output about each file but it could be a problem if you want to feed it to another program.
$ find . -maxdepth 1 -name "*.c" -type f -printf "%p %s\n" | while read path size; do echo "$path - $(md5sum $path | tr -s " " | cut -f 1 -d " ") - $size" ; done
./thread.c - 5f2b7b12c7cd12fcb9e9796078e5d15b - 584
./utils.c - d61bc1dbc72768e622a04f03e3b8f7a2 - 3413
EDIT : And to handle spaces in filenames and still get the checksum and the size, you can use the following code.
$ find . -maxdepth 1 -name "*.c" -type f -print0 | xargs -0 -n 1 md5sum | while read checksum path; do echo $path $(stat --printf="%s" "$path") $checksum ; done
./ini tia li za tion.c 84 31626123e9056bac2e96b472bd62f309

Did you check if both partitions have the same attributes? (block size, size, reserved space for deletions or bad blocks, etc.)
For your specific case, I would recommend rsync with option -n (or --dry-run). It will tell you which files are different. That is:
$ rsync -I -n /source/ /target/
The option -I is to ignore times. You can use the same command to make both directories equivalent (timestamp, permissions, etc.).
Check the manual of rsync or try the option --help to get more options and examples on how to use it. It is very powerful.

Assuming you need to compare dir1 and dir 2, here are the console commands:
cd dir1
find . -type f|sort|xargs ls -l| awk '{print $5,$8}' > ~/dir1.txt
cd dir2
find . -type f|sort|xargs ls -l| awk '{print $5,$8}' > ~/dir2.txt
diff ~/dir1.txt ~/dir2.txt
You may need to edit awk parameters to make it print file length and path properly.

Related

LINUX Copy the name of the newest folder and paste it in a command [duplicate]

I would like to find the newest sub directory in a directory and save the result to variable in bash.
Something like this:
ls -t /backups | head -1 > $BACKUPDIR
Can anyone help?

BACKUPDIR=$(ls -td /backups/*/ | head -1)
$(...) evaluates the statement in a subshell and returns the output.

There is a simple solution to this using only ls:
BACKUPDIR=$(ls -td /backups/*/ | head -1)
-t orders by time (latest first)
-d only lists items from this folder
*/ only lists directories
head -1 returns the first item
I didn't know about */ until I found Listing only directories using ls in bash: An examination.

This ia a pure Bash solution:
topdir=/backups
BACKUPDIR=
# Handle subdirectories beginning with '.', and empty $topdir
shopt -s dotglob nullglob
for file in "$topdir"/* ; do
[[ -L $file || ! -d $file ]] && continue
[[ -z $BACKUPDIR || $file -nt $BACKUPDIR ]] && BACKUPDIR=$file
done
printf 'BACKUPDIR=%q\n' "$BACKUPDIR"
It skips symlinks, including symlinks to directories, which may or may not be the right thing to do. It skips other non-directories. It handles directories whose names contain any characters, including newlines and leading dots.

Well, I think this solution is the most efficient:
path="/my/dir/structure/*"
backupdir=$(find $path -type d -prune | tail -n 1)
Explanation why this is a little better:
We do not need sub-shells (aside from the one for getting the result into the bash variable).
We do not need a useless -exec ls -d at the end of the find command, it already prints the directory listing.
We can easily alter this, e.g. to exclude certain patterns. For example, if you want the second newest directory, because backup files are first written to a tmp dir in the same path:
backupdir=$(find $path -type -d -prune -not -name "*temp_dir" | tail -n 1)

The above solution doesn't take into account things like files being written and removed from the directory resulting in the upper directory being returned instead of the newest subdirectory.
The other issue is that this solution assumes that the directory only contains other directories and not files being written.
Let's say I create a file called "test.txt" and then run this command again:
echo "test" > test.txt
ls -t /backups | head -1
test.txt
The result is test.txt showing up instead of the last modified directory.
The proposed solution "works" but only in the best case scenario.
Assuming you have a maximum of 1 directory depth, a better solution is to use:
find /backups/* -type d -prune -exec ls -d {} \; |tail -1
Just swap the "/backups/" portion for your actual path.
If you want to avoid showing an absolute path in a bash script, you could always use something like this:
LOCALPATH=/backups
DIRECTORY=$(cd $LOCALPATH; find * -type d -prune -exec ls -d {} \; |tail -1)

With GNU find you can get list of directories with modification timestamps, sort that list and output the newest:
find . -mindepth 1 -maxdepth 1 -type d -printf "%T#\t%p\0" | sort -z -n | cut -z -f2- | tail -z -n1
or newline separated
find . -mindepth 1 -maxdepth 1 -type d -printf "%T#\t%p\n" | sort -n | cut -f2- | tail -n1
With POSIX find (that does not have -printf) you may, if you have it, run stat to get file modification timestamp:
find . -mindepth 1 -maxdepth 1 -type d -exec stat -c '%Y %n' {} \; | sort -n | cut -d' ' -f2- | tail -n1
Without stat a pure shell solution may be used by replacing [[ bash extension with [ as in this answer.

Your "something like this" was almost a hit:
BACKUPDIR=$(ls -t ./backups | head -1)
Combining what you wrote with what I have learned solved my problem too. Thank you for rising this question.
Note: I run the line above from GitBash within Windows environment in file called ./something.bash.

How do I classify files in Linux server by their names?

How can use the ls command and options to list the repetitious filenames that are in different directories?

You can't use a single, basic ls command to do this. You'd have to use a combination of other POSIX/Unix/GNU utilities. For example, to find the duplicate filenames first:
find . -type f -exec basename "\{}" \; | sort | uniq -d > dupes
This means find all the files (-type f) through the entire directory hierarchy in the current directory (.), and execute (-exec) the command basename (which strips the directory portion) on the found file (\{}), end of command (\;). These files then sort and print out duplicate lines (uniq -d). The result goes in the file dupes. Now you have the filenames that are duplicated, but you don't know what directory they are in. Use find again to find them. Using bash as your shell:
while read filename; do find . -name "$filename" -print; done < dupes
This means loop through (while) all contents of file dupes and read into the variable filename each line. For each line, execute find again and search for the specific -name of the $filename and print it out (-print, but it's implicit so this is redundant).
Truth be told you can combine these without using an intermediate file:
find . -type f -exec basename "\{}" \; | sort | uniq -d | while read filename; do find . -name "$filename" -print; done
If you're not familiar with it, the | operator means, execute the following command using the output of the previous command as the input to the following command. Example:
eje#EEWANCO-PC:~$ mkdir test
eje#EEWANCO-PC:~$ cd test
eje#EEWANCO-PC:~/test$ mkdir 1 2 3 4 5
eje#EEWANCO-PC:~/test$ mkdir 1/2 2/3
eje#EEWANCO-PC:~/test$ touch 1/0000 2/1111 3/2222 4/2222 5/0000 1/2/1111 2/3/4444
eje#EEWANCO-PC:~/test$ find . -type f -exec basename "\{}" \; | sort | uniq -d | while read filename; do find . -name "$filename" -print; done
./1/0000
./5/0000
./1/2/1111
./2/1111
./3/2222
./4/2222
Disclaimer: The requirement stated that the filenames were all numbers. While I have tried to design the code to handle filenames with spaces (and in tests on my system, it works), the code may break when it encounters special characters, newlines, nuls, or other unusual situations. Please note that the -exec parameter has special security considerations and should not be used by root over arbitrary user files. The simplified example provided is intended for illustrative and didactic purposes only. Please consult your man pages and relevant CERT advisories for full security implications.

I have a function in my bash profile (bash 4.4) for duplicate files.
It is true that find is the correct tool.
I use find combined with -print0 options which separates the find results with null char instead of new lines (default find action). Now i can catch all files under current directory and subdirectories.
This will ensure that results will be correct no matter if filenames contain special chars like spaces or new lines (in some very rare cases). Instead of double running find against find, you can built an array and just locate the duplicate files in this array. Then you grep the whole array using the "duplicates" as pattern.
So something like this works ok for my function:
$ IFS= readarray -t -d '' fn< <(find . -name 'file*' -print0)
$ dupes=$(LC_ALL=C sort <(printf '\<%s\>$\n' "${fn[#]##*/}") |uniq -d)
$ grep -e "$dupes" <(printf '%s\n' "${fn[#]}") |awk -F/ '{print $NF,"==>",$0}' |LC_ALL=C sort
This is a test:
$ IFS= readarray -t -d '' fn< <(find . -name 'file*' -print0)
# find all files and load them in an array using null delimiter
$ printf '%s\n' "${fn[#]}" #print the array
./tmp/file7
./tmp/file14
./tmp/file11
./tmp/file8
./tmp/file9
./tmp/tmp2/file09 99
./tmp/tmp2/file14.txt
./tmp/tmp2/file15.txt
./tmp/tmp2/file$100
./tmp/tmp2/file14.txt.bak
./tmp/tmp2/file15.txt.bak
./tmp/file1
./tmp/file4
./file09 99
./file14
./file$100
./file1
$ dupes=$(LC_ALL=C sort <(printf '\<%s\>$\n' "${fn[#]##*/}") |uniq -d)
#Locate duplicate files
$ echo "$dupes"
\<file$100\>$ #Mind this one with special char $ in filename
\<file09 99\>$ #Mind also this one with spaces
\<file14\>$
\<file1\>$
#I have on purpose enclose the results between \<...\> to force grep later to capture full words and avoid file1 to match file1.txt or file11
$ grep -e "$dupes" <(printf '%s\n' "${fn[#]}") |awk -F/ '{print $NF,"==>",$0}' |LC_ALL=C sort
file$100 ==> ./file$100 #File with special char correctly captured
file$100 ==> ./tmp/tmp2/file$100
file09 99 ==> ./file09 99 #File with spaces in name also correctly captured
file09 99 ==> ./tmp/tmp2/file09 99
file1 ==> ./file1
file1 ==> ./tmp/file1
file14 ==> ./file14 #other files named file14 like file14.txt and file14.txt.bak not captured since they are not duplicates.
file14 ==> ./tmp/file14
Tips:
This one <(printf '\<%s\>$\n' "${fn[#]##*/}") uses process substitution on the basename of the find results using bash built in parameter expansion techniques.
LC_ALL=C is required on sorting in order filenames to be sorted correctly.
In bash versions before 4.4 , the readarray does not accept -d option (delimiter). In this case you can transform find results to an array with
while IFS= read -r -d '' res;do fn+=( "$res" );done < <(find.... -print0)

How to count number of files in each directory?

I am able to list all the directories by
find ./ -type d
I attempted to list the contents of each directory and count the number of files in each directory by using the following command
find ./ -type d | xargs ls -l | wc -l
But this summed the total number of lines returned by
find ./ -type d | xargs ls -l
Is there a way I can count the number of files in each directory?

This prints the file count per directory for the current directory level:
du -a | cut -d/ -f2 | sort | uniq -c | sort -nr

Assuming you have GNU find, let it find the directories and let bash do the rest:
find . -type d -print0 | while read -d '' -r dir; do
files=("$dir"/*)
printf "%5d files in directory %s\n" "${#files[#]}" "$dir"
done

find . -type f | cut -d/ -f2 | sort | uniq -c
find . -type f to find all items of the type file, in current folder and subfolders
cut -d/ -f2 to cut out their specific folder
sort to sort the list of foldernames
uniq -c to return the number of times each foldername has been counted

You could arrange to find all the files, remove the file names, leaving you a line containing just the directory name for each file, and then count the number of times each directory appears:
find . -type f |
sed 's%/[^/]*$%%' |
sort |
uniq -c
The only gotcha in this is if you have any file names or directory names containing a newline character, which is fairly unlikely. If you really have to worry about newlines in file names or directory names, I suggest you find them, and fix them so they don't contain newlines (and quietly persuade the guilty party of the error of their ways).
If you're interested in the count of the files in each sub-directory of the current directory, counting any files in any sub-directories along with the files in the immediate sub-directory, then I'd adapt the sed command to print only the top-level directory:
find . -type f |
sed -e 's%^\(\./[^/]*/\).*$%\1%' -e 's%^\.\/[^/]*$%./%' |
sort |
uniq -c
The first pattern captures the start of the name, the dot, the slash, the name up to the next slash and the slash, and replaces the line with just the first part, so:
./dir1/dir2/file1
is replaced by
./dir1/
The second replace captures the files directly in the current directory; they don't have a slash at the end, and those are replace by ./. The sort and count then works on just the number of names.

Here's one way to do it, but probably not the most efficient.
find -type d -print0 | xargs -0 -n1 bash -c 'echo -n "$1:"; ls -1 "$1" | wc -l' --
Gives output like this, with directory name followed by count of entries in that directory. Note that the output count will also include directory entries which may not be what you want.
./c/fa/l:0
./a:4
./a/c:0
./a/a:1
./a/a/b:0

Slightly modified version of Sebastian's answer using find instead of du (to exclude file-size-related overhead that du has to perform and that is never used):
find ./ -mindepth 2 -type f | cut -d/ -f2 | sort | uniq -c | sort -nr
-mindepth 2 parameter is used to exclude files in current directory. If you remove it, you'll see a bunch of lines like the following:
234 dir1
123 dir2
1 file1
1 file2
1 file3
...
1 fileN
(much like the du-based variant does)
If you do need to count the files in current directory as well, use this enhanced version:
{ find ./ -mindepth 2 -type f | cut -d/ -f2 | sort && find ./ -maxdepth 1 -type f | cut -d/ -f1; } | uniq -c | sort -nr
The output will be like the following:
234 dir1
123 dir2
42 .

Everyone else's solution has one drawback or another.
find -type d -readable -exec sh -c 'printf "%s " "$1"; ls -1UA "$1" | wc -l' sh {} ';'
Explanation:
-type d: we're interested in directories.
-readable: We only want them if it's possible to list the files in them. Note that find will still emit an error when it tries to search for more directories in them, but this prevents calling -exec for them.
-exec sh -c BLAH sh {} ';': for each directory, run this script fragment, with $0 set to sh and $1 set to the filename.
printf "%s " "$1": portably and minimally print the directory name, followed by only a space, not a newline.
ls -1UA: list the files, one per line, in directory order (to avoid stalling the pipe), excluding only the special directories . and ..
wc -l: count the lines

This can also be done with looping over ls instead of find
for f in */; do echo "$f -> $(ls $f | wc -l)"; done
Explanation:
for f in */; - loop over all directories
do echo "$f -> - print out each directory name
$(ls $f | wc -l) - call ls for this directory and count lines

This should return the directory name followed by the number of files in the directory.
findfiles() {
echo "$1" $(find "$1" -maxdepth 1 -type f | wc -l)
}
export -f findfiles
find ./ -type d -exec bash -c 'findfiles "$0"' {} \;
Example output:
./ 6
./foo 1
./foo/bar 2
./foo/bar/bazzz 0
./foo/bar/baz 4
./src 4
The export -f is required because the -exec argument of find does not allow executing a bash function unless you invoke bash explicitly, and you need to export the function defined in the current scope to the new shell explicitly.

My answer is a little different, due to the options of find, you can actually be much more flexible. Just try:
find . -type f -printf "%h\n" | sort | uniq -c
With the "%h" option to "-printf", find prints only the directory of the files it found. Then sort and count with "uniq -c". This prints the number of search result entries with the same directory, per directory.
Using further options on find, you can be much more flexible. For example, to get an overview how many files in which directory have been modified at a certain date, use:
find . -newermt "2022-01-01 00:00:00" -type f -printf "%TY-%Tm-%Td %h\n" | sort | uniq -c
This finds all files that have been modified since 1. January 2022, prints (with "-printf") the modification date and the directory, then sorts and counts them. In this example, each line in the result has the number of files, the date of modification (without time), and the directory.
Note that "-printf" may not be available in all versions of find I think.

I combined #glenn jackman's answer and #pcarvalho's answer(in comment list, there is something wrong with pcarvalho's answer because the extra style control function of character '`'(backtick)).
My script can accept path as an augument and sort the directory list as ls -l, also it can handles the problem of "space in file name".
#!/bin/bash
OLD_IFS="$IFS"
IFS=$'\n'
for dir in $(find $1 -maxdepth 1 -type d | sort);
do
files=("$dir"/*)
printf "%5d,%s\n" "${#files[#]}" "$dir"
done
FS="$OLD_IFS"
My first answer in stackoverflow, and I hope it can help someone ^_^

THis could be another way to browse through the directory structures and provide depth results.
find . -type d | awk '{print "echo -n \""$0" \";ls -l "$0" | grep -v total | wc -l" }' | sh

find . -type f -printf '%h\n' | sort | uniq -c
gives for example:
5 .
4 ./aln
5 ./aln/iq
4 ./bs
4 ./ft
6 ./hot

I tried with some of the others here but ended up with subfolders included in the file count when I only wanted the files. This prints ./folder/path<tab>nnn with the number of files, not including subfolders, for each subfolder in the current folder.
for d in `find . -type d -print`
do
echo -e "$d\t$(find $d -maxdepth 1 -type f -print | wc -l)"
done

This will give the overall count.
for file in */; do echo "$file -> $(ls $file | wc -l)"; done | cut -d ' ' -f 3| py --ji -l 'numpy.sum(l)'

A super fast miracle command, which recursively traverses files to count the number of images in a directory and organize the output by image extension:
find . -type f | sed -e 's/.*\.//' | sort | uniq -c | sort -n | grep -Ei '(tiff|bmp|jpeg|jpg|png|gif)$'
Credits: https://unix.stackexchange.com/a/386135/354980

I edited the script in order to exclude all node_modules directories inside the analyzed one.
This can be used to check if the project number of files is exceeding the maximum number that the file watcher can handle.
find . -type d ! -path "*node_modules*" -print0 | while read -d '' -r dir; do
files=("$dir"/*)
printf "%5d files in directory %s\n" "${#files[#]}" "$dir"
done
To check the maximum files that your system can watch:
cat /proc/sys/fs/inotify/max_user_watches
node_modules folder should be added to your IDE/editor excluded paths in slow systems, and the other files count shouldn't ideally exceed the maximum (which can be changed though).

Easy Method:
find ./|grep "Search_file.txt" |cut -d"/" -f2|sort |uniq -c

In my case I needed the count at subfolder level, so I did:
du -a | cut -d/ -f3 | sort | uniq -c | sort -nr

Easy way to recursively find files of a given type. In this case, .jpg files for all folders in current directory:
find . -name *.jpg -print | wc -l

omg why the complex commands. just use something like
find whatever_folder | wc -l

Measure disk space of certain file types in aggregate

I have some files across several folders:
/home/d/folder1/a.txt
/home/d/folder1/b.txt
/home/d/folder1/c.mov
/home/d/folder2/a.txt
/home/d/folder2/d.mov
/home/d/folder2/folder3/f.txt
How can I measure the grand total amount of disk space taken up by all the .txt files in /home/d/?
I know du will give me the total space of a given folder, and ls -l will give me the total space of individual files, but what if I want to add up all the txt files and just look at the space taken by all .txt files in one giant total for all .txt in /home/d/ including both folder1 and folder2 and their subfolders like folder3?

find folder1 folder2 -iname '*.txt' -print0 | du --files0-from - -c -s | tail -1

This will report disk space usage in bytes by extension:
find . -type f -printf "%f %s\n" |
awk '{
PARTSCOUNT=split( $1, FILEPARTS, "." );
EXTENSION=PARTSCOUNT == 1 ? "NULL" : FILEPARTS[PARTSCOUNT];
FILETYPE_MAP[EXTENSION]+=$2
}
END {
for( FILETYPE in FILETYPE_MAP ) {
print FILETYPE_MAP[FILETYPE], FILETYPE;
}
}' | sort -n
Output:
3250 png
30334451 mov
57725092729 m4a
69460813270 3gp
79456825676 mp3
131208301755 mp4

Simple:
du -ch *.txt
If you just want the total space taken to show up, then:
du -ch *.txt | tail -1

Here's a way to do it (in Linux, using GNU coreutils du and Bash syntax), avoiding bad practice:
total=0
while read -r line
do
size=($line)
(( total+=size ))
done < <( find . -iname "*.txt" -exec du -b {} + )
echo "$total"
If you want to exclude the current directory, use -mindepth 2 with find.
Another version that doesn't require Bash syntax:
find . -iname "*.txt" -exec du -b {} + | awk '{total += $1} END {print total}'
Note that these won't work properly with file names which include newlines (but those with spaces will work).

macOS
use the tool du and the parameter -I to exclude all other files
Linux
-X, --exclude-from=FILE
exclude files that match any pattern in FILE
--exclude=PATTERN
exclude files that match PATTERN

This will do it:
total=0
for file in *.txt
do
space=$(ls -l "$file" | awk '{print $5}')
let total+=space
done
echo $total

GNU find,
find /home/d -type f -name "*.txt" -printf "%s\n" | awk '{s+=$0}END{print "total: "s" bytes"}'

Building on ennuikiller's, this will handle spaces in names. I needed to do this and get a little report:
find -type f -name "*.wav" | grep export | ./calc_space
#!/bin/bash
# calc_space
echo SPACE USED IN MEGABYTES
echo
total=0
while read FILE
do
du -m "$FILE"
space=$(du -m "$FILE"| awk '{print $1}')
let total+=space
done
echo $total

A one liner for those with GNU tools on bash:
for i in $(find . -type f | perl -ne 'print $1 if m/\.([^.\/]+)$/' | sort -u); do echo "$i"": ""$(du -hac **/*."$i" | tail -n1 | awk '{print $1;}')"; done | sort -h -k 2 -r
You must have extglob enabled:
shopt -s extglob
If you want dot files to work, you must run
shopt -s dotglob
Sample output:
d: 3.0G
swp: 1.3G
mp4: 626M
txt: 263M
pdf: 238M
ogv: 115M
i: 76M
pkl: 65M
pptx: 56M
mat: 50M
png: 29M
eps: 25M
etc

my solution to get a total size of all text files in a given path and subdirectories (using perl oneliner)
find /path -iname '*.txt' | perl -lane '$sum += -s $_; END {print $sum}'

I like to use find in combination with xargs:
find . -name "*.txt" -print0 |xargs -0 du -ch
Add tail if you only want to see the grand total
find . -name "*.txt" -print0 |xargs -0 du -ch | tail -n1

For anyone wanting to do this with macOS at the command line, you need a variation based on the -print0 argument instead of printf. Some of the above answers address that but this will do it comprehensively by extension:
find . -type f -print0 | xargs -0 stat -f "%N %i" |
awk '{
PARTSCOUNT=split( $1, FILEPARTS, "." );
EXTENSION=PARTSCOUNT == 1 ? "NULL" : FILEPARTS[PARTSCOUNT];
FILETYPE_MAP[EXTENSION]+=$2
}
END {
for( FILETYPE in FILETYPE_MAP ) {
print FILETYPE_MAP[FILETYPE], FILETYPE;
}
}' | sort -n

There are several potential problems with the accepted answer:
it does not descend into subdirectories (without relying on non-standard shell features like globstar)
in general, as pointed out by Dennis Williamson below, you should avoid parsing the output of ls
namely, if the user or group (columns 3 and 4) have spaces in them, column 5 will not be the file size
if you have a million such files, this will spawn two million subshells, and it'll be sloooow
As proposed by ghostdog74, you can use the GNU-specific -printf option to find to achieve a more robust solution, avoiding all the excessive pipes, subshells, Perl, and weird du options:
# the '%s' format string means "the file's size"
find . -name "*.txt" -printf "%s\n" \
| awk '{sum += $1} END{print sum " bytes"}'
Yes, yes, solutions using paste or bc are also possible, but not any more straightforward.
On macOS, you would need to use Homebrew or MacPorts to install findutils, and call gfind instead. (I see the "linux" tag on this question, but it's also tagged "unix".)
Without GNU find, you can still fall back to using du:
find . -name "*.txt" -exec du -k {} + \
| awk '{kbytes+=$1} END{print kbytes " Kbytes"}'
…but you have to be mindful of the fact that du's default output is in 512-byte blocks for historical reasons (see the "RATIONALE" section of the man page), and some versions of du (notably, macOS's) will not even have an option to print sizes in bytes.
Many other fine solutions here (see Barn's answer in particular), but most suffer the drawback of being unnecessarily complex or depending too heavily on GNU-only features—and maybe in your environment, that's OK!

How can I calculate an MD5 checksum of a directory?

I need to calculate a summary MD5 checksum for all files of a particular type (*.py for example) placed under a directory and all sub-directories.
What is the best way to do that?
The proposed solutions are very nice, but this is not exactly what I need. I'm looking for a solution to get a single summary checksum which will uniquely identify the directory as a whole - including content of all its subdirectories.

Create a tar archive file on the fly and pipe that to md5sum:
tar c dir | md5sum
This produces a single MD5 hash value that should be unique to your file and sub-directory setup. No files are created on disk.

find /path/to/dir/ -type f -name "*.py" -exec md5sum {} + | awk '{print $1}' | sort | md5sum
The find command lists all the files that end in .py.
The MD5 hash value is computed for each .py file. AWK is used to pick off the MD5 hash values (ignoring the filenames, which may not be unique).
The MD5 hash values are sorted. The MD5 hash value of this sorted list is then returned.
I've tested this by copying a test directory:
rsync -a ~/pybin/ ~/pybin2/
I renamed some of the files in ~/pybin2.
The find...md5sum command returns the same output for both directories.
2bcf49a4d19ef9abd284311108d626f1 -
To take into account the file layout (paths), so the checksum changes if a file is renamed or moved, the command can be simplified:
find /path/to/dir/ -type f -name "*.py" -exec md5sum {} + | md5sum
On macOS with md5:
find /path/to/dir/ -type f -name "*.py" -exec md5 {} + | md5

ire_and_curses's suggestion of using tar c <dir> has some issues:
tar processes directory entries in the order which they are stored in the filesystem, and there is no way to change this order. This effectively can yield completely different results if you have the "same" directory on different places, and I know no way to fix this (tar cannot "sort" its input files in a particular order).
I usually care about whether groupid and ownerid numbers are the same, not necessarily whether the string representation of group/owner are the same. This is in line with what for example rsync -a --delete does: it synchronizes virtually everything (minus xattrs and acls), but it will sync owner and group based on their ID, not on string representation. So if you synced to a different system that doesn't necessarily have the same users/groups, you should add the --numeric-owner flag to tar
tar will include the filename of the directory you're checking itself, just something to be aware of.
As long as there is no fix for the first problem (or unless you're sure it does not affect you), I would not use this approach.
The proposed find-based solutions are also no good because they only include files, not directories, which becomes an issue if you the checksumming should keep in mind empty directories.
Finally, most suggested solutions don't sort consistently, because the collation might be different across systems.
This is the solution I came up with:
dir=<mydir>; (find "$dir" -type f -exec md5sum {} +; find "$dir" -type d) | LC_ALL=C sort | md5sum
Notes about this solution:
The LC_ALL=C is to ensure reliable sorting order across systems
This doesn't differentiate between a directory "named\nwithanewline" and two directories "named" and "withanewline", but the chance of that occurring seems very unlikely. One usually fixes this with a -print0 flag for find, but since there's other stuff going on here, I can only see solutions that would make the command more complicated than it's worth.
PS: one of my systems uses a limited busybox find which does not support -exec nor -print0 flags, and also it appends '/' to denote directories, while findutils find doesn't seem to, so for this machine I need to run:
dir=<mydir>; (find "$dir" -type f | while read f; do md5sum "$f"; done; find "$dir" -type d | sed 's#/$##') | LC_ALL=C sort | md5sum
Luckily, I have no files/directories with newlines in their names, so this is not an issue on that system.

If you only care about files and not empty directories, this works nicely:
find /path -type f | sort -u | xargs cat | md5sum

A solution which worked best for me:
find "$path" -type f -print0 | sort -z | xargs -r0 md5sum | md5sum
Reason why it worked best for me:
handles file names containing spaces
Ignores filesystem meta-data
Detects if file has been renamed
Issues with other answers:
Filesystem meta-data is not ignored for:
tar c - "$path" | md5sum
Does not handle file names containing spaces nor detects if file has been renamed:
find /path -type f | sort -u | xargs cat | md5sum

For the sake of completeness, there's md5deep(1); it's not directly applicable due to *.py filter requirement but should do fine together with find(1).

If you want one MD5 hash value spanning the whole directory, I would do something like
cat *.py | md5sum

Checksum all files, including both content and their filenames
grep -ar -e . /your/dir | md5sum | cut -c-32
Same as above, but only including *.py files
grep -ar -e . --include="*.py" /your/dir | md5sum | cut -c-32
You can also follow symlinks if you want
grep -aR -e . /your/dir | md5sum | cut -c-32
Other options you could consider using with grep
-s, --no-messages suppress error messages
-D, --devices=ACTION how to handle devices, FIFOs and sockets;
-Z, --null print 0 byte after FILE name
-U, --binary do not strip CR characters at EOL (MSDOS/Windows)

GNU find
find /path -type f -name "*.py" -exec md5sum "{}" +;

Technically you only need to run ls -lR *.py | md5sum. Unless you are worried about someone modifying the files and touching them back to their original dates and never changing the files' sizes, the output from ls should tell you if the file has changed. My unix-foo is weak so you might need some more command line parameters to get the create time and modification time to print. ls will also tell you if permissions on the files have changed (and I'm sure there are switches to turn that off if you don't care about that).

Using md5deep:
md5deep -r FOLDER | awk '{print $1}' | sort | md5sum

I want to add that if you are trying to do this for files/directories in a Git repository to track if they have changed, then this is the best approach:
git log -1 --format=format:%H --full-diff <file_or_dir_name>
And if it's not a Git directory/repository, then the answer by ire_and_curses is probably the best bet:
tar c <dir_name> | md5sum
However, please note that tar command will change the output hash if you run it in a different OS and stuff. If you want to be immune to that, this is the best approach, even though it doesn't look very elegant on first sight:
find <dir_name> -type f -print0 | sort -z | xargs -0 md5sum | md5sum | awk '{ print $1 }'

md5sum worked fine for me, but I had issues with sort and sorting file names. So instead I sorted by md5sum result. I also needed to exclude some files in order to create comparable results.
find . -type f -print0 \
| xargs -r0 md5sum \
| grep -v ".env" \
| grep -v "vendor/autoload.php" \
| grep -v "vendor/composer/" \
| sort -d \
| md5sum

I had the same problem so I came up with this script that just lists the MD5 hash values of the files in the directory and if it finds a subdirectory it runs again from there, for this to happen the script has to be able to run through the current directory or from a subdirectory if said argument is passed in $1
#!/bin/bash
if [ -z "$1" ] ; then
# loop in current dir
ls | while read line; do
ecriv=`pwd`"/"$line
if [ -f $ecriv ] ; then
md5sum "$ecriv"
elif [ -d $ecriv ] ; then
sh myScript "$line" # call this script again
fi
done
else # if a directory is specified in argument $1
ls "$1" | while read line; do
ecriv=`pwd`"/$1/"$line
if [ -f $ecriv ] ; then
md5sum "$ecriv"
elif [ -d $ecriv ] ; then
sh myScript "$line"
fi
done
fi

If you want really independence from the file system attributes and from the bit-level differences of some tar versions, you could use cpio:
cpio -i -e theDirname | md5sum

There are two more solutions:
Create:
du -csxb /path | md5sum > file
ls -alR -I dev -I run -I sys -I tmp -I proc /path | md5sum > /tmp/file
Check:
du -csxb /path | md5sum -c file
ls -alR -I dev -I run -I sys -I tmp -I proc /path | md5sum -c /tmp/file

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Compare 2 Folders and Find Files with Differing Byte Counts - linux

Related

LINUX Copy the name of the newest folder and paste it in a command [duplicate]

How do I classify files in Linux server by their names?

How to count number of files in each directory?

Measure disk space of certain file types in aggregate

How can I calculate an MD5 checksum of a directory?

Categories

Resources