Linux: compute a single hash for a given folder & contents?

Linux: compute a single hash for a given folder & contents? - linux

Surely there must be a way to do this easily!
I've tried the Linux command-line apps such as sha1sum and md5sum but they seem only to be able to compute hashes of individual files and output a list of hash values, one for each file.
I need to generate a single hash for the entire contents of a folder (not just the filenames).
I'd like to do something like
sha1sum /folder/of/stuff > singlehashvalue
Edit: to clarify, my files are at multiple levels in a directory tree, they're not all sitting in the same root folder.

One possible way would be:
sha1sum path/to/folder/* | sha1sum
If there is a whole directory tree, you're probably better off using find and xargs. One possible command would be
find path/to/folder -type f -print0 | sort -z | xargs -0 sha1sum | sha1sum
And, finally, if you also need to take account of permissions and empty directories:
(find path/to/folder -type f -print0 | sort -z | xargs -0 sha1sum;
find path/to/folder \( -type f -o -type d \) -print0 | sort -z | \
xargs -0 stat -c '%n %a') \
| sha1sum
The arguments to stat will cause it to print the name of the file, followed by its octal permissions. The two finds will run one after the other, causing double the amount of disk IO, the first finding all file names and checksumming the contents, the second finding all file and directory names, printing name and mode. The list of "file names and checksums", followed by "names and directories, with permissions" will then be checksummed, for a smaller checksum.

Use a file system intrusion detection tool like aide.
hash a tar ball of the directory:
tar cvf - /path/to/folder | sha1sum
Code something yourself, like vatine's oneliner:
find /path/to/folder -type f -print0 | sort -z | xargs -0 sha1sum | sha1sum

If you just want to check if something in the folder changed, I'd recommend this one:
ls -alR --full-time /folder/of/stuff | sha1sum
It will just give you a hash of the ls output, that contains folders, sub-folders, their files, their timestamp, size and permissions. Pretty much everything that you would need to determine if something has changed.
Please note that this command will not generate hash for each file, but that is why it should be faster than using find.

You can do tar -c /path/to/folder | sha1sum

So far the fastest way to do it is still with tar. And with several additional parameters we can also get rid of the difference caused by metadata.
To use GNU tar for hash the dir, one need to make sure you sort the path during tar, otherwise it is always different.
tar -C <root-dir> -cf - --sort=name <dir> | sha256sum
ignore time
If you do not care about the access time or modify time also use something like --mtime='UTC 2019-01-01' to make sure all timestamp is the same.
ignore ownership
Usually we need to add --group=0 --owner=0 --numeric-owner to unify the owner metadata.
ignore some files
use --exclude=PATTERN
it is known some tar does not have --sort, be sure you have GNU tar.

A robust and clean approach
First things first, don't hog the available memory! Hash a file in chunks rather than feeding the entire file.
Different approaches for different needs/purpose (all of the below or pick what ever applies):
Hash only the entry name of all entries in the directory tree
Hash the file contents of all entries (leaving the meta like, inode number, ctime, atime, mtime, size, etc., you get the idea)
For a symbolic link, its content is the referent name. Hash it or choose to skip
Follow or not to follow(resolved name) the symlink while hashing the contents of the entry
If it's a directory, its contents are just directory entries. While traversing recursively they will be hashed eventually but should the directory entry names of that level be hashed to tag this directory? Helpful in use cases where the hash is required to identify a change quickly without having to traverse deeply to hash the contents. An example would be a file's name changes but the rest of the contents remain the same and they are all fairly large files
Handle large files well(again, mind the RAM)
Handle very deep directory trees (mind the open file descriptors)
Handle non standard file names
How to proceed with files that are sockets, pipes/FIFOs, block devices, char devices? Must hash them as well?
Don't update the access time of any entry while traversing because this will be a side effect and counter-productive(intuitive?) for certain use cases.
This is what I have on top my head, any one who has spent some time working on this practically would have caught other gotchas and corner cases.
Here's a tool, very light on memory, which addresses most cases, might be a bit rough around the edges but has been quite helpful.
An example usage and output of dtreetrawl.
Usage:
dtreetrawl [OPTION...] "/trawl/me" [path2,...]
Help Options:
-h, --help Show help options
Application Options:
-t, --terse Produce a terse output; parsable.
-j, --json Output as JSON
-d, --delim=: Character or string delimiter/separator for terse output(default ':')
-l, --max-level=N Do not traverse tree beyond N level(s)
--hash Enable hashing(default is MD5).
-c, --checksum=md5 Valid hashing algorithms: md5, sha1, sha256, sha512.
-R, --only-root-hash Output only the root hash. Blank line if --hash is not set
-N, --no-name-hash Exclude path name while calculating the root checksum
-F, --no-content-hash Do not hash the contents of the file
-s, --hash-symlink Include symbolic links' referent name while calculating the root checksum
-e, --hash-dirent Include hash of directory entries while calculating root checksum
A snippet of human friendly output:
...
... //clipped
...
/home/lab/linux-4.14-rc8/CREDITS
Base name : CREDITS
Level : 1
Type : regular file
Referent name :
File size : 98443 bytes
I-node number : 290850
No. directory entries : 0
Permission (octal) : 0644
Link count : 1
Ownership : UID=0, GID=0
Preferred I/O block size : 4096 bytes
Blocks allocated : 200
Last status change : Tue, 21 Nov 17 21:28:18 +0530
Last file access : Thu, 28 Dec 17 00:53:27 +0530
Last file modification : Tue, 21 Nov 17 21:28:18 +0530
Hash : 9f0312d130016d103aa5fc9d16a2437e
Stats for /home/lab/linux-4.14-rc8:
Elapsed time : 1.305767 s
Start time : Sun, 07 Jan 18 03:42:39 +0530
Root hash : 434e93111ad6f9335bb4954bc8f4eca4
Hash type : md5
Depth : 8
Total,
size : 66850916 bytes
entries : 12484
directories : 763
regular files : 11715
symlinks : 6
block devices : 0
char devices : 0
sockets : 0
FIFOs/pipes : 0

If this is a git repo and you want to ignore any files in .gitignore, you might want to use this:
git ls-files <your_directory> | xargs sha256sum | cut -d" " -f1 | sha256sum | cut -d" " -f1
This is working well for me.

If you just want to hash the contents of the files, ignoring the filenames then you can use
cat $FILES | md5sum
Make sure you have the files in the same order when computing the hash:
cat $(echo $FILES | sort) | md5sum
But you can't have directories in your list of files.

Another tool to achieve this:
http://md5deep.sourceforge.net/
As is sounds: like md5sum but also recursive, plus other features.
md5deep -r {direcotory}

There is a python script for that:
http://code.activestate.com/recipes/576973-getting-the-sha-1-or-md5-hash-of-a-directory/
If you change the names of a file without changing their alphabetical order, the hash script will not detect it. But, if you change the order of the files or the contents of any file, running the script will give you a different hash than before.

I had to check into a whole directory for file changes.
But with excluding, timestamps, directory ownerships.
Goal is to get a sum identical anywhere, if the files are identical.
Including hosted into other machines, regardless anything but the files, or a change into them.
md5sum * | md5sum | cut -d' ' -f1
It generate a list of hash by file, then concatenate those hashes into one.
This is way faster than the tar method.
For a stronger privacy in our hashes, we can use sha512sum on the same recipe.
sha512sum * | sha512sum | cut -d' ' -f1
The hashes are also identicals anywhere using sha512sum but there is no known way to reverse it.

Here's a simple, short variant in Python 3 that works fine for small-sized files (e.g. a source tree or something, where every file individually can fit into RAM easily), ignoring empty directories, based on the ideas from the other solutions:
import os, hashlib
def hash_for_directory(path, hashfunc=hashlib.sha1):
filenames = sorted(os.path.join(dp, fn) for dp, _, fns in os.walk(path) for fn in fns)
index = '\n'.join('{}={}'.format(os.path.relpath(fn, path), hashfunc(open(fn, 'rb').read()).hexdigest()) for fn in filenames)
return hashfunc(index.encode('utf-8')).hexdigest()
It works like this:
Find all files in the directory recursively and sort them by name
Calculate the hash (default: SHA-1) of every file (reads whole file into memory)
Make a textual index with "filename=hash" lines
Encode that index back into a UTF-8 byte string and hash that
You can pass in a different hash function as second parameter if SHA-1 is not your cup of tea.

adding multiprocessing and progressbar to kvantour's answer
Around 30x faster (depending on CPU)
100%|██████████████████████████████████| 31378/31378 [03:03<00:00, 171.43file/s]
# to hash without permissions
find . -type f -print0 | sort -z | xargs -P $(nproc --all) -0 sha1sum | tqdm --unit file --total $(find . -type f | wc -l) | sort | awk '{ print $1 }' | sha1sum
# to hash permissions
(find . -type f -print0 | sort -z | xargs -P $(nproc --all) -0 sha1sum | sort | awk '{ print $1 }';
find . \( -type f -o -type d \) -print0 | sort -z | xargs -P $(nproc --all) -0 stat -c '%n %a') | \
sort | sha1sum | awk '{ print $1 }'
make sure tqdm is installed, pip install tqdm or check documentation
awk will remove the filepath so that if the parent directory or path is different it wouldn't affect the hash

You can try hashdir which is an open source command line tool written for this purpose.
hashdir /folder/of/stuff
It has several useful flags to allow you to specify the hashing algorithm, print the hashes of all children, as well as save and verify a hash.
hashdir:
A command-line utility to checksum directories and files.
Usage:
hashdir [options] [<item>...] [command]
Arguments:
<item> Directory or file to hash/check
Options:
-t, --tree Print directory tree
-s, --save Save the checksum to a file
-i, --include-hidden-files Include hidden files
-e, --skip-empty-dir Skip empty directories
-a, --algorithm <md5|sha1|sha256|sha384|sha512> The hash function to use [default: sha1]
--version Show version information
-?, -h, --help Show help and usage information
Commands:
check <item> Verify that the specified hash file is valid.

Try to make it in two steps:
create a file with hashes for all files in a folder
hash this file
Like so:
# for FILE in `find /folder/of/stuff -type f | sort`; do sha1sum $FILE >> hashes; done
# sha1sum hashes
Or do it all at once:
# cat `find /folder/of/stuff -type f | sort` | sha1sum

I would pipe the results for individual files through sort (to prevent a mere reordering of files to change the hash) into md5sum or sha1sum, whichever you choose.

I've written a Groovy script to do this:
import java.security.MessageDigest
public static String generateDigest(File file, String digest, int paddedLength){
MessageDigest md = MessageDigest.getInstance(digest)
md.reset()
def files = []
def directories = []
if(file.isDirectory()){
file.eachFileRecurse(){sf ->
if(sf.isFile()){
files.add(sf)
}
else{
directories.add(file.toURI().relativize(sf.toURI()).toString())
}
}
}
else if(file.isFile()){
files.add(file)
}
files.sort({a, b -> return a.getAbsolutePath() <=> b.getAbsolutePath()})
directories.sort()
files.each(){f ->
println file.toURI().relativize(f.toURI()).toString()
f.withInputStream(){is ->
byte[] buffer = new byte[8192]
int read = 0
while((read = is.read(buffer)) > 0){
md.update(buffer, 0, read)
}
}
}
directories.each(){d ->
println d
md.update(d.getBytes())
}
byte[] digestBytes = md.digest()
BigInteger bigInt = new BigInteger(1, digestBytes)
return bigInt.toString(16).padLeft(paddedLength, '0')
}
println "\n${generateDigest(new File(args[0]), 'SHA-256', 64)}"
You can customize the usage to avoid printing each file, change the message digest, take out directory hashing, etc. I've tested it against the NIST test data and it works as expected. http://www.nsrl.nist.gov/testdata/
gary-macbook:Scripts garypaduana$ groovy dirHash.groovy /Users/garypaduana/.config
.DS_Store
configstore/bower-github.yml
configstore/insight-bower.json
configstore/update-notifier-bower.json
filezilla/filezilla.xml
filezilla/layout.xml
filezilla/lockfile
filezilla/queue.sqlite3
filezilla/recentservers.xml
filezilla/sitemanager.xml
gtk-2.0/gtkfilechooser.ini
a/
configstore/
filezilla/
gtk-2.0/
lftp/
menus/
menus/applications-merged/
79de5e583734ca40ff651a3d9a54d106b52e94f1f8c2cd7133ca3bbddc0c6758

Quick summary: how to hash the contents of an entire folder, or compare two folders for equality
# 1. How to get a sha256 hash over all file contents in a folder, including
# hashing over the relative file paths within that folder to check the
# filenames themselves (get this bash function below).
sha256sum_dir "path/to/folder"
# 2. How to quickly compare two folders (get the `diff_dir` bash function below)
diff_dir "path/to/folder1" "path/to/folder2"
# OR:
diff -r -q "path/to/folder1" "path/to/folder2"
The "one liners"
Do this instead of the main answer, to get a single hash for all non-directory file contents within an entire folder, no matter where the folder is located:
This is a "1-line" command. Copy and paste the whole thing to run it all at once:
# This one works, but don't use it, because its hash output does NOT
# match that of my `sha256sum_dir` function. I recommend you use
# the "1-liner" just below, therefore, instead.
time ( \
starting_dir="$(pwd)" \
&& target_dir="path/to/folder" \
&& cd "$target_dir" \
&& find . -not -type d -print0 | sort -zV \
| xargs -0 sha256sum | sha256sum; \
cd "$starting_dir"
)
However, that produces a slightly different hash than my sha256sum_dir bash function, which I present below, produces. So, to get the output hash to exactly match the output from my sha256sum_dir function, do this instead:
# Use this one, as its output matches that of my `sha256sum_dir`
# function exactly.
all_hashes_str="$( \
starting_dir="$(pwd)" \
&& target_dir="path/to/folder" \
&& cd "$target_dir" \
&& find . -not -type d -print0 | sort -zV | xargs -0 sha256sum \
)"; \
cd "$starting_dir"; \
printf "%s" "$all_hashes_str" | sha256sum
For more on why the main answer doesn't produce identical hashes for identical folders in different locations, see further below.
[My preferred method] Here are some bash functions I wrote: sha256sum_dir and diff_dir
Place the following functions in your ~/.bashrc file or in your ~/.bash_aliases file, assuming your ~/.bashrc file sources the ~/.bash_aliases file like this:
if [ -f ~/.bash_aliases ]; then
. ~/.bash_aliases
fi
You can find both of the functions below in my personal ~/.bash_aliases file in my eRCaGuy_dotfiles repo.
Here is the sha256sum_dir function, which obtains a total "directory" hash of all files in the directory:
# Take the sha256sum of all files in an entire dir, and then sha256sum that
# entire output to obtain a _single_ sha256sum which represents the _entire_
# dir.
# See:
# 1. [my answer] https://stackoverflow.com/a/72070772/4561887
sha256sum_dir() {
return_code="$RETURN_CODE_SUCCESS"
if [ "$#" -eq 0 ]; then
echo "ERROR: too few arguments."
return_code="$RETURN_CODE_ERROR"
fi
# Print help string if requested
if [ "$#" -eq 0 ] || [ "$1" = "-h" ] || [ "$1" = "--help" ]; then
# Help string
echo "Obtain a sha256sum of all files in a directory."
echo "Usage: ${FUNCNAME[0]} [-h|--help] <dir>"
return "$return_code"
fi
starting_dir="$(pwd)"
target_dir="$1"
cd "$target_dir"
# See my answer: https://stackoverflow.com/a/72070772/4561887
filenames="$(find . -not -type d | sort -V)"
IFS=$'\n' read -r -d '' -a filenames_array <<< "$filenames"
time all_hashes_str="$(sha256sum "${filenames_array[#]}")"
cd "$starting_dir"
echo ""
echo "Note: you may now call:"
echo "1. 'printf \"%s\n\" \"\$all_hashes_str\"' to view the individual" \
"hashes of each file in the dir. Or:"
echo "2. 'printf \"%s\" \"\$all_hashes_str\" | sha256sum' to see that" \
"the hash of that output is what we are using as the final hash" \
"for the entire dir."
echo ""
printf "%s" "$all_hashes_str" | sha256sum | awk '{ print $1 }'
return "$?"
}
# Note: I prefix this with my initials to find my custom functions easier
alias gs_sha256sum_dir="sha256sum_dir"
Assuming you just want to compare two directories for equality, you can use diff -r -q "dir1" "dir2" instead, which I wrapped in this diff_dir command. I learned about the diff command to compare entire folders here: how do I check that two folders are the same in linux.
# Compare dir1 against dir2 to see if they are equal or if they differ.
# See:
# 1. How to `diff` two dirs: https://stackoverflow.com/a/16404554/4561887
diff_dir() {
return_code="$RETURN_CODE_SUCCESS"
if [ "$#" -eq 0 ]; then
echo "ERROR: too few arguments."
return_code="$RETURN_CODE_ERROR"
fi
# Print help string if requested
if [ "$#" -eq 0 ] || [ "$1" = "-h" ] || [ "$1" = "--help" ]; then
echo "Compare (diff) two directories to see if dir1 contains the same" \
"content as dir2."
echo "NB: the output will be **empty** if both directories match!"
echo "Usage: ${FUNCNAME[0]} [-h|--help] <dir1> <dir2>"
return "$return_code"
fi
dir1="$1"
dir2="$2"
time diff -r -q "$dir1" "$dir2"
return_code="$?"
if [ "$return_code" -eq 0 ]; then
echo -e "\nDirectories match!"
fi
# echo "$return_code"
return "$return_code"
}
# Note: I prefix this with my initials to find my custom functions easier
alias gs_diff_dir="diff_dir"
Here is the output of my sha256sum_dir command on my ~/temp2 dir (which dir I describe just below so you can reproduce it and test this yourself). You can see the total folder hash is b86c66bcf2b033f65451e8c225425f315e618be961351992b7c7681c3822f6a3 in this case:
$ gs_sha256sum_dir ~/temp2
real 0m0.007s
user 0m0.000s
sys 0m0.007s
Note: you may now call:
1. 'printf "%s\n" "$all_hashes_str"' to view the individual hashes of each
file in the dir. Or:
2. 'printf "%s" "$all_hashes_str" | sha256sum' to see that the hash of that
output is what we are using as the final hash for the entire dir.
b86c66bcf2b033f65451e8c225425f315e618be961351992b7c7681c3822f6a3
Here is the cmd and output of diff_dir to compare two dirs for equality. This is checking that copying an entire directory to my SD card just now worked correctly. I made the output indicate Directories match! whenever that is the case!:
$ gs_diff_dir "path/to/sd/card/tempdir" "/home/gabriel/tempdir"
real 0m0.113s
user 0m0.037s
sys 0m0.077s
Directories match!
Why the main answer doesn't produce identical hashes for identical folders in different locations
I tried the most-upvoted answer here, and it doesn't work quite right as-is. It needs a little tweaking. It doesn't work quite right because the hash changes based on the folder-of-interest's base path! That means that an identical copy of some folder will have a different hash than the folder it was copied from even if the two folders are perfect matches and contain exactly the same content! That kind of defeats the purpose of taking a hash of the folder if the hashes of two identical folders differ! Let me explain:
Assume I have a folder named temp2 at ~/temp2. It contains file1.txt, file2.txt, and file3.txt. file1.txt contains the letter a followed by a return, file2.txt contains a letter b followed by a return, and file3.txt contains a letter c followed by a return.
If I run find /home/gabriel/temp2, I get:
$ find /home/gabriel/temp2
/home/gabriel/temp2
/home/gabriel/temp2/file3.txt
/home/gabriel/temp2/file1.txt
/home/gabriel/temp2/file2.txt
If I forward that to sha256sum (in place of sha1sum) in the same pattern as the main answer states, I get this. Notice it has the full path after each hash, which is not what we want:
$ find /home/gabriel/temp2 -type f -print0 | sort -z | xargs -0 sha256sum
87428fc522803d31065e7bce3cf03fe475096631e5e07bbd7a0fde60c4cf25c7 /home/gabriel/temp2/file1.txt
0263829989b6fd954f72baaf2fc64bc2e2f01d692d4de72986ea808f6e99813f /home/gabriel/temp2/file2.txt
a3a5e715f0cc574a73c3f9bebb6bc24f32ffd5b67b387244c2c909da779a1478 /home/gabriel/temp2/file3.txt
If you then pipe that output string above to sha256sum again, it hashes the file hashes with their full file paths, which is not what we want! The file hashes may match in a folder and in a copy of that folder exactly, but the absolute paths do NOT match exactly, so they will produce different final hashes since we are hashing over the full file paths as part of our single, final hash!
Instead, what we want is the relative file path next to each hash. To do that, you must first cd into the folder of interest, and then run the hash command over all files therein, like this:
cd "/home/gabriel/temp2" && find . -type f -print0 | sort -z | xargs -0 sha256sum
Now, I get this. Notice the file paths are all relative now, which is what I want!:
$ cd "/home/gabriel/temp2" && find . -type f -print0 | sort -z | xargs -0 sha256sum
87428fc522803d31065e7bce3cf03fe475096631e5e07bbd7a0fde60c4cf25c7 ./file1.txt
0263829989b6fd954f72baaf2fc64bc2e2f01d692d4de72986ea808f6e99813f ./file2.txt
a3a5e715f0cc574a73c3f9bebb6bc24f32ffd5b67b387244c2c909da779a1478 ./file3.txt
Good. Now, if I hash that entire output string, since the file paths are all relative in it, the final hash will match exactly for a folder and its copy! In this way, we are hashing over the file contents and the file names within the directory of interest, to get a different hash for a given folder if either the file contents are different or the filenames are different, or both.

You could sha1sum to generate the list of hash values and then sha1sum that list again, it depends on what exactly it is you want to accomplish.

How to hash all files in an entire directory, including the filenames as well as their contents
Assuming you are trying to compare a folder and all its contents to ensure it was copied correctly from one computer to another, for instance, you can do it as follows. Let's assume the folder is named mydir and is at path /home/gabriel/mydir on computer 1, and at /home/gabriel/dev/repos/mydir on computer 2.
# 1. First, cd to the dir in which the dir of interest is found. This is
# important! If you don't do this, then the paths output by find will differ
# between the two computers since the absolute paths to `mydir` differ. We are
# going to hash the paths too, not just the file contents, so this matters.
cd /home/gabriel # on computer 1
cd /home/gabriel/dev/repos # on computer 2
# 2. hash all files inside `mydir`, then hash the list of all hashes and their
# respective file paths. This obtains one single final hash. Sorting is
# necessary by piping to `sort` to ensure we get a consistent file order in
# order to ensure a consistent final hash result.
find mydir -type f -exec sha256sum {} + | sort | sha256sum
# Optionally pipe that output to awk to filter in on just the hash (first field
# in the output)
find mydir -type f -exec sha256sum {} + | sort | sha256sum | awk '{print $1}'
That's it!
To see the intermediary list of file hashes, for learning's sake, just run this:
find mydir -type f -exec sha256sum {} + | sort
Note that the above commands ignore empty directories, file permissions, timestamps of when files were last edited, etc. For most cases though that's ok.
Example
Here is a real run and actual output. I wanted to ensure my eclipse-workspace folder was properly copied from one computer to another. As you can see, the time command tells me it took 11.790 seconds:
$ time find eclipse-workspace -type f -exec sha256sum {} + | sort | sha256sum
8f493478e7bb77f1d025cba31068c1f1c8e1eab436f8a3cf79d6e60abe2cd2e4 -
real 0m11.790s
user 0m11.372s
sys 0m0.432s
The hash I care about is: 8f493478e7bb77f1d025cba31068c1f1c8e1eab436f8a3cf79d6e60abe2cd2e4
If piping to awk and excluding time, I get:
$ find eclipse-workspace -type f -exec sha256sum {} + | sort | sha256sum | awk '{print $1}'
8f493478e7bb77f1d025cba31068c1f1c8e1eab436f8a3cf79d6e60abe2cd2e4
Be sure you check find for errors in the printed stderr output, as a hash will be produced even in the event find fails.
Hashing my whole eclipse-workspace dir in only 12 seconds is impressive considering it contains 6480 files, as shown by this:
find eclipse-workspace -type f | wc -l
...and is 3.6 GB in size, as shown by this:
du -sh eclipse-workspace
See also
My other answer here, where I use the above info.: how do I check that two folders are the same in linux
Other credit: I had a chat with ChatGPT to learn some of the pieces above. All work and text above, however, was written by me, tested by me, and verified by me.

Related

List file using ls to find meet the condition

I am writing a batch program to delete all file in a directory with condition in filename.
In the directory there's a large number of text file (~ hundreds of thousand of files) with filename fixed as "abc" + date
abc_20180820.txt
abc_20180821.txt
abc_20180822.txt
abc_20180823.txt
abc_20180824.txt
The program try to grep all the file, compare the date to a fixed-date, delete it if filename's date < fixed date.
But the problem is it took so long to handle that large amount of file (~1 hour to delete 300k files).
My question: Is there a way to compare the date when running ls command? Not get all file in a list then compare to delete, but list only file already meet the condition then delete. I think that will have better performance.
My code is
TARGET_DATE = "5-12"
DEL_DATE = "20180823"
ls -t | grep "[0-9]\{8\}".txt\$ > ${LIST}
for EACH_FILE in `cat ${LIST}` ;
do
DATE=`echo ${EACH_FILE} | cut -c${TARGET_DATE }`
COMPARE=`expr "${DATE}" \< "${DEL_DATE}"`
if [ $COMPARE -eq 1 ] ;
then
rm -f ${EACH_FILE}
fi
done
Found some similar problem but I dont know how to get it done
List file using ls with a condition and process/grep files that only whitespaces

Here is a refactoring which gets rid of the pesky ls. Looping over a large directory is still going to be somewhat slow.
# Use lowercase for private variables
# to avoid clobbering a reserved system variable
# You can't have spaces around the equals sign
del_date="20180823"
# No need for ls here
# No need for a temporary file
for filename in *[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9].txt
do
# Avoid external process; use the shell's parameter substitution
date=${filename%.txt}
# This could fail if the file name contains literal shell metacharacters!
date=${date#${date%?????????}}
# Avoid expr
if [ "$date" -lt "$del_date" ]; then
# Just print the file name, null-terminated for xargs
printf '%s\0' "$filename"
fi
done |
# For efficiency, do batch delete
xargs -r0 rm
The wildcard expansion will still take a fair amount of time because the shell will sort the list of filenames. A better solution is probably to refactor this into a find command which avoids the sorting.
find . -maxdepth 1 -type f \( \
-name '*1[89][0-9][0-9][0-9][0-9][0-9][0-9].txt' \
-o -name '*201[0-7][0-9][0-9][0-9][0-9].txt' \
-o -name '*20180[1-7][0-9][0-9].txt ' \
-o -name '*201808[01][0-9].txt' \
-o -name '*2018082[0-2].txt' \
\) -delete

You could do something like:
rm 201[0-7]*.txt # remove all files from 2010-2017
rm 20180[1-4]*.txt # remove all files from Jan-Apr 2018
# And so on
...
to remove a large number of files. Then your code would run faster.

Yes it takes a lot of time if you have so many files in one folder.
It is bad idea to keep so many files in one folder. Even simple ls or find will be killing storage. And if you have some scripts which iterate over your files, you are for sure killing storage.
So after you wait for one hour to clean it. Take time and make better folders structure. It is good idea to sort files according to years/month/days ... possibly hours
e.g.
somefolder/2018/08/24/...files here
Then you can easily delete, move compress ... whole month or year.

I found a solution in this thread.
https://unix.stackexchange.com/questions/199554/get-files-with-a-name-containing-a-date-value-less-than-or-equal-to-a-given-inpu
The awk command is so powerful, only take me ~1 minute to deal with hundreds of thousand of files (1/10 compare to the loop).
ls | awk -v date="$DEL_DATE" '$0 <= date' | xargs rm -vrf
I can even count, copy, move with that command with the fastest answer I've ever seen.
COUNT="$(ls | awk -v date="${DEL_DATE}" '$0 <= target' | xargs rm -vrf | wc -l)"

shell - faster alternative to "find"

I'm writing a shell script wich should output the oldest file in a directory.
This directory is on a remote server and has (worst case) between 1000 and 1500 (temporary) files in it. I have no access to the server and I have no influence on how the files are stored. The server is connect through a stable but not very fast line.
The result of my script is passed to a monitoring system wich in turn allerts the staff if there are too many (=unprocessed) files in the directory.
Unfortunately the monitoring system only allows a maximun execution time of 30 seconds for my script before a timeout occurs.
This wasn't a problem when testing with small directories, this wasn't a problem. Testing with the target directory over the remote-mounted directory (approx 1000 files) it is.
So I'm looking for the fastest way to get things like "the oldest / newest / largest / smallest" file in a directory (not recursive) without using 'find' or sorting the output of 'ls'.
Currently I'm using this statement in my sh script:
old)
# return oldest file (age in seconds)
oldest=`find $2 -maxdepth 1 -type f | xargs ls -tr | head -1`
timestamp=`stat -f %B $oldest`
curdate=`date +%s`
echo `expr $(($curdate-$timestamp))`
;;
and I tried this one:
gfind /livedrive/669/iwt.save -type f -printf "%T# %P\n" | sort -nr | tail -1 | cut -d' ' -f 2-
wich are two of many variants of statements one can find using google.
Additional information:
I'writing this on a FreeBSD Box with sh und bash installed. I have full access to the box and can install programs if needed. For reference: gfind is the GNU-"find" utuility as known from linux as FreeBSD has another "find" installed by default.
any help is appreciated
with kind regards,
dura-zell

For the oldest/newest file issue, you can use -t option to ls which sorts the output using the time modified.
-t Sort by descending time modified (most recently modified first).
If two files have the same modification timestamp, sort their
names in ascending lexicographical order. The -r option reverses
both of these sort orders.
For the size issue, you can use -S to sort file by size.
-S Sort by size (largest file first) before sorting the operands in
lexicographical order.
Notice that for both cases, -r will reverse the order of the output.
-r Reverse the order of the sort.
Those options are available on FreeBSD and Linux; and must be pretty common in most implementations of ls.
Let use know if it's fast enough.

In general, you shouldn't be parsing the output of ls. In this case, it's just acting as a wrapper around stat anyway, so you may as well just call stat on each file, and use sort to get the oldest.
old) now=$(date +%s)
read name timestamp < <(stat -f "%N %B" "$2"/* | sort -k2,2n)
echo $(( $now - $timestamp ))
The above is concise, but doesn't distinguish between regular files and directories in the glob. If that is necessary, stick with find, but use a different form of -exec to minimize the number of calls to stat:
old ) now=$(date +%s)
read name timestamp < <(find "$2" -maxdepth 1 -type f -exec stat -f "%N %B" '{}' + | sort -k2,2n)
echo $(( $now - $timestamp ))
(Neither approach works if a filename contains a newline, although since you aren't using the filename in your example anyway, you can avoid that problem by dropping %N from the format and just sorting the timestamps numerically. For example:
read timestamp < <(stat -f %B "$2"/* | sort -n)
# or
read timestamp < <(find "$2" -maxdepth 1 -type f -exec stat -f %B '{}' + | sort -n)
)

Can you try creating a shell script that will reside in the remote host and when executed will provide the required output. Then from your local machine just use ssh or something like that to run that. In this way the script will run locally there. Just a thought :-)

Rm and Egrep -v combo

I want to remove all the logs except the current log and the log before that.
These log files are created after 20 minutes.So the files names are like
abc_23_19_10_3341.log
abc_23_19_30_3342.log
abc_23_19_50_3241.log
abc_23_20_10_3421.log
where 23 is today's date(might include yesterday's date also)
19 is the hour(7 o clock),10,30,50,10 are the minutes.
In this case i want i want to keep abc_23_20_10_3421.log which is the current log(which is currently being writen) and abc_23_19_50_3241.log(the previous one)
and remove the rest.
I got it to work by creating a folder,putting the first files in that folder and removing the files and then deleting it.But that's too long...
I also tried this
files_nodelete=`ls -t | head -n 2 | tr '\n' '|'`
rm *.txt | egrep -v "$files_nodelete"
but it didnt work.But if i put ls instead of rm it works.
I am an amateur in linux.So please suggest a simple idea..or a logic..xargs rm i tried but it didnt work.
Also read about mtime,but seems abit complicated since I am new to linux
Working on a solaris system

Try the logadm tool in Solaris, it might be the simplest way to rotate logs. If you just want to get things done, it will do it.
http://docs.oracle.com/cd/E23823_01/html/816-5166/logadm-1m.html

If you want a solution similar (but working) to your try this:
ls abc*.log | sort | head -n-2 | xargs rm
ls abc*.log: list all files, matching the pattern abc*.log
sort: sorts this list lexicographical (by name) from oldes to to newest logfile
head -n-2: return all but the last two entry in the list (you can give -n a negativ count too)
xargs rm: compose the rm command with the entries from stdin
If there are two or less files in the directory, this command will return an error like
rm: missing operand
and will not delete any files.

It is usually not a good idea to use ls to point to files. Some files may cause havoc (files which have a [Newline] or a weird character in their name are the usual exemples ....).
Using shell globs : Here is an interresting way : we count the files newer than the one we are about to remove!
pattern='abc*.log'
for i in $pattern ; do
[ -f "$i" ] || break ;
#determine if this is the most recent file, in the current directory
# [I add -maxdepth 1 to limit the find to only that directory, no subdirs]
if [ $(find . -maxdepth 1 -name "$pattern" -type f -newer "$i" -print0 | tr -cd '\000' | tr '\000' '+' | wc -c) -gt 1 ];
then
#there are 2 files more recent than $i that match the pattern
#we can delete $i
echo rm "$i" # remove the echo only when you are 100% sure that you want to delete all those files !
else
echo "$i is one of the 2 most recent files matching '${pattern}', I keep it"
fi
done
I only use the globbing mechanism to feed filenames to "find", and just use the terminating "0" of the -printf0 to count the outputed filenames (thus I have no problems with any special characters in those filenames, I just need to know how many files were outputted)
tr -cd "\000" will keep only the \000, ie the terminating NUL character outputed by print0. Then I translate each \000 to a single + character, and I count them with the wc -c. If I see 0, "$i" was the most recent file. If I see 1, "$i" was the one just a bit older (so the find sees only the most recent one). And if I see more than 1, it means the 2 files (mathching the pattern) that we want to keep are newer than "$i", so we can delete "$i"
I'm sure someone will step in with a better one, but the idea could be reused, I guess...

Thanks guyz for all the answers.
I found my answer
files=`ls -t *.txt | head -n 2 | tr '\n' '|' | rev |cut -c 2- |rev`
rm `ls -t | egrep -v "$files"`
Thank you for the help

Operating on multiple results from find command in bash

Hi I'm a novice linux user. I'm trying to use the find command in bash to search through a given directory, each containing multiple files of the same name but with varying content, to find a maximum value within the files.
Initially I wasn't taking the directory as input and knew the file wouldn't be less than 2 directories deep so I was using nested loops as follows:
prev_value=0
for i in <directory_name> ; do
if [ -d "$i" ]; then
cd $i
for j in "$i"/* ; do
if [ -d "$j" ]; then
cd $j
curr_value=`grep "<keyword>" <filename>.txt | cut -c32-33` #gets value I'm comparing
if [ $curr_value -lt $prev_value ]; then
curr_value=$prev_value
else
prev_value=$curr_value
fi
fi
done
fi
done
echo $prev_value
Obviously that's not going to cut it now. I've looked into the -exec option of find but since find is producing a vast amount of results I'm just not sure how to handle the variable assignment and comparisons. Any help would be appreciated, thanks.

find "${DIRECTORY}" -name "${FILENAME}.txt" -print0 | xargs -0 -L 1 grep "${KEYWORD}" | cut -c32-33 | sort -nr | head -n1
We find the filenames that are named FILENAME.txt (FILENAME is a bash variable) that exist under DIRECTORY.
We print them all out, separated by nulls (this avoids any problems with certain characters in directory or file names).
Then we read them all in again using xargs, and pass the null-separated (-0) values as arguments to grep, launching one grep for each filename (-L 1 - let's be POSIX-compliant here). (I do that to avoid grep printing the filenames, which would screw up cut).
Then we sort all the results, numerically (-n), in descending order (-r).
Finally, we take the first line (head -n1) of the sorted numbers - which will be the maximum.
P.S. If you have 4 CPU cores you can try adding the -P 4 option to xargs to try to make the grep part of it run faster.

Clearing archive files with linux bash script

Here is my problem,
I have a folder where is stored multiple files with a specific format:
Name_of_file.TypeMM-DD-YYYY-HH:MM
where MM-DD-YYYY-HH:MM is the time of its creation. There could be multiple files with the same name but not the same time of course.
What i want is a script that can keep the 3 newest version of each file.
So, I found one example there:
Deleting oldest files with shell
But I don't want to delete a number of files but to keep a certain number of newer files. Is there a way to get that find command, parse in the Name_of_file and keep the 3 newest???
Here is the code I've tried yet, but it's not exactly what I need.
find /the/folder -type f -name 'Name_of_file.Type*' -mtime +3 -delete
Thanks for help!
So i decided to add my final solution in case anyone liked to get it. It's a combination of the 2 solutions given.
ls -r | grep -P "(.+)\d{4}-\d{2}-\d{2}-\d{2}:\d{2}" | awk 'NR > 3' | xargs rm
One line, super efficiant. If anything changes on the pattern of date or name just change the grep -P pattern to match it. This way you are sure that only the files fitting this pattern will get deleted.

Can you be extra, extra sure that the timestamp on the file is the exact same timestamp on the file name? If they're off a bit, do you care?
The ls command can sort files by timestamp order. You could do something like this:
$ ls -t | awk 'NR > 3' | xargs rm
THe ls -t lists the files by modification time where the newest are first.
The `awk 'NR > 3' prints out the list of files except for the first three lines which are the three newest.
The xargs rm will remove the files that are older than the first three.
Now, this isn't the exact solution. There are possible problems with xargs because file names might contain weird characters or whitespace. If you can guarantee that's not the case, this should be okay.
Also, you probably want to group the files by name, and keep the last three. Hmm...
ls | sed 's/MM-DD-YYYY-HH:MM*$//' | sort -u | while read file
do
ls -t $file* | awk 'NR > 3' | xargs rm
done
The ls will list all of the files in the directory. The sed 's/\MM-DD-YYYY-HH:MM//' will remove the date time stamp from the files. Thesort -u` will make sure you only have the unique file names. Thus
file1.txt-01-12-1950
file2.txt-02-12-1978
file2.txt-03-12-1991
Will be reduced to just:
file1.txt
file2.txt
These are placed through the loop, and the ls $file* will list all of the files that start with the file name and suffix, but will pipe that to awk which will strip out the newest three, and pipe that to xargs rm that will delete all but the newest three.

Assuming we're using the date in the filename to date the archive file, and that is possible to change the date format to YYYY-MM-DD-HH:MM (as established in comments above), here's a quick and dirty shell script to keep the newest 3 versions of each file within the present working directory:
#!/bin/bash
KEEP=3 # number of versions to keep
while read FNAME; do
NODATE=${FNAME:0:-16} # get filename without the date (remove last 16 chars)
if [ "$NODATE" != "$LASTSEEN" ]; then # new file found
FOUND=1; LASTSEEN="$NODATE"
else # same file, different date
let FOUND="FOUND + 1"
if [ $FOUND -gt $KEEP ]; then
echo "- Deleting older file: $FNAME"
rm "$FNAME"
fi
fi
done < <(\ls -r | grep -P "(.+)\d{4}-\d{2}-\d{2}-\d{2}:\d{2}")
Example run:
[me#home]$ ls
another_file.txt2011-02-11-08:05
another_file.txt2012-12-09-23:13
delete_old.sh
not_an_archive.jpg
some_file.exe2011-12-12-12:11
some_file.exe2012-01-11-23:11
some_file.exe2012-12-10-00:11
some_file.exe2013-03-01-23:11
some_file.exe2013-03-01-23:12
[me#home]$ ./delete_old.sh
- Deleting older file: some_file.exe2012-01-11-23:11
- Deleting older file: some_file.exe2011-12-12-12:11
[me#home]$ ls
another_file.txt2011-02-11-08:05
another_file.txt2012-12-09-23:13
delete_old.sh
not_an_archive.jpg
some_file.exe2012-12-10-00:11
some_file.exe2013-03-01-23:11
some_file.exe2013-03-01-23:12
Essentially, but changing the file name to dates in the form to YYYY-MM-DD-HH:MM, a normal string sort (such as that done by ls) will automatically group similar files together sorted by date-time.
The ls -r on the last line simply lists all files within the current working directly print the results in reverse order so newer archive files appear first.
We pass the output through grep to extract only files that are in the correct format.
The output of that command combination is then looped through (see the while loop) and we can simply start deleting after 3 occurrences of the same filename (minus the date portion).

This pipeline will get you the 3 newest files (by modification time) in the current dir
stat -c $'%Y\t%n' file* | sort -n | tail -3 | cut -f 2-
To get all but the 3 newest:
stat -c $'%Y\t%n' file* | sort -rn | tail -n +4 | cut -f 2-

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string