How to combine `stat` and `md5sum` output line by line? - linux

stat part:
$ find * -depth -exec stat --format '%n %U %G' {} + | sort -d > acl_file
$ cat acl_file
xfce4/desktop/icons screen0-3824x1033.rc john john
Code/CachedData/f30a9b73e8ffc278e71575118b6bf568f04587c8/index-ec362010a4d520491a88088c200c853d.code john john
VirtualBox/selectorwindow.log.6 john john
md5sum part:
$ find * -depth -exec md5sum {} + | sort -d > md5_file
$ cat md5_file
3da180c2d9d1104a17db0749d527aa4b xfce4/desktop/icons screen0-3824x1033.rc
3de44d64a6ce81c63f9072c0517ed3b9 Code/CachedData/f30a9b73e8ffc278e71575118b6bf568f04587c8/index-ec362010a4d520491a88088c200c853d.code
3f85bb5b59bcd13b4fc63d5947e51294 VirtualBox/selectorwindow.log.6
How to combine stat --format '%n %U %G' and md5sum and output to file line by line,such as:
3da180c2d9d1104a17db0749d527aa4b xfce4/desktop/icons screen0-3824x1033.rc john john
3de44d64a6ce81c63f9072c0517ed3b9 Code/CachedData/f30a9b73e8ffc278e71575118b6bf568f04587c8/index-ec362010a4d520491a88088c200c853d.code john john
3f85bb5b59bcd13b4fc63d5947e51294 VirtualBox/selectorwindow.log.6 john john

This is really just a minor variation on #Zilog80's solution. My time testing had it a few seconds faster by skipping reads on a smallish dataset of a few hundred files running on a windows laptop under git bash. YMMV.
mapfile -t lst< <( find . -type f -exec md5sum "{}" \; -exec stat --format '%U %G' "{}" \; )
for ((i=0; i < ${#lst[#]}; i++)); do if (( i%2 )); then echo "${lst[i]}"; else printf "%s " "${lst[i]}"; fi done | sort -d
edit
My original solution was pretty broken. It was skipping files in hidden subdirectories, and the printf botched filenames with spaces. If you don't have hidden directories to deal with, or if you want to skip those (e.g., you're working in a git repo and would rather skip the .git tree...), here's a rework.
shopt -s dotglob # check hidden files
shopt -s globstar # process at arbitrary depth
for f in **/*; do # this properly handles odd names
[[ -f "$f" ]] && echo "$(md5sum "$f") $(stat --format "%U %G" "$f")"
done | sort -d

The quickest way should be :
find * -type f -exec stat --format '%n %U %G' "{}" \; -exec md5sum "{}" \; |
{ while read -r line1 && read -r line2; do printf "%s %s\n" "${line2/ */}" "${line1}";done; } |
sort -d
We use two -exec to apply stat and md5sum file by file, then we read both output lines and use printf to format one output line by file with both the output of stat/ md5sum. We finally pipe the whole output to sort.
Warning: As we pipe the whole output to sort, you may to wait that all the stat/md5sum had been done before getting any output on a console.
And if only md5sum and not stat fails on a file (or vice versa), the output will be trashed.
Edit: A way a little safer for the output :
find * -type f -exec md5sum "{}" \; -exec stat --format '%n %U %G' "{}" \; |
{ while read -r line; do
mdsum="${line/[0-9a-f]* /}";
[ "${mdsum}" != "${line}" ] &&
{ mdsumdisp="${line% ${mdsum}}"; mdsumfile="${mdsum}"; } ||
{ [ "${line#${mdsumfile}}" != "${line}" ] &&
printf "%s %s\n" "${mdsumdisp}" "${line}"; };
done; } | sort -d
Here, at least, we check we have something like a md5sum on the expected line matching the file in the line.

Related

Bash command to recursively find directories with the newest file older than 3 days

I was wondering if there was a single command which would recursively find the directories in which their newest file in older than 3 days. Other solutions seem to only print the newest file in all subdirectories, I was wondering if there was a way to do it recursively and print all the subdirectories? I tried find -newermt "aug 27, 2022" -ls but this only gets me directories that have files younger than the date specified, not the youngest for each directory.
A long one-liner to sort files by date, get uniq directory names, list by modification keeping first
find ~/.config -type f -newermt "aug 29, 2022" -print0 | xargs -r0 ls -l --time-style=+%s | sort -r -k 6 | gawk '{ print $7}' | xargs -I {} dirname {} | sort | uniq | xargs -I {} bash -c "ls -lt --time-style=full-iso {} | head -n2" | grep -v 'total '
With comments
find ~/.config -type f -newermt "aug 29, 2022" -print0 |\
xargs -r0 ls -l --time-style=+%s | sort -r -k 6 |\ # newer files sorted by reverse date
gawk '{ print $7}' | xargs -I {} dirname {} |\ # get directory names
sort | uniq | \ # get uniq directory names
xargs -I {} bash -c "ls -lt --time-style=full-iso {} | head -n2" |\# list each directory by time, keep first
grep -v 'total '
If I'm understanding the requirements correctly, would you please try:
#!/bin/bash
find dir -type d -print0 | while IFS= read -r -d "" d; do # traverse "dir" recursively for subdirectories assigning "$d" to each directory name
if [[ -n $(find "$d" -maxdepth 1 -type f) \
&& -z $(find "$d" -maxdepth 1 -type f -mtime -3) ]]; then # if "$d" contains file(s) and does not contain files newer than 3 days
echo "$d" # then print the directory name "$d"
fi
done
A one liner version:
find dir -type d -print0 | while IFS= read -r -d "" d; do if [[ -n $(find "$d" -maxdepth 1 -type f) && -z $(find "$d" -maxdepth 1 -type f -mtime -3) ]]; then echo "$d"; fi; done
Please modify the top directory name dir according to your file location.

Linux: what's a fast way to find all duplicate files in a directory?

I have a directory with many subdirs and about 7000+ files in total. What I need to find all duplicates of all files. For any given file, its duplicates might be scattered around various subdirs and may or may not have the same file name. A duplicate is a file that you get a 0 return code from the diff command.
The simplest thing to do is to run a double loop over all the files in the directory tree. But that's 7000^2 sequential diffs and not very efficient:
for f in `find /path/to/root/folder -type f`
do
for g in `find /path/to/root/folder -type f`
do
if [ "$f" = "$g" ]
then
continue
fi
diff "$f" "$g" > /dev/null
if [ $? -eq 0 ]
then
echo "$f" MATCHES "$g"
fi
done
done
Is there a more efficient way to do it?
On Debian 11:
% mkdir files; (cd files; echo "one" > 1; echo "two" > 2a; cp 2a 2b)
% find files/ -type f -print0 | xargs -0 md5sum | tee listing.txt | \
awk '{print $1}' | sort | uniq -c | awk '$1>1 {print $2}' > dups.txt
% grep -f dups.txt listing.txt
c193497a1a06b2c72230e6146ff47080 files/2a
c193497a1a06b2c72230e6146ff47080 files/2b
Find and print all files null terminated (-print0).
Use xargs to md5sum them.
Save a copy of the sums and filenames in "listing.txt" file.
Grab the sum and pass to sort then uniq -c to count, saving into the "dups.txt" file.
Use awk to list duplicates, then grep to find the sum and filename.

Copy files with date/time range in filename

I have a bash script, which contains the following lines:
for ((iTime=starttime;iTime<=endtime;iTime++))
do
find . -name "*${iTime}*" -exec cp --parents \{\} ${dst} \;
done
I have a structure with a few folders including subfolders and many files at the bottom of the tree. These files are labeled with date and time info in the filename, like "filename_2021063015300000_suffix". The time is in format yyyymmddhhmmss and two digits for 1/10 and 1/100 seconds. I have a lot of files, which means, that my approach is very slow. The files have a time distance of a few minutes, so only a couple of files (e.g. 10 per subfolder out of >10000) should be copied.
How can i find all the files in the time range and get them all in one find and copy command? Maybe get a list of all the files to copy with one find command and then copy the list of filepathes? But how can i do this?
If your time span is reasonably limited, just inline the acceptable file names into the single find command.
find . \( -false $(for ((iTime=starttime;iTime<=endtime;iTime++)); do printf ' %s' -o -name "*$iTime*"; done) \) -exec cp --parents \{\} ${dst} \;
The initial -false predicate inside the parentheses is just to simplify the following predicates so that they can all start with -o -name.
This could end up with an "argument list too long" error if your list of times is long, though. Perhaps a more robust solution is to pass the time resolution into the command.
find . -type f -exec bash -c '
for f; do
for ((iTime=starttime;iTime<=endtime;iTime++)); do
if [[ $f == *"$iTime"* ]]; then
cp --parents "$f" "$0"
break
fi
done' "$dst" {} +
The script inside -exec could probably be more elegant; if your file names have reasonably regular format, maybe just extract the timestamp and compare it numerically to check whether it's in range. Perhaps also notice how we abuse the $0 parameter after bash -c '...' to pass in the value of $dst.
Lose the find. I created -
filename_2020063015300000_suffix
filename_2021053015300000_suffix
filename_2021063015300000_suffix
filename_2022063015300000_suffix
foo/filename_2021053015312345_suffix
bar/baz/filename_2021053015310101_suffix
So if I execute
starttime=2021000000000000
endtime=2022000000000000
shopt -s globstar
for f in **/*_[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]_*; do # for all these
ts=${f//[^0-9]/} # trim to date
(( ts >= starttime )) || continue # skip too old
(( ts <= endtime )) || continue # skip too new
echo "$f" # list matches
done | xargs -I{} echo cp {} /new/dir/ # pass to xargs
I get
cp bar/baz/filename_2021053015310101_suffix /new/dir/
cp filename_2021053015300000_suffix /new/dir/
cp filename_2021063015300000_suffix /new/dir/
cp foo/filename_2021053015312345_suffix /new/dir/
There are ways to simplify that glob. If you use extglob you can make it shorter, and check more carefully with a regex - for example,
shopt -s globstar extglob
for f in **/*_+([0-9])_*; do
[[ "$f" =~ _[0-9]{16}_ ]] || continue;
It starts looking complicated and hard to maintain for the next guy, though.
Try these, replace the dst, starttime, endtime in your case, both work for me on Ubuntu16.04.
find . -type f -regextype sed -regex "[^_]*_[0-9]\{16\}_[^_]*" -exec bash -c 'dt=$(echo "$0" | grep -oP "\d{16}"); [ "$dt" -gt "$2" ] && [ "$dt" -lt "$3" ] && cp -p "$0" "$1"' {} 'dst/' 'starttime' 'endtime' \;
$0 is filename which contain the datetime, $1 is dst directory path, $2 is starttime, $3 is endtime
Or
find . -type f -regextype sed -regex "[^_]*_[0-9]\{16\}_[^_]*" | awk -v dst='/tmp/test_find/' '{if (0 == system("[ $(echo \"" $0 "\"" " | grep -oP \"" "(?<=_)\\d+(?=_)\") -gt starttime ] && [ $(echo \"" $0 "\"" " | grep -oP \"" "(?<=_)\\d+(?=_)\") -lt endtime ]")) {system("cp -p " $0 " " dst)}}'
Both of them, first, use find to find the file name which has the pattern like _2021063015300000_ (maybe this has 16 digital but you say this pattern format yyyymmddhhmmss only has 14 digital) with sed regex.
Then use -exec bash -c "get datetime in filename, compare them with times, and exec cp action"
Or use awk to get the datetime and compare with start or end time by system command, and will execute cp to dst directory at last also by system command.
PS. this pattern are dependent the filename which only has the datetime between two _.

assigning files in a directory to sub-directories

I have a 1000s of files in a directory with and I want to be able to divide them into sub-directories, with each sub-directory containing a specific number of files. I don't care what files go into what directories, just as long as each contain a specific number. All the file names have a common ending (e.g. .txt) but what goes before varies.
Anyone know an easy way to do this.
Assuming you only have files ending in *.txt, no hidden files and no directories:
#!/bin/bash
shopt -s nullglob
maxf=42
files=( *.txt )
for ((i=0;maxf*i<${#files[#]};++i)); do
s=subdir$i
mkdir -p "$s"
mv -t "$s" -- "${files[#]:i*maxf:maxf}"
done
This will create directories subdirX with X an integer starting from 0, and will put 42 files in each directory.
You can tweak the thing to have padded zeroes for X:
#!/bin/bash
shopt -s nullglob
files=( *.txt )
maxf=42
((l=${#files[#]}/maxf))
p=${#l}
for ((i=0;maxf*i<${#files[#]};++i)); do
printf -v s "subdir%0${p}d" "$i"
mkdir -p "$s"
mv -t "$s" -- "${files[#]:i*maxf:maxf}"
done
max_per_subdir=1000
start=1
while [ -e $(printf %03d $start) ]; do
start=$((start + 1))
done
find -maxdepth 1 -type f ! -name '.*' -name '*.txt' -print0 \
| xargs -0 -n $max_per_subdir echo \
| while read -a files; do
subdir=$(printf %03d $start)
mkdir $subdir || exit 1
mv "${files[#]}" $subdir/ || exit 1
start=$((start + 1))
done
How about
find *.txt -print0 | xargs -0 -n 100 | xargs -I {} echo cp {} '$(md5sum <<< "{}")' | sh
This will create several directories each containing 100 files. The name of each created directory is a md5 hash of the filenames it contains.

Finding the number of files in a directory for all directories in pwd

I am trying to list all directories and place its number of files next to it.
I can find the total number of files ls -lR | grep .*.mp3 | wc -l. But how can I get an output like this:
dir1 34
dir2 15
dir3 2
...
I don't mind writing to a text file or CSV to get this information if its not possible to get it on screen.
Thank you all for any help on this.
This seems to work assuming you are in a directory where some subdirectories may contain mp3 files. It omits the top level directory. It will list the directories in order by largest number of contained mp3 files.
find . -mindepth 2 -name \*.mp3 -print0| xargs -0 -n 1 dirname | sort | uniq -c | sort -r | awk '{print $2 "," $1}'
I updated this with print0 to handle filenames with spaces and other tricky characters and to print output suitable for CSV.
find . -type f -iname '*.mp3' -printf "%h\n" | uniq -c
Or, if order (dir-> count instead of count-> dir) is really important to you:
find . -type f -iname '*.mp3' -printf "%h\n" | uniq -c | awk '{print $2" "$1}'
There's probably much better ways, but this seems to work.
Put this in a shell script:
#!/bin/sh
for f in *
do
if [ -d "$f" ]
then
cd "$f"
c=`ls -l *.mp3 2>/dev/null | wc -l`
if test $c -gt 0
then
echo "$f $c"
fi
cd ..
fi
done
With Perl:
perl -MFile::Find -le'
find {
wanted => sub {
return unless /\.mp3$/i;
++$_{$File::Find::dir};
}
}, ".";
print "$_,$_{$_}" for
sort {
$_{$b} <=> $_{$a}
} keys %_;
'
Here's yet another way to even handle file names containing unusual (but legal) characters, such as newlines, ...:
# count .mp3 files (using GNU find)
find . -xdev -type f -iname "*.mp3" -print0 | tr -dc '\0' | wc -c
# list directories with number of .mp3 files
find "$(pwd -P)" -xdev -depth -type d -exec bash -c '
for ((i=1; i<=$#; i++ )); do
d="${#:i:1}"
mp3s="$(find "${d}" -xdev -type f -iname "*.mp3" -print0 | tr -dc "${0}" | wc -c )"
[[ $mp3s -gt 0 ]] && printf "%s\n" "${d}, ${mp3s// /}"
done
' "'\\0'" '{}' +

Resources