Strange results using Linux find - linux

I am trying to set up a backup shell script that shall run once per week on my server and keep the weekly backups for ten weeks and it all works well, except for one thing...
I have a folder that contains many rather large files, so the ten weekly backups of that folder take up quite a large amount of disk space and many of the larger files in that folder rarely change, so I thought I would split the backup of that folder in two: one for the smaller files that is included in the 'normal' weekly backup (and kept for ten weeks) and one file for the larger files that is just updated every week, without the older weekly versions being kept.
I have used the following command for the larger files:
/usr/bin/find /other/projects -size +100M -print0 | /usr/bin/xargs -0 /bin/tar -rvPf /backup/PRJ-files_LARGE.tar
That works as expected. The tar -v option is there for debugging. However, when archiving the smaller files, I use a similar command:
/usr/bin/find /other/projects -size -100M -print0 | /usr/bin/xargs -0 /bin/tar -rvPf /backup/PRJ-files_$FILE_END.tar
Where $FILE_END is the weekly number. The line above does not work. I had the script run the other day and it took hours and produced a file that was 70 Gb, though the expected output size is about 14 Gb (there are a lot of files). It seems there is some duplication of files in the large file, I have not been able to fully check though. Yesterday I ran the command above for the smaller files from the command line and I could see that files I know to be larger than 100 Mb were included.
However, just now I ran find /other/projects -size -100M from the command line and that produced the expected list of files.
So, if anyone has any ideas what I am doing wrong I would really appreciate tips or pointers. The file names include spaces and all sorts of characters, e.g. single quote, if that has something to do with it.
The only thing I can think of is that I am not using xargs properly and admittedly I am not very familiar with that, but I still think that the problem lies in my use of find since it is find that gives the input to xargs.

First of all, I do not know if it is considered bad form or not to answer your own question, but I am doing it anyway since I realised my error and I wanted to close this and hopefully be able to help someone having the same problem as I had.
Now, once I realised what I did wrong I frankly am a bit embarrassed that I did not see it earlier, but this is it:
I did some experimental runs from the command line and after a while I realised that the output not only listed all files, but it also listed the directories themselves. Directories are of course files too and they are smaller than 100M so they have (most likely anyway) been included and when they have been included, all files in them have also been included, regardless of their sizes. This would also explain why the output file was five times larger than expected.
So, in order to overcome this I added -type f, which includes only regular files, to the find command and lo and behold, it worked!
To recap, the adjusted command I use for the smaller files is now:
/usr/bin/find /other/projects -size -100M -type f -print0 | /usr/bin/xargs -0 /bin/tar -rvPf /backup/PRJ-files_$FILE_END.tar

Related

Compare and sync two (huge) directories - consider only filenames

I want to do a one-way sync in Linux between two directories. One contains files and the other one contains processed files but has the same directory structure and the same filenames but some files might be missing.
Right now I am doing:
cd $SOURCE
find * -type f | while read fname; do
if [ ! -e "$TARGET$fname" ]
then
# process the file and copy it to the target. Create directories if needed.
fi
done
which works, but is painfully slow.
Is there a better way to do this?
There are rughly 50.000.000 files, spread among directories and sub-directories. Each directory does not contain more than 255 files/subdirs.
I looked at
rsync: seems like it always does size or timestamp comparison. This will result on every file being flagged as different since the processing takes some time and changes the file contents.
diff -qr: could not figure out how to make it ignore file-sizes and content
Edit
Valid assumptions:
comparisons are being made solely on directory/file names
we don't care about file metadata and/or attributes (eg, size, owner, permissions, date/time last modified, etc)
we don't care about files that may reside in the target directory but without a matching file in the source directory. This is only partially true, but deletions from the source are rare and happen in bulk, so I will do a special case for that.
Assumptions:
comparisons are being made solely on directory/file names
we don't care about file metadata and/or attributes (eg, size, owner, permissions, date/time last modified, etc)
we don't care about files that may reside in the target directory but without a matching file in the source directory
I don't see a way around comparing 2x lists of ~50 million entries but we can try to eliminate the entry-by-entry approach of a bash looping solution ...
One idea:
# obtain sorted list of all $SOURCE files
srcfiles=$(mktemp)
cd "${SOURCE}"
find * -type f | sort > "${srcfiles}"
# obtain sorted list of all $TARGET files
tgtfiles=$(mktemp)
cd "${TARGET}"
find * -type f | sort > "${tgtfiles}"
# 'comm -23' => extract list of items that only exist in the first file - ${srcfiles}
missingfiles=$(mktemp)
comm -23 "${srcfiles}" "${tgtfiles}" > "${missingfiles}"
# process list of ${SOURCE}-only files
while read -r missingfile
do
process_and_copy "${missingfile}"
done < "${missingsfiles}"
'rm' -rf "${srcfiles}" "${tgtfiles}" "${missingfiles}"
This solution is (still) serial in nature so if there are a 'lot' of missing files the overall time to process said missing files could be appreciable.
With enough system resources (cpu, memory, disk throughput) a 'faster' solution would look at methods of parallelizing the work, eg:
running parallel find/sort/comm/process threads on different $SOURCE/$TARGET subdirectories (may work well if the number of missing files is evenly distributed across the different subdirectories) or ...
stick with the serial find/sort/comm but split ${missingfiles} into chunks and then spawn separate OS processes to process_and_copy the different chunks

How can I use the parallel command to exploit multi-core parallelism on my MacBook?

I often use the find command on Linux and macOS. I just discovered the command parallel, and I would like to combine it with find command if possible because find command takes a long time when we search a specific file into large directories.
I have searched for this information but the results are not accurate enough. There appear to be a lot of possible syntaxes, but I can't tell which one is relevant.
How do I combine the parallel command with the find command (or any other command) in order to benefit from all 16 cores that I have on my MacBook?
Update
From #OleTange, I think I have found the kind of commands that interests me.
So, to know more about these commands, I would like to know the usefulness of characters {}and :::in the following command :
parallel -j8 find {} ::: *
1) Are these characters mandatory ?
2) How can I insert classical options of find command like -type f or -name '*.txt ?
3) For the moment I have defined in my .zshrc the function :
ff () {
find $1 -type f -iname $2 2> /dev/null
}
How could do the equivalent with a fixed number of jobs (I could also set it as a shell argument)?
Parallel processing makes sense when your work is CPU bound (the CPU does the work, and the peripherals are mostly idle) but here, you are trying to improve the performance of a task which is I/O bound (the CPU is mostly idle, waiting for a busy peripheral). In this situation, adding parallelism will only add congestion, as multiple tasks will be fighting over the already-starved I/O bandwidth between them.
On macOS, the system already indexes all your data anyway (including the contents of word-processing documents, PDFs, email messages, etc); there's a friendly magnifying glass on the menu bar at the upper right where you can access a much faster and more versatile search, called Spotlight. (Though I agree that some of the more sophisticated controls of find are missing; and the "user friendly" design gets in the way for me when it guesses what I want, and guesses wrong.)
Some Linux distros offer a similar facility; I would expect that to be the norm for anything with a GUI these days, though the details will differ between systems.
A more traditional solution on any Unix-like system is the locate command, which performs a similar but more limited task; it will create a (very snappy) index on file names, so you can say
locate fnord
to very quickly obtain every file whose name matches fnord. The index is simply a copy of the results of a find run from last night (or however you schedule the backend to run). The command is already installed on macOS, though you have to enable the back end if you want to use it. (Just run locate locate to get further instructions.)
You could build something similar yourself if you find yourself often looking for files with a particular set of permissions and a particular owner, for example (these are not features which locate records); just run a nightly (or hourly etc) find which collects these features into a database -- or even just a text file -- which you can then search nearly instantly.
For running jobs in parallel, you don't really need GNU parallel, though it does offer a number of conveniences and enhancements for many use cases; you already have xargs -P. (The xargs on macOS which originates from BSD is more limited than GNU xargs which is what you'll find on many Linuxes; but it does have the -P option.)
For example, here's how to run eight parallel find instances with xargs -P:
printf '%s\n' */ | xargs -I {} -P 8 find {} -name '*.ogg'
(This assumes the wildcard doesn't match directories which contain single quotes or newlines or other shenanigans; GNU xargs has the -0 option to fix a large number of corner cases like that; then you'd use '%s\0' as the format string for printf.)
As the parallel documentation readily explains, its general syntax is
parallel -options command ...
where {} will be replaced with the current input line (if it is missing, it will be implicitly added at the end of command ...) and the (obviously optional) ::: special token allows you to specify an input source on the command line instead of as standard input.
Anything outside of those special tokens is passed on verbatim, so you can add find options at your heart's content just by specifying them literally.
parallel -j8 find {} -type f -name '*.ogg' ::: */
I don't speak zsh but refactored for regular POSIX sh your function could be something like
ff () {
parallel -j8 find {} -type f -iname "$2" ::: "$1"
}
though I would perhaps switch the arguments so you can specify a name pattern and a list of files to search, à la grep.
ff () {
# "local" is not POSIX but works in many sh versions
local pat=$1
shift
parallel -j8 find {} -type f -iname "$pat" ::: "$#"
}
But again, spinning your disk to find things which are already indexed is probably something you should stop doing, rather than facilitate.
Just use background running at each first level paths separately
In example below will create 12 subdirectories analysis
$ for i in [A-Z]*/ ; do find "$i" -name "*.ogg" & >> logfile ; done
[1] 16945
[2] 16946
[3] 16947
# many lines
[1] Done find "$i" -name "*.ogg"
[2] Done find "$i" -name "*.ogg"
#many lines
[11] Done find "$i" -name "*.ogg"
[12] Done find "$i" -name "*.ogg"
$
Doing so creates many find process the system will dispatch on different cores as any other.
Note 1: it looks a little pig way to do so but it just works..
Note 2: the find command itself is not taking hard on cpus/cores this is 99% of use-case just useless because the find process will spend is time to wait for I/O from disks. Then using parallel or similar commands won't work*
As others have written find is I/O heavy and most likely not limited by your CPUs.
But depending on your disks it can be better to run the jobs in parallel.
NVMe disks are known for performing best if there are 4-8 accesses running in parallel. Some network file systems also work faster with multiple processes.
So some level of parallelization can make sense, but you really have to measure to be sure.
To parallelize find with 8 jobs running in parallel:
parallel -j8 find {} ::: *
This works best if you are in a dir that has many subdirs: Each subdir will then be searched in parallel. Otherwise this may work better:
parallel -j8 find {} ::: */*
Basically the same idea, but now using subdirs of dirs.
If you want the results printed as soon as they are found (and not after the find is finished) use --line-buffer (or --lb):
parallel --lb -j8 find {} ::: */*
To learn about GNU Parallel spend 20 minutes reading chapter 1+2 of https://doi.org/10.5281/zenodo.1146014 and print the cheat sheet: https://www.gnu.org/software/parallel/parallel_cheat.pdf
Your command line will thank you for it.
You appear to want to be able to locate files quickly in large directories under macOS. I think the correct tool for that job is mdfind.
I made a hierarchy with 10,000,000 files under my home directory, all with unique names that resemble UUIDs, e.g. 80104d18-74c9-4803-af51-9162856bf90d. I then tried to find one with:
mdfind -onlyin ~ -name 80104d18-74c9-4803-af51-9162856bf90d
The result was instantaneous and too fast to measure the time, so I did 100 lookups and it took under 20s, so on average a lookup takes 0.2s.
If you actually wanted to locate 100 files, you can group them into a single search like this:
mdfind -onlyin ~ 'kMDItemDisplayName==ffff4bbd-897d-4768-99c9-d8434d873bd8 || kMDItemDisplayName==800e8b37-1f22-4c7b-ba5c-f1d1040ac736 || kMDItemDisplayName==800e8b37-1f22-4c7b-ba5c-f1d1040ac736'
and it executes even faster.
If you only know a partial filename, you can use:
mdfind -onlyin ~ "kMDItemDisplayName = '*cdd90b5ef351*'"
/Users/mark/StackOverflow/MassiveDirectory/800f0058-4021-4f2d-8f5c-cdd90b5ef351
You can also use creation dates, file types, author, video duration, or tags in your search. For example, you can find all PNG images whose name contains "25DD954D73AF" like this:
mdfind -onlyin ~ "kMDItemKind = 'PNG image' && kMDItemDisplayName = '*25DD954D73AF*'"
/Users/mark/StackOverflow/MassiveDirectory/9A91A1C4-C8BF-467E-954E-25DD954D73AF.png
If you want to know what fields you can search on, take a file of the type you want to be able to look for, and run mdls on it and you will see all the fields that macOS knows about:
mdls SomeMusic.m4a
mdls SomeVideo.avi
mdls SomeMS-WordDocument.doc
More examples here.
Also, unlike with locate, there is no need to update a database frequently.

gsutil copy with multithread doesn't finish copying all files

We have around 650 GB of data on google compute engine.
We need to move them to Cloud Storage to a Coldline bucket, so the best options we could find is to copy them with gsutil with parallel mode.
The files are from kilobytes to 10Mb max, and there are few million files.
The command we used is
gsutil -m cp -r userFiles/ gs://removed-websites/
On first run it copied around 200Gb and stopped with error
| [972.2k/972.2k files][207.9 GiB/207.9 GiB] 100% Done 29.4 MiB/s ETA 00:00:00
Operation completed over 972.2k objects/207.9 GiB.
CommandException: 1 file/object could not be transferred.
On second run it finished almost at the same place, and stopped again.
How can we copy these files successfully ?
Also the buckets that have the partial data are not being removed after deleting them. Console just says preparing to delete, and nothing happens, we waited more than 4 hours, any way to remove those buckets ?
Answering your first question, I can propose the several options. All of them based on data split and uploading by small portions of data.
You can try distributed upload from several machines.
https://cloud.google.com/storage/docs/gsutil/commands/cp#copying-tofrom-subdirectories-distributing-transfers-across-machines
In this case you are splitting data by safe chunks, like 50GB, and uploading it from several machines in parallel. But it requires machines, that is not required actually.
You still can try such splited upload on a single machine, but you need then some splitting mechanism, which will not upload all files at once, but by chunks. In this case, if some thing fails, you will need to reload only this chunk. In addition, you will have better accuracy and you'll be able to localize possible fail place if something happens.
Regarding, how you can delete them. Well, same technique as for upload. Divide data on chunks and delete them by chunks. Or, you can try to remove whole project, if it suitable for your situation.
Update 1
So, I checked gsutil interface and it is supports glob syntax. You can match with glob syntax, for example 200 folders, and launch this command 150 time (this will upload 200 x 500 = 30 000 folders).
You can use such approach and combine it with -m option, so this is partially that your script did, but might work faster. This will work for folders names and files as well.
If you provide examples of the folders names and files names it would be easier to propose appropriate glob pattern.
It could be that you are affected by gs-util issue 464. This happens when you are running multiple gs-util instances concurrently wit the -m option. Apparently these instances share a state directory which causes weird behavior.
One of the workarounds is to add parameters: -o GSUtil:parallel_process_count=1 -o GSUtil:parallel_thread_count=24.
E.g.:
gsutil -o GSUtil:parallel_process_count=1 -o GSUtil:parallel_thread_count=24 -m cp -r gs://my-bucket .
I've just run into the same issue, and turns out that it's caused by the cp command running into an uncopyable file (in my case, a broken symlink) and aborting.
Problem is, if you're running a massively parallel copy with -m, the broken file may not be immediately obvious. To figure out which one it is, try a dry run rsync -n instead:
gsutil -m rsync -n -r userFiles/ gs://removed-websites/
This will clearly flag the broken file and abort, and you can fix or delete it and try again. Alternatively, if you're not interested in symlinks, just use the -e option and they'll be ignored entirely.

Find if any files recently changed, as fast as possible

I have a fairly large directory structure with thousands of files. I want to figure out if any have changed since a particular time. Now, I can use
find <dir> -mmin 30 -type f
..to find any files that changed in the last 30 minutes. However, this take a few seconds to go through, and I'm really not interested in (1) finding all the files that changed, or even (2) finding which files have changed. I'm only looking for a yes/no answer to "any files changed?".
I can make (1) better by using -print -quit to stop after the first file was found. However, for the case where no files have changed, the total search still takes a little while.
I was wondering if there was a quicker way to check this? Directory time stamps, maybe? I'm using ext4, if it matters.
For GNU's find you may use option -quit to stop searching after the first match.
So if you want to find out, if there is at least one file changed in the past 30 minutes, then you can run:
find . -mmin -30 -type f -print -quit
That will print out name of the first matched file and quit.
Also, if you have control over the software that uses your bunch of files, and performance is not an issue, you may add a feature of touching a timestamp file everytime any file is changed or added and then check only that timestamp file's stats.

How to find the timestamp of the latest modified file in a directory (recursively)?

I'm working on a process that needs to be restarted upon any change to any file in a specified directory, recursively.
I want to avoid using anything heavy, like inotify. I don't need to know which files were updated, but rather only whether or not files were updated at all. Moreover, I don't need to be notified of every change, but rather only to know if any changes have happened at a specific interval, dynamically determined by the process.
There has to be a way to do this with a fairly simple bash command. I don't mind having to execute the command multiple times; performance is not my primary concern for this use case. However, it would be preferable for the command to be as fast as possible.
The only output I need is the timestamp of the last change, so I can compare it to the timestamp that I have stored in memory.
I'm also open to better solutions.
I actually found a good answer from another closely related question.
I've only modified the command a little to adapt it to my needs:
find . -type f -printf '%T#\n' | sort -n | tail -1
%T# returns the modification time as a unix timestamp, which is just what I need.
sort -n sorts the timestamps numerically.
tail -1 only keeps the last/highest timestamp.
It runs fairly quickly; ~400ms on my entire home directory, and ~30ms on the intended directory (measured using time [command]).
I just thought of an even better solution than the previous one, which also allows me to know about deleted files.
The idea is to use a checksum, but not a checksum of all files; rather, we can only do a checksum of the timestamps. If anything changes at all (new files, deleted files, modified files), then the checksum will change also!
find . -type f -printf '%T#,' | cksum
'%T#,' returns the modification time of each file as a unix timestamp, all on the same line.
cksum calculates the checksum of the timestamps.
????
Profit!!!!
It's actually even faster than the previous solution (by ~20%), because we don't need to sort (which is one of the slowest operations). Even a checksum will be much faster, especially on such a small amount of data (22 bytes per timestamp), instead of doing a checksum on each file.
Instead of remembering the timestamp of the last change, you could remember the last file that changed and find newer files using
find . -type f -newer "$lastfilethatchanged"
This does not work, however, if the same file changes again. Thus, you might need to create a temporary file with touch first:
touch --date="$previoustimestamp" "$tempfile"
find . -type f -newer "$tempfile"
where "$tempfile" could be, for example, in the memory at /dev/shm/.
$ find ./ -name "*.sqlite" -ls
here you can use this command to get info for your file .Use filters to get timestamp

Resources