Find a bunch of randomly sorted images on disk and copy to target dir - linux

For testing purposes I need a bunch of random images from disc, copied to a specific directory. So, in pseudo code:
find [] -iname "*.jpg"
and then sort -R
and then head -n [number wanted]
and then copy to destination
Is it possible to combine above commands in a single bash command? Like eg:
for i in `find ./images/ -iname "*.jpg" | sort -R | head -n243`; do cp "$i" ./target/; done;
But that doesn't quite work. I feel I'll need an 'xargs' somewhere in there, but I'm afraid I don't understand xargs very well... would I need to pass a 'print0' (or equivalent) to all seperate commands?
[edit]
I left out the final step: I'd like to copy the images to a certain directory under a new (sequential) name. So the first image becomes 1.jpg, the second 2.jpg etc. For this, the command I posted does not work as intended.

The command that you specified also will work without any issues. It works for me well. Can you point out the exact error you are facing.
Meanwhile,
This will just do the trick for you:
find ./images/ -iname "*.jpg" | sort -R | head -n <no. of files> | xargs -I {} cp {} target/

Simply use shuf -n.
Example:
find ./images/ -iname "*.jpg" | shuf -n 10 | xargs cp -t ./target/
It would copy 10 random images to ./target/. If you need 243 just use shuf -n 243.

According to your edit, this should do :
for i in `find ./images/ -iname "*.jpg" | sort -R | head -n2`; do cp $i ./target/$((1 + compt++)).jpg; done;`
Here, you add a counter to keep track of the number of files you already copied.

Related

How to grep through many files of same file type

I wish to grep through many (20,000) text files, each with about 1,000,000 lines each, so the faster the better.
I have tried the below code and it just doesn't seem to want to do anything, it doesn't find any matches even after an hour (it should have done by now).
for i in $(find . -name "*.txt"); do grep -Ff firstpart.txt $1; done
Ofir's answer is good. Another option:
find . -name "*.txt" -exec grep -fnFH firstpart.txt {} \;
I like to add the -n for line numbers and -H to get the filename. -H is particularly useful in this case as you could have a lot of matches.
Instead of iterating through the files in a loop, you can just give the file names to grep using xargs and let grep go over all the files.
find . -name "*.txt" | xargs grep $1
I'm not quite sure whether it will actually increase the performance, but it's probably worth a try.
ripgrep is the most amazing tool. You should get that and use it.
To search *.txt files in all directories recursively, do this:
rg -t txt -f patterns.txt
Ripgrep uses one of the fastest regular expression engines out there. It uses multiple threads. It searches directories and files, and filters them to the interesting ones in the fastest way.
It is simply great.
For anyone stuck using grep for whatever reason:
find -name '*.txt' -type f -print0 | xargs -0 -P 8 -n 8 grep -Ff patterns.txt
That tells xargs to -n 8 use 8 arguments per command and to -P 8 run 8 copies in parallel. It has the downside that the output might become interleaved and corrupted.
Instead of xargs you could use parallel which does a fancier job and keeps output in order:
$ find -name '*.txt' -type f -print0 | parallel -0 grep --with-filename grep -Ff patterns.txt

Is it possible to pipe the results of FIND to a COPY command CP?

Is it possible to pipe the results of find to a COPY command cp?
Like this:
find . -iname "*.SomeExt" | cp Destination Directory
Seeking, I always find this kind of formula such as from this post:
find . -name "*.pdf" -type f -exec cp {} ./pdfsfolder \;
This raises some questions:
Why cant you just use | pipe? isn't that what its for?
Why does everyone recommend the -exec
How do I know when to use that (exec) over pipe |?
There's a little-used option for cp: -t destination -- see the man page:
find . -iname "*.SomeExt" | xargs cp -t Directory
Good question!
why cant you just use | pipe? isn't that what its for?
You can pipe, of course, xargs is done for these cases:
find . -iname "*.SomeExt" | xargs cp Destination_Directory/
Why does everyone recommend the -exec
The -exec is good because it provides more control of exactly what you are executing. Whenever you pipe there may be problems with corner cases: file names containing spaces or new lines, etc.
how do I know when to use that (exec) over pipe | ?
It is really up to you and there can be many cases. I would use -exec whenever the action to perform is simple. I am not a very good friend of xargs, I tend to prefer an approach in which the find output is provided to a while loop, such as:
while IFS= read -r result
do
# do things with "$result"
done < <(find ...)
You can use | like below:
find . -iname "*.SomeExt" | while read line
do
cp $line DestDir/
done
Answering your questions:
| can be used to solve this issue. But as seen above, it involves a lot of code. Moreover, | will create two process - one for find and another for cp.
Instead using exec() inside find will solve the problem in a single process.
Try this:
find . -iname "*.SomeExt" -print0 | xargs -0 cp -t Directory
# ........................^^^^^^^..........^^
In case there is whitespace in filenames.
I like the spirit of the response from #fedorqui-so-stop-harming, but it needed a tweak to work in my bash terminal.
In this version...
find . -iname "*.SomeExt" | xargs cp Destination_Directory/
The cp command incorrectly takes Destination_Directory/ as the first argument. I needed to add a replacement string in order to get xargs to insert the argument in the right position for cp. I used a percent symbol for the replacement string, but you can use anything that doesn't conflict with the input from the pipe. This version works for me.
find . -iname "*.SomeExt" | xargs -I % cp % Destination_Directory/
This SOLVED my problem.
find . -type f | grep '\.pdf' | while read line
do
cp $line REPLACE_WITH_TARGET_DIRECTORY
done
If there are spaces in the filenames, try:
find . -iname *.ext > list.txt
cat list.txt | awk 'BEGIN {a="'"'"'"}{print "cp "a$0a" Directory"}' > script.sh
sh script.sh
You can inspect list.txt and script.sh before sh script.sh. Remember to delete the list.txt and script.sh afterwards.
I had some files with parenthesis and wanted a progress bar, so replaced the cat line with:
cat list.txt | awk -v X='"' '{print "rsync -Pa "X$0X" /Volumes/Untitled/"}' > script.sh

Delete everything other than file + linked file across multiple servers (NET::SSH::MULTI)

I've got a couple of thousand images that are saved as logs that need to be deleted.
To avoid the limit of rm and to do this across multiple servers, I used the following code
Net::SSH::Multi.start(:on_error => :ignore) do |session|
# define servers in groups for more granular access
session.group :app do
session.use 'example#example', :password=> 'example'
end
# execute commands on a subset of servers
session.with(:app).exec "find /tmp/motion -maxdepth 1 -not -name 'lastsnap.jpg' -print0 | sudo xargs -0 rm"
end
An ls -l lastsnap.jpg shows that lastsnap.jpg is linked to another file, like so
30 Jun 3 08:18 lastsnap.jpg -> 81-20140603081840-snap.jpg
This other file is constantly changed due to logging scenario that i mentioned above.
Reiterating the question, how do I delete every other logged file that is NOT lastsnap.jpg and it's linked file.
Thanks for the help :)
cd /tmp/motion
ls -1 | grep -v -E '$(basename `find . -lname lastsnap.jpg`)|lastsnap.jpg' | while read n ; do rm -rvf $n ; done
EDIT as per the comment
cd /tmp/motion; rm -rvf $(ls -1 | grep -v -E "$(basename `find . -lname lastsnap.jpg`)|lastsnap.jpg")
Note: Make sure that your file names don't have spaces in it. Other wise this method will not work and needs modification in order to accommodate spaces in the file name.
I wrote a logic using find command. Check whether its useful to you.
My directory contains following files
pyramid-stone.jpg
tallest_water_slide.jpg
SAOLA.JPG
testnap.jpg
silicon_valley_talent.jpg
The_Organic_Battery_From_Japan.jpg
Out of which testnap.jpg is a link
testnap.jpg -> pyramid-stone.jpg
So i wrote a small awk script to get the link name and where its pointing to
IG1=`ls -l | grep ^l | awk '{printf $(NF-2);}'`
IG2=`ls -l | grep ^l | awk '{printf $(NF);}'`
Then i used find command to print all jpg's instead of the link
find . -type f \( -iname "*.jpg" ! -iname $IG1 ! -iname $IG2 \)
OP is
./SAOLA.JPG
./silicon_valley_talent.jpg
./tallest_water_slide.jpg
./The_Organic_Battery_From_Japan.jpg
NOTE:You have add rm to remove files after the find command

Linux: Find a List of Files in a Dictionary recursively

I have a Textfile with one Filename per row:
Interpret 1 - Song 1.mp3
Interpret 2 - Song 2.mp3
...
(About 200 Filenames)
Now I want to search a Folder recursivly for this Filenames to get the full path for each Filename in Filenames.txt.
How to do this? :)
(Purpose: Copied files to my MP3-Player but some of them are broken and i want to recopy them all without spending hours of researching them out of my music folder)
The easiest way may be the following:
cat orig_filenames.txt | while read file ; do find /dest/directory -name "$file" ; done > output_file_with_paths
Much faster way is run the find command only once and use fgrep.
find . -type f -print0 | fgrep -zFf ./file_with_filenames.txt | xargs -0 -J % cp % /path/to/destdir
You can use a while read loop along with find:
filecopy.sh
#!/bin/bash
while read line
do
find . -iname "$line" -exec cp '{}' /where/to/put/your/files \;
done < list_of_files.txt
Where list_of_files.txt is the list of files line by line, and /where/to/put/your/files is the location you want to copy to. You can just run it like so in the directory:
$ bash filecopy.sh
+1 for #jm666 answer, but the -J option doesn't work for my flavor of xargs, so i chaned it to:
find . -type f -print0 | fgrep -zFf ./file_with_filenames.txt | xargs -0 -I{} cp "{}" /path/to/destdir/

How can I calculate an MD5 checksum of a directory?

I need to calculate a summary MD5 checksum for all files of a particular type (*.py for example) placed under a directory and all sub-directories.
What is the best way to do that?
The proposed solutions are very nice, but this is not exactly what I need. I'm looking for a solution to get a single summary checksum which will uniquely identify the directory as a whole - including content of all its subdirectories.
Create a tar archive file on the fly and pipe that to md5sum:
tar c dir | md5sum
This produces a single MD5 hash value that should be unique to your file and sub-directory setup. No files are created on disk.
find /path/to/dir/ -type f -name "*.py" -exec md5sum {} + | awk '{print $1}' | sort | md5sum
The find command lists all the files that end in .py.
The MD5 hash value is computed for each .py file. AWK is used to pick off the MD5 hash values (ignoring the filenames, which may not be unique).
The MD5 hash values are sorted. The MD5 hash value of this sorted list is then returned.
I've tested this by copying a test directory:
rsync -a ~/pybin/ ~/pybin2/
I renamed some of the files in ~/pybin2.
The find...md5sum command returns the same output for both directories.
2bcf49a4d19ef9abd284311108d626f1 -
To take into account the file layout (paths), so the checksum changes if a file is renamed or moved, the command can be simplified:
find /path/to/dir/ -type f -name "*.py" -exec md5sum {} + | md5sum
On macOS with md5:
find /path/to/dir/ -type f -name "*.py" -exec md5 {} + | md5
ire_and_curses's suggestion of using tar c <dir> has some issues:
tar processes directory entries in the order which they are stored in the filesystem, and there is no way to change this order. This effectively can yield completely different results if you have the "same" directory on different places, and I know no way to fix this (tar cannot "sort" its input files in a particular order).
I usually care about whether groupid and ownerid numbers are the same, not necessarily whether the string representation of group/owner are the same. This is in line with what for example rsync -a --delete does: it synchronizes virtually everything (minus xattrs and acls), but it will sync owner and group based on their ID, not on string representation. So if you synced to a different system that doesn't necessarily have the same users/groups, you should add the --numeric-owner flag to tar
tar will include the filename of the directory you're checking itself, just something to be aware of.
As long as there is no fix for the first problem (or unless you're sure it does not affect you), I would not use this approach.
The proposed find-based solutions are also no good because they only include files, not directories, which becomes an issue if you the checksumming should keep in mind empty directories.
Finally, most suggested solutions don't sort consistently, because the collation might be different across systems.
This is the solution I came up with:
dir=<mydir>; (find "$dir" -type f -exec md5sum {} +; find "$dir" -type d) | LC_ALL=C sort | md5sum
Notes about this solution:
The LC_ALL=C is to ensure reliable sorting order across systems
This doesn't differentiate between a directory "named\nwithanewline" and two directories "named" and "withanewline", but the chance of that occurring seems very unlikely. One usually fixes this with a -print0 flag for find, but since there's other stuff going on here, I can only see solutions that would make the command more complicated than it's worth.
PS: one of my systems uses a limited busybox find which does not support -exec nor -print0 flags, and also it appends '/' to denote directories, while findutils find doesn't seem to, so for this machine I need to run:
dir=<mydir>; (find "$dir" -type f | while read f; do md5sum "$f"; done; find "$dir" -type d | sed 's#/$##') | LC_ALL=C sort | md5sum
Luckily, I have no files/directories with newlines in their names, so this is not an issue on that system.
If you only care about files and not empty directories, this works nicely:
find /path -type f | sort -u | xargs cat | md5sum
A solution which worked best for me:
find "$path" -type f -print0 | sort -z | xargs -r0 md5sum | md5sum
Reason why it worked best for me:
handles file names containing spaces
Ignores filesystem meta-data
Detects if file has been renamed
Issues with other answers:
Filesystem meta-data is not ignored for:
tar c - "$path" | md5sum
Does not handle file names containing spaces nor detects if file has been renamed:
find /path -type f | sort -u | xargs cat | md5sum
For the sake of completeness, there's md5deep(1); it's not directly applicable due to *.py filter requirement but should do fine together with find(1).
If you want one MD5 hash value spanning the whole directory, I would do something like
cat *.py | md5sum
Checksum all files, including both content and their filenames
grep -ar -e . /your/dir | md5sum | cut -c-32
Same as above, but only including *.py files
grep -ar -e . --include="*.py" /your/dir | md5sum | cut -c-32
You can also follow symlinks if you want
grep -aR -e . /your/dir | md5sum | cut -c-32
Other options you could consider using with grep
-s, --no-messages suppress error messages
-D, --devices=ACTION how to handle devices, FIFOs and sockets;
-Z, --null print 0 byte after FILE name
-U, --binary do not strip CR characters at EOL (MSDOS/Windows)
GNU find
find /path -type f -name "*.py" -exec md5sum "{}" +;
Technically you only need to run ls -lR *.py | md5sum. Unless you are worried about someone modifying the files and touching them back to their original dates and never changing the files' sizes, the output from ls should tell you if the file has changed. My unix-foo is weak so you might need some more command line parameters to get the create time and modification time to print. ls will also tell you if permissions on the files have changed (and I'm sure there are switches to turn that off if you don't care about that).
Using md5deep:
md5deep -r FOLDER | awk '{print $1}' | sort | md5sum
I want to add that if you are trying to do this for files/directories in a Git repository to track if they have changed, then this is the best approach:
git log -1 --format=format:%H --full-diff <file_or_dir_name>
And if it's not a Git directory/repository, then the answer by ire_and_curses is probably the best bet:
tar c <dir_name> | md5sum
However, please note that tar command will change the output hash if you run it in a different OS and stuff. If you want to be immune to that, this is the best approach, even though it doesn't look very elegant on first sight:
find <dir_name> -type f -print0 | sort -z | xargs -0 md5sum | md5sum | awk '{ print $1 }'
md5sum worked fine for me, but I had issues with sort and sorting file names. So instead I sorted by md5sum result. I also needed to exclude some files in order to create comparable results.
find . -type f -print0 \
| xargs -r0 md5sum \
| grep -v ".env" \
| grep -v "vendor/autoload.php" \
| grep -v "vendor/composer/" \
| sort -d \
| md5sum
I had the same problem so I came up with this script that just lists the MD5 hash values of the files in the directory and if it finds a subdirectory it runs again from there, for this to happen the script has to be able to run through the current directory or from a subdirectory if said argument is passed in $1
#!/bin/bash
if [ -z "$1" ] ; then
# loop in current dir
ls | while read line; do
ecriv=`pwd`"/"$line
if [ -f $ecriv ] ; then
md5sum "$ecriv"
elif [ -d $ecriv ] ; then
sh myScript "$line" # call this script again
fi
done
else # if a directory is specified in argument $1
ls "$1" | while read line; do
ecriv=`pwd`"/$1/"$line
if [ -f $ecriv ] ; then
md5sum "$ecriv"
elif [ -d $ecriv ] ; then
sh myScript "$line"
fi
done
fi
If you want really independence from the file system attributes and from the bit-level differences of some tar versions, you could use cpio:
cpio -i -e theDirname | md5sum
There are two more solutions:
Create:
du -csxb /path | md5sum > file
ls -alR -I dev -I run -I sys -I tmp -I proc /path | md5sum > /tmp/file
Check:
du -csxb /path | md5sum -c file
ls -alR -I dev -I run -I sys -I tmp -I proc /path | md5sum -c /tmp/file

Resources