Copy directory in GnuParallel - gnu

I have a basic job to do: I want to copy the contents of one directory to another location by preserving all the structure inside it (subfolders and files). The size of this directory is very large, and I would like to perform the copy in parallel using GnuParallel. However, I cannot seem to manage to find the proper command to do so.
find . -print0 | parallel -0 cp -r dirToCopy/ newDirLocation/
doesn't seem to do anything, while
find . -print0 | parallel -0 cp {} newDirLocation/
copies only the files inside my original directory, without preserving the structure and the hierarchy where files are placed (basically it copies files without subfolders).
Which is the correct way to copy this directory preserving the directory content?

You need to do that in 2 stages. Create directories:
find . -type d -print0 | parallel -0 mkdir newDirLocation/{}
Create files:
find . -type f -print0 | parallel -0 cp newDirLocation/{}
Be aware that if you disk is a single 1 spindle harddisk, then it is most likely slower to do the copying in parallel. Only way to know for sure it to try it and measure it.

Related

list base files in a folder with numerous date stampped versions of a file

I've got a folder with numerous versions of files (thousands of them), each with a unique date/time stamp as the file extension. For example:
./one.20190422
./one.20190421
./one.20190420
./folder/two.txt.20190420
./folder/two.txt.20190421
./folder/folder/three.mkv.20190301
./folder/folder/three.mkv.20190201
./folder/folder/three.mkv.20190101
./folder/four.doc.20190401
./folder/four.doc.20190329
./folder/four.doc.20190301
I need to get a unique list of the base files. For example, for the above example, this would be the expected output:
./one
./folder/two.txt
./folder/folder/three.mkv
./folder/four.doc
I've come up with the below code, but am wondering if there is a better, more efficient way.
# find all directories
find ./ -type d | while read folder ; do
# go into that directory
# then find all the files in that directory, excluding sub-directories
# remove the extension (date/time stamp)
# sort and remove duplicates
# then loop through each base file
cd "$folder" && find . -maxdepth 1 -type f -exec bash -c 'printf "%s\n" "${#%.*}"' _ {} + | sort -u | while read file ; do
# and find all the versions of that file
ls "$file".* | customFunctionToProcessFiles
done
done
If it matters, the end goal is find all the versions of a specific file, in groups of the base file, and process them for something. So my plan was to get the base files, then loop through the list and find all the version files. So, using the above example again, I'd process all the one.* files first, then the two.* files, etc...
Is there a better, faster, and/or more efficient way to accomplish this?
Some notes:
There are potentially thousands of files. I know I could just search for all files from the root folder, remove the date/time extension, sort and get unique, but since there may be thousands of files I thought it might be more efficient to loop through the directories.
The date/time stamp extension of the file is not in my control and it may not always be just numbers. The only thing I can guarantee is it is on the end after a period. And, whatever format the date/time is in, all the files will share it -- there won't be some files with one format and other files with another format.
You can use find ./ -type f -regex to look for files directly
find ./ -type f -regex '.*\.[0-9]+'
./some_dir/asd.mvk.20190422
./two.txt.20190420
Also, pipe the result to your function through xargs whithout needing while loops
re='(.*)(\.[0-9]{8,8})'
find ./ -type f -regextype posix-egrep -regex "$re" | \
sed -re "s/$re/\1/" | \
xargs -r0 customFunctionToProcessFiles

How to pipe find results to unzip?

I have a lot of folders with a zip file in each. Most of the zip files in the folders have been opened already. I just want to unzip those which have not been opened, which I know all have the same date.
I'm trying to use the following but I'm getting hit back with Unzip rules. The first part finds all the files I need, but piping the results to unzip, as I have done, isn't enough.
find *2019-01-05* | unzip
you can try to use xargs to get prior results and then unzip them:
find *2019-01-05* | xargs unzip
That's:
find -type f -name \*2019-01-05\*.zip -exec unzip {} +
-type f for good measure, in case there are similarly named directories.

Create a bash script to delete folders which do not contain a certain filetype

I have recently run into a problem.
I used a utility to move all my music files into directories based on tags. This left a LOT of almost empty folders. The folders, in general, contain a thumbs.db file or some sort of image for album art. The mp3s have the correct album art in their new directories, so the old ones are okay to delete.
Basically, I need to find any directories within D:/Music/ that:
-Do not have any subdirectories
-Do not contain any mp3 files
And then delete them.
I figured this would be easier to do in a shell script or bash script or whatever else linux/unix world than in Windows 8.1 (HAHA).
Any suggestions? I'm not very experienced writing scripts like this.
This should get you started
find /music -mindepth 1 -type d |
while read dt
do
find "$dt" -mindepth 1 -type d | read && continue
find "$dt" -iname '*.mp3' -type f | read && continue
echo DELETE $dt
done
Here's the short story...
find . -name '*.mp3' -o -type d -printf '%h\n' | sort | uniq > non-empty-dirs.tmp
find . -type d -print | sort | uniq > all-dirs.tmp
comm -23 all-dirs.tmp non-empty-dirs.tmp > dirs-to-be-deleted.tmp
less dirs-to-be-deleted.tmp
cat dirs-to-be-deleted.tmp | xargs rm -rf
Note that you might have to run all the commands a few times (depending on your repository's directory depth) before you're done deleting all recursive empty directories...
And the long story goes...
You can approach this problem from two basic perspective: either you find all directories, then iterate over each of them, check if it contain any mp3 file or any subdirectory, if not, mark that directory for deletion. It will works, but on large very large repositories, you might expect a significant run time.
Another approach, which is in my sense much more interesting, is to build a list of directories NOT to be deleted, and subtract that list from the list of all directories. Let's work the second strategy, one step at a time...
First of all, to find the path of all directories that contains mp3 files, you can simply do:
find . -name '*.mp3' -printf '%h\n' | sort | uniq
This means "find any file ending with .mp3, then print the path to it's parent directory".
Now, I could certainly name at least ten different approaches to find directories that contains at least one subdirectory, but keeping the same strategy as above, we can easily get...
find . -type d -printf '%h\n' | sort | uniq
What this means is: "Find any directory, then print the path to it's parent."
Both of these queries can be combined in a single invocation, producing a single list containing the paths of all directories NOT to be deleted.. Let's redirect that list to a temporary file.
find . -name '*.mp3' -o -type d -printf '%h\n' | sort | uniq > non-empty-dirs.tmp
Let's similarly produce a file containing the paths of all directories, no matter if they are empty or not.
find . -type d -print | sort | uniq > all-dirs.tmp
So there, we have, on one side, the complete list of all directories, and on the other, the list of directories not to be deleted. What now? There are tons of strategies, but here's a very simple one:
comm -23 all-dirs.tmp non-empty-dirs.tmp > dirs-to-be-deleted.tmp
Once you have that, well, review it, and if you are satisfied, then pipe it through xargs to rm to actually delete the directories.
cat dirs-to-be-deleted.tmp | xargs rm -rf

BASH: Checking if files are duplicates within a directory?

I am writing a house-keeping script and have files within a directory that I want to clean up.
I want to move files from a source directory to another, there are many sub-directories so there could be files that are the same. What I want to do, is either use CMP command or MD5sum each file, if they are no duplicates then move them, if they are the same only move 1.
So the I have the move part working correctly as follows:
find /path/to/source -name "IMAGE_*.JPG" -exec mv '{}' /path/to/destination \;
I am assuming that I will have to loop through my directory, so I am thinking.
for files in /path/to/source
do
if -name "IMAGE_*.JPG"
then
md5sum (or cmp) $files
...stuck here (I am worried about how this method will be able to compare all the files against eachother and how I would filter them out)...
then just do the mv to finish.
Thanks in advance.
find . -type f -exec md5sum {} \; | sort | uniq -d
That'll spit out all the md5 hashes that have duplicates. then it's just a matter of figuring out which file(s) produced those duplicate hashes.
There's a tool designed for this purpose, it's fdupes :
fdupes -r dir/
dupmerge is another such tool...

Bash script to recursively step through folders and delete files

Can anyone give me a bash script or one line command i can run on linux to recursively go through each folder from the current folder and delete all files or directories starting with '._'?
Change directory to the root directory you want (or change . to the directory) and execute:
find . -name "._*" -print0 | xargs -0 rm -rf
xargs allows you to pass several parameters to a single command, so it will be faster than using the find -exec syntax. Also, you can run this once without the | to view the files it will delete, make sure it is safe.
find . -name '._*' -exec rm -Rf {} \;
I've had a similar problem a while ago (I assume you are trying to clean up a drive that was connected to a Mac which saves a lot of these files), so I wrote a simple python script which deletes these and other useless files; maybe it will be useful to you:
http://github.com/houbysoft/short/blob/master/tidy
find /path -name "._*" -exec rm -fr "{}" +;
Instead of deleting the AppleDouble files, you could merge them with the corresponding files. You can use dot_clean.
dot_clean -- Merge ._* files with corresponding native files.
For each dir, dot_clean recursively merges all ._* files with their corresponding native files according to the rules specified with the given arguments. By default, if there is an attribute on the native file that is also present in the ._ file, the most recent attribute will be used.
If no operands are given, a usage message is output. If more than one directory is given, directories are merged in the order in which they are specified.
Because dot_clean works recursively by default, use:
dot_clean <directory>
If you want to turn off the recursively merge, use -f for flat merge.
dot_clean -f <directory>
find . -name '.*' -delete
A bit shorter and perform better in case of extremely long list of files.

Resources