How can I fast duplicate images located in a folder?
I usually used this command:
cp -R -p path/to/folder path/to/another/folder
But, because of high number of images in path/to/folder, operation takes too much time.
How can I done this task faster? Is there an alternative solution?
Put them on ZFS and use snapshots/clones instead of copying?
Related
I am using scp to copy the files in parallel using GNU parallel with my below shell script and it is working fine.
I am not sure how can I use rsync in place of scp in my below shell script. I am trying to see whether rsync will have better performance as compared to scp or not in terms of transfer speed.
Below is my problem description -
I am copying the files from machineB and machineC into machineA as I am running my below shell script on machineA.
If the files is not there in machineB then it should be there in machineC for sure so I will try copying the files from machineB first, if it is not there in machineB then I will try copying the same files from machineC.
I am copying the files in parallel using GNU Parallel library and it is working fine. Currently I am copying five files in parallel both for PRIMARY and SECONDARY.
Below is my shell script which I have -
#!/bin/bash
export PRIMARY=/test01/primary
export SECONDARY=/test02/secondary
readonly FILERS_LOCATION=(machineB machineC)
export FILERS_LOCATION_1=${FILERS_LOCATION[0]}
export FILERS_LOCATION_2=${FILERS_LOCATION[1]}
PRIMARY_PARTITION=(550 274 2 546 278) # this will have more file numbers
SECONDARY_PARTITION=(1643 1103 1372 1096 1369 1568) # this will have more file numbers
export dir3=/testing/snapshot/20140103
do_Copy() {
el=$1
PRIMSEC=$2
scp david#$FILERS_LOCATION_1:$dir3/new_weekly_2014_"$el"_200003_5.data $PRIMSEC/. || scp david#$FILERS_LOCATION_2:$dir3/new_weekly_2014_"$el"_200003_5.data $PRIMSEC/.
}
export -f do_Copy
parallel --retries 10 -j 5 do_Copy {} $PRIMARY ::: "${PRIMARY_PARTITION[#]}" &
parallel --retries 10 -j 5 do_Copy {} $SECONDARY ::: "${SECONDARY_PARTITION[#]}" &
wait
echo "All files copied."
Is there any way of replacing my above scp command with rsync but I still want to copy 5 files in parallel both for PRIMARY and SECONDARY simultaneously?
rsync is designed to efficiently synchronise two hierarchies of folders and files.
While it can be used to transfer individual files, it won't help you very much used like that, unless you already have a version of the file at each end with small differences between them. Running multiple instances of rsync in parallel on individual files within a hierarchy defeats the purpose of the tool.
While triplee is right that your task is I/O-bound rather than CPU-bound, and so parallelizing the tasks won't help in the typical case whether you're using rsync or scp, there is one circumstance in which parallelizing network transfers can help: if the sender is throttling requests. In that case, there may be some value to running an instance of rsync for each of a number of different folders, but it would complicate your code, and you'd have to profile both solutions to discover whether you were actually getting any benefit.
In short: just run a single instance of rsync; any performance increase you're going to get from another approach is unlikely to be worth it.
You haven't really given us enough information to know if you are on a sensible path or not, but I suspect you should be looking at lsyncd or possibly even GlusterFS. These are different from what you are doing in that they are continuous sync tools rather than periodically run, though I suspect that you could run lsyncd periodically if that's what you really want. I haven't tried out lsyncd 2.x yet, but I see that they've added parallel synchronisation processes. If your actual scenario involves more than just the three machines you've described, it might even make sense to look at some of the peer-to-peer file sharing protocols.
In your current approach, unless your files are very large, most of the delay is likely to be associated with the overhead of setting up connections and authenticating them. Doing that separately for every single file is expensive, particularly over an ssh based protocol. You'd be better of breaking your file list into batches, and passing those batches to your copying mechanism. Whether you use rsync for that is likely to be of lesser importance, but if you first construct a list of files for an rsync process to handle, then you can pass it to rsync with the --files-from option.
You want to make sense of what the limiting factor is in your sync speed. Presumably it's one of Network bandwidth, Network latency, File IO, or perhaps CPU (checksumming or compression, but probably only if you have low end hardware).
It's likely also important to know something about the pattern of changes in files from one synchronisation run to another. Are there many unchanged files from the previous run? Do existing files change? Do those changes leave a significant number of blocks unchanged (eg database files), or only get appended (eg log files)? Can you safely count on metadata like file modification times and sizes to identify what's changed, or do you need to checksum the entire content?
Is your file content compressible? Eg if you're copying plain text, you probably want to use compression options in scp or rsync, but if you have already-compressed image or video files, then compressing again would only slow you down. rsync is mostly helpful if you have files where just part of the file changes.
You can download single files with rsync just as you would with scp. Just make sure not to use the rsync:// or hostname::path formats that call the daemon.
It can at the very least make the two remote hosts work at the same time. Additionally, if the files are on different physical disks or happen to be in cache, parallelizing them even on a single host can help. That's why I disagree with the other saying a single instance is necessarily the way to go.
I think you can just replace
scp david#$FILERS_LOCATION_1:$dir3/new_weekly_2014_"$el"_200003_5.data $PRIMSEC/. || scp david#$FILERS_LOCATION_2:$dir3/new_weekly_2014_"$el"_200003_5.data $PRIMSEC/.
by
rsync david#$FILERS_LOCATION_1:$dir3/new_weekly_2014_"$el"_200003_5.data $PRIMSEC/new_weekly_2014_"$el"_200003_5.data || rsync david#$FILERS_LOCATION_2:$dir3/new_weekly_2014_"$el"_200003_5.data $PRIMSEC/new_weekly_2014_"$el"_200003_5.data
(note that the change is not only the command)
Perhaps you can get additional speed because rsync will use the delta=transfer algorithm compared to scp which will blindly copy.
I have a bash script (Scientific Linux).
The script has to operate on a file. Let's say "file.dat" (around 1 GB of size)
After some time the scripts is restarted and executes the following:
if [ -f file.dat ]; then
cp file.dat file.previous.dat
fi
to have a backup of the file.
Then a process starts and overwrites "file.dat"
In order to be on the safest side (electricity shut down or anything unexpected). What would be the best option: cp or mv ?
Thanks.
I would use a combination:
mv file.dat file.dat.previous
cp file.dat.previous file.dat
That way file.dat.previous will always be complete as mv is atomic.
The Right Answer to the Wrong Question
If you want a quick, atomic move, then mv is the thing to do since man 2 rename says:
If newpath already exists it will be atomically replaced (subject to a few conditions; see ERRORS below), so that there is no point at which another process attempting to access newpath will find it missing.
Perhaps more importantly, mv is largely a directory entry operation, so it's very quick compared to a file copy in any normal circumstance.
The Right Answer to the Right Question
If you're worried about power outages or unexpected system shutdowns, then:
Attach an uninterruptible power supply. Really. Solve for the threat model.
Make sure you're using a battery-backed RAID controller.
Make critical writes synchronous.
Use a journaling filesystem that journals data, and not just metadata.
The mv command should be faster, but robustness in the face of catastrophic failures is a hardware or filesystem issue.
Probably not too helpful here, but rsync is the tool for this kind of job. If the transfer gets interrupted it can restart from where it needs to go.
Im not sure exactly what category to put this in.
I have tried to do the following with a file that is 7.7GB on my system Centos 5.5
time cp original copy
and
time cp copy copy2
The copy of the copy is about half the time of the copy of the original.
I thought maybe the OS was cacheing or something, so I went to another directory and copied a few small files and stuff, and went back to make the copy of the copy again, and it was still way faster.
Any ideas whats going on here? Is the OS caching the file or something?
What made me notice this problem is that I have some code that processes this file. I wanted to test it on two files, so I just made a copy. I then noticed that the original file takes the longest to process on. What kind of diagnostics can I run on this?
The OS doesn't cache the file so much as it caches the disk blocks it read.
There's a couple of ways to try and account for caching when running timing tests. You could try to flush the OS disk buffers by allocating a huge amount of memory (I usually run something like perl -e '"\0"x1024x1024x1024' to do this); free before and after should give you an idea of how much data the OS has cached (under the buffers and cached columns).
Or when you time your run, ignore the system time - that will be primarily I/O - and just watch the user time. Of course, different runs may be very well dealing with different amounts of data so you would expect there to be different amounts of I/O.
The most reliable way is to run the test several times and use the fastest time as the value to compare.
sync && echo 3 > /proc/sys/vm/drop_caches
time cp original copy
sync && echo 3 > /proc/sys/vm/drop_caches
time cp copy copy2
I have 100'000 1kb files. And a program that reads them - it is really slow.
My best idea for improving performance is to put them on ramdisk.
But this is a fragile solution, every restart need to setup the ramdisk again.
(and file copying is slow as well)
My second best idea is to concatenate the files and work with that. But it is not trivial.
Is there a better solution?
Note: I need to avoid dependencies in the program, even Boost.
You can optimize by storing the files contiguous on disk.
On a disk with ample free room, the easiest way would be to read a tar archive instead.
Other than that, there is/used to be a debian package for 'readahead'.
You can use that tool to
profile a normal run of your software
edit the lsit of files accesssed (detected by readahead)
You can then call readahead with that file list (it will order the files in disk order so the throughput will be maximized and the seektimes minimized)
Unfortunately, it has been a while since I used these, so I hope you can google to the resepctive packages
This is what I seem to have found now:
sudo apt-get install readahead-fedora
Good luck
If your files are static, I agree just tar them up and then place that in a RAM disk. Probably be faster to read directly out of the TAR file, but you can test that.
edit:: instead of TAR, you could also try creating a squashfs volume.
If you don't want to do that, or still need more performance then:
put your data on an SSD.
start investigating some FS performance test, starting with EXT4, XFS, etc...
If there are like 1,000,000 individual files (mostly 100k in size) in a single directory, flatly (no other directories and files in them), is there going to be any compromises in efficiency or disadvantages in any other possible ways?
ARG_MAX is going to take issue with that... for instance, rm -rf * (while in the directory) is going to say "too many arguments". Utilities that want to do some kind of globbing (or a shell) will have some functionality break.
If that directory is available to the public (lets say via ftp, or web server) you may encounter additional problems.
The effect on any given file system depends entirely on that file system. How frequently are these files accessed, what is the file system? Remember, Linux (by default) prefers keeping recently accessed files in memory while putting processes into swap, depending on your settings. Is this directory served via http? Is Google going to see and crawl it? If so, you might need to adjust VFS cache pressure and swappiness.
Edit:
ARG_MAX is a system wide limit to how many arguments can be presented to a program's entry point. So, lets take 'rm', and the example "rm -rf *" - the shell is going to turn '*' into a space delimited list of files which in turn becomes the arguments to 'rm'.
The same thing is going to happen with ls, and several other tools. For instance, ls foo* might break if too many files start with 'foo'.
I'd advise (no matter what fs is in use) to break it up into smaller directory chunks, just for that reason alone.
My experience with large directories on ext3 and dir_index enabled:
If you know the name of the file you want to access, there is almost no penalty
If you want to do operations that need to read in the whole directory entry (like a simple ls on that directory) it will take several minutes for the first time. Then the directory will stay in the kernel cache and there will be no penalty anymore
If the number of files gets too high, you run into ARG_MAX et al problems. That basically means that wildcarding (*) does not always work as expected anymore. This is only if you really want to perform an operation on all the files at once
Without dir_index however, you are really screwed :-D
Most distros use Ext3 by default, which can use b-tree indexing for large directories.
Some of distros have this dir_index feature enabled by default in others you'd have to enable it yourself. If you enable it, there's no slowdown even for millions of files.
To see if dir_index feature is activated do (as root):
tune2fs -l /dev/sdaX | grep features
To activate dir_index feature (as root):
tune2fs -O dir_index /dev/sdaX
e2fsck -D /dev/sdaX
Replace /dev/sdaX with partition for which you want to activate it.
When you accidently execute "ls" in that directory, or use tab completion, or want to execute "rm *", you'll be in big trouble. In addition, there may be performance issues depending on your file system.
It's considered good practice to group your files into directories which are named by the first 2 or 3 characters of the filenames, e.g.
aaa/
aaavnj78t93ufjw4390
aaavoj78trewrwrwrwenjk983
aaaz84390842092njk423
...
abc/
abckhr89032423
abcnjjkth29085242nw
...
...
The obvious answer is the folder will be extremely difficult for humans to use long before any technical limit, (time taken to read the output from ls for one, their are dozens of other reasons) Is there a good reason why you can't split into sub folders?
Not every filesystem supports that many files.
On some of them (ext2, ext3, ext4) it's very easy to hit inode limit.
I've got a host with 10M files in a directory. (don't ask)
The filesystem is ext4.
It takes about 5 minutes to
ls
One limitation I've found is that my shell script to read the files (because AWS snapshot restore is a lie and files aren't present till first read) wasn't able to handle the argument list so I needed to do two passes. Firstly construct a file list with find (wholename in case you want to do partial matches)
find /path/to_dir/ -wholename '*.ldb'| tee filenames.txt
then secondly read from a the file containing filenames and read all files. (with limited parallelism)
while read -r line; do
if test "$(jobs | wc -l)" -ge 10; then
wait -n
fi
{
#do something with 10x fanout
} &
done < filenames.txt
Posting here in case anyone finds the specific work-around useful when working with too many files.