How to use rsync instead of scp in my below shell script to copy the files?

How to use rsync instead of scp in my below shell script to copy the files? - linux

I am using scp to copy the files in parallel using GNU parallel with my below shell script and it is working fine.
I am not sure how can I use rsync in place of scp in my below shell script. I am trying to see whether rsync will have better performance as compared to scp or not in terms of transfer speed.
Below is my problem description -
I am copying the files from machineB and machineC into machineA as I am running my below shell script on machineA.
If the files is not there in machineB then it should be there in machineC for sure so I will try copying the files from machineB first, if it is not there in machineB then I will try copying the same files from machineC.
I am copying the files in parallel using GNU Parallel library and it is working fine. Currently I am copying five files in parallel both for PRIMARY and SECONDARY.
Below is my shell script which I have -
#!/bin/bash
export PRIMARY=/test01/primary
export SECONDARY=/test02/secondary
readonly FILERS_LOCATION=(machineB machineC)
export FILERS_LOCATION_1=${FILERS_LOCATION[0]}
export FILERS_LOCATION_2=${FILERS_LOCATION[1]}
PRIMARY_PARTITION=(550 274 2 546 278) # this will have more file numbers
SECONDARY_PARTITION=(1643 1103 1372 1096 1369 1568) # this will have more file numbers
export dir3=/testing/snapshot/20140103
do_Copy() {
el=$1
PRIMSEC=$2
scp david#$FILERS_LOCATION_1:$dir3/new_weekly_2014_"$el"_200003_5.data $PRIMSEC/. || scp david#$FILERS_LOCATION_2:$dir3/new_weekly_2014_"$el"_200003_5.data $PRIMSEC/.
}
export -f do_Copy
parallel --retries 10 -j 5 do_Copy {} $PRIMARY ::: "${PRIMARY_PARTITION[#]}" &
parallel --retries 10 -j 5 do_Copy {} $SECONDARY ::: "${SECONDARY_PARTITION[#]}" &
wait
echo "All files copied."
Is there any way of replacing my above scp command with rsync but I still want to copy 5 files in parallel both for PRIMARY and SECONDARY simultaneously?

rsync is designed to efficiently synchronise two hierarchies of folders and files.
While it can be used to transfer individual files, it won't help you very much used like that, unless you already have a version of the file at each end with small differences between them. Running multiple instances of rsync in parallel on individual files within a hierarchy defeats the purpose of the tool.
While triplee is right that your task is I/O-bound rather than CPU-bound, and so parallelizing the tasks won't help in the typical case whether you're using rsync or scp, there is one circumstance in which parallelizing network transfers can help: if the sender is throttling requests. In that case, there may be some value to running an instance of rsync for each of a number of different folders, but it would complicate your code, and you'd have to profile both solutions to discover whether you were actually getting any benefit.
In short: just run a single instance of rsync; any performance increase you're going to get from another approach is unlikely to be worth it.

You haven't really given us enough information to know if you are on a sensible path or not, but I suspect you should be looking at lsyncd or possibly even GlusterFS. These are different from what you are doing in that they are continuous sync tools rather than periodically run, though I suspect that you could run lsyncd periodically if that's what you really want. I haven't tried out lsyncd 2.x yet, but I see that they've added parallel synchronisation processes. If your actual scenario involves more than just the three machines you've described, it might even make sense to look at some of the peer-to-peer file sharing protocols.
In your current approach, unless your files are very large, most of the delay is likely to be associated with the overhead of setting up connections and authenticating them. Doing that separately for every single file is expensive, particularly over an ssh based protocol. You'd be better of breaking your file list into batches, and passing those batches to your copying mechanism. Whether you use rsync for that is likely to be of lesser importance, but if you first construct a list of files for an rsync process to handle, then you can pass it to rsync with the --files-from option.
You want to make sense of what the limiting factor is in your sync speed. Presumably it's one of Network bandwidth, Network latency, File IO, or perhaps CPU (checksumming or compression, but probably only if you have low end hardware).
It's likely also important to know something about the pattern of changes in files from one synchronisation run to another. Are there many unchanged files from the previous run? Do existing files change? Do those changes leave a significant number of blocks unchanged (eg database files), or only get appended (eg log files)? Can you safely count on metadata like file modification times and sizes to identify what's changed, or do you need to checksum the entire content?
Is your file content compressible? Eg if you're copying plain text, you probably want to use compression options in scp or rsync, but if you have already-compressed image or video files, then compressing again would only slow you down. rsync is mostly helpful if you have files where just part of the file changes.

You can download single files with rsync just as you would with scp. Just make sure not to use the rsync:// or hostname::path formats that call the daemon.
It can at the very least make the two remote hosts work at the same time. Additionally, if the files are on different physical disks or happen to be in cache, parallelizing them even on a single host can help. That's why I disagree with the other saying a single instance is necessarily the way to go.

I think you can just replace
scp david#$FILERS_LOCATION_1:$dir3/new_weekly_2014_"$el"_200003_5.data $PRIMSEC/. || scp david#$FILERS_LOCATION_2:$dir3/new_weekly_2014_"$el"_200003_5.data $PRIMSEC/.
by
rsync david#$FILERS_LOCATION_1:$dir3/new_weekly_2014_"$el"_200003_5.data $PRIMSEC/new_weekly_2014_"$el"_200003_5.data || rsync david#$FILERS_LOCATION_2:$dir3/new_weekly_2014_"$el"_200003_5.data $PRIMSEC/new_weekly_2014_"$el"_200003_5.data
(note that the change is not only the command)
Perhaps you can get additional speed because rsync will use the delta=transfer algorithm compared to scp which will blindly copy.

Related

Force rsync to compare local files byte by byte instead of checksum

I have written a Bash script to backup a folder. At the core of the script is an rsync instruction
rsync -abh --checksum /path/to/source /path/to/target
I am using --checksum because I neither want to rely on file size nor modification time to determine if the file in the source path needs to be backed up. However, most -- if not all -- of the time I run this script locally, i.e., with an external USB drive attached which contains the backup destination folder; no backup over network. Thus, there is no need for a delta transfer since both files will be read and processed entirely by the same machine. Calculating the checksums even introduces a speed down in this case. It would be better if rsync would just diff the files if they are both on stored locally.
After reading the manpage I stumbled upon the --whole-file option which seems to avoid the costly checksum calculation. The manpage also states that this is the default if source and destination are local paths.
So I am thinking to change my rsync statement to
rsync -abh /path/to/source /path/to/target
Will rsync now check local source and target files byte by byte or will it use modification time and/or size to determine if the source file needs to be backed up? I definitely do not want to rely on file size or modification times to decide if a backup should take place.
UPDATE
Notice the -b option in the rsync instruction. It means that destination files will be backed up before they are replaced. So blindly rsync'ing all files in the source folder, e.g., by supplying --ignore-times as suggested in the comments, is not an option. It would create too many duplicate files and waste storage space. Keep also in mind that I am trying to reduce backup time and workload on a local machine. Just backing up everything would defeat that purpose.
So my question could be rephrased as, is rsync capable of doing a file comparison on a byte by byte basis?

Question: is rsync capable of doing a file comparison on a byte by byte basis?
Strictly speaking, Yes:
It's a block by block comparison, but you can change the block size.
You could use --block-size=1, (but it would be unreasonably inefficient and inappropriate for basically every)
The block based rolling checksum is the default behavior over a network.
Use the --no-whole-file option to force this behavior locally. (see below)
Statement 1. Calculating the checksums even introduces a speed down in this case.
This is why it's off by default for local transfers.
Using the --checksum option forces an entire file read, as opposed to the default block-by-block delta-transfer checksum checking
Statement 2. Will rsync now check local source and target files byte by byte or
will it use modification time and/or size to determine if the source
file needs to be backed up?
By default it will use size & modification time.
You can use a combination of --size-only, --(no-)ignore-times, --ignore-existing and
--checksum to modify this behavior.
Statement 3. I definitely do not want to rely on file size or modification times to decide if a backup should take place.
Then you need to use --ignore-times and/or --checksum
Statement 4. supplying --ignore-times as suggested in the comments, is not an option
Perhaps using --no-whole-file and --ignore-times is what you want then ? This forces the use of the delta-transfer algorithm, but for every file regardless of timestamp or size.
You would (in my opinion) only ever use this combination of options if it was critical to avoid meaningless writes (though it's critical that it's specifically the meaningless writes that you're trying to avoid, not the efficiency of the system, since it wouldn't actually be more efficient to do a delta-transfer for local files), and had reason to believe that files with identical modification stamps and byte size could indeed be different.
I fail to see how modification stamp and size in bytes is anything but a logical first step in identifying changed files.
If you compared the following two files:
File 1 (local) : File.bin - 79776451 bytes and modified on the 15 May 07:51
File 2 (remote): File.bin - 79776451 bytes and modified on the 15 May 07:51
The default behaviour is to skip these files. If you're not satisfied that the files should be skipped, and want them compared, you can force a block-by-block comparison and differential update of these files using --no-whole-file and --ignore-times
So the summary on this point is:
Use the default method for the most efficient backup and archive
Use --ignore-times and --no-whole-file to force delta-change (block by block checksum, transferring only differential data) if for some reason this is necessary
Use --checksum and --ignore-times to be completely paranoid and wasteful.
Statement 5. Notice the -b option in the rsync instruction. It means that destination files will be backed up before they are replaced
Yes, but this can work however you want it to, it doesn't necessarily mean a full backup every time a file is updated, and it certainly doesn't mean that a full transfer will take place at all.
You can configure rsync to:
Keep 1 or more versions of a file
Configure it with a --backup-dir to be a full incremental backup system.
Doing it this way doesn't waste space other than what is required to retain differential data. I can verify that in practise as there would not be nearly enough space on my backup drives for all of my previous versions to be full copies.
Some Supplementary Information
Why is Delta-transfer not more efficient than copying the whole file locally?
Because you're not tracking the changes to each of your files. If you actually have a delta file, you can merge just the changed bytes, but you need to know what those changed bytes are first. The only way you can know this is by reading the entire file
For example:
I modify the first byte of a 10MB file.
I use rsync with delta-transfer to sync this file
rsync immediately sees that the first byte (or byte within the first block) has changed, and proceeds (by default --inplace) to change just that block
However, rsync doesn't know it was only the first byte that's changed. It will keep checksumming until the whole file is read
For all intents and purposes:
Consider rsync a tool that conditionally performs a --checksum based on whether or not the file timestamp or size has changed. Overriding this to --checksum is essentially equivalent to --no-whole-file and --ignore-times, since both will:
Operate on every file, regardless of time and size
Read every block of the file to determine which blocks to sync.
What's the benefit then?
The whole thing is a tradeoff between transfer bandwidth, and speed / overhead.
--checksum is a good way to only ever send differences over a network
--checksum while ignoring files with the same timestamp and size is a good way to both only send differences over a network, and also maximize the speed of the entire backup operation
Interestingly, it's probably much more efficient to use --checksum as a blanket option than it would be to force a delta-transfer for every file.

There is no way to do byte-by-byte comparison of files instead of checksum, the way you are expecting it.
The way rsync works is to create two processes, sender and receiver, that create a list of files and their metadata to decide with each other, which files need to be updated. This is done even in case of local files, but in this case processes can communicate over a pipe, not over a network socket. After the list of changed files is decided, changes are sent as a delta or as whole files.
Theoretically, one could send whole files in the file list to the other to make a diff, but in practice this would be rather inefficient in many cases. Receiver would need to keep these files in the memory in case it detects the need to update the file, or otherwise the changes in files need to be re-sent. Any of the possible solutions here doesn't sound very efficient.
There is a good overview about (theoretical) mechanics of rsync: https://rsync.samba.org/how-rsync-works.html

What is the fastest and safest way to move an SVN repository from one host to another?

I have two SVN repositories about 1.5 GB each that I need to move from one CentOS 5.4 Linux machine to another. They are up into the three to four thousand revision range.
I could simply scp -r them over. I did try starting that process and it was clear that it was going to take several hours, maybe all night, so I stopped it to reconsider.
I could use svnadmin dump with or without the --deltas option, then compress and scp the dump file.
Is there some better approach?

Yep rsync.
Specificly:
rsync -hxDPavilyzH source/ user#remote:/target/ --stats

svnsync is designed for this, and should be safe provided nothing else writes to the target repo until the copy is complete.
Hower, rsync should also be safe, and allow interruption (svnsync may as well, not sure).

Alternative to creating multipart .tar.gz files?

I have a folder with >20GB of images on a linux server, I need to make a backup and download it, so I was thinking about using "split" to create 1GB files. My question is: instead of splitting a .tar.gz and then having to join it again on my computer, is there a way I could create 20 x 1GB valid .tar.gz files, so I can then view/extract them separately?
Edit: I forgot to add that I need to do it without ssh access. I'm using mostly PHP.

You could try rsnapshot to backup using rsync/hardlinks instead. It not only solves the filesize issue but also gives you high storage and bandwidth efficiency when existing images aren't changed often.

Why not just use rsync?
FYI, rsync is a command-line tool that synchronises directories between two machines across the network. If you have Linux at both ends and ssh access properly configured, it's as simple as rsync -av server:/path/to/images/ images/ (make sure the trailing slashes are there). It also optimises subsequent synchronisations so that only changes are transmitted. You can even tell it to compress data in transit, but that usually doesn't help with images.

First I would give rsnapshot a miss if you don't have SSH access. (Though I do and love it)
I would assume you're likely backing up jpeg's and they are already compressed. Zipping them up doesn't make them much smaller, plus you don't need exactly 1GB files. It sounds like they can be a bit bigger or smaller.
So you could just write a script which bundles jpegs into a gz(or whatever) until it has put about 1gb worth in and then starts a new archive.
You could do all this in PHP easy enough.

Redirecting multiple stdouts to single file

I have a program running on multiple machines with NFS and I'd like to log all their outputs into a single file. Can I just run ./my_program >> filename on every machine or is there an issue with concurrency I should be aware of? Since I'm only appending, I don't think there would be a problem, but I'm just trying to make sure.

That could work, but yes, you will have concurrency issues with it, and the log file will be basically indecipherable.
What I would recommend is that there be a log file for each machine and then on some periodical basis (say nightly), concatenate the files together with the machine name as the file name:
for i in "/path/to/logfiles/*"; do
echo "Machine: $i";
cat $i;
done > filename.log
That should give you some ideas, I think.

The NFS protocol does not support atomic append writes, so append writes are never atomic on NFS for any platform. Files WILL end up corrupt if you try.
When appending to files from multiple threads or processes, the fwrites to that file are atomic under the condition that the file was opened in appending mode, the string written to it does not exceed the filesystem blocksize and the filesystem is local. Which in NFS is not the case.
There is a workaround, although I would not know how to do it from a shellscript. The technique is called close-to-open cache consistency

What happens if there are too many files under a single directory in Linux?

If there are like 1,000,000 individual files (mostly 100k in size) in a single directory, flatly (no other directories and files in them), is there going to be any compromises in efficiency or disadvantages in any other possible ways?

ARG_MAX is going to take issue with that... for instance, rm -rf * (while in the directory) is going to say "too many arguments". Utilities that want to do some kind of globbing (or a shell) will have some functionality break.
If that directory is available to the public (lets say via ftp, or web server) you may encounter additional problems.
The effect on any given file system depends entirely on that file system. How frequently are these files accessed, what is the file system? Remember, Linux (by default) prefers keeping recently accessed files in memory while putting processes into swap, depending on your settings. Is this directory served via http? Is Google going to see and crawl it? If so, you might need to adjust VFS cache pressure and swappiness.
Edit:
ARG_MAX is a system wide limit to how many arguments can be presented to a program's entry point. So, lets take 'rm', and the example "rm -rf *" - the shell is going to turn '*' into a space delimited list of files which in turn becomes the arguments to 'rm'.
The same thing is going to happen with ls, and several other tools. For instance, ls foo* might break if too many files start with 'foo'.
I'd advise (no matter what fs is in use) to break it up into smaller directory chunks, just for that reason alone.

My experience with large directories on ext3 and dir_index enabled:
If you know the name of the file you want to access, there is almost no penalty
If you want to do operations that need to read in the whole directory entry (like a simple ls on that directory) it will take several minutes for the first time. Then the directory will stay in the kernel cache and there will be no penalty anymore
If the number of files gets too high, you run into ARG_MAX et al problems. That basically means that wildcarding (*) does not always work as expected anymore. This is only if you really want to perform an operation on all the files at once
Without dir_index however, you are really screwed :-D

Most distros use Ext3 by default, which can use b-tree indexing for large directories.
Some of distros have this dir_index feature enabled by default in others you'd have to enable it yourself. If you enable it, there's no slowdown even for millions of files.
To see if dir_index feature is activated do (as root):
tune2fs -l /dev/sdaX | grep features
To activate dir_index feature (as root):
tune2fs -O dir_index /dev/sdaX
e2fsck -D /dev/sdaX
Replace /dev/sdaX with partition for which you want to activate it.

When you accidently execute "ls" in that directory, or use tab completion, or want to execute "rm *", you'll be in big trouble. In addition, there may be performance issues depending on your file system.
It's considered good practice to group your files into directories which are named by the first 2 or 3 characters of the filenames, e.g.
aaa/
aaavnj78t93ufjw4390
aaavoj78trewrwrwrwenjk983
aaaz84390842092njk423
...
abc/
abckhr89032423
abcnjjkth29085242nw
...
...

The obvious answer is the folder will be extremely difficult for humans to use long before any technical limit, (time taken to read the output from ls for one, their are dozens of other reasons) Is there a good reason why you can't split into sub folders?

Not every filesystem supports that many files.
On some of them (ext2, ext3, ext4) it's very easy to hit inode limit.

I've got a host with 10M files in a directory. (don't ask)
The filesystem is ext4.
It takes about 5 minutes to
ls
One limitation I've found is that my shell script to read the files (because AWS snapshot restore is a lie and files aren't present till first read) wasn't able to handle the argument list so I needed to do two passes. Firstly construct a file list with find (wholename in case you want to do partial matches)
find /path/to_dir/ -wholename '*.ldb'| tee filenames.txt
then secondly read from a the file containing filenames and read all files. (with limited parallelism)
while read -r line; do
if test "$(jobs | wc -l)" -ge 10; then
wait -n
fi
{
#do something with 10x fanout
} &
done < filenames.txt
Posting here in case anyone finds the specific work-around useful when working with too many files.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string