rsync hang in the middle of transfer with a fixed position - linux

I am trying to use rsync to transfer some big files between servers.
For some reasons, when the file is big enough (2GB - 4GB), the rsync would hang in the middle, with the exactly same position, i.e., the progress at which it hanged always stick to the same place even if I retried.
If I remove the file from the destination server first, then the rsync would work fine.
This is the command I used:
/usr/bin/rsync --delete -avz --progress --exclude-from=excludes.txt /path/to/src user#server:/path/to/dest
I have tried to add delete-during and delete-delay, all have no luck.
The rsync version is rsync version 3.1.0 protocol version 31
Any advice please? Thanks!

Eventually I solved the problem by removing compression option: -z
Still don't know why is that so.

I had the same problem (trying to rsync multiple files of up to 500GiB each between my home NAS and a remote server).
In my case the solution (mentioned here) was to add to "/etc/ssh/sshd_config" (on the server to which I was connecting) the following:
ClientAliveInterval 15
ClientAliveCountMax 240
"ClientAliveInterval X" will send a kind of message/"probe" every X seconds to check if the client is still alive (default is not to do anything).
"ClientAliveCountMax Y" will terminate the connection if after Y-probes there has been no reply.
I guess that the root cause of the problem is that in some cases the compression (and/or block diff) that is performed locally on the server takes so much time that while that's being done the SSH-connection (created by the rsync-program) is automatically dropped while that's still ongoing.
Another workaround (e.g. if "sshd_config" cannot be changed) might be to use with rsync the option "--new-compress" and/or a lower compression level (e.g. "rsync --new-compress --compress-level=1" etc...): in my case the new compression (and diff) algorithm is a lot faster than the old/classical one, therefore the ssh-timeout might not occur than when using its default settings.

The problem for me was I thought I had plenty of disk space on a drive but the partition was not using the whole disk and was half the size of what I expected.
So check the size of the space available with lsblk and df -h and make sure the disk you are writing to reports all the space available on the device.

Related

How to use rsync instead of scp in my below shell script to copy the files?

I am using scp to copy the files in parallel using GNU parallel with my below shell script and it is working fine.
I am not sure how can I use rsync in place of scp in my below shell script. I am trying to see whether rsync will have better performance as compared to scp or not in terms of transfer speed.
Below is my problem description -
I am copying the files from machineB and machineC into machineA as I am running my below shell script on machineA.
If the files is not there in machineB then it should be there in machineC for sure so I will try copying the files from machineB first, if it is not there in machineB then I will try copying the same files from machineC.
I am copying the files in parallel using GNU Parallel library and it is working fine. Currently I am copying five files in parallel both for PRIMARY and SECONDARY.
Below is my shell script which I have -
#!/bin/bash
export PRIMARY=/test01/primary
export SECONDARY=/test02/secondary
readonly FILERS_LOCATION=(machineB machineC)
export FILERS_LOCATION_1=${FILERS_LOCATION[0]}
export FILERS_LOCATION_2=${FILERS_LOCATION[1]}
PRIMARY_PARTITION=(550 274 2 546 278) # this will have more file numbers
SECONDARY_PARTITION=(1643 1103 1372 1096 1369 1568) # this will have more file numbers
export dir3=/testing/snapshot/20140103
do_Copy() {
el=$1
PRIMSEC=$2
scp david#$FILERS_LOCATION_1:$dir3/new_weekly_2014_"$el"_200003_5.data $PRIMSEC/. || scp david#$FILERS_LOCATION_2:$dir3/new_weekly_2014_"$el"_200003_5.data $PRIMSEC/.
}
export -f do_Copy
parallel --retries 10 -j 5 do_Copy {} $PRIMARY ::: "${PRIMARY_PARTITION[#]}" &
parallel --retries 10 -j 5 do_Copy {} $SECONDARY ::: "${SECONDARY_PARTITION[#]}" &
wait
echo "All files copied."
Is there any way of replacing my above scp command with rsync but I still want to copy 5 files in parallel both for PRIMARY and SECONDARY simultaneously?
rsync is designed to efficiently synchronise two hierarchies of folders and files.
While it can be used to transfer individual files, it won't help you very much used like that, unless you already have a version of the file at each end with small differences between them. Running multiple instances of rsync in parallel on individual files within a hierarchy defeats the purpose of the tool.
While triplee is right that your task is I/O-bound rather than CPU-bound, and so parallelizing the tasks won't help in the typical case whether you're using rsync or scp, there is one circumstance in which parallelizing network transfers can help: if the sender is throttling requests. In that case, there may be some value to running an instance of rsync for each of a number of different folders, but it would complicate your code, and you'd have to profile both solutions to discover whether you were actually getting any benefit.
In short: just run a single instance of rsync; any performance increase you're going to get from another approach is unlikely to be worth it.
You haven't really given us enough information to know if you are on a sensible path or not, but I suspect you should be looking at lsyncd or possibly even GlusterFS. These are different from what you are doing in that they are continuous sync tools rather than periodically run, though I suspect that you could run lsyncd periodically if that's what you really want. I haven't tried out lsyncd 2.x yet, but I see that they've added parallel synchronisation processes. If your actual scenario involves more than just the three machines you've described, it might even make sense to look at some of the peer-to-peer file sharing protocols.
In your current approach, unless your files are very large, most of the delay is likely to be associated with the overhead of setting up connections and authenticating them. Doing that separately for every single file is expensive, particularly over an ssh based protocol. You'd be better of breaking your file list into batches, and passing those batches to your copying mechanism. Whether you use rsync for that is likely to be of lesser importance, but if you first construct a list of files for an rsync process to handle, then you can pass it to rsync with the --files-from option.
You want to make sense of what the limiting factor is in your sync speed. Presumably it's one of Network bandwidth, Network latency, File IO, or perhaps CPU (checksumming or compression, but probably only if you have low end hardware).
It's likely also important to know something about the pattern of changes in files from one synchronisation run to another. Are there many unchanged files from the previous run? Do existing files change? Do those changes leave a significant number of blocks unchanged (eg database files), or only get appended (eg log files)? Can you safely count on metadata like file modification times and sizes to identify what's changed, or do you need to checksum the entire content?
Is your file content compressible? Eg if you're copying plain text, you probably want to use compression options in scp or rsync, but if you have already-compressed image or video files, then compressing again would only slow you down. rsync is mostly helpful if you have files where just part of the file changes.
You can download single files with rsync just as you would with scp. Just make sure not to use the rsync:// or hostname::path formats that call the daemon.
It can at the very least make the two remote hosts work at the same time. Additionally, if the files are on different physical disks or happen to be in cache, parallelizing them even on a single host can help. That's why I disagree with the other saying a single instance is necessarily the way to go.
I think you can just replace
scp david#$FILERS_LOCATION_1:$dir3/new_weekly_2014_"$el"_200003_5.data $PRIMSEC/. || scp david#$FILERS_LOCATION_2:$dir3/new_weekly_2014_"$el"_200003_5.data $PRIMSEC/.
by
rsync david#$FILERS_LOCATION_1:$dir3/new_weekly_2014_"$el"_200003_5.data $PRIMSEC/new_weekly_2014_"$el"_200003_5.data || rsync david#$FILERS_LOCATION_2:$dir3/new_weekly_2014_"$el"_200003_5.data $PRIMSEC/new_weekly_2014_"$el"_200003_5.data
(note that the change is not only the command)
Perhaps you can get additional speed because rsync will use the delta=transfer algorithm compared to scp which will blindly copy.

centos free space on disk not updating

I am new to the linux and working with centos system ,
By running command df -H it is showing 82% if full, that is only 15GB is free.
I want some more extra spaces, so using WINSCP i hav done shift deleted the 15G record.
and execured df -H once again, but still it is showing 15 GB free. but the free size of the deleted
file where it goes.
Plese help me out in finding solution to this
In most unix filesystems, if a file is open, the OS will delete the file right way, but will not release space until the file is closed. Why? Because the file is still visible for the user that opened it.
On the other side, Windows used to complain that it can't delete a file because it is in use, seems that in later incarnations explorer will pretend to delete the file.
Some applications are famous for bad behavior related to this fact. For example, I have to deal with some versions of MySQL that will not properly close some files, over the time I can find several GB of space wasted in /tmp.
You can use the lsof command to list open files (man lsof). If the problem is related to open files, and you can afford a reboot, most likely it is the easiest way to fix the problem.

Clearing Large Apache Domain Logs

I am having an issue where Apache logs are growing out of proportion on several servers (Linux CentOS 5)... I will eventually disable logging completely but for now I need a quick fix to reclaim the hard disk space.
I have tried using the echo " " > /path/to/log.log or the * > /path/to/log.log but they take too long and almost crash the server as the logs are as large as 100GB
Deleting the files works fast but my question is, will it cause a problem when I restart apache. My servers are live and full of users so I can't crash them.
Your help is appreciated.
Use the truncate command
truncate -s 0 /path/to/log.log
In the longer term you should use logrotate to keep the logs from getting out of hand.
Try this:
cat /dev/null > /path/to/log.log
mv /path/to/log.log /path/to/log.log.1
Do this for your access, error and if you are really doing it on prod, you rewrite logs.
This doesn't effect Apache on *nix, since the file is open. Then restart Apache. Yes, I know I said restart, but this usually takes a second or so, so I doubt that anyone will notice -- or blame it on the network. The restarted Apache will be running with a new set of log files.
In terms of your current logs, IMO you need to keep at least the last 3 months error logs, and 1 month access logs, but look at your volumetrics to decide your rough per week volumes for error and access logs. Don't truncate the old files. If necessary do a nice tail piped to gzip -c of these to archives. If you want to split the use a loop doing a tail|head|gzip using the --bytes=nnG option. OK, you'll split across the odd line but that's better than deleting the lot as you suggest.
Of course, you could just delete the lot as you and others propose, but what are you going to do if you've realised that the site has been hacked recently? "Sorry: too late; I've deleted the evidence!"
Then for goodness sake implement a logrotate regime.

Alternative to creating multipart .tar.gz files?

I have a folder with >20GB of images on a linux server, I need to make a backup and download it, so I was thinking about using "split" to create 1GB files. My question is: instead of splitting a .tar.gz and then having to join it again on my computer, is there a way I could create 20 x 1GB valid .tar.gz files, so I can then view/extract them separately?
Edit: I forgot to add that I need to do it without ssh access. I'm using mostly PHP.
You could try rsnapshot to backup using rsync/hardlinks instead. It not only solves the filesize issue but also gives you high storage and bandwidth efficiency when existing images aren't changed often.
Why not just use rsync?
FYI, rsync is a command-line tool that synchronises directories between two machines across the network. If you have Linux at both ends and ssh access properly configured, it's as simple as rsync -av server:/path/to/images/ images/ (make sure the trailing slashes are there). It also optimises subsequent synchronisations so that only changes are transmitted. You can even tell it to compress data in transit, but that usually doesn't help with images.
First I would give rsnapshot a miss if you don't have SSH access. (Though I do and love it)
I would assume you're likely backing up jpeg's and they are already compressed. Zipping them up doesn't make them much smaller, plus you don't need exactly 1GB files. It sounds like they can be a bit bigger or smaller.
So you could just write a script which bundles jpegs into a gz(or whatever) until it has put about 1gb worth in and then starts a new archive.
You could do all this in PHP easy enough.

What happens if there are too many files under a single directory in Linux?

If there are like 1,000,000 individual files (mostly 100k in size) in a single directory, flatly (no other directories and files in them), is there going to be any compromises in efficiency or disadvantages in any other possible ways?
ARG_MAX is going to take issue with that... for instance, rm -rf * (while in the directory) is going to say "too many arguments". Utilities that want to do some kind of globbing (or a shell) will have some functionality break.
If that directory is available to the public (lets say via ftp, or web server) you may encounter additional problems.
The effect on any given file system depends entirely on that file system. How frequently are these files accessed, what is the file system? Remember, Linux (by default) prefers keeping recently accessed files in memory while putting processes into swap, depending on your settings. Is this directory served via http? Is Google going to see and crawl it? If so, you might need to adjust VFS cache pressure and swappiness.
Edit:
ARG_MAX is a system wide limit to how many arguments can be presented to a program's entry point. So, lets take 'rm', and the example "rm -rf *" - the shell is going to turn '*' into a space delimited list of files which in turn becomes the arguments to 'rm'.
The same thing is going to happen with ls, and several other tools. For instance, ls foo* might break if too many files start with 'foo'.
I'd advise (no matter what fs is in use) to break it up into smaller directory chunks, just for that reason alone.
My experience with large directories on ext3 and dir_index enabled:
If you know the name of the file you want to access, there is almost no penalty
If you want to do operations that need to read in the whole directory entry (like a simple ls on that directory) it will take several minutes for the first time. Then the directory will stay in the kernel cache and there will be no penalty anymore
If the number of files gets too high, you run into ARG_MAX et al problems. That basically means that wildcarding (*) does not always work as expected anymore. This is only if you really want to perform an operation on all the files at once
Without dir_index however, you are really screwed :-D
Most distros use Ext3 by default, which can use b-tree indexing for large directories.
Some of distros have this dir_index feature enabled by default in others you'd have to enable it yourself. If you enable it, there's no slowdown even for millions of files.
To see if dir_index feature is activated do (as root):
tune2fs -l /dev/sdaX | grep features
To activate dir_index feature (as root):
tune2fs -O dir_index /dev/sdaX
e2fsck -D /dev/sdaX
Replace /dev/sdaX with partition for which you want to activate it.
When you accidently execute "ls" in that directory, or use tab completion, or want to execute "rm *", you'll be in big trouble. In addition, there may be performance issues depending on your file system.
It's considered good practice to group your files into directories which are named by the first 2 or 3 characters of the filenames, e.g.
aaa/
aaavnj78t93ufjw4390
aaavoj78trewrwrwrwenjk983
aaaz84390842092njk423
...
abc/
abckhr89032423
abcnjjkth29085242nw
...
...
The obvious answer is the folder will be extremely difficult for humans to use long before any technical limit, (time taken to read the output from ls for one, their are dozens of other reasons) Is there a good reason why you can't split into sub folders?
Not every filesystem supports that many files.
On some of them (ext2, ext3, ext4) it's very easy to hit inode limit.
I've got a host with 10M files in a directory. (don't ask)
The filesystem is ext4.
It takes about 5 minutes to
ls
One limitation I've found is that my shell script to read the files (because AWS snapshot restore is a lie and files aren't present till first read) wasn't able to handle the argument list so I needed to do two passes. Firstly construct a file list with find (wholename in case you want to do partial matches)
find /path/to_dir/ -wholename '*.ldb'| tee filenames.txt
then secondly read from a the file containing filenames and read all files. (with limited parallelism)
while read -r line; do
if test "$(jobs | wc -l)" -ge 10; then
wait -n
fi
{
#do something with 10x fanout
} &
done < filenames.txt
Posting here in case anyone finds the specific work-around useful when working with too many files.

Resources