Using `rsync` and `ssh` in perl causes unexplained timeout? - linux

I'm running a two-stage process on a large number of files.
CODE:
$server_sleep = 1;
$ssh_check = 'ssh '.$destination_user."#".$destination_hostname.' "test -e '.$destination_path.$file_filename.'.txt && echo 1 || echo 0"';
while (`$ssh_check` ne "1\n") { # for some reason, the backticks return the 1 with a newline
$upload_command = "/usr/bin/rsync -qogt --timeout=".$server_sleep." --partial --partial-dir=".$destination_path."partials ".$file_path."/".$file_filename.".txt ". $destination_user."#".$destination_hostname.":".$destination_path;
sleep $server_sleep; # to avoid hammering the server (for the rsync)
$upload_result = `$upload_command 2>&1`;
$file_errorReturn = "FAIL" if $?;
if (defined($file_errorReturn)) {
#log an error. there is code to do this, but I have omitted it.
}
sleep $server_sleep; # to avoid hammering the server (for the ssh check)
$server_sleep++; # increase the timeout if failures continue
}
BEHAVIOUR:
For the first few files this works fine (which should take care of your first few questions about keys, access, permissions, typos, etc.), and at some point, I get this error back:
ssh: connect to host remote_server.com port 22: Connection timed out
rsync: connection unexpectedly closed (0 bytes received so far) [sender]
rsync error: unexplained error (code 255) at io.c(600) [sender=3.0.6]
I get this regardless of if I have specified -e ssh in the command, so I assume there is a default somewhere to ssh (which is fine). I have also tried the rsync section with scp, and it resulted in a similar connection timed out error:
ssh: connect to host remote_server.com port 22: Connection timed out
lost connection
QUESTIONS YOU MAY HAVE
1) Because the first few files work, the path is clear for this to work (ie there should be no problems with typos, permissions, etc.), and my debug code outputs the actual command that is tried, and this works fine in the command line (even on the files that fail in the script).
2) I have tried to add -vvvvv to both ssh and rsync, but I don't know how to get it to output more error information WITHIN my script. All I ever get is the above errors and when I run it on the command line, I get no errors. (even after I added "2>&1" and ">> log.txt" to the end both commands.) It is certainly possible that I'm not collecting all the logs I should be, so your help there would be appreciated as well.
3) I am just a regular user on both the local and remote machines.
local: rsync version 3.0.6 protocol version 30
remote: rsync version 3.0.9 protocol version 30
path to ssh and rsync is the same on both.
4) In response to the excellent question (from qwrrty) in the comment (thanks!):
It is not terribly consistent. The files are numbered and they were being run in the following order: 4, 5, 3, 2, 1. It WAS failing on 1. Then I removed 3. It still failed on 1. When I put 3 back in, it started to fail on 2.
The files are all small (5mb max), so the transfer is basically instant (as the machines are not far from each other physically or networky).
Please let me know if you need more detail. Thanks in advance for any advice you can offer.

Can you try to minimize extra ssh sessions and subprocesses by doing it all in a single rsync? Something like this?
open (RSYNC, "| /usr/bin/rsync -qogt \
--files-from=- \
--ignore-existing \
--timeout=${server_sleep} \
--partial --partial_dir=${destination_path}partials \
${destination_user}#${destination_hostname}:${destination_path}");
for $f (#big_list_of_files) {
print RSYNC $f, "\n";
}
close RSYNC;
rsync has quite a lot of smarts built into it around how to transfer and synchronize large numbers of files at a time, and in my experience it usually works best to let it do as much of the work as possible.

Related

Rsync system() execution in Perl fails randomly with protocol error

We commonly call rsync in (Mod)Perl (and PHP) and haven't run into too many issues but we are running into an occassional protocol error when running a command that when run on a subsequent request goes through fine. The funny thing is even if I retry in code, for the same http request, it will fail every time, but if you make another http request it will likely succeed.
Code looks like this:
$cmd = sprintf('rsync -auvvv --rsync-path="rsync --log-file=/tmp/ui-rsync.log" %s %s', "$fromDir/$fromBase/", "$path/$toBase/");
$exitCode = system($cmd);
The --rsync-path argument was added in there later for debugging. It was unhelpful. It fails either way.
The errors look like this:
rsync: connection unexpectedly closed (0 bytes received so far) [sender]
rsync error: error in rsync protocol data stream (code 12) at io.c(600) [sender=3.0.6]
or like this:
unexpected tag 93 [receiver]
rsync error: error in rsync protocol data stream (code 12) at io.c(1134) [receiver=3.0.6]
rsync: connection unexpectedly closed (9 bytes received so far) [sender]
rsync error: error in rsync protocol data stream (code 12) at io.c(600) [sender=3.0.6]
I debug the actual generated commands and I can run them by-hand fine.
The http user can run the commands fine.
Again, a programmatic retry never works, but a manually retry (hitting the same http endpoint that triggers it), works almost always.
Appreciate any help as this has been driving us crazy for...a long time, with many fixes tried.
If this is a real heisenbug, you could retry rsync maybe three times with some sleep between:
for my $n (1..3){
my $exitCode = system($cmd);
my_log_sub("SUCCESS: rsync succeded on try $n") + last if $exitCode==0;
my_log_sub("ERROR: rsync $n of 3 failed: $cmd $! $?");
sleep(1) if $n<3;
}
Have you checked your local and remote logs? Try sudo ls -rtl /var/log/ or sudo ls -rtl /var/log/httpd/ right after a fail and tail -f /var/log/one_of_the_newest_logs while retrying.
Have you checked if the remote disk is full or if the directory exists? Firewall issue? Remote and local rsync or ssh versions are (very) different? (although I guess that should show a clearer error message)
The solution was to change system() to backticks. Seriously. I don't know why it works.
The change is literally this:
# BAD:
$exitCode = system($cmd);
# GOOD:
`$cmd`;
If I had to guess I'd say there's some subtle difference with how the shell is being initialized, maybe some environment variables or memory locations not being cleaned properly. I really don't know, though.

broken pipe with remote rsync between two servers

I am trying to transfer a large dataset (768 Gigs) from one remote machine to another using bash on ubuntu 16.04. The problem I appear to be having is that I use rsync and the machine will transfer for a few hours and then quit when the connection inevitably gets interrupted. So suppose Im on machine A and the remote servers are machines B and C (all machines using ubuntu 16.04). I ssh to machine B and use this command:
nohup rsync -P -r -e ssh /path/to/files/on/machine_B user#machine_C:directory &
note that I have the authorized key setup so no password is required between machines B and C
A few hours later I get the following in the nohup file:
sending incremental filelist
file_1.bam
90,310,583,648 100% 36.44MB/s 0:39:23 (xfr#4, to-chk=5/10)
file_2.bam
79,976,321,885 100% 93.25MB/s 0:13:37 (xfr#3, to-chk=6/10)
file_3.bam
88,958,959,616 88% 12.50MB/s 0:15:28 rsync error: unexplained error (code 129) at rsync.c(632) [sender=3.1.1]
rsync: [sender] write error: Broken pipe (32)
I used nohup because I though it would keep running even if there was a hangup. I have not tried sh -c and I have not tried running the command from machine A because at this point whatever I try would be guesswork, ideas would be appreciated.
for those that are interested I also tried running the following script with the nohup command on machine B.
script:
chomp( my #files = `ls /path/to/files/on/machineB/*` );
foreach ( #files ) { system("scp $_ user#machineC:destination/"); }
I still got truncated files.
at the moment the following command appears to be working:
nohup rsync -P --append -r -e ssh /path/to/files/on/machine_B user#machine_C:directory &
you just have to check the nohup file for a broken pipe error and re-enter the command if necessary.
I had the same problem and solved it in multiple steps:
First I made sure that I ran all commands on tmux terminals. This adds a layer of safety on top of nohup, as it keeps connections alive: https://en.wikipedia.org/wiki/Tmux
I combined the rsync command with the while command to enforce that the copy is attempted an infinite number of times even if the pipe breaks:
while ! rsync <your_source> <your_destination>; do echo "Rsync failed. Retrying ..."; done
This approach is brute force and it will work if for each attempt, rsync manages to copy at least a few files. Eventually, even with wasteful repeats and multiple failures, all the files will be copied and my command above will exit gratefully.

scp hangs in scipt sometimes

I need some help to try and figure out a problem we are experiencing. We had the following bash shell script running in devices on two separate networks (network1 and network2). Both networks go to the same destination server.
while
do
# do something ...
scp *.zip "$username#$server_ip:$destination_directory"
# do something ...
sleep 30
done
The script worked fine until a recent change to network2 where the scp command in the script above sometimes hangs for hours before resetting. The same script is still working fine on netowrk1 which did not change. We are not able identify what the issue is with network2, everything seem to work except scp. The hang does not happened on every try but when it does hang it hangs for hours.
So I changed the scp command as follow and it now resets within minutes and the data delay is bearable but not desirable.
scp -o BatchMode=yes -o ServerAliveCountMax=3 -o ServerAliveInterval=10 -o \
ConnectTimeout=60 *.zip "$username#$server_ip:$destination_directory"
I also tried sftp as follows;
sftp -o ConnectTimeout=60 -b "batchfiles.txt" "$username#$server_ip"
The ConnectTimeout does not seem to work well in sftp because it still hangs for hours sometimes. So I am back to using scp.
I even included the -o IdentityFile=path_to_key/id_rsa option in both scp and sftp thinking it maybe an authentication issue. That did not work either.
What is really strange is that it always works when I issue the same commands from a terminal. The shell script run as a background task. I am running Linux 3.8.0-26-generic #38-Ubuntu and OpenSSH_6.1p1 Debian-4. I don’t think is a local script permissions issue because; 1) it worked before network2 changed, 2) It works some of the time.
I did a network packet capture. I can see that each time when the scp command hangs it is accompanied by [TCP Retransmission] and [RST, ACK] within seconds from the start of a scp conversation.
I am very confused as to if the issue is networks or script related. Base on the sequence of events I am thinking is likely due to the recent change in network2. But why the same command works from a terminal every time I tried?
Can someone kindly tell me what my issue is or tell me how to go about troubleshoot it?
Thank you for reading and helping.

How to debug bash expansion?

I am trying to figure out why bash autocompletion on filesystem is slow on my PC. My Linux machine is connected to an AD through PAM and I am suspecting bash is trying to query a network mount (which is slow since it queries PAM) every time I use TAB for autocomplete.
I have tried set -x and when I do autocomplete on /var the slowest operation is the following line:
[[ /var == ~* ]]
Also, the following line takes a few seconds to execute in bash when I am connected to the network whereas it returns immediately if it is not connected:
TEMP=~*
I would like to know what bash is trying to expand ~* to or find a workaround.
trying running it with strace
for example
strace echo $FOO
if the system is accessing your mount, you will know

how to re-run the "curl" command automatically when the error occurs

Sometimes when I execute a bash script with the curl command to upload some files to my ftp server, it will return some error like:
56 response reading failed
and I have to find the wrong line and re-run them manually and it will be OK.
I'm wondering if that could be re-run automatically when the error occurs.
My scripts is like this:
#there are some files(A,B,C,D,E) in my to_upload directory,
# which I'm trying to upload to my ftp server with curl command
for files in `ls` ;
do curl -T $files ftp.myserver.com --user ID:pw ;
done
But sometimes A,B,C, would be uploaded successfully, only D were left with an "error 56", so I have to rerun curl command manually. Besides, as Will Bickford said, I prefer that no confirmation will be required, because I'm always asleep at the time the script is running. :)
Here's a bash snippet I use to perform exponential back-off:
# Retries a command a configurable number of times with backoff.
#
# The retry count is given by ATTEMPTS (default 5), the initial backoff
# timeout is given by TIMEOUT in seconds (default 1.)
#
# Successive backoffs double the timeout.
function with_backoff {
local max_attempts=${ATTEMPTS-5}
local timeout=${TIMEOUT-1}
local attempt=1
local exitCode=0
while (( $attempt < $max_attempts ))
do
if "$#"
then
return 0
else
exitCode=$?
fi
echo "Failure! Retrying in $timeout.." 1>&2
sleep $timeout
attempt=$(( attempt + 1 ))
timeout=$(( timeout * 2 ))
done
if [[ $exitCode != 0 ]]
then
echo "You've failed me for the last time! ($#)" 1>&2
fi
return $exitCode
}
Then use it in conjunction with any command that properly sets a failing exit code:
with_backoff curl 'http://monkeyfeathers.example.com/'
Perhaps this will help. It will try the command, and if it fails, it will tell you and pause, giving you a chance to fix run-my-script.
COMMAND=./run-my-script.sh
until $COMMAND; do
read -p "command failed, fix and hit enter to try again."
done
I have faced a similar problem where I need to make contact with servers using curl that are in the process of starting up and haven't started up yet, or services that are temporarily unavailable for whatever reason. The scripting was getting out of hand, so I made a dedicated retry tool that will retry a command until it succeeds:
#there are some files(A,B,C,D,E) in my to_upload directory,
# which I'm trying to upload to my ftp server with curl command
for files in `ls` ;
do retry curl -f -T $files ftp.myserver.com --user ID:pw ;
done
The curl command has the -f option, which returns code 22 if the curl fails for whatever reason.
The retry tool will by default run the curl command over and over forever until the command returns status zero, backing off for 10 seconds between retries. In addition retry will read from stdin once and once only, and writes to stdout once and once only, and writes all stdout to stderr if the command fails.
Retry is available from here: https://github.com/minfrin/retry

Resources