Rsync system() execution in Perl fails randomly with protocol error - linux

We commonly call rsync in (Mod)Perl (and PHP) and haven't run into too many issues but we are running into an occassional protocol error when running a command that when run on a subsequent request goes through fine. The funny thing is even if I retry in code, for the same http request, it will fail every time, but if you make another http request it will likely succeed.
Code looks like this:
$cmd = sprintf('rsync -auvvv --rsync-path="rsync --log-file=/tmp/ui-rsync.log" %s %s', "$fromDir/$fromBase/", "$path/$toBase/");
$exitCode = system($cmd);
The --rsync-path argument was added in there later for debugging. It was unhelpful. It fails either way.
The errors look like this:
rsync: connection unexpectedly closed (0 bytes received so far) [sender]
rsync error: error in rsync protocol data stream (code 12) at io.c(600) [sender=3.0.6]
or like this:
unexpected tag 93 [receiver]
rsync error: error in rsync protocol data stream (code 12) at io.c(1134) [receiver=3.0.6]
rsync: connection unexpectedly closed (9 bytes received so far) [sender]
rsync error: error in rsync protocol data stream (code 12) at io.c(600) [sender=3.0.6]
I debug the actual generated commands and I can run them by-hand fine.
The http user can run the commands fine.
Again, a programmatic retry never works, but a manually retry (hitting the same http endpoint that triggers it), works almost always.
Appreciate any help as this has been driving us crazy for...a long time, with many fixes tried.

If this is a real heisenbug, you could retry rsync maybe three times with some sleep between:
for my $n (1..3){
my $exitCode = system($cmd);
my_log_sub("SUCCESS: rsync succeded on try $n") + last if $exitCode==0;
my_log_sub("ERROR: rsync $n of 3 failed: $cmd $! $?");
sleep(1) if $n<3;
}
Have you checked your local and remote logs? Try sudo ls -rtl /var/log/ or sudo ls -rtl /var/log/httpd/ right after a fail and tail -f /var/log/one_of_the_newest_logs while retrying.
Have you checked if the remote disk is full or if the directory exists? Firewall issue? Remote and local rsync or ssh versions are (very) different? (although I guess that should show a clearer error message)

The solution was to change system() to backticks. Seriously. I don't know why it works.
The change is literally this:
# BAD:
$exitCode = system($cmd);
# GOOD:
`$cmd`;
If I had to guess I'd say there's some subtle difference with how the shell is being initialized, maybe some environment variables or memory locations not being cleaned properly. I really don't know, though.

Related

Linux script for probing ssh connection in a loop and start log command after connect

I have a host machine that gets rebooted or reconnected quite a few times.
I want to have a script running on my dev machine that continuously tries to log into that machine and if successful runs a specific command (tailing the log data).
Edit: To clarify, the connection needs to stay open. The log command keeps tailing until I stop it manually.
What I have so far
#!/bin/bash
IP=192.168.178.1
if (("$#" >= 1))
then
IP=$1
fi
LOOP=1
trap 'echo "stopping"; LOOP=0' INT
while (( $LOOP==1 ))
do
if ping -c1 $IP
then
echo "Host $IP reached"
sshpass -p 'password' ssh -o ConnectTimeout=10 -q user#$IP '<command would go here>'
else
echo "Host $IP unreachable"
fi
sleep 1
done
The LOOP flag is not really used. The script is ended via CTRL-C.
Now this works if I do NOT add a command to be executed after the ssh and instead start the log output manually. On a disconnect the script keeps probing the connection and logs back in once the host is available again.
Also when I disconnect from the host (CTRL-D) the script will log right back into the host if CTRL-C is not pressed fast enough.
When I add a command to be executed after ssh the loop is broken. So pressing (CTRL-C) does not only stop the log but also disconnects and ends the script on the dev machine.
I guess I have to spawn another shell somewhere or something like that?
1) I want the script to keep probing, log in and run a command completely automatically and fall back to probing when the connection breaks.
2) I want to be able to stop the log on the host (CTRL-C) and thereby fall back to a logged in ssh connection to use it manually.
How do I fix this?
Maybe best approach on "fixing" would be fixing requirements.
The problematic part is number "2)".
The problem is from how SIGINT works.
When triggered, it is sent to the current control group related to your terminal. Mostly this is the shell and any process started from there. With more modern shells (you seem to use bash), the shell manages control groups such that programs started in the background are disconnected (by having been assigned a different control group).
In your case the ssh is started in the foreground (from a script executed in the foreground), so it will receive the interrupt, forward it to the remote and terminate as soon as the remote end terminated. As by that time the script shell has processed its signal handler (specified by trap) it is going to exit the loop and terminate itself.
So, as you can see, you have overloaded CTRL-C to mean two things:
terminate the monitoring script
terminate the remote command and continue with whatever is specified for the remote side.
You might get closer to what you want if you drop the first effect (or at least make it more explicit). Then, calling a script on the remote side that does not terminate itself but just the tail command, will be step. In that case you will likely need to use -t switch on ssh to get a terminal allocated for allowing normal shell operation later.
This, will not allow for terminating the remote side with just CTRL-C. You always will need to exit the remote shell that is going to be run.
The essence of such a remote script might look like:
tail command
shell
of course you would need to add whatever parts will be necessary for your shell or coding style.
An alternate approach would be to keep the current remote command being terminated and add another ssh call for the case of being interrupted that is spanning the shell for interactive use. But in that case, also `CTRL-C will not be available for terminating the minoring altogether.
To achieve this you might try changing active interrupt handler with your monitoring script to trigger termination as soon as the remote side returns. However, this will cause a race condition between the user being able to recognize remote command terminated (and control has been returned to local script) and the proper interrupt handler being in place. You might be able to sufficiently lower that risk be first activating the new trap handler and then echoing the fact and maybe add a sleep to allow the user to react.
Not really sure what you are saying.
Also, you should disable PasswordAuthentication in /etc/ssh/sshd_config and log by adding the public key of your home computer to `~/.ssh/authorized_keys
! /bin/sh
while [ true ];
do
RESPONSE=`ssh -i /home/user/.ssh/id_host user#$IP 'tail /home/user/log.txt'`
echo $RESPONSE
sleep 10
done

broken pipe with remote rsync between two servers

I am trying to transfer a large dataset (768 Gigs) from one remote machine to another using bash on ubuntu 16.04. The problem I appear to be having is that I use rsync and the machine will transfer for a few hours and then quit when the connection inevitably gets interrupted. So suppose Im on machine A and the remote servers are machines B and C (all machines using ubuntu 16.04). I ssh to machine B and use this command:
nohup rsync -P -r -e ssh /path/to/files/on/machine_B user#machine_C:directory &
note that I have the authorized key setup so no password is required between machines B and C
A few hours later I get the following in the nohup file:
sending incremental filelist
file_1.bam
90,310,583,648 100% 36.44MB/s 0:39:23 (xfr#4, to-chk=5/10)
file_2.bam
79,976,321,885 100% 93.25MB/s 0:13:37 (xfr#3, to-chk=6/10)
file_3.bam
88,958,959,616 88% 12.50MB/s 0:15:28 rsync error: unexplained error (code 129) at rsync.c(632) [sender=3.1.1]
rsync: [sender] write error: Broken pipe (32)
I used nohup because I though it would keep running even if there was a hangup. I have not tried sh -c and I have not tried running the command from machine A because at this point whatever I try would be guesswork, ideas would be appreciated.
for those that are interested I also tried running the following script with the nohup command on machine B.
script:
chomp( my #files = `ls /path/to/files/on/machineB/*` );
foreach ( #files ) { system("scp $_ user#machineC:destination/"); }
I still got truncated files.
at the moment the following command appears to be working:
nohup rsync -P --append -r -e ssh /path/to/files/on/machine_B user#machine_C:directory &
you just have to check the nohup file for a broken pipe error and re-enter the command if necessary.
I had the same problem and solved it in multiple steps:
First I made sure that I ran all commands on tmux terminals. This adds a layer of safety on top of nohup, as it keeps connections alive: https://en.wikipedia.org/wiki/Tmux
I combined the rsync command with the while command to enforce that the copy is attempted an infinite number of times even if the pipe breaks:
while ! rsync <your_source> <your_destination>; do echo "Rsync failed. Retrying ..."; done
This approach is brute force and it will work if for each attempt, rsync manages to copy at least a few files. Eventually, even with wasteful repeats and multiple failures, all the files will be copied and my command above will exit gratefully.

scp hangs in scipt sometimes

I need some help to try and figure out a problem we are experiencing. We had the following bash shell script running in devices on two separate networks (network1 and network2). Both networks go to the same destination server.
while
do
# do something ...
scp *.zip "$username#$server_ip:$destination_directory"
# do something ...
sleep 30
done
The script worked fine until a recent change to network2 where the scp command in the script above sometimes hangs for hours before resetting. The same script is still working fine on netowrk1 which did not change. We are not able identify what the issue is with network2, everything seem to work except scp. The hang does not happened on every try but when it does hang it hangs for hours.
So I changed the scp command as follow and it now resets within minutes and the data delay is bearable but not desirable.
scp -o BatchMode=yes -o ServerAliveCountMax=3 -o ServerAliveInterval=10 -o \
ConnectTimeout=60 *.zip "$username#$server_ip:$destination_directory"
I also tried sftp as follows;
sftp -o ConnectTimeout=60 -b "batchfiles.txt" "$username#$server_ip"
The ConnectTimeout does not seem to work well in sftp because it still hangs for hours sometimes. So I am back to using scp.
I even included the -o IdentityFile=path_to_key/id_rsa option in both scp and sftp thinking it maybe an authentication issue. That did not work either.
What is really strange is that it always works when I issue the same commands from a terminal. The shell script run as a background task. I am running Linux 3.8.0-26-generic #38-Ubuntu and OpenSSH_6.1p1 Debian-4. I don’t think is a local script permissions issue because; 1) it worked before network2 changed, 2) It works some of the time.
I did a network packet capture. I can see that each time when the scp command hangs it is accompanied by [TCP Retransmission] and [RST, ACK] within seconds from the start of a scp conversation.
I am very confused as to if the issue is networks or script related. Base on the sequence of events I am thinking is likely due to the recent change in network2. But why the same command works from a terminal every time I tried?
Can someone kindly tell me what my issue is or tell me how to go about troubleshoot it?
Thank you for reading and helping.

Using `rsync` and `ssh` in perl causes unexplained timeout?

I'm running a two-stage process on a large number of files.
CODE:
$server_sleep = 1;
$ssh_check = 'ssh '.$destination_user."#".$destination_hostname.' "test -e '.$destination_path.$file_filename.'.txt && echo 1 || echo 0"';
while (`$ssh_check` ne "1\n") { # for some reason, the backticks return the 1 with a newline
$upload_command = "/usr/bin/rsync -qogt --timeout=".$server_sleep." --partial --partial-dir=".$destination_path."partials ".$file_path."/".$file_filename.".txt ". $destination_user."#".$destination_hostname.":".$destination_path;
sleep $server_sleep; # to avoid hammering the server (for the rsync)
$upload_result = `$upload_command 2>&1`;
$file_errorReturn = "FAIL" if $?;
if (defined($file_errorReturn)) {
#log an error. there is code to do this, but I have omitted it.
}
sleep $server_sleep; # to avoid hammering the server (for the ssh check)
$server_sleep++; # increase the timeout if failures continue
}
BEHAVIOUR:
For the first few files this works fine (which should take care of your first few questions about keys, access, permissions, typos, etc.), and at some point, I get this error back:
ssh: connect to host remote_server.com port 22: Connection timed out
rsync: connection unexpectedly closed (0 bytes received so far) [sender]
rsync error: unexplained error (code 255) at io.c(600) [sender=3.0.6]
I get this regardless of if I have specified -e ssh in the command, so I assume there is a default somewhere to ssh (which is fine). I have also tried the rsync section with scp, and it resulted in a similar connection timed out error:
ssh: connect to host remote_server.com port 22: Connection timed out
lost connection
QUESTIONS YOU MAY HAVE
1) Because the first few files work, the path is clear for this to work (ie there should be no problems with typos, permissions, etc.), and my debug code outputs the actual command that is tried, and this works fine in the command line (even on the files that fail in the script).
2) I have tried to add -vvvvv to both ssh and rsync, but I don't know how to get it to output more error information WITHIN my script. All I ever get is the above errors and when I run it on the command line, I get no errors. (even after I added "2>&1" and ">> log.txt" to the end both commands.) It is certainly possible that I'm not collecting all the logs I should be, so your help there would be appreciated as well.
3) I am just a regular user on both the local and remote machines.
local: rsync version 3.0.6 protocol version 30
remote: rsync version 3.0.9 protocol version 30
path to ssh and rsync is the same on both.
4) In response to the excellent question (from qwrrty) in the comment (thanks!):
It is not terribly consistent. The files are numbered and they were being run in the following order: 4, 5, 3, 2, 1. It WAS failing on 1. Then I removed 3. It still failed on 1. When I put 3 back in, it started to fail on 2.
The files are all small (5mb max), so the transfer is basically instant (as the machines are not far from each other physically or networky).
Please let me know if you need more detail. Thanks in advance for any advice you can offer.
Can you try to minimize extra ssh sessions and subprocesses by doing it all in a single rsync? Something like this?
open (RSYNC, "| /usr/bin/rsync -qogt \
--files-from=- \
--ignore-existing \
--timeout=${server_sleep} \
--partial --partial_dir=${destination_path}partials \
${destination_user}#${destination_hostname}:${destination_path}");
for $f (#big_list_of_files) {
print RSYNC $f, "\n";
}
close RSYNC;
rsync has quite a lot of smarts built into it around how to transfer and synchronize large numbers of files at a time, and in my experience it usually works best to let it do as much of the work as possible.

ssh from crontab returning 'tcgetattr: Invalid argument'

I'm defining something like this in my crontab:
* * * * * ssh -tt otherhost whoami
And I'm getting the following output:
tcgetattr: Invalid argument
me
Running ssh with fewer -ttoptions leads to other errors besides tcgetattr.
The solution posted in why is the `tcgetattr` error seen when ssh is used for dumping the backup file on another server? doesn't really work well because in this case I'm using several ssh connections to run monitoring scripts on different hosts and I need to capture output sent to stderr and email it.
Any ideas on how to workaround this?
You could use something like this:
ssh -tt otherhost "your_monitoring_script 2>&1" 2> /dev/null
That way the errors from ssh go in the bucket, but the errors from your script are shown in stdout. For that to work you should mark errors from your script as "ERROR:" so that you can find them back if your script provides lots of output.

Resources