How can I print/log the checksum calculated by rsync? - linux

I have to transfer millons of files of very different size summing up almost 100 TB between two Linux servers. It's easy to do it the first time with rsync, and quite safe, because data can be checksum'ed.
However, I need to keep a list of files and their checksum to do some checks regularly in the future.
Is there a way to tell rsync to print/log the checksum of the file?
And in case this is not feasible: Which tool/command would you recommend considering that performance is very important?
Thanks in advance!

It is possible to include the transfer md5 checksum in logging since rsync 3.1.0 (released on 28 Sep 2013):
Added the "%C" escape to the log-output handling, which will output the
MD5 checksum of any transferred file, or all files if --checksum was
specified (when protocol 30 or above is in effect).
For example, the log format %i %f B:%l md5:%C will log each transfer similar to
>f+++++++++ 00/64235/0664eccc-364e-11e2-af18-57a6d04fd4d5 B:16035388 md5:8ab769aa5224514a41cee0e3e2fe3aad
Take note that this is the md5 sum calculated to verify transfer integrity - it is available even for transfers without the --checksum flag.
This change also allows to log the checksum if just one side of the transfer is 3.1.0 or newer. For example, you can have a newer rsync daemon on the target machine do the checksum logging, but send with an older rsync client as long as md5 is used (3.0.0 or newer).

Related

Simplest format to archive a directory

I have a script that needs to work on multiple platforms and machines. Some of those machines don't have any available archiving software (e.g. zip, tar). I can't download any software onto these machines.
The script creates a directory containing output files. I need to package all those files into a single file so i can download it easily.
What is the simplest possible archiving format to implement, so I can easily roll my own impl in the script. It doesn't have to support compression.
I could make up something ad-hoc, e.g.
file1 base64EncodedContents
dir1/file1 base64EncodedContents
etc.
However if one already exists then that will save me having to roll my own packing and unpacking, only packing, which would be nice. Bonus points if it's zip compatible, so that I can try zipping it with compression if possible, and them impl my own without compression otherwise, and not have to worry about which it is on the other side.
The tar archive format is extremely simple - simple enough that I was able to implement a tar archiver in powershell in a couple of hours.
It consists of a sequence of file header, file data, file header, file data etc.
The header is pure ascii, so doesn't require any bit manipulation - you can literally append strings. Once you've written the header, you then append the file bytes, and pad it with nil chars till it's a multiple of 512 bytes. You then repeat for the next file.
Wikipedia has more details on the exact format: https://en.wikipedia.org/wiki/Tar_(computing).

Does ActiveStorage use the checksum for anything?

In a very real sense, my question is actually 'can I skip generating a checksum', but answering that question rests on the above question.
To give you some background, I'm (finally) converting from Paperclip to ActiveStorage, and one of the pains of my particular conversion process is that I'm storing a decent sized number of fairly large files -- in addition to normal sized thumbnail images, I'm also storing large multimedia files, some in excess of 10GBs (currently poking at a 15GB file).
The basic conversion process has me downloading the file to generate a checksum, and a few other minor details that could be done with a head request instead of downloading the full file. We also copy the file from it's old 'home' to its new 'home', but that is done as an S3 to S3 copy, and doesn't take as long as downloading and uploading.
I'd love to skip the download & generate checksum process -- or at least, put it off for another day, as a cleanup step that isn't important to what we're actually doing.
So the question is: does the checksum actually do anything in ActiveStorage, or is it just a 'nice-to-have' feature that would allow me to, for example, publish the checksum if someone wanted to verify their version?
Found in code Rails
Prior to uploading, we compute the checksum, which is sent to the
service for transit integrity validation. If the checksum does not
match what the service receives, an exception will be raised.
You can create your own checksum without downloading the file:
Found in code Rails
def compute_checksum_in_chunks(io)
OpenSSL::Digest::MD5.new.tap do |checksum|
while chunk = io.read(5.megabytes)
checksum << chunk
end
io.rewind
end.base64digest
end

How can I convince z/OS scp to transfer binary files?

We have SSH-based file transfer scripts currently set up for Linux-to-Linux and we're porting them to z/OS to go z/OS-to-Linux. Note that this is with USS, the UNIX system services within z/OS otherwise known as OMVS, which uses EBCDIC under the covers, not zLinux which uses ASCII.
We've set up all the SSH key files and what-not, and the transfer itself is working fine.
However z/OS, in it's infinite wisdom, insists on converting the files from EBCDIC to ASCII despite the fact that they're binary files - this is screwing up the content of the destination files.
The scp manpage on z/OS states:
scp assumes that files are text. Files copied between EBCDIC and ASCII platforms
are converted.
and I can find nothing useful in the manuals that indicates how to get around this.
It seems a bizarre limitation for anyone wanting to transfer binary files between the two platforms. Does anyone know of a way, using SSH-standard keyfiles (we need this for security, no naked FTP allowed), to effect a binary transfer without translation?
You can use one of the other SSH-based tools such as sftp.
Whereas scp will let you transfer a file (with automatic authentication set up) with something like:
scp -i ident_file zos_file linux_user#linux_box:linux_file
you can do a similar thing with the secure FTP:
sftp IdentityFile=ident_file -b - linux_user#linux_box <<EOF
binary
put zos_file linux_file
EOF

Comparing a big file on two servers

I have two servers and I want to move a backup tar.bz file(50G) from one to other one.
I used AXEL to download file from source server. But now when I want to extract it, it gives me error unexpected EOF. The size of them are same and it seems like there is a problem in content.
I want to know if there is a program/app/script that can compare these two files and correct only damaged parts?! Or do I need to split it by hand and compare each part's hash code?
Problem is here that source server has limited bandwidth and low transfers speed so I cant transfer it again from zero.
You can use a checksum utility, such as md5 or sha, to see if the files are the same on either end. e.g.
$ md5 somefile
MD5 (somefile) = d41d8cd98f00b204e9800998ecf8427e
by running such a command on both ends and comparing the result, you can get some certainty as to if the files are the same.
As for only downloading the erroneous portion of a file, this would require checksums on both sides for "pieces" of the data, such as with the bittorrent protocol.
Ok, I found "rdiff" the best way to solve this problem. Just doing:
On Destination Server:
rdiff signature destFile.tar.bz destFile.sig
Then transferring destFile.sig to source server and execute rdiff there on Source Server again:
rdiff delta destFile.sig srcFile.tar.bz delta.rdiff
Then transferring delta.rdiff to destination server and execute rdiff once again on Destination Server:
rdiff patch destFile.tar.bz delta.rdiff fixedFile.tar.bz
This process really doesn't need a separate program. You can simply do it by using a couple of simple commands. If any of the md5sums don't add up, copy over the mismatched one(s) and concatenate them back together. To make comparing the md5sums easier, just run a diff between the output of the two files (or do an md5sum of the outputs to see if there is a difference at all without having to copy over the output).
split -b 1000000000 -d bigfile bigfile.
for i in bigfile.*
do
md5sum $i
done

CIFS/SMB Write Optimization

I am looking at making a write optimization for CIFS/SMB such that the writing of duplicate blocks are suppressed. For example, I read a file from the remote share and modify a portion near the end of the file. When I save the file, I only want to send write requests back to the remote side for the portions of the file that have actually changed. So basically, suppress all writes up until the point at which a non duplicate write is encountered. At that point the suppression will be disabled and the writes will be allowed as usual. The problem is I can't find any documentation MS-SMB/MS-SMB2/MS-CIFS or otherwise that indicates whether or not this is a valid thing to do. Does anyone know if this would be valid?
Dig deep into the sources of the Linux kernel, there is documentation on CIFS - both in source and text. E.g. http://www.mjmwired.net/kernel/Documentation/filesystems/cifs.txt
If you want to study the behaviour of e.g. the CIFS protocol, you may be able to test it with the unix command "dd". Mount any remote file-system via CIFS, e.g. into /media/remote. Change into this folder cd /media/remote Now create a file with some random stuff (e.g. from the kernel's random pool): dd if=/dev/urandom of=test.bin bs=4M count=5 In this example, you should see some 20MB of traffic. Then create another smaller file, somewhere on your machine, say, your home-folder: dd if=/dev/urandom of=~/test_chunk.bin bs=4M count=1 The interesting thing is what happens, if you attempt to write the chunk into a specific position of the remote test file: dd if=~/test_chunk.bin of=test.bin bs=4M count=1 seek=3 conv=notrunc Actually, this should only change block #4 out of 5 in the target file.
I guess you can adjust the block size ... I did this with 4 MB blocks. But it should help to understand what happens on the network.
The CIFS protocol does allow applications to write back specific portions of the file. This is controlled by the parameters DataOffset and DataLength in the SMB WriteAndX packet.
Documentation for the same can be found here:
http://msdn.microsoft.com/en-us/library/ee441954.aspx
The client can use these fields to write a specific length of data to specific offsets within the file.
Similar support exists in more recent versions of the protocol as well ...
SMB protocol have such write optimization. It works with append cifs operation. Where protocol read EOF for file and start writing new data with offset set to EOF value and length as append data bytes.

Resources