Comparing a big file on two servers

Comparing a big file on two servers - linux

I have two servers and I want to move a backup tar.bz file(50G) from one to other one.
I used AXEL to download file from source server. But now when I want to extract it, it gives me error unexpected EOF. The size of them are same and it seems like there is a problem in content.
I want to know if there is a program/app/script that can compare these two files and correct only damaged parts?! Or do I need to split it by hand and compare each part's hash code?
Problem is here that source server has limited bandwidth and low transfers speed so I cant transfer it again from zero.

You can use a checksum utility, such as md5 or sha, to see if the files are the same on either end. e.g.
$ md5 somefile
MD5 (somefile) = d41d8cd98f00b204e9800998ecf8427e
by running such a command on both ends and comparing the result, you can get some certainty as to if the files are the same.
As for only downloading the erroneous portion of a file, this would require checksums on both sides for "pieces" of the data, such as with the bittorrent protocol.

Ok, I found "rdiff" the best way to solve this problem. Just doing:
On Destination Server:
rdiff signature destFile.tar.bz destFile.sig
Then transferring destFile.sig to source server and execute rdiff there on Source Server again:
rdiff delta destFile.sig srcFile.tar.bz delta.rdiff
Then transferring delta.rdiff to destination server and execute rdiff once again on Destination Server:
rdiff patch destFile.tar.bz delta.rdiff fixedFile.tar.bz

This process really doesn't need a separate program. You can simply do it by using a couple of simple commands. If any of the md5sums don't add up, copy over the mismatched one(s) and concatenate them back together. To make comparing the md5sums easier, just run a diff between the output of the two files (or do an md5sum of the outputs to see if there is a difference at all without having to copy over the output).
split -b 1000000000 -d bigfile bigfile.
for i in bigfile.*
do
md5sum $i
done

Related

How can I print/log the checksum calculated by rsync?

I have to transfer millons of files of very different size summing up almost 100 TB between two Linux servers. It's easy to do it the first time with rsync, and quite safe, because data can be checksum'ed.
However, I need to keep a list of files and their checksum to do some checks regularly in the future.
Is there a way to tell rsync to print/log the checksum of the file?
And in case this is not feasible: Which tool/command would you recommend considering that performance is very important?
Thanks in advance!

It is possible to include the transfer md5 checksum in logging since rsync 3.1.0 (released on 28 Sep 2013):
Added the "%C" escape to the log-output handling, which will output the
MD5 checksum of any transferred file, or all files if --checksum was
specified (when protocol 30 or above is in effect).
For example, the log format %i %f B:%l md5:%C will log each transfer similar to
>f+++++++++ 00/64235/0664eccc-364e-11e2-af18-57a6d04fd4d5 B:16035388 md5:8ab769aa5224514a41cee0e3e2fe3aad
Take note that this is the md5 sum calculated to verify transfer integrity - it is available even for transfers without the --checksum flag.
This change also allows to log the checksum if just one side of the transfer is 3.1.0 or newer. For example, you can have a newer rsync daemon on the target machine do the checksum logging, but send with an older rsync client as long as md5 is used (3.0.0 or newer).

md5sum relationship between splite files and combined large file [duplicate]

I have a situation where I have one VERY large file that I'm using the linux "split" command to break into smaller parts. Later I use the linux "cat" command to bring the parts all back together again.
In the interim, however, I'm curious...
If I get an MD5 fingerprint on the large file before splitting it, then later get the MD5 fingerprints on all the independent file parts that result from the split command, is there a way to take the independent fingerprints and somehow deduce that the sum or average (or whatever you like to all it) of their parts is equal to the fingerprint of the single large file?
By (very) loose example...
bigoldfile.txt MD5 = 737da789
smallfile1.txt MD5 = 23489a89
smallfile2.txt MD5 = 1238g89d
smallfile3.txt MD5 = 01234cd7
someoperator(23489a89,1238g89d,01234cd7) = 737da789 (the fingerprint of the original file)

You likely can't do that - MD5 is complex enough inside and depends on actual data as well as the "initial" hash value.
You could instead generate "incremental" hashes - hash of first part, hash of first plus second part, etc.

Not exactly but the next best thing would be to do this:
cat filepart1 filepart2 | md5sum
or
cat filepart* | md5sum
Be sure to cat them back together in the correct order.
by piping the output of cat you don't have to worry about creating a combined file that is too large.

How to copy multiple files simultaneously using scp

I would like to copy multiple files simultaneously to speed up my process I currently used the follow
scp -r root#xxx.xxx.xx.xx:/var/www/example/example.example.com .
but it only copies one file at a time. I have a 100 Mbps fibre so I have the bandwidth available to really copy a lot at the same time, please help.

You can use background task with wait command.
Wait command ensures that all the background tasks are completed before processing next line. i.e echo will be executed after scp for all three nodes are completed.
#!/bin/bash
scp -i anuruddha.pem myfile1.tar centos#192.168.30.79:/tmp &
scp -i anuruddha.pem myfile2.tar centos#192.168.30.80:/tmp &
scp -i anuruddha.pem myfile.tar centos#192.168.30.81:/tmp &
wait
echo "SCP completed"

SSH is able to do so-called "multiplexing" - more connections over one (to one server). It can be one way to afford what you want. Look up keywords like "ControlMaster"
Second way is using more connections, then send every job at background:
for file in file1 file2 file3 ; do
scp $file server:/tmp/ &
done
But, this is answer to your question - "How to copy multiple files simultaneously". For speed up, you can use weaker encryption (rc4 etc) and also don't forget, that the bottleneck can be your hard drive - because SCP don't implicitly limit transfer speed.
Last thing is using rsync - in some cases, it can be lot faster than scp...

I am not sure if this helps you, but I generally archive (compression is not required. just archiving is sufficient) file at the source, download it, extract them. This will speed up the process significantly.
Before archiving it took > 8 hours to download 1GB
After archiving it took < 8 minutes to do the same

You can use parallel-scp (AKA pscp): http://manpages.ubuntu.com/manpages/natty/man1/parallel-scp.1.html
With this tool, you can copy a file (or files) to multiple hosts simultaneously.
Regards,

100mbit Ethernet is pretty slow, actually. You can expect 8 MiB/s in theory. In practice, you usually get between 4-6 MiB/s at best.
That said, you won't see a speed increase if you run multiple sessions in parallel. You can try it yourself, simply run two parallel SCP sessions copying two large files. My guess is that you won't see a noticeable speedup. The reasons for this are:
The slowest component on the network path between the two computers determines the max. speed.
Other people might be accessing example.com at the same time, reducing the bandwidth that it can give you
100mbit Ethernet requires pretty big gaps between two consecutive network packets. GBit Ethernet is much better in this regard.
Solutions:
Compress the data before sending it over the wire
Use a tool like rsync (which uses SSH under the hood) to copy on the files which have changed since the last time you ran the command.
Creating a lot of small files takes a lot of time. Try to create an archive of all the files on the remote side and send that as a single archive.
The last suggestion can be done like this:
ssh root#xxx "cd /var/www/example ; tar cf - example.example.com" > example.com.tar
or with compression:
ssh root#xxx "cd /var/www/example ; tar czf - example.example.com" > example.com.tar.gz
Note: bzip2 compresses better but slower. That's why I use gzip (z) for tasks like this.

If you specify multiple files scp will download them sequentially:
scp -r root#xxx.xxx.xx.xx:/var/www/example/file1 root#xxx.xxx.xx.xx:/var/www/example/file2 .
Alternatively, if you want the files to be downloaded in parallel, then use multiple invocations of scp, putting each in the background.
#! /usr/bin/env bash
scp root#xxx.xxx.xx.xx:/var/www/example/file1 . &
scp root#xxx.xxx.xx.xx:/var/www/example/file2 . &

Upload output of a program directly to a remote file by ftp

I have some program that generates a lot of data, to be specific encrypting tarballs. I want to upload result on a remote ftp server.
Files are quite big (about 60GB), so I don't want to waste hdd space for tmp dir and time.
Is it possible? I checked ncftput util, but there is not option to read from a standard input.

curl can upload while reading from stdin:
-T, --upload-file
[...]
Use the file name "-" (a single dash) to use stdin instead of a given
file. Alternately, the file name "." (a single period) may be
specified instead of "-" to use stdin in non-blocking mode to allow
reading server output while stdin is being uploaded.
[...]

I guess you could do that with any upload program using named pipe, but I foresee problems if some part of the upload goes wrong and you have to restart your upload: the data is gone and you cannot start back your upload, even if you only lost 1 byte. This also applied to a read from stdin strategy.
My strategy would be the following:
Create a named pipe using mkfifo.
Start the encryption process writing to that named pipe in the background. Soon, the pipe buffer will be full and the encryption process will be blocked trying to write data to the pipe. It should unblock when we will read data from the pipe later.
Read a certain amount of data from the named pipe (let say 1 GB) and put this in a file. The utility dd could be used for that.
Upload that file though ftp doing it the standard way. You then can deal with retries and network errors. Once the upload is completed, delete the file.
Go back to step 3 until you get a EOF from the pipe. This will mean that the encryption process is done writing to the pipe.
On the server, append the files in order to an empty file, deleting the files one by one once it has been appended. Using touch next_file; for f in ordered_list_of_files; do cat $f >> next_file; rm $f; done or some variant should do it.
You can of course prepare the next file while you upload the previous file to use concurrency at its maximum. The bottleneck will be either your encryption algorithm (CPU), you network bandwidth, or your disk bandwidth.
This method will waste you 2 GB of disk space on the client side (or less or more depending the size of the files), and 1 GB of disk space on the server side. But you can be sure that you will not have to do it again if your upload hang near the end.
If you want to be double sure about the result of the transfer, you could compute hash of you files while writing them to disk on the client side, and only delete the client file once you have verify the hash on the server side. The hash can be computed on the client side at the same time you are writing the file to disk using dd ... | tee local_file | sha1sum. On the server side, you would have to compute the hash before doing the cat, and avoid doing the cat if the hash is not good, so I cannot see how to do it without reading the file twice (once for the hash, and once for the cat).

You can write to a remote file using ssh:
program | ssh -l userid host 'cd /some/remote/directory && cat - > filename'

This is a sample of uploading to ftp site by curl
wget -O- http://www.example.com/test.zip | curl -T - ftp://user:password#ftp.example.com:2021/upload/test.zip

How can I compare two zip format(.tar,.gz,.Z) files in Unix

I have two gz files. I want to compare those files without extracting. for example:
first file is number.txt.gz - inside that file:
1111,589,3698,
2222,598,4589,
3333,478,2695,
4444,258,3694,
second file - xxx.txt.gz:
1111,589,3698,
2222,598,4589,
I want to compare any column between those files. If column1 in first file is equal to the 1st column of second file means I want output like this:
1111,589,3698,
2222,598,4589,

You can't do this.
You can compare all content from archive by comparing archives but not part of data in compressed files.
You can compare selected files in archive too without unpacking because archive has metadata with CRC32 control sum and you must compare this sum to know this without unpacking.

If you need to check and compare your data after it's written to those huge files, and you have time and space constraints preventing you from doing this, then you're using the wrong storage format. If your data storage format doesn't support your process then that's what you need to change.
My suggestion would be to throw your data into a database rather than writing it to compressed files. With sensible keys, comparison of subsets of that data can be accomplished with a simple query, and deleting no longer needed data becomes similarly simple.
Transactionality and strict SQL compliance are probably not priorities here, so I'd go with MySQL (with the MyISAM driver) as a simple, fast DB.
EDIT: Alternatively, Blorgbeard's suggestion is perfectly reasonable and feasible. In any programming language that has access to (de)compression libraries, you can read your way sequentially through the compressed file without writing the expanded text to disk; and if you do this side-by-side for two input files, you can implement your comparison with no space problem at all.
As for the time problem, you will find that reading and uncompressing the file (but not writing it to disk) is much faster than writing to disk. I recently wrote a similar program that takes a .ZIPped file as input and creates a .ZIPped file as output without ever writing uncompressed data to file; and it runs much more quickly than an earlier version that unpacked, processed and re-packed the data.

You cannot compare the files while they remain compressed using different techniques.
You must first decompress the files, and then find the difference between the results.
Decompression can be done with gunzip, tar, and uncompress (or zcat).
Finding the difference can be done with the diff command.

I'm not 100% sure whether it's meant match columns/fields or entire rows, but in the case of rows, something along these lines should work:
comm -12 <(zcat number.txt.gz) <(zcat xxx.txt.gz)
or if the shell doesn't support that, perhaps:
zcat number.txt.gz | { zcat xxx.txt.gz | comm -12 /dev/fd/3 - ; } 3<&0

exact answer i want is this only
nawk -F"," 'NR==FNR {a[$1];next} ($3 in a)' <(gzcat file1.txt.gz) <(gzcat file2.txt.gz)
. instead of awk, nawk works perfectly and it's gzip file so use gzcat

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string