How to transfer large file from local to remote box with auto-resume and transfer only what has changed?

How to transfer large file from local to remote box with auto-resume and transfer only what has changed? - linux

I try the following command
rsync -av --progress --inplace --rsh='ssh' /home/tom/workspace/myapp.war root#172.241.181.124:/home/rtom/uploads
But it seems it transfers the whole file again each time I execute the command when I make a small change in app that regenerates the myapp.war.
I want also the connection to automatically resume if connection is lost. I think this part is working.
The transfer should occur over ssh.
The connection speed is very slow and can break too so it is important that it transfers only what has changed. Of course it must also ensure that the file was correctly transfered.

rsync does handle relatively small changes and partial uploads in a file efficiently. There has been significant effort in the rsync algorithm towards this direction.
The problem is that WAR files are "extended" JAR files, which are essentially ZIP arhives and therefore compressed.
A small change in an uncompressed file will change the whole compressed segment where that file belongs and - most importantly - it can also change its size significantly. That can overcome the ability of rsync to detect and handle changes in the final compressed file.
On ZIP archives each uncompressed file has its own compressed segment. Therefore the order in which files are placed in the archive is also important with regard to achieving a degree of similarity to a previous version. Depending on how the WAR file is created, just adding a new file or renaming one can cause segments to move, essentially making the WAR file unrecognisable. In other words:
A small change in your application normally means a rather large change in your WAR file.
rsync is not designed to handle changes in compressed files. However, it can handle changes in your application. One solution would be to use it to upload your application files and then create the WAR file on the remote host.
A slightly different approach - that does not need any development tools on the remote host - would be to unpack (i.e. unzip) the WAR file locally, upload its contents and then pack (i.e. zip) it again on the remote host. This solution only requires a zip or jar implementation on the remote host.

Related

OverlayFS on a single large file

I would like to solve the following set of constraints:
I want to be able to mount a copy of a large (16gb) remote file
if a part of the file is written to by the application, it is written to the local copy and not synced over the network
if a part of the file is read, if it was previously written to by the application, it will read the local copy. if it was never written to, it will first copy from the remote to local, then read from local
parts of the file that are never read before being written to should never be transmitted over the network (this is the most important constraint)
the file will always be the same size, so there is never ambiguity about what should happen when we read a specific byte from the file.
The reason for these constraints is that the vast majority of a single file will never be read, there are many such files (at least a small portion of each file is read), and network bandwidth is extremely limited.
OverlayFS comes very close to what I want. If I was able to apply overlayfs at the file level instead of the directory level, I would use the (perhaps nfs-mounted) remote file as the lower_file and an empty, sparse file as the upper_file.
Is there something that would allow me to do the above?

Rsync with .dd.xz files

I am trying different ways to update/write an image on a linux device and using rsync for this.
For file system synchronization rsync checks and only transfers missing /changed files reducing the bandwidth.
In similar way I created a binary file of 10MB(original.bin) and modified this file by adding few changes (modified.bin)and tried to rsync the original.bin file.First time it transfers the whole file as there is no copy on the device.Next modified.bin file is renamed to original.bin and did rsync. It only transferred changes in the modified.bin I want to know if this is the same with .dd.xz files as well. I have 2 .dd.xz files (image1.dd.xz and image2.dd.xz which has few dlls and mono packgaes added) and when these files are extracted to .dd files and rsync transfers only changes.
But when i rsync the files as .dd.xz it transfers the whole file again. Can some one help me to understand if this is expected behaviour or rsync behaves same on .dd files as with any other text files?

xz is the extension used by the xz compress tool. Compressed files don't work with rsync for obvious reasons.
Consider whether you're better off using dd images without compressing them. You can (de)compress them faster using the pixz command which does its job in parallel using all available processors.

Break a zip file into INDIVIDUAL pieces

What I am trying to do is this;
I get these zip files from clients which are 1.5gb in general. They all include pictures only. I need to make them into 100mb files to actually upload it to my server. Problem is that, If I break my 1.5gb zip file, I need to re-attach all of them if I need to use one.
When I break the 1.5gb zip file into a 100mb zip file, I need the 100mb one to act as a separate new file so the server will unzip it and upload the pictures into the database. I have looked for this problem but most of the threads are about how to split a zip file. This is partially what I want to do and I can do it now but I also need those smaller pieces to be able to unzip on its own. Is it possible to break a zip file into smaller pieces that will act as a new, stand alone zip files?
Thanks.

I have the same question. I think unzip in the Linux shell cannot handle a zip file larger than 1 GB, and I need to unzip them unattended in a headless NAS. What I do for now is unzip everything in the desktop HD, select files until they almost reach 1 GB, archive and delete them, then select the next set of files until I reach 1 GB.

Your answer is not clear, but I will try to answer it based upon my understanding of your dilemma.
Questions
Why does the file size need to be limited?
Is it the transfer to the server that is the constraining factor?
Is application (on the server) unable to process files over a certain size?
Can the process be altered so that image file fragments can be recombined on the server before processing?
What operating systems are in use on the client and the server?
Do you have shell access to the server?
A few options
Use imagemagick to reduce the files so they fit within the file size constraints
On Linux/Mac, this is relatively straightforward to do:
split -b 1m my_large_image.jpg (you need the b parameter for it to work on binary files)
Compress each file into its own zip
Upload to the server
Unzip
Concatenate the fragments back into an image file:
cat xaa xab xac xad (etc) > my_large_image.jpg

Alternative to creating multipart .tar.gz files?

I have a folder with >20GB of images on a linux server, I need to make a backup and download it, so I was thinking about using "split" to create 1GB files. My question is: instead of splitting a .tar.gz and then having to join it again on my computer, is there a way I could create 20 x 1GB valid .tar.gz files, so I can then view/extract them separately?
Edit: I forgot to add that I need to do it without ssh access. I'm using mostly PHP.

You could try rsnapshot to backup using rsync/hardlinks instead. It not only solves the filesize issue but also gives you high storage and bandwidth efficiency when existing images aren't changed often.

Why not just use rsync?
FYI, rsync is a command-line tool that synchronises directories between two machines across the network. If you have Linux at both ends and ssh access properly configured, it's as simple as rsync -av server:/path/to/images/ images/ (make sure the trailing slashes are there). It also optimises subsequent synchronisations so that only changes are transmitted. You can even tell it to compress data in transit, but that usually doesn't help with images.

First I would give rsnapshot a miss if you don't have SSH access. (Though I do and love it)
I would assume you're likely backing up jpeg's and they are already compressed. Zipping them up doesn't make them much smaller, plus you don't need exactly 1GB files. It sounds like they can be a bit bigger or smaller.
So you could just write a script which bundles jpegs into a gz(or whatever) until it has put about 1gb worth in and then starts a new archive.
You could do all this in PHP easy enough.

Uploading & extracting archive (zip, rar, targz, tarbz) automatically - security issue?

I'd like to create following functionality for my web-based application:
user uploads an archive file (zip/rar/tar.gz/tar.bz etc) (content - several image files)
archive is automatically extracted after upload
images are shown in the HTML list (whatever)
Are there any security issues involved with extraction process? E.g. possibility of malicious code execution contained within uploaded files (or well-prepared archive file), or else?

Aside the possibility of exploiting the system with things like buffer overflows if it's not implemented carefully, there can be issues if you blindly extract a well crafted compressed file with a large file with redundant patterns inside (a zip bomb). The compressed version is very small but when you extract, it'll take up the whole disk causing denial of service and possibly crashing the system.
Also, if you are not careful enough, the client might hand a zip file with server-side executable contents (.php, .asp, .aspx, ...) inside and request the file over HTTP, which, if not configured properly can result in arbitrary code execution on the server.

In addition to Medrdad's answer: Hosting user supplied content is a bit tricky. If you are hosting a zip file, then that can be used to store Java class files (also used for other formats) and therefore the "same origin policy" can be broken. (There was the GIFAR attack where a zip was attached to the end of another file, but that no longer works with the Java PlugIn/WebStart.) Image files should at the very least be checked that they actually are image files. Obviously there is a problem with web browsers having buffer overflow vulnerabilities, that now your site could be used to attack your visitors (this may make you unpopular). You may find some client side software using, say, regexs to pass data, so data in the middle of the image file can be executed. Zip files may have naughty file names (for instance, directory traversal with ../ and strange characters).
What to do (not necessarily an exhaustive list):
Host user supplied files on a completely different domain.
The domain with user files should use different IP addresses.
If possible decode and re-encode the data.
There's another stackoverflow question on zip bombs - I suggest decompressing using ZipInputStream and stopping if it gets too big.
Where native code touches user data, do it in a chroot gaol.
White list characters or entirely replace file names.
Potentially you could use an IDS of some description to scan for suspicious data (I really don't know how much this gets done - make sure your IDS isn't written in C!).

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string