Alternative to creating multipart .tar.gz files? - linux

I have a folder with >20GB of images on a linux server, I need to make a backup and download it, so I was thinking about using "split" to create 1GB files. My question is: instead of splitting a .tar.gz and then having to join it again on my computer, is there a way I could create 20 x 1GB valid .tar.gz files, so I can then view/extract them separately?
Edit: I forgot to add that I need to do it without ssh access. I'm using mostly PHP.

You could try rsnapshot to backup using rsync/hardlinks instead. It not only solves the filesize issue but also gives you high storage and bandwidth efficiency when existing images aren't changed often.

Why not just use rsync?
FYI, rsync is a command-line tool that synchronises directories between two machines across the network. If you have Linux at both ends and ssh access properly configured, it's as simple as rsync -av server:/path/to/images/ images/ (make sure the trailing slashes are there). It also optimises subsequent synchronisations so that only changes are transmitted. You can even tell it to compress data in transit, but that usually doesn't help with images.

First I would give rsnapshot a miss if you don't have SSH access. (Though I do and love it)
I would assume you're likely backing up jpeg's and they are already compressed. Zipping them up doesn't make them much smaller, plus you don't need exactly 1GB files. It sounds like they can be a bit bigger or smaller.
So you could just write a script which bundles jpegs into a gz(or whatever) until it has put about 1gb worth in and then starts a new archive.
You could do all this in PHP easy enough.

Related

Strategy for compressing and navigating large compressed directories

I manage a computer cluster. It is a multi-user system. I have a large directory filled with files (terabytes in size). I'd like to compress it so the user who owns it can save space and still be able to extract files from it.
Challenges with possible solutions :
tar : The directory's size makes it challenging to decompress the subsequent tarball due to tar's poor random access read. I'm referring to the canonical way of compressing, i.e. tar cvzf mytarball.tar.gz mybigdir
squashfs : It appears that this would be a great solution, except in order to mount it, it requires root access. I don't really want to be involved in mounting their squashfs file every time they want to access a file.
Compress then tar : I could compress the files first and then use tar to create the archive. This would have the disadvantage that I wouldn't save as much space with compression and I wouldn't get back any inodes.
Similar questions (here) have been asked before, but the solutions are not appropriate in this case.
QUESTION:
Is there a convenient way to compress a large directory such that it is quick and easy to navigate and doesn't require root permissions?
You add it in tags, but do not mention it in question. For me zip is the simplest way to manage big archives (with many files). Moreover tar+gzip is actually two step operation which need special operations to speedup. And zip is available for lot of platforms so you win also in this direction.

rsync hang in the middle of transfer with a fixed position

I am trying to use rsync to transfer some big files between servers.
For some reasons, when the file is big enough (2GB - 4GB), the rsync would hang in the middle, with the exactly same position, i.e., the progress at which it hanged always stick to the same place even if I retried.
If I remove the file from the destination server first, then the rsync would work fine.
This is the command I used:
/usr/bin/rsync --delete -avz --progress --exclude-from=excludes.txt /path/to/src user#server:/path/to/dest
I have tried to add delete-during and delete-delay, all have no luck.
The rsync version is rsync version 3.1.0 protocol version 31
Any advice please? Thanks!
Eventually I solved the problem by removing compression option: -z
Still don't know why is that so.
I had the same problem (trying to rsync multiple files of up to 500GiB each between my home NAS and a remote server).
In my case the solution (mentioned here) was to add to "/etc/ssh/sshd_config" (on the server to which I was connecting) the following:
ClientAliveInterval 15
ClientAliveCountMax 240
"ClientAliveInterval X" will send a kind of message/"probe" every X seconds to check if the client is still alive (default is not to do anything).
"ClientAliveCountMax Y" will terminate the connection if after Y-probes there has been no reply.
I guess that the root cause of the problem is that in some cases the compression (and/or block diff) that is performed locally on the server takes so much time that while that's being done the SSH-connection (created by the rsync-program) is automatically dropped while that's still ongoing.
Another workaround (e.g. if "sshd_config" cannot be changed) might be to use with rsync the option "--new-compress" and/or a lower compression level (e.g. "rsync --new-compress --compress-level=1" etc...): in my case the new compression (and diff) algorithm is a lot faster than the old/classical one, therefore the ssh-timeout might not occur than when using its default settings.
The problem for me was I thought I had plenty of disk space on a drive but the partition was not using the whole disk and was half the size of what I expected.
So check the size of the space available with lsblk and df -h and make sure the disk you are writing to reports all the space available on the device.

Checking the integrity of a copied folder

I am copying a big folder (300Gb) into an external hard drive. I want to make sure the copied file is complete and not corrupt before deleting the original file. How can I do that in ubuntu?
You could use rsync --checksum to check the files. Or simply use sha256sum or similar to check the files manually. Using rsync is in my opinion more comfortable because it automatically checks recursively, but that largely depends on your usecase.
If you really require absolute integrity, you should really consider using an error correction code . Hard drives don't keep data integrity forever, a bit might change from time to time.

How to speed up reading of a fixed set of small files on linux?

I have 100'000 1kb files. And a program that reads them - it is really slow.
My best idea for improving performance is to put them on ramdisk.
But this is a fragile solution, every restart need to setup the ramdisk again.
(and file copying is slow as well)
My second best idea is to concatenate the files and work with that. But it is not trivial.
Is there a better solution?
Note: I need to avoid dependencies in the program, even Boost.
You can optimize by storing the files contiguous on disk.
On a disk with ample free room, the easiest way would be to read a tar archive instead.
Other than that, there is/used to be a debian package for 'readahead'.
You can use that tool to
profile a normal run of your software
edit the lsit of files accesssed (detected by readahead)
You can then call readahead with that file list (it will order the files in disk order so the throughput will be maximized and the seektimes minimized)
Unfortunately, it has been a while since I used these, so I hope you can google to the resepctive packages
This is what I seem to have found now:
sudo apt-get install readahead-fedora
Good luck
If your files are static, I agree just tar them up and then place that in a RAM disk. Probably be faster to read directly out of the TAR file, but you can test that.
edit:: instead of TAR, you could also try creating a squashfs volume.
If you don't want to do that, or still need more performance then:
put your data on an SSD.
start investigating some FS performance test, starting with EXT4, XFS, etc...

How to transfer large file from local to remote box with auto-resume and transfer only what has changed?

I try the following command
rsync -av --progress --inplace --rsh='ssh' /home/tom/workspace/myapp.war root#172.241.181.124:/home/rtom/uploads
But it seems it transfers the whole file again each time I execute the command when I make a small change in app that regenerates the myapp.war.
I want also the connection to automatically resume if connection is lost. I think this part is working.
The transfer should occur over ssh.
The connection speed is very slow and can break too so it is important that it transfers only what has changed. Of course it must also ensure that the file was correctly transfered.
rsync does handle relatively small changes and partial uploads in a file efficiently. There has been significant effort in the rsync algorithm towards this direction.
The problem is that WAR files are "extended" JAR files, which are essentially ZIP arhives and therefore compressed.
A small change in an uncompressed file will change the whole compressed segment where that file belongs and - most importantly - it can also change its size significantly. That can overcome the ability of rsync to detect and handle changes in the final compressed file.
On ZIP archives each uncompressed file has its own compressed segment. Therefore the order in which files are placed in the archive is also important with regard to achieving a degree of similarity to a previous version. Depending on how the WAR file is created, just adding a new file or renaming one can cause segments to move, essentially making the WAR file unrecognisable. In other words:
A small change in your application normally means a rather large change in your WAR file.
rsync is not designed to handle changes in compressed files. However, it can handle changes in your application. One solution would be to use it to upload your application files and then create the WAR file on the remote host.
A slightly different approach - that does not need any development tools on the remote host - would be to unpack (i.e. unzip) the WAR file locally, upload its contents and then pack (i.e. zip) it again on the remote host. This solution only requires a zip or jar implementation on the remote host.

Resources