How to limit memory usage during tar - linux

I need to tar (or otherwise zip) more than 2.7 million files. (150GB)
However, with this many files the tar command uses way too much memory and my system crashes. What can I do?
tar -cf /path/filename.tar /path_to_file/
I've tried to do it in batches (multiple tar files would be OK) based on file creation date and find, but find takes up even more memory.

Not sure if this is an answer exactly, as it doesn't say how to explicitly lower tar's memory usage, but...
I think you can specify the compression program used with tar to use pigz (parallel gzip), then specify the number of threads to use to better help manage memory. Maybe something like:
tar cvf - paths-to-archive | pigz -p 4 > archive.tar.gz
where -p $NUM is the number of cores.

Related

Use of temporary files and memory when using tar to backup very _large_ files with compression

When backing up one or more _very_large_ files using tar with compression (-j or -z) how does GNU tar manage the use of temporary files and memory?
Does it backup and compress the files block by block, file by file, or some other way?
Is there a difference between the way the following two commands use temporary files and memory?
tar -czf data.tar.gz ./data/*
tar -cf - ./data/* | gzip > data.tar.gz
Thank you.
No temporary files are used by either command. tar works completely in a streaming fashion. Packaging and compressing are completely separated from each other and also done in a piping mechanism when using the -z or -j option (or similar).
For each file tar it puts into an archive, it computes a file info datagram which contains the file's path, its user, permissions, etc., and also its size. This needs to be known up front (that's why putting the output of a stream into a tar archive isn't easy without using a temp file). After this datagram, the plain contents of the file follows. Since its size is known and already part of the file info ahead, the end of the file is unambiguous. So after this, the next file in the archive can follow directly. In this process no temporary files are needed for anything.
This stream of bytes is given to any of the implemented compression algorithms which also do not create any temporary files. Here I'm going out on a limb a bit because I don't know all compression algorithms by heart, but all that I ever came in touch with do not create temporary files.

Fast directory conversion to file

I have a directory D that contains multiple files and folders and consumes a very large amount of disk space. Although I do not care much about disk space consumption, I want to convert D into a file as fast as possible. The first approach that came to mind was to use a compression tool however it is taking too long to finish.
Is there a faster way?
Thank you for your help.
you can use tar command, with no compression
with tar -cf you can convert your folder to a file, with no compression process.
tar -cf your_big_folder.tar /path/to/your/big/folder
and finally you can convert it back to folder with
tar -xf your_big_folder.tar

How to find sizes of files in bzip2 archive without unpacking?

I have 1.4GB .bzip2 archive and when try unzip it, I get this error:
bzip2: I/O or other error, bailing out. Possible reason follows.
bzip2: No space left on device
But I have 13GB free space into current partition.
How to get size of unzipped file without unzipping?
You'll still have to "spend" the time of running the decompression algorithm, but you can redirect the character stream to the wc -c (Word count characters) utility. i.e.
bunzip2 -c bigFile.bz2 | wc -c
IHTH
You can use below command to get the bz2 file (compressed using bzip2 compressor) related details.
ls -ahl filename.[extension].bz2

Extracting certain file from large tar ball taking forever

I have a VERY large "tar.gz" archive (~50 GB) and I'm wondering if it's possible for tar to skip processing N number of bytes or something of that sort, or at least make tar begin reading the file from the end instead reading it from the beginning.
Of course i can do tar xvf archive.tar.gz path/to/file but this command will take forever, so i was hoping to find a quicker way to go about it.
Thanks.

How to update tar (NOT append)

I want to update an existing tar file with newer files.
At GNU, I read:
4.2.3 Updating an Archive
In the previous section, you learned how to use ‘--append’ to add a
file to an existing archive. A related operation is ‘--update’ (‘-u’).
The ‘--update’ operation updates a tar archive by comparing the date
of the specified archive members against the date of the file with the
same name. If the file has been modified more recently than the
archive member, then the newer version of the file is added to the
archive (as with ‘--append’).
However,
When I run my tar update command, the files are appended even though their modification dates are exactly the same. I want to ONLY append where modification dates of files to be tarred are newer than those already in the tar...
tar -uf ./tarfile.tar /localdirectory/ >/dev/null 2>&1
Currently, every time I update, the tar doubles in size...
The update you describe implies that the file within the archive is replaced. If the new copy is smaller than what's in the archive, it could be directly rewritten. If the new copy however is larger, tar would have to zero the existing archive entry and append. Such updates would leave runs of '\0's or other unused bytes, so any normal computer user would want that such sections are removed, which would be done by "moving up" bytes comprising the archive contents towards the start of the file (think C's memmove).
Such an in-place move operation however, which would involve seek-read-seek-write cycles, is costly, especially when you look at it in the context of tapes — which tar was designed for originally —, i.e. devices with a seek performance that is not comparable to harddisks. You'd wear out the tape rather quickly with such move ops. Oh and of course, WORM devices don't support this move op either.
If you do not want to use the "-P" switch tar -u... works correctly if the current directory is the parent directory of the one we are going to update, and the path to this directory in the tar command will not be an absolute path.
For exmple:
We want to update catalog /home/blabla/Dir. We do it like that:
cd /home/blabla
tar -u -f tarfile.tar Dir
In general, the update must be made from the same place as the creation, so that the paths agree.
It is also possible:
cd /home/blabla/Dir
tar-u -f /path/to/tarfile.tar .
You may simply create (instead of update) the archive each time:
tar -cvpf tarfile.tar *
This will solve the problem of your archive doubling in size each time. But of course, it is generating the whole archive every time.
By default tar strips the leading / from member names, but it does this after deciding what needs to be updated.
Therefore if you are archiving an absolute path, you either need to cd / and use relative paths, or add the -P/--absolute-names option.
cd /
tar -uf "$OLDPWD/tarfile.tar" localdirectory/ >/dev/null 2>&1
tar -cPf tarfile.tar /localdirectory/ >/dev/null 2>&1
tar -uPf tarfile.tar /localdirectory/ >/dev/null 2>&1
However, the updated items will still be appended. A tar (tape archive) file cannot be modified excpet by appending.
Warning! When speaking about "dates" it means any date, and that includes the access time.
Should your files have been accessed in any such way (a simple ls -l is enough) then tar is right to do what it does!
You need to find another way to do what you want. Probably use a sentinel file and see if its modification date is less than the files you wish to append.

Resources