BusyBox tar: append workaround given limited disk space?

BusyBox tar: append workaround given limited disk space? - linux

I'm on a Linux system with limited resources and BusyBox -- this version of tar does not support --append, -r. Is there a workaround that will allow me to [1] append files from directory B to an existing tar of files from directory A after [2] making the B-files appear to have come from directory A? (Later, when someone extracts the files, they should all end up in the same directory A.)
Situation: I have a list of files that I want to tar, but I must process some of these files first. The files might be used by other processes so I don't want to edit them in-place. I want to be conservative when using disk space so my script only copies those files which it needs to change (vs copying them all and then processing some and finally archiving them all with tar -- if I copied them all I might run into disk space issues).
This means the files I want to archive end up in two separate locations. But I want the resulting tar file to appear as if they were all in the same location. Near the end of my script, I end up with two text files listing the A and B files by name.
I think this is straightforward with a full-blown version of tar, but I have to work with the BusyBox version (usage below). Thanks in advance for any ideas!
Usage: tar -[cxtzjaZmvO] [-X FILE] [-f TARFILE] [-C DIR] [FILE]...
Create, extract, or list files from a tar file
Operation:
c Create
x Extract
t List
Options:
f Name of TARFILE ('-' for stdin/out)
C Change to DIR before operation
v Verbose
z (De)compress using gzip
j (De)compress using bzip2
a (De)compress using lzma
Z (De)compress using compress
O Extract to stdout
h Follow symlinks
m Don't restore mtime
exclude File to exclude
X File with names to exclude
T File with names to include

In principle, you just need to append a tar repository containing the additional files to the end of the tar file. It is only slightly more difficult than that.
A tar file consists of any number of repetitions of header + file. The header is always a single 512-byte block, and the file is padded to a multiple of 512 bytes, so you can think of these units as being a variable number of 512-byte blocks. Each block is independent; it's header starts with the full pathname to the file. So there is no requirement that files in a directory be tarred together.
There is one complication. At the end of the tar file, there are at least two 512-byte blocks completely filled with 0s. When tar is reading a tar file, it will ignore a single zero-filled header, but the second one will cause it to stop reading the file. If it hits EOF, it will complain, so the terminating empty headers are required.
There might be more than two headers, because tar actually writes in blocks which are a multiple of 512 bytes. Gnu tar, for example, by default writes in multiples of 20 512-byte chunks, so the smallest tar file is normally 10240 bytes.
In order to append new data, you need to first truncate the existing file to eliminate the empty blocks.
I believe that if the tar file was produced by busybox, there will only be two empty blocks, but I haven't inspected the code. That would be easy; you only need to truncate the last 1024 bytes of the file before appending the additional files.
For general tar files, it is trickier. If you knew that the files themselves didn't have NUL bytes in them (i.e. they were all simple text files), you could remove empty headers until you found a block with a non-0 byte in it, which wouldn't be too difficult.
What I would do is:
Truncate the last 1024 bytes of the tar file.
Remember the current size of the tar file.
Append a test tar file consisting of the tar of a file with a simple short message
Verify that tar tf correctly shows the test file
Truncate the file back to the remembered length,
If the tar tf found the test file's name, succeed
If the last 512 bytes of the tar file are all 0s, truncate the last 512 bytes of the file, and return to step 2.
Otherwise fail
If the above procedure succeeds, you can proceed to append the tar repository with the new files.
I don't know if you have a trunc command. If not, you can use dd copy a file over top of an old file at a specified offset (see the seek= option). dd will truncate the file automatically at the end of the copy. You can also use dd to read a 512 byte block (see the skip and count options).

The best solution is to cut the last 1024 bytes and concatenate a new tar after it. In order to append a tar to an existing tar file, they must be uncompressed.
For files like:
$ find a b
a
a/file1
b
b/file2
You can:
$ tar -C a -czvf a.tar.gz .
$ gunzip -c a.tar.gz | { head -c -$((512*2)); tar -C b -c .; } | gzip > a+b.tar.gz
With the result:
$ tar -tzvf a+b.tar.gz
drwxr-xr-x 0/0 0 2018-04-20 16:11:00 ./
-rw-r--r-- 0/0 0 2018-04-20 16:11:00 ./file1
drwxr-xr-x 0/0 0 2018-04-20 16:11:07 ./
-rw-r--r-- 0/0 0 2018-04-20 16:11:07 ./file2
Or you can create both tar in the same command:
$ tar -C a -c . | { head -c -$((512*2)); tar -C b -c .; } | gzip > a+b.tar.gz
Although this is for tar generated by busybox tar. As mentioned in previous answer, GNU tar add multiple of 20 blocks. You need to force the number of blocks to be 1 (--blocking-factor=1) in order to know in advance how many blocks to cut:
$ tar --blocking-factor=1 -C a -c . | { head -c -$((512*2)); tar -C b -c .; } | gzip | tar --blocking-factor=1 -tzv
Anyway, GNU tar do have --append. The last --blocking-factor=1 is only needed if you indent do append the resulting tar again.

Related

Combining and compressing using "tar czf" and "tar + gzip". The resultant file in both cases is packname.tar.gz but why sizes are different?

There are three text files. test1, test2 and test3 with file sizes as:
test1 - 121 B
test2 - 4 B
test3 - 26 B
I am trying to combine and compress these files using different methods.
Method-A
Combine the files using tar and then compress it using gzip.
$tar cf testpack1.tar test1 test2 test3
$gzip testpack1.tar
Output is testpack1.tar.gz with size 276 B
Method-B
Combine and compress the files using tar.
$tar czf testpack2.tar.gz test1 test2 test3
Output is testpack2.tar.gz with size 262 B
Why the size of the two files are different?
B mean bytes.

If you un-gzip the archive created by your step B, I bet it will be 10240 bytes. Reason for such difference in size is that tar will align compressed archive to block size (using zero character), but it will not align the uncompressed archive. Here is excerpt from the GNU tar documentation:
-b blocks
--blocking-factor=blocks
Set record size to blocks * 512 bytes.
This option is used to specify a blocking factor for the archive. When
reading or writing the archive, tar, will do reads and writes of the
archive in records of block*512 bytes. This is true even when the
archive is compressed. Some devices requires that all write operations
be a multiple of a certain size, and so, tar pads the archive out to
the next record boundary. The default blocking factor is set when tar
is compiled, and is typically 20. Blocking factors larger than 20
cannot be read by very old versions of tar, or by some newer versions
of tar running on old machines with small address spaces. With a
magnetic tape, larger records give faster throughput and fit more data
on a tape (because there are fewer inter-record gaps). If the archive
is in a disk file or a pipe, you may want to specify a smaller
blocking factor, since a large one will result in a large number of
null bytes at the end of the archive.
You can create same compressed tar archive like this:
tar -b 20 -cf test.tar test1 test2 test3
gzip test.tar

Difference in .tar.gz and first gz and then tar

I made two compressed copy of my folder, first by using the command tar czf dir.tar.gz dir
This gives me an archive of size ~16kb. Then I tried another method, first i gunzipped all files inside the dir and then used
gzip ./dir/*
tar cf dir.tar dir/*.gz
but the second method gave me dir.tar of size ~30kb (almost double). Why there is so much difference in size?

Because zip process in general is more efficient on big sample than on small files. You have zipped 100 files of 1ko for example. Each file will have a certain compression, plus the overhead of the gzip format.
file1.tar -> files1.tar.gz (admit 30 bytes of headers/footers)
file2.tar -> files2.tar.gz (admit 30 bytes of headers/footers)
...
file100.tar -> files100.tar.gz (admit 30 bytes of headers/footers)
------------------------------
30*100 = 3ko of overhead.
But if you try to compress a tar file of 100ko (which contains your 100 files), the overhead of the gzip format will be added only one time (instead of 100 times) and the compression can be better)

Overhead from the per-file metadata and suboptimal conpression by gzip when processing files individually resulting from gzip not observing data in full and thus compressing with suboptimal dictionary (which is reset after each file).

tar cf should create an uncompressed archive, it means the size of your directory should almost be the same as your archive, maybe even more.
tar czf will run gunzip compression through it.
This can be further checked by doing a man tar in shell prompt in Linux,
-z, --gzip, --gunzip, --ungzip
filter the archive through gzip

Fast Concatenation of Multiple GZip Files

I have list of gzip files:
file1.gz
file2.gz
file3.gz
Is there a way to concatenate or gzipping these files into one gzip file
without having to decompress them?
In practice we will use this in a web database (CGI). Where the web will receive
a query from user and list out all the files based on the query and present them
in a batch file back to the user.

With gzip files, you can simply concatenate the files together, like so:
cat file1.gz file2.gz file3.gz > allfiles.gz
Per the gzip RFC,
A gzip file consists of a series of "members" (compressed data sets). [...] The members simply appear one after another in the file, with no additional information before, between, or after them.
Note that this is not exactly the same as building a single gzip file of the concatenated data; among other things, all of the original filenames are preserved. However, gunzip seems to handle it as equivalent to a concatenation.
Since existing tools generally ignore the filename headers for the additional members, it's not easily possible to extract individual files from the result. If you want this to be possible, build a ZIP file instead. ZIP and GZIP both use the DEFLATE algorithm for the actual compression (ZIP supports some other compression algorithms as well as an option - method 8 is the one that corresponds to GZIP's compression); the difference is in the metadata format. Since the metadata is uncompressed, it's simple enough to strip off the gzip headers and tack on ZIP file headers and a central directory record instead. Refer to the gzip format specification and the ZIP format specification.

Here is what man 1 gzip says about your requirement.
Multiple compressed files can be concatenated. In this case, gunzip will extract all members at once. For example:
gzip -c file1 > foo.gz
gzip -c file2 >> foo.gz
Then
gunzip -c foo
is equivalent to
cat file1 file2
Needless to say, file1 can be replaced by file1.gz.
You must notice this:
gunzip will extract all members at once
So to get all members individually, you will have to use something additional or write, if you wish to do so.
However, this is also addressed in man page.
If you wish to create a single archive file with multiple members so that members can later be extracted independently, use an archiver such as tar or zip. GNU tar supports the -z option to invoke gzip transparently. gzip is designed as a complement to tar, not as a replacement.

Just use cat. It is very fast (0.2 seconds for 500 MB for me)
cat *gz > final
mv final final.gz
You can then read the output with zcat to make sure it's pretty:
zcat final.gz
I tried the other answer of 'gz -c' but I ended up with garbage when using already gzipped files as input (I guess it double compressed them).
PV:
Better yet, if you have it, 'pv' instead of cat:
pv *gz > final
mv final final.gz
This gives you a progress bar as it works, but does the same thing as cat.

You can create a tar file of these files and then gzip the tar file to create the new gzip file
tar -cvf newcombined.tar file1.gz file2.gz file3.gz
gzip newcombined.tar

Extracting split .tar files on linux, that separately need to be decrypted

i am trying to extract and decrypt 23 .tar files named as per below:
dev_flash_000.tar.aa.2010_07_29_170013
There are 23 of them, and each needs to be decrypted with an app called dePKG before it is extracted.
I tried this bash script:
for i in `ls dev_flash*`; do ./depkg $i $i.tar ; tar -xvf ./$i.tar ; rm $i.tar; done
and get this error for all 23 files:
read 0x800 bytes of pkg
pkg data # 340 with size 3ec
not inflated, writing 1004 bytes
tar: This does not look like a tar archive
tar: Skipping to next header
tar: Exiting with failure status due to previous errors
I just want to save time :D

You should not use ls in a ` ` context — see http://porkmail.org/era/unix/award.html#ls . FWIW:
for i in dev_flash*`; do
./depkg "$i" -;
done | tar -xv;
Check with your depkg manual pages on how to make it output to stdout, or if it does not, use /dev/stdout as a file. Not only does that save you the temporaries, but running a single tar command on the concatenation of the decrypted contents also works properly when the original archive has been split at arbitrary positions.

How to compare two tarball's content

I want to tell whether two tarball files contain identical files, in terms of file name and file content, not including meta-data like date, user, group.
However, There are some restrictions:
first, I have no control of whether the meta-data is included when making the tar file, actually, the tar file always contains meta-data, so directly diff the two tar files doesn't work.
Second, since some tar files are so large that I cannot afford to untar them in to a temp directory and diff the contained files one by one. (I know if I can untar file1.tar into file1/, I can compare them by invoking 'tar -dvf file2.tar' in file/. But usually I cannot afford untar even one of them)
Any idea how I can compare the two tar files? It would be better if it can be accomplished within SHELL scripts. Alternatively, is there any way to get each sub-file's checksum without actually untar a tarball?
Thanks,

Try also pkgdiff to visualize differences between packages (detects added/removed/renamed files and changed content, exist with zero code if unchanged):
pkgdiff PKG-0.tgz PKG-1.tgz

Are you controlling the creation of these tar files?
If so, the best trick would be to create a MD5 checksum and store it in a file within the archive itself. Then, when you want to compare two files, you just extract this checksum files and compare them.
If you can afford to extract just one tar file, you can use the --diff option of tar to look for differences with the contents of other tar file.
One more crude trick if you are fine with just a comparison of the filenames and their sizes.
Remember, this does not guarantee that the other files are same!
execute a tar tvf to list the contents of each file and store the outputs in two different files. then, slice out everything besides the filename and size columns. Preferably sort the two files too. Then, just do a file diff between the two lists.
Just remember that this last scheme does not really do checksum.
Sample tar and output (all files are zero size in this example).
$ tar tvfj pack1.tar.bz2
drwxr-xr-x user/group 0 2009-06-23 10:29:51 dir1/
-rw-r--r-- user/group 0 2009-06-23 10:29:50 dir1/file1
-rw-r--r-- user/group 0 2009-06-23 10:29:51 dir1/file2
drwxr-xr-x user/group 0 2009-06-23 10:29:59 dir2/
-rw-r--r-- user/group 0 2009-06-23 10:29:57 dir2/file1
-rw-r--r-- user/group 0 2009-06-23 10:29:59 dir2/file3
drwxr-xr-x user/group 0 2009-06-23 10:29:45 dir3/
Command to generate sorted name/size list
$ tar tvfj pack1.tar.bz2 | awk '{printf "%10s %s\n",$3,$6}' | sort -k 2
0 dir1/
0 dir1/file1
0 dir1/file2
0 dir2/
0 dir2/file1
0 dir2/file3
0 dir3/
You can take two such sorted lists and diff them.
You can also use the date and time columns if that works for you.

tarsum is almost what you need. Take its output, run it through sort to get the ordering identical on each, and then compare the two with diff. That should get you a basic implementation going, and it would be easily enough to pull those steps into the main program by modifying the Python code to do the whole job.

Here is my variant, it is checking the unix permission too:
Works only if the filenames are shorter than 200 char.
diff <(tar -tvf 1.tar | awk '{printf "%10s %200s %10s\n",$3,$6,$1}'|sort -k2) <(tar -tvf 2.tar|awk '{printf "%10s %200s %10s\n",$3,$6,$1}'|sort -k2)

EDIT: See the comment by #StéphaneGourichon
I realise that this is a late reply, but I came across the thread whilst attempting to achieve the same thing. The solution that I've implemented outputs the tar to stdout, and pipes it to whichever hash you choose:
tar -xOzf archive.tar.gz | sort | sha1sum
Note that the order of the arguments is important; particularly O which signals to use stdout.

Is tardiff what you're looking for? It's "a simple perl script" that "compares the contents of two tarballs and reports on any differences found between them."

There is also diffoscope, which is more generic, and allows to compare things recursively (including various formats).
pip install diffoscope

I propose gtarsum, that I have written in Go, which means it will be an autonomous executable (no Python or other execution environment needed).
go get github.com/VonC/gtarsum
It will read a tar file, and:
sort the list of files alphabetically,
compute a SHA256 for each file content,
concatenate those hashes into one giant string
compute the SHA256 of that string
The result is a "global hash" for a tar file, based on the list of files and their content.
It can compare multiple tar files, and return 0 if they are identical, 1 if they are not.

Just throwing this out there since none of the above solutions worked for what I needed.
This function gets the md5 hash of the md5 hashes of all the file-paths matching a given path. If the hashes are the same, the file hierarchy and file lists are the same.
I know it's not as performant as others, but it provides the certainty I needed.
PATH_TO_CHECK="some/path"
for template in $(find build/ -name '*.tar'); do
tar -xvf $template --to-command=md5sum |
grep $PATH_TO_CHECK -A 1 |
grep -v $PATH_TO_CHECK |
awk '{print $1}' |
md5sum |
awk "{print \"$template\",\$1}"
done
*note: An invalid path simply returns nothing.

If not extracting the archives nor needing the differences, try diff's -q option:
diff -q 1.tar 2.tar
This quiet result will be "1.tar 2.tar differ" or nothing, if no differences.

There is tool called archdiff. It is basically a perl script that can look into the archives.
Takes two archives, or an archive and a directory and shows a summary of the
differences between them.

I have a similar question and i resolve it by python, here is the code.
ps:although this code is used to compare two zipball's content,but it's similar with tarball, hope i can help you
import zipfile
import os,md5
import hashlib
import shutil
def decompressZip(zipName, dirName):
try:
zipFile = zipfile.ZipFile(zipName, "r")
fileNames = zipFile.namelist()
for file in fileNames:
zipFile.extract(file, dirName)
zipFile.close()
return fileNames
except Exception,e:
raise Exception,e
def md5sum(filename):
f = open(filename,"rb")
md5obj = hashlib.md5()
md5obj.update(f.read())
hash = md5obj.hexdigest()
f.close()
return str(hash).upper()
if __name__ == "__main__":
oldFileList = decompressZip("./old.zip", "./oldDir")
newFileList = decompressZip("./new.zip", "./newDir")
oldDict = dict()
newDict = dict()
for oldFile in oldFileList:
tmpOldFile = "./oldDir/" + oldFile
if not os.path.isdir(tmpOldFile):
oldFileMD5 = md5sum(tmpOldFile)
oldDict[oldFile] = oldFileMD5
for newFile in newFileList:
tmpNewFile = "./newDir/" + newFile
if not os.path.isdir(tmpNewFile):
newFileMD5 = md5sum(tmpNewFile)
newDict[newFile] = newFileMD5
additionList = list()
modifyList = list()
for key in newDict:
if not oldDict.has_key(key):
additionList.append(key)
else:
newMD5 = newDict[key]
oldMD5 = oldDict[key]
if not newMD5 == oldMD5:
modifyList.append(key)
print "new file lis:%s" % additionList
print "modified file list:%s" % modifyList
shutil.rmtree("./oldDir")
shutil.rmtree("./newDir")

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string