Remove trailing null characters produced by tar - linux

I'm trying tar up some files and pass them along to the user through the php passthru command.
The problem is that even though the tar file should only be like 2k it is always 10240. Funny number right?
So I have broken it down to:
-sh-4.1# tar czf - test | wc -c
10240
VS:
-sh-4.1# tar czf test.tar.gz test && wc -c test.tar.gz
2052 test.tar.gz
So tar is clearly padding out the file with NULL.
So how can I make tar stop doing that. Alternatively, how can I strip the trailing NULLs.
I'm running on tar (GNU tar) 1.15.1 and cannot reproduce on my workstation which is tar (GNU tar) 1.23, and since this is an embedded project upgrading is not the answer I'm looking for (yet).
Edit: I am hoping for a workaround that does need to write to the file system.. Maybe a way to stop it from padding or to pipe it through sed or something to strip out the padding.

you can attenuate the padding effect by using a smaller block size, try to pass -b1 to tar

You can minimise the padding by setting the block size to the minimum possible value - on my system this is 512.
$ cat test
a few bytes
$ tar -c test | wc -c
10240
$ tar -b 1 -c test | wc -c
2048
$ tar --record-size=512 -c test | wc -c
2048
$
This keeps the padding to at most 511 bytes. Short of piping through a program to remove the padding, rewrite the block header, and recreate the end-of-archive signature, I think this is the best you can do. At that point you might consider using a scripting language and it's native tar implementation directly, e.g.:
PHP's PharData (http://php.net/manual/en/class.phardata.php)
Perl's Archive::Tar (https://perldoc.perl.org/Archive/Tar.html)
Python's tarfile (https://docs.python.org/2/library/tarfile.html)

Related

How to redirect xz's normal stdout when do tar | xz?

I need to use a compressor like xz to compress huge tar archives.
I am fully aware of previous questions like
Create a tar.xz in one command
and
Utilizing multi core for tar+gzip/bzip compression/decompression
From them, I have found that this command line mostly works:
tar -cvf - paths_to_archive | xz -1 -T0 -v > OUTPUT_FILE.tar.xz
I use the pipe solution because I absolutely must be able to pass options to xz. In particular, xz is very CPU intensive, so I must use -T0 to use all available cores. This is why I am not using other possibilities, like tar's --use-compress-program, or -J options.
Unfortunately, I really want to capture all of tar and xz's log output (i.e. non-archive output) into a log file. In the example above, log outout is always generated by those -v options.
With the command line above, that log output is now printed on my terminal.
So, the problem is that when you use pipes to connect tar and xz as above, you cannot end the command line with something like
>Log_File 2>&1
because of that earlier
> OUTPUT_FILE.tar.xz
Is there a solution?
I tried wrapping in a subshell like this
(tar -cvf - paths_to_archive | xz -1 -T0 -v > OUTPUT_FILE.tar.xz) >Log_File 2>&1
but that did not work.
The normal stdout of tar is the tarball, and the normal stdout of xz is the compressed file. None of these things are logs that you should want to capture. All logging other than the output files themselves are written exclusively to stderr for both processes.
Consequently, you need only redirect stderr, and must not redirect stdout unless you want your output file mixed up with your logging.
{ tar -cvf - paths_to_archive | xz -1 -T0 -v > OUTPUT_FILE.tar.xz; } 2>Log_File
By the way -- if you're curious about why xz -v prints more content when its output goes to the TTY, the answer is in this line of message.c: The progress_automatic flag (telling xz to set a timer to trigger a SIGALRM -- which it treats as an indication that status should be printed -- every second) is only set when isatty(STDERR_FILENO) is true. Thus, after stderr has been redirected to a file, xz no longer prints this output at all; the problem is not that it isn't correctly redirected, but that it no longer exists.
You can, however, send SIGALRM to xz every second from your own code, if you're really so inclined:
{
xz -1 -T0 -v > OUTPUT_FILE.tar.xz < <(tar -cvf - paths_to_archive) & xz_pid=$!
while sleep 1; do
kill -ALRM "$xz_pid" || break
done
wait "$xz_pid"
} 2>Log_File
(Code that avoids rounding up the time needed for xz to execute to the nearest second is possible, but left as an exercise to the reader).
First -cvf - can be replaced by cv.
But the normal stdout-output of tar cvf - is the tar file which is piped into xz. Not sure if I completely understand, maybe this:
tar cv paths | xz -1 -T0 > OUTPUT.tar.xz 2> LOG.stderr
or
tar cv paths 2> LOG.stderr | xz -1 -T0 > OUTPUT.tar.xz
or
tar cv paths 2> LOG.tar.stderr | xz -1 -T0 > OUTPUT.tar.xz 2> LOG.xz.stderr
Not sure if -T0 is implemented yet, which version of xz do you use? (Maybe https://github.com/vasi/pixz is worth a closer look) The pv program, installed with sudo apt-get install pv on some systems, is better at showing progress for pipes than xz -v. It will tell you the progress as a percentage with an ETA:
size=$(du -bc path1 path2 | tail -1 | awk '{print$1}')
tar c paths 2> LOG.stderr | pv -s$size | xz -1 -T0 > OUTPUT.tar.xz

How to pass through each file that's completed tar xzf decompression to a bash loop?

In Linux bash, I would like to be able to decompress a large tar.gz (100G-1T, hundreds of similarly sized files), so that after each file has succeeded the decompression, I can pass it through a bash loop for further processing. See example below with --desired_flag:
tar xzf --desired_flag large.tar.gz \
| xargs -n1 -P8 -I % do_something_to_decompressed_file %
EDIT: the immediate use case I am thinking about is a network operation, where as soon as the contents of the files being decompressed are available, they can be uploaded somewhere on the next step. Given that the tar step could be either CPU-bound or IO-bound depending on the Linux instance, I would like to be able to efficiently pass the files to the next step, which I presume will be bound by network speed.
Given the following function definition:
buffer_lines() {
local last_name file_name
read -r last_name || return
while read -r file_name; do
printf '%s\n' "$last_name"
last_name=$file_name
done
printf '%s\n' "$last_name"
}
...one can then run the following, whether one's tar implementation prints names at the beginning or end of their processing:
tar xvzf large.tar.gz | buffer_lines | xargs -d $'\n' -n 1 -P8 do_something_to_file
Note the v flag, telling tar to print filenames on stdout (in the GNU implementation, in this particular usage mode). Also note the lack of the -I argument.
If you want to insert a buffer (to allow tar to run ahead of the xargs process), consider pv:
tar xvzf large.tar.gz \
| pv -B 1M \
| buffer_lines \
| xargs -d $'\n' -n 1 -P8 do_something_to_file
...will buffer up to 1MB of unpacked names should the processing components run behind.

download using rsync and extract using gunzip, and put all together into a pipe

I have "gz" files that I am downloading using "rsync". Then, as these files are compressed, I need to extract them using gunzip (I am open to any other alternative for gunzip). I want to put all these commands together into a pipe to have something like that rsync file | gunzip
My original command is the following:
awk -F "\t" '$5~/^(reference genome|representative genome)$/ {sub("ftp", "rsync", $20); b=$20"/*genomic.fna.gz"; print b" viral/." }' assembly_summary_viral.txt | xargs -l1 rsync --copy-links --times --recursive --verbose --exclude="*rna*" --exclude="*cds*"
It looks a little bit complicated, but it's downloading the files that I need, and there is no problem with it. I added | gunzip However the extraction of the compressed files is not working, and it's only downloading them.
Any suggestion?
A pipe takes the stdout of the left command and sends it to the stdin of the right command. Here we have to take the stdout of rsync and pipe to the stdin of gunzip.
rsync doesn't really output much without the -v flag so you'll have to add that. It will now spit out to stdout something like the following:
>rsync -rv ./ ../viral
sending incremental file list
file1
file2
file3
test1_2/
test1_2/file1
test1_2/file2
sent 393 bytes received 123 bytes 1,032.00 bytes/sec
total size is 0 speedup is 0.00
We can pipe that to awk first to grab only the file path/name and prepend viral/ to the front of it so that it gunzips the files that you just rsync'd TO (instead of the ones FROM which you rsync'd):
rsync -rv ./ ../viral | awk '!NF{endFileList=1} NR>1 && endFileList!=1{print "../viral/"$0}'
Now we have rsync and awk spitting out a list of filenames that are being sent to the TO directory. Now we need to get gunzip to process that list. Unfortunately, gunzip can't take in a list of files. If you send gunzip something to it's stdin it will assume that the stream is a gzipped stream and will attempt to gunzip it.
Instead we'll employ that xargs method you have above take the stdin and feed it into gunzip as the parameter (filename) that it needs:
rsync -rv ./ ../viral | awk '!NF{endFileList=1} NR>1 && endFileList!=1{print "../viral/"$0}' | xargs -l1 gunzip
Most likely you will have to tweak this a bit to insure you are gunzipping the right files (either your FROM location files or your TO location files). This gets trickier if you are rsyncing to a remote computer of SSH, obviously. Not sure if that can just be piped.

Split a .gz file into multiple 1GB compressed(.gz) files

I have a 250GB gzipped file on Linux and I want to split it in 250 1GB files and compress the generated part files on the fly (as soon as one file is generated, it should be compressed).
I tried using this -
zcat file.gz | split -b 1G – file.gz.part
But this is generating uncompressed file and rightly so. I modified it to look like this, but got an error:
zcat file.gz | split -b 1G - file.gz.part | gzip
gzip: compressed data not written to a terminal. Use -f to force compression.
For help, type: gzip -h
I also tried this, and it did not throw any error, but did not compress the part file as soon as they are generated. I assume that this will compress each file when the whole split is done (or it may pack all part files and create single gz file once the split completed, I am not sure).
zcat file.gz | split -b 1G - file.gz.part && gzip
I read here that there is a filter option, but my version of split is (GNU coreutils) 8.4, hence the filter is not supported.
$ split --version
split (GNU coreutils) 8.4
Please advise a suitable way to achieve this, preferably using a one liner code (if possible) or a shell (bash/ksh) script will also work.
split supports filter commands. Use this:
zcat file.gz | split - -b 1G --filter='gzip > $FILE.gz' file.part.
it's definitely suboptimal but I tried to write it in bash just for fun ( I haven't actually tested it so there may be some minor mistakes)
GB_IN_BLOCKS=`expr 2048 \* 1024`
GB=`expr $GB_IN_BLOCKS \* 512`
COMPLETE_SIZE=`zcat asdf.gz | wc -c`
PARTS=`expr $COMPLETE_SIZE \/ $GB`
for i in `seq 0 $PARTS`
do
zcat asdf.gz | dd skip=`expr $i \* GB_IN_BLOCKS` count=$GB_IN_BLOCKS | gzip > asdf.gz.part$i
done

how to write a bash script that would get minor and major device numbers of /dev/random

I am trying to run a program in a chrooted environment, and it needs /dev/random as a resource.
Manually I can do ls -l on it and then create the file again with mknod c xx yy, but I need to make it automatic and I don't think these version numbers are constant from a linux version to another so that is why I have the following question :
How could I write a bash script that would extract the minor and major numbers of /dev/random and use it with mknod? I can use ls -l but I don't know how to extract a substring of it...
The exact return of ls -l /dev/random is :
crw-rw-rw- 1 root root MINOR, MAJOR mars 30 19:15 /dev/random
and the two numbers I want to extract are MINOR and MAJOR. However if there is an easier way to create the node without ls and mknod I would appreciate it.
You can get the major and minor device numbers with stat:
MINOR=`stat -c %T /dev/random`
MAJOR=`stat -c %t /dev/random`
You can then create a device node with:
mknod mydevice c "$MAJOR" "$MINOR"
Another approach (which doesn't require the parsing of device numbers) is to use tar to create an archive with the details of the device files in:
cd /dev
tar cf /somewhere/devicefiles.tar random null [any other needed devices]
then
cd /somewhere/chroot-location
tar xf /somewhere/devicefiles.tar
This latter method has the advantage that it doesn't rely on the -c option to stat, which is a GNU extension.
A minor improvement to efficiency would be to do only one call (and to use lower-case variable names, as is conventional for all variables other than builtins and environment variables in shell):
read minor major < <(stat -c '%T %t' /dev/random)
On a GNU system, by the way, I'd suggest using cp -a to copy your explicitly whitelisted device files into the chroot during setup:
cp -a /dev/random /your/chroot/dev/random
Try this.
MAJOR=ls -l /dev/random | awk '{ print $5}'
MINOR=ls -l /dev/random | awk '{ print $6}'

Resources