How to redirect xz's normal stdout when do tar | xz? - linux

I need to use a compressor like xz to compress huge tar archives.
I am fully aware of previous questions like
Create a tar.xz in one command
and
Utilizing multi core for tar+gzip/bzip compression/decompression
From them, I have found that this command line mostly works:
tar -cvf - paths_to_archive | xz -1 -T0 -v > OUTPUT_FILE.tar.xz
I use the pipe solution because I absolutely must be able to pass options to xz. In particular, xz is very CPU intensive, so I must use -T0 to use all available cores. This is why I am not using other possibilities, like tar's --use-compress-program, or -J options.
Unfortunately, I really want to capture all of tar and xz's log output (i.e. non-archive output) into a log file. In the example above, log outout is always generated by those -v options.
With the command line above, that log output is now printed on my terminal.
So, the problem is that when you use pipes to connect tar and xz as above, you cannot end the command line with something like
>Log_File 2>&1
because of that earlier
> OUTPUT_FILE.tar.xz
Is there a solution?
I tried wrapping in a subshell like this
(tar -cvf - paths_to_archive | xz -1 -T0 -v > OUTPUT_FILE.tar.xz) >Log_File 2>&1
but that did not work.

The normal stdout of tar is the tarball, and the normal stdout of xz is the compressed file. None of these things are logs that you should want to capture. All logging other than the output files themselves are written exclusively to stderr for both processes.
Consequently, you need only redirect stderr, and must not redirect stdout unless you want your output file mixed up with your logging.
{ tar -cvf - paths_to_archive | xz -1 -T0 -v > OUTPUT_FILE.tar.xz; } 2>Log_File
By the way -- if you're curious about why xz -v prints more content when its output goes to the TTY, the answer is in this line of message.c: The progress_automatic flag (telling xz to set a timer to trigger a SIGALRM -- which it treats as an indication that status should be printed -- every second) is only set when isatty(STDERR_FILENO) is true. Thus, after stderr has been redirected to a file, xz no longer prints this output at all; the problem is not that it isn't correctly redirected, but that it no longer exists.
You can, however, send SIGALRM to xz every second from your own code, if you're really so inclined:
{
xz -1 -T0 -v > OUTPUT_FILE.tar.xz < <(tar -cvf - paths_to_archive) & xz_pid=$!
while sleep 1; do
kill -ALRM "$xz_pid" || break
done
wait "$xz_pid"
} 2>Log_File
(Code that avoids rounding up the time needed for xz to execute to the nearest second is possible, but left as an exercise to the reader).

First -cvf - can be replaced by cv.
But the normal stdout-output of tar cvf - is the tar file which is piped into xz. Not sure if I completely understand, maybe this:
tar cv paths | xz -1 -T0 > OUTPUT.tar.xz 2> LOG.stderr
or
tar cv paths 2> LOG.stderr | xz -1 -T0 > OUTPUT.tar.xz
or
tar cv paths 2> LOG.tar.stderr | xz -1 -T0 > OUTPUT.tar.xz 2> LOG.xz.stderr
Not sure if -T0 is implemented yet, which version of xz do you use? (Maybe https://github.com/vasi/pixz is worth a closer look) The pv program, installed with sudo apt-get install pv on some systems, is better at showing progress for pipes than xz -v. It will tell you the progress as a percentage with an ETA:
size=$(du -bc path1 path2 | tail -1 | awk '{print$1}')
tar c paths 2> LOG.stderr | pv -s$size | xz -1 -T0 > OUTPUT.tar.xz

Related

How to pass through each file that's completed tar xzf decompression to a bash loop?

In Linux bash, I would like to be able to decompress a large tar.gz (100G-1T, hundreds of similarly sized files), so that after each file has succeeded the decompression, I can pass it through a bash loop for further processing. See example below with --desired_flag:
tar xzf --desired_flag large.tar.gz \
| xargs -n1 -P8 -I % do_something_to_decompressed_file %
EDIT: the immediate use case I am thinking about is a network operation, where as soon as the contents of the files being decompressed are available, they can be uploaded somewhere on the next step. Given that the tar step could be either CPU-bound or IO-bound depending on the Linux instance, I would like to be able to efficiently pass the files to the next step, which I presume will be bound by network speed.
Given the following function definition:
buffer_lines() {
local last_name file_name
read -r last_name || return
while read -r file_name; do
printf '%s\n' "$last_name"
last_name=$file_name
done
printf '%s\n' "$last_name"
}
...one can then run the following, whether one's tar implementation prints names at the beginning or end of their processing:
tar xvzf large.tar.gz | buffer_lines | xargs -d $'\n' -n 1 -P8 do_something_to_file
Note the v flag, telling tar to print filenames on stdout (in the GNU implementation, in this particular usage mode). Also note the lack of the -I argument.
If you want to insert a buffer (to allow tar to run ahead of the xargs process), consider pv:
tar xvzf large.tar.gz \
| pv -B 1M \
| buffer_lines \
| xargs -d $'\n' -n 1 -P8 do_something_to_file
...will buffer up to 1MB of unpacked names should the processing components run behind.

download using rsync and extract using gunzip, and put all together into a pipe

I have "gz" files that I am downloading using "rsync". Then, as these files are compressed, I need to extract them using gunzip (I am open to any other alternative for gunzip). I want to put all these commands together into a pipe to have something like that rsync file | gunzip
My original command is the following:
awk -F "\t" '$5~/^(reference genome|representative genome)$/ {sub("ftp", "rsync", $20); b=$20"/*genomic.fna.gz"; print b" viral/." }' assembly_summary_viral.txt | xargs -l1 rsync --copy-links --times --recursive --verbose --exclude="*rna*" --exclude="*cds*"
It looks a little bit complicated, but it's downloading the files that I need, and there is no problem with it. I added | gunzip However the extraction of the compressed files is not working, and it's only downloading them.
Any suggestion?
A pipe takes the stdout of the left command and sends it to the stdin of the right command. Here we have to take the stdout of rsync and pipe to the stdin of gunzip.
rsync doesn't really output much without the -v flag so you'll have to add that. It will now spit out to stdout something like the following:
>rsync -rv ./ ../viral
sending incremental file list
file1
file2
file3
test1_2/
test1_2/file1
test1_2/file2
sent 393 bytes received 123 bytes 1,032.00 bytes/sec
total size is 0 speedup is 0.00
We can pipe that to awk first to grab only the file path/name and prepend viral/ to the front of it so that it gunzips the files that you just rsync'd TO (instead of the ones FROM which you rsync'd):
rsync -rv ./ ../viral | awk '!NF{endFileList=1} NR>1 && endFileList!=1{print "../viral/"$0}'
Now we have rsync and awk spitting out a list of filenames that are being sent to the TO directory. Now we need to get gunzip to process that list. Unfortunately, gunzip can't take in a list of files. If you send gunzip something to it's stdin it will assume that the stream is a gzipped stream and will attempt to gunzip it.
Instead we'll employ that xargs method you have above take the stdin and feed it into gunzip as the parameter (filename) that it needs:
rsync -rv ./ ../viral | awk '!NF{endFileList=1} NR>1 && endFileList!=1{print "../viral/"$0}' | xargs -l1 gunzip
Most likely you will have to tweak this a bit to insure you are gunzipping the right files (either your FROM location files or your TO location files). This gets trickier if you are rsyncing to a remote computer of SSH, obviously. Not sure if that can just be piped.

Grep files in between wget recursive downloads

I am trying to recursively download several files using wget -m, and I intend to grep all of the downloaded files to find specific text. Currently, I can wait for wget to fully complete, and then run grep. However, the wget process is time consuming as there are many files and instead I would like to show progress by grep-ing each file as it downloads and printing to stdout, all before the next file downloads.
Example:
download file1
grep file1 >> output.txt
download file2
grep file2 >> output.txt
...
Thanks for any advice on how this could be achieved.
As c4f4t0r pointed out
wget -m -O - <wesbites>|grep --color 'pattern'
using grep's color function to highlight the patterns may seem helpful especially when dealing with bulky data output to terminal.
EDIT:
Below is a command line you can use. it creates a file called file and save the output messages from wget.Afterwards it tails the message file.
Using awk to find any lines with "saved" and extract filename, then use grep to pattern from filename.
wget -m websites &> file & tail -f -n1 file|awk -F "\'|\`" '/saved/{system( ("grep --colour pattern ") $2)}'
Based on Xorg's solution I was able to achieve my desired effect with some minor adjustments:
wget -m -O file.txt http://google.com 2> /dev/null & sleep 1 && tail -f -n1 file.txt | grep pattern
This will print out all lines that contain pattern to stdout, and wget itself will produce no output visible from the terminal. The sleep is included because otherwise file.txt would not be created by the time the tail command executed.
As a note, this command will miss any results that wget downloads within the first second.

Remove trailing null characters produced by tar

I'm trying tar up some files and pass them along to the user through the php passthru command.
The problem is that even though the tar file should only be like 2k it is always 10240. Funny number right?
So I have broken it down to:
-sh-4.1# tar czf - test | wc -c
10240
VS:
-sh-4.1# tar czf test.tar.gz test && wc -c test.tar.gz
2052 test.tar.gz
So tar is clearly padding out the file with NULL.
So how can I make tar stop doing that. Alternatively, how can I strip the trailing NULLs.
I'm running on tar (GNU tar) 1.15.1 and cannot reproduce on my workstation which is tar (GNU tar) 1.23, and since this is an embedded project upgrading is not the answer I'm looking for (yet).
Edit: I am hoping for a workaround that does need to write to the file system.. Maybe a way to stop it from padding or to pipe it through sed or something to strip out the padding.
you can attenuate the padding effect by using a smaller block size, try to pass -b1 to tar
You can minimise the padding by setting the block size to the minimum possible value - on my system this is 512.
$ cat test
a few bytes
$ tar -c test | wc -c
10240
$ tar -b 1 -c test | wc -c
2048
$ tar --record-size=512 -c test | wc -c
2048
$
This keeps the padding to at most 511 bytes. Short of piping through a program to remove the padding, rewrite the block header, and recreate the end-of-archive signature, I think this is the best you can do. At that point you might consider using a scripting language and it's native tar implementation directly, e.g.:
PHP's PharData (http://php.net/manual/en/class.phardata.php)
Perl's Archive::Tar (https://perldoc.perl.org/Archive/Tar.html)
Python's tarfile (https://docs.python.org/2/library/tarfile.html)

How to grep download speed from wget output?

I need to download several files with wget and measure download speed.
e.g. I download with
wget -O /dev/null http://ftp.bit.nl/pub/OpenBSD/4.7/i386/floppy47.fs http://ftp.bit.nl/pub/OpenBSD/4.7/i386/floppyB47.fs
and the output is
--2010-10-11 18:56:00-- http://ftp.bit.nl/pub/OpenBSD/4.7/i386/floppy47.fs
Resolving ftp.bit.nl... 213.136.12.213, 2001:7b8:3:37:20e:cff:fe4d:69ac
Connecting to ftp.bit.nl|213.136.12.213|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1474560 (1.4M) [text/plain]
Saving to: `/dev/null'
100%[==============================================================>] 1,474,560 481K/s in 3.0s
2010-10-11 18:56:03 (481 KB/s) - `/dev/null' saved [1474560/1474560]
--2010-10-11 18:56:03-- http://ftp.bit.nl/pub/OpenBSD/4.7/i386/floppyB47.fs
Reusing existing connection to ftp.bit.nl:80.
HTTP request sent, awaiting response... 200 OK
Length: 1474560 (1.4M) [text/plain]
Saving to: `/dev/null'
100%[==============================================================>] 1,474,560 499K/s in 2.9s
2010-10-11 18:56:06 (499 KB/s) - `/dev/null' saved [1474560/1474560]
FINISHED --2010-10-11 18:56:06--
Downloaded: 2 files, 2.8M in 5.9s (490 KB/s)
I need to grep the total download speed, that is, the string 490 KB/s.
How do I do this?
P.S. May need to account for the case that we will actually download only one file, so there won't be final output starting with FINISHED
Update, a grep-style version using sed:
wget ... 2>&1 | sed -n '$,$s/.*(\(.*\)).*/\1/p'
Old version:
I thought, it's easier to divide the file size by the download time after the download. ;-)
(/usr/bin/time -p wget ... 2>&1 >/dev/null; ls -l newfile) | \
awk '
NR==1 {t=$2};
NR==4 {printf("rate=%f bytes/second\n", $5/t)}
'
The first awk line stores the elapsed real time of "real xx.xx" in variabe t. The second awk line divides the file size (column 5 of ls -l) by the time and outputs this as the rate.
This worked for me, using your wget -O /dev/null <resource>
The regex I used was \([0-9.]\+ [KM]B/s\)
But note I had to redirect stderr onto stdout so the command was:
wget -O /dev/null http://example.com/index.html 2>&1 | grep '\([0-9.]\+ [KM]B/s\)'
This allows things like 923 KB/s and 1.4 MB/s
grep just finds matches. To get the value(s) you can use sed instead:
wget -O /dev/null http://example.com/index.html 2>&1 |
sed -e 's|^.*(\([0-9.]\+ [KM]B/s\)).*$|\1|'
This works when only 1 file is being downloaded.
I started using sed to get the speed from wget, but I found it irritating so I switched to grep.
This is my command:
wget ... 2>&1 | grep -o "[0-9.]\+ [KM]*B/s"
The -o option means it only returns that part. It matches 1 or more of the 10 digits then a space. Then optionally K or M before the B/s
That will return 423 KB/s (for example).
To grep for just the units, use grep -o "[KM]*B/s" and for just the number use grep -o "[0123456789]\+.
For example, get speed in MBit per second (by adding --report-speed=bits for wget, and small change grep pattern):
wget -O /dev/null --report-speed=bits http://www.ovh.net/files/10Mb.dat 2>&1 | grep -o "[0-9.,]\+ [KM]*[Bb]/s"
answer:
1,51 Mb/s
Why can't you just do this:
perl -ne "/^Downloaded.*?\((.*?)\)/; print $1"
here's suggestion. You can make use of wget's --limit-rate=amount option. For example,
--limit-rate=400k will limit the retrieval rate to 400KB/s. Then its easier for you to
calculate the total speed. Saves you time and mental anguish trying to regex it.

Resources