How to pass through each file that's completed tar xzf decompression to a bash loop? - linux

In Linux bash, I would like to be able to decompress a large tar.gz (100G-1T, hundreds of similarly sized files), so that after each file has succeeded the decompression, I can pass it through a bash loop for further processing. See example below with --desired_flag:
tar xzf --desired_flag large.tar.gz \
| xargs -n1 -P8 -I % do_something_to_decompressed_file %
EDIT: the immediate use case I am thinking about is a network operation, where as soon as the contents of the files being decompressed are available, they can be uploaded somewhere on the next step. Given that the tar step could be either CPU-bound or IO-bound depending on the Linux instance, I would like to be able to efficiently pass the files to the next step, which I presume will be bound by network speed.

Given the following function definition:
buffer_lines() {
local last_name file_name
read -r last_name || return
while read -r file_name; do
printf '%s\n' "$last_name"
last_name=$file_name
done
printf '%s\n' "$last_name"
}
...one can then run the following, whether one's tar implementation prints names at the beginning or end of their processing:
tar xvzf large.tar.gz | buffer_lines | xargs -d $'\n' -n 1 -P8 do_something_to_file
Note the v flag, telling tar to print filenames on stdout (in the GNU implementation, in this particular usage mode). Also note the lack of the -I argument.
If you want to insert a buffer (to allow tar to run ahead of the xargs process), consider pv:
tar xvzf large.tar.gz \
| pv -B 1M \
| buffer_lines \
| xargs -d $'\n' -n 1 -P8 do_something_to_file
...will buffer up to 1MB of unpacked names should the processing components run behind.

Related

How to pipe the output of `ls` to `mplayer`

I want to run mplayer on all files in a folder, sorted by size.
I tried the following commands
ls -1S folder | mplayer
ls -1S folder | xargs mplayer
ls -1S folder | xargs -print0 mplayer
but none of these are working.
How to do it right?
Don’t parse the output of ls.
Instead, use e.g. for to loop over the files and call stat to get the file sizes. To avoid issues with spaces or newlines in filenames, use zero-terminated strings to sort etc.:
for file in folder/*; do
printf "%s %s\0" "$(stat -c %s "$file")" "$file"
done \
| sort -z -k1 -t ' ' \
| cut -z -f2- -d ' ' \
| xargs -0 mplayer
To call mplayer individually for each file (rather than only once, passing all files as arguments), you’ll need to use a while loop, and pipe in the above. Unfortunate | doesn’t work with while (at least I don’t know how), you need to use process substitution instead:
while IFS= read -r -d '' file; do
mplayer "$file"
done < <(
for file in folder/*; do
printf "%s %s\0" "$(stat -c %s "$file")" "$file"
done \
| sort -z -k1 -t ' ' \
| cut -z -f2- -d ' '
)
Note that the above is Bash code, and uses GNU extensions, it works on Linux but it won’t work without changes e.g. on macOS (BSD cut has no -z flag, and stat -c %s needs to be changed to stat -f %z).
I created a python script to build an executable file for just that I want to do. Here is the complete python code:
import os
import re
import sys
import glob
dir_name = sys.argv[1]
# Get a list of files (file paths) in the given directory
list_of_files = filter( os.path.isfile,
glob.glob(dir_name + '/*') )
# Sort list of files in directory by size
list_of_files = sorted( list_of_files,
key = lambda x: os.stat(x).st_size)
# Iterate over sorted list of files in directory and
# print them one by one along with size
for elem in list_of_files[::-1]:
file_size = os.stat(elem).st_size
print(f"mplayer {re.escape(elem)}")
You redirect the output to a file and execute this. And voila - mplayer plays the files in the order from big to small.

How to redirect xz's normal stdout when do tar | xz?

I need to use a compressor like xz to compress huge tar archives.
I am fully aware of previous questions like
Create a tar.xz in one command
and
Utilizing multi core for tar+gzip/bzip compression/decompression
From them, I have found that this command line mostly works:
tar -cvf - paths_to_archive | xz -1 -T0 -v > OUTPUT_FILE.tar.xz
I use the pipe solution because I absolutely must be able to pass options to xz. In particular, xz is very CPU intensive, so I must use -T0 to use all available cores. This is why I am not using other possibilities, like tar's --use-compress-program, or -J options.
Unfortunately, I really want to capture all of tar and xz's log output (i.e. non-archive output) into a log file. In the example above, log outout is always generated by those -v options.
With the command line above, that log output is now printed on my terminal.
So, the problem is that when you use pipes to connect tar and xz as above, you cannot end the command line with something like
>Log_File 2>&1
because of that earlier
> OUTPUT_FILE.tar.xz
Is there a solution?
I tried wrapping in a subshell like this
(tar -cvf - paths_to_archive | xz -1 -T0 -v > OUTPUT_FILE.tar.xz) >Log_File 2>&1
but that did not work.
The normal stdout of tar is the tarball, and the normal stdout of xz is the compressed file. None of these things are logs that you should want to capture. All logging other than the output files themselves are written exclusively to stderr for both processes.
Consequently, you need only redirect stderr, and must not redirect stdout unless you want your output file mixed up with your logging.
{ tar -cvf - paths_to_archive | xz -1 -T0 -v > OUTPUT_FILE.tar.xz; } 2>Log_File
By the way -- if you're curious about why xz -v prints more content when its output goes to the TTY, the answer is in this line of message.c: The progress_automatic flag (telling xz to set a timer to trigger a SIGALRM -- which it treats as an indication that status should be printed -- every second) is only set when isatty(STDERR_FILENO) is true. Thus, after stderr has been redirected to a file, xz no longer prints this output at all; the problem is not that it isn't correctly redirected, but that it no longer exists.
You can, however, send SIGALRM to xz every second from your own code, if you're really so inclined:
{
xz -1 -T0 -v > OUTPUT_FILE.tar.xz < <(tar -cvf - paths_to_archive) & xz_pid=$!
while sleep 1; do
kill -ALRM "$xz_pid" || break
done
wait "$xz_pid"
} 2>Log_File
(Code that avoids rounding up the time needed for xz to execute to the nearest second is possible, but left as an exercise to the reader).
First -cvf - can be replaced by cv.
But the normal stdout-output of tar cvf - is the tar file which is piped into xz. Not sure if I completely understand, maybe this:
tar cv paths | xz -1 -T0 > OUTPUT.tar.xz 2> LOG.stderr
or
tar cv paths 2> LOG.stderr | xz -1 -T0 > OUTPUT.tar.xz
or
tar cv paths 2> LOG.tar.stderr | xz -1 -T0 > OUTPUT.tar.xz 2> LOG.xz.stderr
Not sure if -T0 is implemented yet, which version of xz do you use? (Maybe https://github.com/vasi/pixz is worth a closer look) The pv program, installed with sudo apt-get install pv on some systems, is better at showing progress for pipes than xz -v. It will tell you the progress as a percentage with an ETA:
size=$(du -bc path1 path2 | tail -1 | awk '{print$1}')
tar c paths 2> LOG.stderr | pv -s$size | xz -1 -T0 > OUTPUT.tar.xz

Linux: Reverse Sort files in directory and get second file

I am trying to get the second file, when file contents sorted in reverse (desc order) and copy it to my local directory using scp
Here's what I got:
scp -r uname#host:./backups/dir1/$(ls -r | head -2| tail -1) /tmp/data_sync/dir1/
I still seem to copy all the files when I run this script. What am I missing? TIA.
The $(...) is being interpreted locally. If you want the commands to run on the remote, you'll need to use ssh and have the remote side use scp to copy files to your local system.
Since parsing ls's output has a number of problems, I'll use find to accomplish the same thing as ls, telling it to use NUL between each filename rather than newline. sort sorts that list of filenames, and sed -n 2p prints the second element of the sorted list of filenames. xargs runs the scp command, inserting the filename as the first argument.
ssh uname#host "find ./backups/dir1/ -mindepth 1 -maxdepth 1 -name '[^.]*' -print0 | \
sort -r -z | sed -z -n 2p | \
xargs -0 -I {} scp {} yourlocalhost:/tmp/data_sync/dir1/"
If I got your question, your command is ok with just one specification:
you first ran scp -r which recursively scps your files which have theri content sorted in reverse order.
Try without -r:
scp uname#host:./backups/dir1/$(ls -r | head -2 | tail -1) /tmp/data_sync/dir1/
The basic syntax for scp is:
scp username#source:/location/to/file username#destination:/where/to/put
Don't forget that -rrecursively copy entire directories. More, note that scp follows symbolic links encountered in the tree traversal.

Merge sort gzipped files

I have 40 files of 2GB each, stored on an NFS architecture. Each file contains two columns: a numeric id and a text field. Each file is already sorted and gzipped.
How can I merge all of these files so that the resulting output is also sorted?
I know sort -m -k 1 should do the trick for uncompressed files, but I don't know how to do it directly with the compressed ones.
PS: I don't want the simple solution of uncompressing the files into disk, merging them, and compressing again, as I don't have sufficient disk space for that.
This is a use case for process substitution. Say you have two files to sort, sorta.gz and sortb.gz. You can give the output of gunzip -c FILE.gz to sort for both of these files using the <(...) shell operator:
sort -m -k1 <(gunzip -c sorta.gz) <(gunzip -c sortb.gz) >sorted
Process substitution substitutes a command with a file name that represents the output of that command, and is typically implemented with either a named pipe or a /dev/fd/... special file.
For 40 files, you will want to create the command with that many process substitutions dynamically, and use eval to execute it:
cmd="sort -m -k1 "
for input in file1.gz file2.gz file3.gz ...; do
cmd="$cmd <(gunzip -c '$input')"
done
eval "$cmd" >sorted # or eval "$cmd" | gzip -c > sorted.gz
#!/bin/bash
FILES=file*.gz # list of your 40 gzip files
# (e.g. file1.gz ... file40.gz)
WORK1="merged.gz" # first temp file and the final file
WORK2="tempfile.gz" # second temp file
> "$WORK1" # create empty final file
> "$WORK2" # create empty temp file
gzip -qc "$WORK2" > "$WORK1" # compress content of empty second
# file to first temp file
for I in $FILES; do
echo current file: "$I"
sort -k 1 -m <(gunzip -c "$I") <(gunzip -c "$WORK1") | gzip -c > "$WORK2"
mv "$WORK2" "$WORK1"
done
Fill $FILES the easiest way with the list of your files with bash globbing (file*.gz) or with a list of 40 filenames (separated with white blanks). Your files in $FILES stay unchanged.
Finally, the 80 GB data are compressed in $WORK1. While processing this script no uncompressed data where written to disk.
Adding a differently flavoured multi-file merge within a single pipeline - it takes all (pre-sorted) files in $OUT/uniques, sort-merges them and compresses the output, lz4 is used due to it's speed:
find $OUT/uniques -name '*.lz4' |
awk '{print "<( <" $0 " lz4cat )"}' |
tr "\n" " " |
(echo -n sort -m -k3b -k2 " "; cat -; echo) |
bash |
lz4 \
> $OUT/uniques-merged.tsv.lz4
It is true there are zgrep and other common utilities that play with compressed files, but in this case you need to sort/merge uncompressed data and compress the result.

Remove trailing null characters produced by tar

I'm trying tar up some files and pass them along to the user through the php passthru command.
The problem is that even though the tar file should only be like 2k it is always 10240. Funny number right?
So I have broken it down to:
-sh-4.1# tar czf - test | wc -c
10240
VS:
-sh-4.1# tar czf test.tar.gz test && wc -c test.tar.gz
2052 test.tar.gz
So tar is clearly padding out the file with NULL.
So how can I make tar stop doing that. Alternatively, how can I strip the trailing NULLs.
I'm running on tar (GNU tar) 1.15.1 and cannot reproduce on my workstation which is tar (GNU tar) 1.23, and since this is an embedded project upgrading is not the answer I'm looking for (yet).
Edit: I am hoping for a workaround that does need to write to the file system.. Maybe a way to stop it from padding or to pipe it through sed or something to strip out the padding.
you can attenuate the padding effect by using a smaller block size, try to pass -b1 to tar
You can minimise the padding by setting the block size to the minimum possible value - on my system this is 512.
$ cat test
a few bytes
$ tar -c test | wc -c
10240
$ tar -b 1 -c test | wc -c
2048
$ tar --record-size=512 -c test | wc -c
2048
$
This keeps the padding to at most 511 bytes. Short of piping through a program to remove the padding, rewrite the block header, and recreate the end-of-archive signature, I think this is the best you can do. At that point you might consider using a scripting language and it's native tar implementation directly, e.g.:
PHP's PharData (http://php.net/manual/en/class.phardata.php)
Perl's Archive::Tar (https://perldoc.perl.org/Archive/Tar.html)
Python's tarfile (https://docs.python.org/2/library/tarfile.html)

Resources