download using rsync and extract using gunzip, and put all together into a pipe - linux

I have "gz" files that I am downloading using "rsync". Then, as these files are compressed, I need to extract them using gunzip (I am open to any other alternative for gunzip). I want to put all these commands together into a pipe to have something like that rsync file | gunzip
My original command is the following:
awk -F "\t" '$5~/^(reference genome|representative genome)$/ {sub("ftp", "rsync", $20); b=$20"/*genomic.fna.gz"; print b" viral/." }' assembly_summary_viral.txt | xargs -l1 rsync --copy-links --times --recursive --verbose --exclude="*rna*" --exclude="*cds*"
It looks a little bit complicated, but it's downloading the files that I need, and there is no problem with it. I added | gunzip However the extraction of the compressed files is not working, and it's only downloading them.
Any suggestion?

A pipe takes the stdout of the left command and sends it to the stdin of the right command. Here we have to take the stdout of rsync and pipe to the stdin of gunzip.
rsync doesn't really output much without the -v flag so you'll have to add that. It will now spit out to stdout something like the following:
>rsync -rv ./ ../viral
sending incremental file list
file1
file2
file3
test1_2/
test1_2/file1
test1_2/file2
sent 393 bytes received 123 bytes 1,032.00 bytes/sec
total size is 0 speedup is 0.00
We can pipe that to awk first to grab only the file path/name and prepend viral/ to the front of it so that it gunzips the files that you just rsync'd TO (instead of the ones FROM which you rsync'd):
rsync -rv ./ ../viral | awk '!NF{endFileList=1} NR>1 && endFileList!=1{print "../viral/"$0}'
Now we have rsync and awk spitting out a list of filenames that are being sent to the TO directory. Now we need to get gunzip to process that list. Unfortunately, gunzip can't take in a list of files. If you send gunzip something to it's stdin it will assume that the stream is a gzipped stream and will attempt to gunzip it.
Instead we'll employ that xargs method you have above take the stdin and feed it into gunzip as the parameter (filename) that it needs:
rsync -rv ./ ../viral | awk '!NF{endFileList=1} NR>1 && endFileList!=1{print "../viral/"$0}' | xargs -l1 gunzip
Most likely you will have to tweak this a bit to insure you are gunzipping the right files (either your FROM location files or your TO location files). This gets trickier if you are rsyncing to a remote computer of SSH, obviously. Not sure if that can just be piped.

Related

how to use zcat without the warning when using pipe

I'm trying to silent zcat warning through the option -q or 2>/dev/null
so far nothing is working. I keep getting the same warning when a file name is missing.
I'm looping through 100s of compressed files to extract a specific data. The idea is if zcat encounter a bad name or a missing file name, zcat will just stay quite and wait for the next cycle, but currently this is what I'm getting when using both options
zcat -q $ram | head -n1 or zcat $ram | head -n1 2>/dev/null
gzip: compressed data not read from a terminal. Use -f to force decompression.
For help, type: gzip -h
Any idea how to solve that or a faster way to read a .gz file with a silent feature that works?
Thanks
At present, you're redirecting only stderr from head; you're not redirecting from zcat at all. If you want to redirect stderr from zcat, then you need to put the redirection before the pipe symbol, like so:
zcat $ram 2>/dev/null | head -n1

How to pass through each file that's completed tar xzf decompression to a bash loop?

In Linux bash, I would like to be able to decompress a large tar.gz (100G-1T, hundreds of similarly sized files), so that after each file has succeeded the decompression, I can pass it through a bash loop for further processing. See example below with --desired_flag:
tar xzf --desired_flag large.tar.gz \
| xargs -n1 -P8 -I % do_something_to_decompressed_file %
EDIT: the immediate use case I am thinking about is a network operation, where as soon as the contents of the files being decompressed are available, they can be uploaded somewhere on the next step. Given that the tar step could be either CPU-bound or IO-bound depending on the Linux instance, I would like to be able to efficiently pass the files to the next step, which I presume will be bound by network speed.
Given the following function definition:
buffer_lines() {
local last_name file_name
read -r last_name || return
while read -r file_name; do
printf '%s\n' "$last_name"
last_name=$file_name
done
printf '%s\n' "$last_name"
}
...one can then run the following, whether one's tar implementation prints names at the beginning or end of their processing:
tar xvzf large.tar.gz | buffer_lines | xargs -d $'\n' -n 1 -P8 do_something_to_file
Note the v flag, telling tar to print filenames on stdout (in the GNU implementation, in this particular usage mode). Also note the lack of the -I argument.
If you want to insert a buffer (to allow tar to run ahead of the xargs process), consider pv:
tar xvzf large.tar.gz \
| pv -B 1M \
| buffer_lines \
| xargs -d $'\n' -n 1 -P8 do_something_to_file
...will buffer up to 1MB of unpacked names should the processing components run behind.

Split a .gz file into multiple 1GB compressed(.gz) files

I have a 250GB gzipped file on Linux and I want to split it in 250 1GB files and compress the generated part files on the fly (as soon as one file is generated, it should be compressed).
I tried using this -
zcat file.gz | split -b 1G – file.gz.part
But this is generating uncompressed file and rightly so. I modified it to look like this, but got an error:
zcat file.gz | split -b 1G - file.gz.part | gzip
gzip: compressed data not written to a terminal. Use -f to force compression.
For help, type: gzip -h
I also tried this, and it did not throw any error, but did not compress the part file as soon as they are generated. I assume that this will compress each file when the whole split is done (or it may pack all part files and create single gz file once the split completed, I am not sure).
zcat file.gz | split -b 1G - file.gz.part && gzip
I read here that there is a filter option, but my version of split is (GNU coreutils) 8.4, hence the filter is not supported.
$ split --version
split (GNU coreutils) 8.4
Please advise a suitable way to achieve this, preferably using a one liner code (if possible) or a shell (bash/ksh) script will also work.
split supports filter commands. Use this:
zcat file.gz | split - -b 1G --filter='gzip > $FILE.gz' file.part.
it's definitely suboptimal but I tried to write it in bash just for fun ( I haven't actually tested it so there may be some minor mistakes)
GB_IN_BLOCKS=`expr 2048 \* 1024`
GB=`expr $GB_IN_BLOCKS \* 512`
COMPLETE_SIZE=`zcat asdf.gz | wc -c`
PARTS=`expr $COMPLETE_SIZE \/ $GB`
for i in `seq 0 $PARTS`
do
zcat asdf.gz | dd skip=`expr $i \* GB_IN_BLOCKS` count=$GB_IN_BLOCKS | gzip > asdf.gz.part$i
done

Grep files in between wget recursive downloads

I am trying to recursively download several files using wget -m, and I intend to grep all of the downloaded files to find specific text. Currently, I can wait for wget to fully complete, and then run grep. However, the wget process is time consuming as there are many files and instead I would like to show progress by grep-ing each file as it downloads and printing to stdout, all before the next file downloads.
Example:
download file1
grep file1 >> output.txt
download file2
grep file2 >> output.txt
...
Thanks for any advice on how this could be achieved.
As c4f4t0r pointed out
wget -m -O - <wesbites>|grep --color 'pattern'
using grep's color function to highlight the patterns may seem helpful especially when dealing with bulky data output to terminal.
EDIT:
Below is a command line you can use. it creates a file called file and save the output messages from wget.Afterwards it tails the message file.
Using awk to find any lines with "saved" and extract filename, then use grep to pattern from filename.
wget -m websites &> file & tail -f -n1 file|awk -F "\'|\`" '/saved/{system( ("grep --colour pattern ") $2)}'
Based on Xorg's solution I was able to achieve my desired effect with some minor adjustments:
wget -m -O file.txt http://google.com 2> /dev/null & sleep 1 && tail -f -n1 file.txt | grep pattern
This will print out all lines that contain pattern to stdout, and wget itself will produce no output visible from the terminal. The sleep is included because otherwise file.txt would not be created by the time the tail command executed.
As a note, this command will miss any results that wget downloads within the first second.

Merge sort gzipped files

I have 40 files of 2GB each, stored on an NFS architecture. Each file contains two columns: a numeric id and a text field. Each file is already sorted and gzipped.
How can I merge all of these files so that the resulting output is also sorted?
I know sort -m -k 1 should do the trick for uncompressed files, but I don't know how to do it directly with the compressed ones.
PS: I don't want the simple solution of uncompressing the files into disk, merging them, and compressing again, as I don't have sufficient disk space for that.
This is a use case for process substitution. Say you have two files to sort, sorta.gz and sortb.gz. You can give the output of gunzip -c FILE.gz to sort for both of these files using the <(...) shell operator:
sort -m -k1 <(gunzip -c sorta.gz) <(gunzip -c sortb.gz) >sorted
Process substitution substitutes a command with a file name that represents the output of that command, and is typically implemented with either a named pipe or a /dev/fd/... special file.
For 40 files, you will want to create the command with that many process substitutions dynamically, and use eval to execute it:
cmd="sort -m -k1 "
for input in file1.gz file2.gz file3.gz ...; do
cmd="$cmd <(gunzip -c '$input')"
done
eval "$cmd" >sorted # or eval "$cmd" | gzip -c > sorted.gz
#!/bin/bash
FILES=file*.gz # list of your 40 gzip files
# (e.g. file1.gz ... file40.gz)
WORK1="merged.gz" # first temp file and the final file
WORK2="tempfile.gz" # second temp file
> "$WORK1" # create empty final file
> "$WORK2" # create empty temp file
gzip -qc "$WORK2" > "$WORK1" # compress content of empty second
# file to first temp file
for I in $FILES; do
echo current file: "$I"
sort -k 1 -m <(gunzip -c "$I") <(gunzip -c "$WORK1") | gzip -c > "$WORK2"
mv "$WORK2" "$WORK1"
done
Fill $FILES the easiest way with the list of your files with bash globbing (file*.gz) or with a list of 40 filenames (separated with white blanks). Your files in $FILES stay unchanged.
Finally, the 80 GB data are compressed in $WORK1. While processing this script no uncompressed data where written to disk.
Adding a differently flavoured multi-file merge within a single pipeline - it takes all (pre-sorted) files in $OUT/uniques, sort-merges them and compresses the output, lz4 is used due to it's speed:
find $OUT/uniques -name '*.lz4' |
awk '{print "<( <" $0 " lz4cat )"}' |
tr "\n" " " |
(echo -n sort -m -k3b -k2 " "; cat -; echo) |
bash |
lz4 \
> $OUT/uniques-merged.tsv.lz4
It is true there are zgrep and other common utilities that play with compressed files, but in this case you need to sort/merge uncompressed data and compress the result.

Resources