Split a .gz file into multiple 1GB compressed(.gz) files - linux

I have a 250GB gzipped file on Linux and I want to split it in 250 1GB files and compress the generated part files on the fly (as soon as one file is generated, it should be compressed).
I tried using this -
zcat file.gz | split -b 1G – file.gz.part
But this is generating uncompressed file and rightly so. I modified it to look like this, but got an error:
zcat file.gz | split -b 1G - file.gz.part | gzip
gzip: compressed data not written to a terminal. Use -f to force compression.
For help, type: gzip -h
I also tried this, and it did not throw any error, but did not compress the part file as soon as they are generated. I assume that this will compress each file when the whole split is done (or it may pack all part files and create single gz file once the split completed, I am not sure).
zcat file.gz | split -b 1G - file.gz.part && gzip
I read here that there is a filter option, but my version of split is (GNU coreutils) 8.4, hence the filter is not supported.
$ split --version
split (GNU coreutils) 8.4
Please advise a suitable way to achieve this, preferably using a one liner code (if possible) or a shell (bash/ksh) script will also work.

split supports filter commands. Use this:
zcat file.gz | split - -b 1G --filter='gzip > $FILE.gz' file.part.

it's definitely suboptimal but I tried to write it in bash just for fun ( I haven't actually tested it so there may be some minor mistakes)
GB_IN_BLOCKS=`expr 2048 \* 1024`
GB=`expr $GB_IN_BLOCKS \* 512`
COMPLETE_SIZE=`zcat asdf.gz | wc -c`
PARTS=`expr $COMPLETE_SIZE \/ $GB`
for i in `seq 0 $PARTS`
do
zcat asdf.gz | dd skip=`expr $i \* GB_IN_BLOCKS` count=$GB_IN_BLOCKS | gzip > asdf.gz.part$i
done

Related

Get size of image in bash

I want to get size of image. The image is in folder by name encodedImage.jpc
a="$(ls -s encodedImage.jpc | cut -d " " -f 1)"
temp="$(( $a*1024 * 8))"
echo "$a"
The output is not correct. How to get size? Thank You
Better than parsing ls output, the proper way is to use the command stat like this :
stat -c '%s' file
Check
man stat | less +/'total size, in bytes'
If by size you mean bytes or pretty bytes can you just use
ls -lh
-h When used with the -l option, use unit suffixes: Byte, Kilobyte, Megabyte, Gigabyte, Terabyte and Petabyte in order to reduce the number of digits to three or less using base 2 for sizes.
I guess the more complete answer if you're just trying to tear off the file size alone (I added the file name as well you can remove ,$9 to drop it)
ls -lh | awk '{print $5,$9}'
U can use this command
du -sh your_file

Ubuntu terminal - using gnu parallel to read lines in all files in folder

I am Trying to count the lines in all the files in a very large folder under Ubuntu.
The files are .gz files and I use
zcat * | wc -l
to count all the lines in all the files, and it's slow!
I want to use multi core computing for this task and found this
about Gnu parallel,
I tried to use this bash command:
parallel zcat * | parallel --pipe wc -l
and the cores are not all working
I found that the job starting might cause major overhead and tried using batching with
parallel -X zcat * | parallel --pipe -X wc -l
without improvenemt,
how can I use all the cores to count the lines in all the files in a folder given they are all .gz files and need to be decompresses before counting the rows (don't need to keep them uncompressed after)
Thanks!
If you have 150,000 files, you will likely get problems with "argument list too long". You can avoid that like this:
find . -name \*gz -maxdepth 1 -print0 | parallel -0 ...
If you want the name beside the line count, you will have to echo it yourself, since your wc process will only be reading from its stdin and won't know the filename:
find ... | parallel -0 'echo {} $(zcat {} | wc -l)'
Next, we come to efficiency and it will depend on what your disks are capable of. Maybe try with parallel -j2 then parallel -j4 and see what works on your system.
As Ole helpfully points out in the comments, you can avoid having to output the name of the file whose lines are being counted by using GNU Parallel's --tag option to tag output line, so this is even more efficient:
find ... | parallel -0 --tag 'zcat {} | wc -l'
Basically the command you are looking for is:
ls *gz | parallel 'zcat {} | wc -l'
What it does is:
ls *gzlist all gz files on stdout
Pipe it to parallel
Spawn subshells with parallel
Run in said subshells the command inside quotes 'zcat {} | wc -l'
About the '{}', according to the manual:
This replacement string will be replaced by a full line read from the input source
So each line piped to parallel get fed to zcat.
Of course this is basic, I assume it could be tuned, the documentation and examples might help

download using rsync and extract using gunzip, and put all together into a pipe

I have "gz" files that I am downloading using "rsync". Then, as these files are compressed, I need to extract them using gunzip (I am open to any other alternative for gunzip). I want to put all these commands together into a pipe to have something like that rsync file | gunzip
My original command is the following:
awk -F "\t" '$5~/^(reference genome|representative genome)$/ {sub("ftp", "rsync", $20); b=$20"/*genomic.fna.gz"; print b" viral/." }' assembly_summary_viral.txt | xargs -l1 rsync --copy-links --times --recursive --verbose --exclude="*rna*" --exclude="*cds*"
It looks a little bit complicated, but it's downloading the files that I need, and there is no problem with it. I added | gunzip However the extraction of the compressed files is not working, and it's only downloading them.
Any suggestion?
A pipe takes the stdout of the left command and sends it to the stdin of the right command. Here we have to take the stdout of rsync and pipe to the stdin of gunzip.
rsync doesn't really output much without the -v flag so you'll have to add that. It will now spit out to stdout something like the following:
>rsync -rv ./ ../viral
sending incremental file list
file1
file2
file3
test1_2/
test1_2/file1
test1_2/file2
sent 393 bytes received 123 bytes 1,032.00 bytes/sec
total size is 0 speedup is 0.00
We can pipe that to awk first to grab only the file path/name and prepend viral/ to the front of it so that it gunzips the files that you just rsync'd TO (instead of the ones FROM which you rsync'd):
rsync -rv ./ ../viral | awk '!NF{endFileList=1} NR>1 && endFileList!=1{print "../viral/"$0}'
Now we have rsync and awk spitting out a list of filenames that are being sent to the TO directory. Now we need to get gunzip to process that list. Unfortunately, gunzip can't take in a list of files. If you send gunzip something to it's stdin it will assume that the stream is a gzipped stream and will attempt to gunzip it.
Instead we'll employ that xargs method you have above take the stdin and feed it into gunzip as the parameter (filename) that it needs:
rsync -rv ./ ../viral | awk '!NF{endFileList=1} NR>1 && endFileList!=1{print "../viral/"$0}' | xargs -l1 gunzip
Most likely you will have to tweak this a bit to insure you are gunzipping the right files (either your FROM location files or your TO location files). This gets trickier if you are rsyncing to a remote computer of SSH, obviously. Not sure if that can just be piped.

Total size of bunch of gzip files folder

I am trying to figure out the total size of a bunch of gz files inside of a folder.
I know that we can use gzip -l to get uncompressed size, but if you
awk 'print $2' with that it returns uncompressed as well.
If I have 10 files inside a folder that are gz what would be best way to get uncompressed total?
If you just want to remove the word "uncompressed" from the output, there are a few ways to do it.
Most simply, make the AWK command only process lines that aren't the first.
gzip -l *.gz | awk 'NR > 1 { print $2; }'

Merge sort gzipped files

I have 40 files of 2GB each, stored on an NFS architecture. Each file contains two columns: a numeric id and a text field. Each file is already sorted and gzipped.
How can I merge all of these files so that the resulting output is also sorted?
I know sort -m -k 1 should do the trick for uncompressed files, but I don't know how to do it directly with the compressed ones.
PS: I don't want the simple solution of uncompressing the files into disk, merging them, and compressing again, as I don't have sufficient disk space for that.
This is a use case for process substitution. Say you have two files to sort, sorta.gz and sortb.gz. You can give the output of gunzip -c FILE.gz to sort for both of these files using the <(...) shell operator:
sort -m -k1 <(gunzip -c sorta.gz) <(gunzip -c sortb.gz) >sorted
Process substitution substitutes a command with a file name that represents the output of that command, and is typically implemented with either a named pipe or a /dev/fd/... special file.
For 40 files, you will want to create the command with that many process substitutions dynamically, and use eval to execute it:
cmd="sort -m -k1 "
for input in file1.gz file2.gz file3.gz ...; do
cmd="$cmd <(gunzip -c '$input')"
done
eval "$cmd" >sorted # or eval "$cmd" | gzip -c > sorted.gz
#!/bin/bash
FILES=file*.gz # list of your 40 gzip files
# (e.g. file1.gz ... file40.gz)
WORK1="merged.gz" # first temp file and the final file
WORK2="tempfile.gz" # second temp file
> "$WORK1" # create empty final file
> "$WORK2" # create empty temp file
gzip -qc "$WORK2" > "$WORK1" # compress content of empty second
# file to first temp file
for I in $FILES; do
echo current file: "$I"
sort -k 1 -m <(gunzip -c "$I") <(gunzip -c "$WORK1") | gzip -c > "$WORK2"
mv "$WORK2" "$WORK1"
done
Fill $FILES the easiest way with the list of your files with bash globbing (file*.gz) or with a list of 40 filenames (separated with white blanks). Your files in $FILES stay unchanged.
Finally, the 80 GB data are compressed in $WORK1. While processing this script no uncompressed data where written to disk.
Adding a differently flavoured multi-file merge within a single pipeline - it takes all (pre-sorted) files in $OUT/uniques, sort-merges them and compresses the output, lz4 is used due to it's speed:
find $OUT/uniques -name '*.lz4' |
awk '{print "<( <" $0 " lz4cat )"}' |
tr "\n" " " |
(echo -n sort -m -k3b -k2 " "; cat -; echo) |
bash |
lz4 \
> $OUT/uniques-merged.tsv.lz4
It is true there are zgrep and other common utilities that play with compressed files, but in this case you need to sort/merge uncompressed data and compress the result.

Resources