Total size of bunch of gzip files folder - linux

I am trying to figure out the total size of a bunch of gz files inside of a folder.
I know that we can use gzip -l to get uncompressed size, but if you
awk 'print $2' with that it returns uncompressed as well.
If I have 10 files inside a folder that are gz what would be best way to get uncompressed total?

If you just want to remove the word "uncompressed" from the output, there are a few ways to do it.
Most simply, make the AWK command only process lines that aren't the first.
gzip -l *.gz | awk 'NR > 1 { print $2; }'

Related

How to create large file (require long compress time) on Linux

I make parallel job now So I'm trying to create dummyFile and compreaa that on the backgrounds.
Like this
Create dummy file
for in ()
do
Compress that file &
done
wait
I need to create dummy data So I tried
fallocate -l 1g test.txt
And
tar cfv test.txt
But this compress job is done just 5seconds
How can I create dummydata big and required long compress time (3minute~5minute)
There are two things going on here. The first is that tar won't compress anything unless you pass it a z flag along with what you already have to trigger gzip compression:
tar cvfz test.txt
For a very similar effect, you can invoke gzip directly:
gzip test.txt
The second issue is that with most compression schemes, a gigantic string of zeros, which is likely what you generate, is very easy to compress. You can fix that by supplying random data. On a Unix-like system you can use the pseudo-file /dev/urandom. This answer gives three options in decreasing order of preference, depending on what works:
head that understands suffixes like G for Gibibyte:
head -c 1G < /dev/urandom > test.txt
head that needs it spelled out:
head -c 1073741824 < /dev/urandom > test.txt
No head at all, so use dd, where file size is block size (bs) times count (1073741824 = 1024 * 1048576):
dd bs=1024 count=1048576 < /dev/urandom > test.txt
Something like this may work. There are some bash specific operators.
#!/bin/bash
function createCompressDelete()
{
_rdmfile="$1"
cat /dev/urandom > "$_rdmfile" & # This writes to file in the background
pidcat=$! #Save the backgrounded pid for later use
echo "createCompressDelete::$_rdmfile::pid[$pidcat]"
sleep 2
while [ -f "$_rdmfile" ]
do
fsize=$(du "$_rdmfile" | awk '{print $1}')
if (( $fsize < (1024*1024) )); then # Check the size for 1G
sleep 10
echo -n "...$fsize"
else
kill "$pidcat" # Kill the pid
tar czvf "${_rdmfile}".tar.gz "$_rdmfile" # compress
rm -f "${_rdmfile}" # delete the create file
rm -f "${_rdmfile}".tar.gz # delete the tarball
fi
done
}
# Run for any number of files
for i in file1 file2 file3 file4
do
createCompressDelete "$i" &> "$i".log & # run it in the background
done

download using rsync and extract using gunzip, and put all together into a pipe

I have "gz" files that I am downloading using "rsync". Then, as these files are compressed, I need to extract them using gunzip (I am open to any other alternative for gunzip). I want to put all these commands together into a pipe to have something like that rsync file | gunzip
My original command is the following:
awk -F "\t" '$5~/^(reference genome|representative genome)$/ {sub("ftp", "rsync", $20); b=$20"/*genomic.fna.gz"; print b" viral/." }' assembly_summary_viral.txt | xargs -l1 rsync --copy-links --times --recursive --verbose --exclude="*rna*" --exclude="*cds*"
It looks a little bit complicated, but it's downloading the files that I need, and there is no problem with it. I added | gunzip However the extraction of the compressed files is not working, and it's only downloading them.
Any suggestion?
A pipe takes the stdout of the left command and sends it to the stdin of the right command. Here we have to take the stdout of rsync and pipe to the stdin of gunzip.
rsync doesn't really output much without the -v flag so you'll have to add that. It will now spit out to stdout something like the following:
>rsync -rv ./ ../viral
sending incremental file list
file1
file2
file3
test1_2/
test1_2/file1
test1_2/file2
sent 393 bytes received 123 bytes 1,032.00 bytes/sec
total size is 0 speedup is 0.00
We can pipe that to awk first to grab only the file path/name and prepend viral/ to the front of it so that it gunzips the files that you just rsync'd TO (instead of the ones FROM which you rsync'd):
rsync -rv ./ ../viral | awk '!NF{endFileList=1} NR>1 && endFileList!=1{print "../viral/"$0}'
Now we have rsync and awk spitting out a list of filenames that are being sent to the TO directory. Now we need to get gunzip to process that list. Unfortunately, gunzip can't take in a list of files. If you send gunzip something to it's stdin it will assume that the stream is a gzipped stream and will attempt to gunzip it.
Instead we'll employ that xargs method you have above take the stdin and feed it into gunzip as the parameter (filename) that it needs:
rsync -rv ./ ../viral | awk '!NF{endFileList=1} NR>1 && endFileList!=1{print "../viral/"$0}' | xargs -l1 gunzip
Most likely you will have to tweak this a bit to insure you are gunzipping the right files (either your FROM location files or your TO location files). This gets trickier if you are rsyncing to a remote computer of SSH, obviously. Not sure if that can just be piped.

How to take count some data of a file which is in archived folder?

I have a archived folder which contain some files, from one of those files I want to take count of 31 delimiter. How to get count without unzipping folder?
archived folder name =mug.tar, file name = APR_17
Below is how to take count
| awk -F "|" '{print $31}'|grep "40411"|sort -n|uniq -c|wc –l
Untar the wanted file from the archive file to stdout and pipe it your awk:
$ tar -xOf mug.tar APR_17 | awk ...
man tar:
-x, --extract, --get
extract files from an archive
-O, --to-stdout
extract files to standard output
-f, --file ARCHIVE
use archive file or device ARCHIVE

Split a .gz file into multiple 1GB compressed(.gz) files

I have a 250GB gzipped file on Linux and I want to split it in 250 1GB files and compress the generated part files on the fly (as soon as one file is generated, it should be compressed).
I tried using this -
zcat file.gz | split -b 1G – file.gz.part
But this is generating uncompressed file and rightly so. I modified it to look like this, but got an error:
zcat file.gz | split -b 1G - file.gz.part | gzip
gzip: compressed data not written to a terminal. Use -f to force compression.
For help, type: gzip -h
I also tried this, and it did not throw any error, but did not compress the part file as soon as they are generated. I assume that this will compress each file when the whole split is done (or it may pack all part files and create single gz file once the split completed, I am not sure).
zcat file.gz | split -b 1G - file.gz.part && gzip
I read here that there is a filter option, but my version of split is (GNU coreutils) 8.4, hence the filter is not supported.
$ split --version
split (GNU coreutils) 8.4
Please advise a suitable way to achieve this, preferably using a one liner code (if possible) or a shell (bash/ksh) script will also work.
split supports filter commands. Use this:
zcat file.gz | split - -b 1G --filter='gzip > $FILE.gz' file.part.
it's definitely suboptimal but I tried to write it in bash just for fun ( I haven't actually tested it so there may be some minor mistakes)
GB_IN_BLOCKS=`expr 2048 \* 1024`
GB=`expr $GB_IN_BLOCKS \* 512`
COMPLETE_SIZE=`zcat asdf.gz | wc -c`
PARTS=`expr $COMPLETE_SIZE \/ $GB`
for i in `seq 0 $PARTS`
do
zcat asdf.gz | dd skip=`expr $i \* GB_IN_BLOCKS` count=$GB_IN_BLOCKS | gzip > asdf.gz.part$i
done

Merge sort gzipped files

I have 40 files of 2GB each, stored on an NFS architecture. Each file contains two columns: a numeric id and a text field. Each file is already sorted and gzipped.
How can I merge all of these files so that the resulting output is also sorted?
I know sort -m -k 1 should do the trick for uncompressed files, but I don't know how to do it directly with the compressed ones.
PS: I don't want the simple solution of uncompressing the files into disk, merging them, and compressing again, as I don't have sufficient disk space for that.
This is a use case for process substitution. Say you have two files to sort, sorta.gz and sortb.gz. You can give the output of gunzip -c FILE.gz to sort for both of these files using the <(...) shell operator:
sort -m -k1 <(gunzip -c sorta.gz) <(gunzip -c sortb.gz) >sorted
Process substitution substitutes a command with a file name that represents the output of that command, and is typically implemented with either a named pipe or a /dev/fd/... special file.
For 40 files, you will want to create the command with that many process substitutions dynamically, and use eval to execute it:
cmd="sort -m -k1 "
for input in file1.gz file2.gz file3.gz ...; do
cmd="$cmd <(gunzip -c '$input')"
done
eval "$cmd" >sorted # or eval "$cmd" | gzip -c > sorted.gz
#!/bin/bash
FILES=file*.gz # list of your 40 gzip files
# (e.g. file1.gz ... file40.gz)
WORK1="merged.gz" # first temp file and the final file
WORK2="tempfile.gz" # second temp file
> "$WORK1" # create empty final file
> "$WORK2" # create empty temp file
gzip -qc "$WORK2" > "$WORK1" # compress content of empty second
# file to first temp file
for I in $FILES; do
echo current file: "$I"
sort -k 1 -m <(gunzip -c "$I") <(gunzip -c "$WORK1") | gzip -c > "$WORK2"
mv "$WORK2" "$WORK1"
done
Fill $FILES the easiest way with the list of your files with bash globbing (file*.gz) or with a list of 40 filenames (separated with white blanks). Your files in $FILES stay unchanged.
Finally, the 80 GB data are compressed in $WORK1. While processing this script no uncompressed data where written to disk.
Adding a differently flavoured multi-file merge within a single pipeline - it takes all (pre-sorted) files in $OUT/uniques, sort-merges them and compresses the output, lz4 is used due to it's speed:
find $OUT/uniques -name '*.lz4' |
awk '{print "<( <" $0 " lz4cat )"}' |
tr "\n" " " |
(echo -n sort -m -k3b -k2 " "; cat -; echo) |
bash |
lz4 \
> $OUT/uniques-merged.tsv.lz4
It is true there are zgrep and other common utilities that play with compressed files, but in this case you need to sort/merge uncompressed data and compress the result.

Resources