I am trying to simply compress a directory using
nohup tar cvzf /home/ali/42.tar.gz /home/hadoop/ &
It creates the archive but it is the same exact size as the original directory.
I check the file size of the original directory (Includes millions of text files) and it's exactly the same as the one that is supposed to be compressed, it's like the switch to compress is not working.
also, I using gunzip command to tell me more details about the compression achieved.
[hadoop#node3 ali]$ gunzip -lv 42.tar.gz
method crc date time compressed uncompressed ratio uncompressed_name
defla 7716afb7 Feb 28 10:25 7437323730944 1927010989 -385851.3% 42.tar
update
total capacity of server = 14T
size of /home/hadoop directory = 7.2T
size of /home/ali/42.tar.gz = 6.8T
Note: In the middle of the compression process, the capacity of the server is filled, therefore the file size is 6.8T.
What am I doing wrong?
Related
I am trying to quickly assess the line number of gzipped files. I do this by checking the uncompressed size of the file, sampling lines from the beginning of the file with zcat filename | head -n 100 (for instance), and dividing the uncompressed size by the average line size of this sample of 100 lines.
The problem is that the data I'm receiving from gzip -l is invalid. Mostly it seems the uncompressed size is too small, in some cases producing negative compression values. For example, in one case the compressed file is 1.8gb, and the uncompressed is listed as 0.7gb by gzip -l, when it is actually 9gb when decompressed. I tried to decompress and recompress but still get the same uncompressed size.
gzip 1.6 on ubuntu 18.04.3
Below is the part of the gzip spec (RFC 1952) where it defines how the uncompressed size is stored in the gzip file.
ISIZE (Input SIZE)
This contains the size of the original (uncompressed) input
data modulo 2^32.
You are working with a gzip archive where the uncompressed size is > 2^32, so the uncompressed size reported by gzip -l is always going to be incorrect.
Note that this design limitation in the gzip file format doesn't cause any problems when uncompressing the archive. The only impact is with gzip -l or gunzip -l
There are three text files. test1, test2 and test3 with file sizes as:
test1 - 121 B
test2 - 4 B
test3 - 26 B
I am trying to combine and compress these files using different methods.
Method-A
Combine the files using tar and then compress it using gzip.
$tar cf testpack1.tar test1 test2 test3
$gzip testpack1.tar
Output is testpack1.tar.gz with size 276 B
Method-B
Combine and compress the files using tar.
$tar czf testpack2.tar.gz test1 test2 test3
Output is testpack2.tar.gz with size 262 B
Why the size of the two files are different?
B mean bytes.
If you un-gzip the archive created by your step B, I bet it will be 10240 bytes. Reason for such difference in size is that tar will align compressed archive to block size (using zero character), but it will not align the uncompressed archive. Here is excerpt from the GNU tar documentation:
-b blocks
--blocking-factor=blocks
Set record size to blocks * 512 bytes.
This option is used to specify a blocking factor for the archive. When
reading or writing the archive, tar, will do reads and writes of the
archive in records of block*512 bytes. This is true even when the
archive is compressed. Some devices requires that all write operations
be a multiple of a certain size, and so, tar pads the archive out to
the next record boundary. The default blocking factor is set when tar
is compiled, and is typically 20. Blocking factors larger than 20
cannot be read by very old versions of tar, or by some newer versions
of tar running on old machines with small address spaces. With a
magnetic tape, larger records give faster throughput and fit more data
on a tape (because there are fewer inter-record gaps). If the archive
is in a disk file or a pipe, you may want to specify a smaller
blocking factor, since a large one will result in a large number of
null bytes at the end of the archive.
You can create same compressed tar archive like this:
tar -b 20 -cf test.tar test1 test2 test3
gzip test.tar
I have two linux machines machine1 and machine2(ftp server).
I am going to download a file from machine2 to machine1. The file is located
in machine2/root/vdo.mp4,
and i need the total size of this file before the download begins.
Is there any way to achive this.
UPDATE :
How i got the total file size. I ran below command :
du -hs FILE_NAME
FTP supports getting the total size of a file before download
SIZE filename
The SIZE command (Return the size of a file) is defined in RFC 3659.
I am using lz4mt multi-threaded version of lz4 and in my workflow I am sending thousands of large size files (620 MB) from client to server and when file reaches on server my rule will trigger and compress file using lz4mt and then remove uncompressed file. The problem is sometimes when I remove uncompressed file, I am not able to get compressed file of right size its because lz4mt returns immediately before sending output to disk.
So is there any way lz4mt will remove uncompressed file itself after compressing as done by bzip2.
Input: bzip2 uncompress_file
Output: Compressed file only
whereas
Input: lz4mt uncompress_file
Output: (Uncompressed + Compressed) file
Below script sync command also not working properly I think.
The script which execute as my rule triggers is:
script.sh
/bin/lz4mt uncompressed_file output_file
/bin/sync
/bin/rm uncompressed_file
Please tell me how to solve above issue.
Thanks a lot
Author here. You could try the following methods
Concatenate commands with && or ;.
Add lz4mt command line option -q (suppress prompt), and -f (force overwrite).
Try it with original lz4.
I made two compressed copy of my folder, first by using the command tar czf dir.tar.gz dir
This gives me an archive of size ~16kb. Then I tried another method, first i gunzipped all files inside the dir and then used
gzip ./dir/*
tar cf dir.tar dir/*.gz
but the second method gave me dir.tar of size ~30kb (almost double). Why there is so much difference in size?
Because zip process in general is more efficient on big sample than on small files. You have zipped 100 files of 1ko for example. Each file will have a certain compression, plus the overhead of the gzip format.
file1.tar -> files1.tar.gz (admit 30 bytes of headers/footers)
file2.tar -> files2.tar.gz (admit 30 bytes of headers/footers)
...
file100.tar -> files100.tar.gz (admit 30 bytes of headers/footers)
------------------------------
30*100 = 3ko of overhead.
But if you try to compress a tar file of 100ko (which contains your 100 files), the overhead of the gzip format will be added only one time (instead of 100 times) and the compression can be better)
Overhead from the per-file metadata and suboptimal conpression by gzip when processing files individually resulting from gzip not observing data in full and thus compressing with suboptimal dictionary (which is reset after each file).
tar cf should create an uncompressed archive, it means the size of your directory should almost be the same as your archive, maybe even more.
tar czf will run gunzip compression through it.
This can be further checked by doing a man tar in shell prompt in Linux,
-z, --gzip, --gunzip, --ungzip
filter the archive through gzip