How compress txt file with repeating letters into 1 kb? - zip

This file:
https://drive.google.com/file/d/1L5cx8VLOsCsCY85qrbf3W6VLSMZLveHj/view?usp=sharing
zip format compress only 8 kilobytes, but I wanna compress into 1 kb or less
like a1024, or a1024**2.
Help with this issue.

Related

Decompress LZO indexed files

I need to decompress some lzo indexed files.
First, I've tried to decompress the first lzo file with lzop but in the first line of the fie i have some extra byte.
How I can decompress the file correctly?

gzip -l returning incorrect values for uncompressed file size

I am trying to quickly assess the line number of gzipped files. I do this by checking the uncompressed size of the file, sampling lines from the beginning of the file with zcat filename | head -n 100 (for instance), and dividing the uncompressed size by the average line size of this sample of 100 lines.
The problem is that the data I'm receiving from gzip -l is invalid. Mostly it seems the uncompressed size is too small, in some cases producing negative compression values. For example, in one case the compressed file is 1.8gb, and the uncompressed is listed as 0.7gb by gzip -l, when it is actually 9gb when decompressed. I tried to decompress and recompress but still get the same uncompressed size.
gzip 1.6 on ubuntu 18.04.3
Below is the part of the gzip spec (RFC 1952) where it defines how the uncompressed size is stored in the gzip file.
ISIZE (Input SIZE)
This contains the size of the original (uncompressed) input
data modulo 2^32.
You are working with a gzip archive where the uncompressed size is > 2^32, so the uncompressed size reported by gzip -l is always going to be incorrect.
Note that this design limitation in the gzip file format doesn't cause any problems when uncompressing the archive. The only impact is with gzip -l or gunzip -l

Combining and compressing using "tar czf" and "tar + gzip". The resultant file in both cases is packname.tar.gz but why sizes are different?

There are three text files. test1, test2 and test3 with file sizes as:
test1 - 121 B
test2 - 4 B
test3 - 26 B
I am trying to combine and compress these files using different methods.
Method-A
Combine the files using tar and then compress it using gzip.
$tar cf testpack1.tar test1 test2 test3
$gzip testpack1.tar
Output is testpack1.tar.gz with size 276 B
Method-B
Combine and compress the files using tar.
$tar czf testpack2.tar.gz test1 test2 test3
Output is testpack2.tar.gz with size 262 B
Why the size of the two files are different?
B mean bytes.
If you un-gzip the archive created by your step B, I bet it will be 10240 bytes. Reason for such difference in size is that tar will align compressed archive to block size (using zero character), but it will not align the uncompressed archive. Here is excerpt from the GNU tar documentation:
-b blocks
--blocking-factor=blocks
Set record size to blocks * 512 bytes.
This option is used to specify a blocking factor for the archive. When
reading or writing the archive, tar, will do reads and writes of the
archive in records of block*512 bytes. This is true even when the
archive is compressed. Some devices requires that all write operations
be a multiple of a certain size, and so, tar pads the archive out to
the next record boundary. The default blocking factor is set when tar
is compiled, and is typically 20. Blocking factors larger than 20
cannot be read by very old versions of tar, or by some newer versions
of tar running on old machines with small address spaces. With a
magnetic tape, larger records give faster throughput and fit more data
on a tape (because there are fewer inter-record gaps). If the archive
is in a disk file or a pipe, you may want to specify a smaller
blocking factor, since a large one will result in a large number of
null bytes at the end of the archive.
You can create same compressed tar archive like this:
tar -b 20 -cf test.tar test1 test2 test3
gzip test.tar

Difference in .tar.gz and first gz and then tar

I made two compressed copy of my folder, first by using the command tar czf dir.tar.gz dir
This gives me an archive of size ~16kb. Then I tried another method, first i gunzipped all files inside the dir and then used
gzip ./dir/*
tar cf dir.tar dir/*.gz
but the second method gave me dir.tar of size ~30kb (almost double). Why there is so much difference in size?
Because zip process in general is more efficient on big sample than on small files. You have zipped 100 files of 1ko for example. Each file will have a certain compression, plus the overhead of the gzip format.
file1.tar -> files1.tar.gz (admit 30 bytes of headers/footers)
file2.tar -> files2.tar.gz (admit 30 bytes of headers/footers)
...
file100.tar -> files100.tar.gz (admit 30 bytes of headers/footers)
------------------------------
30*100 = 3ko of overhead.
But if you try to compress a tar file of 100ko (which contains your 100 files), the overhead of the gzip format will be added only one time (instead of 100 times) and the compression can be better)
Overhead from the per-file metadata and suboptimal conpression by gzip when processing files individually resulting from gzip not observing data in full and thus compressing with suboptimal dictionary (which is reset after each file).
tar cf should create an uncompressed archive, it means the size of your directory should almost be the same as your archive, maybe even more.
tar czf will run gunzip compression through it.
This can be further checked by doing a man tar in shell prompt in Linux,
-z, --gzip, --gunzip, --ungzip
filter the archive through gzip

How to validate equivalency between two zip packages

I'm trying to validate if two zip packages are equivalent. I can not rely on md5sum. When I extract the two packages, and do a md5sum diff between all the files in the packages, there is no difference, and all files have equivalent md5sums. But the zip packages themselves have different md5sum values. My question is: How can I validate that two zip packages are equivalent?
When you list the archive's content with
unzip -v archive.zip
you get a list of files with these column headings
Length Method Size Cmpr Date Time CRC-32 Name
Depending on what you consider equivalent (e.g. Size, CRC, Name), you can extract the relevant columns for both archives, sort them and do a diff over the output.
without unzipping the file you can use zipinfo
e.g:
ipinfo 5.zip
Archive: 5.zip 158 bytes 1 file
drwxr-xr-x 3.0 unx 0 bx stor 18-Nov-13 07:23 501/
1 file, 0 bytes uncompressed, 0 bytes compressed: 0.0%

Resources