Is there any way to get the compressed file size in squashfs?
I need to get every single compressed file in squashfs.
Related
is there a way to split a given file whether a text file or image file into smaller chunks of equal size and reassemble them again in python ?
I have a large GZIP-ed file. I want to read a few bytes from a specific offset of uncompressed data.
For example, I have a file that original size is 10GB. In gzipped state it has size 1GB. I want to read a few bytes at 5GB offset in that 1GB gzipped file.
You will need to read all of the first 5 GB in order to get just those bytes.
If you are frequently accessing just a few bytes from the same large gzip file, then it can be indexed for more rapid random access. You would read the entire file once to build the index. See zran.h and zran.c.
Say I have a really large zip file (80GB) containing one massive CSV file (> 200GB).
Is it possible to fetch a subsection of the 80GB file data, modify the central directory, and extract just that bit of data?
Pictorial representation:
Background on my problem:
I have a cyclic process that does a summing on a certain column of a large zipped CSV file stashed in the cloud.
What I do today is I stream the file to my disk, extract it and then stream the file line by line. This makes is a very disk bound operation. Disk IS the bottle neck for sure.
Sure, I can leverage other cloud services to get what I need faster but that is not free.
I'm curious if I can see speed gains by just taking 1GB sub sections of zip until there's nothing left to read.
What I know:
The Zip file is stored using the deflate compression algorithm (always)
In the API I use to get the file from the cloud, I can specify a byte range to filter to. This means I can seek through the bytes of a file without hitting disk!
According the zip file specs there are three major parts to a zip file in order:
1: A header describing the file and it's attributes
2: The raw file data in deflated format
3: The central directory listing out what files start and stop and what bytes
What I don't know:
How the deflate algorithm works exactly. Does it jumble the file up or does it just compress things in order of the original file? If it does jumble, this approach may not be possible.
Had anyone built a tool like this already?
You can always decompress starting from the beginning, going as far as you like, keeping only the last, say, 1 GB, once you get to where you want. You cannot just start decompressing somewhere in the middle. At least not with a normal .zip file that has not been very specially prepared somehow for random access.
The central directory has nothing to do with random access of a single entry. All it can do is tell you where an entry starts and how long it is (both compressed and uncompressed).
I would recommend that you reprocess the .zip file into a .zip file with many (~200) entries, each on the order of 1 GB uncompressed. The resulting .zip file will be very close to the same size, but you can then use the central directory to pick one of the 200 entries, randomly access it, and decompress just that one.
I've got a lot of files which are being produced direct to gzip format. I'd like to re-package these into zip archives grouping many files together.
Given the size and volume of these files (0.5GB uncompressed per file and thousands of files) I really want to avoid decompressing these only to re-compress them if the only technical difference is the file headers.
Given that both file formats use / support the Deflate algorithm, is it possible for me to simply strip the file header off a gzip file and patch the compressed data into a zip file without re-encoding (but adding an appropriate header entry). Or is there a difference in the binary encoding of the compressed data?
Yes, you can use the compressed data from gzip files directly to construct a zip file.
If a file is not compressed (i.e. stored) in a ZIP file, would its corresponding Central File Header entry have the same compressed and uncompressed sizes? Or is it possible that one of these will be missing?
Yes, it should have both sizes, where compressed size is greater or equal to not-compressed.
It can be greater when encryption is used.