Reading Gzip File Footer with CURL or WGET - linux

I have a gzip file in the web server. I want to download the file only if there is enough disk space to decompress the file. Is it possible to know
the decompress size before downloading the file?
The decompressed size is encoded in the footer of the gzip file[1]. We can extract the decompressed size with the following command
gzip -l
But, the file need to be downloaded. I want to avoid the file downloading if I could know the decompressed size.

You can hack your way with the HTTP Range header, but it will take many http requests and your server needs to accept the Range header.
Send a first request with the HEAD method, to figure out the total file size content-length
Send a second request with a Range header to get the last 4 bytes of the file. Compute theses bytes to know the file size
If you have enough size available on the disk (file-size + uncompressed size), download the full file.

Related

How to split an image file into chunks of n bytes to be sent to an api in Node.js?

I'm trying to upload an image to an API that requires it to be sent in chunks of n bytes at a time (The chunk size is dynamic and I get that earlier). The parameters for the request are the chunk index and the image payload. So if I have the file, how would I go about splitting it into n-byte chunks to send in an axios request? Thanks!
You can use split-file npm package to do that
There are lots of options to get a chunk of a file. Here are some of those options:
let stream = fs.createReadStream(filename, {start: firstByte, end: lastByte});
Then, you can pipe that stream to a response or attach your own data listener and as the bytes from the stream come in, you do something with them. The start and end options will automatically limit the stream to that particular chunk of the file.
You could also open the file, then read a specific chunk with fs.read() where you pass the position and length arguments.

How to determine HTTP range request start & end bytes (nodejs + mongodb)

I wonder if it is possible to determine the start & end bytes in HTTP range requests or let the browser somehow know where to start and let it use some user defined chunk size or so.
I have a file in my database and it is split into multiple chunks, each chunk is 2 MB.
eg. 20 MB file => 10 chunks
When the browser starts downloading the file (video file), I have studied Chrome, it firsts checks the 'range=bytes 0-' byte range and if the server sucessfully responds with the 'right' bytes and 206 headers back, then it sends another request for the end bytes of the file eg 'range=bytes 1900000-',
It just checkes if your server responds well for the partial response
On the serverside I have coded my app so that it will send 2 MB partials if you ask it nicely to it :)
What I want the browser to do
range=bytes 0-'
range=bytes 2000000-4000000 bytes'
range=bytes 4000000-6000000 bytes'
But if you ask a partial which doesnt fit in a 2mb chunk it will give an error. Or it will just not play from the right position for a audio/video file.
range=bytes 2500000-4000000 bytes'
range=bytes 0-1000000 bytes'
this will give an error because I cannot start to send from a part of a chunk. Otherwise I have to slice my chunks and do some buffer operations. But I want to keep it clean.
If this is possible please let me know.
I am assuming that you are streaming an mp4 file? Different parts (boxes) of the mp4 have different purposes, its not possible to jump to a random position in the file and start playing without out first identifying the location of each frame by preloading the index (moov). The moov can be at the beginning, or end of a file, so the browser MAY need the end of the of file first. It can determine this by starting from the beginning, and looking for the moov, if it is not at the start, there will be a pointer to the location of the next box. It can leap frog through the file until it finds the index. Once the moov header is downloaded, The browser will know the EXACT byte offset and size of every single frame in the video, and can jump around the file as you seek. This is all possible because the browser knows how to parse mp4 natively. TLDR. No, your solution will not work.

Parse bytes of a zip file?

I am requesting a zip file from an API and I'm trying to retrieve it by bytes range (setting a Range header) and then parsing each of the parts individually. After reading some about gzip and zip compression, I'm having a hard time figuring out:
Can I parse a portion out of a zip file?
I know that gzip files usually compresses a single file so you can decompress and parse it in parts, but what about zip files?
I am using node-js and tried several libraries like adm-zip or zlib but it doesn't look like they allow this kind of possibility.
Zip files have a catalog at the end of the file (in addition to the same basic information before each item), which lists the file names and the location in the zip file of each item. Generally each item is compressed using deflate, which is the same algorithm that gzip uses (but gzip has a custom header before the deflate stream).
So yes, it's entirely feasible to extract the compressed byte stream for one item in a zip file, and prepend a fabricated gzip header (IIRC 14 bytes is the minimum size of this header) to allow you to decompress just that file by passing it to gunzip.
If you want to write code to inflate the deflated stream yourself, I recommend you make a different plan. I've done it, and it's really not fun. Use zlib if you must do it, don't try to reimplement the decompression.

md5 different after decompressing the file and compressing it again

Host: Ubuntu 14.04
Command md5sum
File size: Before/After decompressing: 77.8 M - 323.9
I downloaded the file from Ubuntu official website.
Where I download it from ( device.tar.xz )
Before decompressing the file, I use md5sum to generate the md5 number for this compressed file.
After this, I decompressed the file, however, I dont modify any content inside. And then I re-compressed the file ( device2.tar.xz ).
By comparing two md5 number, it is different. I doubt my decompression may cause something changed.
Is there anyway to ensure that the content will be exactly the same after re-compressing ?
Thanks
You're hashing two different compressed representations of the same uncompressed data.
The xz file format includes some meta-data, which you can see with xz -l foo.xz. So even if you used the same version of the same compression program with the same settings, you could get output files that weren't byte-for-byte identical.

Should all file structures in a ZIP file be consecutive?

While reading a ZIP file, can we safely assume that all file structures (by that I mean Local File Header + file data (compressed or stored) + Data Descriptor) are exactly consecutive? Can there be any irrelevant data in between?
PkWare Appnote tells that
"Immediately following the local header for a file is the compressed
or stored data for the file. The series of [local file header][file
data][data descriptor] repeats for each file in the .ZIP archive."
So there should be no gaps between them.
However, I would recommend to parse and read central directory, not go through local file headers (except that you need streamed processing).

Resources