I am requesting a zip file from an API and I'm trying to retrieve it by bytes range (setting a Range header) and then parsing each of the parts individually. After reading some about gzip and zip compression, I'm having a hard time figuring out:
Can I parse a portion out of a zip file?
I know that gzip files usually compresses a single file so you can decompress and parse it in parts, but what about zip files?
I am using node-js and tried several libraries like adm-zip or zlib but it doesn't look like they allow this kind of possibility.
Zip files have a catalog at the end of the file (in addition to the same basic information before each item), which lists the file names and the location in the zip file of each item. Generally each item is compressed using deflate, which is the same algorithm that gzip uses (but gzip has a custom header before the deflate stream).
So yes, it's entirely feasible to extract the compressed byte stream for one item in a zip file, and prepend a fabricated gzip header (IIRC 14 bytes is the minimum size of this header) to allow you to decompress just that file by passing it to gunzip.
If you want to write code to inflate the deflated stream yourself, I recommend you make a different plan. I've done it, and it's really not fun. Use zlib if you must do it, don't try to reimplement the decompression.
Related
I have a compressed file that's about 200 MB, in the form of a tar.gz file. I understand that I can extract the xml files in it. It contains several small and one 5 GB xml file. I'm trying to remove certain characters from the xml files.
So my very basic question is: is it even possible to accomplish this without ever extracting the content of the compressed file?
I'm trying to speed up the process of reading through xml files looking for characters to remove.
You will have to decompress, change, and then recompress the files. There's no way around that.
However, this does not necessarily include writing the file to a storage. You might be able to do the changes you like in a streaming fashion, i.e. that everything is only done in memory without ever having the complete decompressed file somewhere. Unix uses pipes for such tasks.
Here is an example on how to do it:
Create two random files:
echo "hello world" > a
echo "hello world" > b
Create a compressed archive containing both:
tar -c -z -f x.tgz a b
Pipe the contents of the uncompressed archive through a changer. Unfortunately I haven't found any shell-based way to do this but you also specified Python in the tags, and with the tarfile module you can achieve this:
Here is the file tar.py:
#!/usr/bin/env python3
import sys
import tarfile
tar_in = tarfile.open(fileobj=sys.stdin.buffer, mode='r:gz')
tar_out = tarfile.open(fileobj=sys.stdout.buffer, mode='w:gz')
for tar_info in tar_in:
reader = tar_in.extractfile(tar_info)
if tar_info.path == 'a': # my example file names are "a" and "b"
# now comes the code which makes our change:
# we just skip the first two bytes in each file:
reader.read(2) # skip two bytes
tar_info.size -= 2 # reduce size in info object as well
# add the (maybe changed) file to the output:
tar_out.addfile(tar_info, reader)
tar_out.close()
tar_in.close()
This can be called like this:
./tar.py < x.tgz > y.tgz
y.tgz will contain both files again, but in a the first two bytes will be skipped (so its contents will be llo world).
You will have noticed that you need to know the resulting size of your change beforehand. tar is designed to handle files, and so it needs to write the size of the entry files into the tar info datagram which precedes every entry file in the resulting file, so I see no way around this. With a compressed output it also isn't possible to skip back after writing all output and adjust the file size.
But as you phrased your question, this might be possible in your case.
All you will have to do is provide a file-like object (could be a Popen object's output stream) like reader in my simple example case.
I have a binary data which is a GZIP compressed string. Both header and footer are absent, but the string is otherwise correct. I verified this by using Node.js zlib.gzip() to compress the same string and then comparing the two binary files.
Is it possible to use zlib library to uncompress files without header and footer?
I think you want zlib.inflateRaw() and friends.
I have a gzip file in the web server. I want to download the file only if there is enough disk space to decompress the file. Is it possible to know
the decompress size before downloading the file?
The decompressed size is encoded in the footer of the gzip file[1]. We can extract the decompressed size with the following command
gzip -l
But, the file need to be downloaded. I want to avoid the file downloading if I could know the decompressed size.
You can hack your way with the HTTP Range header, but it will take many http requests and your server needs to accept the Range header.
Send a first request with the HEAD method, to figure out the total file size content-length
Send a second request with a Range header to get the last 4 bytes of the file. Compute theses bytes to know the file size
If you have enough size available on the disk (file-size + uncompressed size), download the full file.
Will USQL support to Compress and Decompress a file.?
I would like decompress a compressed file to perform some validations and once they are passed, would like to compress the data to new file.
In addition, doing automatic compression on OUTPUT is on the roadmap. Please add your vote to https://feedback.azure.com/forums/327234-data-lake/suggestions/13418367-support-gzip-on-output-as-well
According to the main EXTRACT article, U-SQL EXTRACT method automatically recognises the GZip format, so you don't need to do anything special.
Extraction from compressed data
In general, the files are passed as is
to the UDO. One exception is that EXTRACT will recognize GZip
compressed files with the file extension .gz and automatically
decompress them as part of the extraction process. The actual UDO will
see the uncompressed data. For any other compression scheme, users
will have to write their own custom extractor. Note that U-SQL has an
upper limit of 4GB on a GZip compressed file. If you apply your
EXTRACT expression to a file larger than this limit, the error
E_RUNTIME_USER_MAXCOMPRESSEDFILESIZE is being raised during the
compilation of the job.
It looks like this feature has been available for a while, but was updated in Nov 2016 to introduce the 4GB limit. See here.
Here is a simple example which converts a gzipped, comma-separated file to pipe-separated:
DECLARE #file1 string = #"/input/input.csv.gz";
#file =
EXTRACT col1 string,
col2 string,
col3 string
FROM #file1
USING Extractors.Csv(silent : true);
#output =
SELECT *
FROM #file;
OUTPUT #output
TO "/output/output.txt"
ORDER BY col1
//FETCH 500 ROWS
USING Outputters.Text(quoting : false, delimiter : '|');
I am using jazzlib package in j2me application to compress the xml file in zip format using ZipOutputStream and the send the compress stream to the server as a string . I am able to do unzip the in mobile using ZipInputStream. But in server i am not able to unzip , i got
EOF exception. when i copy the compressed stream from console and put into browser, the empty space put special character like [] in compressed stream. I didnt understand what happened. Plz help
You send the compressed stream as a String? That's your problem(s) right there:
compressed data is binary data (i.e. byte[]).
String is designed to handle textual (Unicode) data and not arbitrary binary data
converting arbitrary binary data to String is bound to lead to problems
So if you want to handle (send/receive/...) binary data, make sure you never use a String/Reader/Writer to handle the data anywhere in the process. Stay with byte[]/InputStream/OutputStream.