Will gzip'd files use multiple AU's in DataLake Analytics? - azure

In the EXTRACT documentation there's the (awesome) auto-magic support for gzipped files (which we are using).
But should I assume it won't use more than one AU? As if I understand correctly the files need to be "splitable" to spread across AUs?
Or will it split across AU's once extracted-on-the-fly and / or do gziped files have an index to indicate where they can be split somehow?
Or perhaps I'm muddling the vertex concept with AUs?

This is a good question :).
In general, if the file format is splitable (e.g., basically row-oriented with rows being less than the rowsize limit, which currently is 4MB), then large files will be split into 1GB per vertex.
However, GZip itself is not a splitable format. Thus we cannot split a GZip file during decompression and we end up not splitting the processing of the decompressed file either (the current framework does not provide this). As a consequence, we limit the size of a GZip file to 4GB. If you want scale out with GZip files, we recommend to split the data into several GZip files and then use file sets to scale out processing.


Decompress lots of small files and compressing them again for efficiency and avoid S3 API costs

I've 1B+ gzip files (avg. 50 kb per each) and I want to upload them into S3 server. As I need to pay for each write operation, it becomes a huge cost problem to transfer them into S3. Also, those files are very similar and I want to compress them within a large file, so that compression efficiency will increase too.
I'm a newbie when it comes to write shell scripts but looking for a way, where I can:
Find all .gz files,
Decompress first 1K,
Compress in a single folder,
Delete this 1K batch,
Iterate to next 1K file,
I appreciate if you able to help me to think more creatively to do this. The only way in my mind is decompressing all of them and compress them by each 1K chunks, but it is not possible as I don't have disk space to compress them.
Test with a few files how much additional space is used when decompressing files. Try to make more free space (move 90% of the files to another host).
When files are similar the compression rate of 10% of the files will be high.
I guess that 10 chunks will fit, but it will be tight everytime you want to decompress one. So I would go for 100 chunks.
But first think what you want to do with the data in the future.
Never use it? Delete it.
Perhaps 1 time in the far future? Glacier.
Often? Use smaller chunks so you can find the right file easier.

Azure Synapse loading: Split large compress files to smaller compressed files

I'm receiving this recommendation from Azure Synapse.
Recommendation details
We have detected that you can increase load throughput by splitting your compressed files that are staged in your storage account. A good rule of thumb is to split compressed files into 60 or more to maximize the parallelism of your load. Learn more
Looking at Azure's docs, this is the recommendation.
Preparing data in Azure Storage
To minimize latency, colocate your storage layer and your SQL pool.
When exporting data into an ORC File Format, you might get Java out-of-memory errors when there are large text columns. To work around this limitation, export only a subset of the columns.
All file formats have different performance characteristics. For the fastest load, use compressed delimited text files. The difference between UTF-8 and UTF-16 performance is minimal.
Split large compressed files into smaller compressed files.
What I'm trying to understand is how can I split a large compress files into smaller compress files? Is there an option for that? Thanks!
You may checkout this article How to maximize COPY load throughput with file splits.
It’s recommended to load multiple files at once for parallel processing and maximizing bulk loading performance with SQL pools using the COPY statement.
File-splitting guidance is outlined in the following documentation and this blog covers how to easily split CSV files residing in your data lake using Azure Data Factory Mapping data flows within your data pipeline.

Zip File - is it possible to paginate through the file data?

Say I have a really large zip file (80GB) containing one massive CSV file (> 200GB).
Is it possible to fetch a subsection of the 80GB file data, modify the central directory, and extract just that bit of data?
Pictorial representation:
Background on my problem:
I have a cyclic process that does a summing on a certain column of a large zipped CSV file stashed in the cloud.
What I do today is I stream the file to my disk, extract it and then stream the file line by line. This makes is a very disk bound operation. Disk IS the bottle neck for sure.
Sure, I can leverage other cloud services to get what I need faster but that is not free.
I'm curious if I can see speed gains by just taking 1GB sub sections of zip until there's nothing left to read.
What I know:
The Zip file is stored using the deflate compression algorithm (always)
In the API I use to get the file from the cloud, I can specify a byte range to filter to. This means I can seek through the bytes of a file without hitting disk!
According the zip file specs there are three major parts to a zip file in order:
1: A header describing the file and it's attributes
2: The raw file data in deflated format
3: The central directory listing out what files start and stop and what bytes
What I don't know:
How the deflate algorithm works exactly. Does it jumble the file up or does it just compress things in order of the original file? If it does jumble, this approach may not be possible.
Had anyone built a tool like this already?
You can always decompress starting from the beginning, going as far as you like, keeping only the last, say, 1 GB, once you get to where you want. You cannot just start decompressing somewhere in the middle. At least not with a normal .zip file that has not been very specially prepared somehow for random access.
The central directory has nothing to do with random access of a single entry. All it can do is tell you where an entry starts and how long it is (both compressed and uncompressed).
I would recommend that you reprocess the .zip file into a .zip file with many (~200) entries, each on the order of 1 GB uncompressed. The resulting .zip file will be very close to the same size, but you can then use the central directory to pick one of the 200 entries, randomly access it, and decompress just that one.

Impact of compression codec in Azure Data Lake

It's clear and well documented that the ability to split zip files has a big impact on the performance and parallelisation of jobs within Hadoop.
However Azure is built upon Hadoop and there is no mention of this impact anywhere that I can find in the Microsoft documentation.
Is this not an issue for ADL?
Is, for example, GZipping large files an acceptable approach now or am I going to run into the same issues of inability to parallelise my jobs due to choice of compression codec?
Please note that Azure Data Lake Analytics is not based on Hadoop.
RojoSam is correct that GZip is a bad compression format to parallelize over.
U-SQL does recognize .gz files automatically and does decompress them. However, there is a 4GB limit on the size of the compressed file (since we cannot split and parallelize processing it) and we recommend that you use files in the area of a few 100MB to 1GB.
We are working on adding Parquet support. If you need other compression formats such as BZip: please file a request at http://aka.ms/adlfeedback.
It is not possible to start reading a GZip file from a random position. It is necessary to start always reading from the beginning.
Then, if you have a big GZip (or other not splittable compression format), you can not read/process blocks of it in parallel, ending processing all the file sequential in only one machine.
The main idea of Hadoop (and other Big data alternatives) relies on process data in parallel in different machines. A big GZip file doesn't match with this approach.
There are some data formats that allows compress data pages using Gzip and keep the file splittable (each page can be processed in different machines, but each GZip block continues requiring be processed in only one machine) like Parquet.

Is the compression used for gzip and zip binary compatible?

I've got a lot of files which are being produced direct to gzip format. I'd like to re-package these into zip archives grouping many files together.
Given the size and volume of these files (0.5GB uncompressed per file and thousands of files) I really want to avoid decompressing these only to re-compress them if the only technical difference is the file headers.
Given that both file formats use / support the Deflate algorithm, is it possible for me to simply strip the file header off a gzip file and patch the compressed data into a zip file without re-encoding (but adding an appropriate header entry). Or is there a difference in the binary encoding of the compressed data?
Yes, you can use the compressed data from gzip files directly to construct a zip file.
