Impact of compression codec in Azure Data Lake - azure

It's clear and well documented that the ability to split zip files has a big impact on the performance and parallelisation of jobs within Hadoop.
However Azure is built upon Hadoop and there is no mention of this impact anywhere that I can find in the Microsoft documentation.
Is this not an issue for ADL?
Is, for example, GZipping large files an acceptable approach now or am I going to run into the same issues of inability to parallelise my jobs due to choice of compression codec?
Thanks

Please note that Azure Data Lake Analytics is not based on Hadoop.
RojoSam is correct that GZip is a bad compression format to parallelize over.
U-SQL does recognize .gz files automatically and does decompress them. However, there is a 4GB limit on the size of the compressed file (since we cannot split and parallelize processing it) and we recommend that you use files in the area of a few 100MB to 1GB.
We are working on adding Parquet support. If you need other compression formats such as BZip: please file a request at http://aka.ms/adlfeedback.

It is not possible to start reading a GZip file from a random position. It is necessary to start always reading from the beginning.
Then, if you have a big GZip (or other not splittable compression format), you can not read/process blocks of it in parallel, ending processing all the file sequential in only one machine.
The main idea of Hadoop (and other Big data alternatives) relies on process data in parallel in different machines. A big GZip file doesn't match with this approach.
There are some data formats that allows compress data pages using Gzip and keep the file splittable (each page can be processed in different machines, but each GZip block continues requiring be processed in only one machine) like Parquet.

Related

is there a limit for pyspark read csv files?

i am relatively new to spark/pyspark so any help is well appreciated.
currently we have files being delivered to Azure data lake hourly into a file directory, example:
hour1.csv
hour2.csv
hour3.csv
i am using databricks to read the files in the file directory using the code below:
sparkdf = spark.read.format(csv).option("recursiveFileLookup", "true").option("header", "true").schema(schema).load(file_location)
each of the CSV files is about 5kb and all have the same schema.
what i am unsure about is how scalable "spark.read" is? currently we are processing about 2000 of such small files, i am worried that there is a limit on the number of files being processed. is there a limit such as maximum 5000 files and my code above breaks?
from what i have read online, i believe data size is not a issue with the method above, spark can read petabytes worth of data(comparatively, our data size in total is still very small), but there are no mentions of the number of files that it is able to process - educate me if i am wrong.
any explanations is very much appreciated.
thank you
The limit it your driver's memory.
When reading a directory, the driver lists it (depending on the initial size, it may parallelize the listing to executors, but it collects the results either way).
After having the list of files, it creates tasks for the executors to run.
With that in mind, if the list is too large to fit in the driver's memory, you will have issues.
You can always increase the driver's memory space to manage it, or have some preprocess to merge the files (GCS has a gsutil compose which can merge files without downloading them).

use of df.coalesce(1) in csv vs delta table

When saving to a delta table we avoid 'df.coalesce(1)' but when saving to csv or parquet we(my team) add 'df.coalesce(1)'. Is it a common practise? Why? Is it mandatory?
In most cases when I have seen df.coalesce(1) it was done to generate only one file, for example, import CSV file into Excel, or for Parquet file into the Pandas-based program. But if you're doing .coalesce(1), then the write happens via single task, and it's becoming the performance bottleneck because you need to get data from other executors, and write it.
If you're consuming data from Spark or other distributed system, having multiple files will be beneficial for performance because you can write & read them in parallel. By default, Spark writes N files into the directory where N is the number of partitions. As #pltc noticed, this may generate the big number of files that's often not desirable because you'll get performance overhead from accessing them. So we need to have a balance between the number of files and their size - for Parquet and Delta (that is based on Parquet), having the bigger files bring several performance advantages - you read less files, you can get better compression for data inside the file, etc.
For Delta specifically, having .coalesce(1) having the same problem as for other file formats - you're writing via one task. Relying on default Spark behaviour and writing multiple files is beneficial from performance point of view - each node is writing its data in parallel, but you can get too many small files (so you may use .coalesce(N) to write bigger files). For Databricks Delta, as it was correctly pointed by #Kafels, there are some optimizations that will allow to remove that .coalesce(N) and do automatic tuning achieve the best throughput (so called "Optimized Writes"), and create bigger files ("Auto compaction") - but they should be used carefully.
Overall, the topic of optimal file size for Delta is an interesting topic - if you have big files (1Gb is used by default by OPTIMIZE command), you can get better read throughput, but if you're rewriting them with MERGE/UPDATE/DELETE, then big files are bad from performance standpoint, and it's better to have smaller (16-64-128Mb) files, so you can rewrite less data.
TL;DR: it's not mandatory, it depends on the size of your dataframe.
Long answer:
If your dataframe is 10Mb, and you have 1000 partitions for example, each file would be about 10Kb. And having so many small files would reduce Spark performance dramatically, not to mention when you have too many files, you'll eventually reach OS limitation of the number of files. Any how, when your dataset is small enough, you should merge them into a couple of files by coalesce.
However, if your dataframe is 100G, technically you still can use coalesce(1) and save to a single file, but later on you will have to deal with less parallelism when reading from it.

Azure Synapse loading: Split large compress files to smaller compressed files

I'm receiving this recommendation from Azure Synapse.
Recommendation details
We have detected that you can increase load throughput by splitting your compressed files that are staged in your storage account. A good rule of thumb is to split compressed files into 60 or more to maximize the parallelism of your load. Learn more
Looking at Azure's docs, this is the recommendation.
Preparing data in Azure Storage
To minimize latency, colocate your storage layer and your SQL pool.
When exporting data into an ORC File Format, you might get Java out-of-memory errors when there are large text columns. To work around this limitation, export only a subset of the columns.
All file formats have different performance characteristics. For the fastest load, use compressed delimited text files. The difference between UTF-8 and UTF-16 performance is minimal.
Split large compressed files into smaller compressed files.
What I'm trying to understand is how can I split a large compress files into smaller compress files? Is there an option for that? Thanks!
You may checkout this article How to maximize COPY load throughput with file splits.
It’s recommended to load multiple files at once for parallel processing and maximizing bulk loading performance with SQL pools using the COPY statement.
File-splitting guidance is outlined in the following documentation and this blog covers how to easily split CSV files residing in your data lake using Azure Data Factory Mapping data flows within your data pipeline.

Correct Parquet file size when storing in S3?

I've been reading few questions regarding this topic and also several forums, and in all of them they seem to be mentioning that each of resulting .parquet files coming out from Spark should be either 64MB or 1GB size, but still can't make my mind around which case scenarios belong to each of those file sizes and the reasons behind apart from HDFS splitting them in 64MB blocks.
My current testing scenario is the following.
dataset
.coalesce(n) # being 'n' 4 or 48 - reasons explained below.
.write
.mode(SaveMode.Append)
.partitionBy(CONSTANTS)
.option("basepath", outputPath)
.parquet(outputPath)
I'm currently handling a total of 2.5GB to 3GB of daily data, that will be split and saved into daily buckets per year. The reasons behind 'n' being 4 or 48 is just for testing purposes, as I know the size of my testing set in advance, I try to get a number as close to 64MB or 1GB as I can. I haven't implemented code to buffer the needed data until I get the exact size I need prior saving.
So my question here is...
Should I take the size that much into account if I'm not planning to use HDFS and merely store and retrieve data from S3?
And also, which should be the optimal size for daily datasets of around 10GB maximum if I'm planning to use HDFS to store my resulting .parquet files?
Any other optimization tip would be really appreciated!
You can control the split size of parquet files, provided you save them with a splittable compression like snappy. For the s3a connector, just set fs.s3a.block.size to a different number of bytes.
Smaller split size
More workers can work on a file simultaneously. Speedup if you have idle workers.
More startup overhead scheduling work, starting processing, committing tasks
Creates more files from the output, unless you repartition.
Small files vs large files
Small files:
you get that small split whether or not you want it.
even if you use unsplittable compression.
takes longer to list files. Listing directory trees on s3 is very slow
impossible to ask for larger block sizes than the file length
easier to save if your s3 client doesn't do incremental writes in blocks. (Hadoop 2.8+ does if you set spark.hadoop.fs.s3a.fast.upload true.
Personally, and this is opinion, and some benchmark driven -but not with your queries
Writing
save to larger files.
with snappy.
shallower+wider directory trees over deep and narrow
Reading
play with different block sizes; treat 32-64 MB as a minimum
Hadoop 3.1, use the zero-rename committers. Otherwise, switch to v2
if your FS connector supports this make sure random IO is turned on (hadoop-2.8 + spark.hadoop.fs.s3a.experimental.fadvise random
save to larger files via .repartion().
Keep an eye on how much data you are collecting, as it is very easy to run up large bills from storing lots of old data.
see also Improving Spark Performance with S3/ADLS/WASB

Will gzip'd files use multiple AU's in DataLake Analytics?

In the EXTRACT documentation there's the (awesome) auto-magic support for gzipped files (which we are using).
But should I assume it won't use more than one AU? As if I understand correctly the files need to be "splitable" to spread across AUs?
Or will it split across AU's once extracted-on-the-fly and / or do gziped files have an index to indicate where they can be split somehow?
Or perhaps I'm muddling the vertex concept with AUs?
This is a good question :).
In general, if the file format is splitable (e.g., basically row-oriented with rows being less than the rowsize limit, which currently is 4MB), then large files will be split into 1GB per vertex.
However, GZip itself is not a splitable format. Thus we cannot split a GZip file during decompression and we end up not splitting the processing of the decompressed file either (the current framework does not provide this). As a consequence, we limit the size of a GZip file to 4GB. If you want scale out with GZip files, we recommend to split the data into several GZip files and then use file sets to scale out processing.

Resources