Azure Synapse loading: Split large compress files to smaller compressed files - azure

I'm receiving this recommendation from Azure Synapse.
Recommendation details
We have detected that you can increase load throughput by splitting your compressed files that are staged in your storage account. A good rule of thumb is to split compressed files into 60 or more to maximize the parallelism of your load. Learn more
Looking at Azure's docs, this is the recommendation.
Preparing data in Azure Storage
To minimize latency, colocate your storage layer and your SQL pool.
When exporting data into an ORC File Format, you might get Java out-of-memory errors when there are large text columns. To work around this limitation, export only a subset of the columns.
All file formats have different performance characteristics. For the fastest load, use compressed delimited text files. The difference between UTF-8 and UTF-16 performance is minimal.
Split large compressed files into smaller compressed files.
What I'm trying to understand is how can I split a large compress files into smaller compress files? Is there an option for that? Thanks!

You may checkout this article How to maximize COPY load throughput with file splits.
It’s recommended to load multiple files at once for parallel processing and maximizing bulk loading performance with SQL pools using the COPY statement.
File-splitting guidance is outlined in the following documentation and this blog covers how to easily split CSV files residing in your data lake using Azure Data Factory Mapping data flows within your data pipeline.

Related

How to handle CSV files in the Bronze layer without the extra layer

If my raw data is in CSV format and I would like to store it in the Bronze layer as Delta tables then I would end up with four layers like Raw+Bronze+Silver+Gold. Which approach should I consider?
A bit of an open question, however with respect to retaining the "raw" data in CSV I would normally recommend this as storage of these data is usually cheap relative to the utility of being able to re-process if there are problems or for purpose of data audit/traceability.
I would normally take the approach of compressing the raw files after processing and perhaps tar-balling the files. In addition moving these files to colder/cheaper storage.

use of df.coalesce(1) in csv vs delta table

When saving to a delta table we avoid 'df.coalesce(1)' but when saving to csv or parquet we(my team) add 'df.coalesce(1)'. Is it a common practise? Why? Is it mandatory?
In most cases when I have seen df.coalesce(1) it was done to generate only one file, for example, import CSV file into Excel, or for Parquet file into the Pandas-based program. But if you're doing .coalesce(1), then the write happens via single task, and it's becoming the performance bottleneck because you need to get data from other executors, and write it.
If you're consuming data from Spark or other distributed system, having multiple files will be beneficial for performance because you can write & read them in parallel. By default, Spark writes N files into the directory where N is the number of partitions. As #pltc noticed, this may generate the big number of files that's often not desirable because you'll get performance overhead from accessing them. So we need to have a balance between the number of files and their size - for Parquet and Delta (that is based on Parquet), having the bigger files bring several performance advantages - you read less files, you can get better compression for data inside the file, etc.
For Delta specifically, having .coalesce(1) having the same problem as for other file formats - you're writing via one task. Relying on default Spark behaviour and writing multiple files is beneficial from performance point of view - each node is writing its data in parallel, but you can get too many small files (so you may use .coalesce(N) to write bigger files). For Databricks Delta, as it was correctly pointed by #Kafels, there are some optimizations that will allow to remove that .coalesce(N) and do automatic tuning achieve the best throughput (so called "Optimized Writes"), and create bigger files ("Auto compaction") - but they should be used carefully.
Overall, the topic of optimal file size for Delta is an interesting topic - if you have big files (1Gb is used by default by OPTIMIZE command), you can get better read throughput, but if you're rewriting them with MERGE/UPDATE/DELETE, then big files are bad from performance standpoint, and it's better to have smaller (16-64-128Mb) files, so you can rewrite less data.
TL;DR: it's not mandatory, it depends on the size of your dataframe.
Long answer:
If your dataframe is 10Mb, and you have 1000 partitions for example, each file would be about 10Kb. And having so many small files would reduce Spark performance dramatically, not to mention when you have too many files, you'll eventually reach OS limitation of the number of files. Any how, when your dataset is small enough, you should merge them into a couple of files by coalesce.
However, if your dataframe is 100G, technically you still can use coalesce(1) and save to a single file, but later on you will have to deal with less parallelism when reading from it.

Optimal Data Lake File Partition Sizes

The Small File Problem gets referenced a lot when discussing performance issues with Delta Lake queries. Many sources recommend file sizes of 1GB for optimal query performance.
I know Snowflake is different than Delta Lake, but I think it's interesting that Snowflake's strategy contradicts the conventional wisdom. They rely on micro-partitions, which aim to be between 50MB and 500MB before compression.
Snowflake and Delta Lake have similar features:
File Pruning - Snowflake vs Delta Lake
Metadata about contents of file - Snowflake vs Delta Lake
Can anyone explain why Snowflake thrives on smaller files while conventional wisdom suggests that Delta Lake struggles?
Disclaimer: I'm not very familiar with Snowflake, so I can only say based on the documentation & experience with Delta Lake.
Small files problem usually arise when you're storing streaming data, or something like, and store that in the formats like Parquet that rely only on the listing of the files provided by storage provider. With a lot of small files, the listing of files is very expensive, and often is the place where most of time is spent.
Delta Lake solves this problem by tracking the file names in the manifest files, and then reaching objects by file name, instead of listing all files and extracting file names from there. On Databricks, Delta has more optimizations for data skipping, etc., that could be achieved by using metadata stored in the manifest files. As I see from documentation, Snowflake has something similar under the hood.
Regarding the file size - on Delta, default size is ~1Gb, but in practice it could be much lower, depending on type of data that is stored, and if we need to update data with new data or not - when updating/deleting data, you'll need to rewrite the whole files, and if you have big files, then you're rewriting more.

Correct Parquet file size when storing in S3?

I've been reading few questions regarding this topic and also several forums, and in all of them they seem to be mentioning that each of resulting .parquet files coming out from Spark should be either 64MB or 1GB size, but still can't make my mind around which case scenarios belong to each of those file sizes and the reasons behind apart from HDFS splitting them in 64MB blocks.
My current testing scenario is the following.
dataset
.coalesce(n) # being 'n' 4 or 48 - reasons explained below.
.write
.mode(SaveMode.Append)
.partitionBy(CONSTANTS)
.option("basepath", outputPath)
.parquet(outputPath)
I'm currently handling a total of 2.5GB to 3GB of daily data, that will be split and saved into daily buckets per year. The reasons behind 'n' being 4 or 48 is just for testing purposes, as I know the size of my testing set in advance, I try to get a number as close to 64MB or 1GB as I can. I haven't implemented code to buffer the needed data until I get the exact size I need prior saving.
So my question here is...
Should I take the size that much into account if I'm not planning to use HDFS and merely store and retrieve data from S3?
And also, which should be the optimal size for daily datasets of around 10GB maximum if I'm planning to use HDFS to store my resulting .parquet files?
Any other optimization tip would be really appreciated!
You can control the split size of parquet files, provided you save them with a splittable compression like snappy. For the s3a connector, just set fs.s3a.block.size to a different number of bytes.
Smaller split size
More workers can work on a file simultaneously. Speedup if you have idle workers.
More startup overhead scheduling work, starting processing, committing tasks
Creates more files from the output, unless you repartition.
Small files vs large files
Small files:
you get that small split whether or not you want it.
even if you use unsplittable compression.
takes longer to list files. Listing directory trees on s3 is very slow
impossible to ask for larger block sizes than the file length
easier to save if your s3 client doesn't do incremental writes in blocks. (Hadoop 2.8+ does if you set spark.hadoop.fs.s3a.fast.upload true.
Personally, and this is opinion, and some benchmark driven -but not with your queries
Writing
save to larger files.
with snappy.
shallower+wider directory trees over deep and narrow
Reading
play with different block sizes; treat 32-64 MB as a minimum
Hadoop 3.1, use the zero-rename committers. Otherwise, switch to v2
if your FS connector supports this make sure random IO is turned on (hadoop-2.8 + spark.hadoop.fs.s3a.experimental.fadvise random
save to larger files via .repartion().
Keep an eye on how much data you are collecting, as it is very easy to run up large bills from storing lots of old data.
see also Improving Spark Performance with S3/ADLS/WASB

Impact of compression codec in Azure Data Lake

It's clear and well documented that the ability to split zip files has a big impact on the performance and parallelisation of jobs within Hadoop.
However Azure is built upon Hadoop and there is no mention of this impact anywhere that I can find in the Microsoft documentation.
Is this not an issue for ADL?
Is, for example, GZipping large files an acceptable approach now or am I going to run into the same issues of inability to parallelise my jobs due to choice of compression codec?
Thanks
Please note that Azure Data Lake Analytics is not based on Hadoop.
RojoSam is correct that GZip is a bad compression format to parallelize over.
U-SQL does recognize .gz files automatically and does decompress them. However, there is a 4GB limit on the size of the compressed file (since we cannot split and parallelize processing it) and we recommend that you use files in the area of a few 100MB to 1GB.
We are working on adding Parquet support. If you need other compression formats such as BZip: please file a request at http://aka.ms/adlfeedback.
It is not possible to start reading a GZip file from a random position. It is necessary to start always reading from the beginning.
Then, if you have a big GZip (or other not splittable compression format), you can not read/process blocks of it in parallel, ending processing all the file sequential in only one machine.
The main idea of Hadoop (and other Big data alternatives) relies on process data in parallel in different machines. A big GZip file doesn't match with this approach.
There are some data formats that allows compress data pages using Gzip and keep the file splittable (each page can be processed in different machines, but each GZip block continues requiring be processed in only one machine) like Parquet.

Resources