I am having nested json as source in gzip format. In Synapse pipeline I am using the dataflow activity where I have mentioned the compression type as gzip in the source dataset. The pipeline was executing fine for small size files under 10MB. When I tried to execute pipeline for a large gzip file about 89MB.
The dataflow activity failed with below error:
Error1 {"message":"Job failed due to reason: Cluster ran into out of memory issue during execution,
please retry using an integration runtime with bigger core count and/or memory optimized compute type.
Details:null","failureType":"UserError","target":"df_flatten_inf_provider_references_gz","errorCode":"DF-Executor-OutOfMemoryError"}
Error1
Requesting for your help and guidance.
To resolve Error1, I tried Azure integration runtime with bigger core count (128+16 cores) and memory optimized compute type but still the same error.
I thought it could be too intensive to read json directly from gzip so I tried a basic copy data activity to decompress the gzip file first but still its failing with the same error.
As per your scenario I would recommend Instead of pulling all the data from Json file, pulled from small Json files. You first partitioned your big Json file in few parts with the dataflow using Round robin partition technique. and store this files into a folder in blob storage
Data is evenly distributed among divisions while using round robin. When you don't have excellent key candidates, use round-robin to put a decent, clever partitioning scheme into place. The number of physical divisions is programmable.
You need to evaluate the data size or the partition number of input data, then set reasonable partition number under "Optimize". For example, the cluster that you use in the data flow pipeline execution is 8 cores and the memory of each core is 20GB, but the input data is 1000GB with 10 partitions. If you directly run the data flow, it will meet the OOM issue because 1000GB/10 > 20GB, so it is better to set repartition number to 100 (1000GB/100 < 20GB).
And after above process use these partitioned files to perform dataflow operations with for each activity and in last merge them in a single file.
Reference: Partition in dataflow.
Related
We are having multiple joins involving a large table (about 500gb in size). The output of the joins is stored into multiple small files each of size 800kb-1.5mb. Because of this the job is split into multiple tasks and taking a long time to complete.
We have tried using spark tuning configurations like using broadcast join, changing partition size, changing max records per file etc., But there is no performance improvement with this methods and the issue is also not fixed. Using coalesce makes the job struck at that stage and there is no progress.
Please view this link for Spark UI metrics screenshot, https://i.stack.imgur.com/FfyYy.png
The spark UI confirms your report of too many small files. You will get a file for every spark partition, and you have 33,479 in your final stage where you're writing the output. 33k partitions was probably the right number of partitions for your join but not the right number for your write.
You need to add another stage in your job that comes after your join. That 2nd needs to reduce the number of spark partitions to a reasonable number (that outputs 32MB - ~128MB files)
Something like a coalesce, or repartition. Maybe even a sort :(
You want to target ~350 partitions.
This diagram shows what you want to do manually or automatically (with spark on Databricks)
If you're using Databricks then it's easy as with Delta Lake you can turn on Auto Optimize
We have a use-case where we receive large volume of data (i.e., 80 GB divided into 300 files comes every 5 mins) in ADLS-V2 and using spark-connector to write from ADLS-V2 to Kusto table.
During the write stage, noticed multiple cores are used to batch the entire data and only one core is used to write to Kusto table, i.e., 80Gb is writing with only one core and remaining cores are in idle state.
This process takes good amount of 20-25 mins and we have tight SLA of 10 mins.
Azure databricks(28GB RAM and 8 CPU cores each- 5 nodes)
Each file size is of ~260MB uncompressed and in parquet format. I also seen some best practices document where it says file size should be between 100MB to 1 GB uncompressed.
Using writestream API in databricks to write the data.
What is the ideal approach to write the data from ADLS to ADX in distributed way using spark-connector ?
First - the most efficient flow from ADLS storage to ADX is EventGrid, as the writing through the Spark connector means data is translated to Spark internal and then to csv which is sent to ADX. From the conversation with you guys it was clear you are using Spark for transforming the data before ingestion, in that case the Spark connector is a good choice.
From version 3.1.0 the connector flow got split by default into three Jobs (unless writeMode.Queued - is used), the first translates data into csv, writes it to storage, and queue and ingestion for ADX, this is done in distributed fashion. The second stage is polling on these ingestions until all finishes successfully to ensure transactionality, this is done using one core as the operation is really cheap (call to table storage) and there's no need to hold more than one worker for that. Third stage is sealing the transaction (this is metadata operation in ADX) - and therefore also needs one core.
I am having a json from ravenDB which is not valid json as it is having duplicate columns.
So my first step is to clean the json and if there are duplicates make separate json for each file.
I was able to do it for sample file and it ran successfully,
Then I tried for a 12 MB file and it also worked.
But when I tried for a full DB backup file which is 10GB in size , it is giving error.
This 10 GB file generates 3 separate json as it is having DOCS columns 3 times.
First file is 9.6GB and other 2 files are small like 120MB and 10KB.
For the first file when I am trying to load it in Synapse DWH I am getitng below error.
Job failed due to reason: Cluster ran into out of memory issue during execution. Also, Please note that the dataflow has one or more custom partitioning schemes. The transformation(s) using custom partition schemes: Json,Select1,FlattenDocsCS,Flatten2,Filter1,ChangeDataTypesDateColumns,CstomsShipment. 1. Please retry using an integration runtime with bigger core count and/or memory optimized compute type. 2. Please retry using different partitioning schemes and/or number of partitions.
I tried to publish the pipeline so that I am not running in debug mode and in a small cluster.
I changed cluster size to 32 cores and changed partition schemes in optimize tab to all possible things.
But still I a getting an error.
Kindly please help
Note: As mentioned in the Error message :
Please retry using an integration runtime with bigger core count and/or memory optimized compute type.
Successful execution of data flows depends on many factors, including
the compute size/type, numbers of source/sinks to process, the
partition specification, transformations involved, sizes of datasets,
the data skewness and so on.
Increasing the cluster size:
Data flows distribute the data processing over different nodes in a Spark cluster to perform operations in parallel. A Spark cluster with more cores increases the number of nodes in the compute environment. More nodes increase the processing power of the data flow. Increasing the size of the cluster is often an easy way to reduce the processing time.
MSFT Doc- Integration Runtime Performance | Cluster Size - here
Please retry using different partitioning schemes and/or number of partitions.
Note: Manually setting the partitioning scheme reshuffles the data and can offset the benefits of the Spark optimizer. A best practice is to not manually set the partitioning unless you need to.
By default, Use current partitioning is selected which instructs the
service keep the current output partitioning of the transformation. As
repartitioning data takes time, Use current partitioning is
recommended in most scenarios. Scenarios where you may want to
repartition your data include after aggregates and joins that
significantly skew your data or when using Source partitioning on a
SQL DB
MSFT Data Flow Tunning Performance : Here.
This will definitely contribute tunning your performance to next level. As the Error message has been well described.
I am working on a project where I have to read S3 files (each about 3MB zipped) using boto3. I have a small pyspark script that runs every hour to process the file and generate 2 types of output data which is written back to S3. The pyspark script uses 'xmltodict' python library to read some static data into a dictionary object needed for file processing. I have a small Amazon EMR cluster v5.28 running with 1 Master and 1 Core. This might be excessive but is not my main concern right now.
Questions:
1. How do I know 'IF' i should partition the data? I have read articles on how many partitions to create, etc but couldn't find anything on IF and WHEN. What is the criteria that drives partitioning - number of rows, columns, data type, actions taken in the script, etc in the source data file? I read the source file into an RDD and convert it to a DF and perform various operations by adding columns, grouping data, counting data, etc. How does spark handle partitioning behind the scenes?
2. Currently, I manually execute the pyspark script as follows:
spark-submit --master spark://x.x.x.x:7077 --deploy-mode client test.py
on the master node as I have decided to stick with Standalone CM. The 'xmltodict' is installed on this node, but is not installed on the Core node. It doesn't seem like it needs to be installed or even python3 configured on Core node since I am not seeing any errors. Is that correct and can somebody shed some light on this confusion? I tried to install the python libraries via shell file as a bootstrap
when I created the cluster, but it failed and quite frankly after trying it a few times, I gave up.
3. Based on partitioning I think I am slightly confused on whether or not to use coalesce() or collect(). Again, the question is when to use and when not to?
Sorry too many questions. Now, that I have the pyspark script written, I am trying to work the efficiencies.
Thanks
Partitioning is the mechanism with which data is divided into optimum size chunks and based on that multiple tasks are run, each processing one piece of data. As you see this is the core of parallelism and without this there is no significant use of Spark (or any bigdata processing framework). Most of the file formats are splittable and some are splittable when compressed like Avro, parquet, orc etc. Some file formats are not splittable when compressed like - zip, gzip etc. Based on the size of the file being processed and their ability to be split, Spark automatically creates multiple partitions and processes data in parallel. In your case the data being zip, one file will be one partition and no more than 1 CPU can work on it at once. If this zip is small then its ok, but if it is big then its processing will be slow.
I currently have a spark cluster set up with 4 worker nodes and 2 head nodes. I have a 1.5 GB CSV file in blob storage that I can access from one of the head nodes. I find that it takes quite a while to load this data and cache it using PySpark. Is there a way to load the data faster?
One thought I had was loading the data, then partitioning the data into k (number of nodes) different segments and saving them back to blob as parquet files. This way, I can load in different parts of the data set in parallel then union... However, I am unsure if all the data is just loaded on the head node, then when computation occurs, it distributes to the other machines. If the latter is true, then the partitioning would be useless.
Help would be much appreciated. Thank you.
Generally, you will want to have smaller file sizes on blob storage so that way you can transfer data between blob storage to compute in parallel so you have faster transfer rates. A good rule of thumb is to have a file size between 64MB - 256MB; a good reference is Vida Ha's Data Storage Tips for Optimal Spark Performance.
Your call out for reading the file and then saving it back to Parquet (with default snappy codec compression) is a good idea. Parquet is natively used by Spark and is often faster to query against. The only tweak would be to partition more by the file size vs. # of worker nodes. The data is loaded onto the worker nodes but partitioning is helpful because more tasks are created to read more files.