DF-Executor-OutOfMemoryError in synapse pipeline - azure

I am having a json from ravenDB which is not valid json as it is having duplicate columns.
So my first step is to clean the json and if there are duplicates make separate json for each file.
I was able to do it for sample file and it ran successfully,
Then I tried for a 12 MB file and it also worked.
But when I tried for a full DB backup file which is 10GB in size , it is giving error.
This 10 GB file generates 3 separate json as it is having DOCS columns 3 times.
First file is 9.6GB and other 2 files are small like 120MB and 10KB.
For the first file when I am trying to load it in Synapse DWH I am getitng below error.
Job failed due to reason: Cluster ran into out of memory issue during execution. Also, Please note that the dataflow has one or more custom partitioning schemes. The transformation(s) using custom partition schemes: Json,Select1,FlattenDocsCS,Flatten2,Filter1,ChangeDataTypesDateColumns,CstomsShipment. 1. Please retry using an integration runtime with bigger core count and/or memory optimized compute type. 2. Please retry using different partitioning schemes and/or number of partitions.
I tried to publish the pipeline so that I am not running in debug mode and in a small cluster.
I changed cluster size to 32 cores and changed partition schemes in optimize tab to all possible things.
But still I a getting an error.
Kindly please help

Note: As mentioned in the Error message :
Please retry using an integration runtime with bigger core count and/or memory optimized compute type.
Successful execution of data flows depends on many factors, including
the compute size/type, numbers of source/sinks to process, the
partition specification, transformations involved, sizes of datasets,
the data skewness and so on.
Increasing the cluster size:
Data flows distribute the data processing over different nodes in a Spark cluster to perform operations in parallel. A Spark cluster with more cores increases the number of nodes in the compute environment. More nodes increase the processing power of the data flow. Increasing the size of the cluster is often an easy way to reduce the processing time.
MSFT Doc- Integration Runtime Performance | Cluster Size - here
Please retry using different partitioning schemes and/or number of partitions.
Note: Manually setting the partitioning scheme reshuffles the data and can offset the benefits of the Spark optimizer. A best practice is to not manually set the partitioning unless you need to.
By default, Use current partitioning is selected which instructs the
service keep the current output partitioning of the transformation. As
repartitioning data takes time, Use current partitioning is
recommended in most scenarios. Scenarios where you may want to
repartition your data include after aggregates and joins that
significantly skew your data or when using Source partitioning on a
SQL DB
MSFT Data Flow Tunning Performance : Here.
This will definitely contribute tunning your performance to next level. As the Error message has been well described.

Related

Synapse Pipeline : DF-Executor-OutOfMemoryError

I am having nested json as source in gzip format. In Synapse pipeline I am using the dataflow activity where I have mentioned the compression type as gzip in the source dataset. The pipeline was executing fine for small size files under 10MB. When I tried to execute pipeline for a large gzip file about 89MB.
The dataflow activity failed with below error:
Error1 {"message":"Job failed due to reason: Cluster ran into out of memory issue during execution,
please retry using an integration runtime with bigger core count and/or memory optimized compute type.
Details:null","failureType":"UserError","target":"df_flatten_inf_provider_references_gz","errorCode":"DF-Executor-OutOfMemoryError"}
Error1
Requesting for your help and guidance.
To resolve Error1, I tried Azure integration runtime with bigger core count (128+16 cores) and memory optimized compute type but still the same error.
I thought it could be too intensive to read json directly from gzip so I tried a basic copy data activity to decompress the gzip file first but still its failing with the same error.
As per your scenario I would recommend Instead of pulling all the data from Json file, pulled from small Json files. You first partitioned your big Json file in few parts with the dataflow using Round robin partition technique. and store this files into a folder in blob storage
Data is evenly distributed among divisions while using round robin. When you don't have excellent key candidates, use round-robin to put a decent, clever partitioning scheme into place. The number of physical divisions is programmable.
You need to evaluate the data size or the partition number of input data, then set reasonable partition number under "Optimize". For example, the cluster that you use in the data flow pipeline execution is 8 cores and the memory of each core is 20GB, but the input data is 1000GB with 10 partitions. If you directly run the data flow, it will meet the OOM issue because 1000GB/10 > 20GB, so it is better to set repartition number to 100 (1000GB/100 < 20GB).
And after above process use these partitioned files to perform dataflow operations with for each activity and in last merge them in a single file.
Reference: Partition in dataflow.

Spark SQL output multiple small files

We are having multiple joins involving a large table (about 500gb in size). The output of the joins is stored into multiple small files each of size 800kb-1.5mb. Because of this the job is split into multiple tasks and taking a long time to complete.
We have tried using spark tuning configurations like using broadcast join, changing partition size, changing max records per file etc., But there is no performance improvement with this methods and the issue is also not fixed. Using coalesce makes the job struck at that stage and there is no progress.
Please view this link for Spark UI metrics screenshot, https://i.stack.imgur.com/FfyYy.png
The spark UI confirms your report of too many small files. You will get a file for every spark partition, and you have 33,479 in your final stage where you're writing the output. 33k partitions was probably the right number of partitions for your join but not the right number for your write.
You need to add another stage in your job that comes after your join. That 2nd needs to reduce the number of spark partitions to a reasonable number (that outputs 32MB - ~128MB files)
Something like a coalesce, or repartition. Maybe even a sort :(
You want to target ~350 partitions.
This diagram shows what you want to do manually or automatically (with spark on Databricks)
If you're using Databricks then it's easy as with Delta Lake you can turn on Auto Optimize

Shuffle Stage Failing Due To Executor Loss

I get the following error when my spark jobs fails **"org.apache.spark.shuffle.FetchFailedException: The relative remote executor(Id: 21), which maintains the block data to fetch is dead."**
Over view of my spark job
input size is ~35 GB
I have broadcast joined all the smaller tables with the mother table into say a dataframe1 and then i salted each big table and dataframe1 before i join it with dataframe1 (left table).
profile used:
#configure(profile=[
'EXECUTOR_MEMORY_LARGE',
'NUM_EXECUTORS_32',
'DRIVER_MEMORY_LARGE',
'SHUFFLE_PARTITIONS_LARGE'
])
using the above approach and profiles i was able to get the runtime down by 50% but i still get Shuffle Stage Failing Due To Executor Loss issues.
is there a way i can fix this?
There are multiple things you can try:
Broadcast Joins: If you have used broadcast hints to join multiple smaller tables, then the resulting table (of many smaller tables) might be too huge to be accommodated in each executor memory. So, you need to look at total size of dataframe1.
35GB is really not huge. Also try the profile "EXECUTOR_CORES_MEDIUM", which really increases the parallelism in data computation. Use Dynamic allocation (16 executors should be fine for 35GB) rather than static allocation. If 32 executors are not available at a time, the build doesn't start. "DRIVER_MEMORY_MEDIUM" should be enough.
Spark 3.0 handles skew joins by itself with Adaptive Query Execution. So, you need not use salting technique. There is a profile called "ADAPTIVE_ENABLED" with foundry that you can use. Other settings of adaptive query execution, you will have to set manually with "ctx" spark context object readily available with Foundry.
Some references for AQE:
https://learn.microsoft.com/en-us/azure/databricks/spark/latest/spark-sql/aqe
https://spark.apache.org/docs/latest/sql-performance-tuning.html#adaptive-query-execution

Memory Management Pyspark

1.) I understand that "Spark's operators spills data to disk if it does not fit memory allowing it to run well on any sized data".
If this is true, why do we ever get OOM (Out of Memory) errors?
2.) Increasing the no. of executor cores increases parallelism. Would that also increase the chances of OOM, because the same memory is now divided into smaller parts for each core?
3.) Spark is much more susceptible to OOM because it performs operations in memory as compared to Hive, which repeatedly reads, writes into disk. Is that correct?
There is one angle that you need to consider there. You may get memory leaks if the data is not properly distributed. That means that you need to distribute your data evenly (if possible) on the Tasks so that you reduce shuffling as much as possible and make those Tasks to manage their own data. So if you need to perform a join, if data is distributed randomly, every Task (and therefore executor) will have to:
See what data they have
Send data to other executors (and tasks) to provide the same keys they need
Request the data that is needed by that task to the others
All that data exchange may cause network bottlenecks if you have a large dataset and also will make every Task to hold their data in memory plus whatever has been sent and temporary objects. All of those will blow up memory.
So to prevent that situation you can:
Load the data already repartitioned. By that I mean, if you are loading from a DB, try Spark stride as defined here. Please refer to the partitionColumn, lowerBound, upperBound attributes. That way you will create a number of partitions on the dataframe that will set the data on different tasks based on the criteria you need. If you are going to use a join of two dataframes, try similar approach on them so that partitions are similar (for not to say same) and that will prevent shuffling over network.
When you define partitions, try to make those values as evenly distributed among tasks as possible
The size of each partition should fit on memory. Although there could be spill to disk, that would slow down performance
If you don't have a column that make the data evenly distributed, try to create one that would have n number of different values, depending on the n number of tasks that you have
If you are reading from a csv, that would make it harder to create partitions, but still it's possible. You can either split the data (csv) on multiple files and create multiple dataframes (performing a union after they are loaded) or you can read that big csv and apply a repartition on the column you need. That will create shuffling as well, but it will be done once if you cache the dataframe already repartitioned
Reading from parquet it's possible that you may have multiple files but if they are not evenly distributed (because the previous process that generated didn't do it well) you may end up on OOM errors. To prevent that situation, you can load and apply repartition on the dataframe too
Or another trick valid for csv, parquet files, orc, etc. is to create a Hive table on top of that and run a query from Spark running a distribute by clause on the data, so that you can make Hive to redistribute, instead of Spark
To your question about Hive and Spark, I think you are right up to some point. Depending on the execute engine that Hive uses in your case (map/reduce, Tez, Hive on Spark, LLAP) you can have different behaviours. With map/reduce, as they are mostly disk operations, the chance to have a OOM is much lower than on Spark. Actually from Memory point of view, map/reduce is not that affected because of a skewed data distribution. But (IMHO) your goal should be to find always the best data distribution for the Spark job you are running and that will prevent that problem
Another consideration is if you are testing in a dev environment that doesn't have same data as in a prod environment. I suppose the data distribution should be similar although volumes may differ a lot (I am talking from experience ;)). In that case, when you assign Spark tuning parameters on the spark-submit command, they may be different in prod. So you need to invest some time on finding the best approach on dev and fine tune in prod
Huge majority of OOM in Spark are on the driver, not executors. This is usually a result of running .collect or similar actions on a dataset that won't fit in the driver memory.
Spark does a lot of work under the hood to parallelize the work, when using structured APIs (in contrast to RDDs) the chances of causing OOM on executor are really slim. Some combinations of cluster configuration and jobs can cause memory pressure that will impact performance and cause lots of garbage collection to happen so you need to address it, however spark should be able to handle low memory without explicit exception.
Not really - as above, Spark should be able to recover from memory issues when using structured APIs, however it may need intervention if you see garbage collection and performance impact.

how spark reads data when we are using a filter in where

I'm reading a key from a table which is huge in size (900 GB).
its just one where condition but spark has launched many jobs with huge no of tasks.
i'm using 11 node cluster (128 GB memory and 16 cores per node)
i know that we may need more number of tasks, but why those many jobs, why cant it process in a single stage...?
Can someone please explain what happens internally when we use a where condition..
Appreciate your response.please check this image
Spark is for bulk processing, not a single key lookup as your image shows as in, say, an ORACLE database, with an index. For a JOIN for many rows these lookups are finer, of course.
Spark does not know what you are doing (semantically), so it follows its distributed model and processes in parallel - meaning many tasks - for many partitions.
The image is not a proper use case for Spark.

Resources