What contributes to spark driver maxResultSize limits?

What contributes to spark driver maxResultSize limits? - apache-spark

In my Spark job, the results I am sending to the driver as barely a few KBs. I still got the below exception in spite of spark.driver.maxResultSize set to 4 GBs:
ERROR TaskSetManager: Total size of serialized results of 3021102 tasks (4.0 GB) is bigger than spark.driver.maxResultSize (4.0 GB)
Do Spark accumulators or anything else contribute towards memory usage from one allocated by spark.driver.maxResultSize? Is there an official documentation/code I can refer to to learn more on this?
More details about the code/execution:
There are 3 million tasks
Each task reads 50 files from S3 and re-writes them back to S3 post-transformation
Tasks return prefix of S3 files along with some metadata which is collected at the driver for saving to a file. This data is < 50 MBs

This issue has been fixed here: the cause is that when Spark calculates result size it actually also counts metadata(like metrics) in the task binary result sent back to driver. Therefore in case you have huge amount of tasks but collect almost nothing(the real data), you might still hit the error.

Related

Can spark manage partitions larger than the executor size?

Question:
Spark seems to be able to manage partitions that are bigger than the executor size. How does it do that?
What I have tried so far:
I picked up a CSV with: Size on disk - 12.3 GB, Size in memory deserialized - 3.6 GB, Size in memory serialized - 1964.9 MB. I got these sizes from caching the data in memory deserialized and serialized both and 12.3 GB is the size of the file on the disk.
To check if spark can handle partitions larger than the executor size, I created a cluster with just one executor with spark.executor.memory equal to 500mb. Also, I set executor cores (spark.executor.cores) to 2 and, increased spark.sql.files.maxPartitionBytes to 13 GB. I also switched off Dynamic allocation and adaptive for good measure. The entire session configuration is:
spark = SparkSession.builder.\
config("spark.dynamicAllocation.enabled",False).\
config("spark.executor.cores","2").\
config("spark.executor.instances","1").\
config("spark.executor.memory","500m").\
config("spark.sql.adaptive.enabled", False).\
config("spark.sql.files.maxPartitionBytes","13g").\
getOrCreate()
I read the CSV and checked the number of partitions that it is being read in by df.rdd.getNumPartitions(). Output = 2. This would be confirmed later on as well in the number of tasks
Then I run df.persist(storagelevel.StorageLevel.DISK_ONLY); df.count()
Following are the observations I made:
No caching happens till the data for one batch of tasks (equal to number of cpu cores in case you have set 1 cpu core per task) is read in completely. I conclude this since there is no entry that shows up in the storage tab of the web UI.
Each partition here ends up being around 6 GB on disk. Which should, at a minimum, be around 1964.9 MB/2 (=Size in memory serializez/2) in memory. Which is around 880 MB. There is no spill. Below is the relevant snapshot of the web UI from when around 11 GB of the data has been read in. You can see that Input has been almost 11GB and at this time there was nothing in the storage tab.
Questions:
Since the memory per executor is 300 MB (Execution + Storage) + 200 MB (User memory). How is spark able to manage ~880 MB partitions that too 2 of them in parallel (one by each core)?
The data read in does not show up in the Storage, is not (and, can not be) in the executor and, there is no spill as well. where exactly is that read in data?
Attaching a SS of the web UI post that job completion in case that might be useful
Attaching a SS of the Executors tab in case that might be useful:

PySpark OOM for multiple data files

I want to process several idependent csv files of similar sizes (100 MB) in parallel with PySpark.
I'm running PySpark on a single machine:
spark.driver.memory 20g
spark.executor.memory 2g
local[1]
File content:
type (has the same value within each csv), timestamp, price
First I tested it on one csv (note I used 35 different window functions):
logData = spark.read.csv("TypeA.csv", header=False,schema=schema)
// Compute moving avg. I used 35 different moving averages.
w = (Window.partitionBy("type").orderBy(f.col("timestamp").cast("long")).rangeBetween(-24*7*3600 * i, 0))
logData = logData.withColumn("moving_avg", f.avg("price").over(w))
// Some other simple operations... No Agg, no sort
logData.write.parquet("res.pr")
This works great. However, i had two issues with scaling this job:
I tried to increase number of window functions to 50 the job OOMs. Not sure why PySpark doesn't spill to disk in this case, since window functions are independent of each other
I tried to run the job for 2 CSV files, it also OOMs. It is also not clear why it is not spilled to disk, since the window functions are basically partitioned by CSV files, so they are independent.
The question is why PySpark doesn't spill to disk in these two cases to prevent OOM, or how can I hint the Spark to do it?

If your machine cannot run all of these you can do that in sequence and write the data of each bulk of files before loading the next bulk.
I'm not sure if this is what you mean but you can try hint spark to write some of the data to your disk instead of keep it on RAM with:
df.persist(StorageLevel.MEMORY_AND_DISK)
Update if it helps

In theory, you could process all these 600 files in one single machine. Spark should spill to disk when meemory is not enough. But there're some points to consider:
As the logic involves window agg, which results in heavy shuffle operation. You need to check whether OOM happened on map or reduce phase. Map phase process each partition of file, then write shuffle output into some file. Then reduce phase need to fetch all these shuffle output from all map tasks. It's obvious that in your case you can't hold all map tasks running.
So it's highly likely that OOM happened on map phase. If this is the case, it means the memory per core can't process one signle partition of file. Please be aware that spark will do rough estimation of memory usage, then do spill if it thinks it should be. As the estatimation is not accurate, so it's still possible OOM. You can tune partition size by below configs:
spark.sql.files.maxPartitionBytes (default 128MB)
Usaually, 128M input needs 2GB heap with total 4G executor memory as
executor JVM heap execution memory (0.5 of total executor memory) =
(total executor memory - executor.memoryOverhead (default 0.1)) * spark.memory.storageFraction (0.6)
You can post all your configs in Spark UI for further investigation.

How to overcome the Spark spark.kryoserializer.buffer.max 2g limit?

I am reading a csv with 600 records using spark 2.4.2. Last 100 records have large data.
I am running into the problem of,
ERROR Job aborted due to stage failure:
Task 1 in stage 0.0 failed 4 times, most recent failure:
Lost task 1.3 in stage 0.0 (TID 5, 10.244.5.133, executor 3):
org.apache.spark.SparkException: Kryo serialization failed: Buffer overflow. Available: 0, required: 47094.
To avoid this, increase spark.kryoserializer.buffer.max value.
I have increased the spark.kryoserializer.buffer.max to 2g (the max allowed setting) and spark driver memory to 1g and was able to process few more records but still cannot process all the records in the csv.
I have tried paging the 600 records. e.g With 6 partition I can process 100 records per partition but since the last 100 records are huge the buffer overflow occurs.
In this case, the last 100 records are large but this can be the first 100 or records between 300 to 400. Unless I sample the data before hand to get an idea on the skew I cannot optimize the processing approach.
Is there a reason why spark.kryoserializer.buffer.max is not allowed to go beyond 2g.
May be I can increase the partitions and decrease the records read per partition? Is it possible to use compression?
Appreciate any thoughts.

Kryo buffers are backed by byte arrays, and primitive arrays can only be
up to 2GB in size.
Please refer to the below link for further details.
https://github.com/apache/spark/commit/49d2ec63eccec8a3a78b15b583c36f84310fc6f0
Please increase the partition number since you cannot optimize the processing approach.

What do you have in those records that a single one blows the kryo buffer.
In general leaving the partitions at default 200 should always be a good starting point. Don't reduce it to 6.
It looks like a single record (line) blows the limit.
There are number of options for reading in the csv data you can try csv options
If there is a single line that translates into a 2GB buffer overflow I would think about parsing the file differently.
csv reader also ignores/skips some text in files (no serialization) if you give it a schema.
If you remove some of the columns that are so huge from the schema it may read in the data easily.

Spark 2.x - Shuffle on "small" data crashes "big" executors

My (Py)Spark 2.1.1 app consists in two executors with 5 cores and 30G heap (spark.executor.memory) each. I have 3.2Gb of data persisted in memory (deserialized) spread on a dozen partitions and shared between my two executors (1.9Gb + 1.3Gb). I then want to repartition this data by calling repartition('myCol') on my persisted dataframe with myCol having only three keys with a 60-20-20 distribution. I then want to write the repartitionned data in (3) .parquet files. As expected, this transformation triggers a full shuffle of the data :
First question : In the Spark UI, Shuffle Write amounts to 5.9Gb. Why is this amount much higher than the size of the persisted data ? Is it the format Spark uses to write shuffle files on disk (text strings?) ? Replication ?
Second question : My executors keep dying with error messages such as org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle or ExecutorLostFailure (executor 2 exited caused by one of the running tasks) Reason: Container killed by YARN for exceeding memory limits. 32.0 GB of 32 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.. spark.yarn.executor.memoryOverhead is already set at 2g but I must confess I don't really get how this parameter should help in that context. But the main question is : how shuffling 3Gb of data can OOM a 30Gb executor ?
I changed a few parameters from the understanding I have of Spark (with limited success obviously) : I set spark.memory.fraction to 0.9 and spark.memory.storageFraction to 0.0.
Many thanks in advance for any help, this situation is so frustrating.
PS : Maybe once the issue is solved I can redesign my app with less memory per executor. It currently feels like a terrible waste of ressources to me.

Why is it possible to have "serialized results of n tasks (XXXX MB)" be greater than `spark.driver.memory` in pyspark?

I launched a spark job with these settings (among others):
spark.driver.maxResultSize 11GB
spark.driver.memory 12GB
I was debugging my pyspark job, and it kept giving me the error:
serialized results of 16 tasks (17.4 GB) is bigger than spark.driver.maxResultSize (11 GB)
So, I increased the spark.driver.maxResultSize to 18 G in the configuration settings. And, it worked!!
Now, this is interesting because in both cases the spark.driver.memory was SMALLER than the serialized results returned.
Why is this allowed? I would assume this not to be possible because the serialized results were 17.4 GB when I was debugging, which is more than the size of the driver, which is 12 GB, as shown above?
How is this possible?

It is possible because spark.driver.memory configures JVM driver process not Python interpreter and data between them is transferred with sockets and driver process don't have to keep all data in memory (don't convert to local structure).

My understanding is that when we ask Spark to perform an action, the results from all the partitions are serialized, but these results need not be sent to the driver, unless some operation such as a collect() is performed.
spark.driver.maxResultSize defines a limit on the total size of serialized results of all partitions & is independent of the actual spark.driver.memory. Therefore, your spark.driver.memory could be lesser than your spark.driver.maxResultSize and your code would still work.
We could probably get a better idea if you tell us the transformations and actions used in this process or your code snippet.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string