Question:
Spark seems to be able to manage partitions that are bigger than the executor size. How does it do that?
What I have tried so far:
I picked up a CSV with: Size on disk - 12.3 GB, Size in memory deserialized - 3.6 GB, Size in memory serialized - 1964.9 MB. I got these sizes from caching the data in memory deserialized and serialized both and 12.3 GB is the size of the file on the disk.
To check if spark can handle partitions larger than the executor size, I created a cluster with just one executor with spark.executor.memory equal to 500mb. Also, I set executor cores (spark.executor.cores) to 2 and, increased spark.sql.files.maxPartitionBytes to 13 GB. I also switched off Dynamic allocation and adaptive for good measure. The entire session configuration is:
spark = SparkSession.builder.\
config("spark.dynamicAllocation.enabled",False).\
config("spark.executor.cores","2").\
config("spark.executor.instances","1").\
config("spark.executor.memory","500m").\
config("spark.sql.adaptive.enabled", False).\
config("spark.sql.files.maxPartitionBytes","13g").\
getOrCreate()
I read the CSV and checked the number of partitions that it is being read in by df.rdd.getNumPartitions(). Output = 2. This would be confirmed later on as well in the number of tasks
Then I run df.persist(storagelevel.StorageLevel.DISK_ONLY); df.count()
Following are the observations I made:
No caching happens till the data for one batch of tasks (equal to number of cpu cores in case you have set 1 cpu core per task) is read in completely. I conclude this since there is no entry that shows up in the storage tab of the web UI.
Each partition here ends up being around 6 GB on disk. Which should, at a minimum, be around 1964.9 MB/2 (=Size in memory serializez/2) in memory. Which is around 880 MB. There is no spill. Below is the relevant snapshot of the web UI from when around 11 GB of the data has been read in. You can see that Input has been almost 11GB and at this time there was nothing in the storage tab.
Questions:
Since the memory per executor is 300 MB (Execution + Storage) + 200 MB (User memory). How is spark able to manage ~880 MB partitions that too 2 of them in parallel (one by each core)?
The data read in does not show up in the Storage, is not (and, can not be) in the executor and, there is no spill as well. where exactly is that read in data?
Attaching a SS of the web UI post that job completion in case that might be useful
Attaching a SS of the Executors tab in case that might be useful:
Related
My task in spark uses images data for prediction I am working on a spark cluster standalone but I have an issue utilizing all the available memory capacity as here all available memory is 2.7 GB (coming from a memory executor that is configured 5 GB *0.6 *0.9= 2.7 it's okay ) but the usage memory is only 342 MB after that value my spark session being crashed and I did not know why this specific value!
I test my application on local and on a standalone cluster mode in addition whatever the memory executor configured value the limit of memory value for execution will be 342 MB. and here as shown my data size of 290691 KB led to the crash of my spark session and it works fine if I decrease the number of images
as follows screenshot issue:
This output error crashed with a data size of 290691 KB
Here my spark UI Storage Memory did not exceed 342 MB
so is there any advice or what is the correct spark configuration?
It's a warning, initially.
The general gist here is that you need to repartition to get more, but smaller size partitions, so as to get more parallelism and higher throughput. You can find many such issues out there on the Internet.
I have four questions. Suppose in spark I have 3 worker nodes. Each worker node has 3 executors and each executor has 3 cores. Each executor has 5 gb memory. (Total 6 executors, 27 cores and 15gb memory). What will happen if:
I have 30 data partitions. Each partition is of size 6 gb. Optimally, the number of partitions must be equal to number of cores, since each core executes one partition/task (One task per partition). Now in this case, how will each executor-core will process the partition since partition size is greater than the available executor memory? Note: I'm not calling cache() or persist(), it's simply that i'm applying some narrow transformations like map() and filter() on my rdd.
Will spark automatically try to store the partitions on disk? (I'm not calling cache() or persist() but merely just transformations are happening after an action is called)
Since I have partitions (30) greater than the number of available cores (27) so at max, my cluster can process 27 partitions, what will happen to the remaining 3 partitions? Will they wait for the occupied cores to get freed?
If i'm calling persist() whose storage level is set to MEMORY_AND_DISK, then if partition size is greater than memory, it will spill data to the disk? On which disk this data will be stored? The worker node's external HDD?
I answer as I know things on each part, possibly disregarding a few of your assertions:
I have four questions. Suppose in spark I have 3 worker nodes. Each worker node has 3 executors and each executor has 3 cores. Each executor has 5 gb memory. (Total 6 executors, 27 cores and 15gb memory). What will happen if:
>>> I would use 1 Executor, 1 Core. That is the generally accepted paradigm afaik.
I have 30 data partitions. Each partition is of size 6 gb. Optimally, the number of partitions must be equal to number of cores, since each core executes one partition/task (One task per partition). Now in this case, how will each executor-core will process the partition since partition size is greater than the available executor memory? Note: I'm not calling cache() or persist(), it's simply that I'm applying some narrow transformations like map() and filter() on my rdd. >>> The number of partitions being the same of number of cores is not true. You can service 1000 partitions with 10 cores, processing one at a time. What if you have 100K partition and on-prem? Unlikely you will get 100K Executors. >>> Moving on and leaving Driver-side collect issues to one side: You may not have enough memory for a given operation on an Executor; Spark can spill to files to disk at the expense of speed of processing. However, the partition size should not exceed a maximum size, was beefed up some time ago. Using multi-core Executors failure can occur, i.e. OOM's, also a result of GC-issues, a difficult topic.
Will spark automatically try to store the partitions on disk? (I'm not calling cache() or persist() but merely just transformations are happening after an action is called) >>> Not if it can avoid it, but when memory is tight, eviction / spilling to disk can and will occur, and in some cases re-computation from source or last checkpoint will occur.
Since I have partitions (30) greater than the number of available cores (27) so at max, my cluster can process 27 partitions, what will happen to the remaining 3 partitions? Will they wait for the occupied cores to get freed? >>> They will be serviced by a free Executor at a point in time.
If I'm calling persist() whose storage level is set to MEMORY_AND_DISK, then if partition size is greater than memory, it will spill data to the disk? On which disk this data will be stored? The worker node's external HDD? >>> Yes, and it will be spilled to the local file system. I think you can configure for HDFS via a setting, but local disks are faster.
This an insightful blog: https://medium.com/swlh/spark-oom-error-closeup-462c7a01709d
Your data partition size looks bigger than your Core memory. Your Core memory is ~1.6 GB (5GB/3 Core). This will be a problem as your partition will not be able to process in the Core. To resolve this, you can try:
increasing the number of partitions such that each partition is < Core memory ~1.6 GB. So increase them to something like 150 partitions.
If you keep the partitions the same, you should try increasing your Executor memory and maybe also reducing number of Cores in your Executors.
If everything goes well it will not need to store partitions on disk. However, if it is not able to find enough memory, it will find disk as a backup. If you want to store your data on Disk and persist it for some reason, you need to call persist(DISK_ONLY).
They will wait until one of the Cores is available.
Yes, it will spill on Disk. Where will depend on your cluster configuration I believe.
Sometimes, you will get an OutOfMemoryError not because your RDDs don’t fit in memory, but because the working set of one of your tasks, such as one of the reduce tasks in groupByKey, was too large. Spark’s shuffle operations (sortByKey, groupByKey, reduceByKey, join, etc) build a hash table within each task to perform the grouping, which can often be large. The simplest fix here is to increase the level of parallelism, so that each task’s input set is smaller.
I think it this way, please correct me if I am wrong.
Suppose there are 2 Data Nodes to process the Dataset and both these nodes collectively has a memory of 32GB(16 GB per Data Node). The data set size is 100 GB and let us suppose this data, when read by spark, is partitioned into 10 partitions of 10GB each. It is obvious that the 100GB file cannot be fit into 32 GB RAM at a time. so the partitions have to be loaded into memory and processed in a iterative manner. so I assume as below.
first iteration, 2 partitions, 10GB each are loaded into memory on each data node.
second iteration, 2 partitions, 10GB each are loaded into memory on each data node.
....
....
Fifth iteration, 2 partitions, 10GB each are loaded into memory on each data node.
If this is how the spark is processing, during every iteration, only 2 partitions are loaded into memory. Does that mean, the other partitions which were unable to be accommodated in memory, were read but spilled to disk and they are waiting for the memory to be freed? or those partitions are not read at all and they will be read only when the resources are available. which is true?
During processing if there is a need to groupby/reduceby/join, then it mandates a shuffle. so if one of the shuffle partition is greater than RAM size then the job will fail with OOM error. Example, 10 partitions were processed and shuffled. Now the shuffle partitions are only 4 partitions with 25GB each.
(Default shuffle partitions are 200, but only 4 partitions have the total data remaining are empty.) since the shuffle partition size is greater than 16MB RAM, will the spark job fail? Is my understanding correct?
I understand that, you do not really need that your data fit in memory. Spark processes the data on partition basis. But My question is what if the partition itself is not fitting in memory. Would it still spill the data to disk and start processing or it will fail with OOM error?
The second question I have is, If another spark job(Job2) is triggered during the above spark job(job1) is under execution, and suppose this is also having 100GB file to process with 10 partitions of 10GB each. so when job1 Iteration1 is under execution, there is only 6 MB free slot available in the memory. The job2's partition of 10GB cannot be loaded into memory for processing job2. so will the Job2 wait till the memory is freed up? or will this job also fail with OOM error?
The explanation (bold) is correct.
On your comments:
Unless you explicitly repartition, your partitions will be HDFS block size related, the 128MB size and as many that make up that file.
Then you have number of executors, say 2, per Worker / Data Node. Then max 4 tasks / partitions will be active at any given time.
What would be the point of loading all partitions to memory if you can service at most 4? You would be clogging up the system to the detriment of other Spark Apps. This is all like a normal OS then.
Of course it is a bit more complicated, e.g. if you have 10 Data Nodes and allocation only 2 Executors, there is traffic to move stuff about. Just keeping it simple.
OOM errors only occur if a partition exceeds max partition size. For the rest disk space is needed for spilling.
My (Py)Spark 2.1.1 app consists in two executors with 5 cores and 30G heap (spark.executor.memory) each. I have 3.2Gb of data persisted in memory (deserialized) spread on a dozen partitions and shared between my two executors (1.9Gb + 1.3Gb). I then want to repartition this data by calling repartition('myCol') on my persisted dataframe with myCol having only three keys with a 60-20-20 distribution. I then want to write the repartitionned data in (3) .parquet files. As expected, this transformation triggers a full shuffle of the data :
First question : In the Spark UI, Shuffle Write amounts to 5.9Gb. Why is this amount much higher than the size of the persisted data ? Is it the format Spark uses to write shuffle files on disk (text strings?) ? Replication ?
Second question : My executors keep dying with error messages such as org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle or ExecutorLostFailure (executor 2 exited caused by one of the running tasks) Reason: Container killed by YARN for exceeding memory limits. 32.0 GB of 32 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.. spark.yarn.executor.memoryOverhead is already set at 2g but I must confess I don't really get how this parameter should help in that context. But the main question is : how shuffling 3Gb of data can OOM a 30Gb executor ?
I changed a few parameters from the understanding I have of Spark (with limited success obviously) : I set spark.memory.fraction to 0.9 and spark.memory.storageFraction to 0.0.
Many thanks in advance for any help, this situation is so frustrating.
PS : Maybe once the issue is solved I can redesign my app with less memory per executor. It currently feels like a terrible waste of ressources to me.
I am reading text files of size 8.2 GB(all files in a folder) with WholeTextFiles method.
The job that read the files got 3 executors each with 4 cores and 4GB memory a shown in picture..
Though the job page is showing 3 executors, only 2 executors are really working on the data.(i can understand that from stderr logs which would print the files it's reading). 3rd executor doesnt have any trace that it's processing files.
There are 2 partitions from the wholetextfile API..
2 executors had 4GB each total 8GB of memory. But my files had 8.2GB.
Can anyone explain how the 2 executors with 8GB ram in total are having 8.2GB files?
My job is sucesfully completed.
In the spark doc of the function WholeTextFiles:
Each file is read as a single record and returned in a key-value pair, where the key is the path of each file, the value is the content of each file.
So a RDD record is an entire file content and the num partitions is equal to the number of files.
To have multiple partitions you can use the function textFile
Each and every executor has memory overhead [ which is 10% of allocated memory or with a minimum of 384 M].
You can see the actual allocated memory from YARN Running Jobs.
Also, there is something called Container memory [min and max limit] allocation.