Sometimes, you will get an OutOfMemoryError not because your RDDs don’t fit in memory, but because the working set of one of your tasks, such as one of the reduce tasks in groupByKey, was too large. Spark’s shuffle operations (sortByKey, groupByKey, reduceByKey, join, etc) build a hash table within each task to perform the grouping, which can often be large. The simplest fix here is to increase the level of parallelism, so that each task’s input set is smaller.
I think it this way, please correct me if I am wrong.
Suppose there are 2 Data Nodes to process the Dataset and both these nodes collectively has a memory of 32GB(16 GB per Data Node). The data set size is 100 GB and let us suppose this data, when read by spark, is partitioned into 10 partitions of 10GB each. It is obvious that the 100GB file cannot be fit into 32 GB RAM at a time. so the partitions have to be loaded into memory and processed in a iterative manner. so I assume as below.
first iteration, 2 partitions, 10GB each are loaded into memory on each data node.
second iteration, 2 partitions, 10GB each are loaded into memory on each data node.
....
....
Fifth iteration, 2 partitions, 10GB each are loaded into memory on each data node.
If this is how the spark is processing, during every iteration, only 2 partitions are loaded into memory. Does that mean, the other partitions which were unable to be accommodated in memory, were read but spilled to disk and they are waiting for the memory to be freed? or those partitions are not read at all and they will be read only when the resources are available. which is true?
During processing if there is a need to groupby/reduceby/join, then it mandates a shuffle. so if one of the shuffle partition is greater than RAM size then the job will fail with OOM error. Example, 10 partitions were processed and shuffled. Now the shuffle partitions are only 4 partitions with 25GB each.
(Default shuffle partitions are 200, but only 4 partitions have the total data remaining are empty.) since the shuffle partition size is greater than 16MB RAM, will the spark job fail? Is my understanding correct?
I understand that, you do not really need that your data fit in memory. Spark processes the data on partition basis. But My question is what if the partition itself is not fitting in memory. Would it still spill the data to disk and start processing or it will fail with OOM error?
The second question I have is, If another spark job(Job2) is triggered during the above spark job(job1) is under execution, and suppose this is also having 100GB file to process with 10 partitions of 10GB each. so when job1 Iteration1 is under execution, there is only 6 MB free slot available in the memory. The job2's partition of 10GB cannot be loaded into memory for processing job2. so will the Job2 wait till the memory is freed up? or will this job also fail with OOM error?
The explanation (bold) is correct.
On your comments:
Unless you explicitly repartition, your partitions will be HDFS block size related, the 128MB size and as many that make up that file.
Then you have number of executors, say 2, per Worker / Data Node. Then max 4 tasks / partitions will be active at any given time.
What would be the point of loading all partitions to memory if you can service at most 4? You would be clogging up the system to the detriment of other Spark Apps. This is all like a normal OS then.
Of course it is a bit more complicated, e.g. if you have 10 Data Nodes and allocation only 2 Executors, there is traffic to move stuff about. Just keeping it simple.
OOM errors only occur if a partition exceeds max partition size. For the rest disk space is needed for spilling.
Related
I have four questions. Suppose in spark I have 3 worker nodes. Each worker node has 3 executors and each executor has 3 cores. Each executor has 5 gb memory. (Total 6 executors, 27 cores and 15gb memory). What will happen if:
I have 30 data partitions. Each partition is of size 6 gb. Optimally, the number of partitions must be equal to number of cores, since each core executes one partition/task (One task per partition). Now in this case, how will each executor-core will process the partition since partition size is greater than the available executor memory? Note: I'm not calling cache() or persist(), it's simply that i'm applying some narrow transformations like map() and filter() on my rdd.
Will spark automatically try to store the partitions on disk? (I'm not calling cache() or persist() but merely just transformations are happening after an action is called)
Since I have partitions (30) greater than the number of available cores (27) so at max, my cluster can process 27 partitions, what will happen to the remaining 3 partitions? Will they wait for the occupied cores to get freed?
If i'm calling persist() whose storage level is set to MEMORY_AND_DISK, then if partition size is greater than memory, it will spill data to the disk? On which disk this data will be stored? The worker node's external HDD?
I answer as I know things on each part, possibly disregarding a few of your assertions:
I have four questions. Suppose in spark I have 3 worker nodes. Each worker node has 3 executors and each executor has 3 cores. Each executor has 5 gb memory. (Total 6 executors, 27 cores and 15gb memory). What will happen if:
>>> I would use 1 Executor, 1 Core. That is the generally accepted paradigm afaik.
I have 30 data partitions. Each partition is of size 6 gb. Optimally, the number of partitions must be equal to number of cores, since each core executes one partition/task (One task per partition). Now in this case, how will each executor-core will process the partition since partition size is greater than the available executor memory? Note: I'm not calling cache() or persist(), it's simply that I'm applying some narrow transformations like map() and filter() on my rdd. >>> The number of partitions being the same of number of cores is not true. You can service 1000 partitions with 10 cores, processing one at a time. What if you have 100K partition and on-prem? Unlikely you will get 100K Executors. >>> Moving on and leaving Driver-side collect issues to one side: You may not have enough memory for a given operation on an Executor; Spark can spill to files to disk at the expense of speed of processing. However, the partition size should not exceed a maximum size, was beefed up some time ago. Using multi-core Executors failure can occur, i.e. OOM's, also a result of GC-issues, a difficult topic.
Will spark automatically try to store the partitions on disk? (I'm not calling cache() or persist() but merely just transformations are happening after an action is called) >>> Not if it can avoid it, but when memory is tight, eviction / spilling to disk can and will occur, and in some cases re-computation from source or last checkpoint will occur.
Since I have partitions (30) greater than the number of available cores (27) so at max, my cluster can process 27 partitions, what will happen to the remaining 3 partitions? Will they wait for the occupied cores to get freed? >>> They will be serviced by a free Executor at a point in time.
If I'm calling persist() whose storage level is set to MEMORY_AND_DISK, then if partition size is greater than memory, it will spill data to the disk? On which disk this data will be stored? The worker node's external HDD? >>> Yes, and it will be spilled to the local file system. I think you can configure for HDFS via a setting, but local disks are faster.
This an insightful blog: https://medium.com/swlh/spark-oom-error-closeup-462c7a01709d
Your data partition size looks bigger than your Core memory. Your Core memory is ~1.6 GB (5GB/3 Core). This will be a problem as your partition will not be able to process in the Core. To resolve this, you can try:
increasing the number of partitions such that each partition is < Core memory ~1.6 GB. So increase them to something like 150 partitions.
If you keep the partitions the same, you should try increasing your Executor memory and maybe also reducing number of Cores in your Executors.
If everything goes well it will not need to store partitions on disk. However, if it is not able to find enough memory, it will find disk as a backup. If you want to store your data on Disk and persist it for some reason, you need to call persist(DISK_ONLY).
They will wait until one of the Cores is available.
Yes, it will spill on Disk. Where will depend on your cluster configuration I believe.
I tried looking through the various posts but did not get an answer. Lets say my spark job has 1000 input partitions but I only have 8 executor cores. The job has 2 stages. Can someone help me understand exactly how spark processes this. If you can help answer the below questions, I'd really appreciate it
As there are only 8 executor cores, will spark process the Stage 1 of my job 8 partitions at a time?
If the above is true, after the first set of 8 partitions are processed where is this data stored when spark is running the second set of 8 partitions?
If I dont have any wide transformations, will this cause a spill to disk?
For a spark job, what is the optimal file size. I mean spark better with processing 1 MB files and 1000 spark partitions or say a 10MB file with 100 spark partitions?
Sorry, if these questions are vague. This is not a real use case but as I am learning about spark I am trying to understand the internal details of how the different partitions get processed.
Thank You!
Spark will run all jobs for the first stage before starting the second. This does not mean that it will start 8 partitions, wait for them all to complete, and then start another 8 partitions. Instead, this means that each time an executor finishes a partition, it will start another partition from the first stage until all partions from the first stage is started, then spark will wait until all stages in the first stage are complete before starting the second stage.
The data is stored in memory, or if not enough memory is available, spilled to disk on the executor memory. Whether a spill happens will depend on exactly how much memory is available, and how much intermediate data results.
The optimal file size is varies, and is best measured, but some key factors to consider:
The total number of files limits total parallelism, so should be greater than the number of cores.
The amount of memory used processing a partition should be less than the amount available to the executor. (~4GB for AWS glue)
There is overhead per file read, so you don't want too many small files.
I would be inclined towards 10MB files or larger if you only have 8 cores.
If i have cluster of 5 nodes, each node having 1gb ram, now if my data file is 10gb distributed in all 5 nodes, let say 2gb in each node, now if i trigger
val rdd = sc.textFile("filepath")
rdd.collect
will spark load data into the ram and how spark will deal with this scenario
will it straight away deny or will it process it.
Lets understand the question first #intellect_dp you are asking, you have a cluster of 5 nodes (here the term "node" I am assuming machine which generally includes hard disk,RAM, 4 core cpu etc.), now each node having 1 GB of RAM and you have 10 GB of data file which is distributed in a manner, that 2GB of data is residing in the hard disk of each node. Here lets assume that you are using HDFS and now your block size at each node is 2GB.
now lets break this :
each block size = 2GB
RAM size of each node = 1GB
Due to lazy evaluation in spark, only when "Action API" get triggered, then only it will load your data into the RAM and execute it further.
here you are saying that you are using "collect" as an action api. Now the problem here is that RAM size is less than your block size, and if you process it with all default configuration (1 block = 1 partition ) of spark and considering that no further node will going to add up, then in that case it will give you out of memory exception.
now the question - is there any way spark can handle this kind of large data with the given kind of hardware provisioning?
Ans - yes, first you need to set default minimum partition :
val rdd = sc.textFile("filepath",n)
here n will be my default minimum partition of block, now as we have only 1gb of RAM, so we need to keep it less than 1gb, so let say we take n = 4,
now as your block size is 2gb and minimum partition of block is 4 :
each partition size will be = 2GB/4 = 500mb;
now spark will process this 500mb first and will convert it into RDD, when next chunk of 500mb will come, the first rdd will get spill to hard disk (given that you have set the storage level "MEMORY_AND_DISK_ONLY").
In this way it will process your whole 10 GB of data file with the given cluster hardware configuration.
Now I personally will not recommend the given hardware provisioning for such case,
as it will definitely process the data, but there are few disadvantages :
firstly it will involve multiple I/O operation making whole process very slow.
secondly if any lag occurs in reading or writing to the hard disk, your whole job will get discarded, you will go frustrated with such hardware configuration. In addition to it you will never be sure that will spark process your data and will be able to give result when data will increase.
So try to keep very less I/O operation, and
Utilize in memory computation power of spark with an adition of few more resources for faster performance.
When you use collect all the data send is collected as array only in driver node.
From this point distribution spark and other nodes does't play part. You can think of it as a pure java application on a single machine.
You can determine driver's memory with spark.driver.memory and ask for 10G.
From this moment if you will not have enough memory for the array you will probably get OutOfMemory exception.
In the otherhand if we do so, Performance will be impacted, we will not get the speed we want.
Also Spark store only results in RDD, so I can say result would not be complete data, any worst case if we are doing select * from tablename, it will give data in chunks , what it can affroad....
I have an Spark application that keeps running out of memory, the cluster has two nodes with around 30G of RAM, and the input data size is about few hundreds of GBs.
The application is a Spark SQL job, it reads data from HDFS and create a table and cache it, then do some Spark SQL queries and writes the result back to HDFS.
Initially I split the data into 64 partitions and I got OOM, then I was able to fix the memory issue by using 1024 partitions. But why using more partitions helped me solve the OOM issue?
The solution to big data is partition(divide and conquer). Since not all data could be fit into the memory, and it also could not be processed in a single machine.
Each partition could fit into memory and processed(map) in relative short time. After the data is processed for each partition. It need be merged (reduce). This is tradition map reduce
Splitting data to more partitions means that each partition getting smaller.
[Edit]
Spark using revolution concept called Resilient Distributed DataSet(RDD).
There are two types of operations, transformation and acton
Transformations are mapping from one RDD to another. It is lazy evaluated. Those RDD could be treated as intermediate result we don't wanna get.
Actions is used when you really want get the data. Those RDD/data could be treated as what we want it, like take top failing.
Spark will analysed all the operation and create a DAG(Directed Acyclic Graph) before execution.
Spark start compute from source RDD when actions are fired. Then forget it.
(source: cloudera.com)
I made a small screencast for a presentation on Youtube Spark Makes Big Data Sparking.
Spark's operators spill data to disk if it does not fit in memory,
allowing it to run well on any sized data". The issue with large
partitions generating OOM
Partitions determine the degree of parallelism. Apache Spark doc says that, the partitions size should be atleast equal to the number of cores in the cluster.
Less partitions results in
Less concurrency,
Increase memory pressure for transformation which involves shuffle
More susceptible for data skew.
Many partitions might also have negative impact
Too much time spent in scheduling multiple tasks
Storing your data on HDFS, it will be partitioned already in 64 MB or 128 MB blocks as per your HDFS configuration When reading HDFS files with spark, the number of DataFrame partitions df.rdd.getNumPartitions depends on following properties
spark.default.parallelism (Cores available for the application)
spark.sql.files.maxPartitionBytes (default 128MB)
spark.sql.files.openCostInBytes (default 4MB)
Links :
https://spark.apache.org/docs/latest/tuning.html
https://databricks.com/session/a-deeper-understanding-of-spark-internals
https://spark.apache.org/faq.html
During Spark Summit Aaron Davidson gave some tips about partitions tuning. He also defined a reasonable number of partitions resumed to below 3 points:
Commonly between 100 and 10000 partitions (note: two below points are more reliable because the "commonly" depends here on the sizes of dataset and the cluster)
lower bound = at least 2*the number of cores in the cluster
upper bound = task must finish within 100 ms
Rockie's answer is right, but he does't get the point of your question.
When you cache an RDD, all of his partitions are persisted (in term of storage level) - respecting spark.memory.fraction and spark.memory.storageFraction properties.
Besides that, in an certain moment Spark can automatically drop's out some partitions of memory (or you can do this manually for entire RDD with RDD.unpersist()), according with documentation.
Thus, as you have more partitions, Spark is storing fewer partitions in LRU so that they are not causing OOM (this may have negative impact too, like the need to re-cache partitions).
Another importante point is that when you write result back to HDFS using X partitions, then you have X tasks for all your data - take all the data size and divide by X, this is the memory for each task, that are executed on each (virtual) core. So, that's not difficult to see that X = 64 lead to OOM, but X = 1024 not.
I am running a Spark streaming application with 2 workers.
Application has a join and an union operations.
All the batches are completing successfully but noticed that shuffle spill metrics are not consistent with input data size or output data size (spill memory is more than 20 times).
Please find the spark stage details in the below image:
After researching on this, found that
Shuffle spill happens when there is not sufficient memory for shuffle data.
Shuffle spill (memory) - size of the deserialized form of the data in memory at the time of spilling
shuffle spill (disk) - size of the serialized form of the data on disk after spilling
Since deserialized data occupies more space than serialized data. So, Shuffle spill (memory) is more.
Noticed that this spill memory size is incredibly large with big input data.
My queries are:
Does this spilling impacts the performance considerably?
How to optimize this spilling both memory and disk?
Are there any Spark Properties that can reduce/ control this huge spilling?
Learning to performance-tune Spark requires quite a bit of investigation and learning. There are a few good resources including this video. Spark 1.4 has some better diagnostics and visualisation in the interface which can help you.
In summary, you spill when the size of the RDD partitions at the end of the stage exceed the amount of memory available for the shuffle buffer.
You can:
Manually repartition() your prior stage so that you have smaller partitions from input.
Increase the shuffle buffer by increasing the memory in your executor processes (spark.executor.memory)
Increase the shuffle buffer by increasing the fraction of executor memory allocated to it (spark.shuffle.memoryFraction) from the default of 0.2. You need to give back spark.storage.memoryFraction.
Increase the shuffle buffer per thread by reducing the ratio of worker threads (SPARK_WORKER_CORES) to executor memory
If there is an expert listening, I would love to know more about how the memoryFraction settings interact and their reasonable range.
To add to the above answer, you may also consider increasing the default number (spark.sql.shuffle.partitions) of partitions from 200 (when shuffle occurs) to a number that will result in partitions of size close to the hdfs block size (i.e. 128mb to 256mb)
If your data is skewed, try tricks like salting the keys to increase parallelism.
Read this to understand spark memory management:
https://0x0fff.com/spark-memory-management/
https://www.tutorialdocs.com/article/spark-memory-management.html