Apache Spark running out of memory with smaller amount of partitions - apache-spark

I have an Spark application that keeps running out of memory, the cluster has two nodes with around 30G of RAM, and the input data size is about few hundreds of GBs.
The application is a Spark SQL job, it reads data from HDFS and create a table and cache it, then do some Spark SQL queries and writes the result back to HDFS.
Initially I split the data into 64 partitions and I got OOM, then I was able to fix the memory issue by using 1024 partitions. But why using more partitions helped me solve the OOM issue?

The solution to big data is partition(divide and conquer). Since not all data could be fit into the memory, and it also could not be processed in a single machine.
Each partition could fit into memory and processed(map) in relative short time. After the data is processed for each partition. It need be merged (reduce). This is tradition map reduce
Splitting data to more partitions means that each partition getting smaller.
[Edit]
Spark using revolution concept called Resilient Distributed DataSet(RDD).
There are two types of operations, transformation and acton
Transformations are mapping from one RDD to another. It is lazy evaluated. Those RDD could be treated as intermediate result we don't wanna get.
Actions is used when you really want get the data. Those RDD/data could be treated as what we want it, like take top failing.
Spark will analysed all the operation and create a DAG(Directed Acyclic Graph) before execution.
Spark start compute from source RDD when actions are fired. Then forget it.
(source: cloudera.com)
I made a small screencast for a presentation on Youtube Spark Makes Big Data Sparking.

Spark's operators spill data to disk if it does not fit in memory,
allowing it to run well on any sized data". The issue with large
partitions generating OOM
Partitions determine the degree of parallelism. Apache Spark doc says that, the partitions size should be atleast equal to the number of cores in the cluster.
Less partitions results in
Less concurrency,
Increase memory pressure for transformation which involves shuffle
More susceptible for data skew.
Many partitions might also have negative impact
Too much time spent in scheduling multiple tasks
Storing your data on HDFS, it will be partitioned already in 64 MB or 128 MB blocks as per your HDFS configuration When reading HDFS files with spark, the number of DataFrame partitions df.rdd.getNumPartitions depends on following properties
spark.default.parallelism (Cores available for the application)
spark.sql.files.maxPartitionBytes (default 128MB)
spark.sql.files.openCostInBytes (default 4MB)
Links :
https://spark.apache.org/docs/latest/tuning.html
https://databricks.com/session/a-deeper-understanding-of-spark-internals
https://spark.apache.org/faq.html
During Spark Summit Aaron Davidson gave some tips about partitions tuning. He also defined a reasonable number of partitions resumed to below 3 points:
Commonly between 100 and 10000 partitions (note: two below points are more reliable because the "commonly" depends here on the sizes of dataset and the cluster)
lower bound = at least 2*the number of cores in the cluster
upper bound = task must finish within 100 ms

Rockie's answer is right, but he does't get the point of your question.
When you cache an RDD, all of his partitions are persisted (in term of storage level) - respecting spark.memory.fraction and spark.memory.storageFraction properties.
Besides that, in an certain moment Spark can automatically drop's out some partitions of memory (or you can do this manually for entire RDD with RDD.unpersist()), according with documentation.
Thus, as you have more partitions, Spark is storing fewer partitions in LRU so that they are not causing OOM (this may have negative impact too, like the need to re-cache partitions).
Another importante point is that when you write result back to HDFS using X partitions, then you have X tasks for all your data - take all the data size and divide by X, this is the memory for each task, that are executed on each (virtual) core. So, that's not difficult to see that X = 64 lead to OOM, but X = 1024 not.

Related

Spark configuration based on my data size

I know there's a way to configure a Spark Application based in your cluster resources ("Executor memory" and "number of Executor" and "executor cores") I'm wondering if exist a way to do it considering the data input size?
What would happen if data input size does not fit into all partitions?
Example:
Data input size = 200GB
Number of partitions in cluster = 100
Size of partitions = 128MB
Total size that partitions could handle = 100 * 128MB = 128GB
What about the rest of the data (72GB)?
I guess Spark will wait to have free the resources free due to is designed to process batches of data Is this a correct assumption?
Thank in advance
I recommend for best performance, don't set spark.executor.cores. You want one executor per worker. Also, use ~70% of the executor memory in spark.executor.memory. Finally- if you want real-time application statistics to influence the number of partitions, use Spark 3, since it will come with Adaptive Query Execution (AQE). With AQE, Spark will dynamically coalesce shuffle partitions. SO you set it to an arbitrarily-large number of partitions, such as:
spark.sql.shuffle.partitions=<number of cores * 50>
Then just let AQE do its thing. You can read more about it here:
https://www.databricks.com/blog/2020/05/29/adaptive-query-execution-speeding-up-spark-sql-at-runtime.html
There are 2 aspects to your question. The first is regarding storage of this data, & the second is regarding data execution.
With regards to storage, when you say Size of partitions = 128MB, I assume you use HDFS to store this data & 128M is your default block size. HDFS itself internally decides how to split this 200GB file & store in chunks not exceeding 128M. And your HDFS cluster should have more than 200GB * replication factor of combined storage to persist this data.
Coming to the Spark execution part of the question, once you define spark.default.parallelism=100, it means that Spark will use this value as the default level of parallelism while performing certain operations (like join etc). Please note that the amount of data being processed by each executor is not affected by the block size (128M) in any way. Which means each executor task will work on 200G/100 = 2G of data (provided the executor memory is sufficient for the required operation being performed). In case there isn't enough capacity in the spark cluster to run 100 executors in parallel, then it will launch as many executors it can in batches as and when resources are available.

Repartitioning of large dataset in spark

I have 20TB file and I want to repartition it in spark with each partition = 128MB.
But after calculating n=20TB/128mb= 156250 partitions.
I believe 156250 is a very big number for
df.repartition(156250)
how should I approach repartitiong in this?
or should I increase the block size from 128mb to let's say 128gb.
but 128 gb per task will explode executor.
Please help me with this.
Divide and conquer it. You don’t need to load all the dataset in one place cause it would cost you huge amount resources and also network pressure because of shuffle exchanging.
The block size that you are referring to here is an HDFS concept related to storing the data by breaking it into chunks (say 128M default) & replicating thereafter for fault tolerance. In case you are storing your 20TB file on HDFS, it will automatically be broken into 20TB/128mb=156250 chunks for storage.
Coming to the Spark dataframe repartition, firstly it is a tranformation rather than an action (more information on the differences between the two: https://spark.apache.org/docs/latest/rdd-programming-guide.html#rdd-operations). Which means merely calling this function on the dataframe does nothing unless the dataframe is eventually used in some action.
Further, the repartition value allows you to define the parallelism level of your operation involving the dataframe & should mostly be though upon in those terms rather than the amount of data being processed per executor. The aim should be to maximize parallelism as per the available resources rather than trying to process certain amount of data per executor. The only exception to this rule should be in cases where the executor either needs to persist all this data in memory or collect some information from this data which is proportional to the data size being processed. And the same applies to any executor task running on 128GB of data.

Is spark partition size is equal to HDFS block size or depends on the number of cores available on all executors?

I am looking through spark partitioning and I see different answers for the question.
Is spark partition size is equal to HDFS block size or depends on the number of cores available on all executors?, and Does the performance improves by repartitioning the data in skewed data case? (I assume the data related to the same join key is again shuffled back to a single executor during the join). Please help me understand this. Thanks!
It really depends on your data where from you are reading. If you are reading from HDFS, then one block will be one partition. But if you are reading a parquet file, then one parquet file is one partition as it is not splittable, so depending on the block in case of HDFS and files count in case of parquet, it creates partitions.
Regarding the skewed data, the more data one partition has, the more time it takes to finish the execution. The other tasks will finished quickly as they have less data so the resources are not being utilized properly. Therefore, it is always better to repartition the skewed data properly, so all executors can evenly do the execution.
You can look here for all the available RDDs, and how they are creating partitions:
https://github.com/apache/spark/tree/master/core/src/main/scala/org/apache/spark/rdd

setting tuning parameters of a spark job

I'm relatively new to spark and I have a few questions related to the tuning optimizations with respect to the spark submit command.
I have followed : How to tune spark executor number, cores and executor memory?
and I understand how to utilise maximum resources out of my spark cluster.
However, I was recently asked how to define the number of cores, memory and cores when I have a relatively smaller operation to do as if I give maximum resources, it is going to be underutilised .
For instance,
if I have to just do a merge job (read files from hdfs and write one single huge file back to hdfs using coalesce) for about 60-70 GB (assume each file is of 128 mb in size which is the block size of HDFS) of data(in avro format without compression), what would be the ideal memory, no of executor and cores required for this?
Assume I have the configurations of my nodes same as the one mentioned in the link above.
I can't understand the concept of how much memory will be used up by the entire job provided there are no joins, aggregations etc.
The amount of memory you will need depends on what you run before the write operation. If all you're doing is reading data combining it and writing it out, then you will need very little memory per cpu because the dataset is never fully materialized before writing it out. If you're doing joins/group-by/other aggregate operations all of those will require much ore memory. The exception to this rule is that spark isn't really tuned for large files and generally is much more performant when dealing with sets of reasonably sized files. Ultimately the best way to get your answers is to run your job with the default parameters and see what blows up.

how does hashpartitioner work in spark?

Say I have lots data in a couple of s3 files, about 5 GB each, which I read in using sc.textFile
I need to join the data from the two files, therefore, I opt to use the HashPartitioner technique, and I set a partition count of 20. The submitted job to 8 worker nodes fails without any meaningful messages. Now I am thinking maybe I need to pick a proper number of partitions.
Obviously, the idea for spark to partition up all the data based on a chosen key. In order to load them up into 20 partitions, I imagine spark will have to read thru every line of data, compute its hash, and load into the memory of the matching partition, which resides in one of the 8 worker nodes. If there is enough collective memory in the worker nodes, I assume this goes smoothly. At the end of the read, all the data is in the proper partition, in the right node's memory. Am I right so far?
However, if the total memory can not fit all the data, I imagine Spark will work on certain partitions first. And after processing these first partitions, it flushes the original partitions and reads from the source files again, loading remaining data into new partitions. This would mean reading the same file as many time as necessary to process all partitions using available memory. Is this also correct?
Should I should calculate the number of partitions so that at least one full partition would fit into a single node's memory. Are there other guidelines to follow?

Resources