Max limit on number of RDDs in Spark - apache-spark

I am new to Spark and is currently learning related concepts. One basic question I have is what is the limit on number of RDDs in Spark ?

Up to my knowledge, there is no limit on number of RDDs in Spark.
Only limitation is with memory and disk swap space Spark can expand/use.
I have been running spark over data of 40 tb in production, and yet to reach any kind of limit.

Related

Optimal value of spark.sql.shuffle.partitions for a Spark batch Job reading from Kafka

I have a Spark batch job that consumes data from a Kafka topic with 300 partitions. As part of my job, there are various transformations like group by and join which require shuffling.
I want to know if I should go with the default value of spark.sql.shuffle.partitions which are 200 or set it to 300 which is the same as the number of input partitions in Kafka and hence the number of parallel tasks spawn to read it.
Thanks
In the chapter on Optimizing and Tuning Spark Applications of the book "Learning Spark, 2nd edition" (O'Reilly) it is written that the default value
"is too high for smaller or streaming workloads; you may want to reduce it to a lower value such as the number of cores on the executors or less.
There is no magic formula for the number of shuffle partitions to set for the shuffle stage; the number may vary depending on your use case, data set, number of cores, and the amount of executors memory available - it's a trial-and-error-approach."
Your goal should be to reduce the amount of small partitions being sent accross the network to executor's task.
There is a recording of a talk on Tuning Apache Spark for Large Scale Workloads which also talks about this configuration.
However, when you are using Spark 3.x, you do not think about that much as the Adaptive Query Execution (AQE) framework will dynamically coalesce shuffle partitions based on the shuffle file statistics. More details on the AQE framework are given in this blog.

How can I optimize my spark application to join two rdd having size greater than cluster memory?

I want to join two RDD each taking 10 GB of memory. But the cluster memory I am having is just 15 GB. Is it possible to optimize the code somehow so that we can join these RDD?
I thought of persisting the RDD in DISK but it seems to be not working.
IS there any optimization technique that we can use to encounter such problem?
It is not a necessary condition that the cluster should have more memory than the dataset. However, that helps to increase the performance.
The persist to DISK_ONLY won't help if you have a single join. In case you are trying to multiple joins you would need to persist and count to force the DAG evaluation.
Anyway, the best way is to increase the Dataset partitions and shuflle partition (200 is default).
spark.sql.shuffle.partitions=5000
and then join.

Is my understanding of spark partitioning correct?

I'd like to know If my understanding of the partitioning in Spark is correct.
I always thought about the number of partitions and their size and never about the worker they were processed by.
Yesterday, as I was playing a bit with partitioning, I found out that I was able to track the cached partitions' location using the WEB UI (Storage -> Cached RDD -> Data Distribution) and it surprised me.
I have a cluster of 30 cores (3 cores * 10 executors) and I had a RDD of like 10 partitions. I tried to expand it to 100 partitions to increase the parallelism just to find out that almost 90% of the partitions were on the same worker node and thus my parallelism was not limited by the total number of cpu in my cluster but by the number of cpu of the node containing 90% of the partitions.
I tried to find answers on stackoverflow and the only answer I could come by was about data locality. Spark detected that most of my files were on this node so it decided to keep most of the partitions on this node.
Is my understanding correct?
And if it is, is there a way to tell Spark to really shuffle the data?
So far this "data locality" lead to heavy underutilization of my cluster....

Apache Spark running out of memory with smaller amount of partitions

I have an Spark application that keeps running out of memory, the cluster has two nodes with around 30G of RAM, and the input data size is about few hundreds of GBs.
The application is a Spark SQL job, it reads data from HDFS and create a table and cache it, then do some Spark SQL queries and writes the result back to HDFS.
Initially I split the data into 64 partitions and I got OOM, then I was able to fix the memory issue by using 1024 partitions. But why using more partitions helped me solve the OOM issue?
The solution to big data is partition(divide and conquer). Since not all data could be fit into the memory, and it also could not be processed in a single machine.
Each partition could fit into memory and processed(map) in relative short time. After the data is processed for each partition. It need be merged (reduce). This is tradition map reduce
Splitting data to more partitions means that each partition getting smaller.
[Edit]
Spark using revolution concept called Resilient Distributed DataSet(RDD).
There are two types of operations, transformation and acton
Transformations are mapping from one RDD to another. It is lazy evaluated. Those RDD could be treated as intermediate result we don't wanna get.
Actions is used when you really want get the data. Those RDD/data could be treated as what we want it, like take top failing.
Spark will analysed all the operation and create a DAG(Directed Acyclic Graph) before execution.
Spark start compute from source RDD when actions are fired. Then forget it.
(source: cloudera.com)
I made a small screencast for a presentation on Youtube Spark Makes Big Data Sparking.
Spark's operators spill data to disk if it does not fit in memory,
allowing it to run well on any sized data". The issue with large
partitions generating OOM
Partitions determine the degree of parallelism. Apache Spark doc says that, the partitions size should be atleast equal to the number of cores in the cluster.
Less partitions results in
Less concurrency,
Increase memory pressure for transformation which involves shuffle
More susceptible for data skew.
Many partitions might also have negative impact
Too much time spent in scheduling multiple tasks
Storing your data on HDFS, it will be partitioned already in 64 MB or 128 MB blocks as per your HDFS configuration When reading HDFS files with spark, the number of DataFrame partitions df.rdd.getNumPartitions depends on following properties
spark.default.parallelism (Cores available for the application)
spark.sql.files.maxPartitionBytes (default 128MB)
spark.sql.files.openCostInBytes (default 4MB)
Links :
https://spark.apache.org/docs/latest/tuning.html
https://databricks.com/session/a-deeper-understanding-of-spark-internals
https://spark.apache.org/faq.html
During Spark Summit Aaron Davidson gave some tips about partitions tuning. He also defined a reasonable number of partitions resumed to below 3 points:
Commonly between 100 and 10000 partitions (note: two below points are more reliable because the "commonly" depends here on the sizes of dataset and the cluster)
lower bound = at least 2*the number of cores in the cluster
upper bound = task must finish within 100 ms
Rockie's answer is right, but he does't get the point of your question.
When you cache an RDD, all of his partitions are persisted (in term of storage level) - respecting spark.memory.fraction and spark.memory.storageFraction properties.
Besides that, in an certain moment Spark can automatically drop's out some partitions of memory (or you can do this manually for entire RDD with RDD.unpersist()), according with documentation.
Thus, as you have more partitions, Spark is storing fewer partitions in LRU so that they are not causing OOM (this may have negative impact too, like the need to re-cache partitions).
Another importante point is that when you write result back to HDFS using X partitions, then you have X tasks for all your data - take all the data size and divide by X, this is the memory for each task, that are executed on each (virtual) core. So, that's not difficult to see that X = 64 lead to OOM, but X = 1024 not.

Are the more partitions the better in Spark?

I am new in Spark and have a question
Are the more partitions the better in Spark ? If I have OOM issue, more partitions helps ?
Partitions determine the degree of parallelism.
Apache Spark doc says that, the partitions size should be atleast equal to the number of cores in the cluster.
In case of very few partitions, all the cores in the cluster would not be utilized.
If there are too many partitions and data is small, then too many small tasks gets scheduled.
If your getting the out of memory issue, you would have to increase the executor memory. It should be a minimum of 8GB.

Resources