spark creating num of partitions in RDD more than the data size - apache-spark

I am a noob and learning Pyspark now. My question about RDD is what happens when we try to create more partitions than the data size. E.g.,
data = sc.parallelize(range(5), partitions = 8)
I understand the intention of partitions is to effectively use the CPU cores of a cluster, and making too small partitions involves scheduling overhead than benefitting from distributed computing. What I am curious about is does spark still create 8 partitions here or optimize it to the number of cores? If it's creating 8 partitions then there is data replication in each partition?

My question about RDD is what happens when we try to create more
partitions than the data size
You can easily see how many partitions a given RDD has by using
data.getNumPartitions. I tried creating RDD you have mentioned and running this command and it shows me there are 8 partitions. 4 partitions had one number each and rest 4 empty.
If it's creating 8 partitions then there is data replication in each
partition?
You can try following code and check the executor output to see how many records are there in each partition. Note the first print statement in the below code. I have to return something as required by API so returning each element multiplied by 2.
data.mapPartitionsWithIndex((x,y) => {println(s"partitions $x has ${y.length} records");y.map(a => a*2)}).collect.foreach(println)
I got following output for the above code -
partitions 0 has 0 records
partitions 1 has 1 records
partitions 2 has 0 records
partitions 3 has 1 records
partitions 4 has 0 records
partitions 5 has 1 records
partitions 6 has 0 records
partitions 7 has 1 records
I am curious about is does spark still create 8 partitions here or
optimize it to the number of cores?
Number of partitions defines how much data you want spark to process in one task. If there are 8 partitions and 4 virtual cores then spark would start running 4 tasks ( corresponding to 4 partitions) at once. As these tasks finishes, it will schedule remaining ones those cores.

Related

Is it possible to do repartition after using partitionBy in a spark DF?

I am asking this question because if I specify repartition as 5, than all my data(>200Gigs) are moved to 5 different executors and 98% of the resources is unused. and then the partitionBy is happening which is again creating a lot of shuffle. Is there a way that first the partitionBy happens and then repartition runs on the data?
Although the question is not entirely easy to follow, the following aligns with the other answer and this approach should avoid the issues mentioned on unnecessary shuffling:
val n = [... some calculation for number of partitions / executors based on cluster config and volume of data to process ...]
df.repartition(n, $"field_1", $"field_2", ...)
.sortWithinPartitions("fieldx", "field_y")
.write.partitionBy("field_1", "field_2", ...)
.format("location")
whereby [field_1, field_2, ...] are the same set of fields for repartition and partitionBy.
You can use repartition(5, col("$colName")).
Thus when you will make partitionBy("$colName") you will skip shuffle for '$colName' since it's already been repartitioned.
Also consider to have as many partitions as the product of the number of executors by the number of used cores by 3 (this may vary between 2 and 4 though).
So as we know, Spark can only run 1 concurrent task for every partition of an RDD. Assuming you have 8 cores per executor and 5 executors:
You need to have: 8 * 5 * 3 = 120 partitions

How to handle Spark Executors when number of partitions do not match no of Executors?

Let's say I have 3 executors and 4 partitions, and we assume theses number cannot be changed.
This is not an efficient setup, because we have to read 2 passes: in 1 pass, we read 3 partitions; and in the second partition, we read 1 partition.
Is there a way in Spark that we can improve the efficiency without changing the number of executors and partitions?
In your scenario you need to update the number of cores.
In spark each partition is taken up for execution by one task of spark. As you have 3 executors and 4 partitions and if you assume you have total 3 cores I.e one core per executor then 3 partition of data will be run in parallel and one partition will be taken once one core for the executor will be free. To handle this latency we need to increase spark.executor.cores=2. I.e each executor can run 2 threads at a time I.e 2 tasks at a time.
So all your partitions will be executed in parallel but it does not guarantee whether 1 executor will run 2 tasks and 2 executors will run one task each or 2 executors will run 2 tasks on 2 individual partitions with one executor will be idle.

When should I repartition an RDD?

I know that I can repartition an RDD to increase its partitions and use coalesce to decrease its partitions. I have two questions regarding this that I cannot completely understand after reading different resources.
Spark will use a sensible default (1 partition per block which is 64MB in first versions and now 128MB) when generating an RDD. But I also read that it is recommended to use 2 or 3 times the number of cores running the jobs. So here comes the question:
How many partitions should I use for a given file? For example, suppose I have a 10GB .parquet file, 3 executors with 2 cores and 3gb memory each.
Should I repartition? How many partitions should I use? What is the better way to make that choice?
Are all data types (ie .txt, .parquet, etc..) repartitioned by default if no partitioning is provided?
Spark can run a single concurrent task for every partition of an RDD, up to the total number of cores in the cluster.
For example :
val rdd= sc.textFile ("file.txt", 5)
The above line of code will create an RDD named textFile with 5 partitions.
Suppose that you have a cluster with 4 cores and assume that each partition needs to process for 5 minutes. In case of the above RDD with 5 partitions, 4 partition processes will run in parallel as there are 4 cores and the 5th partition process will process after 5 minutes when one of the 4 cores, is free.
The entire processing will be completed in 10 minutes and during the 5th partition process, the resources (remaining 3 cores) will remain idle.
The best way to decide on the number of partitions in a RDD is to make the number of partitions equal to the number of cores in the cluster so that all the
partitions will process in parallel and the resources will be utilized in an optimal way.
Question : Are all data types (ie .txt, .parquet, etc..) repartitioned
by default if no partitioning is provided?
There will be default no of partitions for every rdd.
to check you can use rdd.partitions.length right after rdd created.
to use existing cluster resources in optimal way and to speed up, we have to consider re-partitioning to ensure that all cores are utilized and all partitions have enough number of records which are uniformly distributed.
For better understanding, also have a look at https://jaceklaskowski.gitbooks.io/mastering-apache-spark/spark-rdd-partitions.html
Note : There is no fixed formula for this. general convention most of them follow is
(numOf executors * no of cores) * replicationfactor (which may be 2 or 3 times more )

Kafka topic partitions to Spark streaming

I have some use cases that I would like to be more clarified, about Kafka topic partitioning -> spark streaming resource utilization.
I use spark standalone mode, so only settings I have are "total number of executors" and "executor memory". As far as I know and according to documentation, way to introduce parallelism into Spark streaming is using partitioned Kafka topic -> RDD will have same number of partitions as kafka, when I use spark-kafka direct stream integration.
So if I have 1 partition in the topic, and 1 executor core, that core will sequentially read from Kafka.
What happens if I have:
2 partitions in the topic and only 1 executor core? Will that core read first from one partition and then from the second one, so there will be no benefit in partitioning the topic?
2 partitions in the topic and 2 cores? Will then 1 executor core read from 1 partition, and second core from the second partition?
1 kafka partition and 2 executor cores?
Thank you.
The basic rule is that you can scale up to the number of Kafka partitions. If you set spark.executor.cores greater than the number of partitions, some of the threads will be idle. If it's less than the number of partitions, Spark will have threads read from one partition then the other. So:
2 partitions, 1 executor: reads from one partition then then other. (I am not sure how Spark decides how much to read from each before switching)
2p, 2c: parallel execution
1p, 2c: one thread is idle
For case #1, note that having more partitions than executors is OK since it allows you to scale out later without having to re-partition. The trick is to make sure that your partitions are evenly divisible by the number of executors. Spark has to process all the partitions before passing data onto the next step in the pipeline. So, if you have 'remainder' partitions, this can slow down processing. For example, 5 partitions and 4 threads => processing takes the time of 2 partitions - 4 at once then one thread running the 5th partition by itself.
Also note that you may also see better processing throughput if you keep the number of partitions / RDDs the same throughout the pipeline by explicitly setting the number of data partitions in functions like reduceByKey().

spark creating too many partitions

I have 3 Cassandra node cluster with 1 seed node and 1 spark master and 3 slave nodes with 8 GB ram and 2 cores. Here is the input to my spark jobs
spark.cassandra.input.split.size_in_mb 67108864
When I run with this configuration set I see that there are around 768 partitions created with around 89.1 MB of data roughly 1706765 records. I am not able to understand why so many partitions are created. I am using Cassandra spark connector version 1.4 so the bug is also fixed regarding input split size.
There are only 11 unique partition key. My partition key has appname which is always test and random number which is always from 0-10 so only 11 different unique partition.
Why so many partitions and how come spark decide how much partitions to create
The Cassandra connector does not use defaultParallelism. It checks a system table in C* (post 2.1.5) for an estimate on how many MB of data are in the given table. This amount is read and divided by the input split size to determine the number of splits to make.
https://github.com/datastax/spark-cassandra-connector/blob/master/doc/FAQ.md#what-does-inputsplitsize_in_mb-use-to-determine-size
If you are on C* < 2.1.5 you will need to manually set the partitioning via a ReadConf.

Resources