why are there so many partitions in a single-element RDD - apache-spark

The following code returns 16 partitions. How is that possible to have 16 partitions for an array of 1 thing?
rdd = sc.parallelize([""])
rdd.getNumPartitions()

The number of partitions in RDD created by sc.parallelize depends on the scheduler implementation used.
SchedulerBackend trait has this method -
def defaultParallelism(): Int
The CoarseGrainedSchedulerBackend (which is used by yarn) has this implementation -
override def defaultParallelism(): Int = {
conf.getInt("spark.default.parallelism", math.max(totalCoreCount.get(), 2))
}
LocalSchedulerBackend has following implementation
override def defaultParallelism(): Int =
scheduler.conf.getInt("spark.default.parallelism", totalCores)
Thats why your RDD has 16 partitions.

In this case of parallelize api it depends on the
Cluster manager.
In local mode it is the total number of cores of your machine
In Mesos fine grain mode it is 8
In yarn it’s total number of cores on all executor nodes or 2 whichever is higher.
These are the default settings if you won’t provide the number of partitions explicitly

Yes, your rdd will have 16 partitions, but 15 of them will be empty. You can check this e.g. with rdd.mapPartitions (see Apache Spark: Get number of records per partition). The number 16 comes from spark.default.parallelism in your case and depends on your environment, but not on the size of your data.
In general empty partitions do not hurt, they will be finished very fast. You could also repartition or coalesce to 1 partition if you don't like empty partitions (see e.g. Dropping empty DataFrame partitions in Apache Spark), but I would not recommend that

Related

What happens with Spark partitions when using Spark-Cassandra-Connector

So, I have a 16 node cluster where every node has Spark and Cassandra installed with a replication factor of 3 and spark.sql.shuffle.partitions of 96. I am using the Spark-Cassandra Connector 3.0.0 for doing a repartitionByCassandraReplica.JoinWithCassandraTable and then some SparkML analysis takes place. My question is what happens eventually with the spark partitions?
1st scenario
My PartitionsPerHost parameter of repartitionByCassandraReplica is numberofSelectedCassandraPartitionkeys which means if I choose 4 partition keys I get 4 partitions per Host. This gives me 64 spark partitions because I have 16 hosts.
2nd scenario
But, according to the Spark Cassandra connector documentation, information from system.size_estimates table should be used in order to calculate the spark partitions. For example from my system.size_estimates:
estimated_table_size = mean_partition_size x number_of_partitions
= (24416287.87/1000000) MB x 332
= 8106.2 MB
spark_partitions = estimated_table_size / input.split.size_in_mb
= 8106.2 MB / 64 MB
= 126.6593 partitions
so, when does the 1st scenario takes place and when the second? Am I calculating something wrong? Is there specific cases where the 1st scenario happens and other cases the 2nd?
Those are two completely different paths by which the number of Spark partitions are calculated.
If you're calling repartitionByCassandraReplica(), the number of Spark partitions are determined by both partitionsPerHost and the number of Cassandra nodes in the local DC.
Otherwise, the connector will use input.split.size_in_mb to determine the number of Spark partitions based on the estimated table size. Cheers!

Is it possible to do repartition after using partitionBy in a spark DF?

I am asking this question because if I specify repartition as 5, than all my data(>200Gigs) are moved to 5 different executors and 98% of the resources is unused. and then the partitionBy is happening which is again creating a lot of shuffle. Is there a way that first the partitionBy happens and then repartition runs on the data?
Although the question is not entirely easy to follow, the following aligns with the other answer and this approach should avoid the issues mentioned on unnecessary shuffling:
val n = [... some calculation for number of partitions / executors based on cluster config and volume of data to process ...]
df.repartition(n, $"field_1", $"field_2", ...)
.sortWithinPartitions("fieldx", "field_y")
.write.partitionBy("field_1", "field_2", ...)
.format("location")
whereby [field_1, field_2, ...] are the same set of fields for repartition and partitionBy.
You can use repartition(5, col("$colName")).
Thus when you will make partitionBy("$colName") you will skip shuffle for '$colName' since it's already been repartitioned.
Also consider to have as many partitions as the product of the number of executors by the number of used cores by 3 (this may vary between 2 and 4 though).
So as we know, Spark can only run 1 concurrent task for every partition of an RDD. Assuming you have 8 cores per executor and 5 executors:
You need to have: 8 * 5 * 3 = 120 partitions

When should I repartition an RDD?

I know that I can repartition an RDD to increase its partitions and use coalesce to decrease its partitions. I have two questions regarding this that I cannot completely understand after reading different resources.
Spark will use a sensible default (1 partition per block which is 64MB in first versions and now 128MB) when generating an RDD. But I also read that it is recommended to use 2 or 3 times the number of cores running the jobs. So here comes the question:
How many partitions should I use for a given file? For example, suppose I have a 10GB .parquet file, 3 executors with 2 cores and 3gb memory each.
Should I repartition? How many partitions should I use? What is the better way to make that choice?
Are all data types (ie .txt, .parquet, etc..) repartitioned by default if no partitioning is provided?
Spark can run a single concurrent task for every partition of an RDD, up to the total number of cores in the cluster.
For example :
val rdd= sc.textFile ("file.txt", 5)
The above line of code will create an RDD named textFile with 5 partitions.
Suppose that you have a cluster with 4 cores and assume that each partition needs to process for 5 minutes. In case of the above RDD with 5 partitions, 4 partition processes will run in parallel as there are 4 cores and the 5th partition process will process after 5 minutes when one of the 4 cores, is free.
The entire processing will be completed in 10 minutes and during the 5th partition process, the resources (remaining 3 cores) will remain idle.
The best way to decide on the number of partitions in a RDD is to make the number of partitions equal to the number of cores in the cluster so that all the
partitions will process in parallel and the resources will be utilized in an optimal way.
Question : Are all data types (ie .txt, .parquet, etc..) repartitioned
by default if no partitioning is provided?
There will be default no of partitions for every rdd.
to check you can use rdd.partitions.length right after rdd created.
to use existing cluster resources in optimal way and to speed up, we have to consider re-partitioning to ensure that all cores are utilized and all partitions have enough number of records which are uniformly distributed.
For better understanding, also have a look at https://jaceklaskowski.gitbooks.io/mastering-apache-spark/spark-rdd-partitions.html
Note : There is no fixed formula for this. general convention most of them follow is
(numOf executors * no of cores) * replicationfactor (which may be 2 or 3 times more )

SPARK - assign multiple cores to one task in RDD.map in pyspark

I am new to SPARK and I'm trying to use RDD.map in pyspark to parallelize running of a method named function in the SPARK framework (72 cores in total in an Standalone SPARK cluster - one driver with 100G RAM and 3 workers each with 24 cores and 100G RAM).
My goal is to run function for 200 times and average over the results. The output of the function is an numpy.array of size 12 by num_of_samples (which is a huge variable in terms of memory).
My first attempt was to create an RDD of size 200, then use RDD.map and reduce at the end:
sum_data = sc.parallelize(range(0,200)).map(function).reduce(lambda x,y:x+y)
Despite the fact that I set the spark driver-memory to maximum, it runs out of memory at the reduce level (I guess due to the huge numpy.array output of the function). I figured the maximum number of element that I can put into my RDD in order to avoid memory error is something about 40 elements:
sum_data = sc.parallelize(range(0,40)).map(function).reduce(lambda x,y:x+y)
Now when I try this, I see that SPARK creates 40 tasks and assign exactly one core to each of them (using only 40 cores out of 72 available cores in the cluster). So the other 32 cores are idle and not being used, resulting in a very slower run-time. I was wondering if this approach is correct and how can I make RDD.map to consume all the available cores instead on using one core for each mapping?
I think this can be achieved by specifying the number of partitions that spark have to divide your RDDs into.
the simplest way for doing this is to add the optional numSlices parameter in the parallelize method call, this will ensure that spark split your data into numSlices partitions and I think it will be using the whole cores.
Please refer to the official documentation for more information.

Spark Streaming Dynamic Allocation ExecutorAllocationManager

We have a spark 2.1 streaming application with a mapWithState, enabling spark.streaming.dynamicAllocation.enabled=true. The pipeline is as follows:
var rdd_out = ssc.textFileStream()
.map(convertToEvent(_))
.combineByKey(...., new HashPartitioner(partitions))
.mapWithState(stateSpec)
.map(s => sessionAnalysis(s))
.foreachRDD( rdd => rdd.toDF().....save(output));
The streaming app starts with 2 executors, after some time it creates new executors, as the load increases as expected. The problem is that the load is not shared by those executors.
The number of Partitions is big enough to spill over to the new executors, and the key is equally distributed, I set it up with 40+ partitions, but I can see only 8 partitions (2 executors x 4 cores each) on the mapWithState storage. I am expecting when new executors are allocated, those 8 partitions get split and assigned to the new ones, but this never happens.
Please advise.
Thanks,
Apparently the answer was staring at my face al along :). RDDs as per documentation below, should inherit the upstream partitions.
* Otherwise, we use a default HashPartitioner. For the number of partitions, if
* spark.default.parallelism is set, then we'll use the value from SparkContext
* defaultParallelism, otherwise we'll use the max number of upstream partitions.
The state inside a mapWithState however does not have an upstream RDD. Therefore is set to the default parallelism unless you specify the partitions directly in the state, as the example bellow.
val stateSpec = StateSpec.function(crediting.formSession _)
.timeout(timeout)
.numPartitions(partitions) // <----------
var rdd_out = ssc.textFileStream()
.map(convertToEvent(_))
.combineByKey(...., new HashPartitioner(partitions))
.mapWithState(stateSpec)
.map(s => sessionAnalysis(s))
.foreachRDD( rdd => rdd.toDF().....save(output));
Still need to figure out how to make the number of partitions dynamic, as with dynamic allocation, this should change at runtime.

Resources