Default Partitioning Scheme in Spark - apache-spark

When I execute below command:
scala> val rdd = sc.parallelize(List((1,2),(3,4),(3,6)),4).partitionBy(new HashPartitioner(10)).persist()
rdd: org.apache.spark.rdd.RDD[(Int, Int)] = ShuffledRDD[10] at partitionBy at <console>:22
scala> rdd.partitions.size
res9: Int = 10
scala> rdd.partitioner.isDefined
res10: Boolean = true
scala> rdd.partitioner.get
res11: org.apache.spark.Partitioner = org.apache.spark.HashPartitioner#a
It says that there are 10 partitions and partitioning is done using HashPartitioner. But When I execute below command:
scala> val rdd = sc.parallelize(List((1,2),(3,4),(3,6)),4)
...
scala> rdd.partitions.size
res6: Int = 4
scala> rdd.partitioner.isDefined
res8: Boolean = false
It says that there are 4 partitions and partitioner is not defined. So, What is default Partitioning Scheme in Spark ? / How data is partitioned in second case?

You have to distinguish between two different things:
partitioning as distributing data between partitions depending on a value of the key which is limited only to the PairwiseRDDs (RDD[(T, U)]). This creates a relationship between partition and the set of keys which can be found on a given partition.
partitioning as splitting input into multiple partitions where data is simply divided into chunks containing consecutive records to enable distributed computation. Exact logic depends on a specific source but it is either number of records or size of a chunk.
In case of parallelize data is evenly distributed between partitions using indices. In case of HadoopInputFormats (like textFile) it depends on properties like mapreduce.input.fileinputformat.split.minsize / mapreduce.input.fileinputformat.split.maxsize.
So default partitioning scheme is simply none because partitioning is not applicable to all RDDs. For operations which require partitioning on a PairwiseRDD (aggregateByKey, reduceByKey etc.) default method is use hash partitioning.

Related

How does HashPartitioner distribute data in Spark? [duplicate]

When I execute below command:
scala> val rdd = sc.parallelize(List((1,2),(3,4),(3,6)),4).partitionBy(new HashPartitioner(10)).persist()
rdd: org.apache.spark.rdd.RDD[(Int, Int)] = ShuffledRDD[10] at partitionBy at <console>:22
scala> rdd.partitions.size
res9: Int = 10
scala> rdd.partitioner.isDefined
res10: Boolean = true
scala> rdd.partitioner.get
res11: org.apache.spark.Partitioner = org.apache.spark.HashPartitioner#a
It says that there are 10 partitions and partitioning is done using HashPartitioner. But When I execute below command:
scala> val rdd = sc.parallelize(List((1,2),(3,4),(3,6)),4)
...
scala> rdd.partitions.size
res6: Int = 4
scala> rdd.partitioner.isDefined
res8: Boolean = false
It says that there are 4 partitions and partitioner is not defined. So, What is default Partitioning Scheme in Spark ? / How data is partitioned in second case?
You have to distinguish between two different things:
partitioning as distributing data between partitions depending on a value of the key which is limited only to the PairwiseRDDs (RDD[(T, U)]). This creates a relationship between partition and the set of keys which can be found on a given partition.
partitioning as splitting input into multiple partitions where data is simply divided into chunks containing consecutive records to enable distributed computation. Exact logic depends on a specific source but it is either number of records or size of a chunk.
In case of parallelize data is evenly distributed between partitions using indices. In case of HadoopInputFormats (like textFile) it depends on properties like mapreduce.input.fileinputformat.split.minsize / mapreduce.input.fileinputformat.split.maxsize.
So default partitioning scheme is simply none because partitioning is not applicable to all RDDs. For operations which require partitioning on a PairwiseRDD (aggregateByKey, reduceByKey etc.) default method is use hash partitioning.

Apache Spark DAG behaviour cogrouped operation

I would like some clarifications about the DAG behaviour, and how exactly has been handle the following job:
val rdd = sc.parallelize(List(1 to 10).flatMap(x=>x).zipWithIndex,3)
.partitionBy(new HashPartitioner(4))
val rdd1 = sc.parallelize(List(1 to 10).flatMap(x=>x).zipWithIndex,2)
.partitionBy(new HashPartitioner(3))
val rdd2 = rdd.join(rdd1)
rdd2.collect()
This is the related rdd2.toDebugString:
(4) MapPartitionsRDD[6] at join at IntegrationStatusJob.scala:92 []
| MapPartitionsRDD[5] at join at IntegrationStatusJob.scala:92 []
| CoGroupedRDD[4] at join at IntegrationStatusJob.scala:92 []
| ShuffledRDD[1] at partitionBy at IntegrationStatusJob.scala:90 []
+-(3) ParallelCollectionRDD[0] at parallelize at IntegrationStatusJob.scala:90 []
+-(3) ShuffledRDD[3] at partitionBy at IntegrationStatusJob.scala:91 []
+-(2) ParallelCollectionRDD[2] at parallelize at IntegrationStatusJob.scala:91 []
This is the spark UI image:
Looking at the toDebugString and at the spark UI, if I understood well, in order to perform the join, the DAG looks at what partitioner should be used and because both rdds are HashPartitioned,it choose the partitioner with the greater number of partitions, so rdd partitioner.
Now from the spark UI, it seems that rdd partitionBy and join being performed in the same stage, so under this conditions, the shuffle needed for to perform the join, will be done just from one side? From one side, I mean that just the rdd1 will be shuffled and no both.
Is my assumption correct?
You right. If both RDDs are partitioned using different partitioner Spark will pick one as a reference and reparation / shuffle only the second one.
If both have the same partitioner there is no need for a shuffle.

How to effectively join large tables in SparkSql?

I am trying to improve performance on a join involving two large tables using SparkSql. From various sources, I figured that the RDDs need to be partitioned.
Source: https://deepsense.io/optimize-spark-with-distribute-by-and-cluster-by
However, when you load a file directly from a parquet file as given below, I am not sure how it can be created as a paired RDD!
With Spark 2.0.1, using “cluster by” has no effect.
val rawDf1 = spark.read.parquet(“file in hdfs”)
rawDf1 .createOrReplaceTempView(“rawdf1”)
val rawDf2 = spark.read.parquet(“file in hdfs”)
rawDf2 .createOrReplaceTempView(“rawdf2”)
val rawDf3 = spark.read.parquet(“file in hdfs”)
rawDf3 .createOrReplaceTempView(“rawdf3”)
val df1 = spark.sql(“select * from rawdf1 cluster by key)
df1 .createOrReplaceTempView(“df1”)
val df2 = spark.sql(“select * from rawdf2 cluster by key)
df2 .createOrReplaceTempView(“df2”)
val df3 = spark.sql(“select * from rawdf3 cluster by key)
df3 .createOrReplaceTempView(“df3”)
val resultDf = spark.sql(“select * from df1 a inner join df2 b on a.key = b.key inner join df3 c on a.key =c.key”)
Whether I use "cluster by" key or not, I still see the same query plan being generated by Spark. How can I create a rdd pair in spark sql so that joins can use tables that can be partitioned?
Without proper partitioning, a lot of shuffles are happening resulting in long delays.
Our configuration ( 5 worker nodes with 1 executor (5 cores per executor) each having 32 cores and 128 GB of RAM):
spark.cores.max 25
spark.default.parallelism 75
spark.driver.extraJavaOptions -XX:+UseG1GC
spark.executor.memory 60G
spark.rdd.compress True
spark.driver.maxResultSize 4g
spark.driver.memory 8g
spark.executor.cores 5
spark.executor.extraJavaOptions -Djdk.nio.maxCachedBufferSize=262144
spark.memory.storageFraction 0.2
To add more info: I am joining more than one table in the same select using the same key across all tables. So it is not possible to create a dataframe first to call repartitionby. I understand I can do this using dataframe api. But my question is how to accomplish this using plain sparksql.

How to remove empty partition in a dataframe?

I need to remove the empty partitions from a Dataframe
We are having two Dataframes, both are created using sqlContext. And the dataframes are constructed and combined as below
import org.apache.spark.sql.{SQLContext}
val sqlContext = new SQLContext(sc)
// Loading Dataframe 1
val csv1 = "s3n://xxxxx:xxxxxx#xxxx/xxx.csv"
val csv1DF = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").load(csv1)
// Loading Dataframe 2
val csv2 = "s3n://xxxxx:xxxxxx#xxxx/xxx.csv"
val csv2DF = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").load(csv1)
// Combining dataframes
val combinedDF = csv1.
join(csv2 csv1("column_1") === csv2("column_2"))
Now the number of partition for combinedDF is 200.
From here it is found that the default number of partition is 200 when we use joins.
In some cases the dataframe/csv is not big and getting many empty partition which causes issues later part of the code.
So how can I remove these empty partition created?
The repartition method can be used to create an RDD without any empty partitions.
This thread discusses the optimal number of partitions for a given cluster. Here is good rule of thumb for estimating the optimal number of partitions.
number_of_partitions = number_of_cores * 4
If you have a cluster of 8 r3.xlarge AWS nodes, you should use 128 partitions (8 nodes * 4 CPUs per node * 4 partitions per CPU).

Performing operations only on subset of a RDD

I would like to perform some transformations only on a subset of a RDD (to make experimenting in REPL faster).
Is it possible?
RDD has take(num: Int): Array[T] method, I think I'd need something similar, but returning RDD[T]
You can use RDD.sample to get an RDD out, not an Array. For example, to sample ~1% without replacement:
val data = ...
data.count
...
res1: Long = 18066983
val sample = data.sample(false, 0.01, System.currentTimeMillis().toInt)
sample.count
...
res3: Long = 180190
The third parameter is a seed, and is thankfully optional in the next Spark version.
RDDs are distributed collections which are materialized on actions only. It is not possible to truncate your RDD to a fixed size, and still get an RDD back (hence RDD.take(n) returns an Array[T], just like collect)
I you want to get similarly sized RDDs regardless of the input size, you can truncate items in each of your partitions - this way you can better control the absolute number of items in resulting RDD. Size of the resulting RDD will depend on spark parallelism.
An example from spark-shell:
import org.apache.spark.rdd.RDD
val numberOfPartitions = 1000
val millionRdd: RDD[Int] = sc.parallelize(1 to 1000000, numberOfPartitions)
val millionRddTruncated: RDD[Int] = rdd.mapPartitions(_.take(10))
val billionRddTruncated: RDD[Int] = sc.parallelize(1 to 1000000000, numberOfPartitions).mapPartitions(_.take(10))
millionRdd.count // 1000000
millionRddTruncated.count // 10000 = 10 item * 1000 partitions
billionRddTruncated.count // 10000 = 10 item * 1000 partitions
Apparently it's possible to create RDD subset by first using its take method and then passing returned array to SparkContext's makeRDD[T](seq: Seq[T], numSlices: Int = defaultParallelism) which returns new RDD.
This approach seems dodgy to me though. Is there a nicer way?
I always use parallelize function of SparkContext to distribute from Array[T] but it seems makeRDD do the same. It's correct way both of them.

Resources