My /accounts/* directory has 7 files with size of each file less than block size.
I want to know how Spark computes Partition. 2nd Argument to "textFile" method is hint to Spark for number of Partitions but is there any logic based on which it decides number of Partitions.
For 10 as input, it gives 15 partition, For 20 as input, it gives 25 partitions
How this is being computed?
Regards!
scala> var accounts= sc.textFile("/accounts/*",3)
scala> accounts.toDebugString
15/10/12 02:41:45 INFO mapred.FileInputFormat: Total input paths to process : 7
res0: String =
(7) /accounts/* MapPartitionsRDD[1] at textFile at <console>:21 []
| /accounts/* HadoopRDD[0] at textFile at <console>:21 []
scala> var accounts= sc.textFile("/accounts/*",10)
scala> accounts.toDebugString
15/10/12 02:42:01 INFO mapred.FileInputFormat: Total input paths to process : 7
res1: String =
(15) /accounts/* MapPartitionsRDD[3] at textFile at <console>:21 []
| /accounts/* HadoopRDD[2] at textFile at <console>:21 []
scala> var accounts= sc.textFile("/accounts/*",20)
scala> accounts.toDebugString
15/10/12 02:42:01 INFO mapred.FileInputFormat: Total input paths to process : 7
res1: String =
(23) /accounts/* MapPartitionsRDD[3] at textFile at <console>:21 []
| /accounts/* HadoopRDD[2] at textFile at <console>:21 []
Spark does not compute the number of partitions. It just passes the hint on to the Hadoop library. What does Hadoop do with it? It depends. Look at the documentation (or more likely the code) of the specific InputFormat's getSplits method.
For example for TextInputFormat the code is in FileInputFormat.getSplits. It's very complicated and depends on several configuration parameters.
Generally, when reading from HDFS spark will create an RDD with a partition per HDFS block.
More details on how the partitions flow through the pipeline and what you can tune here.
Related
I'm learning Spark's data linegae, I wrote a simple JDBC read transformation as bellow, and use RDD.toDebugString to get the data lineage.
val paramDf = spark.read.jdbc(url, "(select * from tb limit 500) t", connectionProperties)
System.out.println(paramDf.rdd.toDebugString)
I found that, the result DataFrame's dependencies have 5 RDDs, as bellow
(1) MapPartitionsRDD[4] at rdd at SparkApp.scala:25 []
| SQLExecutionRDD[3] at rdd at SparkApp.scala:25 []
| MapPartitionsRDD[2] at rdd at SparkApp.scala:25 []
| MapPartitionsRDD[1] at rdd at SparkApp.scala:25 []
| JDBCRDD[0] at rdd at SparkApp.scala:25 []
Why Spark generate so many MapPartitionsRDDs within a simple JDBC transformation?
I am learning Apache Spark and trying to get the lineage graph of the RDDs.
But i could not find when does a particular lineage is created?
Also, where to find the lineage of an RDD?
RDD Lineage is the logical execution plan of a distributed computation that is created and expanded every time you apply a transformation on any RDD.
Note the part "logical" not "physical" that happens after you've executed an action.
Quoting Mastering Apache Spark 2 gitbook:
RDD Lineage (aka RDD operator graph or RDD dependency graph) is a graph of all the parent RDDs of a RDD. It is built as a result of applying transformations to the RDD and creates a logical execution plan.
A RDD lineage graph is hence a graph of what transformations need to be executed after an action has been called.
Any RDD has a RDD lineage even if that means that the RDD lineage is just a single node, i.e. the RDD itself. That's because an RDD may or may not be a result of a series of transformations (and no transformations is a "zero-effect" transformation :))
You can check out the RDD lineage of an RDD using RDD.toDebugString:
toDebugString: String A description of this RDD and its recursive dependencies for debugging.
val nums = sc.parallelize(0 to 9)
scala> nums.toDebugString
res0: String = (8) ParallelCollectionRDD[0] at parallelize at <console>:24 []
val doubles = nums.map(_ * 2)
scala> doubles.toDebugString
res1: String =
(8) MapPartitionsRDD[1] at map at <console>:25 []
| ParallelCollectionRDD[0] at parallelize at <console>:24 []
val groups = doubles.groupBy(_ < 10)
scala> groups.toDebugString
res2: String =
(8) ShuffledRDD[3] at groupBy at <console>:25 []
+-(8) MapPartitionsRDD[2] at groupBy at <console>:25 []
| MapPartitionsRDD[1] at map at <console>:25 []
| ParallelCollectionRDD[0] at parallelize at <console>:24 []
When I execute below command:
scala> val rdd = sc.parallelize(List((1,2),(3,4),(3,6)),4).partitionBy(new HashPartitioner(10)).persist()
rdd: org.apache.spark.rdd.RDD[(Int, Int)] = ShuffledRDD[10] at partitionBy at <console>:22
scala> rdd.partitions.size
res9: Int = 10
scala> rdd.partitioner.isDefined
res10: Boolean = true
scala> rdd.partitioner.get
res11: org.apache.spark.Partitioner = org.apache.spark.HashPartitioner#a
It says that there are 10 partitions and partitioning is done using HashPartitioner. But When I execute below command:
scala> val rdd = sc.parallelize(List((1,2),(3,4),(3,6)),4)
...
scala> rdd.partitions.size
res6: Int = 4
scala> rdd.partitioner.isDefined
res8: Boolean = false
It says that there are 4 partitions and partitioner is not defined. So, What is default Partitioning Scheme in Spark ? / How data is partitioned in second case?
You have to distinguish between two different things:
partitioning as distributing data between partitions depending on a value of the key which is limited only to the PairwiseRDDs (RDD[(T, U)]). This creates a relationship between partition and the set of keys which can be found on a given partition.
partitioning as splitting input into multiple partitions where data is simply divided into chunks containing consecutive records to enable distributed computation. Exact logic depends on a specific source but it is either number of records or size of a chunk.
In case of parallelize data is evenly distributed between partitions using indices. In case of HadoopInputFormats (like textFile) it depends on properties like mapreduce.input.fileinputformat.split.minsize / mapreduce.input.fileinputformat.split.maxsize.
So default partitioning scheme is simply none because partitioning is not applicable to all RDDs. For operations which require partitioning on a PairwiseRDD (aggregateByKey, reduceByKey etc.) default method is use hash partitioning.
I am trying to understand how jobs and stages are defined in spark, and for that I am now using the code that I found here and spark UI. In order to see it on spark UI I had to copy and paste the text on the files several times so it takes more time to process.
Here is the output of spark UI:
Now, I understand that there are three jobs because there are three actions and also that the stages are generated because of shuffle actions, but what I don't understand is why in the Job 1 stages 4, 5 and 6 are the same as stages 0, 1 and 2 of Job 0 and the same happens for Job 2.
How can I know what stages will be in more than a job only seeing the java code (before executing anything)? And also, why are stage 4 and 9 skipped and how can I know it will happen before executing?
I understand that there are three jobs because there are three actions
I'd even say that there could have been more Spark jobs but the minimum number is 3. It all depends on the implementation of transformations and the action used.
I don't understand is why in the Job 1 stages 4, 5 and 6 are the same as stages 0, 1 and 2 of Job 0 and the same happens for Job 2.
Job 1 is the result of some action that ran on a RDD, finalRdd. That RDD was created using (going backwards): join, textFile, map, and distinct.
val people = sc.textFile("people.csv").map { line =>
val tokens = line.split(",")
val key = tokens(2)
(key, (tokens(0), tokens(1))) }.distinct
val cities = sc.textFile("cities.csv").map { line =>
val tokens = line.split(",")
(tokens(0), tokens(1))
}
val finalRdd = people.join(cities)
Run the above and you'll see the same DAG.
Now, when you execute leftOuterJoin or rightOuterJoin actions, you'll get the other two DAGs. You're using the previously-used RDDs to run new Spark jobs and so you'll see the same stages.
why are stage 4 and 9 skipped
Often, Spark will skip execution of some stages. The grayed-out stages are ones already computed so Spark will reuse them and so make performance better.
How can I know what stages will be in more than a job only seeing the java code (before executing anything)?
That's what RDD lineage (graph) offers.
scala> people.leftOuterJoin(cities).toDebugString
res15: String =
(3) MapPartitionsRDD[99] at leftOuterJoin at <console>:28 []
| MapPartitionsRDD[98] at leftOuterJoin at <console>:28 []
| CoGroupedRDD[97] at leftOuterJoin at <console>:28 []
+-(2) MapPartitionsRDD[81] at distinct at <console>:27 []
| | ShuffledRDD[80] at distinct at <console>:27 []
| +-(2) MapPartitionsRDD[79] at distinct at <console>:27 []
| | MapPartitionsRDD[78] at map at <console>:24 []
| | people.csv MapPartitionsRDD[77] at textFile at <console>:24 []
| | people.csv HadoopRDD[76] at textFile at <console>:24 []
+-(3) MapPartitionsRDD[84] at map at <console>:29 []
| cities.csv MapPartitionsRDD[83] at textFile at <console>:29 []
| cities.csv HadoopRDD[82] at textFile at <console>:29 []
As you can see yourself, you will end up with 4 stages since there are 3 shuffle dependencies (the edges with the numbers of partitions).
Numbers in the round brackets are the number of partitions that DAGScheduler will eventually use to create task sets with the exact number of tasks. One TaskSet per stage.
When I execute below command:
scala> val rdd = sc.parallelize(List((1,2),(3,4),(3,6)),4).partitionBy(new HashPartitioner(10)).persist()
rdd: org.apache.spark.rdd.RDD[(Int, Int)] = ShuffledRDD[10] at partitionBy at <console>:22
scala> rdd.partitions.size
res9: Int = 10
scala> rdd.partitioner.isDefined
res10: Boolean = true
scala> rdd.partitioner.get
res11: org.apache.spark.Partitioner = org.apache.spark.HashPartitioner#a
It says that there are 10 partitions and partitioning is done using HashPartitioner. But When I execute below command:
scala> val rdd = sc.parallelize(List((1,2),(3,4),(3,6)),4)
...
scala> rdd.partitions.size
res6: Int = 4
scala> rdd.partitioner.isDefined
res8: Boolean = false
It says that there are 4 partitions and partitioner is not defined. So, What is default Partitioning Scheme in Spark ? / How data is partitioned in second case?
You have to distinguish between two different things:
partitioning as distributing data between partitions depending on a value of the key which is limited only to the PairwiseRDDs (RDD[(T, U)]). This creates a relationship between partition and the set of keys which can be found on a given partition.
partitioning as splitting input into multiple partitions where data is simply divided into chunks containing consecutive records to enable distributed computation. Exact logic depends on a specific source but it is either number of records or size of a chunk.
In case of parallelize data is evenly distributed between partitions using indices. In case of HadoopInputFormats (like textFile) it depends on properties like mapreduce.input.fileinputformat.split.minsize / mapreduce.input.fileinputformat.split.maxsize.
So default partitioning scheme is simply none because partitioning is not applicable to all RDDs. For operations which require partitioning on a PairwiseRDD (aggregateByKey, reduceByKey etc.) default method is use hash partitioning.