Why Spark generate many MapPartitionsRDDs within a simple JDBC transformation? - apache-spark

I'm learning Spark's data linegae, I wrote a simple JDBC read transformation as bellow, and use RDD.toDebugString to get the data lineage.
val paramDf = spark.read.jdbc(url, "(select * from tb limit 500) t", connectionProperties)
System.out.println(paramDf.rdd.toDebugString)
I found that, the result DataFrame's dependencies have 5 RDDs, as bellow
(1) MapPartitionsRDD[4] at rdd at SparkApp.scala:25 []
| SQLExecutionRDD[3] at rdd at SparkApp.scala:25 []
| MapPartitionsRDD[2] at rdd at SparkApp.scala:25 []
| MapPartitionsRDD[1] at rdd at SparkApp.scala:25 []
| JDBCRDD[0] at rdd at SparkApp.scala:25 []
Why Spark generate so many MapPartitionsRDDs within a simple JDBC transformation?

Related

When does a RDD lineage is created? How to find lineage graph?

I am learning Apache Spark and trying to get the lineage graph of the RDDs.
But i could not find when does a particular lineage is created?
Also, where to find the lineage of an RDD?
RDD Lineage is the logical execution plan of a distributed computation that is created and expanded every time you apply a transformation on any RDD.
Note the part "logical" not "physical" that happens after you've executed an action.
Quoting Mastering Apache Spark 2 gitbook:
RDD Lineage (aka RDD operator graph or RDD dependency graph) is a graph of all the parent RDDs of a RDD. It is built as a result of applying transformations to the RDD and creates a logical execution plan.
A RDD lineage graph is hence a graph of what transformations need to be executed after an action has been called.
Any RDD has a RDD lineage even if that means that the RDD lineage is just a single node, i.e. the RDD itself. That's because an RDD may or may not be a result of a series of transformations (and no transformations is a "zero-effect" transformation :))
You can check out the RDD lineage of an RDD using RDD.toDebugString:
toDebugString: String A description of this RDD and its recursive dependencies for debugging.
val nums = sc.parallelize(0 to 9)
scala> nums.toDebugString
res0: String = (8) ParallelCollectionRDD[0] at parallelize at <console>:24 []
val doubles = nums.map(_ * 2)
scala> doubles.toDebugString
res1: String =
(8) MapPartitionsRDD[1] at map at <console>:25 []
| ParallelCollectionRDD[0] at parallelize at <console>:24 []
val groups = doubles.groupBy(_ < 10)
scala> groups.toDebugString
res2: String =
(8) ShuffledRDD[3] at groupBy at <console>:25 []
+-(8) MapPartitionsRDD[2] at groupBy at <console>:25 []
| MapPartitionsRDD[1] at map at <console>:25 []
| ParallelCollectionRDD[0] at parallelize at <console>:24 []

Apache Spark DAG behaviour cogrouped operation

I would like some clarifications about the DAG behaviour, and how exactly has been handle the following job:
val rdd = sc.parallelize(List(1 to 10).flatMap(x=>x).zipWithIndex,3)
.partitionBy(new HashPartitioner(4))
val rdd1 = sc.parallelize(List(1 to 10).flatMap(x=>x).zipWithIndex,2)
.partitionBy(new HashPartitioner(3))
val rdd2 = rdd.join(rdd1)
rdd2.collect()
This is the related rdd2.toDebugString:
(4) MapPartitionsRDD[6] at join at IntegrationStatusJob.scala:92 []
| MapPartitionsRDD[5] at join at IntegrationStatusJob.scala:92 []
| CoGroupedRDD[4] at join at IntegrationStatusJob.scala:92 []
| ShuffledRDD[1] at partitionBy at IntegrationStatusJob.scala:90 []
+-(3) ParallelCollectionRDD[0] at parallelize at IntegrationStatusJob.scala:90 []
+-(3) ShuffledRDD[3] at partitionBy at IntegrationStatusJob.scala:91 []
+-(2) ParallelCollectionRDD[2] at parallelize at IntegrationStatusJob.scala:91 []
This is the spark UI image:
Looking at the toDebugString and at the spark UI, if I understood well, in order to perform the join, the DAG looks at what partitioner should be used and because both rdds are HashPartitioned,it choose the partitioner with the greater number of partitions, so rdd partitioner.
Now from the spark UI, it seems that rdd partitionBy and join being performed in the same stage, so under this conditions, the shuffle needed for to perform the join, will be done just from one side? From one side, I mean that just the rdd1 will be shuffled and no both.
Is my assumption correct?
You right. If both RDDs are partitioned using different partitioner Spark will pick one as a reference and reparation / shuffle only the second one.
If both have the same partitioner there is no need for a shuffle.

When does DAG gets created in Spark?

My Code:
scala> val records = List( "CHN|2", "CHN|3" , "BNG|2","BNG|65")
records: List[String] = List(CHN|2, CHN|3, BNG|2, BNG|65)
scala> val recordsRDD = sc.parallelize(records)
recordsRDD: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[119] at parallelize at <console>:23
scala> val mapRDD = recordsRDD.map(elem => elem.split("\\|"))
mapRDD: org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[120] at map at <console>:25
scala> val keyvalueRDD = mapRDD.map(elem => (elem(0),elem(1)))
keyvalueRDD: org.apache.spark.rdd.RDD[(String, String)] = MapPartitionsRDD[121] at map at <console>:27
scala> keyvalueRDD.count
res12: Long = 5
As you can see above there are 3 RDD's created.
My question is When does DAG gets created and What a DAG contains ?
Does it get created when we create a RDD using any transformation?
or
Does it created when we call a Action on existing RDD and then spark automatically launch that DAG?
Basically I want to know what happens internally when a RDD gets created?
DAG is created when job is executed (when you call an action) and it contains all required dependencies to distributed tasks.
DAG is not executed. Based on DAG Spark determines tasks which are distributed to the workers and executed.
RDD alone defines lineage by traversing recursively dependencies.

How does Spark compute number of partitions?

My /accounts/* directory has 7 files with size of each file less than block size.
I want to know how Spark computes Partition. 2nd Argument to "textFile" method is hint to Spark for number of Partitions but is there any logic based on which it decides number of Partitions.
For 10 as input, it gives 15 partition, For 20 as input, it gives 25 partitions
How this is being computed?
Regards!
scala> var accounts= sc.textFile("/accounts/*",3)
scala> accounts.toDebugString
15/10/12 02:41:45 INFO mapred.FileInputFormat: Total input paths to process : 7
res0: String =
(7) /accounts/* MapPartitionsRDD[1] at textFile at <console>:21 []
| /accounts/* HadoopRDD[0] at textFile at <console>:21 []
scala> var accounts= sc.textFile("/accounts/*",10)
scala> accounts.toDebugString
15/10/12 02:42:01 INFO mapred.FileInputFormat: Total input paths to process : 7
res1: String =
(15) /accounts/* MapPartitionsRDD[3] at textFile at <console>:21 []
| /accounts/* HadoopRDD[2] at textFile at <console>:21 []
scala> var accounts= sc.textFile("/accounts/*",20)
scala> accounts.toDebugString
15/10/12 02:42:01 INFO mapred.FileInputFormat: Total input paths to process : 7
res1: String =
(23) /accounts/* MapPartitionsRDD[3] at textFile at <console>:21 []
| /accounts/* HadoopRDD[2] at textFile at <console>:21 []
Spark does not compute the number of partitions. It just passes the hint on to the Hadoop library. What does Hadoop do with it? It depends. Look at the documentation (or more likely the code) of the specific InputFormat's getSplits method.
For example for TextInputFormat the code is in FileInputFormat.getSplits. It's very complicated and depends on several configuration parameters.
Generally, when reading from HDFS spark will create an RDD with a partition per HDFS block.
More details on how the partitions flow through the pipeline and what you can tune here.

How does lineage get passed down in RDDs in Apache Spark

Do each RDD point to the same lineage graph
or
when a parent RDD gives its lineage to a new RDD, is the lineage graph copied by the child as well so both the parent and child have different graphs. In this case isn't it memory intensive?
Each RDD maintains a pointer to one or more parent along with the metadata about what type of relationship it has with the parent. For example, when we call val b = a.map() on an RDD, the RDD b just keeps a reference (and never copies) to its parent a, that's a lineage.
And when the driver submits the job, the RDD graph is serialized to the worker nodes so that each of the worker nodes apply the series of transformations (like, map filter and etc..) on different partitions. Also, this RDD lineage will be used to recompute the data if some failure occurs.
To display the lineage of an RDD, Spark provides a debug method toDebugString() method.
Consider the following example:
val input = sc.textFile("log.txt")
val splitedLines = input.map(line => line.split(" "))
.map(words => (words(0), 1))
.reduceByKey{(a,b) => a + b}
Executing toDebugString() on splitedLines RDD, will output the following,
(2) ShuffledRDD[6] at reduceByKey at <console>:25 []
+-(2) MapPartitionsRDD[5] at map at <console>:24 []
| MapPartitionsRDD[4] at map at <console>:23 []
| log.txt MapPartitionsRDD[1] at textFile at <console>:21 []
| log.txt HadoopRDD[0] at textFile at <console>:21 []
For more information about how Spark works internally, please read my another post
When a transformation(map or filter etc) is called, it is not executed by Spark immediately, instead a lineage is created for each transformation.
A lineage will keep track of what all transformations has to be applied on that RDD,
including the location from where it has to read the data.
For example, consider the following example
val myRdd = sc.textFile("spam.txt")
val filteredRdd = myRdd.filter(line => line.contains("wonder"))
filteredRdd.count()
sc.textFile() and myRdd.filter() do not get executed immediately,
it will be executed only when an Action is called on the RDD - here filteredRdd.count().
An Action is used to either save result to some location or to display it.
RDD lineage information can also be printed by using the command filteredRdd.toDebugString(filteredRdd is the RDD here).
Also, DAG Visualization shows the complete graph in a very intuitive manner as follows:

Resources