Wide Transformation in Spark RDD - apache-spark

Why Spark creates multiple stages for wide transformation even if data is present in one partition?
val a = sc.parallelize(Array("This","is","a","This","is","file"),1)
val b = a.map(x => (x,1))
val c = b.reduceByKey(_+_)
c.collect

Related

Why is Spark running each task more than once

In my Spark application, I see the same task getting executed in multiple stages. But these statements have been defined only once in the code. Moreover, the same tasks in different stages are taking different times to execute. I understand that in case of loss of RDD, the task lineage is used to recompute the RDD. How can I find out if this the case, because the same phenomenon was seen in all the runs of this application. Can someone please explain what is happening here and under what conditions a task can get scheduled in multiple stages.
The code very much looks like the following:
val events = getEventsDF()
events.cache()
metricCounter.inc("scec", events.count())
val scEvents = events.filter(_.totalChunks == 1)
.repartition(NUM_PARTITIONS, lit(col("eventId")))
val sortedEvents = events.filter(e => e.totalChunks > 1 && e.totalChunks <= maxNumberOfChunks)
.map(PartitionUtil.createKeyValueTuple)
.rdd
.repartitionAndSortWithinPartitions(new EventDataPartitioner(NUM_PARTITIONS))
val largeEvents = events.filter(_.totalChunks > maxNumberOfChunks).count()
val mcEvents = sortedEvents.mapPartitionsWithIndex[CFEventLog](
(index: Int, iter: Iterator[Tuple2]) => doSomething())
val mcEventsDF = session.sqlContext.createDataset[CFEventLog](mcEvents)
metricCounter.inc("mcec", mcEventsDF.count())
val currentDf = scEvents.unionByName(mcEventsDF)
val distinctDateHour = currentDf.select(col("eventDate"), col("eventHour"))
.distinct
.collect
val prevEventsDF = getAnotherDF(distinctDateHour)
val finalDf = currentDf.unionByName(prevEventsDF).dropDuplicates(Seq("eventId"))
finalDf
.write.mode(SaveMode.Overwrite)
.partitionBy("event_date", "event_hour")
.saveAsTable("table")
val finalEventsCount = finalDf.count()
Is every count() action resulting in re-execution of the RDD transformation before the action?
Thanks,
Devj

Perform join in spark only on one co-ordinate of pair key?

I have 3 RDDs:
1st one is of form ((a,b),c).
2nd one is of form (b,d).
3rd one is of form (a,e).
How can I perform join in scala over these RDDs such that my final output is of the form ((a,b),c,d,e)?
you can do something like this:
val rdd1: RDD[((A,B),C)]
val rdd2: RDD[(B,D)]
val rdd3: RDD[(A,E)]
val tmp1 = rdd1.map {case((a,b),c) => (a, (b,c))}
val tmp2 = tmp1.join(rdd3).map{case(a, ((b,c), e)) => (b, (a,c,e))}
val res = tmp2.join(rdd2).map{case(b, ((a,c,e), d)) => ((a,b), c,d,e)}
With current implementations of join apis for paired rdds, its not possible to use condtions. And you would need conditions when joining to get the desired result.
But you can use dataframes/datasets for the joins, where you can use conditions. So use dataframes/datasets for the joins. If you want the result of join in dataframes then you can proceed with that. In case you want your results in rdds, then *.rdd can be used to convert the dataframes/datasets to RDD[Row]*
Below is the sample codes of it can be done in scala
//creating three rdds
val first = sc.parallelize(Seq((("a", "b"), "c")))
val second = sc.parallelize(Seq(("b", "d")))
val third = sc.parallelize(Seq(("a", "e")))
//coverting rdds to dataframes
val firstdf = first.toDF("key1", "value1")
val seconddf = second.toDF("key2", "value2")
val thirddf = third.toDF("key3", "value3")
//udf function for the join condition
import org.apache.spark.sql.functions._
def joinCondition = udf((strct: Row, key: String) => strct.toSeq.contains(key))
//joins with conditions
firstdf
.join(seconddf, joinCondition(firstdf("key1"), seconddf("key2"))) //joining first with second
.join(thirddf, joinCondition(firstdf("key1"), thirddf("key3"))) //joining first with third
.drop("key2", "key3") //dropping unnecessary columns
.rdd //converting dataframe to rdd
You should have output as
[[a,b],c,d,e]

When does DAG gets created in Spark?

My Code:
scala> val records = List( "CHN|2", "CHN|3" , "BNG|2","BNG|65")
records: List[String] = List(CHN|2, CHN|3, BNG|2, BNG|65)
scala> val recordsRDD = sc.parallelize(records)
recordsRDD: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[119] at parallelize at <console>:23
scala> val mapRDD = recordsRDD.map(elem => elem.split("\\|"))
mapRDD: org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[120] at map at <console>:25
scala> val keyvalueRDD = mapRDD.map(elem => (elem(0),elem(1)))
keyvalueRDD: org.apache.spark.rdd.RDD[(String, String)] = MapPartitionsRDD[121] at map at <console>:27
scala> keyvalueRDD.count
res12: Long = 5
As you can see above there are 3 RDD's created.
My question is When does DAG gets created and What a DAG contains ?
Does it get created when we create a RDD using any transformation?
or
Does it created when we call a Action on existing RDD and then spark automatically launch that DAG?
Basically I want to know what happens internally when a RDD gets created?
DAG is created when job is executed (when you call an action) and it contains all required dependencies to distributed tasks.
DAG is not executed. Based on DAG Spark determines tasks which are distributed to the workers and executed.
RDD alone defines lineage by traversing recursively dependencies.

Spark: Perform SQL function on RDD subsets

I need to perform SQL function on RDD subsets, increasing in size. For this I have to take subsets from input RDD with take function:
def main(args: Array[String]) {
// set up environment
val conf = new SparkConf()
.setMaster("local[5]")
.setAppName("Test")
.set("spark.executor.memory", "4g")
val sc = new SparkContext(conf)
val cntPairsRdd = cntsRdd.map(n => {
val sample = data0.take(n)
val dataRDD = sc.parallelize(sample)
val df = dataRDD.toDF()
val result = df.select( ...)
val xCnt = result.count
(n, xCnt)
})
}
cntsRdd is a set of increasing integers. Function take returns a list not RDD. So to make my SQL work I need first to convert my list to RDD and then to dataframe. Unfortunately inside map function Spark does not allow to create another RDD. In other words in Spark one can not create RDD inside another RDD. Because of the same reason Spark does not support SparkContext serilization. I get serilization exception when trying to sc.parallelize(sample).
Please advise some workaround to perform SQL function on RDD subsets, as defined in this scenario.

How to build a lookup map in Spark Streaming?

What is the best way to maintain application state in a spark streaming application?
I know of two ways :
use "Union" operation to append to the lookup RDD and persist it after each union.
save the state in a file or database and load it in the start of each batch.
My question is from the performance perspective which one is better ? Also, is there a better way to do this?
You should really be using mapWithState(spec: StateSpec[K, V, StateType, MappedType]) as follows:
import org.apache.spark.streaming.{ StreamingContext, Seconds }
val ssc = new StreamingContext(sc, batchDuration = Seconds(5))
// checkpointing is mandatory
ssc.checkpoint("_checkpoints")
val rdd = sc.parallelize(0 to 9).map(n => (n, n % 2 toString))
import org.apache.spark.streaming.dstream.ConstantInputDStream
val sessions = new ConstantInputDStream(ssc, rdd)
import org.apache.spark.streaming.{State, StateSpec, Time}
val updateState = (batchTime: Time, key: Int, value: Option[String], state: State[Int]) => {
println(s">>> batchTime = $batchTime")
println(s">>> key = $key")
println(s">>> value = $value")
println(s">>> state = $state")
val sum = value.getOrElse("").size + state.getOption.getOrElse(0)
state.update(sum)
Some((key, value, sum)) // mapped value
}
val spec = StateSpec.function(updateState)
val mappedStatefulStream = sessions.mapWithState(spec)
mappedStatefulStream.print()

Resources