Why is Spark running each task more than once - apache-spark

In my Spark application, I see the same task getting executed in multiple stages. But these statements have been defined only once in the code. Moreover, the same tasks in different stages are taking different times to execute. I understand that in case of loss of RDD, the task lineage is used to recompute the RDD. How can I find out if this the case, because the same phenomenon was seen in all the runs of this application. Can someone please explain what is happening here and under what conditions a task can get scheduled in multiple stages.
The code very much looks like the following:
val events = getEventsDF()
events.cache()
metricCounter.inc("scec", events.count())
val scEvents = events.filter(_.totalChunks == 1)
.repartition(NUM_PARTITIONS, lit(col("eventId")))
val sortedEvents = events.filter(e => e.totalChunks > 1 && e.totalChunks <= maxNumberOfChunks)
.map(PartitionUtil.createKeyValueTuple)
.rdd
.repartitionAndSortWithinPartitions(new EventDataPartitioner(NUM_PARTITIONS))
val largeEvents = events.filter(_.totalChunks > maxNumberOfChunks).count()
val mcEvents = sortedEvents.mapPartitionsWithIndex[CFEventLog](
(index: Int, iter: Iterator[Tuple2]) => doSomething())
val mcEventsDF = session.sqlContext.createDataset[CFEventLog](mcEvents)
metricCounter.inc("mcec", mcEventsDF.count())
val currentDf = scEvents.unionByName(mcEventsDF)
val distinctDateHour = currentDf.select(col("eventDate"), col("eventHour"))
.distinct
.collect
val prevEventsDF = getAnotherDF(distinctDateHour)
val finalDf = currentDf.unionByName(prevEventsDF).dropDuplicates(Seq("eventId"))
finalDf
.write.mode(SaveMode.Overwrite)
.partitionBy("event_date", "event_hour")
.saveAsTable("table")
val finalEventsCount = finalDf.count()
Is every count() action resulting in re-execution of the RDD transformation before the action?
Thanks,
Devj

Related

In Spark, caching a DataFrame influences execution time of previous stages?

I am running a Spark (2.0.1) job with multiple stages. I noticed that when I insert a cache() in one of later stages it changes the execution time of earlier stages. Why? I've never encountered such a case in literature when reading about caching().
Here is my DAG with cache():
And here is my DAG without cache(). All remaining code is the same.
I have a cache() after a sort merge join in Stage10. If the cache() is used in Stage10 then Stage8 is nearly twice longer (20 min vs 11 min) then if there were no cache() in Stage10. Why?
My Stage8 contains two broadcast joins with small DataFrames and a shuffle on a large DataFrame in preparation for the merge join. Stages8 and 9 are independent and operate on two different DataFrames.
Let me know if you need more details to answer this question.
UPDATE 8/2/1018
Here are the details of my Spark script:
I am running my job on a cluster via spark-submit. Here is my spark session.
val spark = SparkSession.builder
.appName("myJob")
.config("spark.executor.cores", 5)
.config("spark.driver.memory", "300g")
.config("spark.executor.memory", "15g")
.getOrCreate()
This creates a job with 21 executors with 5 cpu each.
Load 4 DataFrames from parquet files:
val dfT = spark.read.format("parquet").load(filePath1) // 3 Tb in 3185 partitions
val dfO = spark.read.format("parquet").load(filePath2) // ~ 700 Mb
val dfF = spark.read.format("parquet").load(filePath3) // ~ 800 Mb
val dfP = spark.read.format("parquet").load(filePath4) // 38 Gb
Preprocessing on each of the DataFrames is composed of column selection and dropDuplicates and possible filter like this:
val dfT1 = dfT.filter(...)
val dfO1 = dfO.select(columnsToSelect2).dropDuplicates(Array("someColumn2"))
val dfF1 = dfF.select(columnsToSelect3).dropDuplicates(Array("someColumn3"))
val dfP1 = dfP.select(columnsToSelect4).dropDuplicates(Array("someColumn4"))
Then I left-broadcast-join together first three DataFrames:
val dfTO = dfT1.join(broadcast(dfO1), Seq("someColumn5"), "left_outer")
val dfTOF = dfTO.join(broadcast(dfF1), Seq("someColumn6"), "left_outer")
Since the dfP1 is large I need to do a merge join, I can't afford it to do it now. I need to limit the size of dfTOF first. To do that I add a new timestamp column which is a withColumn with a UDF which transforms a string into a timestamp
val dfTOF1 = dfTOF.withColumn("TransactionTimestamp", myStringToTimestampUDF)
Next I filter on a new timestamp column:
val dfTrain = dfTOF1.filter(dfTOF1("TransactionTimestamp").between("2016-01-01 00:00:00+000", "2016-05-30 00:00:00+000"))
Now I am joining the last DataFrame:
val dfTrain2 = dfTrain.join(dfP1, Seq("someColumn7"), "left_outer")
And lastly the column selection with a cache() that is puzzling me.
val dfTrain3 = dfTrain.select("columnsToSelect5").cache()
dfTrain3.agg(sum(col("someColumn7"))).show()
It looks like the cache() is useless here but there will be some further processing and modelling of the DataFrame and the cache() will be necessary.
Should I give more details? Would you like to see execution plan for dfTrain3?

Splitting a pipeline in spark?

Assume that I have a Spark pipeline like this (formatted to emphasize the important steps):
val foos1 = spark_session.read(foo_file).flatMap(toFooRecord)
.map(someComplicatedProcessing)
.map(transform1)
.distinct().collect().toSet
I'm adding a similar pipeline:
val foos2 = spark_session.read(foo_file).flatMap(toFooRecord)
.map(someComplicatedProcessing)
.map(transform2)
.distinct().collect().toSet
Then I do something with both results.
I'd like to avoid doing someComplicatedProcessing twice (not parsing the file twice is nice, too).
Is there a way to take the stream after the .map(someComplicatedProcessing) step and create two parallel streams feeding off it?
I know that I can store the intermediate result on disk and thus save the CPU time at the cost of more I/O. Is there a better way? What words do I web-search for?
First option - cache intermediate results:
val cached = spark_session.read(foo_file).flatMap(toFooRecord)
.map(someComplicatedProcessing)
.cache
val foos1 = cached.map(transform1)
.distinct().collect().toSet
val foos2 = cached.map(transform2)
.distinct().collect().toSet
Second option - use RDD and make single pass:
val foos = spark_session.read(foo_file)
.flatMap(toFooRecord)
.map(someComplicatedProcessing)
.rdd
.flatMap(x => Seq(("t1", transform1(x)), ("t2", transform2(x))))
.distinct
.collect
.groupBy(_._1)
.mapValues(_.map(_._2))
val foos1 = foos("t1")
val foos2 = foos("t2")
The second option may require some type wrangling if transform1 and transform2 have incompatible return types.

When does DAG gets created in Spark?

My Code:
scala> val records = List( "CHN|2", "CHN|3" , "BNG|2","BNG|65")
records: List[String] = List(CHN|2, CHN|3, BNG|2, BNG|65)
scala> val recordsRDD = sc.parallelize(records)
recordsRDD: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[119] at parallelize at <console>:23
scala> val mapRDD = recordsRDD.map(elem => elem.split("\\|"))
mapRDD: org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[120] at map at <console>:25
scala> val keyvalueRDD = mapRDD.map(elem => (elem(0),elem(1)))
keyvalueRDD: org.apache.spark.rdd.RDD[(String, String)] = MapPartitionsRDD[121] at map at <console>:27
scala> keyvalueRDD.count
res12: Long = 5
As you can see above there are 3 RDD's created.
My question is When does DAG gets created and What a DAG contains ?
Does it get created when we create a RDD using any transformation?
or
Does it created when we call a Action on existing RDD and then spark automatically launch that DAG?
Basically I want to know what happens internally when a RDD gets created?
DAG is created when job is executed (when you call an action) and it contains all required dependencies to distributed tasks.
DAG is not executed. Based on DAG Spark determines tasks which are distributed to the workers and executed.
RDD alone defines lineage by traversing recursively dependencies.

How to submit branching Spark tasks in Parallel (not sequential)

I have the following process in Spark. I create a DataFrame through a process, then I sample the data frame into 2 pieces. I then perform a few more operations on these halves, and then save both. Right now, I run the two save processes sequentially. When the first halve is nearly finished, the next task does not start processing, so I have a down time in my executors. Is there a way I can submit both tasks async so that they operate in parallel, and not sequentially?
val split = 0.99
val frac1 = 0.7
val frac2 = (1 - frac1)
val frac3 = split
val validated = sqlContext.sql("Select * from Validated")
val splitDF = validated.selectExpr("srch_id").distinct.cache.randomSplit(Array(
frac1 * frac3, frac2 * frac3, (1 - frac1 * frac3 - frac2 * frac3)
), 1122334455)
val traindf = validated.as("val").join(splitDF(0).as("train"), $"val.srch_id" === $"train.srch_id", "inner").selectExpr(sel.toSeq: _*)
val scoredf = validated.as("val").join(splitDF(1).as("score"), $"val.srch_id" === $"score.srch_id", "inner").selectExpr(sel.toSeq: _*)
traindf.write.mode(SaveMode.Overwrite).saveAsTable("train") //Holds until finished saving
scoredf.write.mode(SaveMode.Overwrite).saveAsTable("score")
I am considering using threads, but am unsure if DataFrames are threadsafe. It seems that sometimes both tables do not save with the following approach.
import scala.concurrent._
val pool = new forkjoin.ForkJoinPool(8)
val ectx = ExecutionContext.fromExecutorService(pool)
sqlContext.sql("use testpx")
ectx.execute(
new Runnable {
def run {
traindf.write.mode(SaveMode.Overwrite).saveAsTable("train")
}
}
)
ectx.execute(
new Runnable {
def run {
scoredf.write.mode(SaveMode.Overwrite).saveAsTable("test")
}
}
)
ectx.shutdown
ectx.awaitTermination(Long.MaxValue, java.util.concurrent.TimeUnit.DAYS)

How to build a lookup map in Spark Streaming?

What is the best way to maintain application state in a spark streaming application?
I know of two ways :
use "Union" operation to append to the lookup RDD and persist it after each union.
save the state in a file or database and load it in the start of each batch.
My question is from the performance perspective which one is better ? Also, is there a better way to do this?
You should really be using mapWithState(spec: StateSpec[K, V, StateType, MappedType]) as follows:
import org.apache.spark.streaming.{ StreamingContext, Seconds }
val ssc = new StreamingContext(sc, batchDuration = Seconds(5))
// checkpointing is mandatory
ssc.checkpoint("_checkpoints")
val rdd = sc.parallelize(0 to 9).map(n => (n, n % 2 toString))
import org.apache.spark.streaming.dstream.ConstantInputDStream
val sessions = new ConstantInputDStream(ssc, rdd)
import org.apache.spark.streaming.{State, StateSpec, Time}
val updateState = (batchTime: Time, key: Int, value: Option[String], state: State[Int]) => {
println(s">>> batchTime = $batchTime")
println(s">>> key = $key")
println(s">>> value = $value")
println(s">>> state = $state")
val sum = value.getOrElse("").size + state.getOption.getOrElse(0)
state.update(sum)
Some((key, value, sum)) // mapped value
}
val spec = StateSpec.function(updateState)
val mappedStatefulStream = sessions.mapWithState(spec)
mappedStatefulStream.print()

Resources