Apache Spark: stepwise execution - apache-spark

Due to a performance measurement I want to execute my scala programm written for spark stepwise, i.e.
execute first operator; materialize result;
execute second operator; materialize result;
...
and so on. The original code:
var filename = new String("<filename>")
var text_file = sc.textFile(filename)
var counts = text_file.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey((a, b) => a + b)
counts.saveAsTextFile("file://result")
So I want the execution of var counts = text_file.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey((a, b) => a + b) to be stepwise.
Is calling counts.foreachPartition(x => {}) after every operator the right way to do it?
Or is writing to /dev/null with saveAsTextFile() a better alternative? And does spark actually have something like a NullSink for that purpose? I wasn't able to write to /dev/null with saveAsTextFile() because /dev/null already exists. Is there a way to overwrite the spark result folder?
And does the temporary result after each operation should be cached with cache()?
What is the best way to separate the execution?

Spark supports two types of operations: actions and transformations. Transformations, as the name implies, turn datasets into new ones through the combination of the transformation operator and (in some cases, optionally) a function provided to the transformation. Actions, on the other hand, run through a dataset with some computation to provide a value to the driver.
There are two things that Spark does that makes your desired task a little difficult: it bundles non-shuffling transformations into execution blocks called stages and stages in the scheduling graph must be triggered through actions.
For your case, provided your input isn't massive, I think it would be easiest to trigger your transformations with a dummy action (e.g. count(), collect()) as the RDD will be materialized. During RDD computation, you can check the Spark UI to gather any performance statistics about the steps/stages/jobs used to create it.
This would look something like:
val text_file = sc.textFile(filename)
val words = text_file.flatMap(line => line.split(" "))
words.count()
val wordCount = words.map(word => (word, 1))
wordCount.count()
val wordCounts = wordCount.reduceByKey(_ + _)
wordCounts.count()
Some notes:
Since RDD's for all intents and purposes are immutable, they should be stored in val's
You can shorten your reduceByKey() syntax with underscore notation
Your approach with foreachPartition() could work since it is an action but it would require a change in your functions since your are operating over an iterator on your partition
Caching only makes since if you either create multiple RDD's from a parent RDD (branching out) or run iterated computation over the same RDD (perhaps in a loop)

You can also simple invoke RDD.persist() or RDD.cache() after every transformation. but ensure that you have right level of StorageLevel defined.

Related

Splitting a pipeline in spark?

Assume that I have a Spark pipeline like this (formatted to emphasize the important steps):
val foos1 = spark_session.read(foo_file).flatMap(toFooRecord)
.map(someComplicatedProcessing)
.map(transform1)
.distinct().collect().toSet
I'm adding a similar pipeline:
val foos2 = spark_session.read(foo_file).flatMap(toFooRecord)
.map(someComplicatedProcessing)
.map(transform2)
.distinct().collect().toSet
Then I do something with both results.
I'd like to avoid doing someComplicatedProcessing twice (not parsing the file twice is nice, too).
Is there a way to take the stream after the .map(someComplicatedProcessing) step and create two parallel streams feeding off it?
I know that I can store the intermediate result on disk and thus save the CPU time at the cost of more I/O. Is there a better way? What words do I web-search for?
First option - cache intermediate results:
val cached = spark_session.read(foo_file).flatMap(toFooRecord)
.map(someComplicatedProcessing)
.cache
val foos1 = cached.map(transform1)
.distinct().collect().toSet
val foos2 = cached.map(transform2)
.distinct().collect().toSet
Second option - use RDD and make single pass:
val foos = spark_session.read(foo_file)
.flatMap(toFooRecord)
.map(someComplicatedProcessing)
.rdd
.flatMap(x => Seq(("t1", transform1(x)), ("t2", transform2(x))))
.distinct
.collect
.groupBy(_._1)
.mapValues(_.map(_._2))
val foos1 = foos("t1")
val foos2 = foos("t2")
The second option may require some type wrangling if transform1 and transform2 have incompatible return types.

How to run a repetitive task in parallel in Spark instead of looping sequentially?

I have just started with Spark, and I know that non-functional way of sequential looping should be avoided for Spark to give me maximum performance.
I have a function makeData. I need to create a dataframe with the return value of this function by calling this function n times. Currently, my code looks like this:
var myNewDF = sqlContext.createDataFrame(sc.emptyRDD[Row], minority_set.schema)
for ( i <- 1 to n ) {
myNewDF = myNewDF.unionAll(sqlContext.createDataFrame(sc.parallelize(makeData()),minority_set.schema))
}
Is there a way of doing this where each iteration happens in parallel?
Here the problem is that n can be large, and this loop is taking a lot of time. Is there a more scala/spark friendly way of achieving the same thing?
Since all your dataset is in memory already (guessing by sc.parallelize(makeData())) there's no point of using Spark SQL's unionAll to do the unioning which is also local (yet partitioned!).
I'd use Scala alone and only when you merged all the records I'd build a Dataset from it.
With that, I'd do something as follows:
val dataset = (1 to n).par.map { _ => makeData() }.reduce (_ ++ _)
val df = sc.parallelize(dataset, minority_set.schema)

What happens to previous RDD when the next RDD is materialized?

In spark,I would like to know what happens to previous RDD when the next RDD is materialized.
let say I have the below scala code
val lines = sc.textFile("/user/cloudera/data.txt")
val lineLengths = lines.map(s => s.length)
val totalLength = lineLengths.reduce((a, b) => a + b)
I have linesRDD is a base RDD
and similarly i have linesLengths RDD
I know that these two RDD gets materialized when reduce Action is invoked.
My question is while the data is flowing through these 2 RDD's , What happens to linesRDD when the linesLengthsRDD gets materialized .
once the linesLengthsRDD gets materialized then does the data inside linesRDD gets removed?
Let's say in production spark job there might 100 RDD's, a single Action is called against 100th RDD.
what happens to data in 1st RDD when the 99th RDD gets materialized?
Data in all RDD's get deleted only the respective final Action returned the respective output ?
Or
Data in each RDD gets removed automatically once that RDD passes its data to its next RDD as per DAG?
Actually both lines and lineLength will hold their rdds after the reduce. You can think of the rdd as DAG of transformations, as you mentioned. So if later you would like to perform some other transformations on lines or lineLength you can. Even though they materialize during the reduce, unless you cache the directly, they will run through their transformations again when another action will be invoked on a DAG they belong to.

Cartesian of DStream

I use Spark cartesian function to to generate a list N pairs of values.
I then map over these values to generate a distance metric between each of the users :
val cartesianUsers: org.apache.spark.rdd.RDD[(distance.classes.User, distance.classes.User)] = users.cartesian(users)
cartesianUsers.map(m => manDistance(m._1, m._2))
This works as expected.
Using Spark Streaming library I create a DStream and then map over it :
val customReceiverStream: ReceiverInputDStream[String] = ssc.receiverStream....
customReceiverStream.foreachRDD(m => {
println("size is " + m)
})
I could use cartesian function within customReceiverStream.foreachRDD but according to doc http://spark.apache.org/docs/1.2.0/streaming-programming-guide.htm this is not its intended use :
foreachRDD(func) The most generic output operator that applies a function, func, to each RDD generated from the stream. This function should push the data in each RDD to a external system, like saving the RDD to files, or writing it over the network to a database. Note that the function func is executed in the driver process running the streaming application, and will usually have RDD actions in it that will force the computation of the streaming RDDs.
How to compute the cartesian of a DStream ? Perhaps I'm misunderstanding the use of DStreams ?
I wasn't aware of transform method :
cartesianUsers.transform(car => car.cartesian(car))
Nice talk which also mentions transform function at approx 17:00 https://www.youtube.com/watch?v=g171ndOHgJ0

Spark splitting a DStream into several RDDs

The same question also applies to splitting an RDD into several new RDDs.
A DStream or RDD contains several different case classes and I need to turn them into separate RDDs based on case class type.
I'm aware of
val newRDD = rdd.filter { a => a.getClass.getSimpleName == "CaseClass1" }
or
val newRDD = rdd.filter {
a => a match {
case _: CC1 => true
case _ => false
}
}
But this requires many runs through the original RDD, one per case class type.
There must be a more concise way to do the above matching filter?
Is there a way to split an rdd into several by the element type with one parallel pass?
1) A more concise way of filtering for a given type is to use rdd.collect(PartialFunction[T,U])
The equivalent of
val newRDD = rdd.filter { a => a.getClass.getSimpleName == "CaseClass1" }
would be:
val newRDD = rdd.collect{case c:CaseClass1 => c}
It could even be combined with additional filtering and transformation:
val budgetRDD = rdd.collect{case c:CaseClass1 if (c.customer == "important") => c.getBudget}
rdd.collect(p:PartialFunction[T,U]) should not be confused with rdd.collect() which delivers data back to the driver.
2) To split an RDD (or a DStream for that matter), filter is the way to go. One must remember that an RDD is a distributed collection. Filter will let you apply a function to a subset of that distributed collection, in parallel, over the cluster.
A structural creation of 2 or more RDDs from an original RDD would incur a 1-to-many shuffle stage, which will be substantially more expensive.
Looks like with rdd.filter I was on the right track with the long form. A slightly more concise version is:
val newRDD = rdd.filter { case _: CC1 => true ; case _ => false }
You can't leave out the case _ => false or the test for class is not exhaustive and you'll get errors. I couldn't get the collect to work correctly.
#maasg gets credit for the right answer about doing separate filter passes rather than hacking a way to split input in one pass.

Resources