Splitting a pipeline in spark? - apache-spark

Assume that I have a Spark pipeline like this (formatted to emphasize the important steps):
val foos1 = spark_session.read(foo_file).flatMap(toFooRecord)
.map(someComplicatedProcessing)
.map(transform1)
.distinct().collect().toSet
I'm adding a similar pipeline:
val foos2 = spark_session.read(foo_file).flatMap(toFooRecord)
.map(someComplicatedProcessing)
.map(transform2)
.distinct().collect().toSet
Then I do something with both results.
I'd like to avoid doing someComplicatedProcessing twice (not parsing the file twice is nice, too).
Is there a way to take the stream after the .map(someComplicatedProcessing) step and create two parallel streams feeding off it?
I know that I can store the intermediate result on disk and thus save the CPU time at the cost of more I/O. Is there a better way? What words do I web-search for?

First option - cache intermediate results:
val cached = spark_session.read(foo_file).flatMap(toFooRecord)
.map(someComplicatedProcessing)
.cache
val foos1 = cached.map(transform1)
.distinct().collect().toSet
val foos2 = cached.map(transform2)
.distinct().collect().toSet
Second option - use RDD and make single pass:
val foos = spark_session.read(foo_file)
.flatMap(toFooRecord)
.map(someComplicatedProcessing)
.rdd
.flatMap(x => Seq(("t1", transform1(x)), ("t2", transform2(x))))
.distinct
.collect
.groupBy(_._1)
.mapValues(_.map(_._2))
val foos1 = foos("t1")
val foos2 = foos("t2")
The second option may require some type wrangling if transform1 and transform2 have incompatible return types.

Related

Spark 1.6.2's RDD caching seems do to weird things with filters in some cases

I have an RDD:
avroRecord: org.apache.spark.rdd.RDD[com.rr.eventdata.ViewRecord] = MapPartitionsRDD[75]
I then filter the RDD for a single matching value:
val siteFiltered = avroRecord.filter(_.getSiteId == 1200)
I now count how many distinct values I get for SiteId. Given the filter it should be "1". Here's two ways I do it without cache and with cache:
val basic = siteFiltered.map(_.getSiteId).distinct.count
val cached = siteFiltered.cache.map(_.getSiteId).distinct.count
The result indicates that the cached version isn't filtered at all:
basic: Long = 1
cached: Long = 93
"93" isn't even the expected value if the filter was ignored completely (that answer is "522"). It also isn't a problem with "distinct" as the values are real ones.
It seems like the cached RDD has some odd partial version of the filter.
Anyone know what's going on here?
I supposed the problem is that you have to cache the result of your RDD before doing any action on it.
Spark build a DAG that represents the execution of your program. Each node is a transformation or an action on your RDD. Without cacheing the RDD, each action forces Spark to execute the whole DAG from the begining (or from the last cache invocation).
So, your code should work if you do the following changes:
val siteFiltered =
avroRecord.filter(_.getSiteId == 1200)
.map(_.getSiteId).cache
val basic = siteFiltered.distinct.count
// Yes, I know, in this way the second count has no sense at all
val cached = siteFiltered.distinct.count
There is no issue with your code. It should work fine.
I tried out the same at my local it is working fine without any discrepancies with multiple runs.
I have following data with me:
Event1,11.4
Event2,82.0
Event3,53.8
Event4,31.0
Event5,22.6
Event6,43.1
Event7,11.0
Event8,22.1
Event8,22.1
Event8,22.1
Event8,22.1
Event9,3.2
Event10,13.1
Event9,3.2
Event10,13.1
Event9,3.2
Event10,13.1
Event11,3.22
Event12,13.11
And I tried the same thing as you did, following is my code that is working fine:
scala> var textrdd = sc.textFile("file:///data/pocs/blogs/eventrecords");
textrdd: org.apache.spark.rdd.RDD[String] = file:///data/pocs/blogs/eventrecords MapPartitionsRDD[123] at textFile at <console>:27
scala> var filteredRdd = textrdd.filter(_.split(",")(1).toDouble > 1)
filteredRdd: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[124] at filter at <console>:29
scala> filteredRdd.map(x => x.split(",")(1)).distinct.count
res36: Long = 12
scala> filteredRdd.cache.map(x => x.split(",")(1)).distinct.count
res37: Long = 12

Structuring multiple transformations into a single step

An RDD needs to be transformed, and there are several steps in the transformation.
One option is to put all the steps into one function:
rdd.map {x =>
x.field1 = // some logic
x.field2 = // some logic
x.field3 = // some logic
x
}
The issues with the above are:
each of the aforementioned logic steps could be quite large, so
structuring the code is more challenging.
some steps may potentially be dependent on transformations to the RDD in previous steps.
An alternative is as follows:
val transformedRdd = rdd.map(function1).map(function2).map(function3)
This both solves the previous issues. However, is it efficient? Is it any different to:
val rdd1 = rdd.map(function1)
val rdd2 = rdd1.map(function2)
val rdd3 = rdd2.map(function)
Thanks

How to pass output of one function to another in Spark

I am sending output of one function which is dataframe to another function.
val df1 = fun1
val df11 = df1.collect
val df2 = df11.map(x =fun2( x,df3))
Above 2 lines are wriiten in main function. Df1 is very large so if i do collect on driver it gives outof memory or gc issue.
What r ways to send output of one function to another in spark?
Spark can run the data processing for you. You don't need the intermediate collect step. You should just chain all of the transformations together and then add an action at the end to save the resulting data out to disk.
Calling collect() is only useful for debugging very small results.
For example, you could do something like this:
rdd.map(x => fun1(x))
.map(y => fun2(y))
.saveAsObjectFile();
This article might be helpful to explain more about this:
http://www.agildata.com/apache-spark-rdd-vs-dataframe-vs-dataset/

Apache Spark: stepwise execution

Due to a performance measurement I want to execute my scala programm written for spark stepwise, i.e.
execute first operator; materialize result;
execute second operator; materialize result;
...
and so on. The original code:
var filename = new String("<filename>")
var text_file = sc.textFile(filename)
var counts = text_file.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey((a, b) => a + b)
counts.saveAsTextFile("file://result")
So I want the execution of var counts = text_file.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey((a, b) => a + b) to be stepwise.
Is calling counts.foreachPartition(x => {}) after every operator the right way to do it?
Or is writing to /dev/null with saveAsTextFile() a better alternative? And does spark actually have something like a NullSink for that purpose? I wasn't able to write to /dev/null with saveAsTextFile() because /dev/null already exists. Is there a way to overwrite the spark result folder?
And does the temporary result after each operation should be cached with cache()?
What is the best way to separate the execution?
Spark supports two types of operations: actions and transformations. Transformations, as the name implies, turn datasets into new ones through the combination of the transformation operator and (in some cases, optionally) a function provided to the transformation. Actions, on the other hand, run through a dataset with some computation to provide a value to the driver.
There are two things that Spark does that makes your desired task a little difficult: it bundles non-shuffling transformations into execution blocks called stages and stages in the scheduling graph must be triggered through actions.
For your case, provided your input isn't massive, I think it would be easiest to trigger your transformations with a dummy action (e.g. count(), collect()) as the RDD will be materialized. During RDD computation, you can check the Spark UI to gather any performance statistics about the steps/stages/jobs used to create it.
This would look something like:
val text_file = sc.textFile(filename)
val words = text_file.flatMap(line => line.split(" "))
words.count()
val wordCount = words.map(word => (word, 1))
wordCount.count()
val wordCounts = wordCount.reduceByKey(_ + _)
wordCounts.count()
Some notes:
Since RDD's for all intents and purposes are immutable, they should be stored in val's
You can shorten your reduceByKey() syntax with underscore notation
Your approach with foreachPartition() could work since it is an action but it would require a change in your functions since your are operating over an iterator on your partition
Caching only makes since if you either create multiple RDD's from a parent RDD (branching out) or run iterated computation over the same RDD (perhaps in a loop)
You can also simple invoke RDD.persist() or RDD.cache() after every transformation. but ensure that you have right level of StorageLevel defined.

Performing operations only on subset of a RDD

I would like to perform some transformations only on a subset of a RDD (to make experimenting in REPL faster).
Is it possible?
RDD has take(num: Int): Array[T] method, I think I'd need something similar, but returning RDD[T]
You can use RDD.sample to get an RDD out, not an Array. For example, to sample ~1% without replacement:
val data = ...
data.count
...
res1: Long = 18066983
val sample = data.sample(false, 0.01, System.currentTimeMillis().toInt)
sample.count
...
res3: Long = 180190
The third parameter is a seed, and is thankfully optional in the next Spark version.
RDDs are distributed collections which are materialized on actions only. It is not possible to truncate your RDD to a fixed size, and still get an RDD back (hence RDD.take(n) returns an Array[T], just like collect)
I you want to get similarly sized RDDs regardless of the input size, you can truncate items in each of your partitions - this way you can better control the absolute number of items in resulting RDD. Size of the resulting RDD will depend on spark parallelism.
An example from spark-shell:
import org.apache.spark.rdd.RDD
val numberOfPartitions = 1000
val millionRdd: RDD[Int] = sc.parallelize(1 to 1000000, numberOfPartitions)
val millionRddTruncated: RDD[Int] = rdd.mapPartitions(_.take(10))
val billionRddTruncated: RDD[Int] = sc.parallelize(1 to 1000000000, numberOfPartitions).mapPartitions(_.take(10))
millionRdd.count // 1000000
millionRddTruncated.count // 10000 = 10 item * 1000 partitions
billionRddTruncated.count // 10000 = 10 item * 1000 partitions
Apparently it's possible to create RDD subset by first using its take method and then passing returned array to SparkContext's makeRDD[T](seq: Seq[T], numSlices: Int = defaultParallelism) which returns new RDD.
This approach seems dodgy to me though. Is there a nicer way?
I always use parallelize function of SparkContext to distribute from Array[T] but it seems makeRDD do the same. It's correct way both of them.

Resources