Structuring multiple transformations into a single step - apache-spark

An RDD needs to be transformed, and there are several steps in the transformation.
One option is to put all the steps into one function:
rdd.map {x =>
x.field1 = // some logic
x.field2 = // some logic
x.field3 = // some logic
x
}
The issues with the above are:
each of the aforementioned logic steps could be quite large, so
structuring the code is more challenging.
some steps may potentially be dependent on transformations to the RDD in previous steps.
An alternative is as follows:
val transformedRdd = rdd.map(function1).map(function2).map(function3)
This both solves the previous issues. However, is it efficient? Is it any different to:
val rdd1 = rdd.map(function1)
val rdd2 = rdd1.map(function2)
val rdd3 = rdd2.map(function)
Thanks

Related

Why is Spark running each task more than once

In my Spark application, I see the same task getting executed in multiple stages. But these statements have been defined only once in the code. Moreover, the same tasks in different stages are taking different times to execute. I understand that in case of loss of RDD, the task lineage is used to recompute the RDD. How can I find out if this the case, because the same phenomenon was seen in all the runs of this application. Can someone please explain what is happening here and under what conditions a task can get scheduled in multiple stages.
The code very much looks like the following:
val events = getEventsDF()
events.cache()
metricCounter.inc("scec", events.count())
val scEvents = events.filter(_.totalChunks == 1)
.repartition(NUM_PARTITIONS, lit(col("eventId")))
val sortedEvents = events.filter(e => e.totalChunks > 1 && e.totalChunks <= maxNumberOfChunks)
.map(PartitionUtil.createKeyValueTuple)
.rdd
.repartitionAndSortWithinPartitions(new EventDataPartitioner(NUM_PARTITIONS))
val largeEvents = events.filter(_.totalChunks > maxNumberOfChunks).count()
val mcEvents = sortedEvents.mapPartitionsWithIndex[CFEventLog](
(index: Int, iter: Iterator[Tuple2]) => doSomething())
val mcEventsDF = session.sqlContext.createDataset[CFEventLog](mcEvents)
metricCounter.inc("mcec", mcEventsDF.count())
val currentDf = scEvents.unionByName(mcEventsDF)
val distinctDateHour = currentDf.select(col("eventDate"), col("eventHour"))
.distinct
.collect
val prevEventsDF = getAnotherDF(distinctDateHour)
val finalDf = currentDf.unionByName(prevEventsDF).dropDuplicates(Seq("eventId"))
finalDf
.write.mode(SaveMode.Overwrite)
.partitionBy("event_date", "event_hour")
.saveAsTable("table")
val finalEventsCount = finalDf.count()
Is every count() action resulting in re-execution of the RDD transformation before the action?
Thanks,
Devj

Splitting a pipeline in spark?

Assume that I have a Spark pipeline like this (formatted to emphasize the important steps):
val foos1 = spark_session.read(foo_file).flatMap(toFooRecord)
.map(someComplicatedProcessing)
.map(transform1)
.distinct().collect().toSet
I'm adding a similar pipeline:
val foos2 = spark_session.read(foo_file).flatMap(toFooRecord)
.map(someComplicatedProcessing)
.map(transform2)
.distinct().collect().toSet
Then I do something with both results.
I'd like to avoid doing someComplicatedProcessing twice (not parsing the file twice is nice, too).
Is there a way to take the stream after the .map(someComplicatedProcessing) step and create two parallel streams feeding off it?
I know that I can store the intermediate result on disk and thus save the CPU time at the cost of more I/O. Is there a better way? What words do I web-search for?
First option - cache intermediate results:
val cached = spark_session.read(foo_file).flatMap(toFooRecord)
.map(someComplicatedProcessing)
.cache
val foos1 = cached.map(transform1)
.distinct().collect().toSet
val foos2 = cached.map(transform2)
.distinct().collect().toSet
Second option - use RDD and make single pass:
val foos = spark_session.read(foo_file)
.flatMap(toFooRecord)
.map(someComplicatedProcessing)
.rdd
.flatMap(x => Seq(("t1", transform1(x)), ("t2", transform2(x))))
.distinct
.collect
.groupBy(_._1)
.mapValues(_.map(_._2))
val foos1 = foos("t1")
val foos2 = foos("t2")
The second option may require some type wrangling if transform1 and transform2 have incompatible return types.

How to run a repetitive task in parallel in Spark instead of looping sequentially?

I have just started with Spark, and I know that non-functional way of sequential looping should be avoided for Spark to give me maximum performance.
I have a function makeData. I need to create a dataframe with the return value of this function by calling this function n times. Currently, my code looks like this:
var myNewDF = sqlContext.createDataFrame(sc.emptyRDD[Row], minority_set.schema)
for ( i <- 1 to n ) {
myNewDF = myNewDF.unionAll(sqlContext.createDataFrame(sc.parallelize(makeData()),minority_set.schema))
}
Is there a way of doing this where each iteration happens in parallel?
Here the problem is that n can be large, and this loop is taking a lot of time. Is there a more scala/spark friendly way of achieving the same thing?
Since all your dataset is in memory already (guessing by sc.parallelize(makeData())) there's no point of using Spark SQL's unionAll to do the unioning which is also local (yet partitioned!).
I'd use Scala alone and only when you merged all the records I'd build a Dataset from it.
With that, I'd do something as follows:
val dataset = (1 to n).par.map { _ => makeData() }.reduce (_ ++ _)
val df = sc.parallelize(dataset, minority_set.schema)

Spark 1.6.2's RDD caching seems do to weird things with filters in some cases

I have an RDD:
avroRecord: org.apache.spark.rdd.RDD[com.rr.eventdata.ViewRecord] = MapPartitionsRDD[75]
I then filter the RDD for a single matching value:
val siteFiltered = avroRecord.filter(_.getSiteId == 1200)
I now count how many distinct values I get for SiteId. Given the filter it should be "1". Here's two ways I do it without cache and with cache:
val basic = siteFiltered.map(_.getSiteId).distinct.count
val cached = siteFiltered.cache.map(_.getSiteId).distinct.count
The result indicates that the cached version isn't filtered at all:
basic: Long = 1
cached: Long = 93
"93" isn't even the expected value if the filter was ignored completely (that answer is "522"). It also isn't a problem with "distinct" as the values are real ones.
It seems like the cached RDD has some odd partial version of the filter.
Anyone know what's going on here?
I supposed the problem is that you have to cache the result of your RDD before doing any action on it.
Spark build a DAG that represents the execution of your program. Each node is a transformation or an action on your RDD. Without cacheing the RDD, each action forces Spark to execute the whole DAG from the begining (or from the last cache invocation).
So, your code should work if you do the following changes:
val siteFiltered =
avroRecord.filter(_.getSiteId == 1200)
.map(_.getSiteId).cache
val basic = siteFiltered.distinct.count
// Yes, I know, in this way the second count has no sense at all
val cached = siteFiltered.distinct.count
There is no issue with your code. It should work fine.
I tried out the same at my local it is working fine without any discrepancies with multiple runs.
I have following data with me:
Event1,11.4
Event2,82.0
Event3,53.8
Event4,31.0
Event5,22.6
Event6,43.1
Event7,11.0
Event8,22.1
Event8,22.1
Event8,22.1
Event8,22.1
Event9,3.2
Event10,13.1
Event9,3.2
Event10,13.1
Event9,3.2
Event10,13.1
Event11,3.22
Event12,13.11
And I tried the same thing as you did, following is my code that is working fine:
scala> var textrdd = sc.textFile("file:///data/pocs/blogs/eventrecords");
textrdd: org.apache.spark.rdd.RDD[String] = file:///data/pocs/blogs/eventrecords MapPartitionsRDD[123] at textFile at <console>:27
scala> var filteredRdd = textrdd.filter(_.split(",")(1).toDouble > 1)
filteredRdd: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[124] at filter at <console>:29
scala> filteredRdd.map(x => x.split(",")(1)).distinct.count
res36: Long = 12
scala> filteredRdd.cache.map(x => x.split(",")(1)).distinct.count
res37: Long = 12

Linear Regression on Apache Spark

We have a situation where we have to run linear regression on millions of small datasets and store the weights and intercept for each of these datasets. I wrote the below scala code to do so, wherein I fed each of these datasets as a row in an RDD and then I try to run the regression on each(data is the RDD which has (label,features) stored in it in each row, in this case we have one feature per label):
val x = data.flatMap { line => line.split(' ')}.map { line =>
val parts = line.split(',')
val parsedData1 = LabeledPoint(parts(0).toDouble, Vectors.dense(parts(1).split(' ').map(_.toDouble)))
val model = LinearRegressionWithSGD.train(sc.parallelize(List(parsedData1)),100)//using parallelize to convert data to type RDD
(model.intercept,model.weights)
}
The problem here is that, LinearRegressionWithSGD expects an RDD for input, and nested RDDs are not supported in Spark. I chose this approach as all these datasets can be run independent of each other and hence I wanted to distribute them (Hence, ruled out looping).
Can you please suggest if I can use other types (Arrays, Lists etc) to input as a dataset to LinearRegressionWithSGD or even a better approach which will still distribute such computations in Spark?
val modelList = for {item <- dataSet} yield {
val data = MLUtils.loadLibSVMFile(context, item).cache()
val model = LinearRegressionWithSGD.train(data)
model
}
Maybe you can separate your input data into several files and store in HDFS.
Use the directory of those files as input, you can get a list of models.

Resources