Is foreachRDD executed on the Driver? - apache-spark

I am trying to process some XML data received on a JMS queue (QPID) using Spark Streaming. After getting xml as DStream I convert them to Dataframes so I can join them with some of my static data in form of Dataframes already loaded.
But as per API documentation for foreachRdd method on DStream:
it gets executed on Driver, so does that mean all processing logic will only run on Driver and not get distributed to workers/executors.
API Documentation
foreachRDD(func)
The most generic output operator that applies a
function, func, to each RDD generated from the stream. This function
should push the data in each RDD to an external system, such as saving
the RDD to files, or writing it over the network to a database. Note
that the function func is executed in the driver process running the
streaming application, and will usually have RDD actions in it that
will force the computation of the streaming RDDs.

so does that mean all processing logic will only run on Driver and not
get distributed to workers/executors.
No, the function itself runs on the driver, but don't forget that it operates on an RDD. The inner functions that you'll use on the RDD, such as foreachPartition, map, filter etc will still run on the worker nodes. This won't cause all the data to be sent back over the network to the driver, unless you call methods like collect, which do.

To make this clear, if you run the following, you will see "monkey" on the driver's stdout:
myDStream.foreachRDD { rdd =>
println("monkey")
}
If you run the following, you will see "monkey" on the driver's stdout, and the filter work will be done on whatever executors the rdd is distributed across:
myDStream.foreachRDD { rdd =>
println("monkey")
rdd.filter(element => element == "Save me!")
}
Let's add the simplification that myDStream only ever receives one RDD, and that this RDD is spread across a set of partitions that we'll call PartitionSetA that exist on MachineSetB where ExecutorSetC are running. If you run the following, you will see "monkey" on the driver's stdout, you will see "turtle" on the stdouts of all executors in ExecutorSetC ("turtle" will appear once for each partition -- many partitions could be on the machine where an executor is running), and the work of both the filter and addition operations will be done across ExecutorSetC:
myDStream.foreachRDD { rdd =>
println("monkey")
rdd.filter(element => element == "Save me!")
rdd.foreachPartition { partition =>
println("turtle")
val x = 1 + 1
}
}
One more thing to note is that in the following code, y would end up being sent across the network from the driver to all of ExecutorSetC for each rdd:
val y = 2
myDStream.foreachRDD { rdd =>
println("monkey")
rdd.filter(element => element == "Save me!")
rdd.foreachPartition { partition =>
println("turtle")
val x = 1 + 1
val z = x + y
}
}
To avoid this overhead, you can use broadcast variables, which send the value from the driver to the executors just once. For example:
val y = 2
val broadcastY = sc.broadcast(y)
myDStream.foreachRDD { rdd =>
println("monkey")
rdd.filter(element => element == "Save me!")
rdd.foreachPartition { partition =>
println("turtle")
val x = 1 + 1
val z = x + broadcastY.value
}
}
For sending more complex things over as broadcast variables, such as objects that aren't easily serializable once instantiated, you can see the following blog post: https://allegro.tech/2015/08/spark-kafka-integration.html

Related

In Apache Spark, how to make a task to always execute on the same machine?

In its simplest form, RDD is merely a placeholder of chained computations that can be arbitrarily scheduled to be executed on any machine:
val src = sc.parallelize(0 to 1000)
val rdd = src.mapPartitions { itr =>
Iterator(SparkEnv.get.executorId)
}
for (i <- 1 to 3) {
val vs = rdd.collect()
println(vs.mkString)
}
/* yielding:
1230123012301230
0321032103210321
2130213021302130
*/
This behaviour can obviously be overridden by making any of the upstream RDD persisted, such that Spark scheduler will minimise redundant computation:
val src = sc.parallelize(0 to 1000)
src.persist()
val rdd = src.mapPartitions { itr =>
Iterator(SparkEnv.get.executorId)
}
for (i <- 1 to 3) {
val vs = rdd.collect()
println(vs.mkString)
}
/* yield:
2013201320132013
2013201320132013
2013201320132013
each partition has a fixed executorID
*/
Now my problem is :
I don't like the vanilla caching mechanism (see this post: In Apache Spark, can I incrementally cache an RDD partition?) and have wrote my own caching mechanism (by implementing a new RDD). Since the new caching mechanism is only capable of reading existing values from local disk/memory, if there are multiple executors, my cache for each partition will be frequently missed every time the partition is executed in a task on another machine.
So my question is :
How do I mimic Spark RDD persistent implementation to ask the DAG scheduler to enforce/suggest locality aware task scheduling? Without actually calling the .persist() method, because it is unnecessary.

Spark in Foreach

val a = sc.textFile("/user/cts367689/datagen.txt")
val b = a.map(x => (x.split(",")(0),x.split(",")(2),x.split(4))))
val c = b.filter(x => (x._3.toInt > 500))
c.foreach(x => println(x))
or
c.foreach {x => {println(x)}}
I am not getting the expected output when i use for-each statement.I want output to be print one in a line but not sure what wrong in my code.
I think that this have been answered already a couple of times before but here we go again and from the Official Programming Guide :
Printing elements of an RDD
One common idiom is attempting to print out the elements of an RDD using rdd.foreach(println) or rdd.map(println). On a single machine, this will generate the expected output and print all the RDD’s elements.
scala> val rdd = sc.parallelize(Seq((1,2,3),(2,3,4)))
// rdd: org.apache.spark.rdd.RDD[(Int, Int, Int)] = ParallelCollectionRDD[0] at parallelize at <console>:27
scala> rdd.foreach(println)
// (1,2,3)
// (2,3,4)
However, in cluster mode, the output to stdout being called by the executors is now writing to the executor’s stdout instead, not the one on the driver, so stdout on the driver won’t show these!
To print all elements on the driver, one need to collect() the data back to the driver node thus:
scala> rdd.collect().foreach(println)
// (1,2,3)
// (2,3,4)
And here is the limitation. If your data doesn't fit on the driver, this can cause the driver to run out of memory, though, because collect() fetches the entire RDD to a single machine; Thus causing your driver to blow.
If you only need to print a few elements of the RDD, a safer approach is to use the take():
scala> val rdd = sc.parallelize(Range(1, 1000000000))
// rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[2] at parallelize at <console>:27
scala> rdd.take(100).foreach(println)
// 1
// 2
// 3
// 4
// 5
// 6
// 7
// 8
// 9
// 10
// [...]
PS: A small note concerning the foreach method. foreach runs a function on each element of the dataset. This method is usually done for side effects such as updating an Accumulator or interacting with external storage systems.
I hope that this answers your question.

Spark Streaming - how to use reduceByKey within a partition on the Iterator

I am trying to consume Kafka DirectStream, process the RDDs for each partition and write the processed values to DB. When I try to perform reduceByKey(per partition, that is without the shuffle), I get the following error. Usually on the driver node, we can use sc.parallelize(Iterator) to solve this issue. But I would like to solve it in spark streaming.
value reduceByKey is not a member of Iterator[((String, String), (Int, Int))]
Is there a way to perform transformations on Iterator within the partition?
myKafkaDS
.foreachRDD { rdd =>
val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
val commonIter = rdd.mapPartitionsWithIndex ( (i,iter) => {
val offset = offsetRanges(i)
val records = iter.filter(item => {
(some_filter_condition)
}).map(r1 => {
// Some processing
((field2, field2), (field3, field4))
})
val records.reduceByKey((a,b) => (a._1+b._1, a._2+b._2)) // Getting reduceByKey() is not a member of Iterator
// Code to write to DB
Iterator.empty // I just want to store the processed records in DB. So returning empty iterator
})
}
Is there a more elegant way to do this(process kafka RDDs for each partition and store them in a DB)?
So... We can not use spark transformations within mapPartitionsWithIndex. However using scala transform and reduce methods like groupby helped me solve this issue.
yours records value is a iterator and Not a RDD. Hence you are unable to invoke reduceByKey on records relation.
Syntax issues:
1)reduceByKey logic looks ok, please remove val before statement(if not typo) & attach reduceByKey() after map:
.map(r1 => {
// Some processing
((field2, field2), (field3, field4))
}).reduceByKey((a,b) => (a._1+b._1, a._2+b._2))
2)Add iter.next after end of each iteration.
3)iter.empty is wrongly placed. Put after coming out of mapPartitionsWithIndex()
4)Add iterator condition for safety:
val commonIter = rdd.mapPartitionsWithIndex ((i,iter) => if (i == 0 && iter.hasNext){
....
}else iter),true)

In spark Streaming how to reload a lookup non stream rdd after n batches

Suppose i have a streaming context which does lot of steps and then at the end the micro batch look's up or joins to a preloaded RDD. I have to refresh that preloaded RDD every 12 hours . how can i do this. Anything i do which does not relate to streaming context is not replayed to my understanding, how i get this called form one of the streaming RDD. I need to make only one call non matter how many partition the streaming dstream has
This is possible by re-creating the external RDD at the time it needs to be reloaded. It requires defining a mutable variable to hold the RDD reference that's active at a given moment in time. Within the dstream.foreachRDD we can then check for the moment when the RDD reference needs to be refreshed.
This is an example on how that would look like:
val stream:DStream[Int] = ??? //let's say that we have some DStream of Ints
// Some external data as an RDD of (x,x)
def externalData():RDD[(Int,Int)] = sparkContext.textFile(dataFile)
.flatMap{line => try { Some((line.toInt, line.toInt)) } catch {case ex:Throwable => None}}
.cache()
// this mutable var will hold the reference to the external data RDD
var cache:RDD[(Int,Int)] = externalData()
// force materialization - useful for experimenting, not needed in reality
cache.count()
// a var to count iterations -- use to trigger the reload in this example
var tick = 1
// reload frequency
val ReloadFrequency = 5
stream.foreachRDD{ rdd =>
if (tick == 0) { // will reload the RDD every 5 iterations
// unpersist the previous RDD, otherwise it will linger in memory, taking up resources.
cache.unpersist(false)
// generate a new RDD
cache = externalData()
}
// join the DStream RDD with our reference data, do something with it...
val matches = rdd.keyBy(identity).join(cache).count()
updateData(dataFile, (matches + 1).toInt) // so I'm adding data to the static file in order to see when the new records become alive
tick = (tick + 1) % ReloadFrequency
}
streaming.start
Previous to come with this solution, I studied the possibility to play with the persist flag in the RDD, but it didn't work as expected. Looks like unpersist() does not force re-materialization of the RDD when it's used again.

microbatch operation in foreachRDD block of spark streaming

How can we microbatch operations in foreachRDD block . For example , I read logs from HDFS and perform operation in foreachRDD
val lines = ssc.textFileStream(hadoopPath)
lines.foreachRDD{ rdd =>
val newRDD = rdd.map(line =>
ScalaObject.process(line)
}
The code will call ScalaObject.process for each line in logs. Is is possbile to call ScalaObject.process for a batch of lines ?
Thanks
In the context of your example, what you're asking to do will not make use of Spark's parallelism. If you really need to do this for some reason, you can run the processing only once per partition. The parallelism you get with this will only be N, where N is the number of partitions.
Use rdd.foreachPartition:
val lines = ssc.textFileStream(hadoopPath)
lines.foreachRDD { rdd =>
val newRDD = rdd.foreachPartition(lines =>
ScalaObject.process(lines)
)
}
The parameter lines will be of type Iterable[String].

Resources