microbatch operation in foreachRDD block of spark streaming - apache-spark

How can we microbatch operations in foreachRDD block . For example , I read logs from HDFS and perform operation in foreachRDD
val lines = ssc.textFileStream(hadoopPath)
lines.foreachRDD{ rdd =>
val newRDD = rdd.map(line =>
ScalaObject.process(line)
}
The code will call ScalaObject.process for each line in logs. Is is possbile to call ScalaObject.process for a batch of lines ?
Thanks

In the context of your example, what you're asking to do will not make use of Spark's parallelism. If you really need to do this for some reason, you can run the processing only once per partition. The parallelism you get with this will only be N, where N is the number of partitions.
Use rdd.foreachPartition:
val lines = ssc.textFileStream(hadoopPath)
lines.foreachRDD { rdd =>
val newRDD = rdd.foreachPartition(lines =>
ScalaObject.process(lines)
)
}
The parameter lines will be of type Iterable[String].

Related

Spark shuffle when double repartition

I am trying to join some datasets in Spark and I trying to do that without shuffle.
Unfortunately after running some test I saw something odd.
Let's say I have dataset A in S3 as plaintext (Json string).
First, I read the dataset A and repartition it by a specific field. This is a dummy code as example:
val rddA = sc.textFile(s3InputPath)
.map(x => objectMapper.readValue(x, classOf[APojo]))
.map(x => (x.getId(), x)) // "id" field is type of String
.partitionBy(new HashPartitioner(10))
.map(x => objectMapper.writeValueAsString(x._2))
rddA.saveAsTextFile(s3OutputPath)
Then I try to read the previous dataset and run the exact same repartition:
val rddAClone = sc.textFile(s3OutputPath)
.map(x => objectMapper.readValue(x, classOf[APojo]))
.map(x => (x.getId(), x))
.partitionBy(new HashPartitioner(10))
.map(x => objectMapper.writeValueAsString(x._2))
rddAClone.collect //any operation to force the spark to process the RDD
My expectation was that, because I priorly repartitioned my dataset by "id", when I try to repartition it again, the operation will finish without any shuffle. Unfortunately, I still see some shuffle in the Spark's Application Monitor (<5% of the input dataset as shuffle)
Do any of you have confronted with the same problem? Is this behaviour intended or is something wrong with my assumption/expectancy?
Thanks!

Why does my Spark Streaming application not print the number of records from Kafka (using count operator)?

I am working on a spark application which needs to read data from Kafka. I created a Kafka topic where producer was posting messages. I verified from console consumer that messages were successfully posted .
I wrote a short spark application to read data from Kafka, but it is not getting any data.
Following is the code i used:
def main(args: Array[String]): Unit = {
val Array(zkQuorum, group, topics, numThreads) = args
val sparkConf = new SparkConf().setAppName("SparkConsumer").setMaster("local[2]")
val ssc = new StreamingContext(sparkConf, Seconds(2))
val topicMap = topics.split(",").map((_, numThreads.toInt)).toMap
val lines = KafkaUtils.createStream(ssc, zkQuorum, group, topicMap).map(_._2)
process(lines) // prints the number of records in Kafka topic
ssc.start()
ssc.awaitTermination()
}
private def process(lines: DStream[String]) {
val z = lines.count()
println("count of lines is "+z)
//edit
lines.foreachRDD(rdd => rdd.map(println)
// <-- Why does this **not** print?
)
Any suggestions on how to resolve this issue?
******EDIT****
I have used
lines.foreachRDD(rdd => rdd.map(println)
as well in actual code but that is also not working. I set the retention period as mentioned in post : Kafka spark directStream can not get data . But still the problem exist.
Your process is a continuation of a DStream pipeline with no output operator that gets the pipeline executed every batch interval.
You can "see" it by reading the signature of count operator:
count(): DStream[Long]
Quoting the count's scaladoc:
Returns a new DStream in which each RDD has a single element generated by counting each RDD of this DStream.
So, you have a dstream of Kafka records that you transform to a dstream of single values (being the result of count). Not much to have it outputed (to a console or any other sink).
You have to end the pipeline using an output operator as described in the official documentation Output Operations on DStreams:
Output operations allow DStream’s data to be pushed out to external systems like a database or a file systems. Since the output operations actually allow the transformed data to be consumed by external systems, they trigger the actual execution of all the DStream transformations (similar to actions for RDDs).
(Low-Level) Output operators register input dstreams as output dstreams so the execution can start. Spark Streaming's DStream by design has no notion of being an output dstream. It is DStreamGraph to know and be able to differentiate between input and output dstreams.

Spark Streaming - how to use reduceByKey within a partition on the Iterator

I am trying to consume Kafka DirectStream, process the RDDs for each partition and write the processed values to DB. When I try to perform reduceByKey(per partition, that is without the shuffle), I get the following error. Usually on the driver node, we can use sc.parallelize(Iterator) to solve this issue. But I would like to solve it in spark streaming.
value reduceByKey is not a member of Iterator[((String, String), (Int, Int))]
Is there a way to perform transformations on Iterator within the partition?
myKafkaDS
.foreachRDD { rdd =>
val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
val commonIter = rdd.mapPartitionsWithIndex ( (i,iter) => {
val offset = offsetRanges(i)
val records = iter.filter(item => {
(some_filter_condition)
}).map(r1 => {
// Some processing
((field2, field2), (field3, field4))
})
val records.reduceByKey((a,b) => (a._1+b._1, a._2+b._2)) // Getting reduceByKey() is not a member of Iterator
// Code to write to DB
Iterator.empty // I just want to store the processed records in DB. So returning empty iterator
})
}
Is there a more elegant way to do this(process kafka RDDs for each partition and store them in a DB)?
So... We can not use spark transformations within mapPartitionsWithIndex. However using scala transform and reduce methods like groupby helped me solve this issue.
yours records value is a iterator and Not a RDD. Hence you are unable to invoke reduceByKey on records relation.
Syntax issues:
1)reduceByKey logic looks ok, please remove val before statement(if not typo) & attach reduceByKey() after map:
.map(r1 => {
// Some processing
((field2, field2), (field3, field4))
}).reduceByKey((a,b) => (a._1+b._1, a._2+b._2))
2)Add iter.next after end of each iteration.
3)iter.empty is wrongly placed. Put after coming out of mapPartitionsWithIndex()
4)Add iterator condition for safety:
val commonIter = rdd.mapPartitionsWithIndex ((i,iter) => if (i == 0 && iter.hasNext){
....
}else iter),true)

Is foreachRDD executed on the Driver?

I am trying to process some XML data received on a JMS queue (QPID) using Spark Streaming. After getting xml as DStream I convert them to Dataframes so I can join them with some of my static data in form of Dataframes already loaded.
But as per API documentation for foreachRdd method on DStream:
it gets executed on Driver, so does that mean all processing logic will only run on Driver and not get distributed to workers/executors.
API Documentation
foreachRDD(func)
The most generic output operator that applies a
function, func, to each RDD generated from the stream. This function
should push the data in each RDD to an external system, such as saving
the RDD to files, or writing it over the network to a database. Note
that the function func is executed in the driver process running the
streaming application, and will usually have RDD actions in it that
will force the computation of the streaming RDDs.
so does that mean all processing logic will only run on Driver and not
get distributed to workers/executors.
No, the function itself runs on the driver, but don't forget that it operates on an RDD. The inner functions that you'll use on the RDD, such as foreachPartition, map, filter etc will still run on the worker nodes. This won't cause all the data to be sent back over the network to the driver, unless you call methods like collect, which do.
To make this clear, if you run the following, you will see "monkey" on the driver's stdout:
myDStream.foreachRDD { rdd =>
println("monkey")
}
If you run the following, you will see "monkey" on the driver's stdout, and the filter work will be done on whatever executors the rdd is distributed across:
myDStream.foreachRDD { rdd =>
println("monkey")
rdd.filter(element => element == "Save me!")
}
Let's add the simplification that myDStream only ever receives one RDD, and that this RDD is spread across a set of partitions that we'll call PartitionSetA that exist on MachineSetB where ExecutorSetC are running. If you run the following, you will see "monkey" on the driver's stdout, you will see "turtle" on the stdouts of all executors in ExecutorSetC ("turtle" will appear once for each partition -- many partitions could be on the machine where an executor is running), and the work of both the filter and addition operations will be done across ExecutorSetC:
myDStream.foreachRDD { rdd =>
println("monkey")
rdd.filter(element => element == "Save me!")
rdd.foreachPartition { partition =>
println("turtle")
val x = 1 + 1
}
}
One more thing to note is that in the following code, y would end up being sent across the network from the driver to all of ExecutorSetC for each rdd:
val y = 2
myDStream.foreachRDD { rdd =>
println("monkey")
rdd.filter(element => element == "Save me!")
rdd.foreachPartition { partition =>
println("turtle")
val x = 1 + 1
val z = x + y
}
}
To avoid this overhead, you can use broadcast variables, which send the value from the driver to the executors just once. For example:
val y = 2
val broadcastY = sc.broadcast(y)
myDStream.foreachRDD { rdd =>
println("monkey")
rdd.filter(element => element == "Save me!")
rdd.foreachPartition { partition =>
println("turtle")
val x = 1 + 1
val z = x + broadcastY.value
}
}
For sending more complex things over as broadcast variables, such as objects that aren't easily serializable once instantiated, you can see the following blog post: https://allegro.tech/2015/08/spark-kafka-integration.html

adding new elements to batch RDD from DStream RDD

The only way to join / union /cogroup a DStream RDD with Batch RDD is via the "transform" method, which returns another DStream RDD and hence it gets discarded at the end of the micro-batch.
Is there any way to e.g. union Dstream RDD with Batch RDD which produces a new Batch RDD containing the elements of both the DStream RDD and the Batch RDD.
And once such Batch RDD is created in the above way, can it be used by other DStream RDDs to e.g. join with as this time the result can be another DStream RDD
Effectively the functionality described above will result in periodical updates (additions) of elements to a Batch RDD - the additional elements will keep coming from DStream RDDs which keep streaming in with every micro-batch.
Also newly arriving DStream RDDs will be able to join with the thus previously updated BAtch RDD and produce a result DStream RDD
Something almost like that can be achieved with updateStateByKey, but is there a way to do it as described here
Another approach would be to transform the batch input to a DStream and union it with your streaming input. Then you write it out using foreachRDD which is new your batch input to other jobs.
val batch = sc.textFile(...)
val ssc = new StreamingContext(sc, Seconds(30))
val stream = ssc.textFileStream(...)
import scala.collection.mutable
val batchStream = ssc.queueStream(mutable.Queue.empty[RDD[String]], oneAtATime = false, defaultRDD = batch)
val union = ssc.union(Seq(stream, batchStream))
union.print()
union.foreachRDD { rdd =>
// Delete previous, or use SchemaRDD with .insertInto(, overwrite = true)
rdd.saveTextFile(...)
}
ssc.start()
ssc.awaitTermination()

Resources