Spark streaming join Kafka topics comparison - apache-spark

We needed to implement join on Kafka topics with consideration of late data or "not in join", meaning data that come late on the stream or not in join will not be dropped/lost but will be marked as a timeout,
the result of the join is produced to output Kafka topic ( with a timeout filed if occurred).
(spark 2.1.1 in standalone deployment, Kafka 10 )
Kafka in topics: X, Y,... out topics result will look like:
{
"keyJoinFiled": 123456,
"xTopicData": {},
"yTopicData": {},
"isTimeOutFlag": true
}
I found three solutions wrote them here, 1 and 2 from spark streaming official documentation but are not relevant to us ( data not in join Dtsream, arrive "business time" late, is dropped/lost) but I wrote them for comparison.
From what we saw there are not too many examples for Kafka join topic with a stateful operation add some code here for review:
1) According to spark streaming documentation,
https://spark.apache.org/docs/2.1.1/streaming-programming-guide.html:
val stream1: DStream[String, String] =
val stream2: DStream[String, String] =
val joinedStream = stream1.join(stream2)
This will join data from both stream batches duration, but data arrive "business time" late/not in join will be dropped/lost.
2) Window join:
val leftWindowDF = kafkaStreamLeft.window(Minutes(input_parameter_time))
val rightWindowDF = kafkaStreamRight.window(Minutes(input_parameter_time))
leftWindowDF.join(rightWindowDF).foreachRDD...
2.1) In our case we need to use Tumbling window in consideration for
spark streaming batch interval.
2.2) Need to save a lot of data in Memory/Disk, for example, 30-60 min
window
2.3) And again data arrive late/not in the window/not in the join is
dropped/lost.
* Since spark 2.3.1 Structured streaming stream to stream join is
supported, but we encounter a bug with not cleaning HDFS state
store, as a result, a job was falling every few hours on OOM,
resolved in 2.4
,https://issues.apache.org/jira/browse/SPARK-23682
(use of Rocksdb ,or CustomStateStoreProvider HDFS state store).
3) Using stateful operation mapWithState for join Kafka topics Dstreams
with tumbling window and 30 min timeout for late data,
all data produced to output topics contains joined messages from all
topics if join occurred or part of topic data if no
join occurred in 30 min ( mark with is_time_out flag)
3.1) Creating 1..n Dstream per topic, convert to Key value/Unioned
records with join filed as a key and tumbling window.
creating a catch-all scheme.
3.2)Union all streams
3.3)Run on union stream mapWithState with function - will actually do the
join/ mark timeout.
Great example for stateful join from databricks (spark 2.2.0):
https://www.youtube.com/watch?time_continue=1858&v=JAb4FIheP28
Adding a sample code that is running/testing.
val kafkaParams = Map[String, Object](
"bootstrap.servers" -> brokers,
"key.deserializer" -> classOf[StringDeserializer],
"value.deserializer" -> classOf[StringDeserializer],
"group.id" -> groupId,
"session.timeout.ms" -> "30000"
)
//Kafka xTopic DStream
val kafkaStreamLeft = KafkaUtils.createDirectStream[String, String](
ssc,
PreferConsistent,
Subscribe[String, String](leftTopic.split(",").toSet, kafkaParams)
).map(record => {
val msg:xTopic = gson.fromJson(record.value(),classOf[xTopic])
Unioned(Some(msg),None,if (msg.sessionId!= null) msg.sessionId.toString else "")
}).window(Minutes(leftWindow),Minutes(leftWindow))
//Kafka yTopic DStream
val kafkaStreamRight = KafkaUtils.createDirectStream[String, String](
ssc,
PreferConsistent,
Subscribe[String, String](rightTopic.split(",").toSet, kafkaParams)
).map(record => {
val msg:yTopic = gson.fromJson(record.value(),classOf[yTopic])
Unioned(None,Some(msg),if (msg.sessionId!= null) msg.sessionId.toString else "")
}).window(Minutes(rightWindow),Minutes(rightWindow))
//convert stream to key, value pair and filter empty session id.
val unionStream = kafkaStreamLeft.union(kafkaStreamRight).map(record =>(record.sessionId,record))
.filter(record => !record._1.toString.isEmpty)
val stateSpec = StateSpec.function(stateUpdateF).timeout(Minutes(timeout.toInt))
unionStream.mapWithState(stateSpec).foreachRDD(rdd => {
try{
if(!rdd.isEmpty()) rdd.foreachPartition(partition =>{
val props = new util.HashMap[String, Object]()
props.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, brokers)
props.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, "org.apache.kafka.common.serialization.StringSerializer")
props.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, "org.apache.kafka.common.serialization.StringSerializer")
val producer = new KafkaProducer[String, String](props)
//send to kafka result JSON.
partition.foreach(record => {
if(record!=null && !"".equals(record) && !"()".equals(record.toString) && !"None".equals(record.toString) ){
producer.send(new ProducerRecord[String, String](outTopic, null, gson.toJson(record)))
}
})
producer.close()
})
}catch {
case e: Exception => {
logger.error(s""""error join topics :${leftTopic} ${rightTopic} to out topic ${outTopic}""")
logger.info(e.printStackTrace())
}
}})
//mapWithState function that will be called on each key occurrence with new items in newItemValues and state items if exits.
def stateUpdateF = (keySessionId:String,newItemValues:Option[Unioned],state:State[Unioned])=> {
val currentState = state.getOption().getOrElse(Unioned(None,None,keySessionId))
val newVal:Unioned = newItemValues match {
case Some(newItemValue) => {
if (newItemValue.yTopic.isDefined)
Unioned(if(newItemValue.xTopic.isDefined) newItemValue.xTopic else currentState.xTopic,newItemValue.yTopic,keySessionId)
else if (newItemValue.xTopic.isDefined)
Unioned(newItemValue.xTopic, if(currentState.yTopic.isDefined)currentState.yTopic else newItemValue.yTopic,keySessionId)
else newItemValue
}
case _ => currentState //if None = timeout => currentState
}
val processTs = LocalDateTime.now()
val processDate = dtf.format(processTs)
if(newVal.xTopic.isDefined && newVal.yTopic.isDefined){//if we have a join remove from state
state.remove()
JoinState(newVal.sessionId,newVal.xTopic,newVal.yTopic,false,processTs.toInstant(ZoneOffset.UTC).toEpochMilli,processDate)
}else if(state.isTimingOut()){//time out do no try to remove state manually ,it's removed automatically.
JoinState(newVal.sessionId, newVal.xTopic, newVal.yTopic,true,processTs.toInstant(ZoneOffset.UTC).toEpochMilli,processDate)
}else{
state.update(newVal)
}
}
//case class for kafka topics data.(x,y topics ) join will be on session id filed.
case class xTopic(sessionId:String,param1:String,param2:String,sessionCreationDate:String)
case class yTopic(sessionId:Long,clientTimestamp:String)
//catch all schema : object that contains both kafka input fileds topics and key valiue for join.
case class Unioned(xTopic:Option[xTopic],yTopic:Option[yTopic],sessionId:String)
//class for output result of join stateful function.
case class JoinState(sessionId:String, xTopic:Option[xTopic],yTopic:Option[yTopic],isTimeOut:Boolean,processTs:Long,processDate:String)
I will be happy for some review.
sorry for the long post.

I was under the impression this use case was solved by the Sessionization API?:
StructuredSessionization.scala
And Stateful Operations in Structured Streaming
Or am I missing something?

Related

How to repartition Spark DStream Kafka ConsumerRecord RDD

I am getting uneven size of Kafka topics. We want to repartition the input RDD based on some logic.
But when I try to apply the repartition I am getting object not serializable (class: org.apache.kafka.clients.consumer.ConsumerRecord error.
I found following workaround
Job aborted due to stage failure: Task not serializable
Call rdd.forEachPartition and create the NotSerializable object in there like this:
rdd.forEachPartition(iter -> {
NotSerializable notSerializable = new NotSerializable();
// ...Now process iter
});
ABOVE LOGIC APPLIED HERE not sure if I missed anything
val stream =KafkaUtils.createDirectStream[String, String]( ssc, PreferConsistent, Subscribe[String, String](topics, kafkaParam) ).map(_.value())
stream.foreachRDD { rdd =>
val repartitionRDD = flow.repartitionRDD(rdd,1)
println("&&&&&&&&&&&&&& repartitionRDD " + repartitionRDD.count())
val modifiedRDD = rdd.mapPartitions {
iter =>{
val customerRecords: List[ConsumerRecord[String, String]] = List[ConsumerRecord[String, String]]()
while(iter.hasNext){
val consumerRecord :ConsumerRecord[String, String] = iter.next()
customerRecords:+ consumerRecord
}
customerRecords.iterator
}
}
val r = modifiedRDD.repartition(1)
println("************* after repartition " + r.count())
BUT still getting same object not Serializable error. Any help is greatly appreciated.
I tried to make stream transient but that did not resolve the issue either.
I made the test class as Serializable but did not fix the issue.

How to write every record to multiple kafka topics in Spark Streaming 2.3.1?

How to write every record to multiple kafka topics in Spark Streaming 2.3.1? other words say I got 5 records and two output kafka topics I want all 5 records in both output topics.
The question here doesn't talk about structured streaming case. I am looking specific for structured streaming.
Not sure if you are using java or scala. Below is code to generate message to two different topic. You'll have to call
dataset.foreachPartition(partionsrows => {
val props = new util.HashMap[String, Object]()
props.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, bootStrapServer)
props.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG,
"org.apache.kafka.common.serialization.StringSerializer")
props.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG,
"org.apache.kafka.common.serialization.StringSerializer")
val producer = new KafkaProducer[String, String](props)
partionsrows.foreach(row => {
val offerId = row.get(0).toString.replace("[", "").replace("]", "")
val message1 = new ProducerRecord[String, String]("topic1", "message")
producer.send(message1)
val message2 = new ProducerRecord[String, String]("topic2", "message")
producer.send(message2)
})
})

Unable to Iterate over the list of keys retrieved from coverting Dstream to List while using spark streaming with kafka

Below is the code for spark streaming with kafka.
Here I am trying to get the keys for the batch as Dstream and then covert it to a LIST. In order to iterate over it and put data pertaining to each key in a hdfs folder named after the key.
Key is basically - Schema.Table_name
val ssc = new StreamingContext(sparkConf, Seconds(args{7}.toLong)) // configured to run for every 60 seconds
val warehouseLocation="Spark-warehouse"
val spark = SparkSession.builder.config(sparkConf).getOrCreate()
import spark.implicits._
val kafkaParams = Map[String, Object](
"bootstrap.servers" -> conf.getString("kafka.brokers"),
"zookeeper.connect" -> conf.getString("kafka.zookeeper"),
"group.id" -> conf.getString("kafka.consumergroups"),
"auto.offset.reset" -> args { 1 },
"enable.auto.commit" -> (conf.getString("kafka.autoCommit").toBoolean: java.lang.Boolean),
"key.deserializer" -> classOf[StringDeserializer],
"value.deserializer" -> classOf[StringDeserializer],
"security.protocol" -> "SASL_PLAINTEXT",
"session.timeout.ms" -> args { 2 },
"max.poll.records" -> args { 3 },
"request.timeout.ms" -> args { 4 },
"fetch.max.wait.ms" -> args { 5 })
val messages = KafkaUtils.createDirectStream[String, String](
ssc,
LocationStrategies.PreferConsistent,
ConsumerStrategies.
Subscribe[String, String](topicsSet, kafkaParams))
Extracting the keys but it is of type DStream[String]
val keys = messages.map(x=>(x.key()))
var final_list_of_keys = List[String]()
Converting it into a list and updating var final_list_of_keys
keys.foreachRDD( rdd => {
val df_keys = spark.read.json(rdd).distinct().toDF().persist(StorageLevel.MEMORY_ONLY)
df_keys.show()
val comma_separated_keys= df_keys.distinct().collect().mkString("").replace("[","").replace("]",",")
final_list_of_keys= comma_separated_keys.split(",").toList
Now trying to iterate over the list.
for ( i <- final_list_of_keys)
{
println(i)
val message1 = messages.filter(x => x.key().toString().equals(i)).map(x=>x.value()).persist(StorageLevel.MEMORY_ONLY) //.toString())
message1.foreachRDD((rdd, batchTime) => {
if (!rdd.isEmpty())
{
val df1 = spark.read.json(rdd).persist(StorageLevel.MEMORY_ONLY) //.withColumn("pharmacy_location",lit(args{6}))
val df2=df1.withColumn("message",struct( struct($"message.data.*",lit(args{6}).as("pharmacy_location")).alias("data"), struct($"message.headers.*").as("headers"))).persist(StorageLevel.MEMORY_ONLY)
val df3= df2.drop("headers").drop("messageSchema").drop("messageSchemaId").persist(StorageLevel.MEMORY_ONLY)
df3.coalesce(1).write.json(conf.getString("hdfs.streamoutpath1")+ PATH_SEPERATOR + i + PATH_SEPERATOR + args{6}+ PATH_SEPERATOR+ date_today.format(System.currentTimeMillis())
+ PATH_SEPERATOR + date_today_hour.format(System.currentTimeMillis()) + PATH_SEPERATOR + System.currentTimeMillis())
df1.unpersist
df2.unpersist()
df3.unpersist()
}
})
try
{
messages.foreachRDD { rdd =>
val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
messages.asInstanceOf[CanCommitOffsets].commitAsync(offsetRanges) // push it back
}
}
catch
{
case e: BlockMissingException => e.printStackTrace()
case e: IOException => e.printStackTrace()
case e:Throwable => e.printStackTrace()
}
}
ssc.start()
ssc.awaitTermination()
But I get the error - Adding new inputs, transformations, and output operations after starting a context is not supported
When I tried to keep the for loop on list outside the keys.foreachRdd then the list does not get updated and remains empty.
Can someone please advice how can I actually redo this code to have the keys in a list then go over them to put data in correct directory.
From my research i saw post -
Similar post but unable to gather any solution from it
Also,as I am using map,filter inside foreachRdd and then another foreachRdd inside it can cause a problem.
Refer post - Post with similar code
Below is the code for the problem -
val messages = KafkaUtils.createDirectStream[String, String](
ssc,
LocationStrategies.PreferConsistent,
ConsumerStrategies.
Subscribe[String, String](topicsSet, kafkaParams)).persist(StorageLevel.MEMORY_ONLY)
messages.foreachRDD((rdd,batchTime) => ///foreachRDD means go over each rdd parallelly , it gives the rdd and we will put the batch time also
{
val table_list=rdd.map(x => x.key()).distinct().collect() ////kafka sends data in key value pairs,
///here rdd means key and values(key is tablename) and first we need to get all the distinct keys(this batch had 5 tables)
val rddList = table_list.map(x=>(x,(rdd.filter(y=>y.key().equals(x)))))
///here x means table name and we are filtering data in the rdd which is equalent to current_table_name
///Now this table_list will contains the key(table) and values corresponding to each key
rddList.foreach(tuple => //here foreach not in parallal, we want to go one by one , touple is nothing but collection of key and multiple
{
val tableName= tuple._1.toString() //tuple._1 will be the table name
val tableRdd= tuple._2.map(x=>(x.value())).persist(StorageLevel.MEMORY_ONLY) // .toDF()
///tuple._2 will be the complete key value pair,we are putting the value in the hdfs
// val tableRdd= messages.filter(x => x.key().toString().equals(tableName)).map(x=>x.value()).persist(StorageLevel.MEMORY_ONLY)
println(tableName)
/* Your logic */

Trying to understand spark streaming flow

I have this piece of code:
val lines: org.apache.spark.streaming.dstream.InputDStream[(String, String)] = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](
ssc, kafkaParams, topics)
lines.foreachRDD { rdd =>
val df = cassandraSQLContext.read.json(rdd.map(x => x._2))
sparkStreamingService.run(df)
}
ssc.start()
ssc.awaitTermination()
The way I understand it is, foreachRDD is happening at the driver level? So basically all that block of code:
lines.foreachRDD { rdd =>
val df = cassandraSQLContext.read.json(rdd.map(x => x._2))
sparkStreamingService.run(df)
}
is happening at the driver level? The sparkStreamingService.run(df) method basically does some transformations on the current dataframe to yield a new dataframe and then calls another method (on another jar) which stores the dataframe to cassandra.
So if this is happening all at the driver level, we are not utilizing the spark executors and how can I make it so that the executors are being used in parallel to process each partition of the RDD in parallel
My spark streaming service run method:
var metadataDataframe = df.select("customer", "tableName", "messageContent", "initialLoadRunning").collect()
metadataDataframe.foreach(rowD => {
metaData = populateMetaDataService.populateSiteMetaData(rowD)
val headers = (rowD.getString(2).split(recordDelimiter)(0))
val fields = headers.split("\u0001").map(
fieldName => StructField(fieldName, StringType, nullable = true))
val schema = StructType(fields)
val listOfRawData = rowD.getString(2).indexOf(recordDelimiter)
val dataWithoutHeaders = rowD.getString(2).substring(listOfRawData + 1)
val rawData = sparkContext.parallelize(dataWithoutHeaders.split(recordDelimiter))
// val rawData = dataWithoutHeaders.split(recordDelimiter)
val rowRDD = rawData
.map(_.split("\u0001"))
.map(attributes => Row(attributes: _*))
val newDF = cassandraSQLContext.createDataFrame(rowRDD, schema)
dataFrameFilterService.processBasedOnOpType(metaData, newDF)
})
The invocation of foreachRDD does happen on the driver node. But, since we're operating at the RDD level, any transformation on it will be distributed. In your example, rdd.map will cause each partition to be sent to a particular worker node for computation.
Since we don't know what your sparkStreamingService.run method is doing, we cant tell you about the locality of its execution.
The foreachRDD may run locally, but that just means the setup. The RDD itself is a distributed collection, so the actual work is distributed.
To comment directly on the code from the docs:
dstream.foreachRDD { rdd =>
val connection = createNewConnection() // executed at the driver
rdd.foreach { record =>
connection.send(record) // executed at the worker
}
}
Notice that the part of the code that is NOT based around the RDD is executed at the driver. It's the code built up using RDD that is distributed to the workers.
Your code specifically is commented below:
//df.select will be distributed, but collect will pull it all back in
var metadataDataframe = df.select("customer", "tableName", "messageContent", "initialLoadRunning").collect()
//Since collect created a local collection then this is done on the driver
metadataDataframe.foreach(rowD => {
metaData = populateMetaDataService.populateSiteMetaData(rowD)
val headers = (rowD.getString(2).split(recordDelimiter)(0))
val fields = headers.split("\u0001").map(
fieldName => StructField(fieldName, StringType, nullable = true))
val schema = StructType(fields)
val listOfRawData = rowD.getString(2).indexOf(recordDelimiter)
val dataWithoutHeaders = rowD.getString(2).substring(listOfRawData + 1)
//This will run locally, creating a distributed record
val rawData = sparkContext.parallelize(dataWithoutHeaders.split(recordDelimiter))
// val rawData = dataWithoutHeaders.split(recordDelimiter)
//This will redistribute the work
val rowRDD = rawData
.map(_.split("\u0001"))
.map(attributes => Row(attributes: _*))
//again, setting this up locally, to be run distributed
val newDF = cassandraSQLContext.createDataFrame(rowRDD, schema)
dataFrameFilterService.processBasedOnOpType(metaData, newDF)
})
Ultimately, you probably can rewrite this to not need the collect and keep it all distributed, but that is for you not StackOverflow

reduceByKey doesn't work in spark streaming

I have the following code snippet in which the reduceByKey doesn't seem to work.
val myKafkaMessageStream = KafkaUtils.createDirectStream[String, String](
ssc,
PreferConsistent,
Subscribe[String, String](topicsSet, kafkaParams)
)
myKafkaMessageStream
.foreachRDD { rdd =>
val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
val myIter = rdd.mapPartitionsWithIndex { (i, iter) =>
val offset = offsetRanges(i)
iter.map(item => {
(offset.fromOffset, offset.untilOffset, offset.topic, offset.partition, item)
})
}
val myRDD = myIter.filter( (<filter_condition>) ).map(row => {
//Process row
((field1, field2, field3) , (field4, field5))
})
val result = myRDD.reduceByKey((a,b) => (a._1+b._1, a._2+b._2))
result.foreachPartition { partitionOfRecords =>
//I don't get the reduced result here
val connection = createNewConnection()
partitionOfRecords.foreach(record => connection.send(record))
connection.close()
}
}
Am I missing something?
In a streaming situation, it makes more sense to me to use reduceByKeyAndWindow which does what you're looking for, but over a specific time frame.
// Reduce last 30 seconds of data, every 10 seconds
val windowedWordCounts = pairs.reduceByKeyAndWindow((a:Int,b:Int) => (a + b), Seconds(30), Seconds(10))
"When called on a DStream of (K, V) pairs, returns a new DStream of (K, V) pairs where the values for each key are aggregated using the given reduce function func over batches in a sliding window. Note: By default, this uses Spark's default number of parallel tasks (2 for local mode, and in cluster mode the number is determined by the config property spark.default.parallelism) to do the grouping. You can pass an optional numTasks argument to set a different number of tasks."
http://spark.apache.org/docs/latest/streaming-programming-guide.html

Resources