I have the following code snippet in which the reduceByKey doesn't seem to work.
val myKafkaMessageStream = KafkaUtils.createDirectStream[String, String](
ssc,
PreferConsistent,
Subscribe[String, String](topicsSet, kafkaParams)
)
myKafkaMessageStream
.foreachRDD { rdd =>
val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
val myIter = rdd.mapPartitionsWithIndex { (i, iter) =>
val offset = offsetRanges(i)
iter.map(item => {
(offset.fromOffset, offset.untilOffset, offset.topic, offset.partition, item)
})
}
val myRDD = myIter.filter( (<filter_condition>) ).map(row => {
//Process row
((field1, field2, field3) , (field4, field5))
})
val result = myRDD.reduceByKey((a,b) => (a._1+b._1, a._2+b._2))
result.foreachPartition { partitionOfRecords =>
//I don't get the reduced result here
val connection = createNewConnection()
partitionOfRecords.foreach(record => connection.send(record))
connection.close()
}
}
Am I missing something?
In a streaming situation, it makes more sense to me to use reduceByKeyAndWindow which does what you're looking for, but over a specific time frame.
// Reduce last 30 seconds of data, every 10 seconds
val windowedWordCounts = pairs.reduceByKeyAndWindow((a:Int,b:Int) => (a + b), Seconds(30), Seconds(10))
"When called on a DStream of (K, V) pairs, returns a new DStream of (K, V) pairs where the values for each key are aggregated using the given reduce function func over batches in a sliding window. Note: By default, this uses Spark's default number of parallel tasks (2 for local mode, and in cluster mode the number is determined by the config property spark.default.parallelism) to do the grouping. You can pass an optional numTasks argument to set a different number of tasks."
http://spark.apache.org/docs/latest/streaming-programming-guide.html
Related
We needed to implement join on Kafka topics with consideration of late data or "not in join", meaning data that come late on the stream or not in join will not be dropped/lost but will be marked as a timeout,
the result of the join is produced to output Kafka topic ( with a timeout filed if occurred).
(spark 2.1.1 in standalone deployment, Kafka 10 )
Kafka in topics: X, Y,... out topics result will look like:
{
"keyJoinFiled": 123456,
"xTopicData": {},
"yTopicData": {},
"isTimeOutFlag": true
}
I found three solutions wrote them here, 1 and 2 from spark streaming official documentation but are not relevant to us ( data not in join Dtsream, arrive "business time" late, is dropped/lost) but I wrote them for comparison.
From what we saw there are not too many examples for Kafka join topic with a stateful operation add some code here for review:
1) According to spark streaming documentation,
https://spark.apache.org/docs/2.1.1/streaming-programming-guide.html:
val stream1: DStream[String, String] =
val stream2: DStream[String, String] =
val joinedStream = stream1.join(stream2)
This will join data from both stream batches duration, but data arrive "business time" late/not in join will be dropped/lost.
2) Window join:
val leftWindowDF = kafkaStreamLeft.window(Minutes(input_parameter_time))
val rightWindowDF = kafkaStreamRight.window(Minutes(input_parameter_time))
leftWindowDF.join(rightWindowDF).foreachRDD...
2.1) In our case we need to use Tumbling window in consideration for
spark streaming batch interval.
2.2) Need to save a lot of data in Memory/Disk, for example, 30-60 min
window
2.3) And again data arrive late/not in the window/not in the join is
dropped/lost.
* Since spark 2.3.1 Structured streaming stream to stream join is
supported, but we encounter a bug with not cleaning HDFS state
store, as a result, a job was falling every few hours on OOM,
resolved in 2.4
,https://issues.apache.org/jira/browse/SPARK-23682
(use of Rocksdb ,or CustomStateStoreProvider HDFS state store).
3) Using stateful operation mapWithState for join Kafka topics Dstreams
with tumbling window and 30 min timeout for late data,
all data produced to output topics contains joined messages from all
topics if join occurred or part of topic data if no
join occurred in 30 min ( mark with is_time_out flag)
3.1) Creating 1..n Dstream per topic, convert to Key value/Unioned
records with join filed as a key and tumbling window.
creating a catch-all scheme.
3.2)Union all streams
3.3)Run on union stream mapWithState with function - will actually do the
join/ mark timeout.
Great example for stateful join from databricks (spark 2.2.0):
https://www.youtube.com/watch?time_continue=1858&v=JAb4FIheP28
Adding a sample code that is running/testing.
val kafkaParams = Map[String, Object](
"bootstrap.servers" -> brokers,
"key.deserializer" -> classOf[StringDeserializer],
"value.deserializer" -> classOf[StringDeserializer],
"group.id" -> groupId,
"session.timeout.ms" -> "30000"
)
//Kafka xTopic DStream
val kafkaStreamLeft = KafkaUtils.createDirectStream[String, String](
ssc,
PreferConsistent,
Subscribe[String, String](leftTopic.split(",").toSet, kafkaParams)
).map(record => {
val msg:xTopic = gson.fromJson(record.value(),classOf[xTopic])
Unioned(Some(msg),None,if (msg.sessionId!= null) msg.sessionId.toString else "")
}).window(Minutes(leftWindow),Minutes(leftWindow))
//Kafka yTopic DStream
val kafkaStreamRight = KafkaUtils.createDirectStream[String, String](
ssc,
PreferConsistent,
Subscribe[String, String](rightTopic.split(",").toSet, kafkaParams)
).map(record => {
val msg:yTopic = gson.fromJson(record.value(),classOf[yTopic])
Unioned(None,Some(msg),if (msg.sessionId!= null) msg.sessionId.toString else "")
}).window(Minutes(rightWindow),Minutes(rightWindow))
//convert stream to key, value pair and filter empty session id.
val unionStream = kafkaStreamLeft.union(kafkaStreamRight).map(record =>(record.sessionId,record))
.filter(record => !record._1.toString.isEmpty)
val stateSpec = StateSpec.function(stateUpdateF).timeout(Minutes(timeout.toInt))
unionStream.mapWithState(stateSpec).foreachRDD(rdd => {
try{
if(!rdd.isEmpty()) rdd.foreachPartition(partition =>{
val props = new util.HashMap[String, Object]()
props.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, brokers)
props.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, "org.apache.kafka.common.serialization.StringSerializer")
props.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, "org.apache.kafka.common.serialization.StringSerializer")
val producer = new KafkaProducer[String, String](props)
//send to kafka result JSON.
partition.foreach(record => {
if(record!=null && !"".equals(record) && !"()".equals(record.toString) && !"None".equals(record.toString) ){
producer.send(new ProducerRecord[String, String](outTopic, null, gson.toJson(record)))
}
})
producer.close()
})
}catch {
case e: Exception => {
logger.error(s""""error join topics :${leftTopic} ${rightTopic} to out topic ${outTopic}""")
logger.info(e.printStackTrace())
}
}})
//mapWithState function that will be called on each key occurrence with new items in newItemValues and state items if exits.
def stateUpdateF = (keySessionId:String,newItemValues:Option[Unioned],state:State[Unioned])=> {
val currentState = state.getOption().getOrElse(Unioned(None,None,keySessionId))
val newVal:Unioned = newItemValues match {
case Some(newItemValue) => {
if (newItemValue.yTopic.isDefined)
Unioned(if(newItemValue.xTopic.isDefined) newItemValue.xTopic else currentState.xTopic,newItemValue.yTopic,keySessionId)
else if (newItemValue.xTopic.isDefined)
Unioned(newItemValue.xTopic, if(currentState.yTopic.isDefined)currentState.yTopic else newItemValue.yTopic,keySessionId)
else newItemValue
}
case _ => currentState //if None = timeout => currentState
}
val processTs = LocalDateTime.now()
val processDate = dtf.format(processTs)
if(newVal.xTopic.isDefined && newVal.yTopic.isDefined){//if we have a join remove from state
state.remove()
JoinState(newVal.sessionId,newVal.xTopic,newVal.yTopic,false,processTs.toInstant(ZoneOffset.UTC).toEpochMilli,processDate)
}else if(state.isTimingOut()){//time out do no try to remove state manually ,it's removed automatically.
JoinState(newVal.sessionId, newVal.xTopic, newVal.yTopic,true,processTs.toInstant(ZoneOffset.UTC).toEpochMilli,processDate)
}else{
state.update(newVal)
}
}
//case class for kafka topics data.(x,y topics ) join will be on session id filed.
case class xTopic(sessionId:String,param1:String,param2:String,sessionCreationDate:String)
case class yTopic(sessionId:Long,clientTimestamp:String)
//catch all schema : object that contains both kafka input fileds topics and key valiue for join.
case class Unioned(xTopic:Option[xTopic],yTopic:Option[yTopic],sessionId:String)
//class for output result of join stateful function.
case class JoinState(sessionId:String, xTopic:Option[xTopic],yTopic:Option[yTopic],isTimeOut:Boolean,processTs:Long,processDate:String)
I will be happy for some review.
sorry for the long post.
I was under the impression this use case was solved by the Sessionization API?:
StructuredSessionization.scala
And Stateful Operations in Structured Streaming
Or am I missing something?
Below is the code for spark streaming with kafka.
Here I am trying to get the keys for the batch as Dstream and then covert it to a LIST. In order to iterate over it and put data pertaining to each key in a hdfs folder named after the key.
Key is basically - Schema.Table_name
val ssc = new StreamingContext(sparkConf, Seconds(args{7}.toLong)) // configured to run for every 60 seconds
val warehouseLocation="Spark-warehouse"
val spark = SparkSession.builder.config(sparkConf).getOrCreate()
import spark.implicits._
val kafkaParams = Map[String, Object](
"bootstrap.servers" -> conf.getString("kafka.brokers"),
"zookeeper.connect" -> conf.getString("kafka.zookeeper"),
"group.id" -> conf.getString("kafka.consumergroups"),
"auto.offset.reset" -> args { 1 },
"enable.auto.commit" -> (conf.getString("kafka.autoCommit").toBoolean: java.lang.Boolean),
"key.deserializer" -> classOf[StringDeserializer],
"value.deserializer" -> classOf[StringDeserializer],
"security.protocol" -> "SASL_PLAINTEXT",
"session.timeout.ms" -> args { 2 },
"max.poll.records" -> args { 3 },
"request.timeout.ms" -> args { 4 },
"fetch.max.wait.ms" -> args { 5 })
val messages = KafkaUtils.createDirectStream[String, String](
ssc,
LocationStrategies.PreferConsistent,
ConsumerStrategies.
Subscribe[String, String](topicsSet, kafkaParams))
Extracting the keys but it is of type DStream[String]
val keys = messages.map(x=>(x.key()))
var final_list_of_keys = List[String]()
Converting it into a list and updating var final_list_of_keys
keys.foreachRDD( rdd => {
val df_keys = spark.read.json(rdd).distinct().toDF().persist(StorageLevel.MEMORY_ONLY)
df_keys.show()
val comma_separated_keys= df_keys.distinct().collect().mkString("").replace("[","").replace("]",",")
final_list_of_keys= comma_separated_keys.split(",").toList
Now trying to iterate over the list.
for ( i <- final_list_of_keys)
{
println(i)
val message1 = messages.filter(x => x.key().toString().equals(i)).map(x=>x.value()).persist(StorageLevel.MEMORY_ONLY) //.toString())
message1.foreachRDD((rdd, batchTime) => {
if (!rdd.isEmpty())
{
val df1 = spark.read.json(rdd).persist(StorageLevel.MEMORY_ONLY) //.withColumn("pharmacy_location",lit(args{6}))
val df2=df1.withColumn("message",struct( struct($"message.data.*",lit(args{6}).as("pharmacy_location")).alias("data"), struct($"message.headers.*").as("headers"))).persist(StorageLevel.MEMORY_ONLY)
val df3= df2.drop("headers").drop("messageSchema").drop("messageSchemaId").persist(StorageLevel.MEMORY_ONLY)
df3.coalesce(1).write.json(conf.getString("hdfs.streamoutpath1")+ PATH_SEPERATOR + i + PATH_SEPERATOR + args{6}+ PATH_SEPERATOR+ date_today.format(System.currentTimeMillis())
+ PATH_SEPERATOR + date_today_hour.format(System.currentTimeMillis()) + PATH_SEPERATOR + System.currentTimeMillis())
df1.unpersist
df2.unpersist()
df3.unpersist()
}
})
try
{
messages.foreachRDD { rdd =>
val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
messages.asInstanceOf[CanCommitOffsets].commitAsync(offsetRanges) // push it back
}
}
catch
{
case e: BlockMissingException => e.printStackTrace()
case e: IOException => e.printStackTrace()
case e:Throwable => e.printStackTrace()
}
}
ssc.start()
ssc.awaitTermination()
But I get the error - Adding new inputs, transformations, and output operations after starting a context is not supported
When I tried to keep the for loop on list outside the keys.foreachRdd then the list does not get updated and remains empty.
Can someone please advice how can I actually redo this code to have the keys in a list then go over them to put data in correct directory.
From my research i saw post -
Similar post but unable to gather any solution from it
Also,as I am using map,filter inside foreachRdd and then another foreachRdd inside it can cause a problem.
Refer post - Post with similar code
Below is the code for the problem -
val messages = KafkaUtils.createDirectStream[String, String](
ssc,
LocationStrategies.PreferConsistent,
ConsumerStrategies.
Subscribe[String, String](topicsSet, kafkaParams)).persist(StorageLevel.MEMORY_ONLY)
messages.foreachRDD((rdd,batchTime) => ///foreachRDD means go over each rdd parallelly , it gives the rdd and we will put the batch time also
{
val table_list=rdd.map(x => x.key()).distinct().collect() ////kafka sends data in key value pairs,
///here rdd means key and values(key is tablename) and first we need to get all the distinct keys(this batch had 5 tables)
val rddList = table_list.map(x=>(x,(rdd.filter(y=>y.key().equals(x)))))
///here x means table name and we are filtering data in the rdd which is equalent to current_table_name
///Now this table_list will contains the key(table) and values corresponding to each key
rddList.foreach(tuple => //here foreach not in parallal, we want to go one by one , touple is nothing but collection of key and multiple
{
val tableName= tuple._1.toString() //tuple._1 will be the table name
val tableRdd= tuple._2.map(x=>(x.value())).persist(StorageLevel.MEMORY_ONLY) // .toDF()
///tuple._2 will be the complete key value pair,we are putting the value in the hdfs
// val tableRdd= messages.filter(x => x.key().toString().equals(tableName)).map(x=>x.value()).persist(StorageLevel.MEMORY_ONLY)
println(tableName)
/* Your logic */
I have this piece of code:
val lines: org.apache.spark.streaming.dstream.InputDStream[(String, String)] = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](
ssc, kafkaParams, topics)
lines.foreachRDD { rdd =>
val df = cassandraSQLContext.read.json(rdd.map(x => x._2))
sparkStreamingService.run(df)
}
ssc.start()
ssc.awaitTermination()
The way I understand it is, foreachRDD is happening at the driver level? So basically all that block of code:
lines.foreachRDD { rdd =>
val df = cassandraSQLContext.read.json(rdd.map(x => x._2))
sparkStreamingService.run(df)
}
is happening at the driver level? The sparkStreamingService.run(df) method basically does some transformations on the current dataframe to yield a new dataframe and then calls another method (on another jar) which stores the dataframe to cassandra.
So if this is happening all at the driver level, we are not utilizing the spark executors and how can I make it so that the executors are being used in parallel to process each partition of the RDD in parallel
My spark streaming service run method:
var metadataDataframe = df.select("customer", "tableName", "messageContent", "initialLoadRunning").collect()
metadataDataframe.foreach(rowD => {
metaData = populateMetaDataService.populateSiteMetaData(rowD)
val headers = (rowD.getString(2).split(recordDelimiter)(0))
val fields = headers.split("\u0001").map(
fieldName => StructField(fieldName, StringType, nullable = true))
val schema = StructType(fields)
val listOfRawData = rowD.getString(2).indexOf(recordDelimiter)
val dataWithoutHeaders = rowD.getString(2).substring(listOfRawData + 1)
val rawData = sparkContext.parallelize(dataWithoutHeaders.split(recordDelimiter))
// val rawData = dataWithoutHeaders.split(recordDelimiter)
val rowRDD = rawData
.map(_.split("\u0001"))
.map(attributes => Row(attributes: _*))
val newDF = cassandraSQLContext.createDataFrame(rowRDD, schema)
dataFrameFilterService.processBasedOnOpType(metaData, newDF)
})
The invocation of foreachRDD does happen on the driver node. But, since we're operating at the RDD level, any transformation on it will be distributed. In your example, rdd.map will cause each partition to be sent to a particular worker node for computation.
Since we don't know what your sparkStreamingService.run method is doing, we cant tell you about the locality of its execution.
The foreachRDD may run locally, but that just means the setup. The RDD itself is a distributed collection, so the actual work is distributed.
To comment directly on the code from the docs:
dstream.foreachRDD { rdd =>
val connection = createNewConnection() // executed at the driver
rdd.foreach { record =>
connection.send(record) // executed at the worker
}
}
Notice that the part of the code that is NOT based around the RDD is executed at the driver. It's the code built up using RDD that is distributed to the workers.
Your code specifically is commented below:
//df.select will be distributed, but collect will pull it all back in
var metadataDataframe = df.select("customer", "tableName", "messageContent", "initialLoadRunning").collect()
//Since collect created a local collection then this is done on the driver
metadataDataframe.foreach(rowD => {
metaData = populateMetaDataService.populateSiteMetaData(rowD)
val headers = (rowD.getString(2).split(recordDelimiter)(0))
val fields = headers.split("\u0001").map(
fieldName => StructField(fieldName, StringType, nullable = true))
val schema = StructType(fields)
val listOfRawData = rowD.getString(2).indexOf(recordDelimiter)
val dataWithoutHeaders = rowD.getString(2).substring(listOfRawData + 1)
//This will run locally, creating a distributed record
val rawData = sparkContext.parallelize(dataWithoutHeaders.split(recordDelimiter))
// val rawData = dataWithoutHeaders.split(recordDelimiter)
//This will redistribute the work
val rowRDD = rawData
.map(_.split("\u0001"))
.map(attributes => Row(attributes: _*))
//again, setting this up locally, to be run distributed
val newDF = cassandraSQLContext.createDataFrame(rowRDD, schema)
dataFrameFilterService.processBasedOnOpType(metaData, newDF)
})
Ultimately, you probably can rewrite this to not need the collect and keep it all distributed, but that is for you not StackOverflow
I have this spark application:
val conf = new SparkConf().setMaster("local[*]")
.setAppName("StreamingSample")
.set("com.couchbase.bucket.test", "")
.set("com.couchbase.nodes", "test-machine")
val ssc = new StreamingContext(conf, Seconds(5))
ssc.couchbaseStream(from = FromNow, to = ToInfinity)
.filter(!_.isInstanceOf[Snapshot]) // Don't print snapshots, just mutations and deletions
.checkpoint(Seconds(2))
.foreachRDD(rdd => {
val om: Broadcast[ObjectMapper] = ScalaObjectMapper.getInstance(rdd.sparkContext)
rdd.foreach {
case m: Mutation =>
val content: Map[String, Object] = om.value.readValue(m.content, classOf[Map[String, Object]])
content("objectType") match {
case "o" => println("o")
case "c" => println("c")
case "s" => println("s")
case unsupportedType => println("unsupported")
}
case m: Deletion => println("delete")
}
})
when recover spark fail how I recover from last position?
Unfortunately, the current connector version (1.2.1) can only stream either from the beginning or from the current position (end of the stream). So in your example, you have no choice but to change FromNow to FromBeginning and then skip (in code) past all the messages you've already seen until you catch up.
The client team is currently working on a new implementation that will be able to remember state, so you'll be able to restore from a specific point in the stream.
I am using Spark-streaming along with RabbitMQ. So, streaming job fetched the data from RabbitMQ and apply some transformation and actions. So, I want to know how to apply multiple actions (i.e. calculate two different feature sets) on the same streaming. Is it possible? If Yes, How to pass the streaming object to multiple classes as mentioned in the code?
val config = ConfigFactory.parseFile(new File("SparkStreaming.conf"))
val conf = new SparkConf(true).setAppName(config.getString("AppName"))
conf.set("spark.cleaner.ttl", "120000")
val sparkConf = new SparkContext(conf)
val ssc = new StreamingContext(sparkConf, Seconds(config.getLong("SparkBatchInterval")))
val rabbitParams = Map("storageLevel" -> "MEMORY_AND_DISK_SER_2","queueName" -> config.getString("RealTimeQueueName"),"host" -> config.getString("QueueHost"), "exchangeName" -> config.getString("QueueExchangeName"), "routingKeys" -> config.getString("QueueRoutingKey"))
val receiverStream = RabbitMQUtils.createStream(ssc, rabbitParams)
receiverStream.start()
How to process stream from here :
val objProcessFeatureSet1 = new ProcessFeatureSet1(Some_Streaming_Object)
val objProcessFeatureSet2 = new ProcessFeatureSet2(Some_Streaming_Object)
ssc.start()
ssc.awaitTermination()
You can run the multiple actions on same dstream as shown below:
import net.minidev.json.JSONValue
import net.minidev.json.JSONObject
val config = ConfigFactory.parseFile(new File("SparkStreaming.conf"))
val conf = new SparkConf(true).setAppName(config.getString("AppName"))
conf.set("spark.cleaner.ttl", "120000")
val sparkConf = new SparkContext(conf)
val ssc = new StreamingContext(sparkConf, Seconds(config.getLong("SparkBatchInterval")))
val rabbitParams = Map("storageLevel" -> "MEMORY_AND_DISK_SER_2","queueName" -> config.getString("RealTimeQueueName"),"host" -> config.getString("QueueHost"), "exchangeName" -> config.getString("QueueExchangeName"), "routingKeys" -> config.getString("QueueRoutingKey"))
val receiverStream = RabbitMQUtils.createStream(ssc, rabbitParams)
val jsonStream = receiverStream.map(byteData => {
JSONValue.parse(byteData)
})
jsonStream.filter(json => {
var customerType = json.get("customerType")
if(customerType.equals("consumer"))
true
else
false
}).foreachRDD(rdd => {
rdd.foreach(json => {
println("json " + json)
})
})
jsonStream.filter(json => {
var customerType = json.get("customerType")
if(customerType.equals("non-consumer"))
true
else
false
}).foreachRDD(rdd => {
rdd.foreach(json => {
println("json " + json)
})
})
ssc.start()
ssc.awaitTermination()
In the code above snippet I am first creating the jsonStream from the received stream, then creating two different stream from it based on the customer type and then applying (foreachRDD) actions on them to print the results.
In the similar way you can pass the same dstream to two different classes and apply the transformation and actions inside it to calculate the different feature set.
I hope above explanation helps you in resolving the issue.
Thanks,
Hokam