I have this spark application:
val conf = new SparkConf().setMaster("local[*]")
.setAppName("StreamingSample")
.set("com.couchbase.bucket.test", "")
.set("com.couchbase.nodes", "test-machine")
val ssc = new StreamingContext(conf, Seconds(5))
ssc.couchbaseStream(from = FromNow, to = ToInfinity)
.filter(!_.isInstanceOf[Snapshot]) // Don't print snapshots, just mutations and deletions
.checkpoint(Seconds(2))
.foreachRDD(rdd => {
val om: Broadcast[ObjectMapper] = ScalaObjectMapper.getInstance(rdd.sparkContext)
rdd.foreach {
case m: Mutation =>
val content: Map[String, Object] = om.value.readValue(m.content, classOf[Map[String, Object]])
content("objectType") match {
case "o" => println("o")
case "c" => println("c")
case "s" => println("s")
case unsupportedType => println("unsupported")
}
case m: Deletion => println("delete")
}
})
when recover spark fail how I recover from last position?
Unfortunately, the current connector version (1.2.1) can only stream either from the beginning or from the current position (end of the stream). So in your example, you have no choice but to change FromNow to FromBeginning and then skip (in code) past all the messages you've already seen until you catch up.
The client team is currently working on a new implementation that will be able to remember state, so you'll be able to restore from a specific point in the stream.
Related
We needed to implement join on Kafka topics with consideration of late data or "not in join", meaning data that come late on the stream or not in join will not be dropped/lost but will be marked as a timeout,
the result of the join is produced to output Kafka topic ( with a timeout filed if occurred).
(spark 2.1.1 in standalone deployment, Kafka 10 )
Kafka in topics: X, Y,... out topics result will look like:
{
"keyJoinFiled": 123456,
"xTopicData": {},
"yTopicData": {},
"isTimeOutFlag": true
}
I found three solutions wrote them here, 1 and 2 from spark streaming official documentation but are not relevant to us ( data not in join Dtsream, arrive "business time" late, is dropped/lost) but I wrote them for comparison.
From what we saw there are not too many examples for Kafka join topic with a stateful operation add some code here for review:
1) According to spark streaming documentation,
https://spark.apache.org/docs/2.1.1/streaming-programming-guide.html:
val stream1: DStream[String, String] =
val stream2: DStream[String, String] =
val joinedStream = stream1.join(stream2)
This will join data from both stream batches duration, but data arrive "business time" late/not in join will be dropped/lost.
2) Window join:
val leftWindowDF = kafkaStreamLeft.window(Minutes(input_parameter_time))
val rightWindowDF = kafkaStreamRight.window(Minutes(input_parameter_time))
leftWindowDF.join(rightWindowDF).foreachRDD...
2.1) In our case we need to use Tumbling window in consideration for
spark streaming batch interval.
2.2) Need to save a lot of data in Memory/Disk, for example, 30-60 min
window
2.3) And again data arrive late/not in the window/not in the join is
dropped/lost.
* Since spark 2.3.1 Structured streaming stream to stream join is
supported, but we encounter a bug with not cleaning HDFS state
store, as a result, a job was falling every few hours on OOM,
resolved in 2.4
,https://issues.apache.org/jira/browse/SPARK-23682
(use of Rocksdb ,or CustomStateStoreProvider HDFS state store).
3) Using stateful operation mapWithState for join Kafka topics Dstreams
with tumbling window and 30 min timeout for late data,
all data produced to output topics contains joined messages from all
topics if join occurred or part of topic data if no
join occurred in 30 min ( mark with is_time_out flag)
3.1) Creating 1..n Dstream per topic, convert to Key value/Unioned
records with join filed as a key and tumbling window.
creating a catch-all scheme.
3.2)Union all streams
3.3)Run on union stream mapWithState with function - will actually do the
join/ mark timeout.
Great example for stateful join from databricks (spark 2.2.0):
https://www.youtube.com/watch?time_continue=1858&v=JAb4FIheP28
Adding a sample code that is running/testing.
val kafkaParams = Map[String, Object](
"bootstrap.servers" -> brokers,
"key.deserializer" -> classOf[StringDeserializer],
"value.deserializer" -> classOf[StringDeserializer],
"group.id" -> groupId,
"session.timeout.ms" -> "30000"
)
//Kafka xTopic DStream
val kafkaStreamLeft = KafkaUtils.createDirectStream[String, String](
ssc,
PreferConsistent,
Subscribe[String, String](leftTopic.split(",").toSet, kafkaParams)
).map(record => {
val msg:xTopic = gson.fromJson(record.value(),classOf[xTopic])
Unioned(Some(msg),None,if (msg.sessionId!= null) msg.sessionId.toString else "")
}).window(Minutes(leftWindow),Minutes(leftWindow))
//Kafka yTopic DStream
val kafkaStreamRight = KafkaUtils.createDirectStream[String, String](
ssc,
PreferConsistent,
Subscribe[String, String](rightTopic.split(",").toSet, kafkaParams)
).map(record => {
val msg:yTopic = gson.fromJson(record.value(),classOf[yTopic])
Unioned(None,Some(msg),if (msg.sessionId!= null) msg.sessionId.toString else "")
}).window(Minutes(rightWindow),Minutes(rightWindow))
//convert stream to key, value pair and filter empty session id.
val unionStream = kafkaStreamLeft.union(kafkaStreamRight).map(record =>(record.sessionId,record))
.filter(record => !record._1.toString.isEmpty)
val stateSpec = StateSpec.function(stateUpdateF).timeout(Minutes(timeout.toInt))
unionStream.mapWithState(stateSpec).foreachRDD(rdd => {
try{
if(!rdd.isEmpty()) rdd.foreachPartition(partition =>{
val props = new util.HashMap[String, Object]()
props.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, brokers)
props.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, "org.apache.kafka.common.serialization.StringSerializer")
props.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, "org.apache.kafka.common.serialization.StringSerializer")
val producer = new KafkaProducer[String, String](props)
//send to kafka result JSON.
partition.foreach(record => {
if(record!=null && !"".equals(record) && !"()".equals(record.toString) && !"None".equals(record.toString) ){
producer.send(new ProducerRecord[String, String](outTopic, null, gson.toJson(record)))
}
})
producer.close()
})
}catch {
case e: Exception => {
logger.error(s""""error join topics :${leftTopic} ${rightTopic} to out topic ${outTopic}""")
logger.info(e.printStackTrace())
}
}})
//mapWithState function that will be called on each key occurrence with new items in newItemValues and state items if exits.
def stateUpdateF = (keySessionId:String,newItemValues:Option[Unioned],state:State[Unioned])=> {
val currentState = state.getOption().getOrElse(Unioned(None,None,keySessionId))
val newVal:Unioned = newItemValues match {
case Some(newItemValue) => {
if (newItemValue.yTopic.isDefined)
Unioned(if(newItemValue.xTopic.isDefined) newItemValue.xTopic else currentState.xTopic,newItemValue.yTopic,keySessionId)
else if (newItemValue.xTopic.isDefined)
Unioned(newItemValue.xTopic, if(currentState.yTopic.isDefined)currentState.yTopic else newItemValue.yTopic,keySessionId)
else newItemValue
}
case _ => currentState //if None = timeout => currentState
}
val processTs = LocalDateTime.now()
val processDate = dtf.format(processTs)
if(newVal.xTopic.isDefined && newVal.yTopic.isDefined){//if we have a join remove from state
state.remove()
JoinState(newVal.sessionId,newVal.xTopic,newVal.yTopic,false,processTs.toInstant(ZoneOffset.UTC).toEpochMilli,processDate)
}else if(state.isTimingOut()){//time out do no try to remove state manually ,it's removed automatically.
JoinState(newVal.sessionId, newVal.xTopic, newVal.yTopic,true,processTs.toInstant(ZoneOffset.UTC).toEpochMilli,processDate)
}else{
state.update(newVal)
}
}
//case class for kafka topics data.(x,y topics ) join will be on session id filed.
case class xTopic(sessionId:String,param1:String,param2:String,sessionCreationDate:String)
case class yTopic(sessionId:Long,clientTimestamp:String)
//catch all schema : object that contains both kafka input fileds topics and key valiue for join.
case class Unioned(xTopic:Option[xTopic],yTopic:Option[yTopic],sessionId:String)
//class for output result of join stateful function.
case class JoinState(sessionId:String, xTopic:Option[xTopic],yTopic:Option[yTopic],isTimeOut:Boolean,processTs:Long,processDate:String)
I will be happy for some review.
sorry for the long post.
I was under the impression this use case was solved by the Sessionization API?:
StructuredSessionization.scala
And Stateful Operations in Structured Streaming
Or am I missing something?
I have this piece of code:
val lines: org.apache.spark.streaming.dstream.InputDStream[(String, String)] = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](
ssc, kafkaParams, topics)
lines.foreachRDD { rdd =>
val df = cassandraSQLContext.read.json(rdd.map(x => x._2))
sparkStreamingService.run(df)
}
ssc.start()
ssc.awaitTermination()
The way I understand it is, foreachRDD is happening at the driver level? So basically all that block of code:
lines.foreachRDD { rdd =>
val df = cassandraSQLContext.read.json(rdd.map(x => x._2))
sparkStreamingService.run(df)
}
is happening at the driver level? The sparkStreamingService.run(df) method basically does some transformations on the current dataframe to yield a new dataframe and then calls another method (on another jar) which stores the dataframe to cassandra.
So if this is happening all at the driver level, we are not utilizing the spark executors and how can I make it so that the executors are being used in parallel to process each partition of the RDD in parallel
My spark streaming service run method:
var metadataDataframe = df.select("customer", "tableName", "messageContent", "initialLoadRunning").collect()
metadataDataframe.foreach(rowD => {
metaData = populateMetaDataService.populateSiteMetaData(rowD)
val headers = (rowD.getString(2).split(recordDelimiter)(0))
val fields = headers.split("\u0001").map(
fieldName => StructField(fieldName, StringType, nullable = true))
val schema = StructType(fields)
val listOfRawData = rowD.getString(2).indexOf(recordDelimiter)
val dataWithoutHeaders = rowD.getString(2).substring(listOfRawData + 1)
val rawData = sparkContext.parallelize(dataWithoutHeaders.split(recordDelimiter))
// val rawData = dataWithoutHeaders.split(recordDelimiter)
val rowRDD = rawData
.map(_.split("\u0001"))
.map(attributes => Row(attributes: _*))
val newDF = cassandraSQLContext.createDataFrame(rowRDD, schema)
dataFrameFilterService.processBasedOnOpType(metaData, newDF)
})
The invocation of foreachRDD does happen on the driver node. But, since we're operating at the RDD level, any transformation on it will be distributed. In your example, rdd.map will cause each partition to be sent to a particular worker node for computation.
Since we don't know what your sparkStreamingService.run method is doing, we cant tell you about the locality of its execution.
The foreachRDD may run locally, but that just means the setup. The RDD itself is a distributed collection, so the actual work is distributed.
To comment directly on the code from the docs:
dstream.foreachRDD { rdd =>
val connection = createNewConnection() // executed at the driver
rdd.foreach { record =>
connection.send(record) // executed at the worker
}
}
Notice that the part of the code that is NOT based around the RDD is executed at the driver. It's the code built up using RDD that is distributed to the workers.
Your code specifically is commented below:
//df.select will be distributed, but collect will pull it all back in
var metadataDataframe = df.select("customer", "tableName", "messageContent", "initialLoadRunning").collect()
//Since collect created a local collection then this is done on the driver
metadataDataframe.foreach(rowD => {
metaData = populateMetaDataService.populateSiteMetaData(rowD)
val headers = (rowD.getString(2).split(recordDelimiter)(0))
val fields = headers.split("\u0001").map(
fieldName => StructField(fieldName, StringType, nullable = true))
val schema = StructType(fields)
val listOfRawData = rowD.getString(2).indexOf(recordDelimiter)
val dataWithoutHeaders = rowD.getString(2).substring(listOfRawData + 1)
//This will run locally, creating a distributed record
val rawData = sparkContext.parallelize(dataWithoutHeaders.split(recordDelimiter))
// val rawData = dataWithoutHeaders.split(recordDelimiter)
//This will redistribute the work
val rowRDD = rawData
.map(_.split("\u0001"))
.map(attributes => Row(attributes: _*))
//again, setting this up locally, to be run distributed
val newDF = cassandraSQLContext.createDataFrame(rowRDD, schema)
dataFrameFilterService.processBasedOnOpType(metaData, newDF)
})
Ultimately, you probably can rewrite this to not need the collect and keep it all distributed, but that is for you not StackOverflow
I am using spark to read and analyse a data file, file contains data like following:
1,unit1,category1_1,100
2,unit1,category1_2,150
3,unit2,category2_1,200
4,unit3,category3_1,200
5,unit3,category3_2,300
The file contains around 20 million records. If user input unit or category, spark need filter the data by inputUnit or inputCategory.
Solution 1:
sc.textFile(file).map(line => {
val Array(id,unit,category,amount) = line.split(",")
if ( (StringUtils.isNotBlank(inputUnit) && unit != inputUnit ) ||
(StringUtils.isNotBlank(inputCategory) && category != inputCategory)){
null
} else {
val obj = new MyObj(id,unit,category,amount)
(id,obj)
}
}).filter(_!=null).collectAsMap()
Solution 2:
var rdd = sc.textFile(file).map(line => {
val (id,unit,category,amount) = line.split(",")
(id,unit,category,amount)
})
if (StringUtils.isNotBlank(inputUnit)) {
rdd = rdd.filter(_._2 == inputUnit)
}
if (StringUtils.isNotBlank(inputCategory)) {
rdd = rdd.filter(_._3 == inputCategory)
}
rdd.map(e => {
val obj = new MyObject(e._1, e._2, e._3, e._4)
(e._1, obj)
}).collectAsMap
I want to understand, which solution is better, or both of them are poor? If both are poor, how to make a good one? Personally, I think second one is better, but I am not quite sure whether it is nice to declare a rdd as var... (I am new to Spark, and I am using Spark 1.5.0 and Scala 2.10.4 to write the code, this is my first time asking a question in StackOverFlow, feel free to edit if it is not well formatted) Thanks.
I have the following code snippet in which the reduceByKey doesn't seem to work.
val myKafkaMessageStream = KafkaUtils.createDirectStream[String, String](
ssc,
PreferConsistent,
Subscribe[String, String](topicsSet, kafkaParams)
)
myKafkaMessageStream
.foreachRDD { rdd =>
val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
val myIter = rdd.mapPartitionsWithIndex { (i, iter) =>
val offset = offsetRanges(i)
iter.map(item => {
(offset.fromOffset, offset.untilOffset, offset.topic, offset.partition, item)
})
}
val myRDD = myIter.filter( (<filter_condition>) ).map(row => {
//Process row
((field1, field2, field3) , (field4, field5))
})
val result = myRDD.reduceByKey((a,b) => (a._1+b._1, a._2+b._2))
result.foreachPartition { partitionOfRecords =>
//I don't get the reduced result here
val connection = createNewConnection()
partitionOfRecords.foreach(record => connection.send(record))
connection.close()
}
}
Am I missing something?
In a streaming situation, it makes more sense to me to use reduceByKeyAndWindow which does what you're looking for, but over a specific time frame.
// Reduce last 30 seconds of data, every 10 seconds
val windowedWordCounts = pairs.reduceByKeyAndWindow((a:Int,b:Int) => (a + b), Seconds(30), Seconds(10))
"When called on a DStream of (K, V) pairs, returns a new DStream of (K, V) pairs where the values for each key are aggregated using the given reduce function func over batches in a sliding window. Note: By default, this uses Spark's default number of parallel tasks (2 for local mode, and in cluster mode the number is determined by the config property spark.default.parallelism) to do the grouping. You can pass an optional numTasks argument to set a different number of tasks."
http://spark.apache.org/docs/latest/streaming-programming-guide.html
I am using Spark-streaming along with RabbitMQ. So, streaming job fetched the data from RabbitMQ and apply some transformation and actions. So, I want to know how to apply multiple actions (i.e. calculate two different feature sets) on the same streaming. Is it possible? If Yes, How to pass the streaming object to multiple classes as mentioned in the code?
val config = ConfigFactory.parseFile(new File("SparkStreaming.conf"))
val conf = new SparkConf(true).setAppName(config.getString("AppName"))
conf.set("spark.cleaner.ttl", "120000")
val sparkConf = new SparkContext(conf)
val ssc = new StreamingContext(sparkConf, Seconds(config.getLong("SparkBatchInterval")))
val rabbitParams = Map("storageLevel" -> "MEMORY_AND_DISK_SER_2","queueName" -> config.getString("RealTimeQueueName"),"host" -> config.getString("QueueHost"), "exchangeName" -> config.getString("QueueExchangeName"), "routingKeys" -> config.getString("QueueRoutingKey"))
val receiverStream = RabbitMQUtils.createStream(ssc, rabbitParams)
receiverStream.start()
How to process stream from here :
val objProcessFeatureSet1 = new ProcessFeatureSet1(Some_Streaming_Object)
val objProcessFeatureSet2 = new ProcessFeatureSet2(Some_Streaming_Object)
ssc.start()
ssc.awaitTermination()
You can run the multiple actions on same dstream as shown below:
import net.minidev.json.JSONValue
import net.minidev.json.JSONObject
val config = ConfigFactory.parseFile(new File("SparkStreaming.conf"))
val conf = new SparkConf(true).setAppName(config.getString("AppName"))
conf.set("spark.cleaner.ttl", "120000")
val sparkConf = new SparkContext(conf)
val ssc = new StreamingContext(sparkConf, Seconds(config.getLong("SparkBatchInterval")))
val rabbitParams = Map("storageLevel" -> "MEMORY_AND_DISK_SER_2","queueName" -> config.getString("RealTimeQueueName"),"host" -> config.getString("QueueHost"), "exchangeName" -> config.getString("QueueExchangeName"), "routingKeys" -> config.getString("QueueRoutingKey"))
val receiverStream = RabbitMQUtils.createStream(ssc, rabbitParams)
val jsonStream = receiverStream.map(byteData => {
JSONValue.parse(byteData)
})
jsonStream.filter(json => {
var customerType = json.get("customerType")
if(customerType.equals("consumer"))
true
else
false
}).foreachRDD(rdd => {
rdd.foreach(json => {
println("json " + json)
})
})
jsonStream.filter(json => {
var customerType = json.get("customerType")
if(customerType.equals("non-consumer"))
true
else
false
}).foreachRDD(rdd => {
rdd.foreach(json => {
println("json " + json)
})
})
ssc.start()
ssc.awaitTermination()
In the code above snippet I am first creating the jsonStream from the received stream, then creating two different stream from it based on the customer type and then applying (foreachRDD) actions on them to print the results.
In the similar way you can pass the same dstream to two different classes and apply the transformation and actions inside it to calculate the different feature set.
I hope above explanation helps you in resolving the issue.
Thanks,
Hokam