Can we share spark streaming state between two DStreams??
Basically I want to create/update state using first stream and enrich second stream using state.
Example: I have modified StatefulNetworkWordCount example. I am creating state using first stream and enriching second stream with count of first stream.
val initialRDD = ssc.sparkContext.parallelize(List(("hello", 1), ("world", 1)))
val mappingFuncForFirstStream = (batchTime: Time, word: String, one: Option[Int], state: State[Int]) => {
val sum = one.getOrElse(0) + state.getOption.getOrElse(0)
val output = (word, sum)
state.update(sum)
Some(output)
}
val mappingFuncForSecondStream = (batchTime: Time, word: String, one: Option[Int], state: State[Int]) => {
val sum = state.getOption.getOrElse(0)
val output = (word, sum)
Some(output)
}
// first stream
KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topicsSet)
.flatMap(r=>r._2.split(" "))
.map(x => (x, 1))
.mapWithState(StateSpec.function(mappingFuncForFirstStream).initialState(initialRDD).timeout(Minutes(10)))
.print(1)
// second stream
KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams2, mergeTopicSet)
.flatMap(r=>r._2.split(" "))
.map(x => (x, 1))
.mapWithState(StateSpec.function(mappingFuncForSecondStream).initialState(initialRDD).timeout(Minutes(10)))
.print(50)
In checkpointing directory, I can see two different state RDDs.
I am using spark-1.6.1 and kafka-0.8.2.1
It's possible to access the underlying StateDStream of the DStream resulting of applying the mapWithState operation by using stateMappedDStream.snapshotStream()
So, inspired on your example:
val firstDStream = ???
val secondDStream = ???
val firstDStreamSMapped = firstDStream..mapWithState(...)
val firstStreamState = firstDStreamSMapped.snapshotStream()
// we want to use the state of Stream 1 to enrich Stream 2. The keys of both streams are required to match.
val enrichedStream = secondDStream.join(firstStreamState)
... do stuff with enrichedStream ...
This method may be helpful for you:
ssc.untion(Seq[Dstream[T]])
Related
I have a data enrichment job and enrich my data from data source(one kafka topic) then publish to another sink (kafka topic) after processing.
The job itself is IO bound and the processing speed is not increasing linearly when I add more CPU/Memory.
Should I horizontally scale up the jobs with two or three instances and let them consume the same Kafka topic with same groupId?
The difficulty is that currently my codes is maintaining the processing offset in kafka broker. So I will have below codes adding an extra parameters to define which partitions to load into the RDD and let the high level decide which offset they need to read. And this looks like not that good.
private def loadRdd[T:ClassTag](maxMessages: Long = 0, messageFormatter: ((String, String)) => T)
(implicit inputConfig: Config): (RDD[T], Unit => Unit, Boolean) = {
val brokersConnectionString = Try(inputConfig.getString("brokersConnectionString")).getOrElse(throw new RuntimeException("Fail to retrieve the broker connection string."))
val topic = inputConfig.getString("topic")
val groupId = inputConfig.getString("groupId")
val retriesAttempts = Try(inputConfig.getInt("retries.attempts")).getOrElse(SparkKafkaProviderUtilsFunctions.DEFAULT_RETRY_ATTEMPTS)
val retriesDelay = Try(inputConfig.getInt("retries.delay")).getOrElse(SparkKafkaProviderUtilsFunctions.DEFAULT_RETRY_DELAY) * 1000
val topicOffsetRanges = KafkaClusterUtils.getTopicOffsetRanges(inputConfig, topic, SparkKafkaProviderUtilsFunctions.getDebugLogger(inputConfig)).toList
.map { case (partitionId, (minOffset, maxOffset)) => OffsetRange(topic, partitionId, minOffset, maxOffset) }.toArray
val (offsetRanges, readAllAvailableMessages) = restrictOffsetRanges(topicOffsetRanges, maxMessages)
val rdd: RDD[ConsumerRecord[String, String]] = RetryUtils.retryOrDie(retriesAttempts, retryDelay = retriesDelay, loopFn = {SparkLogger.warn("Failed to create Spark RDD, retrying...")},
failureFn = { SparkLogger.warn("Failed to create Spark RDD, giving up...")})(
KafkaUtils.createRDD(sc, KafkaClusterUtils.getKafkaConsumerParameters(brokersConnectionString, groupId), offsetRanges, LocationStrategies.PreferConsistent))
(rdd.map(pair => messageFormatter(pair.key(), pair.value())), Unit => commitOffsets(offsetRanges, inputConfig), readAllAvailableMessages)
}
Should I proceed with this direction and let the high level decide which partitions belongs to which job instance? Or I should simply scale up with adding more CPU and memory?
Thanks!
I have 3 RDDs:
1st one is of form ((a,b),c).
2nd one is of form (b,d).
3rd one is of form (a,e).
How can I perform join in scala over these RDDs such that my final output is of the form ((a,b),c,d,e)?
you can do something like this:
val rdd1: RDD[((A,B),C)]
val rdd2: RDD[(B,D)]
val rdd3: RDD[(A,E)]
val tmp1 = rdd1.map {case((a,b),c) => (a, (b,c))}
val tmp2 = tmp1.join(rdd3).map{case(a, ((b,c), e)) => (b, (a,c,e))}
val res = tmp2.join(rdd2).map{case(b, ((a,c,e), d)) => ((a,b), c,d,e)}
With current implementations of join apis for paired rdds, its not possible to use condtions. And you would need conditions when joining to get the desired result.
But you can use dataframes/datasets for the joins, where you can use conditions. So use dataframes/datasets for the joins. If you want the result of join in dataframes then you can proceed with that. In case you want your results in rdds, then *.rdd can be used to convert the dataframes/datasets to RDD[Row]*
Below is the sample codes of it can be done in scala
//creating three rdds
val first = sc.parallelize(Seq((("a", "b"), "c")))
val second = sc.parallelize(Seq(("b", "d")))
val third = sc.parallelize(Seq(("a", "e")))
//coverting rdds to dataframes
val firstdf = first.toDF("key1", "value1")
val seconddf = second.toDF("key2", "value2")
val thirddf = third.toDF("key3", "value3")
//udf function for the join condition
import org.apache.spark.sql.functions._
def joinCondition = udf((strct: Row, key: String) => strct.toSeq.contains(key))
//joins with conditions
firstdf
.join(seconddf, joinCondition(firstdf("key1"), seconddf("key2"))) //joining first with second
.join(thirddf, joinCondition(firstdf("key1"), thirddf("key3"))) //joining first with third
.drop("key2", "key3") //dropping unnecessary columns
.rdd //converting dataframe to rdd
You should have output as
[[a,b],c,d,e]
I am using Spark Kafka Integration 0.10 and I need two levels of aggregations on the stream:
The first one is per minute interval
The other is to sum on 15 minutes interval.
Also preference is to accumulate the one minute interval values and then reset it when 15 minute is over b/c 15 minute values should be persisted.
Having two reduceByKeysByWindows on different sliding windows does not work as it gives KafkaConcurrentModification exception.
tl;dr It seems to work. Please provide an example that fails.
I am using Spark 2.0.2 (that was released today).
My example is as follows (with some code removed for brevity):
val ssc = new StreamingContext(sc, Seconds(10))
import org.apache.spark.streaming.kafka010._
val dstream = KafkaUtils.createDirectStream[String, String](
ssc,
preferredHosts,
ConsumerStrategies.Subscribe[String, String](topics, kafkaParams, offsets))
def reduceFunc(v1: String, v2: String) = s"$v1 + $v2"
dstream.map { r =>
println(s"value: ${r.value}")
val Array(key, value) = r.value.split("\\s+")
println(s">>> key = $key")
println(s">>> value = $value")
(key, value)
}.reduceByKeyAndWindow(
reduceFunc, windowDuration = Seconds(30), slideDuration = Seconds(10))
.print()
dstream.foreachRDD { rdd =>
// Get the offset ranges in the RDD
val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
for (o <- offsetRanges) {
println(s"${o.topic} ${o.partition} offsets: ${o.fromOffset} to ${o.untilOffset}")
}
}
ssc.start
What would you change to see the exception(s) you're experiencing?
The entire project is available as spark-streaming-kafka-direct.
I use spark streaming to receive Kafka's data like this:
val conf = new SparkConf()
conf.setMaster("local[*]").setAppName("KafkaStreamExample")
.setSparkHome("/home/kufu/spark/spark-1.5.2-bin-hadoop2.6")
.setExecutorEnv("spark.executor.extraClassPath","target/scala-2.11/sparkstreamexamples_2.11-1.0.jar")
val threadNum = 3
val ssc = new StreamingContext(conf, Seconds(2))
val topicMap = Map(consumeTopic -> 1)
val dataRDDs:IndexedSeq[InputDStream[(String, String)]] = approachType match {
case KafkaStreamJob.ReceiverBasedApproach =>
(1 to threadNum).map(_=>
KafkaUtils.createStream(ssc, zkOrBrokers, "testKafkaGroupId", topicMap))
case KafkaStreamJob.DirectApproach =>
(1 to threadNum).map(_=>
KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](
ssc, Map[String, String]("metadata.broker.list" -> zkOrBrokers),
Set[String](consumeTopic)))
}
//dataRDDs.foreach(_.foreachRDD(genProcessing(approachType)))
val dataRDD = ssc.union(dataRDDs)
dataRDD.foreachRDD(genProcessing(approachType))
ssc.start()
ssc.awaitTermination()
the genProcessing generates a process to deal with the data, which will takes 5s(sleep 5s). Code are like this:
def eachRDDProcessing(rdd:RDD[(String, String)]):Unit = {
if(count>max) throw new Exception("Stop here")
println("--------- num: "+count+" ---------")
val batchNum = count
val curTime = System.currentTimeMillis()
Thread.sleep(5000)
val family = approachType match{
case KafkaStreamJob.DirectApproach => KafkaStreamJob.DirectFamily
case KafkaStreamJob.ReceiverBasedApproach => KafkaStreamJob.NormalFamily
}
val families = KafkaStreamJob.DirectFamily :: KafkaStreamJob.NormalFamily :: Nil
val time = System.currentTimeMillis().toString
val messageCount = rdd.count()
rdd.foreach(tuple => {
val hBaseConn = new HBaseConnection(KafkaStreamJob.rawDataTable,
KafkaStreamJob.zookeeper, families)
hBaseConn.openOrCreateTable()
val puts = new java.util.ArrayList[Put]()
val strs = tuple._2.split(":")
val row = strs(1) + ":" + strs(0) + ":" + time
val put = new Put(Bytes.toBytes(row))
put.add(Bytes.toBytes(family), Bytes.toBytes(KafkaStreamJob.tableQualifier),
Bytes.toBytes("batch " + batchNum.toString + ":" + strs(1)))
puts.add(put)
hBaseConn.puts(puts)
hBaseConn.close()
})
count+=1
println("--------- add "+messageCount+" messages ---------")
}
eachRDDProcessing
but the spark streaming doesn't start multi-thread.Tasks were processed one by one, and each task took around 5s. My machine has 8 cores, and the spark run on one node.
I don't spark streaming will start threads especially on driver. The point is if you have multiple nodes, your genProcessing will run on different nodes.
Further, if you call rdd.foreachPartition(...), suppose it should get better parallelism
What is the best way to maintain application state in a spark streaming application?
I know of two ways :
use "Union" operation to append to the lookup RDD and persist it after each union.
save the state in a file or database and load it in the start of each batch.
My question is from the performance perspective which one is better ? Also, is there a better way to do this?
You should really be using mapWithState(spec: StateSpec[K, V, StateType, MappedType]) as follows:
import org.apache.spark.streaming.{ StreamingContext, Seconds }
val ssc = new StreamingContext(sc, batchDuration = Seconds(5))
// checkpointing is mandatory
ssc.checkpoint("_checkpoints")
val rdd = sc.parallelize(0 to 9).map(n => (n, n % 2 toString))
import org.apache.spark.streaming.dstream.ConstantInputDStream
val sessions = new ConstantInputDStream(ssc, rdd)
import org.apache.spark.streaming.{State, StateSpec, Time}
val updateState = (batchTime: Time, key: Int, value: Option[String], state: State[Int]) => {
println(s">>> batchTime = $batchTime")
println(s">>> key = $key")
println(s">>> value = $value")
println(s">>> state = $state")
val sum = value.getOrElse("").size + state.getOption.getOrElse(0)
state.update(sum)
Some((key, value, sum)) // mapped value
}
val spec = StateSpec.function(updateState)
val mappedStatefulStream = sessions.mapWithState(spec)
mappedStatefulStream.print()