What is the best way to maintain application state in a spark streaming application?
I know of two ways :
use "Union" operation to append to the lookup RDD and persist it after each union.
save the state in a file or database and load it in the start of each batch.
My question is from the performance perspective which one is better ? Also, is there a better way to do this?
You should really be using mapWithState(spec: StateSpec[K, V, StateType, MappedType]) as follows:
import org.apache.spark.streaming.{ StreamingContext, Seconds }
val ssc = new StreamingContext(sc, batchDuration = Seconds(5))
// checkpointing is mandatory
ssc.checkpoint("_checkpoints")
val rdd = sc.parallelize(0 to 9).map(n => (n, n % 2 toString))
import org.apache.spark.streaming.dstream.ConstantInputDStream
val sessions = new ConstantInputDStream(ssc, rdd)
import org.apache.spark.streaming.{State, StateSpec, Time}
val updateState = (batchTime: Time, key: Int, value: Option[String], state: State[Int]) => {
println(s">>> batchTime = $batchTime")
println(s">>> key = $key")
println(s">>> value = $value")
println(s">>> state = $state")
val sum = value.getOrElse("").size + state.getOption.getOrElse(0)
state.update(sum)
Some((key, value, sum)) // mapped value
}
val spec = StateSpec.function(updateState)
val mappedStatefulStream = sessions.mapWithState(spec)
mappedStatefulStream.print()
Related
This is a well-known limitation[1] of Structured Streaming that I'm trying to get around using a custom sink.
In what follows, modelsMap is a map of string keys to org.apache.spark.mllib.stat.KernelDensity models
and
streamingData is a streaming dataframe org.apache.spark.sql.DataFrame = [id1: string, id2: string ... 6 more fields]
I'm trying to evaluate each row of streamingData against its corresponding model from modelsMap, enhance each row with prediction, and write to Kakfa.
An obvious way would be .withColumn, using a UDF to predict, and write using kafka sink.
But this is illegal because:
org.apache.spark.SparkException: This RDD lacks a SparkContext. It
could happen in the following cases: (1) RDD transformations and
actions are NOT invoked by the driver, but inside of other
transformations; for example, rdd1.map(x => rdd2.values.count() * x) is
invalid because the values transformation and count action cannot be
performed inside of the rdd1.map transformation. For more information,
see SPARK-5063.
I get the same error with a custom sink that implements forEachWriter which was a bit unexpected:
import org.apache.spark.sql.ForeachWriter
import java.util.Properties
import kafkashaded.org.apache.kafka.clients.producer._
class customSink(topic:String, servers:String) extends ForeachWriter[(org.apache.spark.sql.Row)] {
val kafkaProperties = new Properties()
kafkaProperties.put("bootstrap.servers", servers)
kafkaProperties.put("key.serializer", "kafkashaded.org.apache.kafka.common.serialization.StringSerializer")
kafkaProperties.put("value.serializer", "kafkashaded.org.apache.kafka.common.serialization.StringSerializer")
val results = new scala.collection.mutable.HashMap[String, String]
var producer: KafkaProducer[String, String] = _
def open(partitionId: Long,version: Long): Boolean = {
producer = new KafkaProducer(kafkaProperties)
true
}
def process(value: (org.apache.spark.sql.Row)): Unit = {
var prediction = Double.NaN
try {
val id1 = value(0)
val id2 = value(3)
val id3 = value(5)
val time_0 = value(6).asInstanceOf[Double]
val key = f"$id1/$id2/$id3"
var model = modelsMap(key)
println("Looking up key: ",key)
var prediction = Double.NaN
prediction = model.estimate(Array[Double](time_0))(0)
println(prediction)
} catch {
case e: NoSuchElementException =>
val prediction = Double.NaN
println(prediction)
}
producer.send(new ProducerRecord(topic, value.mkString(",")+","+prediction.toString))
}
def close(errorOrNull: Throwable): Unit = {
producer.close()
}
}
val writer = new customSink("<broker>", "<topic>")
val query = streamingData
.writeStream
.foreach(writer)
.outputMode("update")
.trigger(Trigger.ProcessingTime(10.seconds))
.start()
model.estimate is implemented under the hood using aggregate in mllib.stat, and there's no way to get around it.
What changes do I make? (I could collect each batch and execute a for loop using driver, but then I'm not using spark the way it's intended)
References:
https://www.slideshare.net/databricks/realtime-machine-learning-analytics-using-structured-streaming-and-kinesis-firehose slide#11 mentions limitations
https://www.oreilly.com/learning/extend-structured-streaming-for-spark-ml
https://github.com/holdenk/spark-structured-streaming-ml (proposed solution)
https://issues.apache.org/jira/browse/SPARK-16454
https://issues.apache.org/jira/browse/SPARK-16407
I am using Spark Kafka Integration 0.10 and I need two levels of aggregations on the stream:
The first one is per minute interval
The other is to sum on 15 minutes interval.
Also preference is to accumulate the one minute interval values and then reset it when 15 minute is over b/c 15 minute values should be persisted.
Having two reduceByKeysByWindows on different sliding windows does not work as it gives KafkaConcurrentModification exception.
tl;dr It seems to work. Please provide an example that fails.
I am using Spark 2.0.2 (that was released today).
My example is as follows (with some code removed for brevity):
val ssc = new StreamingContext(sc, Seconds(10))
import org.apache.spark.streaming.kafka010._
val dstream = KafkaUtils.createDirectStream[String, String](
ssc,
preferredHosts,
ConsumerStrategies.Subscribe[String, String](topics, kafkaParams, offsets))
def reduceFunc(v1: String, v2: String) = s"$v1 + $v2"
dstream.map { r =>
println(s"value: ${r.value}")
val Array(key, value) = r.value.split("\\s+")
println(s">>> key = $key")
println(s">>> value = $value")
(key, value)
}.reduceByKeyAndWindow(
reduceFunc, windowDuration = Seconds(30), slideDuration = Seconds(10))
.print()
dstream.foreachRDD { rdd =>
// Get the offset ranges in the RDD
val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
for (o <- offsetRanges) {
println(s"${o.topic} ${o.partition} offsets: ${o.fromOffset} to ${o.untilOffset}")
}
}
ssc.start
What would you change to see the exception(s) you're experiencing?
The entire project is available as spark-streaming-kafka-direct.
I have a question regarding Mechanics of Spark Dataframereader. I will appreciate if anybody can help me. Let me explain the Scenario here
I am creating a DataFrame from Dstream like this. This in Input Data
var config = new HashMap[String,String]();
config += ("zookeeper.connect" ->zookeeper);
config += ("partition.assignment.strategy" ->"roundrobin");
config += ("bootstrap.servers" ->broker);
config += ("serializer.class" -> "kafka.serializer.DefaultEncoder");
config += ("group.id" -> "default");
val lines = KafkaUtils.createDirectStream[String, Array[Byte], StringDecoder, DefaultDecoder](ssc,config.toMap,Set(topic)).map(_._2)
lines.foreachRDD { rdd =>
if(!rdd.isEmpty()){
val rddJson = rdd.map { x => MyFunctions.mapToJson(x) }
val sqlContext = SQLContextSingleton.getInstance(ssc.sparkContext)
val rddDF = sqlContext.read.json(rddJson)
rddDF.registerTempTable("inputData")
val dbDF = ReadDataFrameHelper.readDataFrameHelperFromDB(sqlContext, jdbcUrl, "ABCD","A",numOfPartiton,lowerBound,upperBound)
Here is the code of ReadDataFrameHelper
def readDataFrameHelperFromDB(sqlContext:HiveContext,jdbcUrl:String,dbTableOrQuery:String,
columnToPartition:String,numOfPartiton:Int,lowerBound:Int,highBound:Int):DataFrame={
val jdbcDF = sqlContext.read.jdbc(url = jdbcUrl, table = dbTableOrQuery,
columnName = columnToPartition,
lowerBound = lowerBound,
upperBound = highBound,
numPartitions = numOfPartiton,
connectionProperties = new java.util.Properties()
)
jdbcDF
}
Lastly I am doing a Join like this
val joinedData = rddDF.join(dbDF,rddDF("ID") === dbDF("ID")
&& rddDF("CODE") === dbDF("CODE"),"left_outer")
.drop(dbDF("code"))
.drop(dbDF("id"))
.drop(dbDF("number"))
.drop(dbDF("key"))
.drop(dbDF("loaddate"))
.drop(dbDF("fid"))
joinedData.show()
My input DStream will have 1000 rows and data will contains million of rows. So when I do this join, will spark load all the rows from database and read those rows or will this just read those specific rows from DB which have the code,id from the input DStream
As specified by zero323, i have also confirmed that data will be read full from the table. I checked the DB session logs and saw that whole dataset is getting loaded.
Thanks zero323
I use spark streaming to receive Kafka's data like this:
val conf = new SparkConf()
conf.setMaster("local[*]").setAppName("KafkaStreamExample")
.setSparkHome("/home/kufu/spark/spark-1.5.2-bin-hadoop2.6")
.setExecutorEnv("spark.executor.extraClassPath","target/scala-2.11/sparkstreamexamples_2.11-1.0.jar")
val threadNum = 3
val ssc = new StreamingContext(conf, Seconds(2))
val topicMap = Map(consumeTopic -> 1)
val dataRDDs:IndexedSeq[InputDStream[(String, String)]] = approachType match {
case KafkaStreamJob.ReceiverBasedApproach =>
(1 to threadNum).map(_=>
KafkaUtils.createStream(ssc, zkOrBrokers, "testKafkaGroupId", topicMap))
case KafkaStreamJob.DirectApproach =>
(1 to threadNum).map(_=>
KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](
ssc, Map[String, String]("metadata.broker.list" -> zkOrBrokers),
Set[String](consumeTopic)))
}
//dataRDDs.foreach(_.foreachRDD(genProcessing(approachType)))
val dataRDD = ssc.union(dataRDDs)
dataRDD.foreachRDD(genProcessing(approachType))
ssc.start()
ssc.awaitTermination()
the genProcessing generates a process to deal with the data, which will takes 5s(sleep 5s). Code are like this:
def eachRDDProcessing(rdd:RDD[(String, String)]):Unit = {
if(count>max) throw new Exception("Stop here")
println("--------- num: "+count+" ---------")
val batchNum = count
val curTime = System.currentTimeMillis()
Thread.sleep(5000)
val family = approachType match{
case KafkaStreamJob.DirectApproach => KafkaStreamJob.DirectFamily
case KafkaStreamJob.ReceiverBasedApproach => KafkaStreamJob.NormalFamily
}
val families = KafkaStreamJob.DirectFamily :: KafkaStreamJob.NormalFamily :: Nil
val time = System.currentTimeMillis().toString
val messageCount = rdd.count()
rdd.foreach(tuple => {
val hBaseConn = new HBaseConnection(KafkaStreamJob.rawDataTable,
KafkaStreamJob.zookeeper, families)
hBaseConn.openOrCreateTable()
val puts = new java.util.ArrayList[Put]()
val strs = tuple._2.split(":")
val row = strs(1) + ":" + strs(0) + ":" + time
val put = new Put(Bytes.toBytes(row))
put.add(Bytes.toBytes(family), Bytes.toBytes(KafkaStreamJob.tableQualifier),
Bytes.toBytes("batch " + batchNum.toString + ":" + strs(1)))
puts.add(put)
hBaseConn.puts(puts)
hBaseConn.close()
})
count+=1
println("--------- add "+messageCount+" messages ---------")
}
eachRDDProcessing
but the spark streaming doesn't start multi-thread.Tasks were processed one by one, and each task took around 5s. My machine has 8 cores, and the spark run on one node.
I don't spark streaming will start threads especially on driver. The point is if you have multiple nodes, your genProcessing will run on different nodes.
Further, if you call rdd.foreachPartition(...), suppose it should get better parallelism
I am a beginner of Apache Spark. I want to filter two RDD into result RDD with the below code
def runSpark(stList:List[SubStTime],icList:List[IcTemp]): Unit ={
val conf = new SparkConf().setAppName("OD").setMaster("local[*]")
val sc = new SparkContext(conf)
val st = sc.parallelize(stList).map(st => ((st.productId,st.routeNo),st)).groupByKey()
val ic = sc.parallelize(icList).map(ic => ((ic.productId,ic.routeNo),ic)).groupByKey()
//TODO
//val result = st.join(ic).mapValues( )
sc.stop()
}
here is what i want to do
List[ST] ->map ->Map(Key,st) ->groupByKey ->Map(Key,List[st])
List[IC] ->map ->Map(Key,ic) ->groupByKey ->Map(Key,List[ic])
STRDD join ICRDD get Map(Key,(List[st],List[ic]))
I have a function compare listST and listIC get the List[result] result contains both SubStTime and IcTemp information
def calcIcSt(st:List[SubStTime],ic:List[IcTemp]): List[result]
I don't know how to use mapvalues or other some way to get my result
Thanks
val result = st.join(ic).mapValues( x => calcIcSt(x._1,x._2) )