Spark Streaming kafka offset manage - apache-spark

I had been doing spark streaming jobs which consumer and produce data through kafka. I used directDstream,so I had to manage offset by myself,we adopted redis to write and read offsets.Now there is one problem,when I launched my client,my client need to get the offset from redis,not offset which exists in kafka itself.how show I write my code?Now I had written my code below:
kafka_stream = KafkaUtils.createDirectStream(
ssc,
topics=[config.CONSUME_TOPIC, ],
kafkaParams={"bootstrap.servers": config.CONSUME_BROKERS,
"auto.offset.reset": "largest"},
fromOffsets=read_offset_range(config.OFFSET_KEY))
But I think the fromOffsets is the value(from redis) when the spark-streaming client lauched,not during its running.thank you for helpinp.

If I understand you correctly you need to set your offset manually. This is how I do it:
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils
from pyspark.streaming.kafka import TopicAndPartition
stream = StreamingContext(sc, 120) # 120 second window
kafkaParams = {"metadata.broker.list":"1:667,2:6667,3:6667"}
kafkaParams["auto.offset.reset"] = "smallest"
kafkaParams["enable.auto.commit"] = "false"
topic = "xyz"
topicPartion = TopicAndPartition(topic, 0)
fromOffset = {topicPartion: long(PUT NUMERIC OFFSET HERE)}
kafka_stream = KafkaUtils.createDirectStream(stream, [topic], kafkaParams, fromOffsets = fromOffset)

Related

PySpark Kafka streaming offset

I got the below from below link related to Kafka topic offset streaming in PySpark:
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils
from pyspark.streaming.kafka import TopicAndPartition
stream = StreamingContext(sc, 120) # 120 second window
kafkaParams = {"metadata.broker.list":"1:667,2:6667,3:6667"}
kafkaParams["auto.offset.reset"] = "smallest"
kafkaParams["enable.auto.commit"] = "false"
topic = "xyz"
topicPartion = TopicAndPartition(topic, 0)
fromOffset = {topicPartion: long(PUT NUMERIC OFFSET HERE)}
kafka_stream = KafkaUtils.createDirectStream(stream, [topic], kafkaParams,
fromOffsets = fromOffset)
Reference link: Spark Streaming kafka offset manage
I am not understanding what to provide in below in case I have to read last 15 minutes data from kafka for each window/batch:
fromOffset = {topicPartion: long(PUT NUMERIC OFFSET HERE)}
This is field to help us to manage the checkpoint kind of thing. Managing offsets is most beneficial to achieve data continuity over the lifecycle of the stream process. For example, upon shutting down the stream application or an unexpected failure, offset ranges will be lost unless persisted in a non-volatile data store. Further, without offsets of the partitions being read, the Spark Streaming job will not be able to continue processing data from where it had last left off. So that we can handle the offset in multiple manner.
One of the way , we can stored the offset value in Zookeeper and read for the same while creating the DSstream.
from kazoo.client import KazooClient
zk = KazooClient(hosts='127.0.0.1:2181')
zk.start()
ZOOKEEPER_SERVERS = "127.0.0.1:2181"
def get_zookeeper_instance():
from kazoo.client import KazooClient
if 'KazooSingletonInstance' not in globals():
globals()['KazooSingletonInstance'] = KazooClient(ZOOKEEPER_SERVERS)
globals()['KazooSingletonInstance'].start()
return globals()['KazooSingletonInstance']
def save_offsets(rdd):
zk = get_zookeeper_instance()
for offset in rdd.offsetRanges():
path = f"/consumers/{var_topic_src_name}"
print(path)
zk.ensure_path(path)
zk.set(path, str(offset.untilOffset).encode())
var_offset_path = f'/consumers/{var_topic_src_name}'
try:
var_offset = int(zk.get(var_offset_path)[0])
except:
print("The spark streaming started First Time and Offset value should be Zero")
var_offset = 0
var_partition = 0
enter code here
topicpartion = TopicAndPartition(var_topic_src_name, var_partition)
fromoffset = {topicpartion: var_offset}
print(fromoffset)
kvs = KafkaUtils.createDirectStream(ssc,\
[var_topic_src_name],\
var_kafka_parms_src,\
valueDecoder=serializer.decode_message,\
fromOffsets = fromoffset)
kvs.foreachRDD(handler)
kvs.foreachRDD(save_offsets)
Reference:
pySpark Kafka Direct Streaming update Zookeeper / Kafka Offset

How to handle delayed events per group in spark

Spark watermark features comes handy when it comes to delayed events. But I am not sure how to handle a scenario where stream is generated from multiple devices in the field, and some devices my be reporting the events bit late. If we apply a watermark, eventTime watermark is maintained in spark against all events and not against the groupBy fields. So spark will drop all the events coming from the devices which are running(syncing) late. What is the best way to handle such scenario? I have modified the word count program from spark structured streaming to demonstrate the issue.
import java.sql.Timestamp
import org.apache.spark.sql.functions._
import org.apache.spark.sql.{DataFrame, SparkSession}
case class DeviceData(deviceId:String, value:Double, userId:String, timestamp:Timestamp)
object StructuredNetworkWordCountWindowed {
def main(args: Array[String]) {
if (args.length < 3) {
System.err.println("Usage: StructuredNetworkWordCountWindowed <hostname> <port>" +
" <window duration in seconds> [<slide duration in seconds>]")
System.exit(1)
}
val host = args(0)
val port = args(1).toInt
val windowSize = args(2).toInt
val slideSize = if (args.length == 3) windowSize else args(3).toInt
if (slideSize > windowSize) {
System.err.println("<slide duration> must be less than or equal to <window duration>")
}
val windowDuration = s"$windowSize seconds"
val slideDuration = s"$slideSize seconds"
val spark = SparkSession
.builder
.appName("StructuredNetworkWordCountWindowed")
.master("local[*]")
.getOrCreate()
import spark.implicits._
// Create DataFrame representing the stream of input lines from connection to host:port
val lines = spark.readStream
.format("socket")
.option("host", host)
.option("port", port)
.load()
val deviceDF:DataFrame = lines.as[String].map(_.split(",")).
map(value=>DeviceData(value(0), value(1).toDouble, value(2), new Timestamp(value(3).toLong))).toDF()
// Group the data by window and deviceId and compute the count of each group
val windowedCounts = deviceDF
.withWatermark("timestamp", "2 minutes")
.groupBy(window($"timestamp", windowDuration, slideDuration), $"deviceId")
.count()
val query = windowedCounts.writeStream
.outputMode("append")
.format("console")
.option("truncate", "false")
.start()
query.awaitTermination()
}
}
Here if device1 is syncing almost near real time, while device2 lag by 5 minutes, then the program will completely ignore the events from device2. Is there a way to apply the watermark against the groupBy function than keeping it as a whole?

Spark streaming: batch interval vs window

I have spark streaming application which consumes kafka messages. And I want to process all messages coming last 10 minutes together.
Looks like there are two approaches to do job done:
val ssc = new StreamingContext(new SparkConf(), Minutes(10))
val dstream = ....
and
val ssc = new StreamingContext(new SparkConf(), Seconds(1))
val dstream = ....
dstream.window(Minutes(10), Minutes(10))
and I just want to clarify is there any performance differences between them

How to load history data when starting Spark Streaming process, and calculate running aggregations

I have some sales-related JSON data in my ElasticSearch cluster, and I would like to use Spark Streaming (using Spark 1.4.1) to dynamically aggregate incoming sales events from my eCommerce website via Kafka, to have a current view to the user's total sales (in terms of revenue and products).
What's not really clear to me from the docs I read is how I can load the history data from ElasticSearch upon the start of the Spark application, and to calculate for example the overall revenue per user (based on the history, and the incoming sales from Kafka).
I have the following (working) code to connect to my Kafka instance and receive the JSON documents:
import kafka.serializer.StringDecoder
import org.apache.spark.streaming._
import org.apache.spark.streaming.kafka._
import org.apache.spark.{SparkContext, SparkConf}
import org.apache.spark.sql.SQLContext
object ReadFromKafka {
def main(args: Array[String]) {
val checkpointDirectory = "/tmp"
val conf = new SparkConf().setAppName("Read Kafka JSONs").setMaster("local[2]")
val topicsSet = Array("tracking").toSet
val sc = new SparkContext(conf)
val ssc = new StreamingContext(sc, Seconds(10))
// Create direct kafka stream with brokers and topics
val kafkaParams = Map[String, String]("metadata.broker.list" -> "localhost:9092")
val messages = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](
ssc, kafkaParams, topicsSet)
//Iterate
messages.foreachRDD { rdd =>
//If data is present, continue
if (rdd.count() > 0) {
//Create SQLContect and parse JSON
val sqlContext = new SQLContext(sc)
val trackingEvents = sqlContext.read.json(rdd.values)
//Sample aggregation of incoming data
trackingEvents.groupBy("type").count().show()
}
}
// Start the computation
ssc.start()
ssc.awaitTermination()
}
}
I know that there's a plugin for ElasticSearch (https://www.elastic.co/guide/en/elasticsearch/hadoop/master/spark.html#spark-read), but it's not really clear to me how to integrate the read upon startup, and the streaming calculation process to aggregate the history data with the streaming data.
Help is much appreaciated! Thanks in advance.
RDDs are immutable, so after they are created you cannot add data to them, for example updating the revenue with new events.
What you can do is union the existing data with the new events to create a new RDD, which you can then use as the current total. For example...
var currentTotal: RDD[(Key, Value)] = ... //read from ElasticSearch
messages.foreachRDD { rdd =>
currentTotal = currentTotal.union(rdd)
}
In this case we make currentTotal a var since it will be replaced by the reference to the new RDD when it gets unioned with the incoming data.
After the union you may want to perform some further operations such as reducing the values which belong to the same Key, but you get the picture.
If you use this technique note that the lineage of your RDDs will grow, as each newly created RDD will reference its parent. This can cause a stack overflow style lineage problem. To fix this you can call checkpoint() on the RDD periodically.

apache spark streaming - kafka - reading older messages

I am trying to read older messages from Kafka with spark streaming. However, I am only able to retrieve messages as they are sent in real time (i.e., if I populate new messages, while my spark program is running - then I get those messages).
I am changing my groupID and consumerID to make sure zookeeper isn't just not giving messages it knows my program has seen before.
Assuming spark is seeing the offset in zookeeper as -1, shouldn't it read all the old messages in the queue? Am I just misunderstanding the way a kafka queue can be used? I'm very new to spark and kafka, so I can't rule out that I'm just misunderstanding something.
package com.kibblesandbits
import org.apache.spark.SparkContext
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.streaming.kafka.KafkaUtils
import net.liftweb.json._
object KafkaStreamingTest {
val cfg = new ConfigLoader().load
val zookeeperHost = cfg.zookeeper.host
val zookeeperPort = cfg.zookeeper.port
val zookeeper_kafka_chroot = cfg.zookeeper.kafka_chroot
implicit val formats = DefaultFormats
def parser(json: String): String = {
return json
}
def main(args : Array[String]) {
val zkQuorum = "test-spark02:9092"
val group = "myGroup99"
val topic = Map("testtopic" -> 1)
val sparkContext = new SparkContext("local[3]", "KafkaConsumer1_New")
val ssc = new StreamingContext(sparkContext, Seconds(3))
val json_stream = KafkaUtils.createStream(ssc, zkQuorum, group, topic)
var gp = json_stream.map(_._2).map(parser)
gp.saveAsTextFiles("/tmp/sparkstreaming/mytest", "json")
ssc.start()
}
When running this, I will see the following message. So I am confident that it's not just not seeing the messages because the offset is set.
14/12/05 13:34:08 INFO ConsumerFetcherManager:
[ConsumerFetcherManager-1417808045047] Added fetcher for partitions
ArrayBuffer([[testtopic,0], initOffset -1 to broker
id:1,host:test-spark02.vpc,port:9092] , [[testtopic,1],
initOffset -1 to broker i d:1,host:test-spark02.vpc,port:9092] ,
[[testtopic,2], initOffset -1 to broker
id:1,host:test-spark02.vpc,port:9092] , [[testtopic,3],
initOffset -1 to broker id:1,host:test-spark02.vpc,port:9092] ,
[[testtopic,4], initOffset -1 to broker
id:1,host:test-spark02.vpc,port:9092] )
Then, if I populate 1000 new messages -- I can see those 1000 messages saved in my temp directory. But I don't know how to read the existing messages, which should number in the (at this point) tens of thousands.
Use the alternative factory method on KafkaUtils that lets you provide a configuration to the Kafka consumer:
def createStream[K: ClassTag, V: ClassTag, U <: Decoder[_]: ClassTag, T <: Decoder[_]: ClassTag](
ssc: StreamingContext,
kafkaParams: Map[String, String],
topics: Map[String, Int],
storageLevel: StorageLevel
): ReceiverInputDStream[(K, V)]
Then build a map with your kafka configuration and add the parameter 'kafka.auto.offset.reset' set to 'smallest':
val kafkaParams = Map[String, String](
"zookeeper.connect" -> zkQuorum, "group.id" -> groupId,
"zookeeper.connection.timeout.ms" -> "10000",
"kafka.auto.offset.reset" -> "smallest"
)
Provide that config to the factory method above. "kafka.auto.offset.reset" -> "smallest" tells the consumer to starts from the smallest offset in your topic.

Resources