PySpark Kafka streaming offset - apache-spark

I got the below from below link related to Kafka topic offset streaming in PySpark:
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils
from pyspark.streaming.kafka import TopicAndPartition
stream = StreamingContext(sc, 120) # 120 second window
kafkaParams = {"metadata.broker.list":"1:667,2:6667,3:6667"}
kafkaParams["auto.offset.reset"] = "smallest"
kafkaParams["enable.auto.commit"] = "false"
topic = "xyz"
topicPartion = TopicAndPartition(topic, 0)
fromOffset = {topicPartion: long(PUT NUMERIC OFFSET HERE)}
kafka_stream = KafkaUtils.createDirectStream(stream, [topic], kafkaParams,
fromOffsets = fromOffset)
Reference link: Spark Streaming kafka offset manage
I am not understanding what to provide in below in case I have to read last 15 minutes data from kafka for each window/batch:
fromOffset = {topicPartion: long(PUT NUMERIC OFFSET HERE)}

This is field to help us to manage the checkpoint kind of thing. Managing offsets is most beneficial to achieve data continuity over the lifecycle of the stream process. For example, upon shutting down the stream application or an unexpected failure, offset ranges will be lost unless persisted in a non-volatile data store. Further, without offsets of the partitions being read, the Spark Streaming job will not be able to continue processing data from where it had last left off. So that we can handle the offset in multiple manner.
One of the way , we can stored the offset value in Zookeeper and read for the same while creating the DSstream.
from kazoo.client import KazooClient
zk = KazooClient(hosts='127.0.0.1:2181')
zk.start()
ZOOKEEPER_SERVERS = "127.0.0.1:2181"
def get_zookeeper_instance():
from kazoo.client import KazooClient
if 'KazooSingletonInstance' not in globals():
globals()['KazooSingletonInstance'] = KazooClient(ZOOKEEPER_SERVERS)
globals()['KazooSingletonInstance'].start()
return globals()['KazooSingletonInstance']
def save_offsets(rdd):
zk = get_zookeeper_instance()
for offset in rdd.offsetRanges():
path = f"/consumers/{var_topic_src_name}"
print(path)
zk.ensure_path(path)
zk.set(path, str(offset.untilOffset).encode())
var_offset_path = f'/consumers/{var_topic_src_name}'
try:
var_offset = int(zk.get(var_offset_path)[0])
except:
print("The spark streaming started First Time and Offset value should be Zero")
var_offset = 0
var_partition = 0
enter code here
topicpartion = TopicAndPartition(var_topic_src_name, var_partition)
fromoffset = {topicpartion: var_offset}
print(fromoffset)
kvs = KafkaUtils.createDirectStream(ssc,\
[var_topic_src_name],\
var_kafka_parms_src,\
valueDecoder=serializer.decode_message,\
fromOffsets = fromoffset)
kvs.foreachRDD(handler)
kvs.foreachRDD(save_offsets)
Reference:
pySpark Kafka Direct Streaming update Zookeeper / Kafka Offset

Related

StructuredStreaming with ForEachWriter creating duplicates

Hi I'm trying to create a neo4j sink using pyspark and kafka, but for some reason this sink is creating duplicates in neo4j and I'm not sure why this is happening. I am expecting to get only one node, but it looks like it's creating 4. If someone has an idea, please let me know.
Kafka producer code:
from kafka import KafkaProducer
import json
producer = KafkaProducer(bootstrap_servers='10.0.0.38:9092')
message = {
'test_1': 'test_1',
'test_2': 'test_2'
}
producer.send('test_topic', json.dumps(message).encode('utf-8'))
producer.close()
Kafka consumer code:
from kafka import KafkaConsumer
import findspark
from py2neo import Graph
import json
findspark.init()
from pyspark.sql import SparkSession
class ForeachWriter:
def open(self, partition_id, epoch_id):
neo4j_uri = '' # neo4j uri
neo4j_auth = ('', '') # neo4j user, password
self.graph = Graph(neo4j_uri, auth=neo4j_auth)
return True
def process(self, msg):
msg = json.loads(msg.value.decode('utf-8'))
self.graph.run("CREATE (n: MESSAGE_RECEIVED) SET n.key = '" + str(msg).replace("'", '"') + "'")
raise KeyError('received message: {}. finished creating node'.format(msg))
spark = SparkSession.builder.appName('test-consumer') \
.config('spark.executor.instances', 1) \
.getOrCreate()
ds1 = spark.readStream \
.format('kafka') \
.option('kafka.bootstrap.servers', '10.0.0.38:9092') \
.option('subscribe', 'test_topic') \
.load()
query = ds1.writeStream.foreach(ForeachWriter()).start()
query.awaitTermination()
neo4j graph after running code
After doing some searching, I found this snippet of text from Stream Processing with Apache Spark: Mastering Structured Streaming and Spark Streaming on chapter 11 p151 after describing open, process, and close for ForeachWriter:
This contract is part of the data delivery semantics because it allows us to remove duplicated partitions that might already have been sent to the sink but are reprocessed by Structured Streaming as part of a recovery scenario. For that mechanism to properly work, the sink must implement some persistent way to remember the partition/version combinations that it has already seen.
On another note from the spark website: https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html (see section on Foreach).
Note: Spark does not guarantee same output for (partitionId, epochId), so deduplication cannot be achieved with (partitionId, epochId). e.g. source provides different number of partitions for some reasons, Spark optimization changes number of partitions, etc. See SPARK-28650 for more details. If you need deduplication on output, try out foreachBatch instead.
It seems like I need to implement a check for uniqueness because Structured Streaming automatically reprocesses partitions in case of a fail if I am to use ForeachWriter, otherwise I have to switch to foreachBatch instead.

Spark dataframe lose streaming capability after accessing Kafka source

I use Spark 2.4.3 and Kafka 2.3.0. I want to do Spark structured streaming with data coming from Kafka to Spark. In general it does work in the test mode but since I have to do some processing of the data (and do not know another way to do) the Spark data frames do not have the streaming capability anymore.
#!/usr/bin/env python3
from pyspark.sql import SparkSession
from pyspark.sql.functions import from_json
from pyspark.sql.types import StructField, StructType, StringType, DoubleType
# create schema for data
schema = StructType([StructField("Signal", StringType()),StructField("Value", DoubleType())])
# create spark session
spark = SparkSession.builder.appName("streamer").getOrCreate()
# create DataFrame representing the stream
dsraw = spark.readStream \
.format("kafka").option("kafka.bootstrap.servers", "localhost:9092") \
.option("subscribe", "test")
print("dsraw.isStreaming: ", dsraw.isStreaming)
# Convert Kafka stream to something readable
ds = dsraw.selectExpr("CAST(value AS STRING)")
print("ds.isStreaming: ", ds.isStreaming)
# Do query on the converted data
dsQuery = ds.writeStream.queryName("ds_query").format("memory").start()
df1 = spark.sql("select * from ds_query")
print("df1.isStreaming: ", df1.isStreaming)
# convert json into spark dataframe cols
df2 = df1.withColumn("value", from_json("value", schema))
print("df2.isStreaming: ", df2.isStreaming)
The output is:
dsraw.isStreaming: True
ds.isStreaming: True
df1.isStreaming: False
df2.isStreaming: False
So I lose the streaming capability when I create the first dataframe. How can I avoid it? How do I get a streaming Spark dataframe out of a stream?
It is not recommend to use the memory sink for production applications as all the data will be stored in the driver.
There is also no reason to do this, except for debugging purposes, as you can process your streaming dataframes like the 'normal' dataframes. For example:
import pyspark.sql.functions as F
lines = spark.readStream.format("socket").option("host", "XXX.XXX.XXX.XXX").option("port", XXXXX).load()
words = lines.select(lines.value)
words = words.filter(words.value.startswith('h'))
wordCounts = words.groupBy("value").count()
wordCounts = wordCounts.withColumn('count', F.col('count') + 2)
query = wordCounts.writeStream.queryName("test").outputMode("complete").format("memory").start()
In case you still want to go with your approach: Even if df.isStreaming tells you it is not a streaming dataframe (which is correct), the underlying datasource is a stream and the dataframe will therefore grow with each processed batch.

Spark Streaming kafka offset manage

I had been doing spark streaming jobs which consumer and produce data through kafka. I used directDstream,so I had to manage offset by myself,we adopted redis to write and read offsets.Now there is one problem,when I launched my client,my client need to get the offset from redis,not offset which exists in kafka itself.how show I write my code?Now I had written my code below:
kafka_stream = KafkaUtils.createDirectStream(
ssc,
topics=[config.CONSUME_TOPIC, ],
kafkaParams={"bootstrap.servers": config.CONSUME_BROKERS,
"auto.offset.reset": "largest"},
fromOffsets=read_offset_range(config.OFFSET_KEY))
But I think the fromOffsets is the value(from redis) when the spark-streaming client lauched,not during its running.thank you for helpinp.
If I understand you correctly you need to set your offset manually. This is how I do it:
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils
from pyspark.streaming.kafka import TopicAndPartition
stream = StreamingContext(sc, 120) # 120 second window
kafkaParams = {"metadata.broker.list":"1:667,2:6667,3:6667"}
kafkaParams["auto.offset.reset"] = "smallest"
kafkaParams["enable.auto.commit"] = "false"
topic = "xyz"
topicPartion = TopicAndPartition(topic, 0)
fromOffset = {topicPartion: long(PUT NUMERIC OFFSET HERE)}
kafka_stream = KafkaUtils.createDirectStream(stream, [topic], kafkaParams, fromOffsets = fromOffset)

How to evaluate spark Dstream objects with an spark data frame

I am writing an spark app ,where I need to evaluate the streaming data based on the historical data, which sits in a sql server database
Now the idea is , spark will fetch the historical data from the database and persist it in the memory and will evaluate the streaming data against it .
Now I am getting the streaming data as
import re
from pyspark import SparkContext
from pyspark.streaming import StreamingContext
from pyspark.sql import SQLContext,functions as func,Row
sc = SparkContext("local[2]", "realtimeApp")
ssc = StreamingContext(sc,10)
files = ssc.textFileStream("hdfs://RealTimeInputFolder/")
########Lets get the data from the db which is relavant for streaming ###
driver = "com.microsoft.sqlserver.jdbc.SQLServerDriver"
dataurl = "jdbc:sqlserver://myserver:1433"
db = "mydb"
table = "stream_helper"
credential = "my_credentials"
########basic data for evaluation purpose ########
files_count = files.flatMap(lambda file: file.split( ))
pattern = '(TranAmount=Decimal.{2})(.[0-9]*.[0-9]*)(\\S+ )(TranDescription=u.)([a-zA-z\\s]+)([\\S\\s]+ )(dSc=u.)([A-Z]{2}.[0-9]+)'
tranfiles = "wasb://myserver.blob.core.windows.net/RealTimeInputFolder01/"
def getSqlContextInstance(sparkContext):
if ('sqlContextSingletonInstance' not in globals()):
globals()['sqlContextSingletonInstance'] = SQLContext(sparkContext)
return globals()['sqlContextSingletonInstance']
def pre_parse(logline):
"""
to read files as rows of sql in pyspark streaming using the pattern . for use of logging
added 0,1 in case there is any failure in processing by this pattern
"""
match = re.search(pattern,logline)
if match is None:
return(line,0)
else:
return(
Row(
customer_id = match.group(8)
trantype = match.group(5)
amount = float(match.group(2))
),1)
def parse():
"""
actual processing is happening here
"""
parsed_tran = ssc.textFileStream(tranfiles).map(preparse)
success = parsed_tran.filter(lambda s: s[1] == 1).map(lambda x:x[0])
fail = parsed_tran.filter(lambda s:s[1] == 0).map(lambda x:x[0])
if fail.count() > 0:
print "no of non parsed file : %d", % fail.count()
return success,fail
success ,fail = parse()
Now I want to evaluate it by the data frame that I get from the historical data
base_data = sqlContext.read.format("jdbc").options(driver=driver,url=dataurl,database=db,user=credential,password=credential,dbtable=table).load()
Now since this being returned as a data frame how do I use this for my purpose .
The streaming programming guide here says
"You have to create a SQLContext using the SparkContext that the StreamingContext is using."
Now this makes me even more confused on how to use the existing dataframe with the streaming object . Any help is highly appreciated .
To manipulate DataFrames, you always need a SQLContext so you can instanciate it like :
sc = SparkContext("local[2]", "realtimeApp")
sqlc = SQLContext(sc)
ssc = StreamingContext(sc, 10)
These 2 contexts (SQLContext and StreamingContext) will coexist in the same job because they are associated with the same SparkContext.
But, keep in mind, you can't instanciate two different SparkContext in the same job.
Once you have created your DataFrame from your DStreams, you can join your historical DataFrame with the DataFrame created from your stream.
To do that, I would do something like :
yourDStream.foreachRDD(lambda rdd: sqlContext
.createDataFrame(rdd)
.join(historicalDF, ...)
...
)
Think about the amount of streamed data you need to use for your join when you manipulate streams, you may be interested by the windowed functions

apache spark streaming - kafka - reading older messages

I am trying to read older messages from Kafka with spark streaming. However, I am only able to retrieve messages as they are sent in real time (i.e., if I populate new messages, while my spark program is running - then I get those messages).
I am changing my groupID and consumerID to make sure zookeeper isn't just not giving messages it knows my program has seen before.
Assuming spark is seeing the offset in zookeeper as -1, shouldn't it read all the old messages in the queue? Am I just misunderstanding the way a kafka queue can be used? I'm very new to spark and kafka, so I can't rule out that I'm just misunderstanding something.
package com.kibblesandbits
import org.apache.spark.SparkContext
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.streaming.kafka.KafkaUtils
import net.liftweb.json._
object KafkaStreamingTest {
val cfg = new ConfigLoader().load
val zookeeperHost = cfg.zookeeper.host
val zookeeperPort = cfg.zookeeper.port
val zookeeper_kafka_chroot = cfg.zookeeper.kafka_chroot
implicit val formats = DefaultFormats
def parser(json: String): String = {
return json
}
def main(args : Array[String]) {
val zkQuorum = "test-spark02:9092"
val group = "myGroup99"
val topic = Map("testtopic" -> 1)
val sparkContext = new SparkContext("local[3]", "KafkaConsumer1_New")
val ssc = new StreamingContext(sparkContext, Seconds(3))
val json_stream = KafkaUtils.createStream(ssc, zkQuorum, group, topic)
var gp = json_stream.map(_._2).map(parser)
gp.saveAsTextFiles("/tmp/sparkstreaming/mytest", "json")
ssc.start()
}
When running this, I will see the following message. So I am confident that it's not just not seeing the messages because the offset is set.
14/12/05 13:34:08 INFO ConsumerFetcherManager:
[ConsumerFetcherManager-1417808045047] Added fetcher for partitions
ArrayBuffer([[testtopic,0], initOffset -1 to broker
id:1,host:test-spark02.vpc,port:9092] , [[testtopic,1],
initOffset -1 to broker i d:1,host:test-spark02.vpc,port:9092] ,
[[testtopic,2], initOffset -1 to broker
id:1,host:test-spark02.vpc,port:9092] , [[testtopic,3],
initOffset -1 to broker id:1,host:test-spark02.vpc,port:9092] ,
[[testtopic,4], initOffset -1 to broker
id:1,host:test-spark02.vpc,port:9092] )
Then, if I populate 1000 new messages -- I can see those 1000 messages saved in my temp directory. But I don't know how to read the existing messages, which should number in the (at this point) tens of thousands.
Use the alternative factory method on KafkaUtils that lets you provide a configuration to the Kafka consumer:
def createStream[K: ClassTag, V: ClassTag, U <: Decoder[_]: ClassTag, T <: Decoder[_]: ClassTag](
ssc: StreamingContext,
kafkaParams: Map[String, String],
topics: Map[String, Int],
storageLevel: StorageLevel
): ReceiverInputDStream[(K, V)]
Then build a map with your kafka configuration and add the parameter 'kafka.auto.offset.reset' set to 'smallest':
val kafkaParams = Map[String, String](
"zookeeper.connect" -> zkQuorum, "group.id" -> groupId,
"zookeeper.connection.timeout.ms" -> "10000",
"kafka.auto.offset.reset" -> "smallest"
)
Provide that config to the factory method above. "kafka.auto.offset.reset" -> "smallest" tells the consumer to starts from the smallest offset in your topic.

Resources