I want to extract streaming data from a Kafka cluster in batches of one hour, so I run a script every hour, having set writeStream to .trigger(once=True) and the startingOffsets set to earliest, like this:
df = spark.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers",
config.get("kafka_servers")) \
.option("subscribe", config.get("topic_list")) \
.option("startingOffsets", "earliest") \
.load()
df.writeStream \
.format("parquet") \
.option("checkpointLocation", config.get("checkpoint_path")) \
.option("path", config.get("s3_path_raw")) \
.trigger(once=True)
.partitionBy('date', 'hour') \
.start()
But every time the script gets triggered it only writes to the S3 the messages that are coming from the Kafka cluster in that precise moment, instead of taking all the messages from the last hour as I was expecting it to do.
What might be the problem?
Edit: I should mention that the kafka cluster retention is set to 24 hours
Option 1:
.trigger(once=True) is supposed to process only one patch of data.
Please try to replace it with .trigger(availableNow=True)
Option 2:
To have an up and running job with a 1-hour processing interval; .trigger(processingTime='60 minutes')
Also, you will need to set the following option while reading the stream
.option("failOnDataLoss", "false")
Related
I have spark streaming application that reads data from kafka and writes to s3. When first time query runs it takes 4 minutes to load data into s3 from kafka as there is some historical data too , but after first run there are not many streams to consume from kafka . There by Once first run is done I want to update trigger time to some other value let's say if in first run it is 4 minutes then from second run onwards I want to update to 1 minute . How can I dynamically do this in spark streaming query ?
code for reading data from kafka:
df = spark \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "<ipaddress:port") \
.option("subscribe", "sampletable") \
.option("startingOffsets", "earliest") \
.load() \
.selectExpr("CAST(value AS STRING)", "CAST(offset AS STRING)", "CAST(timestamp AS STRING)")
write stream:
streaming_query = df.writeStream \
.foreachBatch(foreach_batch_sync) \
.trigger(processingTime="5 minute")\
.option("checkpointLocation", s3path) \
.start() \
.awaitTermination()
I have written foreach_batch_sync function , this function is triggered based on value of trigger(processingTime="5 minute"), for very first time job triggers I am fine with processing time to 5 minutes but after first batch completion for second batch and onwards I want to dynamically update processing time to 1 minute .
I have a streaming job which is supposed to read some data from kafka, transform it to desired form using aggregations, and push it back to kafka. When I've used Kafka Source + Console Sink, everything worked fine. The problems started when I started using Kafka Sink instead of Console - messages are not being sent to the topic. When I try to push messages to the same topic using kafka-console-producer, everything is working fine. Note that I am trying to run this job in Intellij. I am using Spark 3.2
Here is my kafka-reader code:
spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", "demo-topic")
.option("startingOffsets", "earliest")
.load()
And kafka-writer:
val queryX = df
.select(to_json(struct("*")).as("value"))
.withColumn("key", lit("key"))
.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
.writeStream
.outputMode("update")
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("topic", "result-data")
.option("checkpointLocation", "/temp/checkpoint")
.start()
queryX.awaitTermination()
checkpointLocation is set to dir in my local file system.
Is there anything that I am doing wrong here?
I am trying to setup a structured streaming job with a map() transformation that make REST API calls. Here are the details:
(1)
df=spark.readStream.format('delta') \
.option("maxFilesPerTrigger", 1000) \
.load(f'{file_location}')
(2)
respData=df.select("resource", "payload").rdd.map(lambda row: put_resource(row[0], row[1])).collect()
respDf=spark.createDataFrame(respData, ["resource", "status_code", "reason"])
(3)
respDf.writeStream \
.trigger(once=True) \
.outputMode("append") \
.format("delta") \
.option("path", f'{file_location}/Response') \
.option("checkpointLocation", f'{file_location}/Response/Checkpoints') \
.start()
However, I got an error: Queries with streaming sources must be executed with writeStream.start() on step (2).
Any help will be appreciated. Thank you.
you have to execute your stream on df also
meaning df.writeStream.start()..
there is a similar thread here :
Queries with streaming sources must be executed with writeStream.start();
I am trying to write a Spark Structured Streaming job that reads from multiple Kafka topics (potentially 100s) and writes the results to different locations on S3 depending on the topic name. I've developed this snippet of code that currently reads from multiple topics and outputs the results to the console (based on a loop) and it works as expected. However, I would like to understand what the performance implications are. Would this be the recommended approach? Is it not recommended to have multiple readStream and writeStream operations? If so, what is the recommended approach?
my_topics = ["topic_1", "topic_2"]
for i in my_topics:
df = spark \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", bootstrap_servers) \
.option("subscribePattern", i) \
.load() \
.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
output_df = df \
.writeStream \
.format("console") \
.option("truncate", False) \
.outputMode("update") \
.option("checkpointLocation", "s3://<MY_BUCKET>/{}".format(i)) \
.start()
It's certainly reasonable to run a number # of concurrent streams per driver node.
Each .start() consumes a certain amount of driver resources in spark. Your limiting factor will be the load on the driver node and its available resources.
100's of topics running continuously at high rate would need to be spread across multiple driver nodes [In Databricks there is one driver per cluster]. The advantage of Spark is as you mention, multiple sinks and also a unified batch & streaming apis for transformations.
The other issue will be dealing with the small writes you may end up making to S3 and file consistency. Take a look at delta.io to handle consistent & reliable writes to S3.
Advantages of below approach.
Generic
Multiple Threads, All threads will work individual.
Easy to maintain code & support for any issues.
If one topic is failed, No impact on other topics in production. You just have to focus on failed one.
If you want to pull all data for specific topic, You just have to stop job for that topic, update or change the config & restart same job.
Note - Below code is not complete generic, You may need to change or tune below code.
topic="" // Get value from input arguments
sink="" // Get value from input arguments
df = spark \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", bootstrap_servers) \
.option("subscribePattern", topic) \
.load() \
.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
output_df = df \
.writeStream \
.format("console") \
.option("truncate", False) \
.outputMode("update") \
.option("checkpointLocation", sink) \
.start()
Problems with below approach.
If one topic is failed, It will terminate complete program.
Limited Threads.
Difficult to maintain code, debug & support for any issues.
If you want to pull all data for specific topic from kafka, It's not possible as any config change will apply for all topics, hence its too costliest operation.
my_topics = ["topic_1", "topic_2"]
for i in my_topics:
df = spark \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", bootstrap_servers) \
.option("subscribePattern", i) \
.load() \
.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
output_df = df \
.writeStream \
.format("console") \
.option("truncate", False) \
.outputMode("update") \
.option("checkpointLocation", "s3://<MY_BUCKET>/{}".format(i)) \
.start()
Follow-up from my previous question: I'm writing a large dataframe in a batch from Databricks to Kafka. This generally works fine now. However, some times there are some errors (mostly timeouts). Retrying kicks in and processing will start over again. But this does not seem to observe the checkpoint, which results in duplicates being written to the Kafka sink.
So should checkpoints work in batch-writing mode at all? Or I am missing something?
Config:
EH_SASL = 'kafkashaded.org.apache.kafka.common.security.plain.PlainLoginModule required username="$ConnectionString" password="Endpoint=sb://myeventhub.servicebus.windows.net/;SharedAccessKeyName=RootManageSharedAccessKey;SharedAccessKey=****";'
dfKafka \
.write \
.format("kafka") \
.option("kafka.sasl.mechanism", "PLAIN") \
.option("kafka.security.protocol", "SASL_SSL") \
.option("kafka.sasl.jaas.config", EH_SASL) \
.option("kafka.bootstrap.servers", "myeventhub.servicebus.windows.net:9093") \
.option("topic", "mytopic") \
.option("checkpointLocation", "/mnt/telemetry/cp.txt") \
.save()
Spark checkpoints tend to cause duplicates . Storing and reading Offset from Zookeeper may solve this issue. Here is the link for details :
http://aseigneurin.github.io/2016/05/07/spark-kafka-achieving-zero-data-loss.html
Also, in your case , checkpoints are not working at all or checkpoints are causing duplicates ? Above URL help is for the later case.