Batch write from to Kafka does not observe checkpoints and writes duplicates - apache-spark

Follow-up from my previous question: I'm writing a large dataframe in a batch from Databricks to Kafka. This generally works fine now. However, some times there are some errors (mostly timeouts). Retrying kicks in and processing will start over again. But this does not seem to observe the checkpoint, which results in duplicates being written to the Kafka sink.
So should checkpoints work in batch-writing mode at all? Or I am missing something?
Config:
EH_SASL = 'kafkashaded.org.apache.kafka.common.security.plain.PlainLoginModule required username="$ConnectionString" password="Endpoint=sb://myeventhub.servicebus.windows.net/;SharedAccessKeyName=RootManageSharedAccessKey;SharedAccessKey=****";'
dfKafka \
.write \
.format("kafka") \
.option("kafka.sasl.mechanism", "PLAIN") \
.option("kafka.security.protocol", "SASL_SSL") \
.option("kafka.sasl.jaas.config", EH_SASL) \
.option("kafka.bootstrap.servers", "myeventhub.servicebus.windows.net:9093") \
.option("topic", "mytopic") \
.option("checkpointLocation", "/mnt/telemetry/cp.txt") \
.save()

Spark checkpoints tend to cause duplicates . Storing and reading Offset from Zookeeper may solve this issue. Here is the link for details :
http://aseigneurin.github.io/2016/05/07/spark-kafka-achieving-zero-data-loss.html
Also, in your case , checkpoints are not working at all or checkpoints are causing duplicates ? Above URL help is for the later case.

Related

PySpark Structured Streaming: trigger once not working with Kafka

I want to extract streaming data from a Kafka cluster in batches of one hour, so I run a script every hour, having set writeStream to .trigger(once=True) and the startingOffsets set to earliest, like this:
df = spark.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers",
config.get("kafka_servers")) \
.option("subscribe", config.get("topic_list")) \
.option("startingOffsets", "earliest") \
.load()
df.writeStream \
.format("parquet") \
.option("checkpointLocation", config.get("checkpoint_path")) \
.option("path", config.get("s3_path_raw")) \
.trigger(once=True)
.partitionBy('date', 'hour') \
.start()
But every time the script gets triggered it only writes to the S3 the messages that are coming from the Kafka cluster in that precise moment, instead of taking all the messages from the last hour as I was expecting it to do.
What might be the problem?
Edit: I should mention that the kafka cluster retention is set to 24 hours
Option 1:
.trigger(once=True) is supposed to process only one patch of data.
Please try to replace it with .trigger(availableNow=True)
Option 2:
To have an up and running job with a 1-hour processing interval; .trigger(processingTime='60 minutes')
Also, you will need to set the following option while reading the stream
.option("failOnDataLoss", "false")

How to insert processed spark stream into kafka

am trying to insert spark stream into kafka after being processed using the below snippet
query = ds1 \
.selectExpr("CAST(value AS STRING)")\
.writeStream\
.foreachBatch(do_something) \
.format("kafka") \
.option("topic","topic-name") \
.option("kafka.bootstrap.servers", "borkers-IPs") \
.option("checkpointLocation", "/home/location") \
.start()
but it seems it's inserting the original stream not the processed one.
Use of foreachBatch has no effect here as you can see. Spark will not generate an error, it will just be like into the void.
Quote from the manuals:
Structured Streaming APIs provide two ways to write the output of a
streaming query to data sources that do not have an existing streaming
sink: foreachBatch() and foreach().
This excellent read is what you are looking for.
https://aseigneurin.github.io/2018/08/14/kafka-tutorial-8-spark-structured-streaming.html

how to run map transformation in a structured streaming job in pyspark

I am trying to setup a structured streaming job with a map() transformation that make REST API calls. Here are the details:
(1)
df=spark.readStream.format('delta') \
.option("maxFilesPerTrigger", 1000) \
.load(f'{file_location}')
(2)
respData=df.select("resource", "payload").rdd.map(lambda row: put_resource(row[0], row[1])).collect()
respDf=spark.createDataFrame(respData, ["resource", "status_code", "reason"])
(3)
respDf.writeStream \
.trigger(once=True) \
.outputMode("append") \
.format("delta") \
.option("path", f'{file_location}/Response') \
.option("checkpointLocation", f'{file_location}/Response/Checkpoints') \
.start()
However, I got an error: Queries with streaming sources must be executed with writeStream.start() on step (2).
Any help will be appreciated. Thank you.
you have to execute your stream on df also
meaning df.writeStream.start()..
there is a similar thread here :
Queries with streaming sources must be executed with writeStream.start();

What is the optimal way to read from multiple Kafka topics and write to different sinks using Spark Structured Streaming?

I am trying to write a Spark Structured Streaming job that reads from multiple Kafka topics (potentially 100s) and writes the results to different locations on S3 depending on the topic name. I've developed this snippet of code that currently reads from multiple topics and outputs the results to the console (based on a loop) and it works as expected. However, I would like to understand what the performance implications are. Would this be the recommended approach? Is it not recommended to have multiple readStream and writeStream operations? If so, what is the recommended approach?
my_topics = ["topic_1", "topic_2"]
for i in my_topics:
df = spark \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", bootstrap_servers) \
.option("subscribePattern", i) \
.load() \
.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
output_df = df \
.writeStream \
.format("console") \
.option("truncate", False) \
.outputMode("update") \
.option("checkpointLocation", "s3://<MY_BUCKET>/{}".format(i)) \
.start()
It's certainly reasonable to run a number # of concurrent streams per driver node.
Each .start() consumes a certain amount of driver resources in spark. Your limiting factor will be the load on the driver node and its available resources.
100's of topics running continuously at high rate would need to be spread across multiple driver nodes [In Databricks there is one driver per cluster]. The advantage of Spark is as you mention, multiple sinks and also a unified batch & streaming apis for transformations.
The other issue will be dealing with the small writes you may end up making to S3 and file consistency. Take a look at delta.io to handle consistent & reliable writes to S3.
Advantages of below approach.
Generic
Multiple Threads, All threads will work individual.
Easy to maintain code & support for any issues.
If one topic is failed, No impact on other topics in production. You just have to focus on failed one.
If you want to pull all data for specific topic, You just have to stop job for that topic, update or change the config & restart same job.
Note - Below code is not complete generic, You may need to change or tune below code.
topic="" // Get value from input arguments
sink="" // Get value from input arguments
df = spark \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", bootstrap_servers) \
.option("subscribePattern", topic) \
.load() \
.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
output_df = df \
.writeStream \
.format("console") \
.option("truncate", False) \
.outputMode("update") \
.option("checkpointLocation", sink) \
.start()
Problems with below approach.
If one topic is failed, It will terminate complete program.
Limited Threads.
Difficult to maintain code, debug & support for any issues.
If you want to pull all data for specific topic from kafka, It's not possible as any config change will apply for all topics, hence its too costliest operation.
my_topics = ["topic_1", "topic_2"]
for i in my_topics:
df = spark \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", bootstrap_servers) \
.option("subscribePattern", i) \
.load() \
.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
output_df = df \
.writeStream \
.format("console") \
.option("truncate", False) \
.outputMode("update") \
.option("checkpointLocation", "s3://<MY_BUCKET>/{}".format(i)) \
.start()

How to write streaming Dataset to Cassandra?

So I have a Python Stream-sourced DataFrame df that has all the data I want to place into a Cassandra table with the spark-cassandra-connector. I've tried doing this in two ways:
df.write \
.format("org.apache.spark.sql.cassandra") \
.mode('append') \
.options(table="myTable",keyspace="myKeySpace") \
.save()
query = df.writeStream \
.format("org.apache.spark.sql.cassandra") \
.outputMode('append') \
.options(table="myTable",keyspace="myKeySpace") \
.start()
query.awaitTermination()
However I keep on getting this errors, respectively:
pyspark.sql.utils.AnalysisException: "'write' can not be called on streaming Dataset/DataFrame;
and
java.lang.UnsupportedOperationException: Data source org.apache.spark.sql.cassandra does not support streamed writing.
Is there anyway I can send my Streamed DataFrame into a my Cassandra Table?
There is currently no streaming Sink for Cassandra in the Spark Cassandra Connector. You will need to implement your own Sink or wait for it to become available.
If you were using Scala or Java you could use foreach operator and use a ForeachWriter as described in Using Foreach.
I know its an old post, updating it for future references.
You can process it as a batch from streaming data. like below
def writeToCassandra(writeDF, epochId):
writeDF.write \
.format("org.apache.spark.sql.cassandra") \
.options(table="table_name", keyspace="keyspacename")\
.mode("append") \
.save()
query = sdf3.writeStream \
.trigger(processingTime="10 seconds") \
.outputMode("update") \
.foreachBatch(writeToCassandra) \
.start()

Resources