How to insert processed spark stream into kafka - apache-spark

am trying to insert spark stream into kafka after being processed using the below snippet
query = ds1 \
.selectExpr("CAST(value AS STRING)")\
.writeStream\
.foreachBatch(do_something) \
.format("kafka") \
.option("topic","topic-name") \
.option("kafka.bootstrap.servers", "borkers-IPs") \
.option("checkpointLocation", "/home/location") \
.start()
but it seems it's inserting the original stream not the processed one.

Use of foreachBatch has no effect here as you can see. Spark will not generate an error, it will just be like into the void.
Quote from the manuals:
Structured Streaming APIs provide two ways to write the output of a
streaming query to data sources that do not have an existing streaming
sink: foreachBatch() and foreach().
This excellent read is what you are looking for.
https://aseigneurin.github.io/2018/08/14/kafka-tutorial-8-spark-structured-streaming.html

Related

Using two WriteStreams in same spark structured streaming job

I have a scenario where I would like to save the same streaming dataframe to two different streaming sinks.
I have created a streaming dataframe which I need to send to both Kafka topic and delta lake.
I thought of using forEachBatch, but looks like it doesn't support multiple STREAMING SINKS.
Also, I tried using spark session.awaitAnyTermination() with multiple write streams. But the second stream is not getting processed !
Is there a way through which we can achieve this ?!
This is my code:
I am reading from Kafka stream and creating a single streaming dataframe.
val df = spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", "ingestionTopic1")
.load()
df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)").as[(String, String)]
writing the above dataframe to a Kafka topic
val ds1 = df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
.writeStream
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9082")
.option("topic", "outputTopic1")
.start()
writing the same streaming dataframe to delta lake
val ds2 = df.format("delta")
.outputMode("append")
.option("checkpointLocation", "/test/delta/events/_checkpoints/etlflow")
.start("/test/delta/events")
ds1.awaitTermination
ds2.awaitTermination
There are a few things you need to follow to use one input stream for multiple output streams:
You need to make sure to have two different checkpointLocations in the two output streams.
Furthermore, you need to ensure to have the writeStream call also on your second output query.
Overall, it is important to start both of the queries before waiting for the termination of both queries. (You are already doing this)

how to run map transformation in a structured streaming job in pyspark

I am trying to setup a structured streaming job with a map() transformation that make REST API calls. Here are the details:
(1)
df=spark.readStream.format('delta') \
.option("maxFilesPerTrigger", 1000) \
.load(f'{file_location}')
(2)
respData=df.select("resource", "payload").rdd.map(lambda row: put_resource(row[0], row[1])).collect()
respDf=spark.createDataFrame(respData, ["resource", "status_code", "reason"])
(3)
respDf.writeStream \
.trigger(once=True) \
.outputMode("append") \
.format("delta") \
.option("path", f'{file_location}/Response') \
.option("checkpointLocation", f'{file_location}/Response/Checkpoints') \
.start()
However, I got an error: Queries with streaming sources must be executed with writeStream.start() on step (2).
Any help will be appreciated. Thank you.
you have to execute your stream on df also
meaning df.writeStream.start()..
there is a similar thread here :
Queries with streaming sources must be executed with writeStream.start();

What is the optimal way to read from multiple Kafka topics and write to different sinks using Spark Structured Streaming?

I am trying to write a Spark Structured Streaming job that reads from multiple Kafka topics (potentially 100s) and writes the results to different locations on S3 depending on the topic name. I've developed this snippet of code that currently reads from multiple topics and outputs the results to the console (based on a loop) and it works as expected. However, I would like to understand what the performance implications are. Would this be the recommended approach? Is it not recommended to have multiple readStream and writeStream operations? If so, what is the recommended approach?
my_topics = ["topic_1", "topic_2"]
for i in my_topics:
df = spark \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", bootstrap_servers) \
.option("subscribePattern", i) \
.load() \
.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
output_df = df \
.writeStream \
.format("console") \
.option("truncate", False) \
.outputMode("update") \
.option("checkpointLocation", "s3://<MY_BUCKET>/{}".format(i)) \
.start()
It's certainly reasonable to run a number # of concurrent streams per driver node.
Each .start() consumes a certain amount of driver resources in spark. Your limiting factor will be the load on the driver node and its available resources.
100's of topics running continuously at high rate would need to be spread across multiple driver nodes [In Databricks there is one driver per cluster]. The advantage of Spark is as you mention, multiple sinks and also a unified batch & streaming apis for transformations.
The other issue will be dealing with the small writes you may end up making to S3 and file consistency. Take a look at delta.io to handle consistent & reliable writes to S3.
Advantages of below approach.
Generic
Multiple Threads, All threads will work individual.
Easy to maintain code & support for any issues.
If one topic is failed, No impact on other topics in production. You just have to focus on failed one.
If you want to pull all data for specific topic, You just have to stop job for that topic, update or change the config & restart same job.
Note - Below code is not complete generic, You may need to change or tune below code.
topic="" // Get value from input arguments
sink="" // Get value from input arguments
df = spark \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", bootstrap_servers) \
.option("subscribePattern", topic) \
.load() \
.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
output_df = df \
.writeStream \
.format("console") \
.option("truncate", False) \
.outputMode("update") \
.option("checkpointLocation", sink) \
.start()
Problems with below approach.
If one topic is failed, It will terminate complete program.
Limited Threads.
Difficult to maintain code, debug & support for any issues.
If you want to pull all data for specific topic from kafka, It's not possible as any config change will apply for all topics, hence its too costliest operation.
my_topics = ["topic_1", "topic_2"]
for i in my_topics:
df = spark \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", bootstrap_servers) \
.option("subscribePattern", i) \
.load() \
.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
output_df = df \
.writeStream \
.format("console") \
.option("truncate", False) \
.outputMode("update") \
.option("checkpointLocation", "s3://<MY_BUCKET>/{}".format(i)) \
.start()

How to streamout or extract only inserts/adds from a Databricks delta file?

I have a scenario, where I want to run a Spark Structured Streaming job to read a Databricks Delta Source File and extract only the inserts to the Source file. I want to filter-out any Updates/Deletes.
I was trying following on a smaller file but the code does not seem to do what I expect.
spark
.readStream
.format("delta")
.option("latestFirst","true")
.option("ignoreDeletes", "true")
.option("ignoreChanges","true")
.load("/mnt/data-lake/data/bronze/accounts")
.writeStream
.format("delta")
.outputMode("append")
.option("checkpointLocation","/mnt/data-lake/tmp/chkpnt_accounts_inserts")
.option("path","/mnt/data-lake/tmp/accounts_inserts")
.start()

How to write streaming Dataset to Cassandra?

So I have a Python Stream-sourced DataFrame df that has all the data I want to place into a Cassandra table with the spark-cassandra-connector. I've tried doing this in two ways:
df.write \
.format("org.apache.spark.sql.cassandra") \
.mode('append') \
.options(table="myTable",keyspace="myKeySpace") \
.save()
query = df.writeStream \
.format("org.apache.spark.sql.cassandra") \
.outputMode('append') \
.options(table="myTable",keyspace="myKeySpace") \
.start()
query.awaitTermination()
However I keep on getting this errors, respectively:
pyspark.sql.utils.AnalysisException: "'write' can not be called on streaming Dataset/DataFrame;
and
java.lang.UnsupportedOperationException: Data source org.apache.spark.sql.cassandra does not support streamed writing.
Is there anyway I can send my Streamed DataFrame into a my Cassandra Table?
There is currently no streaming Sink for Cassandra in the Spark Cassandra Connector. You will need to implement your own Sink or wait for it to become available.
If you were using Scala or Java you could use foreach operator and use a ForeachWriter as described in Using Foreach.
I know its an old post, updating it for future references.
You can process it as a batch from streaming data. like below
def writeToCassandra(writeDF, epochId):
writeDF.write \
.format("org.apache.spark.sql.cassandra") \
.options(table="table_name", keyspace="keyspacename")\
.mode("append") \
.save()
query = sdf3.writeStream \
.trigger(processingTime="10 seconds") \
.outputMode("update") \
.foreachBatch(writeToCassandra) \
.start()

Resources