Using two WriteStreams in same spark structured streaming job - apache-spark

I have a scenario where I would like to save the same streaming dataframe to two different streaming sinks.
I have created a streaming dataframe which I need to send to both Kafka topic and delta lake.
I thought of using forEachBatch, but looks like it doesn't support multiple STREAMING SINKS.
Also, I tried using spark session.awaitAnyTermination() with multiple write streams. But the second stream is not getting processed !
Is there a way through which we can achieve this ?!
This is my code:
I am reading from Kafka stream and creating a single streaming dataframe.
val df = spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", "ingestionTopic1")
.load()
df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)").as[(String, String)]
writing the above dataframe to a Kafka topic
val ds1 = df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
.writeStream
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9082")
.option("topic", "outputTopic1")
.start()
writing the same streaming dataframe to delta lake
val ds2 = df.format("delta")
.outputMode("append")
.option("checkpointLocation", "/test/delta/events/_checkpoints/etlflow")
.start("/test/delta/events")
ds1.awaitTermination
ds2.awaitTermination

There are a few things you need to follow to use one input stream for multiple output streams:
You need to make sure to have two different checkpointLocations in the two output streams.
Furthermore, you need to ensure to have the writeStream call also on your second output query.
Overall, it is important to start both of the queries before waiting for the termination of both queries. (You are already doing this)

Related

Kafka Sink seems not to be working properly

I have a streaming job which is supposed to read some data from kafka, transform it to desired form using aggregations, and push it back to kafka. When I've used Kafka Source + Console Sink, everything worked fine. The problems started when I started using Kafka Sink instead of Console - messages are not being sent to the topic. When I try to push messages to the same topic using kafka-console-producer, everything is working fine. Note that I am trying to run this job in Intellij. I am using Spark 3.2
Here is my kafka-reader code:
spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", "demo-topic")
.option("startingOffsets", "earliest")
.load()
And kafka-writer:
val queryX = df
.select(to_json(struct("*")).as("value"))
.withColumn("key", lit("key"))
.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
.writeStream
.outputMode("update")
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("topic", "result-data")
.option("checkpointLocation", "/temp/checkpoint")
.start()
queryX.awaitTermination()
checkpointLocation is set to dir in my local file system.
Is there anything that I am doing wrong here?

Spark-Streaming hangs with kafka starting offset at earliest (Kafka 2, spark 2.4.3)

i'm having an issue with Spark-Streaming and Kafka. While running a sample program to consume from a Kafka topic and output micro-batched results to the terminal, my job seems to hang when i set the option:
df.option("startingOffsets", "earliest")
Starting the job from the latest offset works fine, results are printed to the terminal as each micro batch streams through.
I was thinking maybe this was a resouces issue--i'm trying to read from a topic with quite a bit of data. However i don't seem to have memory/cpu issues (running this job with a local[*] cluster). The job never really seems to start, but just hangs on the line:
19/09/17 15:21:37 INFO Metadata: Cluster ID: JFXVL24JQ3K4CEbE-VA58A
val sc = new SparkConf().setMaster("local[*]").setAppName("spark-test")
val streamContext = new StreamingContext(sc, Seconds(1))
val spark = SparkSession.builder().appName("spark-test")
.getOrCreate()
val topic = "topic.with.alotta.data"
//subscribe tokafka
val df = spark.readStream.format("kafka")
.option("kafka.bootstrap.servers", "127.0.0.1:9092")
.option("subscribe", topic)
.option("startingOffsets", "earliest")
.load()
//write
df.writeStream
.outputMode("append")
.format("console")
.option("truncate", "false")
.start()
.awaitTermination()
I'd expect to see results printed to the console....but, the application just seems to hang as I mentioned. Any thoughts? It feels like a spark resource issue (because i'm running a local "cluster" against a topic that has a lot of data. Is there something about the nature of streaming dataframes that i'm missing?
Writing to console causes all data to be collected in memory in the driver every trigger. Since you're currently not limiting the size of your batches, this means the entire topic contents is being accumulated in the driver. See https://spark.apache.org/docs/2.4.3/structured-streaming-programming-guide.html#output-sinks
Setting a limit on your batch sizes should fix your issue.
Try adding the maxOffsetsPerTrigger setting when reading from Kafka...
val df = spark.readStream.format("kafka")
.option("kafka.bootstrap.servers", "127.0.0.1:9092")
.option("subscribe", topic)
.option("startingOffsets", "earliest")
.option("maxOffsetsPerTrigger", 1000)
.load()
See https://spark.apache.org/docs/2.4.3/structured-streaming-kafka-integration.html for details.

How to load all records from kafka topic using spark in batch mode

I want to load all records from kafka topic using spark, but all examples which I have seen were using spark-streaming. How can can I load messages fwom kafka exactly once?
Exact steps are listed in the official documentation, for example:
val df = spark
.read
.format("kafka")
.option("kafka.bootstrap.servers", "host1:port1,host2:port2")
.option("subscribePattern", "topic.*")
.option("startingOffsets", "earliest")
.option("endingOffsets", "latest")
.load()
However "all records" is rather poorly defined if the source is continuous stream, as the result depends on the point in time, when query is executed.
Additionally you should keep in mind that parallelism is limited by the partitions of the Kafka topic, so you have to be careful not to overwhelm the cluster.

Kafka does not create consumer group when reading from single partition

I am reading from kafka using spark structured streaming, below is my read stream
val inputDf = spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", brokers)
//.option("subscribe", topic)
.option("assign"," {\""+topic+"\":[0]}")
.option("startingOffsets", "earliest")
.option("maxOffsetsPerTrigger", 60000)
.load()
I'm trying to see my consumer group in kafka using below kafka command. What I've noticed is when i'm reading from only single partition it does not create any comsumer group but when i read from topic(in my case I have 3 partitions for the topic) without mentioning partitions explicitly it create consumer groups. Is this an expected behavior?If yes, then why?
./kafka-consumer-groups.sh --bootstrap-server kfk01.sboxdc.com:9092,kfk02.sboxdc.com:9092,kfk03.sboxdc.com:9092 --list

Spark Dataframe to Kafka

I am trying to stream the Spark Dataframe to Kafka consumer. I am unable to do , Can you please advice me.
I am able to pick the data from Kafka producer to Spark , and I have performed some manipulation, After manipulating the data , I am interested to stream it back to Kafka (Consumer).
Here is an example of producing to kafka in streaming, but the batch version is almost identical
streaming from a source to kafka:
val ds = df
.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
.writeStream
.format("kafka")
.option("kafka.bootstrap.servers", "host1:port1,host2:port2")
.option("topic", "topic1")
.start()
writing a static dataframe (not streamed from a source) to kafka
df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
.write
.format("kafka")
.option("kafka.bootstrap.servers", "host1:port1,host2:port2")
.option("topic", "topic1")
.save()
please keep in mind that
each row will be a message.
the dataframe must be a streaming dataframe. If you have a static dataframe then use the static version.
take a look at the basic documentation: https://spark.apache.org/docs/2.2.0/structured-streaming-kafka-integration.html
it sounds like you have a static dataframe, that isn't streaming from a source.

Resources