Considering data from both the topics are joined at one point and sent to Kafka sink finally which is the best way to read from multiple topics
val df = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", servers)
.option("subscribe", "t1,t2")
vs
val df1 = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", servers)
.option("subscribe", "t1")
val df2 = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", servers)
.option("subscribe", "t2")
Somewhere i will df1.join(df2) and send it to Kafka sink.
With respect to performance and resource usage wise which would be the better option here?
Thanks in advance
PS : I see another similar question Spark structured streaming app reading from multiple Kafka topics but there dataframes from 2 topics seems to be not used together
In first approach, you'd have to add a filter at some point and then proceed with join. Unless, you also want to process both the streams together, 2nd approach is tidbit more performant and easier to maintain.
I'd say approach 2 is a straightforward one and skips a filter stage, hence a little bit more efficient. It also offers autonomy in both the streams from infra point of view, example: one of topic was to move to a new kafka cluster. You also don't have to account for unevenness between two topics, example: number of partitions. This makes job tuning easier.
Related
I have a scenario where I would like to save the same streaming dataframe to two different streaming sinks.
I have created a streaming dataframe which I need to send to both Kafka topic and delta lake.
I thought of using forEachBatch, but looks like it doesn't support multiple STREAMING SINKS.
Also, I tried using spark session.awaitAnyTermination() with multiple write streams. But the second stream is not getting processed !
Is there a way through which we can achieve this ?!
This is my code:
I am reading from Kafka stream and creating a single streaming dataframe.
val df = spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", "ingestionTopic1")
.load()
df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)").as[(String, String)]
writing the above dataframe to a Kafka topic
val ds1 = df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
.writeStream
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9082")
.option("topic", "outputTopic1")
.start()
writing the same streaming dataframe to delta lake
val ds2 = df.format("delta")
.outputMode("append")
.option("checkpointLocation", "/test/delta/events/_checkpoints/etlflow")
.start("/test/delta/events")
ds1.awaitTermination
ds2.awaitTermination
There are a few things you need to follow to use one input stream for multiple output streams:
You need to make sure to have two different checkpointLocations in the two output streams.
Furthermore, you need to ensure to have the writeStream call also on your second output query.
Overall, it is important to start both of the queries before waiting for the termination of both queries. (You are already doing this)
I've set up a Spark structured streaming query that reads from a Kafka topic.
If the number of partitions in the topic is changed while the Spark query is running, Spark does not seem to notice and data on new partitions is not consumed.
Is there a way to tell Spark to check for new partitions in the same topic apart from stopping the query an restarting it?
EDIT:
I'm using Spark 2.4.4. I read from kafka as follows:
spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", kafkaURL)
.option("startingOffsets", "earliest")
.option("subscribe", topic)
.option("failOnDataLoss", value = false)
.load()
after some processing, I write to HDFS on a Delta Lake table.
Answering my own question. Kafka consumers check for new partitions/topic (in case of subscribing to topics with a pattern) every metadata.max.age.ms, whose default value is 300000 (5 minutes).
Since my test was lasting far less than that, I wouldn't notice the update. For tests, reduce the value to something smaller, e.g. 100 ms, by setting the following option of the DataStreamReader:
.option("kafka.metadata.max.age.ms", 100)
I want to load all records from kafka topic using spark, but all examples which I have seen were using spark-streaming. How can can I load messages fwom kafka exactly once?
Exact steps are listed in the official documentation, for example:
val df = spark
.read
.format("kafka")
.option("kafka.bootstrap.servers", "host1:port1,host2:port2")
.option("subscribePattern", "topic.*")
.option("startingOffsets", "earliest")
.option("endingOffsets", "latest")
.load()
However "all records" is rather poorly defined if the source is continuous stream, as the result depends on the point in time, when query is executed.
Additionally you should keep in mind that parallelism is limited by the partitions of the Kafka topic, so you have to be careful not to overwhelm the cluster.
I start using spark structured streaming.
I get readStream from kafka topic (startOffset: latest)
with waterMark,
group by event time with window duration,
and write to kafka topic.
My question is,
How can I handle the data written to the kafka topic before spark structured streaming job?
I tried to run with `startOffset: earliest' at first. but the data in the kafka topic is too large, so spark streaming process is not started because of yarn timeout. (even though I increase timeout value)
1.
If I simply create a batch job and filter by specific data range.
the result is not reflected in the current state of spark streaming,
there seems to be a problem with the consistency and accuracy of the result.
I tried to reset the checkpoint directory but It did not work.
How can I handle the old and large data?
Help me.
you can try the parmeter maxOffsetsPerTrigger for Kafka + Structured Streaming for receiving old data from Kafka. Set the value for this parameter to the number of records you want to receive from Kafka at one time.
Use:
sparkSession.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", "test-name")
.option("startingOffsets", "earliest")
.option("maxOffsetsPerTrigger", 1)
.option("group.id", "2")
.option("auto.offset.reset", "earliest")
.load()
My ultimate goal is to see if a kafka topic is running and if the data in it is good, otherwise fail / throw an error
if I could pull just 100 messages, or pull for just 60 seconds I think I could accomplish what i wanted. But all the streaming examples / questions I have found online have no intention of shutting down the streaming connection.
Here is the best working code I have so far, that pulls data and displays it, but it keeps trying to pull for more data, and if I try to access it in the next line, it hasnt had a chance to pull the data yet. I assume I need some sort of call back. has anyone done something similar? is this the best way of going about this?
I am using databricks notebooks to run my code
import org.apache.spark.sql.functions.{explode, split}
val kafka = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "<kafka server>:9092")
.option("subscribe", "<topic>")
.option("startingOffsets", "earliest")
.load()
val df = kafka.select(explode(split($"value".cast("string"), "\\s+")).as("word"))
display(df.select($"word"))
The trick is you don't need streaming at all. Kafka source supports batch queries, if you replace readStream with read and adjust startingOffsets and endingOffsets.
val df = spark
.read
.format("kafka")
... // Remaining options
.load()
You can find examples in the Kafka streaming documentation.
For streaming queries you can use once trigger, although it might not be the best choice in this case:
df.writeStream
.trigger(Trigger.Once)
... // Handle the output, for example with foreach sink (?)
You could also use standard Kafka client to fetch some data without starting SparkSession.