We have a use-case where we are trying to consumer multiple Kafka topics (AVRO messages) integrating with Schema registry.
We are using Spark Structured streaming (Spark version : 2.4.4) , Confluent Kafka (Library version: 5.4.1) for the same:
val kafkaDF: DataFrame = spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "<bootstrap.server>:9092")
.option("subscribe", [List of Topics]) // multi topic subscribe
.load()
We are selecting value from the above DF and using Schema to de-serialize the AVRO message
val tableDF = kafkaDF.select(from_avro(col("value"), jsonSchema).as("table")).select("*")
The Roadblock here is , since we are using multiple Kafka topics , we have have integrated all our JSON schema into MAP with key being the topic name and values being the respective schema.
How can we do a lookup using the same map in the select above? we tried UDF but it returns a "col" type but jsonSchema has to be of String type. Also it Schema is different for different topics
Couple of added questions:
Is the above approach correct for consuming multiple topics at once?
Should we use single consumer per topic?
If we have more number of topics how can we achieve parallel topic processing, cos sequential might take substantial amount of time.
Without checking it all, you appear to be OK on the basics with from_avro, etc. from_json, etc.
https://sparkbyexamples.com/spark/spark-streaming-consume-and-produce-kafka-messages-in-avro-format/ can guide you on first part. This also very good https://www.waitingforcode.com/apache-spark-structured-streaming/two-topics-two-schemas-one-subscription-apache-spark-structured-streaming/read#filter_topic.
I would do table.*
Multiple, multiple schemas --> read multiple such versions from .avsc or code yourself.
For consuming multiple topics in a Spark Streaming App the question is how many per App? No hard rule except obvious ones like large consumption vs. smaller consumption and if order is important. Executor resources can be relinquished.
Then you need to process all the Topics separately like this imho - Filter on Topic - you can fill in the details as I am a little rushed - using the foreachBatch paradigm.
Not sure how you are writing the data out at rest, but the question is not about that.
Similar to this, then to process multiple Topics:
...
... // Need to get Topic first
stream.toDS()
.select($"topic", $"value")
.writeStream
.foreachBatch((dataset, _) => {
dataset.persist() // need this
dataset.filter($"topic" === topicA)
.select(from_avro(col("value"), jsonSchemaTA)
.as("tA"))
.select("tA.*")
.write.format(...).save(...)
dataset.filter($"topic" === topicB)
.select(from_avro(col("value"), jsonSchemaTB)
.as("tB"))
.select("tB.*")
.write.format(...).save(...)
...
...
dataset.unpersist()
...
})
.start().awaitTermination()
but blend with this excellent answer: Integrating Spark Structured Streaming with the Confluent Schema Registry
Related
I have a spark application that has to process multiple queries in parallel using a single Kafka topic as the source.
The behavior I noticed is that each query has its own consumer (which is in its own consumer group) causing the same data to be streamed to the application multiple times (please correct me if I'm wrong) which seems very inefficient, instead I would like to have a single stream of data that would be then processed in parallel by Spark.
What would be the recommended way to improve performance in the scenario above ? Should I focus on optimizing Kafka partitions instead of how Spark interacts with Kafka ?
Any thoughts are welcome,
Thank you.
The behavior I noticed is that each query has its own consumer (which is in its own consumer group) causing the same data to be streamed to the application multiple times (please correct me if I'm wrong) which seems very inefficient, instead I would like to have a single stream of data that would be then processed in parallel by Spark.
tl;dr Not possible in the current design.
A single streaming query "starts" from a sink. There can only be one in a streaming query (I'm repeating it myself to remember better as I seem to have been caught multiple times while with Spark Structured Streaming, Kafka Streams and recently with ksqlDB).
Once you have a sink (output), the streaming query can be started (on its own daemon thread).
For exactly the reasons you mentioned (not to share data for which Kafka Consumer API requires group.id to be different), every streaming query creates a unique group ID (cf. this code and the comment in 3.3.0) so the same records can be transformed by different streaming queries:
// Each running query should use its own group id. Otherwise, the query may be only assigned
// partial data since Kafka will assign partitions to multiple consumers having the same group
// id. Hence, we should generate a unique id for each query.
val uniqueGroupId = KafkaSourceProvider.batchUniqueGroupId(sourceOptions)
And that makes sense IMHO.
Should I focus on optimizing Kafka partitions instead of how Spark interacts with Kafka ?
Guess so.
You can separate your source data frame into different stages, yes.
val df = spark.readStream.format("kafka") ...
val strDf = df.select(cast('value).as("string")) ...
val df1 = strDf.filter(...) # in "parallel"
val df2 = strDf.filter(...) # in "parallel"
Only the first line should be creating Kafka consumer instance(s), not the other stages, as they depend on the consumer records from the first stage.
Given a DataStreamReader configured to subscribe to multiple topics like this (see here):
// Subscribe to multiple topics
spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "host1:port1,host2:port2")
.option("subscribe", "topic1,topic2,topic3")
When I use foreachBatch on top of this, what will the batches contain?
Each batch will only contain messages from one topic?
Or can a batch contain messages coming from different topics?
In my use case, I'd like to have batches with messages coming from one topic only. Is it possible to configure this?
Quoting the official documentation in Structured Streaming + Kafka Integration Guide (Kafka broker version 0.10.0 or higher):
// Subscribe to multiple topics
...
.option("subscribe", "topic1,topic2")
The code above is what the underlying Kafka consumer (of the streaming query) subscribes to.
When I use foreachBatch on top of this, what will the batches contain?
Each batch will only contain messages from one topic?
That's the proper answer.
I'd like to have batches with messages coming from one topic only. Is it possible to configure this?
That's also documented in Structured Streaming + Kafka Integration Guide (Kafka broker version 0.10.0 or higher):
Each row in the source has the following schema:
...
topic
In other words, the input Dataset will have topic column with the name of the topic a given row (record) comes from.
In order to have "batches with messages coming from one topic only" you simply filter or where with the one topic, e.g.
val messages: DataFrame = ...
assert(messages.isStreaming)
messages
.writeStream
.foreachBatch { case (df, batchId) =>
val topic1Only = df.where($"topic" === "topic1")
val topic2Only = df.where($"topic" === "topic2")
...
}
The batch will contain messages coming from all the topics (I'd say partitions, instead) that your consumer is subscribed to.
I have a use case in which I have to subscribe to multiple topics in kafka in spark structured streaming. Then I have to parse each message and form a delta lake table out of it. I have made the parser and the messages(in form of xml) correctly parsing and forming delta-lake table. However, I am only subscribing to only one topic as of now. I want to subscribe to multiple topics and based on the topic, it should go to the parser dedicatedly made for this particular topic. So basically I want to identify the topic name for all the messages as they process so that I can send them to the desired parser and process further.
This is how I am accessing the messages from different topics. However, I have no idea how to identify the source of the incoming messages while processing them.
val stream_dataframe = spark.readStream
.format(ConfigSetting.getString("source"))
.option("kafka.bootstrap.servers", ConfigSetting.getString("bootstrap_servers"))
.option("kafka.ssl.truststore.location", ConfigSetting.getString("trustfile_location"))
.option("kafka.ssl.truststore.password", ConfigSetting.getString("truststore_password"))
.option("kafka.sasl.mechanism", ConfigSetting.getString("sasl_mechanism"))
.option("kafka.security.protocol", ConfigSetting.getString("kafka_security_protocol"))
.option("kafka.sasl.jaas.config",ConfigSetting.getString("jass_config"))
.option("encoding",ConfigSetting.getString("encoding"))
.option("startingOffsets",ConfigSetting.getString("starting_offset_duration"))
.option("subscribe",ConfigSetting.getString("topics_name"))
.option("failOnDataLoss",ConfigSetting.getString("fail_on_dataloss"))
.load()
var cast_dataframe = stream_dataframe.select(col("value").cast(StringType))
cast_dataframe = cast_dataframe.withColumn("parsed_column",parser(col("value"))) // Parser is the udf, made to parse the xml from the topic.
How can I identify the topic name of the messages as they process in spark structured streaming ?
As per the official documentation (emphasis mine)
Each row in the source has the following schema:
Column Type
key binary
value binary
topic string
partition int
...
As you see input topic is part of the output schema, and can be accessed without any special actions.
Current State:
Today I have built a Spark Structured Streaming application which consumes a single Kafka topic which contain JSON messages. Embedded within the Kafka topic's value contains some information about the source and the schema of the message field. A very simplified version of the message looks something like this:
{
"source": "Application A",
"schema": [{"col_name": "countryId", "col_type": "Integer"}, {"col_name": "name", "col_type": "String"}],
"message": {"countryId": "21", "name": "Poland"}
}
There are a handful of Kafka topics in the system today, and I've deployed this Spark Structured Streaming application per topic, using the subscribe option. The application applies the topic's unique schema (hacked by batch reading the first message in the Kafka topic and mapping the schema) and writes it to HDFS in parquet format.
Desired State:
My organization will soon start producing more and more topics and I don't think this pattern of a Spark Application per topic will scale well. Initially it seems that the subscribePattern option would work well for me, as these topics somewhat have a form of hierarchy, but now I'm stuck on applying the schema and writing to distinct locations in HDFS.
In the future we will most likely have thousands of topics and hopefully only 25 or so Spark Applications.
Does anyone have advice on how to accomplish this?
When sending these events with your kafka producer, you could also send a key as well as the value. If every event had it's event type as the key, when reading the stream from the topic(s), you could also get the key:
val kafkaKvPair = spark
.read
.format("kafka")
.option("kafka.bootstrap.servers", "host1:port1,host2:port2")
.option("subscribe", "topic1")
.load()
.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
.as[(String, String)]
Then you could just filter on which events you want to process:
val events = kafkaKvPair
.filter(f => f._1 == "MY_EVENT_TYPE")
In this way if you are subscribed to multiple topics within one Spark app, you could process as many event types as you wish.
If you are running Kafka 0.11+, consider using the headers functionality. Headers will come across as a MapType, and you can then route messages based on their header without having to parse the body first.
I am using Kafka and Spark 2.1 Structured Streaming. I have two topics with data in json format eg:
topic 1:
{"id":"1","name":"tom"}
{"id":"2","name":"mark"}
topic 2:
{"name":"tom","age":"25"}
{"name":"mark","age:"35"}
I need to compare those two streams in Spark base on tag:name and when value is equal execute some additional definition/function.
How to use Spark Structured Streaming to do this ?
Thanks
Following the current documentation (Spark 2.1.1)
Any kind of joins between two streaming Datasets are not yet
supported.
ref: https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#unsupported-operations
At this moment, I think you need to rely on Spark Streaming as proposed by #igodfried's answer.
I hope you got your solution. In case not, then you can try creating two KStreams from two topics and then join those KStreams and put joined data back to one topic. Now you can read the joined data as one DataFrame using Spark Structured Streaming. Now you'll be able to apply any transformations you want on the joined data. Since Structured streaming doesn't support join of two streaming DataFrames you can follow this approach to get the task done.
I faced a similar requirement some time ago: I had 2 streams which had to be "joined" together based on some criteria. What I used was a function called mapGroupsWithState.
What this functions does (in few words, more details on the reference below) is to take stream in the form of (K,V) and accumulate together its elements on a common state, based on the key of each pair. Then you have ways to tell Spark when the state is complete (according to your application), or even have a timeout for incomplete states.
Example based on your question:
Read Kafka topics into a Spark Stream:
val rawDataStream: DataFrame = spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", bootstrapServers)
.option("subscribe", "topic1,topic2") // Both topics on same stream!
.option("startingOffsets", "latest")
.option("failOnDataLoss", "true")
.load()
.selectExpr("CAST(value AS STRING) as jsonData") // Kafka sends bytes
Do some operations on your data (I prefer SQL, but you can use the DataFrame API) to transform each element into a key-value pair:
spark.sqlContext.udf.register("getKey", getKey) // You define this function; I'm assuming you will be using the name as key in your example.
val keyPairsStream = rawDataStream
.sql("getKey(jsonData) as ID, jsonData from rawData")
.groupBy($"ID")
Use the mapGroupsWithState function (I will show you the basic idea; you will have to define the myGrpFunct according to your needs):
keyPairsStream
.mapGroupsWithState(GroupStateTimeout.ProcessingTimeTimeout())(myGrpFunct)
Thats it! If you implement myGrpFunct correctly, you will have one stream of merged data, which you can further transform, like the following:
["tom",{"id":"1","name":"tom"},{"name":"tom","age":"25"}]
["mark",{"id":"2","name":"mark"},{"name":"mark","age:"35"}]
Hope this helps!
An excellent explanation with some code snippets: http://asyncified.io/2017/07/30/exploring-stateful-streaming-with-spark-structured-streaming/
One method would be to transform both streams into (K,V) format. In your case this would probably take the form of (name, otherJSONData) See the Spark documentation for more information on joining streams and an example located here. Then do a join on both streams and then perform whatever function on the newly joined stream. If needed you can use map to return (K,(W,V)) to (K,V).