Current State:
Today I have built a Spark Structured Streaming application which consumes a single Kafka topic which contain JSON messages. Embedded within the Kafka topic's value contains some information about the source and the schema of the message field. A very simplified version of the message looks something like this:
{
"source": "Application A",
"schema": [{"col_name": "countryId", "col_type": "Integer"}, {"col_name": "name", "col_type": "String"}],
"message": {"countryId": "21", "name": "Poland"}
}
There are a handful of Kafka topics in the system today, and I've deployed this Spark Structured Streaming application per topic, using the subscribe option. The application applies the topic's unique schema (hacked by batch reading the first message in the Kafka topic and mapping the schema) and writes it to HDFS in parquet format.
Desired State:
My organization will soon start producing more and more topics and I don't think this pattern of a Spark Application per topic will scale well. Initially it seems that the subscribePattern option would work well for me, as these topics somewhat have a form of hierarchy, but now I'm stuck on applying the schema and writing to distinct locations in HDFS.
In the future we will most likely have thousands of topics and hopefully only 25 or so Spark Applications.
Does anyone have advice on how to accomplish this?
When sending these events with your kafka producer, you could also send a key as well as the value. If every event had it's event type as the key, when reading the stream from the topic(s), you could also get the key:
val kafkaKvPair = spark
.read
.format("kafka")
.option("kafka.bootstrap.servers", "host1:port1,host2:port2")
.option("subscribe", "topic1")
.load()
.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
.as[(String, String)]
Then you could just filter on which events you want to process:
val events = kafkaKvPair
.filter(f => f._1 == "MY_EVENT_TYPE")
In this way if you are subscribed to multiple topics within one Spark app, you could process as many event types as you wish.
If you are running Kafka 0.11+, consider using the headers functionality. Headers will come across as a MapType, and you can then route messages based on their header without having to parse the body first.
Related
I implemented a spark job to read stream from a kafka topic with foreachbatch in the structured streaming.
val df = spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "mykafka.broker.io:6667")
.option("subscribe", "test-topic")
.option("kafka.security.protocol", "SASL_SSL")
.option("kafka.ssl.truststore.location", "/home/hadoop/cacerts")
.option("kafka.ssl.truststore.password", tspass)
.option("kafka.ssl.truststore.type", "JKS")
.option("kafka.sasl.kerberos.service.name", "kafka")
.option("kafka.sasl.mechanism", "GSSAPI")
.option("groupIdPrefix","MY_GROUP_ID")
.load()
val streamservice = df.selectExpr("CAST(value AS STRING)")
.select(from_json(col("value"), schema).as("data"))
.select("data.*")
var stream_df = streamservice
.selectExpr("cast(id as string) id", "cast(x as int) x")
val monitoring_stream = stream_df.writeStream
.trigger(Trigger.ProcessingTime("120 seconds"))
.foreachBatch { (batchDF: DataFrame, batchId: Long) =>
if(!batchDF.isEmpty) { }
}
.start()
.awaitTermination()
I have the following questions.
If kafka topic does not have data for a long time, will stream_df.writeStream be terminated automatically? Are there some timeout control on this?
If kafka topic is deleted from kafka broker, will stream_df.writeStream be terminated?
I hope that the spark job keep on monitoring on the kafka topic without termination in the above two cases. Do I need some special settings for kafka connector and/or stream_df.writerstream?
If kafka topic does not have data for a long time, will stream_df.writeStream be terminated automatically? Are there some timeout control on this?
The termination of the query is independent of the data being processed. Even if no new messages are produced to your Kafka topic the query will keep running, as it is running as a stream.
I guess that is what you have already figured out yourself while testing. We are using structured streaming queries to process data from Kafka and they have no issues being idle for a longer time (for example over the week-end outside of business hours).
If kafka topic is deleted from kafka broker, will stream_df.writeStream be terminated?
By default, if you delete the Kafka topic while your query is running an Exception is thrown:
ERROR MicroBatchExecution: Query [id = b1f84242-d72b-4097-97c9-ee603badc484, runId = 752b0fe4-2762-4fff-8912-f4cffdbd7bdc] terminated with error
java.lang.IllegalStateException: Partition test-0's offset was changed from 1 to 0, some data may have been missed.
Some data may have been lost because they are not available in Kafka any more; either the
data was aged out by Kafka or the topic may have been deleted before all the data in the
topic was processed. If you don't want your streaming query to fail on such cases, set the
source option "failOnDataLoss" to "false".
I mentioned "by default" because the query option failOnDataLoss default to true. As explained in the Exception message you could set this to false to let your streaming query running. This option is described in the Structured streaming + Kafka Integration Guide as:
"Whether to fail the query when it's possible that data is lost (e.g., topics are deleted, or offsets are out of range). This may be a false alarm. You can disable it when it doesn't work as you expected."
I would like run 2 spark structured streaming jobs in the same emr cluster to consumer the same kafka topic. Both jobs are in the running status. However, only one job can get the kafka data. My configuration for kafka part is as following.
.format("kafka")
.option("kafka.bootstrap.servers", "xxx")
.option("subscribe", "sametopic")
.option("kafka.security.protocol", "SASL_SSL")
.option("kafka.ssl.truststore.location", "./cacerts")
.option("kafka.ssl.truststore.password", "changeit")
.option("kafka.ssl.truststore.type", "JKS")
.option("kafka.sasl.kerberos.service.name", "kafka")
.option("kafka.sasl.mechanism", "GSSAPI")
.load()
I did not set the group.id. I guess the same group id in two jobs are used to cause this issue. However, when I set the group.id, it complains that "user-specified consumer groups are not used to track offsets.". What is the correct way to solve this problem? Thanks!
You need to run Spark v3.
From https://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html
kafka.group.id
The Kafka group id to use in Kafka consumer while reading from Kafka.
Use this with caution. By default, each query generates a unique group
id for reading data. This ensures that each Kafka source has its own
consumer group that does not face interference from any other
consumer, and therefore can read all of the partitions of its
subscribed topics. In some scenarios (for example, Kafka group-based
authorization), you may want to use a specific authorized group id to
read data. You can optionally set the group id. However, do this with
extreme caution as it can cause unexpected behavior. Concurrently
running queries (both, batch and streaming) or sources with the same
group id are likely interfere with each other causing each query to
read only part of the data. This may also occur when queries are
started/restarted in quick succession. To minimize such issues, set
the Kafka consumer session timeout (by setting option
"kafka.session.timeout.ms") to be very small. When this is set, option
"groupIdPrefix" will be ignored.
We have a use-case where we are trying to consumer multiple Kafka topics (AVRO messages) integrating with Schema registry.
We are using Spark Structured streaming (Spark version : 2.4.4) , Confluent Kafka (Library version: 5.4.1) for the same:
val kafkaDF: DataFrame = spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "<bootstrap.server>:9092")
.option("subscribe", [List of Topics]) // multi topic subscribe
.load()
We are selecting value from the above DF and using Schema to de-serialize the AVRO message
val tableDF = kafkaDF.select(from_avro(col("value"), jsonSchema).as("table")).select("*")
The Roadblock here is , since we are using multiple Kafka topics , we have have integrated all our JSON schema into MAP with key being the topic name and values being the respective schema.
How can we do a lookup using the same map in the select above? we tried UDF but it returns a "col" type but jsonSchema has to be of String type. Also it Schema is different for different topics
Couple of added questions:
Is the above approach correct for consuming multiple topics at once?
Should we use single consumer per topic?
If we have more number of topics how can we achieve parallel topic processing, cos sequential might take substantial amount of time.
Without checking it all, you appear to be OK on the basics with from_avro, etc. from_json, etc.
https://sparkbyexamples.com/spark/spark-streaming-consume-and-produce-kafka-messages-in-avro-format/ can guide you on first part. This also very good https://www.waitingforcode.com/apache-spark-structured-streaming/two-topics-two-schemas-one-subscription-apache-spark-structured-streaming/read#filter_topic.
I would do table.*
Multiple, multiple schemas --> read multiple such versions from .avsc or code yourself.
For consuming multiple topics in a Spark Streaming App the question is how many per App? No hard rule except obvious ones like large consumption vs. smaller consumption and if order is important. Executor resources can be relinquished.
Then you need to process all the Topics separately like this imho - Filter on Topic - you can fill in the details as I am a little rushed - using the foreachBatch paradigm.
Not sure how you are writing the data out at rest, but the question is not about that.
Similar to this, then to process multiple Topics:
...
... // Need to get Topic first
stream.toDS()
.select($"topic", $"value")
.writeStream
.foreachBatch((dataset, _) => {
dataset.persist() // need this
dataset.filter($"topic" === topicA)
.select(from_avro(col("value"), jsonSchemaTA)
.as("tA"))
.select("tA.*")
.write.format(...).save(...)
dataset.filter($"topic" === topicB)
.select(from_avro(col("value"), jsonSchemaTB)
.as("tB"))
.select("tB.*")
.write.format(...).save(...)
...
...
dataset.unpersist()
...
})
.start().awaitTermination()
but blend with this excellent answer: Integrating Spark Structured Streaming with the Confluent Schema Registry
Given a DataStreamReader configured to subscribe to multiple topics like this (see here):
// Subscribe to multiple topics
spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "host1:port1,host2:port2")
.option("subscribe", "topic1,topic2,topic3")
When I use foreachBatch on top of this, what will the batches contain?
Each batch will only contain messages from one topic?
Or can a batch contain messages coming from different topics?
In my use case, I'd like to have batches with messages coming from one topic only. Is it possible to configure this?
Quoting the official documentation in Structured Streaming + Kafka Integration Guide (Kafka broker version 0.10.0 or higher):
// Subscribe to multiple topics
...
.option("subscribe", "topic1,topic2")
The code above is what the underlying Kafka consumer (of the streaming query) subscribes to.
When I use foreachBatch on top of this, what will the batches contain?
Each batch will only contain messages from one topic?
That's the proper answer.
I'd like to have batches with messages coming from one topic only. Is it possible to configure this?
That's also documented in Structured Streaming + Kafka Integration Guide (Kafka broker version 0.10.0 or higher):
Each row in the source has the following schema:
...
topic
In other words, the input Dataset will have topic column with the name of the topic a given row (record) comes from.
In order to have "batches with messages coming from one topic only" you simply filter or where with the one topic, e.g.
val messages: DataFrame = ...
assert(messages.isStreaming)
messages
.writeStream
.foreachBatch { case (df, batchId) =>
val topic1Only = df.where($"topic" === "topic1")
val topic2Only = df.where($"topic" === "topic2")
...
}
The batch will contain messages coming from all the topics (I'd say partitions, instead) that your consumer is subscribed to.
I have a use case in which I have to subscribe to multiple topics in kafka in spark structured streaming. Then I have to parse each message and form a delta lake table out of it. I have made the parser and the messages(in form of xml) correctly parsing and forming delta-lake table. However, I am only subscribing to only one topic as of now. I want to subscribe to multiple topics and based on the topic, it should go to the parser dedicatedly made for this particular topic. So basically I want to identify the topic name for all the messages as they process so that I can send them to the desired parser and process further.
This is how I am accessing the messages from different topics. However, I have no idea how to identify the source of the incoming messages while processing them.
val stream_dataframe = spark.readStream
.format(ConfigSetting.getString("source"))
.option("kafka.bootstrap.servers", ConfigSetting.getString("bootstrap_servers"))
.option("kafka.ssl.truststore.location", ConfigSetting.getString("trustfile_location"))
.option("kafka.ssl.truststore.password", ConfigSetting.getString("truststore_password"))
.option("kafka.sasl.mechanism", ConfigSetting.getString("sasl_mechanism"))
.option("kafka.security.protocol", ConfigSetting.getString("kafka_security_protocol"))
.option("kafka.sasl.jaas.config",ConfigSetting.getString("jass_config"))
.option("encoding",ConfigSetting.getString("encoding"))
.option("startingOffsets",ConfigSetting.getString("starting_offset_duration"))
.option("subscribe",ConfigSetting.getString("topics_name"))
.option("failOnDataLoss",ConfigSetting.getString("fail_on_dataloss"))
.load()
var cast_dataframe = stream_dataframe.select(col("value").cast(StringType))
cast_dataframe = cast_dataframe.withColumn("parsed_column",parser(col("value"))) // Parser is the udf, made to parse the xml from the topic.
How can I identify the topic name of the messages as they process in spark structured streaming ?
As per the official documentation (emphasis mine)
Each row in the source has the following schema:
Column Type
key binary
value binary
topic string
partition int
...
As you see input topic is part of the output schema, and can be accessed without any special actions.