I am using the following consumer code in Spark to read from a Kafka Topic:
val df = spark
.read
.format("kafka")
.option("kafka.bootstrap.servers", kafkaBrokers)
.option("subscribe", topicName)
.load()
df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
.as[(String, String)]
The code reads from the Topic as expected, but the contents of the Topic are not getting flushed out as a result of this read. Repeated execution results in the same set of messages getting returned over and over again.
What should I do to cause the messages to be removed form the Topic upon read?
As crikcet_007 mentioned Kafka does not remove logs after consumption. You can manage log retention within Kafka using either size based policy or time based settings.
log.retention.bytes - The maximum size of the log before deleting it
log.retention.hours - The number of hours to keep a log file before deleting it
log.retention.minutes - The number of minutes to keep a log file
log.retention.ms - The number of milliseconds to keep a log file
You can read more about these parameters here
On top of that additional mechanism to handle the log retention is log compaction. By setting following parameters you can manage the log compaction
log.cleanup.policy
log.cleaner.min.compaction.lag.ms
You can read more about that here
Kafka doesn't remove topic messages when consumed
Your Spark code is part of a Kafka consumer group, and it would need to acknowledge that a message has been read, and commit those offsets, which I believe Spark does on its own, periodically, by default, but you can disable this with setting the option of enable.auto.commit to false, which is highly recommended because you will want to control if Spark has succesfully processed a collection of records.
Checkpointing or commiting offsets to a durable store are some ways to preserve your offsets in the event of a restart / failure of a task, and not re-read the same data
Related
I would like run 2 spark structured streaming jobs in the same emr cluster to consumer the same kafka topic. Both jobs are in the running status. However, only one job can get the kafka data. My configuration for kafka part is as following.
.format("kafka")
.option("kafka.bootstrap.servers", "xxx")
.option("subscribe", "sametopic")
.option("kafka.security.protocol", "SASL_SSL")
.option("kafka.ssl.truststore.location", "./cacerts")
.option("kafka.ssl.truststore.password", "changeit")
.option("kafka.ssl.truststore.type", "JKS")
.option("kafka.sasl.kerberos.service.name", "kafka")
.option("kafka.sasl.mechanism", "GSSAPI")
.load()
I did not set the group.id. I guess the same group id in two jobs are used to cause this issue. However, when I set the group.id, it complains that "user-specified consumer groups are not used to track offsets.". What is the correct way to solve this problem? Thanks!
You need to run Spark v3.
From https://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html
kafka.group.id
The Kafka group id to use in Kafka consumer while reading from Kafka.
Use this with caution. By default, each query generates a unique group
id for reading data. This ensures that each Kafka source has its own
consumer group that does not face interference from any other
consumer, and therefore can read all of the partitions of its
subscribed topics. In some scenarios (for example, Kafka group-based
authorization), you may want to use a specific authorized group id to
read data. You can optionally set the group id. However, do this with
extreme caution as it can cause unexpected behavior. Concurrently
running queries (both, batch and streaming) or sources with the same
group id are likely interfere with each other causing each query to
read only part of the data. This may also occur when queries are
started/restarted in quick succession. To minimize such issues, set
the Kafka consumer session timeout (by setting option
"kafka.session.timeout.ms") to be very small. When this is set, option
"groupIdPrefix" will be ignored.
I am running a pyspark job and the data streaming is from Kafka.
I am trying to replicate a scenario in my windows system to find out what happens when the consumer goes down while the data is continuously being fed into Kafka.
Here is what i expect.
producer is started and produces message 1, 2 and 3.
consumer is online and consumes messages 1, 2 and 3.
Now the consumer goes down for some reason while the producer produces messages 4, 5 and 6 and so on...
when the consumer comes up, it is my expectation that it should read where it left off. So the consumer must be able to read from message 4, 5 , 6 and so on....
My pyspark application is not able to achieve what i expect. here is how I created a Spark Session.
session.readStream.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092") \
.option("subscribe", "clickapijson")
.option("startingoffsets" , "latest") \
.load()
I googled and gathered quite a bit of information. It seems like the groupID is relevant here. Kafka maintains the track of offsets read by each consumer in a particular groupID. If a consumer subscribes to a topic with a groupId, say, G1, kafka registers this group and consumerID and keeps a track of this groupID and ConsumerID. If at all, the consumer has to go down for some reason, and restarts with the same groupID, then the kafka will have the information of the already read offsets so the consumer will read the data from where it left off.
This is exactly happening when i use the following command to invoke consumer job in CLI.
kafka-console-consumer.bat --bootstrap-server localhost:9092 --topic "clickapijson" --consumer-property group.id=test
Now when my producer produces the messages 1,2 and 3, consumer is able to consume. I killed the running consumer job(CLI .bat file) after the 3 rd message is read. My producer produces the message 4, 5 and 6 and so on....
Now I bring back my consumer job (CLI .bat file) and it able to read the data from where it left off ( from message 4). This is behaving as I expect.
I am unable to do the same thing in pyspark.
when I am including the option("group.id" , "test"), it throws an error saying Kafka option group.id is not supported as user-specified consumer groups are not used to track offsets.
Upon observing the console output, each time my pyspark consumer job is kicked off, it is creating a new groupID. If my pyspark job has run previously with a groupID and failed, when it is restarted it is not picking up the same old groupID. It is randomly getting a new groupID. Kafka has the offset information of the previous groupID but not the current newly generated groupID. Hence my pyspark application is not able to read the data fed into Kafka while it was down.
If this is the case, then wont I lose my data when the consumer job has gone down due to some failure?
How can i give my own groupid to the pyspark application or how can i restart my pyspark application with same old groupid?
In the current Spark version (2.4.5) it is not possible to provide your own group.id as it gets automatically created by Spark (as you already observed). The full details on the offset management in Spark reading from Kafka is given here and summarised below:
Note that the following Kafka params cannot be set and the Kafka source or sink will throw an exception:
group.id: Kafka source will create a unique group id for each query automatically.
auto.offset.reset: Set the source option startingOffsets to specify where to start instead. Structured Streaming manages which offsets are consumed internally, rather than rely on the kafka Consumer to do it. This will ensure that no data is missed when new topics/partitions are dynamically subscribed. Note that startingOffsets only applies when a new streaming query is started, and that resuming will always pick up from where the query left off.
enable.auto.commit: Kafka source doesn’t commit any offset.
For Spark to be able to remember where it left off reading from Kafka, you need to have checkpointing enabled and provide a path location to store the checkpointing files. In Python this would look like:
aggDF \
.writeStream \
.outputMode("complete") \
.option("checkpointLocation", "path/to/HDFS/dir") \
.format("memory") \
.start()
More details on checkpointing are given in the Spark docs on Recovering from Failures with Checkpointing.
I am executing a batch job in pyspark, where spark will read data from kafka topic for every 5 min.
df = spark \
.read \
.format("kafka") \
.option("kafka.bootstrap.servers", "host1:port1") \
.option("subscribePattern", "test") \
.option("startingOffsets", "earliest") \
.option("endingOffsets", "latest") \
.load()
Whenever spark reads data from kafka it is reading all the data including previous batches.
I want to read data for the current batch or latest records which is not read before.
Please suggest !! Thank you.
From https://spark.apache.org/docs/2.4.5/structured-streaming-kafka-integration.html#creating-a-kafka-source-for-batch-queries
For batch queries, latest (either implicitly or by using -1 in json)
is not allowed.
Using earliest means all the data again is obtained.
You will need to define the offset explicitly every time you run like, e.g.:
.option("startingOffsets", """{"topic1":{"0":23,"1":-2},"topic2":{"0":-2}}""")
That implies you need to save the offsets processed per partition. I am looking into this in the near future myself for a project. Some items hereunder items to help:
https://medium.com/datakaresolutions/structured-streaming-kafka-integration-6ab1b6a56dd1 stating what you observe:
Create a Kafka Batch Query
Spark also provides a feature to fetch the
data from Kafka in batch mode. In batch mode Spark will consume all
the messages at once. Kafka in batch mode requires two important
parameters Starting offsets and ending offsets, if not specified spark
will consider the default configuration which is,
startingOffsets — earliest
endingOffsets — latest
https://dzone.com/articles/kafka-gt-hdfss3-batch-ingestion-through-spark alludes as well to what you should do, with the following:
And, finally, save these Kafka topic endOffsets to file system – local or HDFS (or commit them to ZooKeeper). This will be used for the
next run of starting the offset for a Kafka topic. Here we are making
sure the job's next run will read from the offset where the previous
run left off.
This blog https://dataengi.com/2019/06/06/spark-structured-streaming/ I think has the answer for saving offsets.
Did you use check point location while writing stream data
I had an issue with a Spark structured streaming (SSS) application, that had crashed due to a program bug and did not process over the weekend. When I restarted it, there were many messages on the topics to reprocess (about 250'000 messages each on 3 topics which need to be joined).
On restart, the application crashed again with an OutOfMemory exception. I learned from the docs that maxOffsetsPerTrigger configuration on the read stream is supposed to help exactly in those cases. I changed the PySpark code (running on SSS 2.4.3 btw) to have something like the following for all 3 topics
rawstream = (spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", bootstrapServers)
.option("subscribe", topicName)
.option("maxOffsetsPerTrigger", 10000L)
.option("startingOffsets", "earliest")
.load()
My expectation would be that now the SSS query would load ~33'000 offsets from each of the topics and join them in the first batch. Then in the second batch it would clean the state records from the first batch with are subject to expiration due to watermark (which would clean up most of the records from the first batch) and then read another ~33k from each topic. So after ~8 batches it should have processed the lag, with a "reasonable" amount of memory.
But the application still kept crashing with OOM, and when I checked the DAG in the application master UI, it reported that it again tried to read all 250'000 messages.
Is there something more that I need to configure? How can I check that this option is really used? (when I check the plan, unfortunately it is truncated and just shows (Options: [includeTimestamp=true,subscribe=IN2,inferSchema=true,failOnDataLoss=false,kafka.b...), I couldn't find out how to show the part after the dots)
I have a Spark Structured Streaming job which is configured to read data from Kafka. Please go through the code to check the readStream() with parameters to read the latest data from Kafka.
I understand that readStream() reads from the first offset when a new query is started and not on resume.
But I don't know how to start a new query every time I restart my job in IntelliJ.
val kafkaStreamingDF = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", AppProperties.getProp(AppConstants.PROPS_SERVICES_KAFKA_SERVERS))
.option("subscribe", AppProperties.getProp(AppConstants.PROPS_SDV_KAFKA_TOPICS))
.option("failOnDataLoss", "false")
.option("startingOffsets","earliest")
.load()
.selectExpr("CAST(value as STRING)", "CAST(topic as STRING)")
I have also tried setting the offsets by """{"topicA":{"0":0,"1":0}}"""
Following is my writestream
val query = kafkaStreamingDF
.writeStream
.format("console")
.start()
Every time I restart my job in IntelliJ IDE, logs show that the offset has been set to latest instead of 0 or earliest.
Is there way I can clean my checkpoint, in that case I don't know where the checkpoint directory is because in the above case I don't specify any checkpointing.
Kafka relies on the property auto.offset.reset to take care of the Offset Management.
The default is “latest,” which means that lacking a valid offset, the consumer will start reading from the newest records (records that were written after the consumer started running). The alternative is “earliest,” which means that lacking a valid offset, the consumer will read all the data in the partition, starting from the very beginning.
As per your question you want to read the entire data from the topic. So setting the "startingOffsets" to "earliest" should work. But, also make sure that you are setting the enable.auto.commit to false.
By setting enable.auto.commit to true means that offsets are committed automatically with a frequency controlled by the config auto.commit.interval.ms.
Setting this to true commits the offsets to Kafka automatically when messages are read from Kafka which doesn’t necessarily mean that Spark has finished processing those messages. To enable precise control for committing offsets, set Kafka parameter enable.auto.commit to false.
Try to set up .option("kafka.client.id", "XX"), to use a different client.id.