I had an issue with a Spark structured streaming (SSS) application, that had crashed due to a program bug and did not process over the weekend. When I restarted it, there were many messages on the topics to reprocess (about 250'000 messages each on 3 topics which need to be joined).
On restart, the application crashed again with an OutOfMemory exception. I learned from the docs that maxOffsetsPerTrigger configuration on the read stream is supposed to help exactly in those cases. I changed the PySpark code (running on SSS 2.4.3 btw) to have something like the following for all 3 topics
rawstream = (spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", bootstrapServers)
.option("subscribe", topicName)
.option("maxOffsetsPerTrigger", 10000L)
.option("startingOffsets", "earliest")
.load()
My expectation would be that now the SSS query would load ~33'000 offsets from each of the topics and join them in the first batch. Then in the second batch it would clean the state records from the first batch with are subject to expiration due to watermark (which would clean up most of the records from the first batch) and then read another ~33k from each topic. So after ~8 batches it should have processed the lag, with a "reasonable" amount of memory.
But the application still kept crashing with OOM, and when I checked the DAG in the application master UI, it reported that it again tried to read all 250'000 messages.
Is there something more that I need to configure? How can I check that this option is really used? (when I check the plan, unfortunately it is truncated and just shows (Options: [includeTimestamp=true,subscribe=IN2,inferSchema=true,failOnDataLoss=false,kafka.b...), I couldn't find out how to show the part after the dots)
Related
I want to consume a Kafka topic as a batch where I want to read Kafka topic hourly and read the latest hourly data.
val readStream = existingSparkSession
.read
.format("kafka")
.option("kafka.bootstrap.servers", hostAddress)
.option("subscribe", "kafka.raw")
.load()
But this always read first 20 data rows and these rows are starting from the very beginning so this never pick latest data rows.
How can I read the latest rows on a hourly basis using scala and spark?
If you read Kafka messages in Batch mode you need to take care of the bookkeeping which data is new and which is not yourself. Remember that Spark will not commit any messages back to Kafka, so every time you restart the batch job it will read from beginning (or based on the setting startingOffsets which defaults to earliest for batch queries.
For your scenario where you want to run the job once every hour and only process the new data that arrived to Kafka in the previous hour, you can make use of the writeStream trigger option Trigger.Once for streaming queries.
There is a nice blog from Databricks that nicely explains why a streaming query with Trigger.Once should be preferred over a batch query.
The main point being:
"When you’re running a batch job that performs incremental updates, you generally have to deal with figuring out what data is new, what you should process, and what you should not. Structured Streaming already does all this for you."
Make sure that you also set the option "checkpointLocation" in your writeStream. In the end, you can have a simple cron job that submits your streaming job once an hour.
I am using the following consumer code in Spark to read from a Kafka Topic:
val df = spark
.read
.format("kafka")
.option("kafka.bootstrap.servers", kafkaBrokers)
.option("subscribe", topicName)
.load()
df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
.as[(String, String)]
The code reads from the Topic as expected, but the contents of the Topic are not getting flushed out as a result of this read. Repeated execution results in the same set of messages getting returned over and over again.
What should I do to cause the messages to be removed form the Topic upon read?
As crikcet_007 mentioned Kafka does not remove logs after consumption. You can manage log retention within Kafka using either size based policy or time based settings.
log.retention.bytes - The maximum size of the log before deleting it
log.retention.hours - The number of hours to keep a log file before deleting it
log.retention.minutes - The number of minutes to keep a log file
log.retention.ms - The number of milliseconds to keep a log file
You can read more about these parameters here
On top of that additional mechanism to handle the log retention is log compaction. By setting following parameters you can manage the log compaction
log.cleanup.policy
log.cleaner.min.compaction.lag.ms
You can read more about that here
Kafka doesn't remove topic messages when consumed
Your Spark code is part of a Kafka consumer group, and it would need to acknowledge that a message has been read, and commit those offsets, which I believe Spark does on its own, periodically, by default, but you can disable this with setting the option of enable.auto.commit to false, which is highly recommended because you will want to control if Spark has succesfully processed a collection of records.
Checkpointing or commiting offsets to a durable store are some ways to preserve your offsets in the event of a restart / failure of a task, and not re-read the same data
I have a Spark Structured Streaming job which is configured to read data from Kafka. Please go through the code to check the readStream() with parameters to read the latest data from Kafka.
I understand that readStream() reads from the first offset when a new query is started and not on resume.
But I don't know how to start a new query every time I restart my job in IntelliJ.
val kafkaStreamingDF = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", AppProperties.getProp(AppConstants.PROPS_SERVICES_KAFKA_SERVERS))
.option("subscribe", AppProperties.getProp(AppConstants.PROPS_SDV_KAFKA_TOPICS))
.option("failOnDataLoss", "false")
.option("startingOffsets","earliest")
.load()
.selectExpr("CAST(value as STRING)", "CAST(topic as STRING)")
I have also tried setting the offsets by """{"topicA":{"0":0,"1":0}}"""
Following is my writestream
val query = kafkaStreamingDF
.writeStream
.format("console")
.start()
Every time I restart my job in IntelliJ IDE, logs show that the offset has been set to latest instead of 0 or earliest.
Is there way I can clean my checkpoint, in that case I don't know where the checkpoint directory is because in the above case I don't specify any checkpointing.
Kafka relies on the property auto.offset.reset to take care of the Offset Management.
The default is “latest,” which means that lacking a valid offset, the consumer will start reading from the newest records (records that were written after the consumer started running). The alternative is “earliest,” which means that lacking a valid offset, the consumer will read all the data in the partition, starting from the very beginning.
As per your question you want to read the entire data from the topic. So setting the "startingOffsets" to "earliest" should work. But, also make sure that you are setting the enable.auto.commit to false.
By setting enable.auto.commit to true means that offsets are committed automatically with a frequency controlled by the config auto.commit.interval.ms.
Setting this to true commits the offsets to Kafka automatically when messages are read from Kafka which doesn’t necessarily mean that Spark has finished processing those messages. To enable precise control for committing offsets, set Kafka parameter enable.auto.commit to false.
Try to set up .option("kafka.client.id", "XX"), to use a different client.id.
I have a simple Spark Structured Streaming app that reads from Kafka and writes to HDFS. Today the app has mysteriously stopped working, with no changes or modifications whatsoever (it had been working flawlessly for weeks).
So far, I have observed the following:
App has no active, failed or completed tasks
App UI shows no jobs and no stages
QueryProgress indicates 0 input rows every trigger
QueryProgress indicates offsets from Kafka were read and committed correctly (which means data is actually there)
Data is indeed available in the topic (writing to console shows the data)
Despite all of that, nothing is being written to HDFS anymore. Code snippet:
val inputData = spark
.readStream.format("kafka")
.option("kafka.bootstrap.servers", bootstrap_servers)
.option("subscribe", topic-name-here")
.option("startingOffsets", "latest")
.option("failOnDataLoss", "false").load()
inputData.toDF()
.repartition(10)
.writeStream.format("parquet")
.option("checkpointLocation", "hdfs://...")
.option("path", "hdfs://...")
.outputMode(OutputMode.Append())
.trigger(Trigger.ProcessingTime("60 seconds"))
.start()
Any ideas why the UI shows no jobs/tasks?
For anyone facing the same issue: I found the culprit:
Somehow the data within _spark_metadata in the HDFS directory where I was saving the data got corrupted.
The solution was to erase that directory and restart the application, which re-created the directory. After data, data started flowing.
My ultimate goal is to see if a kafka topic is running and if the data in it is good, otherwise fail / throw an error
if I could pull just 100 messages, or pull for just 60 seconds I think I could accomplish what i wanted. But all the streaming examples / questions I have found online have no intention of shutting down the streaming connection.
Here is the best working code I have so far, that pulls data and displays it, but it keeps trying to pull for more data, and if I try to access it in the next line, it hasnt had a chance to pull the data yet. I assume I need some sort of call back. has anyone done something similar? is this the best way of going about this?
I am using databricks notebooks to run my code
import org.apache.spark.sql.functions.{explode, split}
val kafka = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "<kafka server>:9092")
.option("subscribe", "<topic>")
.option("startingOffsets", "earliest")
.load()
val df = kafka.select(explode(split($"value".cast("string"), "\\s+")).as("word"))
display(df.select($"word"))
The trick is you don't need streaming at all. Kafka source supports batch queries, if you replace readStream with read and adjust startingOffsets and endingOffsets.
val df = spark
.read
.format("kafka")
... // Remaining options
.load()
You can find examples in the Kafka streaming documentation.
For streaming queries you can use once trigger, although it might not be the best choice in this case:
df.writeStream
.trigger(Trigger.Once)
... // Handle the output, for example with foreach sink (?)
You could also use standard Kafka client to fetch some data without starting SparkSession.