Read latest records from Kafka using pyspark batch job - apache-spark

I am executing a batch job in pyspark, where spark will read data from kafka topic for every 5 min.
df = spark \
.read \
.format("kafka") \
.option("kafka.bootstrap.servers", "host1:port1") \
.option("subscribePattern", "test") \
.option("startingOffsets", "earliest") \
.option("endingOffsets", "latest") \
.load()
Whenever spark reads data from kafka it is reading all the data including previous batches.
I want to read data for the current batch or latest records which is not read before.
Please suggest !! Thank you.

From https://spark.apache.org/docs/2.4.5/structured-streaming-kafka-integration.html#creating-a-kafka-source-for-batch-queries
For batch queries, latest (either implicitly or by using -1 in json)
is not allowed.
Using earliest means all the data again is obtained.
You will need to define the offset explicitly every time you run like, e.g.:
.option("startingOffsets", """{"topic1":{"0":23,"1":-2},"topic2":{"0":-2}}""")
That implies you need to save the offsets processed per partition. I am looking into this in the near future myself for a project. Some items hereunder items to help:
https://medium.com/datakaresolutions/structured-streaming-kafka-integration-6ab1b6a56dd1 stating what you observe:
Create a Kafka Batch Query
Spark also provides a feature to fetch the
data from Kafka in batch mode. In batch mode Spark will consume all
the messages at once. Kafka in batch mode requires two important
parameters Starting offsets and ending offsets, if not specified spark
will consider the default configuration which is,
startingOffsets — earliest
endingOffsets — latest
https://dzone.com/articles/kafka-gt-hdfss3-batch-ingestion-through-spark alludes as well to what you should do, with the following:
And, finally, save these Kafka topic endOffsets to file system – local or HDFS (or commit them to ZooKeeper). This will be used for the
next run of starting the offset for a Kafka topic. Here we are making
sure the job's next run will read from the offset where the previous
run left off.
This blog https://dataengi.com/2019/06/06/spark-structured-streaming/ I think has the answer for saving offsets.

Did you use check point location while writing stream data

Related

Spark Structured Streaming Batch Read Checkpointing

I am fairly new to Spark and am still learning. One of the more difficult concepts I have come across is checkpointing and how Spark uses it to recover from failures. I am doing batch reads from Kafka using Structured Streaming and writing them to S3 as Parquet file as:
dataset
.write()
.mode(SaveMode.Append)
.option("checkpointLocation", checkpointLocation)
.partitionBy("date_hour")
.parquet(getS3PathForTopic(topicName));
The checkpoint location is a S3 filesystem path. However, as the job runs, I see no checkpointing files. In subsequent runs, I see the following log:
21/10/14 12:20:51 INFO ConsumerCoordinator: [Consumer clientId=consumer-spark-kafka-relation-54f0cc87-e437-4582-b998-a33189e90bd7-driver-0-5, groupId=spark-kafka-relation-54f0cc87-e437-4582-b998-a33189e90bd7-driver-0] Found no committed offset for partition topic-1
This indicates that the previous run did not checkpoint any offsets for this run to pick them up from. So it keeps consuming from the earliest offset.
How can I make my job pick up new offsets? Note that this is a batch query as described here.
This is how I read:
sparkSession
.read()
.format("kafka")
.option("kafka.bootstrap.servers", kafkaProperties.bootstrapServers())
.option("subscribe", topic)
.option("kafka.security.protocol", "SSL")
.option("kafka.ssl.truststore.location", sslConfig.truststoreLocation())
.option("kakfa.ssl.truststore.password", sslConfig.truststorePassword())
.option("kafka.ssl.keystore.location", sslConfig.keystoreLocation())
.option("kafka.ssl.keystore.password", sslConfig.keystorePassword())
.option("kafka.ssl.endpoint.identification.algorithm", "")
.option("failOnDataLoss", "true");
I am not sure why batch Spark Structured Streaming with Kafka still exists now. If you wish to use it, then you must code your own Offset management. See the guide, but it is badly explained.
I would say Trigger.Once is a better use case for you; Offset management is provided by Spark as it is thus not batch mode.

Consume a Kafka Topic every hour using Spark

I want to consume a Kafka topic as a batch where I want to read Kafka topic hourly and read the latest hourly data.
val readStream = existingSparkSession
.read
.format("kafka")
.option("kafka.bootstrap.servers", hostAddress)
.option("subscribe", "kafka.raw")
.load()
But this always read first 20 data rows and these rows are starting from the very beginning so this never pick latest data rows.
How can I read the latest rows on a hourly basis using scala and spark?
If you read Kafka messages in Batch mode you need to take care of the bookkeeping which data is new and which is not yourself. Remember that Spark will not commit any messages back to Kafka, so every time you restart the batch job it will read from beginning (or based on the setting startingOffsets which defaults to earliest for batch queries.
For your scenario where you want to run the job once every hour and only process the new data that arrived to Kafka in the previous hour, you can make use of the writeStream trigger option Trigger.Once for streaming queries.
There is a nice blog from Databricks that nicely explains why a streaming query with Trigger.Once should be preferred over a batch query.
The main point being:
"When you’re running a batch job that performs incremental updates, you generally have to deal with figuring out what data is new, what you should process, and what you should not. Structured Streaming already does all this for you."
Make sure that you also set the option "checkpointLocation" in your writeStream. In the end, you can have a simple cron job that submits your streaming job once an hour.

Spark Structured Streaming Batch Query

I am new to kafka and spark structured streaming. I want to know how spark in batch mode knows which offset to read from? If I specify "startingOffsets" as "earliest", I am only getting the latest records and not all the records in the partition. I ran the same code in 2 different clusters. Cluster A ( local machine ) fetched 6 records, Cluster B ( TST cluster - very first run) fetched 1 record.
df = spark \
.read \
.format("kafka") \
.option("kafka.bootstrap.servers", broker) \
.option("subscribe", topic) \
.option("startingOffsets", "earliest") \
.option("endingOffsets", "latest" ) \
.load()
I am planning to run my batch once a day, will I get all the records from the yesterday to current run? Where do i see offsets and commits for batch queries?
According to the Structured Streaming + Kafka Integration Guide your offsets are stored in the provided checkpoint location that you set in the write part of your batch job.
If you do not delete the checkpoint files, the job will continue to read from Kafka where it left off. If you delete the checkpoint files or if you run the job for the very first time the job will consume messages based on the option startingOffsets.

Why does Spark Structured Streaming not allow changing the number of input sources?

I would like to build a Spark streaming pipeline that reads from multiple Kafka Topics (that vary in number over time). I intended on stopping the the streaming job, adding/removing the new topics, and starting the job again whenever I required an update to the topics in the streaming job using one of the two options outlined in the Spark Structured Streaming + Kafka Integration Guide:
# Subscribe to multiple topics
df = spark \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "host1:port1,host2:port2") \
.option("subscribe", "topic1,topic2") \
.load()
df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
# Subscribe to a pattern
df = spark \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "host1:port1,host2:port2") \
.option("subscribePattern", "topic.*") \
.load()
df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
Upon further investigation, I noticed the following point in the Spark Structured Streaming Programming Guide and am trying to understand why changing the number of input sources is "not allowed":
Changes in the number or type (i.e. different source) of input sources: This is not allowed.
Definition of "Not Allowed" (also from Spark Structured Streaming Programming Guide):
The term not allowed means you should not do the specified change as the restarted query is likely to fail with unpredictable errors. sdf represents a streaming DataFrame/Dataset generated with sparkSession.readStream.
My understanding is that Spark Structured Streaming implements its own checkpointing mechanism:
In case of a failure or intentional shutdown, you can recover the previous progress and state of a previous query, and continue where it left off. This is done using checkpointing and write-ahead logs. You can configure a query with a checkpoint location, and the query will save all the progress information (i.e. range of offsets processed in each trigger) and the running aggregates (e.g. word counts in the quick example) to the checkpoint location. This checkpoint location has to be a path in an HDFS compatible file system, and can be set as an option in the DataStreamWriter when starting a query.
Can someone please explain why changing the number of sources is "not allowed"? I assume that would be one of the benefits of the checkpointing mechanism.
Steps to add new input source in existing running model streaming job
Stop the current running Streaming in which model is running.
hdfs dfs -get output/checkpoints/<model_name>offsets <local_directory>/offsets
There will be 3 files(since last 3 offsets are saved by spark) in the directory. sample format below for single file
v1
{ "batchWatermarkMs":0,"batchTimestampMs":1578463128395,"conf":{"spark.sql.streaming.stateStore.providerClass":"org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider","spark.sql.streaming.flatMapGroupsWithState.stateFormatVersion":"2","spark.sql.streaming.multipleWatermarkPolicy":"min","spark.sql.streaming.aggregation.stateFormatVersion":"2","spark.sql.shuffle.partitions":"200"}}
{ "logOffset":0}
{ "logOffset":0}
each {"logOffset":batchId} represents single input source.
To add new input source add "-" at the end of each file in the directory.
Sample updated file
v1
{"batchWatermarkMs":0,"batchTimestampMs":1578463128395,"conf":{"spark.sql.streaming.stateStore.providerClass":"org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider","spark.sql.streaming.flatMapGroupsWithState.stateFormatVersion":"2","spark.sql.streaming.multipleWatermarkPolicy":"min","spark.sql.streaming.aggregation.stateFormatVersion":"2","spark.sql.shuffle.partitions":"200"}}
{"logOffset":0}
{"logOffset":0}
If you want to add more than 1 input source then add "-" equal to number of new input source.
hdfs dfs -put -f <local_directory>/offsets output/checkpoints/<model_name>offsets
The best way to do what you want it's running your readStreams in multiple thread.
I'm doing this, reading 40 tables at same time. For do this I follow this article:
https://cm.engineering/multiple-spark-streaming-jobs-in-a-single-emr-cluster-ca86c28d1411.
I will do a quick brief about what I do after read and mount my code structure with main function, executor, and a trait with my spark session who will be shared with all jobs .
1.Two Lists of the topics that I want to read.
So, in Scala I create two lists. The first list is the topics that always I want to read and the second list it's a Dynamic list where when I stop my job I can add some new topics.
Pattern Matching to run the jobs.
I have two job different jobs, one that I run to the tables that always I run and Dynamic jobs that I run to specifc topics,in other words, If I want to add a new topic and create a new job to him, I add this job in pattern matching. In the bellow code, I want to run specfic job to the Cars and Ship tables and all another tables that I put in the specifc list will run the same replication table job
var tables = specifcTables ++ dynamicTables
tables.map(table => {
table._1 match {
case "CARS" => new CarsJob
case "SHIPS" => new ShipsReplicationJob
case _ => new ReplicationJob
After this I pass this pattern matching to a createjobs function that will instantiate each of these jobs and I pass this function to a startFutureTask function who will put each of these jobs in different threads
startFutureTasks(createJobs(tables))
I hope I've helped. Thanks !

How to rewind Kafka Offsets in spark structured streaming readstream

I have a Spark Structured Streaming job which is configured to read data from Kafka. Please go through the code to check the readStream() with parameters to read the latest data from Kafka.
I understand that readStream() reads from the first offset when a new query is started and not on resume.
But I don't know how to start a new query every time I restart my job in IntelliJ.
val kafkaStreamingDF = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", AppProperties.getProp(AppConstants.PROPS_SERVICES_KAFKA_SERVERS))
.option("subscribe", AppProperties.getProp(AppConstants.PROPS_SDV_KAFKA_TOPICS))
.option("failOnDataLoss", "false")
.option("startingOffsets","earliest")
.load()
.selectExpr("CAST(value as STRING)", "CAST(topic as STRING)")
I have also tried setting the offsets by """{"topicA":{"0":0,"1":0}}"""
Following is my writestream
val query = kafkaStreamingDF
.writeStream
.format("console")
.start()
Every time I restart my job in IntelliJ IDE, logs show that the offset has been set to latest instead of 0 or earliest.
Is there way I can clean my checkpoint, in that case I don't know where the checkpoint directory is because in the above case I don't specify any checkpointing.
Kafka relies on the property auto.offset.reset to take care of the Offset Management.
The default is “latest,” which means that lacking a valid offset, the consumer will start reading from the newest records (records that were written after the consumer started running). The alternative is “earliest,” which means that lacking a valid offset, the consumer will read all the data in the partition, starting from the very beginning.
As per your question you want to read the entire data from the topic. So setting the "startingOffsets" to "earliest" should work. But, also make sure that you are setting the enable.auto.commit to false.
By setting enable.auto.commit to true means that offsets are committed automatically with a frequency controlled by the config auto.commit.interval.ms.
Setting this to true commits the offsets to Kafka automatically when messages are read from Kafka which doesn’t necessarily mean that Spark has finished processing those messages. To enable precise control for committing offsets, set Kafka parameter enable.auto.commit to false.
Try to set up .option("kafka.client.id", "XX"), to use a different client.id.

Resources