Spark Structured Streaming Batch Query - apache-spark

I am new to kafka and spark structured streaming. I want to know how spark in batch mode knows which offset to read from? If I specify "startingOffsets" as "earliest", I am only getting the latest records and not all the records in the partition. I ran the same code in 2 different clusters. Cluster A ( local machine ) fetched 6 records, Cluster B ( TST cluster - very first run) fetched 1 record.
df = spark \
.read \
.format("kafka") \
.option("kafka.bootstrap.servers", broker) \
.option("subscribe", topic) \
.option("startingOffsets", "earliest") \
.option("endingOffsets", "latest" ) \
.load()
I am planning to run my batch once a day, will I get all the records from the yesterday to current run? Where do i see offsets and commits for batch queries?

According to the Structured Streaming + Kafka Integration Guide your offsets are stored in the provided checkpoint location that you set in the write part of your batch job.
If you do not delete the checkpoint files, the job will continue to read from Kafka where it left off. If you delete the checkpoint files or if you run the job for the very first time the job will consume messages based on the option startingOffsets.

Related

Fixed interval micro-batch and once time micro-batch trigger mode don't work with Parquet file sink

I'm trying to consume data on Kafka topic and push consumed messages to HDFS with parquet format.
I'm using pyspark (2.4.5) to create Spark structed streaming process. The problem is my Spark job is endless and no data is pushed to HDFS.
process = (
# connect to kafka brokers
(
spark.readStream.format("kafka")
.option("kafka.bootstrap.servers", "brokers_list")
.option("subscribe", "kafka_topic")
.option("startingOffset", "earliest")
.option("includeHeaders", "true")
.load()
.writeStream.format("parquet")
.trigger(once=True). # tried with processingTime argument and have same result
.option("path", f"hdfs://hadoop.local/draft")
.option("checkpointLocation", "hdfs://hadoop.local/draft_checkpoint")
.start()
)
)
My Spark session's UI is liked this:
More details on stage:
I check status on my notebook and got this:
{
'message': 'Processing new data',
'isDataAvailable': True,
'isTriggerActive': True
}
When I check my folder on HDFS, there is no data is loaded. Only a directory named _spark_metadata is created in the output_location folder.
I don't face this problem if I remove the line of triggerMode trigger(processingTime="1 minute"). When I use default trigger mode, spark create a lot of small parquet file in the output location, this is inconvenient.
Does 2 trigger mode processingTime and once support for parquet file sink?
If I have to use the default trigger mode, how can I handle the gigantic number of tiny files created in my HDFS system?

How we manage offsets in Spark Structured Streaming? (Issues with _spark_metadata )

Background:
I have written a simple spark structured steaming app to move data from Kafka to S3. Found that in order to support exactly-once guarantee spark creates _spark_metadata folder, which ends up growing too large, when the streaming app runs for a long time the metadata folder grows so big that we start getting OOM errors. I want to get rid of metadata and checkpoint folders of Spark Structured Streaming and manage offsets myself.
How we managed offsets in Spark Streaming:
I have used val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges to get offsets in Spark Structured Streaming. But want to know how to get the offsets and other metadata to manage checkpointing ourself using Spark Structured Streaming. Do you have any sample program that implements checkpointing?
How we managed offsets in Spark Structured Streaming??
Looking at this JIRA https://issues-test.apache.org/jira/browse/SPARK-18258. looks like offsets are not provided. How should we go about?
The issue is in 6 hours size of metadata increased to 45MB and it grows till it reaches nearly 13 GB. Driver memory allocated is 5GB. At that time system crashes with OOM. Wondering how to avoid making this meta data grow so large? How to make metadata not log so much information.
Code:
1. Reading records from Kafka topic
Dataset<Row> inputDf = spark \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "host1:port1,host2:port2") \
.option("subscribe", "topic1") \
.option("startingOffsets", "earliest") \
.load()
2. Use from_json API from Spark to extract your data for further transformation in a dataset.
Dataset<Row> dataDf = inputDf.select(from_json(col("value").cast("string"), EVENT_SCHEMA).alias("event"))
....withColumn("oem_id", col("metadata.oem_id"));
3. Construct a temp table of above dataset using SQLContext
SQLContext sqlContext = new SQLContext(sparkSession);
dataDf.createOrReplaceTempView("event");
4. Flatten events since Parquet does not support hierarchical data.
5. Store output in parquet format on S3
StreamingQuery query = flatDf.writeStream().format("parquet")
Dataset dataDf = inputDf.select(from_json(col("value").cast("string"), EVENT_SCHEMA).alias("event"))
.select("event.metadata", "event.data", "event.connection", "event.registration_event","event.version_event"
);
SQLContext sqlContext = new SQLContext(sparkSession);
dataDf.createOrReplaceTempView("event");
Dataset flatDf = sqlContext
.sql("select " + " date, time, id, " + flattenSchema(EVENT_SCHEMA, "event") + " from event");
StreamingQuery query = flatDf
.writeStream()
.outputMode("append")
.option("compression", "snappy")
.format("parquet")
.option("checkpointLocation", checkpointLocation)
.option("path", outputPath)
.partitionBy("date", "time", "id")
.trigger(Trigger.ProcessingTime(triggerProcessingTime))
.start();
query.awaitTermination();
For non-batch Spark Structured Streaming KAFKA integration:
Quote:
Structured Streaming ignores the offsets commits in Apache Kafka.
Instead, it relies on its own offsets management on the driver side which is responsible for distributing offsets to executors and
for checkpointing them at the end of the processing round (epoch or
micro-batch).
You need not worry if you follow the Spark KAFKA integration guides.
Excellent reference: https://www.waitingforcode.com/apache-spark-structured-streaming/apache-spark-structured-streaming-apache-kafka-offsets-management/read
For batch the situation is different, you need to manage that yourself and store the offsets.
UPDATE
Based on the comments I suggest the question is slightly different and advise you look at Spark Structured Streaming Checkpoint Cleanup. In addition to your updated comments and the fact that there is no error, I suggest you consukt this on metadata for Spark Structured Streaming https://www.waitingforcode.com/apache-spark-structured-streaming/checkpoint-storage-structured-streaming/read. Looking at the code, different to my style, but cannot see any obvious error.

Read latest records from Kafka using pyspark batch job

I am executing a batch job in pyspark, where spark will read data from kafka topic for every 5 min.
df = spark \
.read \
.format("kafka") \
.option("kafka.bootstrap.servers", "host1:port1") \
.option("subscribePattern", "test") \
.option("startingOffsets", "earliest") \
.option("endingOffsets", "latest") \
.load()
Whenever spark reads data from kafka it is reading all the data including previous batches.
I want to read data for the current batch or latest records which is not read before.
Please suggest !! Thank you.
From https://spark.apache.org/docs/2.4.5/structured-streaming-kafka-integration.html#creating-a-kafka-source-for-batch-queries
For batch queries, latest (either implicitly or by using -1 in json)
is not allowed.
Using earliest means all the data again is obtained.
You will need to define the offset explicitly every time you run like, e.g.:
.option("startingOffsets", """{"topic1":{"0":23,"1":-2},"topic2":{"0":-2}}""")
That implies you need to save the offsets processed per partition. I am looking into this in the near future myself for a project. Some items hereunder items to help:
https://medium.com/datakaresolutions/structured-streaming-kafka-integration-6ab1b6a56dd1 stating what you observe:
Create a Kafka Batch Query
Spark also provides a feature to fetch the
data from Kafka in batch mode. In batch mode Spark will consume all
the messages at once. Kafka in batch mode requires two important
parameters Starting offsets and ending offsets, if not specified spark
will consider the default configuration which is,
startingOffsets — earliest
endingOffsets — latest
https://dzone.com/articles/kafka-gt-hdfss3-batch-ingestion-through-spark alludes as well to what you should do, with the following:
And, finally, save these Kafka topic endOffsets to file system – local or HDFS (or commit them to ZooKeeper). This will be used for the
next run of starting the offset for a Kafka topic. Here we are making
sure the job's next run will read from the offset where the previous
run left off.
This blog https://dataengi.com/2019/06/06/spark-structured-streaming/ I think has the answer for saving offsets.
Did you use check point location while writing stream data

Spark Structured Streaming with Kafka source, change number of topic partitions while query is running

I've set up a Spark structured streaming query that reads from a Kafka topic.
If the number of partitions in the topic is changed while the Spark query is running, Spark does not seem to notice and data on new partitions is not consumed.
Is there a way to tell Spark to check for new partitions in the same topic apart from stopping the query an restarting it?
EDIT:
I'm using Spark 2.4.4. I read from kafka as follows:
spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", kafkaURL)
.option("startingOffsets", "earliest")
.option("subscribe", topic)
.option("failOnDataLoss", value = false)
.load()
after some processing, I write to HDFS on a Delta Lake table.
Answering my own question. Kafka consumers check for new partitions/topic (in case of subscribing to topics with a pattern) every metadata.max.age.ms, whose default value is 300000 (5 minutes).
Since my test was lasting far less than that, I wouldn't notice the update. For tests, reduce the value to something smaller, e.g. 100 ms, by setting the following option of the DataStreamReader:
.option("kafka.metadata.max.age.ms", 100)

This query does not support recovering from checkpoint location. Delete checkpoint/testmemeory/offsets to start over

I have created In-Memory tables inside Spark and try to restart the Spark structured streaming job after failure. It gets "This query does not support recovering from checkpoint location. Delete checkpoint/TEST_IN_MEMORY/offsets to start over."
What is the Concept of Checkpoint In-Memory sink? Is there any way to rectify it? (can we delete the old and the new checkpoint dynamically?)
I am using Data Stax 5.1.6 clusters, so I don't have Choice I have to go with Spark 2.0.2 version only.
val kafkaDataFrame_inmemory = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "Localhost:9092")
.option("subscribe", "TEST_IN_MOEMORY")
.option("startingOffsets", "earliest")
.load()
val checkpoint = "C/Users/756661/Desktop/KT DOCS/spark/In_MEM_TABLE"+ UUID.randomUUID.toString
kafkaDataFrame_inmemory
.writeStream
.format("memory")
.option("truncate", false)
.queryName("IN_MEM_TABLE")
.outputMode("update")
.option("checkpointLocation",checkpoint)
.start()
You should simply remove the line .option("checkpointLocation",checkpoint) from your code and start over.
Per the error message, the memory data source does not support recovering from checkpoint location. Any attempt at (re)starting a streaming query that uses memory format with a directory that exists already will fail.
org.apache.spark.sql.AnalysisException: This query does not support recovering from checkpoint location. Delete xxx/offsets to start over.;
at org.apache.spark.sql.streaming.StreamingQueryManager.createQuery(StreamingQueryManager.scala:240)
at org.apache.spark.sql.streaming.StreamingQueryManager.startQuery(StreamingQueryManager.scala:326)
at org.apache.spark.sql.streaming.DataStreamWriter.start(DataStreamWriter.scala:267)
... 49 elided
It's not to say that memory data source won't use a checkpoint directory. It will, but it is going to be of the name randomly generated.
can we delete the old and the new checkpoint dynamically?
Sure and that's the only way to start the streaming query.

Resources