Resuming Structured Streaming from latest offsets - apache-spark

I would like to create Spark Structured Streaming job reading messages from Kafka source, writing to Kafka sink, which after failure will resume reading only current, newest messages. For that reason I don't need to keep checkpoints for my job.
But it looks like there is no option to disable checkpointing while writing to Kafka sink in Structured Streaming. To my understanding, even if I specify on the source:
.option("startingOffsets", "latest")
it will be taken into account only when the stream is first run, and after failure stream will resume from the checkpoint. Is there some workaround? And is there a way to disable checkpointing?

As workaround for this is to delete existing check point location from your code so that every time it will start fetching latest offset data.
import org.apache.hadoop.fs.{FileSystem, Path}
val checkPointLocation="/path/in/hdfs/location"
val fs = FileSystem.get(spark.sparkContext.hadoopConfiguration)
fs.delete(new Path(checkPointLocation),true)
// Delete check point location if exist.
val options = Map(
"kafka.bootstrap.servers"-> "localhost:9092",
"topic" -> "topic_name",
"checkpointLocation" -> checkPointLocation,
"startingOffsets" -> "latest"
)
df
.writeStream
.format("kafka")
.outputMode("append")
.options(options)
.start()
.awaitTermination()

Related

Fixed interval micro-batch and once time micro-batch trigger mode don't work with Parquet file sink

I'm trying to consume data on Kafka topic and push consumed messages to HDFS with parquet format.
I'm using pyspark (2.4.5) to create Spark structed streaming process. The problem is my Spark job is endless and no data is pushed to HDFS.
process = (
# connect to kafka brokers
(
spark.readStream.format("kafka")
.option("kafka.bootstrap.servers", "brokers_list")
.option("subscribe", "kafka_topic")
.option("startingOffset", "earliest")
.option("includeHeaders", "true")
.load()
.writeStream.format("parquet")
.trigger(once=True). # tried with processingTime argument and have same result
.option("path", f"hdfs://hadoop.local/draft")
.option("checkpointLocation", "hdfs://hadoop.local/draft_checkpoint")
.start()
)
)
My Spark session's UI is liked this:
More details on stage:
I check status on my notebook and got this:
{
'message': 'Processing new data',
'isDataAvailable': True,
'isTriggerActive': True
}
When I check my folder on HDFS, there is no data is loaded. Only a directory named _spark_metadata is created in the output_location folder.
I don't face this problem if I remove the line of triggerMode trigger(processingTime="1 minute"). When I use default trigger mode, spark create a lot of small parquet file in the output location, this is inconvenient.
Does 2 trigger mode processingTime and once support for parquet file sink?
If I have to use the default trigger mode, how can I handle the gigantic number of tiny files created in my HDFS system?

Right way to read stream from Kafka topic using checkpointLocation offsets

I'm trying to develop a small Spark app (using Scala) to read messages from Kafka (Confluent) and write them (insert) into Hive table. Everything works as expected, except for one important feature - managing offsets when the application is restarted (submited). It confuses me.
Cut from my code:
def main(args: Array[String]): Unit = {
val sparkSess = SparkSession
.builder
.appName("Kafka_to_Hive")
.config("spark.sql.warehouse.dir", "/user/hive/warehouse/")
.config("hive.metastore.uris", "thrift://localhost:9083")
.config("hive.exec.dynamic.partition", "true")
.config("hive.exec.dynamic.partition.mode", "nonstrict")
.enableHiveSupport()
.getOrCreate()
sparkSess.sparkContext.setLogLevel("ERROR")
// don't consider this code block please, it's just a part of Confluent avro message deserializing adventures
sparkSess.udf.register("deserialize", (bytes: Array[Byte]) =>
DeserializerWrapper.deserializer.deserialize(bytes)
)
val kafkaDataFrame = sparkSess
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", 'localhost:9092')
.option("group.id", 'kafka-to-hive-1')
// ------> which Kafka options do I need to set here for starting from last right offset to ensure completenes of data and "exactly once" writing? <--------
.option("failOnDataLoss", (false: java.lang.Boolean))
.option("subscribe", 'some_topic')
.load()
import org.apache.spark.sql.functions._
// don't consider this code block please, it's just a part of Confluent avro message deserializing adventures
val valueDataFrame = kafkaDataFrame.selectExpr("""deserialize(value) AS message""")
val df = valueDataFrame.select(
from_json(col("message"), sparkSchema.dataType).alias("parsed_value"))
.select("parsed_value.*")
df.writeStream
.foreachBatch((batchDataFrame, batchId) => {
batchDataFrame.createOrReplaceTempView("`some_view_name`")
val sqlText = "SELECT * FROM `some_view_name` a where some_field='some value'"
val batchDataFrame_view = batchDataFrame.sparkSession.sql(sqlText);
batchDataFrame_view.write.insertInto("default.some_hive_table")
})
.option("checkpointLocation", "/user/some_user/tmp/checkpointLocation")
.start()
.awaitTermination()
}
Questions (the questions are related to each other):
Which Kafka options do I need to apply on readStream.format("kafka") for starting from last right offset on every submit of spark app?
Do I need to manually read 3rd line of checkpointLocation/offsets/latest_batch file to find last offsets to read from Kafka? I mean something like that: readStream.format("kafka").option("startingOffsets", """{"some_topic":{"2":35079,"5":34854,"4":35537,"1":35357,"3":35436,"0":35213}}""")
What is the right/convenient way to read stream from Kafka (Confluent) topic? (I'm not considering offsets storing engine of Kafka)
"Which Kafka options do I need to apply on readStream.format("kafka") for starting from last right offset on every submit of spark app?"
You would need to set startingOffsets=latest and clean up the checkpoint files.
"Do I need to manually read 3rd line of checkpointLocation/offsets/latest_batch file to find last offsets to read from Kafka? I mean something like that: readStream.format("kafka").option("startingOffsets", """{"some_topic":{"2":35079,"5":34854,"4":35537,"1":35357,"3":35436,"0":35213}}""")"
Similar to first question, if you set the startingOffsets as the json string, you need to delete the checkpointing files. Otherwise, the spark application will always fetch the information stored in the checkpoint files and override the settings given in the startingOffsets option.
"What is the right/convenient way to read stream from Kafka (Confluent) topic? (I'm not considering offsets storing engine of Kafka)"
Asking about "the right way" might lead to opinion based answers and is therefore off-topic on Stackoverflow. Anyway, using Spark Structured Streaming is already a mature and production-ready approach in my experience. However, it is always worth also looking into KafkaConnect.

Does Spark Structured Streaming have some timeout issue when reading streams from a Kafka topic?

I implemented a spark job to read stream from a kafka topic with foreachbatch in the structured streaming.
val df = spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "mykafka.broker.io:6667")
.option("subscribe", "test-topic")
.option("kafka.security.protocol", "SASL_SSL")
.option("kafka.ssl.truststore.location", "/home/hadoop/cacerts")
.option("kafka.ssl.truststore.password", tspass)
.option("kafka.ssl.truststore.type", "JKS")
.option("kafka.sasl.kerberos.service.name", "kafka")
.option("kafka.sasl.mechanism", "GSSAPI")
.option("groupIdPrefix","MY_GROUP_ID")
.load()
val streamservice = df.selectExpr("CAST(value AS STRING)")
.select(from_json(col("value"), schema).as("data"))
.select("data.*")
var stream_df = streamservice
.selectExpr("cast(id as string) id", "cast(x as int) x")
val monitoring_stream = stream_df.writeStream
.trigger(Trigger.ProcessingTime("120 seconds"))
.foreachBatch { (batchDF: DataFrame, batchId: Long) =>
if(!batchDF.isEmpty) { }
}
.start()
.awaitTermination()
I have the following questions.
If kafka topic does not have data for a long time, will stream_df.writeStream be terminated automatically? Are there some timeout control on this?
If kafka topic is deleted from kafka broker, will stream_df.writeStream be terminated?
I hope that the spark job keep on monitoring on the kafka topic without termination in the above two cases. Do I need some special settings for kafka connector and/or stream_df.writerstream?
If kafka topic does not have data for a long time, will stream_df.writeStream be terminated automatically? Are there some timeout control on this?
The termination of the query is independent of the data being processed. Even if no new messages are produced to your Kafka topic the query will keep running, as it is running as a stream.
I guess that is what you have already figured out yourself while testing. We are using structured streaming queries to process data from Kafka and they have no issues being idle for a longer time (for example over the week-end outside of business hours).
If kafka topic is deleted from kafka broker, will stream_df.writeStream be terminated?
By default, if you delete the Kafka topic while your query is running an Exception is thrown:
ERROR MicroBatchExecution: Query [id = b1f84242-d72b-4097-97c9-ee603badc484, runId = 752b0fe4-2762-4fff-8912-f4cffdbd7bdc] terminated with error
java.lang.IllegalStateException: Partition test-0's offset was changed from 1 to 0, some data may have been missed.
Some data may have been lost because they are not available in Kafka any more; either the
data was aged out by Kafka or the topic may have been deleted before all the data in the
topic was processed. If you don't want your streaming query to fail on such cases, set the
source option "failOnDataLoss" to "false".
I mentioned "by default" because the query option failOnDataLoss default to true. As explained in the Exception message you could set this to false to let your streaming query running. This option is described in the Structured streaming + Kafka Integration Guide as:
"Whether to fail the query when it's possible that data is lost (e.g., topics are deleted, or offsets are out of range). This may be a false alarm. You can disable it when it doesn't work as you expected."

How we manage offsets in Spark Structured Streaming? (Issues with _spark_metadata )

Background:
I have written a simple spark structured steaming app to move data from Kafka to S3. Found that in order to support exactly-once guarantee spark creates _spark_metadata folder, which ends up growing too large, when the streaming app runs for a long time the metadata folder grows so big that we start getting OOM errors. I want to get rid of metadata and checkpoint folders of Spark Structured Streaming and manage offsets myself.
How we managed offsets in Spark Streaming:
I have used val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges to get offsets in Spark Structured Streaming. But want to know how to get the offsets and other metadata to manage checkpointing ourself using Spark Structured Streaming. Do you have any sample program that implements checkpointing?
How we managed offsets in Spark Structured Streaming??
Looking at this JIRA https://issues-test.apache.org/jira/browse/SPARK-18258. looks like offsets are not provided. How should we go about?
The issue is in 6 hours size of metadata increased to 45MB and it grows till it reaches nearly 13 GB. Driver memory allocated is 5GB. At that time system crashes with OOM. Wondering how to avoid making this meta data grow so large? How to make metadata not log so much information.
Code:
1. Reading records from Kafka topic
Dataset<Row> inputDf = spark \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "host1:port1,host2:port2") \
.option("subscribe", "topic1") \
.option("startingOffsets", "earliest") \
.load()
2. Use from_json API from Spark to extract your data for further transformation in a dataset.
Dataset<Row> dataDf = inputDf.select(from_json(col("value").cast("string"), EVENT_SCHEMA).alias("event"))
....withColumn("oem_id", col("metadata.oem_id"));
3. Construct a temp table of above dataset using SQLContext
SQLContext sqlContext = new SQLContext(sparkSession);
dataDf.createOrReplaceTempView("event");
4. Flatten events since Parquet does not support hierarchical data.
5. Store output in parquet format on S3
StreamingQuery query = flatDf.writeStream().format("parquet")
Dataset dataDf = inputDf.select(from_json(col("value").cast("string"), EVENT_SCHEMA).alias("event"))
.select("event.metadata", "event.data", "event.connection", "event.registration_event","event.version_event"
);
SQLContext sqlContext = new SQLContext(sparkSession);
dataDf.createOrReplaceTempView("event");
Dataset flatDf = sqlContext
.sql("select " + " date, time, id, " + flattenSchema(EVENT_SCHEMA, "event") + " from event");
StreamingQuery query = flatDf
.writeStream()
.outputMode("append")
.option("compression", "snappy")
.format("parquet")
.option("checkpointLocation", checkpointLocation)
.option("path", outputPath)
.partitionBy("date", "time", "id")
.trigger(Trigger.ProcessingTime(triggerProcessingTime))
.start();
query.awaitTermination();
For non-batch Spark Structured Streaming KAFKA integration:
Quote:
Structured Streaming ignores the offsets commits in Apache Kafka.
Instead, it relies on its own offsets management on the driver side which is responsible for distributing offsets to executors and
for checkpointing them at the end of the processing round (epoch or
micro-batch).
You need not worry if you follow the Spark KAFKA integration guides.
Excellent reference: https://www.waitingforcode.com/apache-spark-structured-streaming/apache-spark-structured-streaming-apache-kafka-offsets-management/read
For batch the situation is different, you need to manage that yourself and store the offsets.
UPDATE
Based on the comments I suggest the question is slightly different and advise you look at Spark Structured Streaming Checkpoint Cleanup. In addition to your updated comments and the fact that there is no error, I suggest you consukt this on metadata for Spark Structured Streaming https://www.waitingforcode.com/apache-spark-structured-streaming/checkpoint-storage-structured-streaming/read. Looking at the code, different to my style, but cannot see any obvious error.

Configuring StreamingContext in Apache Zeppelin

My goal is to read streaming data from a stream(in my case aws kinesis) and then query the data. The problem is that I want to query the last 5 minutes data on every batch interval. And what I found is that it is possible to keep the data in a stream for a certain period(using StreamingContext.remember(Duration duration) method). Zeppelin's spark interpreter creates the SparkSession automatically and I don't know how to configure the StreamingContext. Here's what I do:
val df = spark
.readStream
.format("kinesis")
.option("streams", "test")
.option("endpointUrl", "kinesis.us-west-2.amazonaws.com")
.option("initialPositionInStream", "latest")
.option("format", "csv")
.schema(//schema definition)
.load
So far so good. Then as far as I can see the streaming context is started when the write stream is set and started:
df.writeStream
.format(//output source)
.outputMode("complete")
.start()
But having only the SparkSession I don't know how to achieve a query over last X minutes data. Any suggestions?

Resources