How to get commitedOffsets and availableOffsets from sparkstreaming - apache-spark

22/11/09 11:08:40 INFO MicroBatchExecution: Resuming at batch 206 with committed offsets {KafkaV2[Subscribe[test]]:
{"test":{"0":3086,"1":3086,"2":3086,"3":3086,"4":3086,"5":3086,"6":3086,"7":3086,"8":3086,"9":3086,"10":3086,
"11":3086,"12":3086,"13":3086,"14":3086,"15":3086,"16":3086,"17":3086,"18":3086,"19":3086,"20":3086,"21":3086,"22":3086,"23":3086,
24":3086,"25":3086,"26":3086,"27":3086,"28":3086,"29":3086,"30":3086,"31":3886,"32":3086,"33":3086,"34":3086,"35":3086,"36":3086,
"37":3086,"38":3086,"39":3086,"40":3086,"41":3086,"42":3086,"43":3086,"44":3086,"45":3086,"46":3086,"47":3086,"48":3086,"49":3086}}}
and available offsets {KafkaV2[Subscribe[test]]: {"test":{"0":3105,"1":3105,"2":3105,"3":3105,"4":3105,
"5":3105,"6":3105,"7":3105,"8":3105,"9":3105,"10":3105,"11":3105,"12":3105,"13":3105,"14":3105,"15":3105,"16":3105,"17":3105,"18":3105,"19":3105,
"20":3105,"21":3105,"22":3105,"23":3105,"24":3105,"25":3105,"26":3105,"27":3105,"28":3105,"29":3105,"30":3105,"31":3910,"32":3105,"33":3105,"34":3105,
"35":3105,"36":3105,"37":3105,"38":3105,"39":3105,"40":3105,"41":3105,"42":3105,"43":3105,"44":3105,"45":3105,"46":3105,"47":3105,"48":3105,"49":3105}}}
This offset collection is spark streaming checkpoint offsets.
How can i get this offset collection by code?
Is any property in spark streaming?

Related

What is the meaning of "OffsetOutOfRangeException: Offsets out of range with no configured reset policy for partitions"?

I use Apache Spark 2.4.1 and kafka data source.
Dataset<Row> df = sparkSession
.readStream()
.format("kafka")
.option("kafka.bootstrap.servers", SERVERS)
.option("subscribe", TOPIC)
.option("startingOffsets", "latest")
.option("auto.offset.reset", "earliest")
.load();
I have two sinks: raw data stored in hdfs location , after few transformations final data is stored in Cassandra table. The checkpointLocation is an HDFS directory.
When starting the streaming query it gives the warning below:
2019-12-10 08:20:38,926 [Executor task launch worker for task 639]
WARN org.apache.spark.sql.kafka010.InternalKafkaConsumer - Some data
may be lost. Recovering from the earliest offset: 470021 2019-12-10
08:20:38,926 [Executor task launch worker for task 639] WARN
org.apache.spark.sql.kafka010.InternalKafkaConsumer - The current
available offset range is AvailableOffsetRange(470021,470021). Offset
62687 is out of range, and records in [62687, 62727) will be skipped
(GroupId:
spark-kafka-source-1fba9e33-165f-42b4-a220-6697072f7172-1781964857-executor,
TopicPartition: INBOUND-19). Some data may have been lost because they
are not available in Kafka any more; either the data was aged out by
Kafka or the topic may have been deleted before all the data in the
topic was processed. If you want your streaming query to fail on such
cases, set the source option "failOnDataLoss" to "true".
I also used auto.offset.reset as latest and startingOffsets as latest.
2019-12-11 08:33:37,496 [Executor task launch worker for task 1059] WARN org.apache.spark.sql.kafka010.KafkaDataConsumer - KafkaConsumer cache hitting max capacity of 64, removing consumer for CacheKey(spark-kafka-source-93ee3689-79f9-42e8-b1ee-e856570205ae-1923743483-executor,_INBOUND-19)
What is this telling me? How to get rid of the warning (if possible)?
Some data may be lost. Recovering from the earliest offset: 470021
The above warning happens when your streaming query started with checkpointed offsets that are past what's currently available in topics.
In other words, the streaming query uses checkpointLocation with state that is no longer current and hence the warning (not error).
That means that your query is too slow compared to cleanup.policy (retention or compaction).

Spark structured streaming from Kafka checkpoint and acknowledgement

In my spark structured streaming application, I am reading messages from Kafka, filtering them and then finally persisting to Cassandra. I am using spark 2.4.1. From the structured streaming documentation
Fault Tolerance Semantics
Delivering end-to-end exactly-once semantics was one of key goals behind the design of Structured Streaming. To achieve that, we have designed the Structured Streaming sources, the sinks and the execution engine to reliably track the exact progress of the processing so that it can handle any kind of failure by restarting and/or reprocessing. Every streaming source is assumed to have offsets (similar to Kafka offsets, or Kinesis sequence numbers) to track the read position in the stream. The engine uses checkpointing and write-ahead logs to record the offset range of the data being processed in each trigger. The streaming sinks are designed to be idempotent for handling reprocessing. Together, using replayable sources and idempotent sinks, Structured Streaming can ensure end-to-end exactly-once semantics under any failure.
But I am not sure how does Spark actually achieve this. In my case, if the Cassandra cluster is down leading to failures in the write operation, will the checkpoint for Kafka not record those offsets.
Is the Kafka checkpoint offset based only on successful reads from Kafka, or the entire operation including write is considered for each message?
Spark Structured Streaming is not commiting offsets to kafka as a "normal" kafka consumer would do.
Spark is managing the offsets internally with a checkpointing mechanism.
Have a look at the first response of following question which gives a good explanation about how the state is managed with checkpoints and commitslog: How to get Kafka offsets for structured query for manual and reliable offset management?
Spark uses multiple log files to ensure fault tolerance.
The ones relevant to your query are the offset log and the commit log.
from the StreamExecution class doc:
/**
* A write-ahead-log that records the offsets that are present in each batch. In order to ensure
* that a given batch will always consist of the same data, we write to this log *before* any
* processing is done. Thus, the Nth record in this log indicated data that is currently being
* processed and the N-1th entry indicates which offsets have been durably committed to the sink.
*/
val offsetLog = new OffsetSeqLog(sparkSession, checkpointFile("offsets"))
/**
* A log that records the batch ids that have completed. This is used to check if a batch was
* fully processed, and its output was committed to the sink, hence no need to process it again.
* This is used (for instance) during restart, to help identify which batch to run next.
*/
val commitLog = new CommitLog(sparkSession, checkpointFile("commits"))
so when it reads from Kafka it writes the offsets to the offsetLog and only after processing the data and writing it to the sink (in your case Cassandra) it writes the offsets to the commitLog.

Spark Structured Streaming, Executor out-of-memory failure due to broadcast accumulation

Our ETL pipeline is using spark structured streaming to enrich incoming data (join with static dataframes) before storing to cassandra. Currently the lookup tables are csv files(in HDFS) which get loaded as dataframes and joined with each batch of data on every trigger.
It seems lookup-table Dataframes are broadcasted on every trigger and stored in Memory store. This is eating up the executor memory and eventually the executor face OOM and is killed by Mesos: Log of executor
As can be seen in the link above, the lookup-table dataframes to be joined with are being stored as broadcast variables and the executor is killed due to OOM.
The following is the driver log at the same time:
Driver Log
The following are the Spark configurations:
Spark Conf
Is there any better approach for joining with static datasets in spark structured streaming? Or how to avoid the executor OOM in the above case?

What is use of "spark.streaming.blockInterval" in Spark Streaming DirectAPI

I want to understand, What role "spark.streaming.blockInterval" plays in Spark Streaming DirectAPI, as per my understanding "spark.streaming.blockInterval" is used for calculating partitions i.e. #partitions = (receivers x* batchInterval) /blockInterval, but in DirectAPI spark streaming partitions is equal to no. of kafka partitions.
How "spark.streaming.blockInterval" is used in DirectAPI ?
spark.streaming.blockInterval :
Interval at which data received by Spark Streaming receivers is chunked into blocks of data before storing them in Spark.
And KafkaUtils.createDirectStream() do not use receiver.
With directStream, Spark Streaming will create as many RDD partitions
as there are Kafka partitions to consume

Spark Kafka streaming multiple partition window processing

I am using spark kafka streaming using createDirectStream approach (Direct Approach)
also my requirement is to create window on stream but I have seen this in spark doc
"However, be aware that the one-to-one mapping between RDD partition and Kafka partition does not remain after any methods that shuffle or repartition, e.g. reduceByKey() or window()."
how can I solve this as my data is getting corrupted

Resources