Calculating Kafka lag in spark structured streaming application - apache-spark

I am trying to calculate Kafka lag on my spark structured streaming application.
I can get the current processed offset from my Kafka metadata which comes along with actual data.
Is there a way through which we can get the latest offsets of all partitions in a Kafka topic programmatically from spark interface ?
Can I use Apache Kafka admin classes or Kafka interfaces to get the latest offset information for each batch in my spark app ?

Related

Spark Offset Management in Kafka

I am using Spark Structured Streaming (Version 2.3.2). I need to read from Kafka Cluster and write into Kerberized Kafka.
Here I want to use Kafka as offset checkpointing after the record is written into Kerberized Kafka.
Questions:
Can we use Kafka for checkpointing to manage offset or do we need to use only HDFS/S3 only?
Please help.
Can we use Kafka for checkpointing to manage offset
No, you cannot commit offsets back to your source Kafka topic. This is described in detail here and of course in the official Spark Structured Streaming + Kafka Integration Guide.
or do we need to use only HDFS/S3 only?
Yes, this has to be something like HDFS or S3. This is explained in section Recovering from Failures with Checkpointing of the StructuredStreaming Programming Guide: "This checkpoint location has to be a path in an HDFS compatible file system, and can be set as an option in the DataStreamWriter when starting a query."

How to ensure no data loss for kafka data ingestion through Spark Structured Streaming?

I have a long running spark structured streaming job which is ingesting kafka data. I have one concern as below. If the job is failed due to some reason and restart later, how to ensure kafka data will be ingested from the breaking point instead of always ingesting current and later data when the job is restarting. Do I need to specifiy explicitly something like consumer group and auto.offet.reset, etc? Are they supported in spark kafka ingestion? Thanks!
According to the Spark Structured Integration Guide, Spark itself is keeping track of the offsets and there are no offsets committed back to Kafka. That means if your Spark Streaming job fails and you restart it all necessary information on the offsets is stored in Spark's checkpointing files. That way your application will know where it left off and continue to process the remaining data.
I have written more details about setting group.id and Spark's checkpointing of offsets in another post
Here are the most important Kafka specific configurations for your Spark Structured Streaming jobs:
group.id: Kafka source will create a unique group id for each query automatically. According to the code the group.id will automatically be set to
val uniqueGroupId = s"spark-kafka-source-${UUID.randomUUID}-${metadataPath.hashCode}
auto.offset.reset: Set the source option startingOffsets to specify where to start instead. Structured Streaming manages which offsets are consumed internally, rather than rely on the kafka Consumer to do it
enable.auto.commit: Kafka source doesn’t commit any offset.
Therefore, in Structured Streaming it is currently not possible to define your custom group.id for Kafka Consumer and Structured Streaming is managing the offsets internally and not committing back to Kafka (also not automatically).

Spark streaming multiple Kafka topic to multiple Database table with checkpoint

I'm building spark streaming from Apache Kafka to our columnar DataBase.
To ensure fault tolerance I'm using HDFS checkpoint and write ahead log.
Apache Kafka topic -> spark streaming -> HDFS checkpoint-> spark SQl ( for messages manipulation)-> spark jdbc for our Db.
When I'm using spark jobs for one topic and table everything is working file.
I'm trying to stream in one spark job multiple Kafka topics and to write for multiple tables in here started the problem with the checkpoint ( which is per one topic table )
The problem is with checkpoints :(
1) If I will use "KafkaUtils.createDirectStream" with a list of topics and "groupBy" topic name but the checkpoint folder is one and if, for example, I will need to increase resources during the ongoing streaming (change amount of cors due to Kafka Lag) this will be impossible because today it's possible only if I delete the checkpoint folder and restart the spark job.
2) Use multiple spark streamingContext this I will try today and see if it works.
3) Multiple sparkStreaming with high-level consumers ( offset saved in kafka10...)
Any other ideas/solutions that I'm missing
Does structure streams with multiple Kafka topics and checkpoints behave differently?
Thx

How to send data from kafka to spark

I want to send my data from kafka to Spark.
I have installed spark in my system and kafka is also working in my system in proper way.
You need to use a Kafka connector from Spark. Technically, Kafka won't send the data to Spark. In fact, Spark pull the data from Kafka.
Here the link from the documentation : https://spark.apache.org/docs/latest/streaming-kafka-0-10-integration.html

How to process XMLmessage in spark streaming and kafka?

I am new to spark. Consuming message from kafka as xml format in spark streaming. Can you tell me how to process this xml is spark streaming?
Spark Streaming and Kafka documentation is available upstream with examples:
http://spark.apache.org/docs/latest/streaming-kafka-0-8-integration.html
Here's the compatibility matrix for versions supported. Stick to the stable releases first since you're getting started with streaming:
http://spark.apache.org/docs/latest/streaming-kafka-integration.html
You could use this library to process XML records from Spark.
https://github.com/databricks/spark-xml

Resources