Spark Offset Management in Kafka - apache-spark

I am using Spark Structured Streaming (Version 2.3.2). I need to read from Kafka Cluster and write into Kerberized Kafka.
Here I want to use Kafka as offset checkpointing after the record is written into Kerberized Kafka.
Questions:
Can we use Kafka for checkpointing to manage offset or do we need to use only HDFS/S3 only?
Please help.

Can we use Kafka for checkpointing to manage offset
No, you cannot commit offsets back to your source Kafka topic. This is described in detail here and of course in the official Spark Structured Streaming + Kafka Integration Guide.
or do we need to use only HDFS/S3 only?
Yes, this has to be something like HDFS or S3. This is explained in section Recovering from Failures with Checkpointing of the StructuredStreaming Programming Guide: "This checkpoint location has to be a path in an HDFS compatible file system, and can be set as an option in the DataStreamWriter when starting a query."

Related

How to ensure no data loss for kafka data ingestion through Spark Structured Streaming?

I have a long running spark structured streaming job which is ingesting kafka data. I have one concern as below. If the job is failed due to some reason and restart later, how to ensure kafka data will be ingested from the breaking point instead of always ingesting current and later data when the job is restarting. Do I need to specifiy explicitly something like consumer group and auto.offet.reset, etc? Are they supported in spark kafka ingestion? Thanks!
According to the Spark Structured Integration Guide, Spark itself is keeping track of the offsets and there are no offsets committed back to Kafka. That means if your Spark Streaming job fails and you restart it all necessary information on the offsets is stored in Spark's checkpointing files. That way your application will know where it left off and continue to process the remaining data.
I have written more details about setting group.id and Spark's checkpointing of offsets in another post
Here are the most important Kafka specific configurations for your Spark Structured Streaming jobs:
group.id: Kafka source will create a unique group id for each query automatically. According to the code the group.id will automatically be set to
val uniqueGroupId = s"spark-kafka-source-${UUID.randomUUID}-${metadataPath.hashCode}
auto.offset.reset: Set the source option startingOffsets to specify where to start instead. Structured Streaming manages which offsets are consumed internally, rather than rely on the kafka Consumer to do it
enable.auto.commit: Kafka source doesn’t commit any offset.
Therefore, in Structured Streaming it is currently not possible to define your custom group.id for Kafka Consumer and Structured Streaming is managing the offsets internally and not committing back to Kafka (also not automatically).

Is possible send data from Nifi to Spark Structured Streaming/Storm directly without loss data?

In my current scenario; Nifi collects data, then sends to Kafka. Then any streaming engine consumes data from kafka, and analysis it. In this situation; I dont want to use Kafka between Nifi and Streaming Engine. So, I want to send data from Nifi to streaming engine directly. But, I don't know some details here.
For example Spark Structured Streaming; Assumet that I send data from Nifi to Spark Structured Streaming directly, Spark was received this data but then spark's node is down. What happens to data in Spark node? (Do Spark Structured Streaming have any Nifi receiver?), Also, in this case, what is the data guarantee on Spark Structured Streaming?
For example Storm; Storm has Nifi Bolt. But, assume that Storm have received data from Nifi, but then node was down. What happens to the data? Also, in this case, what is the data guarantee on Storm?
In shortly, I want to send data from Nifi to SparkStructuredStreaming/Storm(I'm more likely to used Spark.) directly. But if any node is downs in streaming engine cluster, I dont want to lose data.
Is this possible for Spark Structured Streaming?
All of the streaming integration with NiFi is done using the site-to-site protocol, which is originally made for two NiFi instances to transfer data.
As far as I know there are currently integrations with Storm, Spark streaming, and Flink. I'm not familiar with Spark structured streaming, but I would imagine you could build this integration similar to the others.
https://github.com/apache/nifi/tree/master/nifi-external/nifi-spark-receiver
https://github.com/apache/nifi/tree/master/nifi-external/nifi-storm-spout
https://github.com/apache/flink/tree/master/flink-connectors/flink-connector-nifi
NiFi is not a replayable source of data though. The data is transferred from NiFi to the streaming system in a transaction to ensure it is not removed from the NiFi side until the destination has confirmed the transaction. However, if something fails in the streaming system after that commit, then the data is no longer in NiFi and it is the streaming system's problem.
I'm not sure the reason why you don't want to use Kafka, but NiFi -> Kafka -> Streaming is a more standard and proven approach.
There is a NifiReceiver for spark.
Comparing the implementation with the apache-spark documentatation this receiver is fault tolerant, as it should replay data not passed on.

Is there a way to load streaming data from Kafka into HDFS using Spark and without Flume?

I was looking if there is a way to load the streaming data from Kafka directly into HDFS using spark streaming and without using Flume.
I have tried it using Flume(Kafka source and HDFS sink) already.
Thanks in Advance!
There is HDFS connector for Kafka Connect. Confluent's documentation have more information.
This is a pretty basic function for Spark Streaming. Depending on what version of spark and Kafka you are using, you can look at the spark streaming kafka integration documentation for the versions you are using. Saving to HDFS is as easy as rdd.saveAsTextFile("hdfs:///directory/filename").
Spark/Kafka integration guide for latest versions

How to send data from kafka to spark

I want to send my data from kafka to Spark.
I have installed spark in my system and kafka is also working in my system in proper way.
You need to use a Kafka connector from Spark. Technically, Kafka won't send the data to Spark. In fact, Spark pull the data from Kafka.
Here the link from the documentation : https://spark.apache.org/docs/latest/streaming-kafka-0-10-integration.html

How to integrate kafka and spark streaming in Datastax Enterprise Edition?

I've integrated kafka and spark streaming after downloading from the apache website. However, I wanted to use Datastax for my Big Data solution and I saw you can easily integrate Cassandra and Spark.
But I can't see any kafka modules in the latest version of Datastax enterprise. How to integrate kafka with spark streaming here?
What I want to do is basically:
Start necessary brokers and servers
Start kafka producer
Start kafka consumer
Connect spark streaming to kafka broker and receive the messages from there
However after a quick google search, I can't see anywhere that kafka has been incorporated with datastax enterprise.
How can I achieve this? I'm really new to datastax and kafka and all so I need some advice. Language preference- Python.
Thanks!
Good question. DSE does not incorporate Kafka out of the box, you must set up kafka yourself and then set up your spark streaming job to read from kafka. Since DSE does bundle spark, use DSE Spark to run your spark streaming job.
You can use either the direct kafka API or kafka receivers, more details here on the tradeoffs. TL;DR direct api does not require WAL or zookeeper for HA.
Here is an example of how you can configure Kafka to work with DSE by Cary Bourgeois:
https://github.com/CaryBourgeois/DSE-Spark-Streaming/tree/master

Resources