How to send data from kafka to spark - apache-spark

I want to send my data from kafka to Spark.
I have installed spark in my system and kafka is also working in my system in proper way.

You need to use a Kafka connector from Spark. Technically, Kafka won't send the data to Spark. In fact, Spark pull the data from Kafka.
Here the link from the documentation : https://spark.apache.org/docs/latest/streaming-kafka-0-10-integration.html

Related

Calculating Kafka lag in spark structured streaming application

I am trying to calculate Kafka lag on my spark structured streaming application.
I can get the current processed offset from my Kafka metadata which comes along with actual data.
Is there a way through which we can get the latest offsets of all partitions in a Kafka topic programmatically from spark interface ?
Can I use Apache Kafka admin classes or Kafka interfaces to get the latest offset information for each batch in my spark app ?

Spark Offset Management in Kafka

I am using Spark Structured Streaming (Version 2.3.2). I need to read from Kafka Cluster and write into Kerberized Kafka.
Here I want to use Kafka as offset checkpointing after the record is written into Kerberized Kafka.
Questions:
Can we use Kafka for checkpointing to manage offset or do we need to use only HDFS/S3 only?
Please help.
Can we use Kafka for checkpointing to manage offset
No, you cannot commit offsets back to your source Kafka topic. This is described in detail here and of course in the official Spark Structured Streaming + Kafka Integration Guide.
or do we need to use only HDFS/S3 only?
Yes, this has to be something like HDFS or S3. This is explained in section Recovering from Failures with Checkpointing of the StructuredStreaming Programming Guide: "This checkpoint location has to be a path in an HDFS compatible file system, and can be set as an option in the DataStreamWriter when starting a query."

Is possible send data from Nifi to Spark Structured Streaming/Storm directly without loss data?

In my current scenario; Nifi collects data, then sends to Kafka. Then any streaming engine consumes data from kafka, and analysis it. In this situation; I dont want to use Kafka between Nifi and Streaming Engine. So, I want to send data from Nifi to streaming engine directly. But, I don't know some details here.
For example Spark Structured Streaming; Assumet that I send data from Nifi to Spark Structured Streaming directly, Spark was received this data but then spark's node is down. What happens to data in Spark node? (Do Spark Structured Streaming have any Nifi receiver?), Also, in this case, what is the data guarantee on Spark Structured Streaming?
For example Storm; Storm has Nifi Bolt. But, assume that Storm have received data from Nifi, but then node was down. What happens to the data? Also, in this case, what is the data guarantee on Storm?
In shortly, I want to send data from Nifi to SparkStructuredStreaming/Storm(I'm more likely to used Spark.) directly. But if any node is downs in streaming engine cluster, I dont want to lose data.
Is this possible for Spark Structured Streaming?
All of the streaming integration with NiFi is done using the site-to-site protocol, which is originally made for two NiFi instances to transfer data.
As far as I know there are currently integrations with Storm, Spark streaming, and Flink. I'm not familiar with Spark structured streaming, but I would imagine you could build this integration similar to the others.
https://github.com/apache/nifi/tree/master/nifi-external/nifi-spark-receiver
https://github.com/apache/nifi/tree/master/nifi-external/nifi-storm-spout
https://github.com/apache/flink/tree/master/flink-connectors/flink-connector-nifi
NiFi is not a replayable source of data though. The data is transferred from NiFi to the streaming system in a transaction to ensure it is not removed from the NiFi side until the destination has confirmed the transaction. However, if something fails in the streaming system after that commit, then the data is no longer in NiFi and it is the streaming system's problem.
I'm not sure the reason why you don't want to use Kafka, but NiFi -> Kafka -> Streaming is a more standard and proven approach.
There is a NifiReceiver for spark.
Comparing the implementation with the apache-spark documentatation this receiver is fault tolerant, as it should replay data not passed on.

Is there a way to load streaming data from Kafka into HDFS using Spark and without Flume?

I was looking if there is a way to load the streaming data from Kafka directly into HDFS using spark streaming and without using Flume.
I have tried it using Flume(Kafka source and HDFS sink) already.
Thanks in Advance!
There is HDFS connector for Kafka Connect. Confluent's documentation have more information.
This is a pretty basic function for Spark Streaming. Depending on what version of spark and Kafka you are using, you can look at the spark streaming kafka integration documentation for the versions you are using. Saving to HDFS is as easy as rdd.saveAsTextFile("hdfs:///directory/filename").
Spark/Kafka integration guide for latest versions

How can I connect to Apache Spark Stream in Mule?

I need to connect to Apache Spark Stream where input will come from Kafka and processed data then go to Cassandra. I tried to find Spark connector but didn't get any result.
Is there any custom connector available ?
How can I use Apache Spark Stream in Mule ?
I need to connect to Apache Spark Stream where input will come from
Kafka and processed data then go to Cassandra.
So you need not a Spark connector, but Kafka connector: https://docs.mulesoft.com/mule-user-guide/v/3.8/kafka-connector

Resources