Save each Kafka messages in hdfs using spark streaming - apache-spark

I am using spark streaming to do analysis. after analysis I have to save the kafka message in hdfs. Each kafka message is a xml file. I can't use rdd.saveAsTextFile because it will save whole rdd. Each element of rdd is kafka message ( xml file ). How to save each rdd element (file) in hdfs using spark.

I would go about this a different way. Stream your transformed data back into Kafka, and then use the HDFS connector for Kafka Connect to stream the data to HDFS. Kafka Connect is part of Apache Kafka. The HDFS connector is open source and available standalone or as part of Confluent Platform.
Doing it this way you decouple your processing from writing your data to HDFS, which makes it easier to manage, to troubleshoot, to scale.

Related

Multiple spark streams in spark application - logging per stream

I have a spark application which runs multiple structured streams. (Spark 2.3.2)
The problem is, that they all log to the same file.
Is there a way to set log file per spark stream?

Spark checkpointing has a lot of tmp.crc files

I am using spark structured streaming where I read a stream from Kafka and after some transformation I write the resulted stream to Kafka.
I see a lot of hidden ..*tmp.crc files within my checkpoint directory. These files are not getting cleaned up and ever growing in number.
Am I missing some configuration?
I am not running spark on Hadoop. Using EBS based volume for checkpointing.

Is possible send data from Nifi to Spark Structured Streaming/Storm directly without loss data?

In my current scenario; Nifi collects data, then sends to Kafka. Then any streaming engine consumes data from kafka, and analysis it. In this situation; I dont want to use Kafka between Nifi and Streaming Engine. So, I want to send data from Nifi to streaming engine directly. But, I don't know some details here.
For example Spark Structured Streaming; Assumet that I send data from Nifi to Spark Structured Streaming directly, Spark was received this data but then spark's node is down. What happens to data in Spark node? (Do Spark Structured Streaming have any Nifi receiver?), Also, in this case, what is the data guarantee on Spark Structured Streaming?
For example Storm; Storm has Nifi Bolt. But, assume that Storm have received data from Nifi, but then node was down. What happens to the data? Also, in this case, what is the data guarantee on Storm?
In shortly, I want to send data from Nifi to SparkStructuredStreaming/Storm(I'm more likely to used Spark.) directly. But if any node is downs in streaming engine cluster, I dont want to lose data.
Is this possible for Spark Structured Streaming?
All of the streaming integration with NiFi is done using the site-to-site protocol, which is originally made for two NiFi instances to transfer data.
As far as I know there are currently integrations with Storm, Spark streaming, and Flink. I'm not familiar with Spark structured streaming, but I would imagine you could build this integration similar to the others.
https://github.com/apache/nifi/tree/master/nifi-external/nifi-spark-receiver
https://github.com/apache/nifi/tree/master/nifi-external/nifi-storm-spout
https://github.com/apache/flink/tree/master/flink-connectors/flink-connector-nifi
NiFi is not a replayable source of data though. The data is transferred from NiFi to the streaming system in a transaction to ensure it is not removed from the NiFi side until the destination has confirmed the transaction. However, if something fails in the streaming system after that commit, then the data is no longer in NiFi and it is the streaming system's problem.
I'm not sure the reason why you don't want to use Kafka, but NiFi -> Kafka -> Streaming is a more standard and proven approach.
There is a NifiReceiver for spark.
Comparing the implementation with the apache-spark documentatation this receiver is fault tolerant, as it should replay data not passed on.

How to send data from kafka to spark

I want to send my data from kafka to Spark.
I have installed spark in my system and kafka is also working in my system in proper way.
You need to use a Kafka connector from Spark. Technically, Kafka won't send the data to Spark. In fact, Spark pull the data from Kafka.
Here the link from the documentation : https://spark.apache.org/docs/latest/streaming-kafka-0-10-integration.html

How can I connect to Apache Spark Stream in Mule?

I need to connect to Apache Spark Stream where input will come from Kafka and processed data then go to Cassandra. I tried to find Spark connector but didn't get any result.
Is there any custom connector available ?
How can I use Apache Spark Stream in Mule ?
I need to connect to Apache Spark Stream where input will come from
Kafka and processed data then go to Cassandra.
So you need not a Spark connector, but Kafka connector: https://docs.mulesoft.com/mule-user-guide/v/3.8/kafka-connector

Resources