I have a Spark Structured Streaming application which receives Kafka messages. For each such message it retrieves initial data from DB and performs calculations. I want to use GraphX (or GraphFrame) to build graph for each message and perform calculations. I understand how to do it with simple batch job, but how to use GraphX with Spark Streaming?
Related
I am trying to calculate Kafka lag on my spark structured streaming application.
I can get the current processed offset from my Kafka metadata which comes along with actual data.
Is there a way through which we can get the latest offsets of all partitions in a Kafka topic programmatically from spark interface ?
Can I use Apache Kafka admin classes or Kafka interfaces to get the latest offset information for each batch in my spark app ?
I have read a bit about spark streaming and I would like to know if its possible to stream data from a custom source with rabbitmq as a broker and feed this data through the spark stream where Spark’s machine learning and graph processing algorithms will be performed on them and send it to other filesystems/databases/dashboards or customer receivers.
P.S I code with python, I do not have any experience using spark and Can I call what I'm trying to achieve a microservice?
Thank you.
I feel spark Structured streaming is more suitable and easy to implement rather than spark-streaming. Spark Structured Streaming follows the below concept
Source(read from RabbitMQ) -- Transformation (apply ML algo) -- Sink
(write to database)
You can refer this github project for an example on Spark structured streaming.
I don't think there is an inbuilt spark connector which can consume from rabbitMq. I know there is one for Kafka but you can write your own custom source and sink (Writing this without any spark knowledge might be tricky).
You can start this as a spark-job and you have to create a wrapper service layer which triggers this as a spark job (spark job launcher) or use spark rest api
https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html
In my current scenario; Nifi collects data, then sends to Kafka. Then any streaming engine consumes data from kafka, and analysis it. In this situation; I dont want to use Kafka between Nifi and Streaming Engine. So, I want to send data from Nifi to streaming engine directly. But, I don't know some details here.
For example Spark Structured Streaming; Assumet that I send data from Nifi to Spark Structured Streaming directly, Spark was received this data but then spark's node is down. What happens to data in Spark node? (Do Spark Structured Streaming have any Nifi receiver?), Also, in this case, what is the data guarantee on Spark Structured Streaming?
For example Storm; Storm has Nifi Bolt. But, assume that Storm have received data from Nifi, but then node was down. What happens to the data? Also, in this case, what is the data guarantee on Storm?
In shortly, I want to send data from Nifi to SparkStructuredStreaming/Storm(I'm more likely to used Spark.) directly. But if any node is downs in streaming engine cluster, I dont want to lose data.
Is this possible for Spark Structured Streaming?
All of the streaming integration with NiFi is done using the site-to-site protocol, which is originally made for two NiFi instances to transfer data.
As far as I know there are currently integrations with Storm, Spark streaming, and Flink. I'm not familiar with Spark structured streaming, but I would imagine you could build this integration similar to the others.
https://github.com/apache/nifi/tree/master/nifi-external/nifi-spark-receiver
https://github.com/apache/nifi/tree/master/nifi-external/nifi-storm-spout
https://github.com/apache/flink/tree/master/flink-connectors/flink-connector-nifi
NiFi is not a replayable source of data though. The data is transferred from NiFi to the streaming system in a transaction to ensure it is not removed from the NiFi side until the destination has confirmed the transaction. However, if something fails in the streaming system after that commit, then the data is no longer in NiFi and it is the streaming system's problem.
I'm not sure the reason why you don't want to use Kafka, but NiFi -> Kafka -> Streaming is a more standard and proven approach.
There is a NifiReceiver for spark.
Comparing the implementation with the apache-spark documentatation this receiver is fault tolerant, as it should replay data not passed on.
Do Spark Streaming and Spark Structured Streaming use the same micro-batch scheduler engine? Does Spark Structured Streaming have lower latency than Spark Streaming?
Do Spark Streaming and Spark Structured Streaming use same micro-batch scheduler engine
Certainly not. They're different internally, but share the same high-level concepts of a stream and a record.
While in Spark Structured Streaming you can get as close to how it was in Spark Streaming using DataStreamWriter.foreach or DataStreamWriter.foreachBatch methods.
The main difference is how to describe a streaming pipeline. In Spark Structured Streaming you use Spark SQL's Dataset API while Spark Streaming bet on Spark Core's RDD API. Both end up as a RDD-based computation, but Spark SQL uses higher-level abstractions (e.g. Dataset API).
Do they both use a "micro-batch scheduler engine"? Yes, but Spark Structured Streaming is trying to leverage some data sources that can be queried continuously (and no micro-batching).
does Spark Structured Streaming have lower latency than Spark Streaming?
That'd be hard to answer. The creators of Spark Streaming decided to develop Spark Structured Streaming and hope to get better at query performance and expressiveness. Spark Streaming is no longer recommended.
Structered Streaming is mostly a higher-level abstraction that allows you to define your streaming logic then it uses Spark SQL engine for execution on the same micro-batch engine.
By default Structured Streaming uses micro-batch engine, however if you are using Spark 2.3+, then you can have the continuous mode where you can get down to 1 millisecond latency
I would like to push my spark structured streaming processed data to the REST API. can someone share the examples of same. i have found few but all are related to spark streaming, not the structured streaming.
I have not heard about a REST API sink for Spark Structured Streaming, but you could write one yourself. Start from org.apache.spark.sql.execution.streaming.Source.
The easiest would however be to use DataStreamWriter.foreach or foreachBatch (since 2.4).