QueueStream for Structured Streaming possible? - apache-spark

With dStreams, from the official documentation:
Queue of RDDs as a Stream: For testing a Spark Streaming application
with test data, one can also create a DStream based on a queue of
RDDs, using streamingContext.queueStream(queueOfRDDs). Each RDD pushed
into the queue will be treated as a batch of data in the DStream, and
processed like a stream.
So, for Structured Streaming, can I or can I not use QueueStream as input?
Not able able to find anything in the Structured Streaming Guide 2.3 or 2.4.
I do note memoryStream. This is the way to go? I think so, and if so, why would QueueStream not be an option anymore?
I have converted QueueStreams to Memory Stream as input and it works fine, but is that what is required?

My understanding is that for Structured Streaming I cannot use QueueStream - as it is a dStream.
Simulating Streaming input with Structured Streaming does work with memoryStream.

Related

Trigger.Once Spark Structured Streaming with KAFKA possible?

Does Spark Structured Streaming using Trigger.Once allow for a direct connection to KAFKA and use of MERGE statement? Or must the data for this be from a delta table?
This https://docs.databricks.com/_static/notebooks/merge-in-scd-type-2.html assumes tables as input. I cannot find an example with KAFKA being used with Trigger.Once. OK, the weekend is coming and I will fire up this and that, but it is an interesting point that I would like to know in advance.
Yes, it's possible to use Trigger.Once (or better newer Trigger.AvailableNow) with Kafka, and then use foreachBatch to execute MERGE.
The only thing that you need to take into account is that data shouldn't expire between executions.

Why Spark Structured Streaming is ideal for real-time operations?

I wanna construct a real-time application but I don't know if I should use Spark Streaming or Spark Structured Streaming.
I read online that Structured Streaming is ideal for real-time applications but is not clear why...
Can someone explain it?
Spark Streaming works on something we call a micro batch. ... Each batch represents an RDD. Structured Streaming works on the same architecture of polling the data after some duration, based on your trigger interval, but it has some distinction from the Spark Streaming which makes it more inclined towards real streaming.
For developers all they need to worry is that Spark streaming you will you RDDs but in Spark Structured Streaming you get Dataframes and DataSet.
If you want so very low level(i.e. per record) operations go for RDDs(i.e. Spark Streaming) and but your application can build on Dataframes and querying them like SQL in real time then go for DataFrames(i.e. Spark Structured Streaming)
Eventually RDDs can be converted to Dataframes and vice versa

What is the difference between Spark Structured Streaming and DStreams?

I have been trying to find materials online - both are micro-batch based - so what's the difference ?
Brief description about Spark Streaming(RDD/DStream) and Spark Structured Streaming(Dataset/DataFrame)
Spark Streaming is based on DStream. A DStream is represented by a continuous series of RDDs, which is Spark’s abstraction of an immutable, distributed dataset. Spark Streaming has the following problems.
Difficult - it was not simple to built streaming pipelines supporting delivery policies: exactly once guarantee, handling data arrival in late or fault tolerance. Sure, all of them were implementable but they needed some extra work from the part of programmers.
Incosistent - API used to generate batch processing (RDD, Dataset) was different that the API of streaming processing (DStream). Sure, nothing blocker to code but it's always simpler (maintenance cost especially) to deal with at least abstractions as possible.
see the example
Spark Streaming flow diagram :-
Spark Structured Streaming be understood as an unbounded table, growing with new incoming data, i.e. can be thought as stream processing built on Spark SQL.
More concretely, structured streaming brought some new concepts to Spark.
exactly-once guarantee - structured streaming focuses on that concept. It means that data is processed only once and output doesn't contain duplicates.
event time - one of observed problems with DStream streaming was processing order, i.e the case when data generated earlier was processed after later generated data. Structured streaming handles this problem with a concept called event time that, under some conditions, allows to correctly aggregate late data in processing pipelines.
sink,Result Table,output mode and watermark are other features of spark structured streaming.
see the example
Spark Structured Streaming flow diagram :-
Until Spark 2.2, the DStream[T] was the abstract data type for streaming data which can be viewed as RDD[RDD[T]].From Spark 2.2 onwards, the DataSet is a abstraction on DataFrame that embodies both the batch (cold) as well as streaming data.
From the docs
Discretized Streams (DStreams) Discretized Stream or DStream is the
basic abstraction provided by Spark Streaming. It represents a
continuous stream of data, either the input data stream received from
source, or the processed data stream generated by transforming the
input stream. Internally, a DStream is represented by a continuous
series of RDDs, which is Spark’s abstraction of an immutable,
distributed dataset (see Spark Programming Guide for more details).
Each RDD in a DStream contains data from a certain interval, as shown
in the following figure.
API using Datasets and DataFrames Since Spark 2.0, DataFrames and
Datasets can represent static, bounded data, as well as streaming,
unbounded data. Similar to static Datasets/DataFrames, you can use the
common entry point SparkSession (Scala/Java/Python/R docs) to create
streaming DataFrames/Datasets from streaming sources, and apply the
same operations on them as static DataFrames/Datasets. If you are not
familiar with Datasets/DataFrames, you are strongly advised to
familiarize yourself with them using the DataFrame/Dataset Programming
Guide.

Spark Stateful Streaming with DataFrame

Is it possible to use DataFrame as a State / StateSpec for Spark Streaming? The current StateSpec implementation seems to allow only key-value pair data structure (mapWithState etc..).
My objective is to keep a fixed size FIFO buffer as a StateSpec that gets updated every time new data streams in. I'd like to implement the buffer in Spark DataFrame API, for compatibility with Spark ML.
I'm not entirely sure you can do this with Spark Streaming, but with the newer Dataframe-based Spark Structured Streaming you can express queries that get updated over time, given an incoming stream of data.
You can read more about Spark Structured Streaming in the official documentation.
If you are interested in interoperability with SparkML to deploy a trained model, you may also be interested in this article.

Apache Kafka and Spark Streaming

I'm reading through this blog post:
http://blog.jaceklaskowski.pl/2015/07/20/real-time-data-processing-using-apache-kafka-and-spark-streaming.html
It discusses about using Spark Streaming and Apache Kafka to do some near real time processing. I completely understand the article. It does show how I could use Spark Streaming to read messages from a Topic. I would like to know if there is a Spark Streaming API that I can use to write messages into Kakfa topic?
My use case is pretty simple. I have a set of data that I can read from a given source at a constant interval (say every second). I do this using reactive streams. I would like to do some analytics on this data using Spark. I want to have fault-tolerance, so Kafka comes into play. So what I would essentially do is the following (Please correct me if I was wrong):
Using reactive streams get the data from external source at constant intervals
Pipe the result into Kafka topic
Using Spark Streaming, create the streaming context for the consumer
Perform analytics on the consumed data
One another question though, is the Streaming API in Spark an implementation of the reactive streams specification? Does it have back pressure handling (Spark Streaming v1.5)?
No, at the moment, none of Spark Streaming's built-in receiver APIs are an implementation of the Reactive Streams implementation. But there's an issue for that you will want to follow.
But Spark Streaming 1.5 has internal back-pressure-based dynamic throttling. There's some work to extend that beyond throttling in the pipeline. This throttling is compatible with the Kafka direct stream API.
You can write to Kafka in a Spark Streaming application, here's one example.
(Full disclosure: I'm one of the implementers of some of the back-pressure work)
If you have to write the results stream to another Kafka topic let say 'topic_x', First, you must have columns by the name 'Key' and 'Value' in the result stream which you are trying to write to the topic_x.
result_stream = result_stream.selectExpr('CAST (key AS STRING)','CAST (value AS STRING)')
kafkaOutput = result_stream \
.writeStream \
.format('kafka') \
.option('kafka.bootstrap.servers','192.X.X.X:9092') \
.option('topic','topic_x') \
.option('checkpointLocation','./resultCheckpoint') \
.start()
kafkaOutput.awaitTermination()
For more details check the documentation at https://spark.apache.org/docs/2.4.1/structured-streaming-kafka-integration.html

Resources