Spark Stateful Streaming with DataFrame - apache-spark

Is it possible to use DataFrame as a State / StateSpec for Spark Streaming? The current StateSpec implementation seems to allow only key-value pair data structure (mapWithState etc..).
My objective is to keep a fixed size FIFO buffer as a StateSpec that gets updated every time new data streams in. I'd like to implement the buffer in Spark DataFrame API, for compatibility with Spark ML.

I'm not entirely sure you can do this with Spark Streaming, but with the newer Dataframe-based Spark Structured Streaming you can express queries that get updated over time, given an incoming stream of data.
You can read more about Spark Structured Streaming in the official documentation.
If you are interested in interoperability with SparkML to deploy a trained model, you may also be interested in this article.

Related

Why Spark Structured Streaming is ideal for real-time operations?

I wanna construct a real-time application but I don't know if I should use Spark Streaming or Spark Structured Streaming.
I read online that Structured Streaming is ideal for real-time applications but is not clear why...
Can someone explain it?
Spark Streaming works on something we call a micro batch. ... Each batch represents an RDD. Structured Streaming works on the same architecture of polling the data after some duration, based on your trigger interval, but it has some distinction from the Spark Streaming which makes it more inclined towards real streaming.
For developers all they need to worry is that Spark streaming you will you RDDs but in Spark Structured Streaming you get Dataframes and DataSet.
If you want so very low level(i.e. per record) operations go for RDDs(i.e. Spark Streaming) and but your application can build on Dataframes and querying them like SQL in real time then go for DataFrames(i.e. Spark Structured Streaming)
Eventually RDDs can be converted to Dataframes and vice versa

Zeppelin with Spark Structured Streaming Example

I am trying to visualize spark structured streams in Zeppelin. I am able to achieve using memory sink(spark.apache). But it is not reliable solution for high data volumes. What will be the better solution?
Example implementation or demo would be helpful.
Thanks,
Rilwan
Thanks for asking the question!! Having 2+ years of experience for developing Spark Monitoring Tools, I think I will be able to resolve your doubt!!
There are two types of processing available when data is coming to spark as stream.
Discretized Stream or DStream: In this mode, spark provides you data
in RDD format and you have to write your own logic to handle the
RDD.
Pros:
1. If you want to do some processing before saving the streaming data, RDD is the best way to handle compared to DataFrame.
2. DStream provides you a nice Streaming UI where it graphically show how much data havebeen processed. Check this link - https://spark.apache.org/docs/2.2.0/streaming-programming-guide.html#monitoring-applications
Cons:
1. Handling Raw RDD is not so convenient and easy.
Structured Stream: In this mode, spark provides you data in a
DataFrame format, you need to mention where to store/send the data.
Pros:
1. Spark Streaming comes with some predefined sources and sinks which are very common and 95% of real-life scenarios can be resolved by plugging in these. Check this link - https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html
Cons:
1. There is no Streaming UI available with Structured Streaming :( .Although you can get the metrices and create your own UI. Check this link - https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#monitoring-streaming-queries
You can also put store the metrices in some plaintext file, read the file in Zeppelin through spark.read.json, and plot your own graph.

QueueStream for Structured Streaming possible?

With dStreams, from the official documentation:
Queue of RDDs as a Stream: For testing a Spark Streaming application
with test data, one can also create a DStream based on a queue of
RDDs, using streamingContext.queueStream(queueOfRDDs). Each RDD pushed
into the queue will be treated as a batch of data in the DStream, and
processed like a stream.
So, for Structured Streaming, can I or can I not use QueueStream as input?
Not able able to find anything in the Structured Streaming Guide 2.3 or 2.4.
I do note memoryStream. This is the way to go? I think so, and if so, why would QueueStream not be an option anymore?
I have converted QueueStreams to Memory Stream as input and it works fine, but is that what is required?
My understanding is that for Structured Streaming I cannot use QueueStream - as it is a dStream.
Simulating Streaming input with Structured Streaming does work with memoryStream.

How to convert streaming Dataset to DStream?

Is it possible to convert a streaming o.a.s.sql.Dataset to DStream? If so, how?
I know how to convert it to RDD, but it is in a streaming context.
It is not possible. Structured Streaming and legacy Spark Streaming (DStreams) use completely different semantics and are not compatible with each other so:
DStream cannot be converted to Streaming Dataset.
Streaming Dataset cannot be converted to DStream.
It could be possible (in some use cases).
That question really begs another:
Why would anyone want to do that conversion? What's the problem to be solved?
I can only imagine that such type conversion would only be required when mixing two different APIs in a single streaming application. I'd then say it does not make much sense as you'd rather not do this and make the conversion at Spark module level, i.e. migrate the streaming application from Spark Streaming to Spark Structured Streaming.
A streaming Dataset is an "abstraction" of a series of Datasets (I use quotes since the difference between streaming and batch Datasets is the isStreaming property of a Dataset).
It is possible to convert a DStream to a streaming Dataset so the latter behaves as the former (to keep the behaviour of the DStream and pretend to be a streaming Dataset).
Under the covers, the execution engines of Spark Streaming (DStream) and Spark Structured Streaming (streaming Dataset) are fairly similar. They both "generate" micro-batches of RDDs and Datasets, respectively. And RDDs are convertible to Datasets but this implicit conversion toDF or toDS.
So converting a DStream to a streaming Dataset would logically look as follows:
dstream.foreachRDD { rdd =>
val df = rdd.toDF
// this df is not streaming, but you don't really need that
}

What is the difference between Spark Structured Streaming and DStreams?

I have been trying to find materials online - both are micro-batch based - so what's the difference ?
Brief description about Spark Streaming(RDD/DStream) and Spark Structured Streaming(Dataset/DataFrame)
Spark Streaming is based on DStream. A DStream is represented by a continuous series of RDDs, which is Spark’s abstraction of an immutable, distributed dataset. Spark Streaming has the following problems.
Difficult - it was not simple to built streaming pipelines supporting delivery policies: exactly once guarantee, handling data arrival in late or fault tolerance. Sure, all of them were implementable but they needed some extra work from the part of programmers.
Incosistent - API used to generate batch processing (RDD, Dataset) was different that the API of streaming processing (DStream). Sure, nothing blocker to code but it's always simpler (maintenance cost especially) to deal with at least abstractions as possible.
see the example
Spark Streaming flow diagram :-
Spark Structured Streaming be understood as an unbounded table, growing with new incoming data, i.e. can be thought as stream processing built on Spark SQL.
More concretely, structured streaming brought some new concepts to Spark.
exactly-once guarantee - structured streaming focuses on that concept. It means that data is processed only once and output doesn't contain duplicates.
event time - one of observed problems with DStream streaming was processing order, i.e the case when data generated earlier was processed after later generated data. Structured streaming handles this problem with a concept called event time that, under some conditions, allows to correctly aggregate late data in processing pipelines.
sink,Result Table,output mode and watermark are other features of spark structured streaming.
see the example
Spark Structured Streaming flow diagram :-
Until Spark 2.2, the DStream[T] was the abstract data type for streaming data which can be viewed as RDD[RDD[T]].From Spark 2.2 onwards, the DataSet is a abstraction on DataFrame that embodies both the batch (cold) as well as streaming data.
From the docs
Discretized Streams (DStreams) Discretized Stream or DStream is the
basic abstraction provided by Spark Streaming. It represents a
continuous stream of data, either the input data stream received from
source, or the processed data stream generated by transforming the
input stream. Internally, a DStream is represented by a continuous
series of RDDs, which is Spark’s abstraction of an immutable,
distributed dataset (see Spark Programming Guide for more details).
Each RDD in a DStream contains data from a certain interval, as shown
in the following figure.
API using Datasets and DataFrames Since Spark 2.0, DataFrames and
Datasets can represent static, bounded data, as well as streaming,
unbounded data. Similar to static Datasets/DataFrames, you can use the
common entry point SparkSession (Scala/Java/Python/R docs) to create
streaming DataFrames/Datasets from streaming sources, and apply the
same operations on them as static DataFrames/Datasets. If you are not
familiar with Datasets/DataFrames, you are strongly advised to
familiarize yourself with them using the DataFrame/Dataset Programming
Guide.

Resources