How to convert streaming Dataset to DStream? - apache-spark

Is it possible to convert a streaming o.a.s.sql.Dataset to DStream? If so, how?
I know how to convert it to RDD, but it is in a streaming context.

It is not possible. Structured Streaming and legacy Spark Streaming (DStreams) use completely different semantics and are not compatible with each other so:
DStream cannot be converted to Streaming Dataset.
Streaming Dataset cannot be converted to DStream.

It could be possible (in some use cases).
That question really begs another:
Why would anyone want to do that conversion? What's the problem to be solved?
I can only imagine that such type conversion would only be required when mixing two different APIs in a single streaming application. I'd then say it does not make much sense as you'd rather not do this and make the conversion at Spark module level, i.e. migrate the streaming application from Spark Streaming to Spark Structured Streaming.
A streaming Dataset is an "abstraction" of a series of Datasets (I use quotes since the difference between streaming and batch Datasets is the isStreaming property of a Dataset).
It is possible to convert a DStream to a streaming Dataset so the latter behaves as the former (to keep the behaviour of the DStream and pretend to be a streaming Dataset).
Under the covers, the execution engines of Spark Streaming (DStream) and Spark Structured Streaming (streaming Dataset) are fairly similar. They both "generate" micro-batches of RDDs and Datasets, respectively. And RDDs are convertible to Datasets but this implicit conversion toDF or toDS.
So converting a DStream to a streaming Dataset would logically look as follows:
dstream.foreachRDD { rdd =>
val df = rdd.toDF
// this df is not streaming, but you don't really need that
}

Related

Convert Spark SQL DataFrames to Structured Streaming DataFrames

I'd like to convert Java Spark SQL DataFrames to Structured Streaming DataFrames, in such a way that every batch would be unioned to the Structured Streaming DataFrame. Therefore I could use the Spark Structured Streaming functionalities (such as a continuous job) on DataFrames that I've got from a batch source.
Nothing to do with Java and title a little off-beam.
Unsupported standard operation as you state.
Look in the docs at the the foreachBatch implementation. See https://spark.apache.org/docs/3.1.2/structured-streaming-programming-guide.html#foreachbatch and do the UNION there are having reading the static DF in.

LSHModel on spark structured streaming

Apparently, the LSHModel of MLLib from spark 2.4 supports Spark Structured Streaming (https://issues.apache.org/jira/browse/SPARK-24465).
However, it's not clear to me how. For instance an approxSimilarityJoin from MinHashLSH transformation (https://spark.apache.org/docs/latest/ml-features#lsh-operations) could be applied directly to a streaming dataframe?
I don't find more information online about it. Could someone help me?
You need to
Persist the trained model (e.g. modelFitted) somewhere accessible to your Streaming job. This is done outside of your streaming job.
modelFitted.write.overwrite().save("/path/to/model/location")
Then load this model within you Structured Streaming job
import org.apache.spark.ml._
val model = PipelineModel.read.load("/path/to/model/location")
Apply this model to your streaming Dataframe (e.g. df) with
model.transform(df)
// in your case you may work with two streaming Dataframes to apply `approxSimilarityJoin`.
It might be required to get the streaming Dataframe into the correct format to be used in the model prediction.

Why Spark Structured Streaming is ideal for real-time operations?

I wanna construct a real-time application but I don't know if I should use Spark Streaming or Spark Structured Streaming.
I read online that Structured Streaming is ideal for real-time applications but is not clear why...
Can someone explain it?
Spark Streaming works on something we call a micro batch. ... Each batch represents an RDD. Structured Streaming works on the same architecture of polling the data after some duration, based on your trigger interval, but it has some distinction from the Spark Streaming which makes it more inclined towards real streaming.
For developers all they need to worry is that Spark streaming you will you RDDs but in Spark Structured Streaming you get Dataframes and DataSet.
If you want so very low level(i.e. per record) operations go for RDDs(i.e. Spark Streaming) and but your application can build on Dataframes and querying them like SQL in real time then go for DataFrames(i.e. Spark Structured Streaming)
Eventually RDDs can be converted to Dataframes and vice versa

What is the difference between DStream and Seq[RDD]?

The definition of DStream from the documentation states,
Discretized Stream or DStream is the basic abstraction provided by Spark Streaming. It represents a continuous stream of data, either the input data stream received from source, or the processed data stream generated by transforming the input stream. Internally, a DStream is represented by a continuous series of RDDs, which is Spark’s abstraction of an immutable, distributed dataset.
The question is if it is represented as series of RDDs, can we make Stream of RDD and expect it to work similar to DStream?
It would be great if someone can help me to understand this with a code sample.
The question is if it is represented as series of RDDs, can we make Stream of RDD and expect it to work similar to DStream?
You're right. A DStream is logically a series of RDDs.
Spark Streaming is just to hide the process of creating Seq[RDD] so it is not your job but the framework.
Moreover, Spark Streaming gives you a much nicer developer API so you can think of Seq[RDD] as a DStream, but rather than rdds.map(rdd => your code goes here) you can simply dstream.map(t => your code goes here) which is not that different except the types of rdd and t. You're simply one level below already when working with DStream.

What is the difference between Spark Structured Streaming and DStreams?

I have been trying to find materials online - both are micro-batch based - so what's the difference ?
Brief description about Spark Streaming(RDD/DStream) and Spark Structured Streaming(Dataset/DataFrame)
Spark Streaming is based on DStream. A DStream is represented by a continuous series of RDDs, which is Spark’s abstraction of an immutable, distributed dataset. Spark Streaming has the following problems.
Difficult - it was not simple to built streaming pipelines supporting delivery policies: exactly once guarantee, handling data arrival in late or fault tolerance. Sure, all of them were implementable but they needed some extra work from the part of programmers.
Incosistent - API used to generate batch processing (RDD, Dataset) was different that the API of streaming processing (DStream). Sure, nothing blocker to code but it's always simpler (maintenance cost especially) to deal with at least abstractions as possible.
see the example
Spark Streaming flow diagram :-
Spark Structured Streaming be understood as an unbounded table, growing with new incoming data, i.e. can be thought as stream processing built on Spark SQL.
More concretely, structured streaming brought some new concepts to Spark.
exactly-once guarantee - structured streaming focuses on that concept. It means that data is processed only once and output doesn't contain duplicates.
event time - one of observed problems with DStream streaming was processing order, i.e the case when data generated earlier was processed after later generated data. Structured streaming handles this problem with a concept called event time that, under some conditions, allows to correctly aggregate late data in processing pipelines.
sink,Result Table,output mode and watermark are other features of spark structured streaming.
see the example
Spark Structured Streaming flow diagram :-
Until Spark 2.2, the DStream[T] was the abstract data type for streaming data which can be viewed as RDD[RDD[T]].From Spark 2.2 onwards, the DataSet is a abstraction on DataFrame that embodies both the batch (cold) as well as streaming data.
From the docs
Discretized Streams (DStreams) Discretized Stream or DStream is the
basic abstraction provided by Spark Streaming. It represents a
continuous stream of data, either the input data stream received from
source, or the processed data stream generated by transforming the
input stream. Internally, a DStream is represented by a continuous
series of RDDs, which is Spark’s abstraction of an immutable,
distributed dataset (see Spark Programming Guide for more details).
Each RDD in a DStream contains data from a certain interval, as shown
in the following figure.
API using Datasets and DataFrames Since Spark 2.0, DataFrames and
Datasets can represent static, bounded data, as well as streaming,
unbounded data. Similar to static Datasets/DataFrames, you can use the
common entry point SparkSession (Scala/Java/Python/R docs) to create
streaming DataFrames/Datasets from streaming sources, and apply the
same operations on them as static DataFrames/Datasets. If you are not
familiar with Datasets/DataFrames, you are strongly advised to
familiarize yourself with them using the DataFrame/Dataset Programming
Guide.

Resources