Why Spark Structured Streaming is ideal for real-time operations? - apache-spark

I wanna construct a real-time application but I don't know if I should use Spark Streaming or Spark Structured Streaming.
I read online that Structured Streaming is ideal for real-time applications but is not clear why...
Can someone explain it?

Spark Streaming works on something we call a micro batch. ... Each batch represents an RDD. Structured Streaming works on the same architecture of polling the data after some duration, based on your trigger interval, but it has some distinction from the Spark Streaming which makes it more inclined towards real streaming.
For developers all they need to worry is that Spark streaming you will you RDDs but in Spark Structured Streaming you get Dataframes and DataSet.
If you want so very low level(i.e. per record) operations go for RDDs(i.e. Spark Streaming) and but your application can build on Dataframes and querying them like SQL in real time then go for DataFrames(i.e. Spark Structured Streaming)
Eventually RDDs can be converted to Dataframes and vice versa

Related

Does spark.sql.adaptive.enabled work for Spark Structured Streaming?

I work with Apache Spark Structured Streaming. Structured Streaming is a scalable and fault-tolerant stream processing engine built on the Spark SQL engine. Since It builds on the Spark SQL engine, does it mean spark.sql.adaptive.enabled works for Spark Structured Streaming?
It's disabled in Spark code - See in StreamExecution:
// Adaptive execution can change num shuffle partitions, disallow
sparkSessionForStream.conf.set(SQLConf.ADAPTIVE_EXECUTION_ENABLED.key, "false")
The reason for that is because it might cause issues when having state on the stream (more details in the ticket that added this restriction - SPARK-19873 ).
If you still want to enable it for the Spark Structured Streaming (e.g. if you are sure that it won't cause any harm in your use case), you can do that inside the foreachBatch method, by setting batchDF.sparkSession.conf.set(SQLConf.ADAPTIVE_EXECUTION_ENABLED.key, "true") - this will override the Spark code which disabled it.
No. Stated in this doc https://docs.databricks.com/spark/latest/spark-sql/aqe.html. Non-streaming not supported, does not apply for AQE.
Think of statefulness, small datasets ideally. Many restrictions in Spark Structured Streaming.

How do spark streaming on hive tables using sql in NON-real time?

We have some data (millions) in hive tables which comes everyday. Next day, once the over-night ingestion is complete different applications query us for data (using sql)
We take this sql and make a call on spark
spark.sqlContext.sql(statement) // hive-metastore integration is enabled
This is causing too much memory usage on spark driver, can we use spark streaming (or structured streaming), to stream the results in a piped fashion rather than collecting everything on driver and then sending to clients ?
We don't want to send out the data as soon it comes ( in typical streaming apps), but want to send a streaming data to clients when they ask (PULL) for data.
IIUC..
Spark Streaming is mainly designed to process streaming data by converting into batches of Milliseconds to Seconds.
You can look over streamingDF.writeStream.foreachBatch { (batchDF: DataFrame, batchId: Long) provides you a very good functionality for Spark to write
Streaming processed output Sink in micro-batch manner.
Nevertheless Spark structured streaming don't have a standard JDBC source defined to read from.
Work out for an option to directly store Hive underlying files in compressed and structured manner, transfer them directly rather than selecting through spark.sql if every client needs same/similar data or partition them based on where condition of spark.sql query and transfer needed files further.
Source:
Structured Streaming queries are processed using a micro-batch processing engine, which processes data streams as a series of small batch jobs thereby achieving end-to-end latencies as low as 100 milliseconds and exactly-once fault-tolerance guarantees.
ForeachBatch:
foreachBatch(...) allows you to specify a function that is executed on the output data of every micro-batch of a streaming query. Since Spark 2.4, this is supported in Scala, Java and Python. It takes two parameters: a DataFrame or Dataset that has the output data of a micro-batch and the unique ID of the micro-batch.

QueueStream for Structured Streaming possible?

With dStreams, from the official documentation:
Queue of RDDs as a Stream: For testing a Spark Streaming application
with test data, one can also create a DStream based on a queue of
RDDs, using streamingContext.queueStream(queueOfRDDs). Each RDD pushed
into the queue will be treated as a batch of data in the DStream, and
processed like a stream.
So, for Structured Streaming, can I or can I not use QueueStream as input?
Not able able to find anything in the Structured Streaming Guide 2.3 or 2.4.
I do note memoryStream. This is the way to go? I think so, and if so, why would QueueStream not be an option anymore?
I have converted QueueStreams to Memory Stream as input and it works fine, but is that what is required?
My understanding is that for Structured Streaming I cannot use QueueStream - as it is a dStream.
Simulating Streaming input with Structured Streaming does work with memoryStream.

What is the difference between Spark Structured Streaming and DStreams?

I have been trying to find materials online - both are micro-batch based - so what's the difference ?
Brief description about Spark Streaming(RDD/DStream) and Spark Structured Streaming(Dataset/DataFrame)
Spark Streaming is based on DStream. A DStream is represented by a continuous series of RDDs, which is Spark’s abstraction of an immutable, distributed dataset. Spark Streaming has the following problems.
Difficult - it was not simple to built streaming pipelines supporting delivery policies: exactly once guarantee, handling data arrival in late or fault tolerance. Sure, all of them were implementable but they needed some extra work from the part of programmers.
Incosistent - API used to generate batch processing (RDD, Dataset) was different that the API of streaming processing (DStream). Sure, nothing blocker to code but it's always simpler (maintenance cost especially) to deal with at least abstractions as possible.
see the example
Spark Streaming flow diagram :-
Spark Structured Streaming be understood as an unbounded table, growing with new incoming data, i.e. can be thought as stream processing built on Spark SQL.
More concretely, structured streaming brought some new concepts to Spark.
exactly-once guarantee - structured streaming focuses on that concept. It means that data is processed only once and output doesn't contain duplicates.
event time - one of observed problems with DStream streaming was processing order, i.e the case when data generated earlier was processed after later generated data. Structured streaming handles this problem with a concept called event time that, under some conditions, allows to correctly aggregate late data in processing pipelines.
sink,Result Table,output mode and watermark are other features of spark structured streaming.
see the example
Spark Structured Streaming flow diagram :-
Until Spark 2.2, the DStream[T] was the abstract data type for streaming data which can be viewed as RDD[RDD[T]].From Spark 2.2 onwards, the DataSet is a abstraction on DataFrame that embodies both the batch (cold) as well as streaming data.
From the docs
Discretized Streams (DStreams) Discretized Stream or DStream is the
basic abstraction provided by Spark Streaming. It represents a
continuous stream of data, either the input data stream received from
source, or the processed data stream generated by transforming the
input stream. Internally, a DStream is represented by a continuous
series of RDDs, which is Spark’s abstraction of an immutable,
distributed dataset (see Spark Programming Guide for more details).
Each RDD in a DStream contains data from a certain interval, as shown
in the following figure.
API using Datasets and DataFrames Since Spark 2.0, DataFrames and
Datasets can represent static, bounded data, as well as streaming,
unbounded data. Similar to static Datasets/DataFrames, you can use the
common entry point SparkSession (Scala/Java/Python/R docs) to create
streaming DataFrames/Datasets from streaming sources, and apply the
same operations on them as static DataFrames/Datasets. If you are not
familiar with Datasets/DataFrames, you are strongly advised to
familiarize yourself with them using the DataFrame/Dataset Programming
Guide.

Spark Stateful Streaming with DataFrame

Is it possible to use DataFrame as a State / StateSpec for Spark Streaming? The current StateSpec implementation seems to allow only key-value pair data structure (mapWithState etc..).
My objective is to keep a fixed size FIFO buffer as a StateSpec that gets updated every time new data streams in. I'd like to implement the buffer in Spark DataFrame API, for compatibility with Spark ML.
I'm not entirely sure you can do this with Spark Streaming, but with the newer Dataframe-based Spark Structured Streaming you can express queries that get updated over time, given an incoming stream of data.
You can read more about Spark Structured Streaming in the official documentation.
If you are interested in interoperability with SparkML to deploy a trained model, you may also be interested in this article.

Resources