How to use Dataset-based transformation in Spark Streaming? - apache-spark

I have a Spark job for Batch mode (using Datasets) which performs some transformation and ingests data into NOSQL.
I get data from other source which is similar in structure as received in batch mode albeit the frequency is very high (mins). Can I use the code I use for batch mode for Streaming?
I am trying to avoid 2 copy of code to deal with similar structure.

You can use transform streaming operator (as described in the scaladoc):
transform[U](transformFunc: (RDD[T]) ⇒ RDD[U])(implicit arg0: ClassTag[U]): DStream[U]
Return a new DStream in which each RDD is generated by applying a function on each RDD of 'this' DStream.

Related

What is the difference between DStream and Seq[RDD]?

The definition of DStream from the documentation states,
Discretized Stream or DStream is the basic abstraction provided by Spark Streaming. It represents a continuous stream of data, either the input data stream received from source, or the processed data stream generated by transforming the input stream. Internally, a DStream is represented by a continuous series of RDDs, which is Spark’s abstraction of an immutable, distributed dataset.
The question is if it is represented as series of RDDs, can we make Stream of RDD and expect it to work similar to DStream?
It would be great if someone can help me to understand this with a code sample.
The question is if it is represented as series of RDDs, can we make Stream of RDD and expect it to work similar to DStream?
You're right. A DStream is logically a series of RDDs.
Spark Streaming is just to hide the process of creating Seq[RDD] so it is not your job but the framework.
Moreover, Spark Streaming gives you a much nicer developer API so you can think of Seq[RDD] as a DStream, but rather than rdds.map(rdd => your code goes here) you can simply dstream.map(t => your code goes here) which is not that different except the types of rdd and t. You're simply one level below already when working with DStream.

QueueStream for Structured Streaming possible?

With dStreams, from the official documentation:
Queue of RDDs as a Stream: For testing a Spark Streaming application
with test data, one can also create a DStream based on a queue of
RDDs, using streamingContext.queueStream(queueOfRDDs). Each RDD pushed
into the queue will be treated as a batch of data in the DStream, and
processed like a stream.
So, for Structured Streaming, can I or can I not use QueueStream as input?
Not able able to find anything in the Structured Streaming Guide 2.3 or 2.4.
I do note memoryStream. This is the way to go? I think so, and if so, why would QueueStream not be an option anymore?
I have converted QueueStreams to Memory Stream as input and it works fine, but is that what is required?
My understanding is that for Structured Streaming I cannot use QueueStream - as it is a dStream.
Simulating Streaming input with Structured Streaming does work with memoryStream.

How to convert streaming Dataset to DStream?

Is it possible to convert a streaming o.a.s.sql.Dataset to DStream? If so, how?
I know how to convert it to RDD, but it is in a streaming context.
It is not possible. Structured Streaming and legacy Spark Streaming (DStreams) use completely different semantics and are not compatible with each other so:
DStream cannot be converted to Streaming Dataset.
Streaming Dataset cannot be converted to DStream.
It could be possible (in some use cases).
That question really begs another:
Why would anyone want to do that conversion? What's the problem to be solved?
I can only imagine that such type conversion would only be required when mixing two different APIs in a single streaming application. I'd then say it does not make much sense as you'd rather not do this and make the conversion at Spark module level, i.e. migrate the streaming application from Spark Streaming to Spark Structured Streaming.
A streaming Dataset is an "abstraction" of a series of Datasets (I use quotes since the difference between streaming and batch Datasets is the isStreaming property of a Dataset).
It is possible to convert a DStream to a streaming Dataset so the latter behaves as the former (to keep the behaviour of the DStream and pretend to be a streaming Dataset).
Under the covers, the execution engines of Spark Streaming (DStream) and Spark Structured Streaming (streaming Dataset) are fairly similar. They both "generate" micro-batches of RDDs and Datasets, respectively. And RDDs are convertible to Datasets but this implicit conversion toDF or toDS.
So converting a DStream to a streaming Dataset would logically look as follows:
dstream.foreachRDD { rdd =>
val df = rdd.toDF
// this df is not streaming, but you don't really need that
}

What is the difference between Spark Structured Streaming and DStreams?

I have been trying to find materials online - both are micro-batch based - so what's the difference ?
Brief description about Spark Streaming(RDD/DStream) and Spark Structured Streaming(Dataset/DataFrame)
Spark Streaming is based on DStream. A DStream is represented by a continuous series of RDDs, which is Spark’s abstraction of an immutable, distributed dataset. Spark Streaming has the following problems.
Difficult - it was not simple to built streaming pipelines supporting delivery policies: exactly once guarantee, handling data arrival in late or fault tolerance. Sure, all of them were implementable but they needed some extra work from the part of programmers.
Incosistent - API used to generate batch processing (RDD, Dataset) was different that the API of streaming processing (DStream). Sure, nothing blocker to code but it's always simpler (maintenance cost especially) to deal with at least abstractions as possible.
see the example
Spark Streaming flow diagram :-
Spark Structured Streaming be understood as an unbounded table, growing with new incoming data, i.e. can be thought as stream processing built on Spark SQL.
More concretely, structured streaming brought some new concepts to Spark.
exactly-once guarantee - structured streaming focuses on that concept. It means that data is processed only once and output doesn't contain duplicates.
event time - one of observed problems with DStream streaming was processing order, i.e the case when data generated earlier was processed after later generated data. Structured streaming handles this problem with a concept called event time that, under some conditions, allows to correctly aggregate late data in processing pipelines.
sink,Result Table,output mode and watermark are other features of spark structured streaming.
see the example
Spark Structured Streaming flow diagram :-
Until Spark 2.2, the DStream[T] was the abstract data type for streaming data which can be viewed as RDD[RDD[T]].From Spark 2.2 onwards, the DataSet is a abstraction on DataFrame that embodies both the batch (cold) as well as streaming data.
From the docs
Discretized Streams (DStreams) Discretized Stream or DStream is the
basic abstraction provided by Spark Streaming. It represents a
continuous stream of data, either the input data stream received from
source, or the processed data stream generated by transforming the
input stream. Internally, a DStream is represented by a continuous
series of RDDs, which is Spark’s abstraction of an immutable,
distributed dataset (see Spark Programming Guide for more details).
Each RDD in a DStream contains data from a certain interval, as shown
in the following figure.
API using Datasets and DataFrames Since Spark 2.0, DataFrames and
Datasets can represent static, bounded data, as well as streaming,
unbounded data. Similar to static Datasets/DataFrames, you can use the
common entry point SparkSession (Scala/Java/Python/R docs) to create
streaming DataFrames/Datasets from streaming sources, and apply the
same operations on them as static DataFrames/Datasets. If you are not
familiar with Datasets/DataFrames, you are strongly advised to
familiarize yourself with them using the DataFrame/Dataset Programming
Guide.

How to use RDD checkpointing to share datasets across Spark applications?

I have a spark application, and checkpoint the rdd in the code, a simple code snippet is as follows(It is very simple, just for illustrating my question.):
#Test
def testCheckpoint1(): Unit = {
val data = List("Hello", "World", "Hello", "One", "Two")
val rdd = sc.parallelize(data)
//sc is initialized in the setup
sc.setCheckpointDir(Utils.getOutputDir())
rdd.checkpoint()
rdd.collect()
}
When the rdd is checkpointed on the file system.I write another Spark application and would pick up the data checkpointed in the above code,
and make it as an RDD as a starting point in this second application
The ReliableCheckpointRDD is exactly the RDD that does the work, but this RDD is private to Spark.
So,since ReliableCheckpointRDD is private, it looks spark doesn't recommend to use ReliableCheckpointRDD outside spark.
I would ask if there is a way to do it.
Quoting the scaladoc of RDD.checkpoint (highlighting mine):
checkpoint(): Unit Mark this RDD for checkpointing. It will be saved to a file inside the checkpoint directory set with SparkContext#setCheckpointDir and all references to its parent RDDs will be removed. This function must be called before any job has been executed on this RDD. It is strongly recommended that this RDD is persisted in memory, otherwise saving it on a file will require recomputation.
So, RDD.checkpoint will cut the RDD lineage and trigger partial computation so you've got something already pre-computed in case your Spark application may fail and stop.
Note that RDD checkpointing is very similar to RDD caching but caching would make the partial datasets private to some Spark application.
Let's read Spark Streaming's Checkpointing (that in some way extends the concept of RDD checkpointing making it closer to your needs to share the results of computations between Spark applications):
Data checkpointing Saving of the generated RDDs to reliable storage. This is necessary in some stateful transformations that combine data across multiple batches. In such transformations, the generated RDDs depend on RDDs of previous batches, which causes the length of the dependency chain to keep increasing with time. To avoid such unbounded increases in recovery time (proportional to dependency chain), intermediate RDDs of stateful transformations are periodically checkpointed to reliable storage (e.g. HDFS) to cut off the dependency chains.
So, yes, in a sense you could share the partial results of computations in a form of RDD checkpointing, but why would you even want to do it if you could save the partial results using the "official" interface using JSON, parquet, CSV, etc.
I doubt using this internal persistence interface could give you more features and flexibility than using the aforementioned formats. Yes, it is indeed technically possible to use RDD checkpointing to share datasets between Spark applications, but it's too much effort for not much gain.

Resources