LSHModel on spark structured streaming - apache-spark

Apparently, the LSHModel of MLLib from spark 2.4 supports Spark Structured Streaming (https://issues.apache.org/jira/browse/SPARK-24465).
However, it's not clear to me how. For instance an approxSimilarityJoin from MinHashLSH transformation (https://spark.apache.org/docs/latest/ml-features#lsh-operations) could be applied directly to a streaming dataframe?
I don't find more information online about it. Could someone help me?

You need to
Persist the trained model (e.g. modelFitted) somewhere accessible to your Streaming job. This is done outside of your streaming job.
modelFitted.write.overwrite().save("/path/to/model/location")
Then load this model within you Structured Streaming job
import org.apache.spark.ml._
val model = PipelineModel.read.load("/path/to/model/location")
Apply this model to your streaming Dataframe (e.g. df) with
model.transform(df)
// in your case you may work with two streaming Dataframes to apply `approxSimilarityJoin`.
It might be required to get the streaming Dataframe into the correct format to be used in the model prediction.

Related

Convert Spark SQL DataFrames to Structured Streaming DataFrames

I'd like to convert Java Spark SQL DataFrames to Structured Streaming DataFrames, in such a way that every batch would be unioned to the Structured Streaming DataFrame. Therefore I could use the Spark Structured Streaming functionalities (such as a continuous job) on DataFrames that I've got from a batch source.
Nothing to do with Java and title a little off-beam.
Unsupported standard operation as you state.
Look in the docs at the the foreachBatch implementation. See https://spark.apache.org/docs/3.1.2/structured-streaming-programming-guide.html#foreachbatch and do the UNION there are having reading the static DF in.

Why Spark Structured Streaming is ideal for real-time operations?

I wanna construct a real-time application but I don't know if I should use Spark Streaming or Spark Structured Streaming.
I read online that Structured Streaming is ideal for real-time applications but is not clear why...
Can someone explain it?
Spark Streaming works on something we call a micro batch. ... Each batch represents an RDD. Structured Streaming works on the same architecture of polling the data after some duration, based on your trigger interval, but it has some distinction from the Spark Streaming which makes it more inclined towards real streaming.
For developers all they need to worry is that Spark streaming you will you RDDs but in Spark Structured Streaming you get Dataframes and DataSet.
If you want so very low level(i.e. per record) operations go for RDDs(i.e. Spark Streaming) and but your application can build on Dataframes and querying them like SQL in real time then go for DataFrames(i.e. Spark Structured Streaming)
Eventually RDDs can be converted to Dataframes and vice versa

Not able to convert a spark Dataset<Row> to H2OFrame from asH2OFrame if the dataset is streaming dataset

I already have a Deep Learning model.I am trying to run scoring on streaming data. For this I am reading data from kafka using spark structured streaming api.When I try to convert the received dataset to H20Frame I am getting below error:
Exception in thread "main" org.apache.spark.sql.AnalysisException: Queries with streaming sources must be executed with writeStream.start();
Code Sample
Dataset<Row> testData=sparkSession.readStream().schema(testSchema).format("kafka").option("kafka.bootstrap.servers", "localhost:9042").option("subscribe", "topicName").load();
H2OFrame h2oTestFrame = h2oContext.asH2OFrame(testData.toDF(), "test_frame");
Is there any example that explains sparkling water using spark structured streaming with streaming source?
Is there any example that explains sparkling water using spark structured streaming with streaming source?
There isn't. Generic purpose transformations, including conversion to RDDs and external formats, are not supported in Structured Streaming.

How to convert streaming Dataset to DStream?

Is it possible to convert a streaming o.a.s.sql.Dataset to DStream? If so, how?
I know how to convert it to RDD, but it is in a streaming context.
It is not possible. Structured Streaming and legacy Spark Streaming (DStreams) use completely different semantics and are not compatible with each other so:
DStream cannot be converted to Streaming Dataset.
Streaming Dataset cannot be converted to DStream.
It could be possible (in some use cases).
That question really begs another:
Why would anyone want to do that conversion? What's the problem to be solved?
I can only imagine that such type conversion would only be required when mixing two different APIs in a single streaming application. I'd then say it does not make much sense as you'd rather not do this and make the conversion at Spark module level, i.e. migrate the streaming application from Spark Streaming to Spark Structured Streaming.
A streaming Dataset is an "abstraction" of a series of Datasets (I use quotes since the difference between streaming and batch Datasets is the isStreaming property of a Dataset).
It is possible to convert a DStream to a streaming Dataset so the latter behaves as the former (to keep the behaviour of the DStream and pretend to be a streaming Dataset).
Under the covers, the execution engines of Spark Streaming (DStream) and Spark Structured Streaming (streaming Dataset) are fairly similar. They both "generate" micro-batches of RDDs and Datasets, respectively. And RDDs are convertible to Datasets but this implicit conversion toDF or toDS.
So converting a DStream to a streaming Dataset would logically look as follows:
dstream.foreachRDD { rdd =>
val df = rdd.toDF
// this df is not streaming, but you don't really need that
}

Spark Stateful Streaming with DataFrame

Is it possible to use DataFrame as a State / StateSpec for Spark Streaming? The current StateSpec implementation seems to allow only key-value pair data structure (mapWithState etc..).
My objective is to keep a fixed size FIFO buffer as a StateSpec that gets updated every time new data streams in. I'd like to implement the buffer in Spark DataFrame API, for compatibility with Spark ML.
I'm not entirely sure you can do this with Spark Streaming, but with the newer Dataframe-based Spark Structured Streaming you can express queries that get updated over time, given an incoming stream of data.
You can read more about Spark Structured Streaming in the official documentation.
If you are interested in interoperability with SparkML to deploy a trained model, you may also be interested in this article.

Resources