Spark Structured Streaming and Spark-Ml Regression - apache-spark

Is it possible to apply Spark-Ml regression to streaming sources? I see there is StreamingLogisticRegressionWithSGD but It's for older RDD API and I couldn't use It with structured streaming sources.
How I'm supposed to apply regressions on structured streaming sources?
(A little OT) If I cannot use streaming API for regression how can I commit offsets or so to source in a batch processing way? (Kafka sink)

Today (Spark 2.2 / 2.3) there is no support for machine learning in Structured Streaming and there is no ongoing work in this direction. Please follow SPARK-16424 to track future progress.
You can however:
Train iterative, non-distributed models using forEach sink and some form of external state storage. At a high level regression model could be implemented like this:
Fetch latest model when calling ForeachWriter.open and initialize loss accumulator (not in a Spark sense, just local variable) for the partition.
Compute loss for each record in ForeachWriter.process and update accumulator.
Push loses to external store when calling ForeachWriter.close.
This would leave external storage in charge with computing gradient and updating model with implementation dependent on the store.
Try to hack SQL queries (see https://github.com/holdenk/spark-structured-streaming-ml by Holden Karau)

Related

Spark Model serving: How to compute the state just once, then keep the state to answer all real-time requests? (recommendation engine)

I try to implement a recommendation engine using Kafka to collect real-time click data and then process it using Spark Structured Streaming. My problem is about how to server predictions in near real-time using this streaming dataframe.
What works fine is: collecting click data, sending to Kafka, subscribing from Spark, using Strcutured Streaming to compute a dataframe which describes the 'state of a visitor'. Now having this streaming dataframe, there are just few lines of code (business logic) telling which is the best recommendation.
Now my problem is: how do I put this in production. I could create a mlflow.pyfunc model. But this would not contain the 'state of a visitor' dataframe. When looking at model serving frameworks, I understand that every inference request would create an independent runtime which would have to do the whole data pipeline again.
My idea would be to have 1 Spark instance which would:
create this streaming dataframe
wait for incoming request and answer those by using the dataframe from (1.)
Is this a reasonable approach? If yes: How do I set this up? If no: What is the preferred way to do real-time recommendations?

how to use flink and spark together,and spark just for transformation?

Let`s say there is a collection "goods" in mongodb like this:
{name:"A",attr:["location":"us"],"eventTime":"2018-01-01"}
{name:"B",attr:["brand":"nike"],"eventTime":"2018-01-01"}
In the past,I use spark to flatten it and save to hive:
goodsDF.select($"name",explode($"attribute"))
But,now we need to handle incremental data,
for example,there are a new good in the third line in the next day
{name:"A",attr:["location":"us"],"eventTime":"2018-01-01"}
{name:"B",attr:["brand":"nike"],"eventTime":"2018-01-01"}
{name:"C",attr:["location":"uk"],"eventTime":"2018-02-01"}
some of our team think flink is better on streaming,because flink has event driver application,streaming pipeline and batch,but spark is just micro batch.
so we change to use flink,but there are a lot of code has been written by spark,for example,the "explode" above,so my question is:
Is it possible to use flink to fetch source and save to the sink,but in the middle,use spark to transform the dataset?
If it is not possible,how about save it to a temporary sink,let`s say,some json files,and then spark read the files and transform and save to hive.But I am afraid this makes no sense,because for spark,It is also incremental data.Use flink then use spark is the same as use spark Structured Streaming directly.
No. Apache Spark code can not be used in Flink without making changes in code. As these two are different processing frameworks and APIs provided by two and it's syntax are different from each other. Choice of framework should really be driven by the use case and not by generic statements like Flink is better than Spark. A framework may work great for your use case and it may perform poorly in other use case. By the way, Spark is not just micro batch. It has batch, streaming, graph, ML and other things. Since the complete use case is not mentioned in question, it would be hard to suggest which one is better for this scenario. But if your use case can afford sub-second latency then I would not waste my time in moving to another framework.
Also, if the things are dynamic and it is anticipated that processing framework may change in future it would be better to use something like apache beam which provides abstraction over most of the processing engines. Using apache beam processing APIs will give you flexibility to change underlying processing engine any time. Here is the link to read more about beam - https://beam.apache.org/.

How to avoid Code Redundancy in Lambda Architecture?

We have an exiting batch processing which is working as mentioned below
Hive SQL is using for Daily batch processing.
Data are being either ingested from Files or RDMBS
Data is ingested in Raw --> Staging --> Mart, with staging to mart being all the business transformation and raw to staging is just cleansing and formatting of data.
Now as Part of getting real or near real time data, I am evaluating the Lambda Architecture and this is what plan is?
ALL the source system is going to land on Kafka.
Same batch processing System will consume Kafka topics.
New Spark Application will consume kafka topics for streaming.
Serving layer will create views which will combine both the aggregate data from Streaming and Batch for real (near real) time processing.
The problem is, the Logic will be duplicated in HiveQL (Batch) and Spark (Streaming). is there a way I can avoid this or minimize this?
You can build your processing stages using Spark SQL and Spark Structured Streaming: https://spark.apache.org/docs/2.2.0/structured-streaming-programming-guide.html. Depending on your needs there can be some incompatibilities. But I´d try to build the Spark Aggregations + Transformations using the Dataset[_] api and then try to spawn in both ways, batch and streaming.
The problem of duplicated code base is inherent in lambda architecture. It gets a mention in the 'criticism' section of the wikipedia page
Another issue is that the data between batch and stream are not in sync so can lead to unexpected results when bringing data together. For example, joining across stream and batch when keys do not yet exist in batch.
I believe the lambda architecture comes from an belief that streaming is complex and expensive so keep batch as much as possible and add streaming only for those elements that require near-real time. We already have batch, let's add a few streaming things.
An alternate architecture is to use streaming for everything. This is based on the realization that batch is a special case of streaming, so do your batch and stream processing on a single streaming platform.
use spark structured streaming for batch
lambda architecture issues and how only using streaming solves them
questioning the lambda architecture

What is the difference between Spark Structured Streaming and DStreams?

I have been trying to find materials online - both are micro-batch based - so what's the difference ?
Brief description about Spark Streaming(RDD/DStream) and Spark Structured Streaming(Dataset/DataFrame)
Spark Streaming is based on DStream. A DStream is represented by a continuous series of RDDs, which is Spark’s abstraction of an immutable, distributed dataset. Spark Streaming has the following problems.
Difficult - it was not simple to built streaming pipelines supporting delivery policies: exactly once guarantee, handling data arrival in late or fault tolerance. Sure, all of them were implementable but they needed some extra work from the part of programmers.
Incosistent - API used to generate batch processing (RDD, Dataset) was different that the API of streaming processing (DStream). Sure, nothing blocker to code but it's always simpler (maintenance cost especially) to deal with at least abstractions as possible.
see the example
Spark Streaming flow diagram :-
Spark Structured Streaming be understood as an unbounded table, growing with new incoming data, i.e. can be thought as stream processing built on Spark SQL.
More concretely, structured streaming brought some new concepts to Spark.
exactly-once guarantee - structured streaming focuses on that concept. It means that data is processed only once and output doesn't contain duplicates.
event time - one of observed problems with DStream streaming was processing order, i.e the case when data generated earlier was processed after later generated data. Structured streaming handles this problem with a concept called event time that, under some conditions, allows to correctly aggregate late data in processing pipelines.
sink,Result Table,output mode and watermark are other features of spark structured streaming.
see the example
Spark Structured Streaming flow diagram :-
Until Spark 2.2, the DStream[T] was the abstract data type for streaming data which can be viewed as RDD[RDD[T]].From Spark 2.2 onwards, the DataSet is a abstraction on DataFrame that embodies both the batch (cold) as well as streaming data.
From the docs
Discretized Streams (DStreams) Discretized Stream or DStream is the
basic abstraction provided by Spark Streaming. It represents a
continuous stream of data, either the input data stream received from
source, or the processed data stream generated by transforming the
input stream. Internally, a DStream is represented by a continuous
series of RDDs, which is Spark’s abstraction of an immutable,
distributed dataset (see Spark Programming Guide for more details).
Each RDD in a DStream contains data from a certain interval, as shown
in the following figure.
API using Datasets and DataFrames Since Spark 2.0, DataFrames and
Datasets can represent static, bounded data, as well as streaming,
unbounded data. Similar to static Datasets/DataFrames, you can use the
common entry point SparkSession (Scala/Java/Python/R docs) to create
streaming DataFrames/Datasets from streaming sources, and apply the
same operations on them as static DataFrames/Datasets. If you are not
familiar with Datasets/DataFrames, you are strongly advised to
familiarize yourself with them using the DataFrame/Dataset Programming
Guide.

Spark Stateful Streaming with DataFrame

Is it possible to use DataFrame as a State / StateSpec for Spark Streaming? The current StateSpec implementation seems to allow only key-value pair data structure (mapWithState etc..).
My objective is to keep a fixed size FIFO buffer as a StateSpec that gets updated every time new data streams in. I'd like to implement the buffer in Spark DataFrame API, for compatibility with Spark ML.
I'm not entirely sure you can do this with Spark Streaming, but with the newer Dataframe-based Spark Structured Streaming you can express queries that get updated over time, given an incoming stream of data.
You can read more about Spark Structured Streaming in the official documentation.
If you are interested in interoperability with SparkML to deploy a trained model, you may also be interested in this article.

Resources