Saving Spark ML pipeline to a database - apache-spark

Is it possible to save a Spark ML pipeline to a database (Cassandra for example)? From the documentation I can only see the save to path option:
myMLWritable.save(toPath);
Is there a way to somehow wrap or change the myMLWritable.write() MLWriter instance and redirect the output to the database?

It is not possible (or at least no supported) at this moment. ML writer is not extendable and depends on Parquet files and directory structure to represent models.
Technically speaking you can extract individual components and use internal private API to recreate models from scratch, but it is likely the only option.

Spark 2.0.0+
At first glance all Transformers and Estimators implement MLWritable. If you use Spark <= 1.6.0 and experience some issues with model saving I would suggest switching version.
Spark >= 1.6
Since Spark 1.6 it's possible to save your models using the save method. Because almost every model implements the MLWritable interface. For example LogisticRegressionModel, has it, and therefore it's possible to save your model to the desired path using it.
Spark < 1.6
Some operations on a DataFrames can be optimized and it translates to improved performance compared to plain RDDs. DataFramesprovide efficient caching and SQLish API is arguably easier to comprehend than RDD API.
ML Pipelinesare extremely useful and tools like cross-validator or differentevaluators are simply must-have in any machine pipeline and even if none of the above is particularly hard do implement on top of low level MLlib API it is much better to have ready to use, universal and relatively well tested solution.
I believe that at the end of the day what you get by using ML over MLLibis quite elegant, high level API. One thing you can do is to combine both to create a custom multi-step pipeline:
use ML to load, clean and transform data,
extract required data (see for example [extractLabeledPoints ]4 method) and pass to MLLib algorithm,
add custom cross-validation / evaluation
save MLLib model using a method of your choice (Spark model or PMML)
In Jira also there is temporary solution provided . Temporary Solution

Related

Export spark feature transformation pipeline to a file

PMML, Mleap, PFA currently only support row based transformations. None of them support frame based transformations like aggregates or groupby or join. What is the recommended way to export a spark pipeline consisting of these operations.
I see 2 options wrt Mleap:
1) implement dataframe based transformers and the SQLTransformer-Mleap equivalent. This solution seems to be conceptually the best (since you can always encapsule such transformations in a pipeline element) but also alot of work tbh. See https://github.com/combust/mleap/issues/126
2) extend the DefaultMleapFrame with the respective operations, you want to perform and then actually apply the required actions to the data handed to the restserver within a modified MleapServing subproject.
I actually went with 2) and added implode, explode and join as methods to the DefaultMleapFrame and also a HashIndexedMleapFrame that allows for fast joins. I did not implement groupby and agg, but in Scala this is relatively easy to accomplish.
PMML and PFA are standards for representing machine learning models, not data processing pipelines. A machine learning model takes in a data record, performs some computation on it, and emits an output data record. So by definition, you are working with a single isolated data record, not a collection/frame/matrix of data records.
If you need to represent complete data processing pipelines (where the ML model is just part of the workflow) then you need to look for other/combined standards. Perhaps SQL paired with PMML would be a good choice. The idea is that you want to perform data aggregation outside of the ML model, not inside it (eg. a SQL database will be much better at it than any PMML or PFA runtime).

How does a Spark dependency less Model export works?

Could anyone please explain in simple language how does a Spark model
export works which is NOT dependent on the Spark cluster during
predictions?
I mean, if we are using Spark functions like ml.feature.stopwordremover in training in ML pipeline and export it in say, PMML format, how does this function gets regenerated when deployed in production where I don't have a Spark installation. May be when we use JPMML. I went through the PMML wiki page here but it simply explains the structure of PMML. However no functional description is provided there.
Any good links to articles are welcome.
Please experiment with the JPMML-SparkML library (or its PySpark2PMML or Sparklyr2PMML frontends) to see how exactly are different Apache Spark transformers and models mapped to the PMML standard.
For example, the PMML standard does not provide a specialized "remove stopwords" element. Instead, all low-level text manipulation is handled using generic TextIndex and TextIndexNormalization elements. The removal of stopwords is expressed/implemented as a regex transformation where they are simply replaced with empty strings. To evaluate such PMML documents, your runtime must only provide basic regex capabilities - there is absolutely no need for Apache Spark runtime or its transformer and model algorithms/classes.
The translation from Apache Spark ML to PMML works surprisingly well (eg. much better coverage than with other translation approaches such as MLeap).

Using Spark ML Pipelines just for Transformations

I am working on a project where configurable pipelines and lineage tracking of alterations to Spark DataFrames are both essential. The endpoints of this pipeline are usually just modified DataFrames (think of it as an ETL task). What made the most sense to me was to leverage the already existing Spark ML Pipeline API to track these alterations. In particular, the alterations (adding columns based on others, etc.) are implemented as custom Spark ML Transformers.
However, we are now having an internal about debate whether or not this is the most idiomatic way of implementing this pipeline. The other option would be to implement these transformations as series of UDFs and to build our own lineage tracking based on a DataFrame's schema history (or Spark's internal DF lineage tracking). The argument for this side is that Spark's ML pipelines are not intended just ETL jobs, and should always be implemented with goal of producing a column which can be fed to a Spark ML Evaluator. The argument against this side is that it requires a lot of work that mirrors already existing functionality.
Is there any problem with leveraging Spark's ML Pipelines strictly for ETL tasks? Tasks that only make use of Transformers and don't include Evaluators?
For me, seems like a great idea, especially if you can compose the different Pipelines generated into new ones since a Pipeline can itself be made of different pipelines since a Pipeline extends from PipelineStage up the tree (source: https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.ml.Pipeline).
But keep in mind that you will probably being doing the same thing under the hood as explained here (https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-mllib/spark-mllib-transformers.html):
Internally, transform method uses Spark SQL’s udf to define a function (based on createTransformFunc function described above) that will create the new output column (with appropriate outputDataType). The UDF is later applied to the input column of the input DataFrame and the result becomes the output column (using DataFrame.withColumn method).
If you have decided for other approach or found a better way, please, comment. It's nice to share knowledge about Spark.

Run ML algorithm inside map function in Spark

So I have been trying for some days now to run ML algorithms inside a map function in Spark. I posted a more specific question but referencing Spark's ML algorithms gives me the following error:
AttributeError: Cannot load _jvm from SparkContext. Is SparkContext initialized?
Obviously I cannot reference SparkContext inside the apply_classifier function.
My code is similar to what was suggested in the previous question I asked but still haven't found a solution to what I am looking for:
def apply_classifier(clf):
dt = DecisionTreeClassifier(labelCol="indexedLabel", featuresCol="indexedFeatures", maxDepth=3)
if clf == 0:
clf = DecisionTreeClassifier(labelCol="indexedLabel", featuresCol="indexedFeatures", maxDepth=3)
elif clf == 1:
clf = RandomForestClassifier(labelCol="indexedLabel", featuresCol="indexedFeatures", numTrees=5)
classifiers = [0, 1]
sc.parallelize(classifiers).map(lambda x: apply_classifier(x)).collect()
I have tried using flatMap instead of map but I get NoneType object is not iterable.
I would also like to pass a broadcasted dataset (which is a DataFrame) as parameter inside the apply_classifier function.
Finally, is it possible to do what I am trying to do? What are the alternatives?
is it possible to do what I am trying to do?
It is not. Apache Spark doesn't support any form of nesting and distributed operations can be initialized only by the driver. This includes access to distributed data structures, like Spark DataFrame.
What are the alternatives?
This depends on many factors like the size of the data, amount of available resources, and choice of algorithms. In general you have three options:
Use Spark only as task management tool to train local, non-distributed models. It looks like you explored this path to some extent already. For more advanced implementation of this approach you can check spark-sklearn.
In general this approach is particularly useful when data is relatively small. Its advantage is that there is no competition between multiple jobs.
Use standard multithreading tools to submit multiple independent jobs from a single context. You can use for example threading or joblib.
While this approach is possible I wouldn't recommend it in practice. Not all Spark components are thread-safe and you have to pretty careful to avoid unexpected behaviors. It also gives you very little control over resource allocation.
Parametrize your Spark application and use external pipeline manager (Apache Airflow, Luigi, Toil) to submit your jobs.
While this approach has some drawbacks (it will require saving data to a persistent storage) it is also the most universal and robust and gives a lot of control over resource allocation.

Is Spark still advantageous for non-iterative analytics?

Spark uses in memory computing and caching to decrease latency on complex analytics, however this is mainly for "iterative algorythms",
If I needed to perform a more basic analytic, say perhaps each element was a group of numbers and I wanted to look for elements with a standard deviation less than 'x' would Spark still decrease latency compared to regular cluster computing (without in memory computing)? Assuming I used that same commodity hardware in each case.
It tied for the top sorting framework using none of those extra mechanisms, so I would argue that is reason enough. But, you can also run streaming, graphing, or machine learning without having to switch gears. Then, you add in that you should use DataFrames wherever possible and you get query optimizations beyond any other framework that I know of. So, yes, Spark is the clear choice in almost every instance.
One good thing about spark is its Datasource API combining it with SparkSQL gives you ability to query and join different data sources together. SparkSQL now includes decent optimizer - catalyst. As mentioned in one of the answer along with core (RDD) in spark you can also include streaming data, apply machine learning models and graph algorithms. So yes.

Resources