Export models as PMML using PySpark - apache-spark

Is it possible to export models as PMMLs using PySpark? I know this is possible using Spark. But I did not find any reference in PySpark docs. So does this mean that if I want to do this, I need to write custom code using some third party python PMML library?

It is possible to export Apache Spark pipelines to PMML using the JPMML-SparkML library. Furthermore, this library is made available for end users in the form of a "Spark Package" by the JPMML-SparkML-Package project.
Example PySpark code:
from jpmml_sparkml import toPMMLBytes
pmmlBytes = toPMMLBytes(sc, df, pipelineModel)
print(pmmlBytes)

Related

How does a Spark dependency less Model export works?

Could anyone please explain in simple language how does a Spark model
export works which is NOT dependent on the Spark cluster during
predictions?
I mean, if we are using Spark functions like ml.feature.stopwordremover in training in ML pipeline and export it in say, PMML format, how does this function gets regenerated when deployed in production where I don't have a Spark installation. May be when we use JPMML. I went through the PMML wiki page here but it simply explains the structure of PMML. However no functional description is provided there.
Any good links to articles are welcome.
Please experiment with the JPMML-SparkML library (or its PySpark2PMML or Sparklyr2PMML frontends) to see how exactly are different Apache Spark transformers and models mapped to the PMML standard.
For example, the PMML standard does not provide a specialized "remove stopwords" element. Instead, all low-level text manipulation is handled using generic TextIndex and TextIndexNormalization elements. The removal of stopwords is expressed/implemented as a regex transformation where they are simply replaced with empty strings. To evaluate such PMML documents, your runtime must only provide basic regex capabilities - there is absolutely no need for Apache Spark runtime or its transformer and model algorithms/classes.
The translation from Apache Spark ML to PMML works surprisingly well (eg. much better coverage than with other translation approaches such as MLeap).

Displaying rules of decision tree modelled in pyspark ml library

I am new to spark. I have modeled decision tree using Dataframe based API i.e. pyspark.ml. I want to display rules of decision tree similar to what we get in RDD based API(spark.mllib) in spark using toDebugString.
I have read the documentation and could not find how to display the rules. Is there any other way?
Thank you.
As of Spark 2.0 both DecisionTreeClassificationModel and DecisionTreeRegressionModel provide toDebugString methods.

Saving Spark ML pipeline to a database

Is it possible to save a Spark ML pipeline to a database (Cassandra for example)? From the documentation I can only see the save to path option:
myMLWritable.save(toPath);
Is there a way to somehow wrap or change the myMLWritable.write() MLWriter instance and redirect the output to the database?
It is not possible (or at least no supported) at this moment. ML writer is not extendable and depends on Parquet files and directory structure to represent models.
Technically speaking you can extract individual components and use internal private API to recreate models from scratch, but it is likely the only option.
Spark 2.0.0+
At first glance all Transformers and Estimators implement MLWritable. If you use Spark <= 1.6.0 and experience some issues with model saving I would suggest switching version.
Spark >= 1.6
Since Spark 1.6 it's possible to save your models using the save method. Because almost every model implements the MLWritable interface. For example LogisticRegressionModel, has it, and therefore it's possible to save your model to the desired path using it.
Spark < 1.6
Some operations on a DataFrames can be optimized and it translates to improved performance compared to plain RDDs. DataFramesprovide efficient caching and SQLish API is arguably easier to comprehend than RDD API.
ML Pipelinesare extremely useful and tools like cross-validator or differentevaluators are simply must-have in any machine pipeline and even if none of the above is particularly hard do implement on top of low level MLlib API it is much better to have ready to use, universal and relatively well tested solution.
I believe that at the end of the day what you get by using ML over MLLibis quite elegant, high level API. One thing you can do is to combine both to create a custom multi-step pipeline:
use ML to load, clean and transform data,
extract required data (see for example [extractLabeledPoints ]4 method) and pass to MLLib algorithm,
add custom cross-validation / evaluation
save MLLib model using a method of your choice (Spark model or PMML)
In Jira also there is temporary solution provided . Temporary Solution

Databricks display() function equivalent or alternative to Jupyter

I'm in the process of migrating current DataBricks Spark notebooks to Jupyter notebooks, DataBricks provides convenient and beautiful display(data_frame) function to be able to visualize Spark dataframes and RDDs ,but there's no direct equivalent for Jupyter(im not sure but i think its a DataBricks specific function), i tried :
dataframe.show()
But it's a text version of it ,when you have many columns it breaks , so i'm trying to find an alternative to display() that can render Spark dataframes better than show() functions. Is there any equivalent or alternative to this?
When you use Jupyter, instead of using df.show() use myDF.limit(10).toPandas().head(). And, as sometimes, we are working multiple columns it truncates the view.
So just set your Pandas view column config to the max.
# Alternative to Databricks display function.
import pandas as pd
pd.set_option('max_columns', None)
myDF.limit(10).toPandas().head()
First Recommendation: When you use Jupyter, don't use df.show() instead use df.limit(10).toPandas().head() which results perfect display even better Databricks display()
Second Recommendation:
Zeppelin Notebook. Just use z.show(df.limit(10))
Additionally in Zeppelin;
You register your dataframe as SQL Table df.createOrReplaceTempView('tableName')
Insert new paragraph beginning %sql then query your table with amazing display.
In recent IPython, you can just use display(df) if df is a panda dataframe, it will just work. On older version you might need to do a from IPython.display import display. It will also automatically display if the result of the last expression of a cell is a data_frame. For example this notebook. Of course the representation will depends on the library you use to make your dataframe. If you are using PySpark and it does not defined a nice representation by default, then you'll need to teach IPython how to display the Spark DataFrame. For example here is a project that teach IPython how to display Spark Contexts, and Spark Sessions.
Without converting to pandas dataframe. Use this... This will render dataframe in proper grids.
from IPython.core.display import HTML
display(HTML("<style>pre { white-space: pre !important; }</style>"))
df.show()
You can set config spark.conf.set('spark.sql.repl.eagerEval.enabled', True).
This will allow to display native pyspark DataFrame without explicitly using df.show() and there is also no need to transfer DataFrame to Pandas either, all you need to is just df.
Try Apache Zeppelin (https://zeppelin.apache.org/). There's some nice standard visualizations of dataframes, specifically if you use the sql interpreter. There's also support for other useful interpreters as well.

Is it possible to invoke Mahout's spark item-similarity job from Java/Python?

I am new to recommender systems and trying to decide between Apache Mahout and Spark ALS as the algorithmic core for my recommender engine.
Does Mahout's spark item-similarity job only have a cli?
The only related documentation that i have come across is this: http://apache.github.io/mahout/0.10.1/docs/mahout-spark/index.html#org.apache.mahout.drivers.ItemSimilarityDriver$ which pertains to the cli.
Also, for the the cli, I see that the input format is limited to text files. Does that mean I will have to transform all my data, stored in say Cassandra, to a txt file format to use spark item-similarity?
I have already referred to the introductory documentation on usage of spark item-similarity here - https://mahout.apache.org/users/algorithms/intro-cooccurrence-spark.html.
Any help and pointers to relevant documentation would be much appreciated.

Resources