Could anyone please explain in simple language how does a Spark model
export works which is NOT dependent on the Spark cluster during
predictions?
I mean, if we are using Spark functions like ml.feature.stopwordremover in training in ML pipeline and export it in say, PMML format, how does this function gets regenerated when deployed in production where I don't have a Spark installation. May be when we use JPMML. I went through the PMML wiki page here but it simply explains the structure of PMML. However no functional description is provided there.
Any good links to articles are welcome.
Please experiment with the JPMML-SparkML library (or its PySpark2PMML or Sparklyr2PMML frontends) to see how exactly are different Apache Spark transformers and models mapped to the PMML standard.
For example, the PMML standard does not provide a specialized "remove stopwords" element. Instead, all low-level text manipulation is handled using generic TextIndex and TextIndexNormalization elements. The removal of stopwords is expressed/implemented as a regex transformation where they are simply replaced with empty strings. To evaluate such PMML documents, your runtime must only provide basic regex capabilities - there is absolutely no need for Apache Spark runtime or its transformer and model algorithms/classes.
The translation from Apache Spark ML to PMML works surprisingly well (eg. much better coverage than with other translation approaches such as MLeap).
Related
I am trying to write a spark evaluator in Streamsets. I have to deal with complex SQL queries and hence would want to use data frames or datasets here. But the sample code which Streamsets provides deals with JavaRDD only. Can I have an insight on dataframe to get some headstart here ?
You are almost certainly better off looking at using StreamSets Transformer. Transformer has a much deeper Spark integration and will allow you to work with native Spark structures.
PMML, Mleap, PFA currently only support row based transformations. None of them support frame based transformations like aggregates or groupby or join. What is the recommended way to export a spark pipeline consisting of these operations.
I see 2 options wrt Mleap:
1) implement dataframe based transformers and the SQLTransformer-Mleap equivalent. This solution seems to be conceptually the best (since you can always encapsule such transformations in a pipeline element) but also alot of work tbh. See https://github.com/combust/mleap/issues/126
2) extend the DefaultMleapFrame with the respective operations, you want to perform and then actually apply the required actions to the data handed to the restserver within a modified MleapServing subproject.
I actually went with 2) and added implode, explode and join as methods to the DefaultMleapFrame and also a HashIndexedMleapFrame that allows for fast joins. I did not implement groupby and agg, but in Scala this is relatively easy to accomplish.
PMML and PFA are standards for representing machine learning models, not data processing pipelines. A machine learning model takes in a data record, performs some computation on it, and emits an output data record. So by definition, you are working with a single isolated data record, not a collection/frame/matrix of data records.
If you need to represent complete data processing pipelines (where the ML model is just part of the workflow) then you need to look for other/combined standards. Perhaps SQL paired with PMML would be a good choice. The idea is that you want to perform data aggregation outside of the ML model, not inside it (eg. a SQL database will be much better at it than any PMML or PFA runtime).
I am working on a project where configurable pipelines and lineage tracking of alterations to Spark DataFrames are both essential. The endpoints of this pipeline are usually just modified DataFrames (think of it as an ETL task). What made the most sense to me was to leverage the already existing Spark ML Pipeline API to track these alterations. In particular, the alterations (adding columns based on others, etc.) are implemented as custom Spark ML Transformers.
However, we are now having an internal about debate whether or not this is the most idiomatic way of implementing this pipeline. The other option would be to implement these transformations as series of UDFs and to build our own lineage tracking based on a DataFrame's schema history (or Spark's internal DF lineage tracking). The argument for this side is that Spark's ML pipelines are not intended just ETL jobs, and should always be implemented with goal of producing a column which can be fed to a Spark ML Evaluator. The argument against this side is that it requires a lot of work that mirrors already existing functionality.
Is there any problem with leveraging Spark's ML Pipelines strictly for ETL tasks? Tasks that only make use of Transformers and don't include Evaluators?
For me, seems like a great idea, especially if you can compose the different Pipelines generated into new ones since a Pipeline can itself be made of different pipelines since a Pipeline extends from PipelineStage up the tree (source: https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.ml.Pipeline).
But keep in mind that you will probably being doing the same thing under the hood as explained here (https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-mllib/spark-mllib-transformers.html):
Internally, transform method uses Spark SQL’s udf to define a function (based on createTransformFunc function described above) that will create the new output column (with appropriate outputDataType). The UDF is later applied to the input column of the input DataFrame and the result becomes the output column (using DataFrame.withColumn method).
If you have decided for other approach or found a better way, please, comment. It's nice to share knowledge about Spark.
Is it possible to save a Spark ML pipeline to a database (Cassandra for example)? From the documentation I can only see the save to path option:
myMLWritable.save(toPath);
Is there a way to somehow wrap or change the myMLWritable.write() MLWriter instance and redirect the output to the database?
It is not possible (or at least no supported) at this moment. ML writer is not extendable and depends on Parquet files and directory structure to represent models.
Technically speaking you can extract individual components and use internal private API to recreate models from scratch, but it is likely the only option.
Spark 2.0.0+
At first glance all Transformers and Estimators implement MLWritable. If you use Spark <= 1.6.0 and experience some issues with model saving I would suggest switching version.
Spark >= 1.6
Since Spark 1.6 it's possible to save your models using the save method. Because almost every model implements the MLWritable interface. For example LogisticRegressionModel, has it, and therefore it's possible to save your model to the desired path using it.
Spark < 1.6
Some operations on a DataFrames can be optimized and it translates to improved performance compared to plain RDDs. DataFramesprovide efficient caching and SQLish API is arguably easier to comprehend than RDD API.
ML Pipelinesare extremely useful and tools like cross-validator or differentevaluators are simply must-have in any machine pipeline and even if none of the above is particularly hard do implement on top of low level MLlib API it is much better to have ready to use, universal and relatively well tested solution.
I believe that at the end of the day what you get by using ML over MLLibis quite elegant, high level API. One thing you can do is to combine both to create a custom multi-step pipeline:
use ML to load, clean and transform data,
extract required data (see for example [extractLabeledPoints ]4 method) and pass to MLLib algorithm,
add custom cross-validation / evaluation
save MLLib model using a method of your choice (Spark model or PMML)
In Jira also there is temporary solution provided . Temporary Solution
I am new to recommender systems and trying to decide between Apache Mahout and Spark ALS as the algorithmic core for my recommender engine.
Does Mahout's spark item-similarity job only have a cli?
The only related documentation that i have come across is this: http://apache.github.io/mahout/0.10.1/docs/mahout-spark/index.html#org.apache.mahout.drivers.ItemSimilarityDriver$ which pertains to the cli.
Also, for the the cli, I see that the input format is limited to text files. Does that mean I will have to transform all my data, stored in say Cassandra, to a txt file format to use spark item-similarity?
I have already referred to the introductory documentation on usage of spark item-similarity here - https://mahout.apache.org/users/algorithms/intro-cooccurrence-spark.html.
Any help and pointers to relevant documentation would be much appreciated.