Export spark feature transformation pipeline to a file - apache-spark

PMML, Mleap, PFA currently only support row based transformations. None of them support frame based transformations like aggregates or groupby or join. What is the recommended way to export a spark pipeline consisting of these operations.

I see 2 options wrt Mleap:
1) implement dataframe based transformers and the SQLTransformer-Mleap equivalent. This solution seems to be conceptually the best (since you can always encapsule such transformations in a pipeline element) but also alot of work tbh. See https://github.com/combust/mleap/issues/126
2) extend the DefaultMleapFrame with the respective operations, you want to perform and then actually apply the required actions to the data handed to the restserver within a modified MleapServing subproject.
I actually went with 2) and added implode, explode and join as methods to the DefaultMleapFrame and also a HashIndexedMleapFrame that allows for fast joins. I did not implement groupby and agg, but in Scala this is relatively easy to accomplish.

PMML and PFA are standards for representing machine learning models, not data processing pipelines. A machine learning model takes in a data record, performs some computation on it, and emits an output data record. So by definition, you are working with a single isolated data record, not a collection/frame/matrix of data records.
If you need to represent complete data processing pipelines (where the ML model is just part of the workflow) then you need to look for other/combined standards. Perhaps SQL paired with PMML would be a good choice. The idea is that you want to perform data aggregation outside of the ML model, not inside it (eg. a SQL database will be much better at it than any PMML or PFA runtime).

Related

Disadvantages of Spark Dataset over DataFrame

I know the advantages of Dataset (type safety etc), but i can't find any documentation related Spark Datasets Limitations.
Are there any specific scenarios where Spark Dataset is not recommended and better to use DataFrame.
Currently all our data engineering flows are using Spark (Scala)DataFrame.
We would like to make use of Dataset, for all our new flows. So knowing all the limitations/disadvantages of Dataset would help us.
EDIT: This is not similar to Spark 2.0 Dataset vs DataFrame, which explains some operations on Dataframe/Dataset. or other questions, which most of them explains the differences between rdd, dataframe and dataset and how they evolved. This is targeted to know, when NOT to use Datasets
There are a few scenarios where I find that a Dataframe (or Dataset[Row]) is more useful than a typed dataset.
For example, when I'm consuming data without a fixed schema, like JSON files containing records of different types with different fields. Using a Dataframe I can easily "select" out the fields I need without needing to know the whole schema, or even use a runtime configuration to specify the fields I'll access.
Another consideration is that Spark can better optimize the built-in Spark SQL operations and aggregations than UDAFs and custom lambdas. So if you want to get the square root of a value in a column, that's a built-in function (df.withColumn("rootX", sqrt("X"))) in Spark SQL but doing it in a lambda (ds.map(X => Math.sqrt(X))) would be less efficient since Spark can't optimize your lambda function as effectively.
There are also many untyped Dataframe functions (like statistical functions) that are implemented for Dataframes but not typed Datasets, and you'll often find that even if you start out with a Dataset, by the time you've finished your aggregations you're left with a Dataframe because the functions work by creating new columns, modifying the schema of your dataset.
In general I don't think you should migrate from working Dataframe code to typed Datasets unless you have a good reason to. Many of the Dataset features are still flagged as "experimental" as of Spark 2.4.0, and as mentioned above not all Dataframe features have Dataset equivalents.
Limitations of Spark Datasets:
Datasets used to be less performant (not sure if that's been fixed yet)
You need to define a new case class whenever you change the Dataset schema, which is cumbersome
Datasets don't offer as much type safety as you might expect. We can pass the reverse function a date object and it'll return a garbage response rather than erroring out.
import java.sql.Date
case class Birth(hospitalName: String, birthDate: Date)
val birthsDS = Seq(
Birth("westchester", Date.valueOf("2014-01-15"))
).toDS()
birthsDS.withColumn("meaningless", reverse($"birthDate")).show()
+------------+----------+-----------+
|hospitalName| birthDate|meaningless|
+------------+----------+-----------+
| westchester|2014-01-15| 51-10-4102|
+------------+----------+-----------+

Using Spark ML Pipelines just for Transformations

I am working on a project where configurable pipelines and lineage tracking of alterations to Spark DataFrames are both essential. The endpoints of this pipeline are usually just modified DataFrames (think of it as an ETL task). What made the most sense to me was to leverage the already existing Spark ML Pipeline API to track these alterations. In particular, the alterations (adding columns based on others, etc.) are implemented as custom Spark ML Transformers.
However, we are now having an internal about debate whether or not this is the most idiomatic way of implementing this pipeline. The other option would be to implement these transformations as series of UDFs and to build our own lineage tracking based on a DataFrame's schema history (or Spark's internal DF lineage tracking). The argument for this side is that Spark's ML pipelines are not intended just ETL jobs, and should always be implemented with goal of producing a column which can be fed to a Spark ML Evaluator. The argument against this side is that it requires a lot of work that mirrors already existing functionality.
Is there any problem with leveraging Spark's ML Pipelines strictly for ETL tasks? Tasks that only make use of Transformers and don't include Evaluators?
For me, seems like a great idea, especially if you can compose the different Pipelines generated into new ones since a Pipeline can itself be made of different pipelines since a Pipeline extends from PipelineStage up the tree (source: https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.ml.Pipeline).
But keep in mind that you will probably being doing the same thing under the hood as explained here (https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-mllib/spark-mllib-transformers.html):
Internally, transform method uses Spark SQL’s udf to define a function (based on createTransformFunc function described above) that will create the new output column (with appropriate outputDataType). The UDF is later applied to the input column of the input DataFrame and the result becomes the output column (using DataFrame.withColumn method).
If you have decided for other approach or found a better way, please, comment. It's nice to share knowledge about Spark.

Saving Spark ML pipeline to a database

Is it possible to save a Spark ML pipeline to a database (Cassandra for example)? From the documentation I can only see the save to path option:
myMLWritable.save(toPath);
Is there a way to somehow wrap or change the myMLWritable.write() MLWriter instance and redirect the output to the database?
It is not possible (or at least no supported) at this moment. ML writer is not extendable and depends on Parquet files and directory structure to represent models.
Technically speaking you can extract individual components and use internal private API to recreate models from scratch, but it is likely the only option.
Spark 2.0.0+
At first glance all Transformers and Estimators implement MLWritable. If you use Spark <= 1.6.0 and experience some issues with model saving I would suggest switching version.
Spark >= 1.6
Since Spark 1.6 it's possible to save your models using the save method. Because almost every model implements the MLWritable interface. For example LogisticRegressionModel, has it, and therefore it's possible to save your model to the desired path using it.
Spark < 1.6
Some operations on a DataFrames can be optimized and it translates to improved performance compared to plain RDDs. DataFramesprovide efficient caching and SQLish API is arguably easier to comprehend than RDD API.
ML Pipelinesare extremely useful and tools like cross-validator or differentevaluators are simply must-have in any machine pipeline and even if none of the above is particularly hard do implement on top of low level MLlib API it is much better to have ready to use, universal and relatively well tested solution.
I believe that at the end of the day what you get by using ML over MLLibis quite elegant, high level API. One thing you can do is to combine both to create a custom multi-step pipeline:
use ML to load, clean and transform data,
extract required data (see for example [extractLabeledPoints ]4 method) and pass to MLLib algorithm,
add custom cross-validation / evaluation
save MLLib model using a method of your choice (Spark model or PMML)
In Jira also there is temporary solution provided . Temporary Solution

Implementing a Spark SQL UserDefinedAggregateFunction that performs multiple passes over a column

I've been experimenting with the UserDefinedAggregateFunction class to write aggregate functions for use in Spark SQL.
It works well for implementing single pass operations like sum(), avg() etc., but is there a trick you can use to perform multiple passes over a column?
For example, Calculating variance using the naive approach. i.e. With a first pass calculating the column mean and then a second pass that uses this value to calculate the variance. I know that there are single pass algorithms for doing this that give good approximations (as in fact implemented by Spark). I was just using this as an example of a two-pass operation.
It would be nice to be able to do the following,
spark.sql("SELECT product, MultiPassAgg(price) FROM products GROUP BY product")
I appreciate that I can do this kind of thing using Dataset / DataFrame operations in stages etc., but I was just looking clean approach as illustrated in the SQL above.
Any ideas or suggestions?
This should be possible, though the following suggestion could potentially use a large amount of memory if a large number of rows are involved in any given partition.
In the implementation of your UserDefinedAggregateFunction, set up the bufferSchema having a StructField that includes a DataType that is a collection (such as ArrayType) to act as an internal collection of inputs provided via update.
Then, in update you append each input to your collection, and in merge you combine all of the collections into a single collection. This allows you to have the full partition available for use in evaluate.
Finally, during evaluate you can operate across the entire collection of rows in any way you see fit.

Avoid the use of Java data structures in Apache Spark to avoid copying the data

I have a MySQL database with a single table containing about 100 million records (~25GB, ~5 columns). Using Apache Spark, I extract this data via a JDBC connector and store it in a DataFrame.
From here, I do some pre-processing of the data (e.g. replacing the NULL values), so I absolutely need to go through each record.
Then I would like to perform dimensionality reduction and feature selection (e.g. using PCA), perform clustering (e.g. K-Means) and later on do the testing of the model on new data.
I have implemented this in Spark's Java API, but it is too slow (for my purposes) since I do a lot of copying of the data from a DataFrame to a java.util.Vector and java.util.List (to be able to iterate over all records and do the pre-processing), and later back to a DataFrame (since PCA in Spark expects a DataFrame as input).
I have tried extracting information from the database into a org.apache.spark.sql.Column but cannot find a way to iterate over it.
I also tried avoiding the use of Java data structures (such as List and Vector) by using the org.apache.spark.mllib.linalg.{DenseVector, SparseVector}, but cannot get that to work either.
Finally, I also considered using JavaRDD (by creating it from a DataFrame and a custom schema), but couldn't work it out entirely.
After a lengthy description, my question is: is there a way to do all steps mentioned in the first paragraph, without copying all the data into a Java data structure?
Maybe one of the options I tried could actually work, but I just can't seem to find out how, as the docs and literature on Spark are a bit scarce.
From the wording of your question, it seems there is some confusion about the stages of Spark processing.
First, we tell Spark what to do by specifying inputs and transformations. At this point, the only things that are known are (a) the number of partitions at various stages of processing and (b) the schema of the data. org.apache.spark.sql.Column is used at this stage to identify the metadata associated with a column. However, it doesn't contain any of the data. In fact, there is no data at all at this stage.
Second, we tell Spark to execute an action on a dataframe/dataset. This is what kicks off processing. The input is read and flows through the various transformations and into the final action operation, be it collect or save or something else.
So, that explains why you cannot "extract information from the database into" a Column.
As for the core of your question, it's hard to comment without seeing your code and knowing exactly what it is you are trying to accomplish but it is safe to say that much migrating between types is a bad idea.
Here are a couple of questions that might help guide you to a better outcome:
Why can't you perform the data transformations you need by operating directly on the Row instances?
Would it be convenient to wrap some of your transformation code into a UDF or UDAF?
Hope this helps.

Resources