I've set-up a simple Spark-ML app, where I have a pipeline of independent transformers that add columns to a dataframe of raw data. Since the transformers don't look at the output of one another I was hoping I could run them in parallel in a non-linear (DAG) pipeline. All I could find about this feature is this paragraph from the Spark ML-Guide:
It is possible to create non-linear Pipelines as long as the data flow
graph forms a Directed Acyclic Graph (DAG). This graph is currently
specified implicitly based on the input and output column names of
each stage (generally specified as parameters). If the Pipeline forms
a DAG, then the stages must be specified in topological order.
My understanding of the paragraph is that if I set the inputCol(s), outputCol parameters for each transformer and specify the stages in topological order when I create the pipeline, then the engine will use that information to build an execution DAG s.t. the stages of the DAG could run once their input is ready.
Some questions about that:
Is my understanding correct?
What happens if for one of the stages/transformers I don't specify an output column (e.g. the stage only filters some of the lines)?. Will it assume that for DAG creation purposes the stage is changing all columns so all subsequent stages should be waiting for it?
Likewise, what happens if for one of the stages I don't specify an inputCol(s)? Will the stage wait until all previous stages are complete?
It seems I can specify multiple input columns but only one output column. What happens if a transformer adds two columns to a dataframe (Spark itself has no problem with that)? Is there some way to let the DAG creation engine know about it?
Is my understanding correct?
Not exactly. Because stages are provided in a topological order all you have to do to traverse the graph in the correct order is to apply PipelineStages from left to right. And this exactly what happens when you call PipelineTransform.
Sequence of stages is traversed twice:
once to validate schema using transformSchema which is simply implemented as stages.foldLeft(schema)((cur, stage) => stage.transformSchema(cur)). This is the part where actual schema validation is performed.
once to fit actually transform data using Transformers and fit Estimators. This is just a simple for loop which applies stages sequentially one by one.
Likewise, what happens if for one of the stages I don't specify an inputCol(s)?
Pretty much nothing interesting. Since stages are applied sequentially, and the only schema validation is applied by the given Transformer using its transformSchema method before actual transformations begin, it will processed as any other stage.
What happens if a transformer adds two columns to a dataframe
Same as above. As long as it generates valid input schema for subsequent stages it is not different than any other Transformer.
transformers don't look at the output of one another I was hoping I could run them in parallel
Theoretically you could try to build a custom composite transformer which encapsulates multiple different transformations but the only part that could be performed independently and benefit from this type of operation is model fitting. At the end of the day you have to return a single transformed DataFrame which can be utilized by downstream stages and actual transformations are most likely scheduled as a single data scan anyway.
Question remains if it is really worth the effort. While it possible to execute multiple jobs at the same time, it provides some edge only, if amount of available resources is relatively high compared to amount of work required to handle a single job. It usually requires some low level management (number of partitions, number of shuffle partitions) which is not the strongest suit of Spark SQL.
Related
I am struggling with the implementation of a performant version of a SOM Batch algorithm on Spark / Pyspark for a huge dataset with > 100 features.
I have the feeling that I can either use RDDs where I can/have to specifiy the parallization on my own or I use Dataframe which should be more performant but I see no way how to use something like a local accumulation variable for each worker when using dataframes.
Ideas:
Using Accumulators. Parallelize the calculations by creating a UDF which takes the observations as input, calculates the impacts on the net and sends the impacts to an accumulator in the driver. (Implemented this version already, but seems rather slow (I think accumulator updates take to long))
Store results in a new column of Dataframe and then sum it together in the end. (Would have to store a whole neural net in the each row (e.g. 20*20*130) tho) Are spark optimization algorithms realizing, that it does not need to save each net but only sum them together?
Create an custom parallized algorithms using RDDs similar to that: https://machinelearningnepal.com/2018/01/22/apache-spark-implementation-of-som-batch-algorithm/ (but with more performant calculation algorithms). But I would have to use some kind of loop to loop over each row and update the net -> sounds like that would be rather unperformant.)
Any thoughts on the different options? Is there an even better option?
Or are all ideas not that good and I should just preselect a maximum variety subset of my dataset and train a SOM locally on that.
Thanks!
This is exactly what I have done last year, so I might be in a good position to give you an answer.
First, here is my Spark implementation of the batch SOM algorithm (it is written in Scala, but most things will be similar in Pyspark).
I needed this algorithm for a project, and every implementation I found had at least one of these two problems or limitations:
they did not really implement the batch SOM algorithm, but used a map averaging method that gave me strange results (abnormal symmetries in the output map)
they did not use the DataFrame API (pure RDD API) and were not in the Spark ML/MLlib spirit, i.e. with a simple fit()/transform() API operating over DataFrames.
So, there I went on to code it myself: the batch SOM algorithm in Spark ML style. The first thing I did was looking how k-means was implemented in Spark ML, because as you know, the batch SOM is very similar to the k-means algorithm. Actually, I could re-use a large portion of the Spark ML k-means code, but I had to modify the core algorithm and the hyperparameters.
I can summarize quickly how the model is built:
A SOMParams class, containing the SOM hyperparameters (size, training parameters, etc.)
A SOM class, which inherits from spark's Estimator, and contains the training algorithm. In particular, it contains a fit() method that operates on an input DataFrame, where features are stored as a spark.ml.linalg.Vector in a single column. fit() will then select this column and unpack the DataFrame to obtain the unerlying RDD[Vector] of features, and call the run() method on it. This is where all the computations happen, and as you guessed, it uses RDDs, accumulators and broadcast variables. Finally, the fit() method returns a SOMModel object.
SOMModel is a trained SOM model, and inherits from spark's Transformer/Model. It contains the map prototypes (center vectors), and contains a transform() method that can operate on DataFrames by taking an input feature column, and adding a new column with the predictions (projection on the map). This is done by a prediction UDF.
There is also SOMTrainingSummary that collects stuff such as the objective function.
Here are the take-aways:
There is not really an opposition between RDD and DataFrames (or rather Datasets, but the difference between those two is of no real importance here). They are just used in different contexts. In fact, a DataFrame can be seen as a RDD specialized for manipulating structured data organized in columns (such as relational tables), allowing SQL-like operations and an optimization of the execution plan (Catalyst optimizer).
For structured data, select/filter/aggregation operations, DO USE Dataframes, always.
...but for more complex tasks such as a machine learning algorithm, you NEED to come back to the RDD API and distribute your computations yourself, using map/mapPartitions/foreach/reduce/reduceByKey/and so son. Look at how things are done in MLlib: it's only a nice wrapper around RDD manipulations!
Hope it will solve your question. Concerning performance, as you asked for an efficient implementation, I did not make any benchmarks yet but I use it at work and it crunches 500k/1M-rows datasets in a couple of minutes on the production cluster.
I created an ML pipeline with several transformers, including a StringIndexer which is used during training on the data labels.
I then store the resultant PipelineModel which will later be used for data preparation and prediction on a dataset which doesn't have labels.
The issue is that the created pipeline model's transform function cannot be applied to the new DataFrame, since it expects data labels to be available.
What am I missing?
How should this be done?
Note: My goal is to have a single pipeline (i.e. I'd like to keep the various transformations and ML algorithm together)
Thanks!
You should paste your source code.Then your test data format should be consistent with your train data including the feature name.But you don't need label column.
You can refer to official site
Consider the two scenarios:
A) If I have a RDD and various RDD transformations are called on it, and before any actions are done I create a Dataset from it.
B) I create a Dataset at the very beginning and calls various Dataset methods on it.
Question: If the two scenarios produce the same outcome logically - one uses RDD transformation and converts it to a Dataset right before an action vs just using Dataset and its transformation - do both scenarios goes through the same optimizations?
No they do not.
When you do RDD and RDD transformation on them, no optimization is done. When you transform it to dataset in the end, then and only then conversion to tungsten based representation (which takes less memory and doesn't need to go through garbage collection) is performed.
When you use dataset from the beginning then it will use the tungsten based memory representation from the beginning. This means it will take less memory, shuffles will be smaller and faster and no GC overhead would occur (although conversion from internal representation to case class and back would occur any time typed operations are used). If you use dataframe operations on the dataset then it may also take advantage of code gen and catalyst optimizations.
See also my answer in: Do I have to explicitly use Dataframe's methods to take advantage of Dataset's optimization?
They don't. RDD API doesn't use any of the Tungsten / Catalyst optimizations and equivalent logic is not relevant.
I'd like to compose a pipleline for textual text. However there is an
extra step in the pipeline where "external" features are added. These
features are stored in external database and are accessed by document id
(the row number in the input).
The custom pipeline stage comes after a tfidf step. Meaning the input to the
stage will be a sparse matrix. Is there a way for me to pass the indices in
the input matrix as well? Or maybe a generic way to pass some metadata between
pipeline stages?
Note that the input to the pipeline is selected by GridSearchCV.
I saw the Feature Union with Heterogeneous Data Sources, but fail to see how to apply it to my situation since I can't compute the features from the input to the stage.
In Spark, it is possible to compose multiple RDD into one, using zip, union, join, etc...
Is it possible to decompose RDD efficiently? Namely, without performing multiple passes on the original RDD? What I am looking for is some thing similar to:
val rdd: RDD[T] = ...
val grouped: Map[K, RDD[T]] = rdd.specialGroupBy(...)
One of the strengths of RDDs is that they enable performing iterative computations efficiently. In some (machine learning) use cases I encountered, we need to perform iterative algorithms on each of the groups separately.
The current possibilities I am aware of are:
GroupBy: groupBy returns an RDD[(K, Iterable[T])] which does not give you the RDD benefits on the group itself (the iterable).
Aggregations: Such as reduceByKey, foldByKey, etc. perform only one "iteration" over the data, and do not have the expression power for implementing iterative algorithms.
Creating separate RDD using the filter method and multiple passes on the data (where the number of passes is equal to the number of keys), which is not feasible when the number of keys is not very small.
Some of the use cases I am considering are, given a very large (tabular) dataset:
We wish to execute some iterative algorithm on each of the different columns separately. For example, some automated feature extraction, A natural way to do so, would have been to decompose the dataset such that each of the columns will be represented by a separate RDD.
We wish to decompose the dataset into disjoint datasets (for example a dataset per day) and execute some machine learning modeling on each of them.
I think the best option is to write out the data in a single pass to one file per key (see Write to multiple outputs by key Spark - one Spark job) then load the per-key files into one RDD each.