CrossValidation/TrainValidationSplit with multiple pipelines in PySpark - apache-spark

I'm trying to evaluate multiple pipelines in PySpark. I'm able to do it in a separate CV/TVS for each one, but I would like to do it in just one so it gives me the best model directly and I can't find out how to make it works.
lr_assembler and assembler are 2 instances of VectorAsembler (different feature selection).
pca, lr, rf and gbt are instances of PCA, LinearRegression, RandomForestRegressor and GBTRegressor.
Pipelines definition:
pipeline = Pipeline()
lr_stages = [lr_assembler, pca, lr]
rf_stages = [assembler, rf]
gbt_stages = [assembler, gbt]
lr_pipeline = Pipeline(stages=lr_stages)
rf_pipeline = Pipeline(stages=rf_stages)
gbt_pipeline = Pipeline(stages=gbt_stages)
paramMaps definition:
lr_grid = ParamGridBuilder().baseOn({pipeline.stages:lr_stages})\
.addGrid(pca.k, [2, 5, 7])\
.build()
rf_grid = ParamGridBuilder().baseOn({pipeline.stages:rf_stages})\
.addGrid(rf.maxDepth, [5, 10])\
.addGrid(rf.featureSubsetStrategy, ['3', '6'])\
.build()
gbt_grid = ParamGridBuilder().baseOn({pipeline.stages:gbt_stages})\
.addGrid(gbt.maxDepth, [5, 10])\
.addGrid(gbt.maxIter, [50, 100])\
.build()
grid = lr_grid + rf_grid + gbt_grid
TrainValidationSplit definition:
tvs = TrainValidationSplit(estimator=pipeline, estimatorParamMaps=grid, evaluator=rmse_evaluator, trainRatio=0.8, parallelism=3, seed=7)
Model training:
model = tvs.fit(train_val)
And after running that last line, this is the error I get (not sure if I should post the whole thing here):
KeyError: Param(parent='Pipeline_40f78ef0cee04a4ebc61', name='stages', doc='a list of pipeline stages')
Thanks for your time.

I had the same issue, which I resolved by initializing the Pipeline stages.
pipeline = Pipeline(stages=[]) # Must initialize with empty list!
There's a good example of this approach here:
https://github.com/dsharpc/dsharpc.github.io/blob/master/SparkMLFlights/README.md

Related

Cross validation using Pyspark

I'm trying to use Cross-validation on using spark but it throws an error:
gbtClassifier = GBTClassifier(featuresCol= "features", labelCol="is_goal")
lr = LogisticRegression(featuresCol= "features" ,labelCol="is_goal")
pipelineStages = stringIndexers + encoders + [featureAssembler]
pipeline = Pipeline(stages=pipelineStages)
param_grid_lr = ParamGridBuilder().addGrid(lr.regParam, [0.1,0.01]).addGrid(lr.elasticNetParam, [0,0.5,1]).build()
crossval = CrossValidator(estimator=lr, estimatorParamMaps=param_grid_lr ,evaluator=BinaryClassificationEvaluator(), numFolds=3)
cross_model = crossval.fit(df_tr)
IllegalArgumentException: label does not exist. Available: event_type_str, event_team, shot_place_str, location_str, assist_method_str, situation_str, country_code, is_goal, event_type_str_idx, event_team_idx, shot_place_str_idx, location_str_idx, assist_method_str_idx, situation_str_idx, country_code_idx, event_type_str_vec, event_team_vec, shot_place_str_vec, location_str_vec, assist_method_str_vec, situation_str_vec, country_code_vec, features, CrossValidator_2fc516202d9d_rand, rawPrediction, probability, prediction
[here is who my features look like1
Your BinaryClassificationEvaluator expects by default that the label column is called label as you can see from the docs https://spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml.evaluation.BinaryClassificationEvaluator .
You'll need to specify rawPredictionCol and labelCol according to the columns given in your dataframe

Cross-Validation in apache-spark. How to create Parameter Grid?

I am trying to set up a ParamGrid for using cross-validation later. But I could not find any explanation about the input arguments.
After creating a pipeline I am trying to create a parameter grid, but since I do not understand, which entries are expected I keep getting errors.
//creating my pipeline with indexer, oneHotEncoder, creating the feature vector and applying linear regression on it
val IndexedList = StringList.flatMap{ name =>
val indexer = new StringIndexer().setInputCol(name).setOutputCol(name + "Index")
val encoder = new OneHotEncoderEstimator()
.setInputCols(Array(name+ "Index"))
.setOutputCols(Array(name + "vec"))
Array(indexer,encoder)
}
val features = new VectorAssembler().setInputCols(Array("Modellvec", "KM", "Hubraum", "Fuelvec","Farbevec","Typevec","F1","F2","F3","F4","F5","F6","F7","F8")).setOutputCol("Features2")
val linReg = new LinearRegression()//.setFeaturesCol(features2.getOutputCol).setLabelCol("Preis")
//creates the Array of stages
val IndexedList3 = (IndexedList :+ features :+ linReg).toArray[PipelineStage]
val pipeline2 = new Pipeline()
//This grid should be created in order to apply cross-validation
val <b>pipeline_grid</b> = new ParamGridBuilder()
.baseOn(pipeline2.stages -> IndexedList3)
.addGrid(linReg.regParam, Array(10,15,20,25,30,35,40,45,50,55,60,65,70,75) ).build()
The first part works just fine when I run it separately.
The problem is, that I do not understand, how the Array in "addGrid" should look like(or how I should choose the values) and why it is a problem, that linReg.regParam is of type DoubleParam, since addGrid IS defined on this type.
In most examples I have seen, this Array appears out of nowhere. So could someone explain to me, where it comes from?

Latent Dirichlet allocation (LDA) in Spark - replicate model

I want to save the LDA model from pyspark ml-clustering package and apply the model to the training & test data-set after saving. However results diverge despite setting a seed. My code is the following:
1) Import packages
from pyspark.ml.clustering import LocalLDAModel, DistributedLDAModel
from pyspark.ml.feature import CountVectorizer , IDF
2) Preparing the dataset
countVectors = CountVectorizer(inputCol="requester_instruction_words_filtered_complete", outputCol="raw_features", vocabSize=5000, minDF=10.0)
cv_model = countVectors.fit(tokenized_stopwords_sample_df)
result_tf = cv_model.transform(tokenized_stopwords_sample_df)
vocabArray = cv_model.vocabulary
idf = IDF(inputCol="raw_features", outputCol="features")
idfModel = idf.fit(result_tf)
result_tfidf = idfModel.transform(result_tf)
result_tfidf = result_tfidf.withColumn("id", monotonically_increasing_id())
corpus = result_tfidf.select("id", "features")
3) Training the LDA model
lda = LDA(k=number_of_topics, maxIter=100, docConcentration = [alpha], topicConcentration = beta, seed = 123)
model = lda.fit(corpus)
model.save("LDA_model_saved")
topics = model.describeTopics(words_in_topic)
topics_rdd = topics.rdd
modelled_corpus = model.transform(corpus)
4) Replicate the model
#Prepare the data set
countVectors = CountVectorizer(inputCol="requester_instruction_words_filtered_complete", outputCol="raw_features", vocabSize=5000, minDF=10.0)
cv_model = countVectors.fit(tokenized_stopwords_sample_df)
result_tf = cv_model.transform(tokenized_stopwords_sample_df)
vocabArray = cv_model.vocabulary
idf = IDF(inputCol="raw_features", outputCol="features")
idfModel = idf.fit(result_tf)
result_tfidf = idfModel.transform(result_tf)
result_tfidf = result_tfidf.withColumn("id", monotonically_increasing_id())
corpus_new = result_tfidf.select("id", "features")
#Load the model to apply to new corpus
newModel = LocalLDAModel.load("LDA_model_saved")
topics_new = newModel.describeTopics(words_in_topic)
topics_rdd_new = topics_new.rdd
modelled_corpus_new = newModel.transform(corpus_new)
The following results are different despite my assumption to be equal:
topics_rdd != topics_rdd_new and modelled_corpus != modelled_corpus_new (also when inspecting the extracted topics they are different as well as the predicted classes on the dataset)
So I find it really strange that the same model predicts different classes ("topics") on the same dataset, even though I set a seed in the model generation. Can someone with experience in replicating LDA models help?
Thank you :)
I was facing similar kind of problem while implementing LDA in PYSPARK. Even though I was using seed, every time I re run the code on the same data with same parameters, results were different.
I came up with below solution after trying multitude of things:
Saved cv_model after running it once and loaded it in next iterations rather then re-fitting it.
This is more related to my data set. The size of some of the documents in the corpus that i was using was very small (around 3 words per document). I filtered out these documents and set a limit , such that only those documents will be included in corpus that have minimum 15 words (may be higher in yours). I am not sure why this one worked, may be something related underline complexity of model.
All in all now my results are same even after several iterations. Hope this helps.

Reconstructing k-means using pre-computed cluster centres

I'm using k-means for clustering with number of clusters 60. Since, some of the clusters are coming out as meaning less, I've deleted those cluster centers from cluster center array(count = 8) and saved in clean_cluster_array.
This time, I'm re-fitting k-means model with init = clean_cluster_centers. and n_clusters = 52 and max_iter = 1 because i want to avoid re-fitting as much as possible.
The basic idea is to recreate new model with clean_cluster_centers . The problem here is since, we are removing large number of clusters; The model is quickly configuring to more stable centers even with n_iter = 1. Is there any way to recreate k-means model?
If you've fitted a KMeans object, it has a cluster_centers_ attribute. You can directly update it by doing something like this:
cls.cluster_centers_ = new_cluster_centers
So if you want a new object with the clean cluster centers, just do something like the following:
cls = KMeans().fit(X)
cls2 = cls.copy()
cls2.cluster_centers_ = new_cluster_centers
And now, since the predict function only checks that your object has a non-null attribute called cluster_centers_, you can use the predict function
def predict(self, X):
"""Predict the closest cluster each sample in X belongs to.
In the vector quantization literature, `cluster_centers_` is called
the code book and each value returned by `predict` is the index of
the closest code in the code book.
Parameters
----------
X : {array-like, sparse matrix}, shape = [n_samples, n_features]
New data to predict.
Returns
-------
labels : array, shape [n_samples,]
Index of the cluster each sample belongs to.
"""
check_is_fitted(self, 'cluster_centers_')
X = self._check_test_data(X)
x_squared_norms = row_norms(X, squared=True)
return _labels_inertia(X, x_squared_norms, self.cluster_centers_)[0]

How can I access computed metrics for each fold in a CrossValidatorModel

How can I get the computed metrics for each fold from a CrossValidatorModel in spark.ml? I know I can get the average metrics using model.avgMetrics but is it possible to get the raw results on each fold to look at eg. the variance of the results?
I am using Spark 2.0.0.
Studying the spark code here
For the folds, you can do the iteration yourself like this:
val splits = MLUtils.kFold(dataset.toDF.rdd, $(numFolds), $(seed))
//K-folding operation starting
//for each fold you have multiple models created cfm. the paramgrid
splits.zipWithIndex.foreach { case ((training, validation), splitIndex) =>
val trainingDataset = sparkSession.createDataFrame(training, schema).cache()
val validationDataset = sparkSession.createDataFrame(validation, schema).cache()
val models = est.fit(trainingDataset, epm).asInstanceOf[Seq[Model[_]]]
trainingDataset.unpersist()
var i = 0
while (i < numModels) {
val metric = eval.evaluate(models(i).transform(validationDataset, epm(i)))
logDebug(s"Got metric $metric for model trained with ${epm(i)}.")
metrics(i) += metric
i += 1
}
This is in scala, but the ideas are very clearly outlined.
Take a look at this answer that outlines results per fold. Hope this helps.

Resources