Cross validation using Pyspark - apache-spark

I'm trying to use Cross-validation on using spark but it throws an error:
gbtClassifier = GBTClassifier(featuresCol= "features", labelCol="is_goal")
lr = LogisticRegression(featuresCol= "features" ,labelCol="is_goal")
pipelineStages = stringIndexers + encoders + [featureAssembler]
pipeline = Pipeline(stages=pipelineStages)
param_grid_lr = ParamGridBuilder().addGrid(lr.regParam, [0.1,0.01]).addGrid(lr.elasticNetParam, [0,0.5,1]).build()
crossval = CrossValidator(estimator=lr, estimatorParamMaps=param_grid_lr ,evaluator=BinaryClassificationEvaluator(), numFolds=3)
cross_model = crossval.fit(df_tr)
IllegalArgumentException: label does not exist. Available: event_type_str, event_team, shot_place_str, location_str, assist_method_str, situation_str, country_code, is_goal, event_type_str_idx, event_team_idx, shot_place_str_idx, location_str_idx, assist_method_str_idx, situation_str_idx, country_code_idx, event_type_str_vec, event_team_vec, shot_place_str_vec, location_str_vec, assist_method_str_vec, situation_str_vec, country_code_vec, features, CrossValidator_2fc516202d9d_rand, rawPrediction, probability, prediction
[here is who my features look like1

Your BinaryClassificationEvaluator expects by default that the label column is called label as you can see from the docs https://spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml.evaluation.BinaryClassificationEvaluator .
You'll need to specify rawPredictionCol and labelCol according to the columns given in your dataframe

Related

Cross-Validation in apache-spark. How to create Parameter Grid?

I am trying to set up a ParamGrid for using cross-validation later. But I could not find any explanation about the input arguments.
After creating a pipeline I am trying to create a parameter grid, but since I do not understand, which entries are expected I keep getting errors.
//creating my pipeline with indexer, oneHotEncoder, creating the feature vector and applying linear regression on it
val IndexedList = StringList.flatMap{ name =>
val indexer = new StringIndexer().setInputCol(name).setOutputCol(name + "Index")
val encoder = new OneHotEncoderEstimator()
.setInputCols(Array(name+ "Index"))
.setOutputCols(Array(name + "vec"))
Array(indexer,encoder)
}
val features = new VectorAssembler().setInputCols(Array("Modellvec", "KM", "Hubraum", "Fuelvec","Farbevec","Typevec","F1","F2","F3","F4","F5","F6","F7","F8")).setOutputCol("Features2")
val linReg = new LinearRegression()//.setFeaturesCol(features2.getOutputCol).setLabelCol("Preis")
//creates the Array of stages
val IndexedList3 = (IndexedList :+ features :+ linReg).toArray[PipelineStage]
val pipeline2 = new Pipeline()
//This grid should be created in order to apply cross-validation
val <b>pipeline_grid</b> = new ParamGridBuilder()
.baseOn(pipeline2.stages -> IndexedList3)
.addGrid(linReg.regParam, Array(10,15,20,25,30,35,40,45,50,55,60,65,70,75) ).build()
The first part works just fine when I run it separately.
The problem is, that I do not understand, how the Array in "addGrid" should look like(or how I should choose the values) and why it is a problem, that linReg.regParam is of type DoubleParam, since addGrid IS defined on this type.
In most examples I have seen, this Array appears out of nowhere. So could someone explain to me, where it comes from?

Latent Dirichlet allocation (LDA) in Spark - replicate model

I want to save the LDA model from pyspark ml-clustering package and apply the model to the training & test data-set after saving. However results diverge despite setting a seed. My code is the following:
1) Import packages
from pyspark.ml.clustering import LocalLDAModel, DistributedLDAModel
from pyspark.ml.feature import CountVectorizer , IDF
2) Preparing the dataset
countVectors = CountVectorizer(inputCol="requester_instruction_words_filtered_complete", outputCol="raw_features", vocabSize=5000, minDF=10.0)
cv_model = countVectors.fit(tokenized_stopwords_sample_df)
result_tf = cv_model.transform(tokenized_stopwords_sample_df)
vocabArray = cv_model.vocabulary
idf = IDF(inputCol="raw_features", outputCol="features")
idfModel = idf.fit(result_tf)
result_tfidf = idfModel.transform(result_tf)
result_tfidf = result_tfidf.withColumn("id", monotonically_increasing_id())
corpus = result_tfidf.select("id", "features")
3) Training the LDA model
lda = LDA(k=number_of_topics, maxIter=100, docConcentration = [alpha], topicConcentration = beta, seed = 123)
model = lda.fit(corpus)
model.save("LDA_model_saved")
topics = model.describeTopics(words_in_topic)
topics_rdd = topics.rdd
modelled_corpus = model.transform(corpus)
4) Replicate the model
#Prepare the data set
countVectors = CountVectorizer(inputCol="requester_instruction_words_filtered_complete", outputCol="raw_features", vocabSize=5000, minDF=10.0)
cv_model = countVectors.fit(tokenized_stopwords_sample_df)
result_tf = cv_model.transform(tokenized_stopwords_sample_df)
vocabArray = cv_model.vocabulary
idf = IDF(inputCol="raw_features", outputCol="features")
idfModel = idf.fit(result_tf)
result_tfidf = idfModel.transform(result_tf)
result_tfidf = result_tfidf.withColumn("id", monotonically_increasing_id())
corpus_new = result_tfidf.select("id", "features")
#Load the model to apply to new corpus
newModel = LocalLDAModel.load("LDA_model_saved")
topics_new = newModel.describeTopics(words_in_topic)
topics_rdd_new = topics_new.rdd
modelled_corpus_new = newModel.transform(corpus_new)
The following results are different despite my assumption to be equal:
topics_rdd != topics_rdd_new and modelled_corpus != modelled_corpus_new (also when inspecting the extracted topics they are different as well as the predicted classes on the dataset)
So I find it really strange that the same model predicts different classes ("topics") on the same dataset, even though I set a seed in the model generation. Can someone with experience in replicating LDA models help?
Thank you :)
I was facing similar kind of problem while implementing LDA in PYSPARK. Even though I was using seed, every time I re run the code on the same data with same parameters, results were different.
I came up with below solution after trying multitude of things:
Saved cv_model after running it once and loaded it in next iterations rather then re-fitting it.
This is more related to my data set. The size of some of the documents in the corpus that i was using was very small (around 3 words per document). I filtered out these documents and set a limit , such that only those documents will be included in corpus that have minimum 15 words (may be higher in yours). I am not sure why this one worked, may be something related underline complexity of model.
All in all now my results are same even after several iterations. Hope this helps.

SparkML: Pipeline predictions have fewer records than the input

How can I find out -- inside a pipeline -- which records are skipped or dropped from the transformation?
I have a pipeline which is like the following:
StringIndexer
OneHotEncoderEstimator
(repeat above for all categorical cols)
VectorAssembler (collecting all encoded and raw numeric cols)
LogisticRegression
Then:
model = pipeline.fit(train)
predicted = model.transform(test)
test.count()
8092
predicted.count()
8091
One record is missing and I'd like to find out which one.
thanks
The handleInvalid option of your StringIndexer is likely set to skip.
You can change this option to error and the transform will fail on never seen labels. As of Spark 2.2 you can also use option keep to keep the rows with unknown labels in a separate bucket for them:
string_indexer = StringIndexer(inputCol="label", outputCol="indexed", handleInvalid='keep')

CrossValidation/TrainValidationSplit with multiple pipelines in PySpark

I'm trying to evaluate multiple pipelines in PySpark. I'm able to do it in a separate CV/TVS for each one, but I would like to do it in just one so it gives me the best model directly and I can't find out how to make it works.
lr_assembler and assembler are 2 instances of VectorAsembler (different feature selection).
pca, lr, rf and gbt are instances of PCA, LinearRegression, RandomForestRegressor and GBTRegressor.
Pipelines definition:
pipeline = Pipeline()
lr_stages = [lr_assembler, pca, lr]
rf_stages = [assembler, rf]
gbt_stages = [assembler, gbt]
lr_pipeline = Pipeline(stages=lr_stages)
rf_pipeline = Pipeline(stages=rf_stages)
gbt_pipeline = Pipeline(stages=gbt_stages)
paramMaps definition:
lr_grid = ParamGridBuilder().baseOn({pipeline.stages:lr_stages})\
.addGrid(pca.k, [2, 5, 7])\
.build()
rf_grid = ParamGridBuilder().baseOn({pipeline.stages:rf_stages})\
.addGrid(rf.maxDepth, [5, 10])\
.addGrid(rf.featureSubsetStrategy, ['3', '6'])\
.build()
gbt_grid = ParamGridBuilder().baseOn({pipeline.stages:gbt_stages})\
.addGrid(gbt.maxDepth, [5, 10])\
.addGrid(gbt.maxIter, [50, 100])\
.build()
grid = lr_grid + rf_grid + gbt_grid
TrainValidationSplit definition:
tvs = TrainValidationSplit(estimator=pipeline, estimatorParamMaps=grid, evaluator=rmse_evaluator, trainRatio=0.8, parallelism=3, seed=7)
Model training:
model = tvs.fit(train_val)
And after running that last line, this is the error I get (not sure if I should post the whole thing here):
KeyError: Param(parent='Pipeline_40f78ef0cee04a4ebc61', name='stages', doc='a list of pipeline stages')
Thanks for your time.
I had the same issue, which I resolved by initializing the Pipeline stages.
pipeline = Pipeline(stages=[]) # Must initialize with empty list!
There's a good example of this approach here:
https://github.com/dsharpc/dsharpc.github.io/blob/master/SparkMLFlights/README.md

Spark: Dimensions mismatch error with RDD[LabeledPoint] union

I would ideally like to do the following:
In essence, what I want to do is for my dataset that is RDD[LabeledPoint], I want to control the ratio of positive and negative labels.
val training_data: RDD[LabeledPoint] = MLUtils.loadLibSVMFile(spark, "training_data.tsv")
This dataset has both cases and controls included in it. I want to control the ratio of cases to controls (my dataset is skewed). So I want to do something like sample training_data such that the ratio of cases to controls is 1:2 (instead of 1:500 say).
I was not able to do that therefore, I separated the training data into cases and controls as below and then was trying to combine them later using union operator, which gave me the Dimensions mismatch error.
I have two datasets (both in Libsvm format):
val positives: RDD[LabeledPoint] = MLUtils.loadLibSVMFile(spark, "positives.tsv")
val negatives: RDD[LabeledPoint] = MLUtils.loadLibSVMFile(spark, "negatives.tsv")
I want to combine these two to form training data. Note both are in libsvm format.
training = positives.union(negatives)
When I use the above training dataset in model building (such as logistic regression) I get error since both positives and negatives can have different number of columns/dimensions. I get this error: "Dimensions mismatch when merging with another summarizer" Any idea how to handle that?
In addition, I also want to do samplings such as
positives_subset = positives.sample()
I was able to solve this in the following way:
def create_subset(training: RDD[LabeledPoint], target_label: Double, sampling_ratio: Double): RDD[LabeledPoint] = {
val training_filtered = training.filter { case LabeledPoint(label, features) => (label == target_label) }
val training_subset = training_filtered.sample(false, sampling_ratio)
return training_subset
}
Then calling the above method as:
val positives = create_subset(training, 1.0, 1.0)
val negatives_sampled = create_subset(training, 0.0, sampling_ratio)
Then you can take the union as:
val training_subset_double = positives.union(negatives_double)
and then I was able to use the training_subset_double for model building.

Resources