Cross-Validation in apache-spark. How to create Parameter Grid? - apache-spark

I am trying to set up a ParamGrid for using cross-validation later. But I could not find any explanation about the input arguments.
After creating a pipeline I am trying to create a parameter grid, but since I do not understand, which entries are expected I keep getting errors.
//creating my pipeline with indexer, oneHotEncoder, creating the feature vector and applying linear regression on it
val IndexedList = StringList.flatMap{ name =>
val indexer = new StringIndexer().setInputCol(name).setOutputCol(name + "Index")
val encoder = new OneHotEncoderEstimator()
.setInputCols(Array(name+ "Index"))
.setOutputCols(Array(name + "vec"))
Array(indexer,encoder)
}
val features = new VectorAssembler().setInputCols(Array("Modellvec", "KM", "Hubraum", "Fuelvec","Farbevec","Typevec","F1","F2","F3","F4","F5","F6","F7","F8")).setOutputCol("Features2")
val linReg = new LinearRegression()//.setFeaturesCol(features2.getOutputCol).setLabelCol("Preis")
//creates the Array of stages
val IndexedList3 = (IndexedList :+ features :+ linReg).toArray[PipelineStage]
val pipeline2 = new Pipeline()
//This grid should be created in order to apply cross-validation
val <b>pipeline_grid</b> = new ParamGridBuilder()
.baseOn(pipeline2.stages -> IndexedList3)
.addGrid(linReg.regParam, Array(10,15,20,25,30,35,40,45,50,55,60,65,70,75) ).build()
The first part works just fine when I run it separately.
The problem is, that I do not understand, how the Array in "addGrid" should look like(or how I should choose the values) and why it is a problem, that linReg.regParam is of type DoubleParam, since addGrid IS defined on this type.
In most examples I have seen, this Array appears out of nowhere. So could someone explain to me, where it comes from?

Related

Cross validation using Pyspark

I'm trying to use Cross-validation on using spark but it throws an error:
gbtClassifier = GBTClassifier(featuresCol= "features", labelCol="is_goal")
lr = LogisticRegression(featuresCol= "features" ,labelCol="is_goal")
pipelineStages = stringIndexers + encoders + [featureAssembler]
pipeline = Pipeline(stages=pipelineStages)
param_grid_lr = ParamGridBuilder().addGrid(lr.regParam, [0.1,0.01]).addGrid(lr.elasticNetParam, [0,0.5,1]).build()
crossval = CrossValidator(estimator=lr, estimatorParamMaps=param_grid_lr ,evaluator=BinaryClassificationEvaluator(), numFolds=3)
cross_model = crossval.fit(df_tr)
IllegalArgumentException: label does not exist. Available: event_type_str, event_team, shot_place_str, location_str, assist_method_str, situation_str, country_code, is_goal, event_type_str_idx, event_team_idx, shot_place_str_idx, location_str_idx, assist_method_str_idx, situation_str_idx, country_code_idx, event_type_str_vec, event_team_vec, shot_place_str_vec, location_str_vec, assist_method_str_vec, situation_str_vec, country_code_vec, features, CrossValidator_2fc516202d9d_rand, rawPrediction, probability, prediction
[here is who my features look like1
Your BinaryClassificationEvaluator expects by default that the label column is called label as you can see from the docs https://spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml.evaluation.BinaryClassificationEvaluator .
You'll need to specify rawPredictionCol and labelCol according to the columns given in your dataframe

CrossValidation/TrainValidationSplit with multiple pipelines in PySpark

I'm trying to evaluate multiple pipelines in PySpark. I'm able to do it in a separate CV/TVS for each one, but I would like to do it in just one so it gives me the best model directly and I can't find out how to make it works.
lr_assembler and assembler are 2 instances of VectorAsembler (different feature selection).
pca, lr, rf and gbt are instances of PCA, LinearRegression, RandomForestRegressor and GBTRegressor.
Pipelines definition:
pipeline = Pipeline()
lr_stages = [lr_assembler, pca, lr]
rf_stages = [assembler, rf]
gbt_stages = [assembler, gbt]
lr_pipeline = Pipeline(stages=lr_stages)
rf_pipeline = Pipeline(stages=rf_stages)
gbt_pipeline = Pipeline(stages=gbt_stages)
paramMaps definition:
lr_grid = ParamGridBuilder().baseOn({pipeline.stages:lr_stages})\
.addGrid(pca.k, [2, 5, 7])\
.build()
rf_grid = ParamGridBuilder().baseOn({pipeline.stages:rf_stages})\
.addGrid(rf.maxDepth, [5, 10])\
.addGrid(rf.featureSubsetStrategy, ['3', '6'])\
.build()
gbt_grid = ParamGridBuilder().baseOn({pipeline.stages:gbt_stages})\
.addGrid(gbt.maxDepth, [5, 10])\
.addGrid(gbt.maxIter, [50, 100])\
.build()
grid = lr_grid + rf_grid + gbt_grid
TrainValidationSplit definition:
tvs = TrainValidationSplit(estimator=pipeline, estimatorParamMaps=grid, evaluator=rmse_evaluator, trainRatio=0.8, parallelism=3, seed=7)
Model training:
model = tvs.fit(train_val)
And after running that last line, this is the error I get (not sure if I should post the whole thing here):
KeyError: Param(parent='Pipeline_40f78ef0cee04a4ebc61', name='stages', doc='a list of pipeline stages')
Thanks for your time.
I had the same issue, which I resolved by initializing the Pipeline stages.
pipeline = Pipeline(stages=[]) # Must initialize with empty list!
There's a good example of this approach here:
https://github.com/dsharpc/dsharpc.github.io/blob/master/SparkMLFlights/README.md

Spark MLlib Association Rules confidence is greater than 1.0

I was using Spark 2.0.2 to extract some association rules from some data, while when I get the result, I found I have some strange rules, such as the followings:
【[MUJI,ROEM,西单科技广场] => Bauhaus ] 2.0
“2.0” is the confidence of the rule printed, isn't it the meaning of "the probability of antecedent to consequent" and should be less than 1.0?
KEY WORD: transactions != freqItemset
SOLUTIONS: Use spark.mllib.FPGrowth instead, it accepts a rdd of transactions and can automatically calculate freqItemsets.
Hello, I found it. The reason of this phenomenon is because my input FreqItemset data freqItemsets is wrong. Let's to into detail. I simply use three original transactions ("a"),("a","b","c"),("a","b","d"), the frequency of them are all the same 1.
At the beginning, I thought spark would auto calculate sub-itemset frequency, the only thing I need to do is to create freqItemsets like this (the official example show us):
val freqItemsets = sc.parallelize(Seq(
new FreqItemset(Array("a"), 1),
new FreqItemset(Array("a","b","d"), 1),
new FreqItemset(Array("a", "b","c"), 1)
))
Here is the reason why it make mistakes, AssociationRules's params is FreqItemset, not the transactions, so I made a wrong understand of these two definition.
According to the three transactions, the freqItemsets should be
new FreqItemset(Array("a"), 3),//because "a" appears three times in three transactions
new FreqItemset(Array("b"), 2),//"b" appears two times
new FreqItemset(Array("c"), 1),
new FreqItemset(Array("d"), 1),
new FreqItemset(Array("a","b"), 2),// "a" and "b" totally appears two times
new FreqItemset(Array("a","c"), 1),
new FreqItemset(Array("a","d"), 1),
new FreqItemset(Array("b","d"), 1),
new FreqItemset(Array("b","c"), 1)
new FreqItemset(Array("a","b","d"), 1),
new FreqItemset(Array("a", "b","c"), 1)
You can do this statistical work your self use the following code
val transactons = sc.parallelize(
Seq(
Array("a"),
Array("a","b","c"),
Array("a","b","d")
))
val freqItemsets = transactions
.map(arr => {
(for (i <- 1 to arr.length) yield {
arr.combinations(i).toArray
})
.toArray
.flatten
})
.flatMap(l => l)
.map(a => (Json.toJson(a.sorted).toString(), 1))
.reduceByKey(_ + _)
.map(m => new FreqItemset(Json.parse(m._1).as[Array[String]], m._2.toLong))
//then use freqItemsets like the example code
val ar = new AssociationRules()
.setMinConfidence(0.8)
val results = ar.run(freqItemsets)
//....
Simply we can use FPGrowth instead of "AssociationRules", it accepts rdd of transactions.
val fpg = new FPGrowth()
.setMinSupport(0.2)
.setNumPartitions(10)
val model = fpg.run(transactions) //transactions is defined in the previous code
That's all.

How can I access computed metrics for each fold in a CrossValidatorModel

How can I get the computed metrics for each fold from a CrossValidatorModel in spark.ml? I know I can get the average metrics using model.avgMetrics but is it possible to get the raw results on each fold to look at eg. the variance of the results?
I am using Spark 2.0.0.
Studying the spark code here
For the folds, you can do the iteration yourself like this:
val splits = MLUtils.kFold(dataset.toDF.rdd, $(numFolds), $(seed))
//K-folding operation starting
//for each fold you have multiple models created cfm. the paramgrid
splits.zipWithIndex.foreach { case ((training, validation), splitIndex) =>
val trainingDataset = sparkSession.createDataFrame(training, schema).cache()
val validationDataset = sparkSession.createDataFrame(validation, schema).cache()
val models = est.fit(trainingDataset, epm).asInstanceOf[Seq[Model[_]]]
trainingDataset.unpersist()
var i = 0
while (i < numModels) {
val metric = eval.evaluate(models(i).transform(validationDataset, epm(i)))
logDebug(s"Got metric $metric for model trained with ${epm(i)}.")
metrics(i) += metric
i += 1
}
This is in scala, but the ideas are very clearly outlined.
Take a look at this answer that outlines results per fold. Hope this helps.

Spark: Dimensions mismatch error with RDD[LabeledPoint] union

I would ideally like to do the following:
In essence, what I want to do is for my dataset that is RDD[LabeledPoint], I want to control the ratio of positive and negative labels.
val training_data: RDD[LabeledPoint] = MLUtils.loadLibSVMFile(spark, "training_data.tsv")
This dataset has both cases and controls included in it. I want to control the ratio of cases to controls (my dataset is skewed). So I want to do something like sample training_data such that the ratio of cases to controls is 1:2 (instead of 1:500 say).
I was not able to do that therefore, I separated the training data into cases and controls as below and then was trying to combine them later using union operator, which gave me the Dimensions mismatch error.
I have two datasets (both in Libsvm format):
val positives: RDD[LabeledPoint] = MLUtils.loadLibSVMFile(spark, "positives.tsv")
val negatives: RDD[LabeledPoint] = MLUtils.loadLibSVMFile(spark, "negatives.tsv")
I want to combine these two to form training data. Note both are in libsvm format.
training = positives.union(negatives)
When I use the above training dataset in model building (such as logistic regression) I get error since both positives and negatives can have different number of columns/dimensions. I get this error: "Dimensions mismatch when merging with another summarizer" Any idea how to handle that?
In addition, I also want to do samplings such as
positives_subset = positives.sample()
I was able to solve this in the following way:
def create_subset(training: RDD[LabeledPoint], target_label: Double, sampling_ratio: Double): RDD[LabeledPoint] = {
val training_filtered = training.filter { case LabeledPoint(label, features) => (label == target_label) }
val training_subset = training_filtered.sample(false, sampling_ratio)
return training_subset
}
Then calling the above method as:
val positives = create_subset(training, 1.0, 1.0)
val negatives_sampled = create_subset(training, 0.0, sampling_ratio)
Then you can take the union as:
val training_subset_double = positives.union(negatives_double)
and then I was able to use the training_subset_double for model building.

Resources