PYSPARK: how to get weights from CrossValidatorModel? - apache-spark

I have trained a logistic regression model using cross validation using the following code from https://spark.apache.org/docs/2.1.0/ml-tuning.html
now I want to get the weights and intercept, but I get this error:
AttributeError: 'CrossValidatorModel' object has no attribute 'weights'
how can I get these attributes?
*the same problem with (trainingSummary = cvModel.summary)
from pyspark.ml import Pipeline
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.ml.feature import HashingTF, Tokenizer
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
# Prepare training documents, which are labeled.
training = spark.createDataFrame([
(0, "a b c d e spark", 1.0),
(1, "b d", 0.0),
(2, "spark f g h", 1.0),
(3, "hadoop mapreduce", 0.0),
(4, "b spark who", 1.0),
(5, "g d a y", 0.0),
(6, "spark fly", 1.0),
(7, "was mapreduce", 0.0),
(8, "e spark program", 1.0),
(9, "a e c l", 0.0),
(10, "spark compile", 1.0),
(11, "hadoop software", 0.0)
], ["id", "text", "label"])
# Configure an ML pipeline, which consists of tree stages: tokenizer, hashingTF, and lr.
tokenizer = Tokenizer(inputCol="text", outputCol="words")
hashingTF = HashingTF(inputCol=tokenizer.getOutputCol(), outputCol="features")
lr = LogisticRegression(maxIter=10)
pipeline = Pipeline(stages=[tokenizer, hashingTF, lr])
# We now treat the Pipeline as an Estimator, wrapping it in a CrossValidator instance.
# This will allow us to jointly choose parameters for all Pipeline stages.
# A CrossValidator requires an Estimator, a set of Estimator ParamMaps, and an Evaluator.
# We use a ParamGridBuilder to construct a grid of parameters to search over.
# With 3 values for hashingTF.numFeatures and 2 values for lr.regParam,
# this grid will have 3 x 2 = 6 parameter settings for CrossValidator to choose from.
paramGrid = ParamGridBuilder() \
.addGrid(hashingTF.numFeatures, [10, 100, 1000]) \
.addGrid(lr.regParam, [0.1, 0.01]) \
.build()
crossval = CrossValidator(estimator=pipeline,
estimatorParamMaps=paramGrid,
evaluator=BinaryClassificationEvaluator(),
numFolds=2) # use 3+ folds in practice
# Run cross-validation, and choose the best set of parameters.
cvModel = crossval.fit(training)
# Prepare test documents, which are unlabeled.
test = spark.createDataFrame([
(4, "spark i j k"),
(5, "l m n"),
(6, "mapreduce spark"),
(7, "apache hadoop")
], ["id", "text"])
# Make predictions on test documents. cvModel uses the best model found (lrModel).
prediction = cvModel.transform(test)
selected = prediction.select("id", "text", "probability", "prediction")
for row in selected.collect():
print(row)

LogisticRegression model has coefficients not weights. Other than this it can be done as below:
cvModel
# The best model from CrossValidator
.bestModel
# The last stage in Pipeline
.stages[-1]
.coefficients)

Related

Sklearn GridSearchCV on Pipeline to test multiple transforms and estimators

I'm trying to build a GridSearchCV using Pipeline, and I want to test both transformers and estimators.
Is there a more concise way of doing so?
pipeline = Pipeline([
('imputer', SimpleImputer()),
('scaler', StandardScaler()),
('pca', PCA()),
('clf', KNeighborsClassifier())
])
parameters = [{
'imputer': (SimpleImputer(), ),
'imputer__strategy': ('median', 'mean'),
'pca__n_components': (10, 20),
'clf': (LogisticRegression(),),
'clf__C': (1,10)
}, {
'imputer': (SimpleImputer(), ),
'imputer__strategy': ('median', 'mean'),
'pca__n_components': (10, 20),
'clf': (KNeighborsClassifier(),),
'clf__n_neighbors': (10, 25),
}, {
'imputer': (KNNImputer(), ),
'imputer__n_neighbors': (5, 10),
'pca__n_components': (10, 20),
'clf': (LogisticRegression(),),
'clf__C': (1,10)
}, {
'imputer': (KNNImputer(), ),
'imputer__n_neighbors': (5, 10),
'pca__n_components': (10, 20),
'clf': (KNeighborsClassifier(),),
'clf__n_neighbors': (10, 25),
}]
grid_search = GridSearchCV(estimator=pipeline, param_grid=parameters)
Insted of having 4 blocks of parameters, I want to declare the 2 imputations methods that I want to test with their corresponding parameters, and the 2 classifiers. and without decalring the pca__n_components 4 times.
When you get hyperparameters that depend on each other a fair bit, the parameter grid approach gets to be cumbersome. There are a few ways to get what you need.
Nested grid searches
GridSearchCV(
estimator=GridSearchCV(estimator=pipeline, param_grid=imputer_grid),
param_grid=estimator_grid,
)
For each estimator candidate, this runs a grid search over the imputer candidates; the best imputer is used for the estimator, and then the estimators-with-best-imputers are compared.
The main drawback here is that the inner search gets cloned for each estimator candidate, and so you don't get access to the cv_results_ for the non-winning estimator's imputers.
pythonically generate (part of) the grid
ParameterGrid, used internally by GridSearchCV, is mostly a wrapper around itertools.product. So we can use itertools ourselves to create (chunks of) the grid. E.g. we can create the list you've written, but with less repeated code:
import itertools
imputers = [{
'imputer': (SimpleImputer(), ),
'imputer__strategy': ('median', 'mean'),
},
{
'imputer': (KNNImputer(), ),
'imputer__n_neighbors': (5, 10),
}]
models = [{
'clf': (LogisticRegression(),),
'clf__C': (1,10),
},
{
'clf': (KNeighborsClassifier(),),
'clf__n_neighbors': (10, 25),
}]
pcas = [{'pca__n_components': (10, 20),}]
parameters = [
{**imp, **pca, **model} # in py3.9 the slicker notation imp | pca | model works
for imp, pca, model in itertools.product(imputers, pca, models)
] # this should give the same as your hard-coded list-of-dicts

Does SparkML Cross Validation Only Work With a "label" Column?

When I am running the cross validation example with a dataset that has the label in a column not named "label" I am observing an IllegalArgumentException on Spark 3.1.1. Why?
The below code has been modified to rename "label" column into "target" and the labelCol has been set to "target" for the regression model. This code is causing the exception, while leaving everything at "label" works fine.
from pyspark.ml import Pipeline
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.ml.feature import HashingTF, Tokenizer
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
training = spark.createDataFrame([
(0, "a b c d e spark", 1.0),
(1, "b d", 0.0),
(2, "spark f g h", 1.0),
(3, "hadoop mapreduce", 0.0),
(4, "b spark who", 1.0),
(5, "g d a y", 0.0),
(6, "spark fly", 1.0),
(7, "was mapreduce", 0.0),
(8, "e spark program", 1.0),
(9, "a e c l", 0.0),
(10, "spark compile", 1.0),
(11, "hadoop software", 0.0)
], ["id", "text", "target"]) # try switching between "target" and "label"
tokenizer = Tokenizer(inputCol="text", outputCol="words")
hashingTF = HashingTF(inputCol=tokenizer.getOutputCol(), outputCol="features")
lr = LogisticRegression(maxIter=10, labelCol="target") #try switching between "target" and "label"
pipeline = Pipeline(stages=[tokenizer, hashingTF, lr])
paramGrid = ParamGridBuilder() \
.addGrid(hashingTF.numFeatures, [10, 100, 1000]) \
.addGrid(lr.regParam, [0.1, 0.01]) \
.build()
crossval = CrossValidator(estimator=pipeline,
estimatorParamMaps=paramGrid,
evaluator=BinaryClassificationEvaluator(),
numFolds=2)
cvModel = crossval.fit(training)
Is that in any way expected behaviour?
You need to provide the label column to BinaryClassificationEvaluator too. So if you replace the line
evaluator=BinaryClassificationEvaluator(),
with
evaluator=BinaryClassificationEvaluator(labelCol="target"),
it should work fine.
You can find the usage in the docs.

How should I adjust the hidden layers for my Neural Network while working on MNIST's digit recognition set?

I'm just starting to learn about ML and NN. Now, I am working with Jupyter Notebooks and Scikit, and I want to create a Neural Network for hand written digit recogniton.
This is what I have so far in my notebook:
import pandas as pd
import numpy as np
filepath_train = "practice/mnist_train.csv"
filepath_test= "practice/mnist_test.csv"
train_set = pd.read_csv(filepath_train ) #size is 60 000
test_set = pd.read_csv(filepath_test ) #size is 10 000
x_train = train_set.drop("label", axis=1)
y_train = train_set["label"]
x_test = test_set.drop("label", axis=1)
y_test = test_set["label"]
from sklearn.neural_network import MLPClassifier
mlp = MLPClassifier(hidden_layer_sizes=(150,100), activation='logistic', alpha=0.1,
solver='sgd', tol=1e-4, random_state=1,
learning_rate_init=.1, verbose=True)
clf.fit(x_train,y_train)
I have to admit, both number of hidden layers/neurons and the activation function I choose randomly, through trial and error. I tried with more, I tried with less, depending on what I see in other notebooks as well.
Either way, whatever I do, the loss value keeps getting stuck around 0.7 - 0.8
Loss values
These layers gave me above 90% accuracy:
CNN_model = Sequential()
CNN_model.add(Conv2D(filters = 32, kernel_size = (5, 5), strides=(1, 1), activation='relu', input_shape=(28,28,1)))
CNN_model.add(MaxPooling2D(pool_size=(2, 2), strides=(2, 2)))
CNN_model.add(Conv2D(filters = 64, kernel_size = (5, 5), activation='relu'))
CNN_model.add(MaxPooling2D(pool_size=(2, 2)))
CNN_model.add(Flatten())
CNN_model.add(Dense(units = 1000, activation='relu'))
CNN_model.add(Dense(units = y_train.shape[1], activation='softmax'))

Using Pipeline with GridSearchCV

Suppose I have this Pipeline object:
from sklearn.pipeline import Pipeline
pipe = Pipeline([
('my_transform', my_transform()),
('estimator', SVC())
])
To pass the hyperparameters to my Support Vector Classifier (SVC) I could do something like this:
pipe_parameters = {
'estimator__gamma': (0.1, 1),
'estimator__kernel': (rbf)
}
Then, I could use GridSearchCV:
from sklearn.model_selection import GridSearchCV
grid = GridSearchCV(pipe, pipe_parameters)
grid.fit(X_train, y_train)
We know that a linear kernel does not use gamma as a hyperparameter. So, how could I include the linear kernel in this GridSearch?
For example, In a simple GridSearch (without Pipeline) I could do:
param_grid = [
{'C': [ 0.1, 1, 10, 100, 1000],
'gamma': [0.0001, 0.001, 0.01, 0.1, 1],
'kernel': ['rbf']},
{'C': [0.1, 1, 10, 100, 1000],
'kernel': ['linear']},
{'C': [0.1, 1, 10, 100, 1000],
'gamma': [0.0001, 0.001, 0.01, 0.1, 1],
'degree': [2, 3],
'kernel': ['poly']}
]
grid = GridSearchCV(SVC(), param_grid)
Therefore, I need a working version of this sort of code:
pipe_parameters = {
'bag_of_words__max_features': (None, 1500),
'estimator__kernel': (rbf),
'estimator__gamma': (0.1, 1),
'estimator__kernel': (linear),
'estimator__C': (0.1, 1),
}
Meaning that I want to use as hyperparameters the following combinations:
kernel = rbf, gamma = 0.1
kernel = rbf, gamma = 1
kernel = linear, C = 0.1
kernel = linear, C = 1
You are almost there. Similar to how you created multiple dictionaries for SVC model, create a list of dictionaries for the pipeline.
Try this example:
from sklearn.datasets import fetch_20newsgroups
from sklearn.pipeline import pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.svm import SVC
categories = [
'alt.atheism',
'talk.religion.misc',
'comp.graphics',
'sci.space',
]
remove = ('headers', 'footers', 'quotes')
data_train = fetch_20newsgroups(subset='train', categories=categories,
shuffle=True, random_state=42,
remove=remove)
pipe = Pipeline([
('bag_of_words', CountVectorizer()),
('estimator', SVC())])
pipe_parameters = [
{'bag_of_words__max_features': (None, 1500),
'estimator__C': [ 0.1, ],
'estimator__gamma': [0.0001, 1],
'estimator__kernel': ['rbf']},
{'bag_of_words__max_features': (None, 1500),
'estimator__C': [0.1, 1],
'estimator__kernel': ['linear']}
]
from sklearn.model_selection import GridSearchCV
grid = GridSearchCV(pipe, pipe_parameters, cv=2)
grid.fit(data_train.data, data_train.target)
grid.best_params_
# {'bag_of_words__max_features': None,
# 'estimator__C': 0.1,
# 'estimator__kernel': 'linear'}

Spark MLlib: building classifiers for each data group

I have labeled vectors (LabeledPoint-s) taged by some group number. For every group I need to create a separate Logistic Regression classifier:
import org.apache.log4j.{Level, Logger}
import org.apache.spark.mllib.classification.LogisticRegressionWithLBFGS
import org.apache.spark.{SparkContext, SparkConf}
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.linalg.{Vector, Vectors}
object Scratch {
val train = Seq(
(1, LabeledPoint(0, Vectors.sparse(3, Seq((0, 1.0), (2, 3.0))))),
(1, LabeledPoint(0, Vectors.sparse(3, Seq((1, 1.5), (2, 4.0))))),
(1, LabeledPoint(0, Vectors.sparse(3, Seq((0, 2.0), (1, 1.0), (2, 3.5))))),
(8, LabeledPoint(0, Vectors.sparse(3, Seq((0, 3.0), (2, 7.0))))),
(8, LabeledPoint(0, Vectors.sparse(3, Seq((0, 1.0), (1, 3.0))))),
(8, LabeledPoint(0, Vectors.sparse(3, Seq((0, 1.5), (2, 4.0)))))
)
def main(args: Array[String]) {
Logger.getLogger("org.apache.spark").setLevel(Level.WARN)
Logger.getLogger("org.eclipse.jetty.server").setLevel(Level.OFF)
// set up environment
val conf = new SparkConf()
.setMaster("local[5]")
.setAppName("Scratch")
.set("spark.executor.memory", "2g")
val sc = new SparkContext(conf)
val trainRDD = sc.parallelize(train)
val modelByGroup = trainRDD.groupByKey().map({case (group, iter) =>
(group, new LogisticRegressionWithLBFGS().run(iter))})
}
}
LogisticRegressionWithLBFGS().run(iter) does not compile because run works with RDD and not with iterator that groupBy returns.
Please advise how to build as many classifiers as there are groups (tags) in the input data.
Update - demonstrates that nested RDD iteration does not work:
import org.apache.log4j.{Level, Logger}
import org.apache.spark.mllib.classification.LogisticRegressionWithLBFGS
import org.apache.spark.rdd.RDD
import org.apache.spark.{SparkContext, SparkConf}
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.linalg.{Vector, Vectors}
object Scratch {
val train = Seq(
(1, LabeledPoint(0, Vectors.sparse(3, Seq((0, 1.0), (2, 3.0))))),
(1, LabeledPoint(0, Vectors.sparse(3, Seq((1, 1.5), (2, 4.0))))),
(1, LabeledPoint(0, Vectors.sparse(3, Seq((0, 2.0), (1, 1.0), (2, 3.5))))),
(8, LabeledPoint(0, Vectors.sparse(3, Seq((0, 3.0), (2, 7.0))))),
(8, LabeledPoint(0, Vectors.sparse(3, Seq((0, 1.0), (1, 3.0))))),
(8, LabeledPoint(0, Vectors.sparse(3, Seq((0, 1.5), (2, 4.0)))))
)
def main(args: Array[String]) {
Logger.getLogger("org.apache.spark").setLevel(Level.WARN)
Logger.getLogger("org.eclipse.jetty.server").setLevel(Level.OFF)
// set up environment
val conf = new SparkConf()
.setMaster("local[5]")
.setAppName("Scratch")
.set("spark.executor.memory", "2g")
val sc = new SparkContext(conf)
val trainRDD = sc.parallelize(train)
val keys : RDD[Int] = trainRDD.map({case (key,_) => key}).distinct
for (key <- keys) {
// key is Int here!
// Get train data for the current group (key):
val groupTrain = trainRDD.filter({case (x, _) => x == key }).cache()
/**
* Which results in org.apache.spark.SparkException:
* RDD transformations and actions can only be invoked by the driver,
* not inside of other transformations; for example, rdd1.map(x => rdd2.values.count() * x) is invalid
* because the values transformation and count action cannot be performed inside of the rdd1.map transformation.
* For more information, see SPARK-5063. at org.apache.spark.rdd.RDD.sc(RDD.scala:87)
*/
}
}
}
Looks like there is no way to use transformations inside other transformations, correct?
If your using classifier on each group you don't need mllib. Mllib is designed to use with distributed sets (your sets are not you have butch of local sets on each worker). You can just use some local machine learning library like weka on each group in map function.
EDIT:
val keys = wholeRDD.map(_._1).distinct.collect
var models = List()
for (key <- keys) {
val valuesForKey = wholeRDD.filter(_._1 == key)
// train model
...
models = model::models
}

Resources