Sklearn GridSearchCV on Pipeline to test multiple transforms and estimators - scikit-learn

I'm trying to build a GridSearchCV using Pipeline, and I want to test both transformers and estimators.
Is there a more concise way of doing so?
pipeline = Pipeline([
('imputer', SimpleImputer()),
('scaler', StandardScaler()),
('pca', PCA()),
('clf', KNeighborsClassifier())
])
parameters = [{
'imputer': (SimpleImputer(), ),
'imputer__strategy': ('median', 'mean'),
'pca__n_components': (10, 20),
'clf': (LogisticRegression(),),
'clf__C': (1,10)
}, {
'imputer': (SimpleImputer(), ),
'imputer__strategy': ('median', 'mean'),
'pca__n_components': (10, 20),
'clf': (KNeighborsClassifier(),),
'clf__n_neighbors': (10, 25),
}, {
'imputer': (KNNImputer(), ),
'imputer__n_neighbors': (5, 10),
'pca__n_components': (10, 20),
'clf': (LogisticRegression(),),
'clf__C': (1,10)
}, {
'imputer': (KNNImputer(), ),
'imputer__n_neighbors': (5, 10),
'pca__n_components': (10, 20),
'clf': (KNeighborsClassifier(),),
'clf__n_neighbors': (10, 25),
}]
grid_search = GridSearchCV(estimator=pipeline, param_grid=parameters)
Insted of having 4 blocks of parameters, I want to declare the 2 imputations methods that I want to test with their corresponding parameters, and the 2 classifiers. and without decalring the pca__n_components 4 times.

When you get hyperparameters that depend on each other a fair bit, the parameter grid approach gets to be cumbersome. There are a few ways to get what you need.
Nested grid searches
GridSearchCV(
estimator=GridSearchCV(estimator=pipeline, param_grid=imputer_grid),
param_grid=estimator_grid,
)
For each estimator candidate, this runs a grid search over the imputer candidates; the best imputer is used for the estimator, and then the estimators-with-best-imputers are compared.
The main drawback here is that the inner search gets cloned for each estimator candidate, and so you don't get access to the cv_results_ for the non-winning estimator's imputers.
pythonically generate (part of) the grid
ParameterGrid, used internally by GridSearchCV, is mostly a wrapper around itertools.product. So we can use itertools ourselves to create (chunks of) the grid. E.g. we can create the list you've written, but with less repeated code:
import itertools
imputers = [{
'imputer': (SimpleImputer(), ),
'imputer__strategy': ('median', 'mean'),
},
{
'imputer': (KNNImputer(), ),
'imputer__n_neighbors': (5, 10),
}]
models = [{
'clf': (LogisticRegression(),),
'clf__C': (1,10),
},
{
'clf': (KNeighborsClassifier(),),
'clf__n_neighbors': (10, 25),
}]
pcas = [{'pca__n_components': (10, 20),}]
parameters = [
{**imp, **pca, **model} # in py3.9 the slicker notation imp | pca | model works
for imp, pca, model in itertools.product(imputers, pca, models)
] # this should give the same as your hard-coded list-of-dicts

Related

The added layer must be an instance of class Layer using TF2.10.0 and python 3.10.8

I'm using TF 2.10.0 with python 3.10.8 and running into
TypeError: The added layer must be an instance of class Layer. Received: layer=<class 'keras.layers.pooling.max_pooling2d.MaxPooling2D'> of type <class 'type'>.
I referred to this >https://stackoverflow.com/questions/56089489/how-to-fix-the-added-layer-must-be-an-instance-of-class-layer-while-building-a article to try and fix it but no luck
Here's my code. Please tell me what I'm doing wrong
import tensorflow.keras.layers as layers
from tensorflow.keras.models import Sequential
import numpy as np
import PIL
model = Sequential([
layers.Rescaling(1./255, input_shape=(img_height, img_width, 3)),
layers.Conv2D(16, 3, padding='same', activation='relu'),
layers.MaxPool2D(),
layers.Conv2D(32, 3, padding='same', activation='relu'),
layers.MaxPool2D,
layers.Conv2D(64, 3, padding='same', activation='relu'),
layers.MaxPool2D(),
layers.Flatten(),
layers.Dense(128, activation='relu'),
layers.Dense(num_classes, name='output')
])
Please help :)
that is correct you did not specify it is a function but interesting looking at the custom class, and favorites working with CONV functions and that is the reason for side padding algorithms.
Sample: Sides padding, working is specific or using tf.pad()
import tensorflow as tf
class MyMaxPoolLayer( tf.keras.layers.MaxPool2D ):
def __init__( self, units ):
super(MyMaxPoolLayer, self).__init__( units )
self.num_units = units
def build(self, input_shape):
self.kernel = self.add_weight("kernel",
shape=[int(input_shape[-1]),
self.num_units])
def call(self, inputs):
max_pool_2d = tf.keras.layers.MaxPooling2D(pool_size=(2, 2), strides=(1, 1), padding='valid')
temp = tf.matmul(inputs, self.kernel)
temp = tf.reshape(temp, [1, 10, 10, 1])
temp = max_pool_2d( temp )
return temp
start = 3
limit = 33
delta = 3
sample = tf.range(start, limit, delta)
sample = tf.cast( sample, dtype=tf.float32 )
sample = tf.constant( sample, shape=( 10, 1, 1, 1 ) )
layer = MyMaxPoolLayer(10)
print( layer(sample) )
Output: We need symmetric they call symmetries.
...
[[ 14.819944 ]
[ 10.318433 ]
[ 10.318433 ]
[ 2.3505163 ]
[-10.546914 ]
[-11.872433 ]
[ 13.375727 ]
[ 13.375727 ]
[ 6.088965 ]]
[[ 16.466604 ]
[ 11.464926 ]
[ 11.464926 ]
[ 2.6116848 ]
[-11.865278 ]
[-13.356487 ]
[ 14.861918 ]
[ 14.861918 ]
[ 6.7655163 ]]]], shape=(1, 9, 9, 1), dtype=float32)

Ensembling KNeighbours and Decision Tree Using Voting Classifier

I have a classification problem for which I am trying to build an ensemble using two classifiers, say for example KNeighbours, Decision Tree.In addition to this, I want to implement it using Pipeline. Now this is my attempt to the problem:
steps = [('scaler', StandardScaler()),
('regressor', VotingClassifier(estimators=[
('knn', KNeighborsClassifier()),
('clf', RandomForestClassifier())],voting='soft'))]
pipeline = Pipeline(steps)
parameters = [{'knn__n_neighbors': np.arange(1, 50)}, {
'clf__n_estimators': [10, 20, 30],
'clf__criterion': ['gini', 'entropy'],
'clf__max_features': [5, 10, 15],
'clf__max_depth': ['auto', 'log2', 'sqrt', None]}]
X_train, X_test, y_train, y_test = train_test_split(X, y.values.ravel(),
test_size=0.3, random_state=65)
cv = GridSearchCV(pipeline, param_grid=parameters)
cv.fit(X_train, y_train)
y_pred = cv.predict(X_test)
On running this following error pops up:
Invalid parameter knn for estimator
Pipeline(steps=[('scaler', StandardScaler()),
('regressor', VotingClassifier(
estimators=[('knn', KNeighborsClassifier()),
('clf', RandomForestClassifier())
]
)
)
]
).
Check the list of available parameters with `estimator.get_params().keys()`.
I belive their is some error in how I have defined the parameter grid. Please help me out in this.
Since it's nested, you'll need to specify both prefixes, like this:
parameters = [{'regressor__knn__n_neighbors': np.arange(1, 5), #} { And you'd probably want it to be a single grid?
'regressor__clf__n_estimators': [10, 20, 30],
'regressor__clf__criterion': ['gini', 'entropy'],
'regressor__clf__max_depth': [5, 10, 15],
'regressor__clf__max_features': ['log2', 'sqrt', None]}]
Also, your max_depth and max_features values switched their supposed places somehow, fixed that. (And 'auto' does the same as 'sqrt', at least for the recent versions.)

Using Pipeline with GridSearchCV

Suppose I have this Pipeline object:
from sklearn.pipeline import Pipeline
pipe = Pipeline([
('my_transform', my_transform()),
('estimator', SVC())
])
To pass the hyperparameters to my Support Vector Classifier (SVC) I could do something like this:
pipe_parameters = {
'estimator__gamma': (0.1, 1),
'estimator__kernel': (rbf)
}
Then, I could use GridSearchCV:
from sklearn.model_selection import GridSearchCV
grid = GridSearchCV(pipe, pipe_parameters)
grid.fit(X_train, y_train)
We know that a linear kernel does not use gamma as a hyperparameter. So, how could I include the linear kernel in this GridSearch?
For example, In a simple GridSearch (without Pipeline) I could do:
param_grid = [
{'C': [ 0.1, 1, 10, 100, 1000],
'gamma': [0.0001, 0.001, 0.01, 0.1, 1],
'kernel': ['rbf']},
{'C': [0.1, 1, 10, 100, 1000],
'kernel': ['linear']},
{'C': [0.1, 1, 10, 100, 1000],
'gamma': [0.0001, 0.001, 0.01, 0.1, 1],
'degree': [2, 3],
'kernel': ['poly']}
]
grid = GridSearchCV(SVC(), param_grid)
Therefore, I need a working version of this sort of code:
pipe_parameters = {
'bag_of_words__max_features': (None, 1500),
'estimator__kernel': (rbf),
'estimator__gamma': (0.1, 1),
'estimator__kernel': (linear),
'estimator__C': (0.1, 1),
}
Meaning that I want to use as hyperparameters the following combinations:
kernel = rbf, gamma = 0.1
kernel = rbf, gamma = 1
kernel = linear, C = 0.1
kernel = linear, C = 1
You are almost there. Similar to how you created multiple dictionaries for SVC model, create a list of dictionaries for the pipeline.
Try this example:
from sklearn.datasets import fetch_20newsgroups
from sklearn.pipeline import pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.svm import SVC
categories = [
'alt.atheism',
'talk.religion.misc',
'comp.graphics',
'sci.space',
]
remove = ('headers', 'footers', 'quotes')
data_train = fetch_20newsgroups(subset='train', categories=categories,
shuffle=True, random_state=42,
remove=remove)
pipe = Pipeline([
('bag_of_words', CountVectorizer()),
('estimator', SVC())])
pipe_parameters = [
{'bag_of_words__max_features': (None, 1500),
'estimator__C': [ 0.1, ],
'estimator__gamma': [0.0001, 1],
'estimator__kernel': ['rbf']},
{'bag_of_words__max_features': (None, 1500),
'estimator__C': [0.1, 1],
'estimator__kernel': ['linear']}
]
from sklearn.model_selection import GridSearchCV
grid = GridSearchCV(pipe, pipe_parameters, cv=2)
grid.fit(data_train.data, data_train.target)
grid.best_params_
# {'bag_of_words__max_features': None,
# 'estimator__C': 0.1,
# 'estimator__kernel': 'linear'}

How to create a tensorflow operation that will be reused

I have some data that I want to process before feeding/training a model. For this example I want to do a max pool 2d. I wrote a short function to do that with tensorflow.
import tensorflow
import tensorflow.nn as nn
def _tfMaxPool(arr, pool=(4,4), sess=None):
op = nn.max_pool(arr, (1, 1, pool[0], 1), (1, 1, pool[0], 1 ), padding="VALID")
op = nn.max_pool(op, (1, 1, 1, pool[1]), (1, 1, 1, pool[1]), padding="VALID")
if sess is None:
sess = tensorflow.Session();
return sess.run(op)
The problem is this can add nodes to my graph each time, which seems to clutter my session. One alternative way is to create a model.
import keras
seq = keras.Sequential([
keras.layers.InputLayer((1, 512, 512)),
keras.layers.MaxPool2D((4, 4), (4, 4), data_format="channels_first")
])
def _tfMaxPool2(arr, pool=(4,4), sess=None):
swapped = arr.swapaxes(0,1)
return seq.predict(swapped).swapaxes(0,1)
The model is nearly exactly like what I want, but I think I am missing something fundamental.
Why don't you reuse your graph with various input? In the following code,tfMaxpool is defined only once.
def _tfMaxPool(arr, pool=(4,4)):
op = nn.max_pool(arr, (1, 1, pool[0], 1), (1, 1, pool[0], 1 ), padding="VALID")
op = nn.max_pool(op, (1, 1, 1, pool[1]), (1, 1, 1, pool[1]), padding="VALID")
return op
input = tf.placeholder()
output = _tfMaxPool(input)
with tf.Session() as sess:
sess.run(output, feed_dict={input:arr1})
sess.run(output, feed_dict={input:arr2})

PYSPARK: how to get weights from CrossValidatorModel?

I have trained a logistic regression model using cross validation using the following code from https://spark.apache.org/docs/2.1.0/ml-tuning.html
now I want to get the weights and intercept, but I get this error:
AttributeError: 'CrossValidatorModel' object has no attribute 'weights'
how can I get these attributes?
*the same problem with (trainingSummary = cvModel.summary)
from pyspark.ml import Pipeline
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.ml.feature import HashingTF, Tokenizer
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
# Prepare training documents, which are labeled.
training = spark.createDataFrame([
(0, "a b c d e spark", 1.0),
(1, "b d", 0.0),
(2, "spark f g h", 1.0),
(3, "hadoop mapreduce", 0.0),
(4, "b spark who", 1.0),
(5, "g d a y", 0.0),
(6, "spark fly", 1.0),
(7, "was mapreduce", 0.0),
(8, "e spark program", 1.0),
(9, "a e c l", 0.0),
(10, "spark compile", 1.0),
(11, "hadoop software", 0.0)
], ["id", "text", "label"])
# Configure an ML pipeline, which consists of tree stages: tokenizer, hashingTF, and lr.
tokenizer = Tokenizer(inputCol="text", outputCol="words")
hashingTF = HashingTF(inputCol=tokenizer.getOutputCol(), outputCol="features")
lr = LogisticRegression(maxIter=10)
pipeline = Pipeline(stages=[tokenizer, hashingTF, lr])
# We now treat the Pipeline as an Estimator, wrapping it in a CrossValidator instance.
# This will allow us to jointly choose parameters for all Pipeline stages.
# A CrossValidator requires an Estimator, a set of Estimator ParamMaps, and an Evaluator.
# We use a ParamGridBuilder to construct a grid of parameters to search over.
# With 3 values for hashingTF.numFeatures and 2 values for lr.regParam,
# this grid will have 3 x 2 = 6 parameter settings for CrossValidator to choose from.
paramGrid = ParamGridBuilder() \
.addGrid(hashingTF.numFeatures, [10, 100, 1000]) \
.addGrid(lr.regParam, [0.1, 0.01]) \
.build()
crossval = CrossValidator(estimator=pipeline,
estimatorParamMaps=paramGrid,
evaluator=BinaryClassificationEvaluator(),
numFolds=2) # use 3+ folds in practice
# Run cross-validation, and choose the best set of parameters.
cvModel = crossval.fit(training)
# Prepare test documents, which are unlabeled.
test = spark.createDataFrame([
(4, "spark i j k"),
(5, "l m n"),
(6, "mapreduce spark"),
(7, "apache hadoop")
], ["id", "text"])
# Make predictions on test documents. cvModel uses the best model found (lrModel).
prediction = cvModel.transform(test)
selected = prediction.select("id", "text", "probability", "prediction")
for row in selected.collect():
print(row)
LogisticRegression model has coefficients not weights. Other than this it can be done as below:
cvModel
# The best model from CrossValidator
.bestModel
# The last stage in Pipeline
.stages[-1]
.coefficients)

Resources