sklearn GridSearchCV, SelectKBest, and SVM - scikit-learn

I am trying to make a classifier that uses feature selection via this functionI have written, golub, which returns two np arrays as SelectKBest requires. I want to link this to an SVM classifier with a linear and and optimize over the possible combinations of k and C. However, what I have tried so far has not succeeded and I am not sure why. The code is as follows:
import numpy as np
from sklearn import cross_validation
from sklearn import svm
from sklearn.feature_selection import SelectKBest
from sklearn.pipeline import make_pipeline, Pipeline
from sklearn.grid_search import GridSearchCV
from golub_mod import golub
class SVM_golub_linear:
def __init__(self,X,y):
self.X=X
self.y=y
def Golub_SVM(self):
X=self.X
y=self.y
kbest=SelectKBest(golub,k=1)
k_vals=np.linspace(100,1000,10,dtype=int)
k_vals=k_vals.tolist()
c_vals=[0.00001,0.0001,0.001,0.01,.1,1,10,100,1000]
clf=svm.LinearSVC(penalty='l2')
steps=[('feature_selection',kbest),('svm_linear',clf)]
pipeline=make_pipeline(steps)
params=dict(feature_selection__k=k_vals,
svm_linear__C=c_vals)
best_model=GridSearchCV(pipeline,param_grid=params)
self.model=best_model.fit(X,y)
print(best_mod.best_params_)
def svmpredict(self,X_n):
y_vals=self.model.predict(X_n)
return y_vals
when I try to run this:
model=SVM_golub_linear(X,y)
model.Golub_SVM()
I get the following error:
TypeError: Last step of chain should implement fit '[('feature_selection',
SelectKBest(k=1, score_func=<function golub at 0x105f2c398>)), ('svm_linear', LinearSVC(C=1.0, class_weight=None, dual=False, fit_intercept=True,
intercept_scaling=1, loss='squared_hinge', max_iter=1000,
multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
verbose=0))]' (type <type 'list'>) doesn't)
I do not understand this because LinearSVC does have a fit method. Thanks

In the code above if you replace
pipeline=make_pipeline(steps)
to
pipeline=Pipeline(steps)
the code works as is.

Related

the right way to make prediction using Spacy word vectors

Im learning how to convert text into numbers for NLP problems and following a course Im learning about word vectors provided by Spacy package. the code works all fine from learning and evaluation but I have some problems regarding:
making prediction for new sentences, I cannot seems to make it work and most examples just fit the model then use X_test set for evaluation. ( Code below)
The person explaining stated that its bad( won't give good results) if I used
""
doc.vector over doc.vector.values
""
when trying both I don't see a difference, what is the difference between the two?
the example is to classify news title between fake and real
import spacy
import pandas as pd
df= pd.read_csv('Fake_Real_Data.csv')
print(df.head())
print(f"shape is: {df.shape}")
print("checking the impalance: \n ", df.label.value_counts())
df['label_No'] = df['label'].map({'Fake': 0, 'Real': 1})
print(df.head())
nlp= spacy.load('en_core_web_lg') # only large and medium model have word vectors
df['Text_vector'] = df['Text'].apply(lambda x: nlp(x).vector) #apply the function to EACH element in the column
print(df.head(5))
from sklearn.model_selection import train_test_split
X_train, X_test, y_train,y_test= train_test_split(df.Text_vector.values, df.label_No, test_size=0.2, random_state=2022)
x_train_2D= np.stack(X_train)
x_test_2D= np.stack(X_test)
from sklearn.naive_bayes import MultinomialNB
clf=MultinomialNB()
from sklearn.preprocessing import MinMaxScaler
scaler= MinMaxScaler()
scaled_train_2d= scaler.fit_transform(x_train_2D)
scaled_test_2d= scaler.transform(x_test_2D)
clf.fit(scaled_train_2d, y_train)
from sklearn.metrics import classification_report
y_pred=clf.predict(scaled_test_2d)
print(classification_report(y_test, y_pred))

Loading trained model to make prediction of single image

I have trained a ResNet50 model on intel image multiclass classification task. The task is trying to predict an image whether it is a building a street or glacier etc. The model is succesfully trained and able to make prediction. I have save the model and trying to use the saved model on new image.
Here is the code on training
import os
import torch
import tarfile
import torchvision
import torch.nn as nn
from PIL import Image
import matplotlib.pyplot as plt
import torch.nn.functional as F
from torchvision import transforms
from torchvision.utils import make_grid
from torch.utils.data import random_split
from torchvision.transforms import ToTensor
from torchvision.datasets import ImageFolder
from torch.utils.data import Dataset, DataLoader
from torchvision.datasets.utils import download_url
import PIL
import PIL.Image
import numpy as np
transform_train=transforms.Compose([
transforms.Resize((150,150)),
transforms.RandomHorizontalFlip(),
transforms.RandomVerticalFlip(),
transforms.ToTensor(),
transforms.Normalize((.5,.5,.5),(.5,.5,.5))
])
transform_test=transforms.Compose([
transforms.Resize((150,150)),
transforms.ToTensor(),
transforms.Normalize((.5,.5,.5),(.5,.5,.5))
])
...
torch.save(model2.state_dict(),'/content/drive/MyDrive/saved_model/model_resnet.pth')
When I called the model in other files, I use similar image transformation, however it gives me an error, here is the code and the error
model = torch.load('/content/drive/MyDrive/saved_model/model_resnet.pth')
image=Image.open(Path('/content/drive/MyDrive/images/seg_pred/seg_pred/10004.jpg'))
transform_train=transforms.Compose([
transforms.Resize((150,150)),
transforms.RandomHorizontalFlip(),
transforms.RandomVerticalFlip(),
transforms.ToTensor(),
transforms.Normalize((.5,.5,.5),(.5,.5,.5))
])
input = transform_train(image)
#input = input.view(1, 3, 150,150)
output = model(input)
prediction = int(torch.max(output.data, 1)[1].numpy())
print(prediction)
The error that gives me is
TypeError: 'collections.OrderedDict' object is not callable
My pytorch version is
1.9.0+cu102
You need to create the structure of the model first, it's similar to create model2 on your training code, it can be like:
model = resnet()
Then load the saved state dict:
model.load_state_dict(torch.load('/content/drive/MyDrive/saved_model/model_resnet.pth'))
model.eval()
Ref:
https://pytorch.org/tutorials/beginner/saving_loading_models.html
Based on your question it's clear that you want to prediction on a new image. But you are trying to augment and get transform the image using transform which is not a proper way to get the prediction.
So as the code link you provided having plenty of code you can use them as in your code.
I am sharing the fast.ai and simple `TensorFlow code by which you can predict a new image and then be able to see the result.
img = open_image('any_image.jpg')
print(learn.predict(img)[0])
OR you can try this function:
import matplotlib.pyplot as plt # visualization
import matplotlib.image as mpimg
import tensorflow as tf # Deep Learning Framework
import pathlib
def pred_plot(file, model, class_names=class_names, image_size=(150, 150)):
img = tf.io.read_file(file)
img = tf.io.decode_image(img, channels=3)
img = tf.image.resize(img, size=image_size)
pred_probs = model.predict(tf.expand_dims(img, axis=0))
pred_class = class_names[pred_probs.argmax()]
plt.imshow(img/225.)
plt.title(f'Pred: {pred_class}')
plt.axis(False);
pass any image and you will get the prediction with visilzation.
url ='dummy.jpg'
pred_plot(url, model=model_2, class_names=class_names)

SKLEARN // Combine GridsearchCV with column transform and pipeline

I am struggling with a machine learning project, in which I am trying to combine :
a sklearn column transform to apply different transformers to my numerical and categorical features
a pipeline to apply my different transformers and estimators
a GridSearchCV to search for the best parameters.
As long as I fill-in the parameters of my different transformers manually in my pipeline, the code is working perfectly.
But as soon as I try to pass lists of different values to compare in my gridsearch parameters, I am getting all kind of invalid parameter error messages.
Here is my code :
First I divide my features into numerical and categorical
from sklearn.compose import make_column_selector
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.impute import KNNImputer
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
numerical_features=make_column_selector(dtype_include=np.number)
cat_features=make_column_selector(dtype_exclude=np.number)
Then I create 2 different preprocessing pipelines for numerical and categorical features:
numerical_pipeline= make_pipeline(KNNImputer())
cat_pipeline=make_pipeline(SimpleImputer(strategy='most_frequent'),OneHotEncoder(handle_unknown='ignore'))
I combined both into another pipeline, set my parameters, and run my GridSearchCV code
model=make_pipeline(preprocessor, LinearRegression() )
params={
'columntransformer__numerical_pipeline__knnimputer__n_neighbors':[1,2,3,4,5,6,7]
}
grid=GridSearchCV(model, param_grid=params,scoring = 'r2',cv=10)
cv = KFold(n_splits=5)
all_accuracies = cross_val_score(grid, X, y, cv=cv,scoring='r2')
I tried different ways to declare the paramaters, but never found the proper one. I always get an "invalid parameter" error message.
Could you please help me understanding what went wrong?
Really a lot of thanks for your support, and take good care!
I am assuming that you might have defined preprocessor as the following,
preprocessor = Pipeline([('numerical_pipeline',numerical_pipeline),
('cat_pipeline', cat_pipeline)])
then you need to change your param name as following:
pipeline__numerical_pipeline__knnimputer__n_neighbors
but, there are couple of other problems with the code:
you don't have to call cross_val_score after performing GridSearchCV. Output of GridSearchCV itself would have the cross validation result for each combination of hyper parameters.
KNNImputer would not work when you data is having string data. You need to apply cat_pipeline before num_pipeline.
Complete example:
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import make_column_transformer
from sklearn.compose import make_column_selector
import pandas as pd # doctest: +SKIP
X = pd.DataFrame({'city': ['London', 'London', 'Paris', np.nan],
'rating': [5, 3, 4, 5]}) # doctest: +SKIP
y = [1,0,1,1]
from sklearn.compose import make_column_selector
from sklearn.pipeline import make_pipeline, Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.impute import KNNImputer
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score, KFold
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
numerical_features=make_column_selector(dtype_include=np.number)
cat_features=make_column_selector(dtype_exclude=np.number)
numerical_pipeline= make_pipeline(KNNImputer())
cat_pipeline=make_pipeline(SimpleImputer(strategy='most_frequent'),
OneHotEncoder(handle_unknown='ignore', sparse=False))
preprocessor = Pipeline([('cat_pipeline', cat_pipeline),
('numerical_pipeline',numerical_pipeline)])
model=make_pipeline(preprocessor, LinearRegression() )
params={
'pipeline__numerical_pipeline__knnimputer__n_neighbors':[1,2]
}
grid=GridSearchCV(model, param_grid=params,scoring = 'r2',cv=2)
grid.fit(X, y)

Python 3 and Sklearn: Difficulty to use a NOT-sklearn model as a sklearn model

The code below is working. I have just a routine to run a cross validation scheme using a linear model previous defined in sklearn. I do not have a problem with this. My problem is that: if I replace the code model=linear_model.LinearRegression() by the model=RBF('multiquadric') (please see line 14 and 15 in the __main__, it does not work anymore. So my problem is actually in the class RBF where I try to mimic a sklearn model.
If I replace the code described above, I get the following error:
FitFailedWarning)
/home/daniel/anaconda3/lib/python3.7/site-packages/sklearn/model_selection/_validation.py:536: FitFailedWarning: Estimator fit failed. The score on this train-test partition for these parameters will be set to nan. Details:
ValueError: All arrays must be equal length.
FitFailedWarning)
1) Should I define a score function in the Class RBF?
2) How to do that? I am lost. Since I am inherit BaseEstimator and RegressorMixin, I expected that this was internally solved.
3) Is there something else missing?
from sklearn import datasets
import numpy as np
import pandas as pd
from sklearn import linear_model
from sklearn import model_selection
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from scipy.interpolate import Rbf
np.random.seed(0)
from sklearn.base import BaseEstimator, RegressorMixin
class RBF(BaseEstimator, RegressorMixin):
def __init__(self,function):
self.function=function
def fit(self,x,y):
self.rbf = Rbf(x, y,function=self.function)
def predict(self,x):
return self.rbf(x)
if __name__ == "__main__":
# Load Data
targetName='HousePrice'
data=datasets.load_boston()
featuresNames=list(data.feature_names)
featuresData=data.data
targetData = data.target
df=pd.DataFrame(featuresData,columns=featuresNames)
df[targetName]=targetData
independent_variable_list=featuresNames
dependent_variable=targetName
X=df[independent_variable_list].values
y=np.squeeze(df[[dependent_variable]].values)
# Model Definition
model=linear_model.LinearRegression()
#model=RBF('multiquadric')
# Cross validation routine
number_splits=5
score_list=['neg_mean_squared_error','neg_mean_absolute_error','r2']
kfold = model_selection.KFold(n_splits=number_splits,shuffle=True, random_state=0)
scalar = StandardScaler()
pipeline = Pipeline([('transformer', scalar), ('estimator', model)])
results = model_selection.cross_validate(pipeline, X, y, cv=kfold, scoring=score_list,return_train_score=True)
for score in score_list:
print(score+':')
print('Train: '+'Mean',np.mean(results['train_'+score]),'Standard Error',np.std(results['train_'+score]))
print('Test: '+'Mean',np.mean(results['test_'+score]),'Standard Error',np.std(results['test_'+score]))
Lets look at the documentation here
*args : arrays
x, y, z, …, d, where x, y, z, … are the coordinates of the nodes and d is the array of values at the nodes
So it takes variable length argument with the last argument being the value which is y in your case. Argument k is the kth coordinates of all the data point (same for all other argument z, y, z, ….
Following the documentation, your code should be
from sklearn import datasets
import numpy as np
import pandas as pd
from sklearn import linear_model
from sklearn import model_selection
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from scipy.interpolate import Rbf
np.random.seed(0)
from sklearn.base import BaseEstimator, RegressorMixin
class RBF(BaseEstimator, RegressorMixin):
def __init__(self,function):
self.function=function
def fit(self,X,y):
self.rbf = Rbf(*X.T, y,function=self.function)
def predict(self,X):
return self.rbf(*X.T)
# Load Data
data=datasets.load_boston()
X = data.data
y = data.target
number_splits=5
score_list=['neg_mean_squared_error','neg_mean_absolute_error','r2']
kfold = model_selection.KFold(n_splits=number_splits,shuffle=True, random_state=0)
scalar = StandardScaler()
model = RBF(function='multiquadric')
pipeline = Pipeline([('transformer', scalar), ('estimator', model)])
results = model_selection.cross_validate(pipeline, X, y, cv=kfold, scoring=score_list,return_train_score=True)
for score in score_list:
print(score+':')
print('Train: '+'Mean',np.mean(results['train_'+score]),'Standard Error',np.std(results['train_'+score]))
print('Test: '+'Mean',np.mean(results['test_'+score]),'Standard Error',np.std(results['test_'+score]))
Output
neg_mean_squared_error:
Train: Mean -1.552450953914355e-20 Standard Error 7.932530906290208e-21
Test: Mean -23.007377210596463 Standard Error 4.254629143836107
neg_mean_absolute_error:
Train: Mean -9.398502208736061e-11 Standard Error 2.4673749061941226e-11
Test: Mean -3.1319779583728673 Standard Error 0.2162343985534446
r2:
Train: Mean 1.0 Standard Error 0.0
Test: Mean 0.7144217179633185 Standard Error 0.08526294242760363
Why *X.T : As we saw, each argument correspond to an axis of all the data points, so we transpose them and then use * operator to expand and pass each of the sub array as an argument to the variable length function.
Looks like the latest implementation has a mode parameter where we can pass the N-D array directly.

Plotting Grid Search CV grid parameter results on one graph

In sklearn 0.17.1 there was-->> grid_scores_ : list of named tuples (https://scikit-learn.org/0.17/modules/generated/sklearn.grid_search.GridSearchCV.html#sklearn.grid_search.GridSearchCV)
Now in sklearn 0.21.2 it is replaced with-->> cv_results_ : dict of numpy (masked) ndarrays (https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html)
Previously with sklearn 0.17.1, I was able to plot all grid parameters on a single plot using grid_scores_ but now I am unable to aggregate the values obtained from cv_results_ as there is no "mean_validation_score" in newer version.
I have an existing code which plotted all the parameters score in sklearn 0.17.1 (https://scikit-learn.org/0.17/modules/generated/sklearn.grid_search.GridSearchCV.html#sklearn.grid_search.GridSearchCV) where grid_scores_ was used and it perfectly plotted all the values on one plot.
In newer version of slearn cv_results_ has been replaced with grid_scores_. I have tried to append all the values in want to plot all the parameters on one plot, currently I am unable to add the correct values to plot on the graph.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.metrics import confusion_matrix
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.metrics import roc_curve, auc, roc_auc_score
from sklearn.metrics.ranking import precision_recall_curve
from sklearn.metrics import confusion_matrix
from sklearn.datasets import make_classification
from sklearn.tree import DecisionTreeClassifier, export_graphviz
from sklearn import tree
from sklearn.metrics import accuracy_score
import sklearn
import itertools
from pandas.tools.plotting import scatter_matrix
import os
import datetime as dt
from operator import itemgetter
from itertools import chain
import graphviz
from sklearn.metrics import precision_recall_fscore_support
import scikitplot as skplt
X_train = np.random.randint(0,1, size=[500,5000])
y_train = np.random.randint(0,1, size=500)
print(X_train.shape, y_train.shape)
# (500, 5000) (500,)
#grid_search = GridSearchCV(clf, param_grid, cv=3) # 10 fold cross validation
### hyperparameter estimator
param_grid = {"criterion": ["gini", "entropy"],
"splitter": ["best", "random"],
"max_depth": np.arange(1,9,7),
"min_samples_split": np.arange(2,150,90),
"min_samples_leaf": np.arange(1,60,45),
"min_weight_fraction_leaf": np.arange(0.1,0.4, 0.3),
"max_features": [1000, 500, 5000],
"max_leaf_nodes": np.arange(2,60,45),
"min_impurity_decrease": [0.0, 0.5],
}
def evaluate_param(parameter, param_range, index):
grid_search = GridSearchCV(clf, param_grid = {parameter: param_range}, cv=3) # 3 fold cross validation
grid_search.fit(X_train, y_train) ### grid_search.fit(X_train[features], y_train)
df = {}
#for i, score in enumerate(grid_search.grid_scores_): # previously used methods
for i, score in enumerate(grid_search.cv_results_["params"]):
## How do we save the correct values here for plotting
df[parameter] = grid_search.cv_results_["params"][i][parameter]
#df[parameter].update(grid_search.cv_results_["params"][i][parameter])
#print("df : ", df)
#df[parameter].append(grid_search.cv_results_["params"][i][parameter])
#print("df : ", df) # the values are not appended to the keys
df = pd.DataFrame.from_dict(df, orient='index')
df.reset_index(level=0, inplace=True)
df = df.sort_values(by='index')
plt.subplot(5,2,index) # Change here according to the number of parameters
plt.xlabel(parameter, color = "red")
plt.ylabel("GridSearchCV Score", color= "blue")
plot = plt.plot(df['index'], df[0])
plt.title(parameter.capitalize(), color = "red")
plt.savefig('DT_GridSearchCV_Score_Hyperparameter.png')
return plot, df
clf = tree.DecisionTreeClassifier(random_state=99) # verbose=True, n_jobs=-1 :: Dt does not support it
### hyperparameter estimator
index = 1
plt.figure(figsize=(30,30))
for parameter, param_range in dict.items(param_grid):
evaluate_param(parameter, param_range, index) ## 120 features
index += 1
This image is not filled as there is no "mean_validation_score" which can be filled for each subplot now:
https://ibb.co/Z6jwnMr
## Keys() gives the list of keys that gridsearchcv has:
grid_search.cv_results_.keys()
# output
# dict_keys(['mean_fit_time', 'std_fit_time', 'mean_score_time', 'std_score_time', 'param_criterion', 'param_max_depth', 'param_max_features', 'param_max_leaf_nodes', 'param_min_impurity_decrease', 'param_min_samples_leaf', 'param_min_samples_split', 'param_min_weight_fraction_leaf', 'param_splitter', 'params', 'split0_test_score', 'split1_test_score', 'split2_test_score', 'mean_test_score', 'std_test_score', 'rank_test_score', 'split0_train_score', 'split1_train_score', 'split2_train_score', 'mean_train_score', 'std_train_score'])
grid_search.best_estimator_
# output
# DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=1,
#max_features=1000, max_leaf_nodes=2, min_impurity_decrease=0.0,
#min_impurity_split=None, min_samples_leaf=1,
#min_samples_split=2, min_weight_fraction_leaf=0.1,
#presort=False, random_state=99, splitter='best')
Expected Result (should be filled): https://ibb.co/Z6jwnMr
However each subplot on the plot should have a curve depicting best value for the parameter. The keys do not have a "mean_validation_score" to plot the actual test score which was there in sklearn 0.17.1 but not in sklearn 0.20.2
Kindly let me know if there is still a way to plot all test scores on subplots of a single plot. Thanks in advance!!

Resources