I am implementing KNN using python and it was working.
Now I get an error:
No module named 'sklearn.grid_search
When I change the package to sklean.model_selection, I get another an error:
'GridSearchCV' object has no attribute 'grid_scores_'
Here is my code:
from sklearn.grid_search import GridSearchCV
from sklearn.neighbors import KNeighborsClassifier
import matplotlib.pyplot as plt
# define the parameter values that should be searched
# for python 2, k_range = range(1, 31)
# instantiate model
knn = KNeighborsClassifier(n_jobs=-1)
k_range = list(range(1, 31))
print(k_range)
# create a parameter grid: map the parameter names to the values that should be searched
# simply a python dictionary
# key: parameter name
# value: list of values that should be searched for that parameter
# single key-value pair for param_grid
param_grid = dict(n_neighbors=k_range)
print(param_grid)
# instantiate the grid
grid = GridSearchCV(knn, param_grid, cv=10, scoring='accuracy')
# fit the grid with data
grid.fit(X, y)
# view the complete results (list of named tuples)
grid.grid_scores_
# examine the first tuple
# we will slice the list and select its elements using dot notation and []
print('Parameters')
print(grid.grid_scores_[0].parameters)
# Array of 10 accuracy scores during 10-fold cv using the parameters
print('')
print('CV Validation Score')
print(grid.grid_scores_[0].cv_validation_scores)
# Mean of the 10 scores
print('')
print('Mean Validation Score')
print(grid.grid_scores_[0].mean_validation_score)
# create a list of the mean scores only
# list comprehension to loop through grid.grid_scores
grid_mean_scores = [result.mean_validation_score for result in grid.grid_scores_]
print(grid_mean_scores)
# plot the results
# this is identical to the one we generated above
plt.plot(k_range, grid_mean_scores)
plt.xlabel('Value of K for KNN')
plt.ylabel('Cross-Validated Accuracy')
# examine the best model
# Single best score achieved across all params (k)
print(grid.best_score_)
# Dictionary containing the parameters (k) used to generate that score
print(grid.best_params_)
# Actual model object fit with those best parameters
# Shows default parameters that
We did not specify:
print(grid.best_estimator_)
Try the below :
from sklearn.model_selection import GridSearchCV
ref link
https://scikit-learn.org/stable/auto_examples/model_selection/plot_grid_search_digits.html
Related
I am trying to understand the code behind TfidfTransformer(). From sklearn's documentation, I can get the term frequencies by setting use_idf=False. But when I check the code on Github, I noticed that the TfidfTransformer() will return the same value as CountVectorizer() when not using normalization, which is just the count of each term.
The code that is supposed to calculate term frequencies.
def transform(self, x, copy=True):
"""Transform a count matrix to a tf or tf-idf representation.
Parameters
----------
X : sparse matrix of (n_samples, n_features)
A matrix of term/token counts.
copy : bool, default=True
Whether to copy X and operate on the copy or perform in-place
operations.
Returns
-------
vectors : sparse matrix of shape (n_samples, n_features)
Tf-idf-weighted document-term matrix.
"""
X = self._validate_data(
X, accept_sparse="csr", dtype=FLOAT_DTYPES, copy-copy, reset=False
)
if not sp.issparse(X):
X = sp.csr_matrix(X, dtype=np.float64)
if self.sublinear_tf:
np.log(X.data, X.data)
X.data += 1
if self.use_idf:
# idf being a property, the automatic attributes detection
# does not work as usual and we need to specify the attribute not fitted")
# name:
check_is_fitted (self, attributes=["idf_"], msg="idf vector is not fitted")
# *= doesn't work
X = X * self._idf_diag
if self.norm is not None:
X = normalize(X, norm=self.norm, copy=False)
return X
image of code above
To investigate more, I ran both classes and compared the output of both CountVectorizer and TfidfTransformer using the following code and the output is equal.
import numpy as np
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
dataset = fetch_20newsgroups(shuffle=True, random_state=1, remove=(
'headers', 'footers', 'quotes'), subset='train', categories=['sci.electronics', 'rec.autos', 'rec.sport.hockey'])
train_documents = dataset.data
vectorizer = CountVectorizer()
train_documents_mat = vectorizer.fit_transform(train_documents)
tf_vectorizer = TfidfTransformer(use_idf=False, norm=None)
train_documents_mat_2 = tf_vectorizer.fit_transform(train_documents_mat)
equal = np.array_equal(
train_documents_mat.toarray(),
train_documents_mat_2.toarray()
)
print(equal)
I am trying to get the term frequencies for my documents rather than just the count. Any ideas why sklearn implement TF-IDF in this way?
Main question:
In a certain model scenario it seems more robust to me, to judge on candidates tested in a sklearn.model_selection.GridSearchCV based on their median performance instead of mean. Is there a way to do this?
Some more context:
Especially for small datasets or when using a CV scheme with low sample number in the test folds (e.g. LeaveOneOut), it may occur that certain folds achieve test scores that are extremely low, whereas the bulk of the folds might actually perform quite well. Selecting for the mean of all test scores, may result in a different candidate being preferred, for instance one where all folds have medium low, however, none have outrageously bad performance.
My current workaround has some problems:
I can tell gridsearchcv to write the best_*-attributes w.r.t. a custom callable passed as the refit argument, so I am using the function below for selecting the model which achieved the best median score among the cv folds:
def best_median_score(cv_results):
"""
Find the best median score from a cross-validation result dictionary.
:param cv_results: dictionary of cross-validation results
:return: index of best median score
"""
inner_test_scores = np.array([
scores for key, scores
in cv_results.items()
if key.startswith('split')
and f'test_{Config.refit_scorer}'
in key
])
median_inner_test_scores = np.median(inner_test_scores, axis=0)
return median_inner_test_scores.argmax()
and pass it as:
grid = GridSearchCV(
pipe, # pipeline object of model steps
params, # parameter grid
scoring= scorer, # dict of multiple scorers
refit= best_median_score,
cv=10,
verbose=1,
n_jobs=-1
)
However, gridsearchcv still calculates the mean_test_scores in the grid.cv_results_, where I would prefer "median_test_scores" instead. Also, this way I am loosing the grid.best_score_ attribute and get an error when trying to score manually:
grid.score(model_X, model_y)
KeyError Traceback (most recent call last)
File
~/.local/share/virtualenvs/my_env/lib/python3.9/site-packages/sklearn/model_selection/search.py:446,
in BaseSearchCV.score(self, X, y) 444 if isinstance(self.scorer,
dict): 445 if self.multimetric_:
--> 446 scorer = self.scorer_[self.refit] 447 else: 448 scorer = self.scorer_
KeyError: <function best_median_score at 0x7f4b840beca0>
Median test performance could be calculated outside the GridSearchCV and then refitted with the best hyper-parameter combination based on median score.
import pandas as pd
from sklearn import svm, datasets
from sklearn.model_selection import GridSearchCV
iris = datasets.load_iris()
parameters = {'kernel':('linear', 'rbf'), 'C':[1, 10]}
svc = svm.SVC()
clf = GridSearchCV(svc, parameters, refit=False)
clf.fit(iris.data, iris.target)
results_df = pd.DataFrame(clf.cv_results_)
results_df['median_test_score'] = results_df.filter(regex='^split').median(axis=1)
results_df['rank_test_score'] = results_df['median_test_score'].rank(ascending=False).astype(int)
svc.set_params(**results_df.query('rank_test_score == 1')['params'].values[0])
svc.fit(iris.data, iris.target)
In Pytorch, is there any way of loading a specific single sample using the torch.utils.data.DataLoader class? I'd like to do some testing with it.
The tutorial uses
trainloader = torch.utils.data.DataLoader(...)
images, labels = next(iter(trainloader))
to fetch a random batch of samples. Is there are way, using DataLoader, to get a specific sample?
Cheers
Turn off the shuffle in DataLoader
Use batch_size to calculate the batch in which the desired sample you are looking for falls in
Iterate to the desired batch
Code
import torch
import numpy as np
import itertools
X= np.arange(100)
batch_size = 2
dataloader = torch.utils.data.DataLoader(X, batch_size=batch_size, shuffle=False)
sample_at = 5
k = int(np.floor(sample_at/batch_size))
my_sample = next(itertools.islice(dataloader, k, None))
print (my_sample)
Output:
tensor([4, 5])
if you want to get a specific signle sample from your dataset you can
you should check Subset class.(https://pytorch.org/docs/stable/data.html#torch.utils.data.Subset)
something like this:
indices = [0,1,2] # select your indices here as a list
subset = torch.utils.data.Subset(train_set, indices)
trainloader = DataLoader(subset , batch_size = 16 , shuffle =False) #set shuffle to False
for image , label in trainloader:
print(image.size() , '\t' , label.size())
print(image[0], '\t' , label[0]) # index the specific sample
here is a useful link if you want to learn more about the Pytorch data loading utility
(https://pytorch.org/docs/stable/data.html)
I'm using statsmodels.api to compute the statistical parameters for an OLS fit between two variables:
def computeStats(x, y, yName):
'''
Takes as an argument an array, and a string for the array name.
Uses Ordinary Least Squares to compute the statistical parameters for the
array against log(z), and determines the equation for the line of best fit.
Returns the results summary, residuals, statistical parameters in a list, and the
best fit equation.
'''
# Mask NaN values in both axes
mask = ~np.isnan(y) & ~np.isnan(x)
# Compute model parameters
model = sm.OLS(y, sm.add_constant(x), missing= 'drop')
results = model.fit()
residuals = results.resid
# Compute fit parameters
params = stats.linregress(x[mask], y[mask])
fit = params[0]*x + params[1]
fitEquation = '$(%s)=(%.4g \pm %.4g) \\times redshift+%.4g$'%(yName,
params[0], # slope
params[4], # stderr in slope
params[1]) # y-intercept
return results, residuals, params, fit, fitEquation
The second part of the function (using stats.linregress) plays nicely with the masked values, but statsmodels does not. When I try to plot the residuals against the x values with plt.scatter(x, resids), the dimensions do not match:
ValueError: x and y must be the same size
because there are 29007 x-values, and 11763 residuals (that's how many y-values made it through the masking process). I tried changing the model variable to
model = sm.OLS(y[mask], sm.add_constant(x[mask]), missing= 'drop')
but this had no effect.
How can I scatter-plot the residuals against the x-values they match with?
Hi #jim421616 Since statsmodels dropped few missing values, you should use the model's exog variable to plot the scatter as shown.
plt.scatter(model.model.exog[:,1], model.resid)
For reference a complete dummy example
import numpy as np
import statsmodels.api as sm
import matplotlib.pyplot as plt
#generate data
x = np.random.rand(1000)
y =np.sin( x*25)+0.1*np.random.rand(1000)
# Make some as NAN
y[np.random.choice(np.arange(1000), size=100)]= np.nan
x[np.random.choice(np.arange(1000), size=80)]= np.nan
# fit model
model = sm.OLS(y, sm.add_constant(x) ,missing='drop').fit()
print model.summary()
# plot
plt.scatter(model.model.exog[:,1], model.resid)
plt.show()
Is it possible to compute feature importance (with Random Forest) in scikit learn when features have been onehotencoded?
Here's an example of how to combine feature names with their importances:
from sklearn.feature_extraction import DictVectorizer
from sklearn.preprocessing import FunctionTransformer
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import make_pipeline
# some example data
X = pd.DataFrame({'feature': ['value1', 'value2', 'value2', 'value1', 'value2']})
y = [1, 0, 0, 1, 1]
# translate rows to dicts
def row_to_dict(X, y=None):
return X.apply(dict, axis=1)
# define prediction model
ft = FunctionTransformer(row_to_dict, validate=False)
dv = DictVectorizer()
rf = RandomForestClassifier()
# glue steps together
model = make_pipeline(ft, dv, rf)
# train
model.fit(X, y)
# get feature importances
feature_importances = zip(dv.feature_names_, rf.feature_importances_)
# have a look
print feature_importances
Assuming that you have a pipeline with
a 'pre' step where you implement OneHotEncoder,
a 'clf' step where you define the classifier
the key of the categorical transformation is given as 'cat'
The following function will combine the feature importance of categorical features.
import numpy as np
import pandas as pd
import imblearn
def compute_feature_importance(model):
"""
Create feature importance using sklearn's ensemble models model.feature_importances_ property.
Parameters
----------
model : estimator instance (either sklearn.Pipeline, imblearn.Pipeline or a classifier)
PRE-FITTED classifier or a PRE-FITTED Pipeline in which the last estimator is a classifier.
Returns
-------
fi_df : Pandas DataFrame with feature_names and feature_importance
"""
if type(model) == imblearn.pipeline.Pipeline:
# If the user is using a pipeline model,
# the importance of the feature is calculated in this if block!
pre_model = model['pre'] # Pre step of the pipeline
classifier = model['clf'] # Classifier of the pipeline
ct = model.named_steps['pre'] # Define the column transform for the given pipeline model
# The following line will get the feature names.
feature_names = pre_model.get_feature_names_out()
feature_importance = np.array(classifier.feature_importances_)
# Create a DataFrame using a Dictionary
data = {'feature_names': feature_names, 'feature_importance': feature_importance}
fi_df = pd.DataFrame(data)
# Sort the DataFrame in order decreasing feature importance
fi_df.sort_values(by=['feature_importance'], ascending=False, inplace=True)
if 'cat' in ct.named_transformers_.keys() and hasattr(ct.named_transformers_['cat'], 'feature_names_in_'):
# We first have to apply a column transform and then sum up the feature importance values of individual OneHotEncoder columns.
# Original categorical features list. Categorical features before applying OneHotEncoder
original_cat_features = ct.named_transformers_['cat'].feature_names_in_.tolist()
# Categorical features list after applying OneHotEncoder
all_cat_list = ct.named_transformers_['cat'].get_feature_names_out(original_cat_features).tolist()
# A for loop for original_cat_features to find the one hot encoded features corresponding to each original categorical feature
for original_cat_feature in original_cat_features:
# List of one hot encoded features corresponding to each original categorical feature
cat_list = [i for i in all_cat_list if i.startswith(original_cat_feature)]
# OneHotEncoded columns must be renamed.
# ct.named transformers['cat'].get_feature_names_out(original cat_features) returns column names missing "cat__" in front.
# Let's fix it easily!
for i, element in enumerate(cat_list):
cat_list[i] = 'cat__' + element
# Slice fi_df dataframe to return the only rows for the associated OneHotEncoded features names (cat_list)
# and then sum the feature importance values
cat_sum = fi_df[fi_df['feature_names'].isin(cat_list)]['feature_importance'].sum()
# Slice fi_df dataframe to return the only rows other than categorical features.
# In other words, dataframe with numerical features
fi_df = fi_df[~fi_df['feature_names'].isin(cat_list)]
# Create a temporary dictionary to return the originial categorical feature
# and the summation of OneHotEncoded features importances
temp_dict = {'feature_names': original_cat_feature, 'feature_importance': cat_sum}
# Append the temporary_dict to the dataframe
fi_df = fi_df.append(temp_dict, ignore_index=True)
# Sort the DataFrame in order decreasing feature importance
fi_df.sort_values(by=['feature_importance'], ascending=False, inplace=True)
# Getting rid of the prefixes of the feature names
prefixes = ('num__', 'cat__', 'remainder__', 'scaler__')
for prefix in prefixes:
fi_df['feature_names'] = fi_df['feature_names'].apply(lambda x: str(x).replace(prefix,""))
return fi_df