How do I SelectKBest using mutual information from a mixture of discrete and continuous features? - scikit-learn

I am using scikit learn to train a classification model. I have both discrete and continuous features in my training data. I want to do feature selection using maximum mutual information. If I have vectors x and labels y and the first three feature values are discrete I can get the MMI values like so:
mutual_info_classif(x, y, discrete_features=[0, 1, 2])
Now I'd like to use the same mutual information selection in a pipeline. I'd like to do something like this
SelectKBest(score_func=mutual_info_classif).fit(x, y)
but there's no way to pass the discrete features mask to SelectKBest. Is there some syntax to do this that I'm overlooking, or do I have to write my own score function wrapper?

Unfortunately I could not find this functionality for the SelectKBest.
But what we can do easily is extend the SelectKBest as our custom class to override the fit() method which will be called.
This is the current fit() method of SelectKBest (taken from source at github)
# No provision for extra parameters here
def fit(self, X, y):
X, y = check_X_y(X, y, ['csr', 'csc'], multi_output=True)
....
....
# Here only the X, y are passed to scoring function
score_func_ret = self.score_func(X, y)
....
....
self.scores_ = np.asarray(self.scores_)
return self
Now we will define our new class SelectKBestCustom with the changed fit(). I have copied everything from the above source, changing only two lines (commented about it):
from sklearn.utils import check_X_y
class SelectKBestCustom(SelectKBest):
# Changed here
def fit(self, X, y, discrete_features='auto'):
X, y = check_X_y(X, y, ['csr', 'csc'], multi_output=True)
if not callable(self.score_func):
raise TypeError("The score function should be a callable, %s (%s) "
"was passed."
% (self.score_func, type(self.score_func)))
self._check_params(X, y)
# Changed here also
score_func_ret = self.score_func(X, y, discrete_features)
if isinstance(score_func_ret, (list, tuple)):
self.scores_, self.pvalues_ = score_func_ret
self.pvalues_ = np.asarray(self.pvalues_)
else:
self.scores_ = score_func_ret
self.pvalues_ = None
self.scores_ = np.asarray(self.scores_)
return self
This can be called simply like:
clf = SelectKBestCustom(mutual_info_classif,k=2)
clf.fit(X, y, discrete_features=[0, 1, 2])
Edit:
The above solution can be useful in pipelines also, and the discrete_features parameter can be assigned different values when calling fit().
Another Solution (less preferable):
Still, if you just need to work SelectKBest with mutual_info_classif, temporarily (just analysing the results), we can also make a custom function which can call mutual_info_classif internally with hard coded discrete_features. Something along the lines of:
def mutual_info_classif_custom(X, y):
# To change discrete_features,
# you need to redefine the function each time
# Because once the func def is supplied to selectKBest, it cant be changed
discrete_features = [0, 1, 2]
return mutual_info_classif(X, y, discrete_features)
Usage of the above function:
selector = SelectKBest(mutual_info_classif_custom).fit(X, y)

You could also use partials as follows:
from functools import partial
discrete_mutual_info_classif = partial(mutual_info_classif, iscrete_features=[0, 1, 2])
SelectKBest(score_func=discrete_mutual_info_classif).fit(x, y)

Related

Sklearn Pipeline: One feature automatically missed out

I created a Custom Classifier(Dummy Classifier). Below is definition. I also added some print statements & global variables to capture values
class FeaturePassThroughClassifier(ClassifierMixin):
def __init__(self):
pass
def fit(self, X, y):
global test_arr1
self.classes_ = np.unique(y)
test_arr1 = X
print("1:", X.shape)
return self
def predict(self, X):
global test_arr2
test_arr2 = X
print("2:", X.shape)
return X
def predict_proba(self, X):
global test_arr3
test_arr3 = X
print("3:", X.shape)
return X
Below is Stacking Classifier definition where the above defined CustomClassifier is one of base classifier. There are 3 more base classifiers (these are fitted estimators). Goal is to get input training set variables as is (which will come out from CustomClassifier) + prediction from base_classifier2, base_classifier3, base_classifier4. These features will act as input to meta classifier.
model = StackingClassifier(estimators=[
('select_features', Pipeline(steps = [("model_feature_selector", ColumnTransformer([('feature_list', 'passthrough', X_train.columns)])),
('base(dummy)_classifier1', FeaturePassThroughClassifier())])),
('base_classifier2', base_classifier2),
('base_classifier3', base_classifier3),
('base_classifier4', base_classifier4)
],
final_estimator = Pipeline(memory=None,
steps=[
('save_base_estimator_output_data', FunctionTransformer(save_base_estimator_output_data, validate=False)), ('final_model', RandomForestClassifier())
], verbose=True), passthrough = False, **stack_method = 'predict_proba'**)
Below is o/p on fitting the model. There are 230 variables:
Here is the problem: There are 230 variables but CustomClassifier o/p is showing only 229 which is strange. We can clearly see from print statements above that 230 variables get passed through CustomClassifier.
I need to use stack_method = "predict_proba". I am not sure what's going wrong here. The code works fine when stack_method = "predict".
Since this is a binary classifier, the classifier class expects you to add two probability columns in the output matrix - one for probability for class label "1" and another for "0".
In the output, it has dropped one of these since both are not required, hence, 230 columns get reduced to 229. Add a dummy column to solve your problem.
In the Notes section of the documentation:
When predict_proba is used by each estimator (i.e. most of the time for stack_method='auto' or specifically for stack_method='predict_proba'), The first column predicted by each estimator will be dropped in the case of a binary classification problem.
Here's the code that eliminates the first column.
You could add a sacrificial first column in your custom estimator's predict_proba, or switch to decision_function (which will cause differences depending on your real base estimators), or use the passthrough option instead of the custom estimator (doing feature selection in the final_estimator object instead).
Both the above solutions are on point. This is how I implemented the workaround with dummy column:
Declare a custom transformer whose output is the column that gets dropped due reasons explained above:
class add_dummy_column(BaseEstimator, TransformerMixin):
def __init__(self, key):
self.key = key
def fit(self, X, y=None):
return self
def transform(self, X):
print(type(X))
return X[[self.key]]
Do a feature union where above customer transformer + column transformer are called to create final dataframe. This will duplicate the column that gets dropped. Below is altered definition for defining Stacking classifier with FeatureUnion:
model = StackingClassifier(estimators=[
('select_features', Pipeline(steps = [('featureunion', FeatureUnion([('add_dummy_column_to_input_dataframe', add_dummy_column(key='FEATURE_THAT_GETS_DROPPED')),
("model_feature_selector", ColumnTransformer([('feature_list', 'passthrough', X_train.columns)]))])),
('base(dummy)_classifier1', FeaturePassThroughClassifier())])),
('base_classifier2', base_classifier2),
('base_classifier3', base_classifier3),
('base_classifier4', base_classifier4)
],
final_estimator = Pipeline(memory=None,
steps=[
('save_base_estimator_output_data', FunctionTransformer(save_base_estimator_output_data, validate=False)), ('final_model', RandomForestClassifier())
], verbose=True), passthrough = False, **stack_method = 'predict_proba'**)

TypeError: this constructor takes no arguments. __init__() takes 1 positional argument but 4 were given

TypeError: this constructor takes no arguments
class CustomScaler(BaseEstimator,TransformerMixin):
# init or what information we need to declare a CustomScaler object
# and what is calculated/declared as we do
def __init__(self,columns,copy=True,with_mean=True,with_std=True):
# scaler is nothing but a Standard Scaler object
self.scaler = StandardScaler(copy,with_mean,with_std)
# with some columns 'twist'
self.columns = columns
self.mean_ = None
self.var_ = None
# the fit method, which, again based on StandardScale
def fit(self, X, y=None):
self.scaler.fit(X[self.columns], y)
self.mean_ = np.mean(X[self.columns])
self.var_ = np.var(X[self.columns])
return self
# the transform method which does the actual scaling
def transform(self, X, y=None, copy=None):
# record the initial order of the columns
init_col_order = X.columns
# scale all features that you chose when creating the instance of the class
X_scaled = pd.DataFrame(self.scaler.transform(X[self.columns]), columns=self.columns)
# declare a variable containing all information that was not scaled
X_not_scaled = X.loc[:,~X.columns.isin(self.columns)]
# return a data frame which contains all scaled features and all 'not scaled' features
# use the original order (that you recorded in the beginning)
return pd.concat([X_not_scaled, X_scaled], axis=1)[init_col_order]
unscaled_inputs.columns.values
olumns_to_omit = ['Reason_1', 'Reason_2', 'Reason_3', 'Reason_4', 'Education']
columns_to_scale = [x for x in unscaled_inputs.columns.values if x not in columns_to_omit]
absenteeism_scaler = CustomScaler(columns_to_scale)
when i run the last line of code i get " init() takes 1 positional argument but 4 were given"
This may be a dumb question, but I am having a difficult time figuring out the error. I created a class called CustomScaler but when I try running it, it's giving me a typerror. tried to change init with multiple underscores but nothing works. changed the class, the function, ..etc. keep getting TypeError: this constructor takes no arguments.
# import the libraries needed to create the Custom Scaler
# note that all of them are a part of the sklearn package
# moreover, one of them is actually the StandardScaler module,
# so you can imagine that the Custom Scaler is build on it
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.preprocessing import StandardScaler
# create the Custom Scaler class
class CustomScaler(BaseEstimator,TransformerMixin):
# init or what information we need to declare a CustomScaler object
# and what is calculated/declared as we do
def __init__(self,columns,copy=True,with_mean=True,with_std=True):
# scaler is nothing but a Standard Scaler object
self.scaler = StandardScaler(copy,with_mean,with_std)
# with some columns 'twist'
self.columns = columns
self.mean_ = None
self.var_ = None
# the fit method, which, again based on StandardScale
def fit(self, X, y=None):
self.scaler.fit(X[self.columns], y)
self.mean_ = np.mean(X[self.columns])
self.var_ = np.var(X[self.columns])
return self
# the transform method which does the actual scaling
def transform(self, X, y=None, copy=None):
# record the initial order of the columns
init_col_order = X.columns
# scale all features that you chose when creating the instance of the class
X_scaled = pd.DataFrame(self.scaler.transform(X[self.columns]), columns=self.columns)
# declare a variable containing all information that was not scaled
X_not_scaled = X.loc[:,~X.columns.isin(self.columns)]
# return a data frame which contains all scaled features and all 'not scaled' features
# use the original order (that you recorded in the beginning)
return pd.concat([X_not_scaled, X_scaled], axis=1)[init_col_order]

Curve fitting with known coefficients in Python

I tried using Numpy, Scipy and Scikitlearn, but couldn't find what I need in any of them, basically I need to fit a curve to a dataset, but restricting some of the coefficients to known values, I found how to do it in MATLAB, using fittype, but couldn't do it in python.
In my case I have a dataset of X and Y and I need to find the best fitting curve, I know it's a polynomial of second degree (ax^2 + bx + c) and I know it's values of b and c, so I just needed it to find the value of a.
The solution I found in MATLAB was https://www.mathworks.com/matlabcentral/answers/216688-constraining-polyfit-with-known-coefficients which is the same problem as mine, but with the difference that their polynomial was of degree 5th, how could I do something similar in python?
To add some info: I need to fit a curve to a dataset, so things like scipy.optimize.curve_fit that expects a function won't work (at least as far as I tried).
The tools you have available usually expect functions only inputting their parameters (a being the only unknown in your case), or inputting their parameters and some data (a, x, and y in your case).
Scipy's curve-fit handles that use-case just fine, so long as we hand it a function that it understands. It expects x first and all your parameters as the remaining arguments:
from scipy.optimize import curve_fit
import numpy as np
b = 0
c = 0
def f(x, a):
return c+x*(b+x*a)
x = np.linspace(-5, 5)
y = x**2
# params == [1.]
params, _ = curve_fit(f, x, y)
Alternatively you can reach for your favorite minimization routine. The difference here is that you manually construct the error function so that it only inputs the parameters you care about, and then you don't need to provide that data to scipy.
from scipy.optimize import minimize
import numpy as np
b = 0
c = 0
x = np.linspace(-5, 5)
y = x**2
def error(a):
prediction = c+x*(b+x*a)
return np.linalg.norm(prediction-y)/len(prediction)**.5
result = minimize(error, np.array([42.]))
assert result.success
# params == [1.]
params = result.x
I don't think scipy has a partially applied polynomial fit function built-in, but you could use either of the above ideas to easily build one yourself if you do that kind of thing a lot.
from scipy.optimize import curve_fit
import numpy as np
def polyfit(coefs, x, y):
# build a mapping from null coefficient locations to locations in the function
# coefficients we're passing to curve_fit
#
# idx[j]==i means that unknown_coefs[i] belongs in coefs[j]
_tmp = [i for i,c in enumerate(coefs) if c is None]
idx = {j:i for i,j in enumerate(_tmp)}
def f(x, *unknown_coefs):
# create the entire polynomial's coefficients by filling in the unknown
# values in the right places, using the aforementioned mapping
p = [(unknown_coefs[idx[i]] if c is None else c) for i,c in enumerate(coefs)]
return np.polyval(p, x)
# we're passing an initial value just so that scipy knows how many parameters
# to use
params, _ = curve_fit(f, x, y, np.zeros((sum(c is None for c in coefs),)))
# return all the polynomial's coefficients, not just the few we just discovered
return np.array([(params[idx[i]] if c is None else c) for i,c in enumerate(coefs)])
x = np.linspace(-5, 5)
y = x**2
# (unknown)x^2 + 1x + 0
# params == [1, 0, 0.]
params = fit([None, 0, 0], x, y)
Similar features exist in nearly every mainstream scientific library; you just might need to reshape your problem a bit to frame it in terms of the available primitives.

Pytorch: Custom thresholding activation function - gradient

I created an activation function class Threshold that should operate on one-hot-encoded image tensors.
The function performs min-max feature scaling on each channel followed by thresholding.
class Threshold(nn.Module):
def __init__(self, threshold=.5):
super().__init__()
if threshold < 0.0 or threshold > 1.0:
raise ValueError("Threshold value must be in [0,1]")
else:
self.threshold = threshold
def min_max_fscale(self, input):
r"""
applies min max feature scaling to input. Each channel is treated individually.
input is assumed to be N x C x H x W (one-hot-encoded prediction)
"""
for i in range(input.shape[0]):
# N
for j in range(input.shape[1]):
# C
min = torch.min(input[i][j])
max = torch.max(input[i][j])
input[i][j] = (input[i][j] - min) / (max - min)
return input
def forward(self, input):
assert (len(input.shape) == 4), f"input has wrong number of dims. Must have dim = 4 but has dim {input.shape}"
input = self.min_max_fscale(input)
return (input >= self.threshold) * 1.0
When I use the function I get the following error, since the gradients are not calculated automatically I assume.
Variable._execution_engine.run_backward(RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn
I already had a look at How to properly update the weights in PyTorch? but could not get a clue how to apply it to my case.
How is it possible to calculate the gradients for this function?
Thanks for your help.
The issue is you are manipulating and overwriting elements, this time of operation can't be tracked by autograd. Instead, you should stick with built-in functions. You example is not that tricky to tackle: you are looking to retrieve the minimum and maximum values along input.shape[0] x input.shape[1]. Then you will scale your whole tensor in one go i.e. in vectorized form. No for loops involved!
One way to compute min/max along multiple axes is to flatten those:
>>> x_f = x.flatten(2)
Then, find the min-max on the flattened axis while retaining all shapes:
>>> x_min = x_f.min(axis=-1, keepdim=True).values
>>> x_max = x_f.max(axis=-1, keepdim=True).values
The resulting min_max_fscale function would look something like:
class Threshold(nn.Module):
def min_max_fscale(self, x):
r"""
Applies min max feature scaling to input. Each channel is treated individually.
Input is assumed to be N x C x H x W (one-hot-encoded prediction)
"""
x_f = x.flatten(2)
x_min, x_max = x_f.min(-1, True).values, x_f.max(-1, True).values
x_f = (x_f - x_min) / (x_max - x_min)
return x_f.reshape_as(x)
Important note:
You would notice that you can now backpropagate on min_max_fscale... but not on forward. This is because you are applying a boolean condition which is not a differentiable operation.

How to pass parameter to fit function when using scipy.optimize.curve_fit

I am trying to fit some data that I have using scipy.optimize.curve_fit.
My fit function is:
def fitfun(x, a):
return np.exp(a*(x - b))
What i want is to define a as the fitting parameter, and b as a parameter that changes depending on the data I want to fit. This means that for one set of data I would want to fit the function: np.exp(a*(x - 10)) while for another set I would like to fit the function np.exp(a*(x - 20)). In principle, I would like the parameter b to be passed in as any value.
The way I am currently calling curve_fit is:
coeffs, coeffs_cov = curve_fit(fitfun, xdata, ydata)
But what I would like would be something like this:
b=10
coeffs, coeffs_cov = curve_fit(fitfun(b), xdata, ydata)
b=20
coeffs2, coeffs_cov2 = curve_fit(fitfun(b), xdata, ydata)
So that I get the coefficient a for both cases (b=10 and b=20).
I am new to python so I cannot make it work, even though I have tried to read the documentation. Any help would be greatly appreciated.
I don't know if this is the "proper" way of doing things, but I usually wrap my function in a class, so that I can access parameters from self. Your example would then look like:
class fitClass:
def __init__(self):
pass
def fitfun(self, x, a):
return np.exp(a*(x - self.b))
inst = fitClass()
inst.b = 10
coeffs, coeffs_cov = curve_fit(inst.fitfun, xdata, ydata)
inst.b = 20
coeffs, coeffs_cov = curve_fit(inst.fitfun, xdata, ydata)
This approach avoids using global parameters, which are generally considered evil.
Let me also recommend lmfit (http://lmfit.github.io/lmfit-py/) and its Model class for this type of problem. Lmfit provides a higher-level abstraction for curve fitting and optimization problems.
With lmfit, each parameter in the model becomes an object that can be fixed, varied freely, or given upper and lower bounds without changing the fitting function. In addition, you can define multiple "independent variables" for any model.
That gives you two possible approaches. First, define parameters and fix b:
from lmfit import Model
def fitfun(x, a, b):
return np.exp(a*(x - b))
# turn this model function into a Model:
mymodel = Model(fitfun)
# create parameters with initial values. Note that parameters are
# **named** according to the arguments of your model function:
params = mymodel.make_params(a=1, b=10)
# tell the 'b' parameter to not vary during the fit
params['b'].vary = False
# do fit
result = mymodel.fit(ydata, params, x=xdata)
print(result.fit_report())
The params is not changed in the fit (updated params are in result.params), so to fit another set of data, you could just do:
params['b'].value = 20 # Note that vary is still False
result2 = mymodel.fit(ydata2, params, x=xdata2)
An alternative approach would be to define b as an independent variable:
mymodel = Model(fitfun, independent_vars=['x', 'b'])
params = mymodel.make_params(a=1)
result = model.fit(ydata, params, x=xdata, b=10)
Lmfit has many other nice features for curve-fitting including composing complex models and evaluation of confidence intervals.
One really easy way to do this would be to use the partial function from functools. In this case all you would have to do is the following. In this case b be would have to defined otherwise I believe scipy.optimize.curvefit would try to optimize b in addition to a
from functools import partial
def fitfun(x, a, b):
return np.exp(a*(x - b))
fitfun10 = partial(fitfun, b=10)
coeffs, coeffs_cov = curve_fit(fitfun10, xdata, ydata)
fitfun20 = partial(fitfun, b=20)
coeffs2, coeffs_cov2 = curve_fit(fitfun20, xdata, ydata)
You can define b as a global variable inside the fit function.
from scipy.optimize import curve_fit
def fitfun(x, a):
global b
return np.exp(a*(x - b))
xdata = np.arange(10)
#first sample data set
ydata = np.exp(2 * (xdata - 10))
b = 10
coeffs, coeffs_cov = curve_fit(fitfun, xdata, ydata)
print(coeffs)
#second sample data set
ydata = np.exp(5 * (xdata - 20))
b = 20
coeffs, coeffs_cov = curve_fit(fitfun, xdata, ydata)
print(coeffs)
Output:
[2.]
[5.]
UPDATE:
Apologies for posting the untested code. As pointed out by #mr-t , the code indeed throws an error. It seems , the kwargs argument of the curve_fit function is to set the keywords arguments of leastsq and least_squares functions and not the keyword arguments of fit function itself.
In this case, in addition to answer proposed by others, another possible solution is to redefine the fit function to return the error and directly call the leastsq function which allows to pass the arguments.
def fitfun(a,x,y,b):
return np.exp(a*(x - b)) - y
b=10
leastsq(fitfun,x0=1,args=(xdata,ydata,b))

Resources