Sklearn Pipeline: One feature automatically missed out - scikit-learn

I created a Custom Classifier(Dummy Classifier). Below is definition. I also added some print statements & global variables to capture values
class FeaturePassThroughClassifier(ClassifierMixin):
def __init__(self):
pass
def fit(self, X, y):
global test_arr1
self.classes_ = np.unique(y)
test_arr1 = X
print("1:", X.shape)
return self
def predict(self, X):
global test_arr2
test_arr2 = X
print("2:", X.shape)
return X
def predict_proba(self, X):
global test_arr3
test_arr3 = X
print("3:", X.shape)
return X
Below is Stacking Classifier definition where the above defined CustomClassifier is one of base classifier. There are 3 more base classifiers (these are fitted estimators). Goal is to get input training set variables as is (which will come out from CustomClassifier) + prediction from base_classifier2, base_classifier3, base_classifier4. These features will act as input to meta classifier.
model = StackingClassifier(estimators=[
('select_features', Pipeline(steps = [("model_feature_selector", ColumnTransformer([('feature_list', 'passthrough', X_train.columns)])),
('base(dummy)_classifier1', FeaturePassThroughClassifier())])),
('base_classifier2', base_classifier2),
('base_classifier3', base_classifier3),
('base_classifier4', base_classifier4)
],
final_estimator = Pipeline(memory=None,
steps=[
('save_base_estimator_output_data', FunctionTransformer(save_base_estimator_output_data, validate=False)), ('final_model', RandomForestClassifier())
], verbose=True), passthrough = False, **stack_method = 'predict_proba'**)
Below is o/p on fitting the model. There are 230 variables:
Here is the problem: There are 230 variables but CustomClassifier o/p is showing only 229 which is strange. We can clearly see from print statements above that 230 variables get passed through CustomClassifier.
I need to use stack_method = "predict_proba". I am not sure what's going wrong here. The code works fine when stack_method = "predict".

Since this is a binary classifier, the classifier class expects you to add two probability columns in the output matrix - one for probability for class label "1" and another for "0".
In the output, it has dropped one of these since both are not required, hence, 230 columns get reduced to 229. Add a dummy column to solve your problem.

In the Notes section of the documentation:
When predict_proba is used by each estimator (i.e. most of the time for stack_method='auto' or specifically for stack_method='predict_proba'), The first column predicted by each estimator will be dropped in the case of a binary classification problem.
Here's the code that eliminates the first column.
You could add a sacrificial first column in your custom estimator's predict_proba, or switch to decision_function (which will cause differences depending on your real base estimators), or use the passthrough option instead of the custom estimator (doing feature selection in the final_estimator object instead).

Both the above solutions are on point. This is how I implemented the workaround with dummy column:
Declare a custom transformer whose output is the column that gets dropped due reasons explained above:
class add_dummy_column(BaseEstimator, TransformerMixin):
def __init__(self, key):
self.key = key
def fit(self, X, y=None):
return self
def transform(self, X):
print(type(X))
return X[[self.key]]
Do a feature union where above customer transformer + column transformer are called to create final dataframe. This will duplicate the column that gets dropped. Below is altered definition for defining Stacking classifier with FeatureUnion:
model = StackingClassifier(estimators=[
('select_features', Pipeline(steps = [('featureunion', FeatureUnion([('add_dummy_column_to_input_dataframe', add_dummy_column(key='FEATURE_THAT_GETS_DROPPED')),
("model_feature_selector", ColumnTransformer([('feature_list', 'passthrough', X_train.columns)]))])),
('base(dummy)_classifier1', FeaturePassThroughClassifier())])),
('base_classifier2', base_classifier2),
('base_classifier3', base_classifier3),
('base_classifier4', base_classifier4)
],
final_estimator = Pipeline(memory=None,
steps=[
('save_base_estimator_output_data', FunctionTransformer(save_base_estimator_output_data, validate=False)), ('final_model', RandomForestClassifier())
], verbose=True), passthrough = False, **stack_method = 'predict_proba'**)

Related

TypeError: this constructor takes no arguments. __init__() takes 1 positional argument but 4 were given

TypeError: this constructor takes no arguments
class CustomScaler(BaseEstimator,TransformerMixin):
# init or what information we need to declare a CustomScaler object
# and what is calculated/declared as we do
def __init__(self,columns,copy=True,with_mean=True,with_std=True):
# scaler is nothing but a Standard Scaler object
self.scaler = StandardScaler(copy,with_mean,with_std)
# with some columns 'twist'
self.columns = columns
self.mean_ = None
self.var_ = None
# the fit method, which, again based on StandardScale
def fit(self, X, y=None):
self.scaler.fit(X[self.columns], y)
self.mean_ = np.mean(X[self.columns])
self.var_ = np.var(X[self.columns])
return self
# the transform method which does the actual scaling
def transform(self, X, y=None, copy=None):
# record the initial order of the columns
init_col_order = X.columns
# scale all features that you chose when creating the instance of the class
X_scaled = pd.DataFrame(self.scaler.transform(X[self.columns]), columns=self.columns)
# declare a variable containing all information that was not scaled
X_not_scaled = X.loc[:,~X.columns.isin(self.columns)]
# return a data frame which contains all scaled features and all 'not scaled' features
# use the original order (that you recorded in the beginning)
return pd.concat([X_not_scaled, X_scaled], axis=1)[init_col_order]
unscaled_inputs.columns.values
olumns_to_omit = ['Reason_1', 'Reason_2', 'Reason_3', 'Reason_4', 'Education']
columns_to_scale = [x for x in unscaled_inputs.columns.values if x not in columns_to_omit]
absenteeism_scaler = CustomScaler(columns_to_scale)
when i run the last line of code i get " init() takes 1 positional argument but 4 were given"
This may be a dumb question, but I am having a difficult time figuring out the error. I created a class called CustomScaler but when I try running it, it's giving me a typerror. tried to change init with multiple underscores but nothing works. changed the class, the function, ..etc. keep getting TypeError: this constructor takes no arguments.
# import the libraries needed to create the Custom Scaler
# note that all of them are a part of the sklearn package
# moreover, one of them is actually the StandardScaler module,
# so you can imagine that the Custom Scaler is build on it
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.preprocessing import StandardScaler
# create the Custom Scaler class
class CustomScaler(BaseEstimator,TransformerMixin):
# init or what information we need to declare a CustomScaler object
# and what is calculated/declared as we do
def __init__(self,columns,copy=True,with_mean=True,with_std=True):
# scaler is nothing but a Standard Scaler object
self.scaler = StandardScaler(copy,with_mean,with_std)
# with some columns 'twist'
self.columns = columns
self.mean_ = None
self.var_ = None
# the fit method, which, again based on StandardScale
def fit(self, X, y=None):
self.scaler.fit(X[self.columns], y)
self.mean_ = np.mean(X[self.columns])
self.var_ = np.var(X[self.columns])
return self
# the transform method which does the actual scaling
def transform(self, X, y=None, copy=None):
# record the initial order of the columns
init_col_order = X.columns
# scale all features that you chose when creating the instance of the class
X_scaled = pd.DataFrame(self.scaler.transform(X[self.columns]), columns=self.columns)
# declare a variable containing all information that was not scaled
X_not_scaled = X.loc[:,~X.columns.isin(self.columns)]
# return a data frame which contains all scaled features and all 'not scaled' features
# use the original order (that you recorded in the beginning)
return pd.concat([X_not_scaled, X_scaled], axis=1)[init_col_order]

How to load data from multiply datasets in pytorch

I have two datasets of images - indoors and outdoors, they don't have the same number of examples.
Each dataset has images that contain a certain number of classes (minimum 1 maximum 4), these classes can appear in both datasets, and each class has 4 categories - red, blue, green, white.
Example:
Indoor - cats, dogs, horses
Outdoor - dogs, humans
I am trying to train a model, where I tell it, "here is an image that contains a cat, tell me it's color" regardless of where it was taken (Indoors, outdoors, In a car, on the moon)
To do that,
I need to present my model examples so that every batch has only one category (cat, dog, horse or human), but I want to sample from all datasets (two in this case) that contains these objects and mix them. How can I do this?
It has to take into account that the number of examples in each dataset is different, and that some categories appear in one dataset where others can appear in more than one.
and each batch must contain only one category.
I would appreciate any help, I have been trying to solve this for a few days now.
Assuming the question is:
Combine 2+ data sets with potentially overlapping categories of objects (distinguishable by label)
Each object has 4 "subcategories" for each color (distinguishable by label)
Each batch should only contain a single object category
The first step will be to ensure consistency of the object labels from both data sets, if not already consistent. For example, if the dog class is label 0 in the first data set but label 2 in the second data set, then we need to make sure the two dog categories are correctly merged. We can do this "translation" with a simple data set wrapper:
class TranslatedDataset(Dataset):
"""
Args:
dataset: The original dataset.
translate_label: A lambda (function) that maps the original
dataset label to the label it should have in the combined data set
"""
def __init__(self, dataset, translate_label):
super().__init__()
self._dataset = dataset
self._translate_label = translate_label
def __len__(self):
return len(self._dataset)
def __getitem__(self, idx):
inputs, target = self._dataset[idx]
return inputs, self._translate_label(target)
The next step is combining the translated data sets together, which can be done easily with a ConcatDataset:
first_original_dataset = ...
second_original_dataset = ...
first_translated = TranslateDataset(
first_original_dataset,
lambda y: 0 if y is 2 else 2 if y is 0 else y, # or similar
)
second_translated = TranslateDataset(
second_original_dataset,
lambda y: y, # or similar
)
combined = ConcatDataset([first_translated, second_translated])
Finally, we need to restrict batch sampling to the same class, which is possible with a custom Sampler when creating the data loader.
class SingleClassSampler(torch.utils.data.Sampler):
def __init__(self, dataset, batch_size):
super().__init__()
# We need to create sequential groups
# with batch_size elements from the same class
indices_for_target = {} # dict to store a list of indices for each target
for i, (_, target) in enumerate(dataset):
# converting to string since Tensors hash by reference, not value
str_targ = str(target)
if str_targ not in indices_for_target:
indices_for_target[str_targ] = []
indices_for_target[str_targ] += [i]
# make sure we have a whole number of batches for each class
trimmed = {
k: v[:-(len(v) % batch_size)]
for k, v in indices_for_target.items()
}
# concatenate the lists of indices for each class
self._indices = sum(list(trimmed.values()))
def __len__(self):
return len(self._indices)
def __iter__(self):
yield from self._indices
Then to use the sampler:
loader = DataLoader(
combined,
sampler=SingleClassSampler(combined, 64),
batch_size=64,
shuffle=True
)
I haven't run this code, so it might not be exactly right, but hopefully it will put you on the right track.
torch.utils.data Docs

Find wrongly categorized samples from validation step

I am using a keras neural net for identifying category in which the data belongs.
self.model.compile(loss='categorical_crossentropy',
optimizer=keras.optimizers.Adam(lr=0.001, decay=0.0001),
metrics=[categorical_accuracy])
Fit function
history = self.model.fit(self.X,
{'output': self.Y},
validation_split=0.3,
epochs=400,
batch_size=32
)
I am interested in finding out which labels are getting categorized wrongly in the validation step. Seems like a good way to understand what is happening under the hood.
You can use model.predict_classes(validation_data) to get the predicted classes for your validation data, and compare these predictions with the actual labels to find out where the model was wrong. Something like this:
predictions = model.predict_classes(validation_data)
wrong = np.where(predictions != Y_validation)
If you are interested in looking 'under the hood', I'd suggest to use
model.predict(validation_data_x)
to see the scores for each class, for each observation of the validation set.
This should shed some light on which categories the model is not so good at classifying. The way to predict the final class is
scores = model.predict(validation_data_x)
preds = np.argmax(scores, axis=1)
be sure to use the proper axis for np.argmax (I'm assuming your observation axis is 1). Use preds to then compare with the real class.
Also, as another exploration you want to see the overall accuracy on this dataset, use
model.evaluate(x=validation_data_x, y=validation_data_y)
I ended up creating a metric which prints the "worst performing category id + score" on each iteration. Ideas from link
import tensorflow as tf
import numpy as np
class MaxIoU(object):
def __init__(self, num_classes):
super().__init__()
self.num_classes = num_classes
def max_iou(self, y_true, y_pred):
# Wraps np_max_iou method and uses it as a TensorFlow op.
# Takes numpy arrays as its arguments and returns numpy arrays as
# its outputs.
return tf.py_func(self.np_max_iou, [y_true, y_pred], tf.float32)
def np_max_iou(self, y_true, y_pred):
# Compute the confusion matrix to get the number of true positives,
# false positives, and false negatives
# Convert predictions and target from categorical to integer format
target = np.argmax(y_true, axis=-1).ravel()
predicted = np.argmax(y_pred, axis=-1).ravel()
# Trick from torchnet for bincounting 2 arrays together
# https://github.com/pytorch/tnt/blob/master/torchnet/meter/confusionmeter.py
x = predicted + self.num_classes * target
bincount_2d = np.bincount(x.astype(np.int32), minlength=self.num_classes**2)
assert bincount_2d.size == self.num_classes**2
conf = bincount_2d.reshape((self.num_classes, self.num_classes))
# Compute the IoU and mean IoU from the confusion matrix
true_positive = np.diag(conf)
false_positive = np.sum(conf, 0) - true_positive
false_negative = np.sum(conf, 1) - true_positive
# Just in case we get a division by 0, ignore/hide the error and set the value to 0
with np.errstate(divide='ignore', invalid='ignore'):
iou = false_positive / (true_positive + false_positive + false_negative)
iou[np.isnan(iou)] = 0
return np.max(iou).astype(np.float32) + np.argmax(iou).astype(np.float32)
~
usage:
custom_metric = MaxIoU(len(catagories))
self.model.compile(loss='categorical_crossentropy',
optimizer=keras.optimizers.Adam(lr=0.001, decay=0.0001),
metrics=[categorical_accuracy, custom_metric.max_iou])

How to combine features with different dimensions output using scikit-learn

I am using scikit-learn with Pipeline and FeatureUnion to extract features from different inputs. Each sample (instance) in my dataset refers to documents with different lengths. My goal is to compute the top tfidf for each document independently, but I keep getting this error message:
ValueError: blocks[0,:] has incompatible row dimensions. Got
blocks[0,1].shape[0] == 1, expected 2000.
2000 is the size of the training data.
This is the main code:
book_summary= Pipeline([
('selector', ItemSelector(key='book')),
('tfidf', TfidfVectorizer(analyzer='word', ngram_range(1,3), min_df=1, lowercase=True, stop_words=my_stopword_list, sublinear_tf=True))
])
book_contents= Pipeline([('selector3', book_content_count())])
ppl = Pipeline([
('feats', FeatureUnion([
('book_summary', book_summary),
('book_contents', book_contents)])),
('clf', SVC(kernel='linear', class_weight='balanced') ) # classifier with cross fold 5
])
I wrote two classes to handle each pipeline function. My problem is with book_contents pipeline which is mainly dealing with each sample and return TFidf matrix for each book independently.
class book_content_count():
def count_contents2(self, bookid):
book = open('C:/TheCorpus/'+str(int(bookid))+'_book.csv', 'r')
book_data = pd.read_csv(book, header=0, delimiter=',', encoding='latin1',error_bad_lines=False,dtype=str)
corpus=(str([user_data['text']]).strip('[]'))
return corpus
def transform(self, data_dict, y=None):
data_dict['bookid'] #from here take the name
text=data_dict['bookid'].apply(self.count_contents2)
vec_pipe= Pipeline([('vec', TfidfVectorizer(min_df = 1,lowercase = False, ngram_range = (1,1), use_idf = True, stop_words='english'))])
Xtr = vec_pipe.fit_transform(text)
return Xtr
def fit(self, x, y=None):
return self
Sample of data (example):
title Summary bookid
The beauty and the beast is a traditional fairy tale... 10
ocean at the end of the lane is a 2013 novel by British 11
Then each id will refer to a text file with the actual contents of these books
I have tried toarray and reshape functions but with no luck. Any idea how to solve this issue.
Thanks
You can use Neuraxle's Feature Union with a custom joiner that you would need to code yourself. The joiner is a class passed to Neuraxle's FeatureUnion to merge results together in the way you expected.
1. Import Neuraxle's classes.
from neuraxle.base import NonFittableMixin, BaseStep
from neuraxle.pipeline import Pipeline
from neuraxle.steps.sklearn import SKLearnWrapper
from neuraxle.union import FeatureUnion
2. Define your custom class by inheriting from BaseStep:
class BookContentCount(BaseStep):
def transform(self, data_dict, y=None):
transformed = do_things(...) # be sure to use SKLearnWrapper if you wrap sklearn items.
return transformed
def fit(self, x, y=None):
return self
3. Create a joiner to join the resuts of the feature union the way you wish:
class CustomJoiner(NonFittableMixin, BaseStep):
def __init__(self):
BaseStep.__init__(self)
NonFittableMixin.__init__(self)
# def fit: is inherited from `NonFittableMixin` and simply returns self.
def transform(self, data_inputs):
# TODO: insert your own concatenation method here.
result = np.concatenate(data_inputs, axis=-1)
return result
4. Finally create your pipeline by passing the joiner to the FeatureUnion:
book_summary= Pipeline([
ItemSelector(key='book'),
TfidfVectorizer(analyzer='word', ngram_range(1,3), min_df=1, lowercase=True, stop_words=my_stopword_list, sublinear_tf=True)
])
p = Pipeline([
FeatureUnion([
book_summary,
BookContentCount()
],
joiner=CustomJoiner()
),
SVC(kernel='linear', class_weight='balanced')
])
Note: if you want your Neuraxle pipeline to become a scikit-learn pipeline back, you can do p = p.tosklearn().
To learn more on Neuraxle:
https://github.com/Neuraxio/Neuraxle
More examples from the documentation:
https://www.neuraxle.org/stable/examples/index.html

How do I SelectKBest using mutual information from a mixture of discrete and continuous features?

I am using scikit learn to train a classification model. I have both discrete and continuous features in my training data. I want to do feature selection using maximum mutual information. If I have vectors x and labels y and the first three feature values are discrete I can get the MMI values like so:
mutual_info_classif(x, y, discrete_features=[0, 1, 2])
Now I'd like to use the same mutual information selection in a pipeline. I'd like to do something like this
SelectKBest(score_func=mutual_info_classif).fit(x, y)
but there's no way to pass the discrete features mask to SelectKBest. Is there some syntax to do this that I'm overlooking, or do I have to write my own score function wrapper?
Unfortunately I could not find this functionality for the SelectKBest.
But what we can do easily is extend the SelectKBest as our custom class to override the fit() method which will be called.
This is the current fit() method of SelectKBest (taken from source at github)
# No provision for extra parameters here
def fit(self, X, y):
X, y = check_X_y(X, y, ['csr', 'csc'], multi_output=True)
....
....
# Here only the X, y are passed to scoring function
score_func_ret = self.score_func(X, y)
....
....
self.scores_ = np.asarray(self.scores_)
return self
Now we will define our new class SelectKBestCustom with the changed fit(). I have copied everything from the above source, changing only two lines (commented about it):
from sklearn.utils import check_X_y
class SelectKBestCustom(SelectKBest):
# Changed here
def fit(self, X, y, discrete_features='auto'):
X, y = check_X_y(X, y, ['csr', 'csc'], multi_output=True)
if not callable(self.score_func):
raise TypeError("The score function should be a callable, %s (%s) "
"was passed."
% (self.score_func, type(self.score_func)))
self._check_params(X, y)
# Changed here also
score_func_ret = self.score_func(X, y, discrete_features)
if isinstance(score_func_ret, (list, tuple)):
self.scores_, self.pvalues_ = score_func_ret
self.pvalues_ = np.asarray(self.pvalues_)
else:
self.scores_ = score_func_ret
self.pvalues_ = None
self.scores_ = np.asarray(self.scores_)
return self
This can be called simply like:
clf = SelectKBestCustom(mutual_info_classif,k=2)
clf.fit(X, y, discrete_features=[0, 1, 2])
Edit:
The above solution can be useful in pipelines also, and the discrete_features parameter can be assigned different values when calling fit().
Another Solution (less preferable):
Still, if you just need to work SelectKBest with mutual_info_classif, temporarily (just analysing the results), we can also make a custom function which can call mutual_info_classif internally with hard coded discrete_features. Something along the lines of:
def mutual_info_classif_custom(X, y):
# To change discrete_features,
# you need to redefine the function each time
# Because once the func def is supplied to selectKBest, it cant be changed
discrete_features = [0, 1, 2]
return mutual_info_classif(X, y, discrete_features)
Usage of the above function:
selector = SelectKBest(mutual_info_classif_custom).fit(X, y)
You could also use partials as follows:
from functools import partial
discrete_mutual_info_classif = partial(mutual_info_classif, iscrete_features=[0, 1, 2])
SelectKBest(score_func=discrete_mutual_info_classif).fit(x, y)

Resources