I've been learning and practicing sklearn library on my own. When I participated Kaggle competitions, I noticed the provided sample code used BaseEstimator from sklearn.base.
I don't quite understand how/why is BaseEstimator used.
from sklearn.base import BaseEstimator
class FeatureMapper:
def __init__(self, features):
self.features = features #features contains feature_name, column_name, and extractor( which is CountVectorizer)
def fit(self, X, y=None):
for feature_name, column_name, extractor in self.features:
extractor.fit(X[column_name], y) #my question is: is X features? if yes, where is it assigned? or else how can X call column_name by X[column_name].
...
This is what I usually see on sklearn's tutorial page:
from sklearn import SomeClassifier
X = [[0, 0], [1, 1],[2, 2],[3, 3]]
Y = [0, 1, 2, 3]
clf = SomeClassifier()
clf = clf.fit(X, Y)
I couldn't find a good example or any documentations on sklearn's official page. Although I found the sklearn.base code on github, but I'd like some examples and explanation of how is it used.
UPDATE
Here is the link for the sample code: https://github.com/benhamner/JobSalaryPrediction/blob/master/features.py
Correction: I just realized BaseEstimator is used for the class SimpleTransform. I guess my first question is why is it needed? (because it's not used anywhere in the computation), the other question is when define fit, what is X, and how is assigned? Because usually I see:
def mymethod(self, X, y=None):
X=self.features
# then do something to X[Column_name]
BaseEstimator provides among other things a default implementation for the get_params and set_params methods, see [the source code]. This is useful to make the model grid search-able with GridSearchCV for automated parameters tuning and behave well with others when combined in a Pipeline.
Related
By default, scalers from Sklearn work column-wise. But i need my data to be scaled line-wise, so i did the following:
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
import numpy as np
# %% Generating sample data
x = np.array([[-1, 4, 2], [-0.5, 8, 9], [3, 2, 3]])
y = np.array([1, 2, 3])
#%% Train/Test split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=2)
scaler = MinMaxScaler()
x_train = scaler.fit_transform(x_train.T).T # scaling line-wise
x_test = scaler.transform(x_test) <-------- Error here
But i am getting the following error:
ValueError: X has 3 features, but MinMaxScaler is expecting 2 features as input.
I don't understand whats wrong here. Why it says it is expecting 2 features, when all my X (x, x_train and x_test) has 3 features? How can i fix this?
StandardScaler is stateful: when you fit it, it calculates and saves the columns' means and standard deviations; when transforming (train or test sets), it uses those saved statistics. Your transpose trick doesn't work with that: each row has saved statistics, and then your test set doesn't have the same rows, so transform cannot work correctly (throwing an error if different number of rows, or silently mis-scaling if the same number of rows).
What you want isn't stateful: test sets should be transformed completely independently of the training set. Indeed, every row should be transformed independently of each other. So you could just do this kind of transformation before splitting, or using fit_transform on the test set('s transpose).
For l2 normalization of rows, there's a builtin for this: Normalizer (docs). I don't think there's an analogue for min-max normalization, but I think you could write a FunctionTransformer to do it.
This is possible to do. I can think of a scenario where this would be useful. Normally, MinMaxScaler would scale each x, y, and z with respect to other observations of that feature. That's the "series" scaling. Now imagine that instead, you wanted to map each point constrained by x+y+z = 1. I think this is what OP is asking for. I have done this in the past, I will describe how I did it.
You need to treat your individual observations as a column multi-index and treat it like a higher-dimensional feature. Then, you need to build a pipeline within which the observations are transformed from column-wise to row wise, post which you do the min/max scaling. This gets you to x+y+z=1, but you still need to get back to the original shape of the data, for which you will need to track the index of each observation. Within the pipeline, you'll need to use something like a DataframeFunctionTransformer which I have seen on the interwebs, reproducing it below. This way you can use pandas functions to shape the data and merge back in with the indices.
class DataframeFunctionTransformer():
def __init__(self, func):
self.func = func
def transform(self, input_df, **transform_params):
return self.func(input_df)
def fit(self, X, y=None, **fit_params):
return self
Regarding the statefulness of MinMaxScaler, I think in a scenario such as this, the state of MinMaxScaler doesn't get used, it is purely acting as a transformer that maps these points to a different space meeting the constraint that x, y, and z are scaled such that they add up to 1.
#Murilo hope this gets you started with a solution. Must be an interesting problem.
I'm trying to perform a GridSearchCV on a pipeline with a custom transformer. The transformer enriches the features "year" and "odometer" polynomially and one hot encodes the rest of the features. The ML model is a simple linear regression model.
custom transformer code:
import numpy as np
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import PolynomialFeatures
class custom_poly_features(TransformerMixin, BaseEstimator):
def __init__(self, degree = 2, poly_features = ['year', 'odometer']):
self.degree_ = degree
self.poly_features_ = poly_features
def fit(self, X, y=None):
# Return the classifier
return self
def transform(self, X, y=None):
poly_feat = PolynomialFeatures(degree=self.degree_)
OneHot = OneHotEncoder(sparse=False)
not_poly_features = list(set(X.columns) - set(self.poly_features_))
poly = poly_feat.fit_transform(X[self.poly_features_].to_numpy())
poly = np.hstack([poly, OneHot.fit_transform(X[not_poly_features].to_numpy())])
return poly
def get_params(self, deep=True):
return {"degree": self.degree_, "poly_features": self.poly_features_}
pipeline & gridsearch code:
#create pipeline
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression
poly_pipeline = Pipeline(steps=[("cpf", custom_poly_features()), ("lin_reg", LinearRegression(n_jobs=-1))])
#perform gridsearch
from sklearn.model_selection import GridSearchCV
param_grid = {"cpf__degree": [3, 4, 5]}
search = GridSearchCV(poly_pipeline, param_grid, n_jobs=-1, cv=3)
search.fit(X_train_ordinal, y_train)
The custom transformer itself works fine and the pipeline also works (although the score is not great, but that is not the topic here).
poly_pipeline.fit(X_train, y_train).score(X_test, y_test)
Output:
0.543546844381771
However, when I perform the gridsearch, the scores are all nan values:
search.cv_results_
Output:
{'mean_fit_time': array([4.46928191, 4.58259885, 4.55605125]),
'std_fit_time': array([0.18111937, 0.03305779, 0.02080789]),
'mean_score_time': array([0.21119197, 0.13816587, 0.11357466]),
'std_score_time': array([0.09206233, 0.02171508, 0.02127906]),
'param_custom_poly_features__degree': masked_array(data=[3, 4, 5],
mask=[False, False, False],
fill_value='?',
dtype=object),
'params': [{'custom_poly_features__degree': 3},
{'custom_poly_features__degree': 4},
{'custom_poly_features__degree': 5}],
'split0_test_score': array([nan, nan, nan]),
'split1_test_score': array([nan, nan, nan]),
'split2_test_score': array([nan, nan, nan]),
'mean_test_score': array([nan, nan, nan]),
'std_test_score': array([nan, nan, nan]),
'rank_test_score': array([1, 2, 3])}
Does anyone know what the problem is? The transformer and the pipeline work fine on their own after all.
To debug searches in general, set error_score='raise', so that you get a full error traceback.
Your issue appears to be data-dependent; I can run this just fine on a custom dataset. That suggests to me that the comment by #Sanjar Adylov not only highlights an important issue, but the issue for your data: the train folds sometimes contain different values in some categorical feature(s) than the test folds, and so the one-hot encodings end up with different numbers of features, and the linear model justifiably breaks.
So the fix there is also as Sanjar says: instantiate, store as attributes, and fit the two transformers and in your fit method, and use their transform methods in your transform method.
You will find there is another big issue: all the scores in cv_results_ are the same. This is because you can't actually set the hyperparameters correctly, because in __init__ you've used mismatching names (degree as the parameter but degree_ as the attribute). Read more in the developer guide. (I think you can get around this by editing set_params similar to how you edited get_params, but it would be much easier to actually rely on the BaseEstimator versions of those and just match the parameter names to the attribute names.)
Also, note that setting a parameter default to a list can have surprising effects. Consider alternatives to the default of poly_features in __init__.
class custom_poly_features(TransformerMixin, BaseEstimator):
def __init__(self, degree=2, poly_features=['year', 'odometer']):
self.degree = degree
self.poly_features = poly_features
def fit(self, X, y=None):
self.poly_feat = PolynomialFeatures(degree=self.degree)
self.onehot = OneHotEncoder(sparse=False)
self.not_poly_features_ = list(set(X.columns) - set(self.poly_features))
self.poly_feat.fit(X[self.poly_features])
self.onehot.fit(X[self.not_poly_features_])
return self
def transform(self, X, y=None):
poly = self.poly_feat.transform(X[self.poly_features])
poly = np.hstack([poly, self.onehot.transform(X[self.not_poly_features_])
return poly
There are some additional things you might want to add, like checks for whether poly_features or not_poly_features_ is empty (which would break the corresponding transformer).
Finally, your custom estimator is just doing what a ColumnTransformer is meant to do. I think the only reason to prefer yours is if you need to search over which columns get which treatment; I don't think that's easy to do with a ColumnTransformer.
custom_poly = ColumnTransformer(
transformers=[('poly', PolynomialFeatures(), ['year', 'odometer'])],
remainder=OneHotEncoder(),
)
param_grid = {"cpf__poly__degree": [3, 4, 5]}
I am training a dynamic neural network, meaning that each epoch I tweak the architecture and get a different computational graph.
I want to plot the graph for each epoch using tensorboard, but when I use SummaryWriter.add_graph() at the end of each epoch it simply overwrites the previous one.
Any ideas how to plot several graphs using pytorch + tensorboard? It seems achievable as each graph has a “tag” but I found no option to change this tag to plot several of them.
Thanks,
Elad
If you still want to use SummaryWriter, there is the optionn of using the method "add_scalars":
Example:
summary.add_scalars(f'loss/check_info', {
'score': score[iteration],
'score_nf': score_nf[iteration],
}, iteration)
Instead of using the "tag" feature, you can use the "run" feature.
To do so, you have to open tensorboard from a directory within which you stored your summaries in distinct sub-directories.
In your example, you could save the summary of the first epoch at the directory "tensorboard_log_dir/epoch_1", and then save the summary of the second epoch at the directory "tensorboard_log_dir/epoch_2", etc.
That way, when using tensorboard --logdir=tensorboard_log_dir, you'll be able to switch from one computational graph to another via the "run" widget.
Here's a reproducible example:
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.tensorboard import SummaryWriter
dummy_input = (torch.zeros(1, 3),)
# Two different architectures (PyTorch)
class oneLinear(nn.Module):
def __init__(self):
super(oneLinear, self).__init__()
self.l1 = nn.Linear(3, 5)
def forward(self, x):
x = self.l1(x)
return x
class twoLinear(nn.Module):
def __init__(self):
super(twoLinear, self).__init__()
self.l1 = nn.Linear(3, 5)
self.l2 = nn.Linear(5, 5)
def forward(self, x):
x = self.l1(x)
x = F.relu(self.l2(x))
return x
# add graph into 2 distinct subdirectories
with SummaryWriter('./tensorboard_log_dir/oneLinear') as w:
w.add_graph(oneLinear(), dummy_input)
with SummaryWriter('./tensorboard_log_dir/twoLinear') as w:
w.add_graph(twoLinear(), dummy_input)
I am not able to build vocabulary and getting an error:
TypeError: 'int' object is not iterable
Here is my code that is based on medium article:
https://towardsdatascience.com/implementing-multi-class-text-classification-with-doc2vec-df7c3812824d
I tried to provide pandas series, list to build_vocab function.
import pandas as pd
from gensim.test.utils import common_texts
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
from sklearn.model_selection import train_test_split
import multiprocessing
import nltk
from nltk.corpus import stopwords
def tokenize_text(text):
tokens = []
for sent in nltk.sent_tokenize(text):
for word in nltk.word_tokenize(sent):
if len(word) < 2:
continue
tokens.append(word.lower())
return tokens
df = pd.read_csv("https://raw.githubusercontent.com/RaRe-Technologies/movie-plots-by-genre/master/data/tagged_plots_movielens.csv")
tags_index = {
"sci-fi": 1,
"action": 2,
"comedy": 3,
"fantasy": 4,
"animation": 5,
"romance": 6,
}
df["tindex"] = df.tag.replace(tags_index)
df = df[["plot", "tindex"]]
mylist = list()
for i, q in df.iterrows():
mylist.append(
TaggedDocument(tokenize_text(str(q["plot"])), tags=q["tindex"])
)
df["tdoc"] = mylist
X = df[["tdoc"]]
y = df["tindex"]
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
cores = multiprocessing.cpu_count()
model_doc2vec = Doc2Vec(
dm=1,
vector_size=300,
negative=5,
hs=0,
min_count=2,
sample=0,
workers=cores,
)
model_doc2vec.build_vocab([x for x in X_train["tdoc"]])
The documentation is very confusing for this method.
Doc2Vec needs an iterable sequence of TaggedDocument-like objects for its corpus (as is fed to build_vocab() or train()).
When showing an error, you should also show the full stack that accompanied it, so that it is clear what line-of-code, and surrounding call-frames, are involved.
But, it's unclear if what you've fed into the dataframe, then out via dataframe-bracket-access, then through the train_test_split(), is actually that.
So I'd suggest assigning things to descriptive interim variables, and verifying that they contain the right sorts of things at each step.
Is X_train["tdoc"][0] a proper TaggedDocument, with a words property that is a list-of-strings, and tags property a list-of-tags? (And, where each tag is probably a string, but could perhaps be a plain-int, counting upward from 0.)
Is mylist[0] a proper TaggedDocument?
Separately: many online examples of Doc2Vec use have egregious errors, and the Medium article you link is no exception. Its practice of calling train() multiple times in a loop is usually unneeded, and very error-prone, and in fact in that article results in severe learning-rate alpha mismanagement. (For example, deducting 0.002 from the starting-default alpha of 0.025 30 times results in a negative effective alpha, which is never justified and means the model is making itself worse with every example. This may be a factor contributing to the awful reported classifier accuracy.)
I would disregard that article entirely and seek better examples elsewhere.
Is there a way to set different class weights for xgboost classifier? For example in sklearn RandomForestClassifier this is done by the "class_weight" parameter.
For sklearn version < 0.19
Just assign each entry of your train data its class weight. First get the class weights with class_weight.compute_class_weight of sklearn then assign each row of the train data its appropriate weight.
I assume here that the train data has the column class containing the class number. I assumed also that there are nb_classes that are from 1 to nb_classes.
from sklearn.utils import class_weight
classes_weights = list(class_weight.compute_class_weight('balanced',
np.unique(train_df['class']),
train_df['class']))
weights = np.ones(y_train.shape[0], dtype = 'float')
for i, val in enumerate(y_train):
weights[i] = classes_weights[val-1]
xgb_classifier.fit(X, y, sample_weight=weights)
Update for sklearn version >= 0.19
There is simpler solution
from sklearn.utils import class_weight
classes_weights = class_weight.compute_sample_weight(
class_weight='balanced',
y=train_df['class']
)
xgb_classifier.fit(X, y, sample_weight=classes_weights)
when using the sklearn wrapper, there is a parameter for weight.
example:
import xgboost as xgb
exgb_classifier = xgboost.XGBClassifier()
exgb_classifier.fit(X, y, sample_weight=sample_weights_data)
where the parameter shld be array like, length N, equal to the target length
I recently ran into this problem, so thought will leave a solution I tried
from xgboost import XGBClassifier
# manually handling imbalance. Below is same as computing float(18501)/392318
on the trainig dataset.
# We are going to inversely assign the weights
weight_ratio = float(len(y_train[y_train == 0]))/float(len(y_train[y_train ==
1]))
w_array = np.array([1]*y_train.shape[0])
w_array[y_train==1] = weight_ratio
w_array[y_train==0] = 1- weight_ratio
xgc = XGBClassifier()
xgc.fit(x_df_i_p_filtered, y_train, sample_weight=w_array)
Not sure, why but the results were pretty disappointing. Hope this helps someone.
[Reference link] https://www.programcreek.com/python/example/99824/xgboost.XGBClassifier
from sklearn.utils.class_weight import compute_sample_weight
xgb_classifier.fit(X, y, sample_weight=compute_sample_weight("balanced", y))
The answers here are outdated. THe sample_weight parameter is no longer supported. Its replaced with scale_pos_weight
Rather just do scale_pos_weight = sum(negative instances) / sum(positive instances)
You can alternatively use the scale_pos_weight hyperparameter, as discussed in the XGBoost docs. The advantage of this approach is that you don't have to construct the sample weight vector, and don't have to pass in the sample weight vector at fit time.
Similar to #Firas Omrane and #Pramit answer, but I think it is slightly more pythonic
from sklearn.utils import class_weight
class_weights = dict(
zip(
[0,1],
class_weight.compute_class_weight(
'balanced', classes=np.unique(train['class']), y=train['class']
),
)
)
xgb_classifier.fit(X, train['class'], sample_weight=class_weights)