Outlier elimination in a imblearn pipeline affecting both X and y - scikit-learn

I aim to integrate outlier elimination into a machine learning pipeline with a continuous dependent variable. The challenge is to keep X and y at the same length, thus I have eliminate outliers in both datasets.
As this task turned out to be difficult or impossible using sklearn, I switched to imblearn and FunctionSampler. Inspired by the documentation, I tried the following code:
from imblearn import FunctionSampler
from imblearn.pipeline import make_pipeline
from sklearn.ensemble import IsolationForest
from sklearn.linear_model import LinearRegression
def outlier_rejection(X, y):
model = IsolationForest(max_samples=100, contamination=0.4, random_state=rng)
model.fit(X)
y_pred = model.predict(X)
return X[y_pred == 1], y[y_pred == 1]
pipe = make_pipeline(
FunctionSampler(func = outlier_rejection),
LinearRegression()
)
pipe.fit(X_train, y_train) # y_train is a continuous variable!
However, when I tried to fit the pipeline I got the error
ValueError: Unknown label type: 'continuous'
which I think is because my dependent variable is continuous.
I suspect that imblearn is only compatible with nominal data. Is that true? If yes, is there another approach to solve my problem (e.g. with classic sklearn pipeline)? If not, where did I make a mistake in the code above?

Related

Why is my sklearn linear regression model producing perfect predictions?

I'm trying to do multiple linear regression with sklearn and I have performed the following steps. However, when it comes to predicting y_pred using the trained model I am getting a perfect r^2 = 1.0. Does anyone know why this is the case/what's going wrong with my code?
Also sorry I'm new to this site so I'm not fully up to speed with the formatting/etiquette of questions!
import numpy as np
import pandas as pd
# Import and subset data
ml_data_all = pd.read_excel('C:/Users/User/Documents/RSEM/STADM/Coursework/Crime_SF/Machine_learning_collated_data.xlsx')
ml_data_1218 = ml_data_all[ml_data_all['Year'] >= 2012]
ml_data_1218.drop(columns=['Pop_MOE',
'Pop_density_MOE',
'Age_median_MOE',
'Sex_ratio_MOE',
'Income_median_household_MOE',
'Pop_total_pov_status_determ_MOE',
'Pop_total_50percent_pov_MOE',
'Pop_total_125percent_pov_MOE',
'Poverty_percent_below_MOE',
'Total_labourforceMOE',
'Unemployed_total_MOE',
'Unemployed_total_male_MOE'], inplace=True)
# Taking care of missing data
# Delete rows containing any NaNs
ml_data_1218.dropna(axis=0,
how='any',
inplace=True)
# DATA PREPROCESSING
# Defining X and y
X = ml_data_1218.drop(columns=['Year']).values
y = ml_data_1218['Burglaries '].values
# Encoding categorical data
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
transformer = ColumnTransformer(transformers=[("cat", OneHotEncoder(), [0])], remainder='passthrough')
X = transformer.fit_transform(X)
X.toarray()
X = pd.DataFrame.sparse.from_spmatrix(X)
# Split into Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
# Feature scaling
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
X_train.iloc[:,149:] = sc_X.fit_transform(X_train.iloc[:,149:])
X_test.iloc[:,149:] = sc_X.transform(X_test.iloc[:,149:])
# Fitting multiple linear regression to training set
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, y_train)
# Predicting test set results
y_pred = regressor.predict(X_test)
from sklearn.metrics import r2_score
r2_score(y_test, y_pred)
So turns out it was a stupid mistake in the end: I forgot to drop the dependent variable (Burglaries) from the X columns haha, hence why the linear regression model was making perfect predictions. Now it's working (r2 = 0.56). Thanks everyone!
With regression, it's often a good idea to run a correlation matrix against all of your variables (IVs and the DV). Regression likes parsimony, so removing IVs that are functionally the same (and just leaving one in the model) is better for R^2 value (aka model fit). Also, if something is correlated at .97 or higher with the DV, it is basically a substitute for the DV and all the other data is most likely superfluous.
When reading your issue (before I saw your "Answer") I was thinking "either this person has outrageous correlation issues or the DV is also in the prediction data."

Sklearn Logistic Regression predict_proba returning 0 or 1

I don't have any example data to share in order to replicate the problem, but perhaps someone can provide a high level answer. I've created a lot of logistic regression models in the past, and this is the first time my predict proba scores are showing up as either 1 or 0.
I'm creating a binary classifier to predict one of two labels. I've also used a couple of other algorithms, XGBClassifier and RandomForestCalssifier with the same dataset. For these, predict_proba yields the expected probability results (i.e, float values between 0 and 1).
Also, for the LogisticRegression model, I've tried a variety of parameters including all default params, yet the issue persists. Weirdly enough, using SGDClassifier with loss = 'log' or 'modified_huber' also yields the same binary predict_proba results, so I'm thinking this might be something intrinsic to the dataset, but not sure. Also, this issue only occurs if I standardize training set data. So far I've tried both StandardScaler and MinMaxScaler, same results.
Has anyone ever encountered a problem such as this?
Edit:
The LR parameters are:
LogisticRegression(C=1.7993269963183343, class_weight='balanced', dual=False,
fit_intercept=True, intercept_scaling=1, l1_ratio=.5,
max_iter=100, multi_class='warn', n_jobs=-1, penalty='elasticnet',
random_state=58, solver='saga', tol=0.0001, verbose=0,
warm_start=False)
Again, the issue only occurs when standardizing the data with either StandardScaler() or MinMaxScaler(). Which is odd because the data is not a uniform scale across all features. For instance, some features are represented as percents, others are represented as dollar values, and others are dummy coded representations.
This can happen when you do the following two things in sequence:
Fit an estimator with standardized training data and then later on,
Pass unstandardized data to the same estimator in the validation or testing phase.
Here's an example of predict_proba returning 0 or 1 using the UCI ML Breast Cancer Wisconsin (Diagnostic) dataset:
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
import pandas as pd
import numpy as np
X, y = load_breast_cancer(return_X_y=True, as_frame=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.75, random_state=123)
# Example 1 [CORRECT]
pipeline = make_pipeline(StandardScaler(), LogisticRegression())
pipeline.fit(X_train, y_train)
# Pipeline(steps=[('standardscaler', StandardScaler()), ('logisticregression', LogisticRegression())])
print(pipeline)
y_pred = pipeline.predict_proba(X_test)
# [0.37264656 0.62735344]
print(y_pred.mean(axis=0))
# Example 2 [INCORRECT]
# Fit the model with standardized training set
X_scaled = StandardScaler().fit_transform(X_train)
model = LogisticRegression()
model.fit(X_scaled, y_train)
# Test the model with unstandardized test set
y_pred = model.predict_proba(X_test)
# [1.00000000e+000 2.48303123e-204]
print(y_pred.mean(axis=0))
Since the estimator in Example 2 was fitted on scaled data with a unit variance of 1.0 (X_scaled), the variance of the data it's being tested on (X_test) is much higher than expected. It's no surprise then that this results in very extreme probabilities.
You can prevent this from happening by wrapping your estimator within a pipeline and calling the pipeline fit method instead of the estimator's fit method (see Example 1). Doing it this way guarantees that the same transformations are applied to the data in the training, validation and testing phases.

scikit-learn pipelines: Normalising after PCA produces undesired random results

I am running a pipeline that normalises the inputs, runs PCA, normalises PCA factors before finally running a logistic regression.
However, I am getting variable results on the confusion matrix I produce.
I am finding that, if I remove the 3rd step ("normalise_pca" ), my results are constant.
I have set random_state=0 for all the pipeline steps I can. Any idea why I am getting variable results?
def exp2_classifier(X_train, y_train):
estimators = [('robust_scaler', RobustScaler()),
('reduce_dim', PCA(random_state=0)),
('normalise_pca', PowerTransformer()), #I applied this as the distribution of the PCA factors were skew
('clf', LogisticRegression(random_state=0, solver="liblinear"))]
#solver specified here to suppress warnings, it doesn't seem to effect gridSearch
pipe = Pipeline(estimators)
return pipe
exp2_eval = Evaluation().print_confusion_matrix
logit_grid = Experiment().run_experiment(asdp.data, "heavy_drinker", exp2_classifier, exp2_eval);
I am not able to reproduce your error. I have tried other sample dataset from sklearn but got consistent results for multiple runs. Hence, the variance may not be due to normalize_pca
from sklearn import datasets
from sklearn.metrics import confusion_matrix
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import RobustScaler,PowerTransformer
from sklearn.decomposition import PCA
from sklearn.linear_model import LogisticRegression
cancer = datasets.load_breast_cancer()
X = cancer.data
y = cancer.target
from sklearn.model_selection import train_test_split
X_train, X_eval, y_train, y_eval = train_test_split(X, y, test_size=0.2, random_state=42)
estimators = [('robust_scaler', RobustScaler()),
('reduce_dim', PCA(random_state=0)),
('normalise_pca', PowerTransformer()), #I applied this as the distribution of the PCA factors were skew
('clf', LogisticRegression(random_state=0, solver="liblinear"))]
#solver specified here to suppress warnings, it doesn't seem to effect gridSearch
pipe = Pipeline(estimators)
pipe.fit(X_train,y_train)
print('train data :')
print(confusion_matrix(y_train,pipe.predict(X_train)))
print('test data :')
print(confusion_matrix(y_eval,pipe.predict(X_eval)))
output:
train data :
[[166 3]
[ 4 282]]
test data :
[[40 3]
[ 3 68]]

sklearn auc ValueError: Only one class present in y_true

I searched Google, and saw a couple of StackOverflow posts about this error. They are not my cases.
I use keras to train a simple neural network and make some predictions on the splitted test dataset. But when use roc_auc_score to calculate AUC, I got the following error:
"ValueError: Only one class present in y_true. ROC AUC score is not defined in that case.".
I inspect the target label distribution, and they are highly imbalanced. Some labels(in the total 29 labels) have only 1 instance. So it's likely they will have no positive label instance in the test label. So the sklearn's roc_auc_score function reported the only one class problem. That's reasonable.
But I'm curious, as when I use sklearn's cross_val_score function, it can handle the AUC calculation without error.
my_metric = 'roc_auc'
scores = cross_validation.cross_val_score(myestimator, data,
labels, cv=5,scoring=my_metric)
I wonder what happened in the cross_val_score, is it because the cross_val_score use a stratified cross-validation data split?
UPDATE
I continued to make some digging, but still can't find the difference behind.I see that cross_val_score call check_scoring(estimator, scoring=None, allow_none=False) to return a scorer, and the check_scoring will call get_scorer(scoring) which will return scorer=SCORERS[scoring]
And the SCORERS['roc_auc'] is roc_auc_scorer;
the roc_auc_scorer is made by
roc_auc_scorer = make_scorer(roc_auc_score, greater_is_better=True,
needs_threshold=True)
So, it's still using the roc_auc_score function. I don't get why cross_val_score behave differently with directly calling roc_auc_score.
I think your hunch is correct. The AUC (area under ROC curve) needs a sufficient number of either classes in order to make sense.
By default, cross_val_score calculates the performance metric one each fold separately. Another option could be to do cross_val_predict and compute the AUC over all folds combined.
You could do something like:
from sklearn.metrics import roc_auc_score
from sklearn.cross_validation import cross_val_predict
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_classification
class ProbaEstimator(LogisticRegression):
"""
This little hack needed, because `cross_val_predict`
uses `estimator.predict(X)` internally.
Replace `LogisticRegression` with whatever classifier you like.
"""
def predict(self, X):
return super(self.__class__, self).predict_proba(X)[:, 1]
# some example data
X, y = make_classification()
# define your estimator
estimator = ProbaEstimator()
# get predictions
pred = cross_val_predict(estimator, X, y, cv=5)
# compute AUC score
roc_auc_score(y, pred)

How to get best_estimator parameters from GridSearch using cross_val_score?

I want to know the result of the GridSearch when I'm using nested cross validation with cross_val_score for convenience.
When using cross_val_score, you get an array of scores. It would be useful to receive the fitted estimator back or a summary of the chosen parameters for that estimator.
I know you can do this yourself but just implementing cross-validation manually but it is much more convenient if it can be done in conjunction with cross_val_score.
Any way to do it or is this a feature to suggest?
The GridSearchCV class in scikit-learn already does cross validation internally. You can pass any CV iterator as the cv argument of the constructor of GridSearchCV.
The answer to your question is that it is a feature to suggest. Unfortunately, you can't get the best parameters of the models fitted with nested cross-validation using cross_val_score (as of now, scikit 0.14).
See this example:
from sklearn import datasets
from sklearn.linear_model import LinearRegression
from sklearn.grid_search import GridSearchCV
from sklearn.cross_validation import cross_val_score
digits = datasets.load_digits()
X = digits.data
y = digits.target
hyperparams = [{'fit_intercept':[True, False]}]
algo = LinearRegression()
grid = GridSearchCV(algo, hyperparams, cv=5, scoring='mean_squared_error')
# Nested cross validation
cross_val_score(grid, X, y)
grid.best_score_
[Out]:
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-4-4c4ac83c58fb> in <module>()
15 # Nested cross validation
16 cross_val_score(grid, X, y)
---> 17 grid.best_score_
AttributeError: 'GridSearchCV' object has no attribute 'best_score_'
(Note also that the scores you get from cross_val_score are not the ones defined in scoring, here the mean squared error. What you see is the score function of the best estimator. The bug of v0.14 is described here.)
In sklearn v0.20.0 (which will be released in late 2018), the trained estimators are exposed by the function cross_validate if requested.
See here the corresponding pull-request for the new feature. Something like this will work:
from sklearn.metrics.scorer import check_scoring
from sklearn.model_selection import cross_validate
scorer = check_scoring(estimator=gridSearch, scoring=scoring)
cvRet = cross_validate(estimator=gridSearch, X=X, y=y,
scoring={'score': scorer}, cv=cvOuter,
return_train_score=False,
return_estimator=True,
n_jobs=nJobs)
scores = cvRet['test_score'] # Equivalent to output of cross_val_score()
estimators = cvRet['estimator']
If return_estimator=True, the estimators can be retrieved from the returned dictionary as cvRet['estimator']. The list stored in cvRet['test_score'] is equivalent to the output of cross_val_score. See here how cross_val_score() is implemented by means of cross_validate().

Resources