Obtaining Same Result as Sklearn Pipeline without Using It - python-3.x

How would one correctly standardize the data without using pipeline? I am just wanting to make sure my code is correct and there is no data leakage.
So if I standardize the entire dataset once, right at the beginning of my project, and then go on to try different CV tests with different ML algorithms, will that be the same as creating an Sklearn Pipeline and performing the same standardization in conjunction with each ML algorithm?
y = df['y']
X = df.drop(columns=['y', 'Date'])
scaler = preprocessing.StandardScaler().fit(X)
X_transformed = scaler.transform(X)
clf1 = DecisionTreeClassifier()
clf1.fit(X_transformed, y)
clf2 = SVC()
clf2.fit(X_transformed, y)
####Is this the same as the below code?####
pipeline1 = []
pipeline1.append(('standardize', StandardScaler()))
pipeline1.append(('clf1', DecisionTreeClassifier()))
pipeline1.fit(X_transformed,y)
pipeline2 = []
pipeline2.append(('standardize', StandardScaler()))
pipeline2.append(('clf2', DecisionTreeClassifier()))
pipeline2.fit(X_transformed,y)
Why would anybody choose the latter other than personal preference?

They are the same. It is possible that you may want one or the other from a maintainability standpoint, but the outcome of a test set prediction will be identical.
Edit Note that this is only the case because the StandardScaler is idempotent. It is strange that you fit the pipeline on the data that has already been scaled...

Related

Pipeline with XGBoost - Imputer and Scaler prevent Model from learning

I'm trying to build a pipeline for data preprocessing for my XGBoost model. The data contains NaNs and needs to be scaled. This is the relevant code:
xgb_pipe = Pipeline(steps=[
('imputer', SimpleImputer(strategy='mean')),
('scaler', preprocessing.StandardScaler()),
('regressor', xgboost.XGBRegressor(n_estimators=100, eta=0.1, objective = "reg:squarederror"))])
xgb_pipe.fit(train_x.values, train_y.values,
regressor__early_stopping_rounds=20,
regressor__eval_metric = "rmse",
regressor__eval_set = [[train_x.values, train_y.values],[test_x.values, test_y.values]])
The loss immediately increases and the training stops after 20 iterations.
If I remove the imputer and the scaler from the pipeline, it works and trains for the full 100 iterations. If I manually preprocess the data it also works as intended, so I know that the problem is not the data.
What am I missing?
The problem is that the preprocessing doesn't get applied to your eval sets, and so the model performs quite badly on them, and early stopping kicks in very early.
I'm not sure there's a simple way to do this that would keep everything in one pipeline, unfortunately. You need to apply the preprocessing steps of the pipeline to the eval sets, so those need to be fitted before setting that parameter.
Separate preprocessing
As two objects it's no problem:
preproc = Pipeline(steps=[
('imputer', SimpleImputer(strategy='mean')),
('scaler', preprocessing.StandardScaler()),
])
reg = xgboost.XGBRegressor(n_estimators=100, eta=0.1, objective="reg:squarederror")
train_x_preproc = preproc.fit_transform(train_x.values, train_y.values)
test_x_preproc = preproc.transform(test_x)
reg.fit(train_x.values, train_y.values,
regressor__early_stopping_rounds=20,
regressor__eval_metric = "rmse",
regressor__eval_set = [[train_x_preproc, train_y.values], [test_x_preproc, test_y.values]],
)
After fitting you could put these now-fitted estimators together into a pipeline (pipelines don't clone their estimators) for prediction if you'd like.
Custom estimator
There are a lot of ways to go about this, but inheriting from Pipeline means you can initialize the same way you do your current setup, and we just assume the last step is an xgboost model, and the rest are preprocessing that need to apply to the eval sets as well as fitting and predicting sets. I think everything else can be left to the inherited methods from Pipeline?
class PreprocEarlyStoppingXGB(Pipeline):
def fit(self, X, y, eval_set):
preproc = self.steps[:-1]
X_preproc = preproc.fit_transform(X, y)
eval_preproc = []
for eval in eval_set:
eval_preproc.append([preproc.transform(eval[0]), eval[1]])
self.steps[-1].fit(X_preproc, y, eval_set=eval_preproc)
return self
To your usecase from the comments, what happens when you cross-validate with this object? On each training fold, the preprocessing steps are fitted. Those are then applied to the training fold, and all eval sets (the entire training set as well as the external test set), and finally when scoring the test fold. The xgboost model trains on the preprocessed training fold, and watches the score on the entire training set and the external testing set (both having been preprocessed), the latter getting used for early stopping.

cross Validation in Sklearn using a Custom CV

I am dealing with a binary classification problem.
I have 2 lists of indexes listTrain and listTest, which are partitions of the training set (the actual test set will be used only later). I would like to use the samples associated with listTrain to estimate the parameters and the samples associated with listTest to evaluate the error in a cross validation process (hold out set approach).
However, I am not be able to find the correct way to pass this to the sklearn GridSearchCV.
The documentation says that I should create "An iterable yielding (train, test) splits as arrays of indices". However, I do not know how to create this.
grid_search = GridSearchCV(estimator = model, param_grid = param_grid,cv = custom_cv, n_jobs = -1, verbose = 0,scoring=errorType)
So, my question is how to create custom_cv based on these indexes to be used in this method?
X and y are respectivelly the features matrix and y is the vector of labels.
Example: Supose that I only have one hyperparameter alpha that belongs to the set{1,2,3}. I would like to set alpha=1, estimate the parameters of the model (for instance the coefficients os a regression) using the samples associated with listTrain and evaluate the error using the samples associated with listTest. Then I repeat the process for alpha=2 and finally for alpha=3. Then I choose the alpha that minimizes the error.
EDIT: Actual answer to question. Try passing cv command a generator of the indices:
def index_gen(listTrain, listTest):
yield listTrain, listTest
grid_search = GridSearchCV(estimator = model, param_grid =
param_grid,cv = index_gen(listTrain, listTest), n_jobs = -1,
verbose = 0,scoring=errorType)
EDIT: Before Edits:
As mentioned in the comment by desertnaut, what you are trying to do is bad ML practice, and you will end up with a biased estimate of the generalisation performance of the final model. Using the test set in the manner you're proposing will effectively leak test set information into the training stage, and give you an overestimate of the model's capability to classify unseen data. What I suggest in your case:
grid_search = GridSearchCV(estimator = model, param_grid = param_grid,cv = 5,
n_jobs = -1, verbose = 0,scoring=errorType)
grid_search.fit(x[listTrain], y[listTrain]
Now, your training set will be split into 5 (you can choose the number here) folds, trained using 4 of those folds on a specific set of hyperparameters, and tested the fold that was left out. This is repeated 5 times, till all of your training examples have been part of a left out set. This whole procedure is done for each hyperparameter setting you are testing (5x3 in this case)
grid_search.best_params_ will give you a dictionary of the parameters that performed the best over all 5 folds. These are the parameters that you use to train your final classifier, using again only the training set:
clf = LogisticRegression(**grid_search.best_params_).fit(x[listTrain],
y[listTrain])
Now, finally your classifier is tested on the test set and an unbiased estimate of the generalisation performance is given:
predictions = clf.predict(x[listTest])

Get feature importance PySpark Naive Bayes classifier

I have a Naive Bayes classifier that I wrote in Python using a Pandas data frame and now I need it in PySpark. My problem here is that I need the feature importance of each column. When looking through the PySpark ML documentation I couldn't find any info on it. documentation
Does anyone know if I can get the feature importance with the Naive Bayes Spark MLlib?
The code using Python is the following. The feature importance is retrieved with .coef_
df = df.fillna(0).toPandas()
X_df = df.drop(['NOT_OPEN', 'unique_id'], axis = 1)
X = X_df.values
Y = df['NOT_OPEN'].values.reshape(-1,1)
mnb = BernoulliNB(fit_prior=True)
y_pred = mnb.fit(X, Y).predict(X)
estimator = mnb.fit(X, Y)
# coef_: For a binary classification problems this is the log of the estimated probability of a feature given the positive class. It means that higher values mean more important features for the positive class.
feature_names = X_df.columns
coefs_with_fns = sorted(zip(estimator.coef_[0], feature_names))
If you're interested in an equivalent of coef_, the property, you're looking for, is NaiveBayesModel.theta
log of class conditional probabilities.
New in version 2.0.0.
i.e.
model = ... # type: NaiveBayesModel
model.theta.toArray() # type: numpy.ndarray
The resulting array is of size (number-of-classes, number-of-features), and rows correspond to consecutive labels.
It is, probably, better to evaluate a difference
log(P(feature_X|positive)) - log(P(feature_X|negative))
as a feature importance.
Because, we are interested in the Discriminative power of each feature_X (sure-sure NB is a generative model).
Extreme example: some feature_X1 has the same value across all + and - samples, so no discriminative power.
So, the probability of this feature value is high for both + and - samples, but the difference of log probabilities = 0.

Sklearn MLP Feature Selection

Recursive Feature Elimination with Cross Validation (RFEVC) does not work on the Multi Layer Perceptron estimator (along with several other classifiers).
I wish to use a feature selection across many classifiers that performs cross validation to verify its feature selection. Any suggestions?
There is a feature selection independent of the model choice for structured data, it is called Permutation Importance. It is well explained here and elsewhere.
You should have a look at it. It is currently being implemented in sklearn.
There is no current implementation for MLP, but one could be easily done with something like this (from the article):
def permutation_importances(rf, X_train, y_train, metric):
baseline = metric(rf, X_train, y_train)
imp = []
for col in X_train.columns:
save = X_train[col].copy()
X_train[col] = np.random.permutation(X_train[col])
m = metric(rf, X_train, y_train)
X_train[col] = save
imp.append(baseline - m)
return np.array(imp)
Note that here the training set is used for computing the feature importances, but you could choose to use the test set, as discussed here.

Scikit-learn TruncatedSVD documentation

I plan to use sklearn.decomposition.TruncatedSVD to perform LSA for a Kaggle
competition, I know the math behind SVD and LSA but I'm confused by
scikit-learn's user guide, hence I'm not sure how to actually apply
TruncatedSVD.
In the doc, it states that:
After this operation,
U_k * transpose(S_k) is the transformed training set with k features (called n_components in the API)
Why is this? I thought after SVD, X, at this time X_k should be U_k * S_k * transpose(V_k)?
And then it says,
To also transform a test set X, we multiply it with V_k: X' = X * V_k
What does this mean?
I like the documentation Here a bit better. Sklearn is pretty consistent in that you almost always use some kind of combination of the following code:
#import desired sklearn class
from sklearn.decomposition import TruncatedSVD
trainData= #someArray
testData = #someArray
model = TruncatedSVD(n_components=5, random_state=42)
model.fit(trainData) #you fit your model on the underlying data
if you want to transform that data instead of just fitting it,
model.fit_transform(trainData) #fit and transform underlying data
Similarly, if you weren't transforming data, but making a prediction instead, you would use something like:
predictions = model.transform(testData)
Hope that helps...

Resources