how to cross validate pca in sklearn pipeline without overfitting? - scikit-learn

My input is time series data. I want to decompose the dataset with PCA (I dont want to do PCA on the entire dataset first because that would be overfitting) and then use feature selection on each component (fitted on a KNN Regressor model).
This is my code so far:
tscv = TimeSeriesSplit(n_splits=10)
pca = PCA(n_components=.5,svd_solver='full').fit_transform()
knn = KNeighborsRegressor(n_jobs=-1)
sfs = SequentialFeatureSelector(estimator=knn,n_features_to_select='auto',tol=.001,scoring=custom_scorer,n_jobs=-1)
pipe = Pipeline(steps=[("pca", pca), ("sfs", sfs), ("knn", knn)])
cv_score = cross_val_score(estimator=pipe,X=X,y=y,scoring=custom_scorer,cv=tscv,verbose=10)
print(np.average(cv_score),' +/- ',np.std(cv_score))
print(X.columns)
The problem is I want to make sure PCA isnt looking over the entire dataset when it calculates which features variance. I also want it to be fit transformed, but it doesnt work. With the following error codes:
TypeError: All intermediate steps should be transformers and implement fit and transform or be the string 'passthrough' '<bound method PCA.fit_transform of PCA(svd_solver='full')>' (type <class 'method'>) doesn't
or
TypeError: fit_transform() missing 1 required positional argument: 'X'

You should not use pca = PCA(...).fit_transform nor pca = PCA(...).fit_transform() when defining your pipeline.
Instead, you should use pca = PCA(...). The fit_transform method is automatically called within the pipeline during the model fitting (in cross_val_score).

Related

How to use cross_val_predict to predict probabilities for a new dataset?

I am using sklearn's cross_val_predict for training like so:
myprobs_train = cross_val_predict(LogisticRegression(),X = x_old, y=y_old, method='predict_proba', cv=10)
I am happy with the returned probabilities, and would like now to score up a brand-new dataset. I tried:
myprobs_test = cross_val_predict(LogisticRegression(), X =x_new, y= None, method='predict_proba',cv=10)
but this did not work, it's complaining about y having zero shape. Does it mean there's no way to apply the trained and cross-validated model from cross_val_predict on new data? Or am I just using it wrong?
Thank you!
You are looking at a wrong method. Cross validation methods do not return a trained model; they return values that evaluate the performance of a model (logistic regression in your case). Your goal is to fit some data and then generate prediction for new data. The relevant methods are fit and predict of the LogisticRegression class. Here is the basic structure:
logreg = linear_model.LogisticRegression()
logreg.fit(x_old, y_old)
predictions = logreg.predict(x_new)
I have the same concern as #user3490622. If we can only use cross_val_predict on training and testing sets, why y (target) is None as the default value? (sklearn page)
To partially achieve the desired results of multiple predicted probability, one could use the fit then predict approach repeatedly to mimic the cross-validation.

How to use Principal Component Analysis while predicting?

Suppose my original dataset has 8 features and I apply PCA with n_components = 3 (I am using sklearn.decomposition.PCA). Then I train my model using those 3 PCA components (which are now my new features).
Do I need to apply PCA while predicting as well ?
Do I need to do that even if I am predicting only one data point?
What confuses me is that when I do prediction, every data point is a row in a 2D matrix (consists of all data points that I want to predict). So if I apply PCA on just one data point, then the corresponding row vector will be converted to a zero vector.
If you fitted your model on the first three components of the PCA, you have to transform appropriately any new data. For example, consider this code taken from here:
pca = PCA(n_components=n_components, svd_solver='randomized',
whiten=True).fit(X_train)
X_train_pca = pca.transform(X_train)
X_test_pca = pca.transform(X_test)
clf = GridSearchCV(SVC(kernel='rbf', class_weight='balanced'), param_grid)
clf = clf.fit(X_train_pca, y_train)
y_pred = clf.predict(X_test_pca)
In the code, they first fit PCA on the trainig. Then they transform both training and testing, and then they apply the model (in their case, SVM) on the transformed data.
Even if your X_test consists of only 1 data point, you could still use PCA. Just transform your data into a 2D matrix. For example, if your data point is [1,2,0,5] then X_test=[[1,2,0,5]]. That is, it is a 2D matrix with 1 row.

GridsearchCV with RandomForest

So I am doing some parameter thing with RandomForest and GridsearchCV. Here is my code.
#Import 'GridSearchCV' and 'make_scorer'
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import make_scorer
Create the parameters list you wish to tune
parameters = {'n_estimators':[5,10,15]}
#Initialize the classifier
clf = GridSearchCV(RandomForestClassifier(), parameters)
#Make an f1 scoring function using 'make_scorer'
f1_scorer = make_scorer(f1_scorer)
#Perform grid search on the classifier using the f1_scorer as the scoring method
grid_obj = GridSearchCV(clf, param_grid=parameters, scoring=f1_scorer,cv=5)
print(clf.get_params().keys())
#Fit the grid search object to the training data and find the optimal parameters
grid_obj = grid_obj.fit(X_train_100,y_train_100)
So the issue is the following error: "ValueError: Invalid parameter max_features for estimator GridSearchCV. Check the list of available parameters with estimator.get_params().keys()."
I followed the advice given by the error and the output of print(clf.get_params().keys()) is below. However even when I copy and paste these titles into my parameter dictionary I still get an error. I've hunted around stack overflow and most people are using really similar parameter dictionaries to mine. Anyone have any idea on how to iron out this issue? Thanks again!
dict_keys(['pre_dispatch', 'cv', 'estimator__max_features', 'param_grid', 'refit', 'estimator__min_impurity_split', 'n_jobs', 'estimator__random_state', 'error_score', 'verbose', 'estimator__min_samples_split', 'estimator__n_jobs', 'fit_params', 'estimator__min_weight_fraction_leaf', 'scoring', 'estimator__warm_start', 'estimator__criterion', 'estimator__verbose', 'estimator__bootstrap', 'estimator__class_weight', 'estimator__oob_score', 'iid', 'estimator', 'estimator__max_depth', 'estimator__max_leaf_nodes', 'estimator__min_samples_leaf', 'estimator__n_estimators', 'return_train_score'])
I think the problem is with the two lines:
clf = GridSearchCV(RandomForestClassifier(), parameters)
grid_obj = GridSearchCV(clf, param_grid=parameters, scoring=f1_scorer,cv=5)
What this is essentially doing is creating an object with a structure like:
grid_obj = GridSearchCV(GridSearchCV(RandomForestClassifier()))
which is probably one more GridSearchCV than you want.

How do I correctly manually recreate sklearn (python) logistic regression predict_proba outcome for multiple classification

If I run a basic logistic regression with 4 classes, I can get the predict_proba array.
How can i manually calculate the probabilities using the coefficients and intercepts? What are the exact steps to get the same answers that predict_proba generates?
There seem to be multiple questions about this online and several suggestions which are either incomplete or don't match up anyway.
For example, I can't replicate this process from my sklearn model so what is missing?
https://stats.idre.ucla.edu/stata/code/manually-generate-predicted-probabilities-from-a-multinomial-logistic-regression-in-stata/
Thanks,
Because I had the same question but could not find an answer that gave the same results I had a look at the sklearn GitHub repository to find the answer. Using the functions from their own package I was able to create the same results I got from predict_proba().
It appears that sklearn uses a special softmax() function that differs from the usual softmax function in their code.
Let's assume you build a model like this:
from sklearn.linear_model import LogisticRegression
X = ...
Y = ...
model = LogisticRegression(multi_class="multinomial", solver="saga")
model.fit(X, Y)
Then you can calculate the probabilities either with model.predict(X) or use the sklearn function mentioned above to calculate them manually like this.
from sklearn.utils.extmath import softmax,
import numpy as np
scores = np.dot(X, model.coef_.T) + model.intercept_
softmax(scores) # Sklearn implementation
In the documentation for their own softmax() function, they note that
The softmax function is calculated by
np.exp(X) / np.sum(np.exp(X), axis=1)
This will cause overflow when large values are exponentiated. Hence
the largest value in each row is subtracted from each data point to
prevent this.
Replicate sklearn calcs (saw this on a different post):
V = X_train.values.dot(model.coef_.transpose())
U = V + model.intercept_
A = np.exp(U)
P=A/(1+A)
P /= P.sum(axis=1).reshape((-1, 1))
seems slightly different than softmax calcs, or the UCLA stat example, but it works.

Scikit-learn TruncatedSVD documentation

I plan to use sklearn.decomposition.TruncatedSVD to perform LSA for a Kaggle
competition, I know the math behind SVD and LSA but I'm confused by
scikit-learn's user guide, hence I'm not sure how to actually apply
TruncatedSVD.
In the doc, it states that:
After this operation,
U_k * transpose(S_k) is the transformed training set with k features (called n_components in the API)
Why is this? I thought after SVD, X, at this time X_k should be U_k * S_k * transpose(V_k)?
And then it says,
To also transform a test set X, we multiply it with V_k: X' = X * V_k
What does this mean?
I like the documentation Here a bit better. Sklearn is pretty consistent in that you almost always use some kind of combination of the following code:
#import desired sklearn class
from sklearn.decomposition import TruncatedSVD
trainData= #someArray
testData = #someArray
model = TruncatedSVD(n_components=5, random_state=42)
model.fit(trainData) #you fit your model on the underlying data
if you want to transform that data instead of just fitting it,
model.fit_transform(trainData) #fit and transform underlying data
Similarly, if you weren't transforming data, but making a prediction instead, you would use something like:
predictions = model.transform(testData)
Hope that helps...

Resources