Suppose my original dataset has 8 features and I apply PCA with n_components = 3 (I am using sklearn.decomposition.PCA). Then I train my model using those 3 PCA components (which are now my new features).
Do I need to apply PCA while predicting as well ?
Do I need to do that even if I am predicting only one data point?
What confuses me is that when I do prediction, every data point is a row in a 2D matrix (consists of all data points that I want to predict). So if I apply PCA on just one data point, then the corresponding row vector will be converted to a zero vector.

If you fitted your model on the first three components of the PCA, you have to transform appropriately any new data. For example, consider this code taken from here:
pca = PCA(n_components=n_components, svd_solver='randomized',
X_train_pca = pca.transform(X_train)
X_test_pca = pca.transform(X_test)
clf = GridSearchCV(SVC(kernel='rbf', class_weight='balanced'), param_grid)
clf =, y_train)
y_pred = clf.predict(X_test_pca)
In the code, they first fit PCA on the trainig. Then they transform both training and testing, and then they apply the model (in their case, SVM) on the transformed data.
Even if your X_test consists of only 1 data point, you could still use PCA. Just transform your data into a 2D matrix. For example, if your data point is [1,2,0,5] then X_test=[[1,2,0,5]]. That is, it is a 2D matrix with 1 row.


how to cross validate pca in sklearn pipeline without overfitting?

My input is time series data. I want to decompose the dataset with PCA (I dont want to do PCA on the entire dataset first because that would be overfitting) and then use feature selection on each component (fitted on a KNN Regressor model).
This is my code so far:
tscv = TimeSeriesSplit(n_splits=10)
pca = PCA(n_components=.5,svd_solver='full').fit_transform()
knn = KNeighborsRegressor(n_jobs=-1)
sfs = SequentialFeatureSelector(estimator=knn,n_features_to_select='auto',tol=.001,scoring=custom_scorer,n_jobs=-1)
pipe = Pipeline(steps=[("pca", pca), ("sfs", sfs), ("knn", knn)])
cv_score = cross_val_score(estimator=pipe,X=X,y=y,scoring=custom_scorer,cv=tscv,verbose=10)
print(np.average(cv_score),' +/- ',np.std(cv_score))
The problem is I want to make sure PCA isnt looking over the entire dataset when it calculates which features variance. I also want it to be fit transformed, but it doesnt work. With the following error codes:
TypeError: All intermediate steps should be transformers and implement fit and transform or be the string 'passthrough' '<bound method PCA.fit_transform of PCA(svd_solver='full')>' (type <class 'method'>) doesn't
TypeError: fit_transform() missing 1 required positional argument: 'X'
You should not use pca = PCA(...).fit_transform nor pca = PCA(...).fit_transform() when defining your pipeline.
Instead, you should use pca = PCA(...). The fit_transform method is automatically called within the pipeline during the model fitting (in cross_val_score).

How to set Keras TimeseriesGenerator to predict the second next value?

Currently I have the following code using TimeseriesGenerator from Keras:
TimeseriesGenerator(train, prediction, length=TIME_STEPS, batch_size=1)
Currently this shifts prediction one value backwards, so the train data for t will have the output of t+1. Which makes sense, but I want to predict t+2, thus train data for t will have the output of t+2.
Is there any way to do it using TimeseriesGenerator?
The quickest solution is to just shift your predictions by 1, ie.:
TimeseriesGenerator(train[:-1], prediction[1:], length=TIME_STEPS, batch_size=1)
Note that you have to trim the train set, so both datasets have equal lengths.
You can also use the timeseries_dataset_from_array function where you can align the data and targets according to your needs as you can read in the documentation:
data: Numpy array or eager tensor containing consecutive data points
(timesteps). Axis 0 is expected to be the time dimension.
Targets corresponding to timesteps in data. It should have same length
as data. targets[i] should be the target corresponding to the window
that starts at index i (see example 2 below). Pass None if you don't
have target data (in this case the dataset will only yield the input
So in your case it would be something like this:

How to use cross_val_predict to predict probabilities for a new dataset?

I am using sklearn's cross_val_predict for training like so:
myprobs_train = cross_val_predict(LogisticRegression(),X = x_old, y=y_old, method='predict_proba', cv=10)
I am happy with the returned probabilities, and would like now to score up a brand-new dataset. I tried:
myprobs_test = cross_val_predict(LogisticRegression(), X =x_new, y= None, method='predict_proba',cv=10)
but this did not work, it's complaining about y having zero shape. Does it mean there's no way to apply the trained and cross-validated model from cross_val_predict on new data? Or am I just using it wrong?
Thank you!
You are looking at a wrong method. Cross validation methods do not return a trained model; they return values that evaluate the performance of a model (logistic regression in your case). Your goal is to fit some data and then generate prediction for new data. The relevant methods are fit and predict of the LogisticRegression class. Here is the basic structure:
logreg = linear_model.LogisticRegression(), y_old)
predictions = logreg.predict(x_new)
I have the same concern as #user3490622. If we can only use cross_val_predict on training and testing sets, why y (target) is None as the default value? (sklearn page)
To partially achieve the desired results of multiple predicted probability, one could use the fit then predict approach repeatedly to mimic the cross-validation.

scikit-learn: Is there a way to provide an object as an input to predict function of a classifier?

I am planning to use an SGDClassifier in production. The idea is to train the classifier on some training data, use cPickle to dump it to a .pkl file and reuse it later in a script. However, there are certain high cardinality fields which are categorical in nature and translated to one hot matrix representation which creates around 5000 features. Now the input that I get for the predict will only have one of these features and rest all will be zeroes. It will also include ofcourse the other numerical features apart from this. From the docs, it appears that the predict function expects an array of array as input. Is there any way I can transform my input to the format expected by the predict function without having to store the fields everytime I train the model ?
So, let us say my input contains 3 fields:
rate: 10, // numeric
flagged: 0, //binary
host: '' // keeping this categorical
host can have around 5000 different values. Now I loaded the file to a pandas dataframe, used the get_dummies function to transform the host field to around 5000 new fields which are binary fields.
Then I trained by model and stored it using cPickle.
Now, when I need to use the predict function, for the input, I only have 3 fields (shown above). However, as per my understanding the predict endpoint will expect an array of vectors and each vector is supposed to have those 5000 fields.
For the entry that I need to predict, I know only one field for that entry which will be the value of host itself.
For example, if my input is
rate: 5,
flagged: 1
host: ''
I know that the fields expected by the predict should be:
rate: 5,
flagged: 1
new_host: 1
But if I translate it to vector format, I don't know which index to place the new_host field. Also, I don't know in advance what other hosts are (unless I store it somewhere during the training phase)
I hope I am making some sense. Let me know if I am doing it the wrong way.
I don't know which index to place the new_host field
A good approach that has worked for me is to build a pipeline which you then use for training and prediction. This way you do not have to concern yourself with the column index of whatever output is produced by your transformation:
# in training
pipl = Pipeline(steps=[('binarizer', LabelBinarizer(),
('clf', SGDClassifier())])
model = pipl.train(X, Y)
pickle.dump(mf, model)
# in production
model = pickle.load(mf)
y = model.predict(X)
As X, Y inputs you need to pass an array-like object. Make sure the input is the same structure for both training and test, e.g.
X = [[data.get('rate'), data.get('flagged'), data.get('host')]]
Y = [[y-cols]] # your example doesn't specify what is Y in your data
More flexible: Pandas DataFrame + Pipeline
What also works nicely is to use a Pandas DataFrame in combination with sklearn-pandas as it allows you to use different transformations on different column names. E.g.
df = pd.DataFrame.from_dict(data)
mapper = DataFrameMapper([
('host', sklearn.preprocessing.LabelBinarizer()),
('rate', sklearn.preprocessing.StandardScaler())
pipl = Pipeline(steps=[('mapper', mapper),
('clf', SGDClassifier())])
X = df[x-cols]
y = df[y-col(s)]
Note that x-cols and y-col(s) are the list of the feature and target columns respectively.
You should use a scikit-learn transformer instead of get_dummies. In this case, LabelBinarizer makes sense. Seeing as LabelBinarizer doesn't work in a pipeline, this is one way to do what you want:
binarizer = LabelBinarizer()
# fitting LabelBinarizer means it remembers all the columns it's seen
one_hot_data = binarizer.fit_transform(X_train[:, categorical_col])
# replace string column with one-hot representation
X_train = np.concatenate([np.delete(X_train, categorical_col, axis=1),
one_hot_data], axis=1)
model = SGDClassifier(), y)
pickle.dump(f, {'clf': clf, 'binarizer': binarizer})
then at prediction time:
estimators = pickle.load(f)
clf = estimators['clf']
binarizer = estimators['binarizer']
one_hot_data = binarizer.transform(X_test[:, categorical_col])
X_test = np.concatenate([np.delete(X_test, categorical_col, axis=1),
one_hot_data], axis=1)

Scikit-Learn Linear Regression how to get coefficient's respective features?

I'm trying to perform feature selection by evaluating my regressions coefficient outputs, and select the features with the highest magnitude coefficients. The problem is, I don't know how to get the respective features, as only coefficients are returned form the coef._ attribute. The documentation says:
Estimated coefficients for the linear regression problem. If multiple
targets are passed during the fit (y 2D), this is a 2D array of
shape (n_targets, n_features), while if only one target is passed,
this is a 1D array of length n_features.
I am passing into my,B), where A is a 2-D array, with tfidf value for each feature in a document. Example format:
"feature1" "feature2"
"Doc1" .44 .22
"Doc2" .11 .6
"Doc3" .22 .2
B are my target values for the data, which are just numbers 1-100 associated with each document:
"Doc1" 50
"Doc2" 11
"Doc3" 99
Using regression.coef_, I get a list of coefficients, but not their corresponding features! How can I get the features? I'm guessing I need to modfy the structure of my B targets, but I don't know how.
What I found to work was:
X = your independent variables
coefficients = pd.concat([pd.DataFrame(X.columns),pd.DataFrame(np.transpose(logistic.coef_))], axis = 1)
The assumption you stated: that the order of regression.coef_ is the same as in the TRAIN set holds true in my experiences. (works with the underlying data and also checks out with correlations between X and y)
You can do that by creating a data frame:
cdf = pd.DataFrame(regression.coef_, X.columns, columns=['Coefficients'])
coefficients = pd.DataFrame({"Feature":X.columns,"Coefficients":np.transpose(logistic.coef_)})
I suppose you are working on some feature selection task. Well using regression.coef_ does get the corresponding coefficients to the features, i.e. regression.coef_[0] corresponds to "feature1" and regression.coef_[1] corresponds to "feature2". This should be what you desire.
Well I in its turn recommend tree model from sklearn, which could also be used for feature selection. To be specific, check out here.
Coefficients and features in zip
Coefficients and features in DataFrame
This is the easiest and most intuitive way:
pd.DataFrame(logisticRegr.coef_, columns=x_train.columns)
or the same but transposing index and columns
pd.DataFrame(logisticRegr.coef_, columns=x_train.columns).T
Suppose your train data X variable is 'df_X' then you can map into a dictionary and feed into pandas dataframe to get the mapping:
Try putting them in a series with the data columns names as index:
coeffs = pd.Series(model.coef_[0], index=X.columns.values)
coeffs.sort_values(ascending = False)
