cross_validation for time series in scikit learn machine learning - python-3.x

I wasn't able to find information I am looking for so I will post my question here.
I am just venturing into machine learning. I did my first multiple regression for a time series using scikit learn library. My code is as shown below
X = df[feature_cols]
y = df[['scheduled_amount']]
index= y.reset_index().drop('scheduled_amount', axis=1)
linreg = LinearRegression()
tscv = TimeSeriesSplit(max_train_size=None, n_splits=11)
li=[]
for train_index, test_index in tscv.split(X):
train = index.iloc[train_index]
train_start, train_end = train.iloc[0,0], train.iloc[-1,0]
test = index.iloc[test_index]
test_start, test_end = test.iloc[0,0], test.iloc[-1,0]
X_train, X_test = X[train_start:train_end], X[test_start:test_end]
y_train, y_test = y[train_start:train_end], y[test_start:test_end]
linreg.fit(X_train, y_train)
y_predict = linreg.predict(X_test)
print('RSS:' + str(linreg.score(X_test, y_test)))
y_test['predictec_amount'] = y_predict
y_test.plot()
Not that my data is a time series data and I want to keep the datetime index in my Dataframe when I'm fitting my model.
I am using the TimeSeriesSplit for cross-validation. I still don't really understand the cross validation thing.
First is there a need for a cross-validation in a time series dataset. Second should I use the last linear_coeff_ or should I get the average of all of them to use for my future prediction.

Yes there is a need for cross-validation in a timeseries dataset. Basically you need to ensure your model does not overfit your current test and is able to capture past seasonal changes so you can have some confidence in the model doing the same in the future. This method is also used to choose model hyperparameters (i.e. alpha in a Ridge regression).
In order to make future predictions, you should refit your regressor with the whole data and the best hyperparameters or, as #Marcus V. mentioned in the coments, maybe is best to train it only with the most recent data.

Related

k-NN GridSearchCV taking extremely long time to execute

I am attempting to use sklearn to train a KNN model on the MNIST classification task. When I try to tune my parameters using either sklearn's GridSearchCV or RandomisedSearchCV classes, my code is taking an extremely long time to execute.
As an experiment, I created a KNN model using KNeighborsClassifier() with the default parameters and passed these same parameters to GridSearchCV. Afaik, this should mean GridSearchCV only has single set of parameters and so should effectively not perform a "search". I then called the .fit() methods of both on the training data and timed their execution (see code below). The KNN model's .fit() method took about 11 seconds to run, whereas the GridSearchCV model took over 20 minutes.
I understand that GridSearchCV should take slightly longer as it is performing 5-fold cross validation, but the difference in execution time seems too large for it to be explained by that.
Am I doing something with my GridSearchCV call that it causing it to take such a long time to execute? And is there anything that I can do to accelerate it?
import sklearn
import time
# importing models
from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.model_selection import GridSearchCV
from sklearn.neighbors import KNeighborsClassifier
# Importing data
from sklearn.datasets import fetch_openml
mnist = fetch_openml(name='mnist_784')
print("data loaded")
# splitting the data into stratified train & test sets
X, y = mnist.data, mnist.target # mnist mj.data.shape is (n_samples, n_features)
sss = StratifiedShuffleSplit(n_splits = 1, test_size = 0.2, random_state = 0)
for train_index, test_index in sss.split(X,y):
X_train, y_train = X[train_index], y[train_index]
X_test, y_test = X[test_index], y[test_index]
print("data split")
# Data has no missing values and is preprocessed, so no cleaing needed.
# using a KNN model, as recommended
knn = KNeighborsClassifier()
print("model created")
print("training model")
start = time.time()
knn.fit(X_train, y_train)
end = time.time()
print(f"Execution time for knn paramSearch was: {end-start}")
# Parameter tuning.
# starting by performing a broad-range search on n_neighbours to work out the
# rough scale the parameter should be on
print("beginning param tuning")
params = {'n_neighbors':[5],
'weights':['uniform'],
'leaf_size':[30]
}
paramSearch = GridSearchCV(
estimator = knn,
param_grid = params,
cv=5,
n_jobs = -1)
start = time.time()
paramSearch.fit(X_train, y_train)
end = time.time()
print(f"Execution time for knn paramSearch was: {end-start}")
With vanilla KNN, the costly procedure is predicting, not fitting: fitting just saves a copy of the data, and then predicting has to do the work of finding nearest neighbors. So since your search involves scoring on each test fold, that's going to take a lot more time than just fitting. A better comparison would have you predict on the training set in the no-search section.
However, sklearn does have different options for the algorithm parameter, which aim to trade away some of the prediction complexity for added training time, by building a search structure so that fewer comparisons are needed at prediction time. With the default algorithm='auto', you're probably building a ball tree, and so the effect of the first paragraph won't be so profound. I suspect this is still the issue though: now the training time will be non-neglibible, but the scoring portion in the search is what takes most of the time.

Viewing model coefficients for a single prediction

I have a logistic regression model housed in a scikit-learn pipeline using the following:
pipeline = make_pipeline(
StandardScaler(),
LogisticRegressionCV(
solver='lbfgs',
cv=10,
scoring='roc_auc',
class_weight='balanced'
)
)
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)
I can view the model's coefficients for predictions as a whole with this code ...
# Look at model's coefficients to see what features are most important
plt.rcParams['figure.dpi'] = 50
model = pipeline.named_steps['logisticregressioncv']
coefficients = pd.Series(model.coef_[0], X_train.columns)
plt.figure(figsize=(10,12))
coefficients.sort_values().plot.barh(color='grey');
Which returns a bar plot of the features and their coefficients.
What I'm trying to do is be able to see how different input values for a single observation impact its prediction. The idea is to be able to run predictions on a sample population and examine the group with "low" predictions ... for example if I run predictions for 10 observations, I'd like to see how different input values impacted each of those 10 predictions, individually.
Recalled that I can achieve this via Shap Values using something along the following (but using LinearExplainer instead of TreeExplainer):
# Instantiate model and encoder outside of pipeline for
# use with shap
model = RandomForestClassifier( random_state=25)
# Fit on train, score on val
model.fit(X_train_encoded, y_train2)
y_pred_shap = model.predict(X_val_encoded)
# Get an individual observation to explain.
row = X_test_encoded.iloc[[-3]]
# Why did the model predict this?
# Look at a Shapley Values Force Plot
import shap
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(row)
shap.initjs()
shap.force_plot(
base_value=explainer.expected_value[1],
shap_values=shap_values[1],
features=row
)```

GridSearchCv Pipeline MultiOutputClassifier with XGBoostClassifier - how to pass early_stopping_rounds and eval_set?

I want to do multioutput prediction of labels and continuous data. My data consists of time series, one 10 time-points series of 30 observables per sample. I want to predict 10 labels that are binary, and 5 that are continuous, based on this data.
For the sake of simplicity I have flattened the time series data - ending up with one row per sample.
Since there are many labels to predict about the same system, and since there exists relationships between these, I want to use MutliOutputPrediction to do so. My idea is to divide the task into two parts; one for MultiOutputClassification, another for MultiOutputRegression.
I generally like XGBoost and wish to use it for this task, but of course I want to prevent overfitting when doing so. So I have a piece of code as follows, and I wish to pass the early_stopping_rounds to the fit method of the XGBClassifier, but don't know how to.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33)
pipeline = Pipeline([
('imputer', SimpleImputer()), # XGBoost can deal with NaNs, but MultiOutputClassifier cannot
('classifier', MultiOutputClassifier(XGBClassifier()))
])
param_grid = dict(
classifier__estimator__n_estimators=[100], # this works
# classifier__estimator__early_stopping_rounds=[30], # needs to be passed to .fit
# classifier__estimator__scale_pos_weight=[scale_pos_weight], # XGBoostError: Invalid Parameter format for scale_pos_weight expect float
)
clf = GridSearchCV(estimator=pipeline, param_grid=param_grid, scoring='roc_auc', refit='roc_auc', cv=5, n_jobs=-1)
clf.fit(X_train, y_train[CLASSIFICATION_LABELS])
y_hat_proba = np.array(clf.predict_proba(X_test))
y_hat = pd.DataFrame(np.array([y_hat_proba[:, i, 0] for i in range(y_hat_proba.shape[1])]), columns=CLASSIFICATION_LABELS)
auc_roc_scores = np.array([roc_auc_score(y_test[label], (y_hat[label] > 0.5).astype(int)) for label in y_hat.columns])
print(f'average ROC AUC score: {np.mean(auc_roc_scores).round(3)}+/-{np.std(auc_roc_scores).round(3)}')
>>> average ROC AUC score: 0.499+/-0.002
I tried passing it to fit as follows:
classifier__estimator__early_stopping_rounds=30
classifier__early_stopping_rounds=30
I get AUC ROC scores of 0.5 on the labels, which means this clearly isn't working and hence why I want to pass the early_stopping_rounds parameter and the eval_set. I suppose that being able to pass scale_pos_weight could also be useful, but probably doesn't work for MultiOutput prediction. At the moment I get the feeling that this is not the way to go to solve this, and in case you agree I would appreciate alternative suggestions.

How to get a RMSE value

I already fit the equation. Now I want the RMSE value
q3_1=data1[['bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot', 'floors', 'zipcode']]
q3_2=data1[['bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot', 'floors','zipcode','condition','grade','waterfront','view','sqft_above','sqft_basement','yr_built','yr_renovated',
'lat', 'long','sqft_living15','sqft_lot15']]
reg = LinearRegression()
reg.fit(q3_1,data1.price)
reg.fit(q3_2,data1.price)
I am not able to proceed from here. I need the RMSE value in both the cases.
As I can understand, you are using TensorFlow on Google Colab.
I don't know exactly what is your LinearRegression object, but IĀ suppose that it is a Keras model with a single node.
Hence, I have a question, how do you train the same model (your reg instance) with datasets with different schema -- one with 6 columns, the other with 16?
By the way, during training/fitting, keras is able to give you the MSE of your epoch, as well as a validation MSE if you provide a validation dataset. Finally, you can use the evaluate method which:
Returns the loss value & metrics values for the model [...]
Just use the "mean_squared_error" metric.
Edit
As you are using scikit-learn you have to take care of the metric yourself.
You can use the predict method to get the predictions from your trained model against a dataset.
Then, there is the mean_squared_error metric which is straighforward to use.
train_x, train_y = data1.features[:-100], data1.price[:-100]
test_x, test_y = data1.features[-100:], data1.price[-100:]
reg = LinearRegression()
reg.fit(train_x, train_y)
predictions = reg.predict(test_x)
mse = sklearn.metrics.mean_squared_error(test_y, predictions)
print("RMSE: %s" % math.sqrt(mse))

Scaling of stock data

I am trying to apply machine learning on stock prediction, and I run into problem regarding scaling on future unseen (much higher) stock close value.
Lets say I use random forrest regression on predicting stock price. I break the data into train set and test set.
For the train set, I use standardscaler, and do fit and transform
And then I use regressor to fit
For the test set, I use standardscaler, and do transform
And then I use regressor to predict, and compare to test label
If I plot predict and test label on a graph, predict seems to max out or ceiling. The problem is that standardscaler fit on train set, test set (later in the timeline) have much higher value, the algorithm does not know what to do with these extreme data
def test(X, y):
# split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, shuffle=False)
# preprocess the data
pipeline = Pipeline([
('std_scaler', StandardScaler()),
])
# model = LinearRegression()
model = RandomForestRegressor(n_estimators=20, random_state=0)
# preprocessing fit transform on train data
X_train = pipeline.fit_transform(X_train)
# fit model on train data with train label
model.fit(X_train, y_train)
# transform on test data
X_test = pipeline.transform(X_test)
# predict on test data
y_pred = model.predict(X_test)
# print(np.sqrt(mean_squared_error(y_test, y_pred)))
d = {'actual': y_test, 'predict': y_pred}
plot_data = pd.DataFrame.from_dict(d)
sns.lineplot(data=plot_data)
plt.show()
What should be done with the scaling?
This is what I got for plotting prediction, actual close price vs time
The problem mainly comes from the model you are using. RandomForest regressor is created upon Decision Trees. It is learning to map an input to an output for every examples in the training set. Consequently RandomForest regressor will work for middle values but for extreme values that it hasn't seen during training it will of course perform has your picture is showing.
What you want, is to learn a function directly using linear/polynomial regression or more advanced algorithms like ARIMA.

Resources