Out of Sample Forecasting using Neural Network in Keras (Python) - python-3.x

I am doing a time series forecasting exercise using the window method but i am struggling to understand how to do the forecast out of sample.
Here is the code:
def windowed_dataset(series, window_size, batch_size, shuffle_buffer):
dataset = tf.data.Dataset.from_tensor_slices(series)
dataset = dataset.window(window_size + 1, shift=1, drop_remainder=True)
dataset = dataset.flat_map(lambda window: window.batch(window_size + 1))
dataset = dataset.shuffle(shuffle_buffer).map(lambda window: (window[:-1], window[-1]))
dataset = dataset.batch(batch_size).prefetch(1)
return dataset
dataset = windowed_dataset(x_train, window_size, batch_size, shuffle_buffer_size)
The function windowed_dataset split the univariate time series series into a matrix. Imagine, we have a dataset as follows
dataset = tf.data.Dataset.range(10)
for val in dataset:
print(val.numpy())
0
1
2
3
4
5
6
7
8
9
the windowed_dataset function convert series into windows with x features on the left and y labels on the right.
[2 3 4 5] [6]
[4 5 6 7] [8]
[3 4 5 6] [7]
[1 2 3 4] [5]
[5 6 7 8] [9]
[0 1 2 3] [4]
In the next step, we implement the neural network model on the training dataset as follows:
model = tf.keras.models.Sequential([
tf.keras.layers.Dense(10, input_shape=[window_size], activation="relu"),
tf.keras.layers.Dense(10, activation="relu"),
tf.keras.layers.Dense(1)
])
model.compile(loss="mse", optimizer=tf.keras.optimizers.SGD(lr=1e-6, momentum=0.9))
model.fit(dataset,epochs=100,verbose=0)
Up to here, i am fine with the code. However, I am struggling to understand the out of sample forecasting shown below:
forecast = []
for time in range(len(series) - window_size):
forecast.append(model.predict(series[time:time + window_size][np.newaxis]))
forecast = forecast[split_time-window_size:]
Can someone please explain to me why are we using a loop here for time in range(len(series) - window_size) ? why not simply do model.predict(dataset_validation) for the validation part and model.predict(dataset) for the training part ?
I don't understand the need for the for loop because this is not a rolling forecast we are not re-training the model. Can someone please explain to me?
While i understand why the data science community structure the dataset this way, i personally find it a lot clearer when we split the X and y and do the model.fit as follows model.fit(X,y,epochs=100,verbose=0) and the predict as as follows model.predict(X)

The for loop is returning the predictions in order, whereas if you call model.predict(dataset_validation) you'll get the predictions in a shuffled order (assumed you shuffled the dataset).
As for the point of using datasets - it can just help with code organization. There is no need to ever use one if you don't want to.

Related

What is Coef_.T? Purpose of using T [duplicate]

This question already has an answer here:
Numpy .T syntax for Python
(1 answer)
Closed 9 months ago.
Coef_ is used to find coefficients in linear equations in Python. But Coef_, which I could not find the answer to, was put at the end of .T. What is the .T function here?
for C, marker in zip([0.001, 1, 100], ['o', '^', 'v']):
lr_l1 = LogisticRegression(C=C, penalty="l1").fit(X_train, y_train)
print("Training accuracy of l1 logreg with C={:.3f}: {:.2f}".format(
C, lr_l1.score(X_train, y_train)))
print("Test accuracy of l1 logreg with C={:.3f}: {:.2f}".format(
C, lr_l1.score(X_test, y_test)))
plt.plot(lr_l1.coef_.T, marker, label="C={:.3f}".format(C))
".T" method means Transpose which switches rows & columns
if you have a matrix m:
[1 2 3
4 5 6
7 8 9]
Then m.T would be:
[1 4 7
2 5 8
3 6 9]
It looks like its used in this line:
plt.plot(lr_l1.coef_.T,...)
to make sure it plots the coefficients in an expected way. If the model was built from sklearn LogisticRegression, then you can review the docs here
coef_ has shape (n_classes,n_features), so that means
coef_.T has shape (n_features,n_classes)
Here is a notebook that shows how this works

How can I perform GridSearchCV but cross validate using multiple validation sets?

I have a Train set training_set of m observations and n features, and I have three different validation sets val_a, val_b, and val_c which don't leak information to one another.
I would like to perform hyperparameter tuning via HalvingGridSearchCV, where I fit models on training_set, and validate on all three validation sets separately, and then take the score to be the average score for all three (or the lowest score).
The reason is that the three validation were observations of the samples at three distinct time points (A, B, C), and the training set contains observations from only time point A. Thus, a model trained on training_set and evaluated on val_a would not necessarily be best for val_b and val_c.
Also, concatenating all of the sets via training_set = pd.concat([training_set, val_a, val_b, val_c]), and then performing a variant of GroupShuffleSplit is non-ideal, as this results in leaking information from different time points to the model.
Thus far here's what I've tried:
import pandas as pd
from sklearn.model_selection import PredefinedSplit
# Assume each dataset has 4 observations.
tf = [-1] * len(training_set)
training_set = pd.concat([training_set, val_a, val_b, val_c])
tf += [0] * len(val_a) + [1] * len(val_b) + [2] * len(val_c)
print("Test fold:", tf)
pds = PredefinedSplit(test_fold = tf)
# gs = HalvingGridSearchCV(estimator = LGBMRegressor(), param_grid = param_grid, cv = pds, scoring = 'r2', refit = False, min_resources = 'exhaust')
for train_index, test_index in ps.split():
print("TRAIN:", train_index, "TEST:", test_index)
Output:
Test fold: [-1, -1, -1, -1, 0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 2]
TRAIN: [ 0 1 2 3 8 9 10 11 12 13 14 15] TEST: [4 5 6 7]
TRAIN: [ 0 1 2 3 4 5 6 7 12 13 14 15] TEST: [ 8 9 10 11]
TRAIN: [ 0 1 2 3 4 5 6 7 8 9 10 11] TEST: [12 13 14 15]
As you can see, this would generate a 3 fold cross-validation, where each validation set is left out once, and included in the training set all of the other times. I know -1 will leave the observations out of any test set, but there is no value to leave the observations out of any train set. ):
Thank you!

How to generate cross validation over different dataframes for supervised classification?

Imagine I have 4 dataframes with different length of rows but same number of columns like this: df1(200 rows, 4 columns), df2(100, 4), df3(300, 4) and df4(250, 4).
I would like to make a supervised classification between these dataframes (always using 3 for training and 1 for test/validation) and discover which combination gives me the better accuracy score. This is an example of a bigger volume of data and I would like to automate it by making a cross validation.
I thought that I could try to create a new column for each dataframe with their specific name and then concat all of them. And then, maybe, create a mask that would differentiate the training and test sets by these new columns. But I still do not know how to do this cross validation between them.
The dataframes would be like this:
concatenated_dfs:
feat1 feat2 feat3 feat4 name
0 4 6 57 78 df1
1 1 2 50 78 df1
2 1 1 57 78 df1
. . . . . .
. . . . . .
. . . . . .
849 3 10 50 80 df4
Anyone could show me how to do that with some code? You can use any scikit-learn classification algorithm if you want. Thanks!
You can use scikit learn's cross_val_score with a custom iterable to generate the indices for the training-test splits in your data. Here is an example:
# Setup - creating fake data to match your description
df = pd.DataFrame(data={"name":[x for l in [[f"df{i}"]*c for i, c in enumerate(counts, 1)] for x in l]})
for i in range(1, 5):
df[f"feat{i}"] = np.random.randn(len(df))
X = df[[c for c in df.columns if c != "name"]]
y = np.random.randint(0, 2, len(df))
# Iterable to generate the training-test splits:
indices = list()
for name in df["name"].unique():
train = df.loc[df["name"] != name].index
test = df.loc[df["name"] == name].index
indices.append((train, test))
# Example model - logistic regression
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
# Using cross-val score with the custom indices:
from sklearn.model_selection import cross_val_score
scores = cross_val_score(model, X, y, cv=indices)

Behaviour of train_test_split() from Scikit-learn

I am curious how the train_test_split() method of Scikit-learn will behave in the following scenario:
An imaginary dataset:
id, count, size
1, 4, 8
2, 5, 9
3, 6, 0
say I would divide it into two separate sets like this (keeping 'id' in both):
id, count | id, size
1, 4 | 1, 8
2, 5 | 2, 9
3, 6 | 3, 0
And split them both with train_test_split() with the same random_state of 0. Would the order of both be the same with 'id' as reference? (since you are shuffling the same dataset but with different parts left out)
I am curious as to how this works because I have two models. The first one gets trained with the dataset and adds it's results to the dataset, part of which is then used to train the second model.
When doing this it's important that when testing the generalization of the second model, no data points are used which were also used to train the first model. This is because the data was 'seen before' and the model will know what to do with it, so then you are not testing the generalization to new data anymore.
It would be great if train_test_split() would shuffle it the same since then one would not need to keep track of what data was used to train the first algorithm to prevent contamination of the test results.
They should have the same resulting indices if you use the same random_state parameter in each call.
However--you could also just reverse your order of operations. Call test/train split on the parent dataset, then create two sub-sets from both the test and train sets that result.
Example:
print(df)
id count size
0 1 4 8
1 2 5 9
2 3 6 0
from sklearn.model_selection import train_test_split
dfa = df[['id', 'count']].copy()
dfb = df[['id', 'size']].copy()
rstate = 123
traina, testa = train_test_split(dfa, random_state=123)
trainb, testb = train_test_split(dfb, random_state=123)
assert traina.index.equals(trainb.index)
# True

ValueError: Unknown label type: while implementing MLPClassifier

I have dataframe with columns Year, month, day,hour, minute, second, Daily_KWH. I need to predict Daily KWH using neural netowrk. Please let me know how to go about it
Daily_KWH_System year month day hour minute second
0 4136.900384 2016 9 7 0 0 0
1 3061.657187 2016 9 8 0 0 0
2 4099.614033 2016 9 9 0 0 0
3 3922.490275 2016 9 10 0 0 0
4 3957.128982 2016 9 11 0 0 0
I'm getting the Value Error, when I'm fitting the model.
code so far:
X = df[['year','month','day','hour','minute','second']]
y = df['Daily_KWH_System']
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y)
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
# Fit only to the training data
scaler.fit(X_train)
#y_train.shape
#X_train.shape
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)
from sklearn.neural_network import MLPClassifier
mlp = MLPClassifier(hidden_layer_sizes=(30,30,30))
#y_train = np.asarray(df['Daily_KWH_System'], dtype="|S6")
mlp.fit(X_train,y_train)
Error:
ValueError: Unknown label type: (array([ 2.27016856e+02, 3.02173014e+03, 4.29404190e+03,
2.41273427e+02, 1.76714247e+02, 4.23374425e+03,
First of all, this is a regression problem and not a classification problem, as the values in the Daily_KWH_System column do not form a set of labels. Instead, they seem to be (at least based on the provided example) real numbers.
If you want to approach it as a classification problem regardless, then according to sklearn documentation:
When doing classification in scikit-learn, y is a vector of integers
or strings.
In your case, y is a vector of floats, and therefore you get the error. Thus, instead of the line
y = df['Daily_KWH_System']
write the line
y = np.asarray(df['Daily_KWH_System'], dtype="|S6")
and this will resolve the issue. (You can read more about this approach here: Python RandomForest - Unknown label Error)
Yet, as regression is more appropriate in this case, then instead of the above change, replace the lines
from sklearn.neural_network import MLPClassifier
mlp = MLPClassifier(hidden_layer_sizes=(30,30,30))
with
from sklearn.neural_network import MLPRegressor
mlp = MLPRegressor(hidden_layer_sizes=(30,30,30))
The code will run without throwing an error (but there certainly isn't enough data to check whether the model that we get performs well).
With that being said, I don't think that this is the right approach for choosing features for this problem.
In this problem we deal with a sequence of real numbers that form a time series. One reasonable feature that we could choose is the number of seconds (or minutes\hours\days etc) that passed since the starting point. Since this particular data contains only days, months and years (other values are always 0), we could choose as a feature the number of days that passed since the beginning. Then your data frame will look like:
Daily_KWH_System days_passed
0 4136.900384 0
1 3061.657187 1
2 4099.614033 2
3 3922.490275 3
4 3957.128982 4
You could take the values in the column days_passed as features and the values in Daily_KWH_System as targets. You may also add some indicator features. For example, if you think that the end of the year may affect the target, you can add an indicator feature that indicates whether the month is December or not.
If the data is indeed daily (at least in this example you have one data point per day) and you want to tackle this problem with neural networks, then another reasonable approach would be to handle it as a time series and try to fit recurrent neural network. Here are couple of great blog posts that describe this approach:
http://machinelearningmastery.com/time-series-prediction-lstm-recurrent-neural-networks-python-keras/
http://machinelearningmastery.com/time-series-forecasting-long-short-term-memory-network-python/
The fit() function expects y to be 1D list. By slicing a Pandas dataframe you always get a 2D object. This means that for your case, you need to convert the 2D object you got from slicing the DataFrame into an actual 1D list, as expected by fit function:
y = list(df['Daily_KWH_System'])
Use a regressor instead. This will solve float 2D data issue.
from sklearn.neural_network import MLPRegressor
model = MLPRegressor(solver='lbfgs',alpha=0.001,hidden_layer_sizes=(10,10))
model.fit(x_train,y_train)
y_pred = model.predict(x_test)
Instead of
mlp.fit(X_train,y_train)
use this
mlp.fit(X_train,y_train.values)

Resources