How to set Keras TimeseriesGenerator to predict the second next value? - keras

Currently I have the following code using TimeseriesGenerator from Keras:
TimeseriesGenerator(train, prediction, length=TIME_STEPS, batch_size=1)
Currently this shifts prediction one value backwards, so the train data for t will have the output of t+1. Which makes sense, but I want to predict t+2, thus train data for t will have the output of t+2.
Is there any way to do it using TimeseriesGenerator?

The quickest solution is to just shift your predictions by 1, ie.:
TimeseriesGenerator(train[:-1], prediction[1:], length=TIME_STEPS, batch_size=1)
Note that you have to trim the train set, so both datasets have equal lengths.
You can also use the timeseries_dataset_from_array function where you can align the data and targets according to your needs as you can read in the documentation:
data: Numpy array or eager tensor containing consecutive data points
(timesteps). Axis 0 is expected to be the time dimension.
targets:
Targets corresponding to timesteps in data. It should have same length
as data. targets[i] should be the target corresponding to the window
that starts at index i (see example 2 below). Pass None if you don't
have target data (in this case the dataset will only yield the input
data).
So in your case it would be something like this:
tf.keras.preprocessing.timeseries_dataset_from_array(
train[:-TIME_STEPS-2],
prediction[TIME_STEPS+2:],
length=TIME_STEPS,
batch_size=1
)

Related

Pytorch model gradients are printed correctly but copied wrongly

I want to copy the gradients of loss, with respect to weight, for different data samples using pytorch. In the code below, I am iterating one sample each time from the data loader (batch size = 1) and collecting gradients for 1st fully connected (fc1) layer. Gradients should be different for different samples. The print function shows correct gradients, which are different for different samples. But when I store them in a list, I get the same gradients repeatedly. Any suggestions would be much appreciated. Thanks in advance!
grad_list = [ ]
for data in test_loader:
inputs, labels = data[0], data[1]
inputs = torch.autograd.Variable(inputs)
labels = torch.autograd.Variable(labels)
# zero the parameter gradients
optimizer.zero_grad()
# forward + backward
output = target_model(inputs)
loss = criterion(output, labels)
loss.backward()
grad_list.append(target_model.fc1.weight.grad.data)
print(target_model.fc1.weight.grad.data)
Try using clone and detach instead:
grad_list.append(target_model.fc1.weight.grad.clone().detach())
The data property you are appending to your list is a mutable reference to the storage of the parameter (i.e. the actual memory address and the values contained within). What you need to do is create a replica of the gradient tensor (with clone) and remove it from the computational graph (with detach) to avoid it interfering with gradient computation.

why before embedding, have to make the item be sequential starting at zero

I learn collaborative filtering from this bolg, Deep Learning With Keras: Recommender Systems.
The tutorial is good, and the code working well. Here is my code.
There is one thing confuse me, the author said,
The user/movie fields are currently non-sequential integers representing some unique ID for that entity. We need them to be sequential starting at zero to use for modeling (you'll see why later).
user_enc = LabelEncoder()
ratings['user'] = user_enc.fit_transform(ratings['userId'].values)
n_users = ratings['user'].nunique()
But he didn't seem to metion the reason, I don't why need to do that.Can some one explain for me?
Embeddings are assumed to be sequential.
The first input of Embedding is the input dimension.
So, if the input exceeds the input dimension the value is ignored.
Embedding assumes that max value in the input is input dimension -1 (it starts from 0).
https://www.tensorflow.org/api_docs/python/tf/keras/layers/Embedding?hl=ja
As an example, the following code will generate embeddings only for input [4,3] and will skip the input [7, 8] since input dimension is 5.
I think it is more clear to explain it with tensorflow;
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding
model = Sequential()
model.add(Embedding(5, 1, input_length=2))
input_array = np.array([[4,3], [7,8]])
model.compile('rmsprop', 'mse')
output_array = model.predict(input_array)
You can increase the input dimension to 9 and then you will get embeddings for both inputs.
You could increase the input dimension to max number + 1 in the original data set, but this is not efficient.
It is actually similar to one-hot encoding where sequential data saves great amount of memory.

How to use Principal Component Analysis while predicting?

Suppose my original dataset has 8 features and I apply PCA with n_components = 3 (I am using sklearn.decomposition.PCA). Then I train my model using those 3 PCA components (which are now my new features).
Do I need to apply PCA while predicting as well ?
Do I need to do that even if I am predicting only one data point?
What confuses me is that when I do prediction, every data point is a row in a 2D matrix (consists of all data points that I want to predict). So if I apply PCA on just one data point, then the corresponding row vector will be converted to a zero vector.
If you fitted your model on the first three components of the PCA, you have to transform appropriately any new data. For example, consider this code taken from here:
pca = PCA(n_components=n_components, svd_solver='randomized',
whiten=True).fit(X_train)
X_train_pca = pca.transform(X_train)
X_test_pca = pca.transform(X_test)
clf = GridSearchCV(SVC(kernel='rbf', class_weight='balanced'), param_grid)
clf = clf.fit(X_train_pca, y_train)
y_pred = clf.predict(X_test_pca)
In the code, they first fit PCA on the trainig. Then they transform both training and testing, and then they apply the model (in their case, SVM) on the transformed data.
Even if your X_test consists of only 1 data point, you could still use PCA. Just transform your data into a 2D matrix. For example, if your data point is [1,2,0,5] then X_test=[[1,2,0,5]]. That is, it is a 2D matrix with 1 row.

Can't get gridSearchCV to work for hmmlearn estimator

I've got a hmm which I can train by passing the fit function a list 'merged' of all training sequences concatenated after each other, and a list 'all_lengths' of all of the individual sequence lengths
model = hmm.MultinomialHMM(n_components=3).fit(np.atleast_2d(merged).T, all_lengths)
This works, but I cant to determine the optimal n_components using sklearn's gridsearchCV, which keeps giving me errors if I try the following:
tuned_parameters = [{'n_components': [1,2,3]}]
test = GridSearchCV(hmm.MultinomialHMM(), tuned_parameters, cv=5,)
test.fit(np.atleast_2d(merged).T, all_lengths)
outputs
ValueError: Found input variables with inconsistent numbers of samples: [515031, 28923]
The 515031 relates to the length of merged, and 28923 is the length of all_lengths

scikit-learn: Is there a way to provide an object as an input to predict function of a classifier?

I am planning to use an SGDClassifier in production. The idea is to train the classifier on some training data, use cPickle to dump it to a .pkl file and reuse it later in a script. However, there are certain high cardinality fields which are categorical in nature and translated to one hot matrix representation which creates around 5000 features. Now the input that I get for the predict will only have one of these features and rest all will be zeroes. It will also include ofcourse the other numerical features apart from this. From the docs, it appears that the predict function expects an array of array as input. Is there any way I can transform my input to the format expected by the predict function without having to store the fields everytime I train the model ?
Update
So, let us say my input contains 3 fields:
{
rate: 10, // numeric
flagged: 0, //binary
host: 'somehost.com' // keeping this categorical
}
host can have around 5000 different values. Now I loaded the file to a pandas dataframe, used the get_dummies function to transform the host field to around 5000 new fields which are binary fields.
Then I trained by model and stored it using cPickle.
Now, when I need to use the predict function, for the input, I only have 3 fields (shown above). However, as per my understanding the predict endpoint will expect an array of vectors and each vector is supposed to have those 5000 fields.
For the entry that I need to predict, I know only one field for that entry which will be the value of host itself.
For example, if my input is
{
rate: 5,
flagged: 1
host: 'new_host.com'
}
I know that the fields expected by the predict should be:
{
rate: 5,
flagged: 1
new_host: 1
}
But if I translate it to vector format, I don't know which index to place the new_host field. Also, I don't know in advance what other hosts are (unless I store it somewhere during the training phase)
I hope I am making some sense. Let me know if I am doing it the wrong way.
I don't know which index to place the new_host field
A good approach that has worked for me is to build a pipeline which you then use for training and prediction. This way you do not have to concern yourself with the column index of whatever output is produced by your transformation:
# in training
pipl = Pipeline(steps=[('binarizer', LabelBinarizer(),
('clf', SGDClassifier())])
model = pipl.train(X, Y)
pickle.dump(mf, model)
# in production
model = pickle.load(mf)
y = model.predict(X)
As X, Y inputs you need to pass an array-like object. Make sure the input is the same structure for both training and test, e.g.
X = [[data.get('rate'), data.get('flagged'), data.get('host')]]
Y = [[y-cols]] # your example doesn't specify what is Y in your data
More flexible: Pandas DataFrame + Pipeline
What also works nicely is to use a Pandas DataFrame in combination with sklearn-pandas as it allows you to use different transformations on different column names. E.g.
df = pd.DataFrame.from_dict(data)
mapper = DataFrameMapper([
('host', sklearn.preprocessing.LabelBinarizer()),
('rate', sklearn.preprocessing.StandardScaler())
])
pipl = Pipeline(steps=[('mapper', mapper),
('clf', SGDClassifier())])
X = df[x-cols]
y = df[y-col(s)]
pipl.fit()
Note that x-cols and y-col(s) are the list of the feature and target columns respectively.
You should use a scikit-learn transformer instead of get_dummies. In this case, LabelBinarizer makes sense. Seeing as LabelBinarizer doesn't work in a pipeline, this is one way to do what you want:
binarizer = LabelBinarizer()
# fitting LabelBinarizer means it remembers all the columns it's seen
one_hot_data = binarizer.fit_transform(X_train[:, categorical_col])
# replace string column with one-hot representation
X_train = np.concatenate([np.delete(X_train, categorical_col, axis=1),
one_hot_data], axis=1)
model = SGDClassifier()
clf.fit(X_train, y)
pickle.dump(f, {'clf': clf, 'binarizer': binarizer})
then at prediction time:
estimators = pickle.load(f)
clf = estimators['clf']
binarizer = estimators['binarizer']
one_hot_data = binarizer.transform(X_test[:, categorical_col])
X_test = np.concatenate([np.delete(X_test, categorical_col, axis=1),
one_hot_data], axis=1)
clf.predict(X_test)

Resources