GRU/LSTM in Keras with input sequence of varying length - keras

I'm working on a smaller project to better understand RNN, in particualr LSTM and GRU. I'm not at all an expert, so please bear that in mind.
The problem I'm facing is given as data in the form of:
>>> import numpy as np
>>> import pandas as pd
>>> pd.DataFrame([[1, 2, 3],[1, 2, 1], [1, 3, 2],[2, 3, 1],[3, 1, 1],[3, 3, 2],[4, 3, 3]], columns=['person', 'interaction', 'group'])
person interaction group
0 1 2 3
1 1 2 1
2 1 3 2
3 2 3 1
4 3 1 1
5 3 3 2
6 4 3 3
this is just for explanation. We have different person interacting with different groups in different ways. I've already encoded the various features. The last interaction of a user is always a 3, which means selecting a certain group. In the short example above person 1 chooses group 2, person 2 chooses group 1 and so on.
My whole data set is much bigger but I would like to understand first the conceptual part before throwing models at it. The task I would like to learn is given a sequence of interaction, which group is chosen by the person. A bit more concrete, I would like to have an output a list with all groups (there are 3 groups, 1, 2, 3) sorted by the most likely choice, followed by the second and third likest group. The loss function is therefore a mean reciprocal rank.
I know that in Keras Grus/LSTM can handle various length input. So my three questions are.
The input is of the format:
(samples, timesteps, features)
writing high level code:
import keras.layers as L
import keras.models as M
model_input = L.Input(shape=(?, None, 2))
timestep=None should imply the varying size and 2 is for the feature interaction and group. But what about the samples? How do I define the batches?
For the output I'm a bit puzzled how this should look like in this example? I think for each last interaction of a person I would like to have a list of length 3. Assuming I've set up the output
model_output = L.LSTM(3, return_sequences=False)
I then want to compile it. Is there a way of using the mean reciprocal rank?
model.compile('adam', '?')
I know the questions are fairly high level, but I would like to understand first the big picture and start to play around. Any help would therefore be appreciated.

The concept you've drawn in your question is a pretty good start already. I'll add a few things to make it work, as well as a code example below:
You can specify LSTM(n_hidden, input_shape=(None, 2)) directly, instead of inserting an extra Input layer; the batch dimension is to be omitted for the definition.
Since your model is going to perform some kind of classification (based on time series data) the final layer is what we'd expect from "normal" classification as well, a Dense(num_classes, action='softmax'). Chaining the LSTM and the Dense layer together will first pass the time series input through the LSTM layer and then feed its output (determined by the number of hidden units) into the Dense layer. activation='softmax' allows to compute a class score for each class (we're going to use one-hot-encoding in a data preprocessing step, see code example below). This means class scores are not ordered, but you can always do so via np.argsort or np.argmax.
Categorical crossentropy loss is suited for comparing the classification score, so we'll use that one: model.compile(loss='categorical_crossentropy', optimizer='adam').
Since the number of interactions. i.e. the length of model input, varies from sample to sample we'll use a batch size of 1 and feed in one sample at a time.
The following is a sample implementation w.r.t to the above considerations. Note that I modified your sample data a bit, in order to provide more "reasoning" behind group choices. Also each person needs to perform at least one interaction before choosing a group (i.e. the input sequence cannot be empty); if this is not the case for your data, then introducing an additional no-op interaction (e.g. 0) can help.
import pandas as pd
import tensorflow as tf
model = tf.keras.models.Sequential()
model.add(tf.keras.layers.LSTM(10, input_shape=(None, 2))) # LSTM for arbitrary length series.
model.add(tf.keras.layers.Dense(3, activation='softmax')) # Softmax for class probabilities.
model.compile(loss='categorical_crossentropy', optimizer='adam')
# Example interactions:
# * 1: Likes the group,
# * 2: Dislikes the group,
# * 3: Chooses the group.
df = pd.DataFrame([
[1, 1, 3],
[1, 1, 3],
[1, 2, 2],
[1, 3, 3],
[2, 2, 1],
[2, 2, 3],
[2, 1, 2],
[2, 3, 2],
[3, 1, 1],
[3, 1, 1],
[3, 1, 1],
[3, 2, 3],
[3, 2, 2],
[3, 3, 1]],
columns=['person', 'interaction', 'group']
)
data = [person[1][['interaction', 'group']].values for person in df.groupby('person')]
x_train = [x[:-1] for x in data]
y_train = tf.keras.utils.to_categorical([x[-1, 1]-1 for x in data]) # Expects class labels from 0 to n (-> subtract 1).
print(x_train)
print(y_train)
class TrainGenerator(tf.keras.utils.Sequence):
def __init__(self, x, y):
self.x = x
self.y = y
def __len__(self):
return len(self.x)
def __getitem__(self, index):
# Need to expand arrays to have batch size 1.
return self.x[index][None, :, :], self.y[index][None, :]
model.fit_generator(TrainGenerator(x_train, y_train), epochs=1000)
pred = [model.predict(x[None, :, :]).ravel() for x in x_train]
for p, y in zip(pred, y_train):
print(p, y)
And the corresponding sample output:
[...]
Epoch 1000/1000
3/3 [==============================] - 0s 40ms/step - loss: 0.0037
[0.00213619 0.00241093 0.9954529 ] [0. 0. 1.]
[0.00123938 0.99718493 0.00157572] [0. 1. 0.]
[9.9632275e-01 7.5039308e-04 2.9268670e-03] [1. 0. 0.]
Using custom generator expressions: According to the documentation we can use any generator to yield the data. The generator is expected to yield batches of the data and loop over the whole data set indefinitely. When using tf.keras.utils.Sequence we do not need to specify the parameter steps_per_epoch as this will default to len(train_generator). Hence, when using a custom generator, we shall provide this parameter as well:
import itertools as it
model.fit_generator(((x_train[i % len(x_train)][None, :, :],
y_train[i % len(y_train)][None, :]) for i in it.count()),
epochs=1000,
steps_per_epoch=len(x_train))

Related

What is the recommended way to do embeddings in jax?

So I mean something where you have a categorical feature $X$ (suppose you have turned it into ints already) and say you want to embed that in some dimension using the features $A$ where $A$ is arity x n_embed.
What is the usual way to do this? Is using a for loop and vmap correct? I do not want something like jax.nn, something more efficient like
https://www.tensorflow.org/api_docs/python/tf/keras/layers/Embedding
For example consider high arity and low embedding dim.
Is it jnp.take as in the flax.linen implementation here? https://github.com/google/flax/blob/main/flax/linen/linear.py#L624
Indeed the typical way to do this in pure jax is with jnp.take. Given array A of embeddings of shape (num_embeddings, num_features) and categorical feature x of integers shaped (n,) then the following gives you the embedding lookup.
jnp.take(A, x, axis=0) # shape: (n, num_features)
If using Flax then the recommended way would be to use the flax.linen.Embed module and would achieve the same effect:
import flax.linen as nn
class Model(nn.Module):
#nn.compact
def __call__(self, x):
emb = nn.Embed(num_embeddings, num_features)(x) # shape
Suppose that A is the embedding table and x is any shape of indices.
A[x], which is like jnp.take(A, x, axis=0) but simpler.
vmap-ed A[x], which parallelizes along axis 0 of x.
nested vmap-ed A[x], which parallelizes along all axes of x.
Here are the source code for your reference.
import jax
import jax.numpy as jnp
embs = jnp.array([[1, 1, 1], [2, 2, 2], [3, 3, 3], [4, 4, 4]], dtype=jnp.float32)
x = jnp.array([[3, 1], [2, 0]], dtype=jnp.int32)
print("\ntake\n", jnp.take(embs, x, axis=0))
print("\nuse []\n", embs[x])
print(
"\nvmap\n",
jax.vmap(lambda embs, x: embs[x], in_axes=[None, 0], out_axes=0)(embs, x),
)
print(
"\nnested vmap\n",
jax.vmap(
jax.vmap(lambda embs, x: embs[x], in_axes=[None, 0], out_axes=0),
in_axes=[None, 0],
out_axes=0,
)(embs, x),
)
BTW, I learned the nested-vmap trick from the IREE GPT2 model code by James Bradbury.

Multilabel classification of a sequence, how to do it?

I am quite new to the deep learning field especially Keras. Here I have a simple problem of classification and I don't know how to solve it. What I don't understand is how the general process of the classification, like converting the input data into tensors, the labels, etc.
Let's say we have three classes, 1, 2, 3.
There is a sequence of classes that need to be classified as one of those classes. The dataset is for example
Sequence 1, 1, 1, 2 is labeled 2
Sequence 2, 1, 3, 3 is labeled 1
Sequence 3, 1, 2, 1 is labeled 3
and so on.
This means the input dataset will be
[[1, 1, 1, 2],
[2, 1, 3, 3],
[3, 1, 2, 1]]
and the label will be
[[2],
[1],
[3]]
Now one thing that I do understand is to one-hot encode the class. Because we have three classes, every 1 will be converted into [1, 0, 0], 2 will be [0, 1, 0] and 3 will be [0, 0, 1]. Converting the example above will give a dataset of 3 x 4 x 3, and a label of 3 x 1 x 3.
Another thing that I understand is that the last layer should be a softmax layer. This way if a test data like (e.g. [1, 2, 3, 4]) comes out, it will be softmaxed and the probabilities of this sequence belonging to class 1 or 2 or 3 will be calculated.
Am I right? If so, can you give me an explanation/example of the process of classifying these sequences?
Thank you in advance.
Here are a few clarifications that you seem to be asking about.
This point was confusing so I deleted it.
If your input data has the shape (4), then your input tensor will have the shape (batch_size, 4).
Softmax is the correct activation for your prediction (last) layer
given your desired output, because you have a classification problem
with multiple classes. This will yield output of shape (batch_size,
3). These will be the probabilities of each potential classification, summing to one across all classes. For example, if the classification is class 0, then a single prediction might look something like [0.9714,0.01127,0.01733].
Batch size isn't hard-coded to the network, hence it is represented in model.summary() as None. E.g. the network's last-layer output shape can be written (None, 3).
Unless you have an applicable alternative, a softmax prediction layer requires a categorical_crossentropy loss function.
The architecture of a network remains up to you, but you'll at least need a way in and a way out. In Keras (as you've tagged), there are a few ways to do this. Here are some examples:
Example with Keras Sequential
model = Sequential()
model.add(InputLayer(input_shape=(4,))) # sequence of length four
model.add(Dense(3, activation='softmax')) # three possible classes
Example with Keras Functional
input_tensor = Input(shape=(4,))
x = Dense(3, activation='softmax')(input_tensor)
model = Model(input_tensor, x)
Example including input tensor shape in first functional layer (either Sequential or Functional):
model = Sequential()
model.add(Dense(666, activation='relu', input_shape=(4,)))
model.add(Dense(3, activation='softmax'))
Hope that helps!

Scikit-learn R2 always zero

I'm trying to test my Scikit-learn machine learning algorithm with a simple R^2 score, but for some reason it always returns zero.
import numpy
from sklearn.metrics import r2_score
prediction = numpy.array([0.1567, 4.7528, 1.1260, 0.2294]).reshape(1, -1)
training = numpy.array([0, 3, 1, 0]).reshape(1, -1)
r2 = r2_score(training, prediction, multioutput="raw_values")
print r2
[ 0. 0. 0. 0.]
This is a single four-part value, not four separate values. How do I get proper R^2 scores?
If you are trying to calculate the r2 value between two vectors you should just pass two one dimensional arrays. See the documentation
In the example you provided, the first item is compared to the first item, but note you only have one list in each the prediction and training, so it is calculating R2 for 0.1567 to 0, which is 0, then it calculates it for 4.7528 to 3 which is also 0 and so on... It sounds like you want the R2 for the two vectors like the following:
prediction = numpy.array([0.1567, 4.7528, 1.1260, 0.2294])
training = numpy.array([0, 3, 1, 0])
print(r2_score(training, prediction))
0.472439485
If you have multi-dimensional arrays you can use the multioutput flag to determine what the output should look like:
#modified from the scikit-learn example
y_true = [[0.5, 1], [-1, 1], [7, -6]]
y_pred = [[0, 2], [-1, 2], [8, -5]]
print(r2_score(y_true, y_pred, multioutput='raw_values'))
array([ 0.96543779, 0.90816327])
Here the output is where the first item of each list in y_true is compared to the first item in each list of y_pred, the second item to the second and so on

How does sklean LinearSVC classify the test cases which are on the boundary?

I just did an experiment. I provided only two training cases [0, 1] and [1, 0]. They belong to two different categories. The test cases is [0, 0], which is on the decision boundary. The classifier assigns it to class 0. Is it because class 0 is the first class? Does it really make sense?
>>> X=numpy.array([[0,1],[1,0]])
>>> y=numpy.array([0,1])
>>> clf.fit_transform(X,y)
array([[0, 1],
[1, 0]])
>>> clf.predict(numpy.array([[0,0]]))
array([0])
>>> clf.decision_function(numpy.array([[0,0]]))
array([ 0.])
>>> clf.coef_
array([[ 0.66666667, -0.66666667]])
>>> clf.predict(numpy.array([[0,1]]))
array([0])
>>> clf.decision_function(numpy.array([[0,1]]))
array([-0.66666667])
>>> clf.intercept_
array([ 0.])
>>> clf.intercept_ > 0
array([False], dtype=bool)
Personally, I would take your experiment as the answer to the question.
Points sitting on the decision boundary are ambiguous. What should the behavior be? Should it predict one of the two classes? Error out? Predict NaN?
By your experiment, scikit-learn predicts 0. I would take that to mean that in the general case, it chooses the first (in lexicographical order) for the boundary cases.
If the boundary case matters for your application, you will have to write special code that checks the decision function for exact 0, and does something different. Like this:
scores = clf.decision_function( X )
predictions = scores > 0
preidctions[ scores==0 ] = np.nan

Classification with restrictions

How should I best use scikit-learn for the following supervised classification problem (simplified), with binary features:
import numpy as np
from sklearn.tree import DecisionTreeClassifier
train_data = np.array([[0, 0, 1, 0],
[1, 0, 1, 1],
[0, 1, 1, 1]], dtype=bool)
train_targets = np.array([0, 1, 2])
c = DecisionTreeClassifier()
c.fit(train_data, train_targets)
p = c.predict(np.array([1, 1, 1, 1], dtype=bool))
print(p)
# -> [1]
That works fine. However, suppose now that I know a priori that the presence of feature 0 excludes class 1. Can additional information of this kind be easily included in the classification process?
Currently, I'm just doing some (problem-specific and heuristic) postprocessing to adjust the resulting class. I could perhaps also manually preprocess and split the dataset into two according to the feature, and train two classifiers separately (but with K such features, this ends up in 2^K splitting).
Can additional information of this kind be easily included in the classification process?
Domain-specific hacks are left to the user. The easiest way to do this is to predict probabilities...
>>> prob = c.predict_proba(X)
and then rig the probabilities to get the right class out.
>>> invalid = (prob[:, 1] == 1) & (X[:, 0] == 1)
>>> prob[invalid, 1] = -np.inf
>>> pred = c.classes_[np.argmax(prob, axis=1)]
That's -np.inf instead of 0 so the 1 label doesn't come up as a result of tie-breaking vs. other zero-probability classes.

Resources