What is the recommended way to do embeddings in jax? - jax

So I mean something where you have a categorical feature $X$ (suppose you have turned it into ints already) and say you want to embed that in some dimension using the features $A$ where $A$ is arity x n_embed.
What is the usual way to do this? Is using a for loop and vmap correct? I do not want something like jax.nn, something more efficient like
https://www.tensorflow.org/api_docs/python/tf/keras/layers/Embedding
For example consider high arity and low embedding dim.
Is it jnp.take as in the flax.linen implementation here? https://github.com/google/flax/blob/main/flax/linen/linear.py#L624

Indeed the typical way to do this in pure jax is with jnp.take. Given array A of embeddings of shape (num_embeddings, num_features) and categorical feature x of integers shaped (n,) then the following gives you the embedding lookup.
jnp.take(A, x, axis=0) # shape: (n, num_features)
If using Flax then the recommended way would be to use the flax.linen.Embed module and would achieve the same effect:
import flax.linen as nn
class Model(nn.Module):
#nn.compact
def __call__(self, x):
emb = nn.Embed(num_embeddings, num_features)(x) # shape

Suppose that A is the embedding table and x is any shape of indices.
A[x], which is like jnp.take(A, x, axis=0) but simpler.
vmap-ed A[x], which parallelizes along axis 0 of x.
nested vmap-ed A[x], which parallelizes along all axes of x.
Here are the source code for your reference.
import jax
import jax.numpy as jnp
embs = jnp.array([[1, 1, 1], [2, 2, 2], [3, 3, 3], [4, 4, 4]], dtype=jnp.float32)
x = jnp.array([[3, 1], [2, 0]], dtype=jnp.int32)
print("\ntake\n", jnp.take(embs, x, axis=0))
print("\nuse []\n", embs[x])
print(
"\nvmap\n",
jax.vmap(lambda embs, x: embs[x], in_axes=[None, 0], out_axes=0)(embs, x),
)
print(
"\nnested vmap\n",
jax.vmap(
jax.vmap(lambda embs, x: embs[x], in_axes=[None, 0], out_axes=0),
in_axes=[None, 0],
out_axes=0,
)(embs, x),
)
BTW, I learned the nested-vmap trick from the IREE GPT2 model code by James Bradbury.

Related

LDA covariance matrix not match calculated covariance matrix

I'm looking to better understand the covariance_ attribute returned by scikit-learn's LDA object.
I'm sure I'm missing something, but I expect it to be the covariance matrix associated with the input data. However, when I compare .covariance_ against the covariance matrix returned by numpy.cov(), I get different results.
Can anyone help me understand what I am missing? Thanks and happy to provide any additional information.
Please find a simple example illustrating the discrepancy below.
import numpy as np
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
# Sample Data
X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])
y = np.array([1, 1, 1, 0, 0, 0])
# Covariance matrix via np.cov
print(np.cov(X.T))
# Covariance matrix via LDA
clf = LinearDiscriminantAnalysis(store_covariance=True).fit(X, y)
print(clf.covariance_)
In sklearn.discrimnant_analysis.LinearDiscriminantAnalysis, the covariance is computed as follow:
In [1]: import numpy as np
...: cov = np.zeros(shape=(X.shape[1], X.shape[1]))
...: for c in np.unique(y):
...: Xg = X[y == c, :]
...: cov += np.count_nonzero(y==c) / len(y) * np.cov(Xg.T, bias=1)
...: print(cov)
array([[0.66666667, 0.33333333],
[0.33333333, 0.22222222]])
So it corresponds to the sum of the covariance of each individual class multiplied by a prior which is the class frequency. Note that this prior is a parameter of LDA.

Train/fit a Linear Regression in sklearn with only one feature/variable

So I am understanding lasso regression and I don't understand why it needs two input values to predict another value when it's just a 2 dimensional regression.
It says in the documentation that
clf.fit([[0,0], [1, 1], [2, 2]], [0, 1, 2])
which I don't understand. Why is it [0,0] or [1,1] and not just [0] or [1]?
[[0,0], [1, 1], [2, 2]]
means that you have 3 samples/observations and each is characterised by 2 features/variables (2 dimensional).
Indeed, you could have these 3 samples with only 1 features/variables and still be able to fit a model.
Example using 1 feature.
from sklearn import datasets
from sklearn import linear_model
# import some data to play with
iris = datasets.load_iris()
X = iris.data[:, :1] # we only take the feature
y = iris.target
clf = linear_model.Lasso(alpha=0.1)
clf.fit(X,y)
print(clf.coef_)
print(clf.intercept_)

GRU/LSTM in Keras with input sequence of varying length

I'm working on a smaller project to better understand RNN, in particualr LSTM and GRU. I'm not at all an expert, so please bear that in mind.
The problem I'm facing is given as data in the form of:
>>> import numpy as np
>>> import pandas as pd
>>> pd.DataFrame([[1, 2, 3],[1, 2, 1], [1, 3, 2],[2, 3, 1],[3, 1, 1],[3, 3, 2],[4, 3, 3]], columns=['person', 'interaction', 'group'])
person interaction group
0 1 2 3
1 1 2 1
2 1 3 2
3 2 3 1
4 3 1 1
5 3 3 2
6 4 3 3
this is just for explanation. We have different person interacting with different groups in different ways. I've already encoded the various features. The last interaction of a user is always a 3, which means selecting a certain group. In the short example above person 1 chooses group 2, person 2 chooses group 1 and so on.
My whole data set is much bigger but I would like to understand first the conceptual part before throwing models at it. The task I would like to learn is given a sequence of interaction, which group is chosen by the person. A bit more concrete, I would like to have an output a list with all groups (there are 3 groups, 1, 2, 3) sorted by the most likely choice, followed by the second and third likest group. The loss function is therefore a mean reciprocal rank.
I know that in Keras Grus/LSTM can handle various length input. So my three questions are.
The input is of the format:
(samples, timesteps, features)
writing high level code:
import keras.layers as L
import keras.models as M
model_input = L.Input(shape=(?, None, 2))
timestep=None should imply the varying size and 2 is for the feature interaction and group. But what about the samples? How do I define the batches?
For the output I'm a bit puzzled how this should look like in this example? I think for each last interaction of a person I would like to have a list of length 3. Assuming I've set up the output
model_output = L.LSTM(3, return_sequences=False)
I then want to compile it. Is there a way of using the mean reciprocal rank?
model.compile('adam', '?')
I know the questions are fairly high level, but I would like to understand first the big picture and start to play around. Any help would therefore be appreciated.
The concept you've drawn in your question is a pretty good start already. I'll add a few things to make it work, as well as a code example below:
You can specify LSTM(n_hidden, input_shape=(None, 2)) directly, instead of inserting an extra Input layer; the batch dimension is to be omitted for the definition.
Since your model is going to perform some kind of classification (based on time series data) the final layer is what we'd expect from "normal" classification as well, a Dense(num_classes, action='softmax'). Chaining the LSTM and the Dense layer together will first pass the time series input through the LSTM layer and then feed its output (determined by the number of hidden units) into the Dense layer. activation='softmax' allows to compute a class score for each class (we're going to use one-hot-encoding in a data preprocessing step, see code example below). This means class scores are not ordered, but you can always do so via np.argsort or np.argmax.
Categorical crossentropy loss is suited for comparing the classification score, so we'll use that one: model.compile(loss='categorical_crossentropy', optimizer='adam').
Since the number of interactions. i.e. the length of model input, varies from sample to sample we'll use a batch size of 1 and feed in one sample at a time.
The following is a sample implementation w.r.t to the above considerations. Note that I modified your sample data a bit, in order to provide more "reasoning" behind group choices. Also each person needs to perform at least one interaction before choosing a group (i.e. the input sequence cannot be empty); if this is not the case for your data, then introducing an additional no-op interaction (e.g. 0) can help.
import pandas as pd
import tensorflow as tf
model = tf.keras.models.Sequential()
model.add(tf.keras.layers.LSTM(10, input_shape=(None, 2))) # LSTM for arbitrary length series.
model.add(tf.keras.layers.Dense(3, activation='softmax')) # Softmax for class probabilities.
model.compile(loss='categorical_crossentropy', optimizer='adam')
# Example interactions:
# * 1: Likes the group,
# * 2: Dislikes the group,
# * 3: Chooses the group.
df = pd.DataFrame([
[1, 1, 3],
[1, 1, 3],
[1, 2, 2],
[1, 3, 3],
[2, 2, 1],
[2, 2, 3],
[2, 1, 2],
[2, 3, 2],
[3, 1, 1],
[3, 1, 1],
[3, 1, 1],
[3, 2, 3],
[3, 2, 2],
[3, 3, 1]],
columns=['person', 'interaction', 'group']
)
data = [person[1][['interaction', 'group']].values for person in df.groupby('person')]
x_train = [x[:-1] for x in data]
y_train = tf.keras.utils.to_categorical([x[-1, 1]-1 for x in data]) # Expects class labels from 0 to n (-> subtract 1).
print(x_train)
print(y_train)
class TrainGenerator(tf.keras.utils.Sequence):
def __init__(self, x, y):
self.x = x
self.y = y
def __len__(self):
return len(self.x)
def __getitem__(self, index):
# Need to expand arrays to have batch size 1.
return self.x[index][None, :, :], self.y[index][None, :]
model.fit_generator(TrainGenerator(x_train, y_train), epochs=1000)
pred = [model.predict(x[None, :, :]).ravel() for x in x_train]
for p, y in zip(pred, y_train):
print(p, y)
And the corresponding sample output:
[...]
Epoch 1000/1000
3/3 [==============================] - 0s 40ms/step - loss: 0.0037
[0.00213619 0.00241093 0.9954529 ] [0. 0. 1.]
[0.00123938 0.99718493 0.00157572] [0. 1. 0.]
[9.9632275e-01 7.5039308e-04 2.9268670e-03] [1. 0. 0.]
Using custom generator expressions: According to the documentation we can use any generator to yield the data. The generator is expected to yield batches of the data and loop over the whole data set indefinitely. When using tf.keras.utils.Sequence we do not need to specify the parameter steps_per_epoch as this will default to len(train_generator). Hence, when using a custom generator, we shall provide this parameter as well:
import itertools as it
model.fit_generator(((x_train[i % len(x_train)][None, :, :],
y_train[i % len(y_train)][None, :]) for i in it.count()),
epochs=1000,
steps_per_epoch=len(x_train))

scikit learn: train_test_split, can I ensure same splits on different datasets

I understand that the train_test_split method splits a dataset into random train and test subsets. And using random_state=int can ensure we have the same splits on this dataset for each time the method is called.
My problem is slightly different.
I have two datasets, A and B, they contain identical sets of examples and the order of these examples appear in each dataset is also identical. But they key difference is that exmaples in each dataset uses a different sets of features.
I would like to test to see if the features used in A leads to better performance than features used in B. So I would like to ensure that when I call train_test_split on A and B, I can get the same splits on both datasets so that the comparison is meaningful.
Is this possible? Do I simply need to ensure the random_state in both method calls for both datasets are the same?
Thanks
Yes, random state is enough.
>>> X, y = np.arange(10).reshape((5, 2)), range(5)
>>> X2 = np.hstack((X,X))
>>> X_train, X_test, _, _ = train_test_split(X,y, test_size=0.33, random_state=42)
>>> X_train2, X_test2, _, _ = train_test_split(X2,y, test_size=0.33, random_state=42)
>>> X_train
array([[4, 5],
[0, 1],
[6, 7]])
>>> X_train2
array([[4, 5, 4, 5],
[0, 1, 0, 1],
[6, 7, 6, 7]])
>>> X_test
array([[2, 3],
[8, 9]])
>>> X_test2
array([[2, 3, 2, 3],
[8, 9, 8, 9]])
Looking at the code for the train_test_split function, it sets the random seed inside the function at every call. So it will result in the same split every time. We can check that this works pretty simply
X1 = np.random.random((200, 5))
X2 = np.random.random((200, 5))
y = np.arange(200)
X1_train, X1_test, y1_train, y1_test = model_selection.train_test_split(X1, y,
test_size=0.1,
random_state=42)
X2_train, X2_test, y2_train, y2_test = model_selection.train_test_split(X1, y,
test_size=0.1,
random_state=42)
print np.all(y1_train == y2_train)
print np.all(y1_test == y2_test)
Which outputs:
True
True
Which is good! Another way of doing this problem is to create one training and test split on all your features and then split your features up before training. However if you're in a weird situation where you need to do both at once (sometimes with similarity matrices you don't want test features in your training set), then you can use the StratifiedShuffleSplit function to return the indices of the data that belongs to each set. For example:
n_splits = 1
sss = model_selection.StratifiedShuffleSplit(n_splits=n_splits,
test_size=0.1,
random_state=42)
train_idx, test_idx = list(sss.split(X, y))[0]
Since sklearn.model_selection.train_test_split(*arrays, **options) accepts a variable number of arguments, you can just do like this:
A_train, A_test, B_train, B_test, _, _ = train_test_split(A, B, y,
test_size=0.33,
random_state=42)
As mentioned above you can use Random state parameter.
But if you want to globally generate the same results means setting the random state for all future calls u can use.
np.random.seed('Any random number ')

Classification with restrictions

How should I best use scikit-learn for the following supervised classification problem (simplified), with binary features:
import numpy as np
from sklearn.tree import DecisionTreeClassifier
train_data = np.array([[0, 0, 1, 0],
[1, 0, 1, 1],
[0, 1, 1, 1]], dtype=bool)
train_targets = np.array([0, 1, 2])
c = DecisionTreeClassifier()
c.fit(train_data, train_targets)
p = c.predict(np.array([1, 1, 1, 1], dtype=bool))
print(p)
# -> [1]
That works fine. However, suppose now that I know a priori that the presence of feature 0 excludes class 1. Can additional information of this kind be easily included in the classification process?
Currently, I'm just doing some (problem-specific and heuristic) postprocessing to adjust the resulting class. I could perhaps also manually preprocess and split the dataset into two according to the feature, and train two classifiers separately (but with K such features, this ends up in 2^K splitting).
Can additional information of this kind be easily included in the classification process?
Domain-specific hacks are left to the user. The easiest way to do this is to predict probabilities...
>>> prob = c.predict_proba(X)
and then rig the probabilities to get the right class out.
>>> invalid = (prob[:, 1] == 1) & (X[:, 0] == 1)
>>> prob[invalid, 1] = -np.inf
>>> pred = c.classes_[np.argmax(prob, axis=1)]
That's -np.inf instead of 0 so the 1 label doesn't come up as a result of tie-breaking vs. other zero-probability classes.

Resources