Pytorch: How exactly dataloader get a batch from dataset? - pytorch

I am trying to use pytorch to implement self-supervised contrastive learning. There is a phenomenon that I can't understand.
Here is my code of transformation to get two augmented views from original data:
class ContrastiveTransformations:
def __init__(self, base_transforms, n_views=2):
self.base_transforms = base_transforms
self.n_views = n_views
def __call__(self, x):
return [self.base_transforms(x) for i in range(self.n_views)]
contrast_transforms = transforms.Compose(
[
transforms.RandomResizedCrop(size=96),
transforms.ToTensor(),
]
)
data_set = CIFAR10(
root='/home1/data',
download=True,
transform=ContrastiveTransformations(contrast_transforms, n_views=2),
)
As the definition of ContrastiveTransformations, the type of data in my dataset is a list containing two tensors [x_1, x_2]. In my understanding, the batch from the dataloader should have the form of [data_batch, label_batch], and each item in data_batch is [x_1, x_2]. However, in fact, the form of the batch is in this way: [[batch_x1, batch_x2], label_batch], which is much more convinient to calculate infoNCE loss. I wonder that how DataLoader implements the fetch of the batch.
I have checked the code of DataLoader in pytorch, it seems that dataloader fetches the data in this way:
class _MapDatasetFetcher(_BaseDatasetFetcher):
def __init__(self, dataset, auto_collation, collate_fn, drop_last):
super(_MapDatasetFetcher, self).__init__(dataset, auto_collation, collate_fn, drop_last)
def fetch(self, possibly_batched_index):
if self.auto_collation:
data = [self.dataset[idx] for idx in possibly_batched_index]
else:
data = self.dataset[possibly_batched_index]
return self.collate_fn(data)
However I still didn't figure out how the dataloader generates the batch of x1 and x2 separately.
I would be very thankful if someone could give me an explanation.

In order to convert the separate dataset batch elements to an assembled batch, PyTorch's data loaders use a collate function. This defines how the dataloader should assemble the different elements together to form a minibatch
You can define your own collate function and pass it to your data.DataLoader with the collate_fn argument. By default, the collate function used by dataloaders is default_collate defined in torch/utils/data/_utils/collate.py.
This is the behaviour of the default collate function as described in the header of the function:
# Example with a batch of `int`s:
>>> default_collate([0, 1, 2, 3])
tensor([0, 1, 2, 3])
# Example with a batch of `str`s:
>>> default_collate(['a', 'b', 'c'])
['a', 'b', 'c']
# Example with `Map` inside the batch:
>>> default_collate([{'A': 0, 'B': 1}, {'A': 100, 'B': 100}])
{'A': tensor([ 0, 100]), 'B': tensor([ 1, 100])}
# Example with `NamedTuple` inside the batch:
>>> Point = namedtuple('Point', ['x', 'y'])
>>> default_collate([Point(0, 0), Point(1, 1)])
Point(x=tensor([0, 1]), y=tensor([0, 1]))
# Example with `Tuple` inside the batch:
>>> default_collate([(0, 1), (2, 3)])
[tensor([0, 2]), tensor([1, 3])]
# Example with `List` inside the batch:
>>> default_collate([[0, 1], [2, 3]])
[tensor([0, 2]), tensor([1, 3])]

Related

What is the recommended way to do embeddings in jax?

So I mean something where you have a categorical feature $X$ (suppose you have turned it into ints already) and say you want to embed that in some dimension using the features $A$ where $A$ is arity x n_embed.
What is the usual way to do this? Is using a for loop and vmap correct? I do not want something like jax.nn, something more efficient like
https://www.tensorflow.org/api_docs/python/tf/keras/layers/Embedding
For example consider high arity and low embedding dim.
Is it jnp.take as in the flax.linen implementation here? https://github.com/google/flax/blob/main/flax/linen/linear.py#L624
Indeed the typical way to do this in pure jax is with jnp.take. Given array A of embeddings of shape (num_embeddings, num_features) and categorical feature x of integers shaped (n,) then the following gives you the embedding lookup.
jnp.take(A, x, axis=0) # shape: (n, num_features)
If using Flax then the recommended way would be to use the flax.linen.Embed module and would achieve the same effect:
import flax.linen as nn
class Model(nn.Module):
#nn.compact
def __call__(self, x):
emb = nn.Embed(num_embeddings, num_features)(x) # shape
Suppose that A is the embedding table and x is any shape of indices.
A[x], which is like jnp.take(A, x, axis=0) but simpler.
vmap-ed A[x], which parallelizes along axis 0 of x.
nested vmap-ed A[x], which parallelizes along all axes of x.
Here are the source code for your reference.
import jax
import jax.numpy as jnp
embs = jnp.array([[1, 1, 1], [2, 2, 2], [3, 3, 3], [4, 4, 4]], dtype=jnp.float32)
x = jnp.array([[3, 1], [2, 0]], dtype=jnp.int32)
print("\ntake\n", jnp.take(embs, x, axis=0))
print("\nuse []\n", embs[x])
print(
"\nvmap\n",
jax.vmap(lambda embs, x: embs[x], in_axes=[None, 0], out_axes=0)(embs, x),
)
print(
"\nnested vmap\n",
jax.vmap(
jax.vmap(lambda embs, x: embs[x], in_axes=[None, 0], out_axes=0),
in_axes=[None, 0],
out_axes=0,
)(embs, x),
)
BTW, I learned the nested-vmap trick from the IREE GPT2 model code by James Bradbury.

Customizing the batch with specific elements

I am a fresh starter with PyTorch. Strangely I cannot find anything related to this, although it seems rather simple.
I want to structure my batch with specific examples, like all examples per batch having the same label, or just fill the batch with examples of just 2 classes.
How would I do that? For me, it seems the right place within the data loader and not in the dataset? As the data loader is responsible for the batches and not the dataset?
Is there a simple minimal example?
TLDR;
Default DataLoader only uses a sampler, not a batch sampler.
You can define a sampler, plus a batch sampler, a batch sampler will override the sampler.
The sampler only yields the sequence of dataset elements, not the actual batches (this is handled by the data loader, depending on batch_size).
To answer your initial question: Working with a sampler on an iterable dataset doesn't seem to be possible cf. Github issue (still open). Also, read the following note on pytorch/dataloader.py.
Samplers (for map-style datasets):
That aside, if you are switching to a map-style dataset, here are some details on how samplers and batch samplers work. You have access to a dataset's underlying data using indices, just like you would with a list (since torch.utils.data.Dataset implements __getitem__). In other words, your dataset elements are all dataset[i], for i in [0, len(dataset) - 1].
Here is a toy dataset:
class DS(Dataset):
def __getitem__(self, index):
return index
def __len__(self):
return 10
In a general use case you would just give torch.utils.data.DataLoader the arguments batch_size and shuffle. By default, shuffle is set to false, which means it will use torch.utils.data.SequentialSampler. Else (if shuffle is true) torch.utils.data.RandomSampler will be used. The sampler defines how the data loader accesses the dataset (in which order it accesses it).
The above dataset (DS) has 10 elements. The indices are 0, 1, 2, 3, 4, 5, 6, 7, 8, and 9. They map to elements 0, 10, 20, 30, 40, 50, 60, 70, 80, and 90. So with a batch size of 2:
SequentialSampler: DataLoader(ds, batch_size=2) (implictly shuffle=False), identical to DataLoader(ds, batch_size=2, sampler=SequentialSampler(ds)). The dataloader will deliver tensor([0, 10]), tensor([20, 30]), tensor([40, 50]), tensor([60, 70]), and tensor([80, 90]).
RandomSampler: DataLoader(ds, batch_size=2, shuffle=True), identical to DataLoader(ds, batch_size=2, sampler=RandomSampler(ds)). The dataloader will sample randomly each time you iterate through it. For instance: tensor([50, 40]), tensor([90, 80]), tensor([0, 60]), tensor([10, 20]), and tensor([30, 70]). But the sequence will be different if you iterate through the dataloader a second time!
Batch sampler
Providing batch_sampler will override batch_size, shuffle, sampler, and drop_last altogether. It is meant to define exactly the batch elements and their content. For instance:
>>> DataLoader(ds, batch_sampler=[[1,2,3], [6,5,4], [7,8], [0,9]])`
Will yield tensor([10, 20, 30]), tensor([60, 50, 40]), tensor([70, 80]), and tensor([ 0, 90]).
Batch sampling on the class
Let's say I just want to have 2 elements (different or not) of each class in my batch and have to exclude more examples of each class. So ensuring that not 3 examples are inside of the batch.
Let's say you have a dataset with four classes. Here is how I would do it. First, keep track of dataset indices for each class.
class DS(Dataset):
def __init__(self, data):
super(DS, self).__init__()
self.data = data
self.indices = [[] for _ in range(4)]
for i, x in enumerate(data):
if x > 0 and x % 2: self.indices[0].append(i)
if x > 0 and not x % 2: self.indices[1].append(i)
if x < 0 and x % 2: self.indices[2].append(i)
if x < 0 and not x % 2: self.indices[3].append(i)
def classes(self):
return self.indices
def __getitem__(self, index):
return self.data[index]
For example:
>>> ds = DS([1, 6, 7, -5, 10, -6, 8, 6, 1, -3, 9, -21, -13, 11, -2, -4, -21, 4])
Will give:
>>> ds.classes()
[[0, 2, 8, 10, 13], [1, 4, 6, 7, 17], [3, 9, 11, 12, 16], [5, 14, 15]]
Then for the batch sampler, the easiest way is to create a list of class indices that are available, and have as many class index as there are dataset element.
In the dataset defined above, we have 5 items from class 0, 5 from class 1, 5 from class 2, and 3 from class 3. Therefore we want to construct [0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3]. We will shuffle it. Then, from this list and the dataset classes content (ds.classes()) we will be able to construct the batches.
class Sampler():
def __init__(self, classes):
self.classes = classes
def __iter__(self):
classes = copy.deepcopy(self.classes)
indices = flatten([[i for _ in range(len(klass))] for i, klass in enumerate(classes)])
random.shuffle(indices)
grouped = zip(*[iter(indices)]*2)
res = []
for a, b in grouped:
res.append((classes[a].pop(), classes[b].pop()))
return iter(res)
Note - deep copying the list is required since we're popping elements from it.
A possible output of this sampler would be:
[(15, 14), (16, 17), (7, 12), (11, 6), (13, 10), (5, 4), (9, 8), (2, 0), (3, 1)]
At this point we can simply use torch.data.utils.DataLoader:
>>> dl = DataLoader(ds, batch_sampler=sampler(ds.classes()))
Which could yield something like:
[tensor([ 4, -4]), tensor([-21, 11]), tensor([-13, 6]), tensor([9, 1]), tensor([ 8, -21]), tensor([-3, 10]), tensor([ 6, -2]), tensor([-5, 7]), tensor([-6, 1])]
An easier approach
Here is another - easier - approach that will not guarantee to return all elements from the dataset, on average it will...
For each batch, first sample class_per_batch classes, then sample batch_size elements from these selected classes (by first sampling a class from that class subset, then sampling from a data point from that class).
class Sampler():
def __init__(self, classes, class_per_batch, batch_size):
self.classes = classes
self.n_batches = sum([len(x) for x in classes]) // batch_size
self.class_per_batch = class_per_batch
self.batch_size = batch_size
def __iter__(self):
classes = random.sample(range(len(self.classes)), self.class_per_batch)
batches = []
for _ in range(self.n_batches):
batch = []
for i in range(self.batch_size):
klass = random.choice(classes)
batch.append(random.choice(self.classes[klass]))
batches.append(batch)
return iter(batches)
You can try it this way:
>>> s = Sampler(ds.classes(), class_per_batch=2, batch_size=4)
>>> list(s)
[[16, 0, 0, 9], [10, 8, 11, 2], [16, 9, 16, 8], [2, 9, 2, 3]]
>>> dl = DataLoader(ds, batch_sampler=s)
>>> list(iter(dl))
[tensor([ -5, -6, -21, -13]), tensor([ -4, -4, -13, -13]), tensor([ -3, -21, -2, -5]), tensor([-3, -5, -4, -6])]

GRU/LSTM in Keras with input sequence of varying length

I'm working on a smaller project to better understand RNN, in particualr LSTM and GRU. I'm not at all an expert, so please bear that in mind.
The problem I'm facing is given as data in the form of:
>>> import numpy as np
>>> import pandas as pd
>>> pd.DataFrame([[1, 2, 3],[1, 2, 1], [1, 3, 2],[2, 3, 1],[3, 1, 1],[3, 3, 2],[4, 3, 3]], columns=['person', 'interaction', 'group'])
person interaction group
0 1 2 3
1 1 2 1
2 1 3 2
3 2 3 1
4 3 1 1
5 3 3 2
6 4 3 3
this is just for explanation. We have different person interacting with different groups in different ways. I've already encoded the various features. The last interaction of a user is always a 3, which means selecting a certain group. In the short example above person 1 chooses group 2, person 2 chooses group 1 and so on.
My whole data set is much bigger but I would like to understand first the conceptual part before throwing models at it. The task I would like to learn is given a sequence of interaction, which group is chosen by the person. A bit more concrete, I would like to have an output a list with all groups (there are 3 groups, 1, 2, 3) sorted by the most likely choice, followed by the second and third likest group. The loss function is therefore a mean reciprocal rank.
I know that in Keras Grus/LSTM can handle various length input. So my three questions are.
The input is of the format:
(samples, timesteps, features)
writing high level code:
import keras.layers as L
import keras.models as M
model_input = L.Input(shape=(?, None, 2))
timestep=None should imply the varying size and 2 is for the feature interaction and group. But what about the samples? How do I define the batches?
For the output I'm a bit puzzled how this should look like in this example? I think for each last interaction of a person I would like to have a list of length 3. Assuming I've set up the output
model_output = L.LSTM(3, return_sequences=False)
I then want to compile it. Is there a way of using the mean reciprocal rank?
model.compile('adam', '?')
I know the questions are fairly high level, but I would like to understand first the big picture and start to play around. Any help would therefore be appreciated.
The concept you've drawn in your question is a pretty good start already. I'll add a few things to make it work, as well as a code example below:
You can specify LSTM(n_hidden, input_shape=(None, 2)) directly, instead of inserting an extra Input layer; the batch dimension is to be omitted for the definition.
Since your model is going to perform some kind of classification (based on time series data) the final layer is what we'd expect from "normal" classification as well, a Dense(num_classes, action='softmax'). Chaining the LSTM and the Dense layer together will first pass the time series input through the LSTM layer and then feed its output (determined by the number of hidden units) into the Dense layer. activation='softmax' allows to compute a class score for each class (we're going to use one-hot-encoding in a data preprocessing step, see code example below). This means class scores are not ordered, but you can always do so via np.argsort or np.argmax.
Categorical crossentropy loss is suited for comparing the classification score, so we'll use that one: model.compile(loss='categorical_crossentropy', optimizer='adam').
Since the number of interactions. i.e. the length of model input, varies from sample to sample we'll use a batch size of 1 and feed in one sample at a time.
The following is a sample implementation w.r.t to the above considerations. Note that I modified your sample data a bit, in order to provide more "reasoning" behind group choices. Also each person needs to perform at least one interaction before choosing a group (i.e. the input sequence cannot be empty); if this is not the case for your data, then introducing an additional no-op interaction (e.g. 0) can help.
import pandas as pd
import tensorflow as tf
model = tf.keras.models.Sequential()
model.add(tf.keras.layers.LSTM(10, input_shape=(None, 2))) # LSTM for arbitrary length series.
model.add(tf.keras.layers.Dense(3, activation='softmax')) # Softmax for class probabilities.
model.compile(loss='categorical_crossentropy', optimizer='adam')
# Example interactions:
# * 1: Likes the group,
# * 2: Dislikes the group,
# * 3: Chooses the group.
df = pd.DataFrame([
[1, 1, 3],
[1, 1, 3],
[1, 2, 2],
[1, 3, 3],
[2, 2, 1],
[2, 2, 3],
[2, 1, 2],
[2, 3, 2],
[3, 1, 1],
[3, 1, 1],
[3, 1, 1],
[3, 2, 3],
[3, 2, 2],
[3, 3, 1]],
columns=['person', 'interaction', 'group']
)
data = [person[1][['interaction', 'group']].values for person in df.groupby('person')]
x_train = [x[:-1] for x in data]
y_train = tf.keras.utils.to_categorical([x[-1, 1]-1 for x in data]) # Expects class labels from 0 to n (-> subtract 1).
print(x_train)
print(y_train)
class TrainGenerator(tf.keras.utils.Sequence):
def __init__(self, x, y):
self.x = x
self.y = y
def __len__(self):
return len(self.x)
def __getitem__(self, index):
# Need to expand arrays to have batch size 1.
return self.x[index][None, :, :], self.y[index][None, :]
model.fit_generator(TrainGenerator(x_train, y_train), epochs=1000)
pred = [model.predict(x[None, :, :]).ravel() for x in x_train]
for p, y in zip(pred, y_train):
print(p, y)
And the corresponding sample output:
[...]
Epoch 1000/1000
3/3 [==============================] - 0s 40ms/step - loss: 0.0037
[0.00213619 0.00241093 0.9954529 ] [0. 0. 1.]
[0.00123938 0.99718493 0.00157572] [0. 1. 0.]
[9.9632275e-01 7.5039308e-04 2.9268670e-03] [1. 0. 0.]
Using custom generator expressions: According to the documentation we can use any generator to yield the data. The generator is expected to yield batches of the data and loop over the whole data set indefinitely. When using tf.keras.utils.Sequence we do not need to specify the parameter steps_per_epoch as this will default to len(train_generator). Hence, when using a custom generator, we shall provide this parameter as well:
import itertools as it
model.fit_generator(((x_train[i % len(x_train)][None, :, :],
y_train[i % len(y_train)][None, :]) for i in it.count()),
epochs=1000,
steps_per_epoch=len(x_train))

Access document-term matrix without calling .fit_transform() each time

If I've already called vectorizer.fit_transform(corpus), is the only way to later print the document-term matrix to call vectorizer.fit_transform(corpus) again?
from sklearn.feature_extraction.text import CountVectorizer
corpus = ['the', 'quick','brown','fox']
vectorizer = CountVectorizer(stop_words='english')
vectorizer.fit_transform(corpus) # Returns the document-term matrix
My understanding is by doing above, I've now saved terms into the vectorizer object. I assume this because I can now call vectorizer.vocabulary_ without passing in corpus again.
So I wondered why there is not a method like .document_term_matrix?
Its seems weird that I have to pass in the corpus again if the data is now already stored in vectorizer object. But per the docs, only .fit, .transform, and .fit_transformreturn the mattrix.
Docs: http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html#sklearn.feature_extraction.text.CountVectorizer.fit
Other Info:
I'm using Anaconda and Jupyter Notebook.
You can simply assign the fit to a variable dtm, and, since it is a Scipy sparse matrix, use the toarray method to print it:
from sklearn.feature_extraction.text import CountVectorizer
corpus = ['the', 'quick','brown','fox']
vectorizer = CountVectorizer(stop_words='english')
dtm = vectorizer.fit_transform(corpus)
# vectorizer object is still fit:
vectorizer.vocabulary_
# {'brown': 0, 'fox': 1, 'quick': 2}
dtm.toarray()
# array([[0, 0, 0],
# [0, 0, 1],
# [1, 0, 0],
# [0, 1, 0]], dtype=int64)
although I guess for any realistic document-term matrix this will be really impractical... You could use the nonzero method instead:
dtm.nonzero()
# (array([1, 2, 3], dtype=int32), array([2, 0, 1], dtype=int32))

array-like shape (n_samples,) vs [n_samples] in sklearn documents

For the sample_weight, the requirement of its shape is array-like shape (n_samples,), sometimes is array-like shape [n_samples]. Does (n_samples,) means 1d array? and [n_samples] means list? Or they're equivalent to each other?
Both forms can be seen here: http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB.html
You can use a simple example to test this:
import numpy as np
from sklearn.naive_bayes import GaussianNB
#create some data
X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])
Y = np.array([1, 1, 1, 2, 2, 2])
#create the model and fit it
clf = GaussianNB()
clf.fit(X, Y)
#check the type of some attributes
type(clf.class_prior_)
type(clf.class_count_)
#check the shapes of these attributes
clf.class_prior_.shape
clf.class_count_
Or more advanced searching:
#verify that it is a numpy nd array and NOT a list
isinstance(clf.class_prior_, np.ndarray)
isinstance(clf.class_prior_, list)
Similarly, you can check all the attributes.
Results
numpy.ndarray
numpy.ndarray
(2,)
array([ 3., 3.])
True
False
The results indicate that these atributes are numpy nd arrays.

Resources