Customizing the batch with specific elements - pytorch

I am a fresh starter with PyTorch. Strangely I cannot find anything related to this, although it seems rather simple.
I want to structure my batch with specific examples, like all examples per batch having the same label, or just fill the batch with examples of just 2 classes.
How would I do that? For me, it seems the right place within the data loader and not in the dataset? As the data loader is responsible for the batches and not the dataset?
Is there a simple minimal example?

TLDR;
Default DataLoader only uses a sampler, not a batch sampler.
You can define a sampler, plus a batch sampler, a batch sampler will override the sampler.
The sampler only yields the sequence of dataset elements, not the actual batches (this is handled by the data loader, depending on batch_size).
To answer your initial question: Working with a sampler on an iterable dataset doesn't seem to be possible cf. Github issue (still open). Also, read the following note on pytorch/dataloader.py.
Samplers (for map-style datasets):
That aside, if you are switching to a map-style dataset, here are some details on how samplers and batch samplers work. You have access to a dataset's underlying data using indices, just like you would with a list (since torch.utils.data.Dataset implements __getitem__). In other words, your dataset elements are all dataset[i], for i in [0, len(dataset) - 1].
Here is a toy dataset:
class DS(Dataset):
def __getitem__(self, index):
return index
def __len__(self):
return 10
In a general use case you would just give torch.utils.data.DataLoader the arguments batch_size and shuffle. By default, shuffle is set to false, which means it will use torch.utils.data.SequentialSampler. Else (if shuffle is true) torch.utils.data.RandomSampler will be used. The sampler defines how the data loader accesses the dataset (in which order it accesses it).
The above dataset (DS) has 10 elements. The indices are 0, 1, 2, 3, 4, 5, 6, 7, 8, and 9. They map to elements 0, 10, 20, 30, 40, 50, 60, 70, 80, and 90. So with a batch size of 2:
SequentialSampler: DataLoader(ds, batch_size=2) (implictly shuffle=False), identical to DataLoader(ds, batch_size=2, sampler=SequentialSampler(ds)). The dataloader will deliver tensor([0, 10]), tensor([20, 30]), tensor([40, 50]), tensor([60, 70]), and tensor([80, 90]).
RandomSampler: DataLoader(ds, batch_size=2, shuffle=True), identical to DataLoader(ds, batch_size=2, sampler=RandomSampler(ds)). The dataloader will sample randomly each time you iterate through it. For instance: tensor([50, 40]), tensor([90, 80]), tensor([0, 60]), tensor([10, 20]), and tensor([30, 70]). But the sequence will be different if you iterate through the dataloader a second time!
Batch sampler
Providing batch_sampler will override batch_size, shuffle, sampler, and drop_last altogether. It is meant to define exactly the batch elements and their content. For instance:
>>> DataLoader(ds, batch_sampler=[[1,2,3], [6,5,4], [7,8], [0,9]])`
Will yield tensor([10, 20, 30]), tensor([60, 50, 40]), tensor([70, 80]), and tensor([ 0, 90]).
Batch sampling on the class
Let's say I just want to have 2 elements (different or not) of each class in my batch and have to exclude more examples of each class. So ensuring that not 3 examples are inside of the batch.
Let's say you have a dataset with four classes. Here is how I would do it. First, keep track of dataset indices for each class.
class DS(Dataset):
def __init__(self, data):
super(DS, self).__init__()
self.data = data
self.indices = [[] for _ in range(4)]
for i, x in enumerate(data):
if x > 0 and x % 2: self.indices[0].append(i)
if x > 0 and not x % 2: self.indices[1].append(i)
if x < 0 and x % 2: self.indices[2].append(i)
if x < 0 and not x % 2: self.indices[3].append(i)
def classes(self):
return self.indices
def __getitem__(self, index):
return self.data[index]
For example:
>>> ds = DS([1, 6, 7, -5, 10, -6, 8, 6, 1, -3, 9, -21, -13, 11, -2, -4, -21, 4])
Will give:
>>> ds.classes()
[[0, 2, 8, 10, 13], [1, 4, 6, 7, 17], [3, 9, 11, 12, 16], [5, 14, 15]]
Then for the batch sampler, the easiest way is to create a list of class indices that are available, and have as many class index as there are dataset element.
In the dataset defined above, we have 5 items from class 0, 5 from class 1, 5 from class 2, and 3 from class 3. Therefore we want to construct [0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3]. We will shuffle it. Then, from this list and the dataset classes content (ds.classes()) we will be able to construct the batches.
class Sampler():
def __init__(self, classes):
self.classes = classes
def __iter__(self):
classes = copy.deepcopy(self.classes)
indices = flatten([[i for _ in range(len(klass))] for i, klass in enumerate(classes)])
random.shuffle(indices)
grouped = zip(*[iter(indices)]*2)
res = []
for a, b in grouped:
res.append((classes[a].pop(), classes[b].pop()))
return iter(res)
Note - deep copying the list is required since we're popping elements from it.
A possible output of this sampler would be:
[(15, 14), (16, 17), (7, 12), (11, 6), (13, 10), (5, 4), (9, 8), (2, 0), (3, 1)]
At this point we can simply use torch.data.utils.DataLoader:
>>> dl = DataLoader(ds, batch_sampler=sampler(ds.classes()))
Which could yield something like:
[tensor([ 4, -4]), tensor([-21, 11]), tensor([-13, 6]), tensor([9, 1]), tensor([ 8, -21]), tensor([-3, 10]), tensor([ 6, -2]), tensor([-5, 7]), tensor([-6, 1])]
An easier approach
Here is another - easier - approach that will not guarantee to return all elements from the dataset, on average it will...
For each batch, first sample class_per_batch classes, then sample batch_size elements from these selected classes (by first sampling a class from that class subset, then sampling from a data point from that class).
class Sampler():
def __init__(self, classes, class_per_batch, batch_size):
self.classes = classes
self.n_batches = sum([len(x) for x in classes]) // batch_size
self.class_per_batch = class_per_batch
self.batch_size = batch_size
def __iter__(self):
classes = random.sample(range(len(self.classes)), self.class_per_batch)
batches = []
for _ in range(self.n_batches):
batch = []
for i in range(self.batch_size):
klass = random.choice(classes)
batch.append(random.choice(self.classes[klass]))
batches.append(batch)
return iter(batches)
You can try it this way:
>>> s = Sampler(ds.classes(), class_per_batch=2, batch_size=4)
>>> list(s)
[[16, 0, 0, 9], [10, 8, 11, 2], [16, 9, 16, 8], [2, 9, 2, 3]]
>>> dl = DataLoader(ds, batch_sampler=s)
>>> list(iter(dl))
[tensor([ -5, -6, -21, -13]), tensor([ -4, -4, -13, -13]), tensor([ -3, -21, -2, -5]), tensor([-3, -5, -4, -6])]

Related

How to reorder tensor based on indexes tensor from the same size

Say I have tensor A, and indexes Tensor: A = [1, 2, 3, 4], indexes = [1, 0, 3, 2]
I want to create a new Tensor from these two with the following result : [2, 1, 4, 3]
Each element of the result is element from A and the order is defined by the indexes Tensor.
Is there a way to do it with PyTorch tensor ops without loops?
My goal is to do it for 2D Tensor, but I don't think there is a way to do it without loops, so I thought to project it to 1D, do the work and project it back to the 2D.
You can use scatter:
A = torch.tensor([1, 2, 3, 4])
indices = torch.tensor([1, 0, 3, 2])
result = torch.tensor([0, 0, 0, 0])
print(result.scatter_(0, indices, A))
In 1D you can simply perform A[indexes].
In 2D it is still doable in this way:
A = torch.arange(5, 10).repeat(3, 1) # shape: (3, 5)
indexes = torch.stack([torch.randperm(5) for _ in range(3)]) # shape (3, 5)
A_sort = A[torch.arange(3).unsqueeze(1), indexes]
print(A_sort)

Pytorch, retrieving values from a tensor using several indices. Most computationally efficient solution

If I have an example 3d tensor
a = [[4, 2, 1, 6],[1, 2, 3, 8], [92, 4, 23, 54]]
tensor_a = torch.tensor(a)
I can get 2 of the 1D tensors along the first dimension using
tensor_a[[0, 1]]
tensor([[4, 2, 1, 6],
[1, 2, 3, 8]])
But how about using several indices?
So I have something like this
list_indices = [[0, 0], [0,2], [1, 2]]
I could do something like
combos = []
for indi in list_indices:
combos.append(tensor_a[indi])
But I'm wondering if since there's a for loop, if there's a more computationally way to do this, perhaps also using pytorch
It is more computationally effecient to use the predefined Pytorch function "torch.index_select" to select tensor elements using a list of indices:
a = [[4, 2, 1, 6],[1, 2, 3, 8], [92, 4, 23, 54]]
tensor_a = torch.tensor(a)
list_indices = [[0, 0], [0,2], [1, 2]]
#convert list_indices to Tensor
indices = torch.tensor(list_indices)
#get elements from tensor_a using indices.
tensor_a=torch.index_select(tensor_a, 0, indices.view(-1))
print(tensor_a)
if you want the result to be a list not a tensors, you can convert tensor_a to a list:
tensor_a_list = tensor_a.tolist()
To test the computational efficiency I created 1000000 indices and I compared the execution time. Using the loop takes more time then using my suggested pytorch approach:
import time
import torch
start_time = time.time()
a = [[4, 2, 1, 6],[1, 2, 3, 8], [92, 4, 23, 54]]
tensor_a = torch.tensor(a)
indices = torch.randint(0, 2, (1000000,)).tolist()
for indi in indices:
combos.append(tensor_a[indi])
print("--- %s seconds ---" % (time.time() - start_time))
--- 3.3966853618621826 seconds ---
start_time = time.time()
indices = torch.tensor(indices)
tensor_a=torch.index_select(tensor_a, 0, indices)
print("--- %s seconds ---" % (time.time() - start_time))
--- 0.10641193389892578 seconds ---

Split and extract values from a PyTorch tensor according to given dimensions

I have a tensor Aof sizetorch.Size([32, 32, 3, 3]) and I want to split it and extract a tensor B of size torch.Size([16, 16, 3, 3]) from it. The tensor can be 1d or 4d and split has to be according to the given new tensor dimensions. I have been able to generate the target dimensions but I'm unable to split and extract the values from the source tensor. I ave tried torch.narrow but it takes only 3 arguments and I need 4 in many cases. torch.split takes dim as an int due to which tensor is split along one dimension only. But I want to split it along multiple dimensions.
You have multiple options:
use .split multiple times
use .narrow multiple times
use slicing
e.g.:
t = torch.rand(32, 32, 3, 3)
t0, t1 = t.split((16, 16), 0)
print(t0.shape, t1.shape)
>>> torch.Size([16, 32, 3, 3]) torch.Size([16, 32, 3, 3])
t00, t01 = t0.split((16, 16), 1)
print(t00.shape, t01.shape)
>>> torch.Size([16, 16, 3, 3]) torch.Size([16, 16, 3, 3])
t00_alt, t01_alt = t[:16, :16, :, :], t[16:, 16:, :, :]
print(t00_alt.shape, t01_alt.shape)
>>> torch.Size([16, 16, 3, 3]) torch.Size([16, 16, 3, 3])

Filled with numpy.zero but getting nan instead

I have a Class that contains various methods with include the following:
def _doc_mean(self, doc):
doc_vector_values = []
for w in doc:
#print(w)
if w.lower().strip() in self._E:
Q = np.zeros((1, 200), dtype=np.float64) #this is a zero array for when a word doesnt have a vector representation in our pretrained embeddings
doc_vector_values.append(self._E.get(w, Q))
with warnings.catch_warnings():
warnings.simplefilter("ignore", category=RuntimeWarning)
return np.mean(np.array(doc_vector_values, dtype=np.float64), axis=0)
def fit(self, X, y=None):
return self
def transform(self, X):
return np.array([self._doc_mean(doc) for doc in X])
def fit_transform(self, X, y=None):
return self.fit(X).transform(X)
in _doc_mean, i compare w with the keys in a dictionary E_, if there is a match, then load the value of the key-value pair which contains a 1*200 vector into a list, if there is no match, then load numpy.zeros((1,200)) into a list. This list is now converted to an array and the mean calculated.
When i instantiate the class and fit-transform my 'doc' data
mc = MeanClass()
X_ = mc.fit_transform(doc)
X_ is of dtype "object" and the places where there was a mismatch was replaced with nan instead of numpy.zero.
This leads to multiple other problems in my code that i cant fix. What am i doing wrong?
EDIT:
The E_ dictionary looks like this :
{'hello': array([ 5.84850e-02, 6.20640e-02, ..... -2.08990e-02])
'good': array([ -4.80050e-02, 2.80610e-02, ..... -5.04991e-02])
while doc looks like this :
['hello', 'bye', 'good']
['good', 'bye', 'night']
Since you haven't given a [mcve], I'll create something simple:
In [125]: E_ = {'foo':np.arange(5), 'bar':np.arange(1,6), 'baz':np.arange(5,10)}
In [126]: doc = ['foo','bar','sub','baz','foo']
Now do the dictionary lookup:
In [127]: alist = []
In [128]: for w in doc:
...: alist.append(E_.get(w,np.zeros((1,5),int)))
...:
In [129]: alist
Out[129]:
[array([0, 1, 2, 3, 4]),
array([1, 2, 3, 4, 5]),
array([[0, 0, 0, 0, 0]]),
array([5, 6, 7, 8, 9]),
array([0, 1, 2, 3, 4])]
In [130]: np.array(alist)
Out[130]:
array([array([0, 1, 2, 3, 4]), array([1, 2, 3, 4, 5]),
array([[0, 0, 0, 0, 0]]), array([5, 6, 7, 8, 9]),
array([0, 1, 2, 3, 4])], dtype=object)
The arrays in E_ are all shape (5,). The 'fill' array is (1,5). Due to the mismatch in shapes, the Out[130] array is 1d object.
I think you are trying to avoid the 'fill' case, but you test w.lower().strip() in self._E, and then use w in the get. So you might get the Q value sometimes. I got it with the 'sub' string.
If instead I make the 'fill' be (5,):
In [131]: alist = []
In [132]: for w in doc:
...: alist.append(E_.get(w,np.zeros((5,),int)))
...:
In [133]: alist
Out[133]:
[array([0, 1, 2, 3, 4]),
array([1, 2, 3, 4, 5]),
array([0, 0, 0, 0, 0]),
array([5, 6, 7, 8, 9]),
array([0, 1, 2, 3, 4])]
In [134]: np.array(alist)
Out[134]:
array([[0, 1, 2, 3, 4],
[1, 2, 3, 4, 5],
[0, 0, 0, 0, 0],
[5, 6, 7, 8, 9],
[0, 1, 2, 3, 4]])
The result is a (n,5) numeric array.
I can take two different means. One is the mean across all words, with a value for each 'attribute'. The other is the mean for each word, which I could just as well have gotten by taking the mean in E_.
In [135]: np.mean(_, axis=0)
Out[135]: array([1.2, 2. , 2.8, 3.6, 4.4])
In [137]: np.mean(__, axis=1)
Out[137]: array([2., 3., 0., 7., 2.]) # mean for each 'word'
mean of the object array in Out[130]:
In [138]: np.mean(_130, axis=0)
Out[138]: array([[1, 2, 2, 3, 4]])
The result is (1,5) and looks like Out[135] truncated, but I'd have to dig a bit further to be sure.
Hopefully this gives you an idea of what to watch out for. And an idea of the kind of 'minimal reproducable concrete example' that we find most useful.

How to process different row in tensor based on the first column value in tensorflow

let's say I have a 4 by 3 tensor:
sample = [[10, 15, 25], [1, 2, 3], [4, 4, 10], [5, 9, 8]]
I would like to return another tensor of shape 4: [r1,r2,r3,r4] where r is either equal to tf.reduce_sum(row) if row[0] is less than 5, or r is equal to tf.reduce_mean(row) if row[0] is greater or equal to 5.
output:
output = [16.67, 6, 18, 7.33]
I'm not an adept to tensorflow, please do assist me on how to achieve the above in python 3 without a for loop.
thank you
UPDATES:
So I've tried to adapt the answer given by #Onyambu to include two samples in the functions but it gave me an error in all instances.
here is the answer for the first case:
def f(x):
c = tf.constant(5,tf.float32)
def fun1():
return tf.reduce_sum(x)
def fun2():
return tf.reduce_mean(x)
return tf.cond(tf.less(x[0],c),fun1,fun2)
a = tf.map_fn(f,tf.constant(sample,tf.float32))
The above works well.
The for two samples:
sample1 = [[10, 15, 25], [1, 2, 3], [4, 4, 10], [5, 9, 8]]
sample2 = [[0, 15, 25], [1, 2, 3], [0, 4, 10], [1, 9, 8]]
def f2(x1,x2):
c = tf.constant(1,tf.float32)
def fun1():
return tf.reduce_sum(x1[:,0] - x2[:,0])
def fun2():
return tf.reduce_mean(x1 - x2)
return tf.cond(tf.less(x2[0],c),fun1,fun2)
a = tf.map_fn(f2,tf.constant(sample1,tf.float32), tf.constant(sample2,tf.float32))
The adaptation does give errors, but the principle is simple:
calculate the tf.reduce_sum of sample1[:,0] - sample2[:,0] if row[0] is less than 1
calculate the tf.reduce_sum of sample1 - sample2 if row[0] is greater or equal to 1
Thank you for your assistance in advance!
import tensorflow as tf
def f(x):
y = tf.constant(5,tf.float32)
def fun1():
return tf.reduce_sum(x)
def fun2():
return tf.reduce_mean(x)
return tf.cond(tf.less(x[0],y),fun1,fun2)
a = tf.map_fn(f,tf.constant(sample,tf.float32))
with tf.Session() as sess: print(sess.run(a))
[16.666666 6. 18. 7.3333335]
If you want to shorten it:
y = tf.constant(5,tf.float32)
f=lambda x: tf.cond(tf.less(x[0], y), lambda: tf.reduce_sum(x),lambda: tf.reduce_mean(x))
a = tf.map_fn(f,tf.constant(sample,tf.float32))
with tf.Session() as sess: print(sess.run(a))

Resources