groupby aggregate mean in pytorch

groupby aggregate mean in pytorch - pytorch

I have a 2D tensor:
samples = torch.Tensor([
[0.1, 0.1], #-> group / class 1
[0.2, 0.2], #-> group / class 2
[0.4, 0.4], #-> group / class 2
[0.0, 0.0] #-> group / class 0
])
and a label for each sample corresponding to a class:
labels = torch.LongTensor([1, 2, 2, 0])
so len(samples) == len(labels). Now I want to calculate the mean for each class / label. Because there are 3 classes (0, 1 and 2) the final vector should have dimension [n_classes, samples.shape[1]] So the expected solution should be:
result == torch.Tensor([
[0.1, 0.1],
[0.3, 0.3], # -> mean of [0.2, 0.2] and [0.4, 0.4]
[0.0, 0.0]
])
Question: How can this be done in pure pytorch (i.e. no numpy so that I can autograd) and ideally without for loops?

All you need to do is form an mxn matrix (m=num classes, n=num samples) which will select the appropriate weights, and scale the mean appropriately. Then you can perform a matrix multiplication between your newly formed matrix and the samples matrix.
Given your labels, your matrix should be (each row is a class number, each class a sample number and its weight):
[[0.0000, 0.0000, 0.0000, 1.0000],
[1.0000, 0.0000, 0.0000, 0.0000],
[0.0000, 0.5000, 0.5000, 0.0000]]
Which you can form as follows:
M = torch.zeros(labels.max()+1, len(samples))
M[labels, torch.arange(len(samples)] = 1
M = torch.nn.functional.normalize(M, p=1, dim=1)
torch.mm(M, samples)
Output:
tensor([[0.0000, 0.0000],
[0.1000, 0.1000],
[0.3000, 0.3000]])
Note that the output means are correctly sorted in class order.
Why does M[labels, torch.arange(len(samples))] = 1 work?
This is performing a broadcast operation between the labels and the number of samples. Essentially, we are generating a 2D index for every element in labels: the first specifies which of the m classes it belongs to, and the second simply specifies its index position (from 1 to N). Another way would be top explicitly generate all the 2D indices:
twoD_indices = []
for count, label in enumerate(labels):
twoD_indices.append((label, count))

Reposting here an answer from #ptrblck_de in the Pytorch forums
labels = labels.view(labels.size(0), 1).expand(-1, samples.size(1))
unique_labels, labels_count = labels.unique(dim=0, return_counts=True)
res = torch.zeros_like(unique_labels, dtype=torch.float).scatter_add_(0, labels, samples)
res = res / labels_count.float().unsqueeze(1)

As previous solutions do not work for the case of sparse groups (e.g., not all the groups are in the data), I made one :)
def groupby_mean(value:torch.Tensor, labels:torch.LongTensor) -> (torch.Tensor, torch.LongTensor):
"""Group-wise average for (sparse) grouped tensors
Args:
value (torch.Tensor): values to average (# samples, latent dimension)
labels (torch.LongTensor): labels for embedding parameters (# samples,)
Returns:
result (torch.Tensor): (# unique labels, latent dimension)
new_labels (torch.LongTensor): (# unique labels,)
Examples:
>>> samples = torch.Tensor([
[0.15, 0.15, 0.15], #-> group / class 1
[0.2, 0.2, 0.2], #-> group / class 3
[0.4, 0.4, 0.4], #-> group / class 3
[0.0, 0.0, 0.0] #-> group / class 0
])
>>> labels = torch.LongTensor([1, 5, 5, 0])
>>> result, new_labels = groupby_mean(samples, labels)
>>> result
tensor([[0.0000, 0.0000, 0.0000],
[0.1500, 0.1500, 0.1500],
[0.3000, 0.3000, 0.3000]])
>>> new_labels
tensor([0, 1, 5])
"""
uniques = labels.unique().tolist()
labels = labels.tolist()
key_val = {key: val for key, val in zip(uniques, range(len(uniques)))}
val_key = {val: key for key, val in zip(uniques, range(len(uniques)))}
labels = torch.LongTensor(list(map(key_val.get, labels)))
labels = labels.view(labels.size(0), 1).expand(-1, value.size(1))
unique_labels, labels_count = labels.unique(dim=0, return_counts=True)
result = torch.zeros_like(unique_labels, dtype=torch.float).scatter_add_(0, labels, value)
result = result / labels_count.float().unsqueeze(1)
new_labels = torch.LongTensor(list(map(val_key.get, unique_labels[:, 0].tolist())))
return result, new_labels

For 3D Tensors:
For those, who are interested. I expanded #yhenon's answer to the case, where labels is a 2D tensor and samples is a 3D Tensor. This might be useful, if you want to execute this operation in batches (as I do). But it comes with a caveat (see at the end).
M = torch.zeros(labels.shape[0], labels.max()+1, labels.shape[1])
M[torch.arange(len(labels))[:,None], labels, torch.arange(labels.size(1))] = 1
M = torch.nn.functional.normalize(M, p=1, dim=-1)
result = M#samples
samples = torch.Tensor([[
[0.1, 0.1], #-> group / class 1
[0.2, 0.2], #-> group / class 2
[0.4, 0.4], #-> group / class 2
[0.0, 0.0] #-> group / class 0
], [
[0.5, 0.5], #-> group / class 0
[0.2, 0.2], #-> group / class 1
[0.4, 0.4], #-> group / class 2
[0.1, 0.1] #-> group / class 3
]])
labels = torch.LongTensor([[1, 2, 2, 0], [0, 1, 2, 3]])
Output:
>>> result
tensor([[[0.0000, 0.0000],
[0.1000, 0.1000],
[0.3000, 0.3000],
[0.0000, 0.0000]],
[[0.5000, 0.5000],
[0.2000, 0.2000],
[0.4000, 0.4000],
[0.1000, 0.1000]]])
Be careful: Now, result[0] has a length of 4 (instead of 3 in #yhenon's answer), because labels[1] contains a 3. The last row contains only 0s. If you don't except 0s in the last rows of your resulting tensor, you can use this code and deal with the 0s later.

Related

How to define custom function for scipy's binned_statistic_2d?

The documentation for scipy's binned_statistic_2d function gives an example for a 2D histogram:
from scipy import stats
x = [0.1, 0.1, 0.1, 0.6]
y = [2.1, 2.6, 2.1, 2.1]
binx = [0.0, 0.5, 1.0]
biny = [2.0, 2.5, 3.0]
ret = stats.binned_statistic_2d(x, y, None, 'count', bins=[binx, biny])
Makes sense, but I'm now trying to implement a custom function. The custom function description is given as:
function : a user-defined function which takes a 1D array of values, and outputs a single numerical statistic. This function will be called on the values in each bin. Empty bins will be represented by function([]), or NaN if this returns an error.
I wasn't sure exactly how to implement this, so I thought I'd check my understanding by writing a custom function that reproduces the count option. I tried
def custom_func(values):
return len(values)
x = [0.1, 0.1, 0.1, 0.6]
y = [2.1, 2.6, 2.1, 2.1]
binx = [0.0, 0.5, 1.0]
biny = [2.0, 2.5, 3.0]
ret = stats.binned_statistic_2d(x, y, None, custom_func, bins=[binx, biny])
but this generates an error like so:
556 # Make sure `values` match `sample`
557 if(statistic != 'count' and Vlen != Dlen):
558 raise AttributeError('The number of `values` elements must match the '
559 'length of each `sample` dimension.')
561 try:
562 M = len(bins)
AttributeError: The number of `values` elements must match the length of each `sample` dimension.
How is this custom function supposed to be defined?

The reason for this error is that when using a custom statistic function (or any non-count statistic), you have to pass some array or list of arrays to the values parameter (with the number of elements matching the number in x). You can't just leave it as None as in your example, even though it is irrelevant and does not get used when computing counts of data points in each bin.
So, to match the results, you can just pass the same x object to the values parameter:
def custom_func(values):
return len(values)
x = [0.1, 0.1, 0.1, 0.6]
y = [2.1, 2.6, 2.1, 2.1]
binx = [0.0, 0.5, 1.0]
biny = [2.0, 2.5, 3.0]
ret = stats.binned_statistic_2d(x, y, x, custom_func, bins=[binx, biny])
print(ret)
# BinnedStatistic2dResult(statistic=array([[2., 1.],
# [1., 0.]]), x_edge=array([0. , 0.5, 1. ]), y_edge=array([2. , 2.5, 3. ]), binnumber=array([5, 6, 5, 9]))
The result matches that of the count statistic:
ret = stats.binned_statistic_2d(x, y, None, 'count', bins=[binx, biny])
print(ret)
# BinnedStatistic2dResult(statistic=array([[2., 1.],
# [1., 0.]]), x_edge=array([0. , 0.5, 1. ]), y_edge=array([2. , 2.5, 3. ]), binnumber=array([5, 6, 5, 9]))

How can I retrieve elements in a multidimensional pytorch tensor by a list of indices?

I have two tensors: scores and lists
scores is of shape (x, 8) and lists of (x, 8, 4). I want to filter the max values for each row in scores and filter the respective elements from lists.
Take the following as an example (shape dimension 8 was reduced to 2 for simplicity):
scores = torch.tensor([[0.5, 0.4], [0.3, 0.8], ...])
lists = torch.tensor([[[0.2, 0.3, 0.1, 0.5],
[0.4, 0.7, 0.8, 0.2]],
[[0.1, 0.2, 0.1, 0.3],
[0.4, 0.3, 0.2, 0.5]], ...])
Then I would like to filter these tensors to:
scores = torch.tensor([0.5, 0.8, ...])
lists = torch.tensor([[0.2, 0.3, 0.1, 0.5], [0.4, 0.3, 0.2, 0.5], ...])
NOTE:
I tried so far, to retrieve the indices from the original score vector and use it as an index vector to filter lists:
# PSEUDO-CODE
indices = scores.argmax(dim=1)
for list, idx in zip(lists, indices):
list = list[idx]
That is also where the question name is coming from.

I imagine you tried something like
indices = scores.argmax(dim=1)
selection = lists[:, indices]
This does not work because the indices are selected for every element in dimension 0, so the final shape is (x, x, 4).
The perform the correct selection you need to replace the slice with a range.
indices = scores.argmax(dim=1)
selection = lists[range(indices.size(0)), indices]

Vector similarity with multiple dtypes (string, int, floats etc.)?

I have the following 2 rows in my dataframe:
[1, 1.1, -19, "kuku", "lulu"]
[2.8, 1.1, -20, "kuku", "lilu"]
I want to calculate their similarity by comparing each dimension (equal? 1, otherwise 0) and get the following vector: [0, 1, 0, 1, 0], is there any function that takes a vector and performs such "similarity" against all rows and calculates mean? In our case it would be 2/5 = 0.4.

I would just use a simple = on NumPy arrays, to be casted as int for the vector and numpy.mean() for the mean of the vector:
import numpy as np
a = [1, 1.1, -19, "kuku", "lulu"]
b = [2.8, 1.1, -20, "kuku", "lilu"]
res = (np.array(a) == np.array(b)).astype(int)
print(res)
# [0 1 0 1 0]
v = res.mean()
print(v)
# 0.4
If you do not mind computing everything twice and you can afford the potentially large intermediate temporary objects:
import numpy as np
arr = np.array([
[1, 1.1, -19, "kuku", "lulu"],
[2.8, 1.1, -20, "kuku", "lilu"],
[2.8, 1.1, -20, "kuku", "lulu"]])
corr = arr[None, :, :] == arr[:, None, :]
score = corr.mean(-1)
print(score)
# [[1. 0.4 0.6]
# [0.4 1. 0.8]
# [0.6 0.8 1. ]]

Pytorch, sample given batch logits

Given logits like
# each row is a record of data
logits = np.array([ [0.1, 0.3, 0.5], [0.3, 0.1, 0.5], [0.1, 0.3, 0.0] ])
How can I use Pytorch to sample the index for the logits of each row? Current distribution APIs does not support such functions.
What I want is, for example
distribution = Categorical(logits=logits)
labels = distribution.sample(dim=1)

How to construct a numpy array with its each element be the minimum value of all possible values?

I want to construct a 1d numpy array a, and I know each a[i] has several possible values. Of course, the numbers of the possible values of any two elements of a can be different. For each a[i], I want to set it be the minimum value of all the possible values.
For example, I have two array:
idx = np.array([0, 1, 0, 2, 3, 3, 3])
val = np.array([0.1, 0.5, 0.2, 0.6, 0.2, 0.1, 0.3])
The array I want to construct is following:
a = np.array([0.1, 0.5, 0.6, 0.1])
So does there exist any function in numpy can finish this work?

Here's one approach -
def groupby_minimum(idx, val):
sidx = idx.argsort()
sorted_idx = idx[sidx]
cut_idx = np.r_[0,np.flatnonzero(sorted_idx[1:] != sorted_idx[:-1])+1]
return np.minimum.reduceat(val[sidx], cut_idx)
Sample run -
In [36]: idx = np.array([0, 1, 0, 2, 3, 3, 3])
...: val = np.array([0.1, 0.5, 0.2, 0.6, 0.2, 0.1, 0.3])
...:
In [37]: groupby_minimum(idx, val)
Out[37]: array([ 0.1, 0.5, 0.6, 0.1])
Here's another using pandas -
import pandas as pd
def pandas_groupby_minimum(idx, val):
df = pd.DataFrame({'ID' : idx, 'val' : val})
return df.groupby('ID')['val'].min().values
Sample run -
In [66]: pandas_groupby_minimum(idx, val)
Out[66]: array([ 0.1, 0.5, 0.6, 0.1])

You can also use binned_statistic:
from scipy.stats import binned_statistic
idx_list=np.append(np.unique(idx),np.max(idx)+1)
stats=binned_statistic(idx,val,statistic='min', bins=idx_list)
a=stats.statistic
I think, in older scipy versions, statistic='min' was not implemented, but you can use statistic=np.min instead. Intervals are half open in binned_statistic, so this implementation is safe.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

groupby aggregate mean in pytorch - pytorch

Related

How to define custom function for scipy's binned_statistic_2d?

How can I retrieve elements in a multidimensional pytorch tensor by a list of indices?

Vector similarity with multiple dtypes (string, int, floats etc.)?

Pytorch, sample given batch logits

How to construct a numpy array with its each element be the minimum value of all possible values?

Categories

Resources