The documentation for scipy's binned_statistic_2d function gives an example for a 2D histogram:
from scipy import stats
x = [0.1, 0.1, 0.1, 0.6]
y = [2.1, 2.6, 2.1, 2.1]
binx = [0.0, 0.5, 1.0]
biny = [2.0, 2.5, 3.0]
ret = stats.binned_statistic_2d(x, y, None, 'count', bins=[binx, biny])
Makes sense, but I'm now trying to implement a custom function. The custom function description is given as:
function : a user-defined function which takes a 1D array of values, and outputs a single numerical statistic. This function will be called on the values in each bin. Empty bins will be represented by function([]), or NaN if this returns an error.
I wasn't sure exactly how to implement this, so I thought I'd check my understanding by writing a custom function that reproduces the count option. I tried
def custom_func(values):
return len(values)
x = [0.1, 0.1, 0.1, 0.6]
y = [2.1, 2.6, 2.1, 2.1]
binx = [0.0, 0.5, 1.0]
biny = [2.0, 2.5, 3.0]
ret = stats.binned_statistic_2d(x, y, None, custom_func, bins=[binx, biny])
but this generates an error like so:
556 # Make sure `values` match `sample`
557 if(statistic != 'count' and Vlen != Dlen):
558 raise AttributeError('The number of `values` elements must match the '
559 'length of each `sample` dimension.')
561 try:
562 M = len(bins)
AttributeError: The number of `values` elements must match the length of each `sample` dimension.
How is this custom function supposed to be defined?
The reason for this error is that when using a custom statistic function (or any non-count statistic), you have to pass some array or list of arrays to the values parameter (with the number of elements matching the number in x). You can't just leave it as None as in your example, even though it is irrelevant and does not get used when computing counts of data points in each bin.
So, to match the results, you can just pass the same x object to the values parameter:
def custom_func(values):
return len(values)
x = [0.1, 0.1, 0.1, 0.6]
y = [2.1, 2.6, 2.1, 2.1]
binx = [0.0, 0.5, 1.0]
biny = [2.0, 2.5, 3.0]
ret = stats.binned_statistic_2d(x, y, x, custom_func, bins=[binx, biny])
print(ret)
# BinnedStatistic2dResult(statistic=array([[2., 1.],
# [1., 0.]]), x_edge=array([0. , 0.5, 1. ]), y_edge=array([2. , 2.5, 3. ]), binnumber=array([5, 6, 5, 9]))
The result matches that of the count statistic:
ret = stats.binned_statistic_2d(x, y, None, 'count', bins=[binx, biny])
print(ret)
# BinnedStatistic2dResult(statistic=array([[2., 1.],
# [1., 0.]]), x_edge=array([0. , 0.5, 1. ]), y_edge=array([2. , 2.5, 3. ]), binnumber=array([5, 6, 5, 9]))
Related
I have two tensors: scores and lists
scores is of shape (x, 8) and lists of (x, 8, 4). I want to filter the max values for each row in scores and filter the respective elements from lists.
Take the following as an example (shape dimension 8 was reduced to 2 for simplicity):
scores = torch.tensor([[0.5, 0.4], [0.3, 0.8], ...])
lists = torch.tensor([[[0.2, 0.3, 0.1, 0.5],
[0.4, 0.7, 0.8, 0.2]],
[[0.1, 0.2, 0.1, 0.3],
[0.4, 0.3, 0.2, 0.5]], ...])
Then I would like to filter these tensors to:
scores = torch.tensor([0.5, 0.8, ...])
lists = torch.tensor([[0.2, 0.3, 0.1, 0.5], [0.4, 0.3, 0.2, 0.5], ...])
NOTE:
I tried so far, to retrieve the indices from the original score vector and use it as an index vector to filter lists:
# PSEUDO-CODE
indices = scores.argmax(dim=1)
for list, idx in zip(lists, indices):
list = list[idx]
That is also where the question name is coming from.
I imagine you tried something like
indices = scores.argmax(dim=1)
selection = lists[:, indices]
This does not work because the indices are selected for every element in dimension 0, so the final shape is (x, x, 4).
The perform the correct selection you need to replace the slice with a range.
indices = scores.argmax(dim=1)
selection = lists[range(indices.size(0)), indices]
I have some 2D data with x and y coordinates both within [0,1], plotted using pcolormesh.
Now I want to symmetrize the plot to [-0.5, 0.5] for both x and y coordinates. In Matlab I was able to achieve this by changing x and y from e.g. [0, 0.2, 0.4, 0.6, 0.8] to [0, 0.2, 0.4, -0.4, -0.2], without rearranging the data. However, with pcolormesh I cannot get the desired result.
A minimum example is shown below, with data represented simply by x+y:
import matplotlib.pyplot as plt
import numpy as np
x,y = np.mgrid[0:1:5j,0:1:5j]
fig,(ax1,ax2,ax3) = plt.subplots(1,3,figsize=(9,3.3),constrained_layout=1)
# original plot spanning [0,1]
img1 = ax1.pcolormesh(x,y,x+y,shading='auto')
# shift x and y from [0,1] to [-0.5,0.5]
x = x*(x<0.5)+(x-1)*(x>0.5)
y = y*(y<0.5)+(y-1)*(y>0.5)
img2 = ax2.pcolormesh(x,y,x+y,shading='auto') # similar code works in Matlab
# for this specific case, the following is close to the desired result, I can just rename x and y tick labels
# to [-0.5,0.5], but in general data is not simply x+y
img3 = ax3.pcolormesh(x+y,shading='auto')
fig.colorbar(img1,ax=[ax1,ax2,ax3],orientation='horizontal')
The corresponding figure is below, any suggestion on what is missed would be appreciated!
Let's look at what you want to achieve in a 1D example.
You have x values between 0 and 1 and a dummy function f(x) = 20*x to produce some values.
# x = [0, .2, .4, .6, .8] -> [0, .2, .4, -.4, -.2] -> [-.4, .2, .0, .2, .4])
# fx = [0, 4, 8, 12, 16] -> [0, 4, 8, 12, 16] -> [ 12, 16, 0, 4, 8]
# ^ only flip and shift x not fx ^
You could use np.roll() to achieve the last operation.
I used n=14 to make the result better visible and show that this approach works for arbitrary n.
import numpy as np
import matplotlib.pyplot as plt
n = 14
x, y = np.meshgrid(np.linspace(0, 1, n, endpoint=False),
np.linspace(0, 1, n, endpoint=False))
z = x + y
x_sym = x*(x <= .5)+(x-1)*(x > .5)
# array([[ 0. , 0.2, 0.4, -0.4, -0.2], ...
x_sym = np.roll(x_sym, n//2, axis=(0, 1))
# array([[-0.4, -0.2, 0. , 0.2, 0.4], ...
y_sym = y*(y <= .5)+(y-1)*(y > .5)
y_sym = np.roll(y_sym, n//2, axis=(0, 1))
z_sym = np.roll(z, n//2, axis=(0, 1))
# array([[1.2, 1.4, 0.6, 0.8, 1. ],
# [1.4, 1.6, 0.8, 1. , 1.2],
# [0.6, 0.8, 0. , 0.2, 0.4],
# [0.8, 1. , 0.2, 0.4, 0.6],
# [1. , 1.2, 0.4, 0.6, 0.8]])
fig, (ax1, ax2) = plt.subplots(1, 2)
img1 = ax1.imshow(z, origin='lower', extent=(.0, 1., .0, 1.))
img2 = ax2.imshow(z_sym, origin='lower', extent=(-.5, .5, -.5, .5))
I am getting "Type Error: 0" on dict when acquiring the length of the Dict
(t = len(Motifs[0])
I reviewed the previous post on "Type Error: 0) and I tried casting
t = int(len(Motifs[0]))
def Consensus(Motifs):
k = len(Motifs[0])
profile = ProfileWithPseudocounts(Motifs)
consensus = ""
for j in range(k):
maximum = 0
frequentSymbol = ""
for symbol in "ACGT":
if profile[symbol][j] > maximum:
maximum = profile[symbol][j]
frequentSymbol = symbol
consensus += frequentSymbol
return consensus
def ProfileWithPseudocounts(Motifs):
t = len(Motifs)
k = len(Motifs[0])
profile = {}
count = CountWithPseudocounts(Motifs)
for key, motif_lists in sorted(count.items()):
profile[key] = motif_lists
for motif_list, number in enumerate(motif_lists):
motif_lists[motif_list] = number/(float(t+4))
return profile
def CountWithPseudocounts(Motifs):
t = len(Motifs)
k = len(Motifs[0])
count = {}
for symbol in "ACGT":
count[symbol] = []
for j in range(k):
count[symbol].append(1)
for i in range(t):
for j in range(k):
symbol = Motifs[i][j]
count[symbol][j] += 1
return count
Motifs = {'A': [0.4, 0.3, 0.0, 0.1, 0.0, 0.9],
'C': [0.2, 0.3, 0.0, 0.4, 0.0, 0.1],
'G': [0.1, 0.3, 1.0, 0.1, 0.5, 0.0],
'T': [0.3, 0.1, 0.0, 0.4, 0.5, 0.0]}
#print(type(Motifs))
print(Consensus(Motifs))
"Type Error: 0"
"t = len(Motifs)"
"k = len(Motifs[0])"
"symbol = Motifs[i][j]"
on lines(9, 24, 35, 44) when code executes!!! Traceback:
Traceback (most recent call last):
File "myfile.py", line 47, in <module>
print(Consensus(Motifs))
File "myfile.py", line 2, in Consensus
k = len(Motifs[0])
KeyError: 0
I should get the "Consensus matrix" without errors
You have a dictionary called Motifs with 4 keys:
>>> Motifs.keys()
dict_keys(['A', 'C', 'G', 'T'])
But you are trying to get the value for the key 0, that does not exist (see, for example, Motifs[0] on line 2).
You should use a valid key as, for example, Motifs['A'].
You defined Motifs as a dictionary.
Motifs = {'A': [0.4, 0.3, 0.0, 0.1, 0.0, 0.9],
'C': [0.2, 0.3, 0.0, 0.4, 0.0, 0.1],
'G': [0.1, 0.3, 1.0, 0.1, 0.5, 0.0],
'T': [0.3, 0.1, 0.0, 0.4, 0.5, 0.0]}
Motifs[0] raises KeyError: 0 because the keys are ['T', 'G', 'A', 'C'].
It seems like you wanted to access the length of the first List associated with key A.
You can achieve this by taking len(Motifs['A']).
Note: Ordering of elements in a python dictionary is only a language feature starting from Python3.7.
Mail thread here.
I have a 2D tensor:
samples = torch.Tensor([
[0.1, 0.1], #-> group / class 1
[0.2, 0.2], #-> group / class 2
[0.4, 0.4], #-> group / class 2
[0.0, 0.0] #-> group / class 0
])
and a label for each sample corresponding to a class:
labels = torch.LongTensor([1, 2, 2, 0])
so len(samples) == len(labels). Now I want to calculate the mean for each class / label. Because there are 3 classes (0, 1 and 2) the final vector should have dimension [n_classes, samples.shape[1]] So the expected solution should be:
result == torch.Tensor([
[0.1, 0.1],
[0.3, 0.3], # -> mean of [0.2, 0.2] and [0.4, 0.4]
[0.0, 0.0]
])
Question: How can this be done in pure pytorch (i.e. no numpy so that I can autograd) and ideally without for loops?
All you need to do is form an mxn matrix (m=num classes, n=num samples) which will select the appropriate weights, and scale the mean appropriately. Then you can perform a matrix multiplication between your newly formed matrix and the samples matrix.
Given your labels, your matrix should be (each row is a class number, each class a sample number and its weight):
[[0.0000, 0.0000, 0.0000, 1.0000],
[1.0000, 0.0000, 0.0000, 0.0000],
[0.0000, 0.5000, 0.5000, 0.0000]]
Which you can form as follows:
M = torch.zeros(labels.max()+1, len(samples))
M[labels, torch.arange(len(samples)] = 1
M = torch.nn.functional.normalize(M, p=1, dim=1)
torch.mm(M, samples)
Output:
tensor([[0.0000, 0.0000],
[0.1000, 0.1000],
[0.3000, 0.3000]])
Note that the output means are correctly sorted in class order.
Why does M[labels, torch.arange(len(samples))] = 1 work?
This is performing a broadcast operation between the labels and the number of samples. Essentially, we are generating a 2D index for every element in labels: the first specifies which of the m classes it belongs to, and the second simply specifies its index position (from 1 to N). Another way would be top explicitly generate all the 2D indices:
twoD_indices = []
for count, label in enumerate(labels):
twoD_indices.append((label, count))
Reposting here an answer from #ptrblck_de in the Pytorch forums
labels = labels.view(labels.size(0), 1).expand(-1, samples.size(1))
unique_labels, labels_count = labels.unique(dim=0, return_counts=True)
res = torch.zeros_like(unique_labels, dtype=torch.float).scatter_add_(0, labels, samples)
res = res / labels_count.float().unsqueeze(1)
As previous solutions do not work for the case of sparse groups (e.g., not all the groups are in the data), I made one :)
def groupby_mean(value:torch.Tensor, labels:torch.LongTensor) -> (torch.Tensor, torch.LongTensor):
"""Group-wise average for (sparse) grouped tensors
Args:
value (torch.Tensor): values to average (# samples, latent dimension)
labels (torch.LongTensor): labels for embedding parameters (# samples,)
Returns:
result (torch.Tensor): (# unique labels, latent dimension)
new_labels (torch.LongTensor): (# unique labels,)
Examples:
>>> samples = torch.Tensor([
[0.15, 0.15, 0.15], #-> group / class 1
[0.2, 0.2, 0.2], #-> group / class 3
[0.4, 0.4, 0.4], #-> group / class 3
[0.0, 0.0, 0.0] #-> group / class 0
])
>>> labels = torch.LongTensor([1, 5, 5, 0])
>>> result, new_labels = groupby_mean(samples, labels)
>>> result
tensor([[0.0000, 0.0000, 0.0000],
[0.1500, 0.1500, 0.1500],
[0.3000, 0.3000, 0.3000]])
>>> new_labels
tensor([0, 1, 5])
"""
uniques = labels.unique().tolist()
labels = labels.tolist()
key_val = {key: val for key, val in zip(uniques, range(len(uniques)))}
val_key = {val: key for key, val in zip(uniques, range(len(uniques)))}
labels = torch.LongTensor(list(map(key_val.get, labels)))
labels = labels.view(labels.size(0), 1).expand(-1, value.size(1))
unique_labels, labels_count = labels.unique(dim=0, return_counts=True)
result = torch.zeros_like(unique_labels, dtype=torch.float).scatter_add_(0, labels, value)
result = result / labels_count.float().unsqueeze(1)
new_labels = torch.LongTensor(list(map(val_key.get, unique_labels[:, 0].tolist())))
return result, new_labels
For 3D Tensors:
For those, who are interested. I expanded #yhenon's answer to the case, where labels is a 2D tensor and samples is a 3D Tensor. This might be useful, if you want to execute this operation in batches (as I do). But it comes with a caveat (see at the end).
M = torch.zeros(labels.shape[0], labels.max()+1, labels.shape[1])
M[torch.arange(len(labels))[:,None], labels, torch.arange(labels.size(1))] = 1
M = torch.nn.functional.normalize(M, p=1, dim=-1)
result = M#samples
samples = torch.Tensor([[
[0.1, 0.1], #-> group / class 1
[0.2, 0.2], #-> group / class 2
[0.4, 0.4], #-> group / class 2
[0.0, 0.0] #-> group / class 0
], [
[0.5, 0.5], #-> group / class 0
[0.2, 0.2], #-> group / class 1
[0.4, 0.4], #-> group / class 2
[0.1, 0.1] #-> group / class 3
]])
labels = torch.LongTensor([[1, 2, 2, 0], [0, 1, 2, 3]])
Output:
>>> result
tensor([[[0.0000, 0.0000],
[0.1000, 0.1000],
[0.3000, 0.3000],
[0.0000, 0.0000]],
[[0.5000, 0.5000],
[0.2000, 0.2000],
[0.4000, 0.4000],
[0.1000, 0.1000]]])
Be careful: Now, result[0] has a length of 4 (instead of 3 in #yhenon's answer), because labels[1] contains a 3. The last row contains only 0s. If you don't except 0s in the last rows of your resulting tensor, you can use this code and deal with the 0s later.
I want to construct a 1d numpy array a, and I know each a[i] has several possible values. Of course, the numbers of the possible values of any two elements of a can be different. For each a[i], I want to set it be the minimum value of all the possible values.
For example, I have two array:
idx = np.array([0, 1, 0, 2, 3, 3, 3])
val = np.array([0.1, 0.5, 0.2, 0.6, 0.2, 0.1, 0.3])
The array I want to construct is following:
a = np.array([0.1, 0.5, 0.6, 0.1])
So does there exist any function in numpy can finish this work?
Here's one approach -
def groupby_minimum(idx, val):
sidx = idx.argsort()
sorted_idx = idx[sidx]
cut_idx = np.r_[0,np.flatnonzero(sorted_idx[1:] != sorted_idx[:-1])+1]
return np.minimum.reduceat(val[sidx], cut_idx)
Sample run -
In [36]: idx = np.array([0, 1, 0, 2, 3, 3, 3])
...: val = np.array([0.1, 0.5, 0.2, 0.6, 0.2, 0.1, 0.3])
...:
In [37]: groupby_minimum(idx, val)
Out[37]: array([ 0.1, 0.5, 0.6, 0.1])
Here's another using pandas -
import pandas as pd
def pandas_groupby_minimum(idx, val):
df = pd.DataFrame({'ID' : idx, 'val' : val})
return df.groupby('ID')['val'].min().values
Sample run -
In [66]: pandas_groupby_minimum(idx, val)
Out[66]: array([ 0.1, 0.5, 0.6, 0.1])
You can also use binned_statistic:
from scipy.stats import binned_statistic
idx_list=np.append(np.unique(idx),np.max(idx)+1)
stats=binned_statistic(idx,val,statistic='min', bins=idx_list)
a=stats.statistic
I think, in older scipy versions, statistic='min' was not implemented, but you can use statistic=np.min instead. Intervals are half open in binned_statistic, so this implementation is safe.