How to map element in pytorch tensor to id?

How to map element in pytorch tensor to id? - python-3.x

Given a tensor:
A = torch.tensor([2., 3., 4., 5., 6., 7.])
Then, give each element in A an id:
id = torch.arange(A.shape[0], dtype = torch.int) # tensor([0,1,2,3,4,5])
In other words, id of 2. in A is 0 and id of 3. in A is 1:
2. -> 0
3. -> 1
4. -> 2
5. -> 3
6. -> 4
7. -> 5
Then, I have a new tensor:
B = torch.tensor([3., 6., 6., 5., 4., 4., 4.])
In pytorch, is there any way in Pytorch to map each element in B to id?
In other words, I want to obtain tensor([1, 4, 4, 3, 2, 2, 2]), in which each element is id of the element in B.

What you ask can be done with slowly iterating the whole B matrix and checking each element of it against all elements of A and then retrieving the index of each element:
In [*]: for x in B:
...: print(torch.where(x==A)[0][0])
...:
...:
tensor(1)
tensor(4)
tensor(4)
tensor(3)
tensor(2)
tensor(2)
tensor(2)
Here I used torch.where to find all the True elements in the matrix x==A, where x take the value of each element of matrix B. This is really slow but it allows you to add some functionality to deal with cases where some elements of B do not appear in matrix A
The fast and dirty method to get what you want with linear algebra operations is:
In [*]: (B.view(-1,1) == A).int().argmax(dim=1)
Out[*]: tensor([1, 4, 4, 3, 2, 2, 2])
This trick takes advantage of the fact that argmax returns the first 'max' index of each vector in dim=1.
Big warning here, if the element does not exist in the matrix no error will be raised and the result will silently be 0 for all elements that do not exist in A.
In [*]: C = torch.tensor([100, 1000, 1, 3, 9999])
In [*]: (C.view(-1,1) == A).int().argmax(dim=1)
Out[*]: tensor([0, 0, 0, 1, 0])

I don't think there is such a function in PyTorch to map a tensor.
It seems quite unreasonable to solve this by comparing each value from B to values from B.
Here are two possible solutions to solve this problem.
Using a dictionary as a map
You can use a dictionary. Not so not much of a pure-PyTorch solution but will most probably be the fastest and safest way...
Just create a dict to map each element to an id, then use it to map B:
>>> map = {x.item(): i for i, x in enumerate(A)}
>>> torch.tensor([map[x.item()] for x in B])
tensor([1, 4, 4, 3, 2, 2, 2])
Change of basis approach
An alternative only using torch.Tensors. This will require the values you want to map - the content of A - to be integers because they will be used to index a tensor.
Encode the content of A into one-hot encodings:
>>> A_enc = torch.zeros((int(A.max())+1,)*2)
>>> A_enc[A, torch.arange(A.shape[0])] = 1
>>> A_enc
tensor([[0., 0., 0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0., 0., 0.],
[1., 0., 0., 0., 0., 0., 0., 0.],
[0., 1., 0., 0., 0., 0., 0., 0.],
[0., 0., 1., 0., 0., 0., 0., 0.],
[0., 0., 0., 1., 0., 0., 0., 0.],
[0., 0., 0., 0., 1., 0., 0., 0.],
[0., 0., 0., 0., 0., 1., 0., 0.]])
We'll use A_enc as our basis to map integers:
>>> v = torch.argmax(A_enc, dim=0)
tensor([0, 0, 0, 1, 2, 3, 4, 5])
Now, given an integer for instance x=3, we can encode it into a one-hot-encoding: x_enc = [0, 0, 0, 1, 0, 0, 0, 0]. Then, use v to map it. With a simple dot product you can get the mapping of x_enc: here <v/x_enc> gives 1 which is the desired result (first element of mapped-B). But instead of giving x_enc, we will compute the matrix multiplication between v and encoded-B. First encode B then compute the matrix multiplcition vxB_enc:
>>> B_enc = torch.zeros(A_enc.shape[0], B.shape[0])
>>> B_enc[B, torch.arange(B.shape[0])] = 1
>>> B_enc
tensor([[0., 0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0., 0.],
[1., 0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 1., 1., 1.],
[0., 0., 0., 1., 0., 0., 0.],
[0., 1., 1., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0., 0.]])
>>> v#B_enc.long()
tensor([1, 4, 4, 3, 2, 2, 2])
Note - you will have to define your tensors with Long type.

There is a similar issue for numpy so my answer is heavily inspired by their solution. I will compare some of the mentioned methods using perfplot. I will also generalize the problem to apply a mapping to a tensor (yours is just a specific case).
For the analysis, I will assume the mapping contains all the unique elements in the tensor and the number of elements to small and constant.
import torch
def apply(a: torch.Tensor, ids: torch.Tensor, b: torch.Tensor) -> torch.Tensor:
mapping = {k.item(): v.item() for k, v in zip(a, ids)}
return b.clone().apply_(lambda x: mapping.__getitem__(x))
def bucketize(a: torch.Tensor, ids: torch.Tensor, b: torch.Tensor) -> torch.Tensor:
mapping = {k.item(): v.item() for k, v in zip(a, ids)}
# From `https://stackoverflow.com/questions/13572448`.
palette, key = zip(*mapping.items())
key = torch.tensor(key)
palette = torch.tensor(palette)
index = torch.bucketize(b.ravel(), palette)
remapped = key[index].reshape(b.shape)
return remapped
def iterate(a: torch.Tensor, ids: torch.Tensor, b: torch.Tensor) -> torch.Tensor:
mapping = {k.item(): v.item() for k, v in zip(a, ids)}
return torch.tensor([mapping[x.item()] for x in b])
def argmax(a: torch.Tensor, ids: torch.Tensor, b: torch.Tensor) -> torch.Tensor:
return (b.view(-1, 1) == a).int().argmax(dim=1)
if __name__ == "__main__":
import perfplot
a = torch.arange(2, 8)
ids = torch.arange(0, 6)
perfplot.show(
setup=lambda n: torch.randint(2, 8, (n,)),
kernels=[
lambda x: apply(a, ids, x),
lambda x: bucketize(a, ids, x),
lambda x: iterate(a, ids, x),
lambda x: argmax(a, ids, x),
],
labels=["apply", "bucketize", "iterate", "argmax"],
n_range=[2 ** k for k in range(25)],
xlabel="len(a)",
)
Running this yields the following plot:
Hence depending on the number of elements in your tensor you can pick either the argmax method (with the caveats mentioned and the restriction that you have to map the values from 0 to N), apply, or bucketize.
Now if we increase the number of elements to be mapped lets say tens of thousands i.e. a = torch.arange(2, 10002) and ids = torch.arange(0, 10000) we get the following results:
This means the speed increase of bucketize will only be visible for a larger array but still outperforms the other methods (the argmax method was killed and therefore I had to remove it).
Last, if we have a mapping that does not have all keys present in the tensor we can just update a dictionary with all unique keys:
mapping = {x.item(): x.item() for x in torch.unique(a)}
mapping.update({k.item(): v.item() for k, v in zip(a, ids)})
Now, if the unique elements you want to map is orders of magnitude larger than the array computing this may shift the value of n for when bucketize is faster than apply (since for apply you can change the mapping.__getitem__(x) for mapping.get(x, x).

I guess there is an easier way. Create an array as mapper, cast your tensor back into np.ndarray first and then address it.
import numpy as np
a_array = A.numpy().astype(int)
b_array = B.numpy().astype(int)
mapper = np.zeros(10)
for i, x in enumerate(a_array):
mapper[x] = i
out = torch.Tensor(mapper[b_array])

Related

Scikit learn preprocessing cannot understand the output using min_frequency argument in OneHotencoder class

Consider the below array t. When using min_frequency kwarg in the OneHotEncoder class, I cannot understand why the category snake is still present when transforming a new array. There are 2/40 events of this label. Should the shape of e be (4,3) instead?
sklearn.__version__ == '1.1.1'
t = np.array([['dog'] * 8 + ['cat'] * 20 + ['rabbit'] * 10 +
['snake'] * 2], dtype=object).T
enc = OneHotEncoder(min_frequency= 4/40,
sparse=False).fit(t)
print(enc.infrequent_categories_)
# [array(['snake'], dtype=object)]
e = enc.transform(np.array([['dog'], ['cat'], ['dog'], ['snake']]))
array([[0., 1., 0., 0.],
[1., 0., 0., 0.],
[0., 1., 0., 0.],
[0., 0., 0., 1.]]) # snake is present?

Check out enc.get_feature_names_out():
array(['x0_cat', 'x0_dog', 'x0_rabbit', 'x0_infrequent_sklearn'],
dtype=object)
"snake" isn't considered its own category anymore, but lumped into the infrequent category. If you added some other rare categories, they'd be assigned to the same, and if you additionally set handle_unknown="infrequent_if_exist", you would also encode unseen categories to the same.

All possible concatenations of two tensors in PyTorch

Suppose I have two tensors S and T defined as:
S = torch.rand((3,2,1))
T = torch.ones((3,2,1))
We can think of these as containing batches of tensors with shapes (2, 1). In this case, the batch size is 3.
I want to concatenate all possible pairings between batches. A single concatenation of batches produces a tensor of shape (4, 1). And there are 3*3 combinations so ultimately, the resulting tensor C must have a shape of (3, 3, 4, 1).
One solution is to do the following:
for i in range(S.shape[0]):
for j in range(T.shape[0]):
C[i,j,:,:] = torch.cat((S[i,:,:],T[j,:,:]))
But the for loop doesn't scale well to large batch sizes. Is there a PyTorch command to do this?

I don't know of any command out-of-the-box that does such operation. However, you can pull it off in a straightforward way using a single matrix multiplication.
The trick is to construct a tensor containing all pairs of batch elements by starting from already stacked S,T tensor. Then by multiplying it with a properly chosen mask tensor... In this method, keeping track of shapes and dimension sizes is essential.
The stack is given by (notice the reshape, we essentially flatten the batch elements from S and T into a single batch axis on ST):
>>> ST = torch.stack((S, T)).reshape(6, 2)
>>> ST
tensor([[0.7792, 0.0095],
[0.1893, 0.8159],
[0.0680, 0.7194],
[1.0000, 1.0000],
[1.0000, 1.0000],
[1.0000, 1.0000]]
# ST.shape = (6, 2)
You can retrieve all (S[i], T[j]) pairs using range and itertools.product:
>>> indices = torch.tensor(list(product(range(0, 3), range(3, 6))))
tensor([[0, 3],
[0, 4],
[0, 5],
[1, 3],
[1, 4],
[1, 5],
[2, 3],
[2, 4],
[2, 5]])
# indices.shape = (9, 2)
From there, we construct one-hot-encodings of the indices using torch.nn.functional.one_hot:
>>> mask = one_hot(indices).float()
tensor([[[1., 0., 0., 0., 0., 0.],
[0., 0., 0., 1., 0., 0.]],
[[1., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 1., 0.]],
[[1., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 1.]],
[[0., 1., 0., 0., 0., 0.],
[0., 0., 0., 1., 0., 0.]],
[[0., 1., 0., 0., 0., 0.],
[0., 0., 0., 0., 1., 0.]],
[[0., 1., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 1.]],
[[0., 0., 1., 0., 0., 0.],
[0., 0., 0., 1., 0., 0.]],
[[0., 0., 1., 0., 0., 0.],
[0., 0., 0., 0., 1., 0.]],
[[0., 0., 1., 0., 0., 0.],
[0., 0., 0., 0., 0., 1.]]])
# mask.shape = (9, 2, 6)
Finally, we compute the matrix multiplication and reshape it to the final form:
>>> (mask#ST).reshape(3, 3, 4, 1)
tensor([[[[0.7792],
[0.0095],
[1.0000],
[1.0000]],
[[0.7792],
[0.0095],
[1.0000],
[1.0000]],
[[0.7792],
[0.0095],
[1.0000],
[1.0000]]],
[[[0.1893],
[0.8159],
[1.0000],
[1.0000]],
[[0.1893],
[0.8159],
[1.0000],
[1.0000]],
[[0.1893],
[0.8159],
[1.0000],
[1.0000]]],
[[[0.0680],
[0.7194],
[1.0000],
[1.0000]],
[[0.0680],
[0.7194],
[1.0000],
[1.0000]],
[[0.0680],
[0.7194],
[1.0000],
[1.0000]]]])
I initially went with torch.einsum: torch.einsum('bf,pib->pif', ST, mask). But, later realized than that bf,pib->pif reduces nicely to a simple torch.Tensor.matmul operation if we switch the two operands: i.e. with pib,bf->pif (subscript b is reduced in the middle).

In numpy something called np.meshgrid is used.
https://stackoverflow.com/a/35608701/3259896
So in pytorch, it would be
torch.stack(
torch.meshgrid(x, y)
).T.reshape(-1,2)
Where x and y are your two lists. You can use any number. x, y , z, etc.
And then you reshape it to the number of lists you use.
So if you used three lists, use .reshape(-1,3), for four use .reshape(-1,4), etc.
So for 5 tensors, use
torch.stack(
torch.meshgrid(a, b, c, d, e)
).T.reshape(-1,5)

sklearn ndcg_score returned incorrect result

I am working on a project that involves the use of NDCG (normalized distributed cumulative gain), and I understand the method's underlying calculations.
So I imported ndcg_score from sklearn.metrics, and then pass in a ground truth array and another array to the ndcg_score function to calculate their NDCG score. The ground truth array has the values [5, 4, 3, 2, 1] while the other array has the values [5, 4, 3, 2, 0], so only the last element is different in these 2 arrays.
from sklearn.metrics import ndcg_score
user_ndcg = ndcg_score(array([[5, 4, 3, 2, 1]]), array([[5, 4, 3, 2, 0]]))
I was expecting the result to be around 0.96233 (9.88507/10.27192). However, user_ndcg actually returned 1.0, which surprised me. Initially I thought this is due to rounding, but this is not the case because when I did an experiment on another set of array: ndcg_score(array([[5, 4, 3, 2, 1]]), array([[5, 4, 0, 2, 0]])), it correctly returned 0.98898.
Does anyone know whether this could be a bug with the sklearn ndcg_score function, or whether I was doing something wrong with my code?

I am assuming you are trying to predict six different classes for this problem (0, 1, 2, 3, 4 and 5). If you want to evaluate the ndcg for five different observations, you have to pass the function two arrays of shape (5, 6) each.
That is, you have transform your ground truth and predictions to arrays of five rows and six columns per row.
# Current form of ground truth and predictions
y_true = [5, 4, 3, 2, 1]
y_pred = [5, 4, 3, 2, 0]
# Transform ground truth to ndarray
y_true_nd = np.zeros(shape=(5, 6))
y_true_nd[np.arange(5), y_true] = 1
# Transform predictions to ndarray
y_pred_nd = np.zeros(shape=(5, 6))
y_pred_nd[np.arange(5), y_pred] = 1
# Calculate ndcg score
ndcg_score(y_true_nd, y_pred_nd)
> 0.8921866522394966
Here's what y_true_nd and y_pred_nd look like:
y_true_nd
array([[0., 0., 0., 0., 0., 1.],
[0., 0., 0., 0., 1., 0.],
[0., 0., 0., 1., 0., 0.],
[0., 0., 1., 0., 0., 0.],
[0., 1., 0., 0., 0., 0.]])
y_pred_nd
array([[0., 0., 0., 0., 0., 1.],
[0., 0., 0., 0., 1., 0.],
[0., 0., 0., 1., 0., 0.],
[0., 0., 1., 0., 0., 0.],
[1., 0., 0., 0., 0., 0.]])

Quickest way to insert zeros into numpy array

I have a numpy array ids = np.array([1,1,1,1,2,2,2,3,4,4])
and another array of equal length vals = np.array([1,2,3,4,5,6,7,8,9,10])
Note: the ids array is sorted by ascending order
I would like to insert 4 zeros before the beginning of each new id - i.e.
new array = np.array([0,0,0,0,1,2,3,4,0,0,0,0,5,6,7,0,0,0,0,8,0,0,0,0,9,10])
Only, way I am able to produce this is by iterating through the array which is very slow - and I am not quite sure how to do this using insert, pad, or expand_dim ...

Since your ids increment and are continuous, this isn't so tough, gets a bit messy calculating the offsets however.
n = 4
m = np.flatnonzero(np.append([False], ids[:-1] != ids[1:]))
shape = vals.shape[0] + (m.shape[0]+1) * n
out = np.zeros(shape)
d = np.append([0], m) + np.full(m.shape[0] + 1, n).cumsum()
df = np.append(np.diff(d).cumsum(), [out.shape[0]])
u = tuple([slice(i, j) for i, j in zip(d, df)])
out[np.r_[u]] = vals
array([ 0., 0., 0., 0., 1., 2., 3., 4., 0., 0., 0., 0., 5.,
6., 7., 0., 0., 0., 0., 8., 0., 0., 0., 0., 9., 10.])

u can use np.zeros and append it to your existing array like
newid=np.append(np.zeros((4,), dtype=int),ids)
Good Luck!

How to get the predictation result in csv format from shared variables

Data is shared variables. I want to get the predictation result in csv format. Below is the code.
It throws an error. How to fix? Thank you for your help!
TypeError: ('Bad input argument to theano function with name "4.py:305" at index
0(0-based)', 'Expected an array-like object,
but found a Variable: maybe you are trying to call a function on a (possibly shared)
variable instead of a numeric array?')
test_model = theano.function(
inputs=[index],
outputs=classifier.errors(y),
givens={
x: test_set_x[index * batch_size:(index + 1) * batch_size],
y: test_set_y[index * batch_size:(index + 1) * batch_size]
}
)
def make_submission_csv(predict, is_list=False):
if is_list:
df = pd.DataFrame({'Id': range(1, 101), 'Label': predict})
df.to_csv("submit.csv", index=False)
return
pred = []
for i in range(100):
pred.append(test_model(test.values[i]))
df = pd.DataFrame({'Id': range(1, 101), 'Label': pred})
df.to_csv("submit.csv", index=False)
make_submission_csv(np.argmax(test_model(test_set_x), axis=1), is_list=True)
And more information about "index".
index = T.iscalar()
x = T.matrix('x')
y = T.ivector('y')
when enter:
test_set_x.get_value(borrow=True)
The console shows:
array([[ 0., 0., 0., ..., 0., 0., 0.],
[ 0., 0., 0., ..., 0., 0., 0.],
[ 0., 0., 0., ..., 0., 0., 0.],
...,
[ 0., 0., 0., ..., 0., 0., 0.],
[ 0., 0., 0., ..., 0., 0., 0.],
[ 0., 0., 0., ..., 0., 0., 0.]], dtype=float32)
When enter:
test_model(test_set_x.get_value())
It throws an error:
TypeError: ('Bad input argument to theano function with name "4.py:311" at index 0(0-based)', 'TensorType(int32, scalar) cannot store a value of dtype float32 without risking loss of precision.

Your test_model function has a single input value,
inputs=[index],
Your pasted code doesn't show the creation of the variable index but my guess is that it's a Theano symbolic scalar with an integer type. If so, you need to call the compiled function with a single integer input, for example
test_model(1)
You are trying to call test_model(test_set_x) which doesn't work because test_set_x is (again probably) a shared variable, not the integer index the function is expecting.
Note that the tutorial code does this:
test_losses = [test_model(i) for i in xrange(n_test_batches)]

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

How to map element in pytorch tensor to id? - python-3.x

Related

Scikit learn preprocessing cannot understand the output using min_frequency argument in OneHotencoder class

All possible concatenations of two tensors in PyTorch

sklearn ndcg_score returned incorrect result

Quickest way to insert zeros into numpy array

How to get the predictation result in csv format from shared variables

Categories

Resources