Suppose I have the following embeddings emb_user = torch.randn(64, 128, 256). From the second dimension (of length 128), I wish to pick out 16 at random at each instance. I was wondering if there was a more efficient way of doing the following:
idx = torch.multinomial(torch.ones(64, 128), 16)
sampled_emb_user = emb_user[torch.arange(len(emb_user)).unsqueeze(-1), idx]
What I also find curios is that the above multinomial would not work if the weight matrix (torch.ones(64, 128)) exceeded more than 2 dimensions.
Since in your case you want an uniform distribution you could speed it up with
idx = torch.sort(torch.randint(
0, 128 - 15, (64, 16), device=device
), axis=1).values + torch.arange(0, 16, device=device).reshape(1, -1)
sampled_emb_user = emb_user[torch.arange(len(emb_user)).unsqueeze(-1), idx]
Instead of
idx = torch.multinomial(torch.ones(64, 128, device=device), 16)
sampled_emb_user = emb_user[torch.arange(len(emb_user)).unsqueeze(-1), idx]
The runtimes on my machine are 427 µs and 784 µs with device='cpu'; 135 µs and 260 µs and 469 µs with device='cuda'.
How it works?
The sorted randint gives the indices for a multinomial distribution with replacement. That is increasing, adding the arange term makes it strictly increasing, thus eliminates the replacements.
Illustrating with a small case
idx = torch.sort(torch.randint(0, 7, (4,))).values
print('Indices with replacement in the range from 0 to 6: ', idx)
print('Indices without replacement in the slice: ', idx + torch.arange(4))
Indices with replacement in the range from 0 to 6: tensor([0, 5, 5, 6])
Indices without replacement in the slice: tensor([0, 6, 7, 9])
A possibly faster solution, but not from exactly the same distribution is the following:
idx = torch.cumsum(torch.diff(
torch.sort(torch.randint(
0, 128 - 16, (64, 17), device=device
), axis=1).values
, axis=1) + 1, axis=1) - 1
sampled_emb_user = emb_user[torch.arange(len(emb_user)).unsqueeze(-1), idx]
One more way, I expect to be closer to the exact method, not very rigorously analyzed.
# 1-rand() to include 1 and exclude zero.
d = torch.cumsum(1 - torch.rand(64, 17, device=device
), axis=1)
# this produces a sorted tensor with values in the range [0:128-16]
d = (((128 - 15) * d[:, :-1]) / d[:, -1:]).to(torch.long)
idx = d + torch.arange(0, 16, device=device).reshape(1, -1)
But in the end it tends to be slower than the method using sort.
Related
I have two large arrays, one containing values, and one being a mask basically. The code below shows the function I want to implement.
from scipy.signal import convolve2d
import numpy as np
sample = np.array([[6, 4, 5, 5, 5],
[7, 1, 0, 8, 3],
[2, 5, 4, 8, 4],
[2, 0, 2, 6, 0],
[5, 7, 2, 3, 2]])
mask = np.array([[1, 0, 1, 1, 0],
[0, 0, 1, 0, 1],
[0, 1, 0, 0, 0],
[0, 0, 0, 1, 0],
[1, 1, 0, 0, 1]])
neighbors_sum = convolve2d(sample, np.ones((3,3), dtype=int), mode='same', boundary='wrap')
# neighbors_sum = np.array([[40, 37, 35, 33, 44],
# [37, 34, 40, 42, 48],
# [24, 23, 34, 35, 40],
# [27, 29, 37, 31, 32],
# [31, 33, 34, 30, 34]])
result = np.where(mask, neighbors_sum, 0)
print(result)
This code works, and gets me what I expects:
np.array([[40, 0, 35, 33, 0],
[ 0, 0, 40, 0, 48],
[ 0, 23, 0, 0, 0],
[ 0, 0, 0, 31, 0],
[31, 33, 0, 0, 34]])
So far, so good. However, where I'm encountering some large issue is when I increase the size of the arrays. In my case, instead of a 5x5 input and a 3x3 summing mask, I need a 50,000x20,000 input and a 100x100 summing mask. And when I move to that, the convolve2d function is in all kinds of trouble and the calculation is extremely long.
Given that I only care about the masked result, and thus only care about the summation from convolve2d at those points, can anyone think of a smart approach to take here? Going to a for loop and selecting only the points of interest would lose the speed advantage of the vectorization so I'm not convinced this would be worth it.
Any suggestion welcome!
convolve2d is very inefficient in this case. Since the mask is np.ones, you can split the filter in two trivial ones thanks to separable filtering: one np.ones(100, 1) filter and one np.ones(1, 100) filter. Moreover, a rolling sum can be used to speed up even more the computation.
Here is a simple solution without a rolling sum:
# Simple faster implementation
tmp = convolve2d(sample, np.ones((1,100), dtype=int), mode='same', boundary='wrap')
neighbors_sum = convolve2d(tmp, np.ones((100,1), dtype=int), mode='same', boundary='wrap')
result = np.where(mask, neighbors_sum, 0)
You can compute the rolling sum efficiently using Numba. The strategy is to split the computation in 3 parts: the horizontal rolling sum, the vertical rolling sum and the final masking. Each step can be fully parallelized using multiple threads (although parallelizing the vertical rolling sum is harder with Numba). Each part needs to work line by line so to be cache friendly.
# Complex very-fast implementation
import numba as nb
# Numerical results may diverge if the input contains big
# values with many small ones.
# Does not support inputs containing NaN values or +/- Inf ones.
#nb.njit('float64[:,::1](float64[:,::1], int_)', parallel=True, fastmath=True)
def horizontalRollingSum(sample, filterSize):
n, m = sample.shape
fs = filterSize
# Make the wrapping part of the rolling sum much simpler
assert fs >= 1
assert n >= fs and m >= fs
# Horizontal rolling sum.
tmp = np.empty((n, m), dtype=np.float64)
for i in nb.prange(n):
s = 0.0
lShift = fs//2
rShift = (fs-1)//2
for j in range(m-lShift, m):
s += sample[i, j]
for j in range(0, rShift+1):
s += sample[i, j]
tmp[i, 0] = s
for j in range(1, m):
jLeft, jRight = (j-1-lShift)%m, (j+rShift)%m
s += sample[i, jRight] - sample[i, jLeft]
tmp[i, j] = s
return tmp
#nb.njit('float64[:,::1](float64[:,::1], int_)', fastmath=True)
def verticaltalRollingSum(sample, filterSize):
n, m = sample.shape
fs = filterSize
# Make the wrapping part of the rolling sum much simpler
assert fs >= 1
assert n >= fs and m >= fs
# Horizontal rolling sum.
tmp = np.empty((n, m), dtype=np.float64)
tShift = fs//2
bShift = (fs-1)//2
for j in range(m):
tmp[0, j] = 0.0
for i in range(n-tShift, n):
for j in range(m):
tmp[0, j] += sample[i, j]
for i in range(0, bShift+1):
for j in range(m):
tmp[0, j] += sample[i, j]
for i in range(1, n):
iTop = (i-1-tShift)%n
iBot = (i+bShift)%n
for j in range(m):
tmp[i, j] = tmp[i-1, j] + (sample[iBot, j] - sample[iTop, j])
return tmp
#nb.njit('float64[:,::1](float64[:,::1], int_[:,::1], int_)', parallel=True, fastmath=True)
def compute(sample, mask, filterSize):
n, m = sample.shape
tmp = horizontalRollingSum(sample, filterSize)
neighbors_sum = verticaltalRollingSum(tmp, filterSize)
res = np.empty((n, m), dtype=np.float64)
for i in nb.prange(n):
for j in range(n):
res[i, j] = neighbors_sum[i, j] * mask[i, j]
return res
Benchmark & Notes
Here is the testing code:
n, m = 5000, 2000
sample = np.random.rand(n, m)
mask = (np.random.rand(n, m) < 0.05).astype(int)
Here are the results on my 6-core machine:
Initial solution: 174366 ms (x1)
With separate filters: 5710 ms (x31)
Final Numba solution: 40 ms (x4359)
Optimal theoretical time: 10 ms (optimistic)
Thus, the Numba implementation is 4359 times faster than the initial one.
That being said, be careful of possible numerical issues that this last implementation can have regarding the input array (see the comments in the code). It should be fine as long as np.std(sample) is relatively small and np.all(np.isfinite(sample)) is true.
Note that the code can be further optimized: the vertical rolling sum can be parallelized; modulus operations can be avoided in the horizontal rolling sum; the vertical rolling sum and the masking steps can be merged together (ie. by computing res on-the-fly and not storing tmp); tiling can be used to compute all the steps simultaneously in a more cache-friendly way. However, these optimizations make the code more complex and some of them are very hard to perform (especially the last one with Numba).
Note that using a boolean mask (instead of an integer-based one) should make the algorithm faster since it takes less memory and processors can fetch values faster.
I want to make a projection to the tensor of shape [197, 1, 768] to [197,1,128] in pytorch using nn.Conv()
You could achieve this using a wide flat kernel and/or combined with a specific stride. If you stick with a dilation of 1, then the input/output spatial dimension relation is given by:
out = [(2p + x - k)/s + 1]
Where p is the padding, k is the kernel size and s is the stride. [] detonates the whole part of the quantity.
Applied here you have:
128 = [(2p + 768 - k)/s + 1]
So you would get:
p = 2*p + 768 - (128-1)*s # one off
If you impose p = 0, and s = 6 you find k = 6
>>> project = nn.Conv2d(197, 197, kernel_size=(1, 6), stride=6)
>>> project(torch.rand(1, 197, 1, 768)).shape
torch.Size([1, 197, 1, 128])
Alternatively, a more straightforward - but different - approach is to learn a mapping using a fully connected layer:
>>> project = nn.Linear(768, 128)
>>> project(torch.rand(1, 197, 1, 768)).shape
torch.Size([1, 197, 1, 128])
You could use a kernel size and stride of 6, as that’s the factor between the input and output temporal size:
x = torch.randn(197, 1, 768)
conv = nn.Conv1d(in_channels=1, out_channels=1, kernel_size=6, stride=6)
out = conv(x)
print(out.shape)
> torch.Size([197, 1, 128])
Solution Source
Suppose I have a 4D numpy array A with indexes i, j, k, l for the four dimensions, suppose 50 x 40 x 30 x 20. Also suppose I have some other list B.
How can I set all cells in A that satisfy some condition to 0? Is there a way to do it efficiently without loops (with vectorization?).
Example condition: All cells that have 3rd dimensional index k whereby B[k] == x
For instance,
if we have the 2D matrix A = [[1,2],[3,4]] and B = [7,8]
Then for the 2nd dimension of A (i.e. columns), I want to zero out all cells in the 2nd dimension whereby the index of the cell in that dimension (call the index i), satisfies the condition B[i] == 7. In this case, A will be converted to
A = [[0,0],[3,4]].
You can specify boolean arrays for specific axes:
import numpy as np
i, j, k, l = 50, 40, 30, 20
a = np.random.random((i, j, k, l))
b_k = np.random.random(k)
b_j = np.random.random(j)
# i, j, k, l
a[:, :, b_k < 0.5, :] = 0
# You can alsow combine multiple conditions along the different axes
# i, j, k, l
a[:, b_j > 0.5, b_k < 0.5, :] = 0
# Or work with the index explicitly
condition_k = np.arange(k) % 3 == 0 # Is the index divisible by 3?
# i, j, k, l
a[:, :, condition_k, :] = 0
To work with the example you have given
a = np.array([[1, 2],
[3, 4]])
b = np.array([7, 8])
# i, j
a[b == 7, :] = 0
# array([[0, 0],
# [3, 4]])
Does the following help?
A = np.arange(16,dtype='float64').reshape(2,2,2,2)
A[A == 2] = 3.14
I'm replacing the entry equal to 2 with 3.14. You can set it to some other value.
I have made n-grams / doc-ids for document classification,
def create_dataset(tok_docs, vocab, n):
n_grams = []
document_ids = []
for i, doc in enumerate(tok_docs):
for n_gram in [doc[0][i:i+n] for i in range(len(doc[0]) - 1)]:
n_grams.append(n_gram)
document_ids.append(i)
return n_grams, document_ids
def create_pytorch_datasets(n_grams, doc_ids):
n_grams_tensor = torch.tensor(n_grams)
doc_ids_tensor = troch.tensor(doc_ids)
full_dataset = TensorDataset(n_grams_tensor, doc_ids_tensor)
return full_dataset
create_dataset returns pair of (n-grams, document_ids) like below:
n_grams, doc_ids = create_dataset( ... )
train_data = create_pytorch_datasets(n_grams, doc_ids)
>>> train_data[0:100]
(tensor([[2076, 517, 54, 3647, 1182, 7086],
[517, 54, 3647, 1182, 7086, 1149],
...
]),
tensor(([0, 0, 0, 0, 0, ..., 3, 3, 3]))
train_loader = DataLoader(train_data, batch_size = batch_size, shuffle = True)
The first of tensor content means n-grams and the second one does doc_id.
But as you know, by the length of documents, the amount of training data according to the label would changes.
If one document has very long length, there would be so many pairs that have its label in training data.
I think it can cause overfitting in model, because the classification model tends to classify inputs to long length documents.
So, I want to extract input batches from a uniform distribution for label (doc_ids). How can I fix it in code above?
p.s)
If there is train_data like below, I want to extract batch by the probability like that:
n-grams doc_ids
([1, 2, 3, 4], 1) ====> 0.33
([1, 3, 5, 7], 2) ====> 0.33
([2, 3, 4, 5], 3) ====> 0.33 * 0.25
([3, 5, 2, 5], 3) ====> 0.33 * 0.25
([6, 3, 4, 5], 3) ====> 0.33 * 0.25
([2, 3, 1, 5], 3) ====> 0.33 * 0.25
In pytorch you can specify a sampler or a batch_sampler to the dataloader to change how the sampling of datapoints is done.
docs on the dataloader:
https://pytorch.org/docs/stable/data.html#data-loading-order-and-sampler
documentation on the sampler: https://pytorch.org/docs/stable/data.html#torch.utils.data.Sampler
For instance, you can use the WeightedRandomSampler to specify a weight to every datapoint. The weighting can be the inverse length of the document for instance.
I would make the following modifications in the code:
def create_dataset(tok_docs, vocab, n):
n_grams = []
document_ids = []
weights = [] # << list of weights for sampling
for i, doc in enumerate(tok_docs):
for n_gram in [doc[0][i:i+n] for i in range(len(doc[0]) - 1)]:
n_grams.append(n_gram)
document_ids.append(i)
weights.append(1/len(doc[0])) # << ngrams of long documents are sampled less often
return n_grams, document_ids, weights
sampler = WeightedRandomSampler(weights, 1, replacement=True) # << create the sampler
train_loader = DataLoader(train_data, batch_size=batch_size, shuffle=False, sampler=sampler) # << includes the sampler in the dataloader
I am trying to implement a function that takes each row in a numpy 2d array and returns me scalar result of a certain calculations. My current code looks like the following:
img = np.array([
[0, 5, 70, 0, 0, 0 ],
[10, 50, 4, 4, 2, 0 ],
[50, 10, 1, 42, 40, 1 ],
[10, 0, 0, 6, 85, 64],
[0, 0, 0, 1, 2, 90]]
)
def get_y(stride):
stride_vals = stride[stride > 0]
pix_thresh = stride_vals.max() - 1.5*stride_vals.std()
return np.argwhere(stride>pix_thresh).mean()
np.apply_along_axis(get_y, 0, img)
>> array([ 2. , 1. , 0. , 2. , 2.5, 3.5])
It works as expected, however, performance isn't great as in real dataset there are ~2k rows and ~20-50 columns for each frame, coming 60 times a second.
Is there a way to speed-up the process, perhaps by not using np.apply_along_axis function?
Here's one vectorized approach setting the zeros as NaN and that let's us use np.nanmax and np.nanstd to compute those max and std values avoiding the zeros, like so -
imgn = np.where(img==0, np.nan, img)
mx = np.nanmax(imgn,0) # np.max(img,0) if all are positive numbers
st = np.nanstd(imgn,0)
mask = img > mx - 1.5*st
out = np.arange(mask.shape[0]).dot(mask)/mask.sum(0)
Runtime test -
In [94]: img = np.random.randint(-100,100,(2000,50))
In [95]: %timeit np.apply_along_axis(get_y, 0, img)
100 loops, best of 3: 4.36 ms per loop
In [96]: %%timeit
...: imgn = np.where(img==0, np.nan, img)
...: mx = np.nanmax(imgn,0)
...: st = np.nanstd(imgn,0)
...: mask = img > mx - 1.5*st
...: out = np.arange(mask.shape[0]).dot(mask)/mask.sum(0)
1000 loops, best of 3: 1.33 ms per loop
Thus, we are seeing a 3x+ speedup.