I am trying to run Stand-Alone-Self-Attention model.
Even with batch size=1, it complains CUDA out of memory because of out=key*query: https://github.com/leaderj1001/Stand-Alone-Self-Attention/blob/a983f0f643632b1f2b7b8b27693182f22e9e574c/attention.py#L48
The tensor of the key is [2,8,8,224,224,49]
The tensor of the query is [2,8,8,224,224,1]
The dim 1 is batch, dim 2 is self.groups, dim 3 is out_channels, dim 4 is height, dim 5 is width, and dim 6 = -1.
A very naive thought is to split these two matrixes into several blocks, like split3D used for self-attention implemented in Keras. But I am not sure how will it work in 6 dimensions.
Thanks!
Related
I am trying to create the data for my CNN network. Desired input shape for my CNN network is 36 X 36 X 2. This means, I have two different 2D matrices with the size of 36 X 36.
Using these two matrices, I want to get a output of 36 X 36 X 2.
I have tried above code.
arr1 = np.random.rand(36,36)
arr2 = np.random.rand(36,36)
res = np.stack((arr1, arr2), axis=2)
Output should look like as matrices in the image:
I want my input shape as described in the picture. first matrix should be arr1, second matrix should be arr2 and both matrix should be placed one after other.
However, I am quite confused with the result I got from res. It shows me the shape (36, 36, 2), but when I print the res, then I am not able to see my first matrix and second matrix properly. I see elements from my first matrix arr1 is inside other matrix.
I am not sure if this process gives me a correct output or I am doing anything wrong.
I'm working on a classification problem (100 classes) and my dataset has a huge class imbalance. To tackle this, I'm considering using torch's WeightedRandomSampler to oversample the minority class. I took help from this post which seemed pretty straightforward. Only concern with this is the nature of my dataset.
In my case, each sample (1 point in a batch) contains 8 points. Each of these 8 points have one true class out of 100 classes. So my output shape is like this: (bs x 8). Hence, the final weight variable has total_dataset_length*8 length.
Here's my implementation:
y_org = np.load('target.npy') # 5000 x 8
samples_per_class = np.unique(y_org.ravel(), return_counts=True)[1]
class_weights = class_weight.compute_class_weight(class_weight='balanced', \
classes=np.unique(y_org.ravel()), \
y=y_org.ravel())
weights = class_weights[y_org.ravel()]
sampler = WeightedRandomSampler(weights, len(y_org.ravel()), replacement=True)
To count the number of occurrences of class index, I have to unroll (ravel) the ground truth array along the first dimension. Since the final weight variable has total_dataset_length*8, it causes indexing errors during loading
IndexError: list index out of range
How can I use WeightedRandomSampler in such cases?
I want to get surprisal values from logit outputs from PyTorch, using log base 2.
One way to do this, given a logits tensor, is:
probs = nn.functional.softmax(logits, dim = 2)
surprisals = -torch.log2(probs)
However, PyTorch provides a function that combines log and softmax, which is faster than the above:
surprisals = -nn.functional.log_softmax(logits, dim = 2)
But this seems to return values in base e, which I don't want. Is there a function like log_softmax, but which uses base 2? I have tried log2_softmax and log_softmax2, neither of which seems to work, and haven't had any luck finding documentation online.
How about just using the fact that logarithm bases can be easily altered by the following mathematical identity
is what F.log_softmax() is giving you. All you need to do is
surprisals = - (1 / torch.log(2.)) * nn.functional.log_softmax(logits, dim = 2)
Its just a scalar multiplication. So, it hardly has any performance penalty.
Is there a time-efficient way to do so? Because the dimension is huge (4096), this snippet is taking a lot of time.
for j in multiple_img_features:
img_feature = j
t = np.zeros((1,4096))
for i in range(len(vectors)):
t = t + np.dot(vectors[i],img_feature)
I need to compute where phi is the image features and alphas are the eigen vectors. The DImension of alphas is (100,4096) and for a single image is (4096,)
In the code, multiple_img_features correspond to a set of image features of dimension (10,4096) Each single row corresponds to a vector of 4096 features of a single image. vectors are the alphas (100,4096)
In CNN literature, it is often illustrated that kernel size is same as size of the longest word in the vocabulary list that one has, when it sweeps across a sentence.
So if we use embedding to represent the text, then shouldn't the kernel size be same as the embedding dimension so that it gives the same effect as sweeping word by word?
I see difference sizes of kernel used, despite the word length.
Well... these are 1D convolutions, for which the kernels are 3 dimensional.
It's true that one of these 3 dimensions must match the embedding size (otherwise it would be pointless to have this size)
These three dimensions are:
(length_or_size, input_channels, output_channels)
Where:
length_or_size (kernel_size): anything you want. In the picture, there are 6 different filters with sizes 4, 4, 3, 3, 2, 2, represented by the "vertical" dimension.
input_channels (automatically the embedding_size): the size of the embedding - this is somwehat mandatory (in Keras this is automatic and almost invisible), otherwise the multiplications wouldn't use the entire embedding, which is pointless. In the picture, the "horizontal" dimension of the filters is constantly 5 (the same as the word size - this is not a spatial dimension).
output_channels (filters): anything you want, but it seems the picture is talking about 1 channel only per filter, since it's totally ignored, and if represented would be something like "depth".
So, you're probably confusing which dimensions are which. When you define a conv layer, you do:
Conv1D(filters = output_channels, kernel_size=length_or_size)
While the input_channels come from the embedding (or the previous layer) automatically.
Creating this model in Keras
To create this model, it would be something like:
sentence_length = 7
embedding_size=5
inputs = Input((sentence_length,))
out = Embedding(total_words_in_dic, embedding_size)
Now, supposing these filters have 1 channel only (since the image doesn't seem to consider their depth...), we can join them in pairs of 2 channels:
size1 = 4
size2 = 3
size3 = 2
output_channels=2
out1 = Conv1D(output_channels, size1, activation=activation_function)(out)
out2 = Conv1D(output_channels, size2, activation=activation_function)(out)
out3 = Conv1D(output_channels, size3, activation=activation_function)(out)
Now, let's collapse the spatial dimensions and remain with the two channels:
out1 = GlobalMaxPooling1D()(out1)
out2 = GlobalMaxPooling1D()(out2)
out3 = GlobalMaxPooling1D()(out3)
And create the 6 channel output:
out = Concatenate()([out1,out2,out3])
Now there is a mistery jump from 6 channels to 2 channels which cannot be explained by the picture. Perhaps they're applying a Dense layer or something.......
#????????????????
out = Dense(2, activation='softmax')(out)
model = Model(inputs, out)