Keras Not so Dense Layer - keras

Previous layer is embedding size (V clasess,K -outputdim) - I want to introduce a weights matrix size K x T. The weights will be trainable (as will the embeddings).They generate a VxT matrix will be used downstream.
1) How might I go about this?
2) Will this mess with the gradients?
It's basically vector x Matrix .
Example- embedding vocab = 10, dim K =4. so for a particular member of vocabulary, my embedding weights is a vector size (1,4) (think row vector).
For each row vector I want to multiply a weight matrix size 4x10, yielding a 1 x 10 vector (or layer) . The weight matrix is common to all members of the vocabulary.
This 1 x 10 vector will be input for the next layer.

What you want is a Dense layer, just without a bias. A Dense layer internally has a matrix that is common for all inputs, it does not vary with the input.
So this can be implemented as:
x = Dense(10, use_bias=False)(some_input_tensor)
No activation function is needed since you just want the matrix multiplication.

Related

Custom loss for single-label, multi-class problem

I have a single-label, multi-class classification problem, i.e., a given sample is in exactly one class (say, class 3), but for training purposes, predicting class 2 or 5 is still okay to not penalise the model that heavily.
For example, the ground truth for 1 sample is [0,1,1,0,1] of 5 classes, instead of a one-hot vector. This implies that, the model predicting any one (not necessarily all) of the above classes (2,3 or 5) is fine.
For every batch, the predicted output dimension is of the shape bs x n x nc, where bs is the batch size, n is the number of samples per point and nc is the number of classes. The ground truth is also of the same shape as the predicted tensor.
For every batch, I'm expecting my loss function to compare n tensors across nc classes and then average it across n.
Eg: When dimensions are 32 x 8 x 5000. There are 32 batch points in a batch (for bs=32). Each batch point has 8 vector points, and each vector point has 5000 classes. For a given batch point, I wish to compute loss across all (8) vector points, compute their average and do so for the rest of the batch points (32). Final loss would be loss over all losses from each batch point.
How can I approach designing such a loss function? Any help would be deeply appreciated
P.S.: Let me know if the question is ambiguous
One way to go about this was to use a sigmoid function on the network output, which removes the implicit interdependency between class scores that a softmax function has.
As for the loss function, you can then calculate the loss based on the highest prediction for any of your target classes and ignore all other class predictions. For your example:
# your model output
y_out = torch.tensor([[0.1, 0.2, 0.95, 0.1, 0.01]], requires_grad=True)
# class labels
y = torch.tensor([[0,1,1,0,1]])
since we only care about the highest class probability, we set all other class scores to the maximum value achieved for one of the classes:
class_mask = y == 1
max_class_score = torch.max(y_out[class_mask])
y_hat = torch.where(class_mask, max_class_score, y_out)
From which we can use a regular Cross-Entropy loss function
loss_fn = torch.nn.CrossEntropyLoss()
loss = loss_fn(y_hat, y.float())
loss.backward()
when inspecting the gradients, we see that this only updates the prediction that achieved the highest score as well ass all predictions outside of any of the classes.
>>> y_out.grad
tensor([[ 0.3326, 0.0000, -0.6653, 0.3326, 0.0000]])
Predictions for other target classes do not receive a gradient update. Note that if you have a very high ratio of possible classes, this might slow down your convergence.

How to code Pytorch to fit a different polynomal to every column/row in an image?

Fitting a single polynomial to a bunch of data is pretty easy in Pytorch using an nn.Linear layer. I've included a trivial example at the end of this post. But suppose I have tons of data split into groups, and I want to fit a different polynomial to each group. As an example, find the particular quadratic coefficients that fit each column in this image:
In other words, I want to simultaneously find the coefficients for N polynomials of order n, given m data per set to be fit:
In the image above, there are m=80 points per dataset, and N=100 sets to fit.
This perfectly lends itself to tensor manipulation and Pytorch on a gpu should make this blindingly fast by fitting all N at once. Problem is, I'm having a terrible brain fart, and haven't been able to wrap my head around the right layer configuration. Basically I need N nn.Linear layers, each operating on its own dataset. If this were convolution, I'd use a depthwise layer...
Example network to fit one polynomial where X are the m x p abscissa data, y are the m ordinate data, and we want to find the p coefficients.
class polyfit(torch.nn.Module):
def __init__(self,n=2):
super(polyfit, self).__init__()
self.poly = torch.nn.Linear(n,1,bias=False,)
def forward(self, x):
print(x.shape,self.poly)
return self.poly(x)
model = polyfit(n)
loss = torch.nn.MSELoss()
optimizer = torch.optim.SGD(model.parameters(), lr=1e-3)
for epoch in range(100): # or however I want to run the loops
output = model(X)
mse = loss(output, y)
optimizer.zero_grad()
mse.backward()
optimizer.step()
Figured it out after thinking about my Depthwise Convolution comment. A Conv1D with just 3 parameters times a tensor with values [1,x,x**2] is a quadratic, same as with a Linear layer with n=3. So the layer needs to be:
self.poly = torch.nn.Conv1d(N,N,n+1,bias=False,groups=N)
Just have to make sure the X,y tensors are the right dimensions of [m, N, n] and [m, N, 1] respectively.

LSTM input shape through json file

I am working on the LSTM and after the pre-processing of data I get the data X in form of a list which contains the 3 lists of features and each list contains the sequence of 50 points in form of a list.
X = [list:100 [list:3 [list:50]]]
Y = [list:100]
since its a multivariate LSTM, I am not sure how to give all 3 sequences as an input to Keras-Lstm. Do I need to convert it in Pandas data frame?
model = models.Sequential()
model.add(layers.Bidirectional(layers.LSTM(units=32,
input_shape=(?,?,?)))
You can do do the following to convert the lists into NumPy arrays:
X = np.array(X)
Y = np.array(Y)
Calling the following after this conversion:
print(X.shape)
print(Y.shape)
should output: (100, 3, 50) and (100,), respectively. Finally, the input_shape of the LSTM layer can be (None, 50).
LSTM Call arguments Doc:
inputs: A 3D tensor with shape [batch, timesteps, feature].
You would have to transform that list into a numpy array to work with Keras.
As per the shape of X you have provided, it should work in theory. However you do have to figure out what the 3 dimensions of your array actually contain.
The 1st dimension should be your batch_size i.e. how many batches of data you have.
The 2nd dimension is your timestep data.
Ex: words in a sentence, "cat sat on dog" -> 'cat' is timestep 1, 'sat' is timestep 2 and 'on' is timestep 3 and so on.
The 3rd dimension represent the features of your data of each timestep.. For our sentence earlier, we can vectorize each word

Stacking of 2 convolutional layers

In a convolutional layer with n neurons, trained for inputs with dimension h x w x c (height x width x channel), c usually being 3 (RGB), one trains n x c kernels of size k x k (and n bias values). So for each neuron i in the layer and each channel j in the input, we have a weight matrix of size k x k, we call weights_ij. The output of each neuron i=1,..,n (for input X) is as follows:
out_i = sigma ( tmp_i + bias_i)
with tmp_i = sum_{j=1,...,c} conv(X, weights_ij).
The output is then h_new x w_new x n. So basically the depth of the output coincides with the number of neurons in the first layer. h_new and w_new depend on padding and stride in the convolution.
This makes sense to me and I also checked it by coding the convolution and the summation myself and comparing the result with the result of a keras model, that only consists of this one layer. Now my acutal question: when we add a second convolutional layer, my understanding was that the output from the first layer is now a "picture" with n channels and we do exactly the same as before but with c=n (and a new number n2 of neurons in our 2nd layer).
But I also coded that and compared it with the prediction of a keras model with 2 convolutional layers and now the result is not the same. So does anyone know how the 2nd convolutional layer treats the output of the first?
Ok, I solved my problem.
Actually the problem was already present for just one layer and by stacking 2 layers the errors accumulated.
I thought when using stride=2 in the convolutional layer, one applys the convolution to the sections [0:N_k,0:N_k], [2:2+N_k,2:2+N_k], [4:4+N_k,4:4+N_k],... of the input but keras actually applys the convolution to [1:1+N_k,1:1+N_k], [3:3+N_k,3:3+N_k],...

What does Dense do?

What is the meaning of the two Dense in this code?
self.model.add(Flatten())
self.model.add(Dense(512))
self.model.add(Activation('relu'))
self.model.add(Dropout(0.5))
self.model.add(Dense(10))
self.model.add(Activation('softmax'))
self.model.summary()
Dense is the only actual network layer in that model.
A Dense layer feeds all outputs from the previous layer to all its neurons, each neuron providing one output to the next layer.
It's the most basic layer in neural networks.
A Dense(10) has ten neurons. A Dense(512) has 512 neurons.
Furthermore, a dense layers applies the a non-linear transform:
f(W.X + b)
As to the effect, well in the case that W and X are a 2D tensor W.X + b is a vector and f is a element wise non-linearity like tanh, so the result is just a vector of size in the numbers of neurons
From the keras docs:
Dense implements the operation: output = activation(dot(input, kernel)
bias) where activation is the element-wise activation function passed as the activation argument, kernel is a weights matrix created
by the layer, and bias is a bias vector created by the layer (only
applicable if use_bias is True).

Resources