I have a neural net model, it's last layer is fully connected layer with 9 output neurons.
To train my network correctly, I'm using softmax_cross_entropy_with_logits.
It trains ok, but when I want to evaluate my model, I want probabilities also.
So I take an evaluation sample and feed it to the network.
After that I apply softmax to the output and get
[[ 0. 0. 0. 0. 0. 1. 0. 0. 0.]]
Here unnormalized probabilites also:
[[ -2710.10620117 -2914.37866211 -5045.04443359 -4361.91601562
-459.57000732 8843.65820312 -1871.62756348 5447.12451172
-10947.22949219]]
I'm also getting probility of 1 and rest are zeros.
Could anyone please help to handle this issue?
EDIT:
Input images are of shape 64 * 160.
All activation functions are relu.
Max poolings are 2x2.
In conv_plus_max_pool_layer(x_image, 5, 1, 96) 5 is kernel size.
Here is network layout:
hidden_block_1 = conv_plus_max_pool_layer(x_image, 5, 1, 96)
hidden_block_2 = conv_plus_max_pool_layer(hidden_block_1, 5, 96, 256)
hidden_block_3 = conv_plus_max_pool_layer(hidden_block_2, 3, 256, 384)
hidden_block_4 = conv_plus_max_pool_layer(hidden_block_3, 3, 384, 512)
fc1 = dropout_plus_fc(4 * 10 * 512, 512, hidden_block_4, keep_prob_drop1)
output = dropout_plus_fc(512, model_net10_train.class_num, fc1, keep_prob_drop2)
Looks like your network is pretty sure about the output ;)
In this case, I don't think we can do a lot for you without your network layout... Some gut feelings from my side: the layer leading up to your output layer has too many nodes (thus giving you these huuuge numbers), and I suspect that you don't use nonlinearities such as RELU, or tanh. Another thing you might want to check are the initial values for the weights (might be too big), and the learning rate you are using (might be too high).
Related
I am trying to implement a classification head for the reformer transformer. The classification head works fine, but when I try to change one of the config parameters- config.axial_pos_shape i.e sequence length parameter for the model it throws an error;
size mismatch for reformer.embeddings.position_embeddings.weights.0: copying a param with shape torch.Size([512, 1, 64]) from checkpoint, the shape in current model is torch.Size([64, 1, 64]).
size mismatch for reformer.embeddings.position_embeddings.weights.1: copying a param with shape torch.Size([1, 1024, 192]) from checkpoint, the shape in current model is torch.Size([1, 128, 192]).
The config:
{
"architectures": [
"ReformerForSequenceClassification"
],
"attention_head_size": 64,
"attention_probs_dropout_prob": 0.1,
"attn_layers": [
"local",
"lsh",
"local",
"lsh",
"local",
"lsh"
],
"axial_norm_std": 1.0,
"axial_pos_embds": true,
"axial_pos_embds_dim": [
64,
192
],
"axial_pos_shape": [
64,
256
],
"chunk_size_feed_forward": 0,
"chunk_size_lm_head": 0,
"eos_token_id": 2,
"feed_forward_size": 512,
"hash_seed": null,
"hidden_act": "relu",
"hidden_dropout_prob": 0.05,
"hidden_size": 256,
"initializer_range": 0.02,
"intermediate_size": 3072,
"is_decoder": true,
"layer_norm_eps": 1e-12,
"local_attention_probs_dropout_prob": 0.05,
"local_attn_chunk_length": 64,
"local_num_chunks_after": 0,
"local_num_chunks_before": 1,
"lsh_attention_probs_dropout_prob": 0.0,
"lsh_attn_chunk_length": 64,
"lsh_num_chunks_after": 0,
"lsh_num_chunks_before": 1,
"max_position_embeddings": 8192,
"model_type": "reformer",
"num_attention_heads": 2,
"num_buckets": [
64,
128
],
"num_chunks_after": 0,
"num_chunks_before": 1,
"num_hashes": 1,
"num_hidden_layers": 6,
"output_past": true,
"pad_token_id": 0,
"task_specific_params": {
"text-generation": {
"do_sample": true,
"max_length": 100
}
},
"vocab_size": 320
}
Python Code:
config = ReformerConfig()
config.max_position_embeddings = 8192
config.axial_pos_shape=[64, 128]
#config = ReformerConfig.from_pretrained('./cnp/config.json', output_attention=True)
model = ReformerForSequenceClassification(config)
model.load_state_dict(torch.load("./cnp/pytorch_model.bin"))
I run into the same issue, trying to halve the size of the 65536 (128*512) by default max sequence length used in Reformer pre-training.
As #cronoik mentioned, you must:
load pretrained Reformer
resize it to your need by dropping unnecessary weights
save this new model
load this new model to perform your desired tasks
Those unnecessary weights are the ones from the Position Embeddings layer. In Reformer model, the Axial Position Encodings strategy was used to learn the position embeddings (rather than having fixed ones like BERT). Axial Position Encodings stores position embeddings in a memory efficient manner, using two small tensors rather than a big one.
However, the idea of position embeddings remains exactly the same, which is obtaining different embeddings for each position.
That said, in theory (correct me if I am misunderstanding somewhere), removing the last position embeddings to match your custom max sequence length should not hurt the performance. You can refer to this post from HuggingFace to see a more detailed description of Axial Position Encodings and understand where to truncate your position embeddings tensor.
I have managed to resize and use Reformer with a custom max length of 32768 (128*256) with the following code:
# Load intial pretrained model
model = ReformerForSequenceClassification.from_pretrained('google/reformer-enwik8', num_labels=2)
# Reshape Axial Position Embeddings layer to match desired max seq length
model.reformer.embeddings.position_embeddings.weights[1] = torch.nn.Parameter(model.reformer.embeddings.position_embeddings.weights[1][0][:256])
# Update the config file to match custom max seq length
model.config.axial_pos_shape = 128, 256
model.config.max_position_embeddings = 128*256 # 32768
# Save model with custom max length
output_model_path = "path/to/model"
model.save_pretrained(output_model_path)
I am training NN for the regression problem. So the output layer has a linear activation function. NN output is supposed to be between -20 to 30. My NN is performing good most of the time. However, sometimes it gives output more than 30 which is not desirable for my system. So does anyone know any activation function that can provide such kind of restriction on output or any suggestions on modifying linear activation function for my application?
I am using Keras with tenserflow backend for this application
What you can do is to activate your last layer with a sigmoid, the result will be between 0 and 1 and then create a custom layer in order to get the desired range :
def get_range(input, maxx, minn):
return (minn - maxx) * ((input - K.min(input, axis=1))/ (K.max(input, axis=1)*K.min(input, axis=1))) + maxx
and then add this to your network :
out = layers.Lambda(get_range, arguments={'maxx': 30, 'minn': -20})(sigmoid_output)
The output will be normalized between 'maxx' and 'minn'.
UPDATE
If you want to clip your data without normalizing all your outputs, do this instead :
def clip(input, maxx, minn):
return K.clip(input, minn, maxx)
out = layers.Lambda(clip, arguments={'maxx': 30, 'minn': -20})(sigmoid_output)
What you should do is normalize your target outputs to the range [-1, 1] or [0, 1], and then use a tanh (for [-1, 1]) or sigmoid (for [0, 1]) activation at the output, and train the model with normalize data.
Then you can denormalize the predictions to get values in your original ranges during inference.
I have constructed LSTM architecture using Keras, but I am not certain if duplicating time steps is a good approach to deal with variable sequence length.
I have a multidimensional data set with multi-feature sequence and varying time steps. It is a multivariate time series data with multiple examples to train LSTM on, and Y is either 0 or 1. Currently, I am duplicating last time steps for each sequence to ensure timesteps = 3.
I appreciate if someone could answer the following questions or concerns:
1. Is creating additional time steps with feature values represented by zeroes more suitable?
2. What is the right way to frame this problem, pad sequences, and mask for evaluation.
3. I am duplicating last time step in Y variable as well for prediction, and the value 1 in Y only appears at the last time step if at all.
# The input sequences are
trainX = np.array([
[
# Input features at timestep 1
[1, 2, 3],
# Input features at timestep 2
[5, 2, 3] #<------ duplicate this to ensure compliance
],
# Datapoint 2
[
# Features at timestep 1
[1, 8, 9],
# Features at timestep 2
[9, 8, 9],
# Features at timestep 3
[7, 6, 1]
]
])
# The desired model outputs is as follows:
trainY = np.array([
# Datapoint 1
[
# Target class at timestep 1
[0],
# Target class at timestep 2
[1] #<---------- duplicate this to ensure compliance
],
# Datapoint 2
[
# Target class at timestep 1
[0],
# Target class at timestep 2
[0]
# Target class at time step 3
[0]
]
])
timesteps = 3
model = Sequential()
model.add(LSTM(3, kernel_initializer ='uniform', return_sequences=True, batch_input_shape=(None, timesteps, trainX.shape[2]),
kernel_constraint=maxnorm(3), name='LSTM'))
model.add(Dropout(0.2))
model.add(LSTM(3, return_sequences=True, kernel_constraint=maxnorm(3), name='LSTM-2'))
model.add(Flatten(name='Flatten'))
model.add(Dense(timesteps, activation='sigmoid', name='Dense'))
model.compile(loss="mse", optimizer="sgd", metrics=["mse"])
model.fit(trainX, trainY, epochs=2000, batch_size=2)
predY = model.predict(testX)
In my opinion there are two solutions to your problem. (Duplicating timesteps is None of them):
Use pad_sequence layer in combination with a masking layer. This is the common approach. Now thanks to padding every sample has the same number of timesteps. The good thing on this method, it's very easy to implement. Also, the Masking layer will give you a little performance boost.
The downside of this approach: If you train on a GPU, CuDNNLSTM is the layer to go, which is highly optimized for gpu and therefore a lot faster. But it's not working with a masking layer and if your dataset has a high range of timesteps, you're losing perfomance.
Set your timesteps-shape to None and write a keras generator which will group your batches by timesteps.(I think you'll also have to use the functional api) Now you can implement CuDNNLSTM and every sample will be computed with only the relavant timesteps (instead of padded ones), which is much more efficient.
If you're new to keras and perfomance is not so important, go with option 1. If you have a production environment where you often have to train the Network and it's cost relevant, try option 2.
I am using Keras to make a CNN, and I want to visualize the model with plot_model().
When I look at the shape of the Conv2d layers, there is a thing that I can't figure out.
Let's say my Conv2d layer has kernel size [8 x 8], stride is [4 by 4], padding is 'same' and I want 16 feature maps.
Input shape to this layer is [None, 3, 160, 320] and output is [None,1,40,16].
'None' is samples, but what is 1 and 40? I guess 16 is number of feature maps?
Since I implemented padding = 'same', shouldn't the image size out have the same width and height as input, or isn't this the same thing?
Thanks!
Well, since you're using "strides", you'll never have the same shape.
Your convolutional filter (which can be seen as a sliding window) is jumping four pixels in its sliding.
As a result, you get your final shape divided by 4 (and rounded up).
3/4 rounded up = 1
160/4 = 40
16 is the number of feature maps, indeed.
I am trying to classify handwritten digits using the MNIST dataset to train my model. My model trained successfully and hit an accuracy of 98.9%. But when I try and input a custom image it shows me the following error :
Error when checking : expected conv2d_4_input to have shape (None, 28, 28, 1) but got array with shape (1, 1, 28, 28)
This is the first convolutional layer i.e. the input layer.
What can I do to resolve this issue ?
This is my Convolutional Neural Network :
conv_model = Sequential()
conv_model.add(Conv2D(filters, kernel_size[0], input_shape=(28 , 28 , 1)))
conv_model.add(Activation(act))
conv_model.add(Conv2D(filters, kernel_size[0]))
conv_model.add(Activation(act))
conv_model.add(MaxPool2D(pool_size=(2,2)))
conv_model.add(Dropout(0.25))
conv_model.add(Flatten())
conv_model.add(Dense(128))
conv_model.add(Activation(act))
conv_model.add(Dropout(0.5))
conv_model.add(Dense(10))
conv_model.add(Activation('softmax'))
#conv_model.summary()
Compilation Details :
conv_model.compile(loss='categorical_crossentropy', optimizer='adadelta', metrics=['accuracy'])
COMPLETE SOURCE CODE :
https://github.com/tanmay-edgelord/HandwrittenDigitRecognition
The image :
If any further details are required please comment.
The error message is pretty straight:
Your first layer is expecting data with shape (None, 28, 28, 1), where "None" can be any number (it's the batch size, how many examples you have).
Your data on the other hand has shape (1, 1, 28, 28).
The confusion seems to me a common one: Keras puts the channels at the last dimension, and your data has the channels in the first.
Solution:
Just reshape your data in the correct format: (1, 28, 28, 1).
But are you trying to give that entire image to the model??? If so, it won't work very well, it's expecting images with 28 x 28 pixels.
You will have to separate each number in a different 28 x 28 image. And you must take into account the possibility of your image being inverted in terms of what is black and what is white. Usually the MNIST data has a black background (0 values) with a white number (1 values).
The problem got solved by passing it to the reshape function with the correct input size
roi2 = roi.reshape(1,28,28,1)