I have developed a Convolutional Neural Network using TILDA image dataset which gives over 90% of accuracy with the following model. I used 4 batches and 100 epochs to the model.
model = keras.Sequential([
layers.Input((30,30,1)),
layers.Conv2D(8,2,padding='same', activation='relu',kernel_regularizer=regularizers.l2(0.01)),
layers.BatchNormalization(),
layers.Conv2D(16,2,padding='same', activation='relu',kernel_regularizer=regularizers.l2(0.01)),
layers.BatchNormalization(),
layers.Conv2D(32,2,padding='same', activation='sigmoid',kernel_regularizer=regularizers.l2(0.01)),
layers.BatchNormalization(),
layers.MaxPooling2D(),
layers.Flatten(),
layers.Dropout(0.5),
layers.Dense(5, activation = "softmax"),
])
Using the above model I could plot the following graphs for the training and validation accuracy.
Do you have any suggestions to increase the smoothness of these curves? What can be the possible reasons for getting such curves? I appreciate your recommendations to improve this model.
The following may help in getting a smoother curve:
NEVER use dropout before the final layer. MaxPool + Dropout in your model discards 87.5% of the data flowing into the final layer. Avoid pooling as well, unless you need global or adaptive pooling to get a fixed shape output. If you must pool, you need a much larger number of kernels to compensate for the loss in information.
Use a lower learning rate. From what the training curve tells, the model is directed to a minima, but with several bumps.
Are you using SGD without momentum? If yes, introduce, momentum. Also consider adaptive optimizers with inbuilt momentum, like Adam.
Why the sigmoid in between? Sigmoid reduces the gradient magnitude and makes learning slower.
If you only care about the curve and are not restricted by number of parameters, consider adding a few more layers and/or channels.
Related
I am building a multilabel image classification network. The dataset contains 70k images, total number of classes are 12. With respect to the entire dataset, 12 classes has more than 10% images. Out of 12 classes, 3 classes are above 70%. I am using VGG16 network without its associated classifier.
As the training results, I am getting max of 68% validation accuracy. I have tried changing the number of units per Dense layer (512,256,128 etc), increased the number of layers (5, 6 layers), added/removed Dropout layer (with 0.5), kernel_regularization (L1=0.1, L2=0.1).
As accuracy is not the appropriate metric for multilabel classification, I am trying to incorporate HammingLoss as the metric. But it is not working, here is the issue that I opened on the GitHub repo of HammingLoss.
What can be done to improve the accuracy?
What point I am missing in case of incorporating HammingLoss?
For classification, I am using the network as:
network.add(vggBase)
network.add(tf.keras.layers.Dense(256, activation='relu'))
network.add(tf.keras.layers.Dense(64, activation='relu'))
network.add(tf.keras.layers.Dense(12, activation='sigmoid'))
network.compile(optimizer=tf keras.optimizers.Adam(learning_rate=0.001), loss=tf.keras.losses.BinaryCrossentropy(), metrics=['accuracy'])
I recommend you to use Keras Tuner for tuning.
If Hammingloss is not working for you, you could use a differnet metric as a workaround, like pr_auc for instance. The metric choice depends strongly on what you want to achieve with your model. Maybe towardsdatascience/evaluating-multi-label-classifiers can help you to find that out.
My smallest value in my training dataset is 0.1 and my highest is about 500. my dataset is made about 1500 row and 09 columns.
I'm not sur about that but, is it mandatory to rescale the input data into [0,1] (wiht minmaxscaler for exemple), or is it just to speed the training ?
and second question, is this scaling is du to the model used (LSTM, DENSE, etc.) or does it work for anyone ? For example, my système is :
model = Sequential()
model.add(LSTM(10, input_shape=(12,12),return_sequences=True, activation='tanh'))
model.add(LSTM(10,return_sequences=False,activation='tanh'))
model.add(Dense(5))
model.compile(loss='mean_squared_error', optimizer='adam')
model.fit(trainX, trainY, epochs=100, batch_size=1, verbose=2)
Scaling your data for ML is done for all types of applications. It's meant to help the model converge faster. You can check out this link for a detailed explanation as to the benefits of feature scaling.
There are different ways you can scale the data, such as min-max or standard scaling; both of which are applicable for your model. If you know you have a fixed min and max in your dataset (e.g. images), you can use min-max scaling to fix your input and/or output data to be between 0 and 1.
For other applications where you do not have fixed bounds, standard scaling is useful. This gives all of your features zero-mean and unit variance. Therefore, the distributions of inputs and/or outputs are the same, and the model can treat them as such. If there is no scaling performed, the model will essentially be forced to think certain features are more important than others, rather than being able to learn those things.
The scaling for your outputs is important in defining the activation function for the output layer. If you have min-max scaled outputs, you can use sigmoid, because it bounds the outputs to between 0 and 1. If you are using standard scaling for the outputs, you would want to be sure you use a linear activation function, because technically standard-scaled outputs are not bounded. The choice of output activation is important, and knowledge of how your outputs are scaled is important in determining which activation to use.
Note: even if you had min-max scaling for your outputs, that does not restrict the activations you can use for your hidden layers.
I was building a neural network model and my question is that by any chance the ordering of the dropout and batch normalization layers actually affect the model?
Will putting the dropout layer before batch-normalization layer (or vice-versa) actually make any difference to the output of the model if I am using ROC-AUC score as my metric of measurement.
I expect the output to have a large (ROC-AUC) score and want to know that will it be affected in any way by the ordering of the layers.
The order of the layers effects the convergence of your model and hence your results. Based on the Batch Normalization paper, the author suggests that the Batch Normalization should be implemented before the activation function. Since Dropout is applied after computing the activations. Then the right order of layers are:
Dense or Conv
Batch Normalization
Activation
Droptout.
In code using keras, here is how you write it sequentially:
model = Sequential()
model.add(Dense(n_neurons, input_shape=your_input_shape, use_bias=False)) # it is important to disable bias when using Batch Normalization
model.add(BatchNormalization())
model.add(Activation('relu')) # for example
model.add(Dropout(rate=0.25))
Batch Normalization helps to avoid Vanishing/Exploding Gradients when training your model. Therefore, it is specially important if you have many layers. You can read the provided paper for more details.
I have implemented an autoencoder using Keras. I understand that I can add accuracy performance metric as follows:
autoencoder.compile(optimizer='adam',
loss='mean_squared_error',
metrics=['accuracy'])
My question is:
Is the accuracy metric applied on the last layer of the decoder by default? If so, how can I set it so that it would get the representations from middle (hidden) layer to compute accuracy performance? Do I need to define a custom metric? How would that work?
It seems that what you really want is a multiple output network.
So on top of your middle layer that defines your embedding, add a layer (or more) to do your classification.
Then have a look at Multiple outputs in Keras to create your global cost.
You may also want to start by training the autoendoder only, then the classifier additional layers only to see the performance, you can also balance the accuracy of the encoder vs the accuracy of the classifier as a loss, training "both" networks at the same time.
I am training a model using Keras.
model = Sequential()
model.add(LSTM(units=300, input_shape=(timestep,103), use_bias=True, dropout=0.2, recurrent_dropout=0.2))
model.add(Dense(units=536))
model.add(Activation("sigmoid"))
model.compile(loss="binary_crossentropy", optimizer="adam", metrics=["accuracy"])
while True:
history = model.fit_generator(
generator = data_generator(x_[train_indices],
y_[train_indices], batch = batch, timestep=timestep),
steps_per_epoch=(int)(train_indices.shape[0] / batch),
epochs=1,
verbose=1,
validation_steps=(int)(validation_indices.shape[0] / batch),
validation_data=data_generator(
x_[validation_indices],y_[validation_indices], batch=batch,timestep=timestep))
It is a multiouput classification accoriding to scikit-learn.org definition:
Multioutput regression assigns each sample a set of target values.This can be thought of as predicting several properties for each data-point, such as wind direction and magnitude at a certain location.
Thus, it is a recurrent neural network I tried out different timestep sizes. But the result/problem is mostly the same.
After one epoch, my train loss is around 0.0X and my validation loss is around 0.6X. And this values keep stable for the next 10 epochs.
Dataset is around 680000 rows. Training data is 9/10 and validation data is 1/10.
I ask for intuition behind that..
Is my model already over fittet after just one epoch?
Is 0.6xx even a good value for a validation loss?
High level question:
Therefore it is a multioutput classification task (not multi class), I see the only way by using sigmoid an binary_crossentropy. Do you suggest an other approach?
I've experienced this issue and found that the learning rate and batch size have a huge impact on the learning process. In my case, I've done two things.
Reduce the learning rate (try 0.00005)
Reduce the batch size (8, 16, 32)
Moreover, you can try the basic steps for preventing overfitting.
Reduce the complexity of your model
Increase the training data and also balance each sample per class.
Add more regularization (Dropout, BatchNorm)