I am performing text classification using a Deep Neural network. My problem is that I am receiving high accuracy 98 on train data whereas my validation accuracy is 49.
I have tried the following:
Shuffled the data
My train and validation data is 80:20 split
I am using 100 dimensions Glov vector
Any suggestions?
def get_Model():
model = tf.keras.Sequential([
tf.keras.layers.Embedding(vocab_size+1, embedding_dim, input_length=max_length, weights= . [embeddings_matrix], trainable=False),
tf.keras.layers.Dropout(0.2),
tf.keras.layers.Conv1D(64, 5, activation='relu'),
tf.keras.layers.MaxPooling1D(pool_size=4),
tf.keras.layers.LSTM(64),
tf.keras.layers.Dense(5, activation='softmax')
])
model.compile(loss='sparse_categorical_crossentropy',optimizer="adam",metrics=['acc'])
model.summary()
return model
Your model is clearly overfitting. Standard tricks to prevent overfitting are:
Adding dropout,
L2 regularization,
Trying a smaller model.
It is rather unusual to use convolutions and LSTMs at the same time (although it is perfectly fine). Perhaps keeping only one of them is the best way to make the network smaller.
My guess is that you are working with a rather small dataset. Having a bigger dataset also helps to prevent overfitting but it is not usually a piece of applicable advice.
Related
I trained my model on BI LSTM for multi class text classification, but my result is not different when I use 2 stacked BI LSTM like below compare to just 1 layer of BI LSTM, any idea about that ?
max_len = 409
max_words = 17666
emb_dim = 100
BB = Sequential()
BB.add(Embedding(max_words, emb_dim,weights=[embedding_matrix], input_length=max_len))
BB.add(Bidirectional(LSTM(64, return_sequences=True,dropout=0.4, recurrent_dropout=0.4)))
BB.add(Bidirectional(LSTM(34, return_sequences=False,dropout=0.4, recurrent_dropout=0.4)))
BB.add(Dropout(0.5))
BB.add(Dense(3, activation='softmax'))
BB.compile(optimizer='adam',
loss='categorical_crossentropy',
metrics=['accuracy'])
BB.summary()
One thing that could be the issue is that your embeddings dimension is quite low a(100), maybe you could try some other embedding type as text representation, like word2vec, which has an embedding dimension of 300, if I am not mistaken.
On the modeling side, would try experimenting with increasing LSTM units first (try 100 or more), then stacking LSTM layers on top of each other.
I also, noticed that your model is very heavily regularized through Dropout, recurrent and otherwise. Did you notice overfitting while training your model? If not, I would suggest removing those as they can affect training if unnecessarily added.
I am using an MLP for classification
Here is my model
model = keras.Sequential([
keras.layers.Flatten(input_shape=(X.shape[1], X.shape[2])),
keras.layers.Dense(2048, activation='relu'),
keras.layers.Dropout(0.1),
keras.layers.Dense(512, activation='relu'),
keras.layers.Dense(128, activation='relu'),
keras.layers.Dense(10, activation='softmax')
])
lr_schedule = keras.optimizers.schedules.ExponentialDecay(0.00015, decay_steps=1000, decay_rate=0.96, staircase=True)
optimiser = keras.optimizers.Adam(learning_rate=lr_schedule)
model.compile(optimizer=optimiser, loss='sparse_categorical_crossentropy', metrics=['accuracy'])
I noticed that the traning/validation loss and accuracy(image) validation loss increases as the validation accuracy increases.
Isn't loss supposed to decrease as accuracy increases?
I would recommend checkout out this post. Another aspect that is not mentioned there, but I think is worth noting, is the class-balance in your data set. Since you are using Softmax as your final layer, eg multi-class classification, the reason you observe validation loss increases and validation accuracy increases could be because your data is imbalanced in regards to how many observations of each label(class) exist. Your classifier becomes better (more accurate) at determining your most frequent classes, but worse (less accurate) with the more rare classes. You could say that you are "overfitting" your model to more often predict the common classes.
Because of this, accuracy increases since your classifier is overall correct more often, but the loss increases as well, since the loss for the rare classes become large (since it predicts the probability of them to be 0).
To solve this, you could either upsample or downsample your data, or set class-weights as described here.
I have developed a Convolutional Neural Network using TILDA image dataset which gives over 90% of accuracy with the following model. I used 4 batches and 100 epochs to the model.
model = keras.Sequential([
layers.Input((30,30,1)),
layers.Conv2D(8,2,padding='same', activation='relu',kernel_regularizer=regularizers.l2(0.01)),
layers.BatchNormalization(),
layers.Conv2D(16,2,padding='same', activation='relu',kernel_regularizer=regularizers.l2(0.01)),
layers.BatchNormalization(),
layers.Conv2D(32,2,padding='same', activation='sigmoid',kernel_regularizer=regularizers.l2(0.01)),
layers.BatchNormalization(),
layers.MaxPooling2D(),
layers.Flatten(),
layers.Dropout(0.5),
layers.Dense(5, activation = "softmax"),
])
Using the above model I could plot the following graphs for the training and validation accuracy.
Do you have any suggestions to increase the smoothness of these curves? What can be the possible reasons for getting such curves? I appreciate your recommendations to improve this model.
The following may help in getting a smoother curve:
NEVER use dropout before the final layer. MaxPool + Dropout in your model discards 87.5% of the data flowing into the final layer. Avoid pooling as well, unless you need global or adaptive pooling to get a fixed shape output. If you must pool, you need a much larger number of kernels to compensate for the loss in information.
Use a lower learning rate. From what the training curve tells, the model is directed to a minima, but with several bumps.
Are you using SGD without momentum? If yes, introduce, momentum. Also consider adaptive optimizers with inbuilt momentum, like Adam.
Why the sigmoid in between? Sigmoid reduces the gradient magnitude and makes learning slower.
If you only care about the curve and are not restricted by number of parameters, consider adding a few more layers and/or channels.
I was building a neural network model and my question is that by any chance the ordering of the dropout and batch normalization layers actually affect the model?
Will putting the dropout layer before batch-normalization layer (or vice-versa) actually make any difference to the output of the model if I am using ROC-AUC score as my metric of measurement.
I expect the output to have a large (ROC-AUC) score and want to know that will it be affected in any way by the ordering of the layers.
The order of the layers effects the convergence of your model and hence your results. Based on the Batch Normalization paper, the author suggests that the Batch Normalization should be implemented before the activation function. Since Dropout is applied after computing the activations. Then the right order of layers are:
Dense or Conv
Batch Normalization
Activation
Droptout.
In code using keras, here is how you write it sequentially:
model = Sequential()
model.add(Dense(n_neurons, input_shape=your_input_shape, use_bias=False)) # it is important to disable bias when using Batch Normalization
model.add(BatchNormalization())
model.add(Activation('relu')) # for example
model.add(Dropout(rate=0.25))
Batch Normalization helps to avoid Vanishing/Exploding Gradients when training your model. Therefore, it is specially important if you have many layers. You can read the provided paper for more details.
I am training a model using Keras.
model = Sequential()
model.add(LSTM(units=300, input_shape=(timestep,103), use_bias=True, dropout=0.2, recurrent_dropout=0.2))
model.add(Dense(units=536))
model.add(Activation("sigmoid"))
model.compile(loss="binary_crossentropy", optimizer="adam", metrics=["accuracy"])
while True:
history = model.fit_generator(
generator = data_generator(x_[train_indices],
y_[train_indices], batch = batch, timestep=timestep),
steps_per_epoch=(int)(train_indices.shape[0] / batch),
epochs=1,
verbose=1,
validation_steps=(int)(validation_indices.shape[0] / batch),
validation_data=data_generator(
x_[validation_indices],y_[validation_indices], batch=batch,timestep=timestep))
It is a multiouput classification accoriding to scikit-learn.org definition:
Multioutput regression assigns each sample a set of target values.This can be thought of as predicting several properties for each data-point, such as wind direction and magnitude at a certain location.
Thus, it is a recurrent neural network I tried out different timestep sizes. But the result/problem is mostly the same.
After one epoch, my train loss is around 0.0X and my validation loss is around 0.6X. And this values keep stable for the next 10 epochs.
Dataset is around 680000 rows. Training data is 9/10 and validation data is 1/10.
I ask for intuition behind that..
Is my model already over fittet after just one epoch?
Is 0.6xx even a good value for a validation loss?
High level question:
Therefore it is a multioutput classification task (not multi class), I see the only way by using sigmoid an binary_crossentropy. Do you suggest an other approach?
I've experienced this issue and found that the learning rate and batch size have a huge impact on the learning process. In my case, I've done two things.
Reduce the learning rate (try 0.00005)
Reduce the batch size (8, 16, 32)
Moreover, you can try the basic steps for preventing overfitting.
Reduce the complexity of your model
Increase the training data and also balance each sample per class.
Add more regularization (Dropout, BatchNorm)