when training sample increases accuracy decreases - python-3.x

I am testing keras's imdb dataset. questions is, when I split to train and test for 2000 number of words I get close to 87% accuracy:
(X_train, train_labels), (X_test, test_labels) = imdb.load_data(num_words=2000)
but when I bump up the words to like 5000 or 10000, the model perform poorly:
(X_train, train_labels), (X_test, test_labels) = imdb.load_data(num_words=10000)
Here is my model:
model = models.Sequential()
model.add(layers.Dense(256, activation='relu', input_shape=(10000,)))
model.add(layers.Dense(16, activation='relu' ))
model.add(layers.Dense(1, activation='sigmoid'))
model.compile(optimizer='adam',
loss='binary_crossentropy',
metrics=['accuracy'])
history =model.fit(X_train, y_train, epochs=10, batch_size=64,validation_data=(x_val, y_val))
Can any one explain why this is the case. I though with more sample (and less over fitting) I should get a very good model.
Thanks for any advice

Increasing num_words doesn't increase the amount of samples but the vocabulary, leading to more words per sample (statistically), going in the direction of the curse of dimensionality, which is harmful for the model.
From the docs:
num_words: integer or None. Top most frequent words to consider. Any less frequent word will appear as oov_char value in the sequence data.

Related

LSTM Autoencoder producing poor results in test data

I'm applying LSTM autoencoder for anomaly detection. Since anomaly data are very few as compared to normal data, only normal instances are used for the training. Testing data consists of both anomalies and normal instances. During the training, the model loss seems good. However, in the test the data the model produces poor accuracy. i.e. anomaly and normal points are not well separated.
The snippet of my code is below:
.............
.............
X_train = X_train.reshape(X_train.shape[0], lookback, n_features)
X_valid = X_valid.reshape(X_valid.shape[0], lookback, n_features)
X_test = X_test.reshape(X_test.shape[0], lookback, n_features)
.....................
......................
N = 1000
batch = 1000
lr = 0.0001
timesteps = 3
encoding_dim = int(n_features/2)
lstm_model = Sequential()
lstm_model.add(LSTM(N, activation='relu', input_shape=(timesteps, n_features), return_sequences=True))
lstm_model.add(LSTM(encoding_dim, activation='relu', return_sequences=False))
lstm_model.add(RepeatVector(timesteps))
# Decoder
lstm_model.add(LSTM(timesteps, activation='relu', return_sequences=True))
lstm_model.add(LSTM(encoding_dim, activation='relu', return_sequences=True))
lstm_model.add(TimeDistributed(Dense(n_features)))
lstm_model.summary()
adam = optimizers.Adam(lr)
lstm_model.compile(loss='mse', optimizer=adam)
cp = ModelCheckpoint(filepath="lstm_classifier.h5",
save_best_only=True,
verbose=0)
tb = TensorBoard(log_dir='./logs',
histogram_freq=0,
write_graph=True,
write_images=True)
lstm_model_history = lstm_model.fit(X_train, X_train,
epochs=epochs,
batch_size=batch,
shuffle=False,
verbose=1,
validation_data=(X_valid, X_valid),
callbacks=[cp, tb]).history
.........................
test_x_predictions = lstm_model.predict(X_test)
mse = np.mean(np.power(preprocess_data.flatten(X_test) - preprocess_data.flatten(test_x_predictions), 2), axis=1)
error_df = pd.DataFrame({'Reconstruction_error': mse,
'True_class': y_test})
# Confusion Matrix
pred_y = [1 if e > threshold else 0 for e in error_df.Reconstruction_error.values]
conf_matrix = confusion_matrix(error_df.True_class, pred_y)
plt.figure(figsize=(5, 5))
sns.heatmap(conf_matrix, xticklabels=LABELS, yticklabels=LABELS, annot=True, fmt="d")
plt.title("Confusion matrix")
plt.ylabel('True class')
plt.xlabel('Predicted class')
plt.show()
Please suggest what can be done in the model to improve the accuracy.
If your model is not performing good on the test set I would make sure to check certain things;
Training set is not contaminated with anomalies or any information from the test set. If you use scaling, make sure you did not fit the scaler to training and test set combined.
Based on my experience; if an autoencoder cannot discriminate well enough on the test data but has low training loss, provided your training set is pure, it means that the autoencoder did learn about the underlying details of the training set but not about the generalized idea.
Your threshold value might be off and you may need to come up with a better thresholding procedure. One example can be found here: https://dl.acm.org/citation.cfm?doid=3219819.3219845
If the problem is 2nd one, the solution is to increase generalization. With autoencoders, one of the most efficient generalization tool is the dimension of the bottleneck. Again based on my experience with anomaly detection in flight radar data; lowering the bottleneck dimension significantly increased my multi-class classification accuracy. I was using 14 features with an encoding_dim of 7, but encoding_dim of 4 provided even better results. The value of the training loss was not important in my case because I was only comparing reconstruction errors, but since you are making a classification with a threshold value of RE, a more robust thresholding may be used to improve accuracy, just as in the paper I've shared.

Why is the validation accuracy constant at 20%?

I am trying to implement a 5 class animal classifier using Keras. I am building the CNN from scratch and the weird thing is, the validation accuracy stays constant at 0.20 for all epochs. Any idea why this is happening? The dataset folder contains train, test and validation folders. And each of the folders contains 5 folders corresponding to the 5 classes. What am I doing wrong?
I have tried multiple optimizer but the problem persists. I have included the code sample below.
import warnings
warnings.filterwarnings("ignore")
#First convolution layer
model = Sequential()
model.add(Conv2D(filters=32, kernel_size=(3, 3), activation='relu',kernel_initializer='he_normal',input_shape=input_shape))
model.add(MaxPooling2D(pool_size=(2, 2)))
#Second convolution layer
model.add(Conv2D(filters=64, kernel_size=(3, 3), activation='relu',kernel_initializer='he_normal',input_shape=input_shape))
model.add(MaxPooling2D(pool_size=(2, 2)))
#Flatten the outputs of the convolution layer into a 1D contigious array
model.add(Flatten())
#Add a fully connected layer containing 256 neurons
model.add(Dense(256, activation='relu',kernel_initializer='he_normal'))
model.add(BatchNormalization())
#Add another fully connected layer containing 256 neurons
model.add(Dense(256, activation='relu',kernel_initializer='he_normal'))
model.add(BatchNormalization())
#Add the ouput layer containing 5 neurons, because we have 5 categories
model.add(Dense(5, activation='softmax',kernel_initializer='glorot_uniform'))
optim=RMSprop(lr=1e-6)
model.compile(loss='categorical_crossentropy',optimizer=optim,metrics=['accuracy'])
model.summary()
#We will use the below code snippet for rescaling the images to 0-1 for all the train and test images
train_datagen = ImageDataGenerator(rescale=1./255)
#We won't augment the test data. We will just use ImageDataGenerator to rescale the images.
test_datagen = ImageDataGenerator(rescale=1./255)
train_generator = train_datagen.flow_from_directory(train_data_dir,
classes=['frog', 'giraffe', 'horse', 'tiger','dog'],
target_size=(img_width, img_height),
batch_size=batch_size,
class_mode='categorical',
shuffle=False)
validation_generator = test_datagen.flow_from_directory(validation_data_dir,
classes=['frog', 'giraffe', 'horse', 'tiger','dog'],
target_size=(img_width, img_height),
batch_size=batch_size,
class_mode='categorical',
shuffle=False)
hist=History()
model.fit_generator(train_generator,
steps_per_epoch=nb_train_samples // batch_size,
epochs=epochs,
validation_data=validation_generator,
validation_steps=nb_validation_samples // batch_size,
callbacks=[hist])
model.save('models/basic_cnn_from_scratch_model.h5') #Save the model weights #Load using: model = load_model('cnn_from_scratch_weights.h5') from keras.models import load_model
print("Time taken to train the baseline model from scratch: ",datetime.now()-global_start)
Check the following for your data:
Shuffle the training data well (I see shuffle=False everywhere)
Properly normalize all data (I see you are doing rescale=1./255, maybe okay)
Proper train/val split (you seem to be doing that too)
Suggestions for your model:
Use multiple Conv2D layers followed by a final Dense. That's what works best for image classification problems. You can also look at popular architectures that are tried and tested; e.g. AlexNet
Can change the optimizer to Adam and try with different learning rates
Have a look at your training and validation loss graphs and see if they look as expected
Also, I guess you corrected the shape of the 2nd Conv2D layer as mentioned in the comments.
It looks as if your output is always the same animal, thus you have a 20% accuracy. I highly recommend you to check your testing outputs to see if they are all the same.
Also you said that you were building a CNN but in the code snipet you posted I see only dense layers, it is going to be hard for a dense architecture to do this task, and it is very small. What is the size of your pictures?
Hope it helps!
The models seems to be working now. I have removed shuffle=False attribute. Corrected the input shape for the 2nd convolution layer. Changed the optimizer to adam. I have reached a validation accuracy of almost 94%. However, I have not yet tested the model on unseen data. There is a bit of overfitting in the model. I will have to use some aggressive dropouts to reduce them. Thanks!

Neural network only predicts one class from binary class

My task is to learn defected items in a factory. It means, I try to detect defected goods or fine goods. This led a problem where one class dominates the others (one class is 99.7% of the data) as the defected items were very rare. Training accuracy is 0.9971 and validation accuracy is 0.9970. It sounds amazing.
But the problem is, the model only predicts everything is 0 class which is fine goods. That means, it fails to classify any defected goods.
How can I solve this problem? I have checked other questions and tried out, but I still have the situation. the total data points are 122400 rows and 5 x features.
In the end, my confusion matrix of the test set is like this
array([[30508, 0],
[ 92, 0]], dtype=int64)
which does a terrible job.
My code is as below:
le = LabelEncoder()
y = le.fit_transform(y)
ohe = OneHotEncoder(sparse=False)
y = y.reshape(-1,1)
y = ohe.fit_transform(y)
scaler = StandardScaler()
x = scaler.fit_transform(x)
x_train, x_test, y_train, y_test = train_test_split(x,y,test_size = 0.25, random_state = 777)
#DNN Modelling
epochs = 15
batch_size =128
Learning_rate_optimizer = 0.001
model = Sequential()
model.add(Dense(5,
kernel_initializer='glorot_uniform',
activation='relu',
input_shape=(5,)))
model.add(Dense(5,
kernel_initializer='glorot_uniform',
activation='relu'))
model.add(Dense(8,
kernel_initializer='glorot_uniform',
activation='relu'))
model.add(Dense(2,
kernel_initializer='glorot_uniform',
activation='softmax'))
model.compile(loss='binary_crossentropy',
optimizer=Adam(lr = Learning_rate_optimizer),
metrics=['accuracy'])
history = model.fit(x_train, y_train,
batch_size=batch_size,
epochs=epochs,
verbose=1,
validation_data=(x_test, y_test))
y_pred = model.predict(x_test)
confusion_matrix(y_test.argmax(axis=1), y_pred.argmax(axis=1))
Thank you
it sounds like you have highly imbalanced dataset, the model is learning only how to classify fine goods.
you can try one of the approaches listed here:
https://machinelearningmastery.com/tactics-to-combat-imbalanced-classes-in-your-machine-learning-dataset/
The best attempt would be to firstly take almost equal portions of data of both classes, split them into train-test-val, train the classifier and do thorough testing on your complete dataset. You can also try and use data augmentation techniques to your other set to get more data from the same set. Keep on iterating and maybe even try and change your loss function to suit your condition.

Emotion detection on text

I am a newbie in ML and was experimenting with emotion detection on the text.
So I have an ISEAR dataset which contains tweets with their emotion labeled.
So my current accuracy is 63% and I want to increase to at least 70% or even more maybe.
Heres the code :
inputs = Input(shape=(MAX_LENGTH, ))
embedding_layer = Embedding(vocab_size,
64,
input_length=MAX_LENGTH)(inputs)
# x = Flatten()(embedding_layer)
x = LSTM(32, input_shape=(32, 32))(embedding_layer)
x = Dense(10, activation='relu')(x)
predictions = Dense(num_class, activation='softmax')(x)
model = Model(inputs=[inputs], outputs=predictions)
model.compile(optimizer='adam',
loss='categorical_crossentropy',
metrics=['acc'])
model.summary()
filepath="weights-simple.hdf5"
checkpointer = ModelCheckpoint(filepath, monitor='val_acc', verbose=1, save_best_only=True, mode='max')
history = model.fit([X_train], batch_size=64, y=to_categorical(y_train), verbose=1, validation_split=0.1,
shuffle=True, epochs=10, callbacks=[checkpointer])
That's a pretty general question, optimizing the performance of a neural network may require tuning many factors.
For instance:
The optimizer chosen: in NLP tasks rmsprop is also a popular
optimizer
Tweaking the learning rate
Regularization - e.g dropout, recurrent_dropout, batch norm. This may help the model to generalize better
More units in the LSTM
More dimensions in the embedding
You can try grid search, e.g. using different optimizers and evaluate on a validation set.
The data may also need some tweaking, such as:
Text normalization - better representation of the tweets - remove unnecessary tokens (#, #)
Shuffle the data before the fit - keras validation_split creates a validation set using the last data records
There is no simple answer to your question.

Multiple Sequences RNN/LSTM in Keras

I have multiple sequences of varying length. Each has about 9 features. I want to predict the values of all the continuous features at time t+1. The data is in a list of length 2000 (so, 2000 total sequences). How could one do this in Keras?
model = Sequential()
model.add(LSTM(100, input_shape=(None,9)))
model.add(Dense(1))
model.compile(loss='mean_squared_error', optimizer='adam')
model.fit(X, y, epochs=1, batch_size=1, verbose=1)
This is all I really have, but I'm getting some size mismatches. Any suggestions?

Resources