My task is to learn defected items in a factory. It means, I try to detect defected goods or fine goods. This led a problem where one class dominates the others (one class is 99.7% of the data) as the defected items were very rare. Training accuracy is 0.9971 and validation accuracy is 0.9970. It sounds amazing.
But the problem is, the model only predicts everything is 0 class which is fine goods. That means, it fails to classify any defected goods.
How can I solve this problem? I have checked other questions and tried out, but I still have the situation. the total data points are 122400 rows and 5 x features.
In the end, my confusion matrix of the test set is like this
array([[30508, 0],
[ 92, 0]], dtype=int64)
which does a terrible job.
My code is as below:
le = LabelEncoder()
y = le.fit_transform(y)
ohe = OneHotEncoder(sparse=False)
y = y.reshape(-1,1)
y = ohe.fit_transform(y)
scaler = StandardScaler()
x = scaler.fit_transform(x)
x_train, x_test, y_train, y_test = train_test_split(x,y,test_size = 0.25, random_state = 777)
#DNN Modelling
epochs = 15
batch_size =128
Learning_rate_optimizer = 0.001
model = Sequential()
model.add(Dense(5,
kernel_initializer='glorot_uniform',
activation='relu',
input_shape=(5,)))
model.add(Dense(5,
kernel_initializer='glorot_uniform',
activation='relu'))
model.add(Dense(8,
kernel_initializer='glorot_uniform',
activation='relu'))
model.add(Dense(2,
kernel_initializer='glorot_uniform',
activation='softmax'))
model.compile(loss='binary_crossentropy',
optimizer=Adam(lr = Learning_rate_optimizer),
metrics=['accuracy'])
history = model.fit(x_train, y_train,
batch_size=batch_size,
epochs=epochs,
verbose=1,
validation_data=(x_test, y_test))
y_pred = model.predict(x_test)
confusion_matrix(y_test.argmax(axis=1), y_pred.argmax(axis=1))
Thank you
it sounds like you have highly imbalanced dataset, the model is learning only how to classify fine goods.
you can try one of the approaches listed here:
https://machinelearningmastery.com/tactics-to-combat-imbalanced-classes-in-your-machine-learning-dataset/
The best attempt would be to firstly take almost equal portions of data of both classes, split them into train-test-val, train the classifier and do thorough testing on your complete dataset. You can also try and use data augmentation techniques to your other set to get more data from the same set. Keep on iterating and maybe even try and change your loss function to suit your condition.
Related
I'm using a pre trained InceptionV3 on Keras to retrain the model to make a binary image classification (data labeled with 0's and 1's).
I'm reaching about 65% of accuracy on my k-fold validation with never seen data, but the problem is the model is overfitting to soon. I need to improve this average accuracy, and I guess there is something related to this overfitting problem.
Here are the loss values on epochs:
Here is the code. The dataset and label variables are Numpy Arrays.
dataset = joblib.load(path_to_dataset)
labels = joblib.load(path_to_labels)
le = LabelEncoder()
labels = le.fit_transform(labels)
labels = to_categorical(labels, 2)
X_train, X_test, y_train, y_test = sk.train_test_split(dataset, labels, test_size=0.2)
X_train, X_val, y_train, y_val = sk.train_test_split(X_train, y_train, test_size=0.25) # 0.25 x 0.8 = 0.2
X_train = np.array(X_train)
y_train = np.array(y_train)
X_val = np.array(X_val)
y_val = np.array(y_val)
X_test = np.array(X_test)
y_test = np.array(y_test)
aug = ImageDataGenerator(
rotation_range=20,
zoom_range=0.15,
horizontal_flip=True,
fill_mode="nearest")
pre_trained_model = InceptionV3(input_shape = (299, 299, 3),
include_top = False,
weights = 'imagenet')
for layer in pre_trained_model.layers:
layer.trainable = False
x = layers.Flatten()(pre_trained_model.output)
x = layers.Dense(1024, activation = 'relu')(x)
x = layers.Dropout(0.2)(x)
x = layers.Dense(2, activation = 'softmax')(x) #already tried with sigmoid activation, same behavior
model = Model(pre_trained_model.input, x)
model.compile(optimizer = RMSprop(lr = 0.0001),
loss = 'binary_crossentropy',
metrics = ['accuracy']) #Already tried with Adam optimizer, same behavior
es = EarlyStopping(monitor='val_loss', mode='min', verbose=1, patience=100)
mc = ModelCheckpoint('best_model_inception_rmsprop.h5', monitor='val_accuracy', mode='max', verbose=1, save_best_only=True)
history = model.fit(x=aug.flow(X_train, y_train, batch_size=32),
validation_data = (X_val, y_val),
epochs = 100,
callbacks=[es, mc])
The training dataset has 2181 images and validation has 727 images.
Something is wrong, but I can't tell what...
Any thoughts of what can be done to improve it?
One way to avoid overfitting is to use a lot of data. The main reason overfitting happens is because you have a small dataset and you try to learn from it. The algorithm will have greater control over this small dataset and it will make sure it satisfies all the datapoints exactly. But if you have a large number of datapoints, then the algorithm is forced to generalize and come up with a good model that suits most of the points.
Suggestions:
Use a lot of data.
Use less deep network if you have a small number of data samples.
If 2nd satisfies then don't use huge number of epochs - Using many epochs leads is kinda forcing your model to learn that and your model will learn it well but can not generalize.
From your loss graph , i see that the model is generalized at early epoch ( where there is intersection of both the train & val score) so plz try to use the model saved at that epoch ( and not the later epochs which seems to overfit)
Second option what you have is use lot of training samples..
If you have less no. of training samples then use data augmentations
Have you tried following?
Using a higher dropout value
Lower Learning Rate (lr=0.00001 or lr=0.000001 ...)
More data augmentation you can use.
It seems to me your data amount is low. You may use a lower ratio for test and validation (10%, 10%).
Getting started with simple NN but my loss remains one at each iteration. Can somebody point out what I'm doing wrong here.
This is from a Kaggle introductory course and my modified training set contains shop id, category id, item id, month and revenue. I'm basically trying to predict revenue per shop per category for the following month.
I've scaled revenue and trained on a simple NN with 2 hidden layers; however, it doesn't seem like the training is working as the loss remains constant. I haven't done anything with the labels (ie shop ids, category ids) but I would still think the loss would change on each iteration.
If you have some comments on coding practice, I would be interested as well.
Thanks.
X_train = grouped_train.drop('revenue', axis=1)
y_train = grouped_train['revenue']
print('X & y trains')
print(X_train.head())
print(y_train.head())
scaler = StandardScaler()
y_train = pd.DataFrame(scaler.fit_transform(y_train.values.reshape(-1,1)))
print('Scaled y train')
print(y_train.head())
keras.backend.clear_session()
model = Sequential()
model.add(Dense(30, activation='relu', input_shape=(4,)))
model.add(Dense(30, activation='relu'))
model.add(Dense(1, activation='relu'))
model.summary()
print('Compile & fit')
model.compile(loss='mean_squared_error', optimizer='RMSprop')
model.fit(X_train, scaled_data, batch_size=128, epochs=13)
predictions = pd.DataFrame(model.predict(test))
print('Scaled predictions')
print(predictions.head())
print('Unscaled predictions')
print(pd.DataFrame(scaler.inverse_transform(predictions)).head())
IN
OUT
Looks like you are using the wrong activation for the final layer. You have a regression problem so the standard final activation layer should be activation = 'linear'
model.add(Dense(1, activation='relu'))
model.add(Dense(1, activation='linear'))
Edit:
Additionally model.fit is using 'scaled_data' shouldn't scaled_data be replaced with y_train
I'm applying LSTM autoencoder for anomaly detection. Since anomaly data are very few as compared to normal data, only normal instances are used for the training. Testing data consists of both anomalies and normal instances. During the training, the model loss seems good. However, in the test the data the model produces poor accuracy. i.e. anomaly and normal points are not well separated.
The snippet of my code is below:
.............
.............
X_train = X_train.reshape(X_train.shape[0], lookback, n_features)
X_valid = X_valid.reshape(X_valid.shape[0], lookback, n_features)
X_test = X_test.reshape(X_test.shape[0], lookback, n_features)
.....................
......................
N = 1000
batch = 1000
lr = 0.0001
timesteps = 3
encoding_dim = int(n_features/2)
lstm_model = Sequential()
lstm_model.add(LSTM(N, activation='relu', input_shape=(timesteps, n_features), return_sequences=True))
lstm_model.add(LSTM(encoding_dim, activation='relu', return_sequences=False))
lstm_model.add(RepeatVector(timesteps))
# Decoder
lstm_model.add(LSTM(timesteps, activation='relu', return_sequences=True))
lstm_model.add(LSTM(encoding_dim, activation='relu', return_sequences=True))
lstm_model.add(TimeDistributed(Dense(n_features)))
lstm_model.summary()
adam = optimizers.Adam(lr)
lstm_model.compile(loss='mse', optimizer=adam)
cp = ModelCheckpoint(filepath="lstm_classifier.h5",
save_best_only=True,
verbose=0)
tb = TensorBoard(log_dir='./logs',
histogram_freq=0,
write_graph=True,
write_images=True)
lstm_model_history = lstm_model.fit(X_train, X_train,
epochs=epochs,
batch_size=batch,
shuffle=False,
verbose=1,
validation_data=(X_valid, X_valid),
callbacks=[cp, tb]).history
.........................
test_x_predictions = lstm_model.predict(X_test)
mse = np.mean(np.power(preprocess_data.flatten(X_test) - preprocess_data.flatten(test_x_predictions), 2), axis=1)
error_df = pd.DataFrame({'Reconstruction_error': mse,
'True_class': y_test})
# Confusion Matrix
pred_y = [1 if e > threshold else 0 for e in error_df.Reconstruction_error.values]
conf_matrix = confusion_matrix(error_df.True_class, pred_y)
plt.figure(figsize=(5, 5))
sns.heatmap(conf_matrix, xticklabels=LABELS, yticklabels=LABELS, annot=True, fmt="d")
plt.title("Confusion matrix")
plt.ylabel('True class')
plt.xlabel('Predicted class')
plt.show()
Please suggest what can be done in the model to improve the accuracy.
If your model is not performing good on the test set I would make sure to check certain things;
Training set is not contaminated with anomalies or any information from the test set. If you use scaling, make sure you did not fit the scaler to training and test set combined.
Based on my experience; if an autoencoder cannot discriminate well enough on the test data but has low training loss, provided your training set is pure, it means that the autoencoder did learn about the underlying details of the training set but not about the generalized idea.
Your threshold value might be off and you may need to come up with a better thresholding procedure. One example can be found here: https://dl.acm.org/citation.cfm?doid=3219819.3219845
If the problem is 2nd one, the solution is to increase generalization. With autoencoders, one of the most efficient generalization tool is the dimension of the bottleneck. Again based on my experience with anomaly detection in flight radar data; lowering the bottleneck dimension significantly increased my multi-class classification accuracy. I was using 14 features with an encoding_dim of 7, but encoding_dim of 4 provided even better results. The value of the training loss was not important in my case because I was only comparing reconstruction errors, but since you are making a classification with a threshold value of RE, a more robust thresholding may be used to improve accuracy, just as in the paper I've shared.
I'm trying to follow a Tensorflow tutorial (i'm a beginner) for structured data models with some changes along the way.
My purpose is to create a model to which i provide data (in csv format) that looks something like this (the example has only 2 features but i want to extend it after i figure it out):
power_0,power_1,result
0.2,0.3,draw
0.8,0.1,win
0.3,0.1,draw
0.7,0.2,win
0.0,0.4,lose
I created the model using the following code:
def get_labels(df, label, mapping):
raw_y_true = df.pop(label)
y_true = np.zeros((len(raw_y_true)))
for i, raw_label in enumerate(raw_y_true):
y_true[i] = mapping[raw_label]
return y_true
tf.compat.v1.enable_eager_execution()
mapping_to_numbers = {'win': 0, 'draw': 1, 'lose': 2}
data_frame = pd.read_csv('data.csv')
data_frame.head()
train, test = train_test_split(data_frame, test_size=0.2)
train, val = train_test_split(train, test_size=0.2)
train_labels = np.array(get_labels(train, label='result', mapping=mapping_to_numbers))
val_labels = np.array(get_labels(val, label='result', mapping=mapping_to_numbers))
test_labels = np.array(get_labels(test, label='result', mapping=mapping_to_numbers))
train_features = np.array(train)
val_features = np.array(val)
test_features = np.array(test)
model = tf.keras.Sequential([
tf.keras.layers.Dense(128, activation='relu', input_shape=(train_features.shape[-1],)),
tf.keras.layers.Dense(128, activation='relu'),
tf.keras.layers.Dropout(0.3),
tf.keras.layers.Dense(128, activation='relu'),
tf.keras.layers.Dropout(0.3),
tf.keras.layers.Dense(3, activation='sigmoid'),
])
model.compile(
optimizer='adam',
loss='binary_crossentropy',
metrics=['accuracy'],
run_eagerly=True)
epochs = 10
batch_size = 100
history = model.fit(
train_features,
train_labels,
epochs=epochs,
validation_data=(val_features, val_labels))
input_data_frame = pd.read_csv('input.csv')
input_data_frame.head()
input_data = np.array(input_data_frame)
print(model.predict(input_data))
input.csv looks as following:
power_0,power_1
0.8,0.1
0.7,0.2
And the actual result is:
[[0.00604381 0.00242573 0.00440606]
[0.01321151 0.00634229 0.01041476]]
I expected to get the probability for each label ('win', 'draw' and 'lose'), can anyone please help me with this?
Thanks in advance
Use softmax activation in this line tf.keras.layers.Dense(3, activation='sigmoid').
This works well for me with your example:
model = tf.keras.Sequential([
tf.keras.layers.Flatten(input_shape=(train_features.shape[-1],)),
tf.keras.layers.Dense(128, activation='relu'),
tf.keras.layers.Dense(16, activation='relu'),
tf.keras.layers.Dense(3, activation='softmax'),
])
model.compile(
optimizer='adam',
loss='sparse_categorical_crossentropy',
metrics=['accuracy'],
run_eagerly=True)
Using Flatten Layer
I have to write my suggestions here because i cant comment yet.
#zihaozhihao is right you have to use softmax instead of sigmoid because you dont work with a binary problem. Another problem might be your loss function which is:
model.compile(
optimizer='adam',
loss='binary_crossentropy',
metrics=['accuracy'],
run_eagerly=True)
Try to use loss='categorical_crossentropy',because you are working with a multilabel classification. You could read more about multilable classification here and here.
As for your propability question. You get the propability of each class for your two test inputs.For example:
win draw loss
[[0.00604381 0.00242573 0.00440606]
[0.01321151 0.00634229 0.01041476]]
The Problem is your loss function and the activation function which leads to strange propability values. You might want to check this post here for more information.
Hope this helps a little and feel free to ask.
Since it is a multiclass classification problem, please use categorical_crossentropy instead of binary_crossentropy for loss function, also use softmax instead of sigmoid as activation function.
Also, you should increase your epochs for getting better convergence.
I am a newbie in ML and was experimenting with emotion detection on the text.
So I have an ISEAR dataset which contains tweets with their emotion labeled.
So my current accuracy is 63% and I want to increase to at least 70% or even more maybe.
Heres the code :
inputs = Input(shape=(MAX_LENGTH, ))
embedding_layer = Embedding(vocab_size,
64,
input_length=MAX_LENGTH)(inputs)
# x = Flatten()(embedding_layer)
x = LSTM(32, input_shape=(32, 32))(embedding_layer)
x = Dense(10, activation='relu')(x)
predictions = Dense(num_class, activation='softmax')(x)
model = Model(inputs=[inputs], outputs=predictions)
model.compile(optimizer='adam',
loss='categorical_crossentropy',
metrics=['acc'])
model.summary()
filepath="weights-simple.hdf5"
checkpointer = ModelCheckpoint(filepath, monitor='val_acc', verbose=1, save_best_only=True, mode='max')
history = model.fit([X_train], batch_size=64, y=to_categorical(y_train), verbose=1, validation_split=0.1,
shuffle=True, epochs=10, callbacks=[checkpointer])
That's a pretty general question, optimizing the performance of a neural network may require tuning many factors.
For instance:
The optimizer chosen: in NLP tasks rmsprop is also a popular
optimizer
Tweaking the learning rate
Regularization - e.g dropout, recurrent_dropout, batch norm. This may help the model to generalize better
More units in the LSTM
More dimensions in the embedding
You can try grid search, e.g. using different optimizers and evaluate on a validation set.
The data may also need some tweaking, such as:
Text normalization - better representation of the tweets - remove unnecessary tokens (#, #)
Shuffle the data before the fit - keras validation_split creates a validation set using the last data records
There is no simple answer to your question.