Anomalies have similar error values to normal data - keras

I have inertial measurement unit (IMU) data for which I am building an anomaly detection autoencoder neural net. I have about 5k training samples of which I am using 10% for validation. I also have about 50 (though I can make more) samples to test anomaly detection. My dataset has 12 IMU features. I train for about 10,000 epochs and I attain mean squared errors for reconstruction (MSE) of about 0.004 during training. After training, I perform an MSE calculation on the test data and I get values very similar to those in the train data (0.003) and I do not know why!
I am making my test set by slicing 50 samples from the overall data (not part of X_train) and changing one of the features to all zeros. I have also tried adding noise to one of the features as well as making multiple features zero.
np.random.seed(404)
np.random.shuffle(all_imu_data)
norm_imu_data = all_imu_data[:len_slice]
anom_imu_data = all_imu_data[len_slice:]
anom_imu_data[:,6] = 0
scaler = MinMaxScaler()
norm_data = scaler.fit_transform(norm_imu_data)
anom_data = scaler.transform(anom_imu_data)
X_train = pd.DataFrame(norm_data)
X_test = pd.DataFrame(anom_data)
I have tried many different network sizes by ranging number of hidden layers and number of hidden nodes/layer. As an example, I show a topology like [12-7-4-7-12]:
input_dim = num_features
input_layer = Input(shape=(input_dim, ))
encoder = Dense(int(7), activation="tanh", activity_regularizer=regularizers.l1(10e-5))(input_layer)
encoder = Dense(int(4), activation="tanh")(encoder)
decoder = Dense(int(7), activation="tanh")(encoder)
decoder = Dense(int(input_dim), activation="tanh")(decoder)
autoencoder = Model(inputs=input_layer, outputs=decoder)
autoencoder.compile(optimizer='adam', loss='mse', metrics=['mse'])
history = autoencoder.fit(X_train, X_train,
epochs=nb_epoch,
batch_size=batch_size,
shuffle=True,
validation_split=0.1,
verbose=1,
callbacks=[checkpointer, tensorboard]).history
pred_train = autoencoder.predict(X_train)
pred_test = autoencoder.predict(X_test)
mse_train = np.mean(np.power(X_train - pred_train, 2), axis=1)
mse_test = np.mean(np.power(X_test - pred_test, 2), axis=1)
print('MSE mean() - X_train:', np.mean(mse_train))
print('MSE mean() - X_test:', np.mean(mse_test))
After doing this, I get MSE mean numbers of 0.004 for Train and 0.003 for Test. Therefore, I cannot select a good threshold for anomalous data, as there are a lot of normal points that have larger MSE scores than the 'anomalous' data.
Any thoughts as to why this network is unable to detect these anomalies?

It is completely normal. You train your autoencoder on a sub sample of your whole data. Therefore, there are also anomalies contaminating your training data. The purpose of the autoencoder is to find a perfect reconstruction of your original data which it does including the anomalies. It is a very powerful tool, so if you show it anomalies in the training data, it will reconstruct them easily.
You need to remove 5% of your anomalous data with another anomaly detection algorithm (for example isolation forest) and do the subsampling on that part of the data (without outliers).
After that, you can find your outliers easily.

Related

keras training on big datasets seperately keras

I am working on a keras denoising neural network that denoise high Dimension x-ray images. The idea is to train on some datasets eg.1,2,3 and after having the weights, another datasets eg.4,5,6 will start with a new training with weights initialized from the previous training. Implementation-wise it works, however the weights resulted from the last rotation perform better only on the datasets that were used to train on in this rotation. Same goes for other rotation.
In other words, weights resutlted from training on dataset: 4,5,6 doesn't give the good results on an image of dataset 1 as intended as the weights that were trained on datasets: 1,2,3. which shouldn't be what I intend to do
The idea is that weights should be tweaked to work with all datasets effectively, as training on the whole dataset doesn't fit into memory.
I tried other solutions such as creating custom generator that takes images from disk and do the training as batches which is very slow as it depends on factors like I/O operations happening on disk or the time complexity of processing functions happening inside the custom keras generator!
Below is a code that shows what I am doing. I have 12 datasets, seperated into 4 checkpoints. data is loaded and training goes and saves final model to an array and next training takes the weights from the previous rotation and continues.
EPOCHES = 150
NUM_CHKPTS = 4
weights = []
for chk in range(1,NUM_CHKPTS+1):
log_dir = os.path.join(os.getcwd(), 'resnet_checkpts_' + str(EPOCHES) + "_tl2_chkpt" + str(chk))
if not os.path.isdir(log_dir):
os.makedirs(log_dir)
else:
print('Training log directory already exists # {}.'.format(log_dir))
tb_output = TensorBoard(log_dir=log_dir, histogram_freq=1)
print("Loading Data From CHKPT #" + str(chk))
h5f = h5py.File('C:\\autoencoder\\datasets\\mix\\chk' + str(chk) + '.h5','r')
org_patch = h5f['train_data'][:]
noisy_patch = h5f['train_noisy'][:]
h5f.close()
input_patch, test_patch, noisy_patch, test_noisy_patch = train_test_split(org_patch, noisy_patch, train_size=0.8, shuffle=True)
print("Reshaping")
train_data = np.array([np.reshape(input_patch[i], (52, 52, 1)) for i in range(input_patch.shape[0])], dtype = np.float32)
train_noisy_data = np.array([np.reshape(noisy_patch[i], (52, 52, 1)) for i in range(noisy_patch.shape[0])], dtype = np.float32)
test_data = np.array([np.reshape(test_patch[i], (52, 52, 1)) for i in range(test_patch.shape[0])], dtype = np.float32)
test_noisy_data = np.array([np.reshape(test_noisy_patch[i], (52, 52, 1)) for i in range(test_noisy_patch.shape[0])], dtype = np.float32)
print('Number of training samples are:', train_data.shape[0])
print('Number of test samples are:', test_data.shape[0])
# IN = np.ones((len(XTRAINFILES), 52, 52, 1 ))
if chk == 1:
print("Generating the Model For The First Time..")
autoencoder_model = model_autoencoder(train_noisy_data)
print("Done!")
else:
autoencoder_model=load_model(weights[chk-2])
checkpt_path = log_dir + r"\\cp-{epoch:04d}.ckpt"
checkpoint_callback = tf.keras.callbacks.ModelCheckpoint(filepath=checkpt_path, verbose=0, save_weights_only=True, save_freq='epoch')
optimizer = tf.keras.optimizers.Adam(lr=0.0001)
autoencoder_model.compile(loss='mse',optimizer=optimizer)
autoencoder_model.fit(train_noisy_data, train_data,
batch_size=128,
epochs=EPOCHES, shuffle=True, verbose=1,
validation_data=(test_noisy_data, test_data),
callbacks=[tb_output, checkpoint_callback])
weight_dir = log_dir+'\\model_resnet_new_OL' + str(EPOCHES) + 'epochs.h5'
weights.append(weight_dir)
autoencoder_model.save(weight_dir) # Defined saved model name by number of epochs.
Tensorboard Graphs, Rotations are 1,2,3,4 from up down :
Your model will forget previous dataset as you train on new dataset.
I read in reinforcement learning, when game are used to train Deep Reinforcement Learning (DRL), then you have to create memory replay, which collect data from different rounds of game, because each round of game has different data, then randomly some of that data is chosen to train model. that way DRL model can learn to play different rounds of game without forgetting previous rounds.
You can try to create a single dataset by taking some random samples from each dataset.
When you train model on new dataset that make sure data from all previous rotation are in current rotation.
Also in transfer learning, when you train model on new dataset, you have to freeze previous layers so that model don`t forget previous training. you are not using transfer learning but still when you start training on 2nd dataset your 1st dataset will slowly be removed from memory of weights.
you can try freezing initial layers of decoder so that they are not updated when extracting feature, assuming all of the dataset contain similar images, that way your model will not forget previous training as in transfer learning. but still when you train on new dataset previous will be forgotten.

Nan loss in keras with triplet loss

I'm trying to learn an embedding for Paris6k images combining VGG and Adrian Ung triplet loss. The problem is that after a small amount of iterations, in the first epoch, the loss becomes nan, and then the accuracy and validation accuracy grow to 1.
I've already tried lowering the learning rate, increasing the batch size (only to 16 beacuse of memory), changing optimizer (Adam and RMSprop), checking if there are None values on my dataset, changing data format from 'float32' to 'float64', adding a little bias to them and simplify the model.
Here is my code:
base_model = VGG16(include_top = False, input_shape = (512, 384, 3))
input_images = base_model.input
input_labels = Input(shape=(1,), name='input_label')
embeddings = Flatten()(base_model.output)
labels_plus_embeddings = concatenate([input_labels, embeddings])
model = Model(inputs=[input_images, input_labels], outputs=labels_plus_embeddings)
batch_size = 16
epochs = 2
embedding_size = 64
opt = Adam(lr=0.0001)
model.compile(loss=tl.triplet_loss_adapted_from_tf, optimizer=opt, metrics=['accuracy'])
label_list = np.vstack(label_list)
x_train = image_list[:2500]
x_val = image_list[2500:]
y_train = label_list[:2500]
y_val = label_list[2500:]
dummy_gt_train = np.zeros((len(x_train), embedding_size + 1))
dummy_gt_val = np.zeros((len(x_val), embedding_size + 1))
H = model.fit(
x=[x_train,y_train],
y=dummy_gt_train,
batch_size=batch_size,
epochs=epochs,
validation_data=([x_val, y_val], dummy_gt_val),callbacks=callbacks_list)
The images are 3366 with values scaled in range [0, 1].
The network takes dummy values because it tries to learn embeddings from images in a way that images of the same class should have small distance, while images of different classes should have high distances and than the real class is part of the training.
I've noticed that I was previously making an incorrect class division (and keeping images that should be discarded), and I didn't have the nan loss problem.
What should I try to do?
Thanks in advance and sorry for my english.
In some case, the random NaN loss can be caused by your data, because if there are no positive pairs in your batch, you will get a NaN loss.
As you can see in Adrian Ung's notebook (or in tensorflow addons triplet loss; it's the same code) :
semi_hard_triplet_loss_distance = math_ops.truediv(
math_ops.reduce_sum(
math_ops.maximum(
math_ops.multiply(loss_mat, mask_positives), 0.0)),
num_positives,
name='triplet_semihard_loss')
There is a division by the number of positives pairs (num_positives), which can lead to NaN.
I suggest you try to inspect your data pipeline in order to ensure there is at least one positive pair in each of your batches. (You can for example adapt some of the code in the triplet_loss_adapted_from_tf to get the num_positives of your batch, and check if it is greater than 0).
Try increasing your batch size. It happened to me also. As mentioned in the previous answer, network is unable to find any num_positives. I had 250 classes and was getting nan loss initially. I increased it to 128/256 and then there was no issue.
I saw that Paris6k has 15 classes or 12 classes. Increase your batch size 32 and if the GPU memory occurs you can try with model with less parameters. You can work on Efficient B0 model for starting. It has 5.3M compared to VGG16 which has 138M parameters.
I have implemented a package for triplet generation so that every batch is guaranteed to include postive pairs. It is compatible with TF/Keras only.
https://github.com/ma7555/kerasgen (Disclaimer: I am the owner)

LSTM Autoencoder producing poor results in test data

I'm applying LSTM autoencoder for anomaly detection. Since anomaly data are very few as compared to normal data, only normal instances are used for the training. Testing data consists of both anomalies and normal instances. During the training, the model loss seems good. However, in the test the data the model produces poor accuracy. i.e. anomaly and normal points are not well separated.
The snippet of my code is below:
.............
.............
X_train = X_train.reshape(X_train.shape[0], lookback, n_features)
X_valid = X_valid.reshape(X_valid.shape[0], lookback, n_features)
X_test = X_test.reshape(X_test.shape[0], lookback, n_features)
.....................
......................
N = 1000
batch = 1000
lr = 0.0001
timesteps = 3
encoding_dim = int(n_features/2)
lstm_model = Sequential()
lstm_model.add(LSTM(N, activation='relu', input_shape=(timesteps, n_features), return_sequences=True))
lstm_model.add(LSTM(encoding_dim, activation='relu', return_sequences=False))
lstm_model.add(RepeatVector(timesteps))
# Decoder
lstm_model.add(LSTM(timesteps, activation='relu', return_sequences=True))
lstm_model.add(LSTM(encoding_dim, activation='relu', return_sequences=True))
lstm_model.add(TimeDistributed(Dense(n_features)))
lstm_model.summary()
adam = optimizers.Adam(lr)
lstm_model.compile(loss='mse', optimizer=adam)
cp = ModelCheckpoint(filepath="lstm_classifier.h5",
save_best_only=True,
verbose=0)
tb = TensorBoard(log_dir='./logs',
histogram_freq=0,
write_graph=True,
write_images=True)
lstm_model_history = lstm_model.fit(X_train, X_train,
epochs=epochs,
batch_size=batch,
shuffle=False,
verbose=1,
validation_data=(X_valid, X_valid),
callbacks=[cp, tb]).history
.........................
test_x_predictions = lstm_model.predict(X_test)
mse = np.mean(np.power(preprocess_data.flatten(X_test) - preprocess_data.flatten(test_x_predictions), 2), axis=1)
error_df = pd.DataFrame({'Reconstruction_error': mse,
'True_class': y_test})
# Confusion Matrix
pred_y = [1 if e > threshold else 0 for e in error_df.Reconstruction_error.values]
conf_matrix = confusion_matrix(error_df.True_class, pred_y)
plt.figure(figsize=(5, 5))
sns.heatmap(conf_matrix, xticklabels=LABELS, yticklabels=LABELS, annot=True, fmt="d")
plt.title("Confusion matrix")
plt.ylabel('True class')
plt.xlabel('Predicted class')
plt.show()
Please suggest what can be done in the model to improve the accuracy.
If your model is not performing good on the test set I would make sure to check certain things;
Training set is not contaminated with anomalies or any information from the test set. If you use scaling, make sure you did not fit the scaler to training and test set combined.
Based on my experience; if an autoencoder cannot discriminate well enough on the test data but has low training loss, provided your training set is pure, it means that the autoencoder did learn about the underlying details of the training set but not about the generalized idea.
Your threshold value might be off and you may need to come up with a better thresholding procedure. One example can be found here: https://dl.acm.org/citation.cfm?doid=3219819.3219845
If the problem is 2nd one, the solution is to increase generalization. With autoencoders, one of the most efficient generalization tool is the dimension of the bottleneck. Again based on my experience with anomaly detection in flight radar data; lowering the bottleneck dimension significantly increased my multi-class classification accuracy. I was using 14 features with an encoding_dim of 7, but encoding_dim of 4 provided even better results. The value of the training loss was not important in my case because I was only comparing reconstruction errors, but since you are making a classification with a threshold value of RE, a more robust thresholding may be used to improve accuracy, just as in the paper I've shared.

Scaling of stock data

I am trying to apply machine learning on stock prediction, and I run into problem regarding scaling on future unseen (much higher) stock close value.
Lets say I use random forrest regression on predicting stock price. I break the data into train set and test set.
For the train set, I use standardscaler, and do fit and transform
And then I use regressor to fit
For the test set, I use standardscaler, and do transform
And then I use regressor to predict, and compare to test label
If I plot predict and test label on a graph, predict seems to max out or ceiling. The problem is that standardscaler fit on train set, test set (later in the timeline) have much higher value, the algorithm does not know what to do with these extreme data
def test(X, y):
# split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, shuffle=False)
# preprocess the data
pipeline = Pipeline([
('std_scaler', StandardScaler()),
])
# model = LinearRegression()
model = RandomForestRegressor(n_estimators=20, random_state=0)
# preprocessing fit transform on train data
X_train = pipeline.fit_transform(X_train)
# fit model on train data with train label
model.fit(X_train, y_train)
# transform on test data
X_test = pipeline.transform(X_test)
# predict on test data
y_pred = model.predict(X_test)
# print(np.sqrt(mean_squared_error(y_test, y_pred)))
d = {'actual': y_test, 'predict': y_pred}
plot_data = pd.DataFrame.from_dict(d)
sns.lineplot(data=plot_data)
plt.show()
What should be done with the scaling?
This is what I got for plotting prediction, actual close price vs time
The problem mainly comes from the model you are using. RandomForest regressor is created upon Decision Trees. It is learning to map an input to an output for every examples in the training set. Consequently RandomForest regressor will work for middle values but for extreme values that it hasn't seen during training it will of course perform has your picture is showing.
What you want, is to learn a function directly using linear/polynomial regression or more advanced algorithms like ARIMA.

VGG bottleneck features + LSTM in keras

I have pre-stored bottleneck features (.npy files) obtained from VGG16 for around 10k images. Training a SVM classifier (3-class classification) on these features gave me an accuracy of 90% on the test set. These images are obtained from videos. I want to train an LSTM in keras on top of these features. My code snippet can be found below. The issue is that the training accuracy is not going above 43%, which is unexpected. Please help me in debugging the issue. I have tried with different learning rates.
#Asume all necessary imports done
classes = 3
frames = 5
channels = 3
img_height = 224
img_width = 224
epochs = 20
#Model definition
model = Sequential()
model.add(TimeDistributed(Flatten(),input_shape=(frames,7,7,512)))
model.add(LSTM(256,return_sequences=False))
model.add(Dense(1024,activation="relu"))
model.add(Dense(3,activation="softmax"))
optimizer = Adam(lr=0.1,beta_1=0.9,beta_2=0.999,epsilon=None,decay=0.0)
model.compile (loss="categorical_crossentropy",optimizer=optimizer,metrics=["accuracy"])
model.summary()
train_data = np.load(open('bottleneck_features_train.npy','rb'))
#final_img_data shape --> 2342,5,7,7,512
#one_hot_labels shape --> 2342,3
model.fit(final_img_data,one_hot_labels,epochs=epochs,batch_size=2)
You are probably missing the local minimum, because learning rate is too high. Try to decrease learning rate to 0.01 -- 0.001 and increase number of epochs. Also, decrease Dense layer neurons from 1024 to half. Otherwise you may overfit.

Resources