Why i've got a three different MSE values - keras

I wrote an mlp and want start to tune it to fit a best results. But i've stucked with several different MSE.
from pandas import read_csv
from keras.models import Sequential
from keras.layers import Dense
from keras.wrappers.scikit_learn import KerasRegressor
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn import metrics
import numpy
import joblib
# load dataset
#dataframe = read_csv("housing.csv", delim_whitespace=True, header=None)
dataframe = read_csv("100.csv", header=None)
dataset = dataframe.values
# split into input (X) and output (Y) variables
X = dataset[:,0:6]
Y = dataset[:,6]
# define the model
def larger_model():
# create model
model = Sequential()
model.add(Dense(20, input_dim=6, kernel_initializer='normal', activation='relu'))
model.add(Dense(50, kernel_initializer='normal', activation='relu'))
model.add(Dense(1, kernel_initializer='normal', activation='linear'))
# Compile model
model.compile(loss='mean_squared_error', optimizer='adam', metrics=['mae','mse'])
return model
# evaluate model with standardized dataset
estimators = []
estimators.append(('standardize', StandardScaler()))
estimators.append(('mlp', KerasRegressor(build_fn=larger_model, epochs=100, batch_size=5, verbose=1)))
pipeline = Pipeline(estimators)
kfold = KFold(n_splits=2)
results = cross_val_score(pipeline, X, Y, cv=kfold)
pipeline.fit(X, Y)
prediction = pipeline.predict(X)
result_test = Y
print("%.2f (%.2f) MSE" % (results.mean(), results.std()))
print('Mean Absolute Error:', metrics.mean_absolute_error(prediction, result_test))
print('Mean Squared Error:', metrics.mean_squared_error(prediction, result_test))
Gives me that result:
Epoch 98/100
200/200 [==============================] - 0s 904us/step - loss: 0.0086 - mae: 0.0669 - mse: 0.0086
Epoch 99/100
200/200 [==============================] - 0s 959us/step - loss: 0.0032 - mae: 0.0382 - mse: 0.0032
Epoch 100/100
200/200 [==============================] - 0s 894us/step - loss: 0.0973 - mae: 0.2052 - mse: 0.0973
200/200 [==============================] - 0s 600us/step
21.959478
-0.03 (0.02) MSE
Mean Absolute Error: 0.1959771416462339
Mean Squared Error: 0.0705598179059006
So i see here a 3 different mse results. Why so and which one i should take in mind to understand an overall model score when i willbe tune it?

Basically what I understood was if you print the results variable then you will get 2 MSE because you used n_splits=2.
-0.03 (0.02) MSE
Above output is the mean or average of the results(MSE) and std of the results(MSE).
Epoch 100/100
200/200 [==============================] - 0s 894us/step - loss: 0.0973 - mae: 0.2052 - mse: 0.0973
Above outputs mse = 0.0973 this is I think for split=2 and it will take only 50% of whole data(X) because remaining 50% it will take as validation data.
Mean Squared Error: 0.0705598179059006
Above output is coming where you are predicting on whole data, not 50% by using best model so obviously, you will get 3 different MSEs for the above 3 prints.
I am also solving a very similar kind of problem, so do one thing divide the dataset into train and test and use train data for training and when you are predicting use test dataset then calculate MSE on test data or else keep this as it is and take Mean Squared Error: 0.0705598179059006 as your final mse.

Related

sklearn.model_selection.cross_val_score returns mean_absolute_percentage_error regardless what I choose in the estimator

Please tell me if it's possible to change scoring for sklearn.model_selection.cross_val_score ? It returns the negative value of the mean_absolute_percentage_error regardless of my choice in the estimator. For example if I select metrics=['mean_absolute_error'] it would be changed during ANN training but cross_val_score will still return the negative value of the mean_absolute_percentage.
My code is:
def create_network():
model = models.Sequential()
model.add(layers.Dense(64, activation='relu', input_shape=(X.shape[1],)))
model.add(layers.Dense(64, activation='relu'))
model.add(layers.Dense(1))
model.compile(optimizer='rmsprop',
loss='mse',
metrics=['mean_absolute_error'])
return model
from keras.wrappers.scikit_learn import KerasRegressor
neural_network = KerasRegressor(build_fn=create_network,
epochs=20,
batch_size=10,
verbose=1)
X=feature_normalization(X)[0]
from sklearn.model_selection import cross_val_score
scores = cross_val_score(neural_network, X, y, cv=4)
print ('Scores:',scores)
print ('Average score:',np.average(scores))
Results:
Epoch 20/20
380/380 [==============================] - 0s 82us/step - loss: 18.5750 - mean_absolute_error: 2.9808
126/126 [==============================] - 0s 124us/step
Scores: [-27.60875144 -15.73322312 -35.57359647 -21.53566427]
Average score: -25.112808823950544

Validation and Test accuracy at random performance, whereas Train accuracy very high

I am trying to build a classifier in TensorFlow2.1 for CIFAR10 using ResNet50 pre-trained over imagenet from keras.application and then stacking a small FNN on top of it:
# Load ResNet50 pre-trained on imagenet
resn = applications.resnet50.ResNet50(weights='imagenet', input_shape=(IMG_SIZE, IMG_SIZE, 3), pooling='avg', include_top=False)
# Load CIFAR10
(c10_train, c10_test), info = tfds.load(name='cifar10', split=['train', 'test'], with_info=True, as_supervised=True)
# Make sure all the layers are not trainable
for layer in resn.layers:
layer.trainable = False
# Transfert Learning for CIFAR10: fine-tune the network by stacking a trainable FNN on top of Resnet
from tensorflow.keras import models, layers
def build_model():
model = models.Sequential()
# Feature extractor
model.add(resn)
# Small FNN
model.add(layers.Dense(256, activation='relu'))
model.add(layers.Dropout(0.4))
model.add(layers.Dense(10, activation='softmax'))
model.compile(loss='categorical_crossentropy',
optimizer=tf.keras.optimizers.SGD(learning_rate=0.1),
metrics=['accuracy'])
return model
# Build the resulting net
resn50_c10 = build_model()
I am facing the following issue when it comes to validate or test the accuracy:
history = resn50_c10.fit_generator(c10_train.shuffle(1000).batch(BATCH_SIZE), validation_data=c10_test.batch(BATCH_SIZE), epochs=20)
Epoch 1/20
25/25 [==============================] - 113s 5s/step - loss: 0.9659 - accuracy: 0.6634 - val_loss: 2.8157 - val_accuracy: 0.1000
Epoch 2/20
25/25 [==============================] - 109s 4s/step - loss: 0.8908 - accuracy: 0.6920 - val_loss: 2.8165 - val_accuracy: 0.1094
Epoch 3/20
25/25 [==============================] - 116s 5s/step - loss: 0.8743 - accuracy: 0.7038 - val_loss: 2.7555 - val_accuracy: 0.1016
Epoch 4/20
25/25 [==============================] - 132s 5s/step - loss: 0.8319 - accuracy: 0.7166 - val_loss: 2.8398 - val_accuracy: 0.1013
Epoch 5/20
25/25 [==============================] - 132s 5s/step - loss: 0.7903 - accuracy: 0.7253 - val_loss: 2.8624 - val_accuracy: 0.1000
Epoch 6/20
25/25 [==============================] - 132s 5s/step - loss: 0.7697 - accuracy: 0.7325 - val_loss: 2.8409 - val_accuracy: 0.1000
Epoch 7/20
25/25 [==============================] - 132s 5s/step - loss: 0.7515 - accuracy: 0.7406 - val_loss: 2.7697 - val_accuracy: 0.1000
#... (same for the remaining epochs)
Although the model seems to learn adequately from the training split, both the accuracy and loss for the validation set does not improve at all. What is causing this behavior?
I am excluding this is overfitting since I am applying Dropout and since the model seems to never really improve on the test set.
What I have done so far:
Check the one-hot labelling is consistent throughout train and test
Tried different FNN configurations
Tried the method fit_generator instead of fit
Preprocess the image, resized the images w/ different input_shapes
and experienced always the same problem.
Any hint would be extremely appreciated.
The problem is likely due to loading data using tfds and then passing to Keras .fit
Try to load your data with
from keras.datasets import cifar10
(x_train, y_train), (x_test, y_test) = cifar10.load_data()
And then
fit(x=x_train, y=y_train, batch_size=BATCH_SIZE, epochs=20, verbose=1, callbacks=None, validation_split=0.2, validation_data=None, shuffle=True)
Apparently the problem was caused uniquely by the use of ResNet50.
As a workaround, I downloaded and used other pre-trained deep networks such as keras.applications.vgg16.VGG16, keras.applications.densenet.DenseNet121 and the accuracy on the test set increased as expected.
UPDATE
The above part of this answer is just a palliative. In order to understand what is really happening and eventually use transfer learning properly with ResNet50, keep on reading.
The root cause appears to be found in how Keras handles the Batch Normalization layer:
During fine-tuning, if a Batch Normalization layer is frozen it uses the mini-batch statistics. I believe this is incorrect and it can lead to reduced accuracy especially when we use Transfer learning. A better approach in this case would be to use the values of the moving mean and variance.
As explained more in-depth here: https://github.com/keras-team/keras/pull/9965
Even though the correct approach has been implemented in TensorFlow 2 when we use tf.keras.applications we reference the TensorFlow 1.0 behavior for Batch Normalization. That's why we need to explicitly inject the reference to TensorFlow 2 by adding the argument layers=tf.keras.layers when loading modules. So in my case, the loading of ResNet50 will become
history = resn50_c10.fit_generator(c10_train.shuffle(1000).batch(BATCH_SIZE), validation_data=c10_test.batch(BATCH_SIZE), epochs=20, layers=tf.keras.layers)
and that will do the trick.
Credits for the solution to #rpeloff: https://github.com/keras-team/keras/pull/9965#issuecomment-549126009

Compute mse after keras model. Prediction looks to be wrong--updated need to reshape array first

So I would like to compute R2 = 1 - residual_ss/y_ss after keras. I used the prediction model.predict() to compute residual_ss. However, the residual_ss is much larger than y_ss which results in a negative R2. Since residual_ss = n*mse and mse is also the loss function, the code shows the computation for mse after the model:
import keras
keras.__version__
from keras.datasets import boston_housing
import pandas as pd
import numpy as np
(train_data, train_targets), (test_data, test_targets) = boston_housing.load_data()
mean = train_data.mean(axis=0)
train_data -= mean
std = train_data.std(axis=0)
train_data /= std
test_data -= mean
test_data /= std
from keras import models
from keras import layers
def build_model():
# Because we will need to instantiate
# the same model multiple times,
# we use a function to construct it.
model = models.Sequential()
model.add(layers.Dense(64, activation='relu',
input_shape=(train_data.shape[1],)))
model.add(layers.Dense(64, activation='relu'))
model.add(layers.Dense(1))
model.compile(optimizer='rmsprop', loss='mse', metrics=['mae'])
return model
model=build_model()
model.fit(train_data, train_targets, epochs=200, batch_size=32)
#try to get mse
y_pred = model.predict(train_data)
mse=np.mean((train_targets-y_pred)*(train_targets-y_pred))
print(mse)
Here is last 3 epochs and the mse in the end
Epoch 198/200
404/404 [=======] - 0s 17us/step - loss: 3.4695 - mean_absolute_error: 1.3338
Epoch 199/200
404/404 [=======] - 0s 22us/step - loss: 3.5412 - mean_absolute_error: 1.3260
Epoch 200/200
404/404 [=======] - 0s 20us/step - loss: 3.2775 - mean_absolute_error: 1.2858
162.25934358457062
I only use train_data and train_targets here. Why I got a mse not even close to the loss (mse) reported in each epoch? So the prediction is not close to the target. Please help.

Why my validation accuracy is much higher than train accuracy, but the test accuracy is only 0.5?

I am doing some image classification using inception_v3 model in keras, however, my train accuracy is lower than validation during the whole training process. And my validation accuracy is above 0.95 from the first epoch. I also find that train loss is much higher than validation loss. In the end, the test accuracy is 0.5, which is pretty bad.
At first, my optimizer is Adam with learning rate equals to 0.00001, the result is bad. Then I change it to SGD with learning rate of 0.00001, which doesn't make any change to the bad result. I also tried to increase the learning rate to 0.1, but the test accuracy is still around 0.5
import numpy as np
import pandas as pd
import keras
from keras import layers
from keras.applications.inception_v3 import preprocess_input
from keras.models import Model
from keras.layers.core import Dense
from keras.layers import GlobalAveragePooling2D
from keras.optimizers import Adam, SGD, RMSprop
from keras.preprocessing.image import ImageDataGenerator
from keras.utils.np_utils import to_categorical
from keras.utils import plot_model
from keras.models import model_from_json
from sklearn.metrics import confusion_matrix
import itertools
import matplotlib.pyplot as plt
import math
import copy
import pydotplus
train_path = 'data/train'
valid_path = 'data/validation'
test_path = 'data/test'
top_model_weights_path = 'model_weigh.h5'
# number of epochs to train top model
epochs = 100
# batch size used by flow_from_directory and predict_generator
batch_size = 2
img_width, img_height = 299, 299
fc_size = 1024
nb_iv3_layers_to_freeze = 172
train_datagen = ImageDataGenerator(preprocessing_function=preprocess_input,
rotation_range=30,
width_shift_range=0.2,
height_shift_range=0.2,
shear_range=0.2,
zoom_range=0.2,
horizontal_flip=True)
# this is the augmentation configuration we will use for testing:
# only rescaling
valid_datagen = ImageDataGenerator(preprocessing_function=preprocess_input,
rotation_range=30,
width_shift_range=0.2,
height_shift_range=0.2,
shear_range=0.2,
zoom_range=0.2,
horizontal_flip=True)
train_batches =
train_datagen.flow_from_directory(train_path,
target_size=(img_width, img_height),
classes=None,
class_mode='categorical',
batch_size=batch_size,
shuffle=True)
valid_batches =
valid_datagen.flow_from_directory(valid_path,
target_size=(img_width,img_height),
classes=None,
class_mode='categorical',
batch_size=batch_size,
shuffle=True)
test_batches =
ImageDataGenerator().flow_from_directory(test_path,
target_size=(img_width,
img_height),
classes=None,
class_mode='categorical',
batch_size=batch_size,
shuffle=False)
nb_train_samples = len(train_batches.filenames)
# get the size of the training set
nb_classes_train = len(train_batches.class_indices)
# get the number of classes
predict_size_train = int(math.ceil(nb_train_samples / batch_size))
nb_valid_samples = len(valid_batches.filenames)
nb_classes_valid = len(valid_batches.class_indices)
predict_size_validation = int(math.ceil(nb_valid_samples / batch_size))
nb_test_samples = len(test_batches.filenames)
nb_classes_test = len(test_batches.class_indices)
predict_size_test = int(math.ceil(nb_test_samples / batch_size))
def add_new_last_layer(base_model, nb_classes):
x = base_model.output
x = GlobalAveragePooling2D()(x)
x = Dense(fc_size, activation='relu')(x)
pred = Dense(nb_classes, activation='softmax')(x)
model = Model(input=base_model.input, output=pred)
return model
# freeze base_model layer in order to get the bottleneck feature
def setup_to_transfer_learn(model, base_model):
for layer in base_model.layers:
layer.trainable = False
model.compile(optimizer=Adam(lr=0.00001),
loss='categorical_crossentropy',
metrics=['accuracy'])
base_model = keras.applications.inception_v3.InceptionV3(weights='imagenet', include_top=False)
model = add_new_last_layer(base_model, nb_classes_train)
setup_to_transfer_learn(model, base_model)
model.summary()
train_labels = train_batches.classes
train_labels = to_categorical(train_labels, num_classes=nb_classes_train)
validation_labels = valid_batches.classes
validation_labels = to_categorical(validation_labels, num_classes=nb_classes_train)
history = model.fit_generator(train_batches,
epochs=epochs,
steps_per_epoch=nb_train_samples // batch_size,
validation_data=valid_batches,
validation_steps=nb_valid_samples // batch_size,
class_weight='auto')
# save model to json
model_json = model.to_json()
with open("model.json", "w") as json_file:
json_file.write(model_json)
# serialize model to HDF5
model.save_weights(top_model_weights_path)
print("Saved model to disk")
# model visualization
plot_model(model,
show_shapes=True,
show_layer_names=True,
to_file='model.png')
(eval_loss, eval_accuracy) = model.evaluate_generator(
valid_batches,
steps=nb_valid_samples // batch_size,
verbose=1)
print("[INFO] evaluate accuracy: {:.2f}%".format(eval_accuracy * 100))
print("[INFO] evaluate loss: {}".format(eval_loss))
test_batches.reset()
predictions = model.predict_generator(test_batches,
steps=nb_test_samples / batch_size,
verbose=0)
# print(predictions)
predicted_class_indices = np.argmax(predictions, axis=1)
# print(predicted_class_indices)
labels = train_batches.class_indices
labels = dict((v, k) for k, v in labels.items())
final_predictions = [labels[k] for k in predicted_class_indices]
# print(final_predictions)
# save as csv file
filenames = test_batches.filenames
results = pd.DataFrame({"Filename": filenames,
"Predictions": final_predictions})
results.to_csv("results.csv", index=False)
# evaluation test result
(test_loss, test_accuracy) = model.evaluate_generator(
test_batches,
steps=nb_train_samples // batch_size,
verbose=1)
print("[INFO] test accuracy: {:.2f}%".format(test_accuracy * 100))
print("[INFO] test loss: {}".format(test_loss))
Here is a brief summary of training process:
Epoch 1/100
2000/2000 [==============================] - 146s 73ms/step - loss: 0.4941 - acc: 0.7465 - val_loss: 0.1612 - val_acc: 0.9770
Epoch 2/100
2000/2000 [==============================] - 140s 70ms/step - loss: 0.4505 - acc: 0.7725 - val_loss: 0.1394 - val_acc: 0.9765
Epoch 3/100
2000/2000 [==============================] - 139s 70ms/step - loss: 0.4505 - acc: 0.7605 - val_loss: 0.1643 - val_acc: 0.9560
......
Epoch 98/100
2000/2000 [==============================] - 141s 71ms/step - loss: 0.1348 - acc: 0.9467 - val_loss: 0.0639 - val_acc: 0.9820
Epoch 99/100
2000/2000 [==============================] - 140s 70ms/step - loss: 0.1495 - acc: 0.9365 - val_loss: 0.0780 - val_acc: 0.9770
Epoch 100/100
2000/2000 [==============================] - 138s 69ms/step - loss: 0.1401 - acc: 0.9458 - val_loss: 0.0471 - val_acc: 0.9890
Here is the result that I get:
[INFO] evaluate accuracy: 98.55%
[INFO] evaluate loss: 0.05201659869024259
2000/2000 [==============================] - 47s 23ms/step
[INFO] test accuracy: 51.70%
[INFO] test loss: 7.737395915810134
I wish someone can help me deal with this problem.
As the code is now, you're not freezing the layers of the model for transfer learning. In the setup_to_transfer_learn you're freezing the layer in base_model, and then compiling the new model (containing layers from the base model), but not actually freezing on the new model. Just change setup_to_transfer_learn:
def setup_to_transfer_learn(model):
for layer in model.layers[:-3]: # since you added three new layers (which should not freeze)
layer.trainable = False
model.compile(optimizer=Adam(lr=0.00001),
loss='categorical_crossentropy',
metrics=['accuracy'])
Then call the function like this:
model = add_new_last_layer(base_model, nb_classes_train)
setup_to_transfer_learn(model)
You should see a large difference in the number of trainable parameters when calling model.summary()
Finally, I solved the problem. I forget to do image preprocessing to my test data. After I add this, everything works really fine.
I change this:
test_batches = ImageDataGenerator().flow_from_directory(test_path,
target_size=(img_width, img_height),
classes=None,
class_mode='categorical',
batch_size=batch_size,
shuffle=False)
to this:
test_datagen = ImageDataGenerator(preprocessing_function=preprocess_input)
test_batches = test_datagen.flow_from_directory(test_path,
target_size=(img_width, img_height),
classes=None,
class_mode='categorical',
batch_size=batch_size,
shuffle=False)
And the test accuracy is 0.98, test loss is 0.06.
What actually happens is that when you use preprocessing the model may actually start learning those techniques. One way to check if your model is learning good features is using Grad-CAM

Why acc of char-level cnn for text classification stay unchanged

I misused binary cross-entropy for softmax, changed to categorical cross-entropy. And did some reviewing about details of the problem below in my own answer
I am trying to using open source data: sogou_news_csv(converted to pinyin using jieba from for text classification following https://arxiv.org/abs/1502.01710 "Text understanding from scratch" by Xiang Zhang and Yann LeCun. (mainly follow the idea of using character level CNN, but the structure proposed in the paper).
I did the preprocessing by using one-hot encoding according to a alphabet collection and filling all those not in the alphabet collection with 0s.
As a result, I got the training data with the shape of (450000, 1000, 70),(data_size, sequence_length, alphabet_size).
Then I feed the data into a cnn structure following http://www.wildml.com/2015/12/implementing-a-cnn-for-text-classification-in-tensorflow/.
Problem is
During the training, the loss and acc merely change, I tried preprocessing again for the data, and tried different learning rate settings, but not helpful, So what went wrong?
Below is one-hot encoding:
import numpy as np
all_letters = "abcdefghijklmnopqrstuvwxyz0123456789-,;.!?:'\"/\\|_##$%^&*~`+-=<>()[]{}\n"
n_letters = len(all_letters)
def letterToIndex(letter):
"""
'c' -> 2
"""
return all_letters.find(letter)
def sets2tensors(clean_train, n_letters=n_letters, MAX_SEQUENCE_LENGTH=1000):
"""
From lists of cleaned passages to np.array with shape(len(train),
max_sequence_length, len(dict))
Arg:
obviously
"""
m = len(clean_train)
x_data = np.zeros((m, MAX_SEQUENCE_LENGTH, n_letters))
for ix in range(m):
for no, letter in enumerate(clean_train[ix]):
if no >= 1000:
break
letter_index = letterToIndex(letter)
if letter != -1:
x_data[ix][no][letter_index] = 1
else:
continue
return x_data
This is the Model:
num_classes = 5
from keras.models import Sequential
from keras.layers import Activation, GlobalMaxPool1D, Merge, concatenate, Conv1D, Dense, Dropout
from keras.callbacks import EarlyStopping
from keras.optimizers import SGD
submodels = []
for kw in (3, 4, 5): # kernel sizes
submodel = Sequential()
submodel.add(Conv1D(32,
kw,
padding='valid',
activation='relu',
strides=1, input_shape=(1000, n_letters)))
submodel.add(GlobalMaxPool1D())
submodels.append(submodel)
big_model = Sequential()
big_model.add(Merge(submodels, mode="concat"))
big_model.add(Dense(64))
big_model.add(Dropout(0.5))
big_model.add(Activation('relu'))
big_model.add(Dense(num_classes))
big_model.add(Activation('softmax'))
print('Compiling model')
opt = SGD(lr=1e-6) # tried different learning rate from 1e-6 to 1e-1
# changed from binary crossentropy to categorical_crossentropy
big_model.compile(loss='categorical_crossentropy',
optimizer=opt,
metrics=['accuracy'])
Some results
Train on 5000 samples, validate on 5000 samples
Epoch 1/5
5000/5000 [==============================] - 54s - loss: 0.5198 - acc: 0.7960 - val_loss: 0.5001 - val_acc: 0.8000
Epoch 2/5
5000/5000 [==============================] - 56s - loss: 0.5172 - acc: 0.7959 - val_loss: 0.5000 - val_acc: 0.8000
Epoch 3/5
5000/5000 [==============================] - 56s - loss: 0.5198 - acc: 0.7965 - val_loss: 0.5000 - val_acc: 0.8000
Epoch 4/5
5000/5000 [==============================] - 57s - loss: 0.5222 - acc: 0.7950 - val_loss: 0.4999 - val_acc: 0.8000
Epoch 5/5
5000/5000 [==============================] - 59s - loss: 0.5179 - acc: 0.7960 - val_loss: 0.4999 - val_acc: 0.8000
I found that the problem is I accidentally used binary cross-entropy(that I used for another dataset) with softmax, which should be categorical cross-entropy. Initially, I figured it is a stupid bug since I didn't carefully check the code and logic.
But then I found I don't really understand what is going on here, I mean, I know the difference between binary cross-entropy and categorical cross-entropy, but I don't really understand the details why softmax and categorical cross-entropy can't be chained together.
Luckily, I found a very nice explanation here(did not expect anyone would actually ask or answer this question)
https://www.reddit.com/r/MachineLearning/comments/39bo7k/can_softmax_be_used_with_cross_entropy/#cs2b4jx
Basically what it is saying is that in binary cross-entropy case, the loss function is treating two different values of a single bit as two different class: like 1 for A and 0 for B, despite that with categorical cross-entropy case, the loss function is taking a vector like [0,0,0,1,0] a label, in which the value of a bit stands for the confidence or probability for the corresponding training example being that particular class.
With description above, when we apply binary cross-entropy to softmax, we are misusing the definition of what one bit means in that setting, thus make no sense.
You have set SGD optimizer to 0.000001 (opt = SGD(lr=1e-6))
The default learning rate for SGD is 0.01
keras.optimizers.SGD(lr=0.01, momentum=0.0, decay=0.0, nesterov=False)
I suspect that 1e-6 is to small, try increase it and/or try a different optimizer

Resources