LSTM with classification - keras

Is it possible to use LSTM together with an array of words that I've classified ?
For example I have a array with 1000 words:
'Green'
'Blue'
'Red'
'Yellow'
I classify the words to be Green = 0, Blue = 1 , Red = 2, Yellow = 3.
And I want to predict the 4th word. The words can come in different orders in the sequence. For example first sequence can be input = green, blue, red, target = yellow next sequence is input = blue,red,yellow, target = green and so on.
Maybe I shouldn't use LSTM for this, but I guess I should since I want to check the 3 earliers inputs and predict the 4th.
This is what I have so far, I'm more or less stuck with the reshape of my words list. and I can't really understand what input_shape I should have. I guess it's Timesteps = 3, and features = 4
# define documents
words = [0,1,2,3,2,3,1,0,0,1,2,3,2,0,3,1,1,2,3,0]
words_cat = to_categorical(words,4)
X_train = ?
y_train = ?
# define the model
model = Sequential()
model.add(LSTM(32, input_shape=(3,4)))
model.add(Dense(4, activation='softmax'))
# compile the model
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])
# summarize the model
print(model.summary())
# fit the model
model.fit(X_train, y_train epochs=50, verbose=0)
Br

As already mentioned by the first comment, a LSTM network is maybe a little bit of an overkill in this case. But I assume you do this for pedagogic reasons.
Here's a working example:
# define documents
words = [0,1,2,3,2,3,1,0,0,1,2,3,2,0,3,1,1,2,3,0]
# create labels
labels = np.roll(words[:-3], -3)
X_train = np.array([words[i:(i+3)%len(words)] for i in range(len(words)-3)]).reshape(-1,1,3)
y_train = labels
# define the model
model = Sequential()
model.add(LSTM(32, input_shape=(None,3)))
model.add(Dense(4, activation='softmax'))
# compile the model
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])
# summarize the model
print(model.summary())
# fit the model
model.fit(X_train, y_train, epochs=5, batch_size=1, verbose=1)
preds = model.predict(X_train).argmax(1)
print(preds)
print(y_train)
Output:
Epoch 1/5
17/17 [==============================] - 2s 88ms/step - loss: 1.3771 - accuracy: 0.1765
Epoch 2/5
17/17 [==============================] - 0s 9ms/step - loss: 1.3647 - accuracy: 0.3529
Epoch 3/5
17/17 [==============================] - 0s 6ms/step - loss: 1.3568 - accuracy: 0.2353
Epoch 4/5
17/17 [==============================] - 0s 8ms/step - loss: 1.3496 - accuracy: 0.2353
Epoch 5/5
17/17 [==============================] - 0s 7ms/step - loss: 1.3420 - accuracy: 0.4118
[1 2 1 2 0 0 0 1 1 2 1 0 2 1 1 1 2]
[3 2 3 1 0 0 1 2 3 2 0 3 1 1 0 1 2]
So I took the words provided by you and reshaped them. The first three entries are the series to train on and the fourth entry is the label.
If your sequence is random the model will have a hard time predicting the next value. Otherwise you might wanna train longer or provide more examples (however the number of combinations in this case is fairly limited).

Related

Zero validation loss and validation accuracy at classification problem

I'm running a multiclass classification problem using the below resnet model:
resnet = tf.keras.applications.ResNet50(
include_top=False ,
weights='imagenet' ,
input_shape=(96, 96, 3) ,
pooling="avg"
)
for layer in resnet.layers:
layer.trainable = True
model_resnet = tf.keras.Sequential()
model_resnet.add(resnet)
model_resnet.add(tf.keras.layers.Flatten())
model_resnet.add(tf.keras.layers.Dense(8, activation='softmax',name='output') )
model_resnet.compile( loss="sparse_categorical_crossentropy" , optimizer=tf.keras.optimizers.Adam(learning_rate=0.001) ,metrics=['accuracy'])
I also used a train and a test generator as below:
train_generator=img_gen.flow_from_dataframe(dataframe=train_dataset,x_col="file_loc",y_col='expr',target_size=(96, 96),batch_size=91,class_mode="raw")
test_generator=img_gen.flow_from_dataframe(dataframe=test_dataset,x_col="file_loc",target_size=(96, 96),batch_size=93,y_col=None,shuffle=False,class_mode=None)
when I am running the code below I get the wanted results and everything works fine
model_resnet.fit_generator(train_generator,
steps_per_epoch=STEP_SIZE_TRAIN_resnet,
epochs=20
)
I wanted to compute the validation accuracy of every epoch so I wrote something like this
model_path = f"/content/weights" + "{val_accuracy:.4f}.hdf5"
checkpoint = tf.keras.callbacks.ModelCheckpoint(
model_path,
monitor='val_accuracy',
save_best_only=True,
mode='max',
verbose=1
)
history = model_resnet.fit_generator(
train_generator,
epochs=5,
steps_per_epoch=STEP_SIZE_TRAIN_resnet,
validation_data=test_generator,
validation_steps=STEP_SIZE_TEST_resnet,
max_queue_size=1,
shuffle=True,
callbacks=[checkpoint],
verbose=1
)
The problem is that for every epoch the validation loss and validation accuracy remain zero even though the training loss and accuracy change. I ran this code for over 20 epochs and it doesn't change at all. I can't find what am I doing wrong since without this it works perfectly,does anyone have any idea?
Epoch 1: val_accuracy improved from -inf to 0.00000, saving model to /content/weights0.0000.hdf5
500/500 [==============================] - 30s 60ms/step - loss: 1.0213 - accuracy: 0.6546 - val_loss: 0.0000e+00 - val_accuracy: 0.0000e+00
Epoch 2/5
500/500 [==============================] - ETA: 0s - loss: 0.9644 - accuracy: 0.6672
Epoch 2: val_accuracy did not improve from 0.00000
500/500 [==============================] - 29s 58ms/step - loss: 0.9644 - accuracy: 0.6672 - val_loss: 0.0000e+00 - val_accuracy: 0.0000e+00
Edit: I didn't specify the test labels of the test dataset because I used to compute the accuracy score as below:
y_pred = model_resnet.predict(test_generator)
y_pred_max = np.argmax(y_pred, axis=1)
y_true = test_dataset["expr"].to_numpy()
print("accuracy",accuracy_score(y_true, y_pred_max))
I changed the test_generator as below:
test_generator=img_gen.flow_from_dataframe(dataframe=test_dataset,x_col="file_loc",target_size=(96, 96),batch_size=93,y_col='expr',shuffle=False,class_mode=None)
but nothing has changed, it still results in zero
As #Dr.Snoopy said, the problems were that I didn't specify the test labels in these generator (which are required to compute accuracy) and I had different class modes in the generator,the correct was "raw" in both.

Nearly Constant training and validation accuracy

I’m new to pytorch and my problem may be a little naive
I’m training a pretrained VGG16 network on my dataset which it’s size is near 33000 images in 8 classes with labels [1,2,…,8] and my classes are imbalanced. my problem is that during training, validation and training accuracy is low and doesn’t increase, is there any problem in my code?
if not, what do you suggest to improve training?
'''
import torch
import time
import torch.nn as nn
import numpy as np
from sklearn.model_selection import train_test_split
from torch.optim import Adam
import cv2
import torchvision.models as models
from classify_dataset import Classification_dataset
from torchvision import transforms
transform = transforms.Compose([transforms.Resize((224,224)),
transforms.RandomHorizontalFlip(p=0.5),
transforms.RandomVerticalFlip(p=0.5),
transforms.RandomRotation(degrees=45),
transforms.ToTensor(),
transforms.Normalize((0.485, 0.456, 0.406), (0.229, 0.224, 0.225))
])
dataset = Classification_dataset(root_dir=r'//home/arisa/Desktop/Hamid/IQA/Hamid_Dataset',
csv_file=r'/home/arisa/Desktop/Hamid/IQA/new_label.csv',transform=transform)
target = dataset.labels - 1
train_indices, test_indices = train_test_split(np.arange(target.shape[0]), stratify=target)
test_dataset = torch.utils.data.Subset(dataset, indices=test_indices)
train_dataset = torch.utils.data.Subset(dataset, indices=train_indices)
class_sample_count = np.array([len(np.where(target[train_indices] == t)[0]) for t in np.unique(target)])
weight = 1. / class_sample_count
samples_weight = np.array([weight[t] for t in target[train_indices]])
samples_weight = torch.from_numpy(samples_weight)
samples_weight = samples_weight.double()
sampler = torch.utils.data.WeightedRandomSampler(samples_weight, len(samples_weight), replacement = True)
train_loader = torch.utils.data.DataLoader(train_dataset,
batch_size=64,
sampler=sampler)
test_loader = torch.utils.data.DataLoader(test_dataset,
batch_size=64,
shuffle=False)
for param in model.parameters():
param.requires_grad = False
num_ftrs = model.classifier[0].in_features
model.classifier = nn.Linear(num_ftrs,8)
optimizer = Adam(model.parameters(), lr = 0.0001 )
criterion = nn.CrossEntropyLoss()
lr_scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=50, gamma=0.01)
path = '/home/arisa/Desktop/Hamid/IQA/'
device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')
print(device)
def train_model(model, train_loader,valid_loader, optimizer, criterion, scheduler=None, num_epochs=10 ):
min_valid_loss = np.inf
model.train()
start = time.time()
TrainLoss = []
model = model.to(device)
for epoch in range(num_epochs):
total = 0
correct = 0
train_loss = 0
#lr_scheduler.step()
print('Epoch {}/{}'.format(epoch+1, num_epochs))
print('-' * 10)
train_loss = 0.0
for x,y in train_loader:
x = x.to(device)
#print(y.shape)
y = y.view(y.shape[0],).to(device)
y = y.to(device)
y -= 1
out = model(x)
loss = criterion(out, y)
optimizer.zero_grad()
loss.backward()
TrainLoss.append(loss.item()* y.shape[0])
train_loss += loss.item() * y.shape[0]
_,predicted = torch.max(out.data,1)
total += y.size(0)
correct += (predicted == y).sum().item()
optimizer.step()
lr_scheduler.step()
accuracy = 100*correct/total
valid_loss = 0.0
val_loss = []
model.eval()
val_correct = 0
val_total = 0
with torch.no_grad():
for x_val, y_val in test_loader:
x_val = x_val.to(device)
y_val = y_val.view(y_val.shape[0],).to(device)
y_val -= 1
target = model(x_val)
loss = criterion(target, y_val)
valid_loss += loss.item() * y_val.shape[0]
_,predicted = torch.max(target.data,1)
val_total += y_val.size(0)
val_correct += (predicted == y_val).sum().item()
val_loss.append(loss.item()* y_val.shape[0])
val_acc = 100*val_correct / val_total
print(f'Epoch {epoch + 1} \t\t Training Loss: {train_loss / len(train_loader)} \t\t Validation Loss: {valid_loss / len(test_loader)} \t\t Train Acc:{accuracy} \t\t Validation Acc:{val_acc}')
if min_valid_loss > (valid_loss / len(test_loader)):
print(f'Validation Loss Decreased({min_valid_loss:.6f}--->{valid_loss / len(test_loader):.6f}) \t Saving The Model')
min_valid_loss = valid_loss / len(test_loader)
state = {'state_dict': model.state_dict(),'optimizer': optimizer.state_dict(),}
torch.save(state,'/home/arisa/Desktop/Hamid/IQA/checkpoint.t7')
end = time.time()
print('TRAIN TIME:')
print('%.2gs'%(end-start))
train_model(model=model, train_loader=train_loader, optimizer=optimizer, criterion=criterion, valid_loader= test_loader,num_epochs=500 )
Thanks in advance
here is the result of 15 epoch
Epoch 1/500
----------
Epoch 1 Training Loss: 205.63448420514916 Validation Loss: 233.89266112356475 Train Acc:39.36360386127994 Validation Acc:24.142040038131555
Epoch 2/500
----------
Epoch 2 Training Loss: 199.05699240435197 Validation Loss: 235.08799531243065 Train Acc:41.90998291820601 Validation Acc:24.27311725452812
Epoch 3/500
----------
Epoch 3 Training Loss: 199.15626737127448 Validation Loss: 236.00033430619672 Train Acc:41.1035633416756 Validation Acc:23.677311725452814
Epoch 4/500
----------
Epoch 4 Training Loss: 199.02581041173886 Validation Loss: 233.60767459869385 Train Acc:41.86628530568466 Validation Acc:24.606768350810295
Epoch 5/500
----------
Epoch 5 Training Loss: 198.61493769454472 Validation Loss: 233.7503859202067 Train Acc:41.53656695665991 Validation Acc:25.0
Epoch 6/500
----------
Epoch 6 Training Loss: 198.71323942956585 Validation Loss: 234.17176149830675 Train Acc:41.639852222619474 Validation Acc:25.369399428026693
Epoch 7/500
----------
Epoch 7 Training Loss: 199.9395153770592 Validation Loss: 234.1744423635078 Train Acc:40.98041552456998 Validation Acc:24.84509056244042
Epoch 8/500
----------
Epoch 8 Training Loss: 199.3533399020355 Validation Loss: 235.4645173188412 Train Acc:41.26643626107337 Validation Acc:24.165872259294567
Epoch 9/500
----------
Epoch 9 Training Loss: 199.6451746921249 Validation Loss: 233.33387595956975 Train Acc:40.96452548365312 Validation Acc:24.59485224022879
Epoch 10/500
----------
Epoch 10 Training Loss: 197.9305159737011 Validation Loss: 233.76405122063377 Train Acc:41.8782028363723 Validation Acc:24.6186844613918
Epoch 11/500
----------
Epoch 11 Training Loss: 199.33247244055502 Validation Loss: 234.41085289463854 Train Acc:41.59218209986891 Validation Acc:25.119161105815063
Epoch 12/500
----------
Epoch 12 Training Loss: 199.87399289874256 Validation Loss: 234.23621463775635 Train Acc:41.028085647320545 Validation Acc:24.49952335557674
Epoch 13/500
----------
Epoch 13 Training Loss: 198.85540591944292 Validation Loss: 234.33149099349976 Train Acc:41.206848607635166 Validation Acc:24.857006673021925
Epoch 14/500
----------
Epoch 14 Training Loss: 199.92641723337513 Validation Loss: 233.37722391070741 Train Acc:41.15520597465539 Validation Acc:24.988083889418494
Epoch 15/500
----------
Epoch 15 Training Loss: 197.82172771698328 Validation Loss: 234.4943131533536 Train Acc:41.69943987605768 Validation Acc:24.380362249761678
You freezed your model through
for param in model.parameters():
param.requires_grad = False
which basically says "do not calculate any gradient for any weight" which is equivalent of not updating weights - hence no optimization
my problem was in model.train(). This phrase should be inside the training loop. but in my case I put it outside the training loop and when it comes to model.eval(), model maintained in this mode

BERT Embeddings in Pytorch Embedding Layer

I'm working with word embeddings. I obtained word embeddings using 'BERT'.
I have a data like this
1992 regular unleaded 172 6 MANUAL all wheel drive 4 Luxury Midsize Sedan 21 16 3105 200
and as a label:
df['Make'] = df['Make'].replace(['Chrysler'],1)
I try to give embeddings as a LSTM inputs.
Using below code for BERT:
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
bert_model = BertModel.from_pretrained('bert-base-uncased')
For tokinizing this code:
def tokenize_text(df, max_seq):
return [
tokenizer.encode(text, add_special_tokens=True)[:max_seq] for text in df
]
def pad_text(tokenized_text, max_seq):
return np.array([el + [0] * (max_seq - len(el)) for el in tokenized_text])
def tokenize_and_pad_text(df, max_seq):
tokenized_text = tokenize_text(df, max_seq)
padded_text = pad_text(tokenized_text, max_seq)
return padded_text
and :
train_indices = tokenize_and_pad_text(df_train, max_seq)
for BERT embedding matrix:
def get_bert_embed_matrix():
bert = transformers.BertModel.from_pretrained('bert-base-uncased')
bert_embeddings = list(bert.children())[0]
bert_word_embeddings = list(bert_embeddings.children())[0]
mat = bert_word_embeddings.weight.data.numpy()
return mat
embedding_matrix = get_bert_embed_matrix()
and LSTM Model:
embedding_layer =Embedding(embedding_matrix.shape[0], embedding_matrix.shape[1], weights=[embedding_matrix], input_length=max_seq_len, trainable=True)
model = Sequential()
model.add(embedding_layer)
model.add(LSTM(128, dropout=0.3, recurrent_dropout=0.3))
model.add(Dense(1, activation='softmax'))
model.compile(
optimizer=tf.keras.optimizers.Adam(1e-5),
loss="categorical_crossentropy",
metrics=["accuracy"],
)
model.summary()
and for model fit this code:
model.fit(train_indices, y_train, epochs=20, verbose=1)
I give a output like this:
Epoch 1/20
1/1 [==============================] - 3s 3s/step - loss: 0.0000e+00 - accuracy: 0.3704
Epoch 20/20
1/1 [==============================] - 0s 484ms/step - loss: 0.0000e+00 - accuracy: 0.3704
I don't know what I'm missing.
Firstly, what can we do about it?
Secondly, how can we implement Pytorch Model?
Thanks a lot.

How to extract cell state of LSTM model through model.fit()?

My LSTM model is like this, and I would like to get state_c
def _get_model(input_shape, latent_dim, num_classes):
inputs = Input(shape=input_shape)
lstm_lyr,state_h,state_c = LSTM(latent_dim,dropout=0.1,return_state = True)(inputs)
fc_lyr = Dense(num_classes)(lstm_lyr)
soft_lyr = Activation('relu')(fc_lyr)
model = Model(inputs, [soft_lyr,state_c])
model.compile(optimizer='adam', loss='mse', metrics=['accuracy'])
return model
model =_get_model((n_steps_in, n_features),latent_dim ,n_steps_out)
history = model.fit(X_train,Y_train)
But I canot extract the state_c from the history. How to return that?
I am unsure of what you mean by "How to get state_c", because your LSTM layer is already returning the state_c with the flag return_state=True. I assume you are trying to train the multi-output model in this case. Currently, you only have a single output but your model is compiled with multiple outputs.
Here is how you work with multi-output models.
from tensorflow.keras import layers, Model, utils
def _get_model(input_shape, latent_dim, num_classes):
inputs = layers.Input(shape=input_shape)
lstm_lyr,state_h,state_c = layers.LSTM(latent_dim,dropout=0.1,return_state = True)(inputs)
fc_lyr = layers.Dense(num_classes)(lstm_lyr)
soft_lyr = layers.Activation('relu')(fc_lyr)
model = Model(inputs, [soft_lyr,state_c]) #<------- One input, 2 outputs
model.compile(optimizer='adam', loss='mse')
return model
#Dummy data
X = np.random.random((100,15,5))
y1 = np.random.random((100,4))
y2 = np.random.random((100,7))
model =_get_model((15, 5), 7 , 4)
model.fit(X, [y1,y2], epochs=4) #<--------- #One input, 2 outputs
Epoch 1/4
4/4 [==============================] - 2s 6ms/step - loss: 0.6978 - activation_9_loss: 0.2388 - lstm_9_loss: 0.4591
Epoch 2/4
4/4 [==============================] - 0s 6ms/step - loss: 0.6615 - activation_9_loss: 0.2367 - lstm_9_loss: 0.4248
Epoch 3/4
4/4 [==============================] - 0s 7ms/step - loss: 0.6349 - activation_9_loss: 0.2392 - lstm_9_loss: 0.3957
Epoch 4/4
4/4 [==============================] - 0s 8ms/step - loss: 0.6053 - activation_9_loss: 0.2392 - lstm_9_loss: 0.3661

Why acc of char-level cnn for text classification stay unchanged

I misused binary cross-entropy for softmax, changed to categorical cross-entropy. And did some reviewing about details of the problem below in my own answer
I am trying to using open source data: sogou_news_csv(converted to pinyin using jieba from for text classification following https://arxiv.org/abs/1502.01710 "Text understanding from scratch" by Xiang Zhang and Yann LeCun. (mainly follow the idea of using character level CNN, but the structure proposed in the paper).
I did the preprocessing by using one-hot encoding according to a alphabet collection and filling all those not in the alphabet collection with 0s.
As a result, I got the training data with the shape of (450000, 1000, 70),(data_size, sequence_length, alphabet_size).
Then I feed the data into a cnn structure following http://www.wildml.com/2015/12/implementing-a-cnn-for-text-classification-in-tensorflow/.
Problem is
During the training, the loss and acc merely change, I tried preprocessing again for the data, and tried different learning rate settings, but not helpful, So what went wrong?
Below is one-hot encoding:
import numpy as np
all_letters = "abcdefghijklmnopqrstuvwxyz0123456789-,;.!?:'\"/\\|_##$%^&*~`+-=<>()[]{}\n"
n_letters = len(all_letters)
def letterToIndex(letter):
"""
'c' -> 2
"""
return all_letters.find(letter)
def sets2tensors(clean_train, n_letters=n_letters, MAX_SEQUENCE_LENGTH=1000):
"""
From lists of cleaned passages to np.array with shape(len(train),
max_sequence_length, len(dict))
Arg:
obviously
"""
m = len(clean_train)
x_data = np.zeros((m, MAX_SEQUENCE_LENGTH, n_letters))
for ix in range(m):
for no, letter in enumerate(clean_train[ix]):
if no >= 1000:
break
letter_index = letterToIndex(letter)
if letter != -1:
x_data[ix][no][letter_index] = 1
else:
continue
return x_data
This is the Model:
num_classes = 5
from keras.models import Sequential
from keras.layers import Activation, GlobalMaxPool1D, Merge, concatenate, Conv1D, Dense, Dropout
from keras.callbacks import EarlyStopping
from keras.optimizers import SGD
submodels = []
for kw in (3, 4, 5): # kernel sizes
submodel = Sequential()
submodel.add(Conv1D(32,
kw,
padding='valid',
activation='relu',
strides=1, input_shape=(1000, n_letters)))
submodel.add(GlobalMaxPool1D())
submodels.append(submodel)
big_model = Sequential()
big_model.add(Merge(submodels, mode="concat"))
big_model.add(Dense(64))
big_model.add(Dropout(0.5))
big_model.add(Activation('relu'))
big_model.add(Dense(num_classes))
big_model.add(Activation('softmax'))
print('Compiling model')
opt = SGD(lr=1e-6) # tried different learning rate from 1e-6 to 1e-1
# changed from binary crossentropy to categorical_crossentropy
big_model.compile(loss='categorical_crossentropy',
optimizer=opt,
metrics=['accuracy'])
Some results
Train on 5000 samples, validate on 5000 samples
Epoch 1/5
5000/5000 [==============================] - 54s - loss: 0.5198 - acc: 0.7960 - val_loss: 0.5001 - val_acc: 0.8000
Epoch 2/5
5000/5000 [==============================] - 56s - loss: 0.5172 - acc: 0.7959 - val_loss: 0.5000 - val_acc: 0.8000
Epoch 3/5
5000/5000 [==============================] - 56s - loss: 0.5198 - acc: 0.7965 - val_loss: 0.5000 - val_acc: 0.8000
Epoch 4/5
5000/5000 [==============================] - 57s - loss: 0.5222 - acc: 0.7950 - val_loss: 0.4999 - val_acc: 0.8000
Epoch 5/5
5000/5000 [==============================] - 59s - loss: 0.5179 - acc: 0.7960 - val_loss: 0.4999 - val_acc: 0.8000
I found that the problem is I accidentally used binary cross-entropy(that I used for another dataset) with softmax, which should be categorical cross-entropy. Initially, I figured it is a stupid bug since I didn't carefully check the code and logic.
But then I found I don't really understand what is going on here, I mean, I know the difference between binary cross-entropy and categorical cross-entropy, but I don't really understand the details why softmax and categorical cross-entropy can't be chained together.
Luckily, I found a very nice explanation here(did not expect anyone would actually ask or answer this question)
https://www.reddit.com/r/MachineLearning/comments/39bo7k/can_softmax_be_used_with_cross_entropy/#cs2b4jx
Basically what it is saying is that in binary cross-entropy case, the loss function is treating two different values of a single bit as two different class: like 1 for A and 0 for B, despite that with categorical cross-entropy case, the loss function is taking a vector like [0,0,0,1,0] a label, in which the value of a bit stands for the confidence or probability for the corresponding training example being that particular class.
With description above, when we apply binary cross-entropy to softmax, we are misusing the definition of what one bit means in that setting, thus make no sense.
You have set SGD optimizer to 0.000001 (opt = SGD(lr=1e-6))
The default learning rate for SGD is 0.01
keras.optimizers.SGD(lr=0.01, momentum=0.0, decay=0.0, nesterov=False)
I suspect that 1e-6 is to small, try increase it and/or try a different optimizer

Resources