Keras XOR not training - python-3.x

We have been trying for a while to get this to work. This is probably the easiest example to create and so now we need help. We've been changing the number of epochs in the fit function and that's giving us different results, but never anything good, and when we increase them too much they will always converge on 0.5.
#%%
inputValues = numpy.array([[0,0],[0,1],[1,0],[1,1]])
inputResults = numpy.array([[0],[1],[1],[0]])
print(inputValues)
print(inputResults)
#%%
model = keras.Sequential([
keras.layers.Flatten(input_shape=(2,)),
keras.layers.Dense(units=2, activation=("relu")),
keras.layers.Dense(units=2, activation=("softmax"))
])
model.compile(loss = keras.losses.SparseCategoricalCrossentropy(), optimizer = tensorflow.optimizers.Adam(), metrics=['accuracy'])
model.fit(inputValues, inputResults, epochs=2500)
model.summary()
print(model.weights)
#%%
print(model.predict_proba(inputValues))
print("End of file.")
From my understanding of ANN's we should have 2 inputs in the first layer, specifically for the XOR example. And two outputs for the output (either a 0, or a 1). I assume that since it is not required to say what these outputs are (0 or 1), tensor flow is dealing with this automatically by comparing the results in the fit function? Lastly, we have tried with both a hidden layer (of 2) and without and still don't seem to get any better results.
Could someone let us know what we have done wrong?

Your problem is essentially a binary classification problem, because the output can be either 0 or 1. For this you don't need two ouput neurons; one will do, with a sigmoid function that will return either a 0 or a 1 as output (sigmoid generally works well for binary classification, because its characteristic S-shape will get you values either close to 0 or close to 1).
Another adjustment you need to make is to set the loss function to binary crossentropy; your choice, sparse categorical crossentropy, is suitable for classifications into more than 2 categories.
So the code that I tried is:
model = keras.Sequential([
keras.layers.Flatten(input_shape=(2,)),
keras.layers.Dense(units=4, activation=("sigmoid")),
keras.layers.Dense(units=1, activation=("sigmoid"))
])
model.compile(loss = keras.losses.BinaryCrossentropy(from_logits=False), optimizer = optimizers.Adam(), metrics=['accuracy'])
model.fit(inputValues, inputResults, epochs=2500)
With these settings I got training accuracy to 1.0000. It took a while to get there, and I suppose that could be sped up by playing around with the learning rate, but it should be enough to get the job done.

Related

Using Keras like TensorFlow for gpu computing

I would like to know if Keras can be used as an interface to TensoFlow for only doing computation on my GPU.
I tested TF directly on my GPU. But for ML purposes, I started using Keras, including the backend. I would find it 'comfortable' to do all my stuff in Keras instead of Using two tools.
This is also a matter of curiosity.
I found some examples like this one:
http://christopher5106.github.io/deep/learning/2018/10/28/understand-batch-matrix-multiplication.html
However this example does not actually do the calculation.
It also does not get input data.
I duplicate the snippet here:
'''
from keras import backend as K
a = K.ones((3,4))
b = K.ones((4,5))
c = K.dot(a, b)
print(c.shape)
'''
I would simply like to know if I can get the result numbers from this snippet above, and how?
Thanks,
Michel
Keras doesn't have an eager mode like Tensorflow, and it depends on models or functions with "placeholders" to receive and output data.
So, it's a little more complicated than Tensorflow to do basic calculations like this.
So, the most user friendly solution would be creating a dummy model with one Lambda layer. (And be careful with the first dimension that Keras will insist to understand as a batch dimension and require that input and output have the same batch size)
def your_function_here(inputs):
#if you have more than one tensor for the inputs, it's a list:
input1, input2, input3 = inputs
#if you don't have a batch, you should probably have a first dimension = 1 and get
input1 = input1[0]
#do your calculations here
#if you used the batch_size=1 workaround as above, add this dimension again:
output = K.expand_dims(output,0)
return output
Create your model:
inputs = Input(input_shape)
#maybe inputs2 ....
outputs = Lambda(your_function_here)(list_of_inputs)
#maybe outputs2
model = Model(inputs, outputs)
And use it to predict the result:
print(model.predict(input_data))

Why does my keras model return more metrics than I specified?

I defined a simple generative adversarial network that consists of a generator and a discriminator. The generator is compiled two times: The first time for non-adversarial training (without the discriminator extension), and the second one for adversarial training.
After I have built and compiled everything, I can ask my compiled models for their losses and metrics. This is what I get:
net.generator.loss -> 'mean_absolute_error'
net.generator.metrics -> []
net.combined.loss -> ['mean_absolute_error', 'binary_crossentropy']
net.combined.metrics -> []
So far so good, this seems to be plausible. But when I then use the train_on_batch method on net.generator or net.combined, the format of the returned loss does not match my expectations. I found out that I can check this by using model.metrics_names:
net.generator.metrics_names -> ['loss']
net.combined.metrics_names -> ['loss', 'sequential_15_loss', 'discriminator_loss']
My Question is: Why does my net.combined loss contain 3 instead of just two elements as I defined (loss=[generator_loss_fct,
'binary_crossentropy'). I don't want it to be 3 elements long.
Additionally the first two are almost always the same, or at least
very very very similar.
Does someone understand this? If yes, please explain me why this is like this and if I did something wrong. :)
Thanks in advance!
# build and compile the generator
self.encoder = self._build_encoder(input_shape, encoder_filters, kernel_size, latent_size)
self.decoder = self._build_decoder(self.encoder.layers[-1].output_shape[1:], decoder_filters, kernel_size)
self.generator = Sequential([self.encoder, self.decoder])
# compile for non-adversarial training
self.generator.compile(loss=generator_loss_fct, optimizer=self.optimizer)
# get the inputs
masked_img= Input(self.input_shape, name='masked-image')
filled_img = self.generator(masked_img)
# build and compile the (global) discriminator
self.discriminator = self._build_discriminator(input_shape, discriminator_filters, kernel_size)
self.discriminator.compile(loss='binary_crossentropy', optimizer=self.optimizer, metrics=['accuracy'])
# let the discriminator judge the validity of the reconstruction
valid = self.discriminator(filled_img)
# we freeze the discriminator when training the generator
self.discriminator.trainable = False
# build and compile the combined adversarial model
self.combined = Model(masked_img, [filled_img, valid])
self.combined.compile(loss=[generator_loss_fct, 'binary_crossentropy'], loss_weights=[self.alpha, self.beta], optimizer=self.optimizer)
When you have a multioutput model, Keras will report the total loss, together with the loss corresponding to each output.
Besides, if, as you say, the first two losses are so close, probably your last loss does nothing.
If you are willing to train a GAN model you can take a look at this Keras example

cross Validation in Sklearn using a Custom CV

I am dealing with a binary classification problem.
I have 2 lists of indexes listTrain and listTest, which are partitions of the training set (the actual test set will be used only later). I would like to use the samples associated with listTrain to estimate the parameters and the samples associated with listTest to evaluate the error in a cross validation process (hold out set approach).
However, I am not be able to find the correct way to pass this to the sklearn GridSearchCV.
The documentation says that I should create "An iterable yielding (train, test) splits as arrays of indices". However, I do not know how to create this.
grid_search = GridSearchCV(estimator = model, param_grid = param_grid,cv = custom_cv, n_jobs = -1, verbose = 0,scoring=errorType)
So, my question is how to create custom_cv based on these indexes to be used in this method?
X and y are respectivelly the features matrix and y is the vector of labels.
Example: Supose that I only have one hyperparameter alpha that belongs to the set{1,2,3}. I would like to set alpha=1, estimate the parameters of the model (for instance the coefficients os a regression) using the samples associated with listTrain and evaluate the error using the samples associated with listTest. Then I repeat the process for alpha=2 and finally for alpha=3. Then I choose the alpha that minimizes the error.
EDIT: Actual answer to question. Try passing cv command a generator of the indices:
def index_gen(listTrain, listTest):
yield listTrain, listTest
grid_search = GridSearchCV(estimator = model, param_grid =
param_grid,cv = index_gen(listTrain, listTest), n_jobs = -1,
verbose = 0,scoring=errorType)
EDIT: Before Edits:
As mentioned in the comment by desertnaut, what you are trying to do is bad ML practice, and you will end up with a biased estimate of the generalisation performance of the final model. Using the test set in the manner you're proposing will effectively leak test set information into the training stage, and give you an overestimate of the model's capability to classify unseen data. What I suggest in your case:
grid_search = GridSearchCV(estimator = model, param_grid = param_grid,cv = 5,
n_jobs = -1, verbose = 0,scoring=errorType)
grid_search.fit(x[listTrain], y[listTrain]
Now, your training set will be split into 5 (you can choose the number here) folds, trained using 4 of those folds on a specific set of hyperparameters, and tested the fold that was left out. This is repeated 5 times, till all of your training examples have been part of a left out set. This whole procedure is done for each hyperparameter setting you are testing (5x3 in this case)
grid_search.best_params_ will give you a dictionary of the parameters that performed the best over all 5 folds. These are the parameters that you use to train your final classifier, using again only the training set:
clf = LogisticRegression(**grid_search.best_params_).fit(x[listTrain],
y[listTrain])
Now, finally your classifier is tested on the test set and an unbiased estimate of the generalisation performance is given:
predictions = clf.predict(x[listTest])

Confused about how to implement time-distributed LSTM + LSTM

After much reading and diagramming, I think I've come up with a model that I can use to as the foundation for more testing on which parameters and features I need to tweak. However, I am confused about how to implement the following test case (all numbers are orders of magnitude smaller than final model, but I want to start small):
Input data: 5000x1 time series vector, split into 5 epochs of 1000x1
For each time step, 3 epochs worth of data will be put through 3 time-distributed copies of a bidirectional LSTM layer, and each of those will output a vector of 10x1 (10 features extracted), which will then be taken as the input for a second bidirectional LSTM layer.
For each time step, the first and last labels are ignored, but the center one is what is desired.
Here's what I've come up with, which does compile. However, looking at the model.summary, I think I'm missing the fact that I want the first LSTM to be run on a 3 of the input sequences for each output time step. What am I doing wrong?
model = Sequential()
model.add(TimeDistributed(Bidirectional(LSTM(11, return_sequences=True, recurrent_dropout=0.1, unit_forget_bias=True), input_shape=(3, 3, epoch_len), merge_mode='sum'), input_shape=(n_epochs, 3, epoch_len)))
model.add(TimeDistributed(Dense(7)))
model.add(TimeDistributed(Flatten()))
model.add(Bidirectional(LSTM(12, return_sequences=True, recurrent_dropout=0.1, unit_forget_bias=True), merge_mode='sum'))
model.add(TimeDistributed(Dense(n_classes, activation='softmax')))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
Since your question is a bit confused, I'll take the following assumptions.
You have one time series of 5000 time steps, each step with one feature. Shape (1, 5000, 1)
The main part of the answer to your question: You want to run a "sliding window" case, being the size of the window equal to 3000, and the stride of the window being 1000.
You want the window size to be divided in 3 internal time series, each of these 3 series with 1000 steps, each step with only one feature. Each of these series enters the same LSTM as independent series (which is equivalent to having 3 copies of the LSTM) - Shape (slidingWindowSteps, 3, 1000, 1)
Important: From these 3 series, you want 3 outputs without length and with 10 features. Shape (1,3,10). (Your image says 1x10, but your text says 10x1, I'm assuming the image is correct).
You want these 3 outputs to be merged in a single sequence of 3 steps, shape (1,3,10)
You want the LSTM that processes this 3 step sequence to also return a 3 step sequence
Preparing for the sliding window case:
In a sliding window case, it's unavoidable to duplicate data. You need to first work in your input.
Taking the initial time series (1,5000,1), we need to split it prolerly in a batch containing samples with 3 groups of 1000. Here I do this for X only, you will have to do a similar thing to Y
numberOfOriginalSequences = 1
totalSteps = 5000
features = 1
#example of original input with 5000 steps
originalSeries = np.array(
range(numberOfOriginalSequences*totalSteps*features)
).reshape((numberOfOriginalSequences,
totalSteps,
features))
windowSize = 3000
windowStride = 1000
totalWindowSteps = ((totalSteps - windowSize)//windowStride) + 1
#at first, let's keep these dimensions for better understanding
processedSequences = np.empty((numberOfOriginalSequences,
totalWindowSteps,
windowSize,
features))
for seq in range(numberOfOriginalSequences):
for winStep in range(totalWindowSteps):
start = winStep * windowStride
end = start + windowSize
processedSequences[seq,winStep,:,:] = originalSeries[seq,start:end,:]
#now we reshape the array to transform each window step in independent sequences:
totalSamples = numberOfOriginalSequences*totalWindowSteps
groupsInWindow = windowSize // windowStride
processedSequences = processedSequences.reshape((totalSamples,
groupsInWindow,
windowStride,
features))
print(originalSeries)
print(processedSequences)
Creating the model:
A few comments about your first added layer:
The model only takes into account one input_shape. And this shape is (groupsInWindow,windowStride,features). It should be in the most external wrapper: the TimeDistributed.
You don't want to keep 1000 time steps, you want only 10 resulting features: return_sequences = False. (You can use many LSTMs in this first stage, if you want more layers. In this case the first ones can keep the steps, only the last one needs to use return_sequences=False)
You want 10 features, so units=10
I'll use the functional API just to see the input shape in the summary, which hels understanding things.
from keras.models import Model
intermediateFeatures = 10
inputTensor = Input((groupsInWindow,windowStride,features))
out = TimeDistributed(
Bidirectional(
LSTM(intermediateFeatures,
return_sequences=False,
recurrent_dropout=0.1,
unit_forget_bias=True),
merge_mode='sum'))(inputTensor)
At this point, you have eliminated the 1000 time steps. Since we used return_sequences=False, there will be no need to flatten or things like that. The data is already shaped in the form (samples, groupsInWindow,intermediateFeatures). The Dense layer is also not necessary. But it wouldn't be "wrong" if you wanted to do it the way you did, as long as the final shape is the same.
arbitraryLSTMUnits = 12
n_classes = 17
out = Bidirectional(
LSTM(arbitraryLSTMUnits,
return_sequences=True,
recurrent_dropout=0.1,
unit_forget_bias=True),
merge_mode='sum')(out)
out = TimeDistributed(Dense(n_classes, activation='softmax'))(out)
And if you're going to discard the borders, you can add this layer:
out = Lambda(lambda x: x[:,1,:])(out) #model.add(Lambda(lambda x: x[:,1,:]))
Completing the model:
model = Model(inputTensor,out)
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
model.summary()
Here is how dimensions are flowing through this model.
The first dimension I put here (totalSamples) is shown as None in the model.summary().
Input: (totalSamples,groupsInWindow,windowStride,features)
The Time Distributed LSTM works like this:
TimeDistributed allows a 4th dimension, which is groupsInWindow.
This dimension will be kept.
The LSTM with return_sequences=False will eliminate the windowStride and change the features (windowStride, the second last dimension, is at the time steps position for this LSTM):
result: (totalSamples, groupsInWindow, intermadiateFeatures)
The other LSTM, without time distributed, will not have the 4th dimension. This way, groupsInWindow (the second last) will be the "time steps". But return_sequences=True will not eliminate the time steps as the first LSTM did. Result: (totalSamples, groupsInWindow, arbitraryLSTMUnits)
The final Dense layer, because it's receiving a 3D input, will interpret the second dimension as if it were a TimeDistributed and leave it unchanged, applying itself only to the features dimension. Result: (totalSamples, groupsInWindow, n_classes)

How to use hidden layer activations to construct loss function and provide y_true during fitting in Keras?

Assume I have a model like this. M1 and M2 are two layers linking left and right sides of the model.
The example model: Red lines indicate backprop directions
During training, I hope M1 can learn a mapping from L2_left activation to L2_right activation. Similarly, M2 can learn a mapping from L3_right activation to L3_left activation.
The model also needs to learn the relationship between two inputs and the output.
Therefore, I should have three loss functions for M1, M2, and L3_left respectively.
I probably can use:
model.compile(optimizer='rmsprop',
loss={'M1': 'mean_squared_error',
'M2': 'mean_squared_error',
'L3_left': mean_squared_error'})
But during training, we need to provide y_true, for example:
model.fit([input_1,input_2], y_true)
In this case, the y_true is the hidden layer activations and not from a dataset.
Is it possible to build this model and train it using it's hidden layer activations?
If you have only one output, you must have only one loss function.
If you want three loss functions, you must have three outputs, and, of course, three Y vectors for training.
If you want loss functions in the middle of the model, you must take outputs from those layers.
Creating the graph of your model: (if the model is already defined, see the end of this answer)
#Here, all "SomeLayer(blabla)" could be replaced by a "SomeModel" if necessary
#Example of using a layer or a model:
#M1 = SomeLayer(blablabla)(L12)
#M1 = SomeModel(L12)
from keras.models import Model
from keras.layers import *
inLef = Input((shape1))
inRig = Input((shape2))
L1Lef = SomeLayer(blabla)(inLef)
L2Lef = SomeLayer(blabla)(L1Lef)
M1 = SomeLayer(blablaa)(L2Lef) #this is an output
L1Rig = SomeLayer(balbla)(inRig)
conc2Rig = Concatenate(axis=?)([L1Rig,M1]) #Or Add, or Multiply, however you're joining the models
L2Rig = SomeLayer(nlanlab)(conc2Rig)
L3Rig = SomeLayer(najaljd)(L2Rig)
M2 = SomeLayer(babkaa)(L3Rig) #this is an output
conc3Lef = Concatenate(axis=?)([L2Lef,M2])
L3Lef = SomeLayer(blabla)(conc3Lef) #this is an output
Creating your model with three outputs:
Now you've got your graph ready and you know what the outputs are, you create the model:
model = Model([inLef,inRig], [M1,M2,L3Lef])
model.compile(loss='mse', optimizer='rmsprop')
If you want different losses for each output, then you create a list:
#example of custom loss function, if necessary
def lossM1(yTrue,yPred):
return keras.backend.sum(keras.backend.abs(yTrue-yPred))
#compiling with three different loss functions
model.compile(loss = [lossM1, 'mse','binary_crossentropy'], optimizer =??)
But you've got to have three different yTraining too, for training with:
model.fit([input_1,input_2], [yTrainM1,yTrainM2,y_true], ....)
If your model is already defined and you don't create it's graph like I did:
Then, you have to find in yourModel.layers[i] which ones are M1 and M2, so you create a new model like this:
M1 = yourModel.layers[indexForM1].output
M2 = yourModel.layers[indexForM2].output
newModel = Model([inLef,inRig], [M1,M2,yourModel.output])
If you want that two outputs be equal:
In this case, just subtract the two outputs in a lambda layer, and make that lambda layer be an output of your model, with expected values = 0.
Using the exact same vars as before, we'll just create two addictional layers to subtract outputs:
diffM1L1Rig = Lambda(lambda x: x[0] - x[1])([L1Rig,M1])
diffM2L2Lef = Lambda(lambda x: x[0] - x[1])([L2Lef,M2])
Now your model should be:
newModel = Model([inLef,inRig],[diffM1L1Rig,diffM2L2lef,L3Lef])
And training will expect those two differences to be zero:
yM1 = np.zeros((shapeOfM1Output))
yM2 = np.zeros((shapeOfM2Output))
newModel.fit([input_1,input_2], [yM1,yM2,t_true], ...)
Trying to answer to the last part: how to make gradients only affect one side of the model.
...well.... at first that sounds unfeasible to me. But, if that is similar to "train only a part of the model", then it's totally ok by defining models that only go to a certain point and making part of the layers untrainable.
By doing that, nothing will affect those layers. If that's what you want, then you can do it:
#using the previous vars to define other models
modelM1 = Model([inLef,inRig],diffM1L1Rig)
This model above ends in diffM1L1Rig. Before compiling, you must set L2Right untrainable:
modelM1.layers[??].trainable = False
#to find which layer is the right one, you may define then using the "name" parameter, or see in the modelM1.summary() the shapes, types etc.
modelM1.compile(.....)
modelM1.fit([input_1, input_2], yM1)
This suggestion makes you train only a single part of the model. You can repeat the procedure for M2, locking the layers you need before compiling.
You can also define a full model taking all layers, and lock only the ones you want. But you won't be able (I think) to make half gradients pass by one side and half the gradients pass by the other side.
So I suggest you keep three models, the fullModel, the modelM1, and the modelM2, and you cycle them in training. One epoch each, maybe....
That should be tested....

Resources