I have a fully implemented LSTM RNN using Keras, and I want to use gradient clipping with the gradient norm limited to 5 (I'm trying to reproduce a research paper). I'm quite a beginner with regards to implementing Neural Networks, how would I implement this ?
Is it just (I'm using rmsprop optimizer):
sgd = optimizers.rmsprop(lr=0.01, clipnorm=5)
model.compile(optimizer=sgd,
loss='categorical_crossentropy',
metrics=['accuracy'])
According to the official documentation, any optimizer can have optional arguments clipnorm and clipvalue. If clipnorm provided, gradient will be clipped whenever gradient norm exceeds the threshold.
Related
The Linear regression model from sklearn uses a closed or normal equation to find the parameters. However with large datasets Gradient Descent is said to be more efficient. Is there any way to use the LinearRegression from sklearn using gradient descent.
The function you are looking for is: sklearn.linear_model.SGDRegressor
You can modify the loss hyperparameter which will define the loss function to be used.
Be aware that the SGD of SGDRegressor stands for Stochastic Gradient Descent. Which means that the gradient of the loss is estimated each sample at a time and the model is updated along the way with a decreasing strength schedule (aka learning rate).
I have developed a Convolutional Neural Network using TILDA image dataset which gives over 90% of accuracy with the following model. I used 4 batches and 100 epochs to the model.
model = keras.Sequential([
layers.Input((30,30,1)),
layers.Conv2D(8,2,padding='same', activation='relu',kernel_regularizer=regularizers.l2(0.01)),
layers.BatchNormalization(),
layers.Conv2D(16,2,padding='same', activation='relu',kernel_regularizer=regularizers.l2(0.01)),
layers.BatchNormalization(),
layers.Conv2D(32,2,padding='same', activation='sigmoid',kernel_regularizer=regularizers.l2(0.01)),
layers.BatchNormalization(),
layers.MaxPooling2D(),
layers.Flatten(),
layers.Dropout(0.5),
layers.Dense(5, activation = "softmax"),
])
Using the above model I could plot the following graphs for the training and validation accuracy.
Do you have any suggestions to increase the smoothness of these curves? What can be the possible reasons for getting such curves? I appreciate your recommendations to improve this model.
The following may help in getting a smoother curve:
NEVER use dropout before the final layer. MaxPool + Dropout in your model discards 87.5% of the data flowing into the final layer. Avoid pooling as well, unless you need global or adaptive pooling to get a fixed shape output. If you must pool, you need a much larger number of kernels to compensate for the loss in information.
Use a lower learning rate. From what the training curve tells, the model is directed to a minima, but with several bumps.
Are you using SGD without momentum? If yes, introduce, momentum. Also consider adaptive optimizers with inbuilt momentum, like Adam.
Why the sigmoid in between? Sigmoid reduces the gradient magnitude and makes learning slower.
If you only care about the curve and are not restricted by number of parameters, consider adding a few more layers and/or channels.
To my understanding, batch (vanilla) gradient descent makes one parameter update for all training data. Stochastic gradient descent (SGD) allows you to update parameter for each training sample, helping the model to converge faster, at the cost of high fluctuation in function loss.
Batch (vanilla) gradient descent sets batch_size=corpus_size.
SGD sets batch_size=1.
And mini-batch gradient descent sets batch_size=k, in which k is usually 32, 64, 128...
How does gensim apply SGD or mini-batch gradient descent? It seems that batch_words is the equivalent of batch_size, but I want to be sure.
Is setting batch_words=1 in gensim model equivalent to applying SGD?
No, batch_words in gensim refers to the size of work-chunks sent to worker threads.
The gensim Word2Vec class updates model parameters after each training micro-example of (context)->(target-word) (where context might be a single word, as in skip-gram, or the mean of several words, as in CBOW).
For example, you can review this optimized w2v_fast_sentence_sg_neg() cython function for skip-gram with negative-sampling, deep in the Word2Vec training loop:
https://github.com/RaRe-Technologies/gensim/blob/460dc1cb9921817f71b40b412e11a6d413926472/gensim/models/word2vec_inner.pyx#L159
Observe that it is considering exactly one target-word (word_index parameter) and one context-word (word2_index), and updating both the word-vectors (aka 'projection layer' syn0) and the model's hidden-to-output weights (syn1neg) before it might be called again with a subsequent single (context)->(target-word) pair.
What is difference between SGD classifier and SGD regressor in python sklearn? Also can we set batch size for faster performance in them?
Well, it's in the name. SGD Classifier is a model that is optimized (trained) using SGD (taking the gradient of the loss of each sample at a time and the model is updated along the way) in classification problems. It can represent a variety of classification models (SVM, logistic regression...) which is defined with the loss parameter. By default, it represents linear SVM. SGD Regressor is a model that is optimized (trained) using SGD for regression tasks. It's basically a linear model that is updated along the way with a decaying learning rate.
SGD {Stochastic Gradient Descent} is an optimization method, which is used by machine learning algorithms or models to optimize the loss function.
In the scikit-learn library, these model SGDClassifier and SGDRegressor, which might confuse you to think that SGD is a classifier and regressor.
But that's not the case.
SGDClassifier - it is a classifier optimized by SGD
SGDRegressor - it is a regressor optimized by SGD.
Stochastic gradient descent{SGD} does not support batch, it takes single training example at a time unlike {batch} Gradient descent.
Example using sklearn partial fit
from sklearn.linear_model import SGDClassifier
import random
clf2 = SGDClassifier(loss='log') # shuffle=True is useless here
shuffledRange = range(len(X))
n_iter = 5
for n in range(n_iter):
random.shuffle(shuffledRange)
shuffledX = [X[i] for i in shuffledRange]
shuffledY = [Y[i] for i in shuffledRange]
for batch in batches(range(len(shuffledX)), 10000):
clf2.partial_fit(shuffledX[batch[0]:batch[-1]+1], shuffledY[batch[0]:batch[-1]+1], classes=numpy.unique(Y))
Classifier predicts to which class belongs some data.
this picture is a cat (not a dog)
Regressor predicts usually probability to which class it belongs
this picture with 99% of probability is a cat
I have implemented an autoencoder using Keras. I understand that I can add accuracy performance metric as follows:
autoencoder.compile(optimizer='adam',
loss='mean_squared_error',
metrics=['accuracy'])
My question is:
Is the accuracy metric applied on the last layer of the decoder by default? If so, how can I set it so that it would get the representations from middle (hidden) layer to compute accuracy performance? Do I need to define a custom metric? How would that work?
It seems that what you really want is a multiple output network.
So on top of your middle layer that defines your embedding, add a layer (or more) to do your classification.
Then have a look at Multiple outputs in Keras to create your global cost.
You may also want to start by training the autoendoder only, then the classifier additional layers only to see the performance, you can also balance the accuracy of the encoder vs the accuracy of the classifier as a loss, training "both" networks at the same time.