Some issues I am getting while training with identical code ported from Pytorch to TF2.
Model.fit converges in a completely different manner than Gradient Tape. (And more similar to Pytorch)
tf.keras.optimizers.SGD converges very different than pytorch SGD. (e.g. a LR of 0.1 for tf is unstable at training while the same starting LR for Pytorch is used in many SOTA implementations)
I couldn't come up with a toy example to demonstrate the issue without needing many epochs to train and can be demonstrated in a few lines of code. Any suggestions?
Related
I converted a model from tf.keras to caffe. When I evaluate the model with Caffe on the test set, I find that the accuracy is higher with caffe than with tf.keras. I can't think of a way to get a hand on the source of the problem (if there's a problem in the first place...)
Is this difference due to the lower-level libraries used for accelerating the computations (I am thinking of cudnn and the caffe engine)? Is there a well-known accuracy problem with the keras module of tensorflow?
By the way, there are other people that have a similar issue:
https://github.com/keras-team/keras/issues/4444
This can happen.
Once you convert your keras .h5 model to .caffemodel, the weights are numerically copied. But, internally you'll load your model to Caffe and not Keras.
As, caffe and keras are two different libraries, their internal algorithms can vary slightly. Also if you change your pre-processing scheme that can change the result too. Usually, if you use pruning (to optimize the size) the performance can go low, in the weird case this can be thought of as an extreme regularization and act as a performance booster in test.
I have implemented a Keras-based, Bayesian Deep Learning model (based on this repo)
My model's loss appears to be always negative as well as the logits_variance_loss (see screenshot below). Any idea why is this happening or what does it mean for the training? .
And this is after 2 epochs
Going straight to the problem...
I am using Keras flow_from_directory to load the data for sound classification. Data_generator without any augmentation and shuffle =True and although most of my models have a very good accuracy (92%) and a small val_loss the confusion matrix shows that the model is not predicting the labels correctly
I have tried simple models and complex models with keras flow_from_directory and data_generator on UrbanSound8k dataset. Also tried batch normalization, bias and kernel regularizers to avoid overfitting.
The results look almost random.
To my understanding, batch (vanilla) gradient descent makes one parameter update for all training data. Stochastic gradient descent (SGD) allows you to update parameter for each training sample, helping the model to converge faster, at the cost of high fluctuation in function loss.
Batch (vanilla) gradient descent sets batch_size=corpus_size.
SGD sets batch_size=1.
And mini-batch gradient descent sets batch_size=k, in which k is usually 32, 64, 128...
How does gensim apply SGD or mini-batch gradient descent? It seems that batch_words is the equivalent of batch_size, but I want to be sure.
Is setting batch_words=1 in gensim model equivalent to applying SGD?
No, batch_words in gensim refers to the size of work-chunks sent to worker threads.
The gensim Word2Vec class updates model parameters after each training micro-example of (context)->(target-word) (where context might be a single word, as in skip-gram, or the mean of several words, as in CBOW).
For example, you can review this optimized w2v_fast_sentence_sg_neg() cython function for skip-gram with negative-sampling, deep in the Word2Vec training loop:
https://github.com/RaRe-Technologies/gensim/blob/460dc1cb9921817f71b40b412e11a6d413926472/gensim/models/word2vec_inner.pyx#L159
Observe that it is considering exactly one target-word (word_index parameter) and one context-word (word2_index), and updating both the word-vectors (aka 'projection layer' syn0) and the model's hidden-to-output weights (syn1neg) before it might be called again with a subsequent single (context)->(target-word) pair.
Need Suggestion
I am trying to design a model to guess Facial-Points. Its a part of Kaggle Competition (https://www.kaggle.com/c/facial-keypoints-detection).
In this solution, I am trying to design a CNN model (using Keras Library), as a Multi-variable regression model to Predict the co-ordinates of Facial-points.
Issue Faced --> I am getting loss as "nan"
Solutions tried --
1. Tried optimizers - Adam, SGD
2. tested with Learning rate 0.01 to 0.00001
3. Tried with various batch sizes
Can anyone suggest, if I am missing something. The code is present in below link -
https://www.kaggle.com/saurabhrathor/facialpoints-practice