I am training various Autoencoder structures in keras+tensorflow for my master thesis on anomaly detection and during my literature research on how to train the autoencoder structures, I found various training schemes for Autoencoders including, Layerwise pretraining and Joint training. However, I have not been able to find any literature on how Keras.fit() handles the training process. Can anyone point me to any useful links?
Related
I'm using Windows 10 machine. Libraries: Keras with Tensorflow 2.0 Embeddings: Glove(100 dimensions).
I am trying to implement an LSTM architecture for multi-label text classification.
I am using different types of fine-tuning to achieve better results but with no luck so far.
The main problem I believe is the difference in class distributions of my dataset but after a lot of tries and errors, I couldn't implement stratified-k-split in Keras.
I am also experimenting with dropout layers, batch sizes, # of layers, learning rates, clip values, validation splits but I get a minimum boost or worst performance sometimes.
For metrics, I use mainly ROC and F1.
I also followed the suggestion from a StackOverflow member who said to delete some of my examples so I can balance my dataset but if I do that I will have a very low number of examples.
What would you suggest to me?
If someone can provide code based on my implementation for
stratified-k-split I would be grateful cause I have checked all the
online resources but can't implement it.
Any tips, suggestions will be really helpful.
Metrics Plots
Dataset form+Embedings form+train-test-split form
Dataset's labels distribution
My LSTM implementation
I have trained word embeddings using Fasttext - train.unsuperwised.
Is there a way to autotune the hyperparameters for this? Documentation gives autotuning for supervised training but I am not sure how supervised training can be done for embeddings.
You can used the supervised mode for embeddings, if you have target labels to predict per input text. But then the embeddings will be optimized for that classification purpose, rather than the more general usefulness people usually expect from unsupervised training.
Because such metaparameter optimization ("autotune") only makes sense if testing the results against a goal with clear right/wrong answers, it likely only works for the supervised mode, as shown by the docs.
If you're using the (normal, unsupervised) word-vectors for some other downstream task of your own, and you can create a repeatable evaluation for that task, you should write your own code to perform a search for the best metaparameters.
BERT pre-training of the base-model is done by a language modeling approach, where we mask certain percent of tokens in a sentence, and we make the model learn those missing mask. Then, I think in order to do downstream tasks, we add a newly initialized layer and we fine-tune the model.
However, suppose we have a gigantic dataset for sentence classification. Theoretically, can we initialize the BERT base architecture from scratch, train both the additional downstream task specific layer + the base model weights form scratch with this sentence classification dataset only, and still achieve a good result?
Thanks.
BERT can be viewed as a language encoder, which is trained on a humongous amount of data to learn the language well. As we know, the original BERT model was trained on the entire English Wikipedia and Book corpus, which sums to 3,300M words. BERT-base has 109M model parameters. So, if you think you have large enough data to train BERT, then the answer to your question is yes.
However, when you said "still achieve a good result", I assume you are comparing against the original BERT model. In that case, the answer lies in the size of the training data.
I am wondering why do you prefer to train BERT from scratch instead of fine-tuning it? Is it because you are afraid of the domain adaptation issue? If not, pre-trained BERT is perhaps a better starting point.
Please note, if you want to train BERT from scratch, you may consider a smaller architecture. You may find the following papers useful.
Well-Read Students Learn Better: On the Importance of Pre-training Compact Models
ALBERT: A Lite BERT for Self-supervised Learning of Language Representations
I can give help.
First of all, MLM and NSP (which are the original pre-training objectives from NAACL 2019) are meant to train language encoders with prior language knowledge. Like a primary school student who read many books in the general domain. Before BERT, many neural networks would be trained from scratch, from a clean slate where the model doesn't know anything. This is like a newborn baby.
So my question is, "is it a good idea to start teaching a newborn baby when you can begin with a primary school student?" My answer is no. This is supported by numerous State-of-The-Arts achieved by the pre-trained models, compared to the old methods of training a neural network from scratch.
As someone who works in the field, I can assure you that it is a much better idea to fine-tune a pre-trained model. It doesn't matter if you have a 200k dataset or a 1mil datapoints. In fact, more fine-tuning data will only make the downstream results better if you use the right hyperparameters.
Though I recommend the learning rate between 2e-6 ~ 5e-5 for sentence classification tasks, you can explore. If your dataset is very, very domain-specific, it's up to you to fine-tune with a higher learning rate, which will deviate the model further away from its "pre-trained" knowledge.
And also, regarding your question on
can we initialize the BERT base architecture from scratch, train both the additional downstream task specific layer + the base model weights form scratch with this sentence classification dataset only, and still achieve a good result?
I'm negative about this idea. Even though you have a dataset with 200k instances, BERT is pre-trained on 3300mil words. BERT is too inefficient to be trained with 200k instances (both size-wise and architecture-wise). If you want to train a neural network from scratch, I'd recommend you look into LSTMs or RNNs.
I'm not saying I recommend LSTMs. Just fine-tune BERT. 200k is not even too big anyways.
All the best luck with your NLP studies :)
I have some questions about learning GANS, for example in the source code of the link below:
https://github.com/ImagingLab/Colorizing-with-GANs
in the file src/model I want to know if the generator and discriminator are trained at the same time or just only the generator is trained because I read in some papers that is not recommended to train the two networks (Discriminator a Generator) at the same time. If they are trained at the same time in this code which line can disable discriminator training?
Is Google's pretrained word2vec model CBO or skipgram.
We load pretrained model by:
from gensim.models.keyedvectors as word2vec
model= word2vec.KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin.gz')
How can we specifically load pretrained CBOW or skipgram model ?
The GoogleNews word-vectors were trained by Google, using a proprietary corpus, but they're never explicitly described all the training-parameters used. (It's not encoded in the file.)
It's been asked a number of times on the Google Group devoted to the word2vec-toolkit code, without a definitive answer. For example, there's a response from word2vec author Mikolov that he doesn't remember the training parameters. Elsewhere, another poster thinks one of the word2vec papers implies skip-gram was used – but as that passage doesn't precisely match other aspects (like vocabulary-size) of the released GoogleNews vectors, I wouldn't be completely confident of that.
As Google hasn't been clear, and in any case hasn't released alternate versions based on different training modes, if you want to run any tests or make any conclusions about the different modes, you'll have to use other vector-sets, or train your own vectors in varying ways.
Late to the party, but Mikolov describes the hyperparameters here. The Google News pretrained vectors were trained using CBOW. I believe that's the only option for you to load; there is no pretrained skip-gram version available.