what is gpt-1 optimization method? - nlp

I'm reviewing the paper <Improving Language Understanding by Generative Pre-Training(2018)>.
The concept is well understood, but there is a question about the optimization method used in the paper.
In the first image, it looked like SGD was used for pre-training, but the second image showed that Adam was used. I'm posting a question because I think I might have misunderstood the concept!
I think they used Adam, but I wonder where SGD was used for.


How to put more weight on one class during training in Pytorch [duplicate]

I have a multilabel classification problem, which I am trying to solve with CNNs in Pytorch. I have 80,000 training examples and 7900 classes; every example can belong to multiple classes at the same time, mean number of classes per example is 130.
The problem is that my dataset is very imbalance. For some classes, I have only ~900 examples, which is around 1%. For “overrepresented” classes I have ~12000 examples (15%). When I train the model I use BCEWithLogitsLoss from pytorch with a positive weights parameter. I calculate the weights the same way as described in the documentation: the number of negative examples divided by the number of positives.
As a result, my model overestimates almost every class… Mor minor and major classes I get almost twice as many predictions as true labels. And my AUPRC is just 0.18. Even though it’s much better than no weighting at all, since in this case the model predicts everything as zero.
So my question is, how do I improve the performance? Is there anything else I can do? I tried different batch sampling techniques (to oversample minority class), but they don’t seem to work.
I would suggest either one of these strategies
Focal Loss
A very interesting approach for dealing with un-balanced training data through tweaking of the loss function was introduced in
Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He and Piotr Dollar Focal Loss for Dense Object Detection (ICCV 2017).
They propose to modify the binary cross entropy loss in a way that decrease the loss and gradient of easily classified examples while "focusing the effort" on examples where the model makes gross errors.
Hard Negative Mining
Another popular approach is to do "hard negative mining"; that is, propagate gradients only for part of the training examples - the "hard" ones.
see, e.g.:
Abhinav Shrivastava, Abhinav Gupta and Ross Girshick Training Region-based Object Detectors with Online Hard Example Mining (CVPR 2016)
#Shai has provided two strategies developed in the deep learning era. I would like to provide you some additional traditional machine learning options: over-sampling and under-sampling.
The main idea of them is to produce a more balanced dataset by sampling before starting your training. Note that you probably will face some problems such as losing the data diversity (under-sampling) and overfitting the training data (over-sampling), but it might be a good start point.
See the wiki link for more information.

After a few epochs, the difference between Valid loss and Loss increases

I'm trying to train the model on a MagnaTagAtune dataset. Is the model properly trained? What is the problem, does anyone know? Will waiting solve the problem?
The results are shown in the image.
enter image description here
Thank you pseudo_random_here for your answer. Your tips were helpful, but the problem was still there.
Unfortunately, changing the learning rate did not work. Now, after your advice, I will use the SGD optimizer with a learning rate of 0.1. I even used another model that was for this but the problem was not solved.
from keras.optimizers import SGD
opt = SGD(lr=0.1)
model.compile(loss = "categorical_crossentropy", optimizer = opt)
Short answer: I would say your val_loss is too high and waiting is unlikely to solve your problem
Explanation: I believe there are two possibilities here:
Your architecture is not suitable for the data
Your learning rate is too small
PS. It would help a lot if you were to provide info on what architecture of NNs you are using, what loss function we are looking at and what exactly is it that you are predicting?

Keras layers explaination

I want to get a deep idea about how this keras layers works in a model. What does each layer doing in the model etc. I followed kers documentation and information isn't enough. If any of you know place to get more knowledge let me know.Thanks in advance
Keras layers are widely used CNN, DNN and RNN layers. There is atleast one research paper for each of them and there is a lot of educational material out there. If you are really curious you could look at keras' code. Some links for you:

Can I use SGD with Multinomial Naive Bayes?

I'd like to understand if I can and if it's valid approach to train your MNB model with SGD. My application is text classification. In sklearn I've found out that there is no MNB available, and by default it's SVM, however NB is the linear model, isn't it?
So if my likelihood parameters (with Laplacian smoothing) can be estimated as
Can I update my parameters with SGD and minimize the cost function?
Please let me know if SGD is irrelevant here. Thanks in advance.
So I got the answer and hope that I got it right, that MNB's parameters are updated by the word occurence in the given input text (like tf-idf). But I still don't understand clearly why we can't use SGD for MNB training. I'd understand it if it's explained in explicit description or with some mathematical interpretation. Thanks
In sklearn I've found out that there is no MNB available
Multinomial naive Bayes is implemented in scikit-learn. There is no gradient descent to use. This implementation just uses relative frequencies counts (with smoothing) to find the parameters of the model in a single pass (which the standard and most efficient way to fit an MNB model):

Writing code for A Neural Probabilistic Language Model Bengio, 2003. Not able to understand the model

I'm trying to write code for A Neural Probabilistic Language Model by yoshua Bengio, 2003, but I'm not able to understand the connections between the input layer and projection matrix and between projection matrix and hidden layer. I'm not able to get how exactly is the learning for word-vector representation taking place.
have a look at this answer here
It explains the difference between the hidden layer and the projection layer.
Referring to this thesis
Also, do read this paper by Tomas Mikolov and go through this tutorial.
this will really improve your understanding.
Hope this helps!
