I'm running a DCGAN-based GAN, and am experimenting with WGANs, but am a bit confused about how to train the WGAN.
In the official Wasserstein GAN PyTorch implementation, the discriminator/critic is said to be trained Diters (usually 5) times per each generator training.
Does this mean that the critic/discriminator trains on Diters batches or the whole dataset Diters times? If I'm not mistaken, the official implementation suggests the discriminator/critic is trained on the whole dataset Diters times, but other implementations of WGAN (in PyTorch and TensorFlow etc.) do the opposite.
Which is correct? The WGAN paper (to me, at least), indicates that it is Diters batches. Training on the whole dataset is obviously orders of magnitude slower.
Thanks in advance!
The correct is to consider an iteration as a batch.
In the original paper, for each iteration of the critic/discriminator they are sampling a batch of size m of the real data and a batch of size m of prior samples p(z) to work it. After the critic is trained over Diters iterations, they train the generator which also starts by the sampling of a batch of prior samples of p(z).
Therefore, each iteration is working on a batch.
In the official implementation this is also happening. What may be confusing is that they use the variable name niter to represent the number of epochs to train the model. Although they use a different scheme to set Diters at lines 162-166:
# train the discriminator Diters times
if gen_iterations < 25 or gen_iterations % 500 == 0:
Diters = 100
else:
Diters = opt.Diters
they are, as in the paper, training the critic over Diters batches.
This implementation of WGAN shows it as Diter batches for the discriminator for each run of the generator - https://github.com/shayneobrien/generative-models/blob/74fbe414f81eaed29274e273f1fb6128abdb0ff5/src/w_gan.py#L88
Related
I'm trying to solve a multilabel classification task of 10 classes with a relatively balanced training set consists of ~25K samples and an evaluation set consists of ~5K samples.
I'm using the huggingface:
model = transformers.BertForSequenceClassification.from_pretrained(...
and obtain quite nice results (ROC AUC = 0.98).
However, I'm witnessing some odd behavior which I don't seem to make sense of -
I add the following lines of code:
for param in model.bert.parameters():
param.requires_grad = False
while making sure that the other layers of the model are learned, that is:
[param[0] for param in model.named_parameters() if param[1].requires_grad == True]
gives
['classifier.weight', 'classifier.bias']
Training the model when configured like so, yields some embarrassingly poor results (ROC AUC = 0.59).
I was working under the assumption that an out-of-the-box pre-trained BERT model (without any fine-tuning) should serve as a relatively good feature extractor for the classification layers. So, where do I got it wrong?
From my experience, you are going wrong in your assumption
an out-of-the-box pre-trained BERT model (without any fine-tuning) should serve as a relatively good feature extractor for the classification layers.
I have noticed similar experiences when trying to use BERT's output layer as a word embedding value with little-to-no fine-tuning, which also gave very poor results; and this also makes sense, since you effectively have 768*num_classes connections in the simplest form of output layer. Compared to the millions of parameters of BERT, this gives you an almost negligible amount of control over intense model complexity. However, I also want to cautiously point to overfitted results when training your full model, although I'm sure you are aware of that.
The entire idea of BERT is that it is very cheap to fine-tune your model, so to get ideal results, I would advise against freezing any of the layers. The one instance in which it can be helpful to disable at least partial layers would be the embedding component, depending on the model's vocabulary size (~30k for BERT-base).
I think the following will help in demystifying the odd behavior I reported here earlier –
First, as it turned out, when freezing the BERT layers (and using an out-of-the-box pre-trained BERT model without any fine-tuning), the number of training epochs required for the classification layer is far greater than that needed when allowing all layers to be learned.
For example,
Without freezing the BERT layers, I’ve reached:
ROC AUC = 0.98, train loss = 0.0988, validation loss = 0.0501 # end of epoch 1
ROC AUC = 0.99, train loss = 0.0484, validation loss = 0.0433 # end of epoch 2
Overfitting, train loss = 0.0270, validation loss = 0.0423 # end of epoch 3
Whereas, when freezing the BERT layers, I’ve reached:
ROC AUC = 0.77, train loss = 0.2509, validation loss = 0.2491 # end of epoch 10
ROC AUC = 0.89, train loss = 0.1743, validation loss = 0.1722 # end of epoch 100
ROC AUC = 0.93, train loss = 0.1452, validation loss = 0.1363 # end of epoch 1000
The (probable) conclusion that arises from these results is that working with an out-of-the-box pre-trained BERT model as a feature extractor (that is, freezing its layers) while learning only the classification layer suffers from underfitting.
This is demonstrated in two ways:
First, after running 1000 epochs, the model still hasn’t finished learning (the training loss is still higher than the validation loss).
Second, after running 1000 epochs, the loss values are still higher than the values achieved with the non-freeze version as early as the 1’st epoch.
To sum it up, #dennlinger, I think I completely agree with you on this:
The entire idea of BERT is that it is very cheap to fine-tune your model, so to get ideal results, I would advise against freezing any of the layers.
Why are we updating targets in the implementation of bayesian cnn with mc dropout here?
https://github.com/sungyubkim/MCDO/blob/master/Bayesian_CNN_with_MCDO.ipynb?fbclid=IwAR18IMLcdUUp90TRoYodsJS7GW1smk-KGYovNpojn8LtRhDQckFI_gnpOYc
def update_target(target, original, update_rate):
for target_param, param in zip(target.parameters(), original.parameters()):
target_param.data.copy_((1.0 - update_rate) * target_param.data + update_rate*param.data)
The implementation you have referred to is a data parallel one.
Which means, the author intends to train multiple networks with the same architecture but different hyper-parameters.
Although in an unconventional way, this is what update_target does:
update_target(net_test, net, 0.001)
It updates the net_test with a lower learning rate compared to net, but with the exact same parameter changes applied to original net, that is actually being trained. Only the change scales is different.
I am assuming that this is found to be useful in terms of computational efficiency, since only one of the networks are actually being "trained" during main training phase:
outputs = net(inputs)
loss = CE(outputs, labels)
loss.backward()
optimizer.step()
One less forward pass and one less backprop per step.
I have adapted the base transformer model, for my corpus of aligned Arabic-English sentences. As such the model has trained for 40 epochs and accuracy (SparseCategoricalAccuracy) is improving by a factor of 0.0004 for each epoch.
To achieve good results, my estimate is to attain final accuracy anywhere around 0.5 and accuracy after 40 epochs is 0.0592.
I am running the model on the tesla 2 p80 GPU. Each epoch is taking ~2690 sec.
This implies I need at least 600 epochs and training time would be 15-18 days.
Should I continue with the training or is there something wrong in the procedure as the base transformer in the research paper was trained on an ENGLISH-FRENCH corpus?
Key highlights:
Byte-pair(encoding) of sentences
Maxlen_len =100
batch_size= 64
No pre-trained embeddings were used.
Do you mean Tesla K80 on aws p2.xlarge instance.
If that is the case, these gpus are very slow. You should use p3 instances on aws with V100 gpus. You will get around 6-7 times speedup.
Checkout this for more details.
Also, if you are not using the standard model and have made some changes to model or dataset, then try to tune the hyperparameters. Simplest is to try decreasing the learning rate and see if you get better results.
Also, first try to run the standard model with standard dataset to benchmark the time taken in that case and then make your changes and proceed. See when the model starts converging in the standard case. I feel that it should give some results after 40 epochs also.
In the Adam optimization algorithm, the learning speed is adjusted according to the number of iterations. I don't quite understand Adam's design, especially when using batch training. When using batch training, if there are 19,200 pictures, each time 64 pictures are trained, it is equivalent to 300 iterations. If our epoch has 200 times, there are a total of 60,000 iterations. I don't know if such multiple iterations will reduce the learning speed to a very small size. So when we are training, shall we initialize the optim after each epoch, or do nothing throughout the process?
Using pytorch. I have tried to initialize the optim after each epoch if I use batch train, and I do nothing when the number of data is small.
For expample, I don't know whether the two pieces of code is correct:
optimizer = optim.Adam(model.parameters(), lr=0.1)
for epoch in range(100):
###Some code
optim.step()
Another piece of code:
for epoch in range(100):
optimizer = optim.Adam(model.parameters(), lr=0.1)
###Some code
optim.step()
You can read the official paper here https://arxiv.org/pdf/1412.6980.pdf
Your update looks somewhat like this (for brevity, sake I have omitted the warmup-phase):
new_theta = old_theta-learning_rate*momentum/(velocity+eps)
The intuition here is that if momentum>velocity, then the optimizer is in a plateau, so the the learning_rate is increased because momentum/velocity > 1. on the other hand if momentum<velocity, then the optimizer is in a steep slope or noisy region, so the learning_rate is decreased.
The learning_rate isn't necessarily decreased throughout the training, as you have mentioned in you question.
I am training a model using Keras.
model = Sequential()
model.add(LSTM(units=300, input_shape=(timestep,103), use_bias=True, dropout=0.2, recurrent_dropout=0.2))
model.add(Dense(units=536))
model.add(Activation("sigmoid"))
model.compile(loss="binary_crossentropy", optimizer="adam", metrics=["accuracy"])
while True:
history = model.fit_generator(
generator = data_generator(x_[train_indices],
y_[train_indices], batch = batch, timestep=timestep),
steps_per_epoch=(int)(train_indices.shape[0] / batch),
epochs=1,
verbose=1,
validation_steps=(int)(validation_indices.shape[0] / batch),
validation_data=data_generator(
x_[validation_indices],y_[validation_indices], batch=batch,timestep=timestep))
It is a multiouput classification accoriding to scikit-learn.org definition:
Multioutput regression assigns each sample a set of target values.This can be thought of as predicting several properties for each data-point, such as wind direction and magnitude at a certain location.
Thus, it is a recurrent neural network I tried out different timestep sizes. But the result/problem is mostly the same.
After one epoch, my train loss is around 0.0X and my validation loss is around 0.6X. And this values keep stable for the next 10 epochs.
Dataset is around 680000 rows. Training data is 9/10 and validation data is 1/10.
I ask for intuition behind that..
Is my model already over fittet after just one epoch?
Is 0.6xx even a good value for a validation loss?
High level question:
Therefore it is a multioutput classification task (not multi class), I see the only way by using sigmoid an binary_crossentropy. Do you suggest an other approach?
I've experienced this issue and found that the learning rate and batch size have a huge impact on the learning process. In my case, I've done two things.
Reduce the learning rate (try 0.00005)
Reduce the batch size (8, 16, 32)
Moreover, you can try the basic steps for preventing overfitting.
Reduce the complexity of your model
Increase the training data and also balance each sample per class.
Add more regularization (Dropout, BatchNorm)