Keras reinforcement training with softmax - keras

A project i am working on has a reinforcement learning stage using the REINFORCE algorithm. The used model has a final softmax activation layer and because of that a negative learning rate is used as a replacement for negative rewards. I have some doubts about this process and can't find much literature on using a negative learning rate.
Does reinforement learning work with switching learning rate between positive and negative? and if not what would be a better approach, get rid of softmax or has keras a nice option for this?
Loss function:
def log_loss(y_true, y_pred):
'''
Keras 'loss' function for the REINFORCE algorithm,
where y_true is the action that was taken, and updates
with the negative gradient will make that action more likely.
We use the negative gradient because keras expects training data
to minimize a loss function.
'''
return -y_true * K.log(K.clip(y_pred, K.epsilon(), 1.0 - K.epsilon()))
Switching learning rate:
K.set_value(optimizer.lr, lr * (+1 if won else -1))
learner_net.train_on_batch(np.concatenate(st_tensor, axis=0),
np.concatenate(mv_tensor, axis=0))
Update, test results
I ran a test with only positive reinforcement samples, omitting all negative examples and thus the negative learning rate. Winning rate is rising, it is improving and i can safely assume using a negative learning rate is not correct.
anybody any thoughts on how we should implement it?
Update, model explanation
We are trying to recreate AlphaGo as described by DeepMind, the slow policy net:
For the first stage of the training pipeline, we build on prior work
on predicting expert moves in the game of Go using supervised
learning13,21–24. The SL policy network pσ(a| s) alternates between convolutional
layers with weights σ, and rectifier nonlinearities. A final softmax
layer outputs a probability distribution over all legal moves a.

Not sure if it the best way but at least i found a way that works.
for all negative training samples i reuse the network prediction, set the action i want to unlearn to zero and adjust all values to sum up to one again
i tried several ways to adjust them afterwards but haven't run enough tests to be sure what works best:
apply softmax ( action that has to be unlearned gets a nonzero value.. )
redistribute old action value over all other actions
set all illigal action values to zero and distribute the total removed value
distribute value proportional to value of other values
probably there are several other ways to do so, it might depend on use case what works best and there might be a better way to do so but this one works at least.

Related

Model underfitting

I have trained a model and it took me quite a while to find the correct hyperparameters.
The model has now been trained for 15h and it seems to to its job quite well.
When I observed the training and validation loss though, the training loss is somewhat higher than the validation loss. (red curve: training, green: validation)
I use dropout to regularize my model and as far as I have understood, droput is is only applied during training which might be the reason.
Now Iam wondering if I have trained a valid model?
It doesn't seem like the model is heavily underfitted?
Thanks in advance for any advice,
cheers,
M
First, check whether you have good data set, i.e., if it is a classification, then get equal number of images for all classes and get it from same source not from different sources. And regularization, dropout are used for overfitting/High variance so don't worry about these.
Then, I think your model is doing good when you trained your model the initial error between them are different but as you increased the epochs then they both got into some steady path. So it is good. And may be reason for this is as I mentioned above or you should try shuffle them then using train_test_split for getting better distribution of training and validation sets.
A plot of learning curves shows a good fit if:
The plot of training loss decreases to a point of stability.
The plot of validation loss decreases to a point of stability and has a small gap with the training loss.
In your case these conditions are satisfied.
Still if you want to deal with High Bias/underfitting then here are few methods:
Train bigger models
Train longer. Use better optimization techniques
Try different Neural Network Architecture and also hyper parameters
And also you can use cross-validation or GridSearchCV for finding better optimizer or hyper parameters but it may take really long because you have to train it on different parameters each time considering your time which is 15 hours then it might be very long but you will find better parameters and then train on it.
Above all I think your model is doing okay.
If your model underfits, its performance will be lower, similar as in the case of overfitting, because actually it can not learn effectively to get the optimal result, i.e the proper function to fit the given distribution. So you have to use less regularization technique e.g. less dropout to get the optimal result.
Furthermore the sampling can also be crucial, because there can be training-validation subsets where your model performs well on validation set and less effective on training set and vice-versa. This is one of the reason why we use crossvalidation and different sampling methods e.g. stratified k-fold.

What are the best normalization technique and LSTM structure for forecasting an output with jumps (outliers)?

I have a time series forecasting case with ten features (inputs), and only one output. I'm using 22 timesteps (history of features) for one step ahead prediction using LSTM. Also, I apply MinMaxScaler for input normalization, but I don't normalize the output. The output contains some rare jumps (such as 20, 50, or more than 100), but the other values are between 0 and ~5 (all values are positive). In this case, it's important to forecast both normal and outlier outputs correctly so I dont want to miss the jumps in my forecasting model. I think if I use MinMaxScaler for output, most of the values will be something near the zero but the others (outliers) will be near one.
What is the best way to normalize the output? Should I leave it without normalization?
What is the best LSTM structure to handle this issue? (currently, I'm using LSTM with relu and Dense layer with relu as the last layer so I the output will be a positive value). I think I should select activation functions correctly for this case.
I think first of all, you should decide on a metric to measure performance. For example, do you want to use MAE or MSE? Or some other metric you decide based on the task at hand. For example, you may tolerate greater error for the "rare jumps", but not for the normal cases, or vice versa. Once you are decided on the error metric, ideally, you should set that as the cost function that the LSTM network would be minimizing.
Now the goal would be to minimize the desired error metric you set. If this was a convex problem, the scaling of the output will not matter. But we now that this is not the case with the complex deep learning architectures. What this means is that while minimizing the cost function with gradient decent, it might get stuck in a local minimum with a very delayed convergence. In this case, normalizing the output might help. How?
Assume that your output has a mean value of 5. With last layers parameters initialized around zero and a bias value of zero (i.e. the linear transformation of relu), the network needs to learn that the bias should be around 5. Depending on the complexity of the network this could take some epochs. However, if you normalize the data, or initialize the bias at 5, then your network starts with a good estimate of the bias and thus converges faster.
Now back to your questions:
I would at least make the output zero mean and use Dense layer with linear output.
The architecture you have seems fine, you can try stacking 2-4 LSTM layers if you think your input has complex time dependencies.
Feel free to update the OP with the the code and the performance you get and we can discuss what else can be improved.

Loss functions in LightFM

I recently came across LightFM while learning to train a recommender system. And so far what I know is that it utilizes loss functions which are logistic, BPR, WARP and k-OS WARP. I did not go through the math behind all these functions. Now what I am confused about is that how will I know that which loss function to use where?
From lightfm model documentation page:
logistic: useful when both positive (1) and negative (-1) interactions are present.
BPR: Bayesian Personalised Ranking 1 pairwise loss. Maximises the prediction difference between a positive example and a randomly chosen negative example. Useful when only positive interactions are present and optimising ROC AUC is desired.
WARP: Weighted Approximate-Rank Pairwise [2] loss. Maximises the rank of positive examples by repeatedly sampling negative examples until rank violating one is found. Useful when only positive interactions are present and optimising the top of the recommendation list (precision#k) is desired.
k-OS WARP: k-th order statistic loss [3]. A modification of WARP that uses the k-the positive example for any given user as a basis for pairwise updates.
Everything boils down to how your dataset is structured and what kind of user interacions you're looking at. Obviously one approach would be to include the loss function in your parameter grid when going through hyperparameter tuning (at least that's what I did) and check model accuracy. I find investingating why a given loss function performed better/worse on a dataset as a good learning exercise.

Best Way to Overcome Early Convergence for Machine Learning Model

I have a machine learning model built that tries to predict weather data, and in this case I am doing a prediction on whether or not it will rain tomorrow (a binary prediction of Yes/No).
In the dataset there is about 50 input variables, and I have 65,000 entries in the dataset.
I am currently running a RNN with a single hidden layer, with 35 nodes in the hidden layer. I am using PyTorch's NLLLoss as my loss function, and Adaboost for the optimization function. I've tried many different learning rates, and 0.01 seems to be working fairly well.
After running for 150 epochs, I notice that I start to converge around .80 accuracy for my test data. However, I would wish for this to be even higher. However, it seems like the model is stuck oscillating around some sort of saddle or local minimum. (A graph of this is below)
What are the most effective ways to get out of this "valley" that the model seems to be stuck in?
Not sure why exactly you are using only one hidden layer and what is the shape of your history data but here are the things you can try:
Try more than one hidden layer
Experiment with LSTM and GRU layer and combination of these layers together with RNN.
Shape of your data i.e. the history you look at to predict the weather.
Make sure your features are scaled properly since you have about 50 input variables.
Your question is little ambiguous as you mentioned RNN with a single hidden layer. Also without knowing the entire neural network architecture, it is tough to say how can you bring in improvements. So, I would like to add a few points.
You mentioned that you are using "Adaboost" as the optimization function but PyTorch doesn't have any such optimizer. Did you try using SGD or Adam optimizers which are very useful?
Do you have any regularization term in the loss function? Are you familiar with dropout? Did you check the training performance? Does your model overfit?
Do you have a baseline model/algorithm so that you can compare whether 80% accuracy is good or not?
150 epochs just for a binary classification task looks too much. Why don't you start from an off-the-shelf classifier model? You can find several examples of regression, classification in this tutorial.

How to get started with Tensorflow

I am pretty new to Tensorflow, and I am currently learning it through given website https://www.tensorflow.org/get_started/get_started
It is said in the manual that:
We've created a model, but we don't know how good it is yet. To evaluate the model on training data, we need a y placeholder to provide the desired values, and we need to write a loss function.
A loss function measures how far apart the current model is from the provided data. We'll use a standard loss model for linear regression, which sums the squares of the deltas between the current model and the provided data. linear_model - y creates a vector where each element is the corresponding example's error delta. We call tf.square to square that error. Then, we sum all the squared errors to create a single scalar that abstracts the error of all examples using tf.reduce_sum:"
q1."we don't know how good it is yet.", I didn't understand this
quote as the simple model created is a simple slope equation and on
what it should train for?, as the model is a simple slope. Is it
require an perfect slope or what? why am I training that model and
for what?
q2.what is a loss function? Is loss function is used to determine the
accuracy of the model? Why is it required?
q3. I didn't understand " 'sums the squares of the deltas' between
the current model and the provided data."
q4.I didn't understood this part of code,"squared_deltas =
tf.square(linear_model - y)
this is the code:
y = tf.placeholder(tf.float32)
squared_deltas = tf.square(linear_model - y)
loss = tf.reduce_sum(squared_deltas)
print(sess.run(loss, {x:[1,2,3,4], y:[0,-1,-2,-3]}))
this may be simple questions, but I am a beginner to Tensorflow and having a hard time understanding it.
1) So you're kind of right about "Why should we train for a simple problem" but this is just an introduction piece. With any machine learning task you need to evaluate your model to see how good it is. In this case you are just trying to train to find the coefficients for the line of best fit.
2) A loss function in any machine learning context represents your error with your model. This usually means a function of your "distance" of your calculated value to the ground truth value. Think of it as an internal evaluation score. You want to minimise your loss so the gradients and parameter changes are based on your loss.
3/4) Your question here is more to do with least square regression. It's a statistical method to create lines of best fit between points. The deltas represent the differences between your calculated values and the truth values. The aim is to minimise the area of the squares and hence minise the error and have a better line of best fit.
What you are doing in this Tensorflow example is creating a machine learning model that will learn the coefficients for the line of best fit automatically using a least squares based system.
Pretty much all of your question have to-do with the loss function.
The loss function is a function that determines how far apart your output are from the expected (correct) output.
It has two usages:
Help the algorithm determine if the tweaking of the weight is helping going in the good or bad direction
Determinate the accuracy (~the number of time your system guesses the correct answer)
The loss function is the sum of the deltas witch is: the addition of the diff (delta) between the expected output and the actual output.
I think It's squared to magnifies the error the algorithm makes.

Resources