My question follows my examination of the code in the PyTorch DQN tutorial, but then refers to Reinforcement Learning in general: what are the best practices for optimal exploration/exploitation in reinforcement learning?
In the DQN tutorial, the steps_done variable is a global variable, and the EPS_DECAY = 200. This means that: after 128 steps, the epsilon threshold = 0.500; after 889 steps, the epsilon threshold = 0.0600; and after 1500 steps, the epsilon threshold = 0.05047.
This might work for the CartPole problem featured in the tutorial – where the early episodes might be very short and the task fairly simple – but what about on more complex problems in which far more exploration is required? For example, if we had a problem with 40,000 episodes, each of which had 10,000 timesteps, how would we set up the epsilon greedy exploration policy? Is there some rule of thumb that’s used in RL work?
Thank you in advance for any help.
well, for that I guess it is better to use the linear annealed epsilon-greedy policy which updates epsilon based on steps:
EXPLORE = 3000000 #how many time steps to play
FINAL_EPSILON = 0.001 # final value of epsilon
INITIAL_EPSILON = 1.0# # starting value of epsilon
if epsilon > FINAL_EPSILON:
epsilon -= (INITIAL_EPSILON - FINAL_EPSILON) / EXPLORE
Related
In risk management is frequent that we want to be more conservative in the face of high uncertainty. Is there a way of "adjusting" a linear regression prediction based on the uncertainty (i.e., standard deviation of prediction)? For example, if I have that prediction 1 = prediction 2 = 100, but prediction 2 has a higher uncertainty, then I'd like to adjust prediction 2 to be smaller than prediction 1 because I'm acknowledging risk and being more conservative.
I assume this is a common problem, but I haven't been able to find anything online for some reason.
Thanks!
I would like a recommendation
I am new to scikit learn so please excuse my ignorance. Using GridsearchCV I am trying to optimize a DecisionTreeRegressor. The broader I make the parameter space, the worse the scoring gets.
Setting min_samples_split to range(2,10) gives me a neg_mean_squared_error of -0.04. When setting it to range(2,5) The score is -0.004.
simple_tree =GridSearchCV(tree.DecisionTreeRegressor(random_state=42), n_jobs=4, param_grid={'min_samples_split': range(2, 10)}, scoring='neg_mean_squared_error', cv=10, refit='neg_mean_squared_error')
simple_tree.fit(x_tr,y_tr).score(x_tr,y_tr)
I expect an equal or more positive score for a more extensive grid search compared to the less extensive one.
You're right, you should have a metric that gets closer to 0 when having more parameters.. if you were really comparing the same model each time. This is not the case in the code you provided, because you have not set the random_state parameter in your Decision Tree.
Do DecisionTreeRegressor(random_state = 42) (or any integer) and you should have more sensible results.
Using simple_tree.best_score_ gives the mean best score over all CV folds.
I'm reading through keras-visualization document and the number of iterations when computing the gradient with respect to the filter activation loss was set to 20:
for i in range(20):
loss_value, grads_value = iterate([input_img_data])
input_img_data += grads_value * step
Now, if I want to find the optimum number of iterations, how to find it? Should I wait until the gradient becomes zero and use the second derivative test whether it's a max or min value? If yes, are there already built-in functions in keras that support this?
Do you mean epochs when you say the best number of iterations? If so, I recommend that you set an early stopping, where the loss function value does not change for a while means that it converges and finds the best epochs.
A project i am working on has a reinforcement learning stage using the REINFORCE algorithm. The used model has a final softmax activation layer and because of that a negative learning rate is used as a replacement for negative rewards. I have some doubts about this process and can't find much literature on using a negative learning rate.
Does reinforement learning work with switching learning rate between positive and negative? and if not what would be a better approach, get rid of softmax or has keras a nice option for this?
Loss function:
def log_loss(y_true, y_pred):
'''
Keras 'loss' function for the REINFORCE algorithm,
where y_true is the action that was taken, and updates
with the negative gradient will make that action more likely.
We use the negative gradient because keras expects training data
to minimize a loss function.
'''
return -y_true * K.log(K.clip(y_pred, K.epsilon(), 1.0 - K.epsilon()))
Switching learning rate:
K.set_value(optimizer.lr, lr * (+1 if won else -1))
learner_net.train_on_batch(np.concatenate(st_tensor, axis=0),
np.concatenate(mv_tensor, axis=0))
Update, test results
I ran a test with only positive reinforcement samples, omitting all negative examples and thus the negative learning rate. Winning rate is rising, it is improving and i can safely assume using a negative learning rate is not correct.
anybody any thoughts on how we should implement it?
Update, model explanation
We are trying to recreate AlphaGo as described by DeepMind, the slow policy net:
For the first stage of the training pipeline, we build on prior work
on predicting expert moves in the game of Go using supervised
learning13,21–24. The SL policy network pσ(a| s) alternates between convolutional
layers with weights σ, and rectifier nonlinearities. A final softmax
layer outputs a probability distribution over all legal moves a.
Not sure if it the best way but at least i found a way that works.
for all negative training samples i reuse the network prediction, set the action i want to unlearn to zero and adjust all values to sum up to one again
i tried several ways to adjust them afterwards but haven't run enough tests to be sure what works best:
apply softmax ( action that has to be unlearned gets a nonzero value.. )
redistribute old action value over all other actions
set all illigal action values to zero and distribute the total removed value
distribute value proportional to value of other values
probably there are several other ways to do so, it might depend on use case what works best and there might be a better way to do so but this one works at least.
I am wondering if categorical features, after converting to one-hot encoding (e.g. 0 0 0 1 0 0 for 6 possible values of the variable) should be scaled along real features using svm-scale function. libsvm guide apparently says so, I think.
Also, what is the effect on learning in SVM, if there are some features which are undiscriminating, e.g. random noise? Should I remove such features before training? My guess is that these can affect learning because SVM essentially calculates euclidean distances between data points which are represented as vectors of features. I am not much concerned with running time as number of features is small. Please mention standard feature selection algorithm implementation for svm. Any suggestion is welcome.
Thank you.
You have several questions in there:
1) Should 0-1 features get scaled?
2) What is the effect of noise features?
3) Should noise features be removed?
4) If so, how?
The general answer to (1) and (3) is that you should use cross-validation, (or a holdout validation set) try it both ways, and keep whichever one scores better on cross-validation. If I'm going to guess, I'd say that scaling 0-1 features probably doesn't matter very much, because SVM's are not that scale dependent as long as all of the features are O(1), which those are. A moderate number of noise features are probably ok, too. As for (2), you are correct that noise features usually degrade SVM performance somewhat. Feature selection is a big topic. There is a decent introduction to it in the scikit-learn user guide.