DDPG (Deep Deterministic Policy Gradients), how is the actor updated? - keras

I'm currently trying to implement DDPG in Keras. I know how to update the critic network (normal DQN algorithm), but I'm currently stuck on updating the actor network, which uses the equation:
so in order to reduce the loss of the actor network wrt to its weight dJ/dtheta, it's using chain rule to get dQ/da (from critic network) * da/dtheta (from actor network).
This looks fine, but I'm having trouble understanding how to derive the gradients from those 2 networks. Could someone perhaps explain this part to me?

So the main intuition is that here, J is something you want to maximize instead of minimize. Therefore, we can call it an objective function instead of a loss function. The equation simplifies down to:
dJ/dTheta = dQ / da * da / dTheta = dQ/dTheta
Meaning you want to change the parameters Theta to change Q. Since in RL, we want to maximize Q, for this part, we want to do gradient ascent instead. To do this, you just perform gradient descent, except feed the gradients as negative values.
To derive the gradients, do the following:
Using the online actor network, send in a batch of states that was sampled from your replay memory. (The same batch used to train the critic)
Calculate the deterministic action for each of those states
Send the states used to calculate those actions to the online critic network to map those exact states to Q values.
Calculate the gradient of the Q values with respect with the actions calculated in step 2. We can use tf.gradients(Q value, actions) to do this. Now, we have dQ/dA.
Send the states to the actor online critic again and map it to actions.
Calculate the gradient of the actions with respect to the online actor network weights, again using tf.gradients(a, network_weights). This will give you dA/dTheta
Multiply dQ/dA by -dA/dTheta to get GRADIENT ASCENT. We are left with the gradient of the objective function, i.e., gradient J
Divide all elements of gradient J by the batch size, i.e.,
for j in J,
j / batch size
Apply a variant of gradient descent by first zipping gradient J with the network parameters. This can be done using tf.apply_gradients(zip(J, network_params))
And bam, your actor is training its parameters with respect to maximizing Q.
I hope this makes sense! I also had a hard time understanding this concept, and am still a little fuzzy on some parts to be completely honest. Let me know if I can clarify anything!

Related

Keras loss functions: how to round?

I'm trying to recognize turning points in sequences, the points after which some process behaves differently. I use a keras model to do this. Input is the sequence (always the same length) and output should be 0 before the turning points, a 1 after the turning point.
I want the loss function to depend on the distance between the actual turning point and the predicted turning point.
I tried to round (to obtain the label 0 or 1), followed by summing the total number of 1's to get the "index" of the turning point. Assumed here is that the model gives just one turning point, as the data (synthetically produced) also has just one turning point. Tried is:
def dist_loss(yTrue,yPred):
turningPointTrue = K.sum(yTrue)
turningPointPred = K.sum(K.round(yPred))
return K.abs(turningPointTrue-turningPointPred)
This does not work, the following error is given:
ValueError: An operation has None for gradient. Please make sure
that all of your ops have a gradient defined (i.e. are
differentiable). Common ops without gradient: K.argmax, K.round,
K.eval.
I think this means that K.round(yPred) gives a singular value, instead of a vector/tensor. Does anyone know how to solve this issue?
The round operation has no defined gradient, so it cannot be used at all inside a loss function, since for training of a neural network the gradient of the loss with respect to the weights has to be computed, and this implies that all the parts of the network and loss must be differentiable (or a differentiable approximation must be available).
In your case you should try to find an approximation to round that is differentiable, but unfortunately I don't know if there is one. One example of such approximation is the softmax function as approximation of the max function.

Feature scaling and its affect on various algorithm

Despite going through lots of similar question related to this I still could not understand why some algorithm is susceptible to it while others are not.
Till now I found that SVM and K-means are susceptible to feature scaling while Linear Regression and Decision Tree are not.Can somebody please elaborate me why? in general or relating to this 4 algorithm.
As I am a beginner, please explain this in layman terms.
One reason I can think of off-hand is that SVM and K-means, at least with a basic configuration, uses an L2 distance metric. An L1 or L2 distance metric between two points will give different results if you double delta-x or delta-y, for example.
With Linear Regression, you fit a linear transform to best describe the data by effectively transforming the coordinate system before taking a measurement. Since the optimal model is the same no matter the coordinate system of the data, pretty much by definition, your result will be invariant to any linear transform including feature scaling.
With Decision Trees, you typically look for rules of the form x < N, where the only detail that matters is how many items pass or fail the given threshold test - you pass this into your entropy function. Because this rule format does not depend on dimension scale, since there is no continuous distance metric, we again have in-variance.
Somewhat different reasons for each, but I hope that helps.

Calculate gradient of neural network

I am reading about adversarial images and breaking neural networks. I am trying to work through the article step-by-step but do to my inexperience I am having a hard time trying to understand the following instructions.
At the moment, I have a logistic regression model for the MNIST data set. If you give an image, it will predict the number that it most likely is...
saver.restore(sess, "/tmp/model.ckpt")
# image of number 7
x_in = np.expand_dims(mnist.test.images[0], axis=0)
classification = sess.run(tf.argmax(pred, 1), feed_dict={x:x_in})
print(classification)
Now, the article states that in order to break this image, the first thing we need to do is get the gradient of the neural network. In other words, this will tell me the direction needed to make the image look more like a number 2 or 3, even though it is a 7.
The article states that this is relatively simple to do using back propagation. So you may define a function...
compute_gradient(image, intended_label)
...and this basically tells us what kind of shape the neural network is looking for at that point.
This may seem easy to implement to those more experienced but the logic evades me.
From the parameters of the function compute_gradient, I can see that you feed it an image and an array of labels where the value of the intended label is set to 1.
But I do not see how this is supposed to return the shape of the neural network.
Anyways, I want to understand how I should implement this back propagation algorithm to return the gradient of the neural network. If the answer is not very straightforward, I would like some step-by-step instructions as to how I may get my back propagation to work as the article suggests it should.
In other words, I do not need someone to just give me some code that I can copy but I want to understand how I may implement it as well.
Back propagation involves calculating the error in the network's output (the cost function) as a function of the inputs and the parameters of the network, then computing the partial derivative of the cost function with respect to each parameter. It's too complicated to explain in detail here, but this chapter from a free online book explains back propagation in its usual application as the process for training deep neural networks.
Generating images that fool a neural network simply involves extending this process one step further, beyond the input layer, to the image itself. Instead of adjusting the weights in the network slightly to reduce the error, we adjust the pixel values slightly to increase the error, or to reduce the error for the wrong class.
There's an easy (though computationally intensive) way to approximate the gradient with a technique from Calc 101: for a small enough e, df/dx is approximately (f(x + e) - f(x)) / e.
Similarly, to calculate the gradient with respect to an image with this technique, calculate how much the loss/cost changes after adding a small change to a single pixel, save that value as the approximate partial derivative with respect to that pixel, and repeat for each pixel.
Then the gradient with respect to the image is approximately:
(
(cost(x1+e, x2, ... xn) - cost(x1, x2, ... xn)) / e,
(cost(x1, x2+e, ... xn) - cost(x1, x2, ... xn)) / e,
.
.
.
(cost(x1, x2, ... xn+e) - cost(x1, x2, ... xn)) / e
)

Representing classification confidence

I am working on a simple AI program that classifies shapes using unsupervised learning method. Essentially I use the number of sides and angles between the sides and generate aggregates percentages to an ideal value of a shape. This helps me create some fuzzingness in the result.
The problem is how do I represent the degree of error or confidence in the classification? For example: a small rectangle that looks very much like a square would yield night membership values from the two categories but can I represent the degree of error?
Thanks
Your confidence is based on used model. For example, if you are simply applying some rules based on the number of angles (or sides), you have some multi dimensional representation of objects:
feature 0, feature 1, ..., feature m
Nice, statistical approach
You can define some kind of confidence intervals, baesd on your empirical results, eg. you can fit multi-dimensional gaussian distribution to your empirical observations of "rectangle objects", and once you get a new object you simply check the probability of such value in your gaussian distribution, and have your confidence (which would be quite well justified with assumption, that your "observation" errors have normal distribution).
Distance based, simple approach
Less statistical approach would be to directly take your model's decision factor and compress it to the [0,1] interaval. For example, if you simply measure distance from some perfect shape to your new object in some metric (which yields results in [0,inf)) you could map it using some sigmoid-like function, eg.
conf( object, perfect_shape ) = 1 - tanh( distance( object, perfect_shape ) )
Hyperbolic tangent will "squash" values to the [0,1] interval, and the only remaining thing to do would be to select some scaling factor (as it grows quite quickly)
Such approach would be less valid in the mathematical terms, but would be similar to the approach taken in neural networks.
Relative approach
And more probabilistic approach could be also defined using your distance metric. If you have distances to each of your "perfect shapes" you can calculate the probability of an object being classified as some class with assumption, that classification is being performed at random, with probiability proportional to the inverse of the distance to the perfect shape.
dist(object, perfect_shape1) = d_1
dist(object, perfect_shape2) = d_2
dist(object, perfect_shape3) = d_3
...
inv( d_i )
conf(object, class_i) = -------------------
sum_j inv( d_j )
where
inv( d_i ) = max( d_j ) - d_i
Conclusions
First two ideas can be also incorporated into the third one to make use of knowledge of all the classes. In your particular example, the third approach should result in confidence of around 0.5 for both rectangle and circle, while in the first example it would be something closer to 0.01 (depending on how many so small objects would you have in the "training" set), which shows the difference - first two approaches show your confidence in classifing as a particular shape itself, while the third one shows relative confidence (so it can be low iff it is high for some other class, while the first two can simply answer "no classification is confident")
Building slightly on what lejlot has put forward; my preference would be to use the Mahalanobis distance with some squashing function. The Mahalanobis distance M(V, p) allows you to measure the distance between a distribution V and a point p.
In your case, I would use "perfect" examples of each class to generate the distribution V and p is the classification you want the confidence of. You can then use something along the lines of the following to be your confidence interval.
1-tanh( M(V, p) )

How to scale input DBSCAN in scikit-learn

Should the input to sklearn.clustering.DBSCAN be pre-processeed?
In the example http://scikit-learn.org/stable/auto_examples/cluster/plot_dbscan.html#example-cluster-plot-dbscan-py the distances between the input samples X are calculated and normalized:
D = distance.squareform(distance.pdist(X))
S = 1 - (D / np.max(D))
db = DBSCAN(eps=0.95, min_samples=10).fit(S)
In another example for v0.14 (http://jaquesgrobler.github.io/online-sklearn-build/auto_examples/cluster/plot_dbscan.html) some scaling is done:
X = StandardScaler().fit_transform(X)
db = DBSCAN(eps=0.3, min_samples=10).fit(X)
I base my code on the latter example and have the impression clustering works better with this scaling. However, this scaling "Standardizes features by removing the mean and scaling to unit variance". I try to find 2d clusters. If I have my clusters distributed in a squared area - let's say 100x100 I see no problem in the scaling. However, if the are distributed in an rectangled area e.g. 800x200 the scaling 'squeezes' my samples and changes the relative distances between them in one dimension. This deteriorates the clustering, doesn't it? Or am I understanding sth. wrong?
Do I need to apply some preprocessing at all, or can I simply input my 'raw' data?
It depends on what you are trying to do.
If you run DBSCAN on geographic data, and distances are in meters, you probably don't want to normalize anything, but set your epsilon threshold in meters, too.
And yes, in particular a non-uniform scaling does distort distances. While a non-distorting scaling is equivalent to just using a different epsilon value!
Note that in the first example, apparently a similarity and not a distance matrix is processed. S = (1 - D / np.max(D)) is a heuristic to convert a similarity matrix into a dissimilarity matrix. Epsilon 0.95 then effectively means at most "0.05 of the maximum dissimilarity observed". An alternate version that should yield the same result is:
D = distance.squareform(distance.pdist(X))
S = np.max(D) - D
db = DBSCAN(eps=0.95 * np.max(D), min_samples=10).fit(S)
Whereas in the second example, fit(X) actually processes the raw input data, and not a distance matrix. IMHO that is an ugly hack, to overload the method this way. It's convenient, but it leads to misunderstandings and maybe even incorrect usage sometimes.
Overall, I would not take sklearn's DBSCAN as a referene. The whole API seems to be heavily driven by classification, not by clustering. Usually, you don't fit a clustering, you do that for supervised methods only. Plus, sklearn currently does not use indexes for acceleration, and needs O(n^2) memory (which DBSCAN usually would not).
In general, you need to make sure that your distance works. If your distance function doesn't work no distance-based algorithm will produce the desired results. On some data sets, naive distances such as Euclidean work better when you first normalize your data. On other data sets, you have a good understanding on what distance is (e.g. geographic data. Doing a standardization on this obivously does not make sense, nor does Euclidean distance!)

Resources