Defining TF-Agents action in py_environment.PyEnvironment Class - python-3.x

So I'm trying to define the action_spec in the py_environment.PyEnvironment Class for a DQN network using TF-Agents. Is action limited to returning integer values? I have read through a few online tutorials and in every case it was for a simple game, like chess or tic tac toe. Where you only needed to define a single integer in a small space to make a move/action.
For my use case, action needs to define sets of coordinates in two dimensional space. To make it more complicated, I want the agent too not only choose the X, Y coordinates but also the number of [X, Y] sets.
I need action to return a list of lists. Something like this:
action = [[100, 200], [300, 350], [550, 876]]
With this in mind; I'm not entirely sure how to define action_spec.
I saw this thread here: tf_agents custom time_step_spec
Where it shows the use of a dictionary of ArraySpecs(). Is that what I need to do?

SO according to this link: https://github.com/tensorflow/agents/issues/329
DQN for TF-Agents only supports a single action. As in the dimensions of the action_spec cannot be larger than shape(1,). Keep this in mind before using TF-Agents.

Related

why do object detection methods have an output value for every class

Most recent object detection methods rely on a convolutional neural network. They create a feature map by running input data through a feature extraction step. They then add more convolutional layers to output a set of values like so (this set is from YOLO, but other architectures like SSD differ slightly):
pobj: probability of being an object
c1, c2 ... cn: indicating which class the object belongs to
x, y, w, h: bounding box of the object
However, one particular box cannot be multiple objects. As in, wouldn't having a high value for, say, c1 mean that the values for all the others c2 ... cn would be low? So why use different values for c1, c2 ... cn? Couldn't they all be represented by a single value, say 0-1, where each object has a certain range within the 0-1, say 0-0.2 is c1, 0.2-0.4 is c2 and so on...
This would reduce the dimension of the output from NxNx(5+C) (5 for the probability and bounding box, +C one for each class) to NxNx(5+1) (5 same as before and 1 for the class)
Thank you
Short answer, NO! That is almost certainly not an acceptable solution. It sounds like your core question is: Why is a a single value in the range [0,1] not a sufficient, compact output for object classification? As a clarification, I'd say this doesn't really have to do with single-shot detectors; the outputs from 2-stage detectors and most all classification networks follows this same 1D embedding structure. As a secondary clarification, I'd say that many 1-stage networks also don't output pobj in their original implementations (YOLO is the main one that does but Retinanet and I believe SSD does not).
An object's class is a categorical attribute. Assumed within a standard classification problem is that the set of possible classes is flat (i.e. no class is a subclass of any other), mutually exclusive (each example falls into only a single class), and unrelated (not quite the right term here but essentially no class is any more or less related to any other class).
This assumed attribute structure is well represented by an orthonormal encoding vector of the same length as the set of possible attributes. A vector [1,0,0,0] is no more similar to [0,1,0,0] than it is to [0,0,0,1] in this space.
(As an aside, a separate branch of ML problems called multilabel classification removes the mutual exclusivity constrain (so [0,1,1,0] and [0,1,1,1] would both be valid label predictions. In this space class or label combinations COULD be construed as more or less related since they share constituent labels or "basis vectors" in the orthonormal categorical attribute space. But enough digression..)
A single, continuous variable output for class destroys the assumption that all classes are unrelated. In fact, it assumes that the relation between any two classes is exact and quantifiable! What an assumption! Consider attempting to arrange the classes of, let's say, the ImageNet classification task, along a single dimension. Bus and car should be close, no? Let's say 0.1 and 0.2, respectively in our 1D embedding range of [0,1]. Zebra must be far away from them, maybe 0.8. But should be close to zebra fish (0.82)? Is a striped shirt closer to a zebra or a bus? Is the moon more similar to a bicycle or a trumpet? And is a zebra really 5 times more similar to a zebra fish than a bus is to a car? The exercise is immediately, patently absurd. A 1D embedding space for object class is not sufficiently rich to capture the differences between object classes.
Why can't we just place object classes randomly in the continuous range [0,1]? In a theoretical sense nothing is stopping you, but the gradient of the network would become horrendously, unmanageably non-convex and conventional approaches to training the network would fail. Not to mention the network architecture would have to encode extremely non-linear activation functions to predict the extremely hard boundaries between neighboring classes in the 1D space, resulting in a very brittle and non-generalizable model.
From here, the nuanced reader might suggest that in fact, some classes ARE related to one another (i.e. the unrelated assumption of the standard classification problem is not really correct). Bus and car are certainly more related than bus and trumpet, no? Without devolving into a critique on the limited usefulness of strict ontological categorization of the world, I'll simply suggest that in many cases there is an information embedding that strikes a middle ground. A vast field of work has been devoted to finding embedding spaces that are compact (relative to the exhaustive enumeration of "everything is its own class of 1") but still meaningful. This is the work of principal component analysis and object appearance embedding in deep learning.
Depending on the particular problem, you may be able to take advantage of a more nuanced embedding space better suited towards the final task you hope to accomplish. But in general, canonical deep learning tasks such as classification / detection ignore this nuance in the hopes of designing solutions that are "pretty good" generalized over a large range of problem spaces.
For object classification head, usually cross-entropy loss function is used which operates on the probability distribution to compute the difference between ground-truth(a one hot encoded vector) and prediction class scores.
On the otherhand, you are proposing a different way of encoding the ground-truth class labels which can be further used with certain custom loss function say L1/l2 loss function, which looks theoretically correct but it might not be as good as cross-entropy function in terms of model convergence/optimization.

Is there a way to supply a numerical function to JiTCODE’s function argument instead of symbolic one?

I am getting a function (a learned dynamical system) through a neural network and want to pass it to JiTCODE to calculate trajectories, Lyapunov exponents, etc. As per the JiTCODE documentation, the function f has to be a symbolic function. Is there any way to change this since ultimately JiTCODE is going to lambdify the symbolic function?
Basically, this is what I'm doing right now:
# learns derviates from the Neural net model
# returns an array of numbers [\dot{x},\dot{y}] for input [x,y]
learned_fn = lambda t, y0: NN_model(t, y0)
ODE = jitcode_lyap(learned_fn, n_lyap=2)
ODE.set_integrator("vode")
First beware that JiTCODE does not take regular functions like your learned_fn as an input. It takes either iterables of symbolic expressions or generator functions returning symbolic expressions. This is why your example code will likely produce an error.
What you are asking for
You can “inject” any derivative with the right signature into JiTCODE by changing the f property and telling it that it failed compiling the actual derivative. Here is a minimal example doing this:
from jitcode import jitcode, y
ODE = jitcode([0])
ODE.f = lambda t,y: y[0]
ODE.compile_attempt = False
ODE.set_integrator("dopri5")
ODE.set_initial_value([1],0.0)
for time in range(30):
print(time,*ODE.integrate(time))
Why you probably do not want to do this
Ignoring Lyapunov exponents for a second, the entire point of JiTCODE is to hard-code your derivative for you and pass it to SciPy’s ode or solve_ivp who perform the actual integration. Thus the above example code is just an overly complicated way of passing a function to one SciPy’s standard integrators (here ode), with no advantage. If your NN_model is very efficiently implemented in the first place, you may not even gain a speed boost from JiTCODE’s auto-compilation.
The main reason to use JiTCODE’s Lyapunov-exponent capabilities is that it automatically obtains the Jacobian and the ODE for the tangent-vector evolution (needed for the Benettin method) from the symbolic representation of the derivative. Without a symbolic input, it cannot possibly do this. You could theoretically inject a tangent-vector ODE as well, but then again you would leave little for JiTCODE to do and you would probably better off using SciPy’s ode or solve_ivp directly.
What you probably need
If you want to use JiTCODE, you need to write a small piece of code that translates the output of your neural-network training to a symbolic representation of your ODE as needed by JiTCODE. This is probably much less scary than it sounds. You just need to obtain the trained coefficients and insert it in the equations of the general form of the neural network.
If you are lucky and your NN_model fully supports duck typing (and ), you may do something like this:
from jitcode import t,y
n = 10 # dimension of your ODE
NN_input = [y(i) for i in range(n)]
learned_fn = NN_model(t,NN_input)[1]
The idea is that you feed NN_model once with abstract symbolic input (t and NN_input). NN_model then once acts on this abstract input providing you an abstract result (here you need the duck-typing support). If I interpreted the output of your NN_model correctly, the second component of this result should be the abstract derivative as required by JiTCODE as an input.
Note that your NN_model appears to expect dimensions to be indices, but JiTCODE’s y expects dimensions to be function arguments. Thus you cannot just choose NN_input = y, but you have to transform it as above.
To quote directly from the linked documentation
JiTCODE takes an iterable (or generator function or dictionary) of symbolic expressions, which it translates to C code, compiles on the fly,
so there is no lambdification going on, the function is parsed, not just evaluated.
But in general that should be no problem, you just use the JITCODE provided symbolic vector y and symbol t instead of the function arguments t,y of the right side of the ODE.

Random Index from a Tensor (Sampling with Replacement from a Tensor)

I'm trying to manipulate individual weights of different neural nets to see how their performance degrades. As part of these experiments, I'm required to sample randomly from their weight tensors, which I've come to understand as sampling with replacement (in the statistical sense). However, since it's high-dimensional, I've been stumped by how to do this in a fair manner. Here are the approaches and research I've put into considering this problem:
This was previously implemented by selecting a random layer and then selecting a random weight in that layer (ignore the implementation of picking a random weight). Since layers are different sizes, we discovered that weights were being sampled unevenly.
I considered what would happen if we sampled according to the numpy.shape of the tensor; however, I realize now that this encounters the same problem as above.
Consider what happens to a rank 2 tensor like this:
[[*, *, *],
[*, *, *, *]]
Selecting a row randomly and then a value from that row results in an unfair selection. This method could work if you're able to assert that this scenario never occurs, but it's far from a general solution.
Note that this possible duplicate actually implements it in this fashion.
I found people suggesting flattening the tensor and use numpy.random.choice to select randomly from a 1D array. That's a simple solution, except I have no idea how to invert the flattened tensor back into its original shape. Further, flattening millions of weights would be a somewhat slow implementation.
I found someone discussing tf.random.multinomial here, but I don't understand enough of it to know whether it's applicable or not.
I ran into this paper about resevoir sampling, but again, it went over my head.
I found another paper which specifically discusses tensors and sampling techniques, but it went even further over my head.
A teammate found this other paper which talks about random sampling from a tensor, but it's only for rank 3 tensors.
Any help understanding how to do this? I'm working in Python with Keras, but I'll take an algorithm in any form that it exists. Thank you in advance.
Before I forget to document the solution we arrived at, I'll talk about the two different paths I see for implementing this:
Use a total ordering on scalar elements of the tensor. This is effectively enumerating your elements, i.e. flattening them. However, you can do this while maintaining the original shape. Consider this pseudocode (in Python-like syntax):
def sample_tensor(tensor, chosen_index: int) -> Tuple[int]:
"""Maps a chosen random number to its index in the given tensor.
Args:
tensor: A ragged-array n-tensor.
chosen_index: An integer in [0, num_scalar_elements_in_tensor).
Returns:
The index that accesses this element in the tensor.
NOTE: Entirely untested, expect it to be fundamentally flawed.
"""
remaining = chosen_index
for (i, sub_list) in enumerate(tensor):
if type(sub_list) is an iterable:
if |sub_list| > remaining:
remaining -= |sub_list|
else:
return i joined with sample_tensor(sub_list, remaining)
else:
if len(sub_list) <= remaining:
return tuple(remaining)
First of all, I'm aware this isn't a sound algorithm. The idea is to count down until you reach your element, with bookkeeping for indices.
We need to make crucial assumptions here. 1) All lists will eventually contain only scalars. 2) By direct consequence, if a list contains lists, assume that it also doesn't contain scalars at the same level. (Stop and convince yourself for (2).)
We also need to make a critical note here too: We are unable to measure the number of scalars in any given list, unless the list is homogeneously consisting of scalars. In order to avoid measuring this magnitude at every point, my algorithm above should be refactored to descend first, and subtract later.
This algorithm has some consequences:
It's the fastest in its entire style of approaching the problem. If you want to write a function f: [0, total_elems) -> Tuple[int], you must know the number of preceding scalar elements along the total ordering of the tensor. This is effectively bound at Theta(l) where l is the number of lists in the tensor (since we can call len on a list of scalars).
It's slow. It's too slow compared to sampling nicer tensors that have a defined shape to them.
It begs the question: can we do better? See the next solution.
Use a probability distribution in conjunction with numpy.random.choice. The idea here is that if we know ahead of time what the distribution of scalars is already like, we can sample fairly at each level of descending the tensor. The hard problem here is building this distribution.
I won't write pseudocode for this, but lay out some objectives:
This can be called only once to build the data structure.
The algorithm needs to combine iterative and recursive techniques to a) build distributions for sibling lists and b) build distributions for descendants, respectively.
The algorithm will need to map indices to a probability distribution respective to sibling lists (note the assumptions discussed above). This does require knowing the number of elements in an arbitrary sub-tensor.
At lower levels where lists contain only scalars, we can simplify by just storing the number of elements in said list (as opposed to storing probabilities of selecting scalars randomly from a 1D array).
You will likely need 2-3 functions: one that utilizes the probability distribution to return an index, a function that builds the distribution object, and possibly a function that just counts elements to help build the distribution.
This is also faster at O(n) where n is the rank of the tensor. I'm convinced this is the fastest possible algorithm, but I lack the time to try to prove it.
You might choose to store the distribution as an ordered dictionary that maps a probability to either another dictionary or the number of elements in a 1D array. I think this might be the most sensible structure.
Note that (2) is truly the same as (1), but we pre-compute knowledge about the densities of the tensor.
I hope this helps.

Quasi-Monte-Carlo vs. variable dimensionality?

I've been looking through the Matlab documention on using quasi-random sampling of N-dimensional unit cubes. This represents a problem with N stochastic parameters. Based on the fact that it is a unit cube, I presume that I need to use the inverse CDF of each parameter to map from the [0,1] domain to the value range of each parameter.
I would like to try this on a problem for which I now use Monte Carlo. Unfortunately, the problem I'm analyzing does not have a fixed number of dimensions. For each instantiation of the problem, I generate a variable number of widgets (say) using a Poisson distribution. Only after that do I randomly generate the parameters for each widget. That whole process yields one instance of the problem to be analyzed, so the number of parameters varies from one instance to the next.
Is this kind of problem still amenable to Quasi-Monte-Carlo?
What I used once was to get highest possible dimension of the problem d, generate Sobol sequence in d and use whatever number of points necessary for a particular sampling. I would say it helped somewhat...
From talking to a much smarter colleague, we need to consider the various combinations of widget counts for each widget type. For example, if we have 2 of widget type#1, 4 of widget type #2, 1 of widget type #3, etc., that constitutes one combination. QMC can be applied to that one combination. We are assuming that number of widget#i is independent of the number of widget#j for i<>j, so the probability of each combination is just the product of p(2 widgets of type#1), p(4 widgets of type#2), p(1 widget of type#3), etc. The individual probabilities are easy to get from their Poisson distributions (or their flat distributions, or whatever distribution is being used). If there are N widget types, this is just a joint PMF in N-space. This probability is then used to weight the QMC result for that particular combination. Note that even when the exactly combination is nailed down, QMC is still needed because there each widget is associated with 3 stochastic parameters.

I need a function that describes a set of sequences of zeros and ones?

I have multiple sets with a variable number of sequences. Each sequence is made of 64 numbers that are either 0 or 1 like so:
Set A
sequence 1: 0,0,0,0,0,0,1,1,0,0,0,0,1,1,1,1,0,0,0,1,1,1,0,0,0,1,1,1,0,0,0,0,1,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,1,1,0,0,0,0,0
sequence 2:
0,0,0,0,1,1,1,1,0,0,0,1,1,1,0,0,0,0,1,1,0,0,0,0,0,1,1,0,0,0,0,0,1,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0
sequence 3:
0,0,0,0,0,1,1,1,0,0,0,1,1,1,0,0,0,1,1,1,0,0,0,0,1,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,1,1,1,1,1,1,0
...
Set B
sequence1:
0,0,0,0,0,1,1,1,0,0,0,1,1,1,0,0,0,1,1,1,0,0,0,0,1,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1
sequence2:
0,0,0,0,0,1,1,1,0,0,0,1,1,1,0,0,0,1,1,1,0,0,0,0,1,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,1,1,1,1,1,1,0
...
I would like to find a mathematical function that describes all possible sequences in the set, maybe even predict more and that does not contain the sequences in the other sets.
I need this because I am trying to recognize different gestures in a mobile app based on the cells in a grid that have been touched (1 touch/ 0 no touch). The sets represent each gesture and the sequences a limited sample of variations in each gesture.
Ideally the function describing the sequences in a set would allow me to test user touches against it to determine which set/gesture is part of.
I searched for a solution, either using Excel or Mathematica, but being very ignorant about both and mathematics in general I am looking for the direction of an expert.
Suggestions for basic documentation on the subject is also welcome.
It looks as if you are trying to treat what is essentially 2D data in 1D. For example, let s1 represent the first sequence in set A in your question. Then the command
ArrayPlot[Partition[s1, 8]]
produces this picture:
The other sequences in the same set produce similar plots. One of the sequences from the second set produces, in response to the same operations, the picture:
I don't know what sort of mathematical function you would like to define to describe these pictures, but I'm not sure that you need to if your objective is to recognise user gestures.
You could do something much simpler, such as calculate the 'average' picture for each of your gestures. One way to do this would be to calculate the average value for each of the 64 pixels in each of the pictures. Perhaps there are 6 sequences in your set A describing gesture A. Sum the sequences element-by-element. You will now have a sequence with values ranging from 0 to 6. Divide each element by 6. Now each element represents a sort of probability that a new gesture, one you are trying to recognise, will touch that pixel.
Repeat this for all the sets of sequences representing your set of gestures.
To recognise a user gesture, simply compute the difference between the sequence representing the gesture and each of the sequences representing the 'average' gestures. The smallest (absolute) difference will direct you to the gesture the user made.
I don't expect that this will be entirely foolproof, it may well result in some user gestures being ambiguous or not recognisable, and you may want to try something more sophisticated. But I think this approach is simple and probably adequate to get you started.
In Mathematica the following expression will enumerate all the possible combinations of {0,1} of length 64.
Tuples[{1, 0}, {64}]
But there are 2^62 or 18446744073709551616 of them, so I'm not sure what use that will be to you.
Maybe you just wanted the unique sequences contained in each set, in that case all you need is the Mathematica Union[] function applied to the set. If you have a the sets grouped together in a list in Mathematica, say mySets, then you can apply the Union operator to every set in the list my using the map operator.
Union/#mySets
If you want to do some type of prediction a little more information might be useful.
Thanks you for the clarifications.
Machine Learning
The task you want to solve falls under the disciplines known by a variety of names, but probably most commonly as Machine Learning or Pattern Recognition and if you know which examples represent the same gestures, your case would be known as supervised learning.
Question: In your case do you know which gesture each example represents ?
You have a series of examples for which you know a label ( the form of gesture it is ) from which you want to train a model and use that model to label an unseen example to one of a finite set of classes. In your case, one of a number of gestures. This is typically known as classification.
Learning Resources
There is a very extensive background of research on this topic, but a popular introduction to the subject is machine learning by Christopher Bishop.
Stanford have a series of machine learning video lectures Standford ML available on the web.
Accuracy
You might want to consider how you will determine the accuracy of your system at predicting the type of gesture for an unseen example. Typically you train the model using some of your examples and then test its performance using examples the model has not seen. The two of the most common methods used to do this are 10 fold Cross Validation or repeated 50/50 holdout. Having a measure of accuracy enables you to compare one method against another to see which is superior.
Have you thought about what level of accuracy you require in your task, is 70% accuracy enough, 85%, 99% or better?
Machine learning methods are typically quite sensitive to the specific type of data you have and the amount of examples you have to train the system with, the more examples, generally the better the performance.
You could try the method suggested above and compare it against a variety of well proven methods, amongst which would be Random Forests, support vector machines and Neural Networks. All of which and many more are available to download in a variety of free toolboxes.
Toolboxes
Mathematica is a wonderful system, is infinitely flexible and my favourite environment, but out of the box it doesn't have a great deal of support for machine learning.
I suspect you will make a great deal of progress more quickly by using a custom toolbox designed for machine learning. Two of the most popular free toolboxes are WEKA and R both support more than 50 different methods for solving your task along with methods for measuring the accuracy of the solutions.
With just a little data reformatting, you can convert your gestures to a simple file format called ARFF, load them into WEKA or R and experiment with dozens of different algorithms to see how each performs on your data. The explorer tool in WEKA is definitely the easiest to use, requiring little more than a few mouse clicks and typing some parameters to get started.
Once you have an idea of how well the established methods perform on your data you have a good starting point to compare a customised approach against should they fail to meet your criteria.
Handwritten Digit Recognition
Your problem is similar to a very well researched machine learning problem known as hand written digit recognition. The methods that work well on this public data set of handwritten digits are likely to work well on your gestures.

Resources