Reinforcement Learning: fine-tuning MCTS node selection and expansion stage with inaccurate values - montecarlo

I am implementing a Go playing program roughly according to the architecture of earlier versions of AlphaGo(AlphaGo Fan or AlphaGo Lee), e.g. using policy network, value network, and Monte Carlo tree search(MCTS). Currently I have trained a decent policy network and an insensitive value network, and I don't have a fast roll-out policy. By "insensitive" I mean, the value network is not able to judge complicated situations, only outputing a win rate around 50% unless the situation is concise. The value network can judge concise board(no big fight going on) correctly.
Using this policy network and value network, I also implemented MCTS algorithm(The evaluation of a tree node is done only by value network). Since the value network is not accurate, I am afraid MCTS is prone to be trapped in bad moves before the time of MCTS is up. In order to better fine-tune the hyper parameters of MCTS to remedy the bad influence brought by inaccurate value network, I have two questions to ask:
Node selection is done by arg max (p_value + lambda * p_policy/visit_cnt). Does fine-tune the parameter lambda help?
Intuitively I want MCTS to explore as further as possible. In node expansion stage, does setting the expansion condition as expand a leaf once it is visited a very small number of times, like 3 help? What expansion method should I use?
EDIT: The second question is about the 'expand' stage of typical 'selection, expand, evaluation, backup' MCTS algorithm. I reckon by expand as quickly as possible, the MCTS can explore deeper, and give more accurate value approximations. I set a parameter n as how many times a leaf node is visited before it is expanded. I want to know intuitively, what a large n and a small n would influence the performance of MCTS.

Node selection is done by arg max (p_value + lambda * p_policy/visit_cnt). Does fine-tune the parameter lambda help?
Let's first try to develop a good understanding of what all the terms in that formula do:
p_value: average of all the evaluations at the end of iterations that went through this node. This is our current estimate of how good this node is according to the value network's evaluations at the end of rollouts.
p_policy/visit_cnt: p_policy will be high for nodes that seem good according to your policy network, and low for nodes that seem bad according to your policy network. visit_cnt will be high for nodes which we've already visited often, and low for nodes we have not yet visited often. This complete term makes us initially ''lean towards'' the policy network, but move away from the policy network as time goes on (because the nodes that are good according to policy network will have high visit counts).
lambda: A parameter that determines a balance between the first and the second term of the two points above. If lambda is high, we'll rely more on the information that the policy network gives us, and less on the information that the value network gives us. If lambda is low, we'll more quickly start relying on the information provided to us by earlier MCTS iterations + value network evaluations, and rely less on the policy network.
In the text of your question, you have stated that you believe the policy network is decent, and the value network isn't really informative. So, if that's the case, I'd try using a high value for lambda: if you believe the policy network to be more informative than the value network, you'll want to rely more on the policy network than the value network, so you'll want a high lambda.
Intuitively I want MCTS to explore as further as possible. In node expansion stage, does setting the expansion condition as expand a leaf once it is visited a very small number of times, like 3 help? What expansion method should I use?
The only reason why the expansion phase is often rather limited in classical MCTS implementations (e.g. often only expands a single node per iteration) is memory concerns; if you expand too often, your tree grows too quickly, and you run out of memory.
In these AlphaGo-style setups (mixed Deep Learning + MCTS), you often use a lot more computation time in these networks, and therefore get much fewer MCTS iterations than a raw, pure MCTS algorithm without any Deep Learning (but often higher quality / more informative iterations, which makes up for the lower iteration count). This lower iteration count significantly reduces the risk of running out of memory due to over-enthusiastic expansion, so I suspect you can afford to expand more aggressively. The only possible negative effect of expanding too much will be that you run out of RAM, and you'll easily notice when that happens because your program will crash.

Related

Optimise multiple objectives in MiniZinc

I am newbie in CP but I want to solve problem which I got in college.
I have a Minizinc model which minimize number of used Machines doing some Tasks. Machines have some resource and Tasks have resource requirements. Except minimize that number, I am trying to minimize cost of allocating Tasks to Machine (I have an array with cost). Is there any chance to first minimize that number and then optimizate the cost in Minizinc?
For example, I have 3 Task and 2 Machines. Every Machine has enough resource to allocate 3 Task on them but I want to allocate Task where cost is lower.
Sorry for my English and thanks for help. If there is such a need I will paste my code.
The technique that you are referring to is called lexicographic optimisation/objectives. The idea is to optimise for multiple objectives, where there is a clear ordering between the objectives. For example, when optimising (A, B, C) we would optimise B and C, subject to A. So if we can improve the value of A then we would allow B and C to worsen. Similarly, C is also optimised subject to B.
This technique is often used, but is currently not (yet) natively supported in MiniZinc. There are however a few workarounds:
As shown in the radation model, we can scale the first objective by a value that is at least as much as the maximum of the second objective (and so on). This will ensure that any improvement on the first objective will trump any improvement/stagnation on the second objective. The result of the instance should then be the lexicographic optimal.
We can seperate our models into multiple stages. In each stage we would only concern ourselves with a single objective value (working from most important to least important). Any subsequent stage would
fix the objectives from earlier stages. The solution of the final stage should give you the lexicographic optimal solution.
Some solvers support lexicographic optimisation natively. There is some experimental support for using these lexicographic objectives in MiniZinc, as found in std/experimental.mzn.
Note that lexicographic techniques might not always (explicitly) talk about both minimisation and maximisation; however, you can always convert from one to the other by negating the intended objective value.

How can the FrozenLake OpenAI-Gym environment be solved with no intermediate rewards?

I'm looking at the FrozenLake environments in openai-gym. In both of them, there are no rewards, not even negative rewards, until the agent reaches the goal. Even if the agent falls through the ice, there is no negative reward -- although the episode ends. Without rewards, there is nothing to learn! Each episode starts from scratch with no benefit from previous episodes.
This should be a simple breadth-first search. It doesn't need RL. But assuming you use RL, one approach would be a reward of -1 for a step to a frozen square (that isn't the goal) and a reward of -10 for a step into a hole. The -1 would allow the agent to learn not to repeat squares. The -10 would allow the agent to learn to avoid the holes. So I'm tempted to create my own negative rewards on the agent side. This would make it more like the cliffwalker.
What am I missing? How would RL solve this (except via random search) with no rewards?
The problem you are describing is often answered with Reward Shaping.
Like the frozen lake environment or Montazuma's Revenge, some problems have very sparse rewards. This means that any RL agent must spend a long time to explore the environment to see these rewards. This can be very frustrating for the humans who designed the task for the agent. So, like in the frozen lake environment, people often add extra information like you have suggested. This makes the reward function more dense and (sometimes) allows for faster learning (if the modified reward function actually follows what the human wants the agent to do).
In order for the agent to solve these kinds of problems faster than random search and without human intervention, such as reward shaping or giving the agent a video of expert playing the game, the agent needs some mechanism to explore the space in an intelligent way[citation needed].
Some current research areas on this topic are Intrinsic Motivation, Curiosity, and Options and Option discovery.
Although promising, these research areas are still in their infancy and sometimes its just easier to say:
if agent_is_in_a_hole:
return -10
I think the objective of this environment is to discover ways to balance exploration vs. exploitation. I think the reward manipulation is neither required or desirable.
Now, if you try to run this in q-learning for the 8x8 environment, you may find that it does not converge.
The fix for this was given by JKCooper on openAI forum. You can check out this page and scroll all the way to the bottom to see the comment, https://gym.openai.com/evaluations/eval_xSOlwrBsQDqUW7y6lJOevQ
In there, he introduces a concept of average terminal reward. This reward is then used to calibrate/tune the exploration.
At the beginning, the average terminal reward is undefined or null. On the very first "done" iteration, this variable is updated with the value of that reward.
On each subsequent iteration, if the current reward is greater than the existing value of the average terminal reward, then the epsilon value is "decayed", i.e. exploration is discouraged and exploitation is encouraged gradually.
Using this technique, you can see that qlearning converges.
the modified version on openAI is here: v0.0.2
https://gym.openai.com/evaluations/eval_FVrk7LAVS3zNHzzvissRQ/

Where to get hardware model data?

I have a task which consists of 3 concurrent self-defined (recursive to each other) processes. I need somehow to make it execute on computer, but any attempt to convert a requirement to program code with just my brain fails since first iteration produces 3^3 entities with 27^2 cross-relations, but it needs to implement at least several iterations to try if program even works at all.
So I decided to give up on trying to understand the whole system and formalized the problem and now want to map it to hardware to generate an algorithm and run. Language doesn't matter (maybe even directly to machine/assembly one?).
I never did anything like that before, so all topics I searched through like algorithm synthesis, software and hardware co-design, etc. mention hardware model as the second half (in addition to problem model) of solution generation, but I never seen one. The whole work supposed to look like this:
I don't know yet what level hardware model described at, so can't decide how problem model must be formalized to fit hardware model layer.
For example, target system may contain CPU and GPGPU, let's say target solution having 2 concurrent processes. System must decide which process to run on CPU and which on GPGPU. The highest level solution may come from comparing computational intensity of processes with target hardware, which is ~300 for CPUs and ~50 for GPGPUs.
But a normal model gotta be much more complete with at least cache hierarchy, memory access batch size, etc.
Another example is implementing k-ary trees. A synthesized algorithm could address parents and children with computing k * i + c / ( i - 1 ) / k or store direct pointers - depending on computations per memory latency ratio.
Where can I get a hardware model or data to use? Any hardware would suffice for now - to just see how it can look like - later would be awesome to get models of modern processors, GPGPUs and common heterogeneous clusters.
Do manufacturers supply such kinds of models? Description of how their systems work in any formal language.
I'm not pretty sure if it might be the case for you, but as you're mentioning modeling, I just thought about Modelica. It's used to model physical systems and combined with a simulation environment, you can run some simulations on it.

High Availability - What does Crossover mean in this context?

I'm working on a Mesos framework to run some jobs and it seems like a great opportunity to learn about making a highly available system. To that end, I'm doing some reading on distributed systems and I made the mistake of visiting wikipedia.
The passage in question is talking about a principle of HA engineering:
Reliable crossover. In multithreaded systems, the crossover point itself tends to
become a single point of failure. High availability engineering must provide for reliable
crossover.
My google-fu teaches me three things:
1) audio crossover devices split a single input into multiple outputs
2) genetic algorithms use crossover to combine solutions
3) buzzwordy white papers all copied from this wikipedia article :/
My question: What does a 'crossover point' mean in this context, and why is it single point of failure?
Reliable crossover in this context means:
The ability to switch from a node X (which is broken somehow) to a Node Y without losing data.
Non-reliable HA-database example:
Copy the database every 5 minutes to a passive node. => Here you can lose up to 5 minutes of data.
=> Here the copy action is the single point of failure.
Reliable HA-database example:
Setting up data replication where (per example) your insert statement only returns as "executed OK" when the transaction is copied to the secondary server.
(yes: data replication is more complex than this, this is a simplified example in the context of the question)

Nools and Drools

I was really happy to see a rules engine in Node and also was looking at Drools in the Java world and while reading the documentation (specifically: http://docs.jboss.org/drools/release/6.1.0.Final/drools-docs/html_single/index.html#PHREAK)found that Drools 6.0 has evolved and now uses the PHREAK method for rules matching. The specific paragraph that is of interest is:
Each successful join attempt in RETE produces a tuple (or token, or
partial match) that will be propagated to the child nodes. For this
reason it is characterised as a tuple oriented algorithm. For each
child node that it reaches it will attempt to join with the other side
of the node, again each successful join attempt will be propagated
straight away. This creates a descent recursion effect. Thrashing the
network of nodes as it ripples up and down, left and right from the
point of entry into the beta network to all the reachable leaf nodes.
For complex rules and rules over a certain limit, the above quote says that RETE based method trashes the memory quite a lot and so it was evolved into PHREAK.
Since nools is based on the Rete algorithm, is the above valid? Are there any optimizations done similar to PHREAK? Any comparisons done w.r.t to Drools?
The network thrashing is only an issue when you want to try and apply concurrency and parallelism, which requires locking in areas. As NodeJS is single threaded, that won't be an issue. We haven't yet attempted to solve this area in Drools yet either - but the Phreak work was preparation with this in mind, learning from the issues we found from our Rete implementation. On a separate note Rete has used partition algorithms in the past for parallelism, and this work is in the same area for the problem it's trying to solve.
For single threaded machines lazy rule evaluation is much more interesting. However as the document notes a single rule of joins will not differ in performance between Phreak and Rete. As you add lots of rules, the lazy nature avoids potential work, thus saving over-all cpu cycles. The algorithm is also more forgiving for a larger number of badly written rules, and should degrade less in performance. For instance it doesn't need the traditional Rete root "Context" object that is used to drive rule selection and short-circuit wasteful matching - this would be seen as anti-pattern in Phreak and may actually slow it down, as you blow away matches it might use again in the future.
http://www.dzone.com/links/rip_rete_time_to_get_phreaky.html
Also the collection oriented propagation is relevant when multiple subnetworks are used in rules, such as with multiple accumulates.
http://blog.athico.com/2014/02/drools-6-performance-with-phreak.html
I also did a follow up on the backward chaining and stack evaluation infrastructure:
http://blog.athico.com/2014/01/drools-phreak-stack-based-evaluations.html
Mark (Creator of Phreak)

Resources