How do I write a Multi threaded Alpha-Beta Search algorithm?

How do I write a Multi threaded Alpha-Beta Search algorithm? - multithreading

I'm trying to create a chess engine using an alpha beta minimax search algorithm, but the code is too slow. I've done all the optimisations I can think of, but it is still very slow in a single thread. I looked at the source code of some other engines to see how they do it and the chess programming wiki (https://www.chessprogramming.org/Parallel_Search#Parallel_Alpha-Beta), but the the code is beyond my level and I don't understand them. I couldn't find any written sources or code snippets either.
Can someone explain how to efficiently implement threading in an alpha-beta search algorithm? Thanks.

Alpha-beta is an inherently sequential algorithm as your alpha and beta values get updated continuously thorough the search and cutoffs are decided based on these values. For this reason getting any speedups by increasing the amount of threads is very hard, and the more threads you throw at it, the smaller the gains will be.
However there's still several ways to do it, most of them fairly complicated and they scale extremely poorly with more threads. The go to algorithm used to be the Young Brothers wait concept, it is a fairly complicated algo and it was used for example by Stockfish until a few years back. However with the increasing amount of cores available on modern computers the scaling was very poor and the code very complex. Today most modern engines use something called Lazy SMP. This algorithm is almost as simple as it can be and scales better than the others.
In lazy SMP all you have to do is start the exact same search as you would normally do, just on multiple threads. It relies on having a working transposition table through which the threads communicate with each other. The threads will never be exactly in sync and the randomness will lead each thread to explore slightly different parts of the search tree and then save their results into the transposition table, where it might be used by another thread. Of course there is a lot of repeating work done by each thread, however it is still better than trying to be clever about splitting the work and slowing down the algorithm, and this is especially true when you start scaling up the amount of threads.
I recommend you take a look at the chess programming wiki, where you can even find some pseudo code on how to implement it.
https://www.chessprogramming.org/Lazy_SMP
Though i should also point out that if what you are looking for is improving your time to depth, implementing multithreading won't do all that much for you (and in some extreme cases it might actually even slow it down!). What you need instead is more aggresive pruning of the search tree and more efficient implementation (eg. no memory allocations, so the garbage collector never has to run, etc.).

Related

Clojure: Create and manage multiple threads

I wrote a program which needs to process a very large dataset and I'm planning to run it with multiple threads in a high-end machine.
I'm a beginner in Clojure and i'm lost in the myriad of tools at disposal -
agents, futures, core.async (and Quartzite?). I would like to know which one is most suited for this job.
The following describes my situation:
I have a function which transforms some data and store it in database.
The argument to the said function is popped from a Redis set.
Run the function in several separate threads as long as there is a value in the Redis set.

For simplicity, futures can't be beat. They create a new thread, and return a value from it. However, often you need more fine-grained control than they provide.
The core.async library has nice support for parallelism (via pipeline, see below), and it also provides automatic back-pressure. You have to have a way to control the flow of data such that no one's starving for work, or burdened by too much of it. core.async channels must be bounded, and this helps with this problem. Also, it's a pretty logical model of your problem: taking a value from a source, transforming it (maybe using a transducer?) with some given parallelism, and then putting the result to your database.
You can also go the manual route of using Java's excellent j.u.concurrent library. There are low level primitives as well as thread management tools for thread pools. All of this is accessible within clojure.
From a design standpoint, it comes down to whether you are more CPU-bound or I/O-bound. This affects decisions such as whether or not you will perform parallel reads from redis and writes to your database. If you are CPU-bound and thus your bottleneck is the computation, then it wouldn't make much sense to parallelize your reads from redis, or your writes to your database, would it? These are the types of things to consider.
You really have two problems to solve: (1) your familiarity with clojure's/java's concurrency mechanisms, and (2) your approach to this problem (i.e., how would you approach this problem, irrespective of the language you're using?). Once you solve #2, you will have a much better idea of which tools to use that I mentioned above, and how to use them.

Sounds like you may have a
good
embarrassingly parallel problem
to solve. In that case, you could start simply by coding up your
processing into a top-level function that processes the first datum.
Once that's working, wrap it in
a map to handle all of the
data sequentially (serially, one-at-a-time).
You might want to start tackling the bigger problem with just a few
items from your data set. That will make your testing smoother and
faster.
After you have the map working, it's time to just add a p
(parallel) to your code to make it
a pmap. This is a very
rewarding way to heat up your
machine.
Here is
a discussion about the number of threads pmap uses.
The above is the simplest approach. If you need finer control over
the concurrency, the
this concurrency screencast explores
the use cases.

It is hard to be precise w/o knowing the details of your problem. There are several choices as you mention:
Plain Java threads & threadpools. If your problem is similar to a pre-existing Java solution, this may be the most straightforward.
Simple Clojure threading with future et al. Kicking off a thread with future and getting the result in a promise is very easy.
Replace map with pmap (parallel map). This can help in simple cases that are primarily map/reduce oriented.
The Claypoole library: Lots of tools to make multithreading simpler and easier. Please see their GitHub project and the Clojure/West talk.

What's the overhead of the different forms of parallelism in Julia v0.5?

As the title states, what is the overhead of the different forms of parallelism, at least in the current implementation of Julia (v0.5, in case the implementation changes drastically in the future)? I am looking for some "practical measures", some general heuristics or ballparks to keep in my head for when it can be useful. For example, it's pretty obvious that multiprocessing won't give you gains in a loop like:
addprocs(4)
#parallel (+) for i=1:4
rand()
end
doesn't give you performance gains because each process is only taking one random number, but is there general heuristic for knowing when it will be worthwhile? Also, what about a heuristic for threading. It's surely a lower overhead than multiprocessing, but for example, with 4 threads, for what N is it a good idea to multithread:
A = rand(4)
Base.#threads (+) for i = 1:N
A[i%4+1]
end
(I know there isn't a threaded reduction right now, but let's act like there is, or edit with a better example). Sure, I can benchmark every example, but some good rules to keep in mind would go a long way.
In more concrete terms: what are some good rules of thumb?
How many numbers do you need to be adding/multiplying before threading gives performance enhancements, or before multiprocessing gives performance enhancements?
How much does the depend on Julia's current implementation?
How much does it depend on the number of threads/processes?
How much does the depend on the architecture? Are there good rules for knowing when the threshold should be higher/lower on a particular system?
What kinds of applications violate these heuristics?
Again, I'm not looking for hard rules, just general guidelines to guide development.

A few caveats: 1. I'm speaking from experience with version 0.4.6, (and prior), haven't played with 0.5 yet (but, as I hope my answer below demonstrates, I don't think this is essential vis-a-vis the response I give). 2. this isn't a fully comprehensive answer.
Nevertheless, from my experience, the overhead for multiple processes itself is very small provided that you aren't dealing with data movement issues. In other words, in my experience, any time that you ever find yourself in a situation of wishing something were faster than a single process on your CPU can manage, you're well past the point where parallelism will be beneficial. For instance, in the sum of random numbers example that you gave, I found through testing just now that the break-even point was somewhere around 10,000 random numbers. Anything more and parallelism was the clear winner. Generating 10,000 random number is trivial for modern computers, taking a tiny fraction of a second, and is well below the threshold where I'd start getting frustrated by the slowness of my scripts and want parallelism to speed them up.
Thus, I at least am of the opinion, that although there are probably even more wonderful things that the Julia developers could do to cut down on the overhead even more, at this point, anything pertinent to Julia isn't going to be so much of your limiting factor, at least in terms of the computation aspects of parallelism. I think that there are still improvements to be made in terms of enhancing both the ease and the efficiency of parallel data movement (I like the package that you've started on that topic as a good step. You and I would probably both agree there's still a ways more to go). But, the big limiting factors will be:
How much data do you need to be moving around between processes?
How much read/write to your memory do you need to be doing during your computations? (e.g. flops per read/write)
Aspect 1. might at times lean against using parallelism. Aspect 2. is more likely just to mean that you won't get so much benefit from it. And, at least as I interpret "overhead," neither of these really fall so directly into that specific consideration. And, both of these are, I believe, going to be far more heavily determined by your system hardware than by Julia.

What is the effectiveness of multithreaded alpha-beta-pruning?

What would the effectiveness be of multithreading with alpha beta pruning if:
The multithreading was used iteratively. For example, thread one would look at the first branch, the second thread would look at the second thread, etc. I believe this should only be done at the first depth (the next move the AI made), since the other depths could be cut off.
One thread was at the first "move" generated searching to half the moveset generated, and the second thread was at the last "move" generated and searching back to half the moveset. Here, I think there could be increased speedup, because the last move could be considered the best move, and as a result, the second thread could cause cutoffs the first thread couldn't.
The multithreading was used to think on the opponent's time. For example, say the opponent took some time to think and make a move. The AI could iteratively deepen its search and find results while the opponent is thinking, i'd imagine, not necessarily causing speedup, but having more time for minimax analysis.
There may be other optimizations, i'd imagine, but these were the few that came into mind. I don't know if they actually will improve anything, though.

If I understand your idea correctly, you plan to search moves in the root position in parallel. In comparison to a strictly sequential algorithm, it should be better but I would not expect it to scale well (with multiple CPUs).
For comparison, here is a summary of existing parallelization strategies in chess:
https://www.chessprogramming.org/Parallel_Search
As alpha-beta is a sequential algorithm, all parallelization strategies are speculative. So, you want to avoid spending time on searching parts of the search tree, which will eventually be cut by other moves. One relatively simple strategy to avoid searching irrelevant subtrees is called Young Brothers Wait Concept.
There are also algorithms with improved scalability but at the cost of being more difficult to understand and implement. For instance, supporting work-stealing should improve scalability.

Data Parallelism in Ant Colony Optimization

I have been trying to understand how ACO optimization can be implemented with data parallelism. I have read some content after searching in Google. I only need the basic idea in simple way. Most of the papers are talking about everything else instead of the main thing in simple words.
What I understood so far is, we will make it work parallel by using multi-tasking(threading). But am not sure what each thread would do or how we could separate it into threads without causing trouble.
Does it means that we should create separate thread for each ants? But that would cause lots of threads to be created! So if there are 200 ants, then 200 threads?
Am still having confusion at this data parallelism topic in ACO. I would really love to hear in simple words on how we would implement it parallely.

A few simple ideas to run ACO in parallel
Since you have already read up on ACO, here are a few simple ideas on ways to run ACO in parallel. Rather than getting caught up in multiple-threads and mutli-tasking, it might be helpful to think in terms of 'parallel compute resources' at your disposal.
ACO is one case of Agent-Based Simulation (ABS), and ABS lends itself particularly well to parellization.
Simple Options
Option 1. Run a full version of ACO in each of the parallel resources.
Code your ACO algorithm, run it in parallel fashion. (Since there is a stochastic element to the algorithm, you can then look for the 'best' solution for your problem.)
Option 2. To explore effects of varying ACO parameters
Like any simulation approach, any ACO implementation has a large number of runtime parameters: Number of vertices, time to run, Number of ants, Pheromone evaporation rates, probability functions to choose path options and many more. When you mutliply these options, they add up to some large number of cases to be run. Divide up the work among your parallel compute resources.
The two options mentioned above are sometimes referred to as 'embarrassingly' parallel. Very easy to implement (think of it as a Design of Experiments) and you get back a whole matrix of results, and you can make conclusions by studying what effect the changes in the parameters had on the solution.
Option with solution sharing
Option 3: Master-Slave approach, with Partial Solution sharing
Going up one more level in complexity, we can use each node to contribute its 'knowledge/findings' to the overall problem solution. This is sometimes called a master-slave approach. The master is trying to solve the overall problem (Could be TSP, or some similar complex problem) and each 'slave' is solving some aspect of it, but with some fairly simple algorithm. The idea is that when combined they produce powerful results.
After a certain number of iterations, the solutions are passed back and forth, with 'bad' solutions thrown out. Some variant of the Map-Shuffle-Reduce paradigm would do that. The master evaluates the current best solution, and that is transferred back to each 'slave' node (Example: the latest overall pheromone levels are given to all the slave nodes). The next round of solving resumes.
Option 3 has tons of nuanced variations, and some people spend their entire lives improving various aspects of it.
Hope some of these ideas help.

Advice on starting a large multi-threaded programming project

My company currently runs a third-party simulation program (natural catastrophe risk modeling) that sucks up gigabytes of data off a disk and then crunches for several days to produce results. I will soon be asked to rewrite this as a multi-threaded app so that it runs in hours instead of days. I expect to have about 6 months to complete the conversion and will be working solo.
We have a 24-proc box to run this. I will have access to the source of the original program (written in C++ I think), but at this point I know very little about how it's designed.
I need advice on how to tackle this. I'm an experienced programmer (~ 30 years, currently working in C# 3.5) but have no multi-processor/multi-threaded experience. I'm willing and eager to learn a new language if appropriate. I'm looking for recommendations on languages, learning resources, books, architectural guidelines. etc.
Requirements: Windows OS. A commercial grade compiler with lots of support and good learning resources available. There is no need for a fancy GUI - it will probably run from a config file and put results into a SQL Server database.
Edit: The current app is C++ but I will almost certainly not be using that language for the re-write. I removed the C++ tag that someone added.

Numerical process simulations are typically run over a single discretised problem grid (for example, the surface of the Earth or clouds of gas and dust), which usually rules out simple task farming or concurrency approaches. This is because a grid divided over a set of processors representing an area of physical space is not a set of independent tasks. The grid cells at the edge of each subgrid need to be updated based on the values of grid cells stored on other processors, which are adjacent in logical space.
In high-performance computing, simulations are typically parallelised using either MPI or OpenMP. MPI is a message passing library with bindings for many languages, including C, C++, Fortran, Python, and C#. OpenMP is an API for shared-memory multiprocessing. In general, MPI is more difficult to code than OpenMP, and is much more invasive, but is also much more flexible. OpenMP requires a memory area shared between processors, so is not suited to many architectures. Hybrid schemes are also possible.
This type of programming has its own special challenges. As well as race conditions, deadlocks, livelocks, and all the other joys of concurrent programming, you need to consider the topology of your processor grid - how you choose to split your logical grid across your physical processors. This is important because your parallel speedup is a function of the amount of communication between your processors, which itself is a function of the total edge length of your decomposed grid. As you add more processors, this surface area increases, increasing the amount of communication overhead. Increasing the granularity will eventually become prohibitive.
The other important consideration is the proportion of the code which can be parallelised. Amdahl's law then dictates the maximum theoretically attainable speedup. You should be able to estimate this before you start writing any code.
Both of these facts will conspire to limit the maximum number of processors you can run on. The sweet spot may be considerably lower than you think.
I recommend the book High Performance Computing, if you can get hold of it. In particular, the chapter on performance benchmarking and tuning is priceless.
An excellent online overview of parallel computing, which covers the major issues, is this introduction from Lawerence Livermore National Laboratory.

Your biggest problem in a multithreaded project is that too much state is visible across threads - it is too easy to write code that reads / mutates data in an unsafe manner, especially in a multiprocessor environment where issues such as cache coherency, weakly consistent memory etc might come into play.
Debugging race conditions is distinctly unpleasant.
Approach your design as you would if, say, you were considering distributing your work across multiple machines on a network: that is, identify what tasks can happen in parallel, what the inputs to each task are, what the outputs of each task are, and what tasks must complete before a given task can begin. The point of the exercise is to ensure that each place where data becomes visible to another thread, and each place where a new thread is spawned, are carefully considered.
Once such an initial design is complete, there will be a clear division of ownership of data, and clear points at which ownership is taken / transferred; and so you will be in a very good position to take advantage of the possibilities that multithreading offers you - cheaply shared data, cheap synchronisation, lockless shared data structures - safely.

If you can split the workload up into non-dependent chunks of work (i.e., the data set can be processed in bits, there aren't lots of data dependencies), then I'd use a thread pool / task mechanism. Presumably whatever C# has as an equivalent to Java's java.util.concurrent. I'd create work units from the data, and wrap them in a task, and then throw the tasks at the thread pool.
Of course performance might be a necessity here. If you can keep the original processing code kernel as-is, then you can call it from within your C# application.
If the code has lots of data dependencies, it may be a lot harder to break up into threaded tasks, but you might be able to break it up into a pipeline of actions. This means thread 1 passes data to thread 2, which passes data to threads 3 through 8, which pass data onto thread 9, etc.
If the code has a lot of floating point mathematics, it might be worth looking at rewriting in OpenCL or CUDA, and running it on GPUs instead of CPUs.

For a 6 month project I'd say it definitely pays out to start reading a good book about the subject first. I would suggest Joe Duffy's Concurrent Programming on Windows. It's the most thorough book I know about the subject and it covers both .NET and native Win32 threading. I've written multithreaded programs for 10 years when I discovered this gem and still found things I didn't know in almost every chapter.
Also, "natural catastrophe risk modeling" sounds like a lot of math. Maybe you should have a look at Intel's IPP library: it provides primitives for many common low-level math and signal processing algorithms. It supports multi threading out of the box, which may make your task significantly easier.

There are a lot of techniques that can be used to deal with multithreading if you design the project for it.
The most general and universal is simply "avoid shared state". Whenever possible, copy resources between threads, rather than making them access the same shared copy.
If you're writing the low-level synchronization code yourself, you have to remember to make absolutely no assumptions. Both the compiler and CPU may reorder your code, creating race conditions or deadlocks where none would seem possible when reading the code. The only way to prevent this is with memory barriers. And remember that even the simplest operation may be subject to threading issues. Something as simple as ++i is typically not atomic, and if multiple threads access i, you'll get unpredictable results.
And of course, just because you've assigned a value to a variable, that's no guarantee that the new value will be visible to other threads. The compiler may defer actually writing it out to memory. Again, a memory barrier forces it to "flush" all pending memory I/O.
If I were you, I'd go with a higher level synchronization model than simple locks/mutexes/monitors/critical sections if possible. There are a few CSP libraries available for most languages and platforms, including .NET languages and native C++.
This usually makes race conditions and deadlocks trivial to detect and fix, and allows a ridiculous level of scalability. But there's a certain amount of overhead associated with this paradigm as well, so each thread might get less work done than it would with other techniques. It also requires the entire application to be structured specifically for this paradigm (so it's tricky to retrofit onto existing code, but since you're starting from scratch, it's less of an issue -- but it'll still be unfamiliar to you)
Another approach might be Transactional Memory. This is easier to fit into a traditional program structure, but also has some limitations, and I don't know of many production-quality libraries for it (STM.NET was recently released, and may be worth checking out. Intel has a C++ compiler with STM extensions built into the language as well)
But whichever approach you use, you'll have to think carefully about how to split the work up into independent tasks, and how to avoid cross-talk between threads. Any time two threads access the same variable, you have a potential bug. And any time two threads access the same variable or just another variable near the same address (for example, the next or previous element in an array), data will have to be exchanged between cores, forcing it to be flushed from CPU cache to memory, and then read into the other core's cache. Which can be a major performance hit.
Oh, and if you do write the application in C++, don't underestimate the language. You'll have to learn the language in detail before you'll be able to write robust code, much less robust threaded code.

One thing we've done in this situation that has worked really well for us is to break the work to be done into individual chunks and the actions on each chunk into different processors. Then we have chains of processors and data chunks can work through the chains independently. Each set of processors within the chain can run on multiple threads each and can process more or less data depending on their own performance relative to the other processors in the chain.
Also breaking up both the data and actions into smaller pieces makes the app much more maintainable and testable.

There's plenty of specific bits of individual advice that could be given here, and several people have done so already.
However nobody can tell you exactly how to make this all work for your specific requirements (which you don't even fully know yourself yet), so I'd strongly recommend you read up on HPC (High Performance Computing) for now to get the over-arching concepts clear and have a better idea which direction suits your needs the most.

The model you choose to use will be dictated by the structure of your data. Is your data tightly coupled or loosely coupled? If your simulation data is tightly coupled then you'll want to look at OpenMP or MPI (parallel computing). If your data is loosely coupled then a job pool is probably a better fit... possibly even a distributed computing approach could work.
My advice is get and read an introductory text to get familiar with the various models of concurrency/parallelism. Then look at your application's needs and decide which architecture you're going to need to use. After you know which architecture you need, then you can look at tools to assist you.
A fairly highly rated book which works as an introduction to the topic is "The Art of Concurrency: A Thread Monkey's Guide to Writing Parallel Application".

Read about Erlang and the "Actor Model" in particular. If you make all your data immutable, you will have a much easier time parallelizing it.

Most of the other answers offer good advice regarding partitioning the project - look for tasks that can be cleanly executed in parallel with very little data sharing required. Be aware of non-thread safe constructs such as static or global variables, or libraries that are not thread safe. The worst one we've encountered is the TNT library, which doesn't even allow thread-safe reads under some circumstances.
As with all optimisation, concentrate on the bottlenecks first, because threading adds a lot of complexity you want to avoid it where it isn't necessary.
You'll need a good grasp of the various threading primitives (mutexes, semaphores, critical sections, conditions, etc.) and the situations in which they are useful.
One thing I would add, if you're intending to stay with C++, is that we have had a lot of success using the boost.thread library. It supplies most of the required multi-threading primitives, although does lack a thread pool (and I would be wary of the unofficial "boost" thread pool one can locate via google, because it suffers from a number of deadlock issues).

I would consider doing this in .NET 4.0 since it has a lot of new support specifically targeted at making writing concurrent code easier. Its official release date is March 22, 2010, but it will probably RTM before then and you can start with the reasonably stable Beta 2 now.
You can either use C# that you're more familiar with or you can use managed C++.
At a high level, try to break up the program into System.Threading.Tasks.Task's which are individual units of work. In addition, I'd minimize use of shared state and consider using Parallel.For (or ForEach) and/or PLINQ where possible.
If you do this, a lot of the heavy lifting will be done for you in a very efficient way. It's the direction that Microsoft is going to increasingly support.
2: I would consider doing this in .NET 4.0 since it has a lot of new support specifically targeted at making writing concurrent code easier. Its official release date is March 22, 2010, but it will probably RTM before then and you can start with the reasonably stable Beta 2 now. At a high level, try to break up the program into System.Threading.Tasks.Task's which are individual units of work. In addition, I'd minimize use of shared state and consider using Parallel.For and/or PLINQ where possible. If you do this, a lot of the heavy lifting will be done for you in a very efficient way. 1: http://msdn.microsoft.com/en-us/library/dd321424%28VS.100%29.aspx

Sorry i just want to add a pessimistic or better realistic answer here.
You are under time pressure. 6 month deadline and you don't even know for sure what language is this system and what it does and how it is organized. If it is not a trivial calculation then it is a very bad start.
Most importantly: You say you have never done mulitithreading programming before. This is where i get 4 alarm clocks ringing at once. Multithreading is difficult and takes a long time to learn it when you want to do it right - and you need to do it right when you want to win a huge speed increase. Debugging is extremely nasty even with good tools like Total Views debugger or Intels VTune.
Then you say you want to rewrite the app in another lanugage - well this isn't as bad as you have to rewrite it anyway. THe chance to turn a single threaded Program into a well working multithreaded one without total redesign is almost zero.
But learning multithreading and a new language (what is your C++ skills?) with a timeline of 3 month (you have to write a throw away prototype - so i cut the timespan into two halfs) is extremely challenging.
My advise here is simple and will not like it: Learn multithreadings now - because it is a required skill set in the future - but leave this job to someone who already has experience. Well unless you don't care about the program being successfull and are just looking for 6 month payment.

If it's possible to have all the threads working on disjoint sets of process data, and have other information stored in the SQL database, you can quite easily do it in C++, and just spawn off new threads to work on their own parts using the Windows API. The SQL server will handle all the hard synchronization magic with its DB transactions! And of course C++ will perform a lot faster than C#.
You should definitely revise C++ for this task, and understand the C++ code, and look for efficiency bugs in the existing code as well as adding the multi-threaded functionality.

You've tagged this question as C++ but mentioned that you're a C# developer currently, so I'm not sure if you'll be tackling this assignment from C++ or C#. Anyway, in case you're going to be using C# or .NET (including C++/CLI): I have the following MSDN article bookmarked and would highly recommend reading through it as part of your prep work.
Calling Synchronous Methods Asynchronously

Whatever technology your going to write this, take a look a this must read book on concurrency "Concurrent programming in Java" and for .Net I highly recommend the retlang library for concurrent app.

I don't know if it was mentioned yet, but if I were in your shoes, what I would be doing right now (aside from reading every answer posted here) is writing a multiple threaded example application in your favorite (most used) language.
I don't have extensive multithreaded experience. I've played around with it in the past for fun but I think gaining some experience with a throw-away application will suit your future efforts.
I wish you luck in this endeavor and I must admit I wish I had the opportunity to work on something like this...

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string