How to calculate the maximum branching factor for the towers of hanoi - search

I am modeling the towers of Hanoi problem with n discs and k pegs, and I am trying to find its maximun branching factor. The problem is that, as the number both of discs and pegs is variable, so is the number of actions possible for each node. How can I find a generic way of assesing the maximum branching factor depending on k and n?

In general the smallest disc can move to any other peg: k-1 options.
The second smallest disk (at the top of the stack on a peg; might not be the second smallest overall) can move onto any pegs except the one with the smallest disc: k-2 options.
This continues until the largest disk on the top of a peg, which can't move anywhere (assuming n>k).
So, the expected branching factor is: (k-1)+(k-2)+(k-3)+...+2+1 = (k-1)*k/2
The only time you won't get this is when one of the pegs contains no disks. If n>>k this will rarely happen. But, it means that if you are searching from random states to a goal state, you should consider searching backwards, because the standard goal state has the lowest branching factor since only one peg has a disc.
The n < k case can be similarly analyzed, except that you stop after n disks and subtract an additional term for the moves we counted the first time around that aren't available now:
k(k-1)/2 - (k-n)(k-n-1)/2

Related

Queue O(1) not as fast append or pop in python

I made a Queue O(1) using Nodes where Queue class contain "Head and Tail" and Node Contain "Next and Back" but when I Compared "enqueue and dequeue" to " append and pop " through "timeit" I found out that " append and pop " are way faster than "enqueue and dequeue" I made.
Am I doing something wrong with Node or Queue or my O(1) will not be as fast as append or pop ?
An "arraylist" (which is the data structure behind Python's list) has an amortized cost of O(1) as well, so in terms of big oh, the two are equivalent. Indeed, the list has O(n) to append worst case, but this happens not very often. Typically the list has an initial capacity, and when the list is full, then it doubles the capacity. This means that if we want to make a list with n elements, and n is quite big, then the the list will be resized to capacities:
1, 2, 4, 8, 16, …, n
It each time takes the length of the original list in time to make a copy in the larger array, so that means that the total amount spent copying is the sum of these numbers, which is 2×n-1, and this O(n), this thus means that n append, take in total O(n), time, and thus the amortized cost of an append in a list is O(1).
But there are other reasons why your linked list will be less efficient. Each time you construct a new node, it will look for a memory slot, often allocating memory takes some time. Furthermore Python's lists are typically implemented in the interpreter, for CPython [GitHub] for example, likely the interpreter you are using, it works with a PyListObject [GitHub], this is often more efficient than implementing something in Python itself, since that is interpreted.

list.count() vs Counter() performance

While trying to find the frequency of a bunch of characters in a string, why does running string.count(character) 4 times for 4 different characters yield faster execution time (using time.time()) than using a collections.Counter(string)?
Background:
Given a sequence of moves represented by a string. Valid moves are R (right), L (left), U (up), and D (down). Return True if the sequence of moves takes me back to the origin. Otherwise, return false.
# approach - 1 : iterate 4 times (3.9*10^-6 seconds)
def foo1(moves):
return moves.count('U') == moves.count('D') and moves.count('L') == moves.count('R')
# approach - 2 iterate once (3.9*10^-5 seconds)
def foo2(moves):
from collections import Counter
d = Counter(moves)
return d['R'] == d['L'] and d['U'] == d['D']
import time
start = time.time()
moves = "LDRRLRUULRLRLRLRLRLRLRLRLRLRL"
foo1(moves)
# foo2(moves)
end = time.time()
print("--- %s seconds ---" % (end - start))
These results are the opposite of what I had expected. My reasoning is that first approach should take longer because the string is iterated over 4 times whereas in the second approach, we iterate only once. Could it be due to the library call overhead?
Counter is faster in theory, but has higher fixed overhead, especially compared to str.count, which can scan the underlying C array with direct memory comparisons, where list.count has to do rich comparisons for each element; converting moves to a list of single characters nearly triples the time for foo1 in local tests, from 448 ns to 1.3 μs (while foo2 actually gets a tiny bit faster, dropping from 5.6 μs to 5.48 μs).
Other problems:
Importing an already imported module uses the cached import, but there is a surprising amount of overhead involved in even a cached import (the loading machinery has a lot of stuff to check to make sure it's okay to do so); in local tests, moving from collections import Counter to the top level reduced the runtime of foo2 by 1.6 μs (5.6 μs with single global import, 7.2 μs with local per-call import). This will vary a lot by environment; on another machine (with less stuff installed in both user and system site-packages), the overhead was only 0.75 μs. Regardless, it's a significant, avoidable disadvantage for foo2.
Counter on modern Python uses a C accelerator to speed up counting, but the accelerator only provides a benefit when the iterable is long enough. If you use the list form of moves, but multiply it by 100 to make a longer sequence, the difference drops, relatively speaking (to 106 µs for foo1 vs. 140 µs for foo2)
You're just not counting very many things; when there are only four things you care about, paying O(n) four times can easily beat paying O(n) once if the former case has lower constant multipliers (which aren't included in big-O notation) than the latter. Counter remains O(n) for any number of unique things being counted; calling .count is O(n) per call, but if you need to know the count of every unique thing in the input, for inputs that are mostly unique, individual .count calls for each will be asymptotically O(n²).
The .count approach is short-circuiting in your specific case, so it isn't even doing O(n) work four times, just twice; the U and D counts don't match, so it never counts L and R at all. Counter doesn't get meaningfully slower if it can't short-circuit (all the cost is paid in the single counting pass), but your foo1, in the same benchmark I used from point #2 (longer input, in list form), goes from 106 µs to 185 µs if I just add a single D to the end of the (pre-multiplication) moves (making the U and D counts the same, and requiring two more count calls); foo2 only goes up to 143 µs (from 140 µs), presumably because moves actually got longer (adding the D before multiplying by 100 meant it went from 2900 elements to count to 3000).
Basically, you had some minor implementation weaknesses, but mostly, you happened to choose a use case that gave all the advantage to .count, none to Counter. If your inputs are always str, and you're only counting them a small, fixed number of times, then sure, repeated calls to count are generally going to win. But for arbitrary input types (especially iterators, where count is impossible, both because it doesn't exist, and because you can only iterate it once), especially larger ones, with more unique things to count, where consistent performance counts (so relying on short-circuiting to reduce the number of count calls isn't acceptable), Counter will win.

OpenCL: What if i have more tasks then available work items?

Let's make an example:
i want vector dot product made concurrently (it's not my case, this is only an example) so i have 2 large input vectors and a large output vector with the same size. the work items aviable are less then the sizes of these vectors. How can i make this dot product in opencl if the work items are less then the size of the vectors? Is this possible? Or i have just to make some tricks?
Something like:
for(i = 0; i < n; i++){
output[i] = input1[i]*input2[i];
}
with n > available work items
If by "available work items" you mean you're running into the maximum given by CL_DEVICE_MAX_WORK_ITEM_SIZES, you can always enqueue your kernel multiple times for different ranges of the array.
Depending on your actual workload, it may be more sensible to make each work item perform more work though. In the simplest case, you can use the SIMD types such as float4, float8, float16, etc. and operate on large chunks like that in one go. As always though, there is no replacement for trying different approaches and measuring the performance of each.
Divide and conquer data. If you keep workgroup size as an integer divident of global work size, then you can have N workgroup launches perhaps k of them at once per kernel launch. So you should just launch N/k kernels each with k*workgroup_size workitems and proper addressing of buffers inside kernels.
When you have per-workgroup partial sums of partial dot products(with multiple in-group reduction steps), you can simply sum them on CPU or on whichever device that data is going to.

linearK - large time difference between empirical and acceptance envelopes in spatstat

I am interested in knowing correlation in points between 0 to 2km on a linear network. I am using the following statement for empirical data, this is solved in 2 minutes.
obs<-linearK(c, r=seq(0,2,by=0.20))
Now I want to check the acceptance of Randomness, so I used envelopes for the same r range.
acceptance_enve<-envelope(c, linearK, nsim=19, fix.n = TRUE, funargs = list(r=seq(0,2,by=0.20)))
But this show estimated time to be little less than 3 hours. I just want to ask if this large time difference is normal. Am I correct in my syntax to the function call of envelope its extra arguments for r as a sequence?
Is there some efficient way to shorten this 3 hour execution time for envelopes?
I have a road network of whole city, so it is quite large and I have checked that there are no disconnected subgraphs.
c
Point pattern on linear network
96 points
Linear network with 13954 vertices and 19421 lines
Enclosing window: rectangle = [559.653, 575.4999] x
[4174.833, 4189.85] Km
thank you.
EDIT AFTER COMMENT
system.time({s <- runiflpp(npoints(c), as.linnet(c));
+ linearK(s, r=seq(0,2,by=0.20))})
user system elapsed
343.047 104.428 449.650
EDIT 2
I made some really small changes by deleting some peripheral network segments that seem to have little or no effect on the overall network. This also lead to split some long segments into smaller segments. But now on the same network with different point pattern, I have even longer estimated time:
> month1envelope=envelope(months[[1]], linearK ,nsim = 39, r=seq(0,2,0.2))
Generating 39 simulations of CSR ...
1, 2, [etd 12:03:43]
The new network is
> months[[1]]
Point pattern on linear network
310 points
Linear network with 13642 vertices and 18392 lines
Enclosing window: rectangle = [560.0924, 575.4999] x [4175.113,
4189.85] Km
System Config: MacOS 10.9, 2.5Ghz, 16GB, R 3.3.3, RStudio Version 1.0.143
You don't need to use funargs in this context. Arguments can be passed directly through the ... argument. So I suggest
acceptance_enve <- envelope(c, linearK, nsim=19,
fix.n = TRUE, r=seq(0,2,by=0.20))
Please try this to see if it accelerates the execution.

Is there a search algorithm for minimizing number of threads?

I am using the Intel Xeon Phi coprocessor, which has up to 240 threads, and I am working on minimizing the number of threads used for a particular application (or maximize performance) while being within a percentage of the best execution time. So for example if I have the following measurements:
Threads | Execution time
240 100 s
200 105 s
150 107 s
120 109 s
100 120 s
I would like to select a number of threads between 120 and 150, since the "performance curve" there seems to stabilize and the reduction in execution time is not that significant (in this case around 15% of the best measured time. I did this using an exhaustive search algorithm (measuring from 1 to 240 threads), but my problem is that it takes too long for smaller number of threads (obviously depending on the size of the problem).
To try to reduce the number of measurements, I developed a sort of "binary search" algorithm. Basically I have an upper and lower limit (beginning at 0 and 240 threads), I take the value in the middle and measure it and at 240. I get the percent difference between both values and if it is within 15% (this value was selected after analyzing the results for the exhaustive search) I assign a new lower or upper bound. If the difference is larger than 15% then this is a new lower bound (120-240) and if it is smaller then it is a new upper bound (0-120), and if I get a better execution time I store it as the best execution time.
The problem with this algorithm is that first of all this is not necessarily a sorted array of execution times, and for some problem sizes the exhaustive search results show two different minimum, so for example in one I get the best performance at 80 threads and at 170, and I would like to be able to return 80, and not 170 threads as a result of the search. However, for the other cases where there is only one minimum, the algorithm found a value very close to the one expected.
If anyone has a better idea or knows of an existing search algorithm or heuristic that could help me I would be really grateful.
I'm taking it that your goal is to get the best relative performance for the least amount of threads, while still maintaining some limit on performance based on a coefficient (<=1) of the best possible performance. IE: If the coefficient is 0.85 then the performance should be no less than 85% of the performance using all threads.
It seems like what you should be trying to do is simply find the minimium number of threads required to obtain the performance bound. Rather than looking at 1-240 threads, start at 240 threads and reduce the number of threads until you can place a lower bound on the performance limit. You can then work up from the lower bound in such a way that you can find the min without passing over it. If you don't have predefined performance bound, then you can calculate one on the fly based on diminishing returns.
As long as the performance limit has not been exceeded, half the number of threads (start with max number of threads). The number that exceeds the performance limit is a lower bound on the number of threads required.
Starting at the lower bound on the number of threads, Z, add m threads if can be added without getting within the performance limit. Repeatedly double the number of threads added until within the performance limit. If adding the threads get within the performance limit, subtract the last addition and reset the number of threads to be added to m. If even just adding m gets within the limit, then add the last m threads and return the number of threads.
It might be clearer to give an example of what the process looks like step by step. Where Passed means that the number of threads are outside of the performance limits, and failed means they are either on the performance limit or inside of it.
Try adding 1m (Z + 1m). Passed. Threads = Z + m.
Try adding 2m (Z + 3m). Passed. Threads = Z + 3m.
Try adding 4m (Z + 7m). Failed. Threads = Z + 3m. Reset.
Try adding 1m. Passed. Threads = Z + 4m.
Try adding 2m. Passed. Threads = Z + 6m.
Z + 7m failed earlier so reset.
Comparisons/lookups are cheap, use them to prevent duplication of work.
Try adding 1m. Failed. Threads = Z + 6m. Reset.
Cannot add less than 1m and still in outside of performance limit.
The solution is Z + 7m threads.
Since Z + 6m is m threads short of the performance limit.
It's a bit inefficient, but it does find the minimium number of threads (>= Z) required to obtain the performance bound to within an error of m-1 threads and requiring only O(log (N-Z)) tests. This should be enough in most cases, but if it isn't just skip step 1 and use Z=m. Unless increasing the number of threads rapidly decreases the run-time causing very slow run times when Z is very small. In which case, doing step 1 and using interpolation can get an idea of how quickly the run-time increases as the number of threads decrease, which is also useful for determining a good performance limit if none is given.

Resources