'Simple' question, what is the fastest way to calculate the binomial coefficient? - Some threaded algorithm?
I'm looking for hints :) - not implementations :)
Well the fastest way, I reckon, would be to read them from a table rather than compute them.
Your requirements on integer accuracy from using a double representation means that C(60,30) is all but too big, being around 1e17, so that (assuming you want to have C(m,n) for all m up to some limit, and all n<=m), your table would only have around 1800 entries. As for filling the table in I think Pascal's triangle is the way to go.
According to the equation below (from wikipedia) the fastest way would be to split the range i=1,k to the number of threads, give each thread one range segment, and each thread updates the final result in a lock. "Academic way" would be to split the range into tasks, each task being to calculate (n - k + i)/i, and then no matter how many threads you have, they all run in a loop asking for next task. First is faster, second is... academic.
EDIT: further explanation - in both ways we have some arbitrary number of threads. Usually the number of threads is equal to the number of processor cores, because there is no benefit in adding more threads. The difference between two ways is what those threads are doing.
In first way each thread is given N, K, I1 and I2, where I1 and I2 are the segment in the range 1..K. Each thread then has all the data it neads, so it calculates its part of the result, and upon finish updates the final result.
In second way each thread is given N, K, and access to some syncronized counter that counts from 1 to K. Each thread then aquires one value from this shared counter, calculates one fraction of the result, updates the final result, and loops on this until counter informs the thread that there are no more items. If it happens that some processor cores are faster that others then this second way will put all cores to maximum use. Downside to second way is too much synchronization that effectively blocks, say, 20% of threads all the time.
Hint: You want to do as little multiplications as possible. The formula is n! / (k! * (n-k)!). You should do less than 2m multiplications, where m is the minimum of k and n-k. If you want to work with (fairly) big numbers, you should use a special class for the number representation (Java has BigInteger for instance).
Here's a way that never overflows if the final result is expressible natively in the machine, involves no multiplications/factorizations, is easily parallelized, and generalizes to BigInteger-types:
First note that the binomial coefficients satisfy following:
.
This yields a straightforward recursion for computing the coefficient: the base cases are and , both of which are 1.
The individual results from the subcalls are integers and if \binom{n}{k} can be represented by an int, they can too; so, overflow is not a concern.
Naively implemented, the recursion leads to repeated subcalls and exponential runtimes.
This can be fixed by caching intermediate results. There are
n^2 subproblems, which can be combined in O(1) time, yielding an O(n^2) complexity bound.
This answer calculates binomial with Python:
def h(a, b, c):
x = 0
part = str("=")
while x < (c+1):
nCr = math.comb(c,x)
part = part+'+'+str(int(a**(c-1))*int(b**x)*int(nCr)+'x^'+str(x)
x = x+1
print(part)
h(2,6,4)
Related
I am trying to determine the space and time complexity for TextRank the algorithm listed in this paper:
https://web.eecs.umich.edu/~mihalcea/papers/mihalcea.emnlp04.pdf
Since it is using PageRank whose complexity is:
O(n+m) ( n - number of nodes, m - number of arcs/edges)
and we run it over i iterations/until convergence the complexity for keyword extraction I believe it would be: O(i*(n+m))
and the space complexity would be O(V^2) using an adjacency matrix
While for sentence extraction I believe it would be the same thing.
I'm really not sure and any help would be great Thank you.
If you repeat T times an algorithm (inner) with complexity O(n+m), or whatever other for that matter, it is correct to conclude that the new algorithm (outer) will have a complexity of O(T*(n+m)) provided:
The outer algorithm will only add a constant complexity every time it repeats the inner one.
Parameters n and m remain the same at every invocation of the inner algorithm.
In other words, the outer algorithm should prepare the inputs for the inner one in constant time, and the parameters of new inputs should remain well represented by n and m along the T iterations. Otherwise, if any of these two requirements fail to be proved, you should sum T times the complexities associated to the new parameters, say
O(n1 + m1) + ... + O(n_T + m_T)
and also take into account all the pre- and post-processing of the outer algorithm before and after using the inner.
If complexity is O(nlog2(n))...
How to prove execution time for data like 10e7if we know that for data like 10e5execution time is 0.1s?
In short: To my knowledge, you don't prove it in this way.
More verbosely:
The thing about complexity is that they are reported in Big O notation, in which any constants and lower order terms are discarded. For example; the complexity in the question O(nlog2(n)), however this could be the simplified form of k1 * n * log(k2 * log(k3 * n + c3) + c2) + c1.
These constants cover things like initialization tasks which take the same time regardless of the number of samples, the proportional time it takes to do the log2(n) bit (each one of those could potentially take 10^6 times longer than the n bit), and so on.
In addition to the constants you also have variable factors, such as the hardware on which the algorithm is executed, any additional load on the system, etc.
In order use this as the basis for an estimate of execution time you would need to have enough samples of execution times with respect to problem sizes to estimate the both the constants and variable factors.
For practical purposes one could gather multiple samples of execution times for a sufficiently sizable set of problem sizes, then fit the data with a suitable function based on your complexity formula.
In terms of proving an execution time... not really doable, the best you can hope for is a best fit model and a significant p-value.
Of course, if all you want is a rough guess you could always try assuming that all the constants and variables are 1 or 0 as appropriate and plug in the numbers you have: (0.1s / (10^5 * log2(10^5))) * (10^7 * log2(10^7)) = 11 ish
I want to find a multithreaded algorithm to multiply an $n x n$ matrix by an n-vector that achieves $\Theta(n^2/lgn)$ parallelism while maintaining $\Theta(n^2)$ work.
I know an illegal solution but any tips on how to make the span go down to $\Theta(lgn)$?
There is an implementation of this problem with procedure named MAT-VEC in CLRS textbook. But its span is Theta of N. To pull it down to logarithmic span, you can replace serial summation in inner for loop by using multithreaded divide & conquer strategy. To do that recursively divide the range and spawn one side with parallel to other then synchronize and return the summed value left+right.
I am working on multi objective Genetic Algorithms, I have say 4 objectives and no. of generations is 400, and a population size of 100.
So how many function evaluation will be there?
I mean to say is it 4*400*100 or 400*100?
If for each chromosome you evaluate 4 functions, then obviously you have a total of 4*400*100 evaluations.
What you might also want to consider is the running time of each of this evaluations, because if 3 of the functions run in O(n) and the forth runs in O(n^2), the total running time will be bounded by O(number_of_gens*population_size*n^2), and will be only mildly affected by the other three functions in large problem instances.
If you're asking about the number of evaluations as counted by MOO researchers (i.e., you want to know whether your algorithm is better than mine with the same number of evaluations), then the accepted answer is incorrect. In multi-objective optimization, we formally consider the problem not as optimizing k different functions, but as optimizing one vector-valued function.
It's one evaluation per individual, regardless of the dimensionality of the objective space.
As far as I know, the number of function evaluation of genetic algorithm can be calculated through following equation:
Number of function evaluations = Number of main population + [number of new children(from cross over) + number of mututed children(from mutation)] * number of itteration.
Say i have this very common DP problem ( Dynamic programming) -
Given a cost matrix cost[][] and a position (m, n) in cost[][], write a function that returns cost of minimum cost path to reach (m, n) from (0, 0). Each cell of the matrix represents a cost to traverse through that cell. Total cost of a path to reach (m, n) is sum of all the costs on that path (including both source and destination). You can only traverse down, right and diagonally lower cells from a given cell, i.e., from a given cell (i, j), cells (i+1, j), (i, j+1) and (i+1, j+1) can be traversed. You may assume that all costs are positive integers.
PS: answer to this - 8
Now, After solving this question.. Following Question ran through my mind.
Say i have 1000*1000 matrix. and O(n^2) will take some time (<1sec on intel i5 for sure).
but can i minimize it further. say starting 6-8 threads using this algorithm and then synchronizing them back to get the answer at last ? will it be fast or even logically possible to get answer or i should throw this thought away
Generally speaking, on such small problems (as you say < 1sec), parallel computing is less efficient than sequential due to protocol overhead (thread starting and synchronizing). Another problem might be, that you increase the cache miss rate because you're choosing the data you want to operate on "randomly" (not linearly) from the input. However, when it comes to larger problems, say matrices with 10 times as many entries, it sure is worth a thought (or two).
This is a possible solution. Given a 16x16 Matrix, we cut it into 4 equal squares. For each of those squares, one thread is responsible. The number in each little square indicates, after how many time units the result in that square can be calculated.
So, the total time is 33 units (whatever a unit is). Compared to the sequential solution with 64 units, it is just half of it. You can convince yourself that the runtime for any 2^k x 2^k Matrix is 2^(2k - 1) + 1.
However, this is only the first idea that came up to my mind. I hope that there is a (much) faster parallel solution in the world outside.
What's more, for the reasons I mentionned at the beginning of my answer, for all practical purposes, you would not achieve a speedup of 2 with my solution.
I'd start with algorithmic improvements. There's no need to test N2 solutions.
One key is the direction from which you entered a square. If you entered it by moving downward, there's no need to check the square to the right. Likewise, if you entered it by moving right, there's no need to check the path downward from there. The destination of a right-angle turn can always be reached via a diagonal move, leaving out one square and its positive weight/cost.
As far as threading goes, I can see (at least) a couple of ways of splitting things up. One would be to simply queue up requests from when you enter a square. I.e., instead of (for example) testing another square, it queues up requests to test its two or three exits. N threads process those requests, which generate more requests, continuing until all of them reach the end point.
This has the obvious disadvantage that you're likely to continue traversing some routes after serial code could abandon them because they're already longer than the shortest route you've round so far.
Another possibility would be to start two threads, one traversing forward, the other backward. In each, you find the shortest route to any given point along the diagonal, then you're left with a purely linear scan through those candidates to find the shortest sum.