Multithreaded operators - multithreading

Multithreaded operators - multithreading

How can an operator +, -, *, /, etc... be multithreaded?
In this paper on the language IDL, it claims that all operators use the 'thread pool' for increased execution speed.
How is it that one can use multiple threads to execute a statement such as 'b = a - a' (as on page 42) of the paper?
Can someone explain this? (I currently consider IDL a total ripoff, but maybe someone can change my mind.)
(Really this applies to any language, how can an operator by mulithreaded in any computer programming language?)

I think it's important to also consider that not all operations with + are created equal. If you're using some sort of bignum library, for example, you might be able to seperate a large number into smaller parts, do distinct sums of integers( in parallel), then carry over. In any case, it's not going to be a single-cycle addition of to integers. multiplication involves a couple steps, and division, a lot of steps.
In the example given, the floating points(floating point means a non-trivial adding process) had "4.2 million data points": I doubt they were storing that in a small 32-bit register.
The "simple" operation for addition has suddenly become a huge iterative process... or maybe something a lot faster if they are able to do it in parallel.
While simple operations with small integers might not be worth threading, it's worthwhile to note that B=A+A, while seeming simple, could actually lead to many calculations. 1 line of code doesn't necessarily mean 1 operation.

I don't know about IDL, but it's certainly possible if you have some higher level types. For instance you could conveniently parallelize array operations. Presumably that's what the "4200000 pts" refers to, although someone decided to make the graphs really hard to read.
For comparison, in C (with possible OpenMP parallelization) you might have something like:
#pragma omp parallel for
for (int i=0; i<sizeof(B)/sizeof(B[0]); i++) {
B[i]-=A[i];
}
In a higher level language, such as NumPy, Matlab or C++, it could be just B=B-A. All that said, B=A-A sounds confusingly like B=0 to me.
You asked for a parallel operator in a favorite language? Here's a bit of Haskell:
import Control.Parallel
pmap _ [] = []
pmap f (x:xs) =
let rest=pmap f xs
in rest `par` (f x):rest
parOp op a b = pmap (uncurry op) (zip a b)
parAdd = parOp (+)
main = do
putStrLn$show ([0..300] `parAdd` [500..800])
Yes, it's still a loop. Multitude of operations (not operators) is key to this type of parallelism.

Primitive operations on matrices, arrays, etc. can be parallelised - scroll up to page 41 and you'll find:
For system comparison plots, the results are reported for arrays of 4.2 million elements.
Edit: Assume you have an array A = [1, 2, 3, 4, 5, 6].
Calculating B = A - A = [0, 0, 0, 0, 0, 0] involves 6 subtraction operations (1-1, 2-2, etc).
with a single CPU, regardless of number of threads, the subtractions must be performed in series.
with multiple CPUs but only one thread, the subtractions are also performed in series - that's by definition of a thread.
multiple CPUs, multiple threads - the subtractions can be divided amongst threads/CPUs and thus occur simultaneously (up to the number of CPUs available).

Related

Why are OpenMP Reduction Clauses Non-deterministic for Statically Scheduled Loops?

I have been working on a multi-GPU project where I have had problems with obtaining non-deterministic results. I was surprised when it turned out that I obtained non-deterministic results due to a reduction clause executed on the CPU.
In the book Using OpenMP - The Next Step it is written that
"[...] the order in which threads combine their value to construct the
value for the shared result is non-deterministic."
Maybe I just don't understand how the reduction clauses are implemented. Does it mean that if I use schedule(monotonic:static) in combination with a reduction clause each thread will execute its chunk of the iterations in a deterministic order, but that the order in which the partial results are combined at the end of the parallel region is non-deterministic?

Does it mean that if I use schedule(monotonic:static) in combination
with a reduction clause each thread will execute its chunk of the
iterations in a deterministic order, but that the order in which the
partial results are combined at the end of the parallel region is
non-deterministic?
It is known that the end result is non-determinist, detailed information can be found in:
What Every Computer Scientist Should Know about Floating Point Arithmetic. For instance:
Another grey area concerns the interpretation of parentheses. Due to roundoff errors, the associative laws of algebra do not necessarily hold for floating-point numbers. For example, the expression (x+y)+z has a totally different answer than x+(y+z) when x = 1e30, y = -1e30 and z = 1 (it is 1 in the former case, 0 in the latter).
Now regarding the order in which the threads perform the reduction action, as far as I know, the OpenMP standard does not enforce any order, or requires that the order has to be deterministic. Hence, this is an implementation detail that is left up to the compiler that is implementing the OpenMP standard to decide, and consequently, it is something that your code should not reply upon.

Programming language semantics usually declares that a+b+c+d is evaluated as ((a+b)+c)+d. This is not parallel, so an OpenMP reduction is probably evaluated as (a+b)+(c+d). And so on for larger numbers of summands.
So you immediately have that, because of the non-associativity of floating point arithmetic, the result may be subtly different from the sequential value.
But more importantly, the exact value will depend on precisely how the combination is done. Is it a+(b+c) (on 2 threads) or (a+b)+c? So the result is at least "indeterministic" in the sense that you can not reconstruct how it was formed. It could probably even be done in two different ways, if you run the same reduction twice. That's what I would call "non-deterministic", but look in the standard for the exact definition of the term.
By the way, if you want to get some idea of how OpenMP actually does it, write your own reduction operator, and let each invocation print out what it computes. Here is a decent illustration: https://victoreijkhout.github.io/pcse/omp-reduction.html#Initialvalueforreductions
By the way, the standard actually doesn't use the word "non-deterministic" for this case. The following passage explains the issue:
Furthermore, using different numbers of threads may result in
different numeric results because of changes in the association of
numeric operations. For example, a serial addition reduction may have
a different pattern of addition associations than a parallel
reduction.

Are floating point SMT logics slower than real ones?

I wrote an application in Haskell that calls Z3 solver to solve constrains with some complex formulas. Thanks to Haskell I can quickly switch the data type I'm working with.
When using SBV's AlgReal type for computations, I get results in sensible time, however switching to Float or Double types makes Z3 consume ~2Gb of RAM and doesn't result even in 30 minutes.
Is this expected that producing floating point solutions require much more time, or it is some mistake on my side?

As with any question regarding solver performance, it is impossible to make generalizations. Christoph Wintersteiger (https://stackoverflow.com/users/869966/christoph-wintersteiger) would be the expert on this to opine, but I'm not sure how closely he follows this group. Chris: If you're reading this, I'd love to hear your thoughts!
There's also the risk of comparing apples-to-oranges here: Reals and floats are two completely different logics, with different decision procedures, heuristics, algorithms, etc. I'm sure you can find problems where one outperforms the other, with no clear "performance" winner.
Having said all that, here are some things that make floating-point (FP) tricky:
Rewriting is almost impossible with FP, since rules like associativity simply
don't hold for FP addition/multiplication. So, there are fewer opportunities for
simplification before bit-blasting.
Similarly a * 1/a == 1 doesn't hold for floats. Neither does x + 1 /= x or (x + a == x) -> (a == 0) and many other "obvious" simplifications that you'd love to be able to make. All of this complicates reasoning.
Existence of NaN values make equality non-reflexive: Nothing compares equal to NaN including itself. So, substitution of equals-for-equals is also problematic and requires side conditions.
Likewise, the existence of +0 and -0, which compare equal but behave differently due to rounding complicate matters. The typical example is x == 0 -> fma(a, b, x) == a * b doesn't hold (where fma is fused multiply-add), because depending on the sign of zero these two expressions can produce different values for different rounding modes.
Combination of floats with integers and reals introduce non-linearity, which is always a soft-spot for solvers. So, if you're using FP, it is advisable not to mix it with other theories as the combination itself creates extra complexity.
Operations like multiplication, division, and remainder (yes, there's a floating-point remainder operation!) are inherently very complex and lead to extremely large SAT formulas. In particular, multiplication is a known operation that challenges most SAT/BDD engines, due to lack of any good variable ordering and splitting heuristics. Solvers end-up bit-blasting fairly quickly, resulting in extremely large state-spaces. I have observed that solvers have a hard time dealing with FP division and remainder operations even when the input is completely concrete, imagine what happens when they are fully symbolic!
The logic of reals have a decision procedure with a double-exponential complexity. However, techniques like Fourier-Motzkin elimination (https://en.wikipedia.org/wiki/Fourier%E2%80%93Motzkin_elimination), while remaining exponential, perform really well in practice.
FP solvers are relatively new, and it's a niche area with nascent research. So existing solvers tend to be quite conservative and quickly bit-blast and reduce the problem to bit-vector logic. I would expect them to improve over time, just like all the other theories did.
Again, I want to emphasize comparing solver performance on these two different logics is misguided as they are totally different beasts. But hopefully, the above points illustrate why floating-point is tricky in practice.
A great paper to read about the treatment of IEEE754 floats in SMT solvers is: http://smtlib.cs.uiowa.edu/papers/BTRW14.pdf. You can see the myriad of operations it supports and get a sense of the complexity.

How can natural numbers be represented to offer constant time addition?

Cirdec's answer to a largely unrelated question made me wonder how best to represent natural numbers with constant-time addition, subtraction by one, and testing for zero.
Why Peano arithmetic isn't good enough:
Suppose we use
data Nat = Z | S Nat
Then we can write
Z + n = n
S m + n = S(m+n)
We can calculate m+n in O(1) time by placing m-r debits (for some constant r), one on each S constructor added onto n. To get O(1) isZero, we need to be sure to have at most p debits per S constructor, for some constant p. This works great if we calculate a + (b + (c+...)), but it falls apart if we calculate ((...+b)+c)+d. The trouble is that the debits stack up on the front end.
One option
The easy way out is to just use catenable lists, such as the ones Okasaki describes, directly. There are two problems:
O(n) space is not really ideal.
It's not entirely clear (at least to me) that the complexity of bootstrapped queues is necessary when we don't care about order the way we would for lists.

As far as I know, Idris (a dependently-typed purely functional language which is very close to Haskell) deals with this in a quite straightforward way. Compiler is aware of Nats and Fins (upper-bounded Nats) and replaces them with machine integer types and operations whenever possible, so the resulting code is pretty effective. However, that's not true for custom types (even isomorphic ones) as well as for compilation stage (there were some code samples using Nats for type checking which resulted in exponential growth in compile-time, I can provide them if needed).
In case of Haskell, I think a similar compiler extension may be implemented. Another possibility is to make TH macros which would transform the code. Of course, both of options aren't easy.

My understanding is that in basic computer programming terminology the underlying problem is you want to concatenate lists in constant time. The lists don't have cheats like forward references, so you can't jump to the end in O(1) time, for example.
You can use rings instead, which you can merge in O(1) time, regardless if a+(b+(c+...)) or ((...+c)+b)+a logic is used. The nodes in the rings don't need to be doubly linked, just a link to the next node.
Subtraction is the removal of any node, O(1), and testing for zero (or one) is trivial. Testing for n > 1 is O(n), however.
If you want to reduce space, then at each operation you can merge the nodes at the insertion or deletion points and weight the remaining ones higher. The more operations you do, the more compact the representation becomes! I think the worst case will still be O(n), however.

We know that there are two "extremal" solutions for efficient addition of natural numbers:
Memory efficient, the standard binary representation of natural numbers that uses O(log n) memory and requires O(log n) time for addition. (See also Chapter "Binary Representations" in the Okasaki's book.)
CPU efficient which use just O(1) time. (See Chapter "Structural Abstraction" in the book.) However, the solution uses O(n) memory as we'd represent natural number n as a list of n copies of ().
I haven't done the actual calculations, but I believe for the O(1) numerical addition we won't need the full power of O(1) FIFO queues, it'd be enough to bootstrap standard list [] (LIFO) in the same way. If you're interested, I could try to elaborate on that.
The problem with the CPU efficient solution is that we need to add some redundancy to the memory representation so that we can spare enough CPU time. In some cases, adding such a redundancy can be accomplished without compromising the memory size (like for O(1) increment/decrement operation). And if we allow arbitrary tree shapes, like in the CPU efficient solution with bootstrapped lists, there are simply too many tree shapes to distinguish them in O(log n) memory.
So the question is: Can we find just the right amount of redundancy so that sub-linear amount of memory is enough and with which we could achieve O(1) addition? I believe the answer is no:
Let's have a representation+algorithm that has O(1) time addition. Let's then have a number of the magnitude of m-bits, which we compute as a sum of 2^k numbers, each of them of the magnitude of (m-k)-bit. To represent each of those summands we need (regardless of the representation) minimum of (m-k) bits of memory, so at the beginning, we start with (at least) (m-k) 2^k bits of memory. Now at each of those 2^k additions, we are allowed to preform a constant amount of operations, so we are able to process (and ideally remove) total of C 2^k bits. Therefore at the end, the lower bound for the number of bits we need to represent the outcome is (m-k-C) 2^k bits. Since k can be chosen arbitrarily, our adversary can set k=m-C-1, which means the total sum will be represented with at least 2^(m-C-1) = 2^m/2^(C+1) ∈ O(2^m) bits. So a natural number n will always need O(n) bits of memory!

Number of assembly instruction execution orders

I am working on a presentation on multithreading and I want to demonstrate how instructions can increase in a factorially large way.
Consider the trivial program
a++;
b++;
c++;
In a single-threaded program the three assembly instructions (read, add one, write) that make up the ++ operation only has one order (read a, add one to a, write a to memory, read b,...)
In a program with just three threads executing these three lines in parallel there are many more configurations. The compiler can optimize and re-write these instruction in any order with the constraint that 'read', 'add one' and 'write' occur in order for a b and c. How many valid orders are there?
Initial thoughts:
(3+3+3)!* 1/(3!+3!+3!)=20160
where (3+3+3)! is the total number of permutations without constraint and 1/(3!+3!+3!) is the proportion of permutations that have the correct order.

This might be more of a elaborate comment but...
In the single thread version the compiler can reorder those additions with out the change in the output. c++ compilers are allowed to do so. So there is 3! possibilities for the single thread. And that is assuming the ++ is atomic.
When you go into multithreading the sense of order of operations loses its meaning, depending on architecture it can be done in precisely at the same time. In fact you do not even have threads. E.g. SSE instructions.
What you are trying to count is executing 3 additions where load->inc->store are not atomic, on a single thread. IMO, the way to impose order on the total of 9 elements would be similar to yours, but the factor would be (3!*3!*3!).
1st you take 9! then you impose order on 3 elements by dividing it by 3!, and then repeat process 2 more times. However I get the feeling that this factor is too big.
I would ask a mathematician that's good with combinatorics. The equivalent question is, having NxM coloured balls. N is the number of variables, M is the number of atomic operations you need to execute on each. What is the number of different orders for the balls. The colour is the variable. Because you know that 1st of a colour must be the load, 2nd ++ and 3rd store. So you get M=3 balls for each of N=3 colours. Maybe this representation would be better for a pure mathematician.
EDIT: Well apparently according to wikipedia on permutations of multisets my initial guess was right. Still I would check myself.

What is the difference between bottom-up and top-down?

The bottom-up approach (to dynamic programming) consists in first looking at the "smaller" subproblems, and then solve the larger subproblems using the solution to the smaller problems.
The top-down consists in solving the problem in a "natural manner" and check if you have calculated the solution to the subproblem before.
I'm a little confused. What is the difference between these two?

rev4: A very eloquent comment by user Sammaron has noted that, perhaps, this answer previously confused top-down and bottom-up. While originally this answer (rev3) and other answers said that "bottom-up is memoization" ("assume the subproblems"), it may be the inverse (that is, "top-down" may be "assume the subproblems" and "bottom-up" may be "compose the subproblems"). Previously, I have read on memoization being a different kind of dynamic programming as opposed to a subtype of dynamic programming. I was quoting that viewpoint despite not subscribing to it. I have rewritten this answer to be agnostic of the terminology until proper references can be found in the literature. I have also converted this answer to a community wiki. Please prefer academic sources. List of references: {Web: 1,2} {Literature: 5}
Recap
Dynamic programming is all about ordering your computations in a way that avoids recalculating duplicate work. You have a main problem (the root of your tree of subproblems), and subproblems (subtrees). The subproblems typically repeat and overlap.
For example, consider your favorite example of Fibonnaci. This is the full tree of subproblems, if we did a naive recursive call:
TOP of the tree
fib(4)
fib(3)...................... + fib(2)
fib(2)......... + fib(1) fib(1)........... + fib(0)
fib(1) + fib(0) fib(1) fib(1) fib(0)
fib(1) fib(0)
BOTTOM of the tree
(In some other rare problems, this tree could be infinite in some branches, representing non-termination, and thus the bottom of the tree may be infinitely large. Furthermore, in some problems you might not know what the full tree looks like ahead of time. Thus, you might need a strategy/algorithm to decide which subproblems to reveal.)
Memoization, Tabulation
There are at least two main techniques of dynamic programming which are not mutually exclusive:
Memoization - This is a laissez-faire approach: You assume that you have already computed all subproblems and that you have no idea what the optimal evaluation order is. Typically, you would perform a recursive call (or some iterative equivalent) from the root, and either hope you will get close to the optimal evaluation order, or obtain a proof that you will help you arrive at the optimal evaluation order. You would ensure that the recursive call never recomputes a subproblem because you cache the results, and thus duplicate sub-trees are not recomputed.
example: If you are calculating the Fibonacci sequence fib(100), you would just call this, and it would call fib(100)=fib(99)+fib(98), which would call fib(99)=fib(98)+fib(97), ...etc..., which would call fib(2)=fib(1)+fib(0)=1+0=1. Then it would finally resolve fib(3)=fib(2)+fib(1), but it doesn't need to recalculate fib(2), because we cached it.
This starts at the top of the tree and evaluates the subproblems from the leaves/subtrees back up towards the root.
Tabulation - You can also think of dynamic programming as a "table-filling" algorithm (though usually multidimensional, this 'table' may have non-Euclidean geometry in very rare cases*). This is like memoization but more active, and involves one additional step: You must pick, ahead of time, the exact order in which you will do your computations. This should not imply that the order must be static, but that you have much more flexibility than memoization.
example: If you are performing fibonacci, you might choose to calculate the numbers in this order: fib(2),fib(3),fib(4)... caching every value so you can compute the next ones more easily. You can also think of it as filling up a table (another form of caching).
I personally do not hear the word 'tabulation' a lot, but it's a very decent term. Some people consider this "dynamic programming".
Before running the algorithm, the programmer considers the whole tree, then writes an algorithm to evaluate the subproblems in a particular order towards the root, generally filling in a table.
*footnote: Sometimes the 'table' is not a rectangular table with grid-like connectivity, per se. Rather, it may have a more complicated structure, such as a tree, or a structure specific to the problem domain (e.g. cities within flying distance on a map), or even a trellis diagram, which, while grid-like, does not have a up-down-left-right connectivity structure, etc. For example, user3290797 linked a dynamic programming example of finding the maximum independent set in a tree, which corresponds to filling in the blanks in a tree.
(At it's most general, in a "dynamic programming" paradigm, I would say the programmer considers the whole tree, then writes an algorithm that implements a strategy for evaluating subproblems which can optimize whatever properties you want (usually a combination of time-complexity and space-complexity). Your strategy must start somewhere, with some particular subproblem, and perhaps may adapt itself based on the results of those evaluations. In the general sense of "dynamic programming", you might try to cache these subproblems, and more generally, try avoid revisiting subproblems with a subtle distinction perhaps being the case of graphs in various data structures. Very often, these data structures are at their core like arrays or tables. Solutions to subproblems can be thrown away if we don't need them anymore.)
[Previously, this answer made a statement about the top-down vs bottom-up terminology; there are clearly two main approaches called Memoization and Tabulation that may be in bijection with those terms (though not entirely). The general term most people use is still "Dynamic Programming" and some people say "Memoization" to refer to that particular subtype of "Dynamic Programming." This answer declines to say which is top-down and bottom-up until the community can find proper references in academic papers. Ultimately, it is important to understand the distinction rather than the terminology.]
Pros and cons
Ease of coding
Memoization is very easy to code (you can generally* write a "memoizer" annotation or wrapper function that automatically does it for you), and should be your first line of approach. The downside of tabulation is that you have to come up with an ordering.
*(this is actually only easy if you are writing the function yourself, and/or coding in an impure/non-functional programming language... for example if someone already wrote a precompiled fib function, it necessarily makes recursive calls to itself, and you can't magically memoize the function without ensuring those recursive calls call your new memoized function (and not the original unmemoized function))
Recursiveness
Note that both top-down and bottom-up can be implemented with recursion or iterative table-filling, though it may not be natural.
Practical concerns
With memoization, if the tree is very deep (e.g. fib(10^6)), you will run out of stack space, because each delayed computation must be put on the stack, and you will have 10^6 of them.
Optimality
Either approach may not be time-optimal if the order you happen (or try to) visit subproblems is not optimal, specifically if there is more than one way to calculate a subproblem (normally caching would resolve this, but it's theoretically possible that caching might not in some exotic cases). Memoization will usually add on your time-complexity to your space-complexity (e.g. with tabulation you have more liberty to throw away calculations, like using tabulation with Fib lets you use O(1) space, but memoization with Fib uses O(N) stack space).
Advanced optimizations
If you are also doing a extremely complicated problems, you might have no choice but to do tabulation (or at least take a more active role in steering the memoization where you want it to go). Also if you are in a situation where optimization is absolutely critical and you must optimize, tabulation will allow you to do optimizations which memoization would not otherwise let you do in a sane way. In my humble opinion, in normal software engineering, neither of these two cases ever come up, so I would just use memoization ("a function which caches its answers") unless something (such as stack space) makes tabulation necessary... though technically to avoid a stack blowout you can 1) increase the stack size limit in languages which allow it, or 2) eat a constant factor of extra work to virtualize your stack (ick), or 3) program in continuation-passing style, which in effect also virtualizes your stack (not sure the complexity of this, but basically you will effectively take the deferred call chain from the stack of size N and de-facto stick it in N successively nested thunk functions... though in some languages without tail-call optimization you may have to trampoline things to avoid a stack blowout).
More complicated examples
Here we list examples of particular interest, that are not just general DP problems, but interestingly distinguish memoization and tabulation. For example, one formulation might be much easier than the other, or there may be an optimization which basically requires tabulation:
the algorithm to calculate edit-distance[4], interesting as a non-trivial example of a two-dimensional table-filling algorithm

Top down and bottom up DP are two different ways of solving the same problems. Consider a memoized (top down) vs dynamic (bottom up) programming solution to computing fibonacci numbers.
fib_cache = {}
def memo_fib(n):
global fib_cache
if n == 0 or n == 1:
return 1
if n in fib_cache:
return fib_cache[n]
ret = memo_fib(n - 1) + memo_fib(n - 2)
fib_cache[n] = ret
return ret
def dp_fib(n):
partial_answers = [1, 1]
while len(partial_answers) <= n:
partial_answers.append(partial_answers[-1] + partial_answers[-2])
return partial_answers[n]
print memo_fib(5), dp_fib(5)
I personally find memoization much more natural. You can take a recursive function and memoize it by a mechanical process (first lookup answer in cache and return it if possible, otherwise compute it recursively and then before returning, you save the calculation in the cache for future use), whereas doing bottom up dynamic programming requires you to encode an order in which solutions are calculated, such that no "big problem" is computed before the smaller problem that it depends on.

A key feature of dynamic programming is the presence of overlapping subproblems. That is, the problem that you are trying to solve can be broken into subproblems, and many of those subproblems share subsubproblems. It is like "Divide and conquer", but you end up doing the same thing many, many times. An example that I have used since 2003 when teaching or explaining these matters: you can compute Fibonacci numbers recursively.
def fib(n):
if n < 2:
return n
return fib(n-1) + fib(n-2)
Use your favorite language and try running it for fib(50). It will take a very, very long time. Roughly as much time as fib(50) itself! However, a lot of unnecessary work is being done. fib(50) will call fib(49) and fib(48), but then both of those will end up calling fib(47), even though the value is the same. In fact, fib(47) will be computed three times: by a direct call from fib(49), by a direct call from fib(48), and also by a direct call from another fib(48), the one that was spawned by the computation of fib(49)... So you see, we have overlapping subproblems.
Great news: there is no need to compute the same value many times. Once you compute it once, cache the result, and the next time use the cached value! This is the essence of dynamic programming. You can call it "top-down", "memoization", or whatever else you want. This approach is very intuitive and very easy to implement. Just write a recursive solution first, test it on small tests, add memoization (caching of already computed values), and --- bingo! --- you are done.
Usually you can also write an equivalent iterative program that works from the bottom up, without recursion. In this case this would be the more natural approach: loop from 1 to 50 computing all the Fibonacci numbers as you go.
fib[0] = 0
fib[1] = 1
for i in range(48):
fib[i+2] = fib[i] + fib[i+1]
In any interesting scenario the bottom-up solution is usually more difficult to understand. However, once you do understand it, usually you'd get a much clearer big picture of how the algorithm works. In practice, when solving nontrivial problems, I recommend first writing the top-down approach and testing it on small examples. Then write the bottom-up solution and compare the two to make sure you are getting the same thing. Ideally, compare the two solutions automatically. Write a small routine that would generate lots of tests, ideally -- all small tests up to certain size --- and validate that both solutions give the same result. After that use the bottom-up solution in production, but keep the top-bottom code, commented out. This will make it easier for other developers to understand what it is that you are doing: bottom-up code can be quite incomprehensible, even you wrote it and even if you know exactly what you are doing.
In many applications the bottom-up approach is slightly faster because of the overhead of recursive calls. Stack overflow can also be an issue in certain problems, and note that this can very much depend on the input data. In some cases you may not be able to write a test causing a stack overflow if you don't understand dynamic programming well enough, but some day this may still happen.
Now, there are problems where the top-down approach is the only feasible solution because the problem space is so big that it is not possible to solve all subproblems. However, the "caching" still works in reasonable time because your input only needs a fraction of the subproblems to be solved --- but it is too tricky to explicitly define, which subproblems you need to solve, and hence to write a bottom-up solution. On the other hand, there are situations when you know you will need to solve all subproblems. In this case go on and use bottom-up.
I would personally use top-bottom for Paragraph optimization a.k.a the Word wrap optimization problem (look up the Knuth-Plass line-breaking algorithms; at least TeX uses it, and some software by Adobe Systems uses a similar approach). I would use bottom-up for the Fast Fourier Transform.

Lets take fibonacci series as an example
1,1,2,3,5,8,13,21....
first number: 1
Second number: 1
Third Number: 2
Another way to put it,
Bottom(first) number: 1
Top (Eighth) number on the given sequence: 21
In case of first five fibonacci number
Bottom(first) number :1
Top (fifth) number: 5
Now lets take a look of recursive Fibonacci series algorithm as an example
public int rcursive(int n) {
if ((n == 1) || (n == 2)) {
return 1;
} else {
return rcursive(n - 1) + rcursive(n - 2);
}
}
Now if we execute this program with following commands
rcursive(5);
if we closely look into the algorithm, in-order to generate fifth number it requires 3rd and 4th numbers. So my recursion actually start from top(5) and then goes all the way to bottom/lower numbers. This approach is actually top-down approach.
To avoid doing same calculation multiple times we use Dynamic Programming techniques. We store previously computed value and reuse it. This technique is called memoization. There are more to Dynamic programming other then memoization which is not needed to discuss current problem.
Top-Down
Lets rewrite our original algorithm and add memoized techniques.
public int memoized(int n, int[] memo) {
if (n <= 2) {
return 1;
} else if (memo[n] != -1) {
return memo[n];
} else {
memo[n] = memoized(n - 1, memo) + memoized(n - 2, memo);
}
return memo[n];
}
And we execute this method like following
int n = 5;
int[] memo = new int[n + 1];
Arrays.fill(memo, -1);
memoized(n, memo);
This solution is still top-down as algorithm start from top value and go to bottom each step to get our top value.
Bottom-Up
But, question is, can we start from bottom, like from first fibonacci number then walk our way to up. Lets rewrite it using this techniques,
public int dp(int n) {
int[] output = new int[n + 1];
output[1] = 1;
output[2] = 1;
for (int i = 3; i <= n; i++) {
output[i] = output[i - 1] + output[i - 2];
}
return output[n];
}
Now if we look into this algorithm it actually start from lower values then go to top. If i need 5th fibonacci number i am actually calculating 1st, then second then third all the way to up 5th number. This techniques actually called bottom-up techniques.
Last two, algorithms full-fill dynamic programming requirements. But one is top-down and another one is bottom-up. Both algorithm has similar space and time complexity.

Dynamic Programming is often called Memoization!
1.Memoization is the top-down technique(start solving the given problem by breaking it down) and dynamic programming is a bottom-up technique(start solving from the trivial sub-problem, up towards the given problem)
2.DP finds the solution by starting from the base case(s) and works its way upwards. DP solves all the sub-problems, because it does it bottom-up
Unlike Memoization, which solves only the needed sub-problems
DP has the potential to transform exponential-time brute-force solutions into polynomial-time algorithms.
DP may be much more efficient because its iterative
On the contrary, Memoization must pay for the (often significant) overhead due to recursion.
To be more simple, Memoization uses the top-down approach to solve the problem i.e. it begin with core(main) problem then breaks it into sub-problems and solve these sub-problems similarly. In this approach same sub-problem can occur multiple times and consume more CPU cycle, hence increase the time complexity. Whereas in Dynamic programming same sub-problem will not be solved multiple times but the prior result will be used to optimize the solution.

Dynamic programming problems can be solved using either bottom-up or top-down approaches.
Generally, the bottom-up approach uses the tabulation technique, while the top-down approach uses the recursion (with memorization) technique.
But you can also have bottom-up and top-down approaches using recursion as shown below.
Bottom-Up: Start with the base condition and pass the value calculated until now recursively. Generally, these are tail recursions.
int n = 5;
fibBottomUp(1, 1, 2, n);
private int fibBottomUp(int i, int j, int count, int n) {
if (count > n) return 1;
if (count == n) return i + j;
return fibBottomUp(j, i + j, count + 1, n);
}
Top-Down: Start with the final condition and recursively get the result of its sub-problems.
int n = 5;
fibTopDown(n);
private int fibTopDown(int n) {
if (n <= 1) return 1;
return fibTopDown(n - 1) + fibTopDown(n - 2);
}

Simply saying top down approach uses recursion for calling Sub problems again and again where as bottom up approach use the single without calling any one and hence it is more efficient.

Following is the DP based solution for Edit Distance problem which is top down. I hope it will also help in understanding the world of Dynamic Programming:
public int minDistance(String word1, String word2) {//Standard dynamic programming puzzle.
int m = word2.length();
int n = word1.length();
if(m == 0) // Cannot miss the corner cases !
return n;
if(n == 0)
return m;
int[][] DP = new int[n + 1][m + 1];
for(int j =1 ; j <= m; j++) {
DP[0][j] = j;
}
for(int i =1 ; i <= n; i++) {
DP[i][0] = i;
}
for(int i =1 ; i <= n; i++) {
for(int j =1 ; j <= m; j++) {
if(word1.charAt(i - 1) == word2.charAt(j - 1))
DP[i][j] = DP[i-1][j-1];
else
DP[i][j] = Math.min(Math.min(DP[i-1][j], DP[i][j-1]), DP[i-1][j-1]) + 1; // Main idea is this.
}
}
return DP[n][m];
}
You can think of its recursive implementation at your home. It's quite good and challenging if you haven't solved something like this before.

nothing to be confused about... you usually learn the language in bottom-up manner (from basics to more complicated things), and often make your project in top-down manner (from overall goal & structure of the code to certain pieces of implementations)

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string