Differences between Wallace Tree and Dadda Multipliers - add

Could anyone tell the difference in the partial products reduction method or mechanism between Wallace and Dadda multipliers ?
I have been reading A_comparison_of_Dadda_and_Wallace_multiplier_delays.pdf

Both are very similar. Instead of the traditional row based algorithm, they all aim to implement a multiplication A*B by 1/ anding A with bits b_i, 2/ counting bits for every column until there are only two rows and 3/ performing the final addition with a fast adder.
I worked on a Dadda multiplier, but this was many many years ago, and I am not sure to remember all the details. To my best knowledge, the main difference are in the counting process.
Wallace introduced the "Wallace tree" structure (that is still useful in some design). This allows, given n bits, to count the number of bits at 1 in this set. A (n,m) wallace tree (where m=ceil(log_2 n)) gives the number of bits at 1 among the n inputs and outputs the result on m bits. This is somehow a combinatorial counter. For instance, below is a the schematic of a (7,3) Wallace tree made with full adders (that are (3,2) Wallace trees).
As you can see, this tree generates results of logical weight 2^0, 2^1 and 2^2, if input bits are of weight 2^0.
This allows a fast reduction in the height of the columns, but can be somehow inefficient in terms of gate count.
Luigi Dadda do not use such an aggressive reduction strategy and tries to keep the columns heights more balanced. Only full (or half adders) are used and every counting/reduction will only generate bits of weight 2^0 and 2^1. The reduction process is less efficient (as can be seen by the larger number of rows in your figure), but the gate count is better. Dadda strategy was also supposed to be slightly less time efficient, but according to the enclosed paper, that I did not knew, it is not true.
The main interest in
Wallace/Dadda multipliers is that they can achieve a multiplication with ~log n time complexity, which is much better than the traditional O(n) array multiplier with carry save adders. But, despite this theoretical advantage, they are not really used any longer. Present architectures are more concerned with throughput than latency and prefer to use simpler array structures than can be efficiently pipelined. Implementing Wallace/Dadda structure is a real nightmare beyond a few bits and adding pipeline to them is very complex due to their irregular structure.
Note that other multiplier designs yield to log n time complexity, with a more regular and implementable divide-and-conquer strategy, as for instance the Luk-Vuillemin multiplier.

Related

Random primes and Rabin Karp substring search

I am reading the Rabin-Karb algorithm from Sedgewick. The book says:
We use a random prime Q taking as large a value as possible while
avoiding overflow
At first reading I didn't notice the significance of random and when I saw that in the code a long is used my first thoughts were:
a) Use Eratosthene's sieve to find a big prime that fits a long
or
b) look up from a list of primes any prime large enough that is greater than int and use it as a constant.
But then the rest of the explanation says:
We will use a long value greater than 10^20 making the probability
that a collision happens less than 10^-20
This part got me confused since a long can not fit 10^20 let alone a value greater than that.
Then when I checked the calculation for the prime the book defers to an exercise that has just the following hint:
A random n-digit number is prime with probability proportional to 1/n
What does that mean?
So basically what I don't get is:
a) what is the meaning of using a random prime? Why can't we just pre-calculate it and use it as a constant?
b) why is the 10^20 mentioned since it is out of range for long?
c) How is that hint helpful? What does it mean exactly?
Once again, Sedgewick has tried to simplify an algorithm and gotten the details slightly wrong. First, as you observe, 1020 cannot be represented in 64 bits. Even taking a prime close to 263 āˆ’ 1, however, you probably would want a bit of room to multiply the normal way without overflowing so that the subsequent modulo is correct. The answer uses a 31-bit prime, which makes this easy but only offers collision probabilities in the 10āˆ’9 range.
The original version uses Rabin fingerprints and a random irreducible polynomial over š”½2[x], which from the perspective of algebraic number theory behaves a lot like a random prime over the integers. If we choose the polynomial to be degree 32 or 64, then the fingerprints fit perfectly into a computer word of the appropriate length, and polynomial addition and subtraction both work out to bitwise XOR, so there is no overflow.
Now, Sedgewick presumably didn't want to explain how polynomial rings work. Fine. If I had to implement this approach in practice, I'd choose a prime p close to the max that was easy to mod by with cheap instructions (I'm partial to 231 āˆ’ 227 + 1; EDIT actually 231 āˆ’ 1 works even better since we don't need a smooth prime here) and then choose a random number in [1, pāˆ’1] to evaluate the polynomials at (this is how Wikipedia explains it). The reason that we need some randomness is that otherwise the oblivious adversary could choose an input that would be guaranteed to have a lot of hash collisions, which would severely degrade the running time.
Sedgewick wanted to follow the original a little more closely than that, however, which in essence evaluates the polynomials at a fixed value of x (literally x in the original version that uses polynomial rings). He needs a random prime so that the oblivious adversary can't engineer collisions. Sieving numbers big enough is quite inefficient, so he turns to the Prime Number Theorem (which is the math behind his hint, but it holds only asymptotically, which makes a big mess theoretically) and a fast primality test (which can be probabilistic; the cases where it fails won't influence the correctness of the algorithm, and they are rare enough that they won't affect the expected running time).
I'm not sure how he proves a formal bound on the collision probability. My rough idea is basically, show that there are enough primes in the window of interest, use the Chinese Remainder Theorem to show that it's impossible for there to be a collision for too many primes at once, conclude that the collision probability is bounded by the probability of picking a bad prime, which is low. But the Prime Number Theorem holds only asymptotically, so we have to rely on computer experiments regarding the density of primes in machine word ranges. Not great.

Naive Suffix Array time complexity

I'm trying to invent programming exercise on Suffix Arrays. I learned O(n*log(n)^2) algorithm for constructing it and then started playing with random input strings of varying length in order to find out when naive approach becomes too slow. E.g. I wanted to choose string length so that people will need to implement "advanced" algorithm.
Suddenly I found that naive algorithm (with using logarithmic sort on all suffixes) is not as slow as O(n^2 * log(n)) means. After thinking a bit, I understand that comparison of suffixes of a randomly generated string is not O(n) amortized. Really, we usually only compare few first characters before we come to difference and there we return from comparison function. This of course depends on the size of the alphabet, but anyway it does not depend much on the length of suffixes.
I tried simple implementation in PHP processing 50000-characters string in 2 seconds (despite slowness of scripting language). If it will work at least as O(n^2) we'll expect it to work at least several minutes (with 1e7 operations per second and ~1e9 operations total).
So I understand that even if it is O(n^2 * log(n)) then the constant factor is a very small fraction of 1, really something close to 0. Or we should say about such complexity as worst-case only, right?
But what is the amortized time complexity of the naive approach? I'm bit bewildered about how to assess it.
You seem to be confusing amortized and expected complexity. In this case you are talking about expected complexity. And yes the stated complexity is computed assuming that the suffix comparison takes O(n). This will be the worst case for suffix comparison and for random generated input you will only perform constant number of comparisons in most cases. Thus O(n^2*log(n)) is worst case complexity.
One more note - on a modern computer you can perform a few billion elementary instructions in a second and it is possible that you execute in the order of 50000^2 in 2 seconds. The correct way to benchmark complexity of an algorithm is to measure the time it takes to complete e.g. for input of size N, N*2, N*4,...(as many as you can go) and then to interpolate the function that would describe the computational complexity

k-Nearest Neighbour Algorithm in verilog

Im planning to do KNN's verilog implementation. But the problem is the euclidean distance measurement term associated with KNN,since it needs Subtraction,squaring,adding. I think,the code will become complex when i code knn with euclidean distance.Is there any simple method(hardware friendly) to find the distance, so that the complexity of the code and hence complexity of synthesized circuit will reduce. My idea is to store the codebook in memory and when we give input, k nearest neighbours index will generated as output.
Finding the k-Nearest Neighbors involves two parts: 1) Calculate the distance between your input vector and every reference vector and 2) Find the k smallest distances.
For the part 1), you can design a pipelined Euclidean distance function that consists of a subtractor, multiplier, and accumulator. Subtraction and accumulation (addition) require a relatively small clock period relative to multiplication. Depending on the bitwidth it may be worthwhile to pipeline those as well. A single-cycle multiplier will require a prohibitively high clock period, so it will certainly have to be pipelined.
Here I've assumed you're working with integers; if you have to work with floating point then you're out of luck since floating point multiply and addition cannot be pipelined due to their divergent branching.
For part 2), you have to compare all of the distances to find the k smallest. This can be done several ways; one possible way is with a tree of comparators that finds the single smallest distance. Once that is found, you can remove that distance from the set of distances and repeat k times.
Notice that for part 1, you're basically implementing a CPU/GPU's functional unit; and that's almost certainly going to be faster than your Verilog implementation. The biggest improvement you'll get over a CPU/GPU is with part 2) finding the k minimum distances.

Why is naive string search algorithm faster?

I'm testing string search algorithms from this site: EXACT STRING MATCHING ALGORITHMS. Christian Charras, Thierry Lecroq. Test text is a random sequence of DNA bases (ACGT) of 1 GByte size. Test patterns are a list of random sequences of random size (1kB max). Test system is a AMD Phenom II x4 955 at 3.2 GHz, 4 GB of RAM and Windows 7 64 bits. Code witten in C and compiled with MinGW with -O3 flag.
Naive search algorithm takes 4 seconds for short patterns to 8 seconds for 1kB patterns. Deterministic finite state machine takes 2 seconds for short patterns to 4 seconds for 1kB patterns. Boyer-Moore algorithm takes 4 seconds for very short patters, about 1/2 second for short pattherns and 2 seconds for 1kB patterns. The remaining algorithm performance is worst than naive search algorithm.
How can be naive search algorithm search algorithm faster than most other algorithms?
How can a deterministic finite state machine implemented with a transition table (O(n) execution time always) be 2 to 8 times slower than Boyer-Moore algorithm? Yes, BM best case is O(n/m), but his average case is O(n) and worst case is O(nm).
There is no perfect string matching algorithm which is best for all circumstances.
Boyer-Moore (and Horspool, Sunday etc.) work by creating jump tables ('How far can I move the search pointer when the characters do not match? The more distinct letters in the strings, the better the positive impact. You can imagine, that a string with only 4 distinct letters creates a jump table with a maximum of 3 shifts per mismatch. Whereas searching an english word with case sensitive may result in a jumptable with (A-Z + a-z + punctiation) max. approx 55 shifts per mismatch.
On the other hand, there is a negative impact on both preparation (i.e. calculating the jump tables) and looping itself. So these algorithms perform poor on short strings (preparation creates an overhead) and strings with only a few distict letters (as mentioned before)
The naive search algorithm is very compact and there are very little operations inside the loop, so loop runs fast. As there is no overhead it performs better when searching short strings.
The (compared to the naive search) quite complex loop operations of a BM algorithm take much longer per loop run. This (partly) compensates for the positive performance impact of the jump tables.
So although you are using long strings, the small alphabet (=small jump tables) makes BM perform poorly. A KMP has less overhead in the loop (the jump table is smaller in general, but is similar to the BM with small alphabets) and so the KMP performs so well.
Theoretically good algorithms (lower time complexity) often have high bookkeeping costs that can overwhelm that of a naive algorithm for small problem sizes. Also implementation details matter. By optimizing an implementation you can sometimes improve runtime by factors of 2 or more.
The naive implementation actually has a linear expected running time (same as BM/KMP, etc) for random input data. I could not write a full proof here but it's accessible from Algorithms Design Techniques and Analysis.
Most exact matching algorithms are optimized version of the naive implementation to prevent being slowed down by certain patterns. For instance, suppose we are searching for:
aaaaaaaaaaaaaaaaaaaaaaaab
on a stream of:
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaab
It fails at the b for lots of times. KMP/BM implementations are contrived to prevent repeatedly comparing the as. However, if the sequence is random by itself, such conditions are almost impossible to appear and the naive implementation is likely to work better due to its lower overhead in bookkeeping or possibly better spatial/temporal locality.
And, yeah, I'm not sure DNA sequences are random. Or alternatively are repetitions common in them. Anyway there's no way to examine this carefully without representative data.

Numerical Integration

Generally speaking when you are numerically evaluating and integral, say in MATLAB do I just pick a large number for the bounds or is there a way to tell MATLAB to "take the limit?"
I am assuming that you just use the large number because different machines would be able to handle numbers of different magnitudes.
I am just wondering if their is a way to improve my code. I am doing lots of expected value calculations via Monte Carlo and often use the trapezoid method to check my self of my degrees of freedom are small enough.
Strictly speaking, it's impossible to evaluate a numerical integral out to infinity. In most cases, if the integral in question is finite, you can simply integrate over a reasonably large range. To converge at a stable value, the integral of the normal error has to be less than 10 sigma -- this value is, for better or worse, as equal as you are going to get to evaluating the same integral all the way out to infinity.
It depends very much on what type of function you want to integrate. If it is "smooth" (no jumps - preferably not in any derivatives either, but that becomes progressively less important) and finite, that you have two main choices (limiting myself to the simplest approach):
1. if it is periodic, here meaning: could you put the left and right ends together and the also there have no jumps in value (and derivatives...): distribute your points evenly over the interval and just sample the functionvalues to get the estimated average, and than multiply by the length of the interval to get your integral.
2. if not periodic: use Legendre-integration.
Monte-carlo is almost invariably a poor method: it progresses very slow towards (machine-)precision: for any additional significant digit you need to apply 100 times more points!
The two methods above, for periodic and non-periodic "nice" (smooth etcetera) functions gives fair results already with a very small number of sample-points and then progresses very rapidly towards more precision: 1 of 2 points more usually adds several digits to your precision! This far outweighs the burden that you have to throw away all parts of the previous result when you want to apply a next effort with more sample points: you REPLACE the previous set of points with a fresh new one, while in Monte-Carlo you can just simply add points to the existing set and so refine the outcome.

Resources