Naive Suffix Array time complexity - string

I'm trying to invent programming exercise on Suffix Arrays. I learned O(n*log(n)^2) algorithm for constructing it and then started playing with random input strings of varying length in order to find out when naive approach becomes too slow. E.g. I wanted to choose string length so that people will need to implement "advanced" algorithm.
Suddenly I found that naive algorithm (with using logarithmic sort on all suffixes) is not as slow as O(n^2 * log(n)) means. After thinking a bit, I understand that comparison of suffixes of a randomly generated string is not O(n) amortized. Really, we usually only compare few first characters before we come to difference and there we return from comparison function. This of course depends on the size of the alphabet, but anyway it does not depend much on the length of suffixes.
I tried simple implementation in PHP processing 50000-characters string in 2 seconds (despite slowness of scripting language). If it will work at least as O(n^2) we'll expect it to work at least several minutes (with 1e7 operations per second and ~1e9 operations total).
So I understand that even if it is O(n^2 * log(n)) then the constant factor is a very small fraction of 1, really something close to 0. Or we should say about such complexity as worst-case only, right?
But what is the amortized time complexity of the naive approach? I'm bit bewildered about how to assess it.

You seem to be confusing amortized and expected complexity. In this case you are talking about expected complexity. And yes the stated complexity is computed assuming that the suffix comparison takes O(n). This will be the worst case for suffix comparison and for random generated input you will only perform constant number of comparisons in most cases. Thus O(n^2*log(n)) is worst case complexity.
One more note - on a modern computer you can perform a few billion elementary instructions in a second and it is possible that you execute in the order of 50000^2 in 2 seconds. The correct way to benchmark complexity of an algorithm is to measure the time it takes to complete e.g. for input of size N, N*2, N*4,...(as many as you can go) and then to interpolate the function that would describe the computational complexity

Related

Random primes and Rabin Karp substring search

I am reading the Rabin-Karb algorithm from Sedgewick. The book says:
We use a random prime Q taking as large a value as possible while
avoiding overflow
At first reading I didn't notice the significance of random and when I saw that in the code a long is used my first thoughts were:
a) Use Eratosthene's sieve to find a big prime that fits a long
or
b) look up from a list of primes any prime large enough that is greater than int and use it as a constant.
But then the rest of the explanation says:
We will use a long value greater than 10^20 making the probability
that a collision happens less than 10^-20
This part got me confused since a long can not fit 10^20 let alone a value greater than that.
Then when I checked the calculation for the prime the book defers to an exercise that has just the following hint:
A random n-digit number is prime with probability proportional to 1/n
What does that mean?
So basically what I don't get is:
a) what is the meaning of using a random prime? Why can't we just pre-calculate it and use it as a constant?
b) why is the 10^20 mentioned since it is out of range for long?
c) How is that hint helpful? What does it mean exactly?
Once again, Sedgewick has tried to simplify an algorithm and gotten the details slightly wrong. First, as you observe, 1020 cannot be represented in 64 bits. Even taking a prime close to 263 − 1, however, you probably would want a bit of room to multiply the normal way without overflowing so that the subsequent modulo is correct. The answer uses a 31-bit prime, which makes this easy but only offers collision probabilities in the 10−9 range.
The original version uses Rabin fingerprints and a random irreducible polynomial over 𝔽2[x], which from the perspective of algebraic number theory behaves a lot like a random prime over the integers. If we choose the polynomial to be degree 32 or 64, then the fingerprints fit perfectly into a computer word of the appropriate length, and polynomial addition and subtraction both work out to bitwise XOR, so there is no overflow.
Now, Sedgewick presumably didn't want to explain how polynomial rings work. Fine. If I had to implement this approach in practice, I'd choose a prime p close to the max that was easy to mod by with cheap instructions (I'm partial to 231 − 227 + 1; EDIT actually 231 − 1 works even better since we don't need a smooth prime here) and then choose a random number in [1, p−1] to evaluate the polynomials at (this is how Wikipedia explains it). The reason that we need some randomness is that otherwise the oblivious adversary could choose an input that would be guaranteed to have a lot of hash collisions, which would severely degrade the running time.
Sedgewick wanted to follow the original a little more closely than that, however, which in essence evaluates the polynomials at a fixed value of x (literally x in the original version that uses polynomial rings). He needs a random prime so that the oblivious adversary can't engineer collisions. Sieving numbers big enough is quite inefficient, so he turns to the Prime Number Theorem (which is the math behind his hint, but it holds only asymptotically, which makes a big mess theoretically) and a fast primality test (which can be probabilistic; the cases where it fails won't influence the correctness of the algorithm, and they are rare enough that they won't affect the expected running time).
I'm not sure how he proves a formal bound on the collision probability. My rough idea is basically, show that there are enough primes in the window of interest, use the Chinese Remainder Theorem to show that it's impossible for there to be a collision for too many primes at once, conclude that the collision probability is bounded by the probability of picking a bad prime, which is low. But the Prime Number Theorem holds only asymptotically, so we have to rely on computer experiments regarding the density of primes in machine word ranges. Not great.

Longest Common Subsequence between very large strings

I am trying solve the Longest Common subsequence problem, which is the problem of finding the longest subsequence common to all sequences in a set of sequences (often just two sequences).
I am trying to do this to calculate the overlap between 2 strings.
This is well know Dynamic programming problem. However, In my case the strings are is too huge. When I tried to use the 2D matrix to memoize, I ran into memory out of bound problem.
One solution could be using sparse matrix instead but I am little concerned about the performance overhead with that.
Also I want to perform this algorithm across multiple strings. And it will be okay to provide approximate answer since I am only trying to measure the overlap between 2 strings.
EDIT: After some investigation I found the following alternatives
Hirschberg Algorithm https://en.wikipedia.org/wiki/Hirschberg%27s_algorithm
Original paper http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.348.4360&rep=rep1&type=pdf
Approximate algorithm : http://cs.haifa.ac.il/~ilan/online-papers/cpm09.pdf
Deposition and Extension Approach to Find Longest Common
Subsequence for Multiple Sequences https://arxiv.org/pdf/0903.2015.pdf
LCS on DNA sequence http://www.sersc.org/journals/IJAST/vol47/2.pdf
Efficient Algorithm http://www.sciencedirect.com/science/article/pii/S0885064X12000635
To reduce memory complexity, you don't need to store the entire 2D table. You can only store the row above and current row and thus you can reduce the memory consumption by O(N) if you store the maximum in another data-structure. This results in O(N) memory usage, but time complexity remains O(N^2).

How does the Needleman Wunsch algorithm compare to brute force?

I'm wondering how you can quantify the results of the Needleman-Wunsch algorithm (typically used for aligning nucleotide/protein sequences).
Consider some fixed scoring scheme and two sequences of varying length S1 and S2. Say we calculate every possible alignment of S1 and S2 by brute force, and the highest scoring alignment has a score x. And of course, this has considerably higher complexity than the Needleman-Wunsch approach.
When using the Needleman-Wunsch algorithm to find a sequence alignment, say that it has a score y.
Consider r to be the score generated via Needleman-Wunsch for two random sequences R1 and R2.
How does x compare to y? Is y always greater than r for two sequences of known homology?
In general, I do understand that we use the Needleman-Wunsch algorithm to significantly speed up sequence alignment (vs a brute-force approach), but don't understand the cost in accuracy (if any) that comes with it. I had a go at reading the original paper (Needleman & Wunsch, 1970) but am still left with this question.
Needlman-Wunsch always produces an optimal answer - it's much faster than brute force and doesn't sacrifice accuracy in the process. The key insight it uses is that it's not actually necessary to generate all possible alignments, since most of them contain bad sub-alignments and couldn't possibly be optimal. The Needleman-Wunsch algorithm works by instead slowly building up optimal alignments for fragments of the original strands and then slowly growing those smaller alignments into larger alignments using the guarantee that any optimal alignment must contain an optimal alignment for a slightly smaller case.
I think your question boils down to whether dynamic programming finds the optimal solution ie, garantees that y >= x. For a discussion on this I would refer to people who are likely smarter than me:
https://cs.stackexchange.com/questions/23599/how-is-dynamic-programming-different-from-brute-force
Basically, it says that dynamic programming will likely produce optimal result ie, same as brute force, but only for particular problems that satisfy the Bellman principle of optimality.
According to Wikipedia page for Needleman-Wunsch, the problem does satisfy Bellman principle of optimality:
https://en.wikipedia.org/wiki/Needleman%E2%80%93Wunsch_algorithm
Specifically:
The Needleman–Wunsch algorithm is still widely used for optimal global
alignment, particularly when the quality of the global alignment is of
the utmost importance. However, the algorithm is expensive with
respect to time and space, proportional to the product of the length
of two sequences and hence is not suitable for long sequences.
There is also mention of optimality elsewhere in the same Wikipedia page.

Why is naive string search algorithm faster?

I'm testing string search algorithms from this site: EXACT STRING MATCHING ALGORITHMS. Christian Charras, Thierry Lecroq. Test text is a random sequence of DNA bases (ACGT) of 1 GByte size. Test patterns are a list of random sequences of random size (1kB max). Test system is a AMD Phenom II x4 955 at 3.2 GHz, 4 GB of RAM and Windows 7 64 bits. Code witten in C and compiled with MinGW with -O3 flag.
Naive search algorithm takes 4 seconds for short patterns to 8 seconds for 1kB patterns. Deterministic finite state machine takes 2 seconds for short patterns to 4 seconds for 1kB patterns. Boyer-Moore algorithm takes 4 seconds for very short patters, about 1/2 second for short pattherns and 2 seconds for 1kB patterns. The remaining algorithm performance is worst than naive search algorithm.
How can be naive search algorithm search algorithm faster than most other algorithms?
How can a deterministic finite state machine implemented with a transition table (O(n) execution time always) be 2 to 8 times slower than Boyer-Moore algorithm? Yes, BM best case is O(n/m), but his average case is O(n) and worst case is O(nm).
There is no perfect string matching algorithm which is best for all circumstances.
Boyer-Moore (and Horspool, Sunday etc.) work by creating jump tables ('How far can I move the search pointer when the characters do not match? The more distinct letters in the strings, the better the positive impact. You can imagine, that a string with only 4 distinct letters creates a jump table with a maximum of 3 shifts per mismatch. Whereas searching an english word with case sensitive may result in a jumptable with (A-Z + a-z + punctiation) max. approx 55 shifts per mismatch.
On the other hand, there is a negative impact on both preparation (i.e. calculating the jump tables) and looping itself. So these algorithms perform poor on short strings (preparation creates an overhead) and strings with only a few distict letters (as mentioned before)
The naive search algorithm is very compact and there are very little operations inside the loop, so loop runs fast. As there is no overhead it performs better when searching short strings.
The (compared to the naive search) quite complex loop operations of a BM algorithm take much longer per loop run. This (partly) compensates for the positive performance impact of the jump tables.
So although you are using long strings, the small alphabet (=small jump tables) makes BM perform poorly. A KMP has less overhead in the loop (the jump table is smaller in general, but is similar to the BM with small alphabets) and so the KMP performs so well.
Theoretically good algorithms (lower time complexity) often have high bookkeeping costs that can overwhelm that of a naive algorithm for small problem sizes. Also implementation details matter. By optimizing an implementation you can sometimes improve runtime by factors of 2 or more.
The naive implementation actually has a linear expected running time (same as BM/KMP, etc) for random input data. I could not write a full proof here but it's accessible from Algorithms Design Techniques and Analysis.
Most exact matching algorithms are optimized version of the naive implementation to prevent being slowed down by certain patterns. For instance, suppose we are searching for:
aaaaaaaaaaaaaaaaaaaaaaaab
on a stream of:
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaab
It fails at the b for lots of times. KMP/BM implementations are contrived to prevent repeatedly comparing the as. However, if the sequence is random by itself, such conditions are almost impossible to appear and the naive implementation is likely to work better due to its lower overhead in bookkeeping or possibly better spatial/temporal locality.
And, yeah, I'm not sure DNA sequences are random. Or alternatively are repetitions common in them. Anyway there's no way to examine this carefully without representative data.

Numerical Integration

Generally speaking when you are numerically evaluating and integral, say in MATLAB do I just pick a large number for the bounds or is there a way to tell MATLAB to "take the limit?"
I am assuming that you just use the large number because different machines would be able to handle numbers of different magnitudes.
I am just wondering if their is a way to improve my code. I am doing lots of expected value calculations via Monte Carlo and often use the trapezoid method to check my self of my degrees of freedom are small enough.
Strictly speaking, it's impossible to evaluate a numerical integral out to infinity. In most cases, if the integral in question is finite, you can simply integrate over a reasonably large range. To converge at a stable value, the integral of the normal error has to be less than 10 sigma -- this value is, for better or worse, as equal as you are going to get to evaluating the same integral all the way out to infinity.
It depends very much on what type of function you want to integrate. If it is "smooth" (no jumps - preferably not in any derivatives either, but that becomes progressively less important) and finite, that you have two main choices (limiting myself to the simplest approach):
1. if it is periodic, here meaning: could you put the left and right ends together and the also there have no jumps in value (and derivatives...): distribute your points evenly over the interval and just sample the functionvalues to get the estimated average, and than multiply by the length of the interval to get your integral.
2. if not periodic: use Legendre-integration.
Monte-carlo is almost invariably a poor method: it progresses very slow towards (machine-)precision: for any additional significant digit you need to apply 100 times more points!
The two methods above, for periodic and non-periodic "nice" (smooth etcetera) functions gives fair results already with a very small number of sample-points and then progresses very rapidly towards more precision: 1 of 2 points more usually adds several digits to your precision! This far outweighs the burden that you have to throw away all parts of the previous result when you want to apply a next effort with more sample points: you REPLACE the previous set of points with a fresh new one, while in Monte-Carlo you can just simply add points to the existing set and so refine the outcome.

Resources