Random primes and Rabin Karp substring search

Random primes and Rabin Karp substring search - string

I am reading the Rabin-Karb algorithm from Sedgewick. The book says:
We use a random prime Q taking as large a value as possible while
avoiding overflow
At first reading I didn't notice the significance of random and when I saw that in the code a long is used my first thoughts were:
a) Use Eratosthene's sieve to find a big prime that fits a long
or
b) look up from a list of primes any prime large enough that is greater than int and use it as a constant.
But then the rest of the explanation says:
We will use a long value greater than 10^20 making the probability
that a collision happens less than 10^-20
This part got me confused since a long can not fit 10^20 let alone a value greater than that.
Then when I checked the calculation for the prime the book defers to an exercise that has just the following hint:
A random n-digit number is prime with probability proportional to 1/n
What does that mean?
So basically what I don't get is:
a) what is the meaning of using a random prime? Why can't we just pre-calculate it and use it as a constant?
b) why is the 10^20 mentioned since it is out of range for long?
c) How is that hint helpful? What does it mean exactly?

Once again, Sedgewick has tried to simplify an algorithm and gotten the details slightly wrong. First, as you observe, 1020 cannot be represented in 64 bits. Even taking a prime close to 263 − 1, however, you probably would want a bit of room to multiply the normal way without overflowing so that the subsequent modulo is correct. The answer uses a 31-bit prime, which makes this easy but only offers collision probabilities in the 10−9 range.
The original version uses Rabin fingerprints and a random irreducible polynomial over 𝔽2[x], which from the perspective of algebraic number theory behaves a lot like a random prime over the integers. If we choose the polynomial to be degree 32 or 64, then the fingerprints fit perfectly into a computer word of the appropriate length, and polynomial addition and subtraction both work out to bitwise XOR, so there is no overflow.
Now, Sedgewick presumably didn't want to explain how polynomial rings work. Fine. If I had to implement this approach in practice, I'd choose a prime p close to the max that was easy to mod by with cheap instructions (I'm partial to 231 − 227 + 1; EDIT actually 231 − 1 works even better since we don't need a smooth prime here) and then choose a random number in [1, p−1] to evaluate the polynomials at (this is how Wikipedia explains it). The reason that we need some randomness is that otherwise the oblivious adversary could choose an input that would be guaranteed to have a lot of hash collisions, which would severely degrade the running time.
Sedgewick wanted to follow the original a little more closely than that, however, which in essence evaluates the polynomials at a fixed value of x (literally x in the original version that uses polynomial rings). He needs a random prime so that the oblivious adversary can't engineer collisions. Sieving numbers big enough is quite inefficient, so he turns to the Prime Number Theorem (which is the math behind his hint, but it holds only asymptotically, which makes a big mess theoretically) and a fast primality test (which can be probabilistic; the cases where it fails won't influence the correctness of the algorithm, and they are rare enough that they won't affect the expected running time).
I'm not sure how he proves a formal bound on the collision probability. My rough idea is basically, show that there are enough primes in the window of interest, use the Chinese Remainder Theorem to show that it's impossible for there to be a collision for too many primes at once, conclude that the collision probability is bounded by the probability of picking a bad prime, which is low. But the Prime Number Theorem holds only asymptotically, so we have to rely on computer experiments regarding the density of primes in machine word ranges. Not great.

Related

How does the Needleman Wunsch algorithm compare to brute force?

I'm wondering how you can quantify the results of the Needleman-Wunsch algorithm (typically used for aligning nucleotide/protein sequences).
Consider some fixed scoring scheme and two sequences of varying length S1 and S2. Say we calculate every possible alignment of S1 and S2 by brute force, and the highest scoring alignment has a score x. And of course, this has considerably higher complexity than the Needleman-Wunsch approach.
When using the Needleman-Wunsch algorithm to find a sequence alignment, say that it has a score y.
Consider r to be the score generated via Needleman-Wunsch for two random sequences R1 and R2.
How does x compare to y? Is y always greater than r for two sequences of known homology?
In general, I do understand that we use the Needleman-Wunsch algorithm to significantly speed up sequence alignment (vs a brute-force approach), but don't understand the cost in accuracy (if any) that comes with it. I had a go at reading the original paper (Needleman & Wunsch, 1970) but am still left with this question.

Needlman-Wunsch always produces an optimal answer - it's much faster than brute force and doesn't sacrifice accuracy in the process. The key insight it uses is that it's not actually necessary to generate all possible alignments, since most of them contain bad sub-alignments and couldn't possibly be optimal. The Needleman-Wunsch algorithm works by instead slowly building up optimal alignments for fragments of the original strands and then slowly growing those smaller alignments into larger alignments using the guarantee that any optimal alignment must contain an optimal alignment for a slightly smaller case.

I think your question boils down to whether dynamic programming finds the optimal solution ie, garantees that y >= x. For a discussion on this I would refer to people who are likely smarter than me:
https://cs.stackexchange.com/questions/23599/how-is-dynamic-programming-different-from-brute-force
Basically, it says that dynamic programming will likely produce optimal result ie, same as brute force, but only for particular problems that satisfy the Bellman principle of optimality.
According to Wikipedia page for Needleman-Wunsch, the problem does satisfy Bellman principle of optimality:
https://en.wikipedia.org/wiki/Needleman%E2%80%93Wunsch_algorithm
Specifically:
The Needleman–Wunsch algorithm is still widely used for optimal global
alignment, particularly when the quality of the global alignment is of
the utmost importance. However, the algorithm is expensive with
respect to time and space, proportional to the product of the length
of two sequences and hence is not suitable for long sequences.
There is also mention of optimality elsewhere in the same Wikipedia page.

Naive Suffix Array time complexity

I'm trying to invent programming exercise on Suffix Arrays. I learned O(n*log(n)^2) algorithm for constructing it and then started playing with random input strings of varying length in order to find out when naive approach becomes too slow. E.g. I wanted to choose string length so that people will need to implement "advanced" algorithm.
Suddenly I found that naive algorithm (with using logarithmic sort on all suffixes) is not as slow as O(n^2 * log(n)) means. After thinking a bit, I understand that comparison of suffixes of a randomly generated string is not O(n) amortized. Really, we usually only compare few first characters before we come to difference and there we return from comparison function. This of course depends on the size of the alphabet, but anyway it does not depend much on the length of suffixes.
I tried simple implementation in PHP processing 50000-characters string in 2 seconds (despite slowness of scripting language). If it will work at least as O(n^2) we'll expect it to work at least several minutes (with 1e7 operations per second and ~1e9 operations total).
So I understand that even if it is O(n^2 * log(n)) then the constant factor is a very small fraction of 1, really something close to 0. Or we should say about such complexity as worst-case only, right?
But what is the amortized time complexity of the naive approach? I'm bit bewildered about how to assess it.

You seem to be confusing amortized and expected complexity. In this case you are talking about expected complexity. And yes the stated complexity is computed assuming that the suffix comparison takes O(n). This will be the worst case for suffix comparison and for random generated input you will only perform constant number of comparisons in most cases. Thus O(n^2*log(n)) is worst case complexity.
One more note - on a modern computer you can perform a few billion elementary instructions in a second and it is possible that you execute in the order of 50000^2 in 2 seconds. The correct way to benchmark complexity of an algorithm is to measure the time it takes to complete e.g. for input of size N, N*2, N*4,...(as many as you can go) and then to interpolate the function that would describe the computational complexity

Numerical Integration

Generally speaking when you are numerically evaluating and integral, say in MATLAB do I just pick a large number for the bounds or is there a way to tell MATLAB to "take the limit?"
I am assuming that you just use the large number because different machines would be able to handle numbers of different magnitudes.
I am just wondering if their is a way to improve my code. I am doing lots of expected value calculations via Monte Carlo and often use the trapezoid method to check my self of my degrees of freedom are small enough.

Strictly speaking, it's impossible to evaluate a numerical integral out to infinity. In most cases, if the integral in question is finite, you can simply integrate over a reasonably large range. To converge at a stable value, the integral of the normal error has to be less than 10 sigma -- this value is, for better or worse, as equal as you are going to get to evaluating the same integral all the way out to infinity.

It depends very much on what type of function you want to integrate. If it is "smooth" (no jumps - preferably not in any derivatives either, but that becomes progressively less important) and finite, that you have two main choices (limiting myself to the simplest approach):
1. if it is periodic, here meaning: could you put the left and right ends together and the also there have no jumps in value (and derivatives...): distribute your points evenly over the interval and just sample the functionvalues to get the estimated average, and than multiply by the length of the interval to get your integral.
2. if not periodic: use Legendre-integration.
Monte-carlo is almost invariably a poor method: it progresses very slow towards (machine-)precision: for any additional significant digit you need to apply 100 times more points!
The two methods above, for periodic and non-periodic "nice" (smooth etcetera) functions gives fair results already with a very small number of sample-points and then progresses very rapidly towards more precision: 1 of 2 points more usually adds several digits to your precision! This far outweighs the burden that you have to throw away all parts of the previous result when you want to apply a next effort with more sample points: you REPLACE the previous set of points with a fresh new one, while in Monte-Carlo you can just simply add points to the existing set and so refine the outcome.

Computing generalized mean for extreme values of p

How do I compute the generalized mean for extreme values of p (very close to 0, or very large) with reasonable computational error?

As per your link, the limit for p going to 0 is the geometric mean, for which bounds are derived.
The limit for p going to infinity is the maximum.

I have been struggling with the same problem. Here is how I handled this:
Let gmean_p(x1,...,xn) be the generalized mean where p is real but not 0, and x1, ..xn nonnegative. For M>0, we have gmean_p(x1,...,xn) = M*gmean_p(x1/M,...,xn/M) of which the latter form can be exploited to reduce the computational error. For large p, I use M=max(x1,...,xn) and for p close to 0, I use M=mean(x1,..xn). In case M=0, just add a small positive constant to it. This did the job for me.

I suspect if you're interested in very large or small values of p, it may be best to do some form of algebraic manipulation of the generalized-mean formula before putting in numerical values.
For example, in the small-p limit, one can show that the generalized mean tends to the n'th root of the product x_1*x_2*...x_n. The higher order terms in p involve sums and products of log(x_i), which should also be relatively numerically stable to compute. In fact, I believe the first-order expansion in p has a simple relationship to the variance of log(x_i):
If one applies this formula to a set of 100 random numbers drawn uniformly from the range [0.2, 2], one gets a trend like this:
which here shows the asymptotic formula becoming pretty accurate for p less than about 0.3, and the simple formula only failing when p is less than about 1e-10.
The case of large p, is dominated by that x_i which has the largest magnitude (lets call that index i_max). One can rearrange the generalized mean formula to take the following form, which has less pathological behaviour for large p:
If this is applied (using standard numpy routines including numpy.log1p) to another 100 uniformly distributed samples over [0.2, 2.0], one finds that the rearranged formula agrees essentially exactly with the simple formula, but remains valid for much larger values of p for which the simple formula overflows when computing powers of x_i.
(Note that the left-hand plot has the blue curve for the simple formula shifted up by 0.1 so that one can see where it ends due to overflows. For p less than about 1000, the two curves would otherwise be indistinguishable.)

I think the answer here should be to use a recursive solution. In the same way that mean(1,2,3,4)=mean(mean(1,2),mean(3,4)), you can do this kind of recursion for generalized means. What this buys you is that you won't need to do as many sums of really large numbers and you decrease the likelihood of creating an overflow. Also, the other danger when working with floating point numbers is when adding numbers of very different magnitudes (or subtracting numbers of very similar magnitudes). So to avoid these kinds of rounding errors it might help to sort your data before you try and calculate the generalized mean.

Here's a hunch:
First convert all your numbers into a representation in base p. Now to raise to a power of 1/p or p, you just have to shift them --- so you can very easily do all powers without losing precision.
Work out your mean in base p, then convert the result back to base two.
If that doesn't work, an even less practical hunch:
Try working out the discrete Fourier transform, and relating that to the discrete Fourier transform of the input vector.

How do I efficiently estimate a probability based on a small amount of evidence?

I've been trying to find an answer to this for months (to be used in a machine learning application), it doesn't seem like it should be a terribly hard problem, but I'm a software engineer, and math was never one of my strengths.
Here is the scenario:
I have a (possibly) unevenly weighted coin and I want to figure out the probability of it coming up heads. I know that coins from the same box that this one came from have an average probability of p, and I also know the standard deviation of these probabilities (call it s).
(If other summary properties of the probabilities of other coins aside from their mean and stddev would be useful, I can probably get them too.)
I toss the coin n times, and it comes up heads h times.
The naive approach is that the probability is just h/n - but if n is small this is unlikely to be accurate.
Is there a computationally efficient way (ie. doesn't involve very very large or very very small numbers) to take p and s into consideration to come up with a more accurate probability estimate, even when n is small?
I'd appreciate it if any answers could use pseudocode rather than mathematical notation since I find most mathematical notation to be impenetrable ;-)
Other answers:
There are some other answers on SO that are similar, but the answers provided are unsatisfactory. For example this is not computationally efficient because it quickly involves numbers way smaller than can be represented even in double-precision floats. And this one turned out to be incorrect.

Unfortunately you can't do machine learning without knowing some basic math---it's like asking somebody for help in programming but not wanting to know about "variables" , "subroutines" and all that if-then stuff.
The better way to do this is called a Bayesian integration, but there is a simpler approximation called "maximum a postieri" (MAP). It's pretty much like the usual thinking except you can put in the prior distribution.
Fancy words, but you may ask, well where did the h/(h+t) formula come from? Of course it's obvious, but it turns out that it is answer that you get when you have "no prior". And the method below is the next level of sophistication up when you add a prior. Going to Bayesian integration would be the next one but that's harder and perhaps unnecessary.
As I understand it the problem is two fold: first you draw a coin from the bag of coins. This coin has a "headsiness" called theta, so that it gives a head theta fraction of the flips. But the theta for this coin comes from the master distribution which I guess I assume is Gaussian with mean P and standard deviation S.
What you do next is to write down the total unnormalized probability (called likelihood) of seeing the whole shebang, all the data: (h heads, t tails)
L = (theta)^h * (1-theta)^t * Gaussian(theta; P, S).
Gaussian(theta; P, S) = exp( -(theta-P)^2/(2*S^2) ) / sqrt(2*Pi*S^2)
This is the meaning of "first draw 1 value of theta from the Gaussian" and then draw h heads and t tails from a coin using that theta.
The MAP principle says, if you don't know theta, find the value which maximizes L given the data that you do know. You do that with calculus. The trick to make it easy is that you take logarithms first. Define LL = log(L). Wherever L is maximized, then LL will be too.
so
LL = hlog(theta) + tlog(1-theta) + -(theta-P)^2 / (2*S^2)) - 1/2 * log(2*pi*S^2)
By calculus to look for extrema you find the value of theta such that dLL/dtheta = 0.
Since the last term with the log has no theta in it you can ignore it.
dLL/dtheta = 0 = (h/theta) + (P-theta)/S^2 - (t/(1-theta)) = 0.
If you can solve this equation for theta you will get an answer, the MAP estimate for theta given the number of heads h and the number of tails t.
If you want a fast approximation, try doing one step of Newton's method, where you start with your proposed theta at the obvious (called maximum likelihood) estimate of theta = h/(h+t).
And where does that 'obvious' estimate come from? If you do the stuff above but don't put in the Gaussian prior: h/theta - t/(1-theta) = 0 you'll come up with theta = h/(h+t).
If your prior probabilities are really small, as is often the case, instead of near 0.5, then a Gaussian prior on theta is probably inappropriate, as it predicts some weight with negative probabilities, clearly wrong. More appropriate is a Gaussian prior on log theta ('lognormal distribution'). Plug it in the same way and work through the calculus.

You can use p as a prior on your estimated probability. This is basically the same as doing pseudocount smoothing. I.e., use
(h + c * p) / (n + c)
as your estimate. When h and n are large, then this just becomes h / n. When h and n are small, this is just c * p / c = p. The choice of c is up to you. You can base it on s but in the end you have to decide how small is too small.

You don't have nearly enough info in this question.
How many coins are in the box? If it's two, then in some scenarios (for example one coin is always heads, the other always tails) knowing p and s would be useful. If it's more than a few, and especially if only some of the coins are only slightly weighted then it is not useful.
What is a small n? 2? 5? 10? 100? What is the probability of a weighted coin coming up heads/tail? 100/0, 60/40, 50.00001/49.99999? How is the weighting distributed? Is every coin one of 2 possible weightings? Do they follow a bell curve? etc.
It boils down to this: the differences between a weighted/unweighted coin, the distribution of weighted coins, and the number coins in your box will all decide what n has to be for you to solve this with a high confidence.
The name for what you're trying to do is a Bernoulli trial. Knowing the name should be helpful in finding better resources.
Response to comment:
If you have differences in p that small, you are going to have to do a lot of trials and there's no getting around it.
Assuming a uniform distribution of bias, p will still be 0.5 and all standard deviation will tell you is that at least some of the coins have a minor bias.
How many tosses, again, will be determined under these circumstances by the weighting of the coins. Even with 500 tosses, you won't get a strong confidence (about 2/3) detecting a .51/.49 split.

In general, what you are looking for is Maximum Likelihood Estimation. Wolfram Demonstration Project has an illustration of estimating the probability of a coin landing head, given a sample of tosses.

Well I'm no math man, but I think the simple Bayesian approach is intuitive and broadly applicable enough to put a little though into it. Others above have already suggested this, but perhaps if your like me you would prefer more verbosity.
In this lingo, you have a set of mutually-exclusive hypotheses, H, and some data D, and you want to find the (posterior) probabilities that each hypothesis Hi is correct given the data. Presumably you would choose the hypothesis that had the largest posterior probability (the MAP as noted above), if you had to choose one. As Matt notes above, what distinguishes the Bayesian approach from only maximum likelihood (finding the H that maximizes Pr(D|H)) is that you also have some PRIOR info regarding which hypotheses are most likely, and you want to incorporate these priors.
So you have from basic probability Pr(H|D) = Pr(D|H)*Pr(H)/Pr(D). You can estimate these Pr(H|D) numerically by creating a series of discrete probabilities Hi for each hypothesis you wish to test, eg [0.0,0.05, 0.1 ... 0.95, 1.0], and then determining your prior Pr(H) for each Hi -- above it is assumed you have a normal distribution of priors, and if that is acceptable you could use the mean and stdev to get each Pr(Hi) -- or use another distribution if you prefer. With coin tosses the Pr(D|H) is of course determined by the binomial using the observed number of successes with n trials and the particular Hi being tested. The denominator Pr(D) may seem daunting but we assume that we have covered all the bases with our hypotheses, so that Pr(D) is the summation of Pr(D|Hi)Pr(H) over all H.
Very simple if you think about it a bit, and maybe not so if you think about it a bit more.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string