Fast approximate string difference for large strings - string

I'm trying to quantify the difference between two strings as part of a change-monitor system.
The issue I'm having is that the strings are large - I can often be dealing with strings with 100K+ characters.
I'm currently using Levenshtein distance, but computing the levenshtein distance for large strings is very inefficient. Even the best implementations only manage O(min(mn)).
Since both strings are of approximately the same length, the distance calculation process can take many seconds.
I do not need high precision. A change resolution of 1 in 1000 (e.g. 0.1%) would be plenty for my application.
What options are there for more efficient string distance computation?

If you can tolerate some error, you can try splitting the strings into smaller chunks, and calculate their pairwise L-distances.
The method would obviously yield accurate result for replacements, inserts and deletes would incur an accuracy penalty depending on the number of chunks (worst case scenario would give you a distance of 2 * <number of insert/deletes> * <number of chunks> instead of <number of insert/deletes>)
The next step could be to make the process adaptive, I see two ways of doing it, depending on the expected nature of changes:
Try a small chunk size first then move on to larger and larger chunks and observe the drop between each iteration. That should help you estimate how much of your measured distance is error (though I haven't worked out exactly how).
Once you find a difference between two chunks, try to identify what the difference is (exactly how many characters were added/deleted overall), and shift your next chunk to the left or to the right accordingly.

Related

Longest Common Subsequence between very large strings

I am trying solve the Longest Common subsequence problem, which is the problem of finding the longest subsequence common to all sequences in a set of sequences (often just two sequences).
I am trying to do this to calculate the overlap between 2 strings.
This is well know Dynamic programming problem. However, In my case the strings are is too huge. When I tried to use the 2D matrix to memoize, I ran into memory out of bound problem.
One solution could be using sparse matrix instead but I am little concerned about the performance overhead with that.
Also I want to perform this algorithm across multiple strings. And it will be okay to provide approximate answer since I am only trying to measure the overlap between 2 strings.
EDIT: After some investigation I found the following alternatives
Hirschberg Algorithm https://en.wikipedia.org/wiki/Hirschberg%27s_algorithm
Original paper http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.348.4360&rep=rep1&type=pdf
Approximate algorithm : http://cs.haifa.ac.il/~ilan/online-papers/cpm09.pdf
Deposition and Extension Approach to Find Longest Common
Subsequence for Multiple Sequences https://arxiv.org/pdf/0903.2015.pdf
LCS on DNA sequence http://www.sersc.org/journals/IJAST/vol47/2.pdf
Efficient Algorithm http://www.sciencedirect.com/science/article/pii/S0885064X12000635
To reduce memory complexity, you don't need to store the entire 2D table. You can only store the row above and current row and thus you can reduce the memory consumption by O(N) if you store the maximum in another data-structure. This results in O(N) memory usage, but time complexity remains O(N^2).

How to find unknown repeated patterns in the set of strings?

Here is description of a problem. Suppose you have a set of strings (up to 10 billion of strings, each string length up to 10k characters, there are 1000 unique symbols string could be constructed from). How can I find patterns with length from 2 up to length N (lets say 10 for simplicity). Also I'd like to see only those patterns which occurs at least in 1% of all string (some threshold).
I'd like to find an algorithm which can help me solve this problem. The numbers are not exact but are the same order of magnitude as we have in project.
Thank you
Index all your strings in a suffix tree (link). This can be O(number of characters) and you only need to do it once before you start.
A suffix tree allows you to quickly(O(pattern length)) tell if a pattern appears in any of the strings you've indexed, and how many times.
You can do another pass through the structure and count the number of leafs in each subtree (O(N) again) and that tells you how often you can find the substring from the root to that node, so you can drop them or do whatever you want based on how common they are.
Now, 10 billion strings of length 10k, with 2 byte characters (to fit the 1000 unique symbols) is quite large (18TB if my math is right) which doesn't fit in ram. So you'll either need to wait for a while or get more computers and setup a distributted solution. You can apply the solution above to batches of strings so that they fit into your available memory, but the lookup in the structure needs to be multiplied by the number of batches you are doing.
If everything is in batches then the most efficient way would be to make batches as big as you can, then when you've build the suffix tree for a batch run all your queries through it, save the results and drop the tree to free memory for the next batch of input strings.

Naive Suffix Array time complexity

I'm trying to invent programming exercise on Suffix Arrays. I learned O(n*log(n)^2) algorithm for constructing it and then started playing with random input strings of varying length in order to find out when naive approach becomes too slow. E.g. I wanted to choose string length so that people will need to implement "advanced" algorithm.
Suddenly I found that naive algorithm (with using logarithmic sort on all suffixes) is not as slow as O(n^2 * log(n)) means. After thinking a bit, I understand that comparison of suffixes of a randomly generated string is not O(n) amortized. Really, we usually only compare few first characters before we come to difference and there we return from comparison function. This of course depends on the size of the alphabet, but anyway it does not depend much on the length of suffixes.
I tried simple implementation in PHP processing 50000-characters string in 2 seconds (despite slowness of scripting language). If it will work at least as O(n^2) we'll expect it to work at least several minutes (with 1e7 operations per second and ~1e9 operations total).
So I understand that even if it is O(n^2 * log(n)) then the constant factor is a very small fraction of 1, really something close to 0. Or we should say about such complexity as worst-case only, right?
But what is the amortized time complexity of the naive approach? I'm bit bewildered about how to assess it.
You seem to be confusing amortized and expected complexity. In this case you are talking about expected complexity. And yes the stated complexity is computed assuming that the suffix comparison takes O(n). This will be the worst case for suffix comparison and for random generated input you will only perform constant number of comparisons in most cases. Thus O(n^2*log(n)) is worst case complexity.
One more note - on a modern computer you can perform a few billion elementary instructions in a second and it is possible that you execute in the order of 50000^2 in 2 seconds. The correct way to benchmark complexity of an algorithm is to measure the time it takes to complete e.g. for input of size N, N*2, N*4,...(as many as you can go) and then to interpolate the function that would describe the computational complexity

approximate histogram for streaming string values (card catalog algorithm?)

I have a large list (or stream) of UTF-8 strings sorted lexicographically. I would like to create a histogram with approximately equal values for the counts, varying the bin width as necessary to keep the counts even. In the literature, these are sometimes called equi-height, or equi-depth histograms.
I'm not looking to do the usual word-count bar chart, I'm looking for something more like an old fashioned library card catalog where you have a set of drawers (bins), and one might hold SAM - SOLD,and the next bin SOLE-STE, while all of Y-ZZZ fits in a single bin. I want to calculate where to put the cutoffs for each bin.
Is there (A) a known algorithm for this, similar to approximate histograms for numeric values? or (B) suggestions on how to encode the strings in a way that a standard numeric histogram algorithm would work. The algorithm should not require prior knowledge of string population.
The best way I can think to do it so far is to simply wait until I have some reasonable amount of data, then form logical bins by:
number_of_strings / bin_count = number_of_strings_in_each_bin
Then, starting at 0, step forward by number_of_strings_in_each_bin to get the bin endpoints.
This has two weaknesses for my use-case. First, it requires two iterations over a potentially very large number of strings, one for the count, one to find the endpoints. More importantly, a good histogram implementation can give an estimate of where in a bin a value falls, and this would be really useful.
Thanks.
If we can't make any assumptions about the data, you are going to have to make a pass to determine bin size.
This means that you have to either start with a bin size rather than bin number or live with a two-pass model. I'd just use linear interpolation to estimate positions between bins, then do a binary search from there.
Of course, if you can make some assumptions about the data, here are some that might help:
For example, you might not know the exact size, but you might know that the value will fall in some interval [a, b]. If you want at most n bins, make the bin size == a/n.
Alternatively, if you're not particular about exactly equal-sized bins, you could do it in one pass by sampling every m elements on your pass and dump it into an array, where m is something reasonable based on context.
Then, to find the bin endpoints, you'd find the element at size/n/m in your array.
The solution I came up with addresses the lack of up-front information about the population by using reservoir sampling. Reservoir sampling lets you efficiently take a random sample of a given size, from a population of an unknown size. See Wikipedia for more details. Reservoir sampling provides a random sample regardless of whether the stream is ordered or not.
We make one pass through the data, gathering a sample. For the sample we have explicit information about the number of elements as well as their distribution.
For the histogram, I used a Guava RangeMap. I picked the endpoints of the ranges to provide an even number of results in each range (sample_size / number_of_bins). The Integer in the map merely stores the order of the ranges, from 1 to n. This allows me to estimate the proportion of records that fall within two values: If there are 100 equal sized bins, and the values fall in bin 25 and bin 75, then I can estimate that approximately 50% of the population falls between those values.
This approach has the advantage of working for any Comparable data type.

Why is naive string search algorithm faster?

I'm testing string search algorithms from this site: EXACT STRING MATCHING ALGORITHMS. Christian Charras, Thierry Lecroq. Test text is a random sequence of DNA bases (ACGT) of 1 GByte size. Test patterns are a list of random sequences of random size (1kB max). Test system is a AMD Phenom II x4 955 at 3.2 GHz, 4 GB of RAM and Windows 7 64 bits. Code witten in C and compiled with MinGW with -O3 flag.
Naive search algorithm takes 4 seconds for short patterns to 8 seconds for 1kB patterns. Deterministic finite state machine takes 2 seconds for short patterns to 4 seconds for 1kB patterns. Boyer-Moore algorithm takes 4 seconds for very short patters, about 1/2 second for short pattherns and 2 seconds for 1kB patterns. The remaining algorithm performance is worst than naive search algorithm.
How can be naive search algorithm search algorithm faster than most other algorithms?
How can a deterministic finite state machine implemented with a transition table (O(n) execution time always) be 2 to 8 times slower than Boyer-Moore algorithm? Yes, BM best case is O(n/m), but his average case is O(n) and worst case is O(nm).
There is no perfect string matching algorithm which is best for all circumstances.
Boyer-Moore (and Horspool, Sunday etc.) work by creating jump tables ('How far can I move the search pointer when the characters do not match? The more distinct letters in the strings, the better the positive impact. You can imagine, that a string with only 4 distinct letters creates a jump table with a maximum of 3 shifts per mismatch. Whereas searching an english word with case sensitive may result in a jumptable with (A-Z + a-z + punctiation) max. approx 55 shifts per mismatch.
On the other hand, there is a negative impact on both preparation (i.e. calculating the jump tables) and looping itself. So these algorithms perform poor on short strings (preparation creates an overhead) and strings with only a few distict letters (as mentioned before)
The naive search algorithm is very compact and there are very little operations inside the loop, so loop runs fast. As there is no overhead it performs better when searching short strings.
The (compared to the naive search) quite complex loop operations of a BM algorithm take much longer per loop run. This (partly) compensates for the positive performance impact of the jump tables.
So although you are using long strings, the small alphabet (=small jump tables) makes BM perform poorly. A KMP has less overhead in the loop (the jump table is smaller in general, but is similar to the BM with small alphabets) and so the KMP performs so well.
Theoretically good algorithms (lower time complexity) often have high bookkeeping costs that can overwhelm that of a naive algorithm for small problem sizes. Also implementation details matter. By optimizing an implementation you can sometimes improve runtime by factors of 2 or more.
The naive implementation actually has a linear expected running time (same as BM/KMP, etc) for random input data. I could not write a full proof here but it's accessible from Algorithms Design Techniques and Analysis.
Most exact matching algorithms are optimized version of the naive implementation to prevent being slowed down by certain patterns. For instance, suppose we are searching for:
aaaaaaaaaaaaaaaaaaaaaaaab
on a stream of:
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaab
It fails at the b for lots of times. KMP/BM implementations are contrived to prevent repeatedly comparing the as. However, if the sequence is random by itself, such conditions are almost impossible to appear and the naive implementation is likely to work better due to its lower overhead in bookkeeping or possibly better spatial/temporal locality.
And, yeah, I'm not sure DNA sequences are random. Or alternatively are repetitions common in them. Anyway there's no way to examine this carefully without representative data.

Resources