How to find unknown repeated patterns in the set of strings? - string

Here is description of a problem. Suppose you have a set of strings (up to 10 billion of strings, each string length up to 10k characters, there are 1000 unique symbols string could be constructed from). How can I find patterns with length from 2 up to length N (lets say 10 for simplicity). Also I'd like to see only those patterns which occurs at least in 1% of all string (some threshold).
I'd like to find an algorithm which can help me solve this problem. The numbers are not exact but are the same order of magnitude as we have in project.
Thank you

Index all your strings in a suffix tree (link). This can be O(number of characters) and you only need to do it once before you start.
A suffix tree allows you to quickly(O(pattern length)) tell if a pattern appears in any of the strings you've indexed, and how many times.
You can do another pass through the structure and count the number of leafs in each subtree (O(N) again) and that tells you how often you can find the substring from the root to that node, so you can drop them or do whatever you want based on how common they are.
Now, 10 billion strings of length 10k, with 2 byte characters (to fit the 1000 unique symbols) is quite large (18TB if my math is right) which doesn't fit in ram. So you'll either need to wait for a while or get more computers and setup a distributted solution. You can apply the solution above to batches of strings so that they fit into your available memory, but the lookup in the structure needs to be multiplied by the number of batches you are doing.
If everything is in batches then the most efficient way would be to make batches as big as you can, then when you've build the suffix tree for a batch run all your queries through it, save the results and drop the tree to free memory for the next batch of input strings.

Related

Longest Common Subsequence between very large strings

I am trying solve the Longest Common subsequence problem, which is the problem of finding the longest subsequence common to all sequences in a set of sequences (often just two sequences).
I am trying to do this to calculate the overlap between 2 strings.
This is well know Dynamic programming problem. However, In my case the strings are is too huge. When I tried to use the 2D matrix to memoize, I ran into memory out of bound problem.
One solution could be using sparse matrix instead but I am little concerned about the performance overhead with that.
Also I want to perform this algorithm across multiple strings. And it will be okay to provide approximate answer since I am only trying to measure the overlap between 2 strings.
EDIT: After some investigation I found the following alternatives
Hirschberg Algorithm https://en.wikipedia.org/wiki/Hirschberg%27s_algorithm
Original paper http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.348.4360&rep=rep1&type=pdf
Approximate algorithm : http://cs.haifa.ac.il/~ilan/online-papers/cpm09.pdf
Deposition and Extension Approach to Find Longest Common
Subsequence for Multiple Sequences https://arxiv.org/pdf/0903.2015.pdf
LCS on DNA sequence http://www.sersc.org/journals/IJAST/vol47/2.pdf
Efficient Algorithm http://www.sciencedirect.com/science/article/pii/S0885064X12000635
To reduce memory complexity, you don't need to store the entire 2D table. You can only store the row above and current row and thus you can reduce the memory consumption by O(N) if you store the maximum in another data-structure. This results in O(N) memory usage, but time complexity remains O(N^2).

Find most repeated phrase on huge text

I have huge text data. My entire database is text format in UTF-8
I need to have list of most repeated phrase on my whole text data.
For example my desire output something like this:
{
'a': 423412341,
'this': 423412341,
'is': 322472341,
'this is': 222472341,
'this is a': 122472341,
'this is a my': 5235634
}
Process and store each phrase take huge size of database.
For example store in MySQL or MongoDB.
Question is is there any more efficient database or alghorithm for find this result ?
Solr, Elasticsearch or etc ...
I think i have max 10 words in each phrase can be good for me.
I'd suggest combining ideas from two fields, here: Streaming Algorithms, and the Apriori Algorithm From Market-Basket Analysis.
Let's start with the problem of finding the k most frequent single words without loading the entire corpus into memory. A very simple algorithm, Sampling (see Finding Frequent Items in Data Streams]), can do so very easily. Moreover, it is very amenable to parallel implementation (described below). There is a plethora of work on top-k queries, including some on distributed versions (see, e.g., Efficient Top-K Query Calculation in Distributed Networks).
Now to the problem of k most frequent phrases (of possibly multiple phrases). Clearly, the most frequent phrases of length l + 1 must contain the most frequent phrases of length l as a prefix, as appending a word to a phrase cannot increase its popularity. Hence, once you have the k most frequent single words, you can scan the corpus for only them (which is faster) to build the most frequent phrases of length 2. Using this, you can build the most frequent phrases of length 3, and so on. The stopping condition is when a phrase of length l + 1 does not evict any phrase of length l.
A Short Description of The Sampling Algorithm
This is a very simple algorithm which will, with high probability, find the top k items out of those having frequency at least f. It operates in two stages: the first finds candidate elements, and the second counts them.
In the first stage, randomly select ~ log(n) / f words from the corpus (note that this is much less than n). With high probability, all your desired words appear in the set of these words.
In the second stage, maintain a dictionary of the counts of these candidate elements; scan the corpus, and count the occurrences.
Output the top k of the items resulting from the second stage.
Note that the second stage is very amenable to parallel implementation. If you partition the text into different segments, and count the occurrences in each segment, you can easily combine the dictionaries at the end.
If you can store the data in Apache Solr, then the Luke Request Handler could be used to find the most common phrases. Example query:
http://127.0.0.1:8983/solr/admin/luke?fl=fulltext&numTerms=100
Additionally, the Terms Component may help find the most common individual words. Here is an article about Self Updating Solr Stopwords which uses the Terms Component to find the 100 most common indexed words and add them to the Stopwords file. Example query:
http://127.0.0.1:8983/solr/terms?terms.fl=fulltext&terms.limit=100
Have you considered using MapReduce?
Assuming you have access to a proper infrastructure, this seems to be a clear fit for it. You will need a tokenizer that splits lines into multi-word tokens up to 10 words. I don't think that's a big deal. The outcome from the MR job will be token -> frequency pairs, which you can pass to another job to sort them on the frequencies (one option). I would suggest to read up on Hadoop/MapReduce before considering other solutions. You may also use HBase to store any intermediary outputs.
Original paper on MapReduce by Google.
tokenize it by 1 to 10 words and insert into 10 SQL tables by token lengths. Make sure to use hash index on the column with string tokens. Then just call SELECT token,COUNT(*) FROM tablename GROUP BY token on each table and dump results somewhere and wait.
EDIT: that would be infeasible for large datasets, just for each N-gram update the count by +1 or insert new row into table (in MYSQL would be useful query INSERT...ON DUPLICATE KEY UPDATE). You should definitely still use hash indexes, though.
After that just sort by number of occurences and merge data from these 10 tables (you could do that in single step, but that would put more strain on memory).
Be wary of heuristic methods like suggested by Ami Tavory, if you select wrong parameters, you can get wrong results (flaw of sampling algorithm can be seen on some classic terms or phrases - e.g. "habeas corpus" - neither habeas nor corpus will be selected as frequent by itself, but as a 2 word phrase it may very well rank higher than some phrases you get by appending/prepending to common word). There is surely no need to use them for tokens of lesser length, you could use them only when classic methods fail (take too much time or memory).
The top answer by Amy Tavori states:
Clearly, the most frequent phrases of length l + 1 must contain the most frequent phrases of length l as a prefix, as appending a word to a phrase cannot increase its popularity.
While it is true that appending a word to a phrase cannot increase its popularity, there is no reason to assume that the frequency of 2-grams are bounded by the frequency of 1-grams. To illustrate, consider the following corpus (constructed specifically to illustrate this point):
Here, a tricksy corpus will exist; a very strange, a sometimes cryptic corpus will dumbfound you maybe, perhaps a bit; in particular since my tricksy corpus will not match the pattern you expect from it; nor will it look like a fish, a boat, a sunflower, or a very handsome kitten. The tricksy corpus will surprise a user named Ami Tavory; this tricksy corpus will be fun to follow a year or a month or a minute from now.
Looking at the most frequent single words, we get:
1-Gram Frequency
------ ---------
a 12
will 6
corpus 5
tricksy 4
or 3
from 2
it 2
the 2
very 2
you 2
The method suggested by Ami Tavori would identify the top 1-gram, 'a', and narrow the search to 2-grams with the prefix 'a'. But looking at the corpus from before, the top 2-grams are:
2-Gram Frequency
------ ---------
corpus will 5
tricksy corpus 4
or a 3
a very 2
And moving on to 3-grams, there is only a single repeated 3-gram in the entire corpus, namely:
3-Gram Frequency
------ ---------
tricksy corpus will 4
To generalize: you can't use the top m-grams to extrapolate directly to top (m+1)-grams. What you can do is throw away the bottom m-grams, specifically the ones which do not repeat at all, and look at all the ones that do. That narrows the field a bit.
This can be simplified greatly. You don't need a database at all. Just store the full text in a file. Then write a PHP script to open and read the file contents. Use the PHP regex function to extract matches. Keep the total in a global variable. Write the results to another file. That's it.

approximate histogram for streaming string values (card catalog algorithm?)

I have a large list (or stream) of UTF-8 strings sorted lexicographically. I would like to create a histogram with approximately equal values for the counts, varying the bin width as necessary to keep the counts even. In the literature, these are sometimes called equi-height, or equi-depth histograms.
I'm not looking to do the usual word-count bar chart, I'm looking for something more like an old fashioned library card catalog where you have a set of drawers (bins), and one might hold SAM - SOLD,and the next bin SOLE-STE, while all of Y-ZZZ fits in a single bin. I want to calculate where to put the cutoffs for each bin.
Is there (A) a known algorithm for this, similar to approximate histograms for numeric values? or (B) suggestions on how to encode the strings in a way that a standard numeric histogram algorithm would work. The algorithm should not require prior knowledge of string population.
The best way I can think to do it so far is to simply wait until I have some reasonable amount of data, then form logical bins by:
number_of_strings / bin_count = number_of_strings_in_each_bin
Then, starting at 0, step forward by number_of_strings_in_each_bin to get the bin endpoints.
This has two weaknesses for my use-case. First, it requires two iterations over a potentially very large number of strings, one for the count, one to find the endpoints. More importantly, a good histogram implementation can give an estimate of where in a bin a value falls, and this would be really useful.
Thanks.
If we can't make any assumptions about the data, you are going to have to make a pass to determine bin size.
This means that you have to either start with a bin size rather than bin number or live with a two-pass model. I'd just use linear interpolation to estimate positions between bins, then do a binary search from there.
Of course, if you can make some assumptions about the data, here are some that might help:
For example, you might not know the exact size, but you might know that the value will fall in some interval [a, b]. If you want at most n bins, make the bin size == a/n.
Alternatively, if you're not particular about exactly equal-sized bins, you could do it in one pass by sampling every m elements on your pass and dump it into an array, where m is something reasonable based on context.
Then, to find the bin endpoints, you'd find the element at size/n/m in your array.
The solution I came up with addresses the lack of up-front information about the population by using reservoir sampling. Reservoir sampling lets you efficiently take a random sample of a given size, from a population of an unknown size. See Wikipedia for more details. Reservoir sampling provides a random sample regardless of whether the stream is ordered or not.
We make one pass through the data, gathering a sample. For the sample we have explicit information about the number of elements as well as their distribution.
For the histogram, I used a Guava RangeMap. I picked the endpoints of the ranges to provide an even number of results in each range (sample_size / number_of_bins). The Integer in the map merely stores the order of the ranges, from 1 to n. This allows me to estimate the proportion of records that fall within two values: If there are 100 equal sized bins, and the values fall in bin 25 and bin 75, then I can estimate that approximately 50% of the population falls between those values.
This approach has the advantage of working for any Comparable data type.

Fast approximate string difference for large strings

I'm trying to quantify the difference between two strings as part of a change-monitor system.
The issue I'm having is that the strings are large - I can often be dealing with strings with 100K+ characters.
I'm currently using Levenshtein distance, but computing the levenshtein distance for large strings is very inefficient. Even the best implementations only manage O(min(mn)).
Since both strings are of approximately the same length, the distance calculation process can take many seconds.
I do not need high precision. A change resolution of 1 in 1000 (e.g. 0.1%) would be plenty for my application.
What options are there for more efficient string distance computation?
If you can tolerate some error, you can try splitting the strings into smaller chunks, and calculate their pairwise L-distances.
The method would obviously yield accurate result for replacements, inserts and deletes would incur an accuracy penalty depending on the number of chunks (worst case scenario would give you a distance of 2 * <number of insert/deletes> * <number of chunks> instead of <number of insert/deletes>)
The next step could be to make the process adaptive, I see two ways of doing it, depending on the expected nature of changes:
Try a small chunk size first then move on to larger and larger chunks and observe the drop between each iteration. That should help you estimate how much of your measured distance is error (though I haven't worked out exactly how).
Once you find a difference between two chunks, try to identify what the difference is (exactly how many characters were added/deleted overall), and shift your next chunk to the left or to the right accordingly.

An efficient algorithm for finding smallest pangrammatic windows?

A pangrammatic window is a substring of a larger piece of text that contains all 26 letters of the alphabet. To quote an example from Wikipedia, given this text:
I sang, and thought I sang very well; but he just looked up into my face with a very quizzical expression, and said, 'How long have you been singing, Mademoiselle?'
The smallest pangrammatic window in the text is this string:
g very well; but he just looked up into my face with a very quizzical ex
Which indeed contains every letter at least once.
My question is this: Given a text corpus, what is the most efficient algorithm for finding the smallest pangrammatic window in the text?
I've given this some thought and come up with the following algorithms already. I have a strong feeling that these are not optimal, but I thought I'd post them as a starting point.
There is a simple naive algorithm that runs in time O(n2) and space O(1): For each position in the string, scan forward from that position and track what letters you've seen (perhaps in a bit vector, which, since there are only 26 different letters, takes space O(1)). Once you've found all 26 letters, you have the length of the shortest pangrammatic window starting at that given point. Each scan might take time O(n), and there are O(n) scans, for a grand total of O(n2) time.
We can also solve this problem in time O(n log n) and space O(n) using a modified binary search. Construct 26 arrays, one for each letter of the alphabet, then populate those arrays with the positions of each letter in the input text in sorted order. We can do this by simply scanning across the text, appending each index to the array corresponding to the current character. Once we have this, we can find, in time O(log n), the length of the shortest pangrammatic window beginning at some index by running 26 binary searches in the arrays to find the earliest time that each character appears in the input array at or after the given index. Whichever of these numbers is greatest gives the "long pole" character that appears furthest down in the string, and thus gives the endpoint of the pangrammatic window. Running this search step takes O(log n) time, and since we have to do it for all n characters in the string, the total runtime is O(n log n), with O(n) memory usage for the arrays.
A further refinement for the above approach is to replace the arrays and binary search with van Emde Boas trees and predecessor searches. This increases the creation time to O(n log log n), but reduces each search time to O(log log n) time, for a net runtime of O(n log log n) with O(n) space usage.
Are there any better algorithms out there?
For every letter keep track of the recent-most sighting. Whenever you process a letter, update the corresponding sighting index and calculate the range (max-min) of sighting indexes over all letters. Find the location with minimum range.
Complexity O(n). O(nlog(m)) if you consider alphabet size m.
This algorithm has O(M) space complexity and O(N) time complexity (time does not depend on alphabet size M):
Advance first iterator and increase counter for each processed letter. Stop when all 26 counters are non-zero.
Advance second iterator and decrease counter for each processed letter. Stop when any of these counters is zero.
Use difference between iterators to update best-so-far result and continue with step 1.
This algorithm may be improved a little bit if instead of character counters, positions in the string are stored. In this case step 2 should only read these positions and compare with current position, and step 1 should update these positions and (most of the time) search for some character in the text.

Resources