algorithm to partition strings by prefix - string

Given a list of strings L (sorted), and a positive integer N (N <= len(L)), how to efficiently partition L into no more than N groups by common prefix of length N?
Example: define data structure and function as below:
type PrefixGroup struct {
Prefix string
Count int
}
func partition(L []string, N int, prefix string) []PrefixGroup
The list L may contains thousands of strings, when called with
partition(L, 8, "")
the output may be:
[
{"Prefix":"13", "Count":1000},
{"Prefix":"180": "Count": 10},
{"Prefix":"X": "Count": 2},
... ...
]
which means in L, there are 1000 strings start with "13", 10 start with "180" and 2 start with "X". Note that the length of prefix is not fixed. The key requirement of this algorithm is to partition strings with common prefix so that the number of groups is as close as but not exceeding N.
With the result above, I can then call partition(L, 8, "13") to further drill down subset of L that start with "13":
[
{"Prefix":"131", "Count": 50},
{"Prefix":"135": "Count": 100},
{"Prefix":"136": "Count": 500},
... ...
]
This is not a homework question. I need to write such an algorithm for a project at hand. I can write it "brute-force"-ly, just wonder if there are any classic/well-known data structure and/or algorithm to achieve proven time/space efficiency.
I have considered trie, but wonder if it may consume too much memory...

You need to use a Radix trie.
You can read about the difference between a trie and a Radix trie.

Well there are a few algorithms, but prefix tree is to way to go.
A prefix tree, or trie (often pronounced "try"), is a tree whose nodes don't hold keys, but rather, hold partial keys. For example, if you have a prefix tree that stores strings, then each node would be a character of a string. If you have a prefix tree that stores arrays, each node would be an element of that array. The elements are ordered from the root. So if you had a prefix tree with the word "hello" in it, then the root node would have a child "h," and the "h" node would have a child, "e," and the "e" node would have a child node "l," etc. The deepest node of a key would have some sort of boolean flag on it indicating that it is the terminal node of some key. (This is important because the last node of a key isn't always a leaf node... consider a prefix tree with "dog" and "doggy" in it). Prefix trees are good for looking up keys with a particular prefix.

Related

The optimal data structure for filtering for objects that match criteria

I'll try to present the problem as generally as I can, and a response in any language will do.
Suppose there are a few sets of varying sizes, each containing arbitrary values to do with a category:
var colors = ["red", "yellow", "blue"] // 3 items
var letters = ["A", "B", "C", ... ] // 26 items
var digits = [0, 1, 2, 3, 4, ... ] // 10 items
... // each set has fixed amount of items
Each object in this master list I already have (which I want to restructure somehow to optimize searching) has properties that are each a selection of one of these sets, as such:
var masterList = [
{ id: 1, color: "red", letter: "F", digit: 5, ... },
{ id: 2, color: "blue", letter: "Q", digit: 0, ... },
{ id: 3, color: "red", letter: "Z", digit: 3, ... },
...
]
The purpose of the search would be to create a new list of acceptable objects from the master list. The program would filter the master list by given search criteria that, for each property, contains a list of acceptable values.
var criteria = {
color: ["red", "yellow"],
letter: ["A", "F", "P"],
digit: [1, 3, 5],
...
};
I'd like to think that some sort of tree would be most applicable. My understanding is that it would need to be balanced, so the root node would be the "median" object. I suppose each level would be defined by one of the properties so that as the program searches from the root, it would only continue down the branches that fit the search criteria, each time eliminating the objects that don't fit given the particular property for that level.
However, I understand that many of the objects in this master list will have matching property values. This connects them in a graphical manner that could perhaps be conducive to a speedy search.
My current search algorithm is fairly intuitive and can be done with just the master list as it is. The program
iterates through the properties in the search criteria,
with each property iterates over the master list, eliminating a number of objects that don't have a matching property, and
eventually removes all the objects that don't fit the criteria. There is surely some quicker filtering system that involves a more organized data structure.
Where could I go from here? I'm open to a local database instead of another data structure I suppose - GraphQL looks interesting. This is my first Stack Overflow question, so my apologies for any bad manners 😶
As I don't have the context of number of sets and also on the number of elements in the each set. I would suggest you some very small changes which will make things at-least relatively fast for you.
To keep things mathematical, I will define few terms here:
number of sets - n
size of masterlist - k
size of each property in the search criteria - p
So, from the algorithm that I believe you are using, you are doing n iterations over the search criteria, because there can be n possible keys in the search criteria.
Then in each of these n iterations, you are doing p iterations over the allowed values of that particular set. Finally, in each of these np iterations you are iterating over the master list, with k iterations ,and checking if this value of record should be allowed or not.
Thus, in the average case, you are doing this in O(npk) time complexity.
So, I won't suggest to change much here.
The best you can do is, change the values in the search criteria to a set (hashset) instead of keeping it as a list, and then iterate over the masterlist. Follow this Python code:
def is_possible(criteria, master_list_entry):
for key, value in master_list_entry.items(): # O(n)
if not key in criteria or value not in criteria[key]: # O(1) average
return False
return True
def search(master_list, criteria):
ans = []
for each_entry in master_list: # O(k)
if is_possible(criteria, each_entry): # O(n), check above
ans.append(each_entry)
return ans
Just call search function, and it will return the filtered masterlist.
Regarding the change, change your search criteria to:
criteria = {
color: {"red", "yellow"}, # This is a set, instead of a list
letter: {"A", "F", "P"},
digit: {1, 3, 5},
...
}
As, you can see, I have mentioned the complexities along with each line, thus we have reduced the problem to O(nk) in average case.

how to enable string and numeric comparison temporarily?

Given this simplified example to sort:
l = [10, '0foo', 2.5, 'foo', 'bar']
I want to sort l so that numeric is always before strings. In this case, I'd like to get [2.5, 10, '0foo', 'foo', 'bar']. Is it possible make numeric and string temporarily comparable (with strings always larger than numeric)?
Note it is not easy to provide a key function to sorted if you are thinking about it. For example, converting numeric to string won't work because "10" < "2.5".
A way that you might do this does involve passing a key to sorted. it looks like this:
sorted(l, key=lambda x:(isinstance(x str), x))
This works because the key returns a tuple with the type of x and its value. Because of the way tuples are compared. Items at index 0 are compared first and if it they are the same, then next two items are compared and so on if they also are the same. This allows for the values to be sorted by type (string or not a string), then value if they are a similar type.
A more robust solution that also can handle further types might use a dictionary in the key function like this:
sorted(l,key=lambda x:({int:0, float:0, str:1, list:2, set:3}[type(x)], x))
further types can be added as necessary.

What is a good algorithm for searching a string for multiple substrings?

To be more exact, I'm looking for an algorithm that will take two collections of strings, and return all elements from the first collection which contain all elements from the second collection.
e.g. If I have ["cat", "dog", "boy"], it would return, say, "The boy wants a cat and a dog", but not, for example, "The dog is a good boy."
I found the Aho-Corasick algorithm but that seems to be better for an "at least one match" instead of an "every match" solution.
I am not sure about the specifics of the questions but assuming you have
collection1 : ['this is a good boy',' this is a bad boy',....]
and
collection2: ['this', 'is a', 'good', 'boy']
it should only return 'this is a good boy'.
Again, I am not sure about the memory and speed requirements of such an algorithm but I would create a suffix tree to validate element existence :
psuedo code
for elm1 in collection1:
sTree = suffix_tree(elm)
valid = false
for elm2 in collection2:
valid = valid & search_in_suffix(elm2)
if valid:
return elm1
return 'NOT_FOUND'
you can read more about the suffix tree here. Please keep in mind that it also depends on your dataset, if you have really large strings, suffix tree might be quick but creating it would cost you a lot of memory.
The Aho-Corasick automaton can work here if we slightly modify it.
Let's build the Aho-Corasick automaton for the second collection of the strings.
If one string in the automaton is a prefix of the other, we can remove it. It won't change the answer.
Let's traverse the automaton using only initial edges (it's a tree this way) and precompute a node that is an ancestor of the given one and corresponds to an end of some string from the collection for each node (I'll denote it as anc_end. It's unique as no string is a prefix of any other as shown above). We can do it in linear time using depth first search (the parameters are the current node and the last node that corresponds to the end of some string we've seen on the path from the root to this node (or -1 if there's no such node)).
We can traverse the automaton as we would normally do for each string of the first collection. We need to set was[anc_end[v]] = true at each step where v is the current node. We just need to check that was is true for all nodes corresponding to the end of some string in the second collection when we're done.
We're almost there. Instead of initializing the was array for every string with the new collection, we'll use "array with version" structure (the idea is to keep a pair (value, timestamp) instead of just the value and increment the timer when we go to the next string)) and count the number of seen "end" nodes (we need to reset the counter to zero for each string from the first collection).
This solution is linear in the size of the input, so it's optimal in terms of time complexity (it also uses a linear amount of space).

Data Structure for Subsequence Queries

In a program I need to efficiently answer queries of the following form:
Given a set of strings A and a query string q return all s ∈ A such that q is a subsequence of s
For example, given A = {"abcdef", "aaaaaa", "ddca"} and q = "acd" exactly "abcdef" should be returned.
The following is what I have considered considered so far:
For each possible character, make a sorted list of all string/locations where it appears. For querying interleave the lists of the involved characters, and scan through it looking for matches within string boundaries.
This would probably be more efficient for words instead of characters, since the limited number of different characters will make the return lists very dense.
For each n-prefix q might have, store the list of all matching strings. n might realistically be close to 3. For query strings longer than that we brute force the initial list.
This might speed things up a bit, but one could easily imagine some n-subsequences being present close to all strings in A, which means worst case is the same as just brute forcing the entire set.
Do you know of any data structures, algorithms or preprocessing tricks which might be helpful for performing the above task efficiently for large As? (My ss will be around 100 characters)
Update: Some people have suggested using LCS to check if q is a subsequence of s. I just want to remind that this can be done using a simple function such as:
def isSub(q,s):
i, j = 0, 0
while i != len(q) and j != len(s):
if q[i] == s[j]:
i += 1
j += 1
else:
j += 1
return i == len(q)
Update 2: I've been asked to give more details on the nature of q, A and its elements. While I'd prefer something that works as generally as possible, I assume A will have length around 10^6 and will need to support insertion. The elements s will be shorter with an average length of 64. The queries q will only be 1 to 20 characters and be used for a live search, so the query "ab" will be sent just before the query "abc". Again, I'd much prefer the solution to use the above as little as possible.
Update 3: It has occurred to me, that a data-structure with O(n^{1-epsilon}) lookups, would allow you to solve OVP / disprove the SETH conjecture. That is probably the reason for our suffering. The only options are then to disprove the conjecture, use approximation, or take advantage of the dataset. I imagine quadlets and tries would do the last in different settings.
It could done by building an automaton. You can start with NFA (nondeterministic finite automaton which is like an indeterministic directed graph) which allows edges labeled with an epsilon character, which means that during processing you can jump from one node to another without consuming any character. I'll try to reduce your A. Let's say you A is:
A = {'ab, 'bc'}
If you build NFA for ab string you should get something like this:
+--(1)--+
e | a| |e
(S)--+--(2)--+--(F)
| b| |
+--(3)--+
Above drawing is not the best looking automaton. But there are a few points to consider:
S state is the starting state and F is the ending state.
If you are at F state it means your string qualifies as a subsequence.
The rule of propagating within an autmaton is that you can consume e (epsilon) to jump forward, therefore you can be at more then one state at each point in time. This is called e closure.
Now if given b, starting at state S I can jump one epsilon, reach 2, and consume b and reach 3. Now given end string I consume epsilon and reach F, thus b qualifies as a sub-sequence of ab. So does a or ab you can try yourself using above automata.
The good thing about NFA is that they have one start state and one final state. Two NFA could be easily connected using epsilons. There are various algorithms that could help you to convert NFA to DFA. DFA is a directed graph which can follow precise path given a character -- in particular, it is always in exactly one state at any point in time. (For any NFA, there is a corresponding DFA whose states correspond to sets of states in the NFA.)
So, for A = {'ab, 'bc'}, we would need to build NFA for ab then NFA for bc then join the two NFAs and build the DFA of the entire big NFA.
EDIT
NFA of subsequence of abc would be a?b?c?, so you can build your NFA as:
Now, consider the input acd. To query if ab is subsequence of {'abc', 'acd'}, you can use this NFA: (a?b?c?)|(a?c?d). Once you have NFA you can convert it to DFA where each state will contain whether it is a subsequence of abc or acd or maybe both.
I used link below to make NFA graphic from regular expression:
http://hackingoff.com/images/re2nfa/2013-08-04_21-56-03_-0700-nfa.svg
EDIT 2
You're right! In case if you've 10,000 unique characters in the A. By unique I mean A is something like this: {'abc', 'def'} i.e. intersection of each element of A is empty set. Then your DFA would be worst case in terms of states i.e. 2^10000. But I'm not sure when would that be possible given that there can never be 10,000 unique characters. Even if you have 10,000 characters in A still there will be repetitions and that might reduce states alot since e-closure might eventually merge. I cannot really estimate how much it might reduce. But even having 10 million states, you will only consume less then 10 mb worth of space to construct a DFA. You can even use NFA and find e-closures at run-time but that would add to run-time complexity. You can search different papers on how large regex are converted to DFAs.
EDIT 3
For regex (a?b?c?)|(e?d?a?)|(a?b?m?)
If you convert above NFA to DFA you get:
It actually lot less states then NFA.
Reference:
http://hackingoff.com/compilers/regular-expression-to-nfa-dfa
EDIT 4
After fiddling with that website more. I found that worst case would be something like this A = {'aaaa', 'bbbbb', 'cccc' ....}. But even in this case states are lesser than NFA states.
Tests
There have been four main proposals in this thread:
Shivam Kalra suggested creating an automaton based on all the strings in A. This approach has been tried slightly in the literature, normally under the name "Directed Acyclic Subsequence Graph" (DASG).
J Random Hacker suggested extending my 'prefix list' idea to all 'n choose 3' triplets in the query string, and merging them all using a heap.
In the note "Efficient Subsequence Search in Databases" Rohit Jain, Mukesh K. Mohania and Sunil Prabhakar suggest using a Trie structure with some optimizations and recursively search the tree for the query. They also have a suggestion similar to the triplet idea.
Finally there is the 'naive' approach, which wanghq suggested optimizing by storing an index for each element of A.
To get a better idea of what's worth putting continued effort into, I have implemented the above four approaches in Python and benchmarked them on two sets of data. The implementations could all be made a couple of magnitudes faster with a well done implementation in C or Java; and I haven't included the optimizations suggested for the 'trie' and 'naive' versions.
Test 1
A consists of random paths from my filesystem. q are 100 random [a-z] strings of average length 7. As the alphabet is large (and Python is slow) I was only able to use duplets for method 3.
Construction times in seconds as a function of A size:
Query times in seconds as a function of A size:
Test 2
A consists of randomly sampled [a-b] strings of length 20. q are 100 random [a-b] strings of average length 7. As the alphabet is small we can use quadlets for method 3.
Construction times in seconds as a function of A size:
Query times in seconds as a function of A size:
Conclusions
The double logarithmic plot is a bit hard to read, but from the data we can draw the following conclusions:
Automatons are very fast at querying (constant time), however they are impossible to create and store for |A| >= 256. It might be possible that a closer analysis could yield a better time/memory balance, or some tricks applicable for the remaining methods.
The dup-/trip-/quadlet method is about twice as fast as my trie implementation and four times as fast as the 'naive' implementation. I used only a linear amount of lists for the merge, instead of n^3 as suggested by j_random_hacker. It might be possible to tune the method better, but in general it was disappointing.
My trie implementation consistently does better than the naive approach by around a factor of two. By incorporating more preprocessing (like "where are the next 'c's in this subtree") or perhaps merging it with the triplet method, this seems like todays winner.
If you can do with a magnitude less performance, the naive method does comparatively just fine for very little cost.
As you point out, it might be that all strings in A contain q as a subsequence, in which case you can't hope to do better than O(|A|). (That said, you might still be able to do better than the time taken to run LCS on (q, A[i]) for each string i in A, but I won't focus on that here.)
TTBOMK there are no magic, fast ways to answer this question (in the way that suffix trees are the magic, fast way to answer the corresponding question involving substrings instead of subsequences). Nevertheless if you expect the set of answers for most queries to be small on average then it's worth looking at ways to speed up these queries (the ones yielding small-size answers).
I suggest filtering based on a generalisation of your heuristic (2): if some database sequence A[i] contains q as a subsequence, then it must also contain every subsequence of q. (The reverse direction is not true unfortunately!) So for some small k, e.g. 3 as you suggest, you can preprocess by building an array of lists telling you, for every length-k string s, the list of database sequences containing s as a subsequence. I.e. c[s] will contain a list of the ID numbers of database sequences containing s as a subsequence. Keep each list in numeric order to enable fast intersections later.
Now the basic idea (which we'll improve in a moment) for each query q is: Find all k-sized subsequences of q, look up each in the array of lists c[], and intersect these lists to find the set of sequences in A that might possibly contain q as a subsequence. Then for each possible sequence A[i] in this (hopefully small) intersection, perform an O(n^2) LCS calculation with q to see whether it really does contain q.
A few observations:
The intersection of 2 sorted lists of size m and n can be found in O(m+n) time. To find the intersection of r lists, perform r-1 pairwise intersections in any order. Since taking intersections can only produce sets that are smaller or of the same size, time can be saved by intersecting the smallest pair of lists first, then the next smallest pair (this will necessarily include the result of the first operation), and so on. In particular: sort lists in increasing size order, then always intersect the next list with the "current" intersection.
It is actually faster to find the intersection a different way, by adding the first element (sequence number) of each of the r lists into a heap data structure, then repeatedly pulling out the minimum value and replenishing the heap with the next value from the list that the most recent minimum came from. This will produce a list of sequence numbers in nondecreasing order; any value that appears fewer than r times in a row can be discarded, since it cannot be a member of all r sets.
If a k-string s has only a few sequences in c[s], then it is in some sense discriminating. For most datasets, not all k-strings will be equally discriminating, and this can be used to our advantage. After preprocessing, consider throwing away all lists having more than some fixed number (or some fixed fraction of the total) of sequences, for 3 reasons:
They take a lot of space to store
They take a lot of time to intersect during query processing
Intersecting them will usually not shrink the overall intersection much
It is not necessary to consider every k-subsequence of q. Although this will produce the smallest intersection, it involves merging (|q| choose k) lists, and it might well be possible to produce an intersection that is nearly as small using just a fraction of these k-subsequences. E.g. you could limit yourself to trying all (or a few) k-substrings of q. As a further filter, consider just those k-subsequences whose sequence lists in c[s] are below some value. (Note: if your threshold is the same for every query, you might as well delete all such lists from the database instead, since this will have the same effect, and saves space.)
One thought;
if q tends to be short, maybe reducing A and q to a set will help?
So for the example, derive to { (a,b,c,d,e,f), (a), (a,c,d) }. Looking up possible candidates for any q should be faster than the original problem (that's a guess actually, not sure how exactly. maybe sort them and "group" similar ones in bloom filters?), then use bruteforce to weed out false positives.
If A strings are lengthy, you could make the characters unique based on their occurence, so that would be {(a1,b1,c1,d1,e1,f1),(a1,a2,a3,a4,a5,a6),(a1,c1,d1,d2)}. This is fine, because if you search for "ddca" you only want to match the second d to a second d. The size of your alphabet would go up (bad for bloom or bitmap style operations) and would be different ever time you get new A's, but the amount of false positives would go down.
First let me make sure my understanding/abstraction is correct. The following two requirements should be met:
if A is a subsequence of B, then all characters in A should appear in B.
for those characters in B, their positions should be in an ascending order.
Note that, a char in A might appear more than once in B.
To solve 1), a map/set can be used. The key is the character in string B, and the value doesn't matter.
To solve 2), we need to maintain the position of each characters. Since a character might appear more than once, the position should be a collection.
So the structure is like:
Map<Character, List<Integer>)
e.g.
abcdefab
a: [0, 6]
b: [1, 7]
c: [2]
d: [3]
e: [4]
f: [5]
Once we have the structure, how to know if the characters are in the right order as they are in string A? If B is acd, we should check the a at position 0 (but not 6), c at position 2 and d at position 3.
The strategy here is to choose the position that's after and close to the previous chosen position. TreeSet is a good candidate for this operation.
public E higher(E e)
Returns the least element in this set strictly greater than the given element, or null if there is no such element.
The runtime complexity is O(s * (n1 + n2)*log(m))).
s: number of strings in the set
n1: number of chars in string (B)
n2: number of chars in query string (A)
m: number of duplicates in string (B), e.g. there are 5 a.
Below is the implementation with some test data.
import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import java.util.TreeSet;
public class SubsequenceStr {
public static void main(String[] args) {
String[] testSet = new String[] {
"abcdefgh", //right one
"adcefgh", //has all chars, but not the right order
"bcdefh", //missing one char
"", //empty
"acdh",//exact match
"acd",
"acdehacdeh"
};
List<String> subseqenceStrs = subsequenceStrs(testSet, "acdh");
for (String str : subseqenceStrs) {
System.out.println(str);
}
//duplicates in query
subseqenceStrs = subsequenceStrs(testSet, "aa");
for (String str : subseqenceStrs) {
System.out.println(str);
}
subseqenceStrs = subsequenceStrs(testSet, "aaa");
for (String str : subseqenceStrs) {
System.out.println(str);
}
}
public static List<String> subsequenceStrs(String[] strSet, String q) {
System.out.println("find strings whose subsequence string is " + q);
List<String> results = new ArrayList<String>();
for (String str : strSet) {
char[] chars = str.toCharArray();
Map<Character, TreeSet<Integer>> charPositions = new HashMap<Character, TreeSet<Integer>>();
for (int i = 0; i < chars.length; i++) {
TreeSet<Integer> positions = charPositions.get(chars[i]);
if (positions == null) {
positions = new TreeSet<Integer>();
charPositions.put(chars[i], positions);
}
positions.add(i);
}
char[] qChars = q.toCharArray();
int lowestPosition = -1;
boolean isSubsequence = false;
for (int i = 0; i < qChars.length; i++) {
TreeSet<Integer> positions = charPositions.get(qChars[i]);
if (positions == null || positions.size() == 0) {
break;
} else {
Integer position = positions.higher(lowestPosition);
if (position == null) {
break;
} else {
lowestPosition = position;
if (i == qChars.length - 1) {
isSubsequence = true;
}
}
}
}
if (isSubsequence) {
results.add(str);
}
}
return results;
}
}
Output:
find strings whose subsequence string is acdh
abcdefgh
acdh
acdehacdeh
find strings whose subsequence string is aa
acdehacdeh
find strings whose subsequence string is aaa
As always, I might be totally wrong :)
You might want to have a look into the Book Algorithms on Strings and Sequences by Dan Gusfield. As it turns out part of it is available on the internet. You might also want to read Gusfield's Introduction to Suffix Trees. As it turns out this book covers many approaches for you kind of question. It is considered one of the standard publications in this field.
Get a fast longest common subsequence algorithm implementation. Actually it suffices to determine the length of the LCS. Notice that Gusman's book has very good algorithms and also points to more sources for such algorithms.
Return all s ∈ A with length(LCS(s,q)) == length(q)

Efficient mass string search problem

The Problem: A large static list of strings is provided. A pattern string comprised of data and wildcard elements (* and ?). The idea is to return all the strings that match the pattern - simple enough.
Current Solution: I'm currently using a linear approach of scanning the large list and globbing each entry against the pattern.
My Question: Are there any suitable data structures that I can store the large list into such that the search's complexity is less than O(n)?
Perhaps something akin to a suffix-trie? I've also considered using bi- and tri-grams in a hashtable, but the logic required in evaluating a match based on a merge of the list of words returned and the pattern is a nightmare, furthermore I'm not convinced its the correct approach.
I agree that a suffix trie is a good idea to try, except that the sheer size of your dataset might make it's construction use up just as much time as its usage would save. Theyre best if youve got to query them multiple times to amortize the construction cost. Perhaps a few hundred queries.
Also note that this is a good excuse for parallelism. Cut the list in two and give it to two different processors and have your job done twice as fast.
you could build a regular trie and add wildcard edges. then your complexity would be O(n) where n is the length of the pattern. You would have to replace runs of ** with * in the pattern first (also an O(n) operation).
If the list of words were I am an ox then the trie would look a bit like this:
(I ($ [I])
a (m ($ [am])
n ($ [an])
? ($ [am an])
* ($ [am an]))
o (x ($ [ox])
? ($ [ox])
* ($ [ox]))
? ($ [I]
m ($ [am])
n ($ [an])
x ($ [ox])
? ($ [am an ox])
* ($ [I am an ox]
m ($ [am]) ...)
* ($ [I am an ox]
I ...
...
And here is a sample python program:
import sys
def addWord(root, word):
add(root, word, word, '')
def add(root, word, tail, prev):
if tail == '':
addLeaf(root, word)
else:
head = tail[0]
tail2 = tail[1:]
add(addEdge(root, head), word, tail2, head)
add(addEdge(root, '?'), word, tail2, head)
if prev != '*':
for l in range(len(tail)+1):
add(addEdge(root, '*'), word, tail[l:], '*')
def addEdge(root, char):
if not root.has_key(char):
root[char] = {}
return root[char]
def addLeaf(root, word):
if not root.has_key('$'):
root['$'] = []
leaf = root['$']
if word not in leaf:
leaf.append(word)
def findWord(root, pattern):
prev = ''
for p in pattern:
if p == '*' and prev == '*':
continue
prev = p
if not root.has_key(p):
return []
root = root[p]
if not root.has_key('$'):
return []
return root['$']
def run():
print("Enter words, one per line terminate with a . on a line")
root = {}
while 1:
line = sys.stdin.readline()[:-1]
if line == '.': break
addWord(root, line)
print(repr(root))
print("Now enter search patterns. Do not use multiple sequential '*'s")
while 1:
line = sys.stdin.readline()[:-1]
if line == '.': break
print(findWord(root, line))
run()
If you don't care about memory and you can afford to pre-process the list, create a sorted array of every suffix, pointing to the original word, e.g., for ['hello', 'world'], store this:
[('d' , 'world'),
('ello' , 'hello'),
('hello', 'hello'),
('ld' , 'world'),
('llo' , 'hello'),
('lo' , 'hello'),
('o' , 'hello'),
('orld' , 'world'),
('rld' , 'world'),
('world', 'world')]
Use this array to build sets of candidate matches using pieces of the pattern.
For instance, if the pattern is *or*, find the candidate match ('orld' , 'world') using a binary chop on the substring or, then confirm the match using a normal globbing approach.
If the wildcard is more complex, e.g., h*o, built sets of candidates for h and o and find their intersection before the final linear glob.
You say you're currently doing linear search. Does this give you any data on the most frequently performed query patterns? e.g. is blah* much more common than bl?h (which i'd assume it was) among your current users?
With that kind of prior knowledge you can focus your indexing efforts on the commonly used cases and get them down to O(1), rather than trying to solve the much more difficult, and yet much less worthwhile, problem of making every possible query equally fast.
You can achieve a simple speedup by keeping counts of the characters in your strings. A string with no bs or a single b can never match the query abba*, so there is no point in testing it. This works much better on whole words, if your strings are made of those, since there are many more words than characters; plus, there are plenty of libraries that can build the indexes for you. On the other hand, it is very similar to the n-gram approach you mentioned.
If you do not use a library that does it for you, you can optimize queries by looking up the most globally infrequent characters (or words, or n-grams) first in your indexes. This allows you to discard more non-matching strings up front.
In general, all speedups will be based on the idea of discarding things that cannot possibly match. What and how much to index depends on your data. For example, if the typical pattern length is near to the string length, you can simply check to see if the string is long enough to hold the pattern.
There are plenty of good algorithms for multi-string search. Google "Navarro string search" and you'll see a good analysis of multi-string options. A number of algorithsm are extremely good for "normal" cases (search strings that are fairly long: Wu-Manber; search strings with characters that are modestly rare in the text to be searched: parallel Horspool). Aho-Corasick is an algorithm that guarantees a (tiny) bounded amount of work per input character, no matter how the input text is tuned to create worst behaviour in the search. For programs like Snort, that's really important, in the face of denial-of-service attacks. If you are interested in how a really efficient Aho-Corasick search can be implemented, take a look at ACISM - an Aho-Corasick Interleaved State Matrix.

Resources