String search using suffix trees - string

A suffix tree can be used to efficiently search a word in a set of words. Is suffix trees still the best method if:
1. the set of words is made from an infinite set of characters
2. the set of words is ordered alphabetically (or in a way that makes sense)?

A suffix tree is an overkill if you just want search for a word in a set of words(and you do not need search for their substrings). A trie is a better choice(the time complexity is the same, but it is much simpler). If the words are ordered, you can use a binary search to find the word(yes, it does have an additional log n factor, but it is not that bad). Even if they are not ordered, you can sort them before searching for other words. This approach is good because it does not require any custom data structures and it usually has smaller constant and smaller memory usage(the space complexity is the same, but the constant is better).

Related

Finding the most similar string among a set of millions of strings

Let's say I have a dictionary (word list) of millions upon millions of words. Given a query word, I want to find the word from that huge list that is most similar.
So let's say my query is elepant, then the result would most likely be elephant.
If my word is fentist, the result will probably be dentist.
Of course assuming both elephant and dentist are present in my initial word list.
What kind of index, data structure or algorithm can I use for this so that the query is fast? Hopefully complexity of O(log N).
What I have: The most naive thing to do is to create a "distance function" (which computes the "distance" between two words, in terms of how different they are) and then in O(n) compare the query with every word in the list, and return the one with the closest distance. But I wouldn't use this because it's slow.
The problem you're describing is a Nearest Neighbor Search (NNS). There are two main methods of solving NNS problems: exact and approximate.
If you need an exact solution, I would recommend a metric tree, such as the M-tree, the MVP-tree, and the BK-tree. These trees take advantage of the triangle inequality to speed up search.
If you're willing to accept an approximate solution, there are much faster algorithms. The current state of the art for approximate methods is Hierarchical Navigable Small World (hnsw). The Non-Metric Space Library (nmslib) provides an efficient implementation of hnsw as well as several other approximate NNS methods.
(You can compute the Levenshtein distance with Hirschberg's algorithm)
I made similar algorythm some time ago
Idea is to have an array char[255] with characters
and values is a list of words hashes (word ids) that contains this character
When you are searching 'dele....'
search(d) will return empty list
search(e) will find everything with character e, including elephant (two times, as it have two 'e')
search(l) will brings you new list, and you need to combine this list with results from previous step
...
at the end of input you will have a list
then you can try to do group by wordHash and order by desc by count
Also intresting thing, if your input is missing one or more characters, you will just receive empty list in the middle of the search and it will not affect this idea
My initial algorythm was without ordering, and i was storing for every character wordId and lineNumber and char position.
My main problem was that i want to search
with ee to find 'elephant'
with eleant to find 'elephant'
with antph to find 'elephant'
Every words was actually a line from file, so it's often was very long
And number of files and lines was big
I wanted quick search for directories with more than 1gb text files
So it was a problem even store them in memory, for this idea you need 3 parts
function to fill your cache
function to find by char from input
function to filter and maybe order results (i didn't use ordering, as i was trying to fill my cache in same order as i read the file, and i wanted to put lines that contains input in the same order upper )
I hope it make sense

What is the advantage of Suffix tree over suffix array?

I have been studying about trie, suffix array and suffix tree.I know these data structures can be used to fast lookup and for many more applications.
Now my question is,
If suffix array is space efficient and easy to implement than what are the scenarios where suffix tree should be preferred over suffix array
Can you please list down the individual's advantages over one another..
Thanks in advance.
Here is the abstract from Suffix arrays:A new method for on-line string searches written by Udi Manber and Gene Myers.
link to the article.
It provides a list of advantages of the suffix array in comparison to the suffix tree structure in general ca
A new and conceptually simple data structure, called a suffix array,
for on-line string searches is introduced in this paper. Constructing
and querying suffix arrays is reduced to a sort and search paradigm
that employs novel algorithms. The main advantage of suffix arrays
over suffix trees is that, in practice, they use three to five times
less space. From a complexity standpoint, suffix arrays permit on-line
string searches of the type, ‘‘Is W a substring of A?’’ to be answered
in time O(P + log N), where P is the length of W and N is the length
of A, which is competitive with (and in some cases slightly better
than) suffix trees. The only drawback is that in those instances where
the underlying alphabet is finite and small, suffix trees can be
constructed in O(N) time in the worst case, versus O(N log N) time for
suffix arrays. However, we give an augmented algorithm that,
regardless of the alphabet size, constructs suffix arrays in O(N)
expected time, albeit with lesser space efficiency. We believe that
suffix arrays will prove to be better in practice than suffix trees
for many applications
To make it brief, let's say that the suffix array has a significantly lower space complexity and better space locality than the suffix tree ; the trade-off being that the suffix tree runs faster in terms of time complexity (O(n) versus O(n.log(n)). Both give the suffixes of a string online(you can receive the string one char at a time, you don't need the whole string to run the algorithm).
Another advantage of the suffix array is the adaptability, for a substring search for instance ; the structure will allow for easier use of the data. It is also easier to implement as well.

Time complexity for inserting all suffixes of a string into a ternary search tree?

I have a ternary search tree that contain all the suffixes of a word. What is the time complexity for construction and searching a word in this structure?
Example:
a word banana$, have the suffix banana$,anana$,nana$,ana$,na$,a$,$
and in lexicografical order $,a$,ana$,anana$,banana$,na$,nana$.
inserting all suffix in the ternary search tree in balanced form is:
anana$,a$,$,ana$,na$,banana$,nana$.
Generally speaking, the time required to insert something into a TST is O(L log |Σ|), where L is the length of the string and Σ is the set of allowed characters in your string. The reason for this is that adding each individual character takes time O(log |Σ|) because you're adding each character into a BST of at most |Σ| elements. For the example you're describing, you're adding in strings of length 1, 2, 3, ..., n, so the runtime is O(n2 log |Σ|).
That said, I think you can speed this up by going through a more indirect route. A ternary search tree can be thought of as a trie where the child pointers of each node are stored in a binary search tree. If you just want a trie of all the suffixes, you might want to look at suffix trees, which are specifically designed to represent that information. They can be built in time O(n) for a length-n string.

Tree searching algorithm

I'm looking for suggestions on strategies for searching a tree-like data structure.
The structure is a tree where each element is a string, each branch is a period, and a path is the concatenation of several strings and periods starting at the root. The root and edges from the root are a special case where there is no string behind them.
So given the tree,
{root}
/ \
A X
/ \ /
B C Y
Valid paths are the strings "A", "A.B", "A.C", "X", and "X.Y".
What we have is a set of strings that we need to search for in this tree and find the element that terminates each string. Not all strings in the set appear in the tree. We stop searching when we find all strings. We need to run this search several times but the trees may differ each time. The set of strings to search is the same each run though.
Currently we're using depth-first search, but this isn't very efficient if all strings fall under say the last branch under the root. I feel like there should be a better way of doing this.
What would be a good algorithm for doing this repeated search? Would it be possible to leverage multithreading here as well?
It's an interesting problem; usually one would imagine a single tree being searched for a variable set of strings. Here the situation is reversed: the set of strings is fixed and the tree is highly variable.
I think that the best you can do is build a trie representing the set of strings. That way, you only have to search a tree once for any given prefix. (So, for the example strings you mentioned, you would only need to find the "A" prefix once and the "X" prefix once.) There are lots of trie data structures and algorithms for building them from a set of strings, but since that's a one-time operation for this problem, I wouldn't worry too much about the cost of this preprocessing.

Meta-information in DAWG/DAFSA

I would like to implement a string look-up data structure, for dynamic strings, that will support efficient search and insertion. Currently, I am using a trie but I would like to reduce the memory footprint if possible. This Wikipedia article describes a DAWG/DAFSA, which will obviously save a lot of space over a trie by compressing suffixes. However, while it will clearly test whether a string is legal, it is not obvious to me if there is any way to exclude illegal strings. For example, using the words "cite" and "cat" where the "t" and "e" are terminal states, a DAWG/DAFSA would look like this:
c
/ \
a i
\ /
t
|
e
and "cit" and "cate" will be incorrectly recognized as legal strings without some meta-information.
Questions:
1) Is there a preferred way to store meta-information about strings/paths (such as legality) in a DAWG/DAFSA?
2) If a DAWG/DAFSA is incompatible with the requirements (efficient search/insertion and storing meta-information) what's the best data structure to use? A minimal memory footprint would be nice, but perhaps not absolutely necessary.
In a DAWG, you only compress states together if they're completely indistinguishable from one another. This means that you actually wouldn't combine the T nodes for CAT and CITE together for precisely the reason you've noted - that gives you either a false positive on CIT or a false negative on CAT.
DAWGs are typically most effective for static dictionaries when you have a huge number of words with common suffixes. A DAWG for all of English, for example, could save a lot of space by combining all the suffix "s"'s at the end of plural words and most of the "ING" suffixes from gerunds. If you're going to be doing a lot of insertions or deletions, DAWGs are almost certainly the wrong data structure for the job because adding or removing a single word from a DAWG can cause ripple effects that require lots of branches that were previously combined to be split or vice-versa.
Quite honestly, for reasonably-sized data sets, a trie isn't a bad call. A trie for all of English would only use up something like 26MB, which isn't very much. I would only go with the DAWG if space usage really is at a premium and you aren't doing many insertions or deletions.
Hope this helps!

Resources