Efficiently searching large list of strings - string

I have a large list of strings which needs to be searched by the user of an iPhone/Android app. The strings are sorted alphabetically, but that isn't actually all that useful, since a string should be included in the results if the search query falls anywhere inside the string, not just the beginning. As the user is typing their search query, the search should be updated to reflect the results of what they currently entered. (e.g. if they type "cat", it should display the results for "c", "ca" and "cat" as they type it).
My current approach is the following:
I have a stack of "search results", which starts out empty. If the user types something to make the search query longer, I push the current search results onto the stack, then search through only the current search results for the new ones (it's impossible for something to be in the full string list but not the current results in this case).
If the user hits backspace, I only need to pop the search results off of the stack and restore them. This can be done nearly instantaneously.
This approach is working great for "backwards" searching (making the search query shorter) and cases where the search query is already long enough for the number of results to be low. However, it still has to search though the full list of strings in O(n) time for each of the first few letters the user types, which is quite slow.
One approach I've considered is to have a pre-compiled list of results of all possible search queries of 2 or 3 letters. The problem with this approach is that it would require 26^2 or 26^3 such lists, and would take up a pretty large amount of space.
Any other optimizations or alternate approaches you can think of?

You should consider using a prefix tree (trie) to make a precomputed list. I'm not sure showing the result for 'c', 'ca', and 'cat' on a sub-character basis is a good idea. For example, let's say the user is searching for the word 'eat'. Your algorithm would have to find all the words that contain 'e', then 'ea', and finally 'eat'; most of which will be of no use for the user. For a phone app, it would probably be better if you do it on a word basis. A multi-word string can be tokenized so searching for 'stake' in 'a large stake' will work fine, but not searching for 'take'.

I notice that Google, and I imagine others, do not provide a full list when only 1 or 2 characters have been pressed. In your case, perhaps a good starting point is to only begin populating the search query results when the user has typed a minimum of 3 characters.
For later versions, if it's important, you can take a cue from the way Google does it, and do more sophisticated processing: keep track of the actual entries that previous users have selected, and order these by frequency. Then, run a cron job every day on your server to populate a small database table with the top 10 entries starting with each letter, and if only 1 or 2 letters have been pressed, use the results from this small table instead of scanning the full list.

You could use Compressed suffix tree

Related

Find the most similar text in a list of strings

First of all, I want to acknowledge that this is perhaps a very sophisticated problem; however, I have not been able to find a definitive answer for it online so I'm looking for suggestions.
Suppose I want to collect a list containing over hundred thousand strings, values of these strings are sentences that a user has typed. The values are added to the list as soon as a user types a new message. For example:
["Hello world!", "Good morning, my name is John", "Good morning, everyone"]
But I also want to have a timeout for each string so if they are not repeated within 5 min, they should be removed, so I change it to following format:
[{message:"Hello world!", timeout: NodeJS.Timeout, count: 1}, {message:"Good morning, my name is John", timeout: NodeJS.Timeout, count: 1}, {message:"Good morning, everyone", timeout: NodeJS.Timeout, count: 1}]
Now suppose a user types the following message:
Good morning, everyBODY
I want to compare this string to all the messages in list and if one is 70% or more similar, update the count of that message, otherwise insert it as a new message. For this message for example, the application should update the count for Good morning, everyone to be equal to 2.
Since users can type a lot of messages in a short amount of time, the algorithm must also support fast insertion, searching, and deleting after the timeout.
What is the best way to implement this? or are there any libraries to help me with this?
NOTE: The strings do not need to be in an array, any data structure would work.
The main purpose of this algorithm is to detect similar messages when the count reaches a predefined value. For example warning: Over 5 users typed messages similar to "Hello everybody" within 5 minutes
I have looked at B-Trees, Nearest Neighbor, etc but I can't figure out what would be the best solution.
Update:
I plan on using Levenshtein distance for string similarity, however the main problem is how to apply that to a list of strings in most time efficient way, without having to check every single string every time a new message is added.
Levenshtein distance
Unlike the other answer I think Levenshtein distance is perfectly capable of dealing with spelling mistakes. Indeed, Levenshtein and LevXenshtein only have Levenshtein distance 1, and thus can be concluded to likely be the same message.
However, if you want to use this distance, you will have to compute the distance between the new message and every message stored, every time a new message comes in. There is likely no way around this.
Unfortunately there is no real useful pre-processing you can do for this.
Other possibilities
If you can find a way to map every message to a fixed-size vector, you can use essentially any nearest neighbor search technique. I suggest doing so.
This leaves us with two problems to solve. Generating the fixed-length vector, and doing the search.
Fixed-size vector representation
There are multiple ways of doing this, all with their own set of drawbacks. I'll specifically mention two, but it will depend on your architecture and data which method is best for you.
First, you could go the machine-learning way. You could map every word to a pre-trained vector with fastText, average the words in the message, and use that as your vector. The drawbacks of this method are that it will ignore word order, and it will work less well if the words used tend to be very informal. If your messages have their own culture to them (such as for example Twitch chat) you would have to retrain these vectors instead of using pre-trained ones.
Alternatively, you could use the structure of the text directly, and make an occurrence vector of bigrams. That is, jot down how often every 2-character combination occurs in a message. This is fairly robust, but has the drawback that the vectors will become relatively large.
Regardless, these are just two options, and it's impossible to tell what method is ideal for you. Unless of course someone has a brilliant idea.
Nearest neighbor search
Given that we have fixed length vectors, we can now do nearest neighbor search. As you've probably found, there are once again many different methods for this, all with their own drawbacks. Exhausting, I know.
I'll choose to discuss three categories.
Approximate search: This method may seem a little silly, but it could be what you want. Specifically, Locality-sensitive hashing is essentially just making some hashing function where "similar" vectors are likely to end up in the same bucket. You could then do anything you want, such as Levenshtein, with all of the other members of the bucket, because there should not be too many of them. The advantage of such an approximate algorithm is that it can be fast, and with some smart hashing you don't even need fixed-length vectors. A downside, of course, is that it is not guaranteed to work.
Exact search: We can also choose to instead solve the problem of Fixed-radius near neighbors. That is, find the points within some distance of the target point. You could do this by mapping vectors to integers (if they aren't already) and simply checking every lattice point within the distance you want to search. The primary drawback here is that the search time grows very fast not with the number of points, but with the number of dimensions of the vector. This method would necessitate small vectors.
Fancy datastructures: This seems to me most likely to be the right solution. Unfortunately you have a lot of letter-trees. You mention B-trees, but there's also R-trees, R+-Trees, R*-Trees, X-Trees, and that's just the direct descendants of the R-tree. With the risk of missing the trees for the forest, I'd suggest taking a look at the k-d tree. It can do nearest neighbor search in logarithmic time, as well as insertion and deletion.
You want to covert all of the words to their Soundex value.
Then you need a database for the soundex values that ranks the importance of the word in the sentence, e.g. the should probably get 0. The more information the word carries the higher its value.
Then sort the words in the sentence into a list of integers.
Use the list of integers as the key to find similar sentences.
Since the key is a list of integers a Rose tree should work as data structure.
While some may suggest measuring using something like Levenshtein distance that presupposes that the sentences have no spelling mistakes or such. You need something that is flexible enough to deal with human error.
I would suggest you to use Algolia. Which has their own ranking algorithm rates each matching record on several criteria (such as the number of typos or the geo-distance), to which they individually assign a integer value score.
I would totally take a loook on it, since they have Search-as-you-type and different Ranking algorithm criterias.
https://blog.algolia.com/search-ranking-algorithm-unveiled/
I think Search Engine like SOLR or Elastic Search are best fit for your problem.
You have to create single collection in which you can store data as you have mention in the question after that you just have to add data to solr and search it in the solr search with your time limit.

Advice on how to search and return strings

Forgive me, this will be my first every post to SO, so do let me know how I can improve.
I am currently looking for advice on a problem I am facing. I have a list of one billion unique strings of text. These text strings also have a list of tags associated with them to indicate the content of the string.
Example:
StringText: The cat ate on Sunday
AnimalCode: c001
ActionCode: a001
TimeCode: d001
where
c001 = The cat
a001= ate
d001 = on Sunday
I have loaded all of the strings and their codes as individual documents in an instance of MongoDB
At present, I am trying to devise a method by which I can enter a string and search against the database to return the match. My problem is that the search is taking far to long to return results.
I have created an index on the StringText field, but am guessing that it is too large to hold in memory.
Each string has an equal probability of being searched for so I can't reliably predict which strings have a higher probability of being searched for and pull them out into another collection.
Currently, I am running the DB off a single box with 16GB of RAM and a 4TB HDD.
Does anybody have any advice on how I might accomplish my task more efficiently? Is Mongo the right technology or are there others more adept at doing this kind of search and return?
My goal (forgive me if foolish) would be to try and return a result within 2 seconds or less.
I am very new to this whole arena so any and all advice would be welcome.
Thanks much to all in advance for the help and time.
Sincerely,
Zinga
As discussed in the comments, you could preprocess the input string to find the associated Animal and Action codes and search for StringText based on the indexed codes, which is much faster than text search.
You can't totally avoid text search, so reduce it to the Animal and/or Action collection by tokenizing the input string. See how you can use map/reduce techniques just for queries of this sort.
In your case, if you know that the first word or two will always contain the name of the animal, just use those one or two words to search for the relevant animal. Searching through the Animal/Actions collection shouldn't take long. In case it does, you can keep a periodically updating list of most common animals/actions (based on their frequency) and search against that to make it faster. This is also discussed in the articles on the linked page.
If even after that your search against StringText is slow, you could shard the StringText collection by Animal/Action codes. The official doc should suffice for this and there's not much that's involved in the setup so you might try this anyway. The basic ideology everywhere is to restrict your target space as much as possible. Searching through a billion records for every query is plain overkill. Cache where you can, preprocess where you can, show guesses while you run a slow query.
Good luck!

Match snippets in an HTML document

Somebody has presented me with a very large list of copyedits to make to a long HTML document. The edits are in the format:
"religious" should be "religions"
"their" should be "there"
"you must persistent" should be "you must be persistent"
The copyedits were typed by hand; in some cases, the "actual" value on the left is not an exact match for the content in the document. The order of edits is usually correct, but even that is not guaranteed.
It's a straightforward but very large task to apply these edits by hand to the document. I'd like to automate the process as much as possible, e.g. by automatically searching for the snippets.
In a long document like this, I can't just search for all instances of "their" and replace them with "there." Sometimes "their" was used correctly, just not in one particular instance.
In other words, I'm looking for a fuzzy text match, where the order of the edits influences the search.
What's a good approach to a problem like this? I'm hoping that there's some off-the-shelf open source project that can search for the snippets in a fuzzy order.
I am not aware of any tool. But I would use edit distance for both:
for non-exact string match: probably std. Levenstein + swap (i.e. Damerau-Levenstein distance)
for non-exact sequence match: this time probably only with Match and Swap operations. You can use free (zero-cost) Insert to get the words that should not be edited.
It should not be hard to implement. But the computational complexity will be quite high. I would use some heuristics to skip hopeless matches. Preprocessing words in the document and the edit list could be good: get a set of chars for each word to allow a quick comparison before calculating full edit distance), etc.

How do search engines conduct 'AND' operation?

Consider the following search results:
Google for 'David' - 591 millions hits in 0.28 sec
Google for 'John' - 785 millions hits in 0.18 sec
OK. Pages are indexed, it only needs to look up the count and the first few items in the index table, so speed is understandable.
Now consider the following search with AND operation:
Google for 'David John' ('David' AND 'John') - 173 millions hits in 0.25 sec
This makes me ticked ;) How on earth can search engines get the result of AND operations on gigantic datasets so fast? I see the following two ways to conduct the task and both are terrible:
You conduct the search of 'David'. Take the gigantic temp table and conduct a search of 'John' on it. HOWEVER, the temp table is not indexed by 'John', so brute force search is needed. That just won't compute within 0.25 sec no matter what HW you have.
Indexing by all possible word
combinations like 'David John'. Then
we face a combinatorial explosion on the number of keys and
not even Google has the storage
capacity to handle that.
And you can AND together as many search phrases as you want and you still get answers under a 0.5 sec! How?
What Markus wrote about Google processing the query on many machines in parallel is correct.
In addition, there are information retrieval algorithms that make this job a little bit easier. The classic way to do it is to build an inverted index which consists of postings lists - a list for each term of all the documents that contain that term, in order.
When a query with two terms is searched, conceptually, you would take the postings lists for each of the two terms ('david' and 'john'), and walk along them, looking for documents that are in both lists. If both lists are ordered the same way, this can be done in O(N). Granted, N is still huge, which is why this will be done on hundreds of machines in parallel.
Also, there may be additional tricks. For example, if the highest-ranked documents were placed higher on the lists, then maybe the algorithm could decide that it found the 10 best results without walking the entire lists. It would then guess at the remaining number of results (based on the size of the two lists).
I think you're approaching the problem from the wrong angle.
Google doesn't have a tables/indices on a single machine. Instead they partition their dataset heavily across their servers. Reports indicate that as many as 1000 physical machines are involved in every single query!
With that amount of computing power it's "simply" (used highly ironically) a matter of ensuring that every machine completes their work in fractions of a second.
Reading about Google technology and infrastructure is very inspiring and highly educational. I'd recommend reading up on BigTable, MapReduce and the Google File System.
Google have an archive of their publications available with lots of juicy information about their techologies. This thread on metafilter also provides some insight to the enourmous amount of hardware needed to run a search engine.
I don't know how google does it, but I can tell you how I did it when a client needed something similar:
It starts with an inverted index, as described by Avi. That's just a table listing, for every word in every document, the document id, the word, and a score for the word's relevance in that document. (Another approach is to index each appearance of the word individually along with its position, but that wasn't required in this case.)
From there, it's even simpler than Avi's description - there's no need to do a separate search for each term. Standard database summary operations can easily do that in a single pass:
SELECT document_id, sum(score) total_score, count(score) matches FROM rev_index
WHERE word IN ('david', 'john') GROUP BY document_id HAVING matches = 2
ORDER BY total_score DESC
This will return the IDs of all documents which have scores for both 'David' and 'John' (i.e., both words appear), ordered by some approximation of relevance and will take about the same time to execute regardless of how many or how few terms you're looking for, since IN performance is not affected much by the size of the target set and it's using a simple count to determine whether all terms were matched or not.
Note that this simplistic method just adds the 'David' score and the 'John' score together to determine overall relevance; it doesn't take the order/proximity/etc. of the names into account. Once again, I'm sure that google does factor that into their scores, but my client didn't need it.
I did something similar to this years ago on a 16 bit machine. The dataset had an upper limit of around 110,000 records (it was a cemetery, so finite limit on burials) so I setup a series of bitmaps each containing 128K bits.
The search for "david" resulting in me setting the relevant bit in one of the bitmaps to signify that the record had the word "david" in it. Did the same for 'john' in a second bitmap.
Then all you need to do is a binary 'and' of the two bitmaps, and the resulting bitmap tells you which record numbers had both 'david' and 'john' in them. Quick scan of the resulting bitmap gives you back the list of records that match both terms.
This technique wouldn't work for google though, so consider this my $0.02 worth.

A string searching algorithm to quickly match an abbreviation in a large list of unabbreviated strings?

I am having a lot of trouble finding a string matching algorithm that fits my requirements.
I have a very large database of strings in an unabbreviated form that need to be matched to an arbitrary abbreviation. A string that is an actual substring with no letters between its characters should also match, and with a higher score.
Example: if the word to be matched within was "download" and I searched "down", "ownl", and then "dl", I would get the highest matching score for "down", followed by "ownl" and then "dl".
The algorithm would have to be optimized for speed and a large number of strings to be searched through, and should allow me to pull back a list of matching items strings (if I had added both "download" and "upload" to the database, searching "load" should return both). Memory is still important, but not as important as speed.
Any ideas? I've done a bunch of research on some of these algorithms but I haven't found any that even touch abbreviations, let alone with all these conditions!
I'd wonder if Peter Norvig's spell checker could be adapted in some way for this problem.
It's a stretch that I haven't begun to work out, but it's such an elegant solution that it's worth knowing about.

Resources