First of all, I want to acknowledge that this is perhaps a very sophisticated problem; however, I have not been able to find a definitive answer for it online so I'm looking for suggestions.
Suppose I want to collect a list containing over hundred thousand strings, values of these strings are sentences that a user has typed. The values are added to the list as soon as a user types a new message. For example:
["Hello world!", "Good morning, my name is John", "Good morning, everyone"]
But I also want to have a timeout for each string so if they are not repeated within 5 min, they should be removed, so I change it to following format:
[{message:"Hello world!", timeout: NodeJS.Timeout, count: 1}, {message:"Good morning, my name is John", timeout: NodeJS.Timeout, count: 1}, {message:"Good morning, everyone", timeout: NodeJS.Timeout, count: 1}]
Now suppose a user types the following message:
Good morning, everyBODY
I want to compare this string to all the messages in list and if one is 70% or more similar, update the count of that message, otherwise insert it as a new message. For this message for example, the application should update the count for Good morning, everyone to be equal to 2.
Since users can type a lot of messages in a short amount of time, the algorithm must also support fast insertion, searching, and deleting after the timeout.
What is the best way to implement this? or are there any libraries to help me with this?
NOTE: The strings do not need to be in an array, any data structure would work.
The main purpose of this algorithm is to detect similar messages when the count reaches a predefined value. For example warning: Over 5 users typed messages similar to "Hello everybody" within 5 minutes
I have looked at B-Trees, Nearest Neighbor, etc but I can't figure out what would be the best solution.
Update:
I plan on using Levenshtein distance for string similarity, however the main problem is how to apply that to a list of strings in most time efficient way, without having to check every single string every time a new message is added.
Levenshtein distance
Unlike the other answer I think Levenshtein distance is perfectly capable of dealing with spelling mistakes. Indeed, Levenshtein and LevXenshtein only have Levenshtein distance 1, and thus can be concluded to likely be the same message.
However, if you want to use this distance, you will have to compute the distance between the new message and every message stored, every time a new message comes in. There is likely no way around this.
Unfortunately there is no real useful pre-processing you can do for this.
Other possibilities
If you can find a way to map every message to a fixed-size vector, you can use essentially any nearest neighbor search technique. I suggest doing so.
This leaves us with two problems to solve. Generating the fixed-length vector, and doing the search.
Fixed-size vector representation
There are multiple ways of doing this, all with their own set of drawbacks. I'll specifically mention two, but it will depend on your architecture and data which method is best for you.
First, you could go the machine-learning way. You could map every word to a pre-trained vector with fastText, average the words in the message, and use that as your vector. The drawbacks of this method are that it will ignore word order, and it will work less well if the words used tend to be very informal. If your messages have their own culture to them (such as for example Twitch chat) you would have to retrain these vectors instead of using pre-trained ones.
Alternatively, you could use the structure of the text directly, and make an occurrence vector of bigrams. That is, jot down how often every 2-character combination occurs in a message. This is fairly robust, but has the drawback that the vectors will become relatively large.
Regardless, these are just two options, and it's impossible to tell what method is ideal for you. Unless of course someone has a brilliant idea.
Nearest neighbor search
Given that we have fixed length vectors, we can now do nearest neighbor search. As you've probably found, there are once again many different methods for this, all with their own drawbacks. Exhausting, I know.
I'll choose to discuss three categories.
Approximate search: This method may seem a little silly, but it could be what you want. Specifically, Locality-sensitive hashing is essentially just making some hashing function where "similar" vectors are likely to end up in the same bucket. You could then do anything you want, such as Levenshtein, with all of the other members of the bucket, because there should not be too many of them. The advantage of such an approximate algorithm is that it can be fast, and with some smart hashing you don't even need fixed-length vectors. A downside, of course, is that it is not guaranteed to work.
Exact search: We can also choose to instead solve the problem of Fixed-radius near neighbors. That is, find the points within some distance of the target point. You could do this by mapping vectors to integers (if they aren't already) and simply checking every lattice point within the distance you want to search. The primary drawback here is that the search time grows very fast not with the number of points, but with the number of dimensions of the vector. This method would necessitate small vectors.
Fancy datastructures: This seems to me most likely to be the right solution. Unfortunately you have a lot of letter-trees. You mention B-trees, but there's also R-trees, R+-Trees, R*-Trees, X-Trees, and that's just the direct descendants of the R-tree. With the risk of missing the trees for the forest, I'd suggest taking a look at the k-d tree. It can do nearest neighbor search in logarithmic time, as well as insertion and deletion.
You want to covert all of the words to their Soundex value.
Then you need a database for the soundex values that ranks the importance of the word in the sentence, e.g. the should probably get 0. The more information the word carries the higher its value.
Then sort the words in the sentence into a list of integers.
Use the list of integers as the key to find similar sentences.
Since the key is a list of integers a Rose tree should work as data structure.
While some may suggest measuring using something like Levenshtein distance that presupposes that the sentences have no spelling mistakes or such. You need something that is flexible enough to deal with human error.
I would suggest you to use Algolia. Which has their own ranking algorithm rates each matching record on several criteria (such as the number of typos or the geo-distance), to which they individually assign a integer value score.
I would totally take a loook on it, since they have Search-as-you-type and different Ranking algorithm criterias.
https://blog.algolia.com/search-ranking-algorithm-unveiled/
I think Search Engine like SOLR or Elastic Search are best fit for your problem.
You have to create single collection in which you can store data as you have mention in the question after that you just have to add data to solr and search it in the solr search with your time limit.
I have a object with many fields. Each field has different range of values. I want to use hypothesis to generate different instances of this object.
Is there a limit to the number of combination of field values Hypothesis can handle? Or what does the search tree hypothesis creates look like? I don't need all the combinations but I want to make sure that I get a fair number of combinations where I test many different values for each field. I want to make sure Hypothesis is not doing a DFS until it hits the max number of examples to generate
TLDR: don't worry, this is a common use-case and even a naive strategy works very well.
The actual search process used by Hypothesis is complicated (as in, "lead author's PhD topic"), but it's definitely not a depth-first search! Briefly, it's a uniform distribution layered on a psudeo-random number generator, with a coverage-guided fuzzer biasing that towards less-explored code paths, with strategy-specific heuristics on top of that.
In general, I trust this process to pick good examples far more than I trust my own judgement, or that of anyone without years of experience in QA or testing research!
I have a scenario where a user can post a number of responses or phrases via a form field. I would like to be able to take the response and determine what they are asking for. For instance if the user types in car, train, bike, jet .... I can assume they are talking about a vehicle, and respond accordingly. I understand that I could use a switch statement or perhaps a regexp as well, however the larger the number of possible responses, the less efficient that computation will be. I'm wondering if there is an efficient algorithm for comparing a string with a group of strings. Any info would be great.
You may want to look into the Aho-Corasick algorithm. If you have a collection of strings that you want to search for, you can spend linear time doing preprocessing on those strings and from that point forward can, in O(n) time, check for all possible matches of those strings in a text corpus of length n. In other words, with a small preprocessing time to set up the algorithm once, you can extremely efficiently scan over numerous inputs again and again searching for those keywords.
Interestingly enough, the algorithm was specifically invented to build a fast index (that is, to look for a lot of different keywords in a huge body of text), and allegedly outperformed other methods by a factor of ten. I think it would work great in your application.
Hope this helps!
If you have a large number of "magic" words, I would suggest splitting the query into words, and using a hash-based lookup to check whether the words are recognized.
You can check Trie structure. I think one of best solution for your problem.
I am currently working on project where I have to match up a large quantity of user-generated names with a separate list of the same names in a canonical format. The problem is that the user-generated names contains numerous misspellings, abbreviations, as well as simply invalid data, making it hard to do a cross-reference with the canonical data. Any suggestions on methods to do this?
This does not have to be done in real-time and in this case accuracy is more important than speed.
Current ideas for this are:
Do a fuzzy search for the user entered name in the canonical database using an existing search implementation like Lucene or Sphinx, which I presume use something like the Levenshtein distance for this.
Cross-reference on the SOUNDEX hash (which is supposedly computed on the sound of the name rather than spelling) instead of using the actual name.
Some combination of the above
Anyone have any feedback on any of these or ideas of their own?
One of my concerns is that none of the above methods will handle abbreviations very well. Can anyone point me in a direction for some machine learning methods to actually search on expanded abbreviations (or tell me I'm crazy)? Thanks in advance.
First, I'd add to your list the techniques discussed at Peter Norvig's post on spelling correction.
Second, I'd ask what kind of "user-generated names" you're talking about. Having dealt with both, I believe that the heuristics you'd use for street names are somewhat different from the heuristics for person names. (As a simple example, does "Dr" expand to "Drive" or "Doctor"?)
Third, I'd look at a combination using testing to establish the set of coefficients for combining the results of the various techniques.
I've seen a few sites that list related searches when you perform a search, namely they suggest other search queries you may be interested in.
I'm wondering the best way to model this in a medium-sized site (not enough traffic to rely on visitor stats to infer relationships). My initial thought is to store the top 10 results for each unique query, then when a new search is performed to find all the historical searches that match some amount of the top 10 results but ideally not matching all of them (matching all of them might suggest an equivalent search and hence not that useful as a suggestion).
I imagine that some people have done this functionality before and may be able to provide some ideas of different ways to do this. I'm not necessarily looking for one winning idea since the solution will no doubt vary substantially depending on the size and nature of the site.
have you considered a matrix of with keywords on 1 axis vs. documents on another axis. once you find the set of vetors representing the keywords, find sets of keyword(s) found in your initial result set and then find a way to rank the other keywords by how many documents they reference or how many times they interset the intial result set.
I've tried a number of different approaches to this, with various degrees of success. In the end, I think the best approach is highly dependent on the domain/topics being searched, and how the users form queries.
Your thought about storing previous searches seems reasonable to me. I'd be curious to see how it works in practice (I mean that in the most sincere way -- there are many nuances that can cause these techniques to fail in the "real world", particularly when data is sparse).
Here are some techniques I've used in the past, and seen in the literature:
Thesaurus based approaches: Index into a thesaurus for each term that the user has used, and then use some heuristic to filter the synonyms to show the user as possible search terms.
Stem and search on that: Stem the search terms (eg: with the Porter Stemming Algorithm and then use the stemmed terms instead of the initially provided queries, and given the user the option of searching for exactly the terms they specified (or do the opposite, search the exact terms first, and use stemming to find the terms that stem to the same root. This second approach obviously takes some pre-processing of a known dictionary, or you can collect terms as your indexing term finds them.)
Chaining: Parse the results found by the user's query and extract key terms from the top N results (KEA is one library/algorithm that you can look at for keyword extraction techniques.)