How to gather useful data for autocompletion? - search

im trying to implement a typical autocompletion-box, like you know from amazon.com.
There you go, type in a letter and you get a reasonable suggest about what you might try to enter into the search-box.
The box itself will be implemented by jquery, the persistence-layer and suggest algorithm will be based on Apache Lucene/Solr and its Suggest-Feature.
Additionaly i get a weighted suggestion into the result, using WFST-Suggestion by lucene.
My problem is, what does e.g. amazon to achieve this kind of reasonable data?
I mean where do they get all this keywords and score, so it makes sense?
Is it a pure hand-made style information on each product? What I think would be real tough.
Or is it possible to gather the data using things like clustering or classification from machine-learning-theory? (then I could use mahout or carrot2).
Looking on amazon suggestions, I think the data contains:
name of the product
producer/manufacturer/author of the product/book
product-features (like color, type, size)
Does it contain more?
The next thing would be that it looks that the suggestion itself is ranked. How do they receive this kind of score to weight the suggestions?
Is it a simple user-click-path-tracking, where you look, what the user typed into the box and what he selected or which product he looks afterward?
Is this kind of score computed on each query (maybe cached) using some logic? (Which? maybe bayes theorem?)

They might use something as simple as building an n-gram model from user queries and/or product names and use that to predict the most likely auto-completions.

Related

tensorflow for classification of strings vs elasticsearch

So, a little bit on my problem.
TL;DR
Can I use machine-learning instead of Elastic Search to find results depending on the user's text input? Is it a good idea?
I am working on a car spare parts project, and we have split the car into 300 parts that we store on the database, with some data for each part (weight, availability, etc).
When the customer inputs the text of his part, we need to be able to classify the part, and map it to one in our database.
The current way it's being done is by people on our team manually mapping the customer text with the parts on our database, we want to automate that process.
We tried using MongoDB text search, but it was often inaccurate since parts have different names in different parts of the country.
So we wanted something that got more accurate results, and improved by the more data we have, we immediately considered TensorFlow, after some research and taking part of Google's Machine Learning Crash Course, I got to that point where it specified:
Models can't learn from string values, so you'll have to perform some feature engineering to convert those values to something numeric
That would be useful in the case we have limited number of features as strings, but we don't know what the user will input as a text.
So, my questions are:
1- Can we use Machine Learning to map text input by the user with some documents on our database?
2- If we can do that, is it a good idea to favor it over other search tools like ElasticSearch?
3- Can ElasticSearch improve its results the more data we have? How?
4- How would you go about this problem?
Note: I'd be doing that in Node.js, and since TensorFlow.js is new, I am inclining to go for other solutions, but if push comes to shove, and the results are much better, I would definitely go there.
TL;DR: Yes and yes.
TS;WM:
This is a perfectly suited problem for machine learning. Especially so, if you have a database of past customer texts that have already been mapped to parts. Ideally, you have hundreds of texts mapped to each part. If that is present, you can design and train a network. And models can learn from string values with some engineering, and it's not that bad.
I'm not sure ElasticSearch would improve much on the network. I don't know much about auto parts trading, but as a wild guess, "the large round thingy that helps change direction" would never be mapped to "steering wheel" by ES but could be learned easily by a network - provided there are at least some examples of people using that text to specify steering wheel.
You can but don't have to necessarily use tensorflow.js for your network. The AI could run on your server as a webservice, and you'd just send over the customer's text to it and it would send back it's recommendations of part SKUs and names.

String Matching Algorithms

I have a python app with a database of businesses and I want to be able to search for businesses by name (for autocomplete purposes).
For example, consider the names "best buy", "mcdonalds", "sony" and "apple".
I would like "app" to return "apple", as well as "appel" and "ple".
"Mc'donalds" should return "mcdonalds".
"bst b" and "best-buy" should both return "best buy".
Which algorithm am I looking for, and does it have a python implementation?
Thanks!
The Levenshtein distance should do.
Look around - there are implementations in many languages.
Levenshtein distance will do this.
Note: this is a distance, you have to calculate it to every string in your database, which can be a big problem if you have a lot of entries.
If you have this problem then record all the typos the users make (typo=no direct match) and offline build a correction database which contains all the typo->fix mappings. some companies do this even more clever, eg: google watches how users correct their own typos and learns the mappings from this.
Soundex or Metaphone might work.
I think what you are looking for is a huge field of Data Quality and Data Cleansing. I fear if you could find a python implementation regarding this as it has to be something which cleanses considerable amount of data in db which could be of business value.
Levensthein distance goes in the right direction but only half the way. There are several tricks to get it to use the half matches as well.
One would be to use a subsequence dynamic time warping (DTW is actually a generalization of levensthein distance). For this you relax the start and end cases when calcualting the cost matrix. If you only relax one of the conditions you can get autocompletion with spell checking. I am not sure if there is a python implementation available, but if you want to implement it for yourself it should not be more than 10-20 LOC.
The other idea would be to use a Trie for speed up, which can do DTW/Levensthein on multiple results simultaniously (huge speedup if your database is large). There is a paper on Levensthein on Tries at IEEE, so you can find the algorithm there. Again for this you would need to relax the final boundary condition, so you get partial matches. However since you step down in the trie you just need to check when you have fully consumed the input and then return all leaves.
check this one http://docs.python.org/library/difflib.html
it should help you

How to estimate search application's efficiency?

I hope it belongs here.
Can anyone please tell me is there any method to compare different search applications working in the same domain with the same dataset?
The problem is they are quite different - one is a web application which looks up the database where items are grouped in categories, and another one is a rich client which makes search by keywords.
Is there any standard test giudes for that purpose?
There are testing methods. You may use e.g. Precision/Recall or the F beta method to estimate a rate which computes the "efficiency". However you need to make a reference set by yourself. That means you will somehow measure not the efficiency in the domain, more likely the efficiency compared to your own reasoning.
The more you need to make sure that your reference set is representative for the data you have.
In most cases common reasoning will give you also the result.
If you want to measure the performance in matters of speed you need to formulate a set of assumed queries against the search and query your search engine with these at a given rate. Thats doable with every common loadtesting tool.

Word characteristics tags

I want to do a riddle AI chatbot for my AI class.
So i figgured the input to the chatbot would be :
Something like :
"It is blue, and it is up, but it is not the ceiling"
Translation :
<Object X>
<blue>
<up>
<!ceiling>
</Object X>
(Answer : sky?)
So Input is a set of characteristics (existing \ not existing in the object), output is a matched, most likely object.
The domain will be limited to a number of objects, i could input all attributes myself, but i was thinking :
How could I programatically build a database of characteristics for a word?
Is there such a database available? How could i tag a word, how could i programatically find all it's attributes? I was thinking on crawling wikipedia, or some forum, but i can't see it build any reliable word tag database.
Any ideas on how i could achieve such a thing? Any ideas on some literature on the subject?
Thank you
This sounds like a basic classification problem. You're essentially asking; given N features (color=blue, location=up, etc), which of M classifications is the most likely? There are many algorithms for accomplishing this (Naive Bayes, Maximum Entropy, Support Vector Machine), but you'll have to investigate which is the most accurate and easiest to implement. The biggest challenge is typically acquiring accurate training data, but if you're willing to restrict it to a list of manually entered examples, then that should simplify your implementation.
Your example suggests that whatever algorithm you choose will have to support sparse data. In other words, if you've trained the system on 300 features, it won't require you to enter all 300 features in order to get an answer. It'll also make your training and testing files smaller, because you'll be omit features that are irrelevant for certain objects. e.g.
sky | color:blue,location:up
tree | has_bark:true,has_leaves:true,is_an_organism=true
cat | has_fur:true,eats_mice:true,is_an_animal=true,is_an_organism=true
It might not be terribly helpful, since it's proprietary, but a commercial application that's similar to what you're trying to accomplish is the website 20q.net, albeit the system asks the questions instead of the user. It's interesting in that it's trained "online" based on user input.
Wikipedia certainly has a lot of data, but you'll probably find extracting that data for your program will be very difficult. Cyc's data is more normalized, but its API has a huge learning curve. Another option is the semantic dictionary project Wordnet. It has reasonably intuitive APIs for nearly every programming language, as well as an extensive hypernym/hyponym model for thousands of words (e.g. cat is a type of feline/mammal/animal/organism/thing).
The Cyc project has very similar aims: I believe it contains both inference engines to perform the AI, and databases of facts about commonsense knowledge (like the colour of the sky).

Cross Referencing Databases on Fuzzy Data

I am currently working on project where I have to match up a large quantity of user-generated names with a separate list of the same names in a canonical format. The problem is that the user-generated names contains numerous misspellings, abbreviations, as well as simply invalid data, making it hard to do a cross-reference with the canonical data. Any suggestions on methods to do this?
This does not have to be done in real-time and in this case accuracy is more important than speed.
Current ideas for this are:
Do a fuzzy search for the user entered name in the canonical database using an existing search implementation like Lucene or Sphinx, which I presume use something like the Levenshtein distance for this.
Cross-reference on the SOUNDEX hash (which is supposedly computed on the sound of the name rather than spelling) instead of using the actual name.
Some combination of the above
Anyone have any feedback on any of these or ideas of their own?
One of my concerns is that none of the above methods will handle abbreviations very well. Can anyone point me in a direction for some machine learning methods to actually search on expanded abbreviations (or tell me I'm crazy)? Thanks in advance.
First, I'd add to your list the techniques discussed at Peter Norvig's post on spelling correction.
Second, I'd ask what kind of "user-generated names" you're talking about. Having dealt with both, I believe that the heuristics you'd use for street names are somewhat different from the heuristics for person names. (As a simple example, does "Dr" expand to "Drive" or "Doctor"?)
Third, I'd look at a combination using testing to establish the set of coefficients for combining the results of the various techniques.

Resources