Improving a search engine - search

I'm working on a search engine. For the most part, I'm simply using Appache's Lucene, which is working great so far, but I also wanted to improve the search results by establishing good "heuristics" within the search. (For example, if someone searches 'couch' and I have all of the couches cataloged as type 'sofa', I want the search algorithm to make the connection.)
I know this sounds a bit vague, but I don't know where to continue searching to find further reading in this study. (I Googled terms like 'heuristic search', 'heuristic function', etc, but they're not referring to the same thing I am.) So, I wanted to know if any of you guys worked on similar problems in search engines, and if you would recommend anything.

I had to build something similar for my Artificial Intelligence class. I build a web crawler that associated synonyms of words similar to what your looking to do. When a user searches for a term such as 'couch', I grabbed all of the synonyms of couch and stored them in a database with a reference to the original word. When the engine gets run again and 'sofa' gets searched, the application will again grab synonyms of 'sofa' (which is a synonym of couch). You should then be able to match that association.
There are plenty of free api's to get the synonym of a word. Try changing your google searches to Topic Specific Web Crawlers, or Topic specific search engines. You will gather better results

One of the "quick n' dirty" hack which is popped in my mind can be to implement a dictionary which holds similarities in context. e.g. make sofa and couch group similar. Or much better approach could be to build a square matrix to hold "similarity score" for each word pairs. Here is random matrix about what I mean:
couch sofa chair
couch | 100 | 95 | 75 |
sofa | 95 | 100 | 65 |
chair | 75 | 65 | 100 |
Another approach could be adaptively update that matrix with users selection. e.g. if a user search couch and then click chair, then you can increase couch-chair score by a defined threshold (of course, you should also renormalize all scores after each update).

Related

Is there a way to group items based on a broader category (e.g., skittles and snickers get labeled as "candy")?

I'm wondering if there is a way (specific package, process, etc.) of grouping items based on an overall category? For example, I'm looking at empty search results and want to see what category customers are most interested in.
Let's say I have a list of searched terms: skittles, laundry, snickers and detergent. I would want to group these items based on a broader category (i.e., skittles and snickers are "candy" and laundry and detergent would be "cleaners").
I've done some research on this and have seen similar (but not exact) ways of doing this (e.g., common keyword grouping using NLP) but not sure if something like this exists in the world when there isn't necessarily any commonality. Any help or direction would be greatly appreciated.
Update here: The best way to handle this scenario is to use pretrained word embeddings using something like Google's BERT algorithm as the first pass and then layer on another ML model that is specific to the use case.

Natural language processing keywords for building search engine

I'm recently interested in NLP, and would like to build up search engine for product recommendation. (Actually I'm always wondering about how search engine for Google/Amazon is built up)
Take Amazon product as example, where I could access all "word" information about one product:
Product_Name Description ReviewText
"XXX brand" "Pain relief" "This is super effective"
By applying nltk and gensim packages I could easily compare similarity of different products and make recommendations.
But here's another question I feel very vague about:
How to build a search engine for such products?
For example, if I feel pain and would like to search for medicine online, I'd like to type-in "pain relief" or "pain", whose searching results should include "XXX brand".
So this sounds more like keyword extraction/tagging question? How should this be done in NLP? I know corpus should contain all but single words, so it's like:
["XXX brand" : ("pain", 1),("relief", 1)]
So if I typed in either "pain" or "relief" I could get "XXX brand"; but what about I searched "pain relief"?
I could come up with idea that directly call python in my javascript for calculate similarities of input words "pain relief" on browser-based server and make recommendation; but that's kind of do-able?
I still prefer to build up very big lists of keywords at backends, stored in datasets/database and directly visualized in web page of search engine.
Thanks!
Even though this does not provide a full how-to answer, there are two things that might be helpful.
First, it's important to note that Google does not only treat singular words but also ngrams.
More or less every NLP problem and therefore also information retrieval from text needs to tackle ngrams. This is because phrases carry way more expressiveness and information than singular tokens.
That's also why so called NGramAnalyzers are popular in search engines, be it Solr or elastic. Since both are based on Lucene, you should take a look here.
Relying on either framework, you can use a synonym analyser that adds for each word the synonyms you provide.
For example, you could add relief = remedy (and vice versa if you wish) to your synonym mapping. Then, both engines would retrieve relevant documents regardless if you search for "pain relief" or "pain remedy". However, you should probably also read this post about the issues you might encounter, especially when aiming for phrase synonyms.

Match a huge list of entries in a given text file

I am trying to match a list of entries in a given text file. The list is quite huge. Its a list of organization names, where names can have more than one word. Each text file is a kind of usual write-up with several paragraphs, totaling to approximately 5000 words per txt. Its a plain text content, and there is no clear boundary by which I can locate organization names.
I am looking for a way by which all the entries from the list are searched in the text file and whichever gets matched are recognized and tagged.
Is there any tool or framework to do this?
I tried to go through all the text mining tools listed in Wikipedia, but none seems to match this need.
Any inputs would be highly appreciated.
Approach 1: Finite State Machine
You can combine your search terms into a finite state machine (FSM). The resulting FSM can then scan a document for all the terms simultaneously in linear time. Since the FSM can be reused on each document, the expense of creating it is amortized over all the text you have to search.
A good regular expression library will make an FSM under the covers. Writing code to build your own is probably beyond the scope of a Stack Overflow answer.
The basic idea is to start with a regular expression that is an alternation of all your search terms. Suppose your organization list consists of "cat" and "dog". You'd combine those as cat|dog. If you also had to search for "pink pigs", your regular expression would be cat|dog|pink pigs.
From the regular expression, you can build a graph. The nodes of the graph are states, which keep track of what text you've just seen. The edges of the graph are transitions that tell the state machine which state to go to given the current state and the next character in the input. Some states are marked as "final" states, and if you ever get to one of those, you've just found an instance of one of your organizations.
Building the graph from all but the most trivial regular expressions is tedious and can be computationally expensive, so you probably want to find a well-tested regular expression library that already does this work.
Approach 2: Search for One Term at a Time
Depending on how many search terms you have, how many documents you have, and how fast your simple text searching tool is (possibly sub-linear), it may be best to just loop through the terms and search each document for each term as a separate command. This is certainly the simplest approach.
for doc in documents:
for term in search_terms:
search(term, doc)
Note that nesting the loops this way is probably most friendly to the disk cache.
This is the approach I would take if this were a one-time task. If you have to keep searching new documents (or with different lists of search terms), this might be too expensive.
Approach 3: Suffix Tree
Concatenate all the documents into one giant document, build a suffix tree, sort your search terms, and walk through the suffix tree looking for matches. Most of the details for building and using a suffix array are in this Jon Bentley article from Dr. Dobb's, but you can find many other resources for them as well.
This approach is memory intensive, mostly cache-friendly, and thus very fast.
Use a prefix tree aka Trie.
Load all your candidate names into the prefix tree.
For your documents, match them against the tree.
A prefix tree looks roughly like this:
{}
+-> a
| +-> ap
| | +-> ... apple
| +-> az
| +-> ... azure
+-> b
+-> ba
+-> ... banana republic

How to determine if a piece of text mentions a product

I'm new to natural language process so I apologize if my question is unclear. I have read a book or two on the subject and done general research of various libraries to figure out how i should be doing this, but I'm not confident yet that know what to do.
I'm playing with an idea for an application and part of it is trying to find product mentions in unstructured text (e.g. tweets, facebook posts, emails, websites, etc.) in real-time. I wont go into what the products are but it can be assumed that they are known (stored in a file or database). Some examples:
"starting tomorrow, we have 5 boxes of #hersheys snickers available for $5 each - limit 1 pp" (snickers is the product from the hershey company [mentioned as "#hersheys"])
"Big news: 12-oz. bottles of Coke and Pepsi on sale starting Fri." (coca-cola is the product [aliased as "coke"] from coca-cola company and Pepsi is the product from the PepsiCo company)
"#OMG, i just bought my dream car. a mustang!!!!" (mustang is the product from Ford)
So basically, given a piece of text, query the text to see if it mentions a product and receive some indication (boolean or confidence number) that it does mention the product.
Some concerns I have are:
Missing products because of misspellings. I thought maybe i could use a string similarity check to catch these.
Product names that are also English words or things would get caught. Like mustang the horse versus mustang the car
Needing to keep a list of alternative names for products (e.g. "coke" for "coco-cola", etc.)
I don't really know where to start with this but any help would be appreciated. I've already looked at NLTK and SciKit and didn't really gleam how to do this from there. If you know of examples or papers that explain this, links would be helpful. I'm not specific to any language at this point. Java preferably but Python and Scala are acceptable.
The answer that you chose is not really answering your question.
The best approach you can take is using Named Entity Recognizer(NER) and POS tagger (grab NNP/NNPS; Proper nouns). The database there might be missing some new brands like Lyft (Uber's rival) but without developing your own prop database, Stanford tagger will solve half of your immediate needs.
If you have time, I would build the dictionary that has every brands name and simply extract it from tweet strings.
http://www.namedevelopment.com/brand-names.html
If you know how to crawl, it's not a hard problem to solve.
It looks like your goal is to classify linguistic forms in a given text as references to semantic entities (which can be referred to by many different linguistic forms). You describe a number of subtasks which should be done in order to get good results, but they nevertheless are still independent tasks.
Misspellings
In order to deal with potential misspellings of words, you need to associate these possible misspellings to their canonical (i.e. correct) form.
Phonetic similarity: Many reasons for "misspellings" is opacity in the relationship between the word's phonetic form (i.e. how it sounds) and its orthographic form (i.e. how it's spelled). Therefore, a good way to address this is to index terms phonetically so that e.g. innovashun is associated with innovation.
Form similarity: Additionally, you could do a string similarity check, but you may introduce a lot of noise into your results which you would have to address because many distinct words are in fact very similar (e.g. chic vs. chick). You could make this a bit smarter by first morphologically analyzing the word and then using a tree kernel instead.
Hand-made mappings: You can also simply make a list of common misspelling → canonical_form mappings. This would work well for "exceptions" not handled by the above methods.
Word-sense disambiguation
Mustang the car and Mustang the horse are the same form but refer to entirely different entities (or rather classes of entities, if you want to be pedantic). In fact, we ourselves as humans can't tell which one is meant unless we also know the word's context. One widely-used way of modelling this context is distributional lexical semantics: Defining a word's semantic similarity to another as the similarity of their lexical contexts, i.e. the words preceding and succeeding them in text.
Linguistic aliases (synonyms)
As stated above, any given semantic entity can be referred to in a number of different ways: bathroom, washroom, restroom, toilet, water closet, WC, loo, little boys'/girls' room, throne room etc. For simple meanings referring to generic entities like this, they can often be considered to be variant spellings in the same way that "common misspellings" are and can be mapped to a "canonical" form with a list. For ambiguous references such as throne room, other metrics (such as lexical-distributional methods) can also be included in order to disambiguate the meaning, so that you don't relate e.g. I'm in the throne room just now! to The throne room of the Buckingham Palace is beautiful.
Conclusion
You have a lot of work to do in order to get where you want to go, but it's all interesting stuff and there are already good libraries available for doing most of these tasks.

Fuzzy sentence search algorithms

Suppose I have a set of phrases - about 10 000 - of average length - 7-20 words in which I want to find some given phrase. The phrase I am looking for could have some errors - for example miss one or two words, have some words misplaced, or some random words - for example my database contains "As I was riding my red bike, I saw Christine", and I want it to much "As I was riding my blue bike, saw Christine", or "I was riding my bike, I saw Christine and Marion". What could be some good approach to this problem? I know about Levenhstein's distance, and I also suppose that this problem may have no easy, good solution.
A good text search engine will provide capabilities such as you describe, fsh. A typical approach would be to create a query that matches if any of the words occurs and orders the results using a weight based on number of terms occurring in proximity to each other and weighted inversely to their probability of occurring, since uncommon words will be less likely to co-occur by chance. There's a whole theory of this sort of thing called information retrieval, but maybe you know about that. Furthermore you'd like to make sure that word-level fuzziness gets accounted for by normalizing case, punctuation and the like and applying some basic linguistic transformations (stemming), and in some cases introducing a dictionary of synonyms, especially when there is domain knowledge available to condition it.
If you're interested in messing around with this stuff, try an open-source search engine, this article by Vik gives a reasonable survey from the perspective of 2009, and this one by Middleton and Baeza-Yates gives a good detailed introduction to the topic.

Resources