Free text (natural language) query parsing with solr - search

I'm trying to build a query parsing algorithm for a local search site that can classify a free text search query (single input text box) into various type of possible searches possible on the site.
For e.g. the user could type chinese restaurants near xyz. How should I go about breaking it down to Cuisine:"chinese", locality:"xyz" given that
- there could be spelling mistakes
- keywords may match in different columns e.g. a restaurant may have "chinese" in its name
This is not really a natural language parsing problem since we're trying to search in a very limited set of posiibilities
My initial thoughts are to dump all values of a particular type into a field from the database and use the users query to match in all those fields. Then based on the score (and a predifined confidence level) divide the query into the 3-4 search fields like name/cuisine/locality.
Is there a better/standard way of doing this.

About spelling mistakes, you have to work with a dictionary/thesaurus. This can be part of your pre-processing and normalization.
About querying in multiple columns you can do; cuisine:chinese OR restaurant_name:chinese
You can boost one of the two: cuisine:chinese^0.8 OR restaurant_name:chinese

Related

Internal Search optimization for relevance

My team is using Solr and I have a question regarding it.
There are some search terms which doesn't gives relevant results or results which should have been displayed. For example:
Searching for Macy's without the apostrophe like "Macys" doesnt give back any result for Macy's.
Searching for JPMorgan vs JP Morgan gives different result
Searching for IBM doesn't show results which contains its full name i.e International business machine.
How can we improve and optimize such cases so that it gets applied to all, even to the one we didn't catch apart from these 3 above?
Any suggestions?
All these issues are related to how you process the incoming text for those fields. You'll have to create a filter chain for the field - and possibly use multiple fields for different use cases and prioritize those using qf - that processes the input values to do what you want.
Your first case can be solved by using a PatternReplaceFilter to remove any apostrophes - depending on your use case and tokenizer you might want to use the CharFilter version, as it processes the text before it's split into multiple tokens.
Your second case is a straight forward synonym filter or a WordDelimiterFilter, where you expand JPMorgan to "JP Morgan", or use the WordDelimiterFilter to expand case changes into separate tokens. That'll also allow you to search for JP and get JPMorgan related entries. These might have different effects on score, use debugQuery=true to see exactly how each term in your query contributes to the score.
The third case is in general the same as the second case. You'll have to create a decent synonym word list for the terms used, and this is usually something you build as you get feedback from your users, from existing dictionaries and from domain knowledge. There's also the option of preprocessing text using NLP, or in this case, something as primitive as indexing the initials of any capitalized words after each other could help.

How to implement synonyms for use in a search engine?

I am working on a pet search engine (SE).
What I have right now is boolean keyword SE, as a library that is split in two parts:
index: this is a inverted index ie. it associate terms with the original document where it appears
query: which is supplied by the user and can be arbitrarily complex boolean expression that looks like (mobile OR android OR iphone) AND game
I'd like to improve the search engine, in a way that does automatically extend simple queries to boolean queries so that it includes search terms that do no appear in the original query ie. I'd like to support synonyms.
I need some help to build the synonyms graph.
How can I compute list of words that appears in similar context?
Here is example of list of synonyms I'd like to compute:
psql, pgsql, postgres, postgresql
mobile, iphone, android
and also synonyms that includes ngrams like:
rdbms, relational database management systems, ...
The algorithm doesn't have to be perfect, I can post-process by hand the result, but at least I need to have a clue about what terms are similar to what other terms.
In the standard Information Retrieval (IR) literature, this enrichment of a query with additional terms (that don't appear in the initial/original query) is known as query expansion.
There're a plenty of standard approaches which, generally speaking, are based on the idea of scoring terms based on some factors and then selecting a number of terms (say K, a parameter) that have the highest scores.
To compute the term selection score, it is assumed that the top (M) ranked documents retrieved after initial retrieval are relevant, this being called pseudo-relevance feedback.
The factors on which the term selection function generally depend are:
The term frequency of a term in a top ranked document - higher the better.
The number of documents (out of top M) in which the term occurs in - higher the better.
How many times does an additional term co-occur with a query term - the higher the better.
The co-occurrence factor is the most important and would be give you terms such as 'pgsql' if the original query contains 'psql'.
Note that if documents are too short, this method would not work well and you have to use other methods that are necessarily semantics based such as i) word-vector based expansion or ii) wordnet-based expansion.

Methods for extracting locations from text?

What are the recommended methods for extracting locations from free text?
What I can think of is to use regex rules like "words ... in location". But are there better approaches than this?
Also I can think of having a lookup hash table table with names for countries and cities and then compare every extracted token from the text to that of the hash table.
Does anybody know of better approaches?
Edit: I'm trying to extract locations from tweets text. So the issue of high number of tweets might also affect my choice for a method.
All rule-based approaches will fail (if your text is really "free"). That includes regex, context-free grammars, any kind of lookup... Believe me, I've been there before :-)
This problem is called Named Entity Recognition. Location is one of the 3 most studied classes (with Person and Organization). Stanford NLP has an open source Java implementation that is extremely powerful: http://nlp.stanford.edu/software/CRF-NER.shtml
You can easily find implementations in other programming languages.
Put all of your valid locations into a sorted list. If you are planning on comparing case-insensitive, make sure the case of your list already is normalized.
Then all you have to do is loop over individual "words" in your input text and at the start of each new word, start a new binary search in your location list. As soon as you find a no-match, you can skip the entire word and proceed with the next.
Possible problem: multi-word locations such as "New York", "3rd Street", "People's Republic of China". Perhaps all it takes, though, is to save the position of the first new word, if you find your bsearch leads you to a (possible!) multi-word result. Then, if the full comparison fails -- possibly several words later -- all you have to do is revert to this 'next' word, in relation to the previous one where you started.
As to what a "word" is: while you are preparing your location list, make a list of all characters that may appear inside locations. Only phrases that contain characters from this list can be considered a valid 'word'.
How fast are the tweets coming in? As in is it the full twitter fire hose or some filtering queries?
A bit more sophisticated approach, that is similar to what you described is using an NLP tool that is integrated to a gazetteer.
Very few NLP tools will keep up to twitter rates, and very few do very well with twitter because of all of the leet speak. The NLP can be tuned for precision or recall depending on your needs, to limit down performing lockups in the gazetteer.
I recommend looking at Rosoka(also Rosoka Cloud through Amazon AWS) and GeoGravy

Improving type-ahead suggestions in Search using SOLR

What are the possible ways you could improve the type-ahead (auto-complete) suggestions that appear in a free-form search?
From my understanding, all the suggestions that appear for keywords are stored in a SOLR table.
How do you ensure that it covers all the industry specific relevant type-ahead suggestions?
Can you automate including most recent user generated queries that are not currently providing search results to lead to relevant ones?
In preprocessing, the documents fed into the search engine need to be enriched with whatever is sensible and provides help to find them. E.g. a document containing the string paris may be enriched by french capital, capital of france, ile-de-france, … You will need a dictionary to do so. You can take data from dbpedia.org or—for English only—WordNet. For not to over-generalize you will need to implement some disambiguation (meaning discovery) in the first step, since paris—for example—could equally be expanded with alexandros, alaksandu of wilusa, king of troy, depending on the context.

Xapian multiple-language searching with stop words?

I have two Xapian databases, let's call one "EN" and the other "DE", and let's say the former contains some documents in English, and the latter in German.
If I want users to be able to search both at once, I can easily load both of the databases. However, it seems like I can only use one stemmer and set of stop words?
There's no way to instantiate an English-language stemmer and have it apply just to those results that come from the "EN" database? There's no way to create a Stopper with english words, and have it apply just to those results that come from the "EN" database?
Can this be right?
Stemming is only useful if you know the language of the text you're stemming. If you've created your Xapian databases with stemming (i.e., the Xapian databases are storing stemmed forms of the original words) then you would have specified a language.
However at search time, you also need to know the language to stem correctly. If your users enter a query in English, you must stem in English before applying the query to the English database. The same applies for German. If you want to search each database perhaps you should create two separate, language-specific queries from each user request.
However bear in mind that a query originally entered in German, but then stemmed with an English stemmer, may produce some odd results - if you have any way of finding out what language your users are using at query time then this can be used to apply the correct stemmers.
HTH - by the way, the Xapian-discuss mailing list (see www.xapian.org) is a good place to ask this kind of question.
Charlie

Resources