Strange results from DBpedia lookup API for common words - dbpedia

I ran the keyword and prefix search for some generic keywords like it, there, he, etc.
The most amazing part about these was that it gave wrong results and took around 10 times more time to process the request than some named entities like Nokia, Samsung, McDonald's.
Can anyone explain the weird results I get for these keywords
it ====> http://dbpedia.org/resource/United_States
there ====> http://dbpedia.org/resource/United_States
Why are the results wrong and why does it take so much time to process these requests?

I wonder what kind of results you were looking for with a query like "there" or "it"?
In the context of search engine terms these are often referred to as stop words and are sometimes ignored completely due to the fact they are so common that they add very little relevance to the search query or result. I think actually this is what the lookup tool does now as I do not get the same results you mentioned.
Why did the query take longer? This is likely because the words are very frequent and a query for them returns many more results. This means the search engine has more work to do in figuring out the most relevant result.
Why is United_States the top result? Probably because the wiki page for United_States is the highest ranked in terms of inbound links from other Wikipedia pages. This is the heart of the relevance algorithm used within the lookup tool. Essentially there are more links with the words "there", "it", etc pointing to United_States than any other page, so it is judged to be the most relavent for those terms.

Related

Solr: how to manage irrelevant results when not sorting by relevance?

Case in point: say we have a search query that returns 2000 results ranging from very relevant to hardly relevant at all. When this is sorted by relevance this is fine, as the most relevant results are listed on the first page.
However, when sorting by another field (e.g. user rating) the results on the first page are full of hardly-relevant results, which is a problem for our client. Somehow we need to only show the 'relevant' results with highest ratings.
I can only think of a few solutions, all of which have problems:
1 - Filter out listings on Solr side if relevancy score is under a threshold. I'm not sure how to do this, and from what I've read this isn't a good idea anyway. e.g. If a result returns only 10 listings I would want to display them all instead of filter any out. It seems impossible to determine a threshold that would work across the board. If anyone can show me otherwise please show me how!
2 - Filter out listings on the application side based on score. This I can do without a problem, except that now I can't implement pagination, because I have no way to determine the total number of filtered results without returning the whole set, which would affect performance/bandwidth etc... Also has same problems of the first point.
3 - Create a sort of 'combined' sort that aggregates a score between relevancy and user rating, which the results will then be sorted on. Firstly I'm not sure if this is even possible, and secondly it would be weird for the user if the results aren't actually listed in order of rating.
How has this been solved before? I'm open to any ideas!
Thanks
If they're not relevant, they should be excluded from the result set. Since you want to order by a dedicated field (i.e. user rating), you'll have to tweak how you decide which documents to include in the result at all.
In any case you'll have to define "what is relevant enough", since scores aren't really comparable between queries and doesn't say anything about "this was xyz relevant!".
You'll have to decide why those documents that are included aren't relevant and exclude them based on that criteria, and then either use the review score as a way to boost them further up (if you want the search to appear organic / by relevance). Otherwise you can just exclude them and sort by user score. But remember that user score, as an experience for the user, is usually a harder problem to make relevant than just order by the average of the votes.
Usually the client can choose different ordering options, by relevance or ratings for example. But you are right that ordering by rating is probably not useful enough. What you could do is take into account the rating in the relevance scoring. For example, by multiplying an "organic" score with a rating transformed as a small boost. In Solr you could do this with Function Queries. It is not hard science, and some magic is involved. Much is common sense. And it requires some very good evaluation and testing to see what works best.
Alternatively, if you do not want to treat it as a retrieval problem, you can apply faceting and let users do filtering of the results by rating. Let users help themselves. But I can imagine this does not work in all domains.
Engineers can define what relevancy is. Content similarity scoring is not only what constitutes relevancy. Many Information Retrieval researchers and engineers agree that contextual information should be used besides only the content similarity. This opens a plethora of possibilities to define a retrieval model. For example, what has become popular are Learning to Rank (LTR) approaches where different features are learnt from search logs to deliver more relevant documents to users given their user profiles and prior search behavior. Solr offers this as module.

How to extract meaningful keywords from a query?

I am working on a project on web intelligence in which I have to build a system which accepts user query and extract meaningful keywords. Say for example user enters a query "How to do socket programming in Java", then I have to ignore "how", "to", "do", "in" and take "socket", "programming", "java" for further processing and clustering e.g. socket and programming are two different meaningful keyword but can be used together as keyword which produce different meaning. I am looking for some algorithm like TF-IDF to approach this problem. Any help will be appreciated.
Well what you are looking into a text analytics solution.
I have only used R for this purpose but one way to look at it is you need a list of words that you consider not meaningful keywords, this is often called "stop words". You can find online lists of stop words for almost any popular language. After doing this you might want to get a couple hundred inputs and calculate the frequency of every keyword there (having already removed stop words, as well as punctuation and having all text in lower-case) and try to identify other keywords that you think are irrelevant and add them to your list of words to remove.
After this there are a ton of options you can explore; an example would be stemming which is getting the core term of each word so that "pages" and "page" are considered the same keyword. (as you go deeper you will find a ton of stuff online to fine-tune your approach)
Hope this helps.

What's the best way to tune my Foursquare API search queries?

I'm getting some erratic results from Foursquare's venue search API and I'm wondering if anyone has any tips on how to process my input parameters for the most "intuitive" results.
For example, suppose I am searching for a venue called "Ise Sushi", around "New York, NY", which is equivalent to (lat: 40.7143528, lon: -74.00597309999999) using Google Maps API. Plugging into the Foursquare Venue API, we get:
https://api.foursquare.com/v2/venues/search?query=ise%20sushi&ll=40.7143528%2C-74.00597309999999
This yields pretty underwhelming results: the venue I'm looking for ends up rather far down the list, at 11th place. What's interesting is that reducing the precision of the coordinates appears to produce much better results. For example, suppose we were to round the coordinates to 3 significant digits:
https://api.foursquare.com/v2/venues/search?query=ise%20sushi&ll=40.7%2C-74.0
This time, the venue I'm looking for ends up in 2nd place, even though it is actually farther from the center of the search (1072 meters, vs. 833 meters using the first query).
Another modification that appears to help improve the quality of search is substituting underscores for spaces to separate our search terms. For example, here's the original query with underscores:
https://api.foursquare.com/v2/venues/search?query=ise_sushi&ll=40.7143528%2C-74.00597309999999
This produces the most intuitive-seeming results: the venue I'm looking for appears first, and is accompanied by just one other result, "Ise Restaurant" (which is tagged as a "sushi restaurant"). For what it's worth, this actually seems to be the result set of the same search conducted on Foursquare's own website.
I'm curious what lessons I should be learning from this. Should I be reducing the precision of my coordinates? Should I be connecting my search terms with underscores, and if so, does that limit how a user can order their search terms?
Although there are ranking improvements we can make on our end to find this distant exact match, it generally also helps to specify intent=browse (although it looks like in this case, for now, it may give you worse results). By default, /venues/search uses intent=checkin, which tries really hard to find close-by matches for checking in to, at the expense of other ways a venue might match your search. Learn more at https://developer.foursquare.com/docs/venues/search

"Did you mean?" feature in Lucene.net

Can someone please let me know how do I implement "Did you mean" feature in Lucene.net?
Thanks!
You should look into the SpellChecker module in the contrib dir. It's a port of Java lucene's SpellChecker module, so its documentation should be helpful.
(From the javadocs:)
Example Usage:
import org.apache.lucene.search.spell.SpellChecker;
SpellChecker spellchecker = new SpellChecker(spellIndexDirectory);
// To index a field of a user index:
spellchecker.indexDictionary(new LuceneDictionary(my_lucene_reader, a_field));
// To index a file containing words:
spellchecker.indexDictionary(new PlainTextDictionary(new File("myfile.txt")));
String[] suggestions = spellchecker.suggestSimilar("misspelt", 5);
AFAIK Lucene supports proximity-search, meaning that if you use something like:
field:stirng~0.5
(it s a tilde-sign)
will match "string". the float is how "tolerant" the search would be, where 1.0 is exact match and 0.0 is match everything (sort of).
Different parsers will however implement this differently.
A proximity-search is much slower than a fuzzy-search (stri*) so use it with caution. In your case, one would assume that if you find no matches on a regular search, you try a proximity-search to see what you find, and present "did you mean" based on the result somehow.
Might be useful to cache this sort of lookups for very common mispellings, for performance reasons.
Google's "Did you mean?" is (probably; they're secretive, of course) implemented by consulting their query log. Look to see if people who searched for the query you're processing searched for something very similar soon after; if so, it indicates they made a mistake, and realized what they ought to be searching for.
Since you probably don't have a huge query log, you could approximate it. Take the query, split up the terms, see if there are any similar terms in the database (by edit distance, whatever); replace your terms with those nearby terms, and rerun the query. If you get more hits, that was probably a better query. Suggest it to the user. (And since you've already got the hits, and most people only look at the top 2 results, show them those.)
Take a look at google code project called semanticvectors.
There's a decent amount of discussion on the Lucene mailing lists for doing functionality like what you're after using it - however it is written in java.
You will probably have to parse and use some machine learning algorithms on your search logs to build a feature like this!

Ways to do "related searches" functionality

I've seen a few sites that list related searches when you perform a search, namely they suggest other search queries you may be interested in.
I'm wondering the best way to model this in a medium-sized site (not enough traffic to rely on visitor stats to infer relationships). My initial thought is to store the top 10 results for each unique query, then when a new search is performed to find all the historical searches that match some amount of the top 10 results but ideally not matching all of them (matching all of them might suggest an equivalent search and hence not that useful as a suggestion).
I imagine that some people have done this functionality before and may be able to provide some ideas of different ways to do this. I'm not necessarily looking for one winning idea since the solution will no doubt vary substantially depending on the size and nature of the site.
have you considered a matrix of with keywords on 1 axis vs. documents on another axis. once you find the set of vetors representing the keywords, find sets of keyword(s) found in your initial result set and then find a way to rank the other keywords by how many documents they reference or how many times they interset the intial result set.
I've tried a number of different approaches to this, with various degrees of success. In the end, I think the best approach is highly dependent on the domain/topics being searched, and how the users form queries.
Your thought about storing previous searches seems reasonable to me. I'd be curious to see how it works in practice (I mean that in the most sincere way -- there are many nuances that can cause these techniques to fail in the "real world", particularly when data is sparse).
Here are some techniques I've used in the past, and seen in the literature:
Thesaurus based approaches: Index into a thesaurus for each term that the user has used, and then use some heuristic to filter the synonyms to show the user as possible search terms.
Stem and search on that: Stem the search terms (eg: with the Porter Stemming Algorithm and then use the stemmed terms instead of the initially provided queries, and given the user the option of searching for exactly the terms they specified (or do the opposite, search the exact terms first, and use stemming to find the terms that stem to the same root. This second approach obviously takes some pre-processing of a known dictionary, or you can collect terms as your indexing term finds them.)
Chaining: Parse the results found by the user's query and extract key terms from the top N results (KEA is one library/algorithm that you can look at for keyword extraction techniques.)

Resources