Alternative to intent=match? - foursquare

I'm trying to match existing data to Foursquare Venues. I've tried matching about 100,000 records using intent=match and 30% of them don't return results. Now, sometimes these venues are actually missing, but sometimes the search just isn't finding results that would be obvious to a human. For example:
https://api.foursquare.com/v2/venues/search?intent=match&ll=40.075800000000001,-80.698800000000006&query=19%20TH%20HOLE
That returns no results. However, if I search for "19TH HOLE" I do get a result.
I could just add all these non-matches to Foursquare, but it seems that I'd end up creating a whole lot of duplicates... and I don't want to abuse the system. We're trying to make Foursquare our Venues database, and we can't go and process 300,000 records without matches by hand, either.
I'm open to suggestions on what else I can do.

You can "relax" the search strictness by specifying intent=checkin or intent=browse and using your own criteria to determine if the top result is the one you're looking for.

Related

Solr: how to manage irrelevant results when not sorting by relevance?

Case in point: say we have a search query that returns 2000 results ranging from very relevant to hardly relevant at all. When this is sorted by relevance this is fine, as the most relevant results are listed on the first page.
However, when sorting by another field (e.g. user rating) the results on the first page are full of hardly-relevant results, which is a problem for our client. Somehow we need to only show the 'relevant' results with highest ratings.
I can only think of a few solutions, all of which have problems:
1 - Filter out listings on Solr side if relevancy score is under a threshold. I'm not sure how to do this, and from what I've read this isn't a good idea anyway. e.g. If a result returns only 10 listings I would want to display them all instead of filter any out. It seems impossible to determine a threshold that would work across the board. If anyone can show me otherwise please show me how!
2 - Filter out listings on the application side based on score. This I can do without a problem, except that now I can't implement pagination, because I have no way to determine the total number of filtered results without returning the whole set, which would affect performance/bandwidth etc... Also has same problems of the first point.
3 - Create a sort of 'combined' sort that aggregates a score between relevancy and user rating, which the results will then be sorted on. Firstly I'm not sure if this is even possible, and secondly it would be weird for the user if the results aren't actually listed in order of rating.
How has this been solved before? I'm open to any ideas!
Thanks
If they're not relevant, they should be excluded from the result set. Since you want to order by a dedicated field (i.e. user rating), you'll have to tweak how you decide which documents to include in the result at all.
In any case you'll have to define "what is relevant enough", since scores aren't really comparable between queries and doesn't say anything about "this was xyz relevant!".
You'll have to decide why those documents that are included aren't relevant and exclude them based on that criteria, and then either use the review score as a way to boost them further up (if you want the search to appear organic / by relevance). Otherwise you can just exclude them and sort by user score. But remember that user score, as an experience for the user, is usually a harder problem to make relevant than just order by the average of the votes.
Usually the client can choose different ordering options, by relevance or ratings for example. But you are right that ordering by rating is probably not useful enough. What you could do is take into account the rating in the relevance scoring. For example, by multiplying an "organic" score with a rating transformed as a small boost. In Solr you could do this with Function Queries. It is not hard science, and some magic is involved. Much is common sense. And it requires some very good evaluation and testing to see what works best.
Alternatively, if you do not want to treat it as a retrieval problem, you can apply faceting and let users do filtering of the results by rating. Let users help themselves. But I can imagine this does not work in all domains.
Engineers can define what relevancy is. Content similarity scoring is not only what constitutes relevancy. Many Information Retrieval researchers and engineers agree that contextual information should be used besides only the content similarity. This opens a plethora of possibilities to define a retrieval model. For example, what has become popular are Learning to Rank (LTR) approaches where different features are learnt from search logs to deliver more relevant documents to users given their user profiles and prior search behavior. Solr offers this as module.

Strange results from DBpedia lookup API for common words

I ran the keyword and prefix search for some generic keywords like it, there, he, etc.
The most amazing part about these was that it gave wrong results and took around 10 times more time to process the request than some named entities like Nokia, Samsung, McDonald's.
Can anyone explain the weird results I get for these keywords
it ====> http://dbpedia.org/resource/United_States
there ====> http://dbpedia.org/resource/United_States
Why are the results wrong and why does it take so much time to process these requests?
I wonder what kind of results you were looking for with a query like "there" or "it"?
In the context of search engine terms these are often referred to as stop words and are sometimes ignored completely due to the fact they are so common that they add very little relevance to the search query or result. I think actually this is what the lookup tool does now as I do not get the same results you mentioned.
Why did the query take longer? This is likely because the words are very frequent and a query for them returns many more results. This means the search engine has more work to do in figuring out the most relevant result.
Why is United_States the top result? Probably because the wiki page for United_States is the highest ranked in terms of inbound links from other Wikipedia pages. This is the heart of the relevance algorithm used within the lookup tool. Essentially there are more links with the words "there", "it", etc pointing to United_States than any other page, so it is judged to be the most relavent for those terms.

What's the best way to tune my Foursquare API search queries?

I'm getting some erratic results from Foursquare's venue search API and I'm wondering if anyone has any tips on how to process my input parameters for the most "intuitive" results.
For example, suppose I am searching for a venue called "Ise Sushi", around "New York, NY", which is equivalent to (lat: 40.7143528, lon: -74.00597309999999) using Google Maps API. Plugging into the Foursquare Venue API, we get:
https://api.foursquare.com/v2/venues/search?query=ise%20sushi&ll=40.7143528%2C-74.00597309999999
This yields pretty underwhelming results: the venue I'm looking for ends up rather far down the list, at 11th place. What's interesting is that reducing the precision of the coordinates appears to produce much better results. For example, suppose we were to round the coordinates to 3 significant digits:
https://api.foursquare.com/v2/venues/search?query=ise%20sushi&ll=40.7%2C-74.0
This time, the venue I'm looking for ends up in 2nd place, even though it is actually farther from the center of the search (1072 meters, vs. 833 meters using the first query).
Another modification that appears to help improve the quality of search is substituting underscores for spaces to separate our search terms. For example, here's the original query with underscores:
https://api.foursquare.com/v2/venues/search?query=ise_sushi&ll=40.7143528%2C-74.00597309999999
This produces the most intuitive-seeming results: the venue I'm looking for appears first, and is accompanied by just one other result, "Ise Restaurant" (which is tagged as a "sushi restaurant"). For what it's worth, this actually seems to be the result set of the same search conducted on Foursquare's own website.
I'm curious what lessons I should be learning from this. Should I be reducing the precision of my coordinates? Should I be connecting my search terms with underscores, and if so, does that limit how a user can order their search terms?
Although there are ranking improvements we can make on our end to find this distant exact match, it generally also helps to specify intent=browse (although it looks like in this case, for now, it may give you worse results). By default, /venues/search uses intent=checkin, which tries really hard to find close-by matches for checking in to, at the expense of other ways a venue might match your search. Learn more at https://developer.foursquare.com/docs/venues/search

Why does an exact match on a name return a useless set of venues?

This doesn't make much sense to me, and I'm hoping someone can shed some light on what's going on here and how I work around it.
If I query like this:
https://api.foursquare.com/v2/venues/search?ll=37.77%2C-122.41&radius=15000&intent=browse&oauth_token=xxx&limit=20&query=pi%20ba
I get a list of about 15 items, including the item I'm searching for (pi bar). However, if I search for the exact match name:
https://api.foursquare.com/v2/venues/search?ll=37.77%2C-122.41&radius=15000&intent=browse&oauth_token=xxx&limit=20&query=pi%20bar
I just get back the blanket list of venues within this area (mostly BART stops, etc.)
Is it expected that I should have to shave the last character off of user entered queries to get results back, or is this just a messed up venue name that I've been debugging with?
I'm not sure if this may help, but I've discovered placing an "and" between words in your query can produce more accurate results:
Searching for Chili's Bar & Grill
The first query has extraneous results:
https://api.foursquare.com/v2/venues/search?ll=34.07527923583984,-84.29469299316406&radius=5000&query=chili's bar grill&oauth_token=xxx&v=20111205
The second is much more accurate (although I've removed the ampersand: &)
https://api.foursquare.com/v2/venues/search?ll=34.07527923583984,-84.29469299316406&radius=5000&query=chili's and bar and grill&oauth_token=xxx&v=20111205
There's a known issue with quality of bigram matches in foursquare venue searches -- your query term includes a very popular word ("bar") which skews the results. The search team is working on quality improvements for these sorts of queries.

Twitter search API: get more results and since a specified date

I’m working on a simple search machine for Twitter where I want to extract all search results of an word (or words) since the dawn of time (or anyway Twitter). Is that possible?
I can only retrieve 100 results ordered by recently added tweets but I want to for example see how many times “facebook” has been twitted last month, last year, and so on.
I’ve tried this URL: http://search.twitter.com/search.atom?lang=en&since=2006-01-01&rpp=1000&q=facebook but it still doesn’t give me more than 100 results. I’ve read that the rpp parameter has a maximum of 100 but is there a way to “scroll” through the list and get all results?
Offsets and pagination is probably your best bet

Resources