I’m working on a simple search machine for Twitter where I want to extract all search results of an word (or words) since the dawn of time (or anyway Twitter). Is that possible?
I can only retrieve 100 results ordered by recently added tweets but I want to for example see how many times “facebook” has been twitted last month, last year, and so on.
I’ve tried this URL: http://search.twitter.com/search.atom?lang=en&since=2006-01-01&rpp=1000&q=facebook but it still doesn’t give me more than 100 results. I’ve read that the rpp parameter has a maximum of 100 but is there a way to “scroll” through the list and get all results?
Offsets and pagination is probably your best bet
Related
first time posting.
I wanted to ask if anyone knows how I can search on YouTube for, let's say, music video's that have been viewed between a set number of times. Like the title says for example, between 9 and 11 million times.
One reason I want to do this is because I want to find good music that I haven't heard before. The logic I'm working on is that the Got Talent type video's that get viewed millions of times are generally viewed that many times for one of two reason. 1) they're amazing. 2) they're embarrassingly horrible.
And though I don't think a song being popular will necessarily mean I'll like it, I'm hoping this method will be successful to some degree.
Another reason is to look for trailers for independent films with a similar logic as above. Though with these movies I think I only hear about them six months to a year after they've been released because they're flying under the radar.
If I were to be able to search for movie trailers with 'x' number of views though.. for example, between 500,000 and a million, maybe I'd be able to find movies that I'll like quicker than via time passing and them getting mentioned to me by a friend.
Any help would be greatly appreciated as I've wanted to be able to perform these kind of searches for awhile now.
thanks
You will need to use YouTube API v3.
I havent written this exact request but it looks like you can list videos then filter by 'Chart' = 'mostPopular'
https://developers.google.com/youtube/v3/docs/videos/list
Perhaps a bit of background reading on the API would help too...
https://developers.google.com/youtube/v3/
First off, you would need the Youtube Data API. "v3" means nothing because it's simply the current version, like "Windows 10."
The API lets you get a video's view count, but doesn't put it in a range like 9 million to 11 million.
Youtube's own search function is pretty sophisticated. For instance,
https://www.youtube.com/results?search_query=movie+trailer&search_sort=video_view_count&filters=month. This gives all results for "movie trailer," within the last month, sorted by view count. You can customize the URL, i.e. "week" instead of month would return only trailers from the last week. Or year, etc. Essentially this is a "Videos: List: MostPopular" query, with subject filter.
I have a few Youtube API scripts, and I hardly think it's worth the hassle to do it that way when Youtube's advanced search get you 99% there. If you did, you would need to to a Search:list query for a given subject (i.e. "movie trailer"). Limited to a given time frame (i.e. last month). Then for each video ID, make a Videos:list query to get its view count. Then print all, sorted by views.
Is it possible to search venues (via venues/search) in whole city without passing "radius" parameter? Because I don't know radius of each city :) Documentation says "Searches can be done near a point or through a whole city", but how can I provide this in venues/search?
Thanks.
I do not think there is a way to tell it 'search the entire city', but I also think it might be a wrong use case.
You need to remember a few things when searching:
Foursquare will return up to 50 results (the limit parameter)
The 50 results are ordered by the most popular places around the center of your search
So if you are searching a city which have more than 50 venues in Foursquare database, 'searching the entire city' will usually get the same (up to) 50 results - always.
This where the filters comes in handy, in our case, to get you better results for our needs, we use the categoryId combined with the radius to get things we want to show our users. Sometimes we get information from other cities because of a big radius, but for our application its okay, we actually give our customers more options :) . I can also guess that a lot of apps also use the query filter as they know the name of the place they are looking for.
You just need to experiment with it and discover how to get the data which is right to your application.
In theory, to search an entire city I would use the city lat/lng from Google or Open Street Maps or geonames and do a 10Km search around that point (intent=browse, radius=10000), the following is a guess, but it will get 50 places for over 99% of the cities people who own smartphones live in :)
You can do obtain results within in a city as follows:
https://api.foursquare.com/v2/venues/search?near=Singapore,Singapore&client_id=YOUR_CLIENT_ID
&client_secret=YOUR_CLIENT_SECRET3&v=YYYYMMDD
For more details check the documentation:
https://developer.foursquare.com/docs/venues/explore
Assuming you're talking about requests with a query, I would just set a reasonable value for radius and use the city's default city center. If you want to avoid showing results from neighboring cities, you can post-request filter by the returned venue's "city" string in the location stanza.
I'm getting some erratic results from Foursquare's venue search API and I'm wondering if anyone has any tips on how to process my input parameters for the most "intuitive" results.
For example, suppose I am searching for a venue called "Ise Sushi", around "New York, NY", which is equivalent to (lat: 40.7143528, lon: -74.00597309999999) using Google Maps API. Plugging into the Foursquare Venue API, we get:
https://api.foursquare.com/v2/venues/search?query=ise%20sushi&ll=40.7143528%2C-74.00597309999999
This yields pretty underwhelming results: the venue I'm looking for ends up rather far down the list, at 11th place. What's interesting is that reducing the precision of the coordinates appears to produce much better results. For example, suppose we were to round the coordinates to 3 significant digits:
https://api.foursquare.com/v2/venues/search?query=ise%20sushi&ll=40.7%2C-74.0
This time, the venue I'm looking for ends up in 2nd place, even though it is actually farther from the center of the search (1072 meters, vs. 833 meters using the first query).
Another modification that appears to help improve the quality of search is substituting underscores for spaces to separate our search terms. For example, here's the original query with underscores:
https://api.foursquare.com/v2/venues/search?query=ise_sushi&ll=40.7143528%2C-74.00597309999999
This produces the most intuitive-seeming results: the venue I'm looking for appears first, and is accompanied by just one other result, "Ise Restaurant" (which is tagged as a "sushi restaurant"). For what it's worth, this actually seems to be the result set of the same search conducted on Foursquare's own website.
I'm curious what lessons I should be learning from this. Should I be reducing the precision of my coordinates? Should I be connecting my search terms with underscores, and if so, does that limit how a user can order their search terms?
Although there are ranking improvements we can make on our end to find this distant exact match, it generally also helps to specify intent=browse (although it looks like in this case, for now, it may give you worse results). By default, /venues/search uses intent=checkin, which tries really hard to find close-by matches for checking in to, at the expense of other ways a venue might match your search. Learn more at https://developer.foursquare.com/docs/venues/search
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
I have a set of search queries in the size of approx. 10 millions. The goal is to collect the number of hits returned by a search engine for all of them. For example, Google returns about 47,500,000 for the query "stackoverflow".
The problem is that:
1- Google API is limited to 100 query per day. This is far from being useful to my task since I would have to get lots of counts.
2- I used Bing API but it does not return an accurate number. Accureate in the sense of matching the number of hits shown in Bing UI. Has anyone came across this issue before?
3- Issuing search queries to a search engine and parsing the html is one solution but it results in CAPTCHA and does not scale to this number of queries.
All I care about is that the number of hits and I am open for any suggestion.
Well, I was really hoping that someone would answer this since this is something that I also was interested in finding out but since it doesn't look like anyone will I will throw in these suggestions.
You could set up a series of proxies that change their IP every 100 requests so that you can query google as seemingly different people (seems like a lot of work). Or you can download wikipedia and write something to parse the data there so that when you search a term you can see how many pages it falls in. Of course that is a much smaller dataset than the whole web but it should get you started. Another possible data source is the google n-grams data which you can download and parse to see how many books and pages the search terms fall in. Maybe a combination of these methods could boost the accuracy on any given search term.
Certainly none of these methods are as good as if you could just get the google page counts directly but understandably that is data they don't want to give out for free.
I see this is a very old question but I was trying to do the same thing which brought me here. I'll add some info and my progress to date:
Firstly, the reason you get an estimate that can change wildly is because search engines use probabilistic algorithms to calculate relevance. This means that during a query they do not need to examine all possible matches in order to calculate the top N hits by relevance with a fair degree of confidence. That means that when the search concludes, for a large result set, the search engine actually doesn't know the total number of hits. It has seen a representative sample though, and it can use some statistics about the terms used in your query to set an upper limit on the possible number of hits. That's why you only get an estimate for large result sets. Running the query in such a way that you got an exact count would be much more computationally intensive.
The best I've been able to achieve is to refine the estimate by tricking the search engine into looking at more results. To do this you need to go to page 2 of the results and then modify the 'first' parameter in the URL to go way higher. Doing this may allow you to find the end of the result set (this worked for me last year I'm sure although today it only worked up to the first few thousand). Even if it doesn't allow you to get to the end of the result set you will see that the estimate gets better as the query engine considers more hits.
I found Bing slightly easier to use in the above way - but I was still unable to get an exact count for the site I was considering. Google seems to be actively preventing this use of their engine which isn't that surprising. Bing also seems to hit limits although they looked more like defects.
For my use case I was able to get both search engines to fairly similar estimates (148k for Bing, 149k for Google) using the above technique. The highest hit count I was able to get from Google was 323 whereas Bing went up to 700 - both wildly inaccurate but not surprising since this is not their intended use of the product.
If you want to do it for your own site you can use the search engine's webmaster tools to view indexed page count. For other sites I think you'd need to use the search engine API (at some cost).
I'm trying to match existing data to Foursquare Venues. I've tried matching about 100,000 records using intent=match and 30% of them don't return results. Now, sometimes these venues are actually missing, but sometimes the search just isn't finding results that would be obvious to a human. For example:
https://api.foursquare.com/v2/venues/search?intent=match&ll=40.075800000000001,-80.698800000000006&query=19%20TH%20HOLE
That returns no results. However, if I search for "19TH HOLE" I do get a result.
I could just add all these non-matches to Foursquare, but it seems that I'd end up creating a whole lot of duplicates... and I don't want to abuse the system. We're trying to make Foursquare our Venues database, and we can't go and process 300,000 records without matches by hand, either.
I'm open to suggestions on what else I can do.
You can "relax" the search strictness by specifying intent=checkin or intent=browse and using your own criteria to determine if the top result is the one you're looking for.