Not Getting Link Response Headers in GitHub Pagination For Issues - github-api

I'm writing a quick python app to get stats on my public GitHub project.
When I call (https://api.github.com/repos/user/project/pulls), I get back some json, but because my project has more than 30 outstanding PRs, I get a Link response header with the next and last URLs for me to go call to get all PRs.
However, when I perform a parallel query for issues with a certain label (https://api.github.com/repos/user/project/issues?labels=label&status=opened), I only get 30 back (the pagination limit), but my response header doesn't have a next Link in it for me to follow. I know my project has more than 30 issues that match that label.
Is this a bug in the GitHub API, or in what I'm doing? Alternatively, I don't actually care about the issues themselves, just the count of issues with that label, so is there another way to just query for the count?

Related

Instagram - Obtaining data in realtime

I am trying to get the recent post from a particular location. using this url.
https://api.instagram.com/v1/media/search?lat=34.0500&lng=-118.2500&distance=50&MAX_ID=max_id&access_token=XXXX
So when I use this URL for the first time, I get 20 results. I obtain the max ID from the list of 20 results and modify my url .
But when I use the modified URL, I obtain the same result as the first one.
How do I go about solving this?
Contrary to what I thought, the media search endpoint doesn't return a pagination object. Sorry. It also doesn't support the min_id/max_id parameters, which is why you are having problems..
If you want to get different data you are going to have to use the time based request parameter MIN_TIMESTAMP. However it looks like that parameter doesn't work for that endpoint either (though the documentation says it is supported). Indeed, a quick search on the internet reveals it might be a long standing bug with the api.

Best Approach to Scrape Paginated Results using import.io

There are several websites within the cruise industry that I would like to scrape.
Examples:
http://www.silversea.com/cruise/cruise-results/?page_num=1
http://www.seabourn.com/find-luxury-cruise-vacation/FindCruises.action?cfVer=2&destCode=&durationCode=&dateCode=&shipCodeSearch=&portCode=
In some scenarios, like the first one shown, the results page follows a patten - ?page_num=1...17. However the number of results will vary over time.
In the second scenario, the URL does not change with pagination.
At the end of the day, what I'd like to do is to get the results for each website into a single file.
Q1: Is there any alternative to setting 17 scrapers for scenario 1 and then actively watching as results grow/shrink over time?
Q2: I'm completely stumped about how to scrape content from second scenario.
Q1- The free tool from (import.io) does not have the ability to actively watch the data change over time. What you could do is have the data Bulk Extracted by the Extractor (with 17 pages this would be really fast) and added to a database. After each entry to the database, the entries could be de-duped or marked as unique. You could do this manually in Excel or programmatically.
Their Enterprise (data as a service) could do this for you.
Q2- If there is not a unique URL for each page, the only tool that will paginate the pages for you is the Connector.
I would recommend you to build an extractor to get the pagination. The result of this extractor will be a list of links, each link corresponding to a page.
This way, every time you run your application and the number of pages changes, you will always get all the pages.
After that, make a call for each page to get the data you want.
Extractor 1: Get pages -- Input: The first URL
Extractor 2: Get items (data) -- Input: The result from Extractor 1

Infinite Scrolling on post

I need to scroll blog-posts/latest news infinitely on a browser.
The way it should work is I get first 20 posts from server in a list. I render first one on browser. After I am close to x height from end of browser - that should load next post from list. While loading next post I need to make a call to analytics, advertisements and also change browser url with new title. Once I reach 20th post, I need to make a call to server to get next 20 posts and this continues.
My question is: what libraries are available to me to make a POC on this?
How do I compare them and which one to choose?
I need to make this project in nodejs and I am new to nodejs. Any available demos might help too
Since you are interested in crunching latest data so this can be achieved by server side pagination, say you'll have to query latest blog-post limited to 20 and also will have to keep track of page-cursor (means from where next query will fetch next 20 blog-posts). You are building in Nodejs so I assume your database to be Mongodb (Assuming MEAN Stack), you can write your own pagination logic but why to re-invent wheel? solutions are available to be used such as mongoose-paginate. This completes back-end part.
On front-end there are various plug-ins available for various frameworks such as:
1 - Don't want to use any plugin:
$(window).scroll(function () {
if ($(window).scrollTop() >= $(document).height() - $(window).height() - 10) {
//Add newly-crunched data at the end of the page
}
});
2 - In Angular use angular-ui pagination or ngInfiniteScroll
3 - In jQuery use infinite-scroll or jScroll
Here is tuts+ How to Create Infinite Scroll Pagination
Happy Helping!

querying github API for users with README that matches text

I would like to retrieve users with repositories that contain a README file that contains text that is matched by a string passed in the query. Is this possible using the GitHub API?
In addition, I would like to include location and language in the query.
thanks.
This is not straightforward using the available API now. However, you can use the API to get what you want.
Be warned that there are over 10 million repositories on Github - it will take a long time. As you can only retrieve a list of 100 repositories per query, you need to use pagination -> more than 100000 requests to get all the repositories. A user is limited to 5000 requests per hour, then you are "banned" for another hour. This will take more than 40 hours, if you're using just one user credentials.
Steps:
Get the JSON with all the repositories (https://developer.github.com/v3/repos/#list-all-public-repositories)
Use pagination to fetch 100 objects per query (https://developer.github.com/v3/#link-header)
Decode the json and retrieve the list of repositories
For each repository you need to get the repository url object from the JSON, which gives you the link to the repository.
Now you need to get the README contents. There are two ways :
a) You use the Github API, by using the repo url and sending a GET request for : https://api.github.com/repos/:owner/:repo/readme( https://developer.github.com/v3/repos/contents/#get-the-readme) and then either decode the file (it is encoded using Base64) or you follow the html property of the JSON e.g "html": "https://github.com/pengwynn/octokit/blob/master/README.md". If there is no README, you will get a 404 Not found code, so you can easily proceed to the next repository.
b) You just make the URL for the README using step 4 that gives you e.g. https://api.github.com/repos/octocat/Hello-World ; and you parse it and transform it into https://github.com/octocat/Hello-World/README.MD ; however this would be more complicated, in case there is no README.
Search through the file for your specific text, and record or not if you have found the text.
Iterate until you went through all the repositories.
Advanced things - if you plan on running this more often, I can strongly recommend to use caching https://developer.github.com/v3/#conditional-requests ; You basically store the date + time when you have done the query, and use it later to see if anything has changed in the repository. This will eliminate many of your subsequent queries if you need to have an up-to-date information. You will still have to retrieve the whole list of repositories though. (but then you only do your search for updated repositories)
Of course to make it faster, you can improve this algorithm to make it parallel - you retrieve 100 repositories, then proceed to retrieve the next 100, and in the meanwhile you search if the first 100 repositories contain a README file and if that readme has what you are searching for, and so on. This will make things faster, most certainly. You will need to use some sort of a buffer, as you do not know which terminates faster (getting the repositories list, or searching through them)
Hope it helps.

youtube api v3 search returns different results from youtube site

I'm trying to do a search on the v3 api using this url:
https://www.googleapis.com/youtube/v3/search?part=id,snippet&channelId=UCtVd0c0tGXuTSbU5d8cSBUg&maxResults=10&order=date&q=game&key=[API_KEY]
but this returns me only one playlist.
When I do this search on youtube site directly it returns more results to me:
https://www.youtube.com/user/YouTubeDev/search?query=game
Why this happens, is there something wrong that I'm doing?
We ran into a similar issue when we tried to search for large amounts of content. This is especially evident if you set the time range you're looking for using publishedAfter and publishedBefore to a very small range (say for example 1 hour). Even when we get to very small result sets (you can only paginate around 20 times on the API using pageToken back when we tried it, so it was when our totalResults were less than 1,000), we were finding actually only as little as 540 items.
We reached out to YouTube and our contacts there confirmed that the totalResults are just an estimate, and are not actually accurate. You may get up to the amount of items specified, but there is no guarantee that you will get exactly that. Your best bet is to capture as much as you can, and scan for data using a different time range.
Source: Reddit
In the first one you are using search->list method. Which is searching for channels?
In the second one you are doing a playlist search inside the channel.
You can do the same on API via playlists->list.
(Or if you want the videos inside the channel straight, use videos->list)
Might be a bug. If so and not yet filed, you can file it here: https://code.google.com/p/gdata-issues/issues/list?q=label%3aAPI-YouTube
The problem seems to be caused by the parameter order=date.
Adding order to the "YouTube query" (using channel): https://www.youtube.com/channel/UCtVd0c0tGXuTSbU5d8cSBUg/search?query=game&order=date ,is not different. However omitting order from the "api request" gives the same result (6 items): https://www.googleapis.com/youtube/v3/search?part=id,snippet&channelId=UCtVd0c0tGXuTSbU5d8cSBUg&maxResults=10&q=game&key=YOUR-API-KEY-HERE
Note, that with using order=date in the api request only 1 item is shown, while the same response shows totalResults": 6 (which seems to be right). I did not try all, but using order=relevance does not give this problem.

Resources