querying github API for users with README that matches text - github-api

I would like to retrieve users with repositories that contain a README file that contains text that is matched by a string passed in the query. Is this possible using the GitHub API?
In addition, I would like to include location and language in the query.
thanks.

This is not straightforward using the available API now. However, you can use the API to get what you want.
Be warned that there are over 10 million repositories on Github - it will take a long time. As you can only retrieve a list of 100 repositories per query, you need to use pagination -> more than 100000 requests to get all the repositories. A user is limited to 5000 requests per hour, then you are "banned" for another hour. This will take more than 40 hours, if you're using just one user credentials.
Steps:
Get the JSON with all the repositories (https://developer.github.com/v3/repos/#list-all-public-repositories)
Use pagination to fetch 100 objects per query (https://developer.github.com/v3/#link-header)
Decode the json and retrieve the list of repositories
For each repository you need to get the repository url object from the JSON, which gives you the link to the repository.
Now you need to get the README contents. There are two ways :
a) You use the Github API, by using the repo url and sending a GET request for : https://api.github.com/repos/:owner/:repo/readme( https://developer.github.com/v3/repos/contents/#get-the-readme) and then either decode the file (it is encoded using Base64) or you follow the html property of the JSON e.g "html": "https://github.com/pengwynn/octokit/blob/master/README.md". If there is no README, you will get a 404 Not found code, so you can easily proceed to the next repository.
b) You just make the URL for the README using step 4 that gives you e.g. https://api.github.com/repos/octocat/Hello-World ; and you parse it and transform it into https://github.com/octocat/Hello-World/README.MD ; however this would be more complicated, in case there is no README.
Search through the file for your specific text, and record or not if you have found the text.
Iterate until you went through all the repositories.
Advanced things - if you plan on running this more often, I can strongly recommend to use caching https://developer.github.com/v3/#conditional-requests ; You basically store the date + time when you have done the query, and use it later to see if anything has changed in the repository. This will eliminate many of your subsequent queries if you need to have an up-to-date information. You will still have to retrieve the whole list of repositories though. (but then you only do your search for updated repositories)
Of course to make it faster, you can improve this algorithm to make it parallel - you retrieve 100 repositories, then proceed to retrieve the next 100, and in the meanwhile you search if the first 100 repositories contain a README file and if that readme has what you are searching for, and so on. This will make things faster, most certainly. You will need to use some sort of a buffer, as you do not know which terminates faster (getting the repositories list, or searching through them)
Hope it helps.

Related

Tags feed in GitLab shows only 20 entries

I need to find a way to get all release names and release dates for a project in GitLab.
I tried retrieving Tags feed, but it seems to contain only 20 entries-so the older tags are not in the feeds.
How can i easily get those Release information for a project / Group.
I couldn't download Release_evidence JSON for each release. That's not a problem.
But I need an easy way to get all those release information.
<project>/-/tags?format=atom
only displays 20 entries.How to get all entries regarding release information easily.
The GitLab documentation helped me in this case to retrieve all feeds
This link ---> https://docs.gitlab.com/ee/api/#pagination
Scanning through the available pages using "page=n" query parameter in the URL and read the feed entries from each page until there are no feed entries helped me.
<project>/-/tags?format=atom&page=n, where n=1,2,3...
'per_page' query parameter didn't work for me. but the 'page' query parameter helped me
Hope it helps someone

Instagram - Obtaining data in realtime

I am trying to get the recent post from a particular location. using this url.
https://api.instagram.com/v1/media/search?lat=34.0500&lng=-118.2500&distance=50&MAX_ID=max_id&access_token=XXXX
So when I use this URL for the first time, I get 20 results. I obtain the max ID from the list of 20 results and modify my url .
But when I use the modified URL, I obtain the same result as the first one.
How do I go about solving this?
Contrary to what I thought, the media search endpoint doesn't return a pagination object. Sorry. It also doesn't support the min_id/max_id parameters, which is why you are having problems..
If you want to get different data you are going to have to use the time based request parameter MIN_TIMESTAMP. However it looks like that parameter doesn't work for that endpoint either (though the documentation says it is supported). Indeed, a quick search on the internet reveals it might be a long standing bug with the api.

Best Approach to Scrape Paginated Results using import.io

There are several websites within the cruise industry that I would like to scrape.
Examples:
http://www.silversea.com/cruise/cruise-results/?page_num=1
http://www.seabourn.com/find-luxury-cruise-vacation/FindCruises.action?cfVer=2&destCode=&durationCode=&dateCode=&shipCodeSearch=&portCode=
In some scenarios, like the first one shown, the results page follows a patten - ?page_num=1...17. However the number of results will vary over time.
In the second scenario, the URL does not change with pagination.
At the end of the day, what I'd like to do is to get the results for each website into a single file.
Q1: Is there any alternative to setting 17 scrapers for scenario 1 and then actively watching as results grow/shrink over time?
Q2: I'm completely stumped about how to scrape content from second scenario.
Q1- The free tool from (import.io) does not have the ability to actively watch the data change over time. What you could do is have the data Bulk Extracted by the Extractor (with 17 pages this would be really fast) and added to a database. After each entry to the database, the entries could be de-duped or marked as unique. You could do this manually in Excel or programmatically.
Their Enterprise (data as a service) could do this for you.
Q2- If there is not a unique URL for each page, the only tool that will paginate the pages for you is the Connector.
I would recommend you to build an extractor to get the pagination. The result of this extractor will be a list of links, each link corresponding to a page.
This way, every time you run your application and the number of pages changes, you will always get all the pages.
After that, make a call for each page to get the data you want.
Extractor 1: Get pages -- Input: The first URL
Extractor 2: Get items (data) -- Input: The result from Extractor 1

Not Getting Link Response Headers in GitHub Pagination For Issues

I'm writing a quick python app to get stats on my public GitHub project.
When I call (https://api.github.com/repos/user/project/pulls), I get back some json, but because my project has more than 30 outstanding PRs, I get a Link response header with the next and last URLs for me to go call to get all PRs.
However, when I perform a parallel query for issues with a certain label (https://api.github.com/repos/user/project/issues?labels=label&status=opened), I only get 30 back (the pagination limit), but my response header doesn't have a next Link in it for me to follow. I know my project has more than 30 issues that match that label.
Is this a bug in the GitHub API, or in what I'm doing? Alternatively, I don't actually care about the issues themselves, just the count of issues with that label, so is there another way to just query for the count?

Code fragment repository search on github.com

How can I search for code fragments on github.com? When I search for MSG_PREPARE in the repository ErikZalm/Marlin github shows up nothing.
I'm using the repository code search syntax described on https://github.com/search with
repo:ErikZalm/Marlin MSG_PREPARE
No results, but MSG_PREPARE can be found inside this repository here. Am I missing something? Is there no code search on github.com?
At the time of writing this answer, compared to time this question was asked i.e. about 8 years ago, github has come a good way, though still not to the length which you are looking at.
GitHub code searches are limited on the following rules: https://docs.github.com/en/github/searching-for-information-on-github/searching-code . Quoting the same:
Code in forks is only searchable if the fork has more stars than the parent repository.
Forks with fewer stars than the parent repository are not indexed for code search.
To include forks with more stars than their parent in the search results, you will need to add fork:true or fork:only to your query.
For more information, see "Searching in forks."
So we can search within the fork using the fork:true option, though as expected, since the repo ErikZalm/Marlin is low on star count as compared to parent MarlinFirmware/Marlin, the code in the fork is still not indexed. Hence the advance search shows no good except a match to the repo.
Though, if you perform the same search on the parent, it would show the matches on the code. Here are the matches for MSG_PREPARE in the parent repo MarlinFirmware/Marlin
Fortunately, one company which I know working on this domain is SourceGraph: https://about.sourcegraph.com/
Hence, you can easily search what you intended with SourceGraph:
Here are the matches for MSG_PREPARE in the ErikZalm/Marlin using SourceGraph Cloud
Update July 2013: "Preview the new Search API"
The GitHub search API on code now supports fragments, through text-match metadata.
Some API consumers will want to highlight the matching search terms when displaying search results. The API offers additional metadata to support this use case. To get this metadata in your search results, specify the text-match media type in your Accept header. For example, via curl, the above query would look like this:
curl -H 'Accept: application/vnd.github.preview.text-match+json' \
https://api.github.com/search/code?q=octokit+in:file+extension:gemspec+-repo:octokit/octokit.rb&sort=indexed
This produces the same JSON payload as above, with an extra key called text_matches, an array of objects. These objects provide information such as the position of your search terms within the text, as well as the property that included the search term.
Original answer (November 2012)
I don't think there is anything that you would have missed.
If you search for SdFile, you would find results in .pde file, but none in cpp files like in this SdFile.cpp file.
The search was introduced 4 years ago (November 2008), but, as mentioned in "Search a github repository for the file defining a given function", GitHub repository code is simply not fully indexed.

Resources