Receive more than 1000 commits - github-api

Using a keyword ("data analysis") how is it possible to retrieve using github api all commits for this keyword?
Currently the general search provides only the first 1000 commits. Example link https://github.com/search?q=%22data+analysis%22&type=commits

You can use gh api paginate to manage getting all results
gh api search/repositories --method=GET -F q="data analysis" --jq ".items[].html_url" \
-F per_page=100 --paginate --cache 1h
That won't be limited to the first 100.

Related

return issues without pagination `glab issue list`

I want to retrieve all issues from myrepo with the gitlab-cli.
The command glab issue list -R myrepo -P 1000 returns a page containing with 100 issues, but I have 195 issues. Is this a limitation in the API, or am I missing something in the documentation to undo the pagination?
It is a limitation of the GitLab API. It can only return a maximum of 100 requests per each call.
Just to document my approach for getting the issues if anyone needs it:
issues=()
while true; do
# Retrieve the current page of issues
page_issues=$(glab issue list -R $GITLAB_REPO --page $page --per-page $per_page)
# If the page is empty, we have retrieved all issues
if [ "$page_issues" == "No open issues match your search in $GITLAB_REPO" ]; then
break
fi
# Append the page issues to the array
issues+="$(echo "$page_issues" | grep -Po '#\d+\s+.*')"
# Increment the page number
page=$((page + 1))
done

Databricks Jobs API "INVALID_PARAMETER_VALUE" when trying to get job

I'm just starting to explore the Databricks API. I've created a .netrc file as described in this doc and am able to get the API to work with this for other operations like "list clusters" and "list jobs". But when I try to query details of a particular job, it fails:
$ curl --netrc -X GET https://<my_workspace>.cloud.databricks.com/api/2.0/jobs/get/?job_id=job-395565384955064-run-12345678
{"error_code":"INVALID_PARAMETER_VALUE","message":"Job 0 does not exist."}
What am I doing wrong here?
Job ID should be a numeric identifier while you're providing the job cluster name instead. You need to use first number (395565384955064) from that name as a job ID in REST API. Also, remove / after get - it should be /api/2.0/jobs/get?job_id=<job-ID>
$ curl --netrc -X GET https://<my_workspace>.cloud.databricks.com/api/2.0/jobs/get/?job_id=job-395565384955064-run-12345678
In this link, looks like job_name had been mentioned as alphanumeric value instead of job_id. You can find job_id where you can find it .

How to list more than 20 gitlab badges? [duplicate]

I am currently using GitLab API to return all projects within a group. The question I have is, how do I return all projects if there are over 100 projects in the group?
The curl command I'm using is curl --header "PRIVATE-TOKEN: **********" http://gitlab.example.com/api/v4/groups/myGroup/projects?per_page=100&page=1
I understand that the default page=1 and the max per_page=100 so what do I do if there are over 100 projects? If I set page=2, it just returns all the projects after the first 100.
Check the response for the X-Total-Pages header. As long as page is smaller than total pages, you have to call the api again and increment the page variable.
See https://docs.gitlab.com/ee/api/README.html#pagination-link-header

How can I export GitHub issues to Excel?

How can I export all my issues from an Enterprise GitHub repository to an Excel file? I have tried searching many Stack Overflow answers but did not succeed. I tried this solution too (exporting Git issues to CSV and getting "ImportError: No module named requests" errors. Is there any tool or any easy way to export all the issues to Excel?
To export from a private repo using curl, you can run the following:
curl -i https://api.github.com/repos/<repo-owner>/<repo-name>/issues --header "Authorization: token <token>"
The token can be generated under Personal access tokens
Inspect the API description for all details.
With the official GitHub CLI you can easily export all issues into a CSV format.
brew install gh
Log in:
gh auth login
Change directory to a repository and run this command:
gh issue list --limit 1000 --state all | tr '\t' ',' > issues.csv
In the European .csv files the separator is a semicolon ';', not a comma. Modify the separator as you want.
The hub command-line wrapper for github makes this pretty simple.
You can do something like this:
$ hub issue -f "%t,%l%n" > list.csv
which gives you something like this
$ more issue.csv
Issue 1 title, tag1 tag2
Issue 2 title, tag3 tag2
Issue 3 title, tag1
If that is a one-time task, you may play around with GitHub WebAPI. It allows to export the issues in JSON format. Then you can convert it to Excel (e.g. using some online converter).
Just open the following URL in a browser substituting the {owner} and {repo} with real values:
https://api.github.com/repos/{owner}/{repo}/issues?page=1&per_page=100
It is unfortunate that github.com does not make this easier.
In the mean time, if you have jq and curl, you can do this in two lines using something like the following example that outputs issue number, title and labels (tags) and works for private repos as well (if you don't want to filter by label, just remove the labels={label}& part of the url). You'll need to substitute $owner, $repo, $label, and $username:
# with personal access token = $PAT
echo "number, title, labels" > issues.csv
curl "https://api.github.com/repos/$owner/$repo/issues?labels=$label&page=1&per_page=100" -u "$username:$PAT" \
| jq -r '.[] | [.number, .title, (.labels|map(.name)|join("/"))]|#csv' >> issues.csv
# without PAT (will be prompted for password)
echo "number, title, labels" > issues.csv
curl "https://api.github.com/repos/$owner/$repo/issues?labels=$label&page=1&per_page=100" -u "$username" \
| jq -r '.[] | [.number, .title, (.labels|map(.name)|join("/"))]|#csv' >> issues.csv
Note that if your data exceeds 1 page, it may require additional calls.
I tried the methods described in other comments regarding exporting issues in JSON format. It worked ok but the formatting was somehow screwed up. Then I found in Excel help that it is able to access APIs directly and load the data from the JSON response neatly into my Excel sheets.
The Google terms I used to find the help I needed were "excel power query web.content GET json". I found a How To Excel video which helped a lot.
URL that worked in the Excel query (same as from other posts):
https://api.github.com/repos/{owner}/{repo}/issues?page=1&per_page=100
Personally, I also add the parameter &state=open, otherwise I need to request hundreds of pages. At one point I reached GitHub's limit on unauthenticated API calls/hour for my IP address.
You can also check out the one-liner that I created (it involves GitHub CLI and jq)
gh issue list --limit 10000 --state all --json number,title,assignees,state,url | jq -r '["number","title","assignees","state","url"], (.[] | [.number, .title, (.assignees | if .|length==0 then "Unassigned" elif .|length>1 then map(.login)|join(",") else .[].login end) , .state, .url]) | #tsv' > issues-$(date '+%Y-%m-%d').tsv
Gist with documentation
I have tinkered with this for quite some time and found that Power BI is a good way of keeping the data up to date in the spreadsheet. I had to look into Power BI a little to make this work, because getting the right info out of the structured JSON fields, and collapsing lists into concatenated strings, especially for labels, wasn't super intuitive. But this Power BI query works well for me by removing all the noise and getting relevant info into an easily digestible format that can be reviewed with stakeholders:
let
MyJsonRecord = Json.Document(Web.Contents("https://api.github.com/repos/<your org>/<your repo>/issues?&per_page=100&page=1&state=open&filter=all", [Headers=[Authorization="Basic <your auth token>", Accept="application/vnd.github.symmetra-preview+json"]])),
MyJsonTable = Table.FromRecords(MyJsonRecord),
#"Column selection" = Table.SelectColumns(MyJsonTable,{"number", "title", "user", "labels", "state", "assignee", "assignees", "comments", "created_at", "updated_at", "closed_at", "body"}),
#"Expanded labels" = Table.ExpandListColumn(#"Column selection", "labels"),
#"Expanded labels1" = Table.ExpandRecordColumn(#"Expanded labels", "labels", {"name"}, {"labels.name"}),
#"Grouped Rows" = Table.Group(#"Expanded labels1", {"number","title", "user", "state", "assignee", "assignees", "comments", "created_at", "updated_at", "closed_at", "body"}, {{"Label", each Text.Combine([labels.name],","), type text}}),
#"Removed Other Columns" = Table.SelectColumns(#"Grouped Rows",{"number", "title", "state", "assignee", "comments", "created_at", "updated_at", "closed_at", "body", "Label"}),
#"Expanded assignee" = Table.ExpandRecordColumn(#"Removed Other Columns", "assignee", {"login"}, {"assignee.login"})
in
#"Expanded assignee"
I added and then removed columns in this and did not clean this up - feel free to do that before you use it.
Obviously, you also have to fill in your own organization name and repo name into the URL, and obtain the auth token. I have tested the URL with a Chrome REST plugin and got the token from entering the user and api key there. You can authenticate explicitly from Excel with the user and key if you don't want to deal with the token. I just find it simpler to go the anonymous route in the query setup and instead provide the readily formatted request header.
Also, this works for repos with up to 100 open issues. If you have more, you need to duplicate the query (for page 2 etc) and combine the results.
Steps for using this query:
in a new sheet, on the "Data" tab, open the "Get Data" drop-down
select "Launch Power Query Editor"
in the editor, choose "New Query", "Other Sources", "Blank query"
now you click on "Advanced Editor" and paste the above query
click the "Done" button on the Advanced Editor, then "Close and Load" from the tool bar
the issues are loading in your spreadsheet and you are in business
no crappy third-party tool needed
You can also try https://github.com/remoteorigin/git-issues-downloader but be sure to used the develop branch. The npm version and master branch is buggy.
Or you can use this patched version with
npm install -g https://github.com/mkobar/git-issues-downloader
and then run with (for public repo)
git-issues-downloader -n -p none -u none https://github.com/<user>/<repository>
or for a private repo:
git-issues-downloader -n -p <password or token> -u <user> https://github.com/<user>/<repository>
Works great.
Here is a tool that does it for you (uses the GitHub API):
https://github.com/gavinr/github-csv-tools
Export Pull Requests can export issues to a CSV file, which can be opened with Excel. It also supports GitLab and Bitbucket.
From its documentation:
Export open PRs and issues in sshaw/git-link and
sshaw/itunes_store_transporter:
epr sshaw/git-link sshaw/itunes_store_transporter > pr.csv
Export open pull request not created by sshaw in padrino/padrino-framework:
epr -x pr -c '!sshaw' padrino/padrino-framework > pr.csv
It has several options for filtering what gets exported.
GitHub's JSON API can be queried from directly in Excel using Power Query. It does require some knowledge about how to convert JSON into Excel table format but that's fairly Googlable.
Here's how to first get to the data:
In Excel, on Ribbon, click Data > Get Data > From JSON. In dialog
box, enter API URL ... in format similar to (add parms as you wish):
https://api.github.com/repos/{owner}/{repo}/issues
A dialog box labeled "Access Web content" will appear.
On the left-hand side, click the Basic tab.
In the User name textbox, enter your GitHub username.
In the Password textbox, enter a GitHub password/Personal Access
token.
Click Connect.
Power Query Editor will be displayed with a list of items that say Record.
... now Google around for how to transform accordingly so that the appropriate issue data can be displayed as a single table.
As a one-time task, building on 'hub'-based recommendation from #Chip... on a windows system with GitBash prompt already installed:
Download the latest hub executable (such as Windows 64 bit) https://github.com/github/hub/releases/ and extract it (hub.exe is in the .../bin directory).
Create a github personal access token https://github.com/settings/tokens and copy the token text string to the clipboard.
Create a text file (such as in notepad) to use as the input file to hub.exe... the first line is your github user name and on the 2nd line paste the personal access token, followed by a newline (so that both lines will processed when input to hub). Here I presume the file is infile.txt in the repository's base directory.
Run Git Bash... and remember to cd (change directory) to the repository of interest! Then enter a line like:
<path_to_hub_folder>/bin/hub.exe issue -s all -f "%U|%t|%S|%cI|%uI|%L%n" < infile.txt > outfile.csv
Then open the file with '|' as the column delimiter. (and consider deleting the personal access token on github).
You can do it using the python package PyGithub
from github import Github
token = Github('personal token key here')
repo = token.get_repo('repo-owner/repo-name')
issues = repo.get_issues(state='all')
for issue in issues:
print(issue.url)
Here I got back the URL, you can get back the content instead if you want by changing the '.URL' part.
Then just export the issues links or content to CSV
gh GitHub CLI integrates now jq with --jq <expression> to filter JSON output using a jq expression as documented on GitHub CLI Manual https://cli.github.com/manual/gh_issue_list.
TSV dump.
gh issue list --limit 10 --state all --json title,body --jq '["title","body"], (.[] | [.title,.body]) | #tsv' > issues-$(date '+%Y-%m-%d').tsv
CSV dump
Surprisingly 000D unicode character need to be filtered out with tr $'\x{0D}' ' '.
gh issue list --limit 10 --state all --json title,body --jq '["title","body"], (.[] | [.title,.body]) | #csv' | tr $'\x{0D}' ' ' > issues-$(date '+%Y-%m-%d').csv

Compare two websites and see if they are "equal?"

We are migrating web servers, and it would be nice to have an automated way to check some of the basic site structure to see if the rendered pages are the same on the new server as the old server. I was just wondering if anyone knew of anything to assist in this task?
Get the formatted output of both sites (here we use w3m, but lynx can also work):
w3m -dump http://google.com 2>/dev/null > /tmp/1.html
w3m -dump http://google.de 2>/dev/null > /tmp/2.html
Then use wdiff, it can give you a percentage of how similar the two texts are.
wdiff -nis /tmp/1.html /tmp/2.html
It can be also easier to see the differences using colordiff.
wdiff -nis /tmp/1.html /tmp/2.html | colordiff
Excerpt of output:
Web Images Vidéos Maps [-Actualités-] Livres {+Traduction+} Gmail plus »
[-iGoogle |-]
Paramètres | Connexion
Google [hp1] [hp2]
[hp3] [-Français-] {+Deutschland+}
[ ] Recherche
avancéeOutils
[Recherche Google][J'ai de la chance] linguistiques
/tmp/1.html: 43 words 39 90% common 3 6% deleted 1 2% changed
/tmp/2.html: 49 words 39 79% common 9 18% inserted 1 2% changed
(he actually put google.com into french... funny)
The common % values are how similar both texts are. Plus you can easily see the differences by word (instead of by line which can be a clutter).
The catch is how to check the 'rendered' pages. If the pages don't have any dynamic content the easiest way to do that is to generate hashes for the files using a md5 or sha1 commands and check then against the new server.
IF the pages have dynamic content you will have to download the site using a tool like wget
wget --mirror http://thewebsite/thepages
and then use diff as suggested by Warner or do the hash thing again. I think diff may be the best way to go since even a change of 1 character will mess up the hash.
I've created the following PHP code that does what Weboide suggest here. Thanks Weboide!
the paste is here:
http://pastebin.com/0V7sVNEq
Using the open source tool recheck-web (https://github.com/retest/recheck-web), there are two possibilities:
Create a Selenium test that checks all of your URLs on the old server, creating Golden Masters. Then running that test on the new server and find how they differ.
Use the free and open source (https://github.com/retest/recheck-web-chrome-extension) Chrome extension, that internally uses recheck-web to do the same: https://chrome.google.com/webstore/detail/recheck-web-demo/ifbcdobnjihilgldbjeomakdaejhplii
For both solutions you currently need to manually list all relevant URLs. In most situations, this shouldn't be a big problem. recheck-web will compare the rendered website and show you exactly where they differ (i.e. different font, different meta tags, even different link URLs). And it gives you powerful filters to let you focus on what is relevant to you.
Disclaimer: I have helped create recheck-web.
Copy the files to the same server in /tmp/directory1 and /tmp/directory2 and run the following command:
diff -r /tmp/directory1 /tmp/directory2
For all intents and purposes, you can put them in your preferred location with your preferred naming convention.
Edit 1
You could potentially use lynx -dump or a wget and run a diff on the results.
Short of rendering each page, taking screen captures, and comparing those screenshots, I don't think it's possible to compare the rendered pages.
However, it is certainly possible to compare the downloaded website after downloading recursively with wget.
wget [option]... [URL]...
-m
--mirror
Turn on options suitable for mirroring. This option turns on recursion and time-stamping, sets infinite recursion depth and keeps FTP
directory listings. It is currently equivalent to -r -N -l inf --no-remove-listing.
The next step would then be to do the recursive diff that Warner recommended.

Resources