Why I cannot get all contributors of a GitHub Repository using GitHub API - github-api

Following is my code and every time it only returns 60 contributors:
base_url = "https://api.github.com"
owner = "http4s"
repo = "http4s"
# Set the headers for the request
headers = {
"Accept": "application/vnd.github+json",
"Authorization":"token xxxxxx"
}
url = f"{base_url}/repos/{owner}/{repo}/contributors?per_page=100"
more_pages = True
# Initialize a set to store the contributors
contributors = set()
while more_pages:
# Send the request to the API endpoint
response = requests.get(url, headers=headers)
# Check the status code of the response
if response.status_code == 200:
# If the request is successful, add the list of contributors to the `contributors` set
contributors.update([contributor["login"] for contributor in response.json()])
# Check the `Link` header in the response to see if there are more pages to fetch
link_header = response.headers.get("Link")
if link_header:
# The `Link` header has the format `<url>; rel="name"`, so we split it on `;` to get the URL
next_url = link_header.split(";")[0][1:-1]
# Check if the `rel` parameter is "next" to determine if there are more pages to fetch
if "rel=\"next\"" in link_header:
url = next_url
else:
more_pages = False
else:
more_pages = False
else:
# If the request is not successful, print the status code and the error message
print(f"Failed to get contributors: {response.status_code} {response.text}")
more_pages = False
Can someone tell me how can I optimize my code?
I have tried many ways to fetch more contributors, but every time it only returns 60 different contributors. I want to get the full list of contributors.

Related

How to get the list of followers from an Instagram account without getting banned?

I am trying to scrape all the followers of some particular Instagram accounts. I am using Python 3.8.3 and the latest version of Instaloader library. The code I have written is given below:
# Import the required libraries:
import instaloader
import time
from random import randint
# Start time:
start = time.time()
# Create an instance of instaloader:
loader = instaloader.Instaloader()
# Credentials & target account:
user_id = USERID
password = PASSWORD
target = TARGET # Account of which the list of followers need to be scraped;
# Login or load the session:
loader.login(user_id, password)
# Obtain the profile metadata of the target:
profile = instaloader.Profile.from_username(loader.context, target)
# Print the list of followers and save it in a text file:
try:
# The list to store the collected user handles of the followers:
followers_list = []
# Variables used to apply pauses to slow down scraping:
count = 0
short_counter = 1
short_pauser = randint(19, 24)
long_counter = 1
long_pauser = randint(4900, 5000)
# Fetch the followers one by one:
for follower in profile.get_followers():
sleeper = randint(840, 1020)
# Short pause for the process:
if (short_counter % short_pauser == 0):
short_counter = 0
short_pauser = randint(19, 24)
print('\nShort Pause.\n')
time.sleep(1)
# Long pause for the process:
if (long_counter % long_pauser == 0):
long_counter = 0
long_pauser = randint(4900, 5000)
print('\nLong pause.\n')
time.sleep(sleeper)
# Append the list and print the follower's user handle:
followers_list.append(follower.username)
print(count,'', followers_list[count])
# Increment the counters accordingly:
count = count + 1
short_counter = short_counter + 1
long_counter = long_counter + 1
# Store the followers list in a txt file:
txt_file = target + '.txt'
with open(txt_file, 'a+') as f:
for the_follower in followers_list:
f.write(the_follower)
f.write('\n')
except Exception as e:
print(e)
# End time:
end = time.time()
total_time = end - start
# Print the time taken for execution:
print('Time taken for complete execution:', total_time,'s.')
I am getting the following error after scraping some data:
HTTP Error 400 (Bad Request) on GraphQL Query. Retrying with shorter page length.
HTTP Error 400 (Bad Request) on GraphQL Query. Retrying with shorter page length.
400 Bad Request
In fact, the error occurs when Instagram detects unusual activity and disables the account for a while and prompts the user to change the password.
I have tried -
(1) Slowing down the process of scraping.
(2) Adding pauses in between in order to make the program more human-like.
Still, no progress.
How to bypass such restrictions and get the complete list of all the followers?
If getting the entire list is not possible, what is the best way to get at least 20,000 followers list (from multiple accounts) without getting banned / disabled account / facing such inconveniences?

python download file into memory and handle broken links

I'm using the following code to download a file into memory :
if 'login_before_download' in obj.keys():
with requests.Session() as session:
session.post(obj['login_before_download'], verify=False)
request = session.get(obj['download_link'], allow_redirects=True)
else:
request = requests.get(obj['download_link'], allow_redirects=True)
print("downloaded {} into memory".format(obj[download_link_key]))
file_content = request.content
obj is a dict that contains the download_link and another key that indicates if I need to login to the page to create a cookie.
The problem with my code is that if the url is broken and there isnt any file to download I'm still getting the html content of the page instead of identifying that the download failed.
Is there any way to identify that the file wasnt downloaded ?
I found the following solution in this url :
import requests
def is_downloadable(url):
"""
Does the url contain a downloadable resource
"""
h = requests.head(url, allow_redirects=True)
header = h.headers
content_type = header.get('content-type')
if 'text' in content_type.lower():
return False
if 'html' in content_type.lower():
return False
return True
print is_downloadable('https://www.youtube.com/watch?v=9bZkp7q19f0')
# >> False
print is_downloadable('http://google.com/favicon.ico')
# >> True

export data from service now rest API using python

I have to export incident data from service now rest API. The incident state is one of new, in progress, pending not resolved and closed. I am able to fetch data that are in active state but not able to apply correct filter also in output it is showing one extra character 'b', so how to remove that extra character?
input:
import requests
URL = 'https://instance_name.service-now.com/incident.do?CSV&sysparm_query=active=true'
user = 'user_name'
password = 'password'
headers = {"Accept": "application/xml"}
response = requests.get(URL,auth=(user, password), headers=headers)
if response.status_code != 200:
print('Status:', response.status_code, 'Headers:', response.headers, 'Error Response:', response.content)
exit()
print(response.content.splitlines())
Output:
[b'"number","short_description","state"', b'"INC0010001","Test incident creation through REST","New"', b'"INC0010002","incident creation","Closed"', b'"INC0010004","test","In Progress"']
It's a Byte literal (for more info. please refer https://docs.python.org/3/reference/lexical_analysis.html#string-and-bytes-literals)
To remove Byte literal, we need to decode the strings,
Just give a try as-
new_list = []
s = [b'"number","short_description","state"', b'"INC0010001","Test
incident creation through REST","New"', b'"INC0010002","incident
creation","Closed"', b'"INC0010004","test","In Progress"']
for ele in s:
ele= ele.decode("utf-8")
new_list.append(ele)
print(new_list)
Output:
['"number","short_description","state"', '"INC0010001","Test incident
creation through REST","New"', '"INC0010002","incid
ent creation","Closed"', '"INC0010004","test","In Progress"']
Hope! It will work

Lambda/boto3/python loop

This code acts as an early warning system for ADFS failures, which works fine when run locally. Problem is that when I run it in Lambda, it loops non stop.
In short:
lambda_handler() runs pagecheck()
pagecheck() produces the info needed then passes 2 lists (msgdet_list, error_list) and an int (error_count) to notification().
notification() collates and prints the output. The output is two key variables (notificationheader and notificationbody).
I've #commentedOut the SNS piece which would usually email the info, and am using print() to instead send the info to CloudWatch logs until I can get the loop sorted. Logs:
CloudWatch logs
If I run this locally, it produces a clean single output. In Lambda, the function will loop until it times out. It's almost like every time the lists are updated, they're passed to the notification() module and it's run. I can limit the function time, but would rather fix the code!
Cheers,
tac
# This python/boto3/lambda script sends a request to an Office 365 landing page, parses return details to confirm a successful redirect to /
# the organisation ADFS homepage, authenticates homepge is correct, raises any errors, and sends a consolodated report to /
# an AWS SNS topic.
# Run once to produce pageserver and htmlchar values for global variables.
# Import required modules
import boto3
import urllib.request
from urllib.request import Request, urlopen
from datetime import datetime
import time
import re
import sys
# Global variables to be set
url = "https://outlook.com/CONTOSSO.com"
adfslink = "https://sts.CONTOSSO.com/adfs/ls/?client-request-id="
# Input after first run
pageserver = "Microsoft-HTTPAPI/2.0 Microsoft-HTTPAPI/2.0"
htmlchar = 18600
# Input AWS SNS ARN
snsarn = 'arn:aws:sns:ap-southeast-2:XXXXXXXXXXXXX:Daily_Check_Notifications_CONTOSSO'
sns = boto3.client('sns')
def pagecheck():
# Present the request to the webpage as if coming from a user in a browser
user_agent = 'Mozilla/5.0 (Windows NT 6.1; Win64; x64)'
values = {'name' : 'user'}
headers = { 'User-Agent' : user_agent }
data = urllib.parse.urlencode(values)
data = data.encode('ascii')
# "Null" the Message Detail and Error lists
msgdet_list = []
error_list = []
request = Request(url)
req = urllib.request.Request(url, data, headers)
response = urlopen(request)
with urllib.request.urlopen(request) as response:
# Get the URL. This gets the real URL.
acturl = response.geturl()
msgdet_list.append("\nThe Actual URL is:")
msgdet_list.append(str(acturl))
if adfslink not in acturl:
error_list.append(str("Redirect Fail"))
# Get the HTTP resonse code
httpcode = response.code
msgdet_list.append("\nThe HTTP code is: ")
msgdet_list.append(str(httpcode))
if httpcode//200 != 1:
error_list.append(str("No HTTP 2XX Code"))
# Get the Headers as a dictionary-like object
headers = response.info()
msgdet_list.append("\nThe Headers are:")
msgdet_list.append(str(headers))
if response.info() == "":
error_list.append(str("Header Error"))
# Get the date of request and compare to UTC (DD MMM YYYY HH MM)
date = response.info()['date']
msgdet_list.append("The Date is: ")
msgdet_list.append(str(date))
returndate = str(date.split( )[1:5])
returndate = re.sub(r'[^\w\s]','',returndate)
returndate = returndate[:-2]
currentdate = datetime.utcnow()
currentdate = currentdate.strftime("%d %b %Y %H%M")
if returndate != currentdate:
date_error = ("Date Error. Returned Date: ", returndate, "Expected Date: ", currentdate, "Times in UTC (DD MMM YYYY HH MM)")
date_error = str(date_error)
date_error = re.sub(r'[^\w\s]','',date_error)
error_list.append(str(date_error))
# Get the server
headerserver = response.info()['server']
msgdet_list.append("\nThe Server is: ")
msgdet_list.append(str(headerserver))
if pageserver not in headerserver:
error_list.append(str("Server Error"))
# Get all HTML data and confirm no major change to content size by character lenth (global var: htmlchar).
html = response.read()
htmllength = len(html)
msgdet_list.append("\nHTML Length is: ")
msgdet_list.append(str(htmllength))
msgdet_list.append("\nThe Full HTML is: ")
msgdet_list.append(str(html))
msgdet_list.append("\n")
if htmllength // htmlchar != 1:
error_list.append(str("Page HTML Error - incorrect # of characters"))
if adfslink not in str(acturl):
error_list.append(str("ADFS Link Error"))
error_list.append("\n")
error_count = len(error_list)
if error_count == 1:
error_list.insert(0, 'No Errors Found.')
elif error_count == 2:
error_list.insert(0, 'Error Found:')
else:
error_list.insert(0, 'Multiple Errors Found:')
# Pass completed results and data to the notification() module
notification(msgdet_list, error_list, error_count)
# Use AWS SNS to create a notification email with the additional data generated
def notification(msgdet_list, error_list, errors):
datacheck = str("\n".join(msgdet_list))
errorcheck = str("\n".join(error_list))
notificationbody = str(errorcheck + datacheck)
if errors >1:
result = 'FAILED!'
else:
result = 'passed.'
notificationheader = ('The daily ADFS check has been marked as ' + result + ' ' + str(errors) + ' ' + str(error_list))
if result != 'passed.':
# message = sns.publish(
# TopicArn = snsarn,
# Subject = notificationheader,
# Message = notificationbody
# )
# Output result to CloudWatch logstream
print('Response: ' + notificationheader)
else:
print('passed')
sys.exit()
# Trigger the Lambda handler
def lambda_handler(event, context):
aws_account_ids = [context.invoked_function_arn.split(":")[4]]
pagecheck()
return "Successful"
sys.exit()
Your CloudWatch logs contain the following error message:
Process exited before completing request
This is caused by invoking sys.exit() in your code. Locally your Python interpreter will just terminate when encountering such a sys.exit().
AWS Lambda on the other hand expects a Python function to just return and handles sys.exit() as an error. As your function probably got invoked asynchronously AWS Lambda retries to execute it twice.
To solve your problem, you can replace the occurences of sys.exit() with return or even better, just remove the sys.exit() calls, as there would be already implicit returns in the places where you use sys.exit().

Unable to get the response in POST method in Python

I am facing a unique problem.
Following is my code.
url = 'ABCD.com'
cookies={'cookies':'xyz'}
r = requests.post(url,cookies=cookies)
print(r.status_code)
json_data = json.loads(r.text)
print("Printing = ",json_data)
When I use the url and cookie in the POSTMAN tool and use POST request I get JSON response . But when I use the above code with POST request method in python I get
404
Printing = {'title': 'init', 'description': "Error: couldn't find a device with id: xxxxxxxxx in ABCD: d1"}
But when I use the following code i .e with GET request method
url = 'ABCD.com'
cookies={'cookies':'xyz'}
r = requests.post(url,cookies=cookies)
print(r.status_code)
json_data = json.loads(r.text)
print("Printing = ",json_data)
I get
200
Printing = {'apiVersion': '0.4.0'}
I am not sure why POST method works with JSON repsone in POSTMAN tool and when I try using python it is not work. I use latest python 3.6.4
I finally found what was wrong following is correct way
url = 'ABCD.com'
cookies={'cookies':'xyz'}
r = requests.post(url,headers={'Cookie'=cookies)
print(r.status_code)
json_data = json.loads(r.text)
print("Printing = ",json_data)
web page was expecting headers as cookie and i got the response correctly

Resources