Error Using geopy library - python-3.x

I have the following question, I want to set up a routine to perform iterations inside a dataframe (pandas) to extract longitude and latitude data, after supplying the address using the 'geopy' library.
The routine I created was:
import time
from geopy.geocoders import GoogleV3
import os
arquivo = pd.ExcelFile('path')
df = arquivo.parse("Table1")
def set_proxy():
proxy_addr = 'http://{user}:{passwd}#{address}:{port}'.format(
user='usuario', passwd='senha',
address='IP', port=int('PORTA'))
os.environ['http_proxy'] = proxy_addr
os.environ['https_proxy'] = proxy_addr
def unset_proxy():
os.environ.pop('http_proxy')
os.environ.pop('https_proxy')
set_proxy()
geo_keys = ['AIzaSyBXkATWIrQyNX6T-VRa2gRmC9dJRoqzss0'] # API Google
geolocator = GoogleV3(api_key=geo_keys )
for index, row in df.iterrows():
location = geolocator.geocode(row['NO_LOGRADOURO'])
time.sleep(2)
lat=location.latitude
lon=location.longitude
timeout=10)
address = location.address
unset_proxy()
print(str(lat) + ', ' + str(lon))
The problem I'm having is that when I run the code the following error is thrown:
GeocoderQueryError: Your request was denied.
I tried the creation without passing the key to the google API, however, I get the following message.
KeyError: 'http_proxy'
and if I remove the unset_proxy () statement from within the for, the message I receive is:
GeocoderQuotaExceeded: The given key has gone over the requests limit in the 24 hour period or has submitted too many requests in too short a period of time.
But I only made 5 requests today, and I'm putting a 2-second sleep between requests. Should the period be longer?
Any idea?

api_key argument of the GoogleV3 class must be a string, not a list of strings (that's the cause of your first issue).
geopy doesn't guarantee the http_proxy/https_proxy env vars to be respected (especially the runtime modifications of the os.environ). The advised (by docs) usage of proxies is:
geolocator = GoogleV3(proxies={'http': proxy_addr, 'https': proxy_addr})
PS: Please don't ever post your API keys to the public. I suggest to revoke the key you've posted in the question and generate a new one, to prevent the possibility of it being abused by someone else.

Related

How to loop through api using Python to pull records from specific group of pages

I wrote a Python script that pulls json data from api. The api has 10k records.
As I set up my script, I was only pulling 10 pages, (each page contains 25 items), so it worked fine, dump everything in a .csv and also put everything into a mysql db.
When I ran, what I thought would be my last test and pulled, should I say - attempted to pull the data from all 500 pages, got an internal server error. So researched that and think it is because I am pulling all this data at once. The api documentation is kind of crapy, can find any rate limit info, anyway...
Since this is not my api, I though a quick solution would be just to run my script, let it pull the data from the first 10 pages, then the second 10 pages, 3rd 10 pages etc.
For obvious reasons I can't show all the code, but below are the basics/snippets. It is pretty simple, just grad the url, manipulate it a bit so I can add the Page#, then count the number of pages, then loop through and grab the data content.
Could someone help by explaining/showing how I can run/loop through my url, get the content from pages 1-10, then next loop through and get the content from pages 11-21 and so on?
Any insight, suggestions, examples would be greatly appreciated.
import requests
import json
import pandas as pd
import time
from datetime import datetime, timedelta
time_now = time.strftime("%Y%m%d-%H%M%S")
# Make our Request
def main_request(baseurl, x, endpoint, headers):
r = requests.get(baseurl + f'{x}' + endpoint, headers=headers)
return r.json()
# determine how many pages are needed to loop through, use for pagination
def get_pages(response):
# return response['page']['size']
return 2
def parse_json(response):
animal_list = []
for item in response['_embedded']['results']:
animal_details = {
'animal type': item['_type'],
'animal title': item['title'],
'animal phase': item['type']['value']
}
animal_list.append(animal_details)
return animal_list
animal_main_list = []
animal_data = main_request(baseurl, 1, endpoint, headers)
for x in range(1, get_pages(animal_data) + 1):
print(x)
mainList.extend(parse_json(main_request(baseurl, x, endpoint, headers)))
Got it working. Using the range function and the f{x} in my url. Set the range in a for loop and it updates the url. I used sleep.time to slow the retrieval down

python3, Trying to get an output from my function I defined, need some guidance

I found pretty cool ASN API tool that allows me to supply an AS # and it will go out and pull down the subnets that relate with that ASN.
Here is (rough) but partial code. I am defining a function ASNNUMBER (to which I will supply the number through another file)
When I call url here, it just gives me an n...
What I'm trying to do here, is append my str(ASNNUMBER) to the end of the ?q= parameter in the URL.
Once I do that, I'd like to display my results and output it to a file
import requests
def asnfinder(ASNNUMBER):
print('n\n######## Running ASNFinder ########\n')
url = 'https://api.hackertarget.com/aslookup?q=' + str(ASNNUMBER)
response = requests.get(url)
My results I'd like to get is an output of the get request I'm performing
## Running ASNFinder
n
Try to write something like that:
import requests
def asnfinder(ASNNUMBER):
print('n\n######## Running ASNFinder ########\n')
url = 'https://api.hackertarget.com/aslookup?q=' + str(ASNNUMBER)
response = requests.get(url)
data = response.text
print(data)
with open('filename', 'r') as f:
f.write(data)
It must works fine
P.S. If it helped ya, please make sure you mark this as the answer :)

Invalid hash, timestamp, and key combination in Marvel API Call

I'm trying to form a Marvel API Call.
Here's a link on authorization:
https://developer.marvel.com/documentation/authorization
I'm attempting to create a server-side application, so according to the link above, I need a timestamp, apikey, and hash url parameters. The hash needs be a md5 hash of the form: md5(timestamp + privateKey + publicKey) and the apikey url param is my public key.
Here's my code, I'm making the request in Python 3, using the request library to form the request, the time library to form the timestamp, and the hashlib library to form the hash.
#request.py: making a http request to marvel api
import requests;
import time;
import hashlib;
#timestamp
ts = time.time();
ts_str = str(float(ts));
#keys
public_key = 'a3c785ecc50aa21b134fca1391903926';
private_key = 'my_private_key';
#hash and encodings
m_hash = hashlib.md5();
ts_str_byte = bytes(ts_str, 'utf-8');
private_key_byte = bytes(private_key, 'utf-8');
public_key_byte = bytes(public_key, 'utf-8');
m_hash.update(ts_str_byte + private_key_byte + public_key_byte);
m_hash_str = str(m_hash.digest());
#all request parameters
payload = {'ts': ts_str, 'apikey': 'a3c785ecc50aa21b134fca1391903926', 'hash': m_hash_str};
#make request
r = requests.get('https://gateway.marvel.com:443/v1/public/characters', params=payload);
#for debugging
print(r.url);
print(r.json());
Here's the output:
$python3 request.py
https://gateway.marvel.com:443/v1/public/characters...${URL TRUNCATED FOR READABILITY)
{'code': 'InvalidCredentials', 'message': 'That hash, timestamp, and key combination is invalid'}
$
I'm not sure what exactly is causing the combination to be invalid.
I can provide more info on request. Any info would be appreciated. Thank you!
EDIT:
I'm a little new to API calls in general. Are there any resources for understanding more about how to perform them? So far with my limited experience they seem very specific, and getting each one to work takes a while. I'm a college student and whenever I work in hackathons it takes me a long time just to figure out how to perform the API call. I admit I'm not experienced, but in general does figuring out new API's require a large learning curve, even for individuals who have done 10 or so of them?
Again, thanks for your time :)
I've also had similar issues when accessing the Marvel API key. For those that are still struggling, here is my templated code (that I use in a jupyter notebook).
# import dependencies
import hashlib #this is needed for the hashing library
import time #this is needed to produce a time stamp
import json #Marvel provides its information in json format
import requests #This is used to request information from the API
#Constructing the Hash
m = hashlib.md5() #I'm assigning the method to the variable m. Marvel
#requires md5 hashing, but I could also use SHA256 or others for APIS other
#than Marvel's
ts = str(time.time()) #This creates the time stamp as a string
ts_byte = bytes(ts, 'utf-8') #This converts the timestamp into a byte
m.update(ts_byte) # I add the timestamp (in byte format) to the hash
m.update(b"my_private_key") #I add the private key to
#the hash.Notice I added the b in front of the string to convert it to byte
#format, which is required for md5
m.update(b"b2aeb1c91ad82792e4583eb08509f87a") #And now I add my public key to
#the hash
hasht = m.hexdigest() #Marvel requires the string to be in hex; they
#don't say this in their API documentation, unfortunately.
#constructing the query
base_url = "https://gateway.marvel.com" #provided in Marvel API documentation
api_key = "b2aeb1c91ad82792e4583eb08509f87a" #My public key
query = "/v1/public/events" +"?" #My query is for all the events in Marvel Uni
#Building the actual query from the information above
query_url = base_url + query +"ts=" + ts+ "&apikey=" + api_key + "&hash=" +
hasht
print(query_url) #I like to look at the query before I make the request to
#ensure that it's accurate.
#Making the API request and receiving info back as a json
data = requests.get(query_url).json()
print(data) #I like to view the data to make sure I received it correctly
Give credit where credit is due, I relied on this blog a lot. You can go here for more information on the hashlib library. https://docs.python.org/3/library/hashlib.html
I noticed in your terminal your MD5 hash is uppercase. MD5 should output in lowercase. Make sure you convert to that.
That was my issue, I was sending an uppercase hash.
As mentioned above, the solution was that the hash wasn't formatted properly. Needed to be a hexadecimal string and the issue is resolved.
Your final URL should be like this:
http:// gateway.marvel.com/v1/public/characters?apikey=(public_key)&ts=1&hash=(md5_type_hash)
So, you already have public key in developer account. However, how can you produce md5_type_hash?
ts=1 use just 1. So, your pattern should be this:
1 + private_key(ee7) + public_key(aa3). For example: Convert 1ee7aa3 to MD5-Hash = 1ca3591360a252817c30a16b615b0afa (md5_type_hash)
You can create from this website: https://www.md5hashgenerator.com
Done, you can use marvel api now!

Python 3.6 Downloading .csv files from finance.yahoo.com using requests module

I was trying to download a .csv file from this url for the history of a stock. Here's my code:
import requests
r = requests.get("https://query1.finance.yahoo.com/v7/finance/download/CHOLAFIN.BO?period1=1514562437&period2=1517240837&interval=1d&events=history&crumb=JaCfCutLNr7")
file = open(r"history_of_stock.csv", 'w')
file.write(r.text)
file.close()
But when I opened the file history_of_stock.csv, this was what I found: {
"finance": {
"error": {
"code": "Unauthorized",
"description": "Invalid cookie"
}
}
}
I couldn't find anything that could fix my problem. I found this thread in which someone has the same problem except that it is in C#: C# Download price data csv file from https instead of http
To complement the earlier answer and provide a concrete completed code, I wrote a script which accomplishes the task of getting historical stock prices in Yahoo Finance. Tried to write it as simply as possible. To give a summary: when you use requests to get a URL, in many instances you don't need to worry about crumbs or cookies. However, with Yahoo finance, you need to get the crumbs and the cookies. Once you get the cookies, then you are good to go! Make sure to set a timeout on the requests.get call.
import re
import requests
import sys
from pdb import set_trace as pb
symbol = sys.argv[-1]
start_date = '1442203200' # start date timestamp
end_date = '1531800000' # end date timestamp
crumble_link = 'https://finance.yahoo.com/quote/{0}/history?p={0}'
crumble_regex = r'CrumbStore":{"crumb":"(.*?)"}'
cookie_regex = r'set-cookie: (.*?);'
quote_link = 'https://query1.finance.yahoo.com/v7/finance/download/{}?period1={}&period2={}&interval=1d&events=history&crumb={}'
link = crumble_link.format(symbol)
session = requests.Session()
response = session.get(link)
# get crumbs
text = str(response.content)
match = re.search(crumble_regex, text)
crumbs = match.group(1)
# get cookie
cookie = session.cookies.get_dict()
url = "https://query1.finance.yahoo.com/v7/finance/download/%s?period1=%s&period2=%s&interval=1d&events=history&crumb=%s" % (symbol, start_date, end_date, crumbs)
r = requests.get(url,cookies=session.cookies.get_dict(),timeout=5, stream=True)
out = r.text
filename = '{}.csv'.format(symbol)
with open(filename,'w') as f:
f.write(out)
There was a service for exactly this but it was discontinued.
Now you can do what you intend but first you need to get a Cookie. On this post there is an example of how to do it.
Basically, first you need to make a useless request to get the Cookie and later, with this Cookie in place, you can query whatever else you actually need.
There's also a post about another service which might make your life easier.
There's also a Python module to work around this inconvenience and code to show how to do it without it.

How to add a Timeout in concurrent.futures

As far as I can tell, my code works absolutely fine- though it probably looks a bit rudimentary and crude to more experienced eyes.
Objective:
Create a 'filter' that loops through a (large) range of possible ID numbers. Each ID should tried to log-in at the url website. If the id is valid it should be saved to hit_list.
Issue:
In large loops, the programme 'hangs' for indefinite periods of time. Although I have no evidence (no exception is thrown) I suspect this is a timeout issue (or rather, would be if timeout was specified)
Question:
I want to add a timeout- and then handle the timeout exception so that my programme will stop hanging. If this theory is wrong, I would also like to hear what my issue might be.
How to add a timeout is a question that has been asked before: Here and here, but after spending all weekend working on this, I'm still at a loss. Put blunty, I don't understand those answers.
What I've tried:
Create a try & except block in the id_filter function. The try is at r=s.get(url) and the exception is at the end of the function. I've read the requests docs in detail, here and here. This didn't work.
The more I read about futures the more I'm convinced that excepting errors has to be done in futures, rather than requests (as I did above). So I tried inserting a timeout in the brackets after boss.map, but as far as I could tell, this had no effect- it seems too simple anyway.
So, to reiterate:
For large loops (50,000 +) my programme tends to hang for an indefinite period of time (there is no exact point when this starts, though it's usually after 90% of the loop has been processed). I don't know why, but suspect adding a timeout would throw an exception- which I can then except. This theory may, however be wrong. I have tried to add timeout and handle other errors in the requests part, but to no effect.
-Python 3.5
My code:
import concurrent.futures as cf
import requests
from bs4 import BeautifulSoup
hit_list =[]
processed_list=[]
startrange= 100050000
end_range = 100150000
loop_size=range(startrange,end_range)
workers= 70 #
chunks= 300
url = 'https://ndber.seai.ie/pass/ber/search.aspx'
def id_filter(_range):
with requests.session() as s:
s.headers.update({
'user-agent': 'FOR MORE INFORMATION ABOUT THIS DATA COLLECTION PLEASE CONTACT ########'
})
r = s.get(url)
time.sleep(.1)
soup = BeautifulSoup(r.content, 'html.parser')
viewstate = soup.find('input', {'name': '__VIEWSTATE' }).get('value')
viewstategen = soup.find('input', {'name': '__VIEWSTATEGENERATOR' }).get('value')
validation = soup.find('input', {'name': '__EVENTVALIDATION' }).get('value')
for ber in _range:
data = {
'ctl00$DefaultContent$BERSearch$dfSearch$txtBERNumber': ber,
'ctl00$DefaultContent$BERSearch$dfSearch$Bottomsearch': 'Search',
'__VIEWSTATE' : viewstate,
'__VIEWSTATEGENERATOR' : viewstategen,
'__EVENTVALIDATION' : validation,
}
y = s.post(url, data=data)
if 'No results found' in y.text:
#print('Invalid ID', ber)
else:
hit_list.append(ber)
print('Valid ID',ber)
if __name__ == '__main__':
with cf.ThreadPoolExecutor(max_workers=workers) as boss:
jobs= [loop_size[x: x + chunks] for x in range(0, len(loop_size), chunks)]
boss.map(id_filter, jobs)
#record data below

Resources