SODA API Filtering - python-3.x

I am trying to filter through ny gov open database with their SODA API. I am following the docs on how to filter, but it is returning an empty dataframe.
# noinspection PyUnresolvedReferences
import numpy as np
# noinspection PyUnresolvedReferences
import pandas as pd
# noinspection PyUnresolvedReferences
from sodapy import Socrata
clientNYgov = Socrata('data.ny.gov', None)
Here is where I am trying to find only results in NY.
databaseM = clientNYgov.get('yg7h-zjbf.csv?business_city=NEW+YORK')
dfDatabaseM = pd.DataFrame.from_records(databaseM)
dfDatabaseM.to_csv('Manhattan Agents.csv')
print(dfDatabaseM)
But here is the Empty Output:
0 1 ... 9 10
0 business_address_1 business_address_2 ... license_number license_type
[1 rows x 11 columns]
Process finished with exit code 0
Please let me know if there's a problem with how I am filtering, not quite sure what is going wrong here. Thanks so much in advance!

Socrata uses a json endpoint to export the files via the API. This is found in the top right hand corner of the dataset when selecting API. For this solution I am using just requests to retrieve the data. The Soda module is nice to use, but works the same as a request.
import pandas as pd
import requests
data=requests.get('http://data.ny.gov/resource/yg7h-zjbf.json?$limit=50000&business_city=NEW YORK').json()
df=pd.DataFrame.from_records(data)
df

There are two approaches to do this with filters.
Method 1
This can be done using Socrata() by passing the filters using SQL to the query keyword in the get() method of the instantiated Socrata client. You will need an application token. If you do not use a token, then your requests will be subjected to throttling. To avoid throttling, sign up for a socrata account and create your app token
query = f"""SELECT * WHERE business_city="NEW YORK" LIMIT 50000"""
client = Socrata("data.ny.gov", <YOUR-APP-TOKEN-HERE>)
results = client.get("yg7h-zjbf", query=query)
df_socrata = pd.DataFrame.from_records(results)
Method 2
Using the JSON endpoint (same as #Joseph Gattuso's answer)
data = requests.get(
"http://data.ny.gov/resource/yg7h-zjbf.json?"
"$limit=50000&"
"business_city=NEW YORK"
).json()
df = pd.DataFrame.from_records(data)
Comparison of output - Verify that the two methods return the same result
assert df_socrata.equals(df)

Related

How to loop through api using Python to pull records from specific group of pages

I wrote a Python script that pulls json data from api. The api has 10k records.
As I set up my script, I was only pulling 10 pages, (each page contains 25 items), so it worked fine, dump everything in a .csv and also put everything into a mysql db.
When I ran, what I thought would be my last test and pulled, should I say - attempted to pull the data from all 500 pages, got an internal server error. So researched that and think it is because I am pulling all this data at once. The api documentation is kind of crapy, can find any rate limit info, anyway...
Since this is not my api, I though a quick solution would be just to run my script, let it pull the data from the first 10 pages, then the second 10 pages, 3rd 10 pages etc.
For obvious reasons I can't show all the code, but below are the basics/snippets. It is pretty simple, just grad the url, manipulate it a bit so I can add the Page#, then count the number of pages, then loop through and grab the data content.
Could someone help by explaining/showing how I can run/loop through my url, get the content from pages 1-10, then next loop through and get the content from pages 11-21 and so on?
Any insight, suggestions, examples would be greatly appreciated.
import requests
import json
import pandas as pd
import time
from datetime import datetime, timedelta
time_now = time.strftime("%Y%m%d-%H%M%S")
# Make our Request
def main_request(baseurl, x, endpoint, headers):
r = requests.get(baseurl + f'{x}' + endpoint, headers=headers)
return r.json()
# determine how many pages are needed to loop through, use for pagination
def get_pages(response):
# return response['page']['size']
return 2
def parse_json(response):
animal_list = []
for item in response['_embedded']['results']:
animal_details = {
'animal type': item['_type'],
'animal title': item['title'],
'animal phase': item['type']['value']
}
animal_list.append(animal_details)
return animal_list
animal_main_list = []
animal_data = main_request(baseurl, 1, endpoint, headers)
for x in range(1, get_pages(animal_data) + 1):
print(x)
mainList.extend(parse_json(main_request(baseurl, x, endpoint, headers)))
Got it working. Using the range function and the f{x} in my url. Set the range in a for loop and it updates the url. I used sleep.time to slow the retrieval down

Use requests to download webpage that requires cookies into a dataframe in python

I can't get the below code to navigate through the disclaimer page of the website, I think the issue is how I try and collect the cookie.
I want to try and use requests rather than selenium.
import requests
import pandas as pd
from pandas import read_html
# open the page with the disclaimer just to get the cookies
disclaimer = "https://umm.gassco.no/disclaimer"
disclaimerdummy = requests.get(disclaimer)
# open the actual page and use the cookies from the fake page opened before
actualpage = "https://umm.gassco.no/disclaimer/acceptDisclaimer"
actualpage2 = requests.get(actualpage, cookies=disclaimerdummy.cookies)
# store the content of the actual page in text format
actualpagetext = (actualpage2.text)
# identify relevant data sources by looking at the 'msgTable' class in the webpage code
# This is where the tables with the realtime data can be found
gasscoflow = read_html(actualpagetext, attrs={"class": "msgTable"})
# create the dataframes for the two relevant tables
Table0 = pd.DataFrame(gasscoflow[0])
Table1 = pd.DataFrame(gasscoflow[1])
Table2 = pd.DataFrame(gasscoflow[2])
Table3 = pd.DataFrame(gasscoflow[3])
Table4 = pd.DataFrame(gasscoflow[4])
After Seeing the website first of all it has only 2 tables and you could use session to use cookies across request instead of storing in a variable follow the below code to get all your expected data it is printing only last 2 rows as I have used tail command, you can modify and get your desired data from those tables.
import requests
import pandas as pd
from pandas import read_html
s=requests.session()
s1=s.get("https://umm.gassco.no")
s2=s.get("https://umm.gassco.no/disclaimer/acceptDisclaimer?")
data = read_html(s2.text, attrs={"class": "msgTable"})
t0 = pd.DataFrame(data[0])
t1 = pd.DataFrame(data[1])
print(t0.tail(2))
print(t1.tail(2))
Output:
Let me know if you have any questions :)

How to download bulk amount of images from google or any website

actually, I need to do a project on machine learning. In that I want a lot of images for training. I searched for this problem, but I failed to do so.
can anyone help me to solve this. Thanks in advance.
I used google images to download images using selenium. It is just a basic approach.
from selenium import webdriver
import time
import urllib.request
import os
from selenium.webdriver.common.keys import Keys
browser = webdriver.Chrome("path\\to\\the\\webdriverFile")
browser.get("https://www.google.com")
search = browser.find_element_by_name(‘q’)
search.send_keys(key_words,Keys.ENTER) # use required key_words to download images
elem = browser.find_element_by_link_text(‘Images’)
elem.get_attribute(‘href’)
elem.click()
value = 0
for i in range(20):
browser.execute_script(“scrollBy(“+ str(value) +”,+1000);”)
value += 1000
time.sleep(3)
elem1 = browser.find_element_by_id(‘islmp’)
sub = elem1.find_elements_by_tag_name(“img”)
try:
os.mkdir(‘downloads’)
except FileExistsError:
pass
count = 0
for i in sub:
src = i.get_attribute('src')
try:
if src != None:
src = str(src)
print(src)
count+=1
urllib.request.urlretrieve(src,
os.path.join('downloads','image'+str(count)+'.jpg'))
else:
raise TypeError
except TypeError:
print('fail')
if count == required_images_number: ## use number as required
break
check this for detailed explanation.
download driver here
My tip to you is: Use pictures API. This is my favourite: Bing Image Search API
Following text from Send search queries using the REST API and Python.
Running the quickstart
To get started, set subscription_key to a valid subscription key for the Bing API service.
Python
subscription_key = None
assert subscription_key
Next, verify that the search_url endpoint is correct. At this writing, only one endpoint is used for Bing search APIs. If you encounter authorization errors, double-check this value against the Bing search endpoint in your Azure dashboard.
Python
search_url = "https://api.cognitive.microsoft.com/bing/v7.0/images/search"
Set search_term to look for images of puppies.
Python
search_term = "puppies"
The following block uses the requests library in Python to call out to the Bing search APIs and return the results as a JSON object. Observe that we pass in the API key via the headers dictionary and the search term via the params dictionary. To see the full list of options that can be used to filter search results, refer to the REST API documentation.
Python
import requests
headers = {"Ocp-Apim-Subscription-Key" : subscription_key}
params = {"q": search_term, "license": "public", "imageType": "photo"}
response = requests.get(search_url, headers=headers, params=params)
response.raise_for_status()
search_results = response.json()
The search_results object contains the actual images along with rich metadata such as related items. For example, the following line of code can extract the thumbnail URLS for the first 16 results.
Python
thumbnail_urls = [img["thumbnailUrl"] for img in search_results["value"][:16]]
Then use the PIL library to download the thumbnail images and the matplotlib library to render them on a $4 \times 4$ grid.
Python
%matplotlib inline
import matplotlib.pyplot as plt
from PIL import Image
from io import BytesIO
f, axes = plt.subplots(4, 4)
for i in range(4):
for j in range(4):
image_data = requests.get(thumbnail_urls[i+4*j])
image_data.raise_for_status()
image = Image.open(BytesIO(image_data.content))
axes[i][j].imshow(image)
axes[i][j].axis("off")
plt.show()
Sample JSON response
Responses from the Bing Image Search API are returned as JSON. This sample response has been truncated to show a single result.
JSON
{
"_type":"Images",
"instrumentation":{
"_type":"ResponseInstrumentation"
},
"readLink":"images\/search?q=tropical ocean",
"webSearchUrl":"https:\/\/www.bing.com\/images\/search?q=tropical ocean&FORM=OIIARP",
"totalEstimatedMatches":842,
"nextOffset":47,
"value":[
{
"webSearchUrl":"https:\/\/www.bing.com\/images\/search?view=detailv2&FORM=OIIRPO&q=tropical+ocean&id=8607ACDACB243BDEA7E1EF78127DA931E680E3A5&simid=608027248313960152",
"name":"My Life in the Ocean | The greatest WordPress.com site in ...",
"thumbnailUrl":"https:\/\/tse3.mm.bing.net\/th?id=OIP.fmwSKKmKpmZtJiBDps1kLAHaEo&pid=Api",
"datePublished":"2017-11-03T08:51:00.0000000Z",
"contentUrl":"https:\/\/mylifeintheocean.files.wordpress.com\/2012\/11\/tropical-ocean-wallpaper-1920x12003.jpg",
"hostPageUrl":"https:\/\/mylifeintheocean.wordpress.com\/",
"contentSize":"897388 B",
"encodingFormat":"jpeg",
"hostPageDisplayUrl":"https:\/\/mylifeintheocean.wordpress.com",
"width":1920,
"height":1200,
"thumbnail":{
"width":474,
"height":296
},
"imageInsightsToken":"ccid_fmwSKKmK*mid_8607ACDACB243BDEA7E1EF78127DA931E680E3A5*simid_608027248313960152*thid_OIP.fmwSKKmKpmZtJiBDps1kLAHaEo",
"insightsMetadata":{
"recipeSourcesCount":0,
"bestRepresentativeQuery":{
"text":"Tropical Beaches Desktop Wallpaper",
"displayText":"Tropical Beaches Desktop Wallpaper",
"webSearchUrl":"https:\/\/www.bing.com\/images\/search?q=Tropical+Beaches+Desktop+Wallpaper&id=8607ACDACB243BDEA7E1EF78127DA931E680E3A5&FORM=IDBQDM"
},
"pagesIncludingCount":115,
"availableSizesCount":44
},
"imageId":"8607ACDACB243BDEA7E1EF78127DA931E680E3A5",
"accentColor":"0050B2"
}
}

running in parallel requests.post over a pandas data frame

So I am trying to run a defined function that is a requests.post that gets the input from a pandas dataframe and save it to the same dataframe but different column
import requests, json
import pandas as pd
import argparse
def postRequest(input, url):
'''Post response from url'''
headers = {'content-type': 'application/json'}
r = requests.post(url=url, json=json.loads(input), headers=headers)
response = r.json()
return response
def payload(text):
# get proper payload from text
std_payload = { "auth_key":"key",
"org":{ "id":org_id, "name":"org" },
"ver":{"id":ver_id, "name":"ver" },
"mess":{ "id":80}}
std_payload['message']['text'] = text
std_payload = json.dumps(std_payload)
return std_payload
def find(df):
ff=pd.DataFrame(columns=['text','expected','word','payload','response'])
count=0
for leng in range(0,len(df)):
search=df.text[leng].split()
ff.loc[count]=df.iloc[leng]
ff.loc[count,'word']='orginalphrase'
count=count+1
for w in range(0,len(search)):
if df.text[leng]=="3174":
ff.append(df.iloc[leng],ignore_index=True)
ff.loc[count,'text']="3174"
ff.loc[count,'word']=None
ff.loc[count,'expected']='[]'
continue
word=search[:]
ff.loc[count,'word']=word[w]
word[w]='z'
phrase=' '.join(word)
ff.loc[count,'text']=phrase
ff.loc[count,'expected']=df.loc[leng,'expected']
count=count+1
if df.text[leng]=="3174":
continue
return ff
# read in csv of phrases to be tested
df = pd.read_csv(filename,engine='python')
#allows empty cells by setting them to the phrase empty
df=df.fillna("3174")
sf=find(df)
for i in sf.index:
sf['payload']=payload(sf.text[i])
for index in df.index:
sf.response[index]=postRequest(df.text[index],url)
From all my tests this operation is running over the dataframe one by one which when my dataframe is large this operation can take a few hours.
Searching online for running things in parallel give me a few methods but I do not understand what the methods are doing, I have seen pooling and threading examples while i can get the examples to work. Such as:
Simultaneously run POST in Python
Asynchronous Requests with Python requests
When I try and apply them with my code, specifically I cannot get any method to work with the postRequest it still goes one by one.
Can any one provide assistance in getting the paralleling to work correctly. If more informations is required please let me know.
Thanks
Edit:
here is the last thing I was working with
with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
future_to_url = {executor.submit(postRequest, sf.payload[index],trends_url): index for index in range(10)}
counts=0
for future in concurrent.futures.as_completed(future_to_url):
repo = future_to_url[future]
data = future.result()
sf.response[count]=data
count=count+1
also the dataframe has anywhere between 2000 and 4000 rows so doing it in sequence can take up to 4 hours,

Error Using geopy library

I have the following question, I want to set up a routine to perform iterations inside a dataframe (pandas) to extract longitude and latitude data, after supplying the address using the 'geopy' library.
The routine I created was:
import time
from geopy.geocoders import GoogleV3
import os
arquivo = pd.ExcelFile('path')
df = arquivo.parse("Table1")
def set_proxy():
proxy_addr = 'http://{user}:{passwd}#{address}:{port}'.format(
user='usuario', passwd='senha',
address='IP', port=int('PORTA'))
os.environ['http_proxy'] = proxy_addr
os.environ['https_proxy'] = proxy_addr
def unset_proxy():
os.environ.pop('http_proxy')
os.environ.pop('https_proxy')
set_proxy()
geo_keys = ['AIzaSyBXkATWIrQyNX6T-VRa2gRmC9dJRoqzss0'] # API Google
geolocator = GoogleV3(api_key=geo_keys )
for index, row in df.iterrows():
location = geolocator.geocode(row['NO_LOGRADOURO'])
time.sleep(2)
lat=location.latitude
lon=location.longitude
timeout=10)
address = location.address
unset_proxy()
print(str(lat) + ', ' + str(lon))
The problem I'm having is that when I run the code the following error is thrown:
GeocoderQueryError: Your request was denied.
I tried the creation without passing the key to the google API, however, I get the following message.
KeyError: 'http_proxy'
and if I remove the unset_proxy () statement from within the for, the message I receive is:
GeocoderQuotaExceeded: The given key has gone over the requests limit in the 24 hour period or has submitted too many requests in too short a period of time.
But I only made 5 requests today, and I'm putting a 2-second sleep between requests. Should the period be longer?
Any idea?
api_key argument of the GoogleV3 class must be a string, not a list of strings (that's the cause of your first issue).
geopy doesn't guarantee the http_proxy/https_proxy env vars to be respected (especially the runtime modifications of the os.environ). The advised (by docs) usage of proxies is:
geolocator = GoogleV3(proxies={'http': proxy_addr, 'https': proxy_addr})
PS: Please don't ever post your API keys to the public. I suggest to revoke the key you've posted in the question and generate a new one, to prevent the possibility of it being abused by someone else.

Resources