I am teaching myself geopy. It seems simple and straightforward yet my code isn't working. It is supposed to:
read in a list of address field from a CSV into a pandas df
concatenate the address fields into a single column formatted for geopy
make a list from the new column
feed each item in the list into geopy via a for loop and return the coordinates add the
coordinates to the original df and export it to a CSV
#setup
from geopy.geocoders import Nominatim
import pandas as pd
#create the df
df = pd.DataFrame(pd.read_csv('properties to geocode.csv'))
df['Location'] = df['Street Address'].astype(str)+","+df['City'].astype(str)+","+df['State'].astype(str)
#create the geolocator object
geolocator = Nominatim(timeout=1, user_agent = "My_Agent")
#create the locations list
locations = df['Location']
#empty lists for later columns
lats = []
longs = []
#process the location list
for item in locations:
location = geolocator.geocode('item')
lat = location.latitude
long = location.longitude
lats.append(lat)
longs.append(long)
#add the lists to the df
df.insert(5,'Latitude',lats)
df.insert(6,'Longitude',longs)
#export
df.to_csv('geocoded-properties2.csv',index=False)
Something is not working because it returns the same latitude and longitude values for every row, instead of unique coordinates for each.
I have found working code using .apply elsewhere but am interested in learning what I did wrong. Any thoughts?
your code does not contain sample data. Have used some sample data available from public APIs to demonstrate
your code passes a literal to geolocator.geocode() - it needs to be the address associated with the row
have provided example of using with pandas apply, a list comprehension and a for loop equivalent of a comprehension
results show all three approaches are equivalent
from geopy.geocoders import Nominatim
import requests
import pandas as pd
searchendpoint = "https://directory.spineservices.nhs.uk/ORD/2-0-0/organisations"
# get all healthcare facilities in Herefordshire
dfhc = pd.concat([pd.json_normalize(requests
.get(searchendpoint, params={"PostCode":f"HR{i}","Status":"Active"})
.json()["Organisations"])
for i in range(1,10)]).reset_index(drop=True)
def gps(url, geolocator=None):
# get the address and construct a space delimted string
a = " ".join(str(x) for x in requests.get(url).json()["Organisation"]["GeoLoc"]["Location"].values())
lonlat = geolocator.geocode(a)
if not lonlat is None:
return lonlat[1]
else:
return (0,0)
# work with just GPs
dfgp = dfhc.loc[dfhc.PrimaryRoleId.isin(["RO180","RO96"])].head(5).copy()
geolocator = Nominatim(timeout=1, user_agent = "My_Agent")
# pandas apply
dfgp["lonlat_apply"] = dfgp["OrgLink"].apply(gps, geolocator=geolocator)
# list comprehension
lonlat = [gps(url, geolocator=geolocator) for url in dfgp["OrgLink"].values]
dfgp["lonlat_listcomp"] = lonlat
# old school loop
lonlat = []
for item in dfgp["OrgLink"].values:
lonlat.append(gps(item, geolocator=geolocator))
dfgp["lonlat_oldschool"] = lonlat
Name
OrgId
Status
OrgRecordClass
PostCode
LastChangeDate
PrimaryRoleId
PrimaryRoleDescription
OrgLink
lonlat_apply
lonlat_listcomp
lonlat_oldschool
7
AYLESTONE HILL SURGERY
M81026002
Active
RC2
HR1 1HR
2020-03-19
RO96
BRANCH SURGERY
https://directory.spineservices.nhs.uk/ORD/2-0-0/organisations/M81026002
(52.0612429, -2.7026047)
(52.0612429, -2.7026047)
(52.0612429, -2.7026047)
9
BARRS COURT SCHOOL
5CN91
Active
RC2
HR1 1EQ
2021-01-28
RO180
PRIMARY CARE TRUST SITE
https://directory.spineservices.nhs.uk/ORD/2-0-0/organisations/5CN91
(52.0619209, -2.7086105)
(52.0619209, -2.7086105)
(52.0619209, -2.7086105)
13
BODENHAM SURGERY
5CN24
Active
RC2
HR1 3JU
2013-05-08
RO180
PRIMARY CARE TRUST SITE
https://directory.spineservices.nhs.uk/ORD/2-0-0/organisations/5CN24
(52.152405, -2.6671942)
(52.152405, -2.6671942)
(52.152405, -2.6671942)
22
BELMONT ABBEY
5CN16
Active
RC2
HR2 9RP
2013-05-08
RO180
PRIMARY CARE TRUST SITE
https://directory.spineservices.nhs.uk/ORD/2-0-0/organisations/5CN16
(52.0423056, -2.7648698)
(52.0423056, -2.7648698)
(52.0423056, -2.7648698)
24
BELMONT HEALTH CENTRE
5CN22
Active
RC2
HR2 7XT
2013-05-08
RO180
PRIMARY CARE TRUST SITE
https://directory.spineservices.nhs.uk/ORD/2-0-0/organisations/5CN22
(52.0407746, -2.739788)
(52.0407746, -2.739788)
(52.0407746, -2.739788)
Related
An ongoing project... by a python novice!! I have created 4 "class 'bs4.element.ResultSet" called games (wins), draws, ties and custom from a school website. I am helping the league out by scraping all the school score and aggregating. I can not figure out how to combine those 4 element.resultsets into so I can run the rest of the program. Right now it only save the"games (wins)" to the excel spreadsheet. Also in the output below there are a ton of spaces - how can I get rid of those \n\t ?? Thanks so much in advance for your help.
from bs4 import BeautifulSoup as bs
import requests
import pandas as pd
import re
url = 'https://www.loomischaffee.org/athletics/teams/fall/soccer-boys/varsity'
page = requests.get(url)
soup = bs(page.content, 'html.parser')
week = soup.find(id='fsEl_5138')
games = week.find_all(class_ ='fsResultWin')
draws = week.find_all(class_ ='fsResultTie')
ties = week.find_all(class_ ='fsResultLoss')
custom = week.find_all(class_ ='fsResultCustom')
# now creating 6 lists of the data contained in the above.
date = [games.find(class_ = 'fsDate').get_text() for games in games]
time = [games.find(class_ = 'fsTime').get_text() for games in games]
opponent = [games.find(class_ = 'fsAthleticsOpponentName').get_text() for games in games]
home_away = [games.find(class_ = 'fsAthleticsAdvantage').get_text() for games in games]
location = [games.find(class_ = 'fsAthleticsLocations').get_text() for games in games]
result = [games.find(class_ = 'fsAthleticsResult').get_text() for games in games]
score = [games.find(class_ = 'fsAthleticsScore').get_text() for games in games]
# now I turn data into a table using pandas so I can manipulate
results = pd.DataFrame(
{'Date': date,
'Time': time,
'Opponent': opponent,
'Home/Away': home_away,
'Location' : location,
'Result': result,
'Score': score,
})
print(results)
results.to_excel('results.xls')
Where you write .get_text(),
you could use .get_text().strip() to strip off whitespace.
You are storing several columns,
which may work well enough,
you can combine them with zip(x, y) if need be.
But you might find it more convenient to ask BeautifulSoup to find the table,
and then find_all('tr') within the table, that is, iterate over the rows.
Consider representing (part of) a table row like this:
row = dict(opponent='vs. Northfield Mt. Hermon',
advantage='Home',
score='1-1')
If you have a tr object, a table row, you could easily find those values.
With that in hand, you could represent the whole table as a list of rows,
with each row being a dict.
Then output the rows to a spreadsheet as you've been doing.
Or $ pip install pandas and you can do:
rows = read_html_table_rows()
df = pandas.Dataframe(rows)
df.to_excel('results.xls')
I have data from eye-tracking (.edf file - from Eyelink by SR-research). I want to analyse it and get various measures such as fixation, saccade, duration, etc.
Is there an existing package to analyse Eye-Tracking data?
Thanks!
At least for importing the .edf-file into a pandas DF, you can use the following package by Niklas Wilming: https://github.com/nwilming/pyedfread/tree/master/pyedfread
This should already take care of saccades and fixations - have a look at the readme. Once they're in the data frame, you can apply whatever analysis you want to it.
pyeparse seems to be another (yet currently unmaintained as it seems) library that can be used for eyelink data analysis.
Here is a short excerpt from their example:
import numpy as np
import matplotlib.pyplot as plt
import pyeparse as pp
fname = '../pyeparse/tests/data/test_raw.edf'
raw = pp.read_raw(fname)
# visualize initial calibration
raw.plot_calibration(title='5-Point Calibration')
# create heatmap
raw.plot_heatmap(start=3., stop=60.)
EDIT: After I posted my answer I found a nice list compiling lots of potential tools for eyelink edf data analysis: https://github.com/davebraze/FDBeye/wiki/Researcher-Contributed-Eye-Tracking-Tools
Hey the question seems rather old but maybe I can reactivate it, because I am currently facing the same situation.
To start I recommend to convert your .edf to an .asc file. In this way it is easier to read it to get a first impression.
For this there exist many tools, but I used the SR-Research Eyelink Developers Kit (here).
I don't know your setup but the Eyelink 1000 itself detects saccades and fixation. I my case in the .asc file it looks like that:
SFIX L 10350642
10350642 864.3 542.7 2317.0
...
...
10350962 863.2 540.4 2354.0
EFIX L 10350642 10350962 322 863.1 541.2 2339
SSACC L 10350964
10350964 863.4 539.8 2359.0
...
...
10351004 683.4 511.2 2363.0
ESACC L 10350964 10351004 42 863.4 539.8 683.4 511.2 5.79 221
The first number corresponds to the timestamp, the second and third to x-y coordinates and the last is your pupil diameter (what the last numbers after ESACC are, I don't know).
SFIX -> start fixation
EFIX -> end fixation
SSACC -> start saccade
ESACC -> end saccade
You can also check out PyGaze, I haven't worked with it, but searching for a toolbox, this one always popped up.
EDIT
I found this toolbox here. It looks cool and works fine with the example data, but sadly does not work with mine
EDIT No 2
Revisiting this question after working on my own Eyetracking data I thought I might share a function wrote, to work with my data:
def eyedata2pandasframe(directory):
'''
This function takes a directory from which it tries to read in ASCII files containing eyetracking data
It returns eye_data: A pandas dataframe containing data from fixations AND saccades fix_data: A pandas dataframe containing only data from fixations
sac_data: pandas dataframe containing only data from saccades
fixation: numpy array containing information about fixation onsets and offsets
saccades: numpy array containing information about saccade onsets and offsets
blinks: numpy array containing information about blink onsets and offsets
trials: numpy array containing information about trial onsets
'''
eye_data= []
fix_data = []
sac_data = []
data_header = {0: 'TimeStamp',1: 'X_Coord',2: 'Y_Coord',3: 'Diameter'}
event_header = {0: 'Start', 1: 'End'}
start_reading = False
in_blink = False
in_saccade = False
fix_timestamps = []
sac_timestamps = []
blink_timestamps = []
trials = []
sample_rate_info = []
sample_rate = 0
# read the file and store, depending on the messages the data
# we have the following structure:
# a header -- every line starts with a '**'
# a bunch of messages containing information about callibration/validation and so on all starting with 'MSG'
# followed by:
# START 10350638 LEFT SAMPLES EVENTS
# PRESCALER 1
# VPRESCALER 1
# PUPIL AREA
# EVENTS GAZE LEFT RATE 500.00 TRACKING CR FILTER 2
# SAMPLES GAZE LEFT RATE 500.00 TRACKING CR FILTER 2
# followed by the actual data:
# normal data --> [TIMESTAMP]\t [X-Coords]\t [Y-Coords]\t [Diameter]
# Start of EVENTS [BLINKS FIXATION SACCADES] --> S[EVENTNAME] [EYE] [TIMESTAMP]
# End of EVENTS --> E[EVENT] [EYE] [TIMESTAMP_START]\t [TIMESTAMP_END]\t [TIME OF EVENT]\t [X-Coords start]\t [Y-Coords start]\t [X_Coords end]\t [Y-Coords end]\t [?]\t [?]
# Trial messages --> MSG timestamp\t TRIAL [TRIALNUMBER]
try:
with open(directory) as f:
csv_reader = csv.reader(f, delimiter ='\t')
for i, row in enumerate (csv_reader):
if any ('RATE' in item for item in row):
sample_rate_info = row
if any('SYNCTIME' in item for item in row): # only start reading after this message
start_reading = True
elif any('SFIX' in item for item in row): pass
#fix_timestamps[0].append (row)
elif any('EFIX' in item for item in row):
fix_timestamps.append ([row[0].split(' ')[4],row[1]])
#fix_timestamps[1].append (row)
elif any('SSACC' in item for item in row):
#sac_timestamps[0].append (row)
in_saccade = True
elif any('ESACC' in item for item in row):
sac_timestamps.append ([row[0].split(' ')[3],row[1]])
in_saccade = False
elif any('SBLINK' in item for item in row): # stop reading here because the blinks contain NaN
# blink_timestamps[0].append (row)
in_blink = True
elif any('EBLINK' in item for item in row): # start reading again. the blink ended
blink_timestamps.append ([row[0].split(' ')[2],row[1]])
in_blink = False
elif any('TRIAL' in item for item in row):
# the first element is 'MSG', we don't need it, then we split the second element to seperate the timestamp and only keep it as an integer
trials.append (int(row[1].split(' ')[0]))
elif start_reading and not in_blink:
eye_data.append(row)
if in_saccade:
sac_data.append(row)
else:
fix_data.append(row)
# drop the last data point, because it is the 'END' message
eye_data.pop(-1)
sac_data.pop(-1)
fix_data.pop(-1)
# convert every item in list into a float, substract the start of the first trial to set the start of the first video to t0=0
# then devide by 1000 to convert from milliseconds to seconds
for row in eye_data:
for i, item in enumerate (row):
row[i] = float (item)
for row in fix_data:
for i, item in enumerate (row):
row[i] = float (item)
for row in sac_data:
for i, item in enumerate (row):
row[i] = float (item)
for row in fix_timestamps:
for i, item in enumerate (row):
row [i] = (float(item)-trials[0])/1000
for row in sac_timestamps:
for i, item in enumerate (row):
row [i] = (float(item)-trials[0])/1000
for row in blink_timestamps:
for i, item in enumerate (row):
row [i] = (float(item)-trials[0])/1000
sample_rate = float (sample_rate_info[4])
# convert into pandas fix_data Frames for a better overview
eye_data = pd.DataFrame(eye_data)
fix_data = pd.DataFrame(fix_data)
sac_data = pd.DataFrame(sac_data)
fix_timestamps = pd.DataFrame(fix_timestamps)
sac_timestamps = pd.DataFrame(sac_timestamps)
trials = np.array(trials)
blink_timestamps = pd.DataFrame(blink_timestamps)
# rename header for an even better overview
eye_data = eye_data.rename(columns=data_header)
fix_data = fix_data.rename(columns=data_header)
sac_data = sac_data.rename(columns=data_header)
fix_timestamps = fix_timestamps.rename(columns=event_header)
sac_timestamps = sac_timestamps.rename(columns=event_header)
blink_timestamps = blink_timestamps.rename(columns=event_header)
# substract the first timestamp of trials to set the start of the first video to t0=0
eye_data.TimeStamp -= trials[0]
fix_data.TimeStamp -= trials[0]
sac_data.TimeStamp -= trials[0]
trials -= trials[0]
trials = trials /1000 # does not work with trials/=1000
# devide TimeStamp to get time in seconds
eye_data.TimeStamp /=1000
fix_data.TimeStamp /=1000
sac_data.TimeStamp /=1000
return eye_data, fix_data, sac_data, fix_timestamps, sac_timestamps, blink_timestamps, trials, sample_rate
except:
print ('Could not read ' + str(directory) + ' properly!!! Returned empty data')
return eye_data, fix_data, sac_data, fix_timestamps, sac_timestamps, blink_timestamps, trials, sample_rate
Hope it helps you guys. Some parts of the code you may need to change, like the index where to split the strings to get the crutial information about event on/offsets. Or you don't want to convert your timestamps into seconds or do not want to set the onset of your first trial to 0. That is up to you.
Additionally in my data we sent a message to know when we started measuring ('SYNCTIME') and I had only ONE condition in my experiment, so there is only one 'TRIAL' message
Cheers
I'm trying to clean some data in a pandas df and I want the 'volume' column to go from a float to an int.
EDIT: The main issue was that the dtype for the float variable I was looking at was actually a str. So first it needed to be floated, before being changed.
I deleted the two other solutions I was considering, and left the one I used. The top one is the one with the errors, and the bottom one is the solution.
import pandas as pd
import numpy as np
#Call the df
t_df = pd.DataFrame(client.get_info())
#isolate only the 'symbol' column in t_df
tickers = t_df.loc[:, ['symbol']]
def tick_data(tickers):
for i in tickers:
tick_df = pd.DataFrame(client.get_ticker())
tick = tick_df.loc[:, ['symbol', 'volume']]
tick.iloc[:,['volume']].astype(int)
if tick['volume'].dtype != np.number:
print('yes')
else:
print('no')
return tick
Below is the revised code:
import pandas as pd
#Call the df
def ticker():
t_df = pd.DataFrame(client.get_info())
#isolate only the 'symbol' column in t_df
tickers = t_df.loc[:, ['symbol']]
for i in tickers:
#pulls out market data for each symbol
tickers = pd.DataFrame(client.get_ticker())
#isolates the symbol and volume
tickers = tickers.loc[:, ['symbol', 'volume']]
#floats volume
tickers['volume'] = tickers.loc[:, ['volume']].astype(float)
#volume to int
tickers['volume'] = tickers.loc[:, ['volume']].astype(int)
#deletes all symbols > 20,000 in volume, returns only symbol
tickers = tickers.loc[tickers['volume'] >= 20000, 'symbol']
return tickers
You have a few issues here.
In your first example, iloc only accepts integer locations for the rows and columns in the DataFrame, which is generating your error. I.e.
tick.iloc[:,['volume']].astype(int)
doesn't work. If you want label-based indexing, use .loc:
tick.loc[:,['volume']].astype(int)
Alternately, use bracket-based indexing, which allows you to take a whole column directly without using slice syntax (:) on the rows:
tick['volume'].astype(int)
Next, astype(int) returns a new value, it does not modify in-place. So what you want is
tick['volume'] = tick['volume'].astype(int)
As for your dtype is a number check, you don't want to check == np.number, but you don't want to check is either, which only returns True if it's np.number and not if it's a subclass like np.int64. Use np.issubdtype, or pd.api.types.is_numeric_dtype, i.e.:
if np.issubdtype(tick['volume'].dtype, np.number):
or:
if pd.api.types.is_numeric_dtype(tick['volume'].dtype):
I am trying to yield 1 row by 1 row for a panda dataframe but get an error. The dataframe is a stock price data, including daily open, close, high, low price and volume information.
The following is my code. This class will get data from MySQL database
class HistoricMySQLDataHandler(DataHandler):
def __init__(self, events, symbol_list):
"""
Initialises the historic data handler by requesting
a list of symbols.
Parameters:
events - The Event Queue.
symbol_list - A list of symbol strings.
"""
self.events = events
self.symbol_list = symbol_list
self.symbol_data = {}
self.latest_symbol_data = {}
self.continue_backtest = True
self._connect_MySQL()
def _connect_MySQL(self): #get stock price for symbol s
db_host = 'localhost'
db_user = 'sec_user'
db_pass = 'XXX'
db_name = 'securities_master'
con = mdb.connect(db_host, db_user, db_pass, db_name)
for s in self.symbol_list:
sql="SELECT * FROM daily_price where symbol= s
self.symbol_data[s] = pd.read_sql(sql, con=con, index_col='price_date')"
def _get_new_bar(self, symbol):
"""
Returns the latest bar from the data feed as a tuple of
(sybmbol, datetime, open, low, high, close, volume).
"""
for row in self.symbol_data[symbol].itertuples():
yield tuple(symbol, datetime.datetime.strptime(row[0],'%Y-%m-%d %H:%M:%S'),
row[15], row[17], row[16], row[18],row[20])
def update_bars(self):
"""
Pushes the latest bar to the latest_symbol_data structure
for all symbols in the symbol list.
"""
for s in self.symbol_list:
try:
bar = self._get_new_bar(s).__next__()
except StopIteration:
self.continue_backtest = False
In the main function:
# Declare the components with respective parameters
symbol_list=["GOOG"]
events=queue.Queue()
bars = HistoricMySQLDataHandler(events,symbol_list)
while True:
# Update the bars (specific backtest code, as opposed to live trading)
if bars.continue_backtest == True:
bars.update_bars()
else:
break
time.sleep(1)
Data example:
symbol_data["GOOG"] =
price_date id exchange_id ticker instrument name ... high_price low_price close_price adj_close_price volume
2014-03-27 29 None GOOG stock Alphabet Inc Class C ... 568.0000 552.9200 558.46 558.46 13100
The update_bars function will call _get_new_bar to move to next row (next day price)
My objective is to get stock price day by day (iterate rows of the dataframe) but self.symbol_data[s] in _connect_MySQL is a dataframe while in _get_new_bar is a generator hence I get this error
AttributeError: 'generator' object has no attribute 'itertuples'
Anyone have any ideas?
I am using python 3.6. Thanks
self.symbol_data is a dict, symbol is a string key to get the dataframe. the data is stock price data. For example self.symbol_data["GOOG"] return a dataframe with google's daily stock price information index by date, each row including open, low, high, close price and volume. My goal is to iterate this price data day by day using yield.
_connect_MySQL will get data from the database
In this example, s = "GOOG" in the function
I found the bug.
My code in other place change the dataframe to be a generator.
A stupid mistake lol
I didn't post this line in the question but this line change the datatype
# Reindex the dataframes
for s in self.symbol_list:
self.symbol_data[s] = self.symbol_data[s].reindex(index=comb_index, method='pad').iterrows()
I am attempting to modify this example with county data for Michigan. In short, it's working, but it seems to be adding some extra shapes here and there in the process of drawing the counties. I'm guessing that in some instances (where there are counties with islands), the island part needs to be listed as a separate "county", but I'm not sure about the other case, such as with Wayne county in the lower right part of the state.
Here's a picture of what I currently have:
Here's what I did so far:
Get county data from Bokeh's sample county data just to get the state abbreviation per state number (my second, main data source only has state numbers). For this example, I'll simplify it by just filtering for state number 26).
Get state coordinates ('500k' file) by county from the U.S. Census site.
Use the following code to generate an 'interactive' map of Michigan.
Note: To pip install shapefile (really pyshp), I think I had to download the .whl file from here and then do pip install [path to .whl file].
import pandas as pd
import numpy as np
import shapefile
from bokeh.models import HoverTool, ColumnDataSource
from bokeh.palettes import Viridis6
from bokeh.plotting import figure, show, output_notebook
shpfile=r'Path\500K_US_Counties\cb_2015_us_county_500k.shp'
sf = shapefile.Reader(shpfile)
shapes = sf.shapes()
#Here are the rows from the shape file (plus lat/long coordinates)
rows=[]
lenrow=[]
for i,j in zip(sf.shapeRecords(),sf.shapes()):
rows.append(i.record+[j.points])
if len(i.record+[j.points])!=10:
print("Found record with irrular number of columns")
fields1=sf.fields[1:] #Ignore first field as it is not used (maybe it's a meta field?)
fields=[seq[0] for seq in fields1]+['Long_Lat']#Take the first element in each tuple of the list
c=pd.DataFrame(rows,columns=fields)
try:
c['STATEFP']=c['STATEFP'].astype(int)
except:
pass
#cns=pd.read_csv(r'Path\US_Counties.csv')
#cns=cns[['State Abbr.','STATE num']]
#cns=cns.drop_duplicates('State Abbr.',keep='first')
#c=pd.merge(c,cns,how='left',left_on='STATEFP',right_on='STATE num')
c['Lat']=c['Long_Lat'].apply(lambda x: [e[0] for e in x])
c['Long']=c['Long_Lat'].apply(lambda x: [e[1] for e in x])
#c=c.loc[c['State Abbr.']=='MI']
c=c.loc[c['STATEFP']==26]
#latitudex, longitude=y
county_xs = c['Lat']
county_ys = c['Long']
county_names = c['NAME']
county_colors = [Viridis6[np.random.randint(1,6, size=1).tolist()[0]] for l in aland]
randns=np.random.randint(1,6, size=1).tolist()[0]
#county_colors = [Viridis6[e] for e in randns]
#county_colors = 'b'
source = ColumnDataSource(data=dict(
x=county_xs,
y=county_ys,
color=county_colors,
name=county_names,
#rate=county_rates,
))
output_notebook()
TOOLS="pan,wheel_zoom,box_zoom,reset,hover,save"
p = figure(title="Title", tools=TOOLS,
x_axis_location=None, y_axis_location=None)
p.grid.grid_line_color = None
p.patches('x', 'y', source=source,
fill_color='color', fill_alpha=0.7,
line_color="white", line_width=0.5)
hover = p.select_one(HoverTool)
hover.point_policy = "follow_mouse"
hover.tooltips = [
("Name", "#name"),
#("Unemployment rate)", "#rate%"),
("(Long, Lat)", "($x, $y)"),
]
show(p)
I'm looking for a way to avoid the extra lines and shapes.
Thanks in advance!
I have a solution to this problem, and I think I might even know why it is correct. First, let me show quote from Bryan Van de ven in a Google groups Bokeh discussion:
there is no built-in support for dealing with shapefiles. You will have to convert the data to the simple format that Bokeh understands. (As an aside: it would be great to have a contribution that made dealing with various GIS formats easier).
The format that Bokeh expects for patches is a "list of lists" of points. So something like:
xs = [ [patch0 x-coords], [patch1 x-coords], ... ]
ys = [ [patch1 y-coords], [patch1 y-coords], ... ]
Note that if a patch is comprised of multiple polygons, this is currently expressed by putting NaN values in the sublists. So, the task is basically to convert whatever form of polygon data you have to this format, and then Bokeh can display it.
So it seems like somehow you are ignoring NaNs or otherwise not handling multiple polygons properly. Here is some code that will download US census data, unzip it, read it properly for Bokeh, and make a data frame of lat, long, state, and county.
def get_map_data(shape_data_file, local_file_path):
url = "http://www2.census.gov/geo/tiger/GENZ2015/shp/" + \
shape_data_file + ".zip"
zfile = local_file_path + shape_data_file + ".zip"
sfile = local_file_path + shape_data_file + ".shp"
dfile = local_file_path + shape_data_file + ".dbf"
if not os.path.exists(zfile):
print("Getting file: ", url)
response = requests.get(url)
with open(zfile, "wb") as code:
code.write(response.content)
if not os.path.exists(sfile):
uz_cmd = 'unzip ' + zfile + " -d " + local_file_path
print("Executing command: " + uz_cmd)
os.system(uz_cmd)
shp = open(sfile, "rb")
dbf = open(dfile, "rb")
sf = shapefile.Reader(shp=shp, dbf=dbf)
lats = []
lons = []
ct_name = []
st_id = []
for shprec in sf.shapeRecords():
st_id.append(int(shprec.record[0]))
ct_name.append(shprec.record[5])
lat, lon = map(list, zip(*shprec.shape.points))
indices = shprec.shape.parts.tolist()
lat = [lat[i:j] + [float('NaN')] for i, j in zip(indices, indices[1:]+[None])]
lon = [lon[i:j] + [float('NaN')] for i, j in zip(indices, indices[1:]+[None])]
lat = list(itertools.chain.from_iterable(lat))
lon = list(itertools.chain.from_iterable(lon))
lats.append(lat)
lons.append(lon)
map_data = pd.DataFrame({'x': lats, 'y': lons, 'state': st_id, 'county_name': ct_name})
return map_data
The inputs to this command are a local directory where you want to download the map data to and the other input is the name of the shape file. I know there are at least two available maps from the url in the function above that you could call:
map_low_res = "cb_2015_us_county_20m"
map_high_res = "cb_2015_us_county_500k"
If the US census changes their url, which they certainly will one day, then you will need to change the input file name and the url variable. So, you can call the function above
map_output = get_map_data(map_low_res, ".")
Then you could plot it just as the code in the original question does. Add a color data column first ("county_colors" in the original question), and then set it to the source like this:
source = ColumnDataSource(map_output)
To make this all work you will need to import libraries such as requests, os, itertools, shapefile, bokeh.models.ColumnDataSource, etc...
One solution:
Use the 1:20,000,000 shape file instead of the 1:500,000 file.
It loses some detail around the shape of each county but does not have any extra shapes (and just a couple of extra lines).