I am attempting to modify this example with county data for Michigan. In short, it's working, but it seems to be adding some extra shapes here and there in the process of drawing the counties. I'm guessing that in some instances (where there are counties with islands), the island part needs to be listed as a separate "county", but I'm not sure about the other case, such as with Wayne county in the lower right part of the state.
Here's a picture of what I currently have:
Here's what I did so far:
Get county data from Bokeh's sample county data just to get the state abbreviation per state number (my second, main data source only has state numbers). For this example, I'll simplify it by just filtering for state number 26).
Get state coordinates ('500k' file) by county from the U.S. Census site.
Use the following code to generate an 'interactive' map of Michigan.
Note: To pip install shapefile (really pyshp), I think I had to download the .whl file from here and then do pip install [path to .whl file].
import pandas as pd
import numpy as np
import shapefile
from bokeh.models import HoverTool, ColumnDataSource
from bokeh.palettes import Viridis6
from bokeh.plotting import figure, show, output_notebook
shpfile=r'Path\500K_US_Counties\cb_2015_us_county_500k.shp'
sf = shapefile.Reader(shpfile)
shapes = sf.shapes()
#Here are the rows from the shape file (plus lat/long coordinates)
rows=[]
lenrow=[]
for i,j in zip(sf.shapeRecords(),sf.shapes()):
rows.append(i.record+[j.points])
if len(i.record+[j.points])!=10:
print("Found record with irrular number of columns")
fields1=sf.fields[1:] #Ignore first field as it is not used (maybe it's a meta field?)
fields=[seq[0] for seq in fields1]+['Long_Lat']#Take the first element in each tuple of the list
c=pd.DataFrame(rows,columns=fields)
try:
c['STATEFP']=c['STATEFP'].astype(int)
except:
pass
#cns=pd.read_csv(r'Path\US_Counties.csv')
#cns=cns[['State Abbr.','STATE num']]
#cns=cns.drop_duplicates('State Abbr.',keep='first')
#c=pd.merge(c,cns,how='left',left_on='STATEFP',right_on='STATE num')
c['Lat']=c['Long_Lat'].apply(lambda x: [e[0] for e in x])
c['Long']=c['Long_Lat'].apply(lambda x: [e[1] for e in x])
#c=c.loc[c['State Abbr.']=='MI']
c=c.loc[c['STATEFP']==26]
#latitudex, longitude=y
county_xs = c['Lat']
county_ys = c['Long']
county_names = c['NAME']
county_colors = [Viridis6[np.random.randint(1,6, size=1).tolist()[0]] for l in aland]
randns=np.random.randint(1,6, size=1).tolist()[0]
#county_colors = [Viridis6[e] for e in randns]
#county_colors = 'b'
source = ColumnDataSource(data=dict(
x=county_xs,
y=county_ys,
color=county_colors,
name=county_names,
#rate=county_rates,
))
output_notebook()
TOOLS="pan,wheel_zoom,box_zoom,reset,hover,save"
p = figure(title="Title", tools=TOOLS,
x_axis_location=None, y_axis_location=None)
p.grid.grid_line_color = None
p.patches('x', 'y', source=source,
fill_color='color', fill_alpha=0.7,
line_color="white", line_width=0.5)
hover = p.select_one(HoverTool)
hover.point_policy = "follow_mouse"
hover.tooltips = [
("Name", "#name"),
#("Unemployment rate)", "#rate%"),
("(Long, Lat)", "($x, $y)"),
]
show(p)
I'm looking for a way to avoid the extra lines and shapes.
Thanks in advance!
I have a solution to this problem, and I think I might even know why it is correct. First, let me show quote from Bryan Van de ven in a Google groups Bokeh discussion:
there is no built-in support for dealing with shapefiles. You will have to convert the data to the simple format that Bokeh understands. (As an aside: it would be great to have a contribution that made dealing with various GIS formats easier).
The format that Bokeh expects for patches is a "list of lists" of points. So something like:
xs = [ [patch0 x-coords], [patch1 x-coords], ... ]
ys = [ [patch1 y-coords], [patch1 y-coords], ... ]
Note that if a patch is comprised of multiple polygons, this is currently expressed by putting NaN values in the sublists. So, the task is basically to convert whatever form of polygon data you have to this format, and then Bokeh can display it.
So it seems like somehow you are ignoring NaNs or otherwise not handling multiple polygons properly. Here is some code that will download US census data, unzip it, read it properly for Bokeh, and make a data frame of lat, long, state, and county.
def get_map_data(shape_data_file, local_file_path):
url = "http://www2.census.gov/geo/tiger/GENZ2015/shp/" + \
shape_data_file + ".zip"
zfile = local_file_path + shape_data_file + ".zip"
sfile = local_file_path + shape_data_file + ".shp"
dfile = local_file_path + shape_data_file + ".dbf"
if not os.path.exists(zfile):
print("Getting file: ", url)
response = requests.get(url)
with open(zfile, "wb") as code:
code.write(response.content)
if not os.path.exists(sfile):
uz_cmd = 'unzip ' + zfile + " -d " + local_file_path
print("Executing command: " + uz_cmd)
os.system(uz_cmd)
shp = open(sfile, "rb")
dbf = open(dfile, "rb")
sf = shapefile.Reader(shp=shp, dbf=dbf)
lats = []
lons = []
ct_name = []
st_id = []
for shprec in sf.shapeRecords():
st_id.append(int(shprec.record[0]))
ct_name.append(shprec.record[5])
lat, lon = map(list, zip(*shprec.shape.points))
indices = shprec.shape.parts.tolist()
lat = [lat[i:j] + [float('NaN')] for i, j in zip(indices, indices[1:]+[None])]
lon = [lon[i:j] + [float('NaN')] for i, j in zip(indices, indices[1:]+[None])]
lat = list(itertools.chain.from_iterable(lat))
lon = list(itertools.chain.from_iterable(lon))
lats.append(lat)
lons.append(lon)
map_data = pd.DataFrame({'x': lats, 'y': lons, 'state': st_id, 'county_name': ct_name})
return map_data
The inputs to this command are a local directory where you want to download the map data to and the other input is the name of the shape file. I know there are at least two available maps from the url in the function above that you could call:
map_low_res = "cb_2015_us_county_20m"
map_high_res = "cb_2015_us_county_500k"
If the US census changes their url, which they certainly will one day, then you will need to change the input file name and the url variable. So, you can call the function above
map_output = get_map_data(map_low_res, ".")
Then you could plot it just as the code in the original question does. Add a color data column first ("county_colors" in the original question), and then set it to the source like this:
source = ColumnDataSource(map_output)
To make this all work you will need to import libraries such as requests, os, itertools, shapefile, bokeh.models.ColumnDataSource, etc...
One solution:
Use the 1:20,000,000 shape file instead of the 1:500,000 file.
It loses some detail around the shape of each county but does not have any extra shapes (and just a couple of extra lines).
Related
I am teaching myself geopy. It seems simple and straightforward yet my code isn't working. It is supposed to:
read in a list of address field from a CSV into a pandas df
concatenate the address fields into a single column formatted for geopy
make a list from the new column
feed each item in the list into geopy via a for loop and return the coordinates add the
coordinates to the original df and export it to a CSV
#setup
from geopy.geocoders import Nominatim
import pandas as pd
#create the df
df = pd.DataFrame(pd.read_csv('properties to geocode.csv'))
df['Location'] = df['Street Address'].astype(str)+","+df['City'].astype(str)+","+df['State'].astype(str)
#create the geolocator object
geolocator = Nominatim(timeout=1, user_agent = "My_Agent")
#create the locations list
locations = df['Location']
#empty lists for later columns
lats = []
longs = []
#process the location list
for item in locations:
location = geolocator.geocode('item')
lat = location.latitude
long = location.longitude
lats.append(lat)
longs.append(long)
#add the lists to the df
df.insert(5,'Latitude',lats)
df.insert(6,'Longitude',longs)
#export
df.to_csv('geocoded-properties2.csv',index=False)
Something is not working because it returns the same latitude and longitude values for every row, instead of unique coordinates for each.
I have found working code using .apply elsewhere but am interested in learning what I did wrong. Any thoughts?
your code does not contain sample data. Have used some sample data available from public APIs to demonstrate
your code passes a literal to geolocator.geocode() - it needs to be the address associated with the row
have provided example of using with pandas apply, a list comprehension and a for loop equivalent of a comprehension
results show all three approaches are equivalent
from geopy.geocoders import Nominatim
import requests
import pandas as pd
searchendpoint = "https://directory.spineservices.nhs.uk/ORD/2-0-0/organisations"
# get all healthcare facilities in Herefordshire
dfhc = pd.concat([pd.json_normalize(requests
.get(searchendpoint, params={"PostCode":f"HR{i}","Status":"Active"})
.json()["Organisations"])
for i in range(1,10)]).reset_index(drop=True)
def gps(url, geolocator=None):
# get the address and construct a space delimted string
a = " ".join(str(x) for x in requests.get(url).json()["Organisation"]["GeoLoc"]["Location"].values())
lonlat = geolocator.geocode(a)
if not lonlat is None:
return lonlat[1]
else:
return (0,0)
# work with just GPs
dfgp = dfhc.loc[dfhc.PrimaryRoleId.isin(["RO180","RO96"])].head(5).copy()
geolocator = Nominatim(timeout=1, user_agent = "My_Agent")
# pandas apply
dfgp["lonlat_apply"] = dfgp["OrgLink"].apply(gps, geolocator=geolocator)
# list comprehension
lonlat = [gps(url, geolocator=geolocator) for url in dfgp["OrgLink"].values]
dfgp["lonlat_listcomp"] = lonlat
# old school loop
lonlat = []
for item in dfgp["OrgLink"].values:
lonlat.append(gps(item, geolocator=geolocator))
dfgp["lonlat_oldschool"] = lonlat
Name
OrgId
Status
OrgRecordClass
PostCode
LastChangeDate
PrimaryRoleId
PrimaryRoleDescription
OrgLink
lonlat_apply
lonlat_listcomp
lonlat_oldschool
7
AYLESTONE HILL SURGERY
M81026002
Active
RC2
HR1 1HR
2020-03-19
RO96
BRANCH SURGERY
https://directory.spineservices.nhs.uk/ORD/2-0-0/organisations/M81026002
(52.0612429, -2.7026047)
(52.0612429, -2.7026047)
(52.0612429, -2.7026047)
9
BARRS COURT SCHOOL
5CN91
Active
RC2
HR1 1EQ
2021-01-28
RO180
PRIMARY CARE TRUST SITE
https://directory.spineservices.nhs.uk/ORD/2-0-0/organisations/5CN91
(52.0619209, -2.7086105)
(52.0619209, -2.7086105)
(52.0619209, -2.7086105)
13
BODENHAM SURGERY
5CN24
Active
RC2
HR1 3JU
2013-05-08
RO180
PRIMARY CARE TRUST SITE
https://directory.spineservices.nhs.uk/ORD/2-0-0/organisations/5CN24
(52.152405, -2.6671942)
(52.152405, -2.6671942)
(52.152405, -2.6671942)
22
BELMONT ABBEY
5CN16
Active
RC2
HR2 9RP
2013-05-08
RO180
PRIMARY CARE TRUST SITE
https://directory.spineservices.nhs.uk/ORD/2-0-0/organisations/5CN16
(52.0423056, -2.7648698)
(52.0423056, -2.7648698)
(52.0423056, -2.7648698)
24
BELMONT HEALTH CENTRE
5CN22
Active
RC2
HR2 7XT
2013-05-08
RO180
PRIMARY CARE TRUST SITE
https://directory.spineservices.nhs.uk/ORD/2-0-0/organisations/5CN22
(52.0407746, -2.739788)
(52.0407746, -2.739788)
(52.0407746, -2.739788)
I have created a 2D array that when plotted looks like this:
Basically it is an array of size [101,365] of numbers with range from 0.0 to 1.2 and contains NaNs.
I am writing it to a netCDF4 file in this manner:
nc_out = Dataset(nc_out_file, 'w', format='NETCDF4')
#Create Dimemsions
y = nc_out.createDimension('y',101)
x = nc_out.createDimension('x',365)
#Create Variables
latitudes = nc_out.createVariable('latitude', np.float32, ('y'))
days = nc_out.createVariable('days', np.float32,('x'))
on2_climo = nc_out.createVariable('on2_climo', np.float32, ('x', 'y'))
#Fill Variables
latitudes[:] = lat
days[:] = day
on2_climo[:] = data
nc_out.close()
However, when I plot the data I've saved in the file it looks nothing like the original data:
What is going on here? The faint diagonal lines make me think there is something weird going on here...
Is there a better way to code a netCDF4 file? I'd share a copy of the original data with you... but I can't seem to get a faithful version of it saved...
I want to load in 2 string arrays from a MATLAB structure file, zip/concatenate them together, and then save into a netCDF file.
I have the following data MATLAB structure file:
data.string1 = ['a','b','c']
data.string2 = ['d','e','f']
In Python, I want to concatenate / zip them into a 2x3 matrix, and save them as variable 'text' in a netCDF file 'file.nc' with dimensions 'dim1' = 2, 'dim2' = 3.
This is what I have so far:
data f = h5py.File('data.mat','r')
data = {"string1":np.str(f.get('string1')), "string2":np.array(f.get('string2'))}
dataset = Dataset('file.nc', 'w', format='NETCDF4_CLASSIC')
dim1 = dataset.createDimension('dim1', 2)
dim2 = dataset.createDimension('dim2', 3)
My problem is that strings 1 and 2 are classed as the following when loaded into Python and I am not sure how to proceed:
HDF5 dataset "time_bounds_1": shape (1, 6), type "
How can I proceed to concatenate strings1, and 2, and save 'text' as a variable in the netCDF file with dimensions dim1 and dim2?
I also have the following code that could be adapted later to help:
text = dataset.createVariable('text', np.str, ('dim1','dim2')) # this does not work - error with np.str!
text[:,:] = np.asmatrix(text) # not sure this will work with strings
Ok so I decided to go down the route of creating the time lists in python, and then saving them as floating numbers in the netCDF rather than strings. It is not what I originally wanted but I have settled with this.
For those who may be interested, here is the Code (with different dimensions and variables to my simple example above):
import numpy as np
from netCDF4 import Dataset,num2date,date2num
import datetime
# ----------------------
nv = 2
time=365
# ---------------------------------------------------
base = datetime.datetime(1950,1,1,00,00,1)
numdays = 365
timevalue1 = [base + datetime.timedelta(days=x) for x in range(numdays)]
base = datetime.datetime(1950,1,1,23,59,59)
timevalue2 = [base + datetime.timedelta(days=x) for x in range(numdays)]
#timevalue = datetime.datetime(2014,4,11,23,59)
time_unit_out= "days since 1950-01-01 00:00:00 UTC"
# ---------------------------------------------------
nc_out = Dataset('test.nc', 'w', format='NETCDF4')
time = nc_out.createDimension('time',time)
nv = nc_out.createDimension('nv',nv)
times = nc_out.createVariable('time', np.float64, ('time','nv'))
times.setncattr('unit',time_unit_out);
times[:,0] = date2num(timevalue,time_unit_out);
times[:,1] = date2num(timevalue2,time_unit_out);
nc_out.close()
I would like to improve the below code to split a list of values into two sub lists, which have been randomised and sorted. The below code works, but I'm sure there is a better/cleaner way to do it.
import random
data = list(range(1, 61))
random.shuffle(data)
Intervention = data[:30]
Control = data[30:]
Intervention.sort()
Control.sort()
f = open('Randomised_Groups.txt', 'w')
f.write('Intervention Group = ' + str(Intervention) + '\n' + 'Control Group = ' + str(Control))
f.close()
The expected output is:
Intervention = [1,3,7,9]
Control = [2,4,5,6,8,10]
I think your code is short and clean already. Some changes you can make:
Call sorted() when you slice it.
Intervention = sorted(data[:30])
You can also define both Intervention and Control on one line:
Intervention, Control = data[:30], data[30:]
I would replace the 30 with a variable:
half = len(data)//2
It is safer to open a file with with. That closes the file automatically when indentation stops.
with open('Randomised_Groups.txt', 'w') as f:
...
With the use of f-strings you can make the write statement shorter:
f.write(f'Intervention Group = {Intervention} \nControl Group = {Control}')
All combined:
import random
data = list(range(1, 61))
random.shuffle(data)
half = len(data)//2
Intervention, Control = sorted(data[:half]), sorted(data[half:])
with open('Randomised_Groups.txt', 'w') as f:
f.write(f'Intervention Group = {Intervention}\nControl Group = {Control}')
Something like this might be what you want:
import random
my_rng = [random.randint(0,1) for i in range(60)]
Control = [i for i in range(60) if my_rng[i] == 0]
Intervention = [i for i in range(60) if my_rng[i] == 1]
print(Control)
The idea is to create 60 random 1s or 0s to use as indicators for which list to put each number in. This will only work if you do not need the two lists to be the same length. To get the same length would require changing how my_rng is created in this example.
I have tinkered a bit further and got the lists of the same length:
import random
my_rng = [0 for i in range(30)]
my_rng.extend([1 for i in range(30)])
random.shuffle(my_rng)
Control = [i for i in range(60) if my_rng[i] == 0]
Intervention = [i for i in range(60) if my_rng[i] == 1]
Here, instead of adding randomly 1 or 0 to my_rng I get a list of 30 0s and 30 1s to shuffle, then continue like before.
Here is another solution that is more dynamic using built in random functionality that only creates the lists needed (no extra memory) and would work with lists that contain any type of object (provided that object can be sorted):
import random
def convert_to_random_list(data, num_list):
"""
Takes in the data as one large list and converts it into
[num_list] random sorted lists.
"""
result_lists = [list() for _ in range(num_list)] # two lists
for x in data:
# Using randint we pick which list to insert into
result_lists[random.randint(0, num_list - 1)].append(x)
# You could use list comprehension here with `sorted(...)` but it would take a little extra memory.
for _list in result_lists:
_list.sort()
return result_lists
Can be tested with:
data = list(range(1, 61))
random.shuffle(data)
temp = convert_to_random_list(data, 3)
print(temp)
I'm dealing with a text data file that has 8 columns, each listing temperature, time, damping coefficients, etc. I need to take lines of data only in the temperature range of 0.320 to 0.322.
Here is a sample line of my data (there are thousands of lines):
time temp acq. freq. amplitude damping etc....
6.28444 0.32060 413.00000 117.39371 48.65073 286.00159
The only columns I care about are time, temp, and damping. I need those three values to append to my lists, but only when the temperature is in the specified range (there are some lines of my data where the temperature is all the way up at 4 kelvins, and this data is garbage).
I am using Python 3. Here are the things I have tried thus far
f = open('alldata','r')
c = f.readlines()
temperature = []
newtemp = []
damping = []
time = []
for line in c [0:]:
line = line.split()
temperature.append(line[1])
damping.append(line[4])
time.append(line[0])
for i in temperature:
if float(i)>0.320 and float(i)<0.325:
newtemp.append(float(i))
when I printed the list newtemp, I could see that this code did correctly fill the list with temperature values only in that range, however I also need my damping list and time list to now only be filled with values that correspond to that small temperature range. I'm not sure how to achieve that with this code.
I have also tried this, recommended by someone here:
output = []
lines = open('alldata', 'r')
for line in lines:
temp = line.split()
if float(temp[1]) > 0.320 and float(temp[1]) < 0.322:
output.append(line)
print(output)
And I get an error that says:
IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
--NotebookApp.iopub_data_rate_limit.
I will note that I am very new to coding, so I apologize if this turns out to be a silly question.
Data:
temperature, time, coeff...
0.32, 12:00:23, 2,..
0.43, 11:22:23, 3,..
Here, temperature is in the first column.
output = []
lines = open('data.file', 'r')
for line in lines:
temp = line.split(',')
if float(temp[0]) > 0.320 and float(temp[0]) < 0.322:
output.append(line)
print output
You can use pandas module:
import pandas as pd
# if the file with the data is an excel file use:
df = pd.read_excel('data.xlsx')
# if the file is csv
df = pd.read_csv('data.csv')
# if the column name of interest is named 'temperature'
selected = df['temperature'][(df['temperature'] > 0.320) & (df['temperature'] < 0.322)]
If you do not have pandas installed see here