Bokeh BoxPlot > KeyError: 'the label [SomeCategory] is not in the [index]' - python-3.x

I'm attempting to create a BoxPlot using Bokeh. When I get to the section where I need to identify outliers, it fails if a given category has no outliers.
If I remove the "problem" category, the BoxPlot executes properly. it's only when I attempt to create this BoxPlot with a category that has no outliers it fails.
Any instruction on how to remedy this?
The failure occurs at the commented section "Prepare outlier data for plotting [...]"
import numpy as np
import pandas as pd
import datetime
import math
from bokeh.plotting import figure, show, output_file
from bokeh.models import NumeralTickFormatter
# Create time stamps to allow for figure to display span in title
today = datetime.date.today()
delta1 = datetime.timedelta(days=7)
delta2 = datetime.timedelta(days=1)
start = str(today - delta1)
end = str(today - delta2)
#Identify location of prices
itemloc = 'Everywhere'
df = pd.read_excel(r'C:\Users\me\prices.xlsx')
# Create a list from the dataframe that identifies distinct categories for the separate box plots
cats = df['subcategory_desc'].unique().tolist()
# Find the quartiles and IQR for each category
groups = df.groupby('subcategory_desc', sort=False)
q1 = groups.quantile(q=0.25)
q2 = groups.quantile(q=0.5)
q3 = groups.quantile(q=0.75)
iqr = q3 - q1
upper = q3 + 1.5*iqr
lower = q1 - 1.5*iqr
# Find the outliers for each category
def outliers(group):
cat = group.name
return group[(group.price > upper.loc[cat][0]) | (group.price < lower.loc[cat][0])]['price']
out = groups.apply(outliers).dropna()
# Prepare outlier data for plotting, we need coordinates for every outlier.
outx = []
outy = []
for cat in cats:
# only add outliers if they exist
if not out.loc[cat].empty:
for value in out[cat]:
outx.append(cat)
outy.append(value)
I expect that the Box-and-whisker portion of categories with no outliers merely show up without the outlier dots.

Have you tried the code from official documentation, https://docs.bokeh.org/en/latest/docs/gallery/boxplot.html?
# prepare outlier data for plotting, we need coordinates for every outlier.
if not out.empty:
outx = []
outy = []
for keys in out.index:
outx.append(keys[0])
outy.append(out.loc[keys[0]].loc[keys[1]])

Related

Using python to plot 'Gridded' map

I would like to know how I can create a gridded map of a country(i.e. Singapore) with resolution of 200m x 200m squares. (50m or 100m is ok too)
I would then use the 'nearest neighbour' technique to assign a rainfall data and colour code to each square based on the nearest rainfall station's data.
[I have the latitude,longitude & rainfall data for all the stations for each date.]
Then, I would like to store the data in an Array for each 'gridded map' (i.e. from 1-Jan-1980 to 31-Dec-2021)
Can this be done using python?
P.S Below is a 'simple' version I did as an example to how the 'gridded' map should look like for 1 particular day.
https://i.stack.imgur.com/9vIeQ.png
Thank you so much!
Can this be done using python? YES
I have previously provided a similar answer binning spatial dataframe. Reference that also for concepts
you have noted that you are working with Singapore geometry and rainfall data. To setup an answer I have sourced this data from government sources
for purpose on answer I have used 2kmx2km grid so when plotting to demonstrate answer resource utilisation is reduced
core concept: create a grid of box polygons that cover the total bounds of the geometry. Note it's important to use UTM CRS here so that bounds in meters make sense. Once boxes are created remove boxes that are within total bounds but do not intersect with actual geometry
next create a geopandas dataframe of rainfall data. Use longitude and latitude of weather station to create points
final step, join_nearest() grid geometry with rainfall data geometry and data
clearly this final data frame gdf_grid_rainfall is a data frame, which is effectively an array. You can use as an array as you please ...
have provided a folium and plotly interactive visualisations that demonstrate clearly solution is working
solution
Dependent on data sourcing
# number of meters
STEP = 2000
a, b, c, d = gdf_sg.to_crs(gdf_sg.estimate_utm_crs()).total_bounds
# create a grid for Singapore
gdf_grid = gpd.GeoDataFrame(
geometry=[
shapely.geometry.box(minx, miny, maxx, maxy)
for minx, maxx in zip(np.arange(a, c, STEP), np.arange(a, c, STEP)[1:])
for miny, maxy in zip(np.arange(b, d, STEP), np.arange(b, d, STEP)[1:])
],
crs=gdf_sg.estimate_utm_crs(),
).to_crs(gdf_sg.crs)
# restrict grid to only squares that intersect with Singapore geometry
gdf_grid = (
gdf_grid.sjoin(gdf_sg)
.pipe(lambda d: d.groupby(d.index).first())
.set_crs(gdf_grid.crs)
.drop(columns=["index_right"])
)
# geodataframe of weather station locations and rainfall by date
gdf_rainfall = gpd.GeoDataFrame(
df_stations.merge(df, on="id")
.assign(
geometry=lambda d: gpd.points_from_xy(
d["location.longitude"], d["location.latitude"]
)
)
.drop(columns=["location.latitude", "location.longitude"]),
crs=gdf_sg.crs,
)
# weather station to nearest grid
gdf_grid_rainfall = gpd.sjoin_nearest(gdf_grid, gdf_rainfall).drop(
columns=["Description", "index_right"]
)
# does it work? let's visualize with folium
gdf_grid_rainfall.loc[lambda d: d["Date"].eq("20220622")].explore("Rainfall (mm)", height=400, width=600)
data sourcing
import requests, itertools, io
from pathlib import Path
import urllib
from zipfile import ZipFile
import fiona.drvsupport
import geopandas as gpd
import numpy as np
import pandas as pd
import shapely.geometry
# get official Singapore planning area geometry
url = "https://geo.data.gov.sg/planning-area-census2010/2014/04/14/kml/planning-area-census2010.zip"
f = Path.cwd().joinpath(urllib.parse.urlparse(url).path.split("/")[-1])
if not f.exists():
r = requests.get(url, stream=True, headers={"User-Agent": "XY"})
with open(f, "wb") as fd:
for chunk in r.iter_content(chunk_size=128):
fd.write(chunk)
zfile = ZipFile(f)
zfile.extractall(f.stem)
fiona.drvsupport.supported_drivers['KML'] = 'rw'
gdf_sg = gpd.read_file(
[_ for _ in Path.cwd().joinpath(f.stem).glob("*.kml")][0], driver="KML"
)
# get data about Singapore weather stations
df_stations = pd.json_normalize(
requests.get("https://api.data.gov.sg/v1/environment/rainfall").json()["metadata"][
"stations"
]
)
# dates to get data from weather.gov.sg
dates = pd.date_range("20220601", "20220730", freq="MS").strftime("%Y%m")
df = pd.DataFrame()
# fmt: off
bad = ['S100', 'S201', 'S202', 'S203', 'S204', 'S205', 'S207', 'S208',
'S209', 'S211', 'S212', 'S213', 'S214', 'S215', 'S216', 'S217',
'S218', 'S219', 'S220', 'S221', 'S222', 'S223', 'S224', 'S226',
'S227', 'S228', 'S229', 'S230', 'S900']
# fmt: on
for stat, month in itertools.product(df_stations["id"], dates):
if not stat in bad:
try:
df_ = pd.read_csv(
io.StringIO(
requests.get(
f"http://www.weather.gov.sg/files/dailydata/DAILYDATA_{stat}_{month}.csv"
).text
)
).iloc[:, 0:5]
except pd.errors.ParserError as e:
bad.append(stat)
print(f"failed {stat} {month}")
df = pd.concat([df, df_.assign(id=stat)])
df["Rainfall (mm)"] = pd.to_numeric(
df["Daily Rainfall Total (mm)"], errors="coerce"
)
df["Date"] = pd.to_datetime(df[["Year","Month","Day"]]).dt.strftime("%Y%m%d")
df = df.loc[:,["id","Date","Rainfall (mm)", "Station"]]
visualisation using plotly animation
import plotly.express as px
# reduce dates so figure builds in sensible time
gdf_px = gdf_grid_rainfall.loc[
lambda d: d["Date"].isin(
gdf_grid_rainfall["Date"].value_counts().sort_index().index[0:15]
)
]
px.choropleth_mapbox(
gdf_px,
geojson=gdf_px.geometry,
locations=gdf_px.index,
color="Rainfall (mm)",
hover_data=gdf_px.columns[1:].tolist(),
animation_frame="Date",
mapbox_style="carto-positron",
center={"lat":gdf_px.unary_union.centroid.y, "lon":gdf_px.unary_union.centroid.x},
zoom=8.5
).update_layout(margin={"r": 0, "t": 0, "l": 0, "b": 0, "pad": 4})

Finding the distance between latlong

I am a bit stuck. I have a CSV which includes:
Site Name
Latitude
Longitude.
This CSV has 100,000 locations. I need to generate a comma separated list for each location, showing the other locations within 5KM
I have tried the attached, which transposes the table & gives me 100,000 columns with 100,000 rows and the distance populated as the result. But I am not sure how to just make a new pandas column which has a list of all the sites within 5KM.
Can you help?
from geopy.distance import geodesic
def distance(row, csr):
lat = row['latitude']
long = row['longitude']
lat_long = (lat, long)
try:
return round(geodesic(lat_long, lat_long_compare).kilometers,2)
except:
return 9999
for key, value in d.items():
lat_compare = value['latitude']
long_compare = value['longitude']
lat_long_compare = (lat_compare, long_compare)
csr = key
df[key] = df.apply([distance, csr], axis=1)
Some sample data can be:
destinations = { 'bigben' : {'latitude': 51.510357,
'longitude': -0.116773},
'heathrow' : {'latitude': 51.470020,
'longitude': -0.454295},
'alton_towers' : {'latitude': 52.987662716,
'longitude': -1.888829778}
}
bigben is 0.8KM from the London Eye
heathrow is 23.55KM from the London Eye
alton_towers is 204.63KM from the London Eye
So, in this case, the field should show only big ben.
So we get:
Site | Sites within 5KM
28, BigBen
Here is one way with NearestNeighbors.
from sklearn.neighbors import NearestNeighbors
# data from your input
df = pd.DataFrame.from_dict(destinations, orient='index').rename_axis('Site Name').reset_index()
radius = 50 #change to whatever, in km
# crate the algo with the raidus and the metric for geospatial distance
neigh = NearestNeighbors(radius=radius/6371, metric='haversine')
# fit the data in radians
neigh.fit(df[['latitude', 'longitude']].to_numpy()*np.pi/180)
# extract result and transform to get the expected output
df[f'Site_within_{radius}km'] = (
pd.Series(neigh.radius_neighbors()[1]) # get a list of index for each row
.explode()
.map(df['Site Name']) # get the site name from row index
.groupby(level=0) # transform back to row-row relation
.agg(list) # can use ', '.join instead of list
)
print(df)
Site Name latitude longitude Site_within_50km
0 bigben 51.510357 -0.116773 [heathrow]
1 heathrow 51.470020 -0.454295 [bigben]
2 alton_towers 52.987663 -1.888830 [nan]
Another way
from sklearn.neighbors import DistanceMetric
from math import radians
import pandas as pd
import numpy as np
#To Radians
df['latitude'] = np.radians(df['latitude'])
df['longitude'] = np.radians(df['longitude'])
#Pair the cities
df[['latitude','longitude']].to_numpy()
#Assume a sperical radius of 6373
dist = DistanceMetric.get_metric('haversine')#DistanceMetric class df=pd.DataFrame(dist.pairwise(df[['latitude','longitude']].to_numpy())*6373,columns=df.index.unique(), index=df.index.unique())
s=df.gt(0)&df.le(50)
df['Site_within_50km']=s.agg(lambda x: x.index[x].values, axis=1)#Filter
bigben heathrow alton_towers Site_within_50km
bigben 0.000000 23.802459 203.857533 [heathrow]
heathrow 23.802459 0.000000 195.048961 [bigben]
alton_towers 203.857533 195.048961 0.000000 []

Trouble aligning x-axis Matplotlib (Homework)

As stated about I have a homework assignment for a fundamentals of Data Science class. I am filtering out a tower with faulty information and plotting the data of the good tower by amplitude and timing.
The issue is with my mean line for my graph. It is suppose to run through the average of my points. Unfortunately I cannot seem to align across my X-axis.
My output looks like this:
I've tried solution I've found on stack overflow, but the best I could come up was a mean line for the whole graph using:mplot.plot(np.unique(columnOneF),np.poly1d(np.polyfit(columnOneF,columnTwoF,1))(np.unique(columnOneF)))
import csv
import matplotlib.pyplot as mplot
import numpy as np
File = open("WhiteSwordfish_ch1.csv")
csv_file = csv.reader(File)
columnOneF = []
columnTwoF = []
columnThreeF = []
MeanAmp = []
Freq = []
TempFreq = []
last = 0
for row in csv_file: # Loop graps all the rows out of the CSV File stores them by column in List
if float(row[2]) == 21.312057: # If statement check if the frequency if from the good tower if
Freq.append(row) # so it then grabs THE WHOLE ROW and stores in a a List
for row in Freq: # Program loops through only the good tower's data and sorts it into
columnOneF.append(float(row[0])) # Seperate list by type
columnTwoF.append(float(row[1]))
columnThreeF.append(float(row[2]))
# Mean Line Calculation
for i in Freq:
current = float(i[0])
if current == last:
TempFreq.append(float(i[1]))
else:
last = current
MeanAmp.append(np.mean(TempFreq))
# MeanAmp.insert(int(current), np.mean(TempFreq))
TempFreq = []
print(MeanAmp)
print(columnOneF)
# Graph One (Filter Data)
# ****************************************************************************
mplot.title("Filtered Data")
mplot.xlabel("Timing")
mplot.ylabel("Amplitude")
mplot.axis([-100, 800, -1.5, 1.5])
mplot.scatter(columnOneF, columnTwoF, color="red") # Clean Data POINTS
mplot.plot(MeanAmp, color="blue", linestyle="-") # Line
# mplot.plot(np.unique(columnOneF),np.poly1d(np.polyfit(columnOneF,columnTwoF,1))(np.unique(columnOneF)))
mplot.show() # Displays both graphs
You have passed only MeanAmp to the plot() function, which is interpreted as
plot(y) # plot y using x as index array 0..N-1
Source
If you provide x-cordinates, same as for the scatter() function, the lines will be aligned:
mplot.plot(columnOneF, MeanAmp, color="blue", linestyle="-")

Why is Bokeh's plot not changing with plot selection?

Struggling to understand why this bokeh visual will not allow me to change plots and see the predicted data. The plot and select (dropdown-looking) menu appears, but I'm not able to change the plot for items in the menu.
Running Bokeh 1.2.0 via Anaconda. The code has been run both inside & outside of Jupyter. No errors display when the code is run. I've looked through the handful of SO posts relating to this same issue, but I've not been able to apply the same solutions successfully.
I wasn't sure how to create a toy problem out of this, so in addition to the code sample below, the full code (including the regression code and corresponding data) can be found at my github here (code: Regression&Plotting.ipynb, data: pred_data.csv, historical_data.csv, features_created.pkd.)
import pandas as pd
import datetime
from bokeh.io import curdoc, output_notebook, output_file
from bokeh.layouts import row, column
from bokeh.models import Select, DataRange1d, ColumnDataSource
from bokeh.plotting import figure
#Must be run from the command line
def get_historical_data(src_hist, drug_id):
historical_data = src_hist.loc[src_hist['ndc'] == drug_id]
historical_data.drop(['Unnamed: 0', 'date'], inplace = True, axis = 1)#.dropna()
historical_data['date'] = pd.to_datetime(historical_data[['year', 'month', 'day']], infer_datetime_format=True)
historical_data = historical_data.set_index(['date'])
historical_data.sort_index(inplace = True)
# csd_historical = ColumnDataSource(historical_data)
return historical_data
def get_prediction_data(src_test, drug_id):
#Assign the new date
#Write a new dataframe with values for the new dates
df_pred = src_test.loc[src_test['ndc'] == drug_id].copy()
df_pred.loc[:, 'year'] = input_date.year
df_pred.loc[:, 'month'] = input_date.month
df_pred.loc[:, 'day'] = input_date.day
df_pred.drop(['Unnamed: 0', 'date'], inplace = True, axis = 1)
prediction = lin_model.predict(df_pred)
prediction_data = pd.DataFrame({'drug_id': prediction[0][0], 'predictions': prediction[0][1], 'date': pd.to_datetime(df_pred[['year', 'month', 'day']], infer_datetime_format=True, errors = 'coerce')})
prediction_data = prediction_data.set_index(['date'])
prediction_data.sort_index(inplace = True)
# csd_prediction = ColumnDataSource(prediction_data)
return prediction_data
def make_plot(historical_data, prediction_data, title):
#Historical Data
plot = figure(plot_width=800, plot_height = 800, x_axis_type = 'datetime',
toolbar_location = 'below')
plot.xaxis.axis_label = 'Time'
plot.yaxis.axis_label = 'Price ($)'
plot.axis.axis_label_text_font_style = 'bold'
plot.x_range = DataRange1d(range_padding = 0.0)
plot.grid.grid_line_alpha = 0.3
plot.title.text = title
plot.line(x = 'date', y='nadac_per_unit', source = historical_data, line_color = 'blue', ) #plot historical data
plot.line(x = 'date', y='predictions', source = prediction_data, line_color = 'red') #plot prediction data (line from last date/price point to date, price point for input_date above)
return plot
def update_plot(attrname, old, new):
ver = vselect.value
new_hist_source = get_historical_data(src_hist, ver) #calls the function above to get the data instead of handling it here on its own
historical_data.data = ColumnDataSource.from_df(new_hist_source)
# new_pred_source = get_prediction_data(src_pred, ver)
# prediction_data.data = new_pred_source.data
#Import data source
src_hist = pd.read_csv('data/historical_data.csv')
src_pred = pd.read_csv('data/pred_data.csv')
#Prep for default view
#Initialize plot with ID number
ver = 781593600
#Set the prediction date
input_date = datetime.datetime(2020, 3, 31) #Make this selectable in future
#Select-menu options
menu_options = src_pred['ndc'].astype(str) #already contains unique values
#Create select (dropdown) menu
vselect = Select(value=str(ver), title='Drug ID', options=sorted((menu_options)))
#Prep datasets for plotting
historical_data = get_historical_data(src_hist, ver)
prediction_data = get_prediction_data(src_pred, ver)
#Create a new plot with the source data
plot = make_plot(historical_data, prediction_data, "Drug Prices")
#Update the plot every time 'vselect' is changed'
vselect.on_change('value', update_plot)
controls = row(vselect)
curdoc().add_root(row(plot, controls))
UPDATED: ERRORS:
1) No errors show up in Jupyter Notebook.
2) CLI shows a UserWarning: Pandas doesn't allow columns to be careated via a new attribute name, referencing `historical_data.data = ColumnDatasource.from_df(new_hist_source).
Ultimately, the plot should have a line for historical data, and another line or dot for predicted data derived from sklearn. It also has a dropdown menu to select each item to plot (one at a time).
Your update_plot is a no-op that does not actually make any changes to Bokeh model state, which is what is necessary to change a Bokeh plot. Changing Bokeh model state means assigning a new value to a property on a Bokeh object. Typically, to update a plot, you would compute a new data dict and then set an existing CDS from it:
source.data = new_data # plain python dict
Or, if you want to update from a DataFame:
source.data = ColumnDataSource.from_df(new_df)
As an aside, don't assign the .data from one CDS to another:
source.data = other_source.data # BAD
By contrast, your update_plot computes some new data and then throws it away. Note there is never any purpose to returning anything at all from any Bokeh callback. The callbacks are called by Bokeh library code, which does not expect or use any return values.
Lastly, I don't think any of those last JS console errors were generated by BokehJS.

How to plot in python using Legend as a checkbox?

I have been trying to plot a graph which has a dataframe having 3 columns . One is the "Hour", Second is the "amount" in Rupees and the third consist of "machine codes". I need to analyze the amount of transaction a machine does on an hourly basis. There are total 67 unique machine codes.
Kindy check here the data sample Here
These are the Libraries i have been using
import numpy as np
from bokeh.io import output_notebook, show
from bokeh.layouts import row
from bokeh.palettes import Viridis3
from bokeh.plotting import figure
from bokeh.models import CheckboxGroup, CustomJS
output_notebook()
p = figure()
props = dict(line_width=4, line_alpha=0.7)
x = sl['Hour']
y = sl['amount']
Now I have appended a list labels[] with all the machine codes
labels = []
active1 = []
for s in sl['machinecode'].unique():
labels.append(s)
active1.append(0)
I basically want to create checkboxes for all those machine codes , a user when check any machine code , a graph gets plotted . if a user again checks another machine code , the line of that machine code gets appended into a graph so that I could compare between machines.
j =0
for i in sl['machinecode'].unique():
l = p.line(x, y, color=Viridis3[0], legend="Line:" , **props)
j=j+1
checkbox = CheckboxGroup(labels=labels,
active=active1, width=100)
checkbox.callback = CustomJS(args=dict(l=l, checkbox=checkbox),
code="""
l0.visible = 0 in checkbox.active;
l1.visible = 1 in checkbox.active;
l2.visible = 2 in checkbox.active;
""""")
layout = row(checkbox, p)
show(layout)
The above code is showing something really different kindly check here what the graph is actually showing , it is plotting for every machine with a single color , checkboxes does not command the graph actually

Resources