Bokeh plot returned in function not rendering - python-3.x

I was writing a function to simplify my plotting, it dose not give any error yet when I call
show(plt)
on the return value nothing happens. I'm working in a Jupyter notebook. I've alredy made a call to :
output_notebook()
Here is the function code :
def plot_dist(x, h, title, xl="X axis", yl="Y axis", categories=None, width=0.5, bottom=0, color="#DC143C", xmlo=None, ymlo=None, xlo=-18, ylo=5):
total = np.sum(h)
source = ColumnDataSource(data={
"x":x,
"h":h,
"percentages":[str(round((x*100)/total, 2)) + "%" for x in h]
})
plt = figure(
title=title,
x_axis_label=xl,
y_axis_label=yl
)
plt.vbar(
x="x",
width=width,
bottom=bottom,
top="h",
source=source,
color=color
)
if xmlo is None:
if categories is None:
raise ValueError("If no categories are provided xaxis.major_label_overrides must be defined")
plt.xaxis.major_label_overrides = {
int(x):("(" + str(c.left) + "-" + str(c.right) + "]") for x,c in enumerate(categories)
}
else:
plt.xaxis.major_label_overrides = xmlo
if ymlo is None:
plt.yaxis.major_label_overrides = { int(x):(str(int(x)/1000)+"k") for x in range(0, h.max(), math.ceil((h.max()/len(h))) )}
else:
plt.yaxis.major_label_overrides = ymlo
labels = LabelSet(
x=str(x), y=str(h), text="percentages", level="glyph",
x_offset=xlo, y_offset=ylo, source=source, render_mode="canvas"
)
plt.add_layout(labels)
return plt
And this is how it is invoked :
X = [x for x in range(0, len(grps.index))]
H = grps.to_numpy()
plt = plot_dist(X, H, "Test", "xtest", "ytest", grps.index.categories)
X is just a list and grps is the result of a call to pandas' DataFrame.groupby
As I said it dose not give any error so I think the problem is with the ColumnDataSource object, I must be creating it wrong. Any help is appreciated, thanks!
Edit 1 : Apparently removing the following line solved the problem :
plt.add_layout(labels)
The plot now renders correclyt, yet I need to add the labels, any idea?
Edit 2 : Ok I've solved the problem, inspecting the web console when running the code the following error shows :
Error: attempted to retrieve property array for nonexistent field
The problem was in the following lines :
labels = LabelSet(
x=str(x), y=str(h), text="percentages", level="glyph",
x_offset=xlo, y_offset=ylo, source=source, render_mode="canvas"
)
In particular assignin x=str(x) and y=str(h). Changed it to simply x="x" and y="h" solved it.

The problem with the code is with the labels declaration :
labels = LabelSet(
x=str(x), y=str(h), text="percentages", level="glyph",
x_offset=xlo, y_offset=ylo, source=source, render_mode="canvas"
)
It was discovered by inspecting the browser's web console, which gave the following error :
Error: attempted to retrieve property array for nonexistent field
The parameters x and y must refer to the names in the ColumnDataSource object passed to the Glyph method used to draw on the plot.
I was mistakenly passing str(x) and str(y) which, are the string representation of the content. I was mistakenly assuming it would refer to the string representation of the variable.
To solve the problem is sufficient to pass as values to the x and y parameters of the LabelSet constructor the dictionary's keys used in the ColumnDataSource constructor :
labels = LabelSet(
x="x", y="h", text="percentages", level="glyph",
x_offset=xlo, y_offset=ylo, source=source, render_mode="canvas"
)
In addition if the ColumnDataSource was constructed from a DataFrame the strings will be either the columns names, the string "index", if any of the data used in the plot refer to the index and this has no explicit name, or the name of the index object.
Thanks a lot to bigreddot for helping me with the problem and answer.

Related

Altair: Remove title from layered faceted graphs

I tried layering faceted graphs and it failed, so moved to the method suggested in here - https://stackoverflow.com/a/52882510/20390480 which basically layer the graphs and then call .facet(column). With this method I am unable to remove the facet title.
I tried .facet(column, title=None) throws the following error.
import altair as alt
from vega_datasets import data
cars = data.cars()
horse = alt.Chart().mark_point().encode(
x = 'Weight_in_lbs',
y = 'Horsepower'
)
miles = alt.Chart().mark_point(color='red').encode(
x = 'Weight_in_lbs',
y = 'Miles_per_Gallon'
)
alt.layer(horse, miles, data=cars).facet(column='Origin', title=None)
SchemaValidationError: Invalid specification
altair.vegalite.v4.api.Chart, validating 'required'
'data' is a required property
alt.FacetChart(...)
Try:
alt.layer(horse, miles, data=cars).facet(column=alt.Column('Origin', title=None))

Issue with pd.DataFrame.apply with arguments

I want to create augmented data in a new dataframe for every row of an original dataframe.
So, I've defined augment method which I want to use in apply as following:
def augment(row: pd.Series, column_name: str, target_df: pd.DataFrame, num_samples: int):
# print(type(row))
target_df_start_index = target_df.shape[0]
raw_img = row[column_name].astype('uint8')
bin_image = convert_image_to_binary_image(raw_img)
bin_3dimg = tf.expand_dims(input=bin_image, axis=2)
bin_img_reshaped = tf.image.resize_with_pad(image=bin_3dimg, target_width=128, target_height=128, method="bilinear")
for i in range(num_samples + 1):
new_row = row.copy(deep=True)
if i == 0:
new_row[column_name] = np.squeeze(bin_img_reshaped, axis=2)
else:
aug_image = data_augmentation0(bin_img_reshaped)
new_row[column_name] = np.squeeze(aug_image, axis=2)
# display.display(new_row)
target_df.loc[target_df_start_index + i] = new_row
# print(target_df.shape)
# display.display(target_df)
When I call this as following, everything works:
tmp_df = pd.DataFrame(None, columns=testDF.columns)
augment(testDF.iloc[0], column_name='binMap', target_df=tmp_df, num_samples=4)
augment(testDF.iloc[1], column_name='binMap', target_df=tmp_df, num_samples=4)
However, when I try it using 'apply' method, I get the prints or the display working fine but the resultant dataframe shows error
tmp_df = pd.DataFrame(None, columns=testDF.columns)
testDF.apply(augment, args=('binMap', tmp_df, 4, ), axis=1)
This is how the o/p data looks like after the apply call -
,data
<Error>, <Error>
<Error>, <Error>
What am I doing wrong?
Your test is very nice, thank you for the clear exposition.
I am happy to be your rubber duck.
In test A, you (successfully) mess with
testDF.iloc[0] and [1],
using kind of a Fortran-style API
for augment(), leaving a side effect result in tmp_df.
Test B is carefully constructed to
be "the same" except for the .apply() call.
So let's see, what's different?
Hard to say.
Let's go examine the docs.
Oh, right!
We're using the .apply() API,
so we'd better follow it.
Down at the end it explains:
Returns: Series or DataFrame
Result of applying func along the given axis of the DataFrame.
But you're offering return None instead.
Now, I'm not here to pass judgement on
whether it's best to have side effects
on a target df -- that's up to you.
But .apply() will be bent out of shape
until you give it something nice
to store as its own result.
Happy hunting!
Tiny little style nit.
You wrote
args=('binMap', tmp_df, 4, )
to offer a 3-tuple. Better to write
args=('binMap', tmp_df, 4)
As written it tends to suggest 1-tuple notation.
When is trailing comma helpful?
in a 1-tuple it is essential: x = (7,)
in multiline dict / list expressions it minimizes git diffs, when inevitably another entry ('cherry'?) will later be added
fruits = [
'apple',
'banana',
]
This change worked for me -
def augment(row: pd.Series, column_name: str, target_df: pd.DataFrame, num_samples: int) -> pd.Series:
# print(type(row))
target_df_start_index = target_df.shape[0]
raw_img = row[column_name].astype('uint8')
bin_image = convert_image_to_binary_image(raw_img)
bin_3dimg = tf.expand_dims(input=bin_image, axis=2)
bin_img_reshaped = tf.image.resize_with_pad(image=bin_3dimg, target_width=128, target_height=128, method="bilinear")
for i in range(num_samples + 1):
new_row = row.copy(deep=True)
if i == 0:
new_row[column_name] = np.squeeze(bin_img_reshaped, axis=2)
else:
aug_image = data_augmentation0(bin_img_reshaped)
new_row[column_name] = np.squeeze(aug_image, axis=2)
# display.display(new_row)
target_df.loc[target_df_start_index + i] = new_row
# print(target_df.shape)
# display.display(target_df)
return row
And updated call to apply as following:
testDF = testDF.apply(augment, args=('binMap', tmp_df, 4, ), result_type='broadcast', axis=1)
Thank you #J_H.
If there are better to way to achieve what I'm doing, please feel free to suggest the improvements.

How do I find the value of a bokeh line glyph

Im trying to build an application in which I need to extract the x,y value of a bokeh line. Im able to do this for a bokeh circle (see below, where I find the x value of the circle is tmp1.glyph.x = 2), but the same syntax doesnt work for a line between two points (tmp1.glyph.x ="x"). I would hope to see [-3,3]. Would be grateful for any advice.
from bokeh.plotting import figure, show
fig = figure(x_range=(-5,5),y_range=(-5, 5))
tmp1=fig.circle(x=2, y=-3, size=5)
tmp=fig.line(x = [-3,3], y = [4,-4])
print(tmp1.glyph.x)
# output: 2
print(tmp.glyph.x)
# output: x
show(fig)
For the line glyph a ColumnDataSource object is created. To print the data of this ColumnDataSource use tmp.data_source.data['x'] in your example.
To explain this behavior in more detail, you have to know, that if you pass only one value for x and y for a glyph, this value is stored directly as value (inside the object is looks like this: x = {'value': 2}). If you pass a list to the glyph this gets a pointer with the name of the column in the ColumnDataSource (inside it looks like this x = {'field': 'x'}). The same behavior has the circle glyph, you can try it out adding one value as a list.
Therefor a general solution to print the values could look like the code below:
value = tmp.glyph.x
if isinstance(field_or_value, str):
value = tmp1.data_source.data[value]
print(value)
Here we check if the value in tmp.glyph.x is a string. If it is a string, this is a pointer the the ColumnDataSource.

Compare two values in the same column python

I'm trying to compare two values in the same column in a pandas.DataFrame.
If the two values are different I want to create a new value.
My code looks like this:
def f(x, var1, var2):
if (x[var1].shift(1) != x[var1]):
x[var2] = 1
else:
x[var2] = 0
return x
sdf['2008':'2009'].apply(lambda x: f(x, 'ROW1','ROW2'),axis = 1)
Unfortunatly, this one doesn't work. I get the following error massage
'numpy.float64' object has no attribute 'shift'", 'occurred at index 2008-01-01 00:00:00'
Thanks for your help.
I think you need:
df0 = df.shift()
df['Row2'] = np.where(df0['Row1']!=df['Row1'], 1, 0)
EDIT:
As #jpp suggested in comments:
df['Row2'] = (df0['Row1']!=df['Row1']).astype(int)

Bokeh Mapping Counties

I am attempting to modify this example with county data for Michigan. In short, it's working, but it seems to be adding some extra shapes here and there in the process of drawing the counties. I'm guessing that in some instances (where there are counties with islands), the island part needs to be listed as a separate "county", but I'm not sure about the other case, such as with Wayne county in the lower right part of the state.
Here's a picture of what I currently have:
Here's what I did so far:
Get county data from Bokeh's sample county data just to get the state abbreviation per state number (my second, main data source only has state numbers). For this example, I'll simplify it by just filtering for state number 26).
Get state coordinates ('500k' file) by county from the U.S. Census site.
Use the following code to generate an 'interactive' map of Michigan.
Note: To pip install shapefile (really pyshp), I think I had to download the .whl file from here and then do pip install [path to .whl file].
import pandas as pd
import numpy as np
import shapefile
from bokeh.models import HoverTool, ColumnDataSource
from bokeh.palettes import Viridis6
from bokeh.plotting import figure, show, output_notebook
shpfile=r'Path\500K_US_Counties\cb_2015_us_county_500k.shp'
sf = shapefile.Reader(shpfile)
shapes = sf.shapes()
#Here are the rows from the shape file (plus lat/long coordinates)
rows=[]
lenrow=[]
for i,j in zip(sf.shapeRecords(),sf.shapes()):
rows.append(i.record+[j.points])
if len(i.record+[j.points])!=10:
print("Found record with irrular number of columns")
fields1=sf.fields[1:] #Ignore first field as it is not used (maybe it's a meta field?)
fields=[seq[0] for seq in fields1]+['Long_Lat']#Take the first element in each tuple of the list
c=pd.DataFrame(rows,columns=fields)
try:
c['STATEFP']=c['STATEFP'].astype(int)
except:
pass
#cns=pd.read_csv(r'Path\US_Counties.csv')
#cns=cns[['State Abbr.','STATE num']]
#cns=cns.drop_duplicates('State Abbr.',keep='first')
#c=pd.merge(c,cns,how='left',left_on='STATEFP',right_on='STATE num')
c['Lat']=c['Long_Lat'].apply(lambda x: [e[0] for e in x])
c['Long']=c['Long_Lat'].apply(lambda x: [e[1] for e in x])
#c=c.loc[c['State Abbr.']=='MI']
c=c.loc[c['STATEFP']==26]
#latitudex, longitude=y
county_xs = c['Lat']
county_ys = c['Long']
county_names = c['NAME']
county_colors = [Viridis6[np.random.randint(1,6, size=1).tolist()[0]] for l in aland]
randns=np.random.randint(1,6, size=1).tolist()[0]
#county_colors = [Viridis6[e] for e in randns]
#county_colors = 'b'
source = ColumnDataSource(data=dict(
x=county_xs,
y=county_ys,
color=county_colors,
name=county_names,
#rate=county_rates,
))
output_notebook()
TOOLS="pan,wheel_zoom,box_zoom,reset,hover,save"
p = figure(title="Title", tools=TOOLS,
x_axis_location=None, y_axis_location=None)
p.grid.grid_line_color = None
p.patches('x', 'y', source=source,
fill_color='color', fill_alpha=0.7,
line_color="white", line_width=0.5)
hover = p.select_one(HoverTool)
hover.point_policy = "follow_mouse"
hover.tooltips = [
("Name", "#name"),
#("Unemployment rate)", "#rate%"),
("(Long, Lat)", "($x, $y)"),
]
show(p)
I'm looking for a way to avoid the extra lines and shapes.
Thanks in advance!
I have a solution to this problem, and I think I might even know why it is correct. First, let me show quote from Bryan Van de ven in a Google groups Bokeh discussion:
there is no built-in support for dealing with shapefiles. You will have to convert the data to the simple format that Bokeh understands. (As an aside: it would be great to have a contribution that made dealing with various GIS formats easier).
The format that Bokeh expects for patches is a "list of lists" of points. So something like:
xs = [ [patch0 x-coords], [patch1 x-coords], ... ]
ys = [ [patch1 y-coords], [patch1 y-coords], ... ]
Note that if a patch is comprised of multiple polygons, this is currently expressed by putting NaN values in the sublists. So, the task is basically to convert whatever form of polygon data you have to this format, and then Bokeh can display it.
So it seems like somehow you are ignoring NaNs or otherwise not handling multiple polygons properly. Here is some code that will download US census data, unzip it, read it properly for Bokeh, and make a data frame of lat, long, state, and county.
def get_map_data(shape_data_file, local_file_path):
url = "http://www2.census.gov/geo/tiger/GENZ2015/shp/" + \
shape_data_file + ".zip"
zfile = local_file_path + shape_data_file + ".zip"
sfile = local_file_path + shape_data_file + ".shp"
dfile = local_file_path + shape_data_file + ".dbf"
if not os.path.exists(zfile):
print("Getting file: ", url)
response = requests.get(url)
with open(zfile, "wb") as code:
code.write(response.content)
if not os.path.exists(sfile):
uz_cmd = 'unzip ' + zfile + " -d " + local_file_path
print("Executing command: " + uz_cmd)
os.system(uz_cmd)
shp = open(sfile, "rb")
dbf = open(dfile, "rb")
sf = shapefile.Reader(shp=shp, dbf=dbf)
lats = []
lons = []
ct_name = []
st_id = []
for shprec in sf.shapeRecords():
st_id.append(int(shprec.record[0]))
ct_name.append(shprec.record[5])
lat, lon = map(list, zip(*shprec.shape.points))
indices = shprec.shape.parts.tolist()
lat = [lat[i:j] + [float('NaN')] for i, j in zip(indices, indices[1:]+[None])]
lon = [lon[i:j] + [float('NaN')] for i, j in zip(indices, indices[1:]+[None])]
lat = list(itertools.chain.from_iterable(lat))
lon = list(itertools.chain.from_iterable(lon))
lats.append(lat)
lons.append(lon)
map_data = pd.DataFrame({'x': lats, 'y': lons, 'state': st_id, 'county_name': ct_name})
return map_data
The inputs to this command are a local directory where you want to download the map data to and the other input is the name of the shape file. I know there are at least two available maps from the url in the function above that you could call:
map_low_res = "cb_2015_us_county_20m"
map_high_res = "cb_2015_us_county_500k"
If the US census changes their url, which they certainly will one day, then you will need to change the input file name and the url variable. So, you can call the function above
map_output = get_map_data(map_low_res, ".")
Then you could plot it just as the code in the original question does. Add a color data column first ("county_colors" in the original question), and then set it to the source like this:
source = ColumnDataSource(map_output)
To make this all work you will need to import libraries such as requests, os, itertools, shapefile, bokeh.models.ColumnDataSource, etc...
One solution:
Use the 1:20,000,000 shape file instead of the 1:500,000 file.
It loses some detail around the shape of each county but does not have any extra shapes (and just a couple of extra lines).

Resources