Is there a way to specify what the legend shows in Altair? - altair

I have the following graph in Altair:
The code used to generate it is as follows:
data = pd.read_csv(data_csv)
display(data)
display(set(data['algo_score_raw']))
# First generate base graph
base = alt.Chart(data).mark_circle(opacity=1, stroke='#4c78a8').encode(
x=alt.X('Paragraph:N', axis=None),
y=alt.Y('Section:N', sort=list(OrderedDict.fromkeys(data['Section']))),
size=alt.Size('algo_score_raw:Q', title="Number of Matches"),
).properties(
width=900,
height=500
)
# Next generate the overlying graph with the lines
lines = alt.Chart(data).mark_rule(stroke='#4c78a8').encode(
x=alt.X('Paragraph:N', axis=alt.Axis(labelAngle=0)),
y=alt.Y('Section:N', sort=list(OrderedDict.fromkeys(data['Section'])))
).properties(
width=900,
height=500
)
if max(data['algo_score_raw']) == 0:
return lines # no circles if no matches
else:
return base + lines
However, I don't want the decimal values in my legend; I only want 1.0, 2.0, and 3.0, because those are the only values that are actually present in my data. However, Altair seems to default to what you see above.

The legend is generated based on how you specify your encoding. It sounds like your data are better represented as ordered categories than as a continuous quantitative scale. You can specify this by changing the encoding type to ordinal:
size=alt.Size('algo_score_raw:O')
You can read more about encoding types at https://altair-viz.github.io/user_guide/encoding.html

You can use alt.Legend(tickCount=2)) (labelExpr could also be helpful, see the docs for more):
import altair as alt
from vega_datasets import data
source = data.cars()
source['Acceleration'] = source['Acceleration'] / 10
chart = alt.Chart(source).mark_circle(size=60).encode(
x='Horsepower',
y='Miles_per_Gallon',
size='Acceleration',
)
chart
chart.encode(size=alt.Size('Acceleration', legend=alt.Legend(tickCount=2)))

Related

How to group-by twice, preserve original columns, and plot

I have the following data sets (only sample is shown):
I want to find the most impactful exercise per area and then plot it via Seaborn barplot.
I use the following code to do so.
# Create Dataset Using Only Area, Exercise and Impact Level Chategories
CA_data = Data[['area', 'exercise', 'impact level']]
# Compute Mean Impact Level per Exercise per Area
mean_il_CA = CA_data.groupby(['area', 'exercise'])['impact level'].mean().reset_index()
mean_il_CA_hello = mean_il_CA.groupby('area')['impact level'].max().reset_index()
# Plot
cx = sns.barplot(x="impact level", y="area", data=mean_il_CA_hello)
plt.title('Most Impactful Exercises Considering Area')
plt.show()
The resulting dataset is:
This means that when I plot, on the y axis only the label relative to the area appears, NOT 'area label' + 'exercise label' like I would like.
How do I reinsert 'exercise column into my final dataset?
How do I get both the name of the area and the exercise on the y plot?
The problem of losing the values of 'exercise' when grouping by the maximum of 'area' can be solved by keeping the MultiIndex (i.e. not using reset_index) and using .transform to create a boolean mask to select the appropriate full rows of mean_il_CA that contain the maximum 'impact_level' values per 'area'. This solution is based on the code provided in this answer by unutbu. The full labels for the bar chart can be created by concatenating the labels of 'area' and 'exercise'.
Here is an example using the titanic dataset from the seaborn package. The variables 'class', 'embark_town', and 'fare' are used in place of 'area', 'exercise', and 'impact_level'. The categorical variables both contain three unique values: 'First', 'Second', 'Third', and 'Cherbourg', 'Queenstown', 'Southampton'.
import pandas as pd # v 1.2.5
import seaborn as sns # v 0.11.1
df = sns.load_dataset('titanic')
data = df[['class', 'embark_town', 'fare']]
data.head()
data_mean = data.groupby(['class', 'embark_town'])['fare'].mean()
data_mean
# Select max values in each class and create concatenated labels
mask_max = data_mean.groupby(level=0).transform(lambda x: x == x.max())
data_mean_max = data_mean[mask_max].reset_index()
data_mean_max['class, embark_town'] = data_mean_max['class'].astype(str) + ', ' \
+ data_mean_max['embark_town']
data_mean_max
# Draw seaborn bar chart
sns.barplot(data=data_mean_max,
x=data_mean_max['fare'],
y=data_mean_max['class, embark_town'])

Hide the grid in an a specificaltair plot within a set of vstacked plots

I am trying to create a plot composed of 2 charts stacked vertically: a time series chart showing a data and below it a time series chart showing texts representing events on the time axis. I want the data-chart having a grid, but the mark_text chart below not to show an outer line and no grid. I use the chart.configure_axis(grid=False) command to hide the axis but get the following error: Objects with "config" attribute cannot be used within LayerChart. Consider defining the config attribute in the LayerChart object instead.
I can't figure out, where to apply the configure_axis(grid=False) option, so it will only apply to the bottom plot. any help on this would be greatly appreciated. or any suggestion how to implement the label-plot in a different way.
here is my code:
import altair as alt
import pandas as pd
import locale
from altair_saver import save
from datetime import datetime
file = '.\lagebericht.csv'
df = pd.read_csv(file, sep=';')
source = df
locale.setlocale(locale.LC_ALL, "de_CH")
min_date = '2020-02-29'
domain_pd = pd.to_datetime([min_date, '2020-12-1']).astype(int) / 10 ** 6
base = alt.Chart(source, title='Neumeldungen BS').encode(
alt.X('test_datum:T', axis=alt.Axis(title="",format="%b %y"), scale = alt.Scale(domain=list(domain_pd) ))
)
bar = base.mark_bar(width = 1).encode(
alt.Y('faelle_bs:Q', axis=alt.Axis(title="Anzahl Fälle"), scale = alt.Scale(domain=(0, 120)))
)
line = base.mark_line(color='blue').encode(
y='faelle_Total:Q')
chart1 = (bar + line).properties(width=600)
events= pd.DataFrame({
'datum': [datetime(2020,7,1), datetime(2020,5,15)],
'const': [1,1],
'label': ['allgememeiner Lockdown', 'Gruppen > 50 verboten'],
})
base = alt.Chart(events).encode(
alt.X('datum:T', axis=alt.Axis(title="", format="%b %y"), scale = alt.Scale(domain=list(domain_pd) ))
)
points = base.mark_rule(color='blue').encode(
y=alt.Y('const:Q', axis=alt.Axis(title="",ticks=False, domain=False, labels=False), scale = alt.Scale(domain=(0, 10)))
)
text = base.mark_text(
align='right',
baseline='bottom',
angle = 20,
dx=0, # Nudges text to right so it doesn't appear on top of the bar
dy=20,
).encode(text='label:O').configure_axis(grid=False)
chart2 = (points + text).properties(width=600, height = 50)
save(chart1 & chart2, r"images\figs.html")
this is what it looks without the grid=False option:
enter image description here
The configure() method should be thought of as a way to specify a global chart theme; you cannot have different configurations within a single Chart (See https://altair-viz.github.io/user_guide/customization.html#global-config-vs-local-config-vs-encoding for a discussion of this).
The way to do what you want is not via global configuration, but via axis settings. For example, you can pass grid=False to alt.Axis:
points = alt.Chart(events).mark_rule(color='blue').encode(
x=alt.X('datum:T', axis=alt.Axis(title="", format="%b %y"), scale = alt.Scale(domain=list(domain_pd) )),
y=alt.Y('const:Q', axis=alt.Axis(title="",ticks=False, domain=False, labels=False), scale = alt.Scale(domain=(0, 10)))
)
text = alt.Chart(events).mark_text().encode(
x=alt.X('datum:T', axis=alt.Axis(title="", grid=False, format="%b %y"), scale = alt.Scale(domain=list(domain_pd) )),
text='label:O'
)

How do the factors in factor_cmap in Bokeh work?

I am trying to construct a grouped vertical bar chart in Bokeh from a pandas dataframe. I'm struggling with understanding the use of factor_cmap and how the color mapping works with this function. There's an example in the documentation (https://docs.bokeh.org/en/latest/docs/user_guide/categorical.html#pandas) that was helpful to follow, here:
from bokeh.io import output_file, show
from bokeh.palettes import Spectral5
from bokeh.plotting import figure
from bokeh.sampledata.autompg import autompg_clean as df
from bokeh.transform import factor_cmap
output_file("bar_pandas_groupby_nested.html")
df.cyl = df.cyl.astype(str)
df.yr = df.yr.astype(str)
group = df.groupby(by=['cyl', 'mfr'])
index_cmap = factor_cmap('cyl_mfr', palette=Spectral5, factors=sorted(df.cyl.unique()), end=1)
p = figure(plot_width=800, plot_height=300, title="Mean MPG by # Cylinders and Manufacturer",
x_range=group, toolbar_location=None, tooltips=[("MPG", "#mpg_mean"), ("Cyl, Mfr", "#cyl_mfr")])
p.vbar(x='cyl_mfr', top='mpg_mean', width=1, source=group,
line_color="white", fill_color=index_cmap, )
p.y_range.start = 0
p.x_range.range_padding = 0.05
p.xgrid.grid_line_color = None
p.xaxis.axis_label = "Manufacturer grouped by # Cylinders"
p.xaxis.major_label_orientation = 1.2
p.outline_line_color = None
show(p)
This yields the following (again, a screen shot from the documentation):
Grouped Vbar output
I understand how factor_cmap is working here, I think. The index for the dataframe has multiple factors and we're only taking the first by slicing (as seen with the end = 1). But when I try to instead set coloring based on the second index level, mfr, (setting start = 1 , end = 2) , the index mapping breaks and I get this. I based this change on my assumption that the factors were hierarchical and I needed to slice them to get the second level.
I think I must be thinking about the indexing with these categorical factors wrong, but I'm not sure what I'm doing wrong. How do I get a categorical mapper to color by the second level of the factor? I assumed the format of the factors was ('cyl', 'mfr') but maybe that assumption is wrong?
Here's the documentation for factor_cmap, although it wasn't very helpful: https://docs.bokeh.org/en/latest/docs/reference/transform.html#bokeh.transform.factor_cmap .
If you mean you are trying this:
index_cmap = factor_cmap('cyl_mfr',
palette=Spectral5,
factors=sorted(df.cyl.unique()),
start=1, end=2)
Then there are at least two issues:
2 is out of bounds for the length of the list of sub-factors ('cyl', 'mfr'). You would just want start=1 and leave end with its default value of None (which means to the end of the list, as usual for any Python slice).
In this specific case, with start=1 that means "colormap based on mfr sub-factors of the values", but you are still configuring the cololormapper with the cylinders as the factors for the map:
factors=sorted(df.cyl.unique())
When the colormapper goes to look up a value with mfr="mazda" in the mapping, it does not find anything (because you only put cylinder values in the mapping) so it gets shaded the default color grey (as expected).
So you could do something like this:
index_cmap = factor_cmap('cyl_mfr',
palette=Spectral5,
factors=sorted(df.mfr.unique()),
start=1)
Which "works" modulo the fact that there are way more manufacturer values than there are colors in the Spectral5 palette:
In the real situation you'll need to make sure you use a palette as least as big as the number of (sub-)factors that you configure.

Display figures and names with tooltips and mark_text with Altair

Here are three issues I have with tooltips and labels that I want to display on my Altair graph. All the issues are more or less linked.
First, I would like to modify the name of the information I display with the tooltip:
Year instead of properties.annee
Region instead of properties.region
Bioenergy instead of properties.bioenerie...
Second, I would like to round the values displayed in the tooltip.
"11.2" instead of "11.1687087653"
The code I wrote does what I want for the labels I put in the regions but it is not working for the tooltip.
Third, I would like to display the unit in the labels and in the tooltip but I don't find the correct syntax in the documentation.
Below is my code.
Thanks in advance for yous answers.
Bertrand
Current result of my code
def gen_map(data: gpd.geodataframe.GeoDataFrame, title: str, abs_values: bool):
data_json = json.loads(data.to_json())
choro_data = alt.Data(values=data_json['features'])
# Absolute values or relative values
if abs_values:
column = data.columns[0]
units = 'MW'
form = '.0f'
else:
column = data.columns[1]
units = '%'
form = '.1f'
# Base layer
layer = alt.Chart(choro_data, title=title).mark_geoshape(
stroke='white',
strokeWidth=1
).encode(
alt.Color(f'properties.{column}:Q',
type='quantitative',
title = f'Installed Capacity in {units}'),
tooltip=[f'properties.annee:Q',
f'properties.region:O',
f'properties.{column}:Q',
alt.Text(f'properties.{column}:Q', format=form)]
).transform_lookup(
lookup='region',
from_=alt.LookupData(choro_data, 'region')
).properties(
width=600,
height=500
)
# Label layer
labels = alt.Chart(choro_data).mark_text(baseline='top'
).properties(
width=600,
height=500
).encode(
longitude='properties.centroid_lon:Q',
latitude='properties.centroid_lat:Q',
text=alt.Text(f'properties.{column}:Q', format=form),
size=alt.value(14),
opacity=alt.value(1)
)
return layer + labels
gen_map(bioenergies_2019, 'Bioenergy in France in 2019', False)
Instead of a list of strings, use a list of alt.Tooltip objects:
tooltip=[alt.Tooltip('properties.annee:Q', title='Annee'),
alt.Tooltip('properties.region:O', title='Region'),
alt.Tooltip(f'properties.{column}:Q', title=f'{column}')]
You can additionally pass the format argument to specify the format of the value; for number formats, use d3-format codes; for date/time formats use d3-date-format codes.

How to change the limits for geo_shape in altair (python vega-lite)

I am trying to plot locations in three states in the US in python with Altair. I saw the tutorial about the us map but I am wondering if there is anyway to zoom the image to the only three states of interest, i.e. NY,NJ and CT.
Currently, I have the following code:
from vega_datasets import data
states = alt.topo_feature(data.us_10m.url, 'states')
# US states background
background = alt.Chart(states).mark_geoshape(
fill='lightgray',
stroke='white',
limit=1000
).properties(
title='US State Capitols',
width=700,
height=400
).project("albers")
points=alt.Chart(accts).mark_point().encode(
longitude = "longitude",
latitude = "latitude",
color = "Group")
background+points
I inspected the us_10m.url data set and seems like there is no field which specifies the individual states. So I am hoping if I could just somehow change the xlim and ylim for the background to [-80,-70] and [35,45] for example. I want to zoom in to the regions where there are data points(blue dots).
Could someone kindly show me how to do that? Thanks!!
Update
There is a field called ID in the JSON file and I manually found out that NJ is 34, NY is 36 and CT is 9. Is there a way to filter on these IDs? That will get the job done!
Alright seems like the selection/zoom/xlim/ylim feature for geotype is not supported yet:
Document and add warning that geo-position doesn't support selection yet #3305
So I end up with a hackish way to solve this problem by first filtering based on the IDs using pure python. Basically, load the JSON file into a dictionary and then change the value field before converting the dictionary to topojson format. Below is an example for 5 states,PA,NJ,NY,CT,RI and MA.
import altair as alt
from vega_datasets import data
# Load the data, which is loaded as a dict object
us_10m = data.us_10m()
# Select the geometries under states under objects, filter on id (9,25,34,36,42,44)
us_10m['objects']['states']['geometries']=[item for item in us_10m['objects'] \
['states']['geometries'] if item['id'] in [9,25,34,36,42,44]]
# Make the topojson data
states = alt.Data(
values=us_10m,
format=alt.TopoDataFormat(feature='states',type='topojson'))
# Plot background (now only has 5 states)
background = alt.Chart(states).mark_geoshape(
fill='lightgray',
stroke='white',
limit=1000
).properties(
title='US State Capitols',
width=700,
height=400
).project("mercator")
# Plot the points
points=alt.Chart(accts).mark_circle(size=60).encode(
longitude = "longitude",
latitude = "latitude",
color = "Group").project("mercator")
# Overlay the two plots
background+points
The resulting plot looks ok:

Resources