How to format the last line segment based on a datetime column? - altair

I'm trying to make the final segment of a line plot dashed to indicate incomplete data. From what I can tell I should be able to do this using a condition on strokeDash. However I can't figure out how to get the condition predicate to work using a datetime field.
alt.Chart(rates)
.mark_line(point=True)
.encode(
x=alt.X("start_date:T", scale=alt.Scale(nice="week")),
y="install_rate",
strokeDash=alt.condition(
f"datum.start_date > toDate({start_dates[-2].isoformat()})",
alt.value([5, 5]), # dashed line: 5 pixels dash + 5 pixels space
alt.value([0]), # solid line
)
)
This gives me an error:
Error: Illegal callee type: MemberExpression

You can fix the error you are encountering by making sure that pandas reads in the dates as a temporal data type:
import pandas as pd
import altair as alt
rates = pd.DataFrame({
'start_date': pd.to_datetime(['2022-05-06', '2022-05-13', '2022-05-19', '2022-05-25']),
'install_rate': [0.05, 0.06, 0.08, 0.09],
})
alt.Chart(rates).mark_line(point=True).encode(
x=alt.X("start_date:T"),
y="install_rate",
color=alt.condition(
f"datum.start_date > toDate('2022-05-19')",
alt.value('blue'),
alt.value('red')
)
)
However, as you can see the lines is not amenable to modifications via a condition. I think this is because it is considered a single continuous mark whereas the points are split up and can be changed individually.
You could group the line by creating a new separate field and grouping by it, which creates two separate lines.
rates['above_threshold'] = rates['start_date'] > '2022-05-13'
alt.Chart(rates).mark_line(point=True).encode(
x=alt.X("start_date:T"),
y="install_rate",
color='above_threshold')
However, that causes issues with the gap as you can see above. I think for your case the easiest might be to layer two charts with filter transforms:
base = alt.Chart(rates).encode(
x=alt.X("start_date:T"),
y="install_rate",
)
base.mark_line(strokeDash=[5, 5]).transform_filter(
f"datum.start_date > toDate('2022-05-19')"
) + base.mark_line().transform_filter(
f"datum.start_date < toDate('2022-05-20')"
)

Related

seasonal_decompose : How to use seasonal_decompose:Practical Implementation for seasonal_decompose

How to use seasonal_decompose. How to deal with various errors while using seasonal_decompose. How can we practically use or implement seasonal_decompose.
Get all imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import datetime
from statsmodels.tsa.seasonal import seasonal_decompose
Prepare test data
data = {'Unix Timestamp': ['1.61888E+12','1.61888E+12','1.61888E+12','1.61888E+12','1.61888E+12','1.61888E+12','1.61888E+12','1.61888E+12','1.61888E+12','1.61888E+12','1.61888E+12','1.61888E+12','1.61888E+12','1.61888E+12','1.61888E+12','1.61888E+12','1.61888E+12','1.61888E+12','1.61888E+12','1.61888E+12','1.61888E+12','1.61888E+12','1.61888E+12','1.61888E+12','1.61888E+12','1.61888E+12','1.61888E+12','1.61888E+12','1.61888E+12','1.61888E+12','1.61888E+12','1.61888E+12','1.61888E+12','1.61888E+12'],
'Date': ['4/20/2021 0:02','4/20/2021 0:01','4/20/2021 0:00','4/19/2021 23:59','4/19/2021 23:58','4/19/2021 23:57','4/19/2021 23:56','4/19/2021 23:55','4/19/2021 23:54','4/19/2021 23:53','4/19/2021 23:52','4/19/2021 23:51','4/19/2021 23:50','4/19/2021 23:49','4/19/2021 23:48','4/19/2021 23:47','4/19/2021 23:46','4/20/2021 0:02','4/20/2021 0:01','4/20/2021 0:00','4/19/2021 23:59','4/19/2021 23:58','4/19/2021 23:57','4/19/2021 23:56','4/19/2021 23:55','4/19/2021 23:54','4/19/2021 23:53','4/19/2021 23:52','4/19/2021 23:51','4/19/2021 23:50','4/19/2021 23:49','4/19/2021 23:48','4/19/2021 23:47','4/19/2021 23:46'],
'Symbol': ['BTCUSD','BTCUSD','BTCUSD','BTCUSD','BTCUSD','BTCUSD','BTCUSD','BTCUSD','BTCUSD','BTCUSD','BTCUSD','BTCUSD','BTCUSD','BTCUSD','BTCUSD','BTCUSD','BTCUSD','BTCUSD','BTCUSD','BTCUSD','BTCUSD','BTCUSD','BTCUSD','BTCUSD','BTCUSD','BTCUSD','BTCUSD','BTCUSD','BTCUSD','BTCUSD','BTCUSD','BTCUSD','BTCUSD','BTCUSD'],
'Open': [55717.47,55768.94,55691.79,55777.86,55803.5,55690.64,55624.69,55651.82,55688.08,55749.28,55704.59,55779.38,55816.61,55843.69,55880.12,55890.88,0,55717.47,55768.94,55691.79,55777.86,55803.5,55690.64,55624.69,55651.82,55688.08,55749.28,55704.59,55779.38,55816.61,55843.69,55880.12,55890.88,0],
'High': [55723,55849.82,55793.15,55777.86,55823.88,55822.91,55713.02,55675.92,55730.21,55749.28,55759.27,55779.38,55835.57,55863.89,55916.47,55918.87,0,55723,55849.82,55793.15,55777.86,55823.88,55822.91,55713.02,55675.92,55730.21,55749.28,55759.27,55779.38,55835.57,55863.89,55916.47,55918.87,0],
'Low': [55541.69,55711.74,55691.79,55677.92,55773.08,55682.56,55624.63,55621.58,55641.46,55688.08,55695.42,55688.66,55769.46,55797.08,55815.99,55826.84,0,55541.69,55711.74,55691.79,55677.92,55773.08,55682.56,55624.63,55621.58,55641.46,55688.08,55695.42,55688.66,55769.46,55797.08,55815.99,55826.84,0]}
df=pd.DataFrame(data)
Perform decomposition
df_seasonal = seasonal_decompose(df)
We get our first error
ValueError: could not convert string to float:
Lets fix the above error, for this run below code
df['Date'] = df['Date'].apply(
lambda x : datetime.datetime.strptime(str(x),'%m/%d/%Y %H:%M')
)
Now if you run seasonal_decompose again, you will get new error
df_seasonal = seasonal_decompose(df)
Now the new error will be
TypeError: float() argument must be a string or a number, not 'Timestamp'
To fix this error we pass one column at a time and the column passed should be a string or a number. Try the decompose using below code
df_seasonal = seasonal_decompose(df['Open'])
Now you get a new error, as shown below
ValueError: You must specify a period or x must be a pandas object with a PeriodIndex or a DatetimeIndex with a freq not set to None
There are two solution's to this error
First Solution:- use period parameter for seasonal_decompose
df_seasonal = seasonal_decompose(df['Open'],period = 1) ## here we have data for every minute and hence period is 1 , but this need not be correct.
In above code we have data for every minute and hence period is 1. However, this need not be correct period is actually cycle period of input data. To know more on how to decide on period read this page. To know the complete list of freq abbrevations click here
Second Solution:- create an datetime index for the data along with frequency
df = df.set_index(df.Date).asfreq('2Min') ## M for Months S for Seconds. Here we cannot resample data with frequency 1Min, as data is already in frequency of 1Min, hence we used 2Min here
df_seasonal = seasonal_decompose(df['Open']) ## here we didn't use period and freq argument
In seasonal_decompose we have to set the model ( By default its Addictive). We can either set the model to be Additive or Multiplicative. A rule of thumb for selecting the right model is to see in our plot if the trend and seasonal variation are relatively constant over time, in other words, linear. If yes, then we will select the Additive model. Otherwise, if the trend and seasonal variation increase or decrease over time then we use the Multiplicative model. So that means before we do seasonal_decompose we must plot the preprocessed data over time and see if there are any trends or cycles.
Finally we could run it without error.
Another error that we might see is TypeError: Index(...) must be called with a collection of some kind, 'seasonal' was passed, this again happens due to wrong usage of seasonal_decompose like for example below
df_bt_decomp = seasonal_decompose(df_bt[['Open','High']],period=1) ## this is wrong because we have used two columns together and both are valid metric and not an index.

Is there a way to specify what the legend shows in Altair?

I have the following graph in Altair:
The code used to generate it is as follows:
data = pd.read_csv(data_csv)
display(data)
display(set(data['algo_score_raw']))
# First generate base graph
base = alt.Chart(data).mark_circle(opacity=1, stroke='#4c78a8').encode(
x=alt.X('Paragraph:N', axis=None),
y=alt.Y('Section:N', sort=list(OrderedDict.fromkeys(data['Section']))),
size=alt.Size('algo_score_raw:Q', title="Number of Matches"),
).properties(
width=900,
height=500
)
# Next generate the overlying graph with the lines
lines = alt.Chart(data).mark_rule(stroke='#4c78a8').encode(
x=alt.X('Paragraph:N', axis=alt.Axis(labelAngle=0)),
y=alt.Y('Section:N', sort=list(OrderedDict.fromkeys(data['Section'])))
).properties(
width=900,
height=500
)
if max(data['algo_score_raw']) == 0:
return lines # no circles if no matches
else:
return base + lines
However, I don't want the decimal values in my legend; I only want 1.0, 2.0, and 3.0, because those are the only values that are actually present in my data. However, Altair seems to default to what you see above.
The legend is generated based on how you specify your encoding. It sounds like your data are better represented as ordered categories than as a continuous quantitative scale. You can specify this by changing the encoding type to ordinal:
size=alt.Size('algo_score_raw:O')
You can read more about encoding types at https://altair-viz.github.io/user_guide/encoding.html
You can use alt.Legend(tickCount=2)) (labelExpr could also be helpful, see the docs for more):
import altair as alt
from vega_datasets import data
source = data.cars()
source['Acceleration'] = source['Acceleration'] / 10
chart = alt.Chart(source).mark_circle(size=60).encode(
x='Horsepower',
y='Miles_per_Gallon',
size='Acceleration',
)
chart
chart.encode(size=alt.Size('Acceleration', legend=alt.Legend(tickCount=2)))

How do i set the domain of an axis to a value that isn't a multiple of five in Altair?

I'm trying to set the x-axis domain to between 0-36, as some data I'm processing was collected in 6-week increments. Following the documentation i used the scale=alt.Scale(domain=[0,36]). However, this continues to show the chart up to 40.
df = pd.DataFrame({'x':[0,6,12,18,24,30,36],'y':[0,3,1,4,2,5,3]})
alt.Chart(df).mark_line(point=True).encode(
x=alt.X('x:Q',
axis=alt.Axis(values=[0,6,12,18,24,30,36]),
scale=alt.Scale(domain=[0,36])),
y=alt.Y('y:Q'),
)
Output of code above
Changing the above code to cut off between 30 and 35 i.e., scale=alt.Scale(domain=[0,31]) generates this behavior, where the chart axis gets truncated at 30 (but shows the data after 30, appropriately since the data hasn't been clipped).
But why can't I cut off the graph at values that aren't multiples of 5?
I'm using Altair v4.0.1
The Vega-Lite renderer defaults to choosing "nice" values for the scale. If you want to disable this behavior, you can pass nice=False:
import pandas as pd
import altair as alt
df = pd.DataFrame({'x':[0,6,12,18,24,30,36],'y':[0,3,1,4,2,5,3]})
alt.Chart(df).mark_line(point=True).encode(
x=alt.X('x:Q',
axis=alt.Axis(values=[0,6,12,18,24,30,36]),
scale=alt.Scale(domain=[0,36], nice=False)),
y=alt.Y('y:Q'),
)

How do the factors in factor_cmap in Bokeh work?

I am trying to construct a grouped vertical bar chart in Bokeh from a pandas dataframe. I'm struggling with understanding the use of factor_cmap and how the color mapping works with this function. There's an example in the documentation (https://docs.bokeh.org/en/latest/docs/user_guide/categorical.html#pandas) that was helpful to follow, here:
from bokeh.io import output_file, show
from bokeh.palettes import Spectral5
from bokeh.plotting import figure
from bokeh.sampledata.autompg import autompg_clean as df
from bokeh.transform import factor_cmap
output_file("bar_pandas_groupby_nested.html")
df.cyl = df.cyl.astype(str)
df.yr = df.yr.astype(str)
group = df.groupby(by=['cyl', 'mfr'])
index_cmap = factor_cmap('cyl_mfr', palette=Spectral5, factors=sorted(df.cyl.unique()), end=1)
p = figure(plot_width=800, plot_height=300, title="Mean MPG by # Cylinders and Manufacturer",
x_range=group, toolbar_location=None, tooltips=[("MPG", "#mpg_mean"), ("Cyl, Mfr", "#cyl_mfr")])
p.vbar(x='cyl_mfr', top='mpg_mean', width=1, source=group,
line_color="white", fill_color=index_cmap, )
p.y_range.start = 0
p.x_range.range_padding = 0.05
p.xgrid.grid_line_color = None
p.xaxis.axis_label = "Manufacturer grouped by # Cylinders"
p.xaxis.major_label_orientation = 1.2
p.outline_line_color = None
show(p)
This yields the following (again, a screen shot from the documentation):
Grouped Vbar output
I understand how factor_cmap is working here, I think. The index for the dataframe has multiple factors and we're only taking the first by slicing (as seen with the end = 1). But when I try to instead set coloring based on the second index level, mfr, (setting start = 1 , end = 2) , the index mapping breaks and I get this. I based this change on my assumption that the factors were hierarchical and I needed to slice them to get the second level.
I think I must be thinking about the indexing with these categorical factors wrong, but I'm not sure what I'm doing wrong. How do I get a categorical mapper to color by the second level of the factor? I assumed the format of the factors was ('cyl', 'mfr') but maybe that assumption is wrong?
Here's the documentation for factor_cmap, although it wasn't very helpful: https://docs.bokeh.org/en/latest/docs/reference/transform.html#bokeh.transform.factor_cmap .
If you mean you are trying this:
index_cmap = factor_cmap('cyl_mfr',
palette=Spectral5,
factors=sorted(df.cyl.unique()),
start=1, end=2)
Then there are at least two issues:
2 is out of bounds for the length of the list of sub-factors ('cyl', 'mfr'). You would just want start=1 and leave end with its default value of None (which means to the end of the list, as usual for any Python slice).
In this specific case, with start=1 that means "colormap based on mfr sub-factors of the values", but you are still configuring the cololormapper with the cylinders as the factors for the map:
factors=sorted(df.cyl.unique())
When the colormapper goes to look up a value with mfr="mazda" in the mapping, it does not find anything (because you only put cylinder values in the mapping) so it gets shaded the default color grey (as expected).
So you could do something like this:
index_cmap = factor_cmap('cyl_mfr',
palette=Spectral5,
factors=sorted(df.mfr.unique()),
start=1)
Which "works" modulo the fact that there are way more manufacturer values than there are colors in the Spectral5 palette:
In the real situation you'll need to make sure you use a palette as least as big as the number of (sub-)factors that you configure.

How can I put two bars of distinct series next to each other in the same chart?

Let's assume our data frame has two series of type integer: estimated_value and sell_price.
I want to have two bars next to each other in the same bar chart.
The left one shows average(estimated_value) and the right one shows average(sell_price).
They shall share the same axis.
I thought this would be a very common use case but I could not find any example in the docs. All the examples use 'colour' or 'column' to group bars.
I've tried using y2 but it seems to simply erase the difference to y1 instead of adding a second series.
Then I tried using a layeredChart but this puts both bars on top of each other instead of next to each other.
It sounds like you have wide-form data rather than long-form data. The difference is discussed in Long-form vs. Wide-form data.
Once you've transformed your data to long-form, you can use standard encodings to achieve this result. Here's how it might look, using some example data:
import altair as alt
import pandas as pd
data = pd.DataFrame({
'estimated_value': [500, 600, 700, 800, 900],
'sell_price': [550, 610, 690, 810, 950]
})
alt.Chart(data).transform_fold(
['estimated_value', 'sell_price'], as_=['category', 'price']
).mark_bar().encode(
y='category:N',
x='average(price):Q',
)

Resources