How to use seasonal_decompose. How to deal with various errors while using seasonal_decompose. How can we practically use or implement seasonal_decompose.
Get all imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import datetime
from statsmodels.tsa.seasonal import seasonal_decompose
Prepare test data
data = {'Unix Timestamp': ['1.61888E+12','1.61888E+12','1.61888E+12','1.61888E+12','1.61888E+12','1.61888E+12','1.61888E+12','1.61888E+12','1.61888E+12','1.61888E+12','1.61888E+12','1.61888E+12','1.61888E+12','1.61888E+12','1.61888E+12','1.61888E+12','1.61888E+12','1.61888E+12','1.61888E+12','1.61888E+12','1.61888E+12','1.61888E+12','1.61888E+12','1.61888E+12','1.61888E+12','1.61888E+12','1.61888E+12','1.61888E+12','1.61888E+12','1.61888E+12','1.61888E+12','1.61888E+12','1.61888E+12','1.61888E+12'],
'Date': ['4/20/2021 0:02','4/20/2021 0:01','4/20/2021 0:00','4/19/2021 23:59','4/19/2021 23:58','4/19/2021 23:57','4/19/2021 23:56','4/19/2021 23:55','4/19/2021 23:54','4/19/2021 23:53','4/19/2021 23:52','4/19/2021 23:51','4/19/2021 23:50','4/19/2021 23:49','4/19/2021 23:48','4/19/2021 23:47','4/19/2021 23:46','4/20/2021 0:02','4/20/2021 0:01','4/20/2021 0:00','4/19/2021 23:59','4/19/2021 23:58','4/19/2021 23:57','4/19/2021 23:56','4/19/2021 23:55','4/19/2021 23:54','4/19/2021 23:53','4/19/2021 23:52','4/19/2021 23:51','4/19/2021 23:50','4/19/2021 23:49','4/19/2021 23:48','4/19/2021 23:47','4/19/2021 23:46'],
'Symbol': ['BTCUSD','BTCUSD','BTCUSD','BTCUSD','BTCUSD','BTCUSD','BTCUSD','BTCUSD','BTCUSD','BTCUSD','BTCUSD','BTCUSD','BTCUSD','BTCUSD','BTCUSD','BTCUSD','BTCUSD','BTCUSD','BTCUSD','BTCUSD','BTCUSD','BTCUSD','BTCUSD','BTCUSD','BTCUSD','BTCUSD','BTCUSD','BTCUSD','BTCUSD','BTCUSD','BTCUSD','BTCUSD','BTCUSD','BTCUSD'],
'Open': [55717.47,55768.94,55691.79,55777.86,55803.5,55690.64,55624.69,55651.82,55688.08,55749.28,55704.59,55779.38,55816.61,55843.69,55880.12,55890.88,0,55717.47,55768.94,55691.79,55777.86,55803.5,55690.64,55624.69,55651.82,55688.08,55749.28,55704.59,55779.38,55816.61,55843.69,55880.12,55890.88,0],
'High': [55723,55849.82,55793.15,55777.86,55823.88,55822.91,55713.02,55675.92,55730.21,55749.28,55759.27,55779.38,55835.57,55863.89,55916.47,55918.87,0,55723,55849.82,55793.15,55777.86,55823.88,55822.91,55713.02,55675.92,55730.21,55749.28,55759.27,55779.38,55835.57,55863.89,55916.47,55918.87,0],
'Low': [55541.69,55711.74,55691.79,55677.92,55773.08,55682.56,55624.63,55621.58,55641.46,55688.08,55695.42,55688.66,55769.46,55797.08,55815.99,55826.84,0,55541.69,55711.74,55691.79,55677.92,55773.08,55682.56,55624.63,55621.58,55641.46,55688.08,55695.42,55688.66,55769.46,55797.08,55815.99,55826.84,0]}
df=pd.DataFrame(data)
Perform decomposition
df_seasonal = seasonal_decompose(df)
We get our first error
ValueError: could not convert string to float:
Lets fix the above error, for this run below code
df['Date'] = df['Date'].apply(
lambda x : datetime.datetime.strptime(str(x),'%m/%d/%Y %H:%M')
)
Now if you run seasonal_decompose again, you will get new error
df_seasonal = seasonal_decompose(df)
Now the new error will be
TypeError: float() argument must be a string or a number, not 'Timestamp'
To fix this error we pass one column at a time and the column passed should be a string or a number. Try the decompose using below code
df_seasonal = seasonal_decompose(df['Open'])
Now you get a new error, as shown below
ValueError: You must specify a period or x must be a pandas object with a PeriodIndex or a DatetimeIndex with a freq not set to None
There are two solution's to this error
First Solution:- use period parameter for seasonal_decompose
df_seasonal = seasonal_decompose(df['Open'],period = 1) ## here we have data for every minute and hence period is 1 , but this need not be correct.
In above code we have data for every minute and hence period is 1. However, this need not be correct period is actually cycle period of input data. To know more on how to decide on period read this page. To know the complete list of freq abbrevations click here
Second Solution:- create an datetime index for the data along with frequency
df = df.set_index(df.Date).asfreq('2Min') ## M for Months S for Seconds. Here we cannot resample data with frequency 1Min, as data is already in frequency of 1Min, hence we used 2Min here
df_seasonal = seasonal_decompose(df['Open']) ## here we didn't use period and freq argument
In seasonal_decompose we have to set the model ( By default its Addictive). We can either set the model to be Additive or Multiplicative. A rule of thumb for selecting the right model is to see in our plot if the trend and seasonal variation are relatively constant over time, in other words, linear. If yes, then we will select the Additive model. Otherwise, if the trend and seasonal variation increase or decrease over time then we use the Multiplicative model. So that means before we do seasonal_decompose we must plot the preprocessed data over time and see if there are any trends or cycles.
Finally we could run it without error.
Another error that we might see is TypeError: Index(...) must be called with a collection of some kind, 'seasonal' was passed, this again happens due to wrong usage of seasonal_decompose like for example below
df_bt_decomp = seasonal_decompose(df_bt[['Open','High']],period=1) ## this is wrong because we have used two columns together and both are valid metric and not an index.
I have the following graph in Altair:
The code used to generate it is as follows:
data = pd.read_csv(data_csv)
display(data)
display(set(data['algo_score_raw']))
# First generate base graph
base = alt.Chart(data).mark_circle(opacity=1, stroke='#4c78a8').encode(
x=alt.X('Paragraph:N', axis=None),
y=alt.Y('Section:N', sort=list(OrderedDict.fromkeys(data['Section']))),
size=alt.Size('algo_score_raw:Q', title="Number of Matches"),
).properties(
width=900,
height=500
)
# Next generate the overlying graph with the lines
lines = alt.Chart(data).mark_rule(stroke='#4c78a8').encode(
x=alt.X('Paragraph:N', axis=alt.Axis(labelAngle=0)),
y=alt.Y('Section:N', sort=list(OrderedDict.fromkeys(data['Section'])))
).properties(
width=900,
height=500
)
if max(data['algo_score_raw']) == 0:
return lines # no circles if no matches
else:
return base + lines
However, I don't want the decimal values in my legend; I only want 1.0, 2.0, and 3.0, because those are the only values that are actually present in my data. However, Altair seems to default to what you see above.
The legend is generated based on how you specify your encoding. It sounds like your data are better represented as ordered categories than as a continuous quantitative scale. You can specify this by changing the encoding type to ordinal:
size=alt.Size('algo_score_raw:O')
You can read more about encoding types at https://altair-viz.github.io/user_guide/encoding.html
You can use alt.Legend(tickCount=2)) (labelExpr could also be helpful, see the docs for more):
import altair as alt
from vega_datasets import data
source = data.cars()
source['Acceleration'] = source['Acceleration'] / 10
chart = alt.Chart(source).mark_circle(size=60).encode(
x='Horsepower',
y='Miles_per_Gallon',
size='Acceleration',
)
chart
chart.encode(size=alt.Size('Acceleration', legend=alt.Legend(tickCount=2)))
I'm trying to set the x-axis domain to between 0-36, as some data I'm processing was collected in 6-week increments. Following the documentation i used the scale=alt.Scale(domain=[0,36]). However, this continues to show the chart up to 40.
df = pd.DataFrame({'x':[0,6,12,18,24,30,36],'y':[0,3,1,4,2,5,3]})
alt.Chart(df).mark_line(point=True).encode(
x=alt.X('x:Q',
axis=alt.Axis(values=[0,6,12,18,24,30,36]),
scale=alt.Scale(domain=[0,36])),
y=alt.Y('y:Q'),
)
Output of code above
Changing the above code to cut off between 30 and 35 i.e., scale=alt.Scale(domain=[0,31]) generates this behavior, where the chart axis gets truncated at 30 (but shows the data after 30, appropriately since the data hasn't been clipped).
But why can't I cut off the graph at values that aren't multiples of 5?
I'm using Altair v4.0.1
The Vega-Lite renderer defaults to choosing "nice" values for the scale. If you want to disable this behavior, you can pass nice=False:
import pandas as pd
import altair as alt
df = pd.DataFrame({'x':[0,6,12,18,24,30,36],'y':[0,3,1,4,2,5,3]})
alt.Chart(df).mark_line(point=True).encode(
x=alt.X('x:Q',
axis=alt.Axis(values=[0,6,12,18,24,30,36]),
scale=alt.Scale(domain=[0,36], nice=False)),
y=alt.Y('y:Q'),
)
I am trying to construct a grouped vertical bar chart in Bokeh from a pandas dataframe. I'm struggling with understanding the use of factor_cmap and how the color mapping works with this function. There's an example in the documentation (https://docs.bokeh.org/en/latest/docs/user_guide/categorical.html#pandas) that was helpful to follow, here:
from bokeh.io import output_file, show
from bokeh.palettes import Spectral5
from bokeh.plotting import figure
from bokeh.sampledata.autompg import autompg_clean as df
from bokeh.transform import factor_cmap
output_file("bar_pandas_groupby_nested.html")
df.cyl = df.cyl.astype(str)
df.yr = df.yr.astype(str)
group = df.groupby(by=['cyl', 'mfr'])
index_cmap = factor_cmap('cyl_mfr', palette=Spectral5, factors=sorted(df.cyl.unique()), end=1)
p = figure(plot_width=800, plot_height=300, title="Mean MPG by # Cylinders and Manufacturer",
x_range=group, toolbar_location=None, tooltips=[("MPG", "#mpg_mean"), ("Cyl, Mfr", "#cyl_mfr")])
p.vbar(x='cyl_mfr', top='mpg_mean', width=1, source=group,
line_color="white", fill_color=index_cmap, )
p.y_range.start = 0
p.x_range.range_padding = 0.05
p.xgrid.grid_line_color = None
p.xaxis.axis_label = "Manufacturer grouped by # Cylinders"
p.xaxis.major_label_orientation = 1.2
p.outline_line_color = None
show(p)
This yields the following (again, a screen shot from the documentation):
Grouped Vbar output
I understand how factor_cmap is working here, I think. The index for the dataframe has multiple factors and we're only taking the first by slicing (as seen with the end = 1). But when I try to instead set coloring based on the second index level, mfr, (setting start = 1 , end = 2) , the index mapping breaks and I get this. I based this change on my assumption that the factors were hierarchical and I needed to slice them to get the second level.
I think I must be thinking about the indexing with these categorical factors wrong, but I'm not sure what I'm doing wrong. How do I get a categorical mapper to color by the second level of the factor? I assumed the format of the factors was ('cyl', 'mfr') but maybe that assumption is wrong?
Here's the documentation for factor_cmap, although it wasn't very helpful: https://docs.bokeh.org/en/latest/docs/reference/transform.html#bokeh.transform.factor_cmap .
If you mean you are trying this:
index_cmap = factor_cmap('cyl_mfr',
palette=Spectral5,
factors=sorted(df.cyl.unique()),
start=1, end=2)
Then there are at least two issues:
2 is out of bounds for the length of the list of sub-factors ('cyl', 'mfr'). You would just want start=1 and leave end with its default value of None (which means to the end of the list, as usual for any Python slice).
In this specific case, with start=1 that means "colormap based on mfr sub-factors of the values", but you are still configuring the cololormapper with the cylinders as the factors for the map:
factors=sorted(df.cyl.unique())
When the colormapper goes to look up a value with mfr="mazda" in the mapping, it does not find anything (because you only put cylinder values in the mapping) so it gets shaded the default color grey (as expected).
So you could do something like this:
index_cmap = factor_cmap('cyl_mfr',
palette=Spectral5,
factors=sorted(df.mfr.unique()),
start=1)
Which "works" modulo the fact that there are way more manufacturer values than there are colors in the Spectral5 palette:
In the real situation you'll need to make sure you use a palette as least as big as the number of (sub-)factors that you configure.
Let's assume our data frame has two series of type integer: estimated_value and sell_price.
I want to have two bars next to each other in the same bar chart.
The left one shows average(estimated_value) and the right one shows average(sell_price).
They shall share the same axis.
I thought this would be a very common use case but I could not find any example in the docs. All the examples use 'colour' or 'column' to group bars.
I've tried using y2 but it seems to simply erase the difference to y1 instead of adding a second series.
Then I tried using a layeredChart but this puts both bars on top of each other instead of next to each other.
It sounds like you have wide-form data rather than long-form data. The difference is discussed in Long-form vs. Wide-form data.
Once you've transformed your data to long-form, you can use standard encodings to achieve this result. Here's how it might look, using some example data:
import altair as alt
import pandas as pd
data = pd.DataFrame({
'estimated_value': [500, 600, 700, 800, 900],
'sell_price': [550, 610, 690, 810, 950]
})
alt.Chart(data).transform_fold(
['estimated_value', 'sell_price'], as_=['category', 'price']
).mark_bar().encode(
y='category:N',
x='average(price):Q',
)