I have a question about the peculiar behaviour of Azure AutoML when using forecasting with historical data context.
Basically, I want to apply this usecase from the documentation (documentation)
The idea is to train a model with historical data (imagine, 3 months of historical data) and then feed the model the current prediction context (for example, the last two weeks) in order to predict a certain prediction horizon.
According to the documentation, to train the model with historical data, need to do something like this for configuration:
forecasting_parameters = ForecastingParameters(time_column_name='Timestamp',
target_aggregation_function = "mean",
freq='H',
forecast_horizon = prediction_horizon_hours,
target_lags = 'auto',
)
label = signalTags
automl_config = AutoMLConfig(task='forecasting',
primary_metric='normalized_root_mean_squared_error',
experiment_timeout_minutes=30,
blocked_models=["AutoArima"],
enable_early_stopping=True,
training_data=Data,
label_column_name=label,
n_cross_validations=3,
enable_ensembling=False,
verbosity=logging.INFO,
forecasting_parameters = forecasting_parameters)
After training, in order to perform a predictiton I need to feed the "context" according to what I want to predict in the form of a dataframe (where the values for the target column are filled in in case of the context and empty in case of values I want to predict) and then just call forecast. Something like this:
Timestamp Signal
0 2022-08-07T23:00:00Z 63.16
1 2022-08-08T00:00:00Z 62.92
2 2022-08-08T01:00:00Z 62.89
3 2022-08-08T02:00:00Z 62.79
4 2022-08-08T03:00:00Z 62.75
.. ... ...
233 2022-08-23T17:00:00Z nan
234 2022-08-23T18:00:00Z nan
235 2022-08-23T19:00:00Z nan
236 2022-08-23T20:00:00Z nan
237 2022-08-23T21:00:00Z nan
After all this context (pun intended) here is the question/problem.
When I use the above dataframe to forecast ahead I get an error that mentions the following:
ForecastingConfigException:
Message: Expected column(s) target value column not found in y_pred.
InnerException: None
ErrorResponse
{
"error": {
"code": "UserError",
"message": "Expected column(s) target value column not found in y_pred.",
"target": "y_pred",
"inner_error": {
"code": "BadArgument",
"inner_error": {
"code": "MissingColumnsInData"
}
},
"reference_code": "ac316505-87e4-4877-a855-65a24c3a796b"
}
}
However, if I feed a slightly different dataframe (where the data to be forecasted has any other time except exactly on the hour, i.e. 10h30,11h01, 10h23 etc.) it works normally. If I give it something like this:
Timestamp Signal
0 2022-08-07T23:00:00Z 63.16
1 2022-08-08T00:00:00Z 62.92
2 2022-08-08T01:00:00Z 62.89
3 2022-08-08T02:00:00Z 62.79
4 2022-08-08T03:00:00Z 62.75
.. ... ...
233 2022-08-23T17:00:01Z nan
234 2022-08-23T18:00:01Z nan
235 2022-08-23T19:00:01Z nan
236 2022-08-23T20:00:01Z nan
237 2022-08-23T21:00:01Z nan
It outputs good results. What gives?
I have tried resetting the index of the dataframe, replace None with nan but nothing seems to work. Azure Automl can predict any date except ones that are on the hour.
What can I do to fix this?
Thanks!
I managed to get it to work by changing how I call the forecast model.
Taking into account these variables:
X : time values
y : target values
df: df[[x,y]]
For a univariate series, instead of using this:
model.forecast(x, y)
I need to call:
model.forecast(df, y)
Remember that to call forecast you need to supply the arguments in a dataframe or in numpy array
Related
In order to be able to compare different data sets I need a way to put these on a common time basis. What is the most efficient way to achieve this?
I've tried a few ways and the most easy should - to my understanding - be with pandas DataFrame.reindex:
I have an unevenly spaced time array with associated values for the new status (on/off) which persists after the entry. As such I want to use the previous value of the status column until a new value at a new time for the status is set.
The typical array looks like, df is a one-column DataFrame with time as index and status as column:
In [58]: df
Out[58]:
status
time
1632160022 0
1632986376 <NA>
1632986496 0
1633448715 1
1633452437 0
1633454358 1
1633461201 0
1633534763 1
1633551686 0
...
From the docs of pandas DataFrame.reindex I read that rebasing / re-indexing with the fill-method pad / ffill should yield the previous value:
# creating evenly-spaced time base for observation duration
tmin = min(df.index)
tmax = max(df.index)
tspacing = 120
tbase = [t for t in range(tmin,tmax,tspacing)]
# create the temporally evenly-spaced DataFrame
ndf = df.reindex(index=tbase, method='pad', tolerance=120)
However the result is different to what I expect, all subsequent status entries get assigned NaN instead of the forward interpolated value:
In[62]: ndf
Out[62]:
status
time
1632160022 0
1632160142 0
1632160262 NaN
1632160382 NaN
1632160502 NaN
...
Any idea what I'm missing, doing wrong or if this method does not do the trick: is there another ready-made method available?
As such I want to use the previous value of the status column until a new value at a new time for the status is set.
IIUC:
ndf = df.reindex(tbase, method='ffill')
I have to support the ability for user to run any formula against a frame to produce a new column.
I may have a frame that looks like
dim01 dim02 msr01
0 A 25 1.0
1 B 26 5.3
2 C 53 NaN
I interpret user code to allow them to run a formula using supported functions/ standard operators / other columns
So a formula might look like SQRT([msr01]*100+7)
I convert the user input to Python syntax so this would evaluate to something like
formula_str = '(math.sqrt((row.msr01*100)+7))'
I then apply it to my pandas dataframe like this
data_frame['msr002'] = data_frame.apply(lambda row: eval(formula_str), axis=1)
This was working good until I hit data with a NaN in a column used in the calculation. I noticed that when this case happens I get a frame like this in return.
dim01 dim02 msr01 msr02
0 A 25 1.0 10.344
1 B 26 5.3 23.173
2 C 53 NaN 7.342
So it appears that the eval is not evaluating the NaN correctly.
I am using a lexer/parser to ensure that the user sent formula isnt dangerous and to convert from everyday user syntax to use python functions and make it work against pandas columns.
Any advice on how to fix this?
Perhaps I should include something in the lambda that looks if any required column is NaN and just hardcode to Nan in that case? But that doesn't seem like the best solution to me.
I did see this question which is similar but didnt think it answered my exact need.
So you can try with
df.msr01.mul(100).add(7)**0.5
Out[716]:
0 10.34408
1 23.17326
2 NaN
Name: msr01, dtype: float64
Also with your original code
df.apply(lambda row: eval(formula_str), axis=1)
Out[714]:
0 10.34408
1 23.17326
2 NaN
dtype: float64
I am experimenting with Dask, but I encountered a problem while using apply after grouping.
I have a Dask DataFrame with a large number of rows. Let's consider for example the following
N=10000
df = pd.DataFrame({'col_1':np.random.random(N), 'col_2': np.random.random(N) })
ddf = dd.from_pandas(df, npartitions=8)
I want to bin the values of col_1 and I follow the solution from here
bins = np.linspace(0,1,11)
labels = list(range(len(bins)-1))
ddf2 = ddf.map_partitions(test_f, 'col_1',bins,labels)
where
def test_f(df,col,bins,labels):
return df.assign(bin_num = pd.cut(df[col],bins,labels=labels))
and this works as I expect it to.
Now I want to take the median value in each bin (taken from here)
median = ddf2.groupby('bin_num')['col_1'].apply(pd.Series.median).compute()
Having 10 bins, I expect median to have 10 rows, but it actually has 80. The dataframe has 8 partitions so I guess that somehow the apply is working on each one individually.
However, If I want the mean and use mean
median = ddf2.groupby('bin_num')['col_1'].mean().compute()
it works and the output has 10 rows.
The question is then: what am I doing wrong that is preventing apply from operating as mean?
Maybe this warning is the key (Dask doc: SeriesGroupBy.apply) :
Pandas’ groupby-apply can be used to to apply arbitrary functions, including aggregations that result in one row per group. Dask’s groupby-apply will apply func once to each partition-group pair, so when func is a reduction you’ll end up with one row per partition-group pair. To apply a custom aggregation with Dask, use dask.dataframe.groupby.Aggregation.
You are right! I was able to reproduce your problem on Dask 2.11.0. The good news is that there's a solution! It appears that the Dask groupby problem is specifically with the category type (pandas.core.dtypes.dtypes.CategoricalDtype). By casting the category column to another column type (float, int, str), then the groupby will work correctly.
Here's your code that I copied:
import dask.dataframe as dd
import pandas as pd
import numpy as np
def test_f(df, col, bins, labels):
return df.assign(bin_num=pd.cut(df[col], bins, labels=labels))
N = 10000
df = pd.DataFrame({'col_1': np.random.random(N), 'col_2': np.random.random(N)})
ddf = dd.from_pandas(df, npartitions=8)
bins = np.linspace(0,1,11)
labels = list(range(len(bins)-1))
ddf2 = ddf.map_partitions(test_f, 'col_1', bins, labels)
print(ddf2.groupby('bin_num')['col_1'].apply(pd.Series.median).compute())
which prints out the problem you mentioned
bin_num
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
...
5 0.550844
6 0.651036
7 0.751220
8 NaN
9 NaN
Name: col_1, Length: 80, dtype: float64
Here's my solution:
ddf3 = ddf2.copy()
ddf3["bin_num"] = ddf3["bin_num"].astype("int")
print(ddf3.groupby('bin_num')['col_1'].apply(pd.Series.median).compute())
which printed:
bin_num
9 0.951369
2 0.249150
1 0.149563
0 0.049897
3 0.347906
8 0.847819
4 0.449029
5 0.550608
6 0.652778
7 0.749922
Name: col_1, dtype: float64
#MRocklin or #TomAugspurger
Would you be able to create a fix for this in a new release? I think there is sufficient reproducible code here. Thanks for all your hard work. I love Dask and use it every day ;)
I am working on a simple time series linear regression using statsmodels.api.OLS, and am running these regressions on groups of data based on an identifier variable. I have been able to get the grouped regressions working, but am now looking to merge the results of the regressions back into the original dataframe and am getting index errors.
A simplified version of my original dataframe, which we'll call "df" looks like this:
id value time
a 1 1
a 1.5 2
a 2 3
a 2.5 4
b 1 1
b 1.5 2
b 2 3
b 2.5 4
My function to conduct the regressions is as follows:
def ols_reg(df, xcol, ycol):
x = df[xcol]
y = df[ycol]
x = sm.add_constant(x)
model = sm.OLS(y, x, missing='drop').fit()
predictions = model.predict()
return pd.Series(predictions)
I then define a variable that stores the results of conducting this function on my dataset, grouping by the id column. This code is as follows:
var = df.groupby('id').apply(ols_reg,
xcol='time',ycol='value')
This returns a Series of the predicted linear values that has the same length as the original dataset, and looks like the following:
id
a 0 0.5
1 1
2 2.5
3 3
b 0 0.5
1 1
2 2.5
3 3
The column starting with 0.5 (ignore the values; not the actual output) is the column with predicted values from the regression. As the return on the function shows, this is a pandas Series.
I now want to merge these results back into the original dataframe, to look like the following:
id value time results
a 1 1 0.5
a 1.5 2 1
a 2 3 2.5
a 2.5 4 3
b 1 1 0.5
b 1.5 2 1
b 2 3 2.5
b 2.5 4 3
I've tried a number of methods, such as setting a new column in the original dataset equal to the series, but get the following error:
TypeError: incompatible index of inserted column with frame index
Any help on getting these results back into the original dataframe would be greatly appreciated. There are a number of other posts that correspond to this topic, but none of the solutions worked for me in this instance.
UPDATE:
I've solved this with a relatively simple method, in which I converted the series to a list, and just set a new column in the dataframe equal to the list. However, I would be really curious to hear if others have better/different/unique solutions to this problem. Thanks!
To not loose the position when inserting prediction in the missing values you can use this approach, in example:
X_train: The train data is a pandas dataframe corresponding to the known real results (in y_train).
X_test: The test data is a pandas dataframe without corresponding known real results. Need to predict.
y_train: The train data is pandas serie with real known results
Prediction: The prediction is a pandas series object
To get the complete data merged in one pandas dataframe first get the known part together:
# merge train part of the data into a dataframe
X_train = X_train.sort_index()
y_train = y_train.sort_index()
result = pd.concat([X_train,X_test])
# if need to convert numpy array to pandas series:
# prediction = pd.Series(prediction)
# here is the magic
result['specie'][result['specie'].isnull()] = prediction.values
If there is no missing value would do the job.
I need a hand on this problem: In an Excel workbook I reported 10 time series (with monthly frequency) of 10 titles that should cover the past 15 years. Unfortunately, not all titles can cover the 15-year time series. For example, a title only goes up to 2003; So in the column of that title, I have the first 5 years with a "Not Available" instead of a value. Once I’have imported the data into Matlab, obviously, in the column of the title with the shorter series appears NaN where there are no values.
>> Prices = xlsread('PrezziTitoli.xls');
>> whos
Name Size Bytes Class Attributes
Prices 182x10 6360 double
My goal is to estimate the variance-covariance matrix, however, because of the lack of data, the calculation is not possible for me. I thought to an interpolation, before the calculation of the variance-covariance matrix, to cover the values that in Matlab return NaN, for example with a "fillts", but have difficulties in its use.
There is some code that can be useful to me? Can you help me?
Thanks!
Do you have the statistics toolbox installed? In that case, the solution is simple:
>> x = randn(10,4); // x is a 10x4 matrix of random numbers
>> x(randi(40,10,1)) = NaN; // set some random entries to NaN
>> disp(x)
-1.1480 NaN -2.1384 2.9080
0.1049 -0.8880 NaN 0.8252
0.7223 0.1001 1.3546 1.3790
2.5855 -0.5445 NaN -1.0582
-0.6669 NaN NaN NaN
NaN -0.6003 0.1240 -0.2725
-0.0825 0.4900 1.4367 1.0984
-1.9330 0.7394 -1.9609 -0.2779
-0.4390 1.7119 -0.1977 0.7015
-1.7947 -0.1941 -1.2078 -2.0518
>> nancov(x) // Compute covariances after removing all NaN rows
1.2977 0.0520 1.6248 1.3540
0.0520 0.5359 -0.0967 0.3966
1.6248 -0.0967 2.2940 1.6071
1.3540 0.3966 1.6071 1.9358
>> nancov(x, 'pairwise') // Compute covariances pairwise, ignoring NaNs
1.9195 -0.5221 1.4491 -0.0424
-0.5221 0.7325 -0.1240 0.2917
1.4491 -0.1240 2.1454 0.2279
-0.0424 0.2917 0.2279 2.1305
If you don't have the statistics toolbox, we need to think harder - let me know!