How to use fitted_model.forecast() for AutoML forecasting model? - azure-machine-learning-service

Say I have a dataset with a monthly granularity with the following columns:
Timestamp
Issues (i.e. number of GitHub issues)
There is data for each month for all of 2016-2019, so I divide the data accordingly.
training_data: 2016-2017
validation_data: 2018
holdout_data: 2019
If I have a fitted_model that is a ForecastingPipelineWrapper which is the best run from AutoML where I passed gave it training_data and validation_data.
Looking at the ForecastingPipelineWrapper class docstring documentation only serves to confuse me more. What is X_past, X_future, and Y_future?
How do I use the above dataframes with fitted_model.forecast() to manually validate model fit on the holdout_data dataframe?

The following notebook illustrates how to leverage y_past, x_past, y_future, x_future, and fitted_model.forecast in the bottom half, 'Forecasting away from training data'. https://github.com/Azure/MachineLearningNotebooks/blob/master/how-to-use-azureml/automated-machine-learning/forecasting-high-frequency/auto-ml-forecasting-function.ipynb
The notebook will be a much better guide to grasping these concepts than perhaps a docstring doc. Should you have any more questions or need clarity, let us know!

Related

Example for Azure AutoML Forecasting for time series with multiple covariate features

I would like to use Azure AutoML for forecasting where I have multiple features for one timeseries. Is there any example which I can replicate?
I have been looking into: https://github.com/Azure/MachineLearningNotebooks/blob/master/how-to-use-azureml/automated-machine-learning/forecasting-beer-remote/auto-ml-forecasting-beer-remote.ipynb
and
https://github.com/Azure/MachineLearningNotebooks/blob/master/how-to-use-azureml/automated-machine-learning/forecasting-orange-juice-sales/auto-ml-forecasting-orange-juice-sales.ipynb
but no luck using multiple features instead of only one timeseries.
Any help is greatly appreciated
It looks like you are trying to find a notebook that shows how to predict a target variable when exogenous features are provided. The OJ sample notebook you included is actually a good example to reference for this scenario.
On a second glance, you'll find that in the OJ sample, `Quantity' is a function of 'Price' and other variables. We suggest trying to focus on a single time series within the OJ dataset (a single store & brand combo) as the concept could be lost in the focus on multiple series. Also note that in this example, the OJ dataset does have multiple features, we just only specify which features need to be excluded.
OJ Sample Notebook: https://github.com/Azure/MachineLearningNotebooks/blob/master/how-to-use-azureml/automated-machine-learning/forecasting-orange-juice-sales/auto-ml-forecasting-orange-juice-sales.ipynb
-Sabina, Azure Machine Learning PM
Please check here,
Auto-train a time-series forecast model - Azure Machine Learning | Microsoft Docs
Please check the below many models accelerator which models timeseries data (but in a different domain). This can be useful.
buswrecker/energy-many-models: An offshoot of the original AML Many-Models - for the Energy Sector (github.com)
AML AutoML forecasting models address missing data in featurization stage via forward fill if missing value is in target column or median value if in feature column. Also libraries like Prophet which are supported via Auto ML can be robust.

How to see correlation between features in scikit-learn?

I am developing a model in which it predicts whether the employee retains its job or leave the company.
The features are as below
satisfaction_level
last_evaluation
number_projects
average_monthly_hours
time_spend_company
work_accident
promotion_last_5years
Department
salary
left (boolean)
During feature analysis, I came up with the two approaches and in both of them, I got different results for the features. as shown in the image
here
When I plot a heatmap it can be seen that satisfaction_level has a negative correlation with left.
On the other hand, if I just use pandas for analysis I got results something like this
In the above image, it can be seen that satisfaction_level is quite important in the analysis since employees with higher satisfaction_level retain the job.
While in the case of time_spend_company the heatmap shows it is important while on the other hand, the difference is not quite important in the second image.
Now I am confused about whether to take this as one of my features or not and which approach should I choose in order to choose features.
Some please help me with this.
BTW I am doing ML in scikit-learn and the data is taken from here.
Correlation between features have little to do with feature importance. Your heat map is correctly showing correlation.
In fact, in most of the cases when you talking about feature importance, you must provide context of a model that you are using. Different models may choose different features as important. Moreover many models assume that data comes from IID (Independent and identically distributed random variables), so correlation close to zero is desirable.
For example in sklearn learn regression to get estimation of feature importance you can examine coef_ parameter.

Azure Machine Learning Experiment Creation

I am new to create Experiments in Azure ML. I want to done a sample and small POC on Azure ML.
I have a data for the students consisting of StudentID, Student Name and Marks for Monthly Tests 1,2 and 3. I just to want to Predict data for the Final Monthly Test (i.e., Monthly Test 4).
I don't know how to create and what kind of Transformations to be used in Predicting the Data.
Anyone Please...
Thanks in Advance
Pradeep
You can simply start with basic tutorials.
https://azure.microsoft.com/en-in/documentation/articles/machine-learning-create-experiment/
It is real helpful. I also referred this.
You can draw simple flow chart for your experiment and simply apply when you need to drag the dataset.
HTH
This is a unsupervised machine learning problem. Do refer the algorithms that you can use for solving the problem (Most probably linear regression will suit for this case) Do the data pre-processing first. Then follow the steps in the above link mentioned by #kunal to build up the model.

Get Spark metrics on each iteration step?

Applying spark's logistic regression on a specific dataset requires to define a number of iterations. So far I've learned that outputting the result of the cost function on each iteration might be useful information to plot. It can be used to visualize how many iterations a function needs to converge to a minimum. I was wondering if there is a way to output such information in spark? Looping over a train() function with different iteration numbers, sounds like a solution that requires a lot of time on large datasets. It would be nice to know if there is a better one already built in. Thanks for any advice on this topic.
After you've trained a model (call it myModel) that has such a history, you can get the iteration-by-iteration history with
myModel.summary.objectiveHistory.foreach(...)
There's a nice example here in the Spark ML documentation -- once you know the right search terms.

Time Series Modeling of Choppy Data

I'm trying to model 10 years of monthly time series data that is very choppy and overal it has an upward trend. At first glance it looks like a strong seasonal series, however the test results indicate that it is definitely not seasonal. This is a pricing variable that I'm trying to model as a function of macroeconomic environment, such as interest rates and yield curves.
I've tryed linear OLS regression (proc reg), but I don't get a very goo dmodel with that.
I've also tried autoregressive error models (proc autoreg), but it captures 7 lags of the error term as significant factors. I don't really want to include that many lag of the error term in the model. In addition most of the macroeconomic variables become insignificant when I include all these error lags in the model.
Any suggestions on modeling method/technique that could help me model this choppy data is really appreciated.
At a past project, we've used proc arima to predict future product sales based on a time series of past sales:
http://support.sas.com/documentation/cdl/en/etsug/60372/HTML/default/viewer.htm#etsug_arima_sect019.htm (note that arima is also an autoregressive model)
But as Joe said, for really statistical feedback on your question, you're better of asking at the Cross Validated site.

Resources