ey guys I need your help. I want to predict rice production in India using a simple regression. For this I have a dataset with the yield and production data for the last 40 years. As explanatory variables I have the daily data on rainfall, temperature etc. Now to my problem. Obviously the number of observations of the y-variable (40) does not match the number of observations of the x-variable (about 15,000). Thus a regression is not feasible. What is the best way to proceed?
Average the weather data over the year and thus estimate the y-variable, i.e. a kind of undersampling of the x-variable. Of course, this means that important data such as outliers are lost.
Add the annual production values for each weather entry in the associated year. This would give us the same y value 365 times. Doesn't sound reasonable to me either.
What other ideas do you guys have? If interested, I'll be happy to attach the datasets as well.
Related
An example, the time someone left home and the time someone called 9-1-1 and put these points in to predict ideally the time of incident on an excel format. I can put in a time in column a and column b but all it does is give me the half way point between the two. example column a says 12:00 and column b says 1:00 and the result would be 12:30. If I can get some thing more predictive using this approach, that is ideally what I'm looking for.
I used some of the standard functions in Excel to predict time based series.
We were looking at predicting data points for 1mis, 3mis and 6mis (mis = Months In Service).
We found that the forecast() function with some "fiddle" factors - sorry finely tuned polynomial assumptions - gave a reasonable prediction for our needs. We fed it steps of historical data to see the performance until it was suitable for what we needed.
Im trying to build an excel sheet that calculates synthetic options prices and greeks for time series data to model intraday options pricing, input is simply intraday price data, say Tick level to 5 minute interval. I found this https://www.thebiccountant.com/2021/12/28/black-scholes-option-pricing-with-power-query-in-power-bi/ which provides for powerBI and Black Scholes but possibly not very accurately. I prefer the Binomial method (I have used this excellent tutuorial to build a manual version for a large number of strikes but it takes a long time to calculate and is very very complex and also inaccurate due to not being able to calculate many steps before topping excel out: https://www.macroption.com/binomial-option-pricing-excel/).
Does anyone have any idea if this is possible to create an entire column in Power Query that will calculate bionomially derived options pricing using >100 even up to 1000 steps? The reason is intraday pricing using high resolution data 5min, 1min, Seconds and Tick I think needs a large number of steps to properly converge. This is just about doing a good enough model that can be used for visualising the progress of a trade on a given day.
Any pointers on how this could be done and calculated using M Language would be much appreciated and useful!
I have a question about how to calculate features for future time frames. Consider the below dataset and consider today is: 2019-11-11. I have last 2 years of daily data and below is last 6 rows:
Date, Temperature, Sales
2019-11-06, 25.5, 500000
2019-11-07, 24.2, 550000
2019-11-08, 25.1, 560000
2019-11-09, 22.6, 510000
2019-11-10, 22.3, 520000
2019-11-11, 24.4, 535000
Now I have to predict Sales for 2019-11-12, 2019-11-13, 2019-11-14. In order to predict sales for those dates, I have to provide below test data to the machine learning trained model:
Date, Temperature
2019-11-12, temperatureX
2019-11-13, temperatureY
2019-11-14, temperatureZ
What will be values for temperatureX, temperatureY and temperatureZ since these values will be coming from future as well?
There are different solutions.
I suggest you start with Time-series Forecasting using Azure AutoML or to dig deeper Auto-train a time-series forecast model
If you need an interpretable model you could train a Linear Model (LM) or an other regression model in R or Python. It might make sense to derive some features from the date such as month, day of month, season or so. This approach has the benefit that you can calculate the confidence interval.
As it is a multivariate time series, also have a look at Vector Auto Regression, see A Multivariate Time Series Guide to Forecasting and Modeling (with Python codes)
If you are just interested in predicting values, you can try a recurrent neural network or LSTM. See GitHub Azure/DeepLearningForTimeSeriesForecasting
Easy answer? You can't predict if you don't have the independent variables that explain your target at prediction time.
That being said, you can usually get a weather forecast for at least a week ahead of any date through a simple web search. So if you don't require a very large max horizon, you can use predicted weather forecasts for your temperature values (x, y, and z). Your retraining period would then become weekly, or however far out you're able to find existing weather forecasts.
Ref: https://datascience.stackexchange.com/questions/27171/what-to-give-as-predictors-to-predict-future-values
I have a large excel file that has monthly sales per customer for January - December 2016. I want to predict what their sales will be in January 2017.
You could average each client's data and ignore the zeros with a formula like
=AVERAGEIF(D2:D12,"<>0)
D2:D12 would be the range of a single client's sales variable and it would give you a monthly average for that client that you could use for January Predicted Sales.
You have several problems to solve:
Determining (a) candidate forecasting model(s) to use.
Organising your existing data to test whether such model(s) are actually suitable, performing such tests and selecting (a) suitable model(s) [There may be more than one model to be used dependent on whether your data are homogeneous or not.]
Organising your existing data to apply your chosen model(s) for the
purposes of making your prediction. (A different organisation to 2. may be required.)
Your description talks about "sales" but the data sample you provided mentions "claims". These are very different entities - sales (dependent on what type of sales) may well be as frequent as monthly, but claims are likely to be a lot less frequent. If this is the case and claims are highly infrequent, then there is little sense in trying to predict an individual customer's claim. In such a case it would make more sense to predict the aggregate level of claims across a group of customers.
With all modelling, and particularly with forecasting models, context is highly important in steering towards which particular types of model are likely to be suitable. As it is, you have provided no context about what your data really represents, so are unlikely (beyond random chance) to find that any solution offered to you is actually going to be suitable. A solution might compute but, in the context in which you are operating, will it provide anything like a sensible or justifiable set of forecasts?
The "AverageIf" solution may be sufficient; however, you may be able to do better if there is in fact any trends/seasonality in the data that could be used to modeling advantage. For each customer, I would check for autocorrelation in the data. "Autocorrelation, also known as serial correlation, is the correlation of a signal with a delayed copy of itself as a function of delay. Informally, it is the similarity between observations as a function of the time lag between them."(https://en.wikipedia.org/wiki/Autocorrelation) For instance, if there is significant autocorrelation at lag = 12, this would suggest yearly seasonality in the data (maybe every January is similar). There is a nice tutorial to analyze autocorrelation in Excel at:
http://www.real-statistics.com/time-series-analysis/stochastic-processes/autocorrelation-function/
If autocorrelation does exist, it would likely then be useful to perform regression with that time component(s). If there is a trend with time in additional to a cyclical component, that should also be factored into the regression (i.e., such as a "Year" variable); or a more sophisticated time series method could be applied that would accomodate trend and autocorrelation such as an Autoregressive Integrated Moving Average (ARIMA) model:
https://en.wikipedia.org/wiki/Autoregressive_integrated_moving_average
Excel has a forecasting function that might help:
FORECAST.ETS function
Calculates or predicts a future value based on existing (historical) values by using the AAA version of the Exponential Smoothing (ETS) algorithm. The predicted value is a continuation of the historical values in the specified target date, which should be a continuation of the timeline. You can use this function to predict future sales, inventory requirements, or consumer trends.
This function requires the timeline to be organized with a constant step between the different points. For example, that could be a monthly timeline with values on the 1st of every month, a yearly timeline, or a timeline of numerical indices. For this type of timeline, it’s very useful to aggregate raw detailed data before you apply the forecast, which produces more accurate forecast results as well.
Syntax
FORECAST.ETS(target_date, values, timeline, [seasonality], [data_completion], [aggregation])
And you can see it in action in a workbook from the FORECAST.ETS.SEASONALITY page:
Download a sample workbook
We have a application where users enter prices all day. These prices are recorded in a table with a timestamp and then used for producing charts of how the price has moved... Every now and then the user enters a price wrongly (eg. puts in a zero to many or to few) which somewhat ruins the chart (you get big spikes). We've even put in an extra confirmation dialogue if the price moves by more than 20% but this doesn't stop them entering wrong values...
What statistical method can I use to analyse the values before I chart them to exclude any values that are way different from the rest?
EDIT: To add some meat to the bone. Say the prices are share prices (they are not but they behave in the same way). You could see prices moving significantly up or down during the day. On an average day we record about 150 prices and sometimes one or two are way wrong. Other times they are all good...
Calculate and track the standard deviation for a while. After you have a decent backlog, you can disregard the outliers by seeing how many standard deviations away they are from the mean. Even better, if you've got the time, you could use the info to do some naive Bayesian classification.
That's a great question but may lead to quite a bit of discussion as the answers could be very varied. It depends on
how much effort are you willing to put into this?
could some answers genuinely differ by +/-20% or whatever test you invent? so will there always be need for some human intervention?
and to invent a relevant test I'd need to know far more about the subject matter.
That being said the following are possible alternatives.
A simple test against the previous value (or mean/mode of previous 10 or 20 values) would be straight forward to implement
The next level of complexity would involve some statistical measurement of all values (or previous x values, or values of the last 3 months), a normal or Gaussian distribution would enable you to give each value a degree of certainty as to it being a mistake vs. accurate. This degree of certainty would typically be expressed as a percentage.
See http://en.wikipedia.org/wiki/Normal_distribution and http://en.wikipedia.org/wiki/Gaussian_function there are adequate links from these pages to help in programming these, also depending on the language you're using there are likely to be functions and/or plugins available to help with this
A more advanced method could be to have some sort of learning algorithm that could take other parameters into account (on top of the last x values) a learning algorithm could take the product type or manufacturer into account, for instance. Or even monitor the time of day or the user that has entered the figure. This options seems way over the top for what you need however, it would require a lot of work to code it and also to train the learning algorithm.
I think the second option is the correct one for you. Using standard deviation (a lot of languages contain a function for this) may be a simpler alternative, this is simply a measure of how far the value has deviated from the mean of x previous values, I'd put the standard deviation option somewhere between option 1 and 2
You could measure the standard deviation in your existing population and exclude those that are greater than 1 or 2 standard deviations from the mean?
It's going to depend on what your data looks like to give a more precise answer...
Or graph a moving average of prices instead of the actual prices.
Quoting from here:
Statisticians have devised several methods for detecting outliers. All the methods first quantify how far the outlier is from the other values. This can be the difference between the outlier and the mean of all points, the difference between the outlier and the mean of the remaining values, or the difference between the outlier and the next closest value. Next, standardize this value by dividing by some measure of scatter, such as the SD of all values, the SD of the remaining values, or the range of the data. Finally, compute a P value answering this question: If all the values were really sampled from a Gaussian population, what is the chance of randomly obtaining an outlier so far from the other values? If the P value is small, you conclude that the deviation of the outlier from the other values is statistically significant.
Google is your friend, you know. ;)
For your specific question of plotting, and your specific scenario of an average of 1-2 errors per day out of 150, the simplest thing might be to plot trimmed means, or the range of the middle 95% of values, or something like that. It really depends on what value you want out of the plot.
If you are really concerned with the true max and true of a day's prices, then you have to deal with the outliers as outliers, and properly exclude them, probably using one of the outlier tests previously proposed ( data point is x% more than next point, or the last n points, or more than 5 standard deviations away from the daily mean). Another approach is to view what happens after the outlier. If it is an outlier, then it will have a sharp upturn followed by a sharp downturn.
If however you care about overall trend, plotting daily trimmed mean, median, 5% and 95% percentiles will portray history well.
Choose your display methods and how much outlier detection you need to do based on the analysis question. If you care about medians or percentiles, they're probably irrelevant.