How to: Analysis of Variance

How to: Analysis of Variance - statistics

Using the mv2020.csv dataset, determine if the fuel type, month, and office significantly differ in terms of kms traveled.
File: https://docs.google.com/spreadsheets/d/10iew59515ntUAMK9ujjztLiJu2hSYL1O/edit?usp=sharing&ouid=102353892068679695500&rtpof=true&sd=true

Related

Predict yearly harvest - Regression

ey guys I need your help. I want to predict rice production in India using a simple regression. For this I have a dataset with the yield and production data for the last 40 years. As explanatory variables I have the daily data on rainfall, temperature etc. Now to my problem. Obviously the number of observations of the y-variable (40) does not match the number of observations of the x-variable (about 15,000). Thus a regression is not feasible. What is the best way to proceed?
Average the weather data over the year and thus estimate the y-variable, i.e. a kind of undersampling of the x-variable. Of course, this means that important data such as outliers are lost.
Add the annual production values for each weather entry in the associated year. This would give us the same y value 365 times. Doesn't sound reasonable to me either.
What other ideas do you guys have? If interested, I'll be happy to attach the datasets as well.

Binomial Options Pricing Calculation in PowerQuery

Im trying to build an excel sheet that calculates synthetic options prices and greeks for time series data to model intraday options pricing, input is simply intraday price data, say Tick level to 5 minute interval. I found this https://www.thebiccountant.com/2021/12/28/black-scholes-option-pricing-with-power-query-in-power-bi/ which provides for powerBI and Black Scholes but possibly not very accurately. I prefer the Binomial method (I have used this excellent tutuorial to build a manual version for a large number of strikes but it takes a long time to calculate and is very very complex and also inaccurate due to not being able to calculate many steps before topping excel out: https://www.macroption.com/binomial-option-pricing-excel/).
Does anyone have any idea if this is possible to create an entire column in Power Query that will calculate bionomially derived options pricing using >100 even up to 1000 steps? The reason is intraday pricing using high resolution data 5min, 1min, Seconds and Tick I think needs a large number of steps to properly converge. This is just about doing a good enough model that can be used for visualising the progress of a trade on a given day.
Any pointers on how this could be done and calculated using M Language would be much appreciated and useful!

Weighted percentile calculation from group of percentiles

Can we calculate the overall kth percentile if we have kth percentile over 1 minute window for the same time period?
The underlying data is not available. Only the kth percentile and count of underlying data is available.
Are there any existing algorithms available for this?
How approximate will the calculated kth percentile be?

No. If you have only one percentile (and count) for every time period, then you cannot reasonably estimate that same percentile for the entire time period.
This is because percentiles are only semi-numerical measures (like Means) and don't implicitly tell you enough about their distributions above and below their measured values at each measurement time. There are a couple of exceptions to the above.
If the percentile that you have is the 50th percentile (i.e., the Mean), then you can do some extrapolation to the Mean of the whole time, but it's a bit sketchy and I'm not sure how bad the variance would be.
If all of your percentile measure are very close together (compared to the actual range of the measured population), then obviously you can use that as a reasonable estimate of the overall percentile.
If you can assume with high assurance that every minute's data is an independent sampling of the exact same population distribution (i.e., there is no time-dependence), then you may be able to combine them, possibly even if the exact distribution is not fully known (has parameter that are unknown, but still known to be fixed over the time-period). Again I am not sure what the valid functions and variance calculations are for this.
If the distribution is known (or can be assumed) to be a specific function or shape with some unknown value or values and where time-dependence has a known role in that function, then you should be able to using weighting and time-adjustments to transform into the same situation as #3 above. So for instance if the distributions were a time-varying exponential distribution of the form pdf(k,t) = (k*t)e^-(k*t) then I believe that you could derive an overall percentile estimate by estimating the value of k for by adjust it for each different minute (t).
Unfortunately I am not a professional statistician. I have Math/CS background, enough to have some idea of what's mathematically possible/reasonable, but not enough to tell exactly how to do it. If you think that your situation falls into one of the above categories, then you might be able to take it to https://stats.stackexchange.com but you will need to also provide the information I mentioned in those categories and/or detailed and specific information about what you are measuring and how you are measuring it.

Based on statistical instincts ,The error rate will be proportional to Standard Deviation of the total set. If you are creating a approximation for a longer time span , that includes the discrete chunks of kth percentile . [ clarification may be need for proving this theory.]

How to calculate features for forecasted time frame

I have a question about how to calculate features for future time frames. Consider the below dataset and consider today is: 2019-11-11. I have last 2 years of daily data and below is last 6 rows:
Date, Temperature, Sales
2019-11-06, 25.5, 500000
2019-11-07, 24.2, 550000
2019-11-08, 25.1, 560000
2019-11-09, 22.6, 510000
2019-11-10, 22.3, 520000
2019-11-11, 24.4, 535000
Now I have to predict Sales for 2019-11-12, 2019-11-13, 2019-11-14. In order to predict sales for those dates, I have to provide below test data to the machine learning trained model:
Date, Temperature
2019-11-12, temperatureX
2019-11-13, temperatureY
2019-11-14, temperatureZ
What will be values for temperatureX, temperatureY and temperatureZ since these values will be coming from future as well?

There are different solutions.
I suggest you start with Time-series Forecasting using Azure AutoML or to dig deeper Auto-train a time-series forecast model
If you need an interpretable model you could train a Linear Model (LM) or an other regression model in R or Python. It might make sense to derive some features from the date such as month, day of month, season or so. This approach has the benefit that you can calculate the confidence interval.
As it is a multivariate time series, also have a look at Vector Auto Regression, see A Multivariate Time Series Guide to Forecasting and Modeling (with Python codes)
If you are just interested in predicting values, you can try a recurrent neural network or LSTM. See GitHub Azure/DeepLearningForTimeSeriesForecasting

Easy answer? You can't predict if you don't have the independent variables that explain your target at prediction time.
That being said, you can usually get a weather forecast for at least a week ahead of any date through a simple web search. So if you don't require a very large max horizon, you can use predicted weather forecasts for your temperature values (x, y, and z). Your retraining period would then become weekly, or however far out you're able to find existing weather forecasts.
Ref: https://datascience.stackexchange.com/questions/27171/what-to-give-as-predictors-to-predict-future-values

Excel predicting future value

I have a large excel file that has monthly sales per customer for January - December 2016. I want to predict what their sales will be in January 2017.

You could average each client's data and ignore the zeros with a formula like
=AVERAGEIF(D2:D12,"<>0)
D2:D12 would be the range of a single client's sales variable and it would give you a monthly average for that client that you could use for January Predicted Sales.

You have several problems to solve:
Determining (a) candidate forecasting model(s) to use.
Organising your existing data to test whether such model(s) are actually suitable, performing such tests and selecting (a) suitable model(s) [There may be more than one model to be used dependent on whether your data are homogeneous or not.]
Organising your existing data to apply your chosen model(s) for the
purposes of making your prediction. (A different organisation to 2. may be required.)
Your description talks about "sales" but the data sample you provided mentions "claims". These are very different entities - sales (dependent on what type of sales) may well be as frequent as monthly, but claims are likely to be a lot less frequent. If this is the case and claims are highly infrequent, then there is little sense in trying to predict an individual customer's claim. In such a case it would make more sense to predict the aggregate level of claims across a group of customers.
With all modelling, and particularly with forecasting models, context is highly important in steering towards which particular types of model are likely to be suitable. As it is, you have provided no context about what your data really represents, so are unlikely (beyond random chance) to find that any solution offered to you is actually going to be suitable. A solution might compute but, in the context in which you are operating, will it provide anything like a sensible or justifiable set of forecasts?

The "AverageIf" solution may be sufficient; however, you may be able to do better if there is in fact any trends/seasonality in the data that could be used to modeling advantage. For each customer, I would check for autocorrelation in the data. "Autocorrelation, also known as serial correlation, is the correlation of a signal with a delayed copy of itself as a function of delay. Informally, it is the similarity between observations as a function of the time lag between them."(https://en.wikipedia.org/wiki/Autocorrelation) For instance, if there is significant autocorrelation at lag = 12, this would suggest yearly seasonality in the data (maybe every January is similar). There is a nice tutorial to analyze autocorrelation in Excel at:
http://www.real-statistics.com/time-series-analysis/stochastic-processes/autocorrelation-function/
If autocorrelation does exist, it would likely then be useful to perform regression with that time component(s). If there is a trend with time in additional to a cyclical component, that should also be factored into the regression (i.e., such as a "Year" variable); or a more sophisticated time series method could be applied that would accomodate trend and autocorrelation such as an Autoregressive Integrated Moving Average (ARIMA) model:
https://en.wikipedia.org/wiki/Autoregressive_integrated_moving_average

Excel has a forecasting function that might help:
FORECAST.ETS function
Calculates or predicts a future value based on existing (historical) values by using the AAA version of the Exponential Smoothing (ETS) algorithm. The predicted value is a continuation of the historical values in the specified target date, which should be a continuation of the timeline. You can use this function to predict future sales, inventory requirements, or consumer trends.
This function requires the timeline to be organized with a constant step between the different points. For example, that could be a monthly timeline with values on the 1st of every month, a yearly timeline, or a timeline of numerical indices. For this type of timeline, it’s very useful to aggregate raw detailed data before you apply the forecast, which produces more accurate forecast results as well.
Syntax
FORECAST.ETS(target_date, values, timeline, [seasonality], [data_completion], [aggregation])
And you can see it in action in a workbook from the FORECAST.ETS.SEASONALITY page:
Download a sample workbook

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string