The problem statement that I am currently working on has data available for 27 customers and the purchase amount they have transacted on (in total) for each month in 2021 from Jan until Sept. The data looks like the attached image with this question/post.
sample dataset
I could simply use average to find the next value but that'd not be precise to a very good extent, but then, in absence of any other data or features/columns, is that the only way to solve this question, or are there any other methods anyone can suggest? Note, both Excel &/or Python examples are fine.
Additional Note: I have already tried FORECAST functions in Excel, but I am not sure if the outcome is correct or not, since Microsoft documentation merely provides the formula by means of which this function performs the calculations. Overall there are 5 total types of FORECAST(.**) functions that Excel provides, but the documentation is poor, hence tomorrow, if I want to write the same solution in Python or any other programming language.
Taking a cursory glance at the data, there's a complexity that I'm missing like seasonality, trend, noise, outliers, etc., but let's just assume that this data is a simple trend line for each client.
From a purely high-level, excel can do a simple FORECAST.ETS(target_date, values, timeline, [seasonality], [data_completion], [aggregation]) formula.
It can be streamlined with excel's built in data tool Forecast Sheet.
I could talk about Python but that's a little more hands on with a time series forecast.
Related
An example, the time someone left home and the time someone called 9-1-1 and put these points in to predict ideally the time of incident on an excel format. I can put in a time in column a and column b but all it does is give me the half way point between the two. example column a says 12:00 and column b says 1:00 and the result would be 12:30. If I can get some thing more predictive using this approach, that is ideally what I'm looking for.
I used some of the standard functions in Excel to predict time based series.
We were looking at predicting data points for 1mis, 3mis and 6mis (mis = Months In Service).
We found that the forecast() function with some "fiddle" factors - sorry finely tuned polynomial assumptions - gave a reasonable prediction for our needs. We fed it steps of historical data to see the performance until it was suitable for what we needed.
I am trying to create a forecast using a monthly timeseries data set of marketing expenses for a fictional company. The data looks something like this:
Using linear regression to forecast future sales, I get the following result:
My problem lies with the seasonality of the marketing expenses (higher in the summer months for instance). I would ideally like to calculate to forecasted values of future months including seasonality. I read somewhere about ARIMA forecasting, but am really searching for some best ideas on how to accomplish the task.
To be clear, I do not JUST want a chart and trendline, but the data to support it too.
Any help would be much appreciated!
You can do that easily using Excel (2016) Forecast tool by first selecting your data, then clicking on:
Data -> Forecast Sheet -> Options -> Set Manually (under Seasonality)
You can also play with the options. Once you click on "Create", Excel will generate a graph, and a table with relevant data.
Alternatively, you can also create a binary variable for each season, and calculate a multiple regression for the Marketing expenses controlling for time, and each of the binary variables for the seasons but one (which is the reference group). You could either use excel analytical tool, or any other statistical software.
I have a large excel file that has monthly sales per customer for January - December 2016. I want to predict what their sales will be in January 2017.
You could average each client's data and ignore the zeros with a formula like
=AVERAGEIF(D2:D12,"<>0)
D2:D12 would be the range of a single client's sales variable and it would give you a monthly average for that client that you could use for January Predicted Sales.
You have several problems to solve:
Determining (a) candidate forecasting model(s) to use.
Organising your existing data to test whether such model(s) are actually suitable, performing such tests and selecting (a) suitable model(s) [There may be more than one model to be used dependent on whether your data are homogeneous or not.]
Organising your existing data to apply your chosen model(s) for the
purposes of making your prediction. (A different organisation to 2. may be required.)
Your description talks about "sales" but the data sample you provided mentions "claims". These are very different entities - sales (dependent on what type of sales) may well be as frequent as monthly, but claims are likely to be a lot less frequent. If this is the case and claims are highly infrequent, then there is little sense in trying to predict an individual customer's claim. In such a case it would make more sense to predict the aggregate level of claims across a group of customers.
With all modelling, and particularly with forecasting models, context is highly important in steering towards which particular types of model are likely to be suitable. As it is, you have provided no context about what your data really represents, so are unlikely (beyond random chance) to find that any solution offered to you is actually going to be suitable. A solution might compute but, in the context in which you are operating, will it provide anything like a sensible or justifiable set of forecasts?
The "AverageIf" solution may be sufficient; however, you may be able to do better if there is in fact any trends/seasonality in the data that could be used to modeling advantage. For each customer, I would check for autocorrelation in the data. "Autocorrelation, also known as serial correlation, is the correlation of a signal with a delayed copy of itself as a function of delay. Informally, it is the similarity between observations as a function of the time lag between them."(https://en.wikipedia.org/wiki/Autocorrelation) For instance, if there is significant autocorrelation at lag = 12, this would suggest yearly seasonality in the data (maybe every January is similar). There is a nice tutorial to analyze autocorrelation in Excel at:
http://www.real-statistics.com/time-series-analysis/stochastic-processes/autocorrelation-function/
If autocorrelation does exist, it would likely then be useful to perform regression with that time component(s). If there is a trend with time in additional to a cyclical component, that should also be factored into the regression (i.e., such as a "Year" variable); or a more sophisticated time series method could be applied that would accomodate trend and autocorrelation such as an Autoregressive Integrated Moving Average (ARIMA) model:
https://en.wikipedia.org/wiki/Autoregressive_integrated_moving_average
Excel has a forecasting function that might help:
FORECAST.ETS function
Calculates or predicts a future value based on existing (historical) values by using the AAA version of the Exponential Smoothing (ETS) algorithm. The predicted value is a continuation of the historical values in the specified target date, which should be a continuation of the timeline. You can use this function to predict future sales, inventory requirements, or consumer trends.
This function requires the timeline to be organized with a constant step between the different points. For example, that could be a monthly timeline with values on the 1st of every month, a yearly timeline, or a timeline of numerical indices. For this type of timeline, it’s very useful to aggregate raw detailed data before you apply the forecast, which produces more accurate forecast results as well.
Syntax
FORECAST.ETS(target_date, values, timeline, [seasonality], [data_completion], [aggregation])
And you can see it in action in a workbook from the FORECAST.ETS.SEASONALITY page:
Download a sample workbook
How Can I Model Multiple Short Time Series Samples?
For example, let's say I have a new subject each month, and I measure each subject every day for the entire month. I then want to model these multiple strings of independent time series because I assume that there is an underlying pattern that applies to all 12 subjects. However, a time series with an n of 30 is too short to model, so is there some way to group these 12 time series together for a parallel analysis?
I imagine the way to handle this is similar to how one might handle a time series with multiple breaks of unknown length. Unfortunately, I unaware of how to deal with this type of data structure.
Any thoughts on where to even begin? What terms I should research?
Well. Depends on what you're interested in. Makes it a lot easier if we know what kind of data you have, and what you're trying to analyse.
Trying to answer your question: If you assume that there is some underlying structure which is homogenous for, say, 6 of the subjects, and different for the other half, you can just pool the two data sets and do some kind of group-mean analysis. If you're interested in a temporal change over the 12 months, then you need to assume that each subject are homogenous across whatever variable you're measuring.
Normally, for e.g. timeseries in economics, what you're describing is called "censored" or "truncated data".
If we want to measure the income of everyone in a country, we do this by checking electronic paychecks or something. But some people at the end of each tail, may not have a visible income. Poor people may be earning income in other ways, and rich people may want to hide some of their income. This is censored data, and any advanced timeseries stats book will have something on that.
Truncated data is similar. Just imagine income again. If we truncate everyone who makes < 10,000$ a year, then this will "cut off the end" of your distribution. There are also remedies for this. Again check an advanced time series book.
Hope this helped a bit.
I use GV for business and have since about 2011. Over that time I've amassed about 10,000 calls with various clients. I'd like to analyze this data to understand things like what days of the week did I have the most calls, what months had the highest call volume, what hour of the day has the highest call volume, et cetera. (Eventually I would also like to compare that to my Google Calendar data to analyze my conversion rates for a given month, but that's step 2)
My question is, is there any easy way to do this short of actually learning to use Excel? Are there any free or relatively cheap statistics programs that will cut some of the work out for me? It's easy enough to clean the data and drop it into Excel, but there are so many intermediary steps between having a good clean data set and actually getting a histogram out of it that it's starting to feel like it isn't worth it.
I have a list of about 10k calls in this format:
Col.A Col.B Col.C
client date 24hr time
I'm not particularly concerned with who the client is... I just want to analyze the second two columns.
Any help at all would be greatly appreciated.