Statsmodels seasonal decomposition - Trend not a straight line - python-3.x

This query refers to decomposition of classic Airline passengers data into Trend, Seasonal and Residual. We expect linear trend to be a straight line. However, the result is not so. I wonder what is the logic behind extraction of Trend. Can you please throw some light?
from statsmodels.tsa.seasonal import seasonal_decompose
result = seasonal_decompose(airline['Thousands of Passengers'], model='additive')
result.plot();

Two things to clarify:
1) Not all trends are linear
2) Even linear trends can be subject to some variation depending on the time series in question.
For instance, let's consider the trend for maximum air temperature in Dublin, Ireland over a number of years (modelled using statsmodels):
In this example, you can see that the trend both ascends and descends - given that air temperature is subject to changing seasons we would expect this.
In terms of the airline dataset, we can see that the trend is being observed over a number of years. Even when the observed, seasonal and residual components have been extracted, the trend itself will be subject to shifts over time.

Related

How to predict something along with dates in python?

I have time series data , the two columns are traffic density and date. I wish to predict the density for next 7 days.
I am using arime time series forecasting. I am able to forecast density but I want to forecast density with time. How can it be done?
GO with RNN(LSTM) or FBProphet
Here's a good piece of work for FBProphet:
https://towardsdatascience.com/a-quick-start-of-time-series-forecasting-with-a-practical-example-using-fb-prophet-31c4447a2274
Here's a good piece of work for LSTM:
https://colah.github.io/posts/2015-08-Understanding-LSTMs/
However you can also look into ARIMA Variants.

Convert GMM-UBM scores to equicalent accuracy percent

I have constructed a GMM-UBM model for the speaker recognition purpose. The output of models adapted for each speaker some scores calculated by log likelihood ratio. Now I want to convert these likelihood scores to equivalent number between 0 and 100. Can anybody guide me please?
There is no straightforward formula. You can do simple things like
prob = exp(logratio_score)
but those might not reflect the true distribution of your data. The computed probability percentage of your samples will not be uniformly distributed.
Ideally you need to take a large dataset and collect statistics on what acceptance/rejection rate do you have for what score. Then once you build a histogram you can normalize the score difference by that spectrogram to make sure that 30% of your subjects are accepted if you see the certain score difference. That normalization will allow you to create uniformly distributed probability percentages. See for example How to calculate the confidence intervals for likelihood ratios from a 2x2 table in the presence of cells with zeroes
This problem is rarely solved in speaker identification systems because confidence intervals is not what you want actually want to display. You need a simple accept/reject decision and for that you need to know the amount of false rejects and accept rate. So it is enough to find just a threshold, not build the whole distribution.

Tableau, Scatterplot with Trendlines to show confidence levels

I have scatterplot in Tableau, and I displayed trendlines. However, I cannot understand why there are three of them.
When I research this on Tableau, they say upper line is upper 95% confidence, and lower line is lower 95% confidence.
When I think of confidence levels, I think if there are 100 black and white marbles and I take a sample, and see the ratio, I can say 95% of the time, white marbles will be 40% to 60%
And to create confidence bounds, 92% to 98% of the time, white marbles will be 40% to 60%
But I'm having difficulty translating this to tableau trendlines. Please advise.
Think of your data set as just one random sample drawn from a larger population of possible data sets. You could have sampled another time or place or in a parallel universe.
If you could build a scatter plot for the entire population, it would have a best fit trend line also. You can think of your trend line as a sample trend line attempting to estimate this true population trend line.
Now imagine you actually did collect many different sample data sets from that same population. Also imagine you used identical procedures to create scatter plots and trend lines for hundreds or thousands of these data sets (samples). Different samples would lead to (slightly?) different trend lines in each plot.
The confidence bands are constructed in such a way that you can expect them to enclose the true population trend line in 95% of your samples.
You are using statistical inference to estimate the confidence in the population trend model parameters. All based on the Central Limit Theorem.

Excel Graphing help

I didn't know what stack exchange site to put this on, so I put it here. I am trying to determine if there is a correlation between the size of a school and the major that the school specializes in.
In order to do this, I programatically collected and analyzed data. In order to make my report, I need to make a few graphs in excel, but I have no clue how to do this.
What I'm looking for is a scatter plot, with quantitative values on the Y-Axis (the school size) and qualitative values on the X-Axis, I would like there to be every major listed out (kinda like a bar graph). From there, I want to plot a point above the major that a school specializes in; and have that point be as high as its student size.
Any help?
Edit:
Here is my sample data set. I want it to have categories that are to the right of the data, and points on the graph that correspond.
When you say "correlation" between X and Y, I think regression.
I would recommend doing an X-Y scatter plot and asking Excel to add a trend line. Not only will you get a least squares fit for the "best" line for your data, you'll get the correlation coefficient that tells you whether or not there's a relationship. The correlation coefficient ranges from -1 to +1; the closer your correlation coefficient is to 1.0, the better the relationship.

From one histogram, create a new histogram from just a mean or median?

Suppose I have a list of values that I can histogram and calculate descriptive statistics on such as mean, average, max, standard deviation, etc. Perhaps this histogram is bimodal or right skewed. Let’s call this group of data “DataSet1”.
Suppose I had just a mean or median of another set of data. Lets call that DataSet2. I do not have all the raw data for DataSet2, just the median or mean. There is a strong belief that DataSet1 and DataSet2 would show the same variability in values.
If I knew just a single value of either mean or median, can I apply the description statistics from DataSet1 to create a new histogram that mirrors the bimodal or right skewed behavior from DataSet1?
Thanks
Dan
Alternative intent:
I have 3 years of historical data, where the data definitely has a "day of week" trend to it. I am using a python api to apply seasonal ARIMA to forecast the next 7 days from the 3 years of historical data. The predicted value is great, but it is only 1 value. I would like to use that predicted value as the "mean" and create a histogram from the variability of values shown to exists historically by day of week.
so, today is thursday. Lets say i predict tomorrow to have a value of 78.6.
I want to sample potential values of tomorrow based upon a mean of 78.6 but with variability similar to that showed to exist on all historical fridays
If i look at historical fridays, perhaps it shows a skewed to the left behavior
so when i sample with a mean of 78.6, if i sampled 100 times, the values sampled, if plotted in a histogram, would also skew to the left
Hope that helps..

Resources