From one histogram, create a new histogram from just a mean or median? - statistics

Suppose I have a list of values that I can histogram and calculate descriptive statistics on such as mean, average, max, standard deviation, etc. Perhaps this histogram is bimodal or right skewed. Let’s call this group of data “DataSet1”.
Suppose I had just a mean or median of another set of data. Lets call that DataSet2. I do not have all the raw data for DataSet2, just the median or mean. There is a strong belief that DataSet1 and DataSet2 would show the same variability in values.
If I knew just a single value of either mean or median, can I apply the description statistics from DataSet1 to create a new histogram that mirrors the bimodal or right skewed behavior from DataSet1?
Thanks
Dan
Alternative intent:
I have 3 years of historical data, where the data definitely has a "day of week" trend to it. I am using a python api to apply seasonal ARIMA to forecast the next 7 days from the 3 years of historical data. The predicted value is great, but it is only 1 value. I would like to use that predicted value as the "mean" and create a histogram from the variability of values shown to exists historically by day of week.
so, today is thursday. Lets say i predict tomorrow to have a value of 78.6.
I want to sample potential values of tomorrow based upon a mean of 78.6 but with variability similar to that showed to exist on all historical fridays
If i look at historical fridays, perhaps it shows a skewed to the left behavior
so when i sample with a mean of 78.6, if i sampled 100 times, the values sampled, if plotted in a histogram, would also skew to the left
Hope that helps..

Related

Normalisation or Standardisation for detecting outlier?

When to use min max scaling that is normalisation and when to use standardisation that is using z score for data pre-processing ?
I know that normalisation brings down the range of feature to 0 to 1, and z score bring downs to -3 to 3, but am unsure when to use one of the two technique for detecting the outliers in data?
Let us briefly agree on the terms:
The z-score tells us how many standard deviations a given element of a sample is away from the mean.
The min-max scaling is the method of rescaling a range of measurements the interval [0, 1].
By those definitions, z-score usually spans an interval much larger than [-3,3] if your data follows a long-tailed distribution. On the other hand, a plain normalization does indeed limit the range of the possible outcomes, but will not help you help you to find outliers, since it just bounds the data.
What you need for outlier dedetction are thresholds above or below which you consider a data point to be an outlier. Many programming languages offer Violin plots or Box plots which nicely show your data distribution. The methods behind plots implement a common choice of thresholds:
Box and whisker [of the box plot] plots quartiles, and the band inside the box is always the second quartile (the median). But the ends of the whiskers can represent several possible alternative values, among them:
the minimum and maximum of all of the data [...]
one standard deviation above and below the mean of the data
the 9th percentile and the 91st percentile
the 2nd percentile and the 98th percentile.
All data points outside the whiskers of the box plots are plotted as points and considered outliers.

How to predict something along with dates in python?

I have time series data , the two columns are traffic density and date. I wish to predict the density for next 7 days.
I am using arime time series forecasting. I am able to forecast density but I want to forecast density with time. How can it be done?
GO with RNN(LSTM) or FBProphet
Here's a good piece of work for FBProphet:
https://towardsdatascience.com/a-quick-start-of-time-series-forecasting-with-a-practical-example-using-fb-prophet-31c4447a2274
Here's a good piece of work for LSTM:
https://colah.github.io/posts/2015-08-Understanding-LSTMs/
However you can also look into ARIMA Variants.

Correlation statistics

Naive Question:
In the attached snapshot, I am trying to figure out the correlation concept when applied to actual values and to calculation performed on those actual values and creating a new stream of data.
In the example,
Columns A,B,C,D,E have very different correlation but when I do a rolling sum on the same columns to get G,H,I,J,K the correlation is very much the same(negative or positive.
Are these to different types of correlation or am I missing out on something.
Thanks in advance!!
Yes, these are different correlations. It's similar to if you were to measure acceleration over time of 5 automobiles (your first piece of data) and correlate those accelerations. Each car accelerates at different rates over time leaving your correlation all over the place.
Your second set of data would be the velocity of each car at each point in time. Because each car is accelerating at a pretty constant rate (and doing so in two different directions from the starting point) you either get a big positive or big negative correlation.
It's not necessary that you get that big positive or big negative correlation in the second set, but since your data in each list is consistently positive or negative and grows at a consistent rate, it correlates with either similar lists.

Excel - Plot average of two plots with inconsistent time (X) axis

I have managed to plot two different data sets on the same axis however, I'm also looking to plotting another line showing their average.
The main problem is that both data sets have different X (time) values so it's not possible to add an average column at the end and plot that. (See the highlighted row 22 for example, corresponding Time values are different)
Is there any way I can plot an average of two plots on the same axis?
One idea that might work is to place the values of both series, one above the other in two new columns, sort this new data according to time, smooth it, then plot the smoothed combined data. Alternatively, you could do the smoothing by simply plotting the new sorted series, adding a moving average trendline to it, then change the formatting of the new series so that it is no longer visible (but the trendline is). Something like this:
In the above picture, series 3 is the plot of the sorted aggregate data of series 1 and 2. If you change the formatting of series 3 so that there is no line, you get something like this:
For my relatively small mock data sets, the results are admittedly poor (it was based on just 25 data points in each series), but if you have a large amount of closely spaced data, and you play around with the moving average window size, you might get something acceptable. If not, you should probably just interpolate both datasets to obtain two consistent time series.

Averaging many curves with different x and y values

I have several curves that contain many data points. The x-axis is time and let's say I have n curves with data points corresponding to times on the x-axis.
Is there a way to get an "average" of the n curves, despite the fact that the data points are located at different x-points?
I was thinking maybe something like using a histogram to bin the values, but I am not sure which code to start with that could accomplish something like this.
Can Excel or MATLAB do this?
I would also like to plot the standard deviation of the averaged curve.
One concern is: The distribution amongst the x-values is not uniform. There are many more values closer to t=0, but at t=5 (for example), the frequency of data points is much less.
Another concern. What happens if two values fall within 1 bin? I assume I would need the average of these values before calculating the averaged curve.
I hope this conveys what I would like to do.
Any ideas on what code I could use (MATLAB, EXCEL etc) to accomplish my goal?
Since your series' are not uniformly distributed, interpolating prior to computing the mean is one way to avoid biasing towards times where you have more frequent samples. Note that by definition, interpolation will likely reduce the range of your values, i.e. the interpolated points aren't likely to fall exactly at the times of your measured points. This has a greater effect on the extreme statistics (e.g. 5th and 95th percentiles) rather than the mean. If you plan on going this route, you'll need the interp1 and mean functions
An alternative is to do a weighted mean. This way you avoid truncating the range of your measured values. Assuming x is a vector of measured values and t is a vector of measurement times in seconds from some reference time then you can compute the weighted mean by:
timeStep = diff(t);
weightedMean = timeStep .* x(1:end-1) / sum(timeStep);
As mentioned in the comments above, a sample of your data would help a lot in suggesting the appropriate method for calculating the "average".

Resources