How to quantify magnitude of change in dataset containing base values of 0? - statistics

I have a dataset with the low-water and high-water surface area of lakes/ponds within a delta for each year. These lakes can undergo substantial change from year to year, and sometimes can dry out completely. As such, surface area can have values of 0 during the low-water period. I'm trying to quantify the magnitude of flooding in the spring on the surface areas of these lakes. Given the high inter annual variations in surface area, I need to compare the low-water value from the previous year to the high-water value of the following year to quantify this magnitude; comparing to a mean isn't sensitive enough. However, given the low water surface area of 0 for some lakes, I cannot quantify percent change.
My current idea is to do an "inverse" of percent change (don't know how else to describe it), where I divide the low-water value by the high-water value. This gives me a scale where large change will equal 0 and little change will equal 1. However, again small changes from a surface area of 0 will be over represented. Any idea how I could accurately compare the magnitude of flooding in such a case?

Related

bollinger bands versus statistics: is 1 standard deviation supposed to be split into two halves by it's mean? or top and bottom bands from mean?

I have a question about how bollinger bands are plotted in relation to statistics. In statistics, once a standard deviation is calculated from a mean of a set of numbers, shouldn't interpreting a 1 standard deviation be done so that you divide this number is half, and plot each half above and below the mean? By doing so, you can then determine whether or not it's data points fall within this 1 standard deviation.
Then, correct me if I am wrong, but aren't bollinger bands NOT calculated this way?? Instead, it takes a 1 standard deviation (if you have set it to 1) and plots the WHOLE value both above and below the mean (not splitting in two), thereby doubling the size of this standard-deviation?
Bollinger bands loosely state that that 68% of data falls within the 1st band, 1 standard deviation (loosely because the empirical rule in statistics requires that distributions be normal distributions which most often stock prices are not). However if this empirical rule is from statistics where 1 standard deviation is split in half, that means that applying a 68% probability in to an entire bollinger band is wrong. ??? is this correct??
You can modify the deviation multiples to suite your purpose, you can use 0.5 for example.

Ratio Correction

In my study I calculate some ratios. The theoretical background is as follows:
There is the effect of Binocular Rivalry, where a different picture is presented to the left eye than to the right eye (e.g. a black and a white square). Most of the time, the test persons do not see a mixture of colours (i.e. something grey), but the picture changes back and forth, so a black square is seen once and then a white one. During the time of the trial (e.g. 60 seconds) the test persons indicate what they see (black square, white square, mixed picture). These durations can be used to calculate the predominance ratio as an indication of whether one stimulus is seen significantly more often than the other. The ratio is calculated from [T(stimulus1)-T(stimulus2)/T(stimulus1)+T(stimulus2)], where T is the cumulative time the stimulus was seen during the 60 seconds. The times for the mixed image are completely omitted from this calculation. In the end the ratio is tested if it is significantly different from zero with a one-sample t-test. If it is significantly different from zero and positive, stimulus 1 is seen longer, if it is significantly different from zero and negative, stimulus 2 is seen longer. Now I have two conditions and I calculate a predominance ratio for each.
Let us suppose that condition 1 would be the squares I mentioned above and condition 2 would be a stick figure in black and a tree in white. I want to know if there is a significant predominance ratio in the stickman/tree condition, but without the influence of the colors. Therefore I want to somehow deduct the predominance ratio from condition 1 from condition 2. So I would like to do a kind of "baseline correction". The value of this predominance ratio can vary between -1 and 1. Now my question is how to do this correction without changing the metrics of the ratio. In order to test the corrected ratio towards zero in a meaningful way, it must not take any other values than from -1 to 1.
Does anyone have an idea?
Thanks a lot!

Generate positive only distribution based on array

I have an array of data, for example:
[1000,800,700,650,630,500,370,350,310,250,210,180,150,100,80,50,30,20,15,12,10,8,6,3]
From this data, I want to generate random numbers that fit the same distribution.
I can generate a random number using code like the following:
dist = scipy.stats.gaussian_kde(data)
randomVar = np.floor(dist.resample()[0])
This results in random number generation that includes negative numbers, which I believe I can dump fairly easily without changing the overall shape of the rest of the curve (I just generate sufficient resamples that I still have enough for purpose after dumping the negatives).
However, because the original data was positive values only - and heaped up against that boundary, I end up with a kde that is highest a short distance before it gets to zero, but then drops off sharply from there as it approaches zero; and that downward tick in the KDE is preventing me from generating appropriate numbers.
I can set the bandwidth lower, in order to get a sharper corner, closer to zero, but then due to the low quantity of the original data it ends up sawtoothing elsewhere. Higher bandwidths unfortunately hide the shape of the curve before they remove the downward tick.
As broadly suggested in the comments by Hilbert's Drinking Problem, the real solution was to find a better distribution that fit the parameters. In my case Chi-Squared, which fit both the shape of the curve, and also the fact that it only took positive values.
However in the comments Stelios made the good suggestion of using scipy.stats.rv_histogram, which I used and was satisfied with for a while. This enabled me to fit a curve to the data exactly, though it had two problems:
1) It assumes zero value in the absence of data. I.e. if you set the
settings to fit too closely to the data, then during gaps in your
data it will drop to zero rather than interpolate.
2) As an extension
to point 1, it wont extrapolate beyond the seed data's maximum and
minimum (those data ranges are effectively giant gaps, so everything
eventually zeroes out).

Correlation statistics

Naive Question:
In the attached snapshot, I am trying to figure out the correlation concept when applied to actual values and to calculation performed on those actual values and creating a new stream of data.
In the example,
Columns A,B,C,D,E have very different correlation but when I do a rolling sum on the same columns to get G,H,I,J,K the correlation is very much the same(negative or positive.
Are these to different types of correlation or am I missing out on something.
Thanks in advance!!
Yes, these are different correlations. It's similar to if you were to measure acceleration over time of 5 automobiles (your first piece of data) and correlate those accelerations. Each car accelerates at different rates over time leaving your correlation all over the place.
Your second set of data would be the velocity of each car at each point in time. Because each car is accelerating at a pretty constant rate (and doing so in two different directions from the starting point) you either get a big positive or big negative correlation.
It's not necessary that you get that big positive or big negative correlation in the second set, but since your data in each list is consistently positive or negative and grows at a consistent rate, it correlates with either similar lists.

Averaging many curves with different x and y values

I have several curves that contain many data points. The x-axis is time and let's say I have n curves with data points corresponding to times on the x-axis.
Is there a way to get an "average" of the n curves, despite the fact that the data points are located at different x-points?
I was thinking maybe something like using a histogram to bin the values, but I am not sure which code to start with that could accomplish something like this.
Can Excel or MATLAB do this?
I would also like to plot the standard deviation of the averaged curve.
One concern is: The distribution amongst the x-values is not uniform. There are many more values closer to t=0, but at t=5 (for example), the frequency of data points is much less.
Another concern. What happens if two values fall within 1 bin? I assume I would need the average of these values before calculating the averaged curve.
I hope this conveys what I would like to do.
Any ideas on what code I could use (MATLAB, EXCEL etc) to accomplish my goal?
Since your series' are not uniformly distributed, interpolating prior to computing the mean is one way to avoid biasing towards times where you have more frequent samples. Note that by definition, interpolation will likely reduce the range of your values, i.e. the interpolated points aren't likely to fall exactly at the times of your measured points. This has a greater effect on the extreme statistics (e.g. 5th and 95th percentiles) rather than the mean. If you plan on going this route, you'll need the interp1 and mean functions
An alternative is to do a weighted mean. This way you avoid truncating the range of your measured values. Assuming x is a vector of measured values and t is a vector of measurement times in seconds from some reference time then you can compute the weighted mean by:
timeStep = diff(t);
weightedMean = timeStep .* x(1:end-1) / sum(timeStep);
As mentioned in the comments above, a sample of your data would help a lot in suggesting the appropriate method for calculating the "average".

Resources