Ratio Correction - statistics

In my study I calculate some ratios. The theoretical background is as follows:
There is the effect of Binocular Rivalry, where a different picture is presented to the left eye than to the right eye (e.g. a black and a white square). Most of the time, the test persons do not see a mixture of colours (i.e. something grey), but the picture changes back and forth, so a black square is seen once and then a white one. During the time of the trial (e.g. 60 seconds) the test persons indicate what they see (black square, white square, mixed picture). These durations can be used to calculate the predominance ratio as an indication of whether one stimulus is seen significantly more often than the other. The ratio is calculated from [T(stimulus1)-T(stimulus2)/T(stimulus1)+T(stimulus2)], where T is the cumulative time the stimulus was seen during the 60 seconds. The times for the mixed image are completely omitted from this calculation. In the end the ratio is tested if it is significantly different from zero with a one-sample t-test. If it is significantly different from zero and positive, stimulus 1 is seen longer, if it is significantly different from zero and negative, stimulus 2 is seen longer. Now I have two conditions and I calculate a predominance ratio for each.
Let us suppose that condition 1 would be the squares I mentioned above and condition 2 would be a stick figure in black and a tree in white. I want to know if there is a significant predominance ratio in the stickman/tree condition, but without the influence of the colors. Therefore I want to somehow deduct the predominance ratio from condition 1 from condition 2. So I would like to do a kind of "baseline correction". The value of this predominance ratio can vary between -1 and 1. Now my question is how to do this correction without changing the metrics of the ratio. In order to test the corrected ratio towards zero in a meaningful way, it must not take any other values than from -1 to 1.
Does anyone have an idea?
Thanks a lot!

Related

How to quantify magnitude of change in dataset containing base values of 0?

I have a dataset with the low-water and high-water surface area of lakes/ponds within a delta for each year. These lakes can undergo substantial change from year to year, and sometimes can dry out completely. As such, surface area can have values of 0 during the low-water period. I'm trying to quantify the magnitude of flooding in the spring on the surface areas of these lakes. Given the high inter annual variations in surface area, I need to compare the low-water value from the previous year to the high-water value of the following year to quantify this magnitude; comparing to a mean isn't sensitive enough. However, given the low water surface area of 0 for some lakes, I cannot quantify percent change.
My current idea is to do an "inverse" of percent change (don't know how else to describe it), where I divide the low-water value by the high-water value. This gives me a scale where large change will equal 0 and little change will equal 1. However, again small changes from a surface area of 0 will be over represented. Any idea how I could accurately compare the magnitude of flooding in such a case?

Generate positive only distribution based on array

I have an array of data, for example:
[1000,800,700,650,630,500,370,350,310,250,210,180,150,100,80,50,30,20,15,12,10,8,6,3]
From this data, I want to generate random numbers that fit the same distribution.
I can generate a random number using code like the following:
dist = scipy.stats.gaussian_kde(data)
randomVar = np.floor(dist.resample()[0])
This results in random number generation that includes negative numbers, which I believe I can dump fairly easily without changing the overall shape of the rest of the curve (I just generate sufficient resamples that I still have enough for purpose after dumping the negatives).
However, because the original data was positive values only - and heaped up against that boundary, I end up with a kde that is highest a short distance before it gets to zero, but then drops off sharply from there as it approaches zero; and that downward tick in the KDE is preventing me from generating appropriate numbers.
I can set the bandwidth lower, in order to get a sharper corner, closer to zero, but then due to the low quantity of the original data it ends up sawtoothing elsewhere. Higher bandwidths unfortunately hide the shape of the curve before they remove the downward tick.
As broadly suggested in the comments by Hilbert's Drinking Problem, the real solution was to find a better distribution that fit the parameters. In my case Chi-Squared, which fit both the shape of the curve, and also the fact that it only took positive values.
However in the comments Stelios made the good suggestion of using scipy.stats.rv_histogram, which I used and was satisfied with for a while. This enabled me to fit a curve to the data exactly, though it had two problems:
1) It assumes zero value in the absence of data. I.e. if you set the
settings to fit too closely to the data, then during gaps in your
data it will drop to zero rather than interpolate.
2) As an extension
to point 1, it wont extrapolate beyond the seed data's maximum and
minimum (those data ranges are effectively giant gaps, so everything
eventually zeroes out).

Netlogo model result conveying method

I have designed a netlogo model which outputs number of turtles in each run. Number of turtles increases with ticks and becomes constant to a value N. I run the model 50 times and I have the data with 50 different N values varying from 9 to 12. I have to report the result with a graph showing number of turtles increasing with the ticks. For one simulation it will become constant at 9 (N = 9) and for some other it will become constant at 10 (N = 10).
For which simulation out of the 50, should I draw the graph for?
or
Should I take the average of 50 values for each tick, and draw a graph for that?
What is the right approach to convey that in my result, confirmed by 50 simulations, the number of turtles increases with ticks and becomes constant (which varies in the range of (9 - 12) for different simulations) ?
Thank you.
The point of doing multiple simulations is to average out the stochastic effects. Without seeing your data, the most appropriate graph is probably one that averages your variable of interest (eg final turtle count, or turtle count at each tick). That average should be taken across the simulations that are running the same scenario (that is, have the same starting parameters) if you want to compare scenarios.

Averaging many curves with different x and y values

I have several curves that contain many data points. The x-axis is time and let's say I have n curves with data points corresponding to times on the x-axis.
Is there a way to get an "average" of the n curves, despite the fact that the data points are located at different x-points?
I was thinking maybe something like using a histogram to bin the values, but I am not sure which code to start with that could accomplish something like this.
Can Excel or MATLAB do this?
I would also like to plot the standard deviation of the averaged curve.
One concern is: The distribution amongst the x-values is not uniform. There are many more values closer to t=0, but at t=5 (for example), the frequency of data points is much less.
Another concern. What happens if two values fall within 1 bin? I assume I would need the average of these values before calculating the averaged curve.
I hope this conveys what I would like to do.
Any ideas on what code I could use (MATLAB, EXCEL etc) to accomplish my goal?
Since your series' are not uniformly distributed, interpolating prior to computing the mean is one way to avoid biasing towards times where you have more frequent samples. Note that by definition, interpolation will likely reduce the range of your values, i.e. the interpolated points aren't likely to fall exactly at the times of your measured points. This has a greater effect on the extreme statistics (e.g. 5th and 95th percentiles) rather than the mean. If you plan on going this route, you'll need the interp1 and mean functions
An alternative is to do a weighted mean. This way you avoid truncating the range of your measured values. Assuming x is a vector of measured values and t is a vector of measurement times in seconds from some reference time then you can compute the weighted mean by:
timeStep = diff(t);
weightedMean = timeStep .* x(1:end-1) / sum(timeStep);
As mentioned in the comments above, a sample of your data would help a lot in suggesting the appropriate method for calculating the "average".

Find contour of 2D unorganized pointcloud

I have a set of 2D points, unorganized, and I want to find the "contour" of this set (not the convex hull). I can't use alpha shapes because I have a speed objective (less than 10ms on an average computer).
My first approach was to compute a grid and find the outline squares (squares which have an empty square as a neighbor). So I think I downsized efficiently my numbers of points (from 22000 to 3000 roughly). But I still need to refine this new set.
My question is : how do I find the real outlines points among my green points ?
After a weekend full of reflexions, I may have found a convenient solution.
So we need a grid, we need to fill it with our points, no difficulty here.
We have to decide which squares are considered as "Contour". Our criteria is : at least one empty neighbor and at least 3 non empty neighbors.
We lack connectivity information. So we choose a "Contour" square which as 2 "Contour" neighbors or less. We then pick one of the neighbor. From that, we can start the expansion. We just circle around the current square to find the next "Contour" square, knowing the previous "Contour" squares. Our contour criteria prevent us from a dead end.
We now have vectors of connected squares, and normally if our shape doesn't have a hole, only one vector of connected squares !
Now for each square, we need to find the best point for the contour. We select the one which is farther from the barycenter of our plane. It works for most of the shapes. Another technique is to compute the barycenter of the empty neighbors of the selected square and choose the nearest point.
The red points are the contour of the green one. The technique used is the plane barycenter one.
For a set of 28000 points, this techniques take 8 ms. CGAL's Alpha shapes would take an average 125 ms for 28000 points.
PS : I hope I made myself clear, English is not my mothertongue :s
You really should use the alpha shapes. Maybe use only green points as inputs of the alpha alpha algorithm.

Resources