Netlogo model result conveying method - statistics

I have designed a netlogo model which outputs number of turtles in each run. Number of turtles increases with ticks and becomes constant to a value N. I run the model 50 times and I have the data with 50 different N values varying from 9 to 12. I have to report the result with a graph showing number of turtles increasing with the ticks. For one simulation it will become constant at 9 (N = 9) and for some other it will become constant at 10 (N = 10).
For which simulation out of the 50, should I draw the graph for?
or
Should I take the average of 50 values for each tick, and draw a graph for that?
What is the right approach to convey that in my result, confirmed by 50 simulations, the number of turtles increases with ticks and becomes constant (which varies in the range of (9 - 12) for different simulations) ?
Thank you.

The point of doing multiple simulations is to average out the stochastic effects. Without seeing your data, the most appropriate graph is probably one that averages your variable of interest (eg final turtle count, or turtle count at each tick). That average should be taken across the simulations that are running the same scenario (that is, have the same starting parameters) if you want to compare scenarios.

Related

How to quantify magnitude of change in dataset containing base values of 0?

I have a dataset with the low-water and high-water surface area of lakes/ponds within a delta for each year. These lakes can undergo substantial change from year to year, and sometimes can dry out completely. As such, surface area can have values of 0 during the low-water period. I'm trying to quantify the magnitude of flooding in the spring on the surface areas of these lakes. Given the high inter annual variations in surface area, I need to compare the low-water value from the previous year to the high-water value of the following year to quantify this magnitude; comparing to a mean isn't sensitive enough. However, given the low water surface area of 0 for some lakes, I cannot quantify percent change.
My current idea is to do an "inverse" of percent change (don't know how else to describe it), where I divide the low-water value by the high-water value. This gives me a scale where large change will equal 0 and little change will equal 1. However, again small changes from a surface area of 0 will be over represented. Any idea how I could accurately compare the magnitude of flooding in such a case?

Ratio Correction

In my study I calculate some ratios. The theoretical background is as follows:
There is the effect of Binocular Rivalry, where a different picture is presented to the left eye than to the right eye (e.g. a black and a white square). Most of the time, the test persons do not see a mixture of colours (i.e. something grey), but the picture changes back and forth, so a black square is seen once and then a white one. During the time of the trial (e.g. 60 seconds) the test persons indicate what they see (black square, white square, mixed picture). These durations can be used to calculate the predominance ratio as an indication of whether one stimulus is seen significantly more often than the other. The ratio is calculated from [T(stimulus1)-T(stimulus2)/T(stimulus1)+T(stimulus2)], where T is the cumulative time the stimulus was seen during the 60 seconds. The times for the mixed image are completely omitted from this calculation. In the end the ratio is tested if it is significantly different from zero with a one-sample t-test. If it is significantly different from zero and positive, stimulus 1 is seen longer, if it is significantly different from zero and negative, stimulus 2 is seen longer. Now I have two conditions and I calculate a predominance ratio for each.
Let us suppose that condition 1 would be the squares I mentioned above and condition 2 would be a stick figure in black and a tree in white. I want to know if there is a significant predominance ratio in the stickman/tree condition, but without the influence of the colors. Therefore I want to somehow deduct the predominance ratio from condition 1 from condition 2. So I would like to do a kind of "baseline correction". The value of this predominance ratio can vary between -1 and 1. Now my question is how to do this correction without changing the metrics of the ratio. In order to test the corrected ratio towards zero in a meaningful way, it must not take any other values than from -1 to 1.
Does anyone have an idea?
Thanks a lot!

Averaging many curves with different x and y values

I have several curves that contain many data points. The x-axis is time and let's say I have n curves with data points corresponding to times on the x-axis.
Is there a way to get an "average" of the n curves, despite the fact that the data points are located at different x-points?
I was thinking maybe something like using a histogram to bin the values, but I am not sure which code to start with that could accomplish something like this.
Can Excel or MATLAB do this?
I would also like to plot the standard deviation of the averaged curve.
One concern is: The distribution amongst the x-values is not uniform. There are many more values closer to t=0, but at t=5 (for example), the frequency of data points is much less.
Another concern. What happens if two values fall within 1 bin? I assume I would need the average of these values before calculating the averaged curve.
I hope this conveys what I would like to do.
Any ideas on what code I could use (MATLAB, EXCEL etc) to accomplish my goal?
Since your series' are not uniformly distributed, interpolating prior to computing the mean is one way to avoid biasing towards times where you have more frequent samples. Note that by definition, interpolation will likely reduce the range of your values, i.e. the interpolated points aren't likely to fall exactly at the times of your measured points. This has a greater effect on the extreme statistics (e.g. 5th and 95th percentiles) rather than the mean. If you plan on going this route, you'll need the interp1 and mean functions
An alternative is to do a weighted mean. This way you avoid truncating the range of your measured values. Assuming x is a vector of measured values and t is a vector of measurement times in seconds from some reference time then you can compute the weighted mean by:
timeStep = diff(t);
weightedMean = timeStep .* x(1:end-1) / sum(timeStep);
As mentioned in the comments above, a sample of your data would help a lot in suggesting the appropriate method for calculating the "average".

Computing average grid size

I am trying to compute the average cell size on the following set of points, as seen on the picture: . The picture was generated using gnuplot:
gnuplot> plot "debug.dat" using 1:2
The points are almost aligned on a rectangular grid, but not quite. There seems to be a bias (jitter?) of say 10-15% along either X or Y. How would one compute efficiently a proper partition in tiles so that there is virtually only one point per tile, size would be expressed as (tilex, tiley). I use the word virtually since the 10-15% bias may have moved a point in another adjacent tile.
Just for reference, I have manually sorted (hopefully correct) and extracted the first 10 points:
-133920,33480
-132480,33476
-131044,33472
-129602,33467
-128162,33463
-139679,34576
-138239,34572
-136799,34568
-135359,34564
-133925,34562
Just for clarification, a valid tile as per the above description would be (1435,1060), but I am really looking for a quick automated way.
Let's do this for X coordinate only:
1) sort the X coordinates
2) look at deltas between two subsequent X coordinates. These delta will fall into two categories - either they correspond to spaces between two columns, or to spaces between crosses within the same column. Your goal is to find a threshold that will separate the long spaces from the short ones. This can be done by finding a threshold that separates the deltas into two groups whose means are the furthest apart (I think)
3) once you have the threshold, separate points into columns. A columns starts and ends with a delta corresponding to the threshold you measured previously
4) calculate average position of each detected column
5) take deltas between subsequent columns. Now, the problem is that you may get a stray point that would break your columns. Use a median to get the strays out.
6) You should have a robust estimate of your gridX
Example, using your data, looking at axis X:
-133920 -132480 -131044 -129602 -128162 -139679 -138239 -136799 -135359 -133925
Sorted + deltas:
5 1434 1436 1440 1440 1440 1440 1440 1442
Here you can see that there is a very obvious threshold between small (5) and large (1434 and up) delta. 1434 will define your space here
Split the points into columns:
-139679|-138239|-136799|-135359|-133925 -133920|-132480|-131044|-129602|-128162
1440 1440 1440 1434 5 1440 1436 1442 1440
Almost all points are alone, except the two -133925 -133920.
The average grid line positions are:
-139679 -138239 -136799 -135359 -133922.5 -132480 -131044 -129602 -128162
Sorted deltas:
1436.0 1436.5 1440.0 1440.0 1440.0 1440.0 1442.0 1442.5
Median:
1440
Which is the correct answer for your SMALL data set, IMHO.

Algorithm for drawing box plot for given data

I have sorted array of real values, say X, drawn from some unknown distribution. I would like draw a box plot for this data.
In the simplest case, I need to know five values: min, Q1, median, Q3, and max.
Trivially, min = X[0], max = X[length(X)-1], and possibly median = X[ceil(length(X)/2)]. But I'm wondering how to determine the lower quartile Q1 and Q3.
When I plot X = [1,2,4] using MATLAB, I obtain following result:
It seems to me like there is some magic how to obtain the values Q1 = 1.25 and Q3 = 3.5, but I don't know what the magic is. Does anybody have experience with this?
If you go to the original definition of box plots (look up John Tukey), you use the median for the midpoint (i.e., 2 in your data set of 1, 2, 4). The endpoints are the min and max.
The top and bottom of the box are not exactly defined by quartiles, instead they are called "hinges". Hinges are the medians of the top and bottom halves of the data. If there is an odd number of observations, the median of the entire set is used in determining both hinges. The lower hinge is the median of (1,2), or 1.5. The top hinge is the median of (2,4), or 3.
There are actually dozens of definitions of a box plot's quartiles (Wikipedia: "There is no universal agreement on choosing the quartile values"). If you want to rationalize MatLab's box plot, you'll have to check its documentation. Otherwise, you could Google your brains out to try to find a method that matches the results.
Minitab gives 1 and 4 for the hinges in your data set. Excel's PERCENTILE function gives 1.5 and 3, which incidentally matches Tukey's algorithm at least in this case.
The median devides the data into two halves. The median of the first half = Q1, and the median of the second half = Q3.
More info: http://www.purplemath.com/modules/boxwhisk.htm
Note on the MatLab boxplot: The Q1 and Q3 are maybe calculated in a different way in MatLab, I'd try with a larger amount of testing data. With my method, Q1 should be 1 and Q3 should be 4.
EDIT:
The possible calculation that MatLab does, is the difference between the median and the first number of the first half, and take a quarter of that. Add that to the first number to get Q1.
The same (roughly) applies to Q3: Take the difference between the median and the highest number, and subtract a quarter of that from the highest number. That is Q3.

Resources