Decimal places in annotations of forest plots generated with the metafor package - forestplot

In a previous post about significant figures displayed in forest plots generated with the metafor package , option digits was suggested to specify the number of decimal places for tick mark labels of the x-axis and plot annotations.
Is it possible to specify a different number of decimal places for different parts of the annotation, i.e. 1 decimal for the weights, more than one decimals for the effect sizes, and, if so, how?

Not at the moment. But I think this is a useful feature (showing many digits on the weights is often not that useful, so being able to adjust the number of digits for the weights separately from what is used for the effect size estimates and the x-axis labels makes sense). I have just pushed an update to the development version of metafor that allows you to specify 3 values for digits, the first for the annotations, the second for the x-axis label, and the third for the weights. You can install the development version as described here:
https://github.com/wviechtb/metafor#installation

Related

Generate positive only distribution based on array

I have an array of data, for example:
[1000,800,700,650,630,500,370,350,310,250,210,180,150,100,80,50,30,20,15,12,10,8,6,3]
From this data, I want to generate random numbers that fit the same distribution.
I can generate a random number using code like the following:
dist = scipy.stats.gaussian_kde(data)
randomVar = np.floor(dist.resample()[0])
This results in random number generation that includes negative numbers, which I believe I can dump fairly easily without changing the overall shape of the rest of the curve (I just generate sufficient resamples that I still have enough for purpose after dumping the negatives).
However, because the original data was positive values only - and heaped up against that boundary, I end up with a kde that is highest a short distance before it gets to zero, but then drops off sharply from there as it approaches zero; and that downward tick in the KDE is preventing me from generating appropriate numbers.
I can set the bandwidth lower, in order to get a sharper corner, closer to zero, but then due to the low quantity of the original data it ends up sawtoothing elsewhere. Higher bandwidths unfortunately hide the shape of the curve before they remove the downward tick.
As broadly suggested in the comments by Hilbert's Drinking Problem, the real solution was to find a better distribution that fit the parameters. In my case Chi-Squared, which fit both the shape of the curve, and also the fact that it only took positive values.
However in the comments Stelios made the good suggestion of using scipy.stats.rv_histogram, which I used and was satisfied with for a while. This enabled me to fit a curve to the data exactly, though it had two problems:
1) It assumes zero value in the absence of data. I.e. if you set the
settings to fit too closely to the data, then during gaps in your
data it will drop to zero rather than interpolate.
2) As an extension
to point 1, it wont extrapolate beyond the seed data's maximum and
minimum (those data ranges are effectively giant gaps, so everything
eventually zeroes out).

Why do Excel and Matlab give different results?

I have 352k values and I want to find the most frequent values from all of them.
Numbers are rounded to two decimal places.
I use the commands mode(a) in Matlab and mode(B1:B352000) in Excel, but the results are different.
Where did I make a mistake, or which one can I believe?
Thanks
//edit: When I use other commands like average, the results are the same.
From Wikipedia:
For a sample from a continuous distribution, such as [0.935..., 1.211..., 2.430..., 3.668..., 3.874...], the concept is unusable in its raw form, since no two values will be exactly the same, so each value will occur precisely once. In order to estimate the mode of the underlying distribution, the usual practice is to discretize the data by assigning frequency values to intervals of equal distance, as for making a histogram, effectively replacing the values by the midpoints of the intervals they are assigned to. The mode is then the value where the histogram reaches its peak. For small or middle-sized samples the outcome of this procedure is sensitive to the choice of interval width if chosen too narrow or too wide
Thus, it is likely that the two programs use a different interval size, yielding different answers. You can believe both (I presume) but knowing that the value returned is an approximation to the true mode of the undelying distribution.

Round to 2 decimal places on range AndroidPlot

I am using AndroidPlot to create a simple XYPlot.
My y axis is (by default) rounding up to 1 decimal place. How can I change this to 2 decimal places?
I have found another answer that shows how you would do something similar with the old version of AndroidPlot: (this gets rid of decimal places but I assume this is the same function I would use)
// Gets rid of decimal places
mySimpleXYPlot.setDomainValueFormat(new DecimalFormat("0"));
Does anyone know how to do this with AndroidPlot 1.*?
Thank you!
Since support was added in 1.x to display labels along any of the four edges of the graph, the way to attach a formatter was modified to support an arbitrary edge.
If you're using the standard domain value labels along the bottom of the plot, this should give you the same behavior as before:
plot.getGraph().getLineLabelStyle(XYGraphWidget.Edge.BOTTOM)
.setFormat(new DecimalFormat("0.0"));

Averaging many curves with different x and y values

I have several curves that contain many data points. The x-axis is time and let's say I have n curves with data points corresponding to times on the x-axis.
Is there a way to get an "average" of the n curves, despite the fact that the data points are located at different x-points?
I was thinking maybe something like using a histogram to bin the values, but I am not sure which code to start with that could accomplish something like this.
Can Excel or MATLAB do this?
I would also like to plot the standard deviation of the averaged curve.
One concern is: The distribution amongst the x-values is not uniform. There are many more values closer to t=0, but at t=5 (for example), the frequency of data points is much less.
Another concern. What happens if two values fall within 1 bin? I assume I would need the average of these values before calculating the averaged curve.
I hope this conveys what I would like to do.
Any ideas on what code I could use (MATLAB, EXCEL etc) to accomplish my goal?
Since your series' are not uniformly distributed, interpolating prior to computing the mean is one way to avoid biasing towards times where you have more frequent samples. Note that by definition, interpolation will likely reduce the range of your values, i.e. the interpolated points aren't likely to fall exactly at the times of your measured points. This has a greater effect on the extreme statistics (e.g. 5th and 95th percentiles) rather than the mean. If you plan on going this route, you'll need the interp1 and mean functions
An alternative is to do a weighted mean. This way you avoid truncating the range of your measured values. Assuming x is a vector of measured values and t is a vector of measurement times in seconds from some reference time then you can compute the weighted mean by:
timeStep = diff(t);
weightedMean = timeStep .* x(1:end-1) / sum(timeStep);
As mentioned in the comments above, a sample of your data would help a lot in suggesting the appropriate method for calculating the "average".

Programming machine learning, compare two plotted lines with x y coordinates

So I have multiple paths stored, each path would consist of data points x1,y1 | x2, y2 | x3, y3 ... etc
I would like to compare these paths with one another to work out if any similarities are present.
I could run through each point and see if it matched any of the points in the first path, then look to see if the next point matches the next point.
I think this would work if there were no anomalies, but could skip over if the next point did not match.
I would like to build in some level of tolerance eg 10, 10 may match 12, 12 or 8, 8
Is this a good way to compare the data, or is there a better approach?
As a second step I may want to consider time as a value too, so each point would have a time value associated with it.
Some possible approaches you can use:
handle booth paths as polygon and compare them as such
see: How to compare two shapes?
use OCR algorithms/approaches
see: OCR and character similarity
transform both paths to synchronized dataset and correlate
either extract significant points only and/or resample paths to the same point count. Then synchronize booth datasets (as in bullet 1) and use correlation coefficient
[notes]
Depending on the input data you can also exploit DCT/DFT transforms to remove unimportant data (like in JPG compression) And or compare in frequency domain instead of spatial/time domain.
You can also compare obvious things (invariant on rotation and translation) like
area
perimeter length
number of self-intersections
number of inflex points
u could compare the mean and variances of the two set of points. If they are on straight lines, as you hypothesize, you could fit straight lines through the two datasets and then compare the parameters of the two straight lines to infer about their distances. It would be more helpful if you could tell the behavour of the two datasets.

Resources