#Timed annotation to generate Distribution Summary - micrometer

Micrometer #Timed Annotation generates Timer, which when exported to datadog, only provides sum, min, max, median, 95 percentile functionality.
To get p50,p75,p90 etc in datadog, the metric should be of type distribution. How can i change the meter generated from #Timed to distribution summary.

A Timer is already a distribution in Micrometer. It is a special one though because it records duration and it also gives you a convenient way to measure elapsed time.
You can call .publishPercentiles(0.5, 0.75, 0.90) on the builder to get those percentiles published, also you can call .publishPercentileHistogram() to publish histograms, please see the docs: https://micrometer.io/docs/concepts#_histograms_and_percentiles

Old question but just found the answer searching for the same thing:
#Timed(value = "your.name.here",histogram = true,percentiles = {0.5,0.75,0.90} )

Related

Generating multiple sets of n samples from data set where standard deviation of each set is minimized

I prepared a dataset and later learned that it is skewed.
Assuming a plot of user_count vs score where user_count is the number of users on that particular score.
I have to sample out the total users in multiple samples of size [100<=n<=1000] in such a way that the standard deviation of each created sample is minimized.
How do I do that?
I have tried binning methods like custom binning, quantile, etc. but that is not helpful to me, as by manual binning some of my bins have high SD.
Example:
I created 19 custom bins of interval: .05-.10, .10-.15, ......, .90-.95, >.95
this gives me something like this:
the problem here is: q19 has a high SD.
so, I am trying to figure out a way using which I can create an optimal number of bins automatically with minimum standard deviations.

sample size for a single arm study based on median time to event

In my master thesis, I need to determine and calculate the number of cases for median time to event. The method is according to Brookmeyer & Crowley, 1982. My question is: How can I determine the sample size according to Brookmeyer? So determine the number of cases for median time to event. How can I define the equation for N? I know how to calculate the confidence interval, but my problem, how do I determine the case number theoretically for this.
Edit:
"Designing the trial with different characteristics: planning a single arm study without historical control. How can I determine the sample size N and what method is the best", this is my plan. Assuming "Median Time to event "PFS" ". I want to determine the sample size N and then calculate it, that's why I thought that I can clearly use or find a formula for N. I firmly assume that the survival time is exponentially distributed I want to see with it: 1- Sample size based on distributional assumptions? 2- No implementation available? How to derive p-value? Thanks for further help, best regards

Integrating Power pdf to get energy pdf?

I'm trying to work out how to solve what seems like a simple problem, but I can't convince myself of the correct method.
I have time-series data that represents the pdf of a Power output (P), varying over time, also the cdf and quantile functions - f(P,t), F(P,t) and q(p,t). I need to find the pdf, cdf and quantile function for the Energy in a given time interval [t1,t2] from this data - say e(), E(), and qe().
Clearly energy is the integral of the power over [t1,t2], but how do I best calculate e, E and qe ?
My best guess is that since q(p,t) is a power, I should generate qe by integrating q over the time interval, and then calculate the other distributions from that.
Is it as simple as that, or do I need to get to grips with stochastic calculus ?
Additional details for clarification
The data we're getting is a time-series of 'black-box' forecasts for f(P), F(P),q(P) for each time t, where P is the instantaneous power and there will be around 100 forecasts for the interval I'd like to get the e(P) for. By 'Black-box' I mean that there will be a function I can call to evaluate f,F,q for P, but I don't know the underlying distribution.
The black-box functions are almost certainly interpolating output data from the model that produces the power forecasts, but we don't have access to that. I would guess that it won't be anything straightforward, since it comes from a chain of non-linear transformations. It's actually wind farm production forecasts: the wind speeds may be normally distributed, but multiple terrain and turbine transformations will change that.
Further clarification
(I've edited the original text to remove confusing variable names in the energy distribution functions.)
The forecasts will be provided as follows:
The interval [t1,t2] that we need e, E and qe for is sub-divided into 100 (say) sub-intervals k=1...100. For each k we are given a distinct f(P), call them f_k(P). We need to calculate the energy distributions for the interval from this set of f_k(P).
Thanks for the clarification. From what I can tell, you don't have enough information to solve this problem properly. Specifically, you need to have some estimate of the dependence of power from one time step to the next. The longer the time step, the less the dependence; if the steps are long enough, power might be approximately independent from one step to the next, which would be good news because that would simplify the analysis quite a bit. So, how long are the time steps? An hour? A minute? A day?
If the time steps are long enough to be independent, the distribution of energy is the distribution of 100 variables, which will be very nearly normally distributed by the central limit theorem. It's easy to work out the mean and variance of the total energy in this case.
Otherwise, the distribution will be some more complicated result. My guess is that the variance as estimated by the independent-steps approach will be too big -- the actual variance would be somewhat less, I believe.
From what you say, you don't have any information about temporal dependence. Maybe you can find or derive from some other source or sources an estimate the autocorrelation function -- I wouldn't be surprised if that question has already been studied for wind power. I also wouldn't be surprised if a general version of this problem has already been studied -- perhaps you can search for something like "distribution of a sum of autocorrelated variables." You might get some interest in that question on stats.stackexchange.com.

K-means metrics

I have read through the scikit learn documentation and Googled to no avail. I have 2000 data sets, clustered as the picture shows. Some of the clusters, as shown, are wrong, here the red cluster. I need a metrics to method to validate all the 2000 cluster-sets. Almost every metric in scikit learn requires the ground truth class labels, which I do not think I have or CAN have for that matter. I have the hourly traffic flow for 30 days and I am clustering them using k-means. The lines are the cluster centers. What should I do? Am I even on the right track?!The horizontal axis is the hour, 0 to 23, and the vertical axis is the traffic flow, so the data points represent the traffic flow in that hour over the 30 days, and k=3.
SciKit learn has no methods, except from the silhouette coefficient, for internal evaluation, to my knowledge, we can implement the DB Index (Davies-Bouldin) and the Dunn Index for such problems. The article here provides good metrics for k-means:
http://www.iaeng.org/publication/IMECS2012/IMECS2012_pp471-476.pdf
Both the Silhouette coefficient and the Calinski-Harabaz index are implemented in scikit-learn nowadays and will help you evaluate your clustering results when there is no ground-truth.
More details here:
http://scikit-learn.org/stable/modules/clustering.html
And here:
http://scikit-learn.org/stable/modules/generated/sklearn.metrics.silhouette_samples.html#sklearn.metrics.silhouette_samples
Did you look at the Agglomerative clustering and then the subsection (Varying the metric):
http://scikit-learn.org/stable/modules/clustering.html#varying-the-metric
To me it seems very similar to what you are trying to do.

A method to find the inconsistency or variation in the data

I am running an experiment (it's an image processing experiment) in which I have a set of paper samples and each sample has a set of lines. For each line in the paper sample, its strength is calculated which is denoted by say 's'. For a given paper sample I have to find the variation amongst the strength values 's'. If the variation is above a certain limit, we have to discard that paper.
1) I started with the Standard Deviation of the values, but the problem I am facing is that for each sample, order of magnitude for s (because of various properties of line like its length, sharpness, darkness etc) might differ and also the calculated Standard Deviations values are also differing a lot in magnitude. So I can't really use this method for different samples.
Is there any way where I can find that suitable limit which can be applicable for all samples.
I am thinking that since I don't have any history of how the strength value should behave,( for a given sample depending on the order of magnitude of the strength value more variation could be tolerated in that sample whereas because the magnitude is less in another sample, there should be less variation in that sample) I first need to find a way of baselining the variation in different samples. I don't know what approaches I could try to get started.
Please note that I have to tell variation between lines within a sample whereas the limit should be applicable for any good sample.
Please help me out.
You seem to have a set of samples. Then, for each sample you want to do two things: 1) compute a descriptive metric and 2) perform outlier detection. Both of these are vast subjects that require some knowledge of the phenomenology and statistics of the underlying problem. However, below are some ideas to get you going.
Compute a metric
Median Absolute Deviation. If your sample strength s has values that can jump by an order of magnitude across a sample then it is understandable that the standard deviation was not a good metric. The standard deviation is notoriously sensitive to outliers. So, try a more robust estimate of dispersion in your data. For example, the MAD estimate uses the median in the underlying computations which is more robust to a large spread in the numbers.
Robust measures of scale. Read up on other robust measures like the Interquartile range.
Perform outlier detection
Thresholding. This is similar to what you are already doing. However, you have to choose a suitable threshold for the metric computed above. You might consider using another robust metric for thresholding the metric. You can compute a robust estimate of their mean (e.g., the median) and a robust estimate of their standard deviation (e.g., 1.4826 * MAD). Then identify outliers as metric values above some number of robust standard deviations above the robust mean.
Histogram Another simple method is to histogram your computed metrics from step #1. This is non-parametric so it doesn't require you to model your data. If can histogram your metric values and then use the top 1% (or some other value) as your threshold limit.
Triangle Method A neat and simple heuristic for thresholding is the triangle method to perform binary classification of a skewed distribution.
Anomaly detection Read up on other outlier detection methods.

Resources