Why does the Gamma distribution in SciPy have three parameters? - statistics

Usually, the Gamma distribution has two parameters: shape and scale (or alternatively shape and rate). However, it seems that in SciPy the Gamma distribution has three parameters: two shape parameters and a location parameter.
Does anyone know the mapping between the SciPy parameters of Gamma, and, e.g., the parameters in the definition given on wikipedia:
http://en.wikipedia.org/wiki/Gamma_distribution
Thanks!

All the continuous distributions in scipy.stats have location and scale parameters, even those for which the location is not generally used. For the gamma distribution, just leave the location at its default value 0. If you are using the fit method, use the argument floc=0 to ensure that it does not treat the location as a free parameter.
The shape and scale parameters in the scipy gamma distribution correspond to k and θ, respectively, in the wikipedia page.

Related

How to calculate added path length (APL) image segmentation metric?

Background
I'm trying to calculate the added path length (APL) metric used in segmantic segmentation for radiotherapy treatment planning. The metric originated in this paper but I can't find any explanation on how to calculate it, just the following figure that indicates shape A (black), and the surface edits (dotted yellow lines) required to create shape B:
added path length
Currently I'm calculating this metric by summing all the surface pixels that are in shape B, but not in shape A (similar to this) and multiplying by pixel width (assuming isometric pixels) to obtain a value in mm. I've also added a tolerance parameter that allows some deviation between surfaces of shape A and B before considering the surface as "edited".
Questions
Any good references for how the original authors calculated this metric?
Any thoughts on going from voxel to mm version of this metric?
I had the same problem, and implemented a version of the APL metric in platipy. The code is available here. This is open-source Python code.
$ pip install platipy
(Note - you may have to update pip with pip install -U pip if you get errors).
from platipy.imaging.label.comparison import compute_metric_total_apl
compute_metric_total_apl(label_A, label_B, distance_threshold=3)
This will return the (total) APL in millimetres. You may also find the function compute_metric_mean_apl useful, which computes the slice-wise averaged APL.
You may notice the added variable distance_threshold. If the two contours are closer than this distance, they are considered identical. It is used to make the APL more sensitive to true differences (i.e. ignore negligible, voxel-scale variations). Just set it to zero to get the APL as per the original definition.

What does "noisemultiplier" mean in tensorflow-federated tutorial?

I find the term NoiseMultiplier in the following part of tensorflow-federated tutorial.
def train(rounds, noise_multiplier, clients_per_round, data_frame):
# Using the `dp_aggregator` here turns on differential privacy with adaptive
# clipping.
aggregation_factory = tff.learning.model_update_aggregator.dp_aggregator(
noise_multiplier, clients_per_round)
I have read the paper about differential privacy with adaptive clipping Andrew et al. 2021, Differentially Private Learning with Adaptive Clipping. I guess NoiseMultiplier is the noise we input to the system. However, I find the NoiseMultiplier is a scalar we set. Actually, the different noise should be put into the corresponding weight_variables, so I am so confused about that.
To make an aggregation in TFF differentially private (focusing on an additive application of the Gaussian mechanism, which seems to be what is happening in the symbol you're using), two steps are required:
Incoming tensors must be clipped, so that the total norm of these (potentially structured) tensors considered as a single vector has an explicit upper bound. In this case, the adaptivity in adaptive clipping refers to the upper bound on this norm; this upper bound (the norm to which the incoming tensors are clipped) is what is adjusted through time.
Noise sampled according to an isotropic Gaussian with some variance is added to these clipped tensors (possibly before or after aggregating, depending on the DP model, e.g. local vs central).
The noise multiplier here refers to the relationship between the clipping norm in step 1 and the variance in step 2; in fact, it is their ratio. It is this ratio which determines the privacy budget `used up' by one application of the query; this can be seen, e.g., in the relationship between sensitivity, epsilon and variance in the Wikipedia article on the Gaussian mechanism.
The parameter set here, then, can be understood as a mechanism for specifying how much privacy each step of the algorithm 'costs'. It is a scalar because it is simply the ratio of two scalars; the 'vectorizing' is handled by sampling from the high-dimensional Gaussian, but since this Gaussian is isotropic (spherical), only its scalar variance needs to be known (rather than the full covariance matrix).

Gaussian approximation of old states

I came across the following sentence referred to the usual Extended Kalman Filter and I'm trying to make sense of it:
States before the current state are approximated with a normal distribution
What does it mean?
the modeled quantity has uncertainty because it is derived from measurements. you can't be sure it's exactly value X. that's why the quantity is represented by a probability density function (or a cumulative distribution function, which is the integral of that).
a probability distribution can look very arbitrary but there are many "simple" distributions that approximate the real world. you've heard of the normal distribution (gaussian), the uniform distribution (rectangle), ...
the normal distribution (parameters mu and sigma) occurs everywhere in nature so it's likely that your measurements already fit a normal distribution very well.
"a gaussian" implies that your distribution isn't a mixture (sum) of gaussians but a single gaussian.

input for torch.nn.functional.gumbel_softmax

Say I have a tensor named attn_weights of size [1,a], entries of which indicate the attention weights between the given query and |a| keys. I want to select the largest one using torch.nn.functional.gumbel_softmax.
I find docs about this function describe the parameter as logits - […, num_features] unnormalized log probabilities. I wonder whether should I take log of attn_weights before passing it into gumbel_softmax? And I find Wiki defines logit=lg(p/1-p), which is different from barely logrithm. I wonder which one should I pass to the function?
Further, I wonder how to choose tau in gumbel_softmax, any guidelines?
I wonder whether should I take log of attn_weights before passing it into gumbel_softmax?
If attn_weights are probabilities (sum to 1; e.g., output of a softmax), then yes. Otherwise, no.
I wonder how to choose tau in gumbel_softmax, any guidelines?
Usually, it requires tuning. The references provided in the docs can help you with that.
From Categorical Reparameterizaion with Gumbel-Softmax:
Figure 1, caption:
... (a) For low temperatures (τ = 0.1, τ = 0.5), the expected value of a Gumbel-Softmax random variable approaches the expected value of a categorical random variable with the same logits. As the temperature increases (τ = 1.0, τ = 10.0), the expected value converges to a uniform distribution over the categories.
Section 2.2, 2nd paragraph (emphasis mine):
While Gumbel-Softmax samples are differentiable, they are not identical to samples from the corresponding categorical distribution for non-zero temperature. For learning, there is a tradeoff between small temperatures, where samples are close to one-hot but the variance of the gradients is large, and large temperatures, where samples are smooth but the variance of the gradients is small (Figure 1). In practice, we start at a high temperature and anneal to a small but non-zero temperature.
Lastly, they remind the reader that tau can be learned:
If τ is a learned parameter (rather than annealed via a fixed
schedule), this scheme can be interpreted as entropy regularization (Szegedy et al., 2015; Pereyra et al., 2016), where the Gumbel-Softmax distribution can adaptively adjust the "confidence" of proposed samples during the training process.

Does fitting Weibull distribution to data using scipy.stats perform poor?

I am working on fitting Weibull distribution on some integer data and estimating relevant shape, scale, location parameters. However, I noticed poor performance of scipy.stats library while doing so.
So, I took a different direction and checked the fit performance by using the code below. I first create 100 numbers using Weibull distribution with parameters shape=3, scale=200, location=1. Subsequently, I estimate the best distribution fit using fitter library.
from fitter import Fitter
import numpy as np
from scipy.stats import weibull_min
# generate numbers
x = weibull_min.rvs(3, scale=200, loc=1, size=100)
# make them integers
data = np.asarray(x, dtype=int)
# fit one of the four distributions
f = Fitter(data, distributions=["gamma", "rayleigh", "uniform", "weibull_min"])
f.fit()
f.summary()
I expect the best fit to be Weibull distribution. I have tried re-running this test. Sometimes Weibull fit is a good estimate. However, most of the time Weibull fit is reported as the worst result. In this case, the estimated parameters are = (0.13836651040093312, 66.99999999999999, 1.3200752378443505). I assume these parameters correspond to shape, scale, location in order. Below is the summary of the fit procedure.
$ f.summary()
sumsquare_error aic bic kl_div
gamma 0.001601 1182.739756 -1090.410631 inf
rayleigh 0.001819 1154.204133 -1082.276256 inf
uniform 0.002241 1113.815217 -1061.400668 inf
weibull_min 0.004992 1558.203041 -976.698452 inf
Additionally, the following plot is produced.
Also, Rayleigh distribution is a special case of Weibull with shape parameter = 2. So, I expect the resulting Weibull fit to be at least as good as Rayleigh.
Update
I ran the tests above on Linux/Ubuntu 20.04 machine with numpy version 1.19.2 and scipy version 1.5.2. The code above seems to run as expected and return proper results for Weibull distribution on a Mac machine.
I have also tested fitting a Weibull distribution on data x generated above on the Linux machine by using an R library fitdistrplus as:
fit.weib <- fitdist(x, "weibull")
and observed that the estimated shape and scale values are found to be very close to the initially given values. The best guess so far is that the problem is due to some Python-Ubuntu bug/incompatibility.
I can be considered as a newbie in this area. So, I am wondering, am I doing something wrong here? Or is this result somehow expected? Any help is greatly appreciated.
Thank you.
Library fitter doesn't allow to specify parameters for distributions such as a, loc, etc. And strangely, Mac produces better fit while Linux heavily pains the results for best fit, for the same version of Numpy and Scipy. Underlying reasons may include different BLAS-LAPACK algorithms designed for Linux and Mac, https://stackoverflow.com/a/49274049/6806531, or weibull_min may not initialize parameter a = 1 which is discussed online, or default floating-point accuracy. However, one can solve the error inside fitter library. Knowing the fact that weib_min is expon_weib with parameter a is fixed as 1, changing the run function inside of _timed_run function in fitter.py as
def run(self):
try:
if distribution == "exponweib":
self.result = func(args,floc=0,fa = 1, **kwargs)
else:
self.result = func(args, floc=0, **kwargs)
except Exception as err:
self.exc_info = sys.exc_info()
and using exponweib as weib_min gives nearly same results as R fitdist.
I am not familiar with the Fitter library, but in order to draw some conclusions I would suggest:
Retry your code, but by taking size=10,000. In this case, there are sufficient datapoints for the fitting methods to utilize. Theoretically, you would then expect the Weibull to deliver the best fit.
I noticed that the location parameter can sometimes be a pain. You could try to run your fits by fixing the location parameter with floc=1 (i.e. equal to your sampling parameter for location). What do you get? Aditionally, FYI, with MLE, it suffices to take loc=min(x), where x is your dataset. For the exponential distribution, this in fact the MLE of the location parameter. For other distributions I am not sure, but I wouldn't be surprised if this holds for other distributions as well. This would reduce the fitting procedure with 1 parameter.
Lastly, I noticed that if you take small values for location/scale/shape for some distributions, the functions logpdf and logcdf of scipy.stats distributions result in np.inf values. In this scenario, you could perhaps use the Powell optimization algorithm and set bounds on the values of your parameters.

Resources