What does "noisemultiplier" mean in tensorflow-federated tutorial? - python-3.x

I find the term NoiseMultiplier in the following part of tensorflow-federated tutorial.
def train(rounds, noise_multiplier, clients_per_round, data_frame):
# Using the `dp_aggregator` here turns on differential privacy with adaptive
# clipping.
aggregation_factory = tff.learning.model_update_aggregator.dp_aggregator(
noise_multiplier, clients_per_round)
I have read the paper about differential privacy with adaptive clipping Andrew et al. 2021, Differentially Private Learning with Adaptive Clipping. I guess NoiseMultiplier is the noise we input to the system. However, I find the NoiseMultiplier is a scalar we set. Actually, the different noise should be put into the corresponding weight_variables, so I am so confused about that.

To make an aggregation in TFF differentially private (focusing on an additive application of the Gaussian mechanism, which seems to be what is happening in the symbol you're using), two steps are required:
Incoming tensors must be clipped, so that the total norm of these (potentially structured) tensors considered as a single vector has an explicit upper bound. In this case, the adaptivity in adaptive clipping refers to the upper bound on this norm; this upper bound (the norm to which the incoming tensors are clipped) is what is adjusted through time.
Noise sampled according to an isotropic Gaussian with some variance is added to these clipped tensors (possibly before or after aggregating, depending on the DP model, e.g. local vs central).
The noise multiplier here refers to the relationship between the clipping norm in step 1 and the variance in step 2; in fact, it is their ratio. It is this ratio which determines the privacy budget `used up' by one application of the query; this can be seen, e.g., in the relationship between sensitivity, epsilon and variance in the Wikipedia article on the Gaussian mechanism.
The parameter set here, then, can be understood as a mechanism for specifying how much privacy each step of the algorithm 'costs'. It is a scalar because it is simply the ratio of two scalars; the 'vectorizing' is handled by sampling from the high-dimensional Gaussian, but since this Gaussian is isotropic (spherical), only its scalar variance needs to be known (rather than the full covariance matrix).

Related

How to use a residual plot to determine if the relationship looks linear

I was attempting some questions based on residplot() in seaborn. There were two residual plots in which I had to tell whether the relationship is linear. Can anyone explain how it is determined by just looking at the plot. Apparently:
1. This plot shows the linear relationship
2. This plot shows a non-linear relationship
Roughly speaking, these residual plots enable you to visually check whether the residuals still contain some nonlinear behaviour with respect to your explanatory variables. Two remarks for further explanation:
The residuals of a correctly specified model (e.g. the baseline linear model) should be similar to random noise. In absence of remaining patterns in the residuals, we have no indications that important features have been omitted.
If the residuals suggest a pattern, then this means that we failed to take some (nonlinear) effects into account. You should reconsider the model specification. If the baseline model was linear, then including some nonlinear terms might "clean" the residuals.
This kind of visual inspection is often subjective. However, you can argue that 1. is just a random cloud of points whereas 2. shows some remaining curvature. There is also a statistical test to do this kind of assessment for you: the Ramsey Regression Equation Specification Error Test (RESET)

input for torch.nn.functional.gumbel_softmax

Say I have a tensor named attn_weights of size [1,a], entries of which indicate the attention weights between the given query and |a| keys. I want to select the largest one using torch.nn.functional.gumbel_softmax.
I find docs about this function describe the parameter as logits - […, num_features] unnormalized log probabilities. I wonder whether should I take log of attn_weights before passing it into gumbel_softmax? And I find Wiki defines logit=lg(p/1-p), which is different from barely logrithm. I wonder which one should I pass to the function?
Further, I wonder how to choose tau in gumbel_softmax, any guidelines?
I wonder whether should I take log of attn_weights before passing it into gumbel_softmax?
If attn_weights are probabilities (sum to 1; e.g., output of a softmax), then yes. Otherwise, no.
I wonder how to choose tau in gumbel_softmax, any guidelines?
Usually, it requires tuning. The references provided in the docs can help you with that.
From Categorical Reparameterizaion with Gumbel-Softmax:
Figure 1, caption:
... (a) For low temperatures (τ = 0.1, τ = 0.5), the expected value of a Gumbel-Softmax random variable approaches the expected value of a categorical random variable with the same logits. As the temperature increases (τ = 1.0, τ = 10.0), the expected value converges to a uniform distribution over the categories.
Section 2.2, 2nd paragraph (emphasis mine):
While Gumbel-Softmax samples are differentiable, they are not identical to samples from the corresponding categorical distribution for non-zero temperature. For learning, there is a tradeoff between small temperatures, where samples are close to one-hot but the variance of the gradients is large, and large temperatures, where samples are smooth but the variance of the gradients is small (Figure 1). In practice, we start at a high temperature and anneal to a small but non-zero temperature.
Lastly, they remind the reader that tau can be learned:
If τ is a learned parameter (rather than annealed via a fixed
schedule), this scheme can be interpreted as entropy regularization (Szegedy et al., 2015; Pereyra et al., 2016), where the Gumbel-Softmax distribution can adaptively adjust the "confidence" of proposed samples during the training process.

How can r-squared be negative when the correlation between prediction and truth is positive?

Trying to understand how the r-squared (and also explained variance) metrics can be negative (thus indicating non-existant forecasting power) when at the same time the correlation factor between prediction and truth (as well as slope in a linear-regression (regressing truth on prediction)) are positive
R Squared can be negative in a rare scenario.
R squared = 1 – (SSR/SST)
Here, SST stands for Sum of Squared Total which is nothing but how much does the predicted points get varies from the mean of the target variable. Mean is nothing but a regression line here.
SST = Sum (Square (Each data point- Mean of the target variable))
For example,
If we want to build a regression model to predict height of a student with weight as the independent variable then a possible prediction without much effort is to calculate the mean height of all current students and consider it as the prediction.
In the above diagram, red line is the regression line which is nothing but the mean of all heights. This mean calculated without much effort and can be considered as one of the worst method of prediction with poor accuracy. In the diagram itself we can see that the prediction is nowhere near to the original data points.
Now come to SSR,
SSR stands for Sum of Squared Residuals. This residual is calculated from the model which we build from our mathematical approach (Linear regression, Bayesian regression, Polynomial regression or any other approach). If we use a sophisticated approach rather than using a naive approach like mean then our accuracy will obviously increase.
SSR = Sum (Square (Each data point - Each corresponding data point in the regression line))
In the above diagram, let's consider that the blue line indicates a sophisticated model with large mathematical analysis. We can see that it has obviously higher accuracy than the red line.
Now come to the formula,
R Squared = 1- (SSR/SST)
Here,
SST will be large number because it a very poor model (red line).
SSR will be a small number because it is the best model we developed
after much mathematical analysis (blue line).
So, SSR/SST will be a very small number (It will become very small
whenever SSR decreases).
So, 1- (SSR/SST) will be large number.
So we can infer that whenever R Squared goes higher, it means the
model is too good.
This is a generic case but this cannot be applied in many cases where multiple independent variables are present. In the example, we had only one independent variable and one target variable but in real case, we will have 100's of independent variables for a single dependent variable. The actual problem is that, out of 100's of independent variables-
Some variables will have very high correlation with target variable.
Some variables will have very small correlation with target variable.
Also some independent variables will have no correlation at all.
So, RSquared is calculated on an assumption that the average line of the target which is perpendicular line of y axis is the worst fit a model can have at a maximum riskiest case. SST is the squared difference between this average line and original data points. Similarly, SSR is the squared difference between the predicted data points (by the model plane) and original data points.
SSR/SST gives a ratio how SSR is worst with respect to SST. If your model can somewhat build a plane which is a comparatively good than the worst, then in 99% cases SSR<SST. It eventually makes R squared as positive if you substitute it in the equation.
But what if SSR>SST ? This means that your regression plane is worse than the mean line (SST). In this case, R squared will be obviously negative. But it happens only at 1% of cases or smaller.
Answer was originally written in quora by me -
https://qr.ae/pNsLU8
https://qr.ae/pNsLUr

Why does scikit learn return log-density?

The function score_samples from sklearn.neighbors.kde.KernelDensity returns the log of the density. What is the advantage of that over returning the density it self?
I know that the logarithm makes sense for probabilities, which are between 0 and 1 (See this quenstion: Why use log-probability estimates in GaussianNB [scikit-learn]?) But why do you do the same for densities which are between 0 and infinity?
Is there a way to estimate log-density directly, or is it just the logarithm taken from the estimated density?
Much of what applies to probabilities also applies to densities, so the answers in Why use log-probability estimates in GaussianNB [scikit-learn]? apply:
As long as the density is everywhere positive, the logarithm is well defined. It has much better numerical resolution and stability as density tends toward 0. Imagine a gaussian kernel of a certain width to model your points and imagine them in a cluster somewhere. As you move away from this dense area, the log density amounts to the negative squared distance to the cluster. The exponential of that will quickly yield very small quantities in which you may rightfully not trust anymore.

Representing classification confidence

I am working on a simple AI program that classifies shapes using unsupervised learning method. Essentially I use the number of sides and angles between the sides and generate aggregates percentages to an ideal value of a shape. This helps me create some fuzzingness in the result.
The problem is how do I represent the degree of error or confidence in the classification? For example: a small rectangle that looks very much like a square would yield night membership values from the two categories but can I represent the degree of error?
Thanks
Your confidence is based on used model. For example, if you are simply applying some rules based on the number of angles (or sides), you have some multi dimensional representation of objects:
feature 0, feature 1, ..., feature m
Nice, statistical approach
You can define some kind of confidence intervals, baesd on your empirical results, eg. you can fit multi-dimensional gaussian distribution to your empirical observations of "rectangle objects", and once you get a new object you simply check the probability of such value in your gaussian distribution, and have your confidence (which would be quite well justified with assumption, that your "observation" errors have normal distribution).
Distance based, simple approach
Less statistical approach would be to directly take your model's decision factor and compress it to the [0,1] interaval. For example, if you simply measure distance from some perfect shape to your new object in some metric (which yields results in [0,inf)) you could map it using some sigmoid-like function, eg.
conf( object, perfect_shape ) = 1 - tanh( distance( object, perfect_shape ) )
Hyperbolic tangent will "squash" values to the [0,1] interval, and the only remaining thing to do would be to select some scaling factor (as it grows quite quickly)
Such approach would be less valid in the mathematical terms, but would be similar to the approach taken in neural networks.
Relative approach
And more probabilistic approach could be also defined using your distance metric. If you have distances to each of your "perfect shapes" you can calculate the probability of an object being classified as some class with assumption, that classification is being performed at random, with probiability proportional to the inverse of the distance to the perfect shape.
dist(object, perfect_shape1) = d_1
dist(object, perfect_shape2) = d_2
dist(object, perfect_shape3) = d_3
...
inv( d_i )
conf(object, class_i) = -------------------
sum_j inv( d_j )
where
inv( d_i ) = max( d_j ) - d_i
Conclusions
First two ideas can be also incorporated into the third one to make use of knowledge of all the classes. In your particular example, the third approach should result in confidence of around 0.5 for both rectangle and circle, while in the first example it would be something closer to 0.01 (depending on how many so small objects would you have in the "training" set), which shows the difference - first two approaches show your confidence in classifing as a particular shape itself, while the third one shows relative confidence (so it can be low iff it is high for some other class, while the first two can simply answer "no classification is confident")
Building slightly on what lejlot has put forward; my preference would be to use the Mahalanobis distance with some squashing function. The Mahalanobis distance M(V, p) allows you to measure the distance between a distribution V and a point p.
In your case, I would use "perfect" examples of each class to generate the distribution V and p is the classification you want the confidence of. You can then use something along the lines of the following to be your confidence interval.
1-tanh( M(V, p) )

Resources