Say I have a tensor named attn_weights of size [1,a], entries of which indicate the attention weights between the given query and |a| keys. I want to select the largest one using torch.nn.functional.gumbel_softmax.
I find docs about this function describe the parameter as logits - […, num_features] unnormalized log probabilities. I wonder whether should I take log of attn_weights before passing it into gumbel_softmax? And I find Wiki defines logit=lg(p/1-p), which is different from barely logrithm. I wonder which one should I pass to the function?
Further, I wonder how to choose tau in gumbel_softmax, any guidelines?
I wonder whether should I take log of attn_weights before passing it into gumbel_softmax?
If attn_weights are probabilities (sum to 1; e.g., output of a softmax), then yes. Otherwise, no.
I wonder how to choose tau in gumbel_softmax, any guidelines?
Usually, it requires tuning. The references provided in the docs can help you with that.
From Categorical Reparameterizaion with Gumbel-Softmax:
Figure 1, caption:
... (a) For low temperatures (τ = 0.1, τ = 0.5), the expected value of a Gumbel-Softmax random variable approaches the expected value of a categorical random variable with the same logits. As the temperature increases (τ = 1.0, τ = 10.0), the expected value converges to a uniform distribution over the categories.
Section 2.2, 2nd paragraph (emphasis mine):
While Gumbel-Softmax samples are differentiable, they are not identical to samples from the corresponding categorical distribution for non-zero temperature. For learning, there is a tradeoff between small temperatures, where samples are close to one-hot but the variance of the gradients is large, and large temperatures, where samples are smooth but the variance of the gradients is small (Figure 1). In practice, we start at a high temperature and anneal to a small but non-zero temperature.
Lastly, they remind the reader that tau can be learned:
If τ is a learned parameter (rather than annealed via a fixed
schedule), this scheme can be interpreted as entropy regularization (Szegedy et al., 2015; Pereyra et al., 2016), where the Gumbel-Softmax distribution can adaptively adjust the "confidence" of proposed samples during the training process.
Related
I find the term NoiseMultiplier in the following part of tensorflow-federated tutorial.
def train(rounds, noise_multiplier, clients_per_round, data_frame):
# Using the `dp_aggregator` here turns on differential privacy with adaptive
# clipping.
aggregation_factory = tff.learning.model_update_aggregator.dp_aggregator(
noise_multiplier, clients_per_round)
I have read the paper about differential privacy with adaptive clipping Andrew et al. 2021, Differentially Private Learning with Adaptive Clipping. I guess NoiseMultiplier is the noise we input to the system. However, I find the NoiseMultiplier is a scalar we set. Actually, the different noise should be put into the corresponding weight_variables, so I am so confused about that.
To make an aggregation in TFF differentially private (focusing on an additive application of the Gaussian mechanism, which seems to be what is happening in the symbol you're using), two steps are required:
Incoming tensors must be clipped, so that the total norm of these (potentially structured) tensors considered as a single vector has an explicit upper bound. In this case, the adaptivity in adaptive clipping refers to the upper bound on this norm; this upper bound (the norm to which the incoming tensors are clipped) is what is adjusted through time.
Noise sampled according to an isotropic Gaussian with some variance is added to these clipped tensors (possibly before or after aggregating, depending on the DP model, e.g. local vs central).
The noise multiplier here refers to the relationship between the clipping norm in step 1 and the variance in step 2; in fact, it is their ratio. It is this ratio which determines the privacy budget `used up' by one application of the query; this can be seen, e.g., in the relationship between sensitivity, epsilon and variance in the Wikipedia article on the Gaussian mechanism.
The parameter set here, then, can be understood as a mechanism for specifying how much privacy each step of the algorithm 'costs'. It is a scalar because it is simply the ratio of two scalars; the 'vectorizing' is handled by sampling from the high-dimensional Gaussian, but since this Gaussian is isotropic (spherical), only its scalar variance needs to be known (rather than the full covariance matrix).
Trying to understand how the r-squared (and also explained variance) metrics can be negative (thus indicating non-existant forecasting power) when at the same time the correlation factor between prediction and truth (as well as slope in a linear-regression (regressing truth on prediction)) are positive
R Squared can be negative in a rare scenario.
R squared = 1 – (SSR/SST)
Here, SST stands for Sum of Squared Total which is nothing but how much does the predicted points get varies from the mean of the target variable. Mean is nothing but a regression line here.
SST = Sum (Square (Each data point- Mean of the target variable))
For example,
If we want to build a regression model to predict height of a student with weight as the independent variable then a possible prediction without much effort is to calculate the mean height of all current students and consider it as the prediction.
In the above diagram, red line is the regression line which is nothing but the mean of all heights. This mean calculated without much effort and can be considered as one of the worst method of prediction with poor accuracy. In the diagram itself we can see that the prediction is nowhere near to the original data points.
Now come to SSR,
SSR stands for Sum of Squared Residuals. This residual is calculated from the model which we build from our mathematical approach (Linear regression, Bayesian regression, Polynomial regression or any other approach). If we use a sophisticated approach rather than using a naive approach like mean then our accuracy will obviously increase.
SSR = Sum (Square (Each data point - Each corresponding data point in the regression line))
In the above diagram, let's consider that the blue line indicates a sophisticated model with large mathematical analysis. We can see that it has obviously higher accuracy than the red line.
Now come to the formula,
R Squared = 1- (SSR/SST)
Here,
SST will be large number because it a very poor model (red line).
SSR will be a small number because it is the best model we developed
after much mathematical analysis (blue line).
So, SSR/SST will be a very small number (It will become very small
whenever SSR decreases).
So, 1- (SSR/SST) will be large number.
So we can infer that whenever R Squared goes higher, it means the
model is too good.
This is a generic case but this cannot be applied in many cases where multiple independent variables are present. In the example, we had only one independent variable and one target variable but in real case, we will have 100's of independent variables for a single dependent variable. The actual problem is that, out of 100's of independent variables-
Some variables will have very high correlation with target variable.
Some variables will have very small correlation with target variable.
Also some independent variables will have no correlation at all.
So, RSquared is calculated on an assumption that the average line of the target which is perpendicular line of y axis is the worst fit a model can have at a maximum riskiest case. SST is the squared difference between this average line and original data points. Similarly, SSR is the squared difference between the predicted data points (by the model plane) and original data points.
SSR/SST gives a ratio how SSR is worst with respect to SST. If your model can somewhat build a plane which is a comparatively good than the worst, then in 99% cases SSR<SST. It eventually makes R squared as positive if you substitute it in the equation.
But what if SSR>SST ? This means that your regression plane is worse than the mean line (SST). In this case, R squared will be obviously negative. But it happens only at 1% of cases or smaller.
Answer was originally written in quora by me -
https://qr.ae/pNsLU8
https://qr.ae/pNsLUr
Lets say I have a database with users buying products(There are no ratings or something similar) and I want to recommend others products for them. I am using ATL.trainImplicit where the training data has the following format:
[Rating(user=2, product=23053, rating=1.0),
Rating(user=2, product=2078, rating=1.0),
Rating(user=3, product=23, rating=1.0)]
So all the ratings in the training dataset is always 1.
Is it normal that the predictions ratings gave min value -0.6 and max rating 1.85? I would expect something between 0 and 1.
Yes, it is normal. The implicit version of ALS essentially tries to reconstruct a binary preference matrix P (rather than a matrix of explicit ratings, R). In this case, the "ratings" are treated as confidence levels - higher ratings equals higher confidence that the binary preference p(ij) should be reconstructed as 1 instead of 0.
However, ALS essentially solves a (weighted) least squares regression problem to find the user and item factor matrices that reconstruct matrix P. So the predicted values are not guaranteed to be in the range [0, 1] (though in practice they are usually close to that range). It's enough to interpret the predictions as "opaque" values where higher values equate to greater likelihood that the user might purchase that product. That's enough for sorting recommended products by predicted score.
(Note item-item or user-user similarities are typically computed using cosine similarity between the factor vectors, so these scores will lie in [-1, 1]. That computation is not directly available in Spark but can be done yourself).
I am working on a simple AI program that classifies shapes using unsupervised learning method. Essentially I use the number of sides and angles between the sides and generate aggregates percentages to an ideal value of a shape. This helps me create some fuzzingness in the result.
The problem is how do I represent the degree of error or confidence in the classification? For example: a small rectangle that looks very much like a square would yield night membership values from the two categories but can I represent the degree of error?
Thanks
Your confidence is based on used model. For example, if you are simply applying some rules based on the number of angles (or sides), you have some multi dimensional representation of objects:
feature 0, feature 1, ..., feature m
Nice, statistical approach
You can define some kind of confidence intervals, baesd on your empirical results, eg. you can fit multi-dimensional gaussian distribution to your empirical observations of "rectangle objects", and once you get a new object you simply check the probability of such value in your gaussian distribution, and have your confidence (which would be quite well justified with assumption, that your "observation" errors have normal distribution).
Distance based, simple approach
Less statistical approach would be to directly take your model's decision factor and compress it to the [0,1] interaval. For example, if you simply measure distance from some perfect shape to your new object in some metric (which yields results in [0,inf)) you could map it using some sigmoid-like function, eg.
conf( object, perfect_shape ) = 1 - tanh( distance( object, perfect_shape ) )
Hyperbolic tangent will "squash" values to the [0,1] interval, and the only remaining thing to do would be to select some scaling factor (as it grows quite quickly)
Such approach would be less valid in the mathematical terms, but would be similar to the approach taken in neural networks.
Relative approach
And more probabilistic approach could be also defined using your distance metric. If you have distances to each of your "perfect shapes" you can calculate the probability of an object being classified as some class with assumption, that classification is being performed at random, with probiability proportional to the inverse of the distance to the perfect shape.
dist(object, perfect_shape1) = d_1
dist(object, perfect_shape2) = d_2
dist(object, perfect_shape3) = d_3
...
inv( d_i )
conf(object, class_i) = -------------------
sum_j inv( d_j )
where
inv( d_i ) = max( d_j ) - d_i
Conclusions
First two ideas can be also incorporated into the third one to make use of knowledge of all the classes. In your particular example, the third approach should result in confidence of around 0.5 for both rectangle and circle, while in the first example it would be something closer to 0.01 (depending on how many so small objects would you have in the "training" set), which shows the difference - first two approaches show your confidence in classifing as a particular shape itself, while the third one shows relative confidence (so it can be low iff it is high for some other class, while the first two can simply answer "no classification is confident")
Building slightly on what lejlot has put forward; my preference would be to use the Mahalanobis distance with some squashing function. The Mahalanobis distance M(V, p) allows you to measure the distance between a distribution V and a point p.
In your case, I would use "perfect" examples of each class to generate the distribution V and p is the classification you want the confidence of. You can then use something along the lines of the following to be your confidence interval.
1-tanh( M(V, p) )
I have a classification task with a time-series as the data input, where each attribute (n=23) represents a specific point in time. Besides the absolute classification result I would like to find out, which attributes/dates contribute to the result to what extent. Therefore I am just using the feature_importances_, which works well for me.
However, I would like to know how they are getting calculated and which measure/algorithm is used. Unfortunately I could not find any documentation on this topic.
There are indeed several ways to get feature "importances". As often, there is no strict consensus about what this word means.
In scikit-learn, we implement the importance as described in [1] (often cited, but unfortunately rarely read...). It is sometimes called "gini importance" or "mean decrease impurity" and is defined as the total decrease in node impurity (weighted by the probability of reaching that node (which is approximated by the proportion of samples reaching that node)) averaged over all trees of the ensemble.
In the literature or in some other packages, you can also find feature importances implemented as the "mean decrease accuracy". Basically, the idea is to measure the decrease in accuracy on OOB data when you randomly permute the values for that feature. If the decrease is low, then the feature is not important, and vice-versa.
(Note that both algorithms are available in the randomForest R package.)
[1]: Breiman, Friedman, "Classification and regression trees", 1984.
The usual way to compute the feature importance values of a single tree is as follows:
you initialize an array feature_importances of all zeros with size n_features.
you traverse the tree: for each internal node that splits on feature i you compute the error reduction of that node multiplied by the number of samples that were routed to the node and add this quantity to feature_importances[i].
The error reduction depends on the impurity criterion that you use (e.g. Gini, Entropy, MSE, ...). Its the impurity of the set of examples that gets routed to the internal node minus the sum of the impurities of the two partitions created by the split.
Its important that these values are relative to a specific dataset (both error reduction and the number of samples are dataset specific) thus these values cannot be compared between different datasets.
As far as I know there are alternative ways to compute feature importance values in decision trees. A brief description of the above method can be found in "Elements of Statistical Learning" by Trevor Hastie, Robert Tibshirani, and Jerome Friedman.
It's the ratio between the number of samples routed to a decision node involving that feature in any of the trees of the ensemble over the total number of samples in the training set.
Features that are involved in the top level nodes of the decision trees tend to see more samples hence are likely to have more importance.
Edit: this description is only partially correct: Gilles and Peter's answers are the correct answer.
As #GillesLouppe pointed out above, scikit-learn currently implements the "mean decrease impurity" metric for feature importances. I personally find the second metric a bit more interesting, where you randomly permute the values for each of your features one-by-one and see how much worse your out-of-bag performance is.
Since what you're after with feature importance is how much each feature contributes to your overall model's predictive performance, the second metric actually gives you a direct measure of this, whereas the "mean decrease impurity" is just a good proxy.
If you're interested, I wrote a small package that implements the Permutation Importance metric and can be used to calculate the values from an instance of a scikit-learn random forest class:
https://github.com/pjh2011/rf_perm_feat_import
Edit: This works for Python 2.7, not 3
code:
iris = datasets.load_iris()
X = iris.data
y = iris.target
clf = DecisionTreeClassifier()
clf.fit(X, y)
decision_tree plot:
enter image description here
We get
compute_feature_importance:[0. ,0.01333333,0.06405596,0.92261071]
Check source code:
cpdef compute_feature_importances(self, normalize=True):
"""Computes the importance of each feature (aka variable)."""
cdef Node* left
cdef Node* right
cdef Node* nodes = self.nodes
cdef Node* node = nodes
cdef Node* end_node = node + self.node_count
cdef double normalizer = 0.
cdef np.ndarray[np.float64_t, ndim=1] importances
importances = np.zeros((self.n_features,))
cdef DOUBLE_t* importance_data = <DOUBLE_t*>importances.data
with nogil:
while node != end_node:
if node.left_child != _TREE_LEAF:
# ... and node.right_child != _TREE_LEAF:
left = &nodes[node.left_child]
right = &nodes[node.right_child]
importance_data[node.feature] += (
node.weighted_n_node_samples * node.impurity -
left.weighted_n_node_samples * left.impurity -
right.weighted_n_node_samples * right.impurity)
node += 1
importances /= nodes[0].weighted_n_node_samples
if normalize:
normalizer = np.sum(importances)
if normalizer > 0.0:
# Avoid dividing by zero (e.g., when root is pure)
importances /= normalizer
return importances
Try calculate the feature importance:
print("sepal length (cm)",0)
print("sepal width (cm)",(3*0.444-(0+0)))
print("petal length (cm)",(54* 0.168 - (48*0.041+6*0.444)) +(46*0.043 -(0+3*0.444)) + (3*0.444-(0+0)))
print("petal width (cm)",(150* 0.667 - (0+100*0.5)) +(100*0.5-(54*0.168+46*0.043))+(6*0.444 -(0+3*0.444)) + (48*0.041-(0+0)))
We get feature_importance: np.array([0,1.332,6.418,92.30]).
After normalized, we get array ([0., 0.01331334, 0.06414793, 0.92253873]),this is same as clf.feature_importances_.
Be careful all classes are supposed to have weight one.
For those looking for a reference to the scikit-learn's documentation on this topic or a reference to the answer by #GillesLouppe:
In RandomForestClassifier, estimators_ attribute is a list of DecisionTreeClassifier (as mentioned in the documentation). In order to compute the feature_importances_ for the RandomForestClassifier, in scikit-learn's source code, it averages over all estimator's (all DecisionTreeClassifer's) feature_importances_ attributes in the ensemble.
In DecisionTreeClassifer's documentation, it is mentioned that "The importance of a feature is computed as the (normalized) total reduction of the criterion brought by that feature. It is also known as the Gini importance [1]."
Here is a direct link for more info on variable and Gini importance, as provided by scikit-learn's reference below.
[1] L. Breiman, and A. Cutler, “Random Forests”, http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm
Feature Importance in Random Forest
Random forest uses many trees, and thus, the variance is reduced
Random forest allows far more exploration of feature combinations as well
Decision trees gives Variable Importance and it is more if there is reduction in impurity (reduction in Gini impurity)
Each tree has a different Order of Importance
Here is what happens in the background!
We take an attribute and check in all the trees where it is present and take the average values of the change in the homogeneity on this attribute split. This average value of change in the homogeneity gives us the feature importance of the attribute