I'm looking how to do class weighting using BCEWithLogitsLoss.
https://pytorch.org/docs/stable/generated/torch.nn.BCEWithLogitsLoss.html
The example on how to use pos_weight seems clear to me. If there are 3x more negative samples than positive samples, then you can set pos_weight=3
Does the weight parameter do the same thing?
Say that I set it weight=torch.tensor([1, 3]). Is that the same thing as pos_weight=3
Also, is weight normalized? Is weight=torch.tensor([1, 3]) the same as weight=torch.tensor([3, 9]), or are they different in how they affect the magnitude of the loss?
They are different things. pos_weight is size n_classes. weight is size batch_size. In the formula in the page you linked, weight refers to the w_n variable, whereas pos_weight refers to the p_c variable.
Related
In pytorch, there is a sampler class called WeightedRandomSampler (https://pytorch.org/docs/stable/data.html#torch.utils.data.WeightedRandomSampler). It ('weights' parameter) expects probabilities for N samples. For uniform distribution, I believe it expects array with 1/N value.
But if I put say 0.5 for each sample, where N*0.5 is not equal to 1, does it still make the sampling uniform, given equal probabilities are there for each sample?
Yes, the sampling will still be uniform. Only the relative magnitude of the weights with respect to the other weights is important, not the absolute magnitude, as pytorch normalizes the weights.
If we look under the hood of WeightedRandomSampler, it makes a call to torch.multinomial which itself makes a call to torch.distributions.Categorical, which we can see here (line 57) normalizes the weights such that they sum to one.
I had a model and some observational data. I used the MCMC method to obtain the best free parameters and used some coding to plot contours of 1 to 3 sigma confidence levels (as you see in the plot). I want the +/- value of sigma for each best value in any confidence level but I know that it is not symmetric. So, I didn't find the common equation of square variance useful. Is there any other way to calculate +/- errors?
this is my contours
this is what I want to get
I used np. percentile(w, [15, 85]), np. percentile(w, [5, 95]) and np. percentile(w, [0.5, 99.5]) in the final Gaussian probability and found errors for each confidence level correctly
Say I have a tensor named attn_weights of size [1,a], entries of which indicate the attention weights between the given query and |a| keys. I want to select the largest one using torch.nn.functional.gumbel_softmax.
I find docs about this function describe the parameter as logits - […, num_features] unnormalized log probabilities. I wonder whether should I take log of attn_weights before passing it into gumbel_softmax? And I find Wiki defines logit=lg(p/1-p), which is different from barely logrithm. I wonder which one should I pass to the function?
Further, I wonder how to choose tau in gumbel_softmax, any guidelines?
I wonder whether should I take log of attn_weights before passing it into gumbel_softmax?
If attn_weights are probabilities (sum to 1; e.g., output of a softmax), then yes. Otherwise, no.
I wonder how to choose tau in gumbel_softmax, any guidelines?
Usually, it requires tuning. The references provided in the docs can help you with that.
From Categorical Reparameterizaion with Gumbel-Softmax:
Figure 1, caption:
... (a) For low temperatures (τ = 0.1, τ = 0.5), the expected value of a Gumbel-Softmax random variable approaches the expected value of a categorical random variable with the same logits. As the temperature increases (τ = 1.0, τ = 10.0), the expected value converges to a uniform distribution over the categories.
Section 2.2, 2nd paragraph (emphasis mine):
While Gumbel-Softmax samples are differentiable, they are not identical to samples from the corresponding categorical distribution for non-zero temperature. For learning, there is a tradeoff between small temperatures, where samples are close to one-hot but the variance of the gradients is large, and large temperatures, where samples are smooth but the variance of the gradients is small (Figure 1). In practice, we start at a high temperature and anneal to a small but non-zero temperature.
Lastly, they remind the reader that tau can be learned:
If τ is a learned parameter (rather than annealed via a fixed
schedule), this scheme can be interpreted as entropy regularization (Szegedy et al., 2015; Pereyra et al., 2016), where the Gumbel-Softmax distribution can adaptively adjust the "confidence" of proposed samples during the training process.
Lets say I have a database with users buying products(There are no ratings or something similar) and I want to recommend others products for them. I am using ATL.trainImplicit where the training data has the following format:
[Rating(user=2, product=23053, rating=1.0),
Rating(user=2, product=2078, rating=1.0),
Rating(user=3, product=23, rating=1.0)]
So all the ratings in the training dataset is always 1.
Is it normal that the predictions ratings gave min value -0.6 and max rating 1.85? I would expect something between 0 and 1.
Yes, it is normal. The implicit version of ALS essentially tries to reconstruct a binary preference matrix P (rather than a matrix of explicit ratings, R). In this case, the "ratings" are treated as confidence levels - higher ratings equals higher confidence that the binary preference p(ij) should be reconstructed as 1 instead of 0.
However, ALS essentially solves a (weighted) least squares regression problem to find the user and item factor matrices that reconstruct matrix P. So the predicted values are not guaranteed to be in the range [0, 1] (though in practice they are usually close to that range). It's enough to interpret the predictions as "opaque" values where higher values equate to greater likelihood that the user might purchase that product. That's enough for sorting recommended products by predicted score.
(Note item-item or user-user similarities are typically computed using cosine similarity between the factor vectors, so these scores will lie in [-1, 1]. That computation is not directly available in Spark but can be done yourself).
we have skew normal distribution with location=0, scale =1 and shape =0 then it is same as standard normal distribution with mean 0 and variance 1.but if we change the shape parameter say shape=5 then mean and variance also changes.how can we fix mean and variance with different values of shape parameter
Just look after how the mean and variance of a skew normal distribution can be computed and you got the answer! Knowing that the mean looks like:
and
You can see, that with a xi=0 (location), omega=1 (scale) and alpha=0 (shape) you really get a standard normal distribution (with mean=0, standard deviation=1):
If you only change the alpha (shape) to 5, you can except the mean will differ a lot, and will be positive. If you want to hold the mean around zero with a higher alpha (shape), you will have to decrease other parameters, e.g.: the omega (scale). The most obvious solution could be to set it to zero instead of 1. See:
Mean is set, we have to get a variance equal to zero with a omega set to zero and shape set to 5. The formula is known:
With our known parameters:
Which is insane :) That cannot be done this way. You may also go back and alter the value of xi instead of omega to get a mean equal to zero. But that way you might first compute the only possible value of omega with the formula of variance given.
Then the omega should be around 1.605681 (negative or positive).
Getting back to mean:
So, with the following parameters you should get a distribution you was intended to:
location = 1.256269 (negative or positive), scale = 1.605681 (negative or positive) and shape = 5.
Please, someone test it, as I might miscalculated somewhere with the given example.