Understanding batched `F.cosine_similarity` in `torch` - pytorch

In the SimCLR torch implementation, cosine similarity is computed between feature maps of shape (512,128) in the following way:
cos_sim = F.cosine_similarity(feats[:,None,:], feats[None,:,:], dim=-1)
Why do we need to expand the dimensions this way, instead of just computing F.cosine_similarity(feats, feats) according to the documentation? I would really like to see more examples for that.

Related

How to calculate mutual information in PyTorch (differentiable estimator)

I am training a model with pytorch, where I need to calculate the degree of dependence between two tensors (let's say they are the two tensors each containing values very close to zero or one, e.g. v1 = [0.999, 0.998, 0.001, 0.98] and v2 = [0.97, 0.01, 0.997, 0.999]) as a part of my loss function. I am trying to calculate mutual information, but I can't find any mutual information estimation implementation in PyTorch. Has such a thing been provided anywhere?
Mutual information is defined for distribution and not individual points. So, I will write the next part assuming v1 and v2 are samples from a distribution, p. I will also take that you have n samples from p, n>1.
You want a method to estimate mutual information from samples. There are many ways to do this. One of the simplest ways to do this would be to use a non-parametric estimator like NPEET (https://github.com/gregversteeg/NPEET). It works with numpy (you can convert from torch to numpy for this). There are more involved parametric models for which you may be able to find implementation in pytorch (See https://arxiv.org/abs/1905.06922).
If you only have two vectors and want to compute a similarity measure, a dot product similarity would be more suitable than mutual information as there is no distribution.
It is not provided in the official Pytorch code, but here is a pytorch implementation that uses kernel density estimation for the histogram approximation. Note that this method is fully-differentiable.
Alternatively, you can also use the differentiable histogram functions in Kornia to compute the MI metric yourself if you want more control for whatever reason.

In the scikit learn implementation of LDA what is the difference between transform and decision_function?

I am currently working on a project that uses Linear Discriminant Analysis to transform some high-dimensional feature set into a scalar value according to some binary labels.
So I train LDA on the data and the labels and then use either transform(X) or decision_function(X) to project the data into a one-dimensional space.
I would like to understand the difference between these two functions. My intuition would be that the decision_function(X) would be transform(X) + bias, but this is not the case.
Also, I found that those two functions give a different AUC score, and thus indicate that it is not a monotonic transformation as I would have thought.
In the documentation, it states that the transform(X) projects the data to maximize class separation, but I would have expected decision_function(X) to do this.
I hope someone could help me understand the difference between these two.
LDA projects your multivariate data onto a 1D space. The projection is based on a linear combination of all your attributes (columns in X). The weights of each attribute are determined by maximizing the class separation. Subsequently, a threshold value in 1D space is determined which gives the best classification results. transform(X) gives you the value of each observation in this 1D space x' = transform(X). decision_function(X) gives you the log-likelihood of an attribute being a positive class log(P(y=1|x')).

RandomForestClassifier in Multi-label problem - how it works?

How does the RandomForestClassifier of sklearn handle a multilabel problem (under the hood)?
For example, does it brake the problem in distinct one-label problems?
Just to be clear, I have not really tested it yet but I see y : array-like, shape = [n_samples] or [n_samples, n_outputs] at the .fit() function of the RandomForestClassifier.
Let me cite scikit-learn. The user guide of random forest:
Like decision trees, forests of trees also extend to multi-output problems (if Y is an array of size [n_samples, n_outputs]).
The section multi-output problems of the user guide of decision trees:
… to support multi-output problems. This requires the following changes:
Store n output values in leaves, instead of 1;
Use splitting criteria that compute the average reduction across all n outputs.
And I hope this will answer your question. If not, you can look at the section's reference:
M. Dumont et al., Fast multi-class image annotation with random subwindows and multiple output randomized trees, International Conference on Computer Vision Theory and Applications, 2009.
I was a bit confused when I started using trees. If you refer to the sklearn doc:
https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.DecisionTreeClassifier
If you go down on the methods to predict_proba, you can see:
"The predicted class probability is the fraction of samples of the same class in a leaf."
So in predict, the class is the mode of the classes on that node. This can change if you use weighted classes
"class_weight : dict, list of dicts, “balanced” or None, default=None
Weights associated with classes in the form {class_label: weight}. If not given, all classes are supposed to have weight one."
Hope this helps! :)

Sklearn and gensim's TF-IDF implementation

I've been trying to determine the similarity between a set of documents, and one of the methods I'm using is the cosine similarity with the results of the TF-IDF.
I tried to use both sklearn and gensim's implementations, which give me similar results, but my own implementation results in a different matrix.
After analyzing, I noticed that the their implementations are different from the ones I've studied and came across:
Sklearn and gensim use raw counts as the TF, and apply L2 norm
on the resulting vectors.
On the other side, the implementations I found will normalize the term count,
like
TF = term count / sum of all term counts in the document
My question is, what is the difference with their implementations? Do they give better results in the end, for clustering or other purposes?
EDIT(So the question is clearer):
What is the difference between normalizing the end result vs normalizing the term count at the beggining?
I ended up understanding why the normalization is done at the end of the tf-idf calculations instead of doing it on the term frequencies.
After searching around, I noticed they use L2 normalization in order to facilitate cosine similarity calculations.
So, instead of using the formula dot(vector1, vector2) / (norm(vector1) * norm(vector2)) to get the similarity between 2 vectors, we can use directly the result from the fit_transform function: tfidf * tfidf.T, without the need to normalize, since the norm for the vectors is already 1.
I tried to add normalization on the term frequency, but it just gives out the same results in the end, when normalizing the whole vectors, ending up being a waste of time.
With scikit-learn, you can set the normalization as desired when calling TfidfTransformer() by setting norm to either l1, l2, or none.
If you try this with none, you may get similar results to your own hand-rolled tf-idf implementation.
The normalization is typically used to reduce the effects of document length on a particular tf-idf weighting so that words appearing in short documents are treated on more equal footing to words appearing in much longer documents.

Setting feature weights for KNN

I am working with sklearn's implementation of KNN. While my input data has about 20 features, I believe some of the features are more important than others. Is there a way to:
set the feature weights for each feature when "training" the KNN learner.
learn what the optimal weight values are with or without pre-processing the data.
On a related note, I understand generally KNN does not require training but since sklearn implements it using KDTrees, the tree must be generated from the training data. However, this sounds like its turning KNN into a binary tree problem. Is that the case?
Thanks.
kNN is simply based on a distance function. When you say "feature two is more important than others" it usually means difference in feature two is worth, say, 10x difference in other coords. Simple way to achive this is by multiplying coord #2 by its weight. So you put into the tree not the original coords but coords multiplied by their respective weights.
In case your features are combinations of the coords, you might need to apply appropriate matrix transform on your coords before applying weights, see PCA (principal component analysis). PCA is likely to help you with question 2.
The answer to question to is called "metric learning" and currently not implemented in Scikit-learn. Using the popular Mahalanobis distance amounts to rescaling the data using StandardScaler. Ideally you would want your metric to take into account the labels.

Resources