How to calculate mutual information in PyTorch (differentiable estimator) - pytorch

I am training a model with pytorch, where I need to calculate the degree of dependence between two tensors (let's say they are the two tensors each containing values very close to zero or one, e.g. v1 = [0.999, 0.998, 0.001, 0.98] and v2 = [0.97, 0.01, 0.997, 0.999]) as a part of my loss function. I am trying to calculate mutual information, but I can't find any mutual information estimation implementation in PyTorch. Has such a thing been provided anywhere?

Mutual information is defined for distribution and not individual points. So, I will write the next part assuming v1 and v2 are samples from a distribution, p. I will also take that you have n samples from p, n>1.
You want a method to estimate mutual information from samples. There are many ways to do this. One of the simplest ways to do this would be to use a non-parametric estimator like NPEET (https://github.com/gregversteeg/NPEET). It works with numpy (you can convert from torch to numpy for this). There are more involved parametric models for which you may be able to find implementation in pytorch (See https://arxiv.org/abs/1905.06922).
If you only have two vectors and want to compute a similarity measure, a dot product similarity would be more suitable than mutual information as there is no distribution.

It is not provided in the official Pytorch code, but here is a pytorch implementation that uses kernel density estimation for the histogram approximation. Note that this method is fully-differentiable.
Alternatively, you can also use the differentiable histogram functions in Kornia to compute the MI metric yourself if you want more control for whatever reason.

Related

Does Gpytorch use Analytic gradient or Automatic differentiation for training?

I am confused about how gpytorch calculates the gradients with respect to parameters of the model. For instance, lets say I am using ExactGP with Gaussian likelihood, RBF kernel, and constant mean and using MLE (maximum likelihood estimate) for finding the parameters of the model (mean, kernel parameters, and noise). One way to calculate the gradient w.r.t parameters of the model is using analytical gradient which means taking derivative of negative log-likelihood with respect to parameters and finding the equation for each derivation. Another way is to use automatic differentiation provided by pytorch.
Gpytorch authors have mentioned in their paper with the title of "GPyTorch: Blackbox Matrix-Matrix Gaussian Process Inference with GPU Acceleration" that they are using analytical gradient or at least this is what I understood by reading the paper. Am I correct? Also, I couldn't find the code that they have implemented the analytical gradient.
Could anyone help me understand this better, please?
The "automatic differentiation provided by PyTorch" does compute the analytic gradient (via back-propagation, note that there is no finite differencing or anything like that involved) - it just does so automatically.
https://github.com/cornellius-gp/gpytorch/discussions/1949#discussioncomment-2384471

In the scikit learn implementation of LDA what is the difference between transform and decision_function?

I am currently working on a project that uses Linear Discriminant Analysis to transform some high-dimensional feature set into a scalar value according to some binary labels.
So I train LDA on the data and the labels and then use either transform(X) or decision_function(X) to project the data into a one-dimensional space.
I would like to understand the difference between these two functions. My intuition would be that the decision_function(X) would be transform(X) + bias, but this is not the case.
Also, I found that those two functions give a different AUC score, and thus indicate that it is not a monotonic transformation as I would have thought.
In the documentation, it states that the transform(X) projects the data to maximize class separation, but I would have expected decision_function(X) to do this.
I hope someone could help me understand the difference between these two.
LDA projects your multivariate data onto a 1D space. The projection is based on a linear combination of all your attributes (columns in X). The weights of each attribute are determined by maximizing the class separation. Subsequently, a threshold value in 1D space is determined which gives the best classification results. transform(X) gives you the value of each observation in this 1D space x' = transform(X). decision_function(X) gives you the log-likelihood of an attribute being a positive class log(P(y=1|x')).

scikit-learn clustering: predict(X) vs. fit_predict(X)

In scikit-learn, some clustering algorithms have both predict(X) and fit_predict(X) methods, like KMeans and MeanShift, while others only have the latter, like SpectralClustering. According to the doc:
fit_predict(X[, y]): Performs clustering on X and returns cluster labels.
predict(X): Predict the closest cluster each sample in X belongs to.
I don't really understand the difference between the two, they seem equivalent to me.
In order to use the 'predict' you must use the 'fit' method first. So using 'fit()' and then 'predict()' is definitely the same as using 'fit_predict()'. However, one could benefit from using only 'fit()' in such cases where you need to know the initialization parameters of your models rather than if you use 'fit_predict()', where you will just be obtained the labeling results of running your model on the data.
fit_predict is usually used for unsupervised machine learning transductive estimator.
Basically, fit_predict(x) is equivalent to fit(x).predict(x).
This might be very late to add an answer here, It just that someone might get benefitted in future
The reason I could relate for having predict in kmeans and only fit_predict in dbscan is
In kmeans you get centroids based on the number of clusters considered. So once you trained your datapoints using fit(), you can use that to predict() a new single datapoint to assign to a specific cluster.
In dbscan you don't have centroids , based on the min_samples and eps (min distance between two points to be considered as neighbors) you define, clusters are formed . This algorithm returns cluster labels for all the datapoints. This behavior explains why there is no predict() method to predict a single datapoint. Difference between fit() and fit_predict() was already explained by other user -
In another spatial clustering algorithm hdbscan gives us an option to predict using approximate_predict(). Its worth to explore that.
Again its my understanding based on the source code I explored. Any experts can highlight any difference.

Modelling probabilities in a regularized (logistic?) regression model in python

I would like to fit a regression model to probabilities. I am aware that linear regression is often used for this purpose, but I have several probabilities at or near 0.0 and 1.0 and would like to fit a regression model where the output is constrained to lie between 0.0 and 1.0. I want to be able to specify a regularization norm and strength for the model and ideally do this in python (but an R implementation would be helpful as well). All the logistic regression packages I've found seem to be only suited for classification whereas this is a regression problem (albeit one where I want to use the logit link function). I use scikits-learn for my classification and regression needs so if this regression model can be implemented in scikits-learn, that would be fantastic (it seemed to me that this is not possible), but I'd be happy about any solution in python and/or R.
The question has two issues, penalized estimation and fractional or proportions data as dependent variable. I worked on each separately but never tried the combination.
Penalization
Statsmodels has had L1 regularized Logit and other discrete models like Poisson for some time. In recent months there has been a lot of effort to support more penalization but it is not in statsmodels yet. Elastic net for linear and Generalized Linear Model (GLM) is in a pull request and will be merged soon. More penalized GLM like L2 penalization for GAM and splines or SCAD penalization will follow over the next months based on pull requests that still need work.
Two examples for the current L1 fit_regularized for Logit are here
Difference in SGD classifier results and statsmodels results for logistic with l1 and https://github.com/statsmodels/statsmodels/blob/master/statsmodels/examples/l1_demo/short_demo.py
Note, the penalization weight alpha can be a vector with zeros for coefficients like the constant if they should not be penalized.
http://www.statsmodels.org/dev/generated/statsmodels.discrete.discrete_model.Logit.fit_regularized.html
Fractional models
Binary and binomial models in statsmodels do not impose that the dependent variable is binary and work as long as the dependent variable is in the [0,1] interval.
Fractions or proportions can be estimated with Logit as Quasi-maximum likelihood estimator. The estimates are consistent if the mean function, logistic, cumulative normal or similar link function, is correctly specified but we should use robust sandwich covariance for proper inference. Robust standard errors can be obtained in statsmodels through a fit keyword cov_type='HC0'.
Best documentation is for Stata http://www.stata.com/manuals14/rfracreg.pdf and the references therein. I went through those references before Stata had fracreg, and it works correctly with at least Logit and Probit which were my test cases. (I don't find my scripts or test cases right now.)
The bad news for inference is that robust covariance matrices have not been added to fit_regularized, so the correct sandwich covariance is not directly available. The standard covariance matrix and standard errors of the parameter estimates are derived under the assumption that the model, i.e. the likelihood function, is correctly specified, which will not be the case if the data are fractions and not binary.
Besides using Quasi-Maximum Likelihood with binary models, it is also possible to use a likelihood that is defined for fractional data in (0, 1). A popular model is Beta regression, which is also waiting in a pull request for statsmodels and is expected to be merged within the next months.

Setting feature weights for KNN

I am working with sklearn's implementation of KNN. While my input data has about 20 features, I believe some of the features are more important than others. Is there a way to:
set the feature weights for each feature when "training" the KNN learner.
learn what the optimal weight values are with or without pre-processing the data.
On a related note, I understand generally KNN does not require training but since sklearn implements it using KDTrees, the tree must be generated from the training data. However, this sounds like its turning KNN into a binary tree problem. Is that the case?
Thanks.
kNN is simply based on a distance function. When you say "feature two is more important than others" it usually means difference in feature two is worth, say, 10x difference in other coords. Simple way to achive this is by multiplying coord #2 by its weight. So you put into the tree not the original coords but coords multiplied by their respective weights.
In case your features are combinations of the coords, you might need to apply appropriate matrix transform on your coords before applying weights, see PCA (principal component analysis). PCA is likely to help you with question 2.
The answer to question to is called "metric learning" and currently not implemented in Scikit-learn. Using the popular Mahalanobis distance amounts to rescaling the data using StandardScaler. Ideally you would want your metric to take into account the labels.

Resources