Element wise calculation breaks autograd - python-3.x

I am using pytorch to calculate loss for a logistic regression (I know pytorch can do this automatically but I have to make it myself). My function is defined below but the cast to torch.tensor breaks autograd and gives me w.grad = None. Im new to pytorch so Im sorry.
logistic_loss = lambda X,y,w: torch.tensor([torch.log(1 + torch.exp(-y[i] * torch.matmul(w, X[i,:]))) for i in range(X.shape[0])], requires_grad=True)

Your post isn't very clear on details and this is a monster of a one-liner. I first reworked it to make a minimal, complete, verifiable example. Please correct me if I misunderstood your intentions and please do it yourself next time.
import torch
# unroll the one-liner to have an easier time understanding what's going on
def logistic_loss(X, y, w):
elementwise = []
for i in range(X.shape[0]):
mm = torch.matmul(w, X[i, :])
exp = torch.exp(-y[i] * mm)
elementwise.append(torch.log(1 + exp))
return torch.tensor(elementwise, requires_grad=True)
# I assume that's the excepted dimensions of your input
X = torch.randn(5, 30, requires_grad=True)
y = torch.randn(5)
w = torch.randn(30)
# I assume you backpropagate from a reduced version
# of your sum, because you can't call .backward on multi-dimensional
# tensors
loss = logistic_loss(X, y, w).mean()
loss.mean().backward()
print(X.grad)
The simplest solution to your problem is to replace torch.tensor(elementwise, requires_grad=True) with torch.stack(elementwise). You can think of torch.tensor as a constructor for entirely new tensors, if your tensor is more of a result of some mathematical expression, you should use operations like torch.stack or torch.cat.
That being said, this code is still wildly inefficient because you do manual looping over i. Instead, you could write simply
def logistic_loss_vectorized(X, y, w):
mm = torch.matmul(X, w)
exp = torch.exp(-y * mm)
return torch.log(1 + exp)
which is mathematically equivalent, but will be much faster in practice, because it allows for better parallelization due to lack of explicit looping.
Note that there is still a numerical issue with this code - you're taking a logarithm of an exponential, but the intermediate result, called exp, is likely to attain very high values, causing loss of precision. There are workarounds for that, which is why the loss functions provided by PyTorch are preferable.

Related

MLE for censored distributions of the exponential family

With the scipy.stats package it is straightforward to fit a distribution to data, e.g. scipy.stats.expon.fit() can be used to fit data to an exponential distribution.
However, I am trying to fit data to a censored/conditional distribution in the exponential family. In other words, using MLE, I am trying to find the maximum of
,
where is a PDF of a distribution in the exponential family, and is its corresponding CDF.
Mathematically, I have found that the log-likelihood function is convex in the parameter space , so my assumption was that it should be relatively straightforward to apply the scipy.optimize.minimize function. Notice in the above log-likelihood that by taking we obtain the traditional/uncensored MLE problem.
However, I find that even for simple distributions that e.g. the nelder-mead simplex algorithm does not always converge, or that it does converge but the estimated parameters are far off from the true ones. I have attached my code below. Notice that one can choose a distribution, and that the code is generic enough to fit the loc and scale parameters, as well as the optional shape parameters (for e.g. a Beta or Gamma distribution).
My question is: what am I doing wrong to obtain these bad estimates, or sometimes get convergence issues? I have tried a few algorithms but there is not one that easily works, to my surprise as the problem is convex. Are there smoothness issues, and that I need to find a way to use the Jacobian and Hessian in a generic way for this problem?
Are there other methods to tackle this problem? Initially I thought to override fit() function in the scipy.stats.rv class to take care of the censoring with the CDF, but this seemed quite cumbersome. But since the problem is convex, I would guess that using the minimize function of scipy I should be able to easily get the results...
Comments and help are very welcome!
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy.stats import expon, gamma, beta, norm
from scipy.optimize import minimize
from scipy.stats import rv_continuous as rv
def log_likelihood(func: rv, delays, max_delays=10**8, **func_pars)->float:
return np.sum(np.log(func.pdf(delays, **func_pars)+1) - np.log(func.cdf(max_delays, **func_pars)))
def minimize_log_likelihood(func: rv, delays, max_delays):
# Determine number of parameters to estimate (always 'loc', 'scale', sometimes shape parameters)
n_pars = 2 + func.numargs
# Initialize guess (loc, scale, [shapes])
x0 = np.ones(n_pars)
def wrapper(params, *args):
func = args[0]
delays = args[1]
max_delays = args[2]
loc, scale = params[0], params[1]
# Set 'loc' and 'scale' parameters
kwargs = {'loc': loc, 'scale': scale}
# Add shape parameters if existing to kwargs
if func.shapes is not None:
for i, s in enumerate(func.shapes.split(', ')):
kwargs[s] = params[2+i]
return -log_likelihood(func=func, delays=delays, max_delays=max_delays, **kwargs)
# Find maximum of log-likelihood (thus minimum of minus log-likelihood; see wrapper function)
return minimize(wrapper, x0, args=(func, delays, max_delays), options={'disp': True},
method='nelder-mead', tol=1e-8)
# Test code with by sampling from known distribution, and retrieve parameters
distribution = expon
dist_pars = {'loc': 0, 'scale': 4}
x = np.linspace(distribution.ppf(0.0001, **dist_pars), distribution.ppf(0.9999, **dist_pars), 1000)
res = minimize_log_likelihood(distribution, x, 10**8)
print(res)
I have found that the convergence is bad due to numerical inaccuracies. Best is to replace
np.log(func.pdf(x, **func_kwargs))
with
func.logpdf(x, **func_kwargs)
This leads to correct estimation of the parameters. The same holds for the CDF. The documentation of scipy also indicates that the numerical accuracy of the latter performs better.
This all works nicely with the Exponential, Normal, Gamma, chi2 distributions. The Beta distribution still gives me issues, but I think this is again to some (other) numerical inaccuracies which I will analyse separately.

How to calculate geometric mean in a differentiable way?

How to calculate goemetric mean along a dimension using Pytorch? Some numbers can be negative. The function must be differentiable.
A known (reasonably) numerically-stable version of the geometric mean is:
import torch
def gmean(input_x, dim):
log_x = torch.log(input_x)
return torch.exp(torch.mean(log_x, dim=dim))
x = torch.Tensor([2.0] * 1000).requires_grad_(True)
print(gmean(x, dim=0))
# tensor(2.0000, grad_fn=<ExpBackward>)
This kind of implementation can be found, for example, in SciPy (see here), which is a quite stable lib.
The implementation above does not handle zeros and negative numbers. Some will argue that the geometric mean with negative numbers is not well-defined, at least when not all of them are negative.
torch.prod() helps:
import torch
x = torch.FloatTensor(3).uniform_().requires_grad_(True)
print(x)
y = x.prod() ** (1.0/x.shape[0])
print(y)
y.backward()
print(x.grad)
# tensor([0.5692, 0.7495, 0.1702], requires_grad=True)
# tensor(0.4172, grad_fn=<PowBackward0>)
# tensor([0.2443, 0.1856, 0.8169])
EDIT: ?what about
y = (x.abs() ** (1.0/x.shape[0]) * x.sign() ).prod()

Vectorized implementation of field-aware factorization

I would like to implement the field-aware factorization model (FFM) in a vectorized way. In FFM, a prediction is made by the following equation
where w are the embeddings that depend on the feature and the field of the other feature. For more info, see equation (4) in FFM.
To do so, I have defined the following parameter:
import torch
W = torch.nn.Parameter(torch.Tensor(n_features, n_fields, n_factors), requires_grad=True)
Now, given an input x of size (batch_size, n_features), I want to be able to compute the previous equation. Here is my current (non-vectorized) implementation:
total_inter = torch.zeros(x.shape[0])
for i in range(n_features):
for j in range(i + 1, n_features):
temp1 = torch.mm(
x[:, i].unsqueeze(1),
W[i, feature2field[j], :].unsqueeze(0))
temp2 = torch.mm(
x[:, j].unsqueeze(1),
W[j, feature2field[i], :].unsqueeze(0))
total_inter += torch.sum(temp1 * temp2, dim=1)
Unsurprisingly, this implementation is horribly slow since n_features can easily be as large as 1000! Note however that most of the entries of x are 0. All inputs are appreciated!
Edit:
If it can help in any ways, here are some implementations of this model in PyTorch:
pytorch-fm
ctr_model_zoo
Unfortunately, I cannot figure out exactly how they have done it.
Additional update:
I can now obtain the product of x and W in a more efficient way by doing:
temp = torch.einsum('ij, jkl -> ijkl', x, W)
Thus, my loop is now:
total_inter = torch.zeros(x.shape[0])
for i in range(n_features):
for j in range(i + 1, n_features):
temp1 = temp[:, i, feature2field[j], :]
temp2 = temp[:, j, feature2field[i], :]
total_inter += 0.5 * torch.sum(temp1 * temp2, dim=1)
It is however still too long since this loop goes over for about 500 000 iterations.
Something that could potentially help you speed up the multiplication is using pytorch sparse tensors.
Also something that might work would be the following:
Create n arrays, one for each feature i that would hold its corresponding field factors in each row. e.g. for feature i = 0
[ W[0, feature2field[0], :],
W[0, feature2field[1], :],
W[0, feature2field[n], :]]
Then calculate the multiplication of those arrays, lets call them F, with X
R[i] = F[i] * X
So each element in R would hold the result of the multiplication, an array, of the F[i] with X.
Next you would multiply each R[i] with its transpose
R[i] = R[i] * R[i].T
Now you can do the summation in a loop like before
for i in range(n_features):
total_inter += torch.sum(R[i], dim=1)
Please take this with a grain of salt as i haven't tested it. In any case i think that it will point you in the right direction.
One problem that might occur is in the transpose multiplication in which each element will also be multiplied with itself and then be added in the sum. I don't think it will affect the classifier but in any case you can make the elements in the diagonal of the transpose and above 0 (including the diagonal).
Also although minor nevertheless please move the 1st unsqueeze operation outside of the nested for loop.
I hope it helps.

pytorch: how to directly find gradient w.r.t. loss

In theano, it was very easy to get the gradient of some variable w.r.t. a given loss:
loss = f(x, w)
dl_dw = tt.grad(loss, wrt=w)
I get that pytorch goes by a different paradigm, where you'd do something like:
loss = f(x, w)
loss.backwards()
dl_dw = w.grad
The thing is I might not want to do a full backwards propagation through the graph - just along the path needed to get to w.
I know you can define Variables with requires_grad=False if you don't want to backpropagate through them. But then you have to decide that at the time of variable-creation (and the requires_grad=False property is attached to the variable, rather than the call which gets the gradient, which seems odd).
My Question is is there some way to backpropagate on demand (i.e. only backpropagate along the path needed to compute dl_dw, as you would in theano)?
It turns out that this is reallyy easy. Just use torch.autograd.grad
Example:
import torch
import numpy as np
from torch.autograd import grad
x = torch.autograd.Variable(torch.from_numpy(np.random.randn(5, 4)))
w = torch.autograd.Variable(torch.from_numpy(np.random.randn(4, 3)), requires_grad=True)
y = torch.autograd.Variable(torch.from_numpy(np.random.randn(5, 3)))
loss = ((x.mm(w) - y)**2).sum()
(d_loss_d_w, ) = grad(loss, w)
assert np.allclose(d_loss_d_w.data.numpy(), (x.transpose(0, 1).mm(x.mm(w)-y)*2).data.numpy())
Thanks to JerryLin for answering the question here.

Role of class_weight in loss functions for linearSVC and LogisticRegression

I am trying to figure out what exactly the loss function formula is and how I can manually calculate it when class_weight='auto' in case of svm.svc, svm.linearSVC and linear_model.LogisticRegression.
For balanced data, say you have a trained classifier: clf_c. Logistic loss should be (am I correct?):
def logistic_loss(x,y,w,b,b0):
'''
x: nxp data matrix where n is number of data points and p is number of features.
y: nx1 vector of true labels (-1 or 1).
w: nx1 vector of weights (vector of 1./n for balanced data).
b: px1 vector of feature weights.
b0: intercept.
'''
s = y
if 0 in np.unique(y):
print 'yes'
s = 2. * y - 1
l = np.dot(w, np.log(1 + np.exp(-s * (np.dot(x, np.squeeze(b)) + b0))))
return l
I realized that logisticRegression has predict_log_proba() which gives you exactly that when data is balanced:
b, b0 = clf_c.coef_, clf_c.intercept_
w = np.ones(len(y))/len(y)
-(clf_c.predict_log_proba(x[xrange(len(x)), np.floor((y+1)/2).astype(np.int8)]).mean() == logistic_loss(x,y,w,b,b0)
Note, np.floor((y+1)/2).astype(np.int8) simply maps y=(-1,1) to y=(0,1).
But this does not work when data is imbalanced.
What's more, you expect the classifier (here, logisticRegression) to perform similarly (in terms of loss function value) when data in balance and class_weight=None versus when data is imbalanced and class_weight='auto'. I need to have a way to calculate the loss function (without the regularization term) for both scenarios and compare them.
In short, what does class_weight = 'auto' exactly mean? Does it mean class_weight = {-1 : (y==1).sum()/(y==-1).sum() , 1 : 1.} or rather class_weight = {-1 : 1./(y==-1).sum() , 1 : 1./(y==1).sum()}?
Any help is much much appreciated. I tried going through the source code, but I am not a programmer and I am stuck.
Thanks a lot in advance.
class_weight heuristics
I am a bit puzzled by your first proposition for the class_weight='auto' heuristic, as:
class_weight = {-1 : (y == 1).sum() / (y == -1).sum(),
1 : 1.}
is the same as your second proposition if we normalize it so that the weights sum to one.
Anyway to understand what class_weight="auto" does, see this question:
what is the difference between class weight = none and auto in svm scikit learn.
I am copying it here for later comparison:
This means that each class you have (in classes) gets a weight equal
to 1 divided by the number of times that class appears in your data
(y), so classes that appear more often will get lower weights. This is
then further divided by the mean of all the inverse class frequencies.
Note how this is not completely obvious ;).
This heuristic is deprecated and will be removed in 0.18. It will be replaced by another heuristic, class_weight='balanced'.
The 'balanced' heuristic weighs classes proportionally to the inverse of their frequency.
From the docs:
The "balanced" mode uses the values of y to automatically adjust
weights inversely proportional to class frequencies in the input data:
n_samples / (n_classes * np.bincount(y)).
np.bincount(y) is an array with the element i being the count of class i samples.
Here's a bit of code to compare the two:
import numpy as np
from sklearn.datasets import make_classification
from sklearn.utils import compute_class_weight
n_classes = 3
n_samples = 1000
X, y = make_classification(n_samples=n_samples, n_features=20, n_informative=10,
n_classes=n_classes, weights=[0.05, 0.4, 0.55])
print("Count of samples per class: ", np.bincount(y))
balanced_weights = n_samples /(n_classes * np.bincount(y))
# Equivalent to the following, using version 0.17+:
# compute_class_weight("balanced", [0, 1, 2], y)
print("Balanced weights: ", balanced_weights)
print("'auto' weights: ", compute_class_weight("auto", [0, 1, 2], y))
Output:
Count of samples per class: [ 57 396 547]
Balanced weights: [ 5.84795322 0.84175084 0.60938452]
'auto' weights: [ 2.40356854 0.3459682 0.25046327]
The loss functions
Now the real question is: how are these weights used to train the classifier?
I don't have a thorough answer here unfortunately.
For SVC and linearSVC the docstring is pretty clear
Set the parameter C of class i to class_weight[i]*C for SVC.
So high weights mean less regularization for the class and a higher incentive for the svm to classify it properly.
I do not know how they work with logistic regression. I'll try to look into it but most of the code is in liblinear or libsvm and I'm not too familiar with those.
However, note that the weights in class_weight do not influence directly methods such as predict_proba. They change its ouput because the classifier optimizes a different loss function.
Not sure this is clear, so here's a snippet to explain what I mean (you need to run the first one for the imports and variable definition):
lr = LogisticRegression(class_weight="auto")
lr.fit(X, y)
# We get some probabilities...
print(lr.predict_proba(X))
new_lr = LogisticRegression(class_weight={0: 100, 1: 1, 2: 1})
new_lr.fit(X, y)
# We get different probabilities...
print(new_lr.predict_proba(X))
# Let's cheat a bit and hand-modify our new classifier.
new_lr.intercept_ = lr.intercept_.copy()
new_lr.coef_ = lr.coef_.copy()
# Now we get the SAME probabilities.
np.testing.assert_array_equal(new_lr.predict_proba(X), lr.predict_proba(X))
Hope this helps.

Resources