How does grad() works in PyTorch? - pytorch

I need some conceptual clarity with the inputs of the Pytorch grad() function.
Please see the following code:
import torch
a = torch.tensor([2., 3.], requires_grad=True)
b = torch.tensor([6., 4.], requires_grad=True)
Q = 1/3*a**3 - 1/2*b**2
Here, I have defined 3 tensors, I was trying to compute the derivative of Q w.r.t a.
The following line would simply compute the first and second derivatives.
Q_a = torch.autograd.grad(Q.sum(), a, create_graph=True)[0]
Q_aa = torch.autograd.grad(Q_a.sum(), a, create_graph=True)[0]
print('Q_a =',Q_a.detach().numpy())
print('Q_aa =',Q_aa.detach().numpy())
The output is:
Q_a = [4. 9.]
Q_aa = [4. 6.]
I am wondering that, why do I need to pass Q.sum() or Q_a.sum() which is just 1 value and the second argument a has two values.
>>> print(Q.sum())
>>> tensor(-14.3333, grad_fn=<SumBackward0>)
>>> print(a)
>>> tensor([2., 3.], requires_grad=True)
Can someone explain to me how does Q.sum() helps in computing the correct gradient. Is it possible to compute the derivatives with just Q not Q.sum()?

Well, your question is based on wrong assumption. You said ..
I was trying to compute the derivative of Q w.r.t a
NO. you are not. In the code sample you provided, you are trying to compute the derivative of Q.sum() w.r.t a - they are different things.
"Derivative of Q w.r.t a" is matrix called Jacobian, whereas ..
"Derivative of Q.sum() w.r.t a" is a vector known as gradient.
Both can be computed and are used in different places for achieving different things. Its your decision which one you want.

Related

What is the meaning of "value" in a node in sklearn decisiontree plot_tree

I plotted my sklearn decision tree using the plot_tree function. The nodes have the following structure:
But I don't understand what does the value = [2417, 1059] mean. In other nodes there are other values. Thanks for explaining.
DecisionTreeClassifier:
value in a DecisionTreeClassifier is the class split in each node's samples.
Keep in mind it might also be weighted if you weighted your classes on the call to fit().
For example:
cw={0: 0.6495288248337029, 1: 2.1719184430027805}
Taking the true node, your true class split is calculated as:
>>> [3819.229 / cw[0], 1216.274 / cw[1]]
[5880, 560]
And if it's not clear, your criterion is calculated on the weighted split:
>>> a, b = 3819.229, 1216.274
>>> ab = a + b
>>> (-(a / ab)*math.log2(a / ab)) - ((b / ab)*math.log2(b / ab))
0.7975914228753467
DecisionTreeRegressor:
value in a DecisionTreeRegressor is the value that the tree would predict for a new example falling in that node. If your criterion is MSE, you'll find that value is an average measure of the samples in that node.
For example:
*(Data: Seaborn's "dots" example set.)
A depth-1 regressor tree fitted on coherence to predict firing_rate. It's not a very useful tree, but it illustrates the idea.
Taking the true node, value is calculated as:
>>> value = data[data.coherence <= 19.2].firing_rate.mean()
>>> value
40.48326118418657
squared_error for that node is:
>>> ((data[data.coherence <= 19.2].firing_rate - value)**2).mean()
134.6504380931471
They are indicating you the number of sample by class that you have in the step.
For example, your picture show that before splitting for "hops<=5" you have 2417 samples of class 0 and 1059 samples of the class 1.
Realize that if you sum this two values, you will obtain the same number (3476) as the parameter "samples".
If the tree works, you will observe how the data is splitting better in every step. For final leaf you will see that you have clear values like [300, 2]. Then you can say that all this sample are class 0.

Retrieve elements from a 3D tensor with a 2D index tensor

I am playing around with GPT2 and I have 2 tensors:
O: An output tensor of shaped (B, S-1, V) where B is the batch size S is the the number of timestep and V is the vocabulary size. This is the output of a generative model and is softmaxed along the 2nd dimension.
L: A 2D tensor shaped (B, S-1) where each element is the index of the correct token for each timestep for each sample. This is basically the labels.
I want to extract the predicted probability of the corresponding correct token from tensor O based on tensor L such that I will end up with a 2D tensor shaped (B, S). Is there an efficient way of doing this apart from using loops?
For reference, I based my answer on this Medium article.
Essentially, your answer lies in torch.gather, assuming that both of your tensors are just regular torch.Tensors (or can be converted to one).
import torch
# Specify some arbitrary dimensions for now
B = 3
V = 6
S = 4
# Make example reproducible
torch.manual_seed(42)
# L necessarily has to be a torch.LongTensor, otherwise indexing will fail.
L = torch.randint(0, V, size=[B, S])
O = torch.rand([B, S, V])
# Now collect the results. L needs to have similar dimension,
# except in the axis you want to collect along.
X = torch.gather(O, dim=2, index=L.unsqueeze(dim=2))
# Make sure X has no "unnecessary" dimension
X = X.squeeze(dim=2)
It is a bit difficult to see whether this produces the exact correct results, which is why I included a random seed which makes the example deterministic in the result, and you an easily verify that it gets you the desired results. However, for clarification, one could also use a lower-dimensional tensor, for which this becomes clearer what exactly torch.gather does.
Note that torch.gather also allows you to index multiple indexes in the same row theoretically. Meaning if you instead got a multiclass example for which multiple values are correct, you could similarly use a tensor L of shape [B, S, number_of_correct_samples].

Gradient flow through torch.nn.Parameter()

I have a toy example
a = torch.ones(10)
b = torch.nn.Parameter(a,requires_grad=True)
c = (b**2).sum()
c.backward()
print(b.grad)
print(a.grad)
b.grad calculated successfully, but a.grad is None. How to make gradient flow through torch.nn.Parameter? This example looks artificial, but I work with class A derived from nn.Module and it's parameters initialized with outputs from some other Module B, and I whant to make gradients flow through A parameters to B parameters.
#a_guest answer is wrong. Using requires_grad=True here will change nothing since torch.nn.Parameter is not tracked in computation graph. You should do it the other way around, to create a Parameter tensor, and then to extract a raw tensor reference out of it:
:
a = torch.nn.Parameter(torch.ones((10,)), requires_grad=True)
b = a[:] # silly hack to convert in a raw tensor including the computation graph
b.retain_grad() # Otherwise backward pass will not store the gradient since it is not a leaf
c = (b**2).sum()
c.backward()
print(b.grad)
print(a.grad)
Another approach would be to copy manually the content of tensor a in b
You could fix this by making the copy explicit:
a = torch.ones((10,), requires_grad=True)
b = torch.nn.Parameter(a.clone(), requires_grad=True)
b = a
c = (b**2).sum()
c.backward()
print(b.grad)
print(a.grad)
Yet it is not very convenient since the copy must be done systematically.

Expectation Maximization algorithm(Gaussian Mixture Model) : ValueError: the input matrix must be positive semidefinite

I am trying to implement Expectation Maximization algorithm(Gaussian Mixture Model) on a data set data=[[x,y],...]. I am using mv_norm.pdf(data, mean,cov) function to calculate cluster responsibilities. But after calculating new values of covariance (cov matrix) after 6-7 iterations, cov matrix is becoming singular i.e determinant of cov is 0 (very small value) and hence it is giving errors
ValueError: the input matrix must be positive semidefinite
and
raise np.linalg.LinAlgError('singular matrix')
Can someone suggest any solution for this?
#E-step: Compute cluster responsibilities, given cluster parameters
def calculate_cluster_responsibility(data,centroids,cov_m):
pdfmain=[[] for i in range(0,len(data))]
for i in range(0,len(data)):
sum1=0
pdfeach=[[] for m in range(0,len(centroids))]
pdfeach[0]=1/3.*mv_norm.pdf(data[i], mean=centroids[0],cov=[[cov_m[0][0][0],cov_m[0][0][1]],[cov_m[0][1][0],cov_m[0][1][1]]])
pdfeach[1]=1/3.*mv_norm.pdf(data[i], mean=centroids[1],cov=[[cov_m[1][0][0],cov_m[1][0][1]],[cov_m[1][1][0],cov_m[0][1][1]]])
pdfeach[2]=1/3.*mv_norm.pdf(data[i], mean=centroids[2],cov=[[cov_m[2][0][0],cov_m[2][0][1]],[cov_m[2][1][0],cov_m[2][1][1]]])
sum1+=pdfeach[0]+pdfeach[1]+pdfeach[2]
pdfeach[:] = [x / sum1 for x in pdfeach]
pdfmain[i]=pdfeach
global old_pdfmain
if old_pdfmain==pdfmain:
return
old_pdfmain=copy.deepcopy(pdfmain)
softcounts=[sum(i) for i in zip(*pdfmain)]
calculate_cluster_weights(data,centroids,pdfmain,soft counts)
Initially, I've passed [[3,0],[0,3]] for each cluster covariance since expected number of clusters is 3.
Can someone suggest any solution for this?
The problem is your data lies in some manifold of dimension strictly smaller than the input data. In other words for example your data lies on a circle, while you have 3 dimensional data. As a consequence when your method tries to estimate 3 dimensional ellipsoid (covariance matrix) that fits your data - it fails since the optimal one is a 2 dimensional ellipse (third dimension is 0).
How to fix it? You will need some regularization of your covariance estimator. There are many possible solutions, all in M step, not E step, the problem is with computing covariance:
Simple solution, instead of doing something like cov = np.cov(X) add some regularizing term, like cov = np.cov(X) + eps * np.identity(X.shape[1]) with small eps
Use nicer estimator like LedoitWolf estimator from scikit-learn.
Initially, I've passed [[3,0],[0,3]] for each cluster covariance since expected number of clusters is 3.
This makes no sense, covariance matrix values has nothing to do with amount of clusters. You can initialize it with anything more or less resonable.

How to make sense the output of DecisionTreeClassifier in scikit-learn?

I'm learning ML and uses scikit-learn to do a basic decision tree classify.
The value of features are categorical so I used DictVectorizer to convert the original feature values. Here's my code:
training_set # list of dict representing the traing set
labels # corresponding labels of the training set
vec = DictVectorizer()
vectorized = vec.fit_transform(training_set)
clf = tree.DecisionTreeClassifier()
clf.fit(vectorized.toarray(), labels)
with open("output.dot", "w") as output_file:
tree.export_graphviz(clf, out_file=output_file)
But I don't understand the output graph. It contains a tree with each node marked X[1] <= 0.5000 or something like that. What I expected was that the nodes marked with FEATURE_1 == VALUE_1, the un-vectorized information show on the tree.
Is it possible?
UPDATE:
For example, FEATURE_1 has three possible values A, B, C, which in turn be vectorized into 0,0, 0,1, 1,0 respectively. What I want on the graph is FEATURE_1 == A instead of X[1] <= 0.5
You can pass the feature names to the tree exporter method:
with open("output.dot", "w") as output_file:
tree.export_graphviz(clf, feature_names=vec.get_feature_names(),
out_file=output_file)
The classifier itself is unaware of the "meaning" of the data, it just deals with continuous numerical values, hence the need to use a vectorizer to one-hot-encode the categorical variables as binary variables that can safely treated as continuous variables in the range [0, 1] with all the actual values being either 0 or 1 and nothing in between.
To understand how the DictVectorizer does the one-hot-encoding, have a look at the example snippet in the documentation.
X[1] <= 0.5000 means X[1] = 0 if you have binary variables. If the equation holds, left branch is chosen. Otherwise, right branch. You can certainly parse the dot file and overwrite it (it's merely a text file and it's easy to do with regular expressions), but the way it is constructed initially is fixed like this, because by default the node of a tree is an inequality.
When the values are in the continuous interval, the machine learner will sort the values and look for all the intermediate values to find the value with the Highest Gini index.
This is reasonable since in the continuous domain, the chance of finding a test instance with a value exactly, let's say, 3.1415 is zero. In such cases the classifier shouldn't know what to do.
I don't know about scikit-learn, but in WEKA for instance, one can specify whether the values are continue or discrete.
When you do an export_graphviz specify the feature_names which are the column names in this case for the independent variables DataFrame.
This would yield you the column names in your output file as below.
model = clf.fit(X, y)
dot_data = tree.export_graphviz(model, out_file=None, feature_names=X.columns.values.tolist(), class_names = None, filled=True, rounded=True, special_characters=True)
with open("output.dot", "w") as output_file:
output_file.write(dot_data)

Resources