sklearn: LogisticRegression - predict_proba(X) - calculation - scikit-learn

I was wondering if someone can maybe have a quick look at the following code snippet and point me in a direction to find my misunderstanding in calculating the probability of a sample for each class in the model and my related code bug. I tried to manually calculate the results provided by the sklearn function lm.predict_proba(X) , sadly the results are different, so i did a mistake.
I think the bug will be in part "d" of the following code walkthrough. Maybe in the math, but I could not see why.
a) Creating and training a logistic regression model ( works fine )
lm = LogisticRegression(random_state=413, multi_class='multinomial', solver='newton-cg')
lm.fit(X, train_labels)
b) Saving coefficient and bias ( works fine )
W = lm.coef_
b = lm.intercept_
c) Using lm.predict_proba(X) ( works fine)
def reshape_single_element(x,num):
singleElement = x[num]
nx,ny = singleElement.shape
return singleElement.reshape((1,nx*ny))
select_image_number = 6
X_select_image_data=reshape_single_element(train_dataset,select_image_number)
Y_probabilities = lm.predict_proba(X_select_image_data)
Y_pandas_probabilities = pd.Series(Y_probabilities[0], index=['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j'])
print"estimate probabilities for each class: \n" ,Y_pandas_probabilities , "\n"
print"all probabilities by lm.predict_proba(..) sum up to ", np.sum(Y_probabilities) , "\n"
The output was:
estimate probabilities for each class:
a 0.595426
b 0.019244
c 0.001343
d 0.004033
e 0.017185
f 0.004193
g 0.160380
h 0.158245
i 0.003093
j 0.036860
dtype: float64
all probabilities by lm.predict_proba(..) sum up to 1.0
d) Manually performing the calculation done by lm.predict_proba ( no error/warning, but results are not the same )
manual_calculated_probabilities = []
for select_class_k in range(0,10): #a=0. b=1, c=3 ...
z_for_class_k = (np.sum(W[select_class_k] *X_select_image_data) + b[select_class_k] )
p_for_class_k = 1/ (1 + math.exp(-z_for_class_k))
manual_calculated_probabilities.append(p_for_class_k)
print "formula: ", manual_calculated_probabilities , "\n"
def softmax(x):
"""Compute softmax values for each sets of scores in x."""
e = np.exp(x)
dist = e / np.sum(np.exp(x),axis=0)
return dist
abc = softmax(manual_calculated_probabilities)
print "softmax:" , abc
The output was:
formula: [0.9667598370531315, 0.48453459121301334, 0.06154496922245115, 0.16456194859398865, 0.45634781280053394, 0.16999340794727547, 0.8867996361191054, 0.8854473986336552, 0.13124464656251109, 0.642913996162282]
softmax: [ 0.15329642 0.09464644 0.0620015 0.0687293 0.0920159 0.069103610.14151607 0.14132483 0.06647715 0.11088877]
Softmax was used, because of a comment at github logistic.py
For a multi_class problem, if multi_class is set to be "multinomial" the softmax function is used to find the predicted probability of each class.
Note:
print "shape of X: " , X_select_image_data.shape
print "shape of W: " , W.shape
print "shape of b: " , b.shape
shape of X: (1, 784)
shape of W: (10, 784)
shape of b: (10,)
I found a very similar question here, but sadly I could not adapted it to my code so the predictions got the same. I tried many different combinations to calculate the variables 'z_for_class_k' and 'p_for_class_k' but sadly without success to reproduce the prediction values from 'predict_proba(X)'.

I think the problem is with
p_for_class_k = 1/ (1 + math.exp(-z_for_class_k))
1 / (1 + exp(-logit)) is a simplification that works only on binary problems.
The real equation, before being simplified, looks like this:
p_for_classA =
exp(logit_classA) /
[1 + exp(logit_classA) + exp(logit_classB) ... + exp(logit_classC)]
In other words, when calculating a probability for a specific class, you must incorporate ALL the weights and biases from the other classes as well into your formula.
I didn't have the data to test this out, but hopefully this points you in the right direction.

change
p_for_class_k = 1/ (1 + math.exp(-z_for_class_k))
manual_calculated_probabilities.append(p_for_class_k)
to
manual_calculated_probabilities.append(z_for_class_k)
aka the input for softmax is "z"s instead of "p"s, in your notation. multinomial logistic

I was able to replicate the method lr.predict_proba by doing the following:
>>> sigmoid = lambda x: 1/(1+np.exp(-x))
>>> sigmoid(lr.intercept_+np.sum(lr.coef_*X.values, axis=1))
Assuming that X is a numpy array and lr is an object from sklearn.

Related

Could someone please help me with sklearn.metrics.roc_curve's use and what does the function expect?

I am trying to construct 2 numpy ndarray-s from a networkx Graph's data structures that look like a list of tuples and a simple list. I would like to make a roc curve where
the validation set is the above mentioned list of tuples of the edges of a G graph that I was trying to construct like this:
x = []
for i in G_orig.nodes():
for j in G_orig.nodes():
if j > I and (i, j) not in G.edges():
if (i, j) in G_orig.edges():
x.append((i, j, 1))
else:
x.append((i, j, 0))
y_validation = np.array(x)
It looks something like this: [(1, 344, 1), (2, 23, 0), (3, 5, 0), ...... (333, 334, 1)].
The first 2 numbers mean 2 nodes, the 3rd one means whether there is an edge between them. 1 means edge, 0 means no edge.
Then roc_curve expects something called y_score in the documentation. I have a list for that made with a method called preferential attachment, therefore I named it pref_att_types. I tried to make a numpy array of it in case the roc_curve expects only it.
positive_class_predicted_probabilities = np.array(pref_att_types)
3.Then I just did what we used in class.
FPRs, TPRs, thresholds = roc_curve(y_validation,
positive_class_predicted_probabilities,
pos_label=1)
It is literally just Ctrl C + Ctrl V. But it says Value error and 'multiclass-multioutput format is not supported'. Please note that I am not a programmer just someone who studies to be a mathematics analyst.
The first argument, y_true, needs to be just the true labels, in this case 0/1 without the pair of nodes. Just be sure that the indices of the arrays y_validation and pref_att_types match
The code below draws the ROC curves for two RF models:
from sklearn.metrics import roc_curve
#create array of probabilities
y_test_predict1_probaRF = rf1.predict_proba(X_test)
y_test_predict2_probaRF = rf2.predict_proba(X_test)
RFfpr1, RFtpr1, thresholds = roc_curve(y_test, y_test_predict1_probaRF[:,1])
RFfpr2, RFtpr2, thresholds = roc_curve(y_test, y_test_predict2_probaRF[:,1])
def plot_roc_curve (fpr, tpr, label = None):
plt.plot(fpr, tpr, linewidth = 2, label = label)
plt.plot([0,1], [0,1], "k--")
plt.axis([0,1,0,1])
plt.xlabel("False positive rate")
plt.ylabel("True positive rate")
plot_roc_curve (RFfpr1,RFtpr1,"RF1")
plot_roc_curve (RFfpr2,RFtpr2,"RF2")
plt.legend()
plt.show()

Neural Network Python Gradient Descent

I am new to machine learning and trying to understand it (self-learning). So I grabbed a book (this one if interested: https://www.amazon.com/Neural-Networks-Unity-Programming-Windows/dp/1484236726) and started to read the first chapter. While reading, there are a few things I did not understand so I went to research online.
However, I still have trouble with a few points that I cannot understand even after so much reading and research:
How are we calculating l2_delta and l1_delta? (marked with #what is this part doing? in code below)
How does gradient descent relate? (I looked up the formula and tried to read a bit about it but I could not relate the one line code to the code I have down there)
Is that a network with 3 layers (layer 1: 3 input nodes, layer 2: not sure ,layer 3: 1 output node )
Neural Network Full Code:
trying to write my first neural network!
import numpy as np
#activation function (sigmoid , maps value between 0 and 1)
def sigmoid(x):
return 1/(1+np.exp(-x))
def derivative(x):
return x*(1-x)
#initialize input (4 training data (row), 3 features (col))
X = np.array([[0,0,1],[0,1,1],[1,0,1],[1,1,1]])
#initialize output for training data (4 training data (rows), 1 output for each (col))
Y = np.array([[0],[1],[1],[0]])
np.random.seed(1)
#synapses
syn0 = 2* np.random.random((3,4)) - 1
syn1 = 2* np.random.random((4,1)) - 1
for iter in range(60000):
#layers
l0 = X
l1 = sigmoid(np.dot(l0,syn0))
l2 = sigmoid(np.dot(l1,syn1))
#error
l2_error = Y - l2
if(iter % 10000 == 0): #only print error every 10000 steps to save time and limit the amount of output
print("Error L2: " + str (np.mean(np.abs(l2_error))))
#what is this part doing?
l2_delta = l2_error * derivative(l2)
l1_error = l2_delta.dot(syn1.T)
l1_delta = l1_error * derivative(l1)
if(iter % 10000 == 0): #only print error every 10000 steps to save time and limit the amount of output
print("Error L1: " + str (np.mean(np.abs(l1_error))))
#update weights
syn1 = syn1 + l1.T.dot(l2_delta) // derative with respect to cost function
syn0 = syn2 + l0.T.dot(l1_delta)
print(l2)
Thank you!
In general, layerwise computations (Hence the notation l1 and l2 above) is simply getting the dot product of a vector $x \in \mathbb{R}^n$ and a vector of weights in the same dimension, then applying the sigmoid function on each component .
Gradient Descent. - - - Imagine, in two dimensions say the graph of $f(x) = x^2$ Suppose, we don't know how to get it's minimum. Gradient descent will basically evaluate $f'(x)$ at various points, and check whether $f'(x)$ is close to zero

Vectorized implementation of field-aware factorization

I would like to implement the field-aware factorization model (FFM) in a vectorized way. In FFM, a prediction is made by the following equation
where w are the embeddings that depend on the feature and the field of the other feature. For more info, see equation (4) in FFM.
To do so, I have defined the following parameter:
import torch
W = torch.nn.Parameter(torch.Tensor(n_features, n_fields, n_factors), requires_grad=True)
Now, given an input x of size (batch_size, n_features), I want to be able to compute the previous equation. Here is my current (non-vectorized) implementation:
total_inter = torch.zeros(x.shape[0])
for i in range(n_features):
for j in range(i + 1, n_features):
temp1 = torch.mm(
x[:, i].unsqueeze(1),
W[i, feature2field[j], :].unsqueeze(0))
temp2 = torch.mm(
x[:, j].unsqueeze(1),
W[j, feature2field[i], :].unsqueeze(0))
total_inter += torch.sum(temp1 * temp2, dim=1)
Unsurprisingly, this implementation is horribly slow since n_features can easily be as large as 1000! Note however that most of the entries of x are 0. All inputs are appreciated!
Edit:
If it can help in any ways, here are some implementations of this model in PyTorch:
pytorch-fm
ctr_model_zoo
Unfortunately, I cannot figure out exactly how they have done it.
Additional update:
I can now obtain the product of x and W in a more efficient way by doing:
temp = torch.einsum('ij, jkl -> ijkl', x, W)
Thus, my loop is now:
total_inter = torch.zeros(x.shape[0])
for i in range(n_features):
for j in range(i + 1, n_features):
temp1 = temp[:, i, feature2field[j], :]
temp2 = temp[:, j, feature2field[i], :]
total_inter += 0.5 * torch.sum(temp1 * temp2, dim=1)
It is however still too long since this loop goes over for about 500 000 iterations.
Something that could potentially help you speed up the multiplication is using pytorch sparse tensors.
Also something that might work would be the following:
Create n arrays, one for each feature i that would hold its corresponding field factors in each row. e.g. for feature i = 0
[ W[0, feature2field[0], :],
W[0, feature2field[1], :],
W[0, feature2field[n], :]]
Then calculate the multiplication of those arrays, lets call them F, with X
R[i] = F[i] * X
So each element in R would hold the result of the multiplication, an array, of the F[i] with X.
Next you would multiply each R[i] with its transpose
R[i] = R[i] * R[i].T
Now you can do the summation in a loop like before
for i in range(n_features):
total_inter += torch.sum(R[i], dim=1)
Please take this with a grain of salt as i haven't tested it. In any case i think that it will point you in the right direction.
One problem that might occur is in the transpose multiplication in which each element will also be multiplied with itself and then be added in the sum. I don't think it will affect the classifier but in any case you can make the elements in the diagonal of the transpose and above 0 (including the diagonal).
Also although minor nevertheless please move the 1st unsqueeze operation outside of the nested for loop.
I hope it helps.

Function implementation in keras tensorflow

Suppose I´ve got three tensors shape (n,1) T,r,E and I wanted to implement a function that calculates: sum(i,j) (T[j] < T[i]) (r[j] > r[i]) E[j].
How can I proceed?
This is what I´ve got
#tensor examples
T=K.constant([1,4,5])
r=K.constant([.8,.3,.7])
E=K.constant([1,0,1])
# cartesian product of T to compare element wise
c = tf.stack(tf.meshgrid(T, T, indexing='ij'), axis=-1)
cartesian_T = tf.reshape(c, (-1, 2))
# cartesian product of r to compare elemento wise
r = tf.stack(tf.meshgrid(r, r, indexing='ij'), axis=-1)
cartesian_r = tf.reshape(r, (-1, 2))
# TO DO:
# compare the two columns in cartesian T and cast to integer 1/0 if
# second column in T less/greater than first column in T => return tensor
# compare the two columns in cartesian E and cast to integer 1/0 if
# second column in r greater/less than first column in r => return tensor
# multiply previous tensors by a broadcasted version of E, then do K.sum()
Do you think I'm on the right track? What would you suggest to implement this?
Try:
import keras.backend as K
import tensorflow as tf
T=K.constant([1,4,5])
r=K.constant([.8,.3,.7])
E=K.constant([1,0,1])
T_new = tf.less(T[tf.newaxis,:],T[:,tf.newaxis])
r_new = tf.greater(r[tf.newaxis,:],r[:,tf.newaxis])
E_row,_ = tf.meshgrid(E, E)
result = tf.reduce_sum(tf.boolean_mask(E_row,tf.logical_and(T_new,r_new)))
with tf.Session() as sess:
print(sess.run(result))
#print
2.0

tensorflow map_fn on signle value Tensor

Is it possible to run map_fn on a tensor with a single value?
The following works:
import tensorflow as tf
a = tf.constant(1.0, shape=[3])
tf.map_fn(lambda x: x+1, a)
#output: [2.0, 2.0, 2.0]
However this does not:
import tensorflow as tf
b = tf.constant(1.0)
tf.map_fn(lambda x: x+1, b)
#expected output: 2.0
Is it possible at all?
What am I doing wrong?
Any hints will be greatly appreciated!
Well, I see you accepted an answer, which correctly states that tf.map_fn() is applying a function to elements of a tensor, and a scalar tensor has no elements. But it's not impossible to do this for a scalar tensor, you just have to tf.reshape() it before and after, like this code (tested):
import tensorflow as tf
b = tf.constant(1.0)
if () == b.get_shape():
c = tf.reshape( tf.map_fn(lambda x: x+1, tf.reshape( b, ( 1, ) ) ), () )
else:
c = tf.map_fn(lambda x: x+1, b)
#expected output: 2.0
with tf.Session() as sess:
print( sess.run( c ) )
will output:
2.0
as desired.
This way you can factor this into an agnostic function that can take both scalar and non-scalar tensors as argument.
No, this is not possible. As you probably saw it throws an error:
ValueError: elems must be a 1+ dimensional Tensor, not a scalar
The point of map_fn is to apply a function to each element of a tensor, so it makes no sense to use this for a scalar (single-element) tensor.
As to "what you are doing wrong": This is difficult to say without knowing what you're trying to achieve.

Resources