Sklearn LogisticRegression predict_proba result look weird - scikit-learn

I am quite new to SKlearn, Machine learning and its related. I have searched for a day but still cannot figure out the answer.
model = LogisticRegression(C=1)
model.fit(X, y)
print(model.predict_proba(X_test))
// output
[[ 1.01555532e-08 2.61926230e-01 7.37740949e-01 3.32810963e-04]]
I am quite confused whether the output is correct or not. When I tried on SVM with the same dataset, I got [[ 0.21071225 0.42531172 0.01024818 0.35372784]] which looks like probability and this is what I want. How can I make LogisticRegression model get the same probability style like SVM? What do I misunderstand?

This is just printing-style!
Have a look at this demo:
Code:
import numpy as np
p = np.array([[ 1.01555532e-08, 2.61926230e-01, 7.37740949e-01, 3.32810963e-04]])
print('p: ', p)
print('sum: ', p.sum()) # approximately a probability-distribution?
np.set_printoptions(suppress=True)
print('p: ', p) # same print as above
# but printing-style was changed before!
Output:
p: [[1.01555532e-08 2.61926230e-01 7.37740949e-01 3.32810963e-04]]
sum: 1.0000000001185532
p: [[0.00000001 0.26192623 0.73774095 0.00033281]]
Numpy uses a lot of code to decide on how to print your arrays, depending on the values inside! Here we changed something, using np.set_printoptions.
Your output looks different, because the output of your SVM-prediction has no small values, like the other one did!
suppress : bool, optional
Whether or not suppress printing of small floating point values using scientific notation (default False).
The use of scientific-notation also applies to python's types:
x = 0.00000001
print(x)
# 1e-08

Related

To change the output class label value of a predict function in OneclassSVM

When I use OneClassSVM, we confirm that the results obtained by estimator.predict (X_test) derive the results as 1 and -1, respectively. Each means an outlier value and an internal value. But what I want is to label it with different values, like 0,1 not -1,1. I thought I could give a specific argument to predict to do so, but I couldn't find the search result I wanted.
from sklearn import OneClassSVM
check = OneClassSVM(kernel='rbf', gamma='scale')
check.fit(X_train, y_train)
check.predict(X_test)
I used the above code.
There is no built-in function to specify the labels. However, you can perform this operation using np.where():
import numpy as np
pred = np.array([-1, 1, -1, 1])
np.where(pred==-1, 'outlier_value', 'internal_value')
Output:
array(['outlier_value', 'internal_value', 'outlier_value',
'internal_value'], dtype='<U14')

Python Surprise package gives different predictions for predict method vs manual compute using latent factors

I am using the surprise package for matrix factorization. Below is the code for the tutorial:
from surprise import SVD
from surprise import Dataset
from surprise import accuracy
from surprise.model_selection import train_test_split
# Load the movielens-100k dataset (download it if needed),
data = Dataset.load_builtin('ml-100k')
trainset = data.build_full_trainset()
algo = SVD()
algo.fit(trainset)
algo.predict(str(196), str(302))
Out:
Prediction(uid='196', iid='301', r_ui=4, est=3.0740854315737174, details={'was_impossible': False})
However, when I use the SVD equation from its documentation and source code to manually compute the r_hat (r prediction):
algo.trainset.global_mean + algo.bi[301] + algo.bu[196] + np.dot(algo.qi[301], algo.pu[196])
Out:
2.817335384596893
The predictions does not match at all. Am I doing anything wrong or missing something?
I managed to figure it out. There's a difference between raw users/items and inner users/items. The former refers to the actual names of the users and items (e.g., user = John or a number like 10; items = Avengers or a number like 20) while the latter I assume to be the label encoded values given to the original users/items.
The hidden attributes of the trainset contain 4 attributes, _inner2raw_id_items, _inner2raw_id_users, _raw2inner_id_items, _raw2inner_id_users, which are dicts containing the conversion from one to the other.
If we call trainset._raw2inner_id_users and trainset._raw2inner_id_items, we get:
_raw2inner_id_users
{'196': 0,
'186': 1,
'22': 2, ...}
_raw2inner_id_items
{'242': 0,
'302': 1,
'377': 2, ...
'301': 404, ...}
Therefore, when we call:
algo.predict(str(196), str(302))
Out:
# different from original post as the prediction changes from run to run
Prediction(uid='196', iid='301', r_ui=None, est=3.2072618383879736, details={'was_impossible': False})
We are actually referring to the 0th user and 1st item. So when we use the manual computation using the latent factors, bias, and global mean according to the SVD equation, we should use these numbers instead:
algo.trainset.global_mean + algo.bi[404] + algo.bu[0] + np.dot(algo.qi[404], algo.pu[0])
Output:
3.2072618383879736

Why is ColumnTransformer producing a different output using the same code but different .csv files?

I am trying to finish this course tooth and nail with the hopes of being able to do this kind of stuff entry level by Spring time. This is my first post here on this incredible resource, and will do my best to conform to posting format. As a potential way to enforce my learning and commit to long term memory, I'm trying the same things on my own dataset of > 500 entries containing data more relevant to me as opposed to dummy data.
I'm learning about the data preprocessing phase where you fill in missing values and separate the columns into their respective X and Y to be fed into the models later on, if I understand correctly.
So in the course example, it's the top left dataset of countries. Then the bottom left is my own database of data I've been keeping for about a year on a multiplayer game I play. It has 100 or so characters you can choose from who are played between 5 different categorical roles.
Course data set (top left) personal dataset (bottom left
personal dataset column transformed results
What's up with the different outputs that are produced, with the only difference being the dataset (.csv file)? The course's dataset looks right; that first column of countries (textual categories) gets turned into binary vectors in the output no? Why is the output on my data set omitting columns, and producing these bizarre looking tuples followed by what looks like a random number? I've tried removing the np.array function, I've tried printing each output at each level, unable to see what's causing the difference. I expected on my dataset it would transform the characters' names into binary vectors (combinations of 1s/0s?) so the computer can understand the difference and map them to the appropriate results. Instead I'm getting that weird looking output I've never seen before.
EDIT: It turns out these bizarre number combinations are what's called a "sparse matrix." Had to do some research starting with the type() which yielded csr_array. If I understood what I Read correctly all the stuff inside takes up one column, so I just tried all rows/columns using [:] and I didn't get an error.
Really appreciate your time and assistance.
EDIT: Thanks to this thread I was able to make my way to the end of this data preprocessing/import/cleaning/ phase exercise, to feature scaling using my own dataset of ~ 550 rows.
import pandas as pd
import numpy as np
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, LabelEncoder, StandardScaler
from sklearn.model_selection import train_test_split
# IMPORT RAW DATA // ASSIGN X AND Y RAW
df = pd.read_csv('datasets/winpredictor.csv')
X = df.iloc[:, :-1].values
y = df.iloc[:, -1].values
# TRANSFORM CATEGORICAL DATA
ct = ColumnTransformer(transformers=\
[('encoder', OneHotEncoder(), [0, 1])], remainder='passthrough')
le = LabelEncoder()
X = ct.fit_transform(X)
y = le.fit_transform(y)
# SPLIT THE DATA INTO TRAINING AND TEST SETS
X_train, X_test, y_train, y_test = train_test_split(\
X, y, train_size=.8, test_size=.2, random_state=1)
# FEATURE SCALING
sc = StandardScaler(with_mean=False)
X_train[:, :] = sc.fit_transform(X_train[:, :])
X_test[:, :] = sc.transform(X_test[:, :])
First of all I encourage you to keep working with this course and for sure you will be a perfect Data Science in a few weeks.
Let's talk about your problem. It' seems that you only have a problem of visualization due to the big size of different types of "Hero" (I think you have 37 unique values).
I will explain you the results you have plotted. They programm only indicate you the values of the samples that are different of 0:
(0,10)=1 --> 0 refers to the first sample, and 10 refers to the 10th
value of the sample that is equal to 1.
(0,37)=5 --> 0 refers to the first sample, and 37 refers to the 37th, which is equal to 5.
etc..
So your first sample will be something like:
[0,0,0,0,0,0,0,0,0,0,1,.........., 5, 980,-30, 1000, 6023]
Which is the way to express the first sample of "Jakiro".
["Jakiro",5, 980,-30, 1000, 6023]
To sump up, the first 37 values refers to your OneHotEncoder, and last 5 refers to your initial numerical values.
So it seems to be correct, just a different way to plot the result due to the big size of classes of the categorical variable.
You can try to reduce the number of X rows (to 4 for example), and try the same process. Then you will have a similar output as the course.

How to use get_operation_by_name() in tensorflow, from a graph built from a different function?

I'd like to build a tensorflow graph in a separate function get_graph(), and to print out a simple ops a in the main function. It turns out that I can print out the value of a if I return a from get_graph(). However, if I use get_operation_by_name() to retrieve a, it print out None. I wonder what I did wrong here? Any suggestion to fix it? Thank you!
import tensorflow as tf
def get_graph():
graph = tf.Graph()
with graph.as_default():
a = tf.constant(5.0, name='a')
return graph, a
if __name__ == '__main__':
graph, a = get_graph()
with tf.Session(graph=graph) as sess:
print(sess.run(a))
a = sess.graph.get_operation_by_name('a')
print(sess.run(a))
it prints out
5.0
None
p.s. I'm using python 3.4 and tensorflow 1.2.
Naming conventions in tensorflow are subtle and a bit offsetting at first.
The thing is, when you write
a = tf.constant(5.0, name='a')
a is not the constant op, but its output. Names of op outputs derive from the op name by adding a number corresponding to its rank. Here, constant has only one output, so its name is
print(a.name)
# `a:0`
When you run sess.graph.get_operation_by_name('a') you do get the constant op. But what you actually wanted is to get 'a:0', the tensor that is the output of this operation, and whose evaluation returns an array.
a = sess.graph.get_tensor_by_name('a:0')
print(sess.run(a))
# 5

How do I correctly manually recreate sklearn (python) logistic regression predict_proba outcome for multiple classification

If I run a basic logistic regression with 4 classes, I can get the predict_proba array.
How can i manually calculate the probabilities using the coefficients and intercepts? What are the exact steps to get the same answers that predict_proba generates?
There seem to be multiple questions about this online and several suggestions which are either incomplete or don't match up anyway.
For example, I can't replicate this process from my sklearn model so what is missing?
https://stats.idre.ucla.edu/stata/code/manually-generate-predicted-probabilities-from-a-multinomial-logistic-regression-in-stata/
Thanks,
Because I had the same question but could not find an answer that gave the same results I had a look at the sklearn GitHub repository to find the answer. Using the functions from their own package I was able to create the same results I got from predict_proba().
It appears that sklearn uses a special softmax() function that differs from the usual softmax function in their code.
Let's assume you build a model like this:
from sklearn.linear_model import LogisticRegression
X = ...
Y = ...
model = LogisticRegression(multi_class="multinomial", solver="saga")
model.fit(X, Y)
Then you can calculate the probabilities either with model.predict(X) or use the sklearn function mentioned above to calculate them manually like this.
from sklearn.utils.extmath import softmax,
import numpy as np
scores = np.dot(X, model.coef_.T) + model.intercept_
softmax(scores) # Sklearn implementation
In the documentation for their own softmax() function, they note that
The softmax function is calculated by
np.exp(X) / np.sum(np.exp(X), axis=1)
This will cause overflow when large values are exponentiated. Hence
the largest value in each row is subtracted from each data point to
prevent this.
Replicate sklearn calcs (saw this on a different post):
V = X_train.values.dot(model.coef_.transpose())
U = V + model.intercept_
A = np.exp(U)
P=A/(1+A)
P /= P.sum(axis=1).reshape((-1, 1))
seems slightly different than softmax calcs, or the UCLA stat example, but it works.

Resources