Saving LinearRegression (from sklearn.linear_model) coefficients in a list - python-3.x

I'm stuck in a problem that should be very simple. I'm running four simple linear regressions (changing only the x variables) and I need to store both de intercept and the scope coefficient in a list, for all regressions.
I thought it would be very easy, but it seems I'm not good at handling lists. The result stores me the same coefficients for all four models in the list.
This is my code:
from sklearn.linear_model import LinearRegression
variables = ['Number_of_likes','Number_of_comments','Number_of_followers','Number_of_repplies']
models = [None] * 4
lm = LinearRegression()
#Fit regressions
models[0] = lm.fit(X[[variables[0]]],y)
models[1] = lm.fit(X[[variables[1]]],y)
models[2] = lm.fit(X[[variables[2]]],y)
models[3] = lm.fit(X[[variables[3]]],y)
When I look at "models", it seems to be storing the results only for the last regression, in all four slots.
Hope I explained well my problem.

lm.fit() will modify the existing instance, not create a new copy of it. Also, the models list will store these instances by reference, which yields the behavior you are seeing.
To solve this, you need to create a new LogisticRegression every time you want to fit it to a new input, not re-use the same old model. For example:
models = [] # just an empty list; we will append our models to it one by one
for var in variables:
lm = LinearRegression() # create a new object
lm.fit(X[[var]], y) # fit it
models.append(lm) # add it to the list
Or, a more faithful version to your original code would be (using sklearn.base.clone):
from sklearn.base import clone # to create a new copy of the lm object
lm = LinearRegression()
#Fit regressions
models[0] = clone(lm).fit(X[[variables[0]]],y)
models[1] = clone(lm).fit(X[[variables[1]]],y)
models[2] = clone(lm).fit(X[[variables[2]]],y)
models[3] = clone(lm).fit(X[[variables[3]]],y)

Related

cross Validation in Sklearn using a Custom CV

I am dealing with a binary classification problem.
I have 2 lists of indexes listTrain and listTest, which are partitions of the training set (the actual test set will be used only later). I would like to use the samples associated with listTrain to estimate the parameters and the samples associated with listTest to evaluate the error in a cross validation process (hold out set approach).
However, I am not be able to find the correct way to pass this to the sklearn GridSearchCV.
The documentation says that I should create "An iterable yielding (train, test) splits as arrays of indices". However, I do not know how to create this.
grid_search = GridSearchCV(estimator = model, param_grid = param_grid,cv = custom_cv, n_jobs = -1, verbose = 0,scoring=errorType)
So, my question is how to create custom_cv based on these indexes to be used in this method?
X and y are respectivelly the features matrix and y is the vector of labels.
Example: Supose that I only have one hyperparameter alpha that belongs to the set{1,2,3}. I would like to set alpha=1, estimate the parameters of the model (for instance the coefficients os a regression) using the samples associated with listTrain and evaluate the error using the samples associated with listTest. Then I repeat the process for alpha=2 and finally for alpha=3. Then I choose the alpha that minimizes the error.
EDIT: Actual answer to question. Try passing cv command a generator of the indices:
def index_gen(listTrain, listTest):
yield listTrain, listTest
grid_search = GridSearchCV(estimator = model, param_grid =
param_grid,cv = index_gen(listTrain, listTest), n_jobs = -1,
verbose = 0,scoring=errorType)
EDIT: Before Edits:
As mentioned in the comment by desertnaut, what you are trying to do is bad ML practice, and you will end up with a biased estimate of the generalisation performance of the final model. Using the test set in the manner you're proposing will effectively leak test set information into the training stage, and give you an overestimate of the model's capability to classify unseen data. What I suggest in your case:
grid_search = GridSearchCV(estimator = model, param_grid = param_grid,cv = 5,
n_jobs = -1, verbose = 0,scoring=errorType)
grid_search.fit(x[listTrain], y[listTrain]
Now, your training set will be split into 5 (you can choose the number here) folds, trained using 4 of those folds on a specific set of hyperparameters, and tested the fold that was left out. This is repeated 5 times, till all of your training examples have been part of a left out set. This whole procedure is done for each hyperparameter setting you are testing (5x3 in this case)
grid_search.best_params_ will give you a dictionary of the parameters that performed the best over all 5 folds. These are the parameters that you use to train your final classifier, using again only the training set:
clf = LogisticRegression(**grid_search.best_params_).fit(x[listTrain],
y[listTrain])
Now, finally your classifier is tested on the test set and an unbiased estimate of the generalisation performance is given:
predictions = clf.predict(x[listTest])

How to use categorical data neural network in tensorflow without estimator?

I am trying to build a neural network without using estimators. I have defined layers as,
x_categorical = tf.placeholder(tf.string)
x_numeric = tf.placeholder(tf.float32)
l1 = tf.add(tf.matmul(x_numeric,weights), biases)
l2 = tf.add(tf.matmul(x_categorical,weights), biases)
tf.matmul works well for numeric features but i also have some categorical features. So i am unable to use them
I tried tf.string_to_hash_bucket_fast but it converts the string to int64 which is not supported by tf.matmul, i also tried tf.decode_raw. that also did not work. So please help me with this I want use categorical features as well.
To handle categorical values in a Neural Network you have to represent them in OneHot representation. If they are string (as it seems to be your case) you first have to convert them to "Integer representation". Step by step:
Using from sklearn.preprocessing import LabelEncoder,OneHotEncoder
Define you categorial string values
categorical_values = np.array([['Foo','bar','values'],['more','foo','bar'],['many','foo','bar']])
Then encode them as integers:
categorical_values[:,0] = LabelEncoder().fit_transform(categorical_values[:,0])
categorical_values[:,1] = LabelEncoder().fit_transform(categorical_values[:,1])
categorical_values[:,2] = LabelEncoder().fit_transform(categorical_values[:,2])
And use OneHotEncoder to obtain the OneHot representation:
oneHot_values = OneHotEncoder().fit_transform(categorical_values).toarray()
Define your graph:
x_categorical = tf.placeholder(shape=[NUM_OBSERVATIONS,NUM_FEATURES],dtype=tf.float32)
weights = tf.Variable(tf.truncated_normal([NUM_FEATURES,NUM_CLASSES]),dtype=tf.float32)
bias = tf.Variable([NUM_CLASSES],dtype=tf.float32)
l2 = tf.add(tf.matmul(x_categorical,weights),bias)
And execute it obtaining the results:
with tf.Session() as sess:
tf.global_variables_initializer().run()
_l2 = sess.run(l2,feed_dict={x_categorical : oneHot_values})
Edit: As requested, no-sklearn version.
Using just numpy.unique() and tensorflow.one_hot()
categorical_values = np.array(['Foo','bar','values']) #For one observation
lookup, labeledValues = np.unique(categorical_values, return_inverse=True)
oneHotValues = tf.one_hot(labeledValues,depth=NUM_FEATURES)
Full example on the JN linked below
Here you have a Jupyter Notebook with the code on my Github

How do I correctly manually recreate sklearn (python) logistic regression predict_proba outcome for multiple classification

If I run a basic logistic regression with 4 classes, I can get the predict_proba array.
How can i manually calculate the probabilities using the coefficients and intercepts? What are the exact steps to get the same answers that predict_proba generates?
There seem to be multiple questions about this online and several suggestions which are either incomplete or don't match up anyway.
For example, I can't replicate this process from my sklearn model so what is missing?
https://stats.idre.ucla.edu/stata/code/manually-generate-predicted-probabilities-from-a-multinomial-logistic-regression-in-stata/
Thanks,
Because I had the same question but could not find an answer that gave the same results I had a look at the sklearn GitHub repository to find the answer. Using the functions from their own package I was able to create the same results I got from predict_proba().
It appears that sklearn uses a special softmax() function that differs from the usual softmax function in their code.
Let's assume you build a model like this:
from sklearn.linear_model import LogisticRegression
X = ...
Y = ...
model = LogisticRegression(multi_class="multinomial", solver="saga")
model.fit(X, Y)
Then you can calculate the probabilities either with model.predict(X) or use the sklearn function mentioned above to calculate them manually like this.
from sklearn.utils.extmath import softmax,
import numpy as np
scores = np.dot(X, model.coef_.T) + model.intercept_
softmax(scores) # Sklearn implementation
In the documentation for their own softmax() function, they note that
The softmax function is calculated by
np.exp(X) / np.sum(np.exp(X), axis=1)
This will cause overflow when large values are exponentiated. Hence
the largest value in each row is subtracted from each data point to
prevent this.
Replicate sklearn calcs (saw this on a different post):
V = X_train.values.dot(model.coef_.transpose())
U = V + model.intercept_
A = np.exp(U)
P=A/(1+A)
P /= P.sum(axis=1).reshape((-1, 1))
seems slightly different than softmax calcs, or the UCLA stat example, but it works.

Scikit-learn TruncatedSVD documentation

I plan to use sklearn.decomposition.TruncatedSVD to perform LSA for a Kaggle
competition, I know the math behind SVD and LSA but I'm confused by
scikit-learn's user guide, hence I'm not sure how to actually apply
TruncatedSVD.
In the doc, it states that:
After this operation,
U_k * transpose(S_k) is the transformed training set with k features (called n_components in the API)
Why is this? I thought after SVD, X, at this time X_k should be U_k * S_k * transpose(V_k)?
And then it says,
To also transform a test set X, we multiply it with V_k: X' = X * V_k
What does this mean?
I like the documentation Here a bit better. Sklearn is pretty consistent in that you almost always use some kind of combination of the following code:
#import desired sklearn class
from sklearn.decomposition import TruncatedSVD
trainData= #someArray
testData = #someArray
model = TruncatedSVD(n_components=5, random_state=42)
model.fit(trainData) #you fit your model on the underlying data
if you want to transform that data instead of just fitting it,
model.fit_transform(trainData) #fit and transform underlying data
Similarly, if you weren't transforming data, but making a prediction instead, you would use something like:
predictions = model.transform(testData)
Hope that helps...

Scikit-Learn Multiple Regression Fails with ElasticNetCV

According to the documentation and other SO questions, ElasticNetCV accepts multiple output regression. When I try it, though, it fails. Code:
from sklearn import linear_model
import numpy as np
import numpy.random as rnd
nsubj = 10
nfeat_train = 5
nfeat_predict = 20
x = rnd.random((nsubj, nfeat_train))
y = rnd.random((nsubj, nfeat_predict))
lm = linear_model.LinearRegression()
lm.fit(x,y) # works
el = linear_model.ElasticNetCV()
el.fit(x,y) # fails
Error message:
ValueError: Buffer has wrong number of dimensions (expected 1, got 2)
This is with scikit-learn version 0.14.1. Is this a mismatch between the documentation and implementation?
You may want to take a look at sklearn.linear_model.MultiTaskElasticNetCV. But beware, this object assumes that your multiple targets share features. Thus, a feature is either active for all tasks (with variable activation for each, which can be small), or active for none of them. Before using this object, make sure this is the functionality you need.

Resources