Predicting values using Polynomial/Least Squares Regression - scikit-learn

I have a dataset of 2 variables (called x with shape n x 2 values of x1 and x2) and 1 output (called y). I am having trouble understanding how to calculate predicted output values from the polynomial features as well as weights. My understanding is that y = X dot w, where X are the polynomial features and w are the weights.
The polynomial features were generated using PolynomialFeatures from sklearn.preprocessing. The weights were generated from np.linalg.lstsq. Below is a sample code that I created for this.
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
df = pd.DataFrame()
df['x1'] = [1,2,3,4,5]
df['x2'] = [11,12,13,14,15]
df['y'] = [75,96,136,170,211]
x = np.array([df['x1'],df['x2']]).T
y = np.array(df['y']).reshape(-1,1)
poly = PolynomialFeatures(interaction_only=False, include_bias=True)
poly_features = poly.fit_transform(x)
print(poly_features)
w = np.linalg.lstsq(x,y)
weight_list = []
for item in w:
if type(item) is np.int32:
weight_list.append(item)
continue
for weight in item:
if type(weight) is np.ndarray:
weight_list.append(weight[0])
continue
weight_list.append(weight)
weight_list
y_pred = np.dot(poly_features, weight_list)
print(y_pred)
regression_model = LinearRegression()
regression_model.fit(x,y)
y_predicted = regression_model.predict(x)
print(y_predicted)
With the y_pred values, they are nowhere near the list of values that I created. Am I using the incorrect inputs for np.linalg.lstsq, is there a lapse in my understanding?
Using the built-in LinearRegression() function, the y_predicted is much closer to my provided y-values. The y_pred is orders of magnitude much higher.

In the lstsq function, the polynomial features that were generated should be the first input, not the x-data that is initially supplied.
Additionally, the first returned output of lstsq are the regression coefficients/weights, which can be accessed by indexing 0.
The corrected code using this explicit linear algebra method of least-squares regression weights/coefficients would be:
w = np.linalg.lstsq(poly_features,y, rcond=None)
y_pred = np.dot(poly_features, w[0])
For the entire correct code (note that this method is actually more accurate for predicted values than the default LinearRegression function):
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
df = pd.DataFrame()
df['x1'] = [1,2,3,4,5]
df['x2'] = [11,12,13,14,15]
df['y'] = [75,96,136,170,211]
x = np.array([df['x1'],df['x2']]).T
y = np.array(df['y']).reshape(-1,1)
poly = PolynomialFeatures(interaction_only=False, include_bias=True)
poly_features = poly.fit_transform(x)
print(poly_features)
w = np.linalg.lstsq(poly_features,y, rcond=None)
print(w)
y_pred = np.dot(poly_features, w[0])
print(y_pred)
regression_model = LinearRegression()
regression_model.fit(x,y)
y_predicted = regression_model.predict(x)
print(y_predicted)

Related

How can the coefficients of the multiclass logistic regression model be used to predict the probabilities of class membership of observations?

I am trying to solve one problem that resembles that of Fisher's irises classification. The problem is that I can train the model on my computer, but the given model has to predict class membership on a computer where it is impossible to install python and scikit learn. I want to understand how, having received the coefficients of the logistic regression model, I can predict the belonging to a certain class without using the predict method of the model.
Using the Fisher problem as an example, I do the following.
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.metrics import accuracy_score, f1_score
# data preparation
iris = load_iris()
data = pd.DataFrame(data=np.hstack([iris.data, iris.target[:, np.newaxis]]),
columns=iris.feature_names + ['target'])
names = data.columns
# split data
X_train, X_test, y_train, y_test = train_test_split(data[names[:-1]], data[names[-1]], random_state=42)
# train model
cls = make_pipeline(
StandardScaler(),
LogisticRegression(C=2, random_state=42)
)
cls = cls.fit(X_train.to_numpy(), y_train)
preds_train = cls.predict(X_train)
# prediction
preds_test = cls.predict(X_test)
# scores
train_score = accuracy_score(preds_train, y_train), f1_score(preds_train, y_train, average='macro') # on train data
# train_score = (0.9642857142857143, 0.9653621232568601)
test_score = accuracy_score(preds_test, y_test), f1_score(preds_test, y_test, average='macro') # on test data
# test_score = (1.0, 1.0)
# model coefficients
cls[1].coef_, cls[1].intercept_
>>> (array([[-1.13948079, 1.30623841, -2.21496793, -2.05617771],
[ 0.66515676, -0.2541143 , -0.55819748, -0.86441227],
[ 0.47432404, -1.05212411, 2.77316541, 2.92058998]]),
array([-0.35860337, 2.43929019, -2.08068682]))
Now I have the coefficients of the model. And I want to use them to make predictions.
First, I make a prediction using the predict method for the first five observations on the test sample.
preds_test = cls.predict_proba(X_test)
preds_test[0:5]
>>>array([[5.66019001e-03, 9.18455687e-01, 7.58841233e-02],
[9.75854479e-01, 2.41455095e-02, 1.10881450e-08],
[1.18780156e-09, 6.53295166e-04, 9.99346704e-01],
[6.71574900e-03, 8.14174200e-01, 1.79110051e-01],
[6.98756622e-04, 8.09096425e-01, 1.90204818e-01]])
Then I manually calculate the predictions of the class probabilities for the observations using the coefficients of the model.
# define two functions for making predictions
def logit(x, w):
return np.dot(x, w)
# from here: https://stackoverflow.com/questions/34968722/how-to-implement-the-softmax-function-in-python
def softmax(z):
assert len(z.shape) == 2
s = np.max(z, axis=1)
s = s[:, np.newaxis] # necessary step to do broadcasting
e_x = np.exp(z - s)
div = np.sum(e_x, axis=1)
div = div[:, np.newaxis] # dito
return e_x / div
n, k = X_test.shape
X_ = np.hstack((np.ones((n, 1)), X_test)) # add column with 1 for intercept
weights = np.hstack((cls[1].intercept_[:, np.newaxis], cls[1].coef_)) # create weights matrix
results = softmax(logit(X_, weights.T)) # calculate probabilities
results[0:5]
>>>array([[3.67343725e-14, 4.63938438e-06, 9.99995361e-01],
[2.81976786e-05, 8.63083152e-01, 1.36888650e-01],
[1.24572182e-22, 5.47800683e-11, 1.00000000e+00],
[3.32990060e-14, 3.08352323e-06, 9.99996916e-01],
[2.66415118e-15, 1.78252465e-06, 9.99998217e-01]])
If you compare the two results obtained (preds_test[0:5] and results[0:5]), you can see that they do not coincide at all. Please explain me what I am doing wrong and how I can use the model's coefficients to calculate predictions without using the predict method.
I forgot that a scaler was applied. If you change the code a little, then the results are the same.
scaler = StandardScaler()
scaler.fit(X_train)
X_test_transf = scaler.transform(X_test)
def logit(x, w):
return np.dot(x, w)
def softmax(z):
assert len(z.shape) == 2
s = np.max(z, axis=1)
s = s[:, np.newaxis] # necessary step to do broadcasting
e_x = np.exp(z - s)
div = np.sum(e_x, axis=1)
div = div[:, np.newaxis] # dito
return e_x / div
n, k = X_test_transf.shape
X_ = np.hstack((np.ones((n, 1)), X_test_transf))
weights = np.hstack((cls[1].intercept_[:, np.newaxis], cls[1].coef_))
results = softmax(logit(X_, weights.T))
np.allclose(preds_test, results)
>>>True
There are two values for every predict_proba. The first value is the probability of the event not occurring and the probability of the event occurring. predict_proba(X)[:,1] to get the probability of the event occurring.

SciKitLearn Estimator choice

Good afternoon,
As a humble newbie working on a Machine Learning project, I was trying out the most basic estimator (Linear regression) even though I'm pretty sure I made the wrong choice based on my data.
In my data I have 38 Columns, in which there is a datetime column, two string columns, and my three targets are: two int type columns and a string (single character) type column, while other columns are made of floats.
Using linear regression (after dropping datetime columns, transforming every string type in numerical one) I'm getting a maximal accuracy of 44% (0.44) out of my model after 10000 iterations with for loop.
Here's my code.
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
import pickle
#Import xls#
data_18_19 = pd.read_excel(r'c:\Users\unkno\Desktop\xxxxx_x_.xls')
data_19_20 = pd.read_excel(r'c:\Users\unkno\Desktop\yyyyy_y_.xls')
#fusione dfs#
merge_data = [data_18_19, data_19_20]
data = pd.concat(merge_data, sort=False)
#Drop della colonna Div, tutte l1 e orario perché problematico#
data = data.drop(['Div'], 1)
data = data.drop(['Time'], 1)
data = data.drop(['Date'], 1)
#droplist str list comprehension dei nomi delle colonne#
droplist = [str(x) for x in data.iloc[0:0,37:]]
data = data.drop(droplist, 1)
#Cambio di HT, D, AT in 1,0,2 per HTR e FTR#
data['FTR'] = data['FTR'].replace(['H','D','A'], [1,0,2])
data['HTR'] = data['HTR'].replace(['H','D','A'], [1,0,2])
#Trasformazione s in numeri in ordine alfabetico#
dt = {'At':1,'Bo':2,'Br':3,'Ca':4,'Ch':23,'Em':22,'Fr':21,'Fi':5,'Ge':6,'In':7,'Ju':8,'La':9,'Le':10,'Mi':11,'Na':12,'Pa':13,'Ro':14,'Sa':15,'Sas':16,'Sp':17,'To':18,'Ud':19,'Ve':20}
data['HT'] = data['HT'].replace([i for i in dt.keys()], [j for j in dt.values()])
data['AT'] = data['AT'].replace([i for i in dt.keys()], [j for j in dt.values()])
#definizione della colonna target della predizione#
predict = 'FTR'
#Costituzione delle features(X) e dei target(y)#
X = np.array(data.drop([predict],axis=1))
y = np.array(data[predict])
best = 0
for i in range(10000):
#split dei dati per validazione#
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.1)
#definizione e training del modello da training#
model = LinearRegression()
model.fit(X_train, y_train)
#test di precisione#
acc = model.score(X_test, y_test)
#predizioni#
predicts = model.predict(X_test)
hr_predicts = np.around(predicts)
if acc > best:
best = acc
with open(r"c:\Users\unkno\Desktop\dump.pickle", "wb") as doc:
pickle.dump(model, doc)
print("Precisione: ", acc)
I was wandering about how to increase accuracy and which estimator to choose for better results?
Thanks in advance.
There are several approaches. However, a few simple steps without changing much that has been done.
Check the model performance after scaling the data. If the scale of the predictors are very different, then the model do not converge properly. I assume, the y that you want to predict is also continuous variable. Check if it follows a normal distribution. Else, applying log transform helps to normalize.
Secondly, the random_state is not specified in the train_test_split. Is this intentional ? Please check the performance by setting it to a random int value.
from sklearn.preprocessing import StandardScaler
scalar = StandardScaler()
X_train = scalar.fit_transform(X_train)
y_train = scalar.fit_transform(y_train)
# only transform the test data , else it leads to data leakage
X_test = scalar.transform(X_test)
y_train = scalar.transform(y_test)

LSTM Sequence Learning -- Prediction Result Issues

I am attempting to implement sequence learning in Python 3.5.2 | Anaconda 4.2.0 (64-bit) on my windows 10 machine. I have scoured the internet in order to get as far as I have, but documentation lacks the details I need. I will share my code and then ask my questions.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import tensorflow
import keras as krs
from keras.models import Sequential
from keras.layers import Dense, LSTM, Dropout
from keras.preprocessing import sequence
from sklearn.preprocessing import QuantileTransformer, MinMaxScaler
from sklearn.metrics import mean_squared_error
DF = pd.read_csv("./integer_seqn/train.csv", delimiter = ",")
N = DF.shape[0]
data = []
# turn pandas df into list
for rows in DF['Sequence']:
fix = rows.split(",")
fix = [int(elem) for elem in fix]
data.append(fix)
# for now I am working with one sequence at a time
D = data[0]
D = np.reshape(D, (len(D),1))
D = D.astype('float64')
# scaling needed since LSTMs are sensitive to large data values
# for this first sequence I use quantile scaling since the data is skewed
scaler = MinMaxScaler((0,1))
X_train_scaled = scaler.fit_transform(D)
# split each sequence into a training and a test set
train_size = int(len(X_train_scaled)*(0.90))
test_size = len(X_train_scaled) - train_size
train, test = X_train_scaled[0:train_size], X_train_scaled[train_size:len(X_train_scaled)]
# creates a data set so that the first column
# is every n-th element in a sequence
# and the second column is every (n+1)-th element in a sequence
dataX, dataY = [], []
time_step = 2
for i in range(0,(len(train)-time_step)):
dataX.append(train[i:(i+time_step),0])
dataY.append(train[i+time_step, 0])
dataX = np.array(dataX)
dataY = np.array(dataY)
testX, testY = [], []
for i in range(0,(len(test)-time_step)):
testX.append(test[i:(i+time_step),0])
testY.append(test[i+time_step,0])
testX = np.array(testX)
testY = np.array(testY)
# need to reshape the data so that model.fit will accept it
dataX = np.reshape(dataX, (dataX.shape[0],dataX.shape[1],1))
testX = np.reshape(testX, (testX.shape[0],testX.shape[1],1))
# build RNN-LSTM for a sequence
# the number of LSTM nodes is adjustable
model = Sequential()
model.add(LSTM(units = 20, input_shape = (1,1)))
model.add(Dense(1))
model.compile(loss="mean_squared_error", optimizer = "adam")
model.fit(dataX, dataY, epochs = 100, batch_size = 1, verbose = 0)
# predict using the model (validation with the train set and test with the test set)
trainPredict = model.predict(dataX)
testPredict = model.predict(testX)
# "unscale" the data
trainPredict = scaler.inverse_transform(trainPredict)
dataY = scaler.inverse_transform(np.reshape(dataY,(len(dataY),1)))
testPredict = scaler.inverse_transform(testPredict)
testY = scaler.inverse_transform(np.reshape(testY, (len(testY),1)))
I am testing this code on an integer sequence
data[0] = [1, 3, 13, 87, 1053, 28576, 2141733, 508147108, 402135275365, 1073376057490373, 9700385489355970183, 298434346895322960005291, 31479360095907908092817694945, 11474377948948020660089085281068730]
I am getting some pretty poor predictions. When I do model.predict(dataX), the order of magnitude is significantly larger than the true predictions. Additionally, the predictions provided are all the same. Most recently, I get a 7x1 numpy array filled with the value 1.2242328e+32. I get a similar occurrence with model.predict(testX), where I get a 3x1 numpy array filled with the value 1.2242328e+32, but at least this prediction is closer (though not as close as I would like) to the final element in testY.
My specific question is: why is my prediction array filled with the same value?

Confidence score too low

Im wondering why the model score is very low, only 0.13, i already make sure the data is clean, scaled, and also have high correlation between each features but the model score using linear regression is very low, why is this happening and how to solve this? this is my code
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn import preprocessing
path = r"D:\python projects\avocado.csv"
df = pd.read_csv(path)
df = df.reset_index(drop=True)
df.set_index('Date', inplace=True)
df = df.drop(['Unnamed: 0','year','type','region','AveragePrice'],1)
df.rename(columns={'4046':'Small HASS sold',
'4225':'Large HASS sold',
'4770':'XLarge HASS sold'},
inplace=True)
print(df.head)
sns.heatmap(df.corr())
sns.pairplot(df)
df.plot()
_=plt.xticks(rotation=20)
forecast_line = 35
df['target'] = df['Total Volume'].shift(-forecast_line)
X = np.array(df.drop(['target'], 1))
X = preprocessing.scale(X)
X_lately = X[-forecast_line:]
X = X[:-forecast_line]
df.dropna(inplace=True)
y = np.array(df['target'])
X_train, X_test, y_train, y_test = train_test_split(X, y,test_size=0.2)
lr = LinearRegression()
lr.fit(X_train,y_train)
confidence = lr.score(X_test,y_test)
print(confidence)
this is the link of the dataset i use https://www.kaggle.com/neuromusic/avocado-prices
So the score function you are using is:
Return the coefficient of determination R^2 of the prediction.
The coefficient R^2 is defined as (1 - u/v), where u is the residual
sum of squares ((y_true - y_pred) ** 2).sum() and v is the total sum
of squares ((y_true - y_true.mean()) ** 2).sum(). The best possible
score is 1.0 and it can be negative (because the model can be
arbitrarily worse). A constant model that always predicts the expected
value of y, disregarding the input features, would get a R^2 score of
0.0.
So as you realise you are already above the the constant prediction.
My advice try to plot your data, to see what kind of regression you should use. Here you can see an overview which type of linear regression are available: https://scikit-learn.org/stable/modules/linear_model.html
Logistic regression makes sense if your data has a logistic curve, which means that your points are either close to 0 or to 1, and in the middle are not so many points.

ValueError: Can only tuple-index with a MultiIndex

For a Multilabel Classification problem i am trying to plot precission and recall curve.
The sample code is taken from "https://scikit-learn.org/stable/auto_examples/model_selection/plot_precision_recall.html#sphx-glr-auto-examples-model-selection-plot-precision-recall-py" under section Create multi-label data, fit, and predict.
I am trying to fit it in my code but i get below error as "ValueError: Can only tuple-index with a MultiIndex" when i try below code.
train_df.columns.values
array(['DefId', 'DefectCount', 'SprintNo', 'ReqName', 'AreaChange',
'CodeChange', 'TestSuite'], dtype=object)
Test Suite is the value to be predicted
X_train = train_df.drop("TestSuite", axis=1)
Y_train = train_df["TestSuite"]
X_test = test_df.drop("DefId", axis=1).copy()
classes --> i have hardcorded with the testsuite values
from sklearn.preprocessing import label_binarize
# Use label_binarize to be multi-label like settings
Y = label_binarize(Y_train, classes=np.array([0, 1, 2,3,4])
n_classes = Y.shape[1]
# We use OneVsRestClassifier for multi-label prediction
from sklearn.multiclass import OneVsRestClassifier
# Run classifier
classifier = OneVsRestClassifier(svm.LinearSVC(random_state=3))
classifier.fit(X_train, Y_train)
y_score = classifier.decision_function(X_test)
from sklearn.metrics import precision_recall_curve
from sklearn.metrics import average_precision_score
import pandas as pd
# For each class
precision = dict()
recall = dict()
average_precision = dict()
#n_classes = Y.shape[1]
for i in range(n_classes):
precision[i], recall[i], _ = precision_recall_curve(Y_train[:, i], y_score[:, i])
average_precision[i] = average_precision_score(Y_train[:, i], y_score[:, i])
Input Data -> Values has been categorised

Resources