Different R-squared scores for different times - python-3.x

I just learned about cross-validation and when I give in different arguments, there are different results.
This is the code for building the Regression Model and the R-squared output was about .5 :
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt
import numpy as np
boston = load_boston()
X = boston.data
y = boston['target']
X_rooms = X[:,5]
X_train, X_test, y_train, y_test = train_test_split(X_rooms, y)
reg = LinearRegression()
reg.fit(X_train.reshape(-1,1), y_train)
prediction_space = np.linspace(min(X_rooms), max(X_rooms)).reshape(-1,1)
plt.scatter(X_test, y_test)
plt.plot(prediction_space, reg.predict(prediction_space), color = 'black')
reg.score(X_test.reshape(-1,1), y_test)
Now when I give the cross-validation for X_train, X_test, and X(respectively), it shows different R-squared values.
Here's the X_test and y_test arguments:
from sklearn.model_selection import cross_val_score
cv = cross_val_score(reg, X_test.reshape(-1,1), y_test, cv = 8)
cv
The result:
array([ 0.42082715, 0.6507651 , -3.35208835, 0.6959869 , 0.7770039 ,
0.59771158, 0.53494622, -0.03308137])
Now when I use the arguments, X_train and y_train, different results are outputted.
from sklearn.model_selection import cross_val_score
cv = cross_val_score(reg, X_train.reshape(-1,1), y_train, cv = 8)
cv
The result:
array([0.46500321, 0.27860944, 0.02537985, 0.72248968, 0.3166983 ,
0.51262191, 0.53049663, 0.60138472])
Now, when I input different arguments again; this time X(which in my case is X_rooms) and y, I yet again get different R-squared values.
from sklearn.model_selection import cross_val_score
cv = cross_val_score(reg, X_rooms.reshape(-1,1), y, cv = 8)
cv
The output:
array([ 0.61748403, 0.79811218, 0.61559222, 0.6475456 , 0.61468198,
-0.7458466 , -3.71140488, -1.17174927])
Which one should I use?
I know this is a long question so Thanks!!

Train set should be distinctly use for training your model, while test set is for final evaluation. But unfortunately, you need to test your model's score on some set before checking it on final result (test set): for example when you try to tune some hyper-parameters. There are some other reasons to use cv, it's just one of them.
Usually the process is:
Split train and test
Train model use cv to check stability, including hyper-tune params (which is irrelevant in your case)
Assess model score on test set.
scikit-learn's cross_val_score receives an object (before training!) and data. It trains each time model on different section of data, and then returns the score. It's like having a lot of "train-test" checks.
Therefore, you should:
from sklearn.model_selection import cross_val_score
reg = LinearRegression()
cv = cross_val_score(reg, X_train.reshape(-1,1), y_train, cv = 8)
solely on train set. Test set should be used for other purposes.
What you get is a list of accuracy score. You can see if your model is stable (does accuracy is in same range among all folds?) or general performance of model (avg score)

Related

How to improve negative R square in DecisionTree Regressor

I was trying to apply some regressor to make an IMDB rating predict. This is what I tried:
import pandas as pd
import numpy as np
from sklearn.preprocessing import MinMaxScaler
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
data = pd.read_csv("D:/Code/imdb_project/movie_metadata.csv")
df = data[["duration","budget", "title_year","imdb_score"]]
df = df.dropna()
feature = np.array(df[["duration","budget","title_year"]])
rating = np.array(df["imdb_score"])
scaler = MinMaxScaler()
scaler.fit(feature)
X = scaler.transform(feature)
y = rating
x_train, x_test, y_train, y_test = train_test_split(X, y, train_size = 0.8, test_size = 0.2, random_state = 5)
regressor = DecisionTreeRegressor(criterion='mse')
regressor.fit(x_train, y_train)
regressor.score(x_test, y_test)
For clarification, my dataset contains 3 features: Budge, Release year, and duration, y is the IMDB rating.
When applying this regressor for the test data, I always receive a negative R square (it works just fine with the train data.) I understand that R square can be negative but I am still wondering if there is a way that I can improve it? The only way I know is normalizing the data and I did it before fitting the model.
Negative R^2 score means your model fits the data very poorly. In this case Decision tree may be too simple. Or maybe you've chosen wrong criterion.
I would recommend to try tune your model's hyperparameters or choose another one.

using sklearn.train_test_split for Imbalanced data

I have a very imbalanced dataset. I used sklearn.train_test_split function to extract the train dataset. Now I want to oversample the train dataset, so I used to count number of type1(my data set has 2 categories and types(type1 and tupe2) but approximately all of my train data are type1. So I cant oversample.
Previously I used to split train test datasets with my written code. In that code 0.8 of all type1 data and 0.8 of all type2 data were in the train dataset.
How I can use this method with train_test_split function or other spliting methods in sklearn?
*I should just use sklearn or my own written methods.
You're looking for stratification. Why?
There's a parameter stratify in method train_test_split to which you can give the labels list e.g. :
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y,
stratify=y,
test_size=0.2)
There's also StratifiedShuffleSplit.
It seems like we both had similar issues here. Unfortunately, imbalanced-learn isn't always what you need and scikit does not offer the functionality you want. You will want to implement your own code.
This is what I came up for my application. Note that I have not had extensive time to debug it but I believe it works from the testing I have done. Hope it helps:
def equal_sampler(classes, data, target, test_frac):
# Find the least frequent class and its fraction of the total
_, count = np.unique(target, return_counts=True)
fraction_of_total = min(count) / len(target)
# split further into train and test
train_frac = (1-test_frac)*fraction_of_total
test_frac = test_frac*fraction_of_total
# initialize index arrays and find length of train and test
train=[]
train_len = int(train_frac * data.shape[0])
test=[]
test_len = int(test_frac* data.shape[0])
# add values to train, drop them from the index and proceed to add to test
for i in classes:
indeces = list(target[target ==i].index.copy())
train_temp = np.random.choice(indeces, train_len, replace=False)
for val in train_temp:
train.append(val)
indeces.remove(val)
test_temp = np.random.choice(indeces, test_len, replace=False)
for val in test_temp:
test.append(val)
# X_train, y_train, X_test, y_test
return data.loc[train], target[train], data.loc[test], target[test]
For the input, classes expects a list of possible values, data expects the dataframe columns used for prediction, target expects the target column.
Take care that the algorithm may not be extremely efficient, due to the triple for-loop(list.remove takes linear time). Despite that, it should be reasonably fast.
You may also look into stratified shuffle split as follows:
# We use a utility to generate artificial classification data.
from sklearn.datasets import make_classification
from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.svm import SVC
from sklearn.pipeline import make_pipeline
X, y = make_classification(n_samples=100, n_informative=10, n_classes=2)
sss = StratifiedShuffleSplit(n_splits=5, test_size=0.5, random_state=0)
for train_index, test_index in sss.split(X, y):
print("TRAIN:", train_index, "TEST:", test_index)
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
clf = make_pipeline(StandardScaler(), SVC(gamma='auto'))
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)

Why is my sklearn linear regression model producing perfect predictions?

I'm trying to do multiple linear regression with sklearn and I have performed the following steps. However, when it comes to predicting y_pred using the trained model I am getting a perfect r^2 = 1.0. Does anyone know why this is the case/what's going wrong with my code?
Also sorry I'm new to this site so I'm not fully up to speed with the formatting/etiquette of questions!
import numpy as np
import pandas as pd
# Import and subset data
ml_data_all = pd.read_excel('C:/Users/User/Documents/RSEM/STADM/Coursework/Crime_SF/Machine_learning_collated_data.xlsx')
ml_data_1218 = ml_data_all[ml_data_all['Year'] >= 2012]
ml_data_1218.drop(columns=['Pop_MOE',
'Pop_density_MOE',
'Age_median_MOE',
'Sex_ratio_MOE',
'Income_median_household_MOE',
'Pop_total_pov_status_determ_MOE',
'Pop_total_50percent_pov_MOE',
'Pop_total_125percent_pov_MOE',
'Poverty_percent_below_MOE',
'Total_labourforceMOE',
'Unemployed_total_MOE',
'Unemployed_total_male_MOE'], inplace=True)
# Taking care of missing data
# Delete rows containing any NaNs
ml_data_1218.dropna(axis=0,
how='any',
inplace=True)
# DATA PREPROCESSING
# Defining X and y
X = ml_data_1218.drop(columns=['Year']).values
y = ml_data_1218['Burglaries '].values
# Encoding categorical data
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
transformer = ColumnTransformer(transformers=[("cat", OneHotEncoder(), [0])], remainder='passthrough')
X = transformer.fit_transform(X)
X.toarray()
X = pd.DataFrame.sparse.from_spmatrix(X)
# Split into Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
# Feature scaling
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
X_train.iloc[:,149:] = sc_X.fit_transform(X_train.iloc[:,149:])
X_test.iloc[:,149:] = sc_X.transform(X_test.iloc[:,149:])
# Fitting multiple linear regression to training set
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, y_train)
# Predicting test set results
y_pred = regressor.predict(X_test)
from sklearn.metrics import r2_score
r2_score(y_test, y_pred)
So turns out it was a stupid mistake in the end: I forgot to drop the dependent variable (Burglaries) from the X columns haha, hence why the linear regression model was making perfect predictions. Now it's working (r2 = 0.56). Thanks everyone!
With regression, it's often a good idea to run a correlation matrix against all of your variables (IVs and the DV). Regression likes parsimony, so removing IVs that are functionally the same (and just leaving one in the model) is better for R^2 value (aka model fit). Also, if something is correlated at .97 or higher with the DV, it is basically a substitute for the DV and all the other data is most likely superfluous.
When reading your issue (before I saw your "Answer") I was thinking "either this person has outrageous correlation issues or the DV is also in the prediction data."

How to compute accuracy and the confusion matrix using K-fold cross-validation?

I tried to do K-fold cross-validation with K=30 folds, with one confusion matrix for each fold. How to compute the accuracy and the confusion matrix to the model with confidence interval?
Could someone help me?
My code is:
import numpy as np
from sklearn import model_selection
from sklearn import datasets
from sklearn import svm
import pandas as pd
from sklearn.linear_model import LogisticRegression
UNSW = pd.read_csv('/home/sec/Desktop/CEFET/tudao.csv')
previsores = UNSW.iloc[:,UNSW.columns.isin(('sload','dload',
'spkts','dpkts','swin','dwin','smean','dmean',
'sjit','djit','sinpkt','dinpkt','tcprtt','synack','ackdat','ct_srv_src','ct_srv_dst','ct_dst_ltm',
'ct_src_ltm','ct_src_dport_ltm','ct_dst_sport_ltm','ct_dst_src_ltm')) ].values
classe= UNSW.iloc[:, -1].values
X_train, X_test, y_train, y_test = model_selection.train_test_split(
previsores, classe, test_size=0.4, random_state=0)
print(X_train.shape, y_train.shape)
#((90, 4), (90,))
print(X_test.shape, y_test.shape)
#((60, 4), (60,))
logmodel = LogisticRegression()
logmodel.fit(X_train,y_train)
print(previsores.shape)
########K FOLD
print('########K FOLD########K FOLD########K FOLD########K FOLD')
from sklearn.model_selection import KFold
from sklearn.metrics import confusion_matrix
kf = KFold(n_splits=30, random_state=None, shuffle=False)
kf.get_n_splits(previsores)
for train_index, test_index in kf.split(previsores):
X_train, X_test = previsores[train_index], previsores[test_index]
y_train, y_test = classe[train_index], classe[test_index]
logmodel.fit(X_train, y_train)
print (confusion_matrix(y_test, logmodel.predict(X_test)))
print(10* '#')
For accuracy, I would use the function cross_val_score that does exactly what you are looking for. It outputs a list of 30 validation accuracies and you can then compute their mean, standard deviation, etc and create some kind of a confidence interval (mean +- 2*std)
.
Since confusion matrix cannot be seen as a performance metric (not a single number but a matrix) I would recommend creating a list and then iteratively just append it with a corresponding validation confusion matrix (currently you just print it). At the end, you can use this list to extract a lot of interesting information.
UPDATE:
...
...
cm_holder = []
for train_index, test_index in kf.split(previsores):
X_train, X_test = previsores[train_index], previsores[test_index]
y_train, y_test = classe[train_index], classe[test_index]
logmodel.fit(X_train, y_train)
cm_holder.append(confusion_matrix(y_test, logmodel.predict(X_test))))
Note that the len(cm_holder) = 30 and each of the elements is an array of shape=(n_classes, n_classes).

cross validation and text classification for imbalanced data

I am new to NLP and I am trying to build a text classifier but my data is currently imbalanced.The highest category having as much as 280 entries while the lowest as much as 30.
I am trying to use cross validation technique for the current data, but after looking for days now i am unable to implement it.It looks pretty straightforward but I am still unable to implement it. Here is my code
y = resample.Subsystem
X = resample['new description']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)
from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(X_train)
X_train_counts.shape
from sklearn.feature_extraction.text import TfidfTransformer
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
X_train_tfidf.shape
#SVM
from sklearn.pipeline import Pipeline
from sklearn.linear_model import SGDClassifier
text_clf_svm = Pipeline([('vect', CountVectorizer(stop_words='english')),('tfidf', TfidfTransformer()),('clf-svm', SGDClassifier(loss='hinge', penalty='l2',alpha=1e-3, n_iter=5, random_state=42)),])
text_clf_svm.fit(X_train, y_train)
predicted_svm = text_clf_svm.predict(X_test)
print('The best accuracy is : ',np.mean(predicted_svm == y_test))
I have done some gridsearch and Stemmer further but right now I would work on cross validation on this code.I have cleaned the data pretty well but i am stil getting an accuracy of 60%
Any help would be appreciated
Try to do oversampling or under sampling. As the data is highly imbalanced, There is more bias towards the class with more data points. After the over/under sampling the bias will be very less and accuracy will up.
Else instead of SVM you can use MLP. It gives good results even with unbalanced data.
from sklearn.model_selection import StratifiedKFold
skf = StratifiedKFold(n_splits=5, random_state=None)
# X is the feature set and y is the target
from sklearn.model_selection import RepeatedKFold
kf = RepeatedKFold(n_splits=20, n_repeats=10, random_state=None)
for train_index, test_index in kf.split(X):
#print("Train:", train_index, "Validation:",test_index)
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]

Resources