How to improve negative R square in DecisionTree Regressor - python-3.x

I was trying to apply some regressor to make an IMDB rating predict. This is what I tried:
import pandas as pd
import numpy as np
from sklearn.preprocessing import MinMaxScaler
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
data = pd.read_csv("D:/Code/imdb_project/movie_metadata.csv")
df = data[["duration","budget", "title_year","imdb_score"]]
df = df.dropna()
feature = np.array(df[["duration","budget","title_year"]])
rating = np.array(df["imdb_score"])
scaler = MinMaxScaler()
scaler.fit(feature)
X = scaler.transform(feature)
y = rating
x_train, x_test, y_train, y_test = train_test_split(X, y, train_size = 0.8, test_size = 0.2, random_state = 5)
regressor = DecisionTreeRegressor(criterion='mse')
regressor.fit(x_train, y_train)
regressor.score(x_test, y_test)
For clarification, my dataset contains 3 features: Budge, Release year, and duration, y is the IMDB rating.
When applying this regressor for the test data, I always receive a negative R square (it works just fine with the train data.) I understand that R square can be negative but I am still wondering if there is a way that I can improve it? The only way I know is normalizing the data and I did it before fitting the model.

Negative R^2 score means your model fits the data very poorly. In this case Decision tree may be too simple. Or maybe you've chosen wrong criterion.
I would recommend to try tune your model's hyperparameters or choose another one.

Related

Why is my sklearn linear regression model producing perfect predictions?

I'm trying to do multiple linear regression with sklearn and I have performed the following steps. However, when it comes to predicting y_pred using the trained model I am getting a perfect r^2 = 1.0. Does anyone know why this is the case/what's going wrong with my code?
Also sorry I'm new to this site so I'm not fully up to speed with the formatting/etiquette of questions!
import numpy as np
import pandas as pd
# Import and subset data
ml_data_all = pd.read_excel('C:/Users/User/Documents/RSEM/STADM/Coursework/Crime_SF/Machine_learning_collated_data.xlsx')
ml_data_1218 = ml_data_all[ml_data_all['Year'] >= 2012]
ml_data_1218.drop(columns=['Pop_MOE',
'Pop_density_MOE',
'Age_median_MOE',
'Sex_ratio_MOE',
'Income_median_household_MOE',
'Pop_total_pov_status_determ_MOE',
'Pop_total_50percent_pov_MOE',
'Pop_total_125percent_pov_MOE',
'Poverty_percent_below_MOE',
'Total_labourforceMOE',
'Unemployed_total_MOE',
'Unemployed_total_male_MOE'], inplace=True)
# Taking care of missing data
# Delete rows containing any NaNs
ml_data_1218.dropna(axis=0,
how='any',
inplace=True)
# DATA PREPROCESSING
# Defining X and y
X = ml_data_1218.drop(columns=['Year']).values
y = ml_data_1218['Burglaries '].values
# Encoding categorical data
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
transformer = ColumnTransformer(transformers=[("cat", OneHotEncoder(), [0])], remainder='passthrough')
X = transformer.fit_transform(X)
X.toarray()
X = pd.DataFrame.sparse.from_spmatrix(X)
# Split into Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
# Feature scaling
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
X_train.iloc[:,149:] = sc_X.fit_transform(X_train.iloc[:,149:])
X_test.iloc[:,149:] = sc_X.transform(X_test.iloc[:,149:])
# Fitting multiple linear regression to training set
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, y_train)
# Predicting test set results
y_pred = regressor.predict(X_test)
from sklearn.metrics import r2_score
r2_score(y_test, y_pred)
So turns out it was a stupid mistake in the end: I forgot to drop the dependent variable (Burglaries) from the X columns haha, hence why the linear regression model was making perfect predictions. Now it's working (r2 = 0.56). Thanks everyone!
With regression, it's often a good idea to run a correlation matrix against all of your variables (IVs and the DV). Regression likes parsimony, so removing IVs that are functionally the same (and just leaving one in the model) is better for R^2 value (aka model fit). Also, if something is correlated at .97 or higher with the DV, it is basically a substitute for the DV and all the other data is most likely superfluous.
When reading your issue (before I saw your "Answer") I was thinking "either this person has outrageous correlation issues or the DV is also in the prediction data."

Confidence score too low

Im wondering why the model score is very low, only 0.13, i already make sure the data is clean, scaled, and also have high correlation between each features but the model score using linear regression is very low, why is this happening and how to solve this? this is my code
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn import preprocessing
path = r"D:\python projects\avocado.csv"
df = pd.read_csv(path)
df = df.reset_index(drop=True)
df.set_index('Date', inplace=True)
df = df.drop(['Unnamed: 0','year','type','region','AveragePrice'],1)
df.rename(columns={'4046':'Small HASS sold',
'4225':'Large HASS sold',
'4770':'XLarge HASS sold'},
inplace=True)
print(df.head)
sns.heatmap(df.corr())
sns.pairplot(df)
df.plot()
_=plt.xticks(rotation=20)
forecast_line = 35
df['target'] = df['Total Volume'].shift(-forecast_line)
X = np.array(df.drop(['target'], 1))
X = preprocessing.scale(X)
X_lately = X[-forecast_line:]
X = X[:-forecast_line]
df.dropna(inplace=True)
y = np.array(df['target'])
X_train, X_test, y_train, y_test = train_test_split(X, y,test_size=0.2)
lr = LinearRegression()
lr.fit(X_train,y_train)
confidence = lr.score(X_test,y_test)
print(confidence)
this is the link of the dataset i use https://www.kaggle.com/neuromusic/avocado-prices
So the score function you are using is:
Return the coefficient of determination R^2 of the prediction.
The coefficient R^2 is defined as (1 - u/v), where u is the residual
sum of squares ((y_true - y_pred) ** 2).sum() and v is the total sum
of squares ((y_true - y_true.mean()) ** 2).sum(). The best possible
score is 1.0 and it can be negative (because the model can be
arbitrarily worse). A constant model that always predicts the expected
value of y, disregarding the input features, would get a R^2 score of
0.0.
So as you realise you are already above the the constant prediction.
My advice try to plot your data, to see what kind of regression you should use. Here you can see an overview which type of linear regression are available: https://scikit-learn.org/stable/modules/linear_model.html
Logistic regression makes sense if your data has a logistic curve, which means that your points are either close to 0 or to 1, and in the middle are not so many points.

Prediction with linear regression is very inaccurate

This is the csv that im using https://gist.github.com/netj/8836201 currently, im trying to predict the variety which is categorical data with linear regression but somehow the prediction is very very inaccurate. While you know, the actual label is just combination of 0.0 and 1. but the prediction is 0.numbers and 1.numbers even with minus numbers which in my opinion is very inaccurate, what part did i make the mistake and what is the solution for this inaccuracy? this is the assignment my teacher gave me, he said we could predict the categorical data with linear regression not only logistic regression
import pandas as pd
from sklearn import model_selection
from sklearn.linear_model import LinearRegression
from sklearn import preprocessing
from sklearn import metrics
path= r"D:\python projects\iris.csv"
df = pd.read_csv(path)
array = df.values
X = array[:,0:3]
y = array[:,4]
le = preprocessing.LabelEncoder()
ohe = preprocessing.OneHotEncoder(categorical_features=[0])
y = le.fit_transform(y)
y = y.reshape(-1,1)
y = ohe.fit_transform(y).toarray()
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, test_size=0.2, random_state=0)
sc = preprocessing.StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
y_train = sc.fit_transform(y_train)
model = LinearRegression(n_jobs=-1).fit(X_train, y_train)
y_pred = model.predict(X_test)
df = pd.DataFrame({'Actual': X_test.flatten(), 'Predicted': y_pred.flatten()})
the output :
y_pred
Out[46]:
array([[-0.08676055, 0.43120144, 0.65555911],
[ 0.11735424, 0.72384335, 0.1588024 ],
[ 1.17081347, -0.24484483, 0.07403136],
X_test
Out[61]:
array([[-0.09544771, -0.58900572, 0.72247648],
[ 0.14071157, -1.98401928, 0.10361279],
[-0.44968663, 2.66602591, -1.35915595],
Linear Regression is used to predict continuous output data. As you correctly said, you are trying to predict categorical (discrete) output data. Essentially, you want to be doing classification instead of regression - linear regression is not appropriate for this.
As you also said, logistic regression can and should be used instead as it is applicable to classification tasks.

Why do my ML models have horrible accuracy?

I am new to machine learning and I am building my first model independently. I have a dataset that evaluates cars, it contains features of price, safety and luxury and classifies if its good, very good, acceptable and unacceptable. I converted all the non-numeric columns into numeric, trained the data and predicted with a test set. However, my predictions are awful; I used LinearRegression and r2_score outputs 0.05 which is practically 0. I have tried a few different models and all have been giving me horrible predictions and accuracy.
What am I doing wrong? I have seen tutorials, read articles with similar methodology, yet they end up with 0.92 accuracy and I'm getting 0.05. How do you make a good model for your data and how do you know which model to use?
Code:
import numpy as np
import pandas as pd
from sklearn import preprocessing, linear_model
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score
import seaborn as sns
import matplotlib.pyplot as plt
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)
columns = ['buying', 'maint', 'doors', 'persons', 'lug_boot', 'safety', 'class value']
df = pd.read_csv('car.data.txt', index_col=False, names=columns)
for col in df.columns.values:
try:
if df[col].astype(int):
pass
except ValueError:
enc = preprocessing.LabelEncoder()
enc.fit(df[col])
df[col] = enc.transform(df[col])
#Split the data
class_y = df.pop('class value')
x_train, x_test, y_train, y_test = train_test_split(df, class_y, test_size=0.2, random_state=0)
#Make the model
regression_model = linear_model.LinearRegression()
regression_model = regression_model.fit(x_train, y_train)
#Predict the test data
y_pred = regression_model.predict(x_test)
score = r2_score(y_test, y_pred)
You should not use Linear Regression, which is used for predicting continuous values and not categorical values. In your case what you are trying to predict is categorical. Technically, each situation is a class.
I would suggest trying Logistic Regression or other type of classification methods such as Naive Bayes, SVM , decision tree classifiers etc. instead.

Why SVR prediction some value

I'm trying to predict stock prices through SVR using python. Given below is the code that I have used,
import pandas as pd
import numpy as np
from sklearn.svm import SVR
train= pd.read_csv("ntrain1.csv")
X_train = train.drop("Close Now",1)
Y_train = train["Close Now"]
clf = SVR(kernel= 'rbf', C=100000, gamma=0.2, epsilon = 0.1)
clf.fit(X_train, Y_train)
test= pd.read_csv("ntestbri.csv")
X_test = test.drop("Close Now",1)
Y_test = test["Close Now"]
y_prediksi = clf.predict(X_test)
y_prediksi_series = pd.Series(y_prediksi)
y_prediksi= pd.DataFrame()
y_prediksi["y_prediksi"] = y_prediksi_series
y_prediksi.to_csv("npredksibri3.csv")
rmse = np.sqrt( mean_squared_error( Y_test, y_prediksi ) )
rmse
The problem in this code is to generate a prediction with the same value of 4436.021668 and the RMSE value corresponding to the predicted result.
How do I fix this?
#maulita - not sure what the specific question is but if you are looking to improve your predictor, a best practice is to do the train test split on the training set. This allows you to assess the quality of your predictions and calibrate your predictor prior to loading the test dataset. Hope that helps - Sandeep

Resources