Can't set scoring parameter to 'neg_mean_squared_error' in cross_val_score - scikit-learn

I'm working on a regression problem, and I am evaluating my model using cross_val_score. I am trying to predict car prices based on some features.
I'm trying to set the 'scoring' parameter to 'neg_mean_squared_error', but when I run it, I get the following error:
TypeError Traceback (most recent call last)
Cell In [46], line 1
----> 1 cross_val_score(model, transformed_X, y, cv=5, scoring='neg_mean_squared_error')
TypeError: 'numpy.float64' object is not callable
Code that gave the error:
cross_val_score(model, transformed_X, y, cv=5, scoring='neg_mean_squared_error')
transformed_X contains categorical features that have been one-hot-encoded. y are labels (price values for the cars). Code:
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
OneHotEncoder = OneHotEncoder()
categorical_features = ["Make", "Colour", "Doors"]
transformer = ColumnTransformer([("one-hot",
OneHotEncoder,
categorical_features)],
remainder="passthrough")
transformed_X = transformer.fit_transform(X)
transformed_X
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
np.random.seed(0)
X = car_sales.drop("Price", axis=1)
y = car_sales["Price"]
X_train, X_test, y_train, y_test = train_test_split(transformed_X,
y,
test_size=0.2)
model = RandomForestRegressor()
model.fit(X_train, y_train)
model.score(X_test, y_test)
cross_val_score(model, transformed_X, y, cv=5, scoring='neg_mean_squared_error')
This is an example of transformed_X (it is an array of floats):
array([[0.0000000e+00, 1.0000000e+00, 0.0000000e+00, ..., 3.5431000e+04,
4.3200000e+02, 2.2015860e+04],
[1.0000000e+00, 0.0000000e+00, 0.0000000e+00, ..., 1.9271400e+05,
1.0300000e+02, 1.1974723e+05],
[0.0000000e+00, 1.0000000e+00, 0.0000000e+00, ..., 8.4714000e+04,
3.3400000e+02, 5.2638970e+04],
...,
[0.0000000e+00, 0.0000000e+00, 1.0000000e+00, ..., 6.6604000e+04,
4.7300000e+02, 4.1385910e+04],
[0.0000000e+00, 1.0000000e+00, 0.0000000e+00, ..., 2.1588300e+05,
1.8000000e+01, 1.3414381e+05],
[0.0000000e+00, 0.0000000e+00, 0.0000000e+00, ..., 2.4836000e+05,
5.1000000e+01, 1.5432413e+05]])
If I run np.mean(cross_val_score(model, transformed_X, y, cv=5)) without the scoring parameter, it works normally.
I searched online, and the answers I saw don't really answer my question.
Any help would be appreciated!

I cannot reproduce this. This feels like a "Jupyter notebook out-of-order cell execution bug."
If you're trying to combine a column transformer + one-hot encoding + Random Forests + cross validation + negative mean squared error, you're probably looking for this:
import pandas as pd
from sklearn.datasets import make_regression
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.ensemble import RandomForestRegressor
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import cross_val_score
# Generate fake dataset with three numeric features and a categorical column
X, y = make_regression(n_samples=1000, n_features=3)
df = pd.DataFrame(X, columns=["A", "B", "C"])
df["D"] = ["a"] * 500 + ["b"] * 500
pipe = make_pipeline(
ColumnTransformer(
transformers=[
("one-hot", OneHotEncoder(), ["D"])
],
remainder="passthrough",
),
RandomForestRegressor(),
)
print(cross_val_score(pipe, df, y, cv=5, scoring="neg_mean_squared_error"))

Related

List object not callable in SVM

I'm trying to run this SVM using stratified K fold in Python,however I keep on getting the error like below
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.utils import shuffle
from sklearn import preprocessing
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import accuracy_score, zero_one_loss, confusion_matrix
import pandas as pd
import numpy as np
z = pd.read_csv('/home/User/datasets/gtzan.csv', header=0)
X = z.iloc[:, :-1]
y = z.iloc[:, -1:]
X = np.array(X)
y = np.array(y)
# Performing standard scaling
scaler = preprocessing.MinMaxScaler()
X_scaled = scaler.fit_transform(X)
# Defining the SVM with 'rbf' kernel
svc = SVC(kernel='rbf', C=100, random_state=50)
#X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.10, shuffle=True)
skf = StratifiedKFold(n_splits=10, shuffle=True)
accuracy_score = []
#skf.get_n_splits(X, y)
for train_index, test_index in skf.split(X, y):
X_train, X_test = X_scaled[train_index], X_scaled[test_index]
y_train, y_test = y[train_index], y[test_index]
# Training the model
svc.fit(X_train, np.ravel(y_train))
# Prediction on test dataste
y_pred = svc.predict(X_test)
# Obtaining the accuracy scores of the model
score = accuracy_score(y_test, y_pred)
accuracy_score.append(score)
# Print the accuarcy of the svm model
print('accuracy score: %0.3f' % np.mean(accuracy_score))
however, it gives me an error like below
Traceback (most recent call last):
File "/home/User/Test_SVM.py", line 55, in <module>
score = accuracy_score(y_test, y_pred)
TypeError: 'list' object is not callable
What makes this score list uncallable and how do I fix this error?
accuracy_scoreis a list in my code and I was also calling the same list as a function, which is overriding the existing functionality of function accuarcy_score. Changed the list name to acc_score which solved the problem.

Error in Grid search CV - RidgeClassifierCV as the constructor either does not set or modifies parameter alphas

I am performing gridsearchcv on ridgeclassifiercv to obtain hyper-parameters for my model.
So i imported the libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn import linear_model
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
import warnings
warnings.filterwarnings('ignore')
np.random.seed(27)
Then i imported the dataset and split, scaled and label encoded the target variable
!wget -O ChurnData.csv https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/ML0101ENv3/labs/ChurnData.csv
churn = pd.read_csv("ChurnData.csv")
X = churn.drop(['churn'], axis='columns')
y1 = churn[['churn']]
y1['churn']=y1['churn'].astype('int')
scaler=StandardScaler()
X_scaled=scaler.fit_transform(X)
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
le.fit(churn['churn'].unique())
y = le.transform(y1)
# split
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size = 0.2)
Then i performed gridsearchcv
alphas = [(0.1, 1, 2, 5 , 10)]
solver_churn = ['auto', 'svd','cholesky', 'lsqr', 'sparse_cg', 'sag', 'saga']
fit_intercept = [True, False]
class_weight = [{0:0.5,1:0.5},{0:0.6,1:0.4}]
param_grid_churn = dict(alphas=alphas, fit_intercept=fit_intercept,class_weight=class_weight)
ridgecv = linear_model.RidgeClassifierCV()
grids_churn = GridSearchCV(estimator=ridgecv, param_grid=param_grid_churn, scoring='roc_auc', verbose=1, n_jobs=-1)
grid_result_churn = grids_churn.fit(X_train, y_train)
alphas is given in docs as a parameter still i get
Error in Grid search CV - RidgeClassifierCV as the constructor either does not set or modifies parameter alphas
How to resolve this?
Adjust your code like this:
alphas = (0.1, 1, 2, 5 , 10)
solver_churn = ['auto', 'svd','cholesky', 'lsqr', 'sparse_cg', 'sag', 'saga']
fit_intercept = [True, False]
class_weight = [{0:0.5,1:0.5},{0:0.6,1:0.4}]
param_grid_churn = dict(fit_intercept=fit_intercept,class_weight=class_weight)
ridgecv = linear_model.RidgeClassifierCV(alphas=alphas)
grids_churn = GridSearchCV(estimator=ridgecv, param_grid=param_grid_churn, scoring='roc_auc', verbose=1, n_jobs=-1)
grid_result_churn = grids_churn.fit(X_train, y_train)

How to use .fit when the X value is in time format

Xtrain,Xtest,Ytrain,Ytest = train_test_split(X,Y,test_size=0.2, random_state = 10)
You have to preprocess data before feeding your model. Here is a complete working example. First, let's import the required modules:
from datetime import datetime
import numpy as np
from sklearn.svm import SVC
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler, FunctionTransformer
Then, define the training data:
X = ['17:00','17:05', '17:10', '17:15', '17:20', '17:25']
X = np.array(X).reshape(-1, 1)
y = [1, 0, 1, 1, 0, 1]
Note, the X must be 2D array. Also, you have to convert time string values to the numerical format. One way to do it is to convert strings to timestamp using the builtin datetime module. Here is a function which will be used to transform the data:
def transform(X, y=None):
X_new = np.apply_along_axis(
lambda x: [datetime.strptime(x[0], '%H:%M').timestamp()],
axis=1,
arr=X)
return X_new
Don't forget to scale your data since SVC models require data normalization. One can easily combine all the preprocessing steps using the Pipeline:
pipeline = Pipeline(steps=[
('transformer', FunctionTransformer(transform, validate=False)),
('scaler', MinMaxScaler()),
('predictor', SVC(kernel='linear'))
])
Finally, let's fit the model:
print('Build and fit a model...')
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
pipeline.fit(X_train, y_train)
score = pipeline.score(X_test, y_test)
print('Done. Score', score)

Cross-validation for paragraph-vector model

I just came across an error when trying to apply a cross-validation for a paragraph vector model:
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
from sklearn.pipeline import Pipeline
from gensim.sklearn_api import D2VTransformer
data = pd.read_csv('https://pastebin.com/raw/bSGWiBfs')
np.random.seed(0)
X_train = data.apply(lambda r: simple_preprocess(r['text'], min_len=2), axis=1)
y_train = data.label
model = D2VTransformer(size=10, min_count=1, iter=5, seed=1)
clf = LogisticRegression(random_state=0)
pipeline = Pipeline([
('vec', model),
('clf', clf)
])
pipeline.fit(X_train, y_train)
score = pipeline.score(X_train, y_train)
print("Score:", score) # This works
cval = cross_val_score(pipeline, X_train, y_train, scoring='accuracy', cv=3)
print("Cross-Validation:", cval) # This doesn't work
KeyError: 0
I experimented by replacing X_train in cross_val_score with model.transform(X_train) or model.fit_transform(X_train). Also, I tried the same with raw input data (data.text), instead of pre-processed text. I suspect that something must be wrong with the format of X_train for the cross-validation, as compared to the .score function for Pipeline, which works just fine. I also noted that the cross_val_score worked with CountVectorizer().
Does anyone spot the mistake?
No, this has nothing to do with transformation from model. Its related to cross_val_score.
cross_val_score will split the supplied data according the the cv param. For this, it will do something like this:
for train, test in splitter.split(X_train, y_train):
new_X_train, new_y_train = X_train[train], y_train[train]
But your X_train is a pandas.Series object in which the index based selection does not work like this. See this:https://pandas.pydata.org/pandas-docs/stable/indexing.html#selection-by-position
Change this line:
X_train = data.apply(lambda r: simple_preprocess(r['text'], min_len=2), axis=1)
to:
# Access the internal numpy array
X_train = data.apply(lambda r: simple_preprocess(r['text'], min_len=2), axis=1).values
OR
# Convert series to list
X_train = data.apply(lambda r: simple_preprocess(r['text'], min_len=2), axis=1).tolist()

I am trying to run Gradient Boosting Classifier

from sklearn.ensemble import GradientBoostingRegressor
gbrt = GradientBoostingRegressor(max_depth=2, n_estimators=3, learning_rate=1.0)
gbrt.fit(X, y)
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
X_train, X_val, y_train, y_val = train_test_split(X, y)
gbrt = GradientBoostingRegressor(max_depth=2, n_estimators=120)
gbrt.fit(X_train, y_train)
errors = [mean_sqaured_error(y_val, y_pred)
for y_pred in gbrt.staged_predict(X_val)]
bst_n_estimators = np.argim(errors)
gbrt_best = GradientBoostingRegressor(max_depth = 2, n_estimators = bst_n_estimators)
gbrt_best.fit(X_train, y_train)
When I run this code I get the following error
ValueError: could not convert string to float: '<=50K'
I am using the following data
https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data
After the boosting classifier I want to check the performance boost on area under the curve, but the above error needs to be fixed first
Based on your provided code and data preview, ValueError occurs because you're feeding in the string values/categorical data to the GBM model. Recommend doing one-hot encoding (http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html) or pd.get_dummies first (https://pandas.pydata.org/pandas-docs/stable/generated/pandas.get_dummies.html), then fit the model.
For ROC curve, please check out: http://scikit-learn.org/stable/auto_examples/model_selection/plot_roc.html#sphx-glr-auto-examples-model-selection-plot-roc-py. The example should be fairly straightforward for what you need.
df = pd.read_csv(['PLEASE SPECIFY YOUR FILE PATH'], thousands = ',')
df.columns = ['V' + str(col) for col in df.columns]
list_cat = ['V1', 'V3', 'V4', 'V5', 'V6', 'V7', 'V8', 'V9', 'V13', 'V14']
list_target = ['V0']
df = pd.get_dummies(df, columns = list_cat, drop_first = True)
X = df.loc[:, df.columns != list_target[0]].values
y = df[list_target].values
print(df.shape)
df.head()
import numpy as np
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
gbrt = GradientBoostingRegressor(max_depth=2, n_estimators=3, learning_rate=1.0)
gbrt.fit(X, y)

Resources