Applying GridSearch on Catboost Regressor with Categorical Variables - scikit-learn

I have a Catboost regressor with some input features that are categorical. I've passed them to the model using code below.
cb = CatBoostRegressor(
iterations=400,
eval_metric='RMSE',
loss_function='RMSE',
learning_rate=0.05,
max_depth=16,
random_state=1,
verbose=False)
pool_train = Pool(X_train, y_train,
cat_features = ['A', 'B',
'C', 'D'])
pool_test = Pool(X_test, cat_features = ['A', 'B',
'C', 'D'])
cb.fit(pool_train)
y_pred = cb.predict(pool_test)
The idea is to use Pool method to pass categorical features. However, I can't do he same using scikit learn Gridsearch method.
How can I pass pool_train to either catboost's grid search or scikit learn's gridsearch?

Related

Sklearn Voting ensemble with models using different features and testing with k-fold cross validation

I have a data frame with 4 different groups of features.
I need to create 4 different models with these four different feature groups and combine them with the ensemble voting classifier.
Furthermore, I need to test the classifier using k-fold cross validation.
However, I am finding it difficult to combine different feature sets, voting classifier and k-fold cross validation with functionality available in sklearn. Following is the code that I have so far.
y = df1.index
x = preprocessing.scale(df1)
SVM = svm.SVC(kernel='rbf', C=1)
rf=RandomForestClassifier(n_estimators=200)
ann = MLPClassifier(solver='lbfgs', alpha=1e-5, hidden_layer_sizes=(25, 2), random_state=1)
neigh = KNeighborsClassifier(n_neighbors=10)
models = list()
models.append(('facial', SVM))
models.append(('posture', rf))
models.append(('computer', ann))
models.append(('physio', neigh))
ens = VotingClassifier(estimators=models)
cv = KFold(n_splits=10, random_state=None, shuffle=True)
scores = cross_val_score(ens, x, y, cv=cv, scoring='accuracy')
As you can see, this program uses same features for all 4 models. How can I improve this program to achieve my objective?
I did manage to achieve this using Pipelines,
y = df1.index
x = preprocessing.scale(df1)
phy_features = ['A', 'B', 'C']
phy_transformer = Pipeline(steps=[('imputer', SimpleImputer(strategy='median')), ('scaler', StandardScaler())])
phy_processer = ColumnTransformer(transformers=[('phy', phy_transformer, phy_features)])
fa_features = ['D', 'E', 'F']
fa_transformer = Pipeline(steps=[('imputer', SimpleImputer(strategy='median')), ('scaler', StandardScaler())])
fa_processer = ColumnTransformer(transformers=[('fa', fa_transformer, fa_features)])
pipe_phy = Pipeline(steps=[('preprocessor', phy_processer ),('classifier', SVM)])
pipe_fa = Pipeline(steps=[('preprocessor', fa_processer ),('classifier', SVM)])
ens = VotingClassifier(estimators=[pipe_phy, pipe_fa])
cv = KFold(n_splits=10, random_state=None, shuffle=True)
for train_index, test_index in cv.split(x):
x_train, x_test = x[train_index], x[test_index]
y_train, y_test = y[train_index], y[test_index]
ens.fit(x_train,y_train)
print(ens.score(x_test, y_test))
Please refer sklearn Pipeline: argument of type 'ColumnTransformer' is not iterable for if you are receiving an TypeError when using ColumnTransforms.

Vectorize list of string with Word2Vec to feed to keras sequential layer

I am trying to built a custom made word embedding model with fastText, that represents my data (list of sentences) as vectors so I can "feed" it to a Keras CNN for abusive language detection.
My tokenised data is stored in a list like this:
data = [['is',
'this',
'a',
'news',
'if',
'you',
'have',
'no',
'news',
'than',
'shutdown',
'the',
'channel'],
['if',
'interest',
'rate',
'will',
'hike',
'by',
'fed',
'then',
'what',
'is',
'the',
'effect',
'on',
'nifty']]
I am currently applying the fastText model like this:
model = fastText(data, size=100, window=5, min_count=5, workers=16, sg=0, negative=5)
And then I perform:
model = FastText(sentences, min_count=1)
documents = []
for document in textList:
word_vectors = []
for word in document:
word_vectors.append(model.wv[word])
documents.append(np.concatenate(word_vectors)
document_matrix = np.concatenate(documents)
Obviously, the document_matrix doesn't fit as the input for my Keras model:
from keras.models import Sequential
from keras import layers
from keras.layers import Dense, Activation
model = Sequential()
model.add(layers.Conv1D(filters=250, kernel_size = 4, padding = 'same', input_shape=( 1,)))
model.add(layers.GlobalMaxPooling1D())
model.add(layers.Dense(250, activation='relu'))
model.add(layers.Dense(3, activation='sigmoid'))
I am stuck and running out of ideas how to make the output of the embedding fit the input for Keras.
Thank you very much in advance, you guys are the best !
Lisa
You can take each word representation from word2vec model with model[YOURKEYWORD]. Some word embeddings may not be exist in your word2vec model so you can use try-catch in your code.

GridSearchCV with Invalid parameter gamma for estimator LogisticRegression

I need to perform a grid search on the parameters listed below for a Logistic Regression classifier, using recall for scoring and cross-validation three times.
The data is in a csv file (11,1 MB), this link for download is: https://drive.google.com/file/d/1cQFp7HteaaL37CefsbMNuHqPzkINCVzs/view?usp=sharing
I have grid_values = {'gamma':[0.01, 0.1, 1, 10, 100]}
I need to apply penalty L1 e L2 in a Logistic Regression
I couldn't verify if the scores will run because I have the following error:
Invalid parameter gamma for estimator LogisticRegression. Check the list of available parameters with estimator.get_params().keys().
This is my code:
from sklearn.model_selection import train_test_split
df = pd.read_csv('fraud_data.csv')
X = df.iloc[:,:-1]
y = df.iloc[:,-1]
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
def LogisticR_penalty():
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
grid_values = {'gamma':[0.01, 0.1, 1, 10, 100]}
#train de model with many parameters for "C" and penalty='l1'
lr_l1 = LogisticRegression(penalty='l1')
grid_lr_l1 = GridSearchCV(lr_l1, param_grid = grid_values, cv=3, scoring = 'recall')
grid_lr_l1.fit(X_train, y_train)
y_decision_fn_scores_recall = grid_lr_l1.decision_function(X_test)
lr_l2 = LogisticRegression(penalty='l2')
grid_lr_l2 = GridSearchCV(lr_l2, param_grid = grid_values, cv=3 , scoring = 'recall')
grid_lr_l2.fit(X_train, y_train)
y_decision_fn_scores_recall = grid_lr_l2.decision_function(X_test)
#The precision, recall, and accuracy scores for every combination
#of the parameters in param_grid are stored in cv_results_
results = pd.DataFrame()
results['l1_results'] = pd.DataFrame(grid_lr_l1.cv_results_)
results['l1_results'] = results['l2_results'].sort_values(by='mean_test_precision_score', ascending=False)
results['l2_results'] = pd.DataFrame(grid_lr_l2.cv_results_)
results['l2_results'] = results['l2_results'].sort_values(by='mean_test_precision_score', ascending=False)
return results
LogisticR_penalty()
I expected from .cv_results_, the average test scores of each parameter combination that I should be available here: mean_test_precision_score but not sure
The output is: ValueError: Invalid parameter gamma for estimator LogisticRegression. Check the list of available parameters with estimator.get_params().keys().
The error message contains the answer for your question. You can use the function estimator.get_params().keys() to see all available parameters for you estimator:
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression()
print(lr.get_params().keys())
Output:
dict_keys(['C', 'class_weight', 'dual', 'fit_intercept', 'intercept_scaling', 'l1_ratio', 'max_iter', 'multi_class', 'n_jobs', 'penalty', 'random_state', 'solver', 'tol', 'verbose', 'warm_start'])
From scikit-learn's documentation, the LogisticRegression has no parameter gamma, but a parameter C for the regularization weight.
If you change grid_values = {'gamma':[0.01, 0.1, 1, 10, 100]} for grid_values = {'C':[0.01, 0.1, 1, 10, 100]} your code should work.
My code contained some errors the main error was using param_grid incorrectly. I had to apply L1 and L2 penalties with gamma 0.01, 0.1, 1, 10, 100. The right way to do this is:
grid_values ​​= {'penalty': ['l1', 'l2'], 'C': [0.01, 0.1, 1, 10, 100]}
Then it was necessary to correct the way I was training my logistic regression and to correct the way I retrieved the scores in cv_results_ and averaged those scores.
Follow my code:
from sklearn.model_selection import train_test_split
df = pd.read_csv('fraud_data.csv')
X = df.iloc[:,:-1]
y = df.iloc[:,-1]
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
def LogisticR_penalty():
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
grid_values = {'penalty': ['l1', 'l2'], 'C': [0.01, 0.1, 1, 10, 100]}
#train de model with many parameters for "C" and penalty='l1'
lr = LogisticRegression()
# We use GridSearchCV to find the value of the range that optimizes a given measurement metric.
grid_lr_recall = GridSearchCV(lr, param_grid = grid_values, cv=3, scoring = 'recall')
grid_lr_recall.fit(X_train, y_train)
y_decision_fn_scores_recall = grid_lr_recall.decision_function(X_test)
##The precision, recall, and accuracy scores for every combination
#of the parameters in param_grid are stored in cv_results_
CVresults = []
CVresults = pd.DataFrame(grid_lr_recall.cv_results_)
#test scores and mean of them
split_test_scores = np.vstack((CVresults['split0_test_score'], CVresults['split1_test_score'], CVresults['split2_test_score']))
mean_scores = split_test_scores.mean(axis=0).reshape(5, 2)
return mean_scores
LogisticR_penalty()

SKLearn Error with Pipeline and Gridsearch

I would like to first split my data in a test and train set. Then I want to use GridSearchCV on my training set (internally split into train/validation set). In the end I want to collect all the testdata and do some other things (not in the scope of the question).
I have to scale my data. So I want to handle this problem in a pipeline. Some things in my SVC should be ficed (kernel='rbf', class_weight=...).
When I run the code the following occurs:
"ValueError: Invalid parameter estimator for estimator Pipeline"
I don't understand what I'm doing wrong. I tried to follow this thread: StandardScaler with Pipelines and GridSearchCV
The only difference is, that I fix some parameters in my SVC. How can I handle this?
target = np.array(target).ravel()
loo = LeaveOneOut()
loo.get_n_splits(input)
# Outer Loop
for train_index, test_index in loo.split(input):
X_train, X_test = input[train_index], input[test_index]
y_train, y_test = target[train_index], target[test_index]
p_grid = {'estimator__C': np.logspace(-5, 2, 20),}
'estimator__gamma': np.logspace(-5, 3, 20)}
SVC_Kernel = SVC(kernel='rbf', class_weight='balanced',tol=10e-4, max_iter=200000, probability=False)
pipe_SVC = Pipeline([('scaler', RobustScaler()),('SVC', SVC_Kernel)])
n_splits = 5
scoring = "f1_micro"
inner_cv = StratifiedKFold(n_splits=n_splits,
shuffle=True, random_state=5)
clfSearch = GridSearchCV(estimator=pipe_SVC, param_grid=p_grid,
cv=inner_cv, scoring='f1_micro', iid=False, n_jobs=-1)
clfSearch.fit(X_train, y_train)
print("Best parameters set found on validation set for Support Vector Machine:")
print()
print(clfSearch.best_params_)
print()
print(clfSearch.best_score_)
print("Grid scores on validation set:")
print()
I also tried it this way:
p_grid = {'estimator__C': np.logspace(-5, 2, 20),
'estimator__gamma': np.logspace(-5, 3, 20),
'estimator__tol': [10e-4],
'estimator__kernel': ['rbf'],
'estimator__class_weight': ['balanced'],
'estimator__max_iter':[200000],
'estimator__probability': [False]}
SVC_Kernel = SVC()
This also doesn't work.
The problem is in your p_grid. You are grid searching on your Pipeline, and that doesn't have anything called estimator. It does have something called SVC, so if you want to set that SVC's parameter, you should prefix you keys with SVC__ instead of estimator__. So replace p_grid with:
p_grid = {'SVC__C': np.logspace(-5, 2, 20),}
'SVC__gamma': np.logspace(-5, 3, 20)}
Also, you can replace your outer for loop using cross_validate function.

Logistic Regression fitting issue

I'm having an issue with a logistic regression analysis. The data is being read from a csv file which is of size (2039, 7).
The first 6 columns contain inputs (i.e. the 6 features) and the 7th column contains the output I want to predict. The program runs without error but the problem is that when I run my program I get too many coefficients and intercepts. For coefficients I get an array of size (1239, 6) and for intercepts I get a list of numbers 1239 long. I would assume I should just get 6 coefficients (one for each feature) and one intercept.
Also, the accuracy of the regression model is excessively low. Any ideas on what I am doing wrong would be greatly appreciated. Code is below.
import pandas
import numpy
from sklearn import cross_validation
from sklearn.linear_model import LogisticRegression
filename = '1.csv'
names = ['A', 'B', 'C', 'D', 'E', 'F', 'G']
dataframe = pandas.read_csv(filename, names=names)
array = dataframe.values
X = array[:,0:6]
Y = numpy.asarray(array[:,6], dtype="|S6")
test_size = 0.33
seed = 7
X_train, X_test, Y_train, Y_test = cross_validation.train_test_split(X, Y, test_size=test_size, random_state=seed)
model = LogisticRegression()
model.fit(X_train, Y_train)
result = model.score(X_test, Y_test)
print(result*100.0)
print(model.coef_.shape)
print(model.intercept_.shape)

Resources