countVectorizer error Found input variables with inconsistent numbers of samples when using in sklearn pipeline - scikit-learn

I have a dataframe:
df = A B. Text
1. 2. 'hello and good morning'
3. 4. 'I am watching TV'
I want to apply sentence classification pipeline:
text_feats = ['Text']
num_feats = ['A','B']
text_transformer = Pipeline(steps=[
('tfidf_vectorizer', TfidfVectorizer())])
numeric_transformer = Pipeline(steps=[
('scale', StandardScaler())])
preprocessor = ColumnTransformer(
transformers=[
('text', text_transformer, text_feats),
('num', numeric_transformer, num_feats))
clf_pipe = Pipeline(steps=[('preprocessor', preprocessor),
("model", RandomForestClassifier())])
clf_pipe.fit(df, df[Y_COL])
But I get the following error in the TfIDF vectorizer and in CountVectorizer:
Found input variables with inconsistent numbers of samples: [1, 2]
Any idea what is the problem?

Related

Sklearn Voting ensemble with models using different features and testing with k-fold cross validation

I have a data frame with 4 different groups of features.
I need to create 4 different models with these four different feature groups and combine them with the ensemble voting classifier.
Furthermore, I need to test the classifier using k-fold cross validation.
However, I am finding it difficult to combine different feature sets, voting classifier and k-fold cross validation with functionality available in sklearn. Following is the code that I have so far.
y = df1.index
x = preprocessing.scale(df1)
SVM = svm.SVC(kernel='rbf', C=1)
rf=RandomForestClassifier(n_estimators=200)
ann = MLPClassifier(solver='lbfgs', alpha=1e-5, hidden_layer_sizes=(25, 2), random_state=1)
neigh = KNeighborsClassifier(n_neighbors=10)
models = list()
models.append(('facial', SVM))
models.append(('posture', rf))
models.append(('computer', ann))
models.append(('physio', neigh))
ens = VotingClassifier(estimators=models)
cv = KFold(n_splits=10, random_state=None, shuffle=True)
scores = cross_val_score(ens, x, y, cv=cv, scoring='accuracy')
As you can see, this program uses same features for all 4 models. How can I improve this program to achieve my objective?
I did manage to achieve this using Pipelines,
y = df1.index
x = preprocessing.scale(df1)
phy_features = ['A', 'B', 'C']
phy_transformer = Pipeline(steps=[('imputer', SimpleImputer(strategy='median')), ('scaler', StandardScaler())])
phy_processer = ColumnTransformer(transformers=[('phy', phy_transformer, phy_features)])
fa_features = ['D', 'E', 'F']
fa_transformer = Pipeline(steps=[('imputer', SimpleImputer(strategy='median')), ('scaler', StandardScaler())])
fa_processer = ColumnTransformer(transformers=[('fa', fa_transformer, fa_features)])
pipe_phy = Pipeline(steps=[('preprocessor', phy_processer ),('classifier', SVM)])
pipe_fa = Pipeline(steps=[('preprocessor', fa_processer ),('classifier', SVM)])
ens = VotingClassifier(estimators=[pipe_phy, pipe_fa])
cv = KFold(n_splits=10, random_state=None, shuffle=True)
for train_index, test_index in cv.split(x):
x_train, x_test = x[train_index], x[test_index]
y_train, y_test = y[train_index], y[test_index]
ens.fit(x_train,y_train)
print(ens.score(x_test, y_test))
Please refer sklearn Pipeline: argument of type 'ColumnTransformer' is not iterable for if you are receiving an TypeError when using ColumnTransforms.

Setting the parameters of an imputer within a three levels pipeline

I'm a newbie in this DataScience realm and in order to organize my code I'm using pipeline.
The snippet of the code I'm trying to organize follows:
### Preprocessing ###
# Preprocessing for numerical data
numerical_transformer = Pipeline(steps=[
('imputer', SimpleImputer()),
('scaler', StandardScaler())
])
# Preprocessing for categorical data
categorical_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='most_frequent')),
('onehot', OneHotEncoder(handle_unknown='ignore', sparse=False))
])
# Bundle preprocessing for numerical and categorical data
preprocessor = ColumnTransformer(
transformers=[
('num', numerical_transformer, numerical_cols),
('cat', categorical_transformer, categorical_cols)
])
### Model ###
model = XGBRegressor(objective ='reg:squarederror', n_estimators=1000, learning_rate=0.05)
### Processing ###
# Bundle preprocessing and modeling code in a pipeline
my_pipeline = Pipeline(steps=[('preprocessor', preprocessor),
('model', model)
])
parameters = {}
# => How to set the parameters for one of the parts of the numerical_transformer pipeline?
# GridSearch
CV = GridSearchCV(my_pipeline, parameters, scoring = 'neg_mean_absolute_error', n_jobs= 1)
CV.fit(X_train, y_train)
How can I change the parameters for the Imputer found in the numerical_transformer pipeline?
Thank you,
After #desernaut pointing to the right direction, this is the answer:
parameters['preprocessor__num__imputer__strategy'] = ['most_frequent','mean', 'median',]
Thanks #desernaut!

SKLearn Error with Pipeline and Gridsearch

I would like to first split my data in a test and train set. Then I want to use GridSearchCV on my training set (internally split into train/validation set). In the end I want to collect all the testdata and do some other things (not in the scope of the question).
I have to scale my data. So I want to handle this problem in a pipeline. Some things in my SVC should be ficed (kernel='rbf', class_weight=...).
When I run the code the following occurs:
"ValueError: Invalid parameter estimator for estimator Pipeline"
I don't understand what I'm doing wrong. I tried to follow this thread: StandardScaler with Pipelines and GridSearchCV
The only difference is, that I fix some parameters in my SVC. How can I handle this?
target = np.array(target).ravel()
loo = LeaveOneOut()
loo.get_n_splits(input)
# Outer Loop
for train_index, test_index in loo.split(input):
X_train, X_test = input[train_index], input[test_index]
y_train, y_test = target[train_index], target[test_index]
p_grid = {'estimator__C': np.logspace(-5, 2, 20),}
'estimator__gamma': np.logspace(-5, 3, 20)}
SVC_Kernel = SVC(kernel='rbf', class_weight='balanced',tol=10e-4, max_iter=200000, probability=False)
pipe_SVC = Pipeline([('scaler', RobustScaler()),('SVC', SVC_Kernel)])
n_splits = 5
scoring = "f1_micro"
inner_cv = StratifiedKFold(n_splits=n_splits,
shuffle=True, random_state=5)
clfSearch = GridSearchCV(estimator=pipe_SVC, param_grid=p_grid,
cv=inner_cv, scoring='f1_micro', iid=False, n_jobs=-1)
clfSearch.fit(X_train, y_train)
print("Best parameters set found on validation set for Support Vector Machine:")
print()
print(clfSearch.best_params_)
print()
print(clfSearch.best_score_)
print("Grid scores on validation set:")
print()
I also tried it this way:
p_grid = {'estimator__C': np.logspace(-5, 2, 20),
'estimator__gamma': np.logspace(-5, 3, 20),
'estimator__tol': [10e-4],
'estimator__kernel': ['rbf'],
'estimator__class_weight': ['balanced'],
'estimator__max_iter':[200000],
'estimator__probability': [False]}
SVC_Kernel = SVC()
This also doesn't work.
The problem is in your p_grid. You are grid searching on your Pipeline, and that doesn't have anything called estimator. It does have something called SVC, so if you want to set that SVC's parameter, you should prefix you keys with SVC__ instead of estimator__. So replace p_grid with:
p_grid = {'SVC__C': np.logspace(-5, 2, 20),}
'SVC__gamma': np.logspace(-5, 3, 20)}
Also, you can replace your outer for loop using cross_validate function.

How to include multiple predictors (int and string types) in sk-learn machine learning pipeline?

Here is the data table:
I am following this ML tutorial, and customized the code for my needs as follow, the goal is to use some predictor(s) to predict the label which is a multiclass label. I have also created dummy variables based on the 'label' column, as in the tutorial.
df = pd.read_csv(directory_data+final_data_file, encoding='utf-8', low_memory=False)
# Text cleaning
def clean_text(text):
text = text.lower()
text = re.sub(r"what's", "what is ", text)
text = text.strip(' ')
return text
df['study_title'] = df['study_title'].map(lambda com: clean_text(com))
df['study_desc'] = df['study_desc'].map(lambda com: clean_text(com))
df['condition'] = df['condition'].map(lambda com: clean_text(com))
df['min_age'] = df['min_age'].astype(int)
# Split data into train and test sets
# Need to do a sophisticated randomization since currently same study can occupy multiple columns
# Need to first randomize study ids into train or test sets
# Then remap the studies into the sets based on the matching study ids
unique_study_id_list = cf.unique(df, 'study_id')
rand_seed = 888
random.seed(rand_seed)
random.shuffle(unique_study_id_list)
percent_test = 0.50
test_study_id = unique_study_id_list[0:int(len(unique_study_id_list)*percent_test)]
train_study_id = unique_study_id_list[(int(len(unique_study_id_list)*percent_test)):]
test = df[df['study_id'].isin(test_study_id)]
train = df[df['study_id'].isin(train_study_id)]
# Specify what to traing with
X_train = train['study_desc']
X_test = test['study_desc']
# ML pipeline: define a pipeline combining a text feature extractor with multi lable classifier
NB_pipeline = Pipeline([
('tfidf', TfidfVectorizer(stop_words=stop_words)),
('clf', OneVsRestClassifier(MultinomialNB(
fit_prior=True, class_prior=None))),])
LogReg_pipeline = Pipeline([
('tfidf', TfidfVectorizer(stop_words=stop_words)),
('clf', OneVsRestClassifier(LogisticRegression(solver='sag'), n_jobs=1)),])
SVC_pipeline = Pipeline([
('tfidf', TfidfVectorizer(stop_words=stop_words)),
('clf', OneVsRestClassifier(LinearSVC(), n_jobs=1)),])
#######################
#######################
#######################
# Testing outputs
categories = cf.unique(df, 'label')
output_switch = 'test2' # 'real' or 'test' or 'off'
if output_switch == 'test':
print (df['label'].value_counts())
elif output_switch == 'test2':
for category in categories:
print('... Processing {}'.format(category))
# train the model using X_dtm & y
NB_pipeline.fit(X_train, train[category])
# compute the testing accuracy
prediction = NB_pipeline.predict(X_test)
print('Test auc-score is {}'.format(roc_auc_score(test[category], prediction)))
else: pass
However, I have not figured how to modify the following to include multiple predictors. Currently I am using 'study_desc' only, but how can I also include 'study_title' and 'min_age' as my predictors?
I have tried the following but got errors:
X_train = train['study_desc', 'study_title', 'min_age']
X_test = test['study_desc', 'study_title', 'min_age']
KeyError: ('study_desc', 'study_title', 'min_age')
And
X_train = train[['study_desc', 'study_title', 'min_age']]
X_test = test[['study_desc', 'study_title', 'min_age']]
ValueError: Found input variables with inconsistent numbers of samples: [3, 3649]
////////////////////
Edits: trying out the suggested reference:
from sklearn.pipeline import Pipeline
from sklearn.pipeline import FeatureUnion
class ItemSelector(BaseEstimator, TransformerMixin):
def __init__(self, key):
self.key = key
def fit(self, x, y=None):
return self
def transform(self, data_dict):
return data_dict[self.key]
NB_pipeline_multi = Pipeline([
('union', FeatureUnion(
transformer_list=[
('min_age', Pipeline([
('selector', ItemSelector(key='min_age')),
])),
('study_title', Pipeline([
('selector', ItemSelector(key='study_title')),
('tfidf', TfidfVectorizer(stop_words=stop_words)),
])),
('study_desc', Pipeline([
('selector', ItemSelector(key='study_desc')),
('tfidf', TfidfVectorizer(stop_words=stop_words)),
])),
],
)),
('clf', OneVsRestClassifier(MultinomialNB(fit_prior=True, class_prior=None))),
])
KeyError: 'min_age'
What you want is to combine your tf-idf features with standard numerical/categorical features. You can achieve this by using a FeatureUnion transformer. Two very nice resources are here - API docu and an example. The example is very relevant for your application because it also implements an ItemSelector transformer that enables selection of specific columns in a pd.DataFrame.

sklearn GridSearchCV: how to get classification report?

I am using GridSearchCV like this:
corpus = load_files('corpus')
with open('stopwords.txt', 'r') as f:
stop_words = [y for x in f.read().split('\n') for y in (x, x.title())]
x = corpus.data
y = corpus.target
pipeline = Pipeline([
('vec', CountVectorizer(stop_words=stop_words)),
('classifier', MultinomialNB())])
parameters = {'vec__ngram_range': [(1, 1), (1, 2)],
'classifier__alpha': [1e-2, 1e-3],
'classifier__fit_prior': [True, False]}
gs_clf = GridSearchCV(pipeline, parameters, n_jobs=-1, cv=5, scoring="f1", verbose=10)
gs_clf = gs_clf.fit(x, y)
joblib.dump(gs_clf.best_estimator_, 'MultinomialNB.pkl', compress=1)
Then, in another file, to classify new documents (not from the corpus), I do this:
classifier = joblib.load(filepath) # path to .pkl file
result = classifier.predict(tokenlist)
My question is: Where do I get the values needed for the classification_report?
In many other examples, I see people split the corpus into traing set and test set.
However, since I am using GridSearchCV with kfold-cross-validation, I don't need to do that.
So how can I get those values from GridSearchCV?
If you have GridSearchCV object:
from sklearn.metrics import classification_report
clf = GridSearchCV(....)
clf.fit(x_train, y_train)
classification_report(y_test,clf.best_estimator_.predict(x_test))
If you have saved the best estimator and loaded it then:
classifier = joblib.load(filepath)
classification_report(y_test,classifier.predict(x_test))
The best model is in clf.best_estimator_. You need to fit the training data to this; then predict your test data and use ytest and ypreds for the classification report.

Resources