RF model loses accuracy when I remove it from Pipeline - scikit-learn

Hoping I'm overlooking something stupid here or maybe I don't understand how this is working...
I have an nlp pipeline that does basically the following:
rf_pipeline = Pipeline([
('vect', TfidfVectorizer(tokenizer = spacy_tokenizer)),
('fit', RandomForestClassifier())
])
I run it:
clf = rf_pipeline.fit(X_train, y_train)
preds = clf.predict(X_test)
When I optimize I get accuracy in the high 90's with the following:
confusion_matrix(y_test, preds)
accuracy_score(y_test, preds)
precision_score(y_test, preds)
the TfidfVectorizer is the bottleneck in my computations, so I wanted to break out the pipeline. run the vectorizer, and then do a grid search on the classifier rather than running it on the whole pipeline. Here's how I broke it out:
# initialize
tfidf = TfidfVectorizer(tokenizer = spacy_tokenizer)
# transform and fit
vect = tfidf.fit_transform(X_train)
clf = rf_class.fit(vect, y_train)
# predict
clf.predict(tfidf.fit_transform(X_test))
When I took a look at the accuracy before I ran a full grid search it had plummeted to just over 50%. When I tried increasing the number of trees the score dropped almost 10%.
Any ideas?

For test set, you can't call fit_transform(), but just transform(), otherwise elements of a tfidf vectors has different meaning.
Try this
# predict
clf.predict(tfidf.transform(X_test))

Related

Viewing model coefficients for a single prediction

I have a logistic regression model housed in a scikit-learn pipeline using the following:
pipeline = make_pipeline(
StandardScaler(),
LogisticRegressionCV(
solver='lbfgs',
cv=10,
scoring='roc_auc',
class_weight='balanced'
)
)
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)
I can view the model's coefficients for predictions as a whole with this code ...
# Look at model's coefficients to see what features are most important
plt.rcParams['figure.dpi'] = 50
model = pipeline.named_steps['logisticregressioncv']
coefficients = pd.Series(model.coef_[0], X_train.columns)
plt.figure(figsize=(10,12))
coefficients.sort_values().plot.barh(color='grey');
Which returns a bar plot of the features and their coefficients.
What I'm trying to do is be able to see how different input values for a single observation impact its prediction. The idea is to be able to run predictions on a sample population and examine the group with "low" predictions ... for example if I run predictions for 10 observations, I'd like to see how different input values impacted each of those 10 predictions, individually.
Recalled that I can achieve this via Shap Values using something along the following (but using LinearExplainer instead of TreeExplainer):
# Instantiate model and encoder outside of pipeline for
# use with shap
model = RandomForestClassifier( random_state=25)
# Fit on train, score on val
model.fit(X_train_encoded, y_train2)
y_pred_shap = model.predict(X_val_encoded)
# Get an individual observation to explain.
row = X_test_encoded.iloc[[-3]]
# Why did the model predict this?
# Look at a Shapley Values Force Plot
import shap
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(row)
shap.initjs()
shap.force_plot(
base_value=explainer.expected_value[1],
shap_values=shap_values[1],
features=row
)```

Hypermarameters optimization in Gaussian Process in scikitlearn

Are the hyper-parameters in Gaussian Process Regresor optimized during the fitting in scikit-learn?
In the page
https://scikit-learn.org/stable/modules/gaussian_process.html
it is said:
"The hyperparameters of the kernel are optimized during fitting of GaussianProcessRegressor by maximizing the log-marginal-likelihood (LML) based on the passed optimizer"
So, it is not required, for instance, to optimize it by using grid earch?
A hyperparameter is something that you need to specify, usually, the best way to do it is within a pipeline ( series of steps) in which you try many hyperparameters and get the best one. Here is an example of just trying different hyperparameters for a k-means in which you give a list of hyperparameters (n_neighbors for K-Means) in order to see which ones work best! Hope it helps you!
neighbors = np.arange(1, 9)
train_accuracy = np.empty(len(neighbors))
test_accuracy = np.empty(len(neighbors))
# Loop over different values of k
for i, k in enumerate(neighbors):
# Setup a k-NN Classifier with k neighbors: knn
knn = KNeighborsClassifier(n_neighbors= k)
# Fit the classifier to the training data
knn.fit(X_train,y_train)
#Compute accuracy on the training set
train_accuracy[i] = knn.score(X_train, y_train)
knn.predict(X_test)
#Compute accuracy on the testing set
test_accuracy[i] = knn.score(X_test, y_test)
# Generate plot
plt.title('k-NN: Varying Number of Neighbors')
plt.plot(neighbors, test_accuracy, label = 'Testing Accuracy')
plt.plot(neighbors, train_accuracy, label = 'Training Accuracy')
plt.legend()
plt.xlabel('Number of Neighbors')

Scaling of stock data

I am trying to apply machine learning on stock prediction, and I run into problem regarding scaling on future unseen (much higher) stock close value.
Lets say I use random forrest regression on predicting stock price. I break the data into train set and test set.
For the train set, I use standardscaler, and do fit and transform
And then I use regressor to fit
For the test set, I use standardscaler, and do transform
And then I use regressor to predict, and compare to test label
If I plot predict and test label on a graph, predict seems to max out or ceiling. The problem is that standardscaler fit on train set, test set (later in the timeline) have much higher value, the algorithm does not know what to do with these extreme data
def test(X, y):
# split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, shuffle=False)
# preprocess the data
pipeline = Pipeline([
('std_scaler', StandardScaler()),
])
# model = LinearRegression()
model = RandomForestRegressor(n_estimators=20, random_state=0)
# preprocessing fit transform on train data
X_train = pipeline.fit_transform(X_train)
# fit model on train data with train label
model.fit(X_train, y_train)
# transform on test data
X_test = pipeline.transform(X_test)
# predict on test data
y_pred = model.predict(X_test)
# print(np.sqrt(mean_squared_error(y_test, y_pred)))
d = {'actual': y_test, 'predict': y_pred}
plot_data = pd.DataFrame.from_dict(d)
sns.lineplot(data=plot_data)
plt.show()
What should be done with the scaling?
This is what I got for plotting prediction, actual close price vs time
The problem mainly comes from the model you are using. RandomForest regressor is created upon Decision Trees. It is learning to map an input to an output for every examples in the training set. Consequently RandomForest regressor will work for middle values but for extreme values that it hasn't seen during training it will of course perform has your picture is showing.
What you want, is to learn a function directly using linear/polynomial regression or more advanced algorithms like ARIMA.

Is passing sklearn tfidf matrix to train MultinomialNB model proper?

I'm do some text classification tasks. What I have observed is that if fed tfidf matrix(from sklearn's TfidfVectorizer), Logistic Regression model is always outperforming MultinomialNB model. Below is my code for training both:
X = df_new['text_content']
y = df_new['label']
X_train, X_test, y_train, y_test = train_test_split(X, y)
vectorizer = TfidfVectorizer(stop_words='english')
X_train_dtm = vectorizer.fit_transform(X_train)
X_test_dtm = vectorizer.transform(X_test)
clf_lr = LogisticRegression()
clf_lr.fit(X_train_dtm, y_train)
y_pred = clf_lr.predict(X_test_dtm)
lr_score = accuracy_score(y_test, y_pred) # perfectly balanced binary classes
clf_mnb = MultinomialNB()
clf_mnb.fit(X_train_dtm, y_train)
y_pred = clf_mnb.predict(X_test_dtm)
mnb_score = accuracy_score(y_test, y_pred) # perfectly balanced binary classes
Currently lr_score > mnb_score always. I'm wondering how exactly MultinomialNB is using the tfidf matrix since the term frequency in tfidf is calculated based on no class information. Any chance that I should not feed tfidf matrix to MultinomialNB the same way I did to LogisticRegression?
Update: I understand the difference between results of TfidfVectorizer and CountVectorizer. And I also just checked the sources code of sklearn's MultinomialNB.fit() function, looks like it does expect a count as oppose to frequency. This will also explain the performance boost mentioned in my comment below. However, I'm still wondering if under any circumstances pass tfidf into MultinomialNB makes sense. The sklearn documentation briefly mentioned the possibility, but not much details.
Any advice would be much appreciated!

Binary classification (logistic regression) predict wrong label with high accuracy

I have a problem that a binary Logistic regression (using scikit-learn python=2.7) classification that is predicting the wrong/opposite class with a high accuracy. That is, after fitting the model the predicted score and predicted probabilities for each class are very consistent but always of the wrong class. I cannot share the data, but some pseudo-code of my approach is:
X = np.vstack((cond_1, cond_2)) # shape of X = 200*51102
y = np.concatenate([np.zeros(len(cond_1)), np.ones(len(cond_2)])
scls = []
clfs = []
scores = []
for train, test in cv.split(X, y):
clf = LogisticRegression(C=1)
scl = StandardScaler()
scl.fit(X[train])
X_train = scl.transform(X[train])
scls.append(scl)
X_test = scl.transform(X[test])
clf.fit(X_train, y[train])
y_pred = clf.predict(X_test)
scores.append(roc_auc_score(y[test], y_pred))
The roc_auc scores have a mean of 0.065% and a standard deviation of 0.05% so there seems to be going something, but what? I have plotted the features and they seem to be okay normally distributed. I also look that at the probabilities from predict_proba and they are mostly above 80% for the wrong class/label.
Any ideas what is going on and/or how to proper diagnose the problem?
I apologise for not being able to ask a more precise question but I'm lacking the vocabulary.

Resources