Why RFECV visualizer from yellowbrick finds number of optimal features more than number of features in train set? - scikit-learn

I am trying to perform feature selection and to use RFECV from yellowbrick.model_selection. I have 48 features in my train set. But when I run the code in below, the visualizer finds number of features as 49.
xgboost_base_model = xgb.XGBClassifier(tree_method='hist')
visualizer_xgb = RFECV(xgboost_base_model, step=3, cv=3, scoring='roc_auc')
visualizer_xgb.fit(X_train, y_train)
visualizer_xgb.show()

Related

Viewing model coefficients for a single prediction

I have a logistic regression model housed in a scikit-learn pipeline using the following:
pipeline = make_pipeline(
StandardScaler(),
LogisticRegressionCV(
solver='lbfgs',
cv=10,
scoring='roc_auc',
class_weight='balanced'
)
)
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)
I can view the model's coefficients for predictions as a whole with this code ...
# Look at model's coefficients to see what features are most important
plt.rcParams['figure.dpi'] = 50
model = pipeline.named_steps['logisticregressioncv']
coefficients = pd.Series(model.coef_[0], X_train.columns)
plt.figure(figsize=(10,12))
coefficients.sort_values().plot.barh(color='grey');
Which returns a bar plot of the features and their coefficients.
What I'm trying to do is be able to see how different input values for a single observation impact its prediction. The idea is to be able to run predictions on a sample population and examine the group with "low" predictions ... for example if I run predictions for 10 observations, I'd like to see how different input values impacted each of those 10 predictions, individually.
Recalled that I can achieve this via Shap Values using something along the following (but using LinearExplainer instead of TreeExplainer):
# Instantiate model and encoder outside of pipeline for
# use with shap
model = RandomForestClassifier( random_state=25)
# Fit on train, score on val
model.fit(X_train_encoded, y_train2)
y_pred_shap = model.predict(X_val_encoded)
# Get an individual observation to explain.
row = X_test_encoded.iloc[[-3]]
# Why did the model predict this?
# Look at a Shapley Values Force Plot
import shap
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(row)
shap.initjs()
shap.force_plot(
base_value=explainer.expected_value[1],
shap_values=shap_values[1],
features=row
)```

How to add more features in multi text classification?

I have a retail dataset with product_description, price, supplier, category as columns.
I used product_description as feature:
from sklearn import model_selection, preprocessing, naive_bayes
# split the dataset into training and validation datasets
train_x, valid_x, train_y, valid_y = model_selection.train_test_split(df['product_description'], df['category'])
# label encode the target variable
encoder = preprocessing.LabelEncoder()
train_y = encoder.fit_transform(train_y)
valid_y = encoder.fit_transform(valid_y)
tfidf_vect = TfidfVectorizer(analyzer='word', token_pattern=r'\w{1,}', max_features=5000)
tfidf_vect.fit(df['product_description'])
xtrain_tfidf = tfidf_vect.transform(train_x)
xvalid_tfidf = tfidf_vect.transform(valid_x)
classifier = naive_bayes.MultinomialNB().fit(xtrain_tfidf, train_y)
# predict the labels on validation dataset
predictions = classifier.predict(xvalid_tfidf)
metrics.accuracy_score(predictions, valid_y) # ~20%, very low
Since the accuracy is very low, I want to add the supplier and price as features too. How can I incorporate this in the code?
I have tried other classifiers like LR, SVM, and Random Forrest, but they had (almost) the same outcome.
The TF-IDF vectorizer returns a matrix: one row per example with the scores. You can modify this matrix as you wish before feeding it into the classifier.
Prepare your additional features as a NumPy array of shape: number of examples × number of features.
Use np.concatenate with axis=1.
Fit the classifier as you did before.
It is usually a good idea to normalize real-valued features. Also, you can try different classifiers: Logistic Regression or SVM might do a better job for real-valued features than Naive Bayes.

Scaling of stock data

I am trying to apply machine learning on stock prediction, and I run into problem regarding scaling on future unseen (much higher) stock close value.
Lets say I use random forrest regression on predicting stock price. I break the data into train set and test set.
For the train set, I use standardscaler, and do fit and transform
And then I use regressor to fit
For the test set, I use standardscaler, and do transform
And then I use regressor to predict, and compare to test label
If I plot predict and test label on a graph, predict seems to max out or ceiling. The problem is that standardscaler fit on train set, test set (later in the timeline) have much higher value, the algorithm does not know what to do with these extreme data
def test(X, y):
# split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, shuffle=False)
# preprocess the data
pipeline = Pipeline([
('std_scaler', StandardScaler()),
])
# model = LinearRegression()
model = RandomForestRegressor(n_estimators=20, random_state=0)
# preprocessing fit transform on train data
X_train = pipeline.fit_transform(X_train)
# fit model on train data with train label
model.fit(X_train, y_train)
# transform on test data
X_test = pipeline.transform(X_test)
# predict on test data
y_pred = model.predict(X_test)
# print(np.sqrt(mean_squared_error(y_test, y_pred)))
d = {'actual': y_test, 'predict': y_pred}
plot_data = pd.DataFrame.from_dict(d)
sns.lineplot(data=plot_data)
plt.show()
What should be done with the scaling?
This is what I got for plotting prediction, actual close price vs time
The problem mainly comes from the model you are using. RandomForest regressor is created upon Decision Trees. It is learning to map an input to an output for every examples in the training set. Consequently RandomForest regressor will work for middle values but for extreme values that it hasn't seen during training it will of course perform has your picture is showing.
What you want, is to learn a function directly using linear/polynomial regression or more advanced algorithms like ARIMA.

How can this feature ranking problem be implemented with Support Vector Classification?

If I want the classifier to be SVM (using scikit-learn), how can I modify the 'clf' variable such that the svm classifier used for feature ranking results to high accuracy? What parameters/arguments do I need to add ? Which kernel type of SVC ('linear' or 'rbf' or 'sigmoid' or others) would you suggest for best accuracy?
The codes are referred form the following github link:
https://github.com/CynthiaKoopman/Network-Intrusion-Detection/blob/master/RandomForest_IDS.ipynb
I have 10 features which are ranked (with RecursiveFeatureElimination of scikit learn) from 1 to 10 which are from DoS attacks of the NSL-KDD dataset using RandomForestClassifier with 99% accuracy (using RFC as prediction model).
from sklearn.feature_selection import RFE
from sklearn.ensemble import RandomForestClassifier
#from sklearn.svm import SVC
# Create a decision tree classifier. clf is the 'variable for classifier'
clf = RandomForestClassifier(n_jobs = 2)
# If classifier used is svm
#clf = SVC(kernel = "linear")
#rank all features, i.e continue the elimination until the last one
rfe = RFE(clf, n_features_to_select=1)
rfe.fit(X_newDoS, Y_DoS)
print ("DoS Features sorted by their rank:")
#print (sorted(zip(map(lambda x: round(x, 4), rfe.ranking_), newcolname_DoS)))
sorted_newcolname_DoS = sorted(zip(map(lambda x: round(x, 4), rfe.ranking_), newcolname_DoS))
sorted_newcolname_DoS
I expect more or less 99% similarity between the ranked features of the two classifiers, which I didnt observe.

cross_validation for time series in scikit learn machine learning

I wasn't able to find information I am looking for so I will post my question here.
I am just venturing into machine learning. I did my first multiple regression for a time series using scikit learn library. My code is as shown below
X = df[feature_cols]
y = df[['scheduled_amount']]
index= y.reset_index().drop('scheduled_amount', axis=1)
linreg = LinearRegression()
tscv = TimeSeriesSplit(max_train_size=None, n_splits=11)
li=[]
for train_index, test_index in tscv.split(X):
train = index.iloc[train_index]
train_start, train_end = train.iloc[0,0], train.iloc[-1,0]
test = index.iloc[test_index]
test_start, test_end = test.iloc[0,0], test.iloc[-1,0]
X_train, X_test = X[train_start:train_end], X[test_start:test_end]
y_train, y_test = y[train_start:train_end], y[test_start:test_end]
linreg.fit(X_train, y_train)
y_predict = linreg.predict(X_test)
print('RSS:' + str(linreg.score(X_test, y_test)))
y_test['predictec_amount'] = y_predict
y_test.plot()
Not that my data is a time series data and I want to keep the datetime index in my Dataframe when I'm fitting my model.
I am using the TimeSeriesSplit for cross-validation. I still don't really understand the cross validation thing.
First is there a need for a cross-validation in a time series dataset. Second should I use the last linear_coeff_ or should I get the average of all of them to use for my future prediction.
Yes there is a need for cross-validation in a timeseries dataset. Basically you need to ensure your model does not overfit your current test and is able to capture past seasonal changes so you can have some confidence in the model doing the same in the future. This method is also used to choose model hyperparameters (i.e. alpha in a Ridge regression).
In order to make future predictions, you should refit your regressor with the whole data and the best hyperparameters or, as #Marcus V. mentioned in the coments, maybe is best to train it only with the most recent data.

Resources