How to Classify unbalanced classes with Random Forest avoiding overfitting - scikit-learn

I'm stuck in a Data Science problem.
I'm trying to predict some future classes using Random forest.
My features are categorical and numerical.
My classes are unbalanced.
When I run my fitting, the score seems very good but the cross validation awful.
My model must overfit.
Here is my code:
features_cat = ["area", "country", "id", "company", "unit"]
features_num = ["year", "week"]
classes = ["type"]
print("Data",len(data_forest))
print(data_forest["type"].value_counts(normalize=True))
X_cat = pd.get_dummies(data_forest[features_cat])
print("Cat features dummies",len(X_cat))
X_num = data_forest[features_num]
X = pd.concat([X_cat,X_num],axis=1)
X.index = range(1,len(X) + 1)
y = data_forest[classes].values.ravel()
test_size = 0.2
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, random_state=42)
forest = RandomForestClassifier(n_estimators=50, n_jobs=4, oob_score=True, max_features="log2", criterion="entropy")
forest.fit(X_train, y_train)
score = forest.score(X_test, y_test)
print("Score on Random Test Sample:",score)
X_BC = X[y!="A"]
y_BC = y[y!="A"]
score = forest.score(X_BC, y_BC)
print("Score on only Bs, Cs rows of all dataset:",score)
Here is the output:
Data 768296
A 0.845970
B 0.098916
C 0.055114
Name: type, dtype: float64
Cat features dummies 725
Score on Random Test Sample: 0.961434335546
Score on only Bs, Cs rows of all dataset: 0.959194193052
So far I feel happy with the model...
But when I try to predict future dates, it gives mostly the same outcome.
I check cross-validation:
rf = RandomForestClassifier(n_estimators=50, n_jobs=4, oob_score=True, max_features="log2", criterion="entropy")
scores = cross_validation.cross_val_score(rf, X, y, cv=5, n_jobs=4)
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))
And it gives me poor results...
Accuracy: 0.55 (+/- 0.57)
What do I miss ?

What if you change (or remove) random_state? train_test_split is not stratified by default, so it could be that your classifier is always just predicting the most common class A, and your test split with that partition just contains A's.

Related

Different results using OneVsRestClassifier(KNeighborsClassifier(n_neighbors=2)) compared to KNeighborsClassifier(n_neighbors=2)

I'm implementing a multi-class classifier and I'm getting different results when wrapping KNN in a multi-class classifier.
Unsure why as I understood KNN worked for multiclass already?
y = rock_df['Sample_type']
X = rock_df[col_list]
def model_eval(model, X,y):
""" Function implements classifier model on X and y with a 0.33 test hold out, stratified by y and returns accuracy and standard deviation
Inputs:
model: The ML model to be tested
X: the cleaned and preprocessed data (normalized, and NAN dealt with)
y: Target labels for input data X
"""
#Split train /test
X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.33, random_state=42, stratify = y)
n = X_test.size
#Fit model
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
#Scoring
confusion_matrix(y_test,y_pred)
balanced_accuracy_score( y_test, y_pred)
scores = cross_val_score(model, X, y, cv=3)
mean= scores.mean()
sd = scores.std()
print("For {} : {:.1%} accuracy on cross validation, with a standard deviation of {:.1%}".format(model, mean, sd) )
# binomial confidence interval - 95% -- confirm difference with SD
#interval = 1.96 * sqrt( (mean * (1 - mean)) /n )
#print('Confidence Interval: {:.3%}'.format(interval) )
#return balanced_accuracy_score, confusion_matrix
model = OneVsRestClassifier(KNeighborsClassifier(n_neighbors=2))
model_eval(model, X,y)
model = KNeighborsClassifier(n_neighbors=2)
model_eval(model, X,y)
First model I get:
For OneVsRestClassifier(estimator=KNeighborsClassifier(n_neighbors=2)) : 78.6% accuracy on cross validation, with a standard deviation of 5.8%
second:
For KNeighborsClassifier(n_neighbors=2) : 83.3% accuracy on cross validation, with a standard deviation of 8.9%
thanks
It is OK that you have different results. KNeighborsClassifier doesn't employ one-vs-rest strategy; majority vote works with 3 and more classes and there is no need to have OvR in the original implementation. But trying OneVsRestClassifier might be useful as well. I believe that generally decision boundaries will be different. Here I played with Iris dataset to get decision boundaries using KNeighborsClassifier(n_neighbors=5) and OneVsRestClassifier(KNeighborsClassifier(n_neighbors=5)):

Sklearn incorrect support value (number of samples in each class) for classification report

I am fitting an SVM to some data using sklearn. I have 24 samples in total (10 negative, 14 positive).
# Set model
clf = svm.SVC(kernel = 'linear', C = 1)
# Create train, test splits and fit SVM to data
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 0, test_size = 0.3, stratify = y)
clf.fit(X_train, y_train)
# Make predictions
y_pred = clf.predict(X_test)
I stratified by y to ensure I have an equal number of each class in my test set, which seems to have worked (see image below), however, the classification report says there are no negative samples:
The signature for classification_report is (y_true, y_pred, ...); you've reversed the inputs.
Here's one of the places where using explicit keyword arguments is a good practice.

Why is my detection score high inspite of obvious misclassifications during prediction?

I am working on an intrusion classification problem using NSL-KDD dataset. I used 10 features (out of 42) for training after applying Recursive feature elimination technique using Random Forest Classifier as the estimator parameter and Gini index as criterion for splitting Decision tree. After training the classifier, I use same classifier to predict the classes of test data. My cross validation score (Accuracy, precision, recall, f-score) using cross_val_score of sklearn gave above 99 % scores for all the four scores. But plotting the confusion matrix showed otherwise with higher values seen in False positive and False negative values. Claerly, they are not matching with accuracy and all these scores. Where did I do wrong ?
# Train set contain X_train (dataframe of features) and Y_train (series
# of target labels)
# Test set contain X_test and Y_test
# Classifier variable
clf = RandomForestClassifier(n_estimators = 10, criterion = 'gini')
#Training
clf.fit(X_train, Y_train)
# Testing
Y_pred = clf.predict(X_test)
pandas.crosstab(Y_test, Y_pred, rownames = ['Actual'], colnames =
['Predicted'])
# Scoring
accuracy = cross_val_score(clf, X_test, Y_test, cv = 10, scoring =
'accuracy')
print("Accuracy: %0.5f (+/- %0.5f)" % (accuracy.mean(), accuracy.std() *
2))
precision = cross_val_score(clf, X_test, Y_test, cv = 10, scoring =
'precision_weighted')
print("Precision: %0.5f (+/- %0.5f)" % (precision.mean(), precision.std()
* 2))
recall = cross_val_score(clf, X_test, Y_test, cv = 10, scoring =
'recall_weighted')
print("Recall: %0.5f (+/- %0.5f)" % (recall.mean(), recall.std() * 2))
f = cross_val_score(clf, X_test, Y_test, cv = 10, scoring = 'f1_weighted')
print("F-Score: %0.5f (+/- %0.5f)" % (f.mean(), f.std() * 2))
I got accuracy, precision, recall and f-score of
Accuracy 0.99825
Precision 0.99826
Recall 0.99825
F-Score 0.99825
However, the confusion matrix showed otherwise
Predicted 9670 41
Actual 5113 2347
Am I training the whole thing wrong or is it just misclassification problem from poor feature selection?
Your predicted values are stored in y_pred.
accuracy_score(y_test,y_pred)
Just check whether this works...
You are not comparing equivalent results! For the confusion matrix, you train on (X_train,Y_train) and test on (X_test,Y_test).
However, the crossvalscore fits the estimator on k-1 folds of (X_test,Y_test) and test it on the remaining fold of (X_test,Y_test) because crossvalscore do its own cross-validation (with 10 folds here) on the dataset you provide. Check out crossvalscore documentation for more explanation.
So basically, you don't fit and test your algorithm on the same data. This might explain some inconsistency in the results.

predicting Y label using CuDNNLSTM

I want to use CuDNNLSTM to predict the Y label. I have a dataset and I want to use CuDNNLSTM to predict the code. I am considering the sentences as X label and the codes as Y label.
the model is actually giving probability matrix of every class. I want to know
1. How can I predict the actual sentence code
2.
The dataset is somewhat this kind of:
Google headquarters is in California 98873
Google pixel is a very nice phone 98873
Steve Jobs was a great man 15890
Steve Jobs has done great technology innovations 15890
Microsoft is another great giant in technology 89736
Bill Gates founded Microsoft 89736
I took help from this link:
https://towardsdatascience.com/multi-class-text-classification-with-lstm-1590bee1bd17
The below code I am using predicts the probability matrix, I want to know how can it predicts the actual sentence code.
Also, can we use tfidf vectorizer?
# The maximum number of words to be used. (most frequent)
MAX_NB_WORDS = 50000
MAX_SEQUENCE_LENGTH = 250
# This is fixed.
EMBEDDING_DIM = 100
tokenizer = keras.preprocessing.text.Tokenizer(num_words=MAX_NB_WORDS, filters='!"#$%&()*+,-./:;<=>?#[\]^_`{|}~', lower=True)
tokenizer.fit_on_texts(df['procedureNew'].values)
word_index = tokenizer.word_index
print('Found %s unique tokens.' % len(word_index))
X = tokenizer.texts_to_sequences(df['procedureNew'].values)
X = keras.preprocessing.sequence.pad_sequences(X, maxlen=MAX_SEQUENCE_LENGTH)
print('Shape of data tensor:', X.shape)
Y = pd.get_dummies(df['SuggestedCpt1']).values
X_train, X_test, Y_train, Y_test = train_test_split(X,Y, test_size = 0.10, random_state = 42)
model = Sequential()
model.add(Embedding(MAX_NB_WORDS, EMBEDDING_DIM, input_length=X.shape[1]))
model.add(SpatialDropout1D(0.2))
model.add(CuDNNLSTM(100))
model.add(Dense(10, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
epochs = 10
batch_size = 40
history = model.fit(X_train, Y_train, epochs=epochs, batch_size=batch_size,validation_split=0.1,callbacks=[EarlyStopping(monitor='val_loss', patience=3, min_delta=0.0001)])
accr = model.evaluate(X_test,Y_test)
print('Test set\n Loss: {:0.3f}\n Accuracy: {:0.3f}'.format(accr[0],accr[1]*100, "%"))
new_sentence = ['Pixel phone is launched by Google']
seq = tokenizer.texts_to_sequences(new_procedure)
padded = keras.preprocessing.sequence.pad_sequences(seq, maxlen=MAX_SEQUENCE_LENGTH)
pred = model.predict(padded)
labels = ['98873', '15890', '89736', '87325', '23689', '10368', '45789', '36975', '26987', '64721']
print(pred, labels[np.argmax(pred)])
print("\npredicted sentence code is", labels[np.argmax(pred)])

feature selection using logistic regression

I am performing feature selection ( on a dataset with 1,930,388 rows and 88 features) using Logistic Regression. If I test the model on held-out data, the accuracy is just above 60%. The response variable is equally distributed. My question is, if the model's performance is not good, can I consider the features that it gives as actual important features? Or should I try to improve the accuracy of the model though my end-goal is not to improve the accuracy but only get important features
sklearn's GridSearchCV has some pretty neat methods to give you the best feature set. For example, consider the following code
pipeline = Pipeline([
('vect', TfidfVectorizer(stop_words='english',sublinear_tf=True)),
('clf', LogisticRegression())
])
parameters = {
'vect__max_df': (0.25, 0.5, 0.6, 0.7, 1.0),
'vect__ngram_range': ((1, 1), (1, 2), (2,3), (1,3), (1,4), (1,5)),
'vect__use_idf': (True, False),
'clf__C': (0.1, 1, 10, 20, 30)
}
here the parameters array holds all of the different parameters that i need to consider. notice the use if vect__max_df. max_df is an actual key that is used by my vectorizer, which is my feature selector. So,
'vect__max_df': (0.25, 0.5, 0.6, 0.7, 1.0),
actually specifies that i want to try out the above 5 values for my vectorizer. Similarly for the others. Notice how i have tied my vectorizer to the key 'vect' and my classifier to the key 'clf'. Can you see the pattern? Moving on
traindf = pd.read_json('../../data/train.json')
traindf['ingredients_clean_string'] = [' , '.join(z).strip() for z in traindf['ingredients']]
traindf['ingredients_string'] = [' '.join([WordNetLemmatizer().lemmatize(re.sub('[^A-Za-z]', ' ', line)) for line in lists]).strip() for lists in traindf['ingredients']]
X, y = traindf['ingredients_string'], traindf['cuisine'].as_matrix()
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7)
grid_search = GridSearchCV(pipeline, parameters, n_jobs=3, verbose=1, scoring='accuracy')
grid_search.fit(X_train, y_train)
print ('best score: %0.3f' % grid_search.best_score_)
print ('best parameters set:')
bestParameters = grid_search.best_estimator_.get_params()
for param_name in sorted(parameters.keys()):
print ('\t %s: %r' % (param_name, bestParameters[param_name]))
predictions = grid_search.predict(X_test)
print ('Accuracy:', accuracy_score(y_test, predictions))
print ('Confusion Matrix:', confusion_matrix(y_test, predictions))
print ('Classification Report:', classification_report(y_test, predictions))
note that the bestParameters array will give me the best set of parameters out of all the options that i specified while creating my pipeline.
Hope this helps.
Edit: To get a list of features selected
so once you have your best set of parameters, create vectorizers and classifiers with those parameter values
vect = TfidfVectorizer('''use the best parameters here''')
then you basically train this vectorizer again. in doing so, the vectorizer will choose certain features from your training set.
traindf = pd.read_json('../../data/train.json')
traindf['ingredients_clean_string'] = [' , '.join(z).strip() for z in traindf['ingredients']]
traindf['ingredients_string'] = [' '.join([WordNetLemmatizer().lemmatize(re.sub('[^A-Za-z]', ' ', line)) for line in lists]).strip() for lists in traindf['ingredients']]
X, y = traindf['ingredients_string'], traindf['cuisine'].as_matrix()
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7)
termDocMatrix = vect.fit_transform(X_train, y_train)
now, the termDocMatrix has all of the selected features. also, you can use the vectorizer to get the feature names. lets say you want to get the top 100 features. and your metric for comparison is the chi square score
getKbest = SelectKBest(chi2, k = 100)
now just
print(np.asarray(vect.get_feature_names())[getKbest.get_support()])
should give you the top 100 features. try this.

Resources