How to retrieve the training history from LinearSVC model? - python-3.x

I am trying to retrieve the training history of my SVM model to plot its learning curve. Something like:
history = model.fit(X, Y, validation_split=0.33, epochs=150, batch_size=10)
I have already looked at GridSearchCV best model CV history and https://scikit-learn.org/stable/auto_examples/model_selection/plot_learning_curve.html to do that, but the first approach did not work for LinearSVC model, and the second approach is not quite what I would like to do (as far as I understood, If I use the learning curve method I will have to train my model again after the grid search).
model = GridSearchCV(LinearSVC(verbose=0),
{'C': [1, 10, 100, 1000]}, cv=5,
iid=False, scoring='recall_macro')
model.fit(x_train, y_train)
_, loss, val_loss = learning_curve(model.best_estimator_.fit(X, Y, cv=5))
How can I get this history? I am using sklearn 0.20.3.

Related

Scaling of stock data

I am trying to apply machine learning on stock prediction, and I run into problem regarding scaling on future unseen (much higher) stock close value.
Lets say I use random forrest regression on predicting stock price. I break the data into train set and test set.
For the train set, I use standardscaler, and do fit and transform
And then I use regressor to fit
For the test set, I use standardscaler, and do transform
And then I use regressor to predict, and compare to test label
If I plot predict and test label on a graph, predict seems to max out or ceiling. The problem is that standardscaler fit on train set, test set (later in the timeline) have much higher value, the algorithm does not know what to do with these extreme data
def test(X, y):
# split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, shuffle=False)
# preprocess the data
pipeline = Pipeline([
('std_scaler', StandardScaler()),
])
# model = LinearRegression()
model = RandomForestRegressor(n_estimators=20, random_state=0)
# preprocessing fit transform on train data
X_train = pipeline.fit_transform(X_train)
# fit model on train data with train label
model.fit(X_train, y_train)
# transform on test data
X_test = pipeline.transform(X_test)
# predict on test data
y_pred = model.predict(X_test)
# print(np.sqrt(mean_squared_error(y_test, y_pred)))
d = {'actual': y_test, 'predict': y_pred}
plot_data = pd.DataFrame.from_dict(d)
sns.lineplot(data=plot_data)
plt.show()
What should be done with the scaling?
This is what I got for plotting prediction, actual close price vs time
The problem mainly comes from the model you are using. RandomForest regressor is created upon Decision Trees. It is learning to map an input to an output for every examples in the training set. Consequently RandomForest regressor will work for middle values but for extreme values that it hasn't seen during training it will of course perform has your picture is showing.
What you want, is to learn a function directly using linear/polynomial regression or more advanced algorithms like ARIMA.

Neural network only predicts one class from binary class

My task is to learn defected items in a factory. It means, I try to detect defected goods or fine goods. This led a problem where one class dominates the others (one class is 99.7% of the data) as the defected items were very rare. Training accuracy is 0.9971 and validation accuracy is 0.9970. It sounds amazing.
But the problem is, the model only predicts everything is 0 class which is fine goods. That means, it fails to classify any defected goods.
How can I solve this problem? I have checked other questions and tried out, but I still have the situation. the total data points are 122400 rows and 5 x features.
In the end, my confusion matrix of the test set is like this
array([[30508, 0],
[ 92, 0]], dtype=int64)
which does a terrible job.
My code is as below:
le = LabelEncoder()
y = le.fit_transform(y)
ohe = OneHotEncoder(sparse=False)
y = y.reshape(-1,1)
y = ohe.fit_transform(y)
scaler = StandardScaler()
x = scaler.fit_transform(x)
x_train, x_test, y_train, y_test = train_test_split(x,y,test_size = 0.25, random_state = 777)
#DNN Modelling
epochs = 15
batch_size =128
Learning_rate_optimizer = 0.001
model = Sequential()
model.add(Dense(5,
kernel_initializer='glorot_uniform',
activation='relu',
input_shape=(5,)))
model.add(Dense(5,
kernel_initializer='glorot_uniform',
activation='relu'))
model.add(Dense(8,
kernel_initializer='glorot_uniform',
activation='relu'))
model.add(Dense(2,
kernel_initializer='glorot_uniform',
activation='softmax'))
model.compile(loss='binary_crossentropy',
optimizer=Adam(lr = Learning_rate_optimizer),
metrics=['accuracy'])
history = model.fit(x_train, y_train,
batch_size=batch_size,
epochs=epochs,
verbose=1,
validation_data=(x_test, y_test))
y_pred = model.predict(x_test)
confusion_matrix(y_test.argmax(axis=1), y_pred.argmax(axis=1))
Thank you
it sounds like you have highly imbalanced dataset, the model is learning only how to classify fine goods.
you can try one of the approaches listed here:
https://machinelearningmastery.com/tactics-to-combat-imbalanced-classes-in-your-machine-learning-dataset/
The best attempt would be to firstly take almost equal portions of data of both classes, split them into train-test-val, train the classifier and do thorough testing on your complete dataset. You can also try and use data augmentation techniques to your other set to get more data from the same set. Keep on iterating and maybe even try and change your loss function to suit your condition.

Cross-validation in sklearn: do I need to call fit() as well as cross_val_score()?

I would like to use k-fold cross validation while learning a model. So far I am doing it like this:
# splitting dataset into training and test sets
X_train, X_test, y_train, y_test = train_test_split(dataset_1, df1['label'], test_size=0.25, random_state=4222)
# learning a model
model = MultinomialNB()
model.fit(X_train, y_train)
scores = cross_val_score(model, X_train, y_train, cv=5)
At this step I am not quite sure whether I should use model.fit() or not, because in the official documentation of sklearn they do not fit but just call cross_val_score as following (they do not even split the data into training and test sets):
from sklearn.model_selection import cross_val_score
clf = svm.SVC(kernel='linear', C=1)
scores = cross_val_score(clf, iris.data, iris.target, cv=5)
I would like to tune the hyper parameters of the model while learning the model. What is the right pipeline?
If you want to do hyperparameter selection then look into RandomizedSearchCV or GridSearchCV. If you want to use the best model afterwards, then call either of these with refit=True and then use best_estimator_.
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import RandomizedSearchCV
log_params = {'penalty': ['l1', 'l2'], 'C': [1E-7, 1E-6, 1E-6, 1E-4, 1E-3]}
clf = LogisticRegression()
search = RandomizedSearchCV(clf, scoring='average_precision', cv=10,
n_iter=10, param_distributions=log_params,
refit=True, n_jobs=-1)
search.fit(X_train, y_train)
clf = search.best_estimator_
http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html
Your second example is right for doing the cross validation. See the example here: http://scikit-learn.org/stable/modules/cross_validation.html#computing-cross-validated-metrics
The fitting will be done inside the cross_val_score function, you don't need to worry about this beforehand.
[Edited] If, besides cross validation, you want to train a model, you can call model.fit() afterwards.

Is passing sklearn tfidf matrix to train MultinomialNB model proper?

I'm do some text classification tasks. What I have observed is that if fed tfidf matrix(from sklearn's TfidfVectorizer), Logistic Regression model is always outperforming MultinomialNB model. Below is my code for training both:
X = df_new['text_content']
y = df_new['label']
X_train, X_test, y_train, y_test = train_test_split(X, y)
vectorizer = TfidfVectorizer(stop_words='english')
X_train_dtm = vectorizer.fit_transform(X_train)
X_test_dtm = vectorizer.transform(X_test)
clf_lr = LogisticRegression()
clf_lr.fit(X_train_dtm, y_train)
y_pred = clf_lr.predict(X_test_dtm)
lr_score = accuracy_score(y_test, y_pred) # perfectly balanced binary classes
clf_mnb = MultinomialNB()
clf_mnb.fit(X_train_dtm, y_train)
y_pred = clf_mnb.predict(X_test_dtm)
mnb_score = accuracy_score(y_test, y_pred) # perfectly balanced binary classes
Currently lr_score > mnb_score always. I'm wondering how exactly MultinomialNB is using the tfidf matrix since the term frequency in tfidf is calculated based on no class information. Any chance that I should not feed tfidf matrix to MultinomialNB the same way I did to LogisticRegression?
Update: I understand the difference between results of TfidfVectorizer and CountVectorizer. And I also just checked the sources code of sklearn's MultinomialNB.fit() function, looks like it does expect a count as oppose to frequency. This will also explain the performance boost mentioned in my comment below. However, I'm still wondering if under any circumstances pass tfidf into MultinomialNB makes sense. The sklearn documentation briefly mentioned the possibility, but not much details.
Any advice would be much appreciated!

From SKLearn to Keras - What is the difference?

I'm trying to go from SKLearn to Keras in order to make specific improvements to my models.
However, I can't get the same performance I had with my SKLearn model :
mlp = MLPClassifier(
solver='adam', activation='relu',
beta_1=0.9, beta_2=0.999, learning_rate='constant',
alpha=0, hidden_layer_sizes=(238,),
max_iter=300
)
dev_score(mlp)
Gives ~0.65 score everytime
Here is my corresponding Keras code :
def build_model(alpha):
level_moreargs = {'kernel_regularizer':l2(alpha), 'kernel_initializer': 'glorot_uniform'}
model = Sequential()
model.add(Dense(units=238, input_dim=X.shape[1], **level_moreargs))
model.add(Activation('relu'))
model.add(Dense(units=class_names.shape[0], **level_moreargs)) # output
model.add(Activation('softmax'))
model.compile(loss=keras.losses.categorical_crossentropy, # like sklearn
optimizer=keras.optimizers.Adam(lr=0.001, beta_1=0.9, beta_2=0.999, epsilon=1e-08, decay=0.0),
metrics=['accuracy'])
return model
k_dnn = KerasClassifier(build_fn=build_model, epochs=300, batch_size=200, validation_data=None, shuffle=True, alpha=0.5, verbose=0)
dev_score(k_dnn)
From looking at the documentation (and digging into SKLearn code), this should correspond exactly to the same thing.
However, I get ~0.5 accuracy when I run this model, which is very bad.
And if I set alpha to 0, SKLearn's score barely changes (0.63), while Keras's goes random from 0.2 to 0.4.
What is the difference between these models ? Why is Keras, although being supposed to be better than SKLearn, outperformed by so far here ? What's my mistake ?
Thanks,

Resources