sklearn.linear_model.SGDClassifier manual inference for multiclass classification - scikit-learn

I have trained a model for 3-class classification using sklearn.linear_model.SGDClassifier. Now I'm looking for a way for manual inference of the model. The problem here is that the model contains three pairs of [coef_, intercept_] so I don't understand how can I do a prediction in C++.
The code for training looks like in the sklearn example:
clf = make_pipeline(StandardScaler(), SGDClassifier(max_iter=1000, tol=1e-3))
clf.fit(train_features, train_labels)
I tried to calculate values coef_ * sample + intercept_ for each of the classes but didn't understand how to determine the class using that numbers.

Related

How to use class weights for GaussianNB and KNeighborsRegressor in sklearn?

I have a highly imbalanced data set from which I want to get both classification (binary) as well as probabilities. I have managed to use logistic regression as well as random forest to obtain results from cross_val_predict using class weights.
I am aware that RandomForestClassifier and LogisiticRegression can take class weight as an argument while KNeighborsRegressor and GaussianNB do not. However, for KNN and NB in the documentation it says that for that I can use fit which incorporates sample weights:
fit(self, X, y, sample_weight=None)
So I was thinking of working around it by calculating class weights and using these to create an array of sample weights depending on the classification of the sample. Here is the code for that:
c_w = class_weight.compute_class_weight('balanced', np.unique(y), y)
sw=[]
for i in range(len(y)):
if y[i]==False:
sw.append(c_w[0])
else:
sw.append(c_w[1])
Not sure if this workaround makes sense, however I managed to fit the model using this method and I seem to get better results in terms of my smaller class.
The issue now is that I want to use this method in sklearn's
cross_val_predict()
however I am not managing to pass sample weights through cross validation.
I have 2 questions:
Does my workaround to use sample weights to substitute class weights make sense?
Is there a way to pass sample weights through cross_val_predict just like you would when you use fit without cross validation?
please see the response for this post for the description of sample and class weights difference. Ingeneral if you use class weights, you "make your model aware" of class imbalance. If you use sample weights you make your model aware that some samples must be "considered more carefully" or not taken into account at all.
fit_params argument should do the job, see here:
fit_params : dict, defualt=None - parameters to pass to the fit method of the estimator.

What is the classifier used in scikit-learn's VotingClassifier?

I looked at the documentation of scikit-learn but it is not clear to me what sort of classification method is used under the hood of the VotingClassifier? Is it logistic regression, SVM or some sort of a tree method?
I'm interested in ways to vary the classifier method used under the hood. If Scikit-learn is not offering such an option is there a python package which can be integrated easily with scikit-learn which would offer such functionality?
EDIT:
I meant the classifier method used for the second level model. I'm perfectly aware that the first level classifiers can be any type of classifier supported by scikit-learn.
The second level classifier uses the predictions of the first level classifiers as inputs. So my question is - what method does this second level classifier use? Is it logistic regression? Or something else? Can I change it?
General
The VotingClassifier is not limited to one specific method/algorithm. You can choose multiple different algorithms and combine them to one VotingClassifier. See example below:
iris = datasets.load_iris()
X, y = iris.data[:, 1:3], iris.target
clf1 = LogisticRegression(...)
clf2 = RandomForestClassifier(...)
clf3 = SVC(...)
eclf = VotingClassifier(estimators=[('lr', clf1), ('rf', clf2), ('svm', clf3)], voting='hard')
Read more about the usage here: VotingClassifier-Usage.
When it comes down to how the VotingClassifier "votes" you can either specify voting='hard' or voting='soft'. See the paragraph below for more detail.
Voting
Majority Class Labels (Majority/Hard Voting)
In majority voting, the predicted class label for a particular sample
is the class label that represents the majority (mode) of the class
labels predicted by each individual classifier.
E.g., if the prediction for a given sample is
classifier 1 -> class 1 classifier 2 -> class 1 classifier 3 -> class
2 the VotingClassifier (with voting='hard') would classify the sample
as “class 1” based on the majority class label.
Source: scikit-learn-majority-class-labels-majority-hard-voting
Weighted Average Probabilities (Soft Voting)
In contrast to majority voting (hard voting), soft voting returns the
class label as argmax of the sum of predicted probabilities.
Specific weights can be assigned to each classifier via the weights
parameter. When weights are provided, the predicted class
probabilities for each classifier are collected, multiplied by the
classifier weight, and averaged. The final class label is then derived
from the class label with the highest average probability.
Source/Read more here: scikit-learn-weighted-average-probabilities-soft-voting
The VotingClassifier does not fit any meta model on the first level of classifiers output.
It just aggregates the output of each classifier in the first level by the mode (if voting is hard) or averaging the probabilities (if the voting is soft).
In simple terms, VotingClassifier does not learn anything from the first level of classifiers. It only consolidates the output of individual classifiers.
If you want your meta model to be more intelligent, try using the adaboost, gradientBoosting models.

Pre-training for multi label classification

I have to pre train a model for multi label classification. I'm pretraining with cifar10 dataset and I wonder if I have to use for the pre training
'categorical_crossentrpy' (softmax) or 'binary_crossentropy' (sigmoid), since in the first case I have a multi classification problem
You should use softmax because it gives you the probabilities for every class, no matter how many of them are there. Sigmoid, as you have written is used with binary_crossentropy and is used in binary classification (hence binary in the name). I hope it's clearer now.

How to adopt multiple different loss functions in each steps of LSTM in Keras

I have a set of sentences and their scores, I would like to train a marking system that could predict the score for a given sentence, such one example is like this:
(X =Tomorrow is a good day, Y = 0.9)
I would like to use LSTM to build such a marking system, and also consider the sequential relationship between each word in the sentence, so the training example shown above is transformed as following:
(x1=Tomorrow, y1=is) (x2=is, y2=a) (x3=a, y3=good) (x4=day, y4=0.9)
When training this LSTM, I would like the first three time steps using a softmax classifier, and the final step using a MSE. It is obvious that the loss function used in this LSTM is composed of two different loss functions. In this case, it seems the Keras does not provide the way to address my problem directly. In addition, I am not sure whether my method to build the marking system is correct or not.
Keras support multiple loss functions as well:
model = Model(inputs=inputs,
outputs=[lang_model, sent_model])
model.compile(optimizer='sgd',
loss=['categorical_crossentropy', 'mse'],
metrics=['accuracy'], loss_weights=[1., 1.])
Based on your explanation, I think you need a model that first, predict a token based on previous tokens, in NLP domain it usually called Language model, and then compute a score which I assume it is a sentiment (it is applicable to other domain).
To do so, you can train your language model with LSTM and pick the last output of LSTM for your ranking task. To this end, you need to define two loss function: categorical_crossentropy for the language model and MSE for the ranking task.
This tutorial would be helpful: https://www.pyimagesearch.com/2018/06/04/keras-multiple-outputs-and-multiple-losses/

Imbalanced dataset using MLP classifier in python

I am dealing with imbalanced dataset and I try to make a predictive model using MLP classifier. Unfortunately the algorithm classifies all the observations from test set to class "1" and hence the f1 score and recall values in classification report are 0. Does anyone know how to deal with it?
model= MLPClassifier(solver='lbfgs', activation='tanh')
model.fit(X_train, y_train)
score=accuracy_score(y_test, model.predict(X_test), )
fpr, tpr, thresholds = roc_curve(y_test, model.predict_proba(X_test)[:,1])
roc=roc_auc_score(y_test, model.predict_proba(X_test)[:,1])
cr=classification_report(y_test, model.predict(X_test))
There are few techniques to handle the imbalanced dataset. A fully dedicated python library "imbalanced-learn" is available here. But one should be cautious about which technique should be used in a specific case.
Few interesting examples are also available at https://svds.com/learning-imbalanced-classes/

Resources