Classifying an unknown class in LIbsvm - svm

I have a dataset (2598), in which 108 instances belong to Class 1 and 87 belong to Class 2 , I need to classify all the rest instance in the dataset either as Class1 or 2 or Donot belong to this classification. Is it possible to do it using Libsvm , Since i am training the dataset using Class1 and Class2 and i need to find which dont belong to both of the classes.
Do help me in this regard.

Use multiclass classification and train the 'rest' as class '3'.

The problem here is the number of class3 is too low
class 1 : 129
class 2: 239
class 3: 30
How to define weights for these in libsvm , i used multi_class_learn as referred in libsvm but there i could not assign weight and prediction went too low ..
Is there any other packages where i can do multiclass svm quite easier?? with weights

Related

How to give more importance to one class within a batch when training the model

I have five classes to train and would like one of my models to prioritize class 3 when training, so that the class 3 can get higher accuracy when predicting, how could I code that in pytorch?
The order doesn't matter as the addition operator is commutative.

What is the classifier used in scikit-learn's VotingClassifier?

I looked at the documentation of scikit-learn but it is not clear to me what sort of classification method is used under the hood of the VotingClassifier? Is it logistic regression, SVM or some sort of a tree method?
I'm interested in ways to vary the classifier method used under the hood. If Scikit-learn is not offering such an option is there a python package which can be integrated easily with scikit-learn which would offer such functionality?
EDIT:
I meant the classifier method used for the second level model. I'm perfectly aware that the first level classifiers can be any type of classifier supported by scikit-learn.
The second level classifier uses the predictions of the first level classifiers as inputs. So my question is - what method does this second level classifier use? Is it logistic regression? Or something else? Can I change it?
General
The VotingClassifier is not limited to one specific method/algorithm. You can choose multiple different algorithms and combine them to one VotingClassifier. See example below:
iris = datasets.load_iris()
X, y = iris.data[:, 1:3], iris.target
clf1 = LogisticRegression(...)
clf2 = RandomForestClassifier(...)
clf3 = SVC(...)
eclf = VotingClassifier(estimators=[('lr', clf1), ('rf', clf2), ('svm', clf3)], voting='hard')
Read more about the usage here: VotingClassifier-Usage.
When it comes down to how the VotingClassifier "votes" you can either specify voting='hard' or voting='soft'. See the paragraph below for more detail.
Voting
Majority Class Labels (Majority/Hard Voting)
In majority voting, the predicted class label for a particular sample
is the class label that represents the majority (mode) of the class
labels predicted by each individual classifier.
E.g., if the prediction for a given sample is
classifier 1 -> class 1 classifier 2 -> class 1 classifier 3 -> class
2 the VotingClassifier (with voting='hard') would classify the sample
as “class 1” based on the majority class label.
Source: scikit-learn-majority-class-labels-majority-hard-voting
Weighted Average Probabilities (Soft Voting)
In contrast to majority voting (hard voting), soft voting returns the
class label as argmax of the sum of predicted probabilities.
Specific weights can be assigned to each classifier via the weights
parameter. When weights are provided, the predicted class
probabilities for each classifier are collected, multiplied by the
classifier weight, and averaged. The final class label is then derived
from the class label with the highest average probability.
Source/Read more here: scikit-learn-weighted-average-probabilities-soft-voting
The VotingClassifier does not fit any meta model on the first level of classifiers output.
It just aggregates the output of each classifier in the first level by the mode (if voting is hard) or averaging the probabilities (if the voting is soft).
In simple terms, VotingClassifier does not learn anything from the first level of classifiers. It only consolidates the output of individual classifiers.
If you want your meta model to be more intelligent, try using the adaboost, gradientBoosting models.

Scikit-learn precision_recall_fscore_support multi-class

I am trying to get the precision, recall and fscore for multi-class classification with scikit-learn. My classes have labels 0 and 1 but this is NOT binary classification. The scikit precision_recall_fscore_support() method assumes that my classification is binary and reports results only for class 1. If I convert my labels to string then it requires pos_label. If I provide pos_label='1' then again it reports results only for class 1.
How do I make it consider '0' and '1' as two independent classes and show me averaged results for both, not just 1?
Solution is argument pos_label=None.

Spark, MLlib: Adjusting classifier descrimination threshold

I try to use Spark MLlib Logistic Regression (LR) and/or Random Forests (RF) classifiers to create model to descriminate between two classes reprsented by sets which cardinality differes quite a lot.
One set has 150 000 000 negative and and another just 50 000 positive instances.
After training both LR and RF classifiers with default parameters I get very similar results for both classifiers with, for example, for the following test set:
Test instances: 26842
Test positives = 433.0
Test negatives = 26409.0
Classifier detects:
truePositives = 0.0
trueNegatives = 26409.0
falsePositives = 433.0
falseNegatives = 0.0
Precision = 0.9838685641904478
Recall = 0.9838685641904478
It looks like classifier can not detect any positive instance at all.
Also, no matter how data was split into train and test sets, classifier provides exactly the same number of false positives equal to a number of positives that test set really has.
LR classifier default threshold is set to 0.5 Setting threshold to 0.8 does not make any difference.
val model = new LogisticRegressionWithLBFGS().run(training)
model.setThreshold(0.8)
Questions:
1) Please advise how to manipulate classifier threshold to make classifier more sensetive to a class with a tiny fraction of positive instances vs a class with huge amount of negative instances?
2) Any other MLlib classifiers to solve this problem?
3) What itercept parameter does to the Logistic Regression algorithm?
val model = new LogisticRegressionWithSGD().setIntercept(true).run(training)
Well, I think what you have here is a very unbalance data set problem:
150 000 000 Class1
50 000 Class2. 3000 times smaller.
So if you train a classifier that assumes all are Class1 you are going to have:
0.999666 accuracy. So the best classifier will always be ALL are Class1. This is what your model is learning here.
There are different ways to assess these cases, in general you can do, downsampling the larger Class, or up-sampling the smaller class, or there are some other things you can do with randomforests for example when you sample do it in a balanced way (stratified), or add weights:
http://statistics.berkeley.edu/sites/default/files/tech-reports/666.pdf
Other methods also exist like SMOTE,etc (also doing samples) for more details you can read here:
https://www3.nd.edu/~dial/papers/SPRINGER05.pdf
The threshold you can change for your logistic regression is going to be the probability one, you can try playing with "probabilityCol" in the parameters of the logistic regression example here:
http://spark.apache.org/docs/latest/ml-guide.html
But a problem now with MLlib is that not all classifiers are returning a probability, I asked them about this and it is in their roadmap.

Classification score: SVM

I am using libsvm for multi-class classification. How can I attach classification scores, to compare the confidence of classification, with the output for a given sample as:
Class 1: score1
Class 2: score2
Class 3: score3
Class 4: score4
You can use one vs all approach first and consider them as 2class classification by having the decision value option in the libSVM. This is done by having the each class as positive class and rest of the class as negative for each classification.
Then compare the decision values of the results to classify the samples. Like you can assign the sample to the class which has the highest decision values. For example, sample 1 has decision value 0.54 for class 1, 0.64 for class 2, 0.43 for class 3 and 0.80 for class4, then you can classify it to class4.
You can also use probability values to classify instead of decision function values by using -b option in libSVM.
Hope this helps..
Another option is to use the LIBLINEAR package which internally implements one-vs-all strategy for solving multi-class problem. In LIBSVM, this implementation is based on one-vs-one strategy.

Resources