Reuters dataset classes - svm

I am researching on text classification using SVM. I am using reuters 21578 modapte dataset in arff format and classifying it using weka. I am getting two classes after classification viz., (-inf-0.5] and (0.5-inf). What are these classes? And how should I proceed to study on learning svm?

Related

NLP Classification on a dataset

I am trying to learned NLP. I understand the basic concepts from Text Preprocessing to td-idf, and Word Embedding. How do I apply this learning? I have a Data set with two columns: Answer and Gender. I want to use NLP to transform the Answer column to vectors and then use supervised machine learning to train a model that predict where a certain type of answer was given by male or a female.
I dont know how to process after I Pre_processed the text.
You can download datasets which are available in Matlab format.
All of them are divided into train and test datasets.
check my GitHub

Extracting word embeddings for a xlnet calssification model in simple transformers?

I am trying to implement a xlnet transformer model using the Simple Transformers library. I am following this particular tutorial - https://simpletransformers.ai/docs/multi-class-classification/
According to this, I can train the model on the train_df and then produce results like accuracy, f1 score, etc. but is there a way to extract the word embedding produced by this model when trained on the training data? I would be interested in analyzing the plotting those embeddings for academic purposes but I am unable to figure out a way to do so in the Simple Transformers library.

Extract CNN features using Caffe and train using SVM

I want to extract features using caffe and train those features using SVM. I have gone through this link: http://caffe.berkeleyvision.org/gathered/examples/feature_extraction.html. This links provides how we can extract features using caffenet. But I want to use Lenet architecture here. I am unable to change this line of command for Lenet:
./build/tools/extract_features.bin models/bvlc_reference_caffenet/bvlc_reference_caffenet.caffemodel examples/_temp/imagenet_val.prototxt fc7 examples/_temp/features 10 leveldb
And also, after extracting the features, how to train these features using SVM? I want to use python for this. For eg: If I get features from this code:
features = net.blobs['pool2'].data.copy()
Then, how can I train these features using SVM by defining my own classes?
You have two questions here:
Extracting features using LeNet
Training an SVM
Extracting features using LeNet
To extract the features from LeNet using the extract_features.bin script you need to have the model file (.caffemodel) and the model definition for testing (.prototxt).
The signature of extract_features.bin is here:
Usage: extract_features pretrained_net_param feature_extraction_proto_file extract_feature_blob_name1[,name2,...] save_feature_dataset_name1[,name2,...] num_mini_batches db_type [CPU/GPU] [DEVICE_ID=0]
So if you take as an example val prototxt file this one (https://github.com/BVLC/caffe/blob/master/models/bvlc_alexnet/train_val.prototxt), you can change it to the LeNet architecture and point it to your LMDB / LevelDB. That should get you most of the way there. Once you did that and get stuck, you can re-update your question or post a comment here so we can help.
Training SVM on top of features
I highly recommend using Python's scikit-learn for training an SVM from the features. It is super easy to get started, including reading in features saved from Caffe's format.
Very lagged reply, but should help.
Not 100% what you want, but I have used the VGG-16 net to extract face features using caffe and perform a accuracy test on a small subset of the LFW dataset. Exactly what you needed is in the code. The code creates classes for training and testing and pushes them into the SVM for classification.
https://github.com/wajihullahbaig/VGGFaceMatching

Scikit-learn processing pipeline for text across test, train and validation datasets

I am using scikit-learn to build text classifiers. As part of the preprocessing I use a tf-ifd transformer. I have had difficulty when trying to validate a model against an unseen dataset as the vocabulary is different. How can a pipeline be applied to unseen data that needs to be used for prediction at a later time?
Thanks

unary class text classification in weka?

I have a training dataset (text) for a particular category (say Cancer). I want to train a SVM classifier for this class in weka. But when i try to do this by creating a folder 'cancer' and putting all those training files to that folder and when i run to code i get the following error:
weka.classifiers.functions.SMO: Cannot handle unary class!
what I want to do is if the classifier finds a document related to 'cancer' it says the class name correctly and once i fed a non cancer document it should say something like 'unknown'.
What should I do to get this behavior?
The SMO algorithm in Weka only does binary classification between two classes. Sequential Minimal Optimization is a specific algorithm for solving an SVM and in Weka this a basic implementation of this algorithm. If you have some examples that are cancer and some that are not, then that would be binary, perhaps you haven't labeled them correctly.
However, if you are using training data which is all examples of cancer and you want it to tell you whether a future example fits the pattern or not, then you are attempting to do one-class SVM, aka outlier detection.
LibSVM in Weka can handle one-class svm. Unlike the Weka SMO implementation, LibSVM is a standalone program which has been interfaced into Weka and incorporates many different variants of SVM. This post on the Wekalist explains how to use LibSVM for this in Weka.

Resources