Java SVM Text Classification , Train & Test Files? - text

I'm trying to Classify a Text Document into Categories , for example :
Document 1 : " Basketball is a good sport " ---> Category : Sport
Document 2 : " World war 2 .. " ---> Category : History
...
My gool is to create a Java interface with a SVM Algorithm !
So, I should use SVM Java Library , I found two :
SVMLIGH
LIBSVM
Should I use the first one or the second?
I had do many research , and I found that I should do two things :
I should prepare a training file.
In SVM there is a special format for this file ( Example : 1 1:317.5 )
But the question is : From what I Should Generate this file ? From the documents only ? Or From something else ?
I should have a test file, that's mean a new document to classify. Should I transform the new document to classify into SVM Test file format?
That's correct?
Please guide me I'm truly lost and I don't know what I should do ! PLZ

yes, you should change the format to svm standard
your svm classifier have no idea about text, first you should change your texts(train,test) to standrad format
you can start your classifier with Weka, weka have simple GUI & you can classify your datasets with few clicks
when you get confidence about your classifier & it's accuracy then implement it in java
you can use Weka in your java code too
PS:
1- WEKA Text Classification for First Time & Beginner Users : http://www.youtube.com/watch?v=IY29uC4uem8
2- http://www.cs.waikato.ac.nz/ml/weka/‎

Related

NLP Classification on a dataset

I am trying to learned NLP. I understand the basic concepts from Text Preprocessing to td-idf, and Word Embedding. How do I apply this learning? I have a Data set with two columns: Answer and Gender. I want to use NLP to transform the Answer column to vectors and then use supervised machine learning to train a model that predict where a certain type of answer was given by male or a female.
I dont know how to process after I Pre_processed the text.
You can download datasets which are available in Matlab format.
All of them are divided into train and test datasets.
check my GitHub

How to recognize entities in text that is the output of optical character recognition (OCR)?

I am trying to do multi-class classification with textual data. Problem I am facing that I have unstructured textual data. I'll explain the problem with an example.
consider this image for example:
I want to extract and classify text information given in image. Problem is when I extract information OCR engine will give output something like this:
18
EURO 46
KEEP AWAY
FROM FIRE
MADE IN CHINA
2226249917581
7412501
DOROTHY
PERKINS
Now target classes here are:
18 -> size
EURO 46 -> price
KEEP AWAY FROM FIRE -> usage_instructions
MADE IN CHINA -> manufacturing_location
2226249917581 -> product_id
7412501 -> style_id
DOROTHY PERKINS -> brand_name
Problem I am facing is that input text is not separable, meaning "multiple lines can belong to same class" and there can be cases where "single line can have multiple classes".
So I don't know how I can split/merge lines before passing it to classification model. Is there any way using NLP I can split paragraph based on target class. In other words given input paragraph split it based on target labels.
If you only consider the text, this is a Named Entity Recognition (NER) task.
What you can do is train a Spacy model to NER for your particular problem.
Here is what you will need to do:
First gather a list of training text data
Label that data with corresponding entity types
Split the data into training set and testing set
Train a model with Spacy NER using training set
Score the model using the testing set
...
Profit!
See Spacy documentation on training specific NER models
Good luck!

applying regression on bag of words

I have a text document and did clean the text. Now I have a list of words that I want to apply regression on, but I don't know how to do it. Can anyone please help?
And can I use other Machine learning algorithms on the list of words??
Please provide details on what kind of prediction are you doing?
In general case(using scikit-learn):
Step-1 : Use Snowball Stemmer to stem words
Step-2 : Using this parsed Data create features and labels training and test sets.
Step-3 : Convert text vectorization to lists of numbers using tfidfvectorizer
Step-4 : As it will be a huge set of features, we need to select top 10 (or whatever you want) Percentile using selectpercentile to remove less weighted features.
Now you can use your feature set for whatever purpose you want!
Hope this helps :)
PS: You will need to do some research on nltk and vectorizer for appropriate parameters and tuning

Where can I get CoNLL-X training data?

I'm trying to train the Stanford Neural Network Dependency Parser to check phrase similarity.
The way I tried is:
java edu.stanford.nlp.parser.nndep.DependencyParser -trainFile trainPath -devFile devPath -embedFile wordEmbeddingFile -embeddingSize wordEmbeddingDimensionality -model modelOutputFile.txt.gz
The error that I got is:
Train File: C:\Users\rohit\Downloads\CoreNLP-master\CoreNLP-master\data\edu\stanford\nlp\parser\trees\en-onetree.txt
Dev File: null
Model File: modelOutputFile.txt.gz
Embedding File: null
Pre-trained Model File: null
################### Train
#Trees: 1
0 tree(s) are illegal (0.00%).
1 tree(s) are legal but have multiple roots (100.00%).
0 tree(s) are legal but not projective (0.00%).
###################
#Word: 3
#POS:3
#Label: 2
###################
#Transitions: 3
#Labels: 1
ROOTLABEL: null
Random generator initialized with seed 1459831358061
Exception in thread "main" java.lang.NullPointerException
at edu.stanford.nlp.parser.nndep.Util.scaling(Util.java:49)
at edu.stanford.nlp.parser.nndep.DependencyParser.readEmbedFile. (DependencyParser.java:636)
at edu.stanford.nlp.parser.nndep.DependencyParser.setupClassifierForTraining(DependencyParser.java:787)
at edu.stanford.nlp.parser.nndep.DependencyParser.train(DependencyParser.java:676)
at edu.stanford.nlp.parser.nndep.DependencyParser.main(DependencyParser.java:1247)
The help embedded within the code says that the training file should be a - "Path to a training treebank in CoNLL-X format".
Does anyone know where I can find some CoNLL-X training data to train?
I gave training file but not embedding file and got this error.
My guess is if I give the embedding file it might work.
Please shed some light on which training file & embedding file I should use and where I can find them.
CoNLL-X treebanks
You can get the training data for Danish, Dutch, Portuguese, and Swedish available for free here. For other languages, you'll probably need to license a treebank from LDC, unfortunately (details for many languages on that page).
Universal Dependencies are in CoNLL-U format, which can usually be converted to CoNLL-X format with some work.
Lastly, there's a large list of treebanks and their availability on this page. You should be able to convert many of the dependency treebanks in this list into CoNLL-X format if they're not already in that format.
Training the Stanford Neural Net Dependency parser
From this page: The embedding file is optional, but the treebank is not. The best treebank and embedding files to use depend on which language and type of text you'd like to parse. Ideally, you would train on as much data as possible in the domain/genre that you're trying to parse.

How to use reuters-21578 dataset with svm.net for text classification?

I've just started an application for text classification and I've read lots of papers about this topic, but till now I don't know how to start, I feel like I've not got the whole image. I've got the training dataset and read its description and got a great implementation for SVM algorithm (SVM.Net) but I don't know how to use that dataset with this implementation. I know that I should extract features from the dataset's texts and use these features as input to the SVM so could any body please tell me about a detailed tutorial about how to extract text's features and use them as input to the SVM algorithm, and then use this algorithm to classify a new text?
And if there is a full example about using SVM for text classification, that's would be great.
Any help would be appreciated.
Thanks in advance.
Creating features for text classification can be as complex as you want it to be.
A simple approach is to just map each distinct term to a feature index. You then represent each document as a vector of the frequencies of each term. (You can remove stop words, weight terms etc etc). For text classification you would also assign each vector with the label.
For example, if the document was the sentence:
John loves Mary
with a label "spam".
Then you might have the following mapping:
John : 1
loves: 2
Mary: 3
Your vector then becomes:
1 1 2 1 3 1
(I has assumed that each feature has a weight of one)
I don't know about SVM.NET, but most supervised machine learning methods will accept vector-based input.

Resources