Text Classification using Naive bayes - text

Do guide me along if I am not posting in the right section.
I have some text files for my training data which are unformatted in word documents. They all contain ASCII characters only.
I would like to train a model on the text files using data mining methods.
The text files do have about 300 words in each file on average.
Are there any software that are recommended for me to start on it?
My initial idea is to use all the words in one of the file as training data and the remaining as test data. This is to perform cross fold validation.
However, I have tools such as weka but it does not seem to satisfy my needs as converting to csv files does not seem to be feasible in my case as the text files are separated
I have trying to perform cross validation in such a way that all the words in the training data are considered as features.

You need to use weka StringToWord filter and convert your text files to arff files. After that you can use weka classification algorithms. Watch following video to learn basics.

Related

Pre-Processing Raw Texts Before Running Vader or TextBlob Sentiment Analysis?

I learned from a few examples that one needs to pre-process raw texts (e.g. removing stop-words, punctuation, applying lower-case, lemmatizing, and etc) before running a sentiment analysis.
However, when I saw some example codes for sentiment analysis using Vader or TextBlob, those example codes do not go over pre-process steps but rather the codes are directly applied to raw texts.
Based on my impression, it seems that people pre-process raw texts when they want to train and test the model by themselves (in this case, corresponding labels such as positive or negative are identified) whereas they do not go over pre-process steps when they choose to run unsupervised sentiment analysis (e.g. TextBlob, vader).
Or is there any other reasons for whether we decide to pre-process or not?
Any helpful comments will be appreciated.

Is there any way to classify text based on some given keywords using python?

i been trying to learn a bit of machine learning for a project that I'm working in. At the moment I managed to classify text using SVM with sklearn and spacy having some good results, but i want to not only classify the text with svm, I also want it to be classified based on a list of keywords that I have. For example: If the sentence has the word fast or seconds I would like it to be classified as performance.
I'm really new to machine learning and I would really appreciate any advice.
I assume that you are already taking a portion of your data, classifying it manually and then using the result as your training data for the SVM algorithm.
If yes, then you could just append your list of keywords (features) and desired classifications (labels) to your training data. If you are not doing it already, I'd recommend using the SnowballStemmer on your training data features.

Convert Text Data into SVMFile format for Spam Classification?

How can I convert Text Data into LibSVM file format for training the model for spam classification.
Are SVMFiles already Labeled ?
SVM format is neither required or that useful. It is used in Apache Spark ML example, only because it can be map directly to the required format.
Are SVMFiles already Labeled ?
Not necessarily, but Spark can read only labeled variant.
In practice you should use org.apache.spark.ml.feature tools to extract relevant features from your data.
You can follow the documentation as well as a number of questions on SO.,

Training Stanford POS tagger using multiple text files

I have a corpus of about 20000 text files and i want to train the tagger using these text files, which is better,to group these text files into one text file(i don't know if it will affect tagging accuracy or not) or to include all these text files in the props file?
I don't think it matters. The code should just load all of the data in, it's just for convenience if you have it split into multiple files. Also, you can specify different input formats for different files, but this is not going to affect the final model.

Using topic modeling Java toolkit

I'm working on text classification and I want to use Topic models (LDA).
My corpus consists of at least 24, 000 Persian news documents. each doc in the corpus is in format of (keyword, weight) pairs extracted from the news.
I saw two Java toolkits: mallet and lingpipe.
I've read mallet tutorial on importing the data and it gets data in plain text, not the format that I have. is there any way that I could change it?
Also read a little about the lingpipe, the example from tutorial was using arrays of integers. Is it convenient for large data?
I need to know which implementation of LDA is better for me? Are there any other implementation that suits my data? (in Java)
From the keyword-weight file you can create an artificial text containing the words in random order with the given weights. Run mallet on the so-generated texts to retrieve the topics.

Resources