how to perfom classfication - statistics

I'm trying to perform document classification into two categories (category1 and category2), using Weka.
I've gathered a training set consisting of 600 documents belonging to both categories and the total number of documents that are going to be classified is 1,000,000.
So to perform the classification, I apply the StringToWordVector filter. I set true the followings from the filter:
- IDF transform
- TF ransform
- OutputWordCounts
I'd like to ask a few questions about this process.
1) How many documents shall I use as training set, so that I over-fitting is avoided?
2) After applying the filter, I get a list of the words in the training set. Do I have to remove any of them to get a better result at the classifier or it doesn't play any role?
3) As classification method I usually choose naiveBayes but the results I get are the followings:
-------------------------
Correctly Classified Instances 393 70.0535 %
Incorrectly Classified Instances 168 29.9465 %
Kappa statistic 0.415
Mean absolute error 0.2943
Root mean squared error 0.5117
Relative absolute error 60.9082 %
Root relative squared error 104.1148 %
----------------------------
and if I use SMO the results are:
------------------------------
Correctly Classified Instances 418 74.5098 %
Incorrectly Classified Instances 143 25.4902 %
Kappa statistic 0.4742
Mean absolute error 0.2549
Root mean squared error 0.5049
Relative absolute error 52.7508 %
Root relative squared error 102.7203 %
Total Number of Instances 561
------------------------------
So in document classification which one is "better" classifier?
Which one is better for small data sets, like the one I have?
I've read that naiveBayes performs better with big data sets but if I increase my data set, will it cause the "over-fitting" effect?
Also, about Kappa statistic, is there any accepted threshold or it doesn't matter in this case because there are only two categories?
Sorry for the long post, but I've been trying for a week to improve the classification results with no success, although I tried to get documents that fit better in each category.

1) How many documents shall I use as training set, so that I
over-fitting is avoided? \
You don't need to choose the size of training set, in WEKA, you just use the 10-fold cross-validation. Back to the question, machine learning algorithms influence much more than data set in over-fitting problem.
2) After applying the filter, I get a list of the words in the
training set. Do I have to remove any of them to get a better result
at the classifier or it doesn't play any role? \
Definitely it does. But whether the result get better can not be promised.
3) As classification method I usually choose naiveBayes but the
results I get are the followings: \
Usually, to define whether a classify algorithm is good or not, the ROC/AUC/F-measure value is always considered as the most important indicator. You can learn them in any machine learning book.

To answers your questions:
I would use (10 fold) cross-validation to evaluate your method. The model is trained trained 10 times on 90% of the data and tested on 10% of the data using different parts of the data each time. The results are therefor less biased towards your current (random) selection of train and test set.
Removing stop words (i.e., frequently occurring words with little discriminating value like the, he or and) is a common strategy to improve your classifier. Weka's StringToWordVector allows you to select a file containing these stop words, but it should also have a default list with English stop words.
Given your results, SMO performs the best of the two classifiers (e.g., it has more Correctly Classified Instances). You might also want to take a look at (Lib)SVM or LibLinear (You may need to install them if they are not in Weka natively; Weka 3.7.6 has a package manager allowing for easy installation), which can perform quite well on document classification as well.

Regarding the second question
2) After applying the filter, I get a list of the words in the training set. Do I have to remove any of them to get a better result at the classifier or it doesn't play any role?
I was building a classifier and training it with the famous 20news group dataset, when testing it without the preprocessing the results were not good. So, i pre-processed the data according to the following steps:
Substitute TAB, NEWLINE and RETURN characters by SPACE.
Keep only letters (that is, turn punctuation, numbers, etc. into SPACES).
Turn all letters to lowercase.
Substitute multiple SPACES by a single SPACE.
The title/subject of each document is simply added in the beginning of the document's text.
no-short Obtained from the previous file, by removing words that are less than 3 characters long. For example, removing "he" but keeping "him".
no-stop Obtained from the previous file, by removing the 524 SMART stopwords. Some of them had already been removed, because they were shorter than 3 characters.
stemmed Obtained from the previous file, by applying Porter's Stemmer to the remaining
words. Information about stemming can be found here.
These steps are taken from http://web.ist.utl.pt/~acardoso/datasets/

Related

Having trouble training Word2Vec iteratively on Gensim

I'm attempting to train multiple texts supplied by myself iteratively. However, I keep running into an issue when I train the model more than once:
ValueError: You must specify either total_examples or total_words, for proper learning-rate and progress calculations. If you've just built the vocabulary using the same corpus, using the count cached in the model is sufficient: total_examples=model.corpus_count.
I'm currently initiating my model like this:
model = Word2Vec(sentences, min_count=0, workers=cpu_count())
model.build_vocab(sentences, update=False)
model.save('firstmodel.model')
model = Word2Vec.load('firstmodel.model')
and subsequently training it iteratively like this:
model.build_vocab(sentences, update = True)
model.train(sentences, totalexamples=model.corpus_count, epochs=model.epochs)
What am I missing here?
Somehow, it worked when I just trained one other model, so not sure why it doesn't work beyond two models...
First, the error message says you need to supply either the total_examples or total_words parameter to train() (so that it has an accurate estimate of the total training-corpus size).
Your code, as currently shown, only supplies totalexamples – a parameter name missing the necessary _. Correcting this typo should remedy the immediate error.
However, some other comments on your usage:
repeatedly calling train() with different data is an expert technique highly subject to error or other problems. It's not the usual way of using Word2Vec, nor the way most published results were reached. You can't count on it to always improve the model with new words; it might make the model worse, as new training sessions update some-but-not-all words, and alter the (usual) property that the vocabulary has one consistent set of word-frequencies from one single corpus. The best course is to train() once, with all available data, so that the full vocabulary, word-frequencies, & equally-trained word-vectors are achieved in a single consistent session.
min_count=0 is almost always a bad idea with word2vec: words with few examples in the corpus should be discarded. Trying to learn word-vectors for them not only gets weak vectors for those words, but dilutes/distracts the model from achieving better vectors for surrounding more-common words.
a count of workers up to your local cpu_count() only reliably helps up to about 4-12 workers, depending on other parameters & the efficiency of your corpus-reading, then more workers can hurt, due to inefficiencies in the Python GIL & Gensim corpus-to-worker handoffs. (inding the actual best count for your setup is, unfortunately, still just a matter of trial and error. But if you've got 16 (or more) cores, your setting is almost sure to do worse than a lower workers number.

How to get negative word samples in Gensim Word2Vec Model?

I am using gensim Word2Vec model to train word embeddings. My code is:
w2v_model = Word2Vec(min_count=20,
window=2,
vector_size=50,
sample=6e-5,
alpha=0.03,
min_alpha=0.0007,
negative=20,
workers=cores-1)
w2v_model.build_vocab(sentences, progress_per=10000)
w2v_model.train(sentences, total_examples=w2v_model.corpus_count, epochs=50, report_delay=1)
I wonder whether I can access the negative and positive word samples during the process?
Thanks in advance.
Deep inside the training loops, for each individual 'center' word in the training texts that is to be predicted – a micro-training-example for the shallow neural-net – a different set of negative words will be chosen.
Those negative-words will be used for just that one set of forward/backward neural-net nudges, then discarded when training moves to the next word.
There's no way to access them other than changing that core code – which is actually written in Cython, & re-compiled into a native library after any changes. (It's a bit harder to tinker with than pure Python code.)
You can see where the exact choice-of-negative samples happens in the source code for one of the modes (CBOW w/ negative-sampling) here:
https://github.com/RaRe-Technologies/gensim/blob/91175ddc7e3d6f3a2af245c20af21ec3bf5e360f/gensim/models/word2vec_inner.pyx#L427
If you just need a representative set of negative-words, you could copy these steps in your own code.
If you want to know (& potentially log?) the negative words chosen for every positive prediction, I suspect that's a misguided idea:
Meaningful analysis of this algorithm's behavior won't depend on either individual micro-examples, nor the arbitrarily-random negative words chosen over all training. The interesting properties only arise from the tug-of-war happening across the interplay of all training.
As this is very deep in the training loops, even the most-efficient extra-steps, as a function of the negative-words, would slow things down a lot. Or, in the case of logging, result in 20x (for window=20) more logged-negative-words than your original training corpus. For the kinds of large corpora where this algorithm works well, such a slowdown/log could be onerous; for tiny toy-sized examples, this algorithm won't be working interestingly at all.
So the mere question, if you truly want a peek at all the (random, arbitrary) negative words during the process, suggests you may be going down a questionable path.
It'd be easier for me to imagine just wanting to see a representative set of the negatively-sampled words - because any 10, or 10,000, or 1,000,000 such randomly-chosen words are as good as any other, and the algorithm (on adequately-sized data) is robust against usual variance in which negative-words are actually chosen. And for that, you could just run the same sampling-process outside the training.
Separately: those are odd non-default choices for alpha & min_alpha - values that usually don't need any tweaking, and if tweaked should really only be done so with a conscious plan, driven by quantitaive evaluations comparing the results of alternate values. But, those specific odd unmotivated values are pretty common in some of the worst online tutorials. So beware where you're learning about word2vec!

Cluster similar words using word2vec

I have various restaurant labels with me and i have some words that are unrelated to restaurants as well. like below:
vegan
vegetarian
pizza
burger
transportation
coffee
Bookstores
Oil and Lube
I have such mix of around 500 labels. I want to know is there a way pick the similar labels that are related to food choices and leave out words like oil and lube, transportation.
I tried using word2vec but, some of them have more than one word and could not figure out a right way.
Brute-force approach is to tag them manually. But, i want to know is there a way using NLP or Word2Vec to cluster all related labels together.
Word2Vec could help with this, but key factors to consider are:
How are your word-vectors trained? Using off-the-shelf vectors (like say the popular GoogleNews vectors trained on a large corpus of news stories) are unlikely to closely match the senses of these words in your domain, or include multi-word tokens like 'oil_and_lube'. But, if you have a good training corpus from your own domain, with multi-word tokens from a controlled vocabulary (like oil_and_lube) that are used in context, you might get quite good vectors for exactly the tokens you need.
The similarity of word-vectors isn't strictly 'synonymity' but often other forms of close-relation including oppositeness and other ways words can be interchangeable or be used in similar contexts. So whether or not the word-vector similarity-values provide a good threshold cutoff for your particular desired "related to food" test is something you'd have to try out & tinker around. (For example: whether words that are drop-in replacements for each other are closest to each other, or words that are common-in-the-same-topics are closest to each other, can be influenced by whether the window parameter is smaller or larger. So you could find tuning Word2Vec training parameters improve the resulting vectors for your specific needs.)
Making more recommendations for how to proceed would require more details on the training data you have available – where do these labels come from? what's the format they're in? how much do you have? – and your ultimate goals – why is it important to distinguish between restaurant- and non-restaurant- labels?
OK, thank you for the details.
In order to train on word2vec you should take into account the following facts :
You need a huge and variate text dataset. Review your training set and make sure it contains the useful data you need in order to obtain what you want.
Set one sentence/phrase per line.
For preprocessing, you need to delete punctuation and set all strings to lower case.
Do NOT lemmatize or stemmatize, because the text will be less complex!
Try different settings:
5.1 Algorithm: I used word2vec and I can say BagOfWords (BOW) provided better results, on different training sets, than SkipGram.
5.2 Number of layers: 200 layers provide good result
5.3 Vector size: Vector length = 300 is OK.
Now run the training algorithm. The, use the obtained model in order to perform different tasks. For example, in your case, for synonymy, you can compare two words (i.e. vectors) with cosine (or similarity). From my experience, cosine provides a satisfactory result: the distance between two words is given by a double between 0 and 1. Synonyms have high cosine values, you must find the limit between words which are synonyms and others that are not.

Dataset for RNN-LSTM as Spell checker in python

I have dataset of more than 5 million of records which has many noise features(words) in it So i thought of doing spell correction and abbreviation handling.
When i googled for spell correction packages in python i got packages like autocorrect, textblob, hunspell etc and Peter norvig's method
Below is the sample of my dataset
Id description
1 switvch for air conditioner..............
2 control tfrmr...........
3 coling pad.................
4 DRLG machine
5 hair smothing kit...............
I Tried spell correction function by above packages using the code
dataset['description']=dataset['description'].apply(lambda x: list(set([spellcorrection_function(item) for item in x])))
For entire dataset it took more than 12 hours to complete spell correction and also it introduces few noise( for 20% of total words which are important)
for eg: In last row, "smothing" corrected as "something" but it should be "smoothing" ( i dont get "something" in this context)
Approaching Further
When I observed the dataset not all time the spelling of word is wrong, there were also correct instance of spelling somewhere in dataset.So I tokenize the entire dataset and split correct words and wrong words by using dictionary , applied jarowinkler similarity method between all pair of words and selected pairs which is having similarity value 0.93 and more
Wrong word correct word similarity score
switvch switch 0.98
coling cooling 0.98
smothing smoothing 0.99
I got more than 50k pair of similar words which I put in dictionary with wrong word as key and correct word as value
I also kept words with its abbreviation list( ~3k pairs) in dictionary
key value
tfrmr transformer
drlg drilling
Search and replace key-value pair using code
dataset['description']=dataset['description'].replace(similar_word_dictionary,regex=true)
dataset['description']=dataset['description'].replace(abbreviation_dictionary,regex=true)
This code took more than a day to complete for only 10% of my entire dataset which I found is not efficient one.
Along With Python packages I had also found deep spelling which is something very efficient way of doing spelling correction.There was a very clear explanation of RNN-LSTM as spell checker.
As I dont know much about RNN and LSTM i got very basic understanding of above link.
Question
I am confused how to consider trainset for RNN to my problem,
whether
I need to consider correct words ( without any spelling mistake) in entire dataset as trainset and entire description of my dataset as testset.
or Pair of similar words and abbrievation list as trainset and description of my dataset as testset ( where model find wrong word in description and correct it)
or any other way? could some one please tell me how can I approach further
Could you give some more information about the model you are building?
It makes sense to use a character level sequence to sequence model, similar to the one you would use for translation. There are already some approaches trying to do the same (1, 2, 3).
Maybe draw on them for some inspiration?
Now, with regards to the dataset, It seems that the one you are trying to use mostly has errors? If you don't have the correct version of each phrase, I don't think you can use this dataset.
A simple approach would be to get an existing dataset and introduce random noise in it. The deep spelling blog talks about how you can do that an existing text corpus. Also, a recommendation from myself would be to use small-ish standalone sentences as the training set. A good place to find those is from machine translation datasets (like the tatoeba project) and only use the english phrases. Out of those you can create pairs of (input_phrase, target_phrase) where the input_phrase is potentially noisy (but not always).
With regards to performance, firstly 12hrs training for 1 pass of a 5M dataset sounds about right for a home pc. You can use a GPU or a cloud solution (1, 2) for faster training.
Now for false-positive correction, the dictionary you have created could indeed be handy: if a word exists in this dictionary, don't accept a "correction" on it from the model.

Document classification using LSA/SVD

I am trying to do document classification using Support Vector Machines (SVM). The documents I have are collection of emails. I have around 3000 documents to train the SVM classifier and have a test document set of around 700 for which I need classification.
I initially used binary DocumentTermMatrix as the input for SVM training. I got around 81% accuracy for the classification with the test data. DocumentTermMatrix was used after removing several stopwords.
Since I wanted to improve the accuracy of this model, I tried using LSA/SVD based dimensional reduction and use the resulting reduced factors as input to the classification model (I tried with 20, 50, 100 and 200 singular values from the original bag of ~ 3000 words). The performance of the classification worsened in each case. (Another reason for using LSA/SVD was to overcome memory issues with one of the response variable that had 65 levels).
Can someone provide some pointers on how to improve the performance of LSA/SVD classification? I realize this is general question without any specific data or code but would appreciate some inputs from the experts on where to start the debugging.
FYI, I am using R for doing the text preprocessing (packages: tm, snowball,lsa) and building classification models (package: kernelsvm)
Thank you.
Here's some general advice - nothing specific to LSA, but it might help improving the results nonetheless.
'binary documentMatrix' seems to imply your data is represented by binary values, i.e. 1 for a term existing in a document, and 0 for non-existing term; moving to other scoring scheme
(e.g. tf/idf) might lead to better results.
LSA is a good metric for dimensional reduction in some cases, but less so in others. So depending in the exact nature of your data, it might be a good idea to consider additional methods, e.g. Infogain.
If the main incentive for reducing the dimensionality is the one parameter with 65 levels, maybe treating this parameter specifically, e.g. by some form of quantization, would lead to a better tradeoff?
This might not be the best tailored answer. Hope these suggestions may help.
Maybe you could use lemmatization over stemming to reduce unacceptable outcomes.
Short and dense: http://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html
The goal of both stemming and lemmatization is to reduce inflectional forms and
sometimes derivationally related forms of a word to a common base form.
However, the two words differ in their flavor. Stemming usually refers to a crude
heuristic process that chops off the ends of words in the hope of achieving this
goal correctly most of the time, and often includes the removal of derivational
affixes. Lemmatization usually refers to doing things properly with the use of a
vocabulary and morphological analysis of words, normally aiming to remove
inflectional endings only and to return the base or dictionary form of a word,
which is known as the lemma.
One instance:
go,goes,going ->Lemma: go,go,go ||Stemming: go, goe, go
And use some predefined set of rules; such that short term words are generalized. For instance:
I'am -> I am
should't -> should not
can't -> can not
How to deal with parentheses inside a sentence.
This is a dog(Its name is doggy)
Text inside parentheses often referred to alias names of the entities mentioned. You can either removed them or do correference analysis and treat it as a new sentence.
Try to use Local LSA, which can improve the classification process compared to Global LSA. In addition, LSA's power depends entirely on its parameters, so try to tweak parameters (start with 1, then 2 or more) and compare results to enhance the performance.

Resources