How to use tfidf in text classification? - nlp

I have a dataset which has 300000 lines, each line of which is an article title, I want to find features like tf or tfidf of this dataset.
I am able to count the words(tf) in this dataset, such as:
WORD FREQUENCE
must 10000
amazing 9999
or word percentage:
must 0.2
amazing 0.19
but how to caculate idf, I mean I need to find some features to discriminate this dataset from the others? or HOW DOES tfidf used in text classification?

In your case a document is a single article title. Therefore the inverse document frequency (IDF) is log(300000/num(t)). Where num(t) is the number of documents (article titles) that contain the term t.
See https://en.wikipedia.org/wiki/Tf%E2%80%93idf#Inverse_document_frequency_2

Related

Minimum Number of Words for Each Sentence for Training Gensim Word2vec Model

Suppose I have a corpus of short sentences of which the number of words ranges from 1 to around 500 and the average number of words is around 9. If I train a Gensim Word2vec model using window=5(which is the default), should I use all of the sentences? or I should remove sentences with low word count? If so, is there a rule of thumb for the minimum number of words?
Texts with only 1 word are essentially 'empty' to the word2vec algorithm: there are no neighboring words, which are necessary for all training modes. You could drop them, but there's little harm in leaving them in, either. They're essentially just no-ops.
Any text with 2 or more words can contribute to the training.

Word embeddings perform poorly for text classification

I am working on a text classification use case. The text is basically contents of legal documents, for example, companies annual reports, W9 etc. So there are 10 different categories and 500 documents in total. Therefore 50 documents per category. So the dataset consists of 500 rows and 2 columns, 1st column consisting of text and 2nd column is the Target.
I have built a basic model using TF-IDF for my textual features. I have used Multinomial Naive Bayes, SVC, Linear SGD, Multilayer Perceptron, Random Forest. These models are giving me an F1-score of approx 70-75%.
I wanted to see if creating word-embedding will help me improve the accuracy. I trained the word vectors using gensim Word2vec, and fit the word vectors through the same ML models as above, but I am getting a score of about 30-35%. I have a very small dataset and lot of categories, is that the problem? Is it the only reason, or there is something I am missing out?

Using Word Embeddings to find similarity between documents with certain words having more weight

Using Word embeddings ,I am calculating the similarity distance between 2 paragraphs where distance between 2 paragraphs is the sum of euclidean distances between vectors of 2 words ,1 from each paragraph.
The more the value of this sum, the less similar 2 documents are-
How can I assign prefernce/weights to certain words while calculating this similarity distance.
It sounds like you've improvised your own paragraph-to-paragraph distance measure based on doing (lots of?) word-to-word distances.
Are you picking the words for each word-to-word comparison randomly, and doing it a lot to find the overall difference?
One naive measure that works better-than-nothing is to average all words in a paragraph to get a single vector for the paragraph. You could conceivably overweight words there quite easily by assigning each word a weight, default 1.0 (for normal average), but larger to overweight words.
Another more sophisticated comparison based on word-vectors is "Word Mover's Distance" - it essentially considers each word to be a "pile of meaning", and then finds the minimal pairwise "moves" to tranform one paragraph (as a bag-of-words) to another. (It's available in Python gensim as wmdistance(), and other libraries.) It's quite a bit more expensive to calculate, though, especially as a function of text word count.

data representation for svm

I have a million files which includes free text. Each file has been assigned a code or number of codes. The codes can be assumed as categories. I have normalized the text by removing stop words. I am using scikit-learn libsvm to train the model to predict the files for the right code/s (category).
I have read and searched a lot but i couldn't understand how to represent my textual data into integers, since SVM or most machine learning tools use numerical values for learning.
I think i would need to find tf-idf for each term in the whole corpus. But still i am not sure how would that help me to convert my textual data into libsvm format.
any help would be greatly appreciated, Thank you.
You are not forced to use tf-idf.
To begin with follow this simple approach:
Select all distinct words in all your documents. This will be your vocabulary. Save it in a file.
For each word in a specific document, replace it with the index of the word in your vocabulary file.
and also add the number of time the word appears in the document
Example:
I have two documents (stop word removed, stemmed) :
hello world
and
hello sky sunny hello
Step 1: I generate the following vocabulary
hello
sky
sunny
world
Step 2:
I can represent my documents like this:
1 4
(because the word hello is in position 1 in the vocabulary and the word world is in position 4)
and
1 2 3 1
Step 3: I add the term frequency near each term and remove duplicates
1:1 4:1
(because the word hello appears 1 time in the document, and the word world appears 1 time)
and
1:2 2:1 3:1
If you add the class number in front of each line, you have a file in libsvm format:
1 1:1 4:1
2,3 1:2 2:1 3:1
Here the first document has class 1, and the second document has class 2 and 3.
In this example each word is associated with the term frequency. To use tf-idf you do the same but replace the tf by the computed tf-idf.

SVM for Text Mining using scikit

Can someone share a code snippet that shows how to use SVM for text mining using scikit. I have seen an example of SVM on numerical data but not quite sure how to deal with text. I looked at http://scikit-learn.org/stable/auto_examples/document_classification_20newsgroups.html
but couldn't find SVM.
In text mining problems, text is represented by numeric values. Each feature represent a word and values are binary numbers. That gives a matrix with lots of zeros and a few 1s which means that the corresponding words exist in the text. Words can be given some weights according to their frequency or some other criteria. Then you get some real numbers instead of 0 and 1.
After converting the dataset to numerical values you can use this example: http://scikit-learn.org/dev/modules/generated/sklearn.svm.SVC.html#sklearn.svm.SVC

Resources