Can someone share a code snippet that shows how to use SVM for text mining using scikit. I have seen an example of SVM on numerical data but not quite sure how to deal with text. I looked at http://scikit-learn.org/stable/auto_examples/document_classification_20newsgroups.html
but couldn't find SVM.
In text mining problems, text is represented by numeric values. Each feature represent a word and values are binary numbers. That gives a matrix with lots of zeros and a few 1s which means that the corresponding words exist in the text. Words can be given some weights according to their frequency or some other criteria. Then you get some real numbers instead of 0 and 1.
After converting the dataset to numerical values you can use this example: http://scikit-learn.org/dev/modules/generated/sklearn.svm.SVC.html#sklearn.svm.SVC
Related
I am trying to calculate the emotion of paragraphs by measuring the emotion of its individual sentences. But this ends in vectors of varying length, as a paragraph might be as short as 1 sentence or as long as 30 sentences. how would you suggest converting these vectors to scalars?
The first option is taking average, but this biases the results: It turns out shorter paragraphs have a higher score and longer ones a score around the mean.
The second option is summing up the values, but this biases the results again, as longer paragraph will have bigger scores
The third option is using a method used in VADER, which is summing up and then normalizing, but I could not find a reliable resource that explains how the results are normalized. The only thing I found is the following formula from VADER code:
norm_score = score / math.sqrt((score * score) + alpha)
VADER sets alpha to 15, but how this number should be changed and based on what? Also , where does this normalization method come from?
I would like to use the MNIST dataset, where each digit is assigned a specific colour. Not the background, the digit itself.
The following dataset colours the background of the image: https://www.wouterbulten.nl/blog/tech/getting-started-with-gans-2-colorful-mnist/
Maybe you are looking for the coloured MNIST dataset?
I have seen two papers proposing it:
Invariant Risk Minimisation: source code to generate the data
PREDICTING WITH HIGH CORRELATION FEATURES: source code to generate
the data
I am working on a text classification use case. The text is basically contents of legal documents, for example, companies annual reports, W9 etc. So there are 10 different categories and 500 documents in total. Therefore 50 documents per category. So the dataset consists of 500 rows and 2 columns, 1st column consisting of text and 2nd column is the Target.
I have built a basic model using TF-IDF for my textual features. I have used Multinomial Naive Bayes, SVC, Linear SGD, Multilayer Perceptron, Random Forest. These models are giving me an F1-score of approx 70-75%.
I wanted to see if creating word-embedding will help me improve the accuracy. I trained the word vectors using gensim Word2vec, and fit the word vectors through the same ML models as above, but I am getting a score of about 30-35%. I have a very small dataset and lot of categories, is that the problem? Is it the only reason, or there is something I am missing out?
I have a million files which includes free text. Each file has been assigned a code or number of codes. The codes can be assumed as categories. I have normalized the text by removing stop words. I am using scikit-learn libsvm to train the model to predict the files for the right code/s (category).
I have read and searched a lot but i couldn't understand how to represent my textual data into integers, since SVM or most machine learning tools use numerical values for learning.
I think i would need to find tf-idf for each term in the whole corpus. But still i am not sure how would that help me to convert my textual data into libsvm format.
any help would be greatly appreciated, Thank you.
You are not forced to use tf-idf.
To begin with follow this simple approach:
Select all distinct words in all your documents. This will be your vocabulary. Save it in a file.
For each word in a specific document, replace it with the index of the word in your vocabulary file.
and also add the number of time the word appears in the document
Example:
I have two documents (stop word removed, stemmed) :
hello world
and
hello sky sunny hello
Step 1: I generate the following vocabulary
hello
sky
sunny
world
Step 2:
I can represent my documents like this:
1 4
(because the word hello is in position 1 in the vocabulary and the word world is in position 4)
and
1 2 3 1
Step 3: I add the term frequency near each term and remove duplicates
1:1 4:1
(because the word hello appears 1 time in the document, and the word world appears 1 time)
and
1:2 2:1 3:1
If you add the class number in front of each line, you have a file in libsvm format:
1 1:1 4:1
2,3 1:2 2:1 3:1
Here the first document has class 1, and the second document has class 2 and 3.
In this example each word is associated with the term frequency. To use tf-idf you do the same but replace the tf by the computed tf-idf.
This paper contains confusion matrices for spelling errors in a noisy channel. It describes how to correct the errors based on conditional properties.
The conditional probability computation is on page 2, left column. In footnote 4, page 2, left column, the authors say: "The chars matrices can be easily replicated, and are therefore omitted from the appendix." I cannot figure out how can they be replicated!
How to replicate them? Do I need the original corpus? or, did the authors mean they could be recomputed from the material in the paper itself?
Looking at the paper, you just need to calculate them using a corpus, either the same one or one relevant to your application.
In replicating the matrices, note that they implicitly define two different chars matrices: a vector and an n-by-n matrix. For each character x, the vector chars contains a count of the number of times the character x occurred in the corpus. For each character sequence xy, the matrix chars contains a count of the number of times that sequence occurred in the corpus.
chars[x] represents a look-up of x in the vector; chars[x,y] represents a look-up of the sequence xy in the matrix. Note that chars[x] = the sum over chars[x,y] for each value of y.
Note that their counts are all based on the 1988 AP Newswire corpus (available from the LDC). If you can't use their exact corpus, I don't think it would be unreasonable to use another text from the same genre (i.e. another newswire corpus) and scale your counts such that they fit the original data. That is, the frequency of a given character shouldn't vary too much from one text to another if they're similar enough, so if you've got a corpus of 22 million words of newswire, you could count characters in that text and then double them to approximate their original counts.