I am working on a text classification use case. The text is basically contents of legal documents, for example, companies annual reports, W9 etc. So there are 10 different categories and 500 documents in total. Therefore 50 documents per category. So the dataset consists of 500 rows and 2 columns, 1st column consisting of text and 2nd column is the Target.
I have built a basic model using TF-IDF for my textual features. I have used Multinomial Naive Bayes, SVC, Linear SGD, Multilayer Perceptron, Random Forest. These models are giving me an F1-score of approx 70-75%.
I wanted to see if creating word-embedding will help me improve the accuracy. I trained the word vectors using gensim Word2vec, and fit the word vectors through the same ML models as above, but I am getting a score of about 30-35%. I have a very small dataset and lot of categories, is that the problem? Is it the only reason, or there is something I am missing out?
Related
I am trying to implement a scoring model following the link https://rstudio-pubs-static.s3.amazonaws.com/376828_032c59adbc984b0ab892ce0026370352.html#1_introduction.
Post the entire implementation though, When I create pivot with my generated scores and the original labels, the average scores for "good' labels is significantly lower than the ones for " high" labels.
Hence, my problem can be oversimplified to why would logistic regression give reversed probabilities for 0-1 target variable( In my model I am using 0 for bad and 1 for good).
Any suggestions and solutions would be welcome.
Suppose I have a corpus of short sentences of which the number of words ranges from 1 to around 500 and the average number of words is around 9. If I train a Gensim Word2vec model using window=5(which is the default), should I use all of the sentences? or I should remove sentences with low word count? If so, is there a rule of thumb for the minimum number of words?
Texts with only 1 word are essentially 'empty' to the word2vec algorithm: there are no neighboring words, which are necessary for all training modes. You could drop them, but there's little harm in leaving them in, either. They're essentially just no-ops.
Any text with 2 or more words can contribute to the training.
I have a dataset which has 300000 lines, each line of which is an article title, I want to find features like tf or tfidf of this dataset.
I am able to count the words(tf) in this dataset, such as:
WORD FREQUENCE
must 10000
amazing 9999
or word percentage:
must 0.2
amazing 0.19
but how to caculate idf, I mean I need to find some features to discriminate this dataset from the others? or HOW DOES tfidf used in text classification?
In your case a document is a single article title. Therefore the inverse document frequency (IDF) is log(300000/num(t)). Where num(t) is the number of documents (article titles) that contain the term t.
See https://en.wikipedia.org/wiki/Tf%E2%80%93idf#Inverse_document_frequency_2
I am using Weka IBk for text classificaiton. Each document basically is a short sentence. The training dataset contains 15,000 documents. While testing, I can see that k=1 gives the best accuracy? How can this be explained?
If you are querying your learner with the same dataset you have trained on with k=1, the output values should be perfect barring you have data with the same parameters that have different outcome values. Do some reading on overfitting as it applies to KNN learners.
In the case where you are querying with the same dataset as you trained with, the query will come in for each learner with some given parameter values. Because that point exists in the learner from the dataset you trained with, the learner will match that training point as closest to the parameter values and therefore output whatever Y value existed for that training point, which in this case is the same as the point you queried with.
The possibilities are:
The data training with data tests are the same data
Data tests have high similarity with the training data
The boundaries between classes are very clear
The optimal value for K is depends on the data. In general, the value of k may reduce the effect of noise on the classification, but makes the boundaries between each classification becomes more blurred.
If your result variable contains values of 0 or 1 - then make sure you are using as.factor, otherwise it might be interpreting the data as continuous.
Accuracy is generally calculated for the points that are not in training dataset that is unseen data points because if you calculate the accuracy for unseen values (values not in training dataset), you can claim that my model's accuracy is the accuracy that is been calculated for the unseen values.
If you calculate accuracy for training dataset, KNN with k=1, you get 100% as the values are already seen by the model and a rough decision boundary is formed for k=1. When you calculate the accuracy for the unseen data it performs really bad that is the training error would be very low but the actual error would be very high. So it would be better if you choose an optimal k. To choose an optimal k you should be plotting a graph between error and k value for the unseen data that is the test data, now you should choose the value of the where the error is lowest.
To answer your question now,
1) you might have taken the entire dataset as train data set and would have chosen a subpart of the dataset as the test dataset.
(or)
2) you might have taken accuracy for the training dataset.
If these two are not the cases than please check the accuracy values for higher k, you will get even better accuracy for k>1 for the unseen data or the test data.
Can someone share a code snippet that shows how to use SVM for text mining using scikit. I have seen an example of SVM on numerical data but not quite sure how to deal with text. I looked at http://scikit-learn.org/stable/auto_examples/document_classification_20newsgroups.html
but couldn't find SVM.
In text mining problems, text is represented by numeric values. Each feature represent a word and values are binary numbers. That gives a matrix with lots of zeros and a few 1s which means that the corresponding words exist in the text. Words can be given some weights according to their frequency or some other criteria. Then you get some real numbers instead of 0 and 1.
After converting the dataset to numerical values you can use this example: http://scikit-learn.org/dev/modules/generated/sklearn.svm.SVC.html#sklearn.svm.SVC