If I am training a sentiment classifier off of a tagged dataset where most documents are negative, say ~95%, should the classifier be trained with the same distribution of negative comments? If not, what would be other options to "normalize" the data set?
You don't say what type of classifier you have but in general you don't have to normalize the distribution of the training set. However, usually the more data the better but you should always do blind tests to prevent over-fitting.
In your case you will have a strong classifier for negative comments and unless you have a very large sample size, a weaker positive classifier. If your sample size is large enough it won't really matter since you hit a point where you might start over-fitting your negative data anyway.
In short, it's impossible to say for sure without knowing the actual algorithm and the size of the data sets and the diversity within the dataset.
Your best bet is to carve off something like 10% of your training data (randomly) and just see how the classifier performs after being trained on the 90% subset.
Related
I want to compare the performance between ELMo and word2vec as word embedding using the CNN model by classifying 4000 tweet data on five class labels, but the results show that ELMo gives worse performance than word2vec.
I used ELMoformanylangs for ELMo and pretrained 1 million tweets for word2vec
Curve loss of word2vec-cnn
Curve loss of ELMo-cnn
It shows that the 2 models are overfitting, but why can ELMo be worse than word2vec?
From the elmoformanylangs project you've linked, it looks like your generic ELMo model was trained on "on a set of 20-million-words data randomly sampled from the raw text released by the shared task (wikidump + common crawl)".
Given that many tweets are larger than 20 words, your 1-million-tweets training set for word2vec might be larger training data than was used for the ELMo model. And, coming from actual tweets, it may also reflect words/word-senses used in tweets better than generic wikidump/common-crawl text.
Given that, I'm not sure why you'd have expected the ELMo approach to necessarily be better.
But also, as you've noted, the fact that your classifier is performing worse with more training is highly indicative of extreme overfitting. You may want to fix that before attempting to reason any further about the relative merits of different approaches. (When both classifiers are massively broken, exactly why one's brokenness is a bit better than the others' brokenness should be a fairly moot point. After they're both fixed to do as well as they can, then the remaining difference may be interesting to choose between, or understand deeply.)
I am classifying data with categorical variables. It is data where people have provided information.
My training dataset is of varying quality. I have a greater confidence in some of the data i.e. I have a higher confidence that people have provided correct information whereas in some the data I am not so sure.
How can I pass this information into a classification algorithm such as Naive Bayes or K nearest neighbour?
Or should I instead look to another algorithm?
I think what you want to do, is to provide individual weights (for the importance/confidence) for each data point you have.
For instance, if you are very certain that one data point is of higher quality and should have a higher weight than others, in which you are less confident in, you can specify that when fitting your classifier.
Sklearn provides for instance the Gaussian Naive Bayes classifier (GaussianNB) for that.
Here, you can specify sample_weights when calling the fit() method.
I am using sklearn's random forests module to predict values based on 50 different dimensions. When I increase the number of dimensions to 150, the accuracy of the model decreases dramatically. I would expect more data to only make the model more accurate, but more features tend to make the model less accurate.
I suspect that splitting might only be done across one dimension which means that features which are actually more important get less attention when building trees. Could this be the reason?
Yes, the additional features you have added might not have good predictive power and as random forest takes random subset of features to build individual trees, the original 50 features might have got missed out. To test this hypothesis, you can plot variable importance using sklearn.
Your model is overfitting the data.
From Wikipedia:
An overfitted model is a statistical model that contains more parameters than can be justified by the data.
https://qph.fs.quoracdn.net/main-qimg-412c8556aacf7e25b86bba63e9e67ac6-c
There are plenty of illustrations of overfitting, but for instance, this 2d plot represents the different functions that would have been learned for a binary classification task. Because the function on the right has too many parameters, it learns wrongs data patterns that don't generalize properly.
I have a very large data set now. The response variable is binary 1/0. The bad population size is a very small portion of the entire data set. The good population size is 8,000,000. The bad population size that is tagged as 1 is only 7,000.
I used a decision tree, this decision tree would take the features as the inputs, and then would classify the individuals into either 1 or 0.
Because the population size was really large. R was not able to efficiently process all data. So I decided to randomly sample some good samples. But I wanted to keep all the bad samples. So I selected 8000 good samples and included all the 7000 bad samples. Thus, I had a 15,000 samples. I randomly splited them into training and testing data set. After training the decision tree on the training set, I fitted the testing data into the trained model, the result was vary promising.
However, I am really worried how this model would work on the entire population now. Although I compared the distribution conditioned on different variables for the good samples and good populations, the distribution of the good samples wasvery consistent with the good population.
Because the good samples and bad samples are equally weighted in the sampled data, the effect of the "BAD" is exaggerated in training the model, I am thinking that "BAD" will not be "BAD" if the entire data fit into the model, because the bad part is too tiny. do you think this is a potential failing issue for the model? Do you have any suggestions to fix this problem?
There's something I don't understand about neural networks. I've tried to use them with financial data analysis and audio pitch classification. In both cases, I need a classifier that can detect the significant item from among the many. My audio application literally has one positive hit for every thousand negative hits. I run the network trainer, and it learns that it's a pretty darn fine guess to just go with the negative. Is there some other algorithm for detecting the rare gem? Is there some form of neural network training that is especially suited for this type of problem? I can change the range on my positive data to be equivalent to the sum of the negative values, but I don't understand how that fits in with the preferred range of zero to one on the typical neural network.
Here are two possible suggestions:
Balance your training set
Even if the real-world data contains 1000x as many negatives as positives, your training data does not have to. You can modify your training data set to increase the proportion of positives in your training set. That will improve the recall (more true positives), but also worsen the precision (also more false positives). So, you'd have to experiment with the ideal proportion of positives to negatives in the training set.
This paper discusses this approach in more detail: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2243711/pdf/procamiasymp00003-0260.pdf
Anomaly detection
... on the other hand, if you have too few positive examples to train the neural network with a more balanced training set, then perhaps you could try anomaly detection. With anomaly detection, you train your algorithm (e.g., a neural network) to recognize what negative data points look like. Then, any data point that looks different than normal gets flagged as positive.