There's something I don't understand about neural networks. I've tried to use them with financial data analysis and audio pitch classification. In both cases, I need a classifier that can detect the significant item from among the many. My audio application literally has one positive hit for every thousand negative hits. I run the network trainer, and it learns that it's a pretty darn fine guess to just go with the negative. Is there some other algorithm for detecting the rare gem? Is there some form of neural network training that is especially suited for this type of problem? I can change the range on my positive data to be equivalent to the sum of the negative values, but I don't understand how that fits in with the preferred range of zero to one on the typical neural network.
Here are two possible suggestions:
Balance your training set
Even if the real-world data contains 1000x as many negatives as positives, your training data does not have to. You can modify your training data set to increase the proportion of positives in your training set. That will improve the recall (more true positives), but also worsen the precision (also more false positives). So, you'd have to experiment with the ideal proportion of positives to negatives in the training set.
This paper discusses this approach in more detail: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2243711/pdf/procamiasymp00003-0260.pdf
Anomaly detection
... on the other hand, if you have too few positive examples to train the neural network with a more balanced training set, then perhaps you could try anomaly detection. With anomaly detection, you train your algorithm (e.g., a neural network) to recognize what negative data points look like. Then, any data point that looks different than normal gets flagged as positive.
Related
I am currently fitting a neural network to predict a continuous target from 1 to 10. However, the samples are not evenly distributed over the entire data set: samples with target ranging from 1-3 are quite underrepresented (only account for around 5% of the data). However, they are of big interest, since the low range of the target is kind of the critical range.
Is there any way to know how my model predicts these low range samples in particular? I know that when doing multiclass classification I can examine the recall to get a taste of how well the model performs on a certain class. For classification use cases I can also set the class weight parameter in Keras to account for class imbalances, but this is obviously not possible for regression.
Until now, I use typical metrics like MAE, MSE, RMSE and get satisfying results. I would however like to know how the model performs on the "critical" samples.
From my point of view, I would compare the test measurements (classification performance, MSE, RMSE) for the whole test step that corresponds to the whole range of values (1-10). Then, of course, I would do it separately to the specific range that you are considering critical (let's say between 1-3) and compare the divergence of the two populations. You can even perform some statistics about the significance of the difference between the two populations (Wilcoxon tests etc.).
Maybe this link could be useful for your comparisons. Since you can regression you can even compare for MSE and RMSE.
What you need to do is find identifiers for these critical samples. Often times row indices are used for this. Once you have predicted all of your samples, use those stored indices to find the critical samples in your predictions and run whatever automatic metric over those filtered samples. I hope this answers your question.
I am using keras sequential model for binary classification. But My data is unbalanced. I have 2 features column and 1 output column(1/0). I have 10000 of data. Among that only 20 results in output 1, all others are 0. Then i have extended the data size to 40000. Now also only 20 results in output 1, all others are 0. Since the data is unbalanced(0 dominates 1), which neural network will be better for correct prediction?
First of all, two features is a really small amount. Neural Networks are highly non-linear models with a really really high amount of freedom degrees, thus if you try to train a network with more than just a couple of networks it will overfit even with balanced classes. You can find more suitable models for a small dimensionality like Support Vector Machines in scikit-learn library.
Now about unbalanced data, the most common techniques are Undersampling and Oversampling. Undersampling is basically training your model several times with a fraction of the dataset, that contains the non dominant class and a random sample of the dominant so that the ratio is acceptable, where as oversampling consist on generating artificial data to balance the classes. In most cases undersampling works better.
Also when working with unbalanced data it's quite important to choose the right metric based on what is more important for the problem (is minimizing false positives more important than false negatives, etc).
I have programmed keras neural network to train on sequences. Does choosing the LSTM units in keras depend on length of the sequence?
There isn't a set way of determining how many units you should have based on your input.
More units are a way of making the model more complex. Generally speaking, if the look back period for your neural network is longer, then you have more features to train on, which means a more complex model would be better suited for learning your data.
Personally, I like to use the number of timesteps in each sample as my number of units, and I decrease this number as I move deeper into the network.
I have encountered the problem when I designed sports betting prediction engine with LSTM RNN.
There's a rule of thumb that helps for supervised learning problems. Please check this link. Here
But in my opinion, there is still no correct method or formulus to calculate the number of neurons per layer and the number of hidden layers according to the training dataset yet.
If I am training a sentiment classifier off of a tagged dataset where most documents are negative, say ~95%, should the classifier be trained with the same distribution of negative comments? If not, what would be other options to "normalize" the data set?
You don't say what type of classifier you have but in general you don't have to normalize the distribution of the training set. However, usually the more data the better but you should always do blind tests to prevent over-fitting.
In your case you will have a strong classifier for negative comments and unless you have a very large sample size, a weaker positive classifier. If your sample size is large enough it won't really matter since you hit a point where you might start over-fitting your negative data anyway.
In short, it's impossible to say for sure without knowing the actual algorithm and the size of the data sets and the diversity within the dataset.
Your best bet is to carve off something like 10% of your training data (randomly) and just see how the classifier performs after being trained on the 90% subset.
If you had a training set containing instances for various classes and it was highly imbalanced. What strategy would you use to balance it?
Information about real-world population: 7 classes whereof the smallest accounts for 5%.
Information about training set: frequencies differ largely from the populations frequencies.
Here are two options:
Bias it to the populations class frequencies.
Bias it to a uniform distribution.
With biasing i intend something like SMOTE or Cost-Sensitive Classification.
I am insecure which strategy to follow. I am also open for other suggestions. How would you evaluate the success of the strategy?
As you mentioned, for training you have two options. Either to balance your data set (works if you have very large amount of data and/or small number of features, so that throwing away some samples won't affect learning), or use different weights for different classes, according to their frequencies. The latter is typically straightforward to do, but depends on the method and library you choose.
Once you have your classifier trained (with some prior on your training set), you can easily update the prediction probabilities if your priors change (different frequencies in training and population). There is an excellent overview how to replace the prior information, that explains it better than I could in a short post. Take a look at Combining probabilities, Section 3 (Replacing prior information).