Multi-features modeling based on one binary-feature which is rarely 1 (imbalanced data) when there is a cost - python-3.x

I need to model a multi-variate time-series data to predict a binary-target which is rarely 1 (imbalanced data).
This means that we want to model based on one feature is binary (outbreak), rarely 1?
All of the features are binary and rarely 1.
What is the suggested solution?
This features has an effect on cost function based on the following cost function. We want to know prepared or not prepared if the cost is the same as following.
Problem Definition:
Model based on outbreak which is rarely 1.
Prepared or not prepared to avoid the outbreak of a disease and the cost of outbreak is 20 times of preparation
cost of each day(next day):
cost=20*outbreak*!prepared+prepared
Model:prepare(prepare for next day)for outbreak for which days?
Questions:
Build a model to predict outbreaks?
Report the cost estimation for every year
csv file is uploaded and data is for end of the day
The csv file contains rows which each row is a day with its different features some of them are binary and last feature is outbreak which is rarely 1 and a main features considering in the cost.

You are describing class imbalance.
Typical approach is to generate balanced training data
by repeatedly running through examples containing
your (rare) positive class,
and each time choosing a new random sample
from the negative class.
Also, pay attention to your cost function.
You wouldn't want to reward a simple model
for always choosing the majority class.

My suggestions:
Supervised Approach
SMOTE for upsampling
Xgboost by tuning scale_pos_weight
replicate minority class eg:10 times
Try to use ensemble tree algorithms, trying to generate a linear surface is risky for your case.
Since your data is time series you can generate days with minority class just before real disease happened. For example you have minority class at 2010-07-20. Last observations before that time is 2010-06-27. You can generate observations by slightly changing variance as 2010-07-15, 2010-07-18 etc.
Unsupervised Approach
Try Anomaly Detection algorithms. Such as IsolationForest (try extended version of it also).
Cluster your observations check minority class becomes a cluster itself or not. If its successful you can label your data with cluster names (cluster1, cluster2, cluster3 etc) then train a decision tree to see split patterns. (Kmeans + DecisionTreeClassifier)
Model Evaluation
Set up a cost matrix. Do not use confusion matrix precision etc directly. You can find further information about cost matrix in here: http://mlwiki.org/index.php/Cost_Matrix
Note:
According to OP's question in comments groupby year could be done like this:
df["date"] = pd.to_datetime(df["date"])
df.groupby(df["date"].dt.year).mean()
You can use other aggregators also (mean, sum, count, etc)

Related

Neural network regression evaluation based on target range

I am currently fitting a neural network to predict a continuous target from 1 to 10. However, the samples are not evenly distributed over the entire data set: samples with target ranging from 1-3 are quite underrepresented (only account for around 5% of the data). However, they are of big interest, since the low range of the target is kind of the critical range.
Is there any way to know how my model predicts these low range samples in particular? I know that when doing multiclass classification I can examine the recall to get a taste of how well the model performs on a certain class. For classification use cases I can also set the class weight parameter in Keras to account for class imbalances, but this is obviously not possible for regression.
Until now, I use typical metrics like MAE, MSE, RMSE and get satisfying results. I would however like to know how the model performs on the "critical" samples.
From my point of view, I would compare the test measurements (classification performance, MSE, RMSE) for the whole test step that corresponds to the whole range of values (1-10). Then, of course, I would do it separately to the specific range that you are considering critical (let's say between 1-3) and compare the divergence of the two populations. You can even perform some statistics about the significance of the difference between the two populations (Wilcoxon tests etc.).
Maybe this link could be useful for your comparisons. Since you can regression you can even compare for MSE and RMSE.
What you need to do is find identifiers for these critical samples. Often times row indices are used for this. Once you have predicted all of your samples, use those stored indices to find the critical samples in your predictions and run whatever automatic metric over those filtered samples. I hope this answers your question.

How to combine LIBSVM probability estimates from two (or three) two class SVM classifiers.

I have training data that falls into two classes, let's say Yes and No. The data represents three tasks, easy, medium and difficult. A person performs these tasks and is classified into one of the two classes as a result. Each task is classified independently and then the results are combined. I am using 3 independently trained SVM classifiers and then voting on the final result.
I am looking to provide a measure of confidence or probability associated with each classification. LIBSVM can provide a probability estimate along with the classification for each task (easy, medium and difficult, say Pe, Pm and Pd) but I am unsure of how best to combine these into an overall estimate for the final classification of the person (let's call it Pp).
My attempts so far have been along the lines of a simple average:
Pp = (Pe + Pm + Pd) / 3
An Inverse-variance weighted average (since each task is repeated a few times and sample variance (VARe, VARm and VARd) can be calculated - in which case Pe would be a simple average of all the easy samples):
Pp = (Pe/VARe + Pm/VARm + Pd/VARd) / (( 1/VARe ) + ( 1/VARm ) + ( 1/VARd ))
Or a multiplication (under the assumption that these events are independent, which I am unsure of since the underlying tasks are related):
Pp = Pe * Pm * Pd
The multiplication would provide a very low number, so it's unclear how to interpret that as an overall probability when the results of the voting are very clear.
Would any of these three options be the best or is there some other method / detail I'm overlooking?
Based on your comment, I will make the following suggestion. If you need to do this as an SVM (and because, as you say, you get better performance when you do it this way), take the output from your intermediate classifiers and feed them as features to your final classifier. Even better, switch to a multi-layer Neural Net where your inputs represent inputs to the intermediates, the (first) hidden layer represents outputs to the intermediate problem, and subsequent layer(s) represent the final decision you want. This way you get the benefit of an intermediate layer, but its output is optimised to help with the final prediction rather than for accuracy in its own right (which I assume you don't really care about).
The correct generative model for these tests likely looks something like the following:
Generate an intelligence/competence score i
For each test t: generate pass/fail according to p_t(pass | i)
This is simplified, but I think it should illustrate tht you have a latent variable i on which these tests depend (and there's also structure between them, since presumably p_easy(pass|i) > p_medium(pass|i) > p_hard(pass|i); you could potentially model this as a logistic regression with a continuous 'hardness' feature). I suspect what you're asking about is a way to do inference on some thresholding function of i, but you want to do it in a classification way rather than as a probabilistic model. That's fine, but without explicitly encoding the latent variable and the structure between the tests it's going to be hard (and no average of the probabilities will account for the missing structure).
I hope that helps---if I've made assumptions that aren't justified, please feel free to correct.

Training Set Balancing Strategies

If you had a training set containing instances for various classes and it was highly imbalanced. What strategy would you use to balance it?
Information about real-world population: 7 classes whereof the smallest accounts for 5%.
Information about training set: frequencies differ largely from the populations frequencies.
Here are two options:
Bias it to the populations class frequencies.
Bias it to a uniform distribution.
With biasing i intend something like SMOTE or Cost-Sensitive Classification.
I am insecure which strategy to follow. I am also open for other suggestions. How would you evaluate the success of the strategy?
As you mentioned, for training you have two options. Either to balance your data set (works if you have very large amount of data and/or small number of features, so that throwing away some samples won't affect learning), or use different weights for different classes, according to their frequencies. The latter is typically straightforward to do, but depends on the method and library you choose.
Once you have your classifier trained (with some prior on your training set), you can easily update the prediction probabilities if your priors change (different frequencies in training and population). There is an excellent overview how to replace the prior information, that explains it better than I could in a short post. Take a look at Combining probabilities, Section 3 (Replacing prior information).

Online learning with Naive Bayes Classifier

I am trying to predict the inter-arrival time of the incoming network packets. I measure the inter-arrival times of network packets and represent this data in the form of binary features: xi= 0,1,1,1,0,... where xi=0 if the inter-arrival time is less than a break-even-time and 1 otherwise. The data has to be mapped into two possible classes C={0,1}, where C=0 represents a short inter-arrival time and 1 represents a long inter-arrival time. Since I want to implement the classifier in an online feature, where as soon as I observe a vector of features xi=0,1,1,0..., I calculate the MAP class. Since I don't have a prior estimation of the conditional and prior probabilities, I initialize them as follows:
p(x=0|c=0)=p(x=1|c=0)=p(x=0|c=1)=p(x=1|c=1)=0.5
p(c=0)=p(c=1)=0.5
For each feature vector (x1=m1,x2=m2,...,xn=mn), when I output a class C, I update the conditional and prior probabilities as follows:
p(xi=mi|y=c)=a+(1-a)*p(p(xi=mi|c)
p(y=c)=b+(1-b)*p(y=c)
The problem is that, I am always getting a biased prediction. Since the number of long inter-arrival times are comparatively less than the short, the posterior of short always remains higher than the long. Is there any way to improve this? or am I doing something wrong? Any help will be appreciated.
Since you have a long time series, the best path would probably be to take into account more than a single previous value. the standard way of doing this would be to use a time-window, i.e. split the long vector Xi to overlapping pieces of a constant length, with the last value treated as the class, and use them as the train set. This could be also done on streaming data in an online manner, by incrementally updating the NB model with new data as it arrives.
Note that Using this method, other regression algorithms might end up being a better choice than NB.
Weka (version 3.7.3 and up) has a very nice dedicated tool supporting time-series analysis. alternatively, MOA is also based on Weka, and supports modeling of streaming data.
EDIT: it might also be a good idea to move from binary features to the real values (maybe normalized), and apply the threshold post-classification. This might give more information to the regression model (NB or other), allowing better accuracy.

Training set - proportion of pos / neg / neutral sentences

I am hand tagging twitter messages as Positive, Negative, Neutral. I am try to appreciate is there some logic one can use to identify of the training set what proportion of message should be positive / negative and neutral ?
So for e.g. if I am training a Naive Bayes classifier with 1000 twitter messages should the proportion of pos : neg : neutral be 33 % : 33% : 33% or should it be 25 % : 25 % : 50 %
Logically in my head it seems that I i train (i.e. give more samples for neutral) that the system would be better at identifying neutral sentences then whether they are positive or negative - is that true ? or I am missing some theory here ?
Thanks
Rahul
The problem you're referring to is known as the imbalance problem. Many machine learning algorithms perform badly when confronted with imbalanced training data, i.e. when the instances of one class heavily outnumber those of the other class. Read this article to get a good overview of the problem and how to approach it. For techniques like naive bayes or decision trees it is always a good idea to balance your data somehow, e.g. by random oversampling (explained in the references paper). I disagree with mjv's suggestion to have a training set match the proportions in the real world. This may be appropriate in some cases but I'm quite confident it's not in your setting. For a classification problem like the one you describe, the more the sizes of the class sets differ, the more most ML algorithms will have problems discriminating the classes properly. However, you can always use the information about which class is the largest in reality by taking it as a fallback such that when the classifier's confidence for a particular instance is low or this instance couldn't be classified at all, you would assign it the largest class.
One further remark: finding the positivity/negativity/neutrality in Twitter messages seems to me to be a question of degree. As such, it may be viewes as a regression rather than a classification problem, i.e. instead of a three class scheme you perhaps may want calculate a score which tells you how positive/negative the message is.
There are many other factors... but an important one (in determining a suitable ratio and volume of training data) is the expected distribution of each message category (Positive, Neutral, Negative) in the real world. Effectively, a good baseline for the training set (and the control set) is
[qualitatively] as representative as possible of the whole "population"
[quantitatively] big enough that measurements made from such sets is statistically significant.
The effect of the [relative] abundance of a certain category of messages in the training set is hard to determine; it is in any case a lesser factor -or rather one that is highly sensitive to- other factors. Improvements in the accuracy of the classifier, as a whole, or with regards to a particular category, is typically tied more to the specific implementation of the classifier (eg. is it Bayesian, what are the tokens, are noise token eliminated, is proximity a factor, are we using bi-grams etc...) than to purely quantitative characteristics of the training set.
While the above is generally factual but moderately helpful for the selection of the training set's size and composition, there are ways of determining, post facto, when an adequate size and composition of training data has been supplied.
One way to achieve this is to introduce a control set, i.e. one manually labeled but that is not part of the training set and to measure for different test runs with various subsets of the training set, the recall and precision obtained for each category (or some similar accuracy measurements), for this the classification of the control set. When these measurements do not improve or degrade, beyond what's statistically representative, the size and composition of the training [sub-]set is probably the right one (unless it is an over-fitting set :-(, but that's another issue altogether... )
This approach, implies that one uses a training set that could be 3 to 5 times the size of the training subset effectively needed, so that one can build, randomly (within each category), many different subsets for the various tests.

Resources