Scaling data real time for LibSVM - svm

I am using LibSVM to classify data. I train and test the classifier with linearly scaled feature data on the interval [-1 1]. After establishing a model which produces acceptable accuracy, I want to classify new data which arrives periodically, almost in real time.
I don't know how to rescale the feature columns of the 'real time' data on an interval of [-1 1] since I'm only generating 1 row of features for this input data. If I were to store the min/max values of the testing/training set data feature columns (in order to scale new data), this presents the possibility that if the new real time data does not fall into this min/max range, thus the model is no longer valid as I would have to re-scale all prior data to accommodate for the new min/max and generate a new model.
I have thought about using other scaling techniques such as mean normalization, but I have read that SVM works particularly well with linearly scaled features so I am hesitant to apply another methodology.
How does one deal with the rescaling of new features to a linear interval, when the new features are a single row vector, and could have higher/lower feature values than the max/min feature values used in rescaling the training data?
This is the equation I'm using to rescale the training/testing feature set.
Even if one were to use another feature scaling technique (such as mean normalization), with each additional 'real time' classification, would it be prudent to recalculate the mean, min and max for ALL (new, test and train) data before rescaling, or is it acceptable to use the stored scaling values from training/testing for new samples -- until a "re-training" the classifier to account for all the newly acquired data were to occur.
All in all, I think what I'm having trouble with is: how does one deal with linear feature scaling in an 'online' classification problem?

Related

Using discretization before or after splitting data?

I am new to data mining concepts and have a question regarding implementation of a technique.
I am using the a dataset with large continuous values.
Now, I am trying to code an algorithm where I need to discretize data (not scale as it makes no impact on data along with the fact that algorithm is not a distance based one, hence no scaling needed).
Now for discretization, I have a similar question with regards to scaling and train test split.
For scaling, I know we should split data and then fit transform the train and transform the test based on what we fit from train.
But what do we do for discretization? I am using scikit learns KBinsDiscretizer and trying to make sense of whether I should split first and discretize the same way we normally scale or discretize first then scale.
The issue came up because I used the 17 bins, uniform strategy (0-16 value range)
With split then discretize, I get (0-16) range throughout in train but not in test.
With discretize and split, I get (0-16) range in both.
With former strategy, my accuracy is around 85% but with the latter, its a whopping 97% which leads me to believe I have definitely overfit the data.
Please advise on what I should be doing for discretization and whether the data interpretation was correct.

when to normalize data with zscore (before or after split)

I was taking a udemy course, which made a strong case for normalizing only the train data (after the split from test data) since the model will typically used by fresh data, with features of the scale of the original set. And if you scale the test data, then you are not scoring the model properly.
On the other hand, what I found was that my two-class logistic regression model (created with Azure Machine Learning Studio) was getting terrible results after Z-Score scaling only the train data.
a. Is this a problem only with Azure's tools?
b. What is a good rule of thumb for when feature data needs to be scaled (one, two, or three orders of magnitude in difference)?
Not scoring the model properly due to normalized test set doesn't seem to make sense:
you would presumably also normalize data that you use for predictions in the future.
I found this similar question in datascience stackexchange and the top answer suggests not only that test data has to be normalized, but you need to apply the exact same scaling as you have done to the training data, because the scale of your data is also taken into account by your model: differently scaled test/prediction data would potentially lead to over/under-exaggeration of a feature.

What is the difference between scale and fit in sklearn?

i am new to datascience and when i was going through one of the kaggle blog, i saw that the user is using both scale and fit on the data set. i tried to understand the difference by going through the documentation but was not able to understand
It's hard to understand the source of your confusion without any code. Inside the link you provided, the data is first scaled with sklearn.preprocessing.scale() and then fit to a sklearn.ensemble.GradientBoostingRegressor.
So the scaling operation transforms data such that all the features are represented on the same scale, and the fitting operation trains the model with the said data.
From your question it sounds like you thought these two operations were mutually exclusive, or somehow equivalent, but they are actually logical consecutive steps.
In general, before model is trained, data is somehow preprocessed (with .scale() in this case), then trained. In sklearn the .fit() methods are for training (fitting functions/models to the data).
Hope it makes sense!
Scale is a data normalization technique and it is used when data in different features are of not similar values like in one feature you have values ranging from 1 to 10 and in other features you have values ranging from 1000 to 10000.
Where as fit is the function that actually starts your model training
Scaling is conversion of data, a method used to normalize the range of independent variables or features of data. The fit method is a training step.

Multi-features modeling based on one binary-feature which is rarely 1 (imbalanced data) when there is a cost

I need to model a multi-variate time-series data to predict a binary-target which is rarely 1 (imbalanced data).
This means that we want to model based on one feature is binary (outbreak), rarely 1?
All of the features are binary and rarely 1.
What is the suggested solution?
This features has an effect on cost function based on the following cost function. We want to know prepared or not prepared if the cost is the same as following.
Problem Definition:
Model based on outbreak which is rarely 1.
Prepared or not prepared to avoid the outbreak of a disease and the cost of outbreak is 20 times of preparation
cost of each day(next day):
cost=20*outbreak*!prepared+prepared
Model:prepare(prepare for next day)for outbreak for which days?
Questions:
Build a model to predict outbreaks?
Report the cost estimation for every year
csv file is uploaded and data is for end of the day
The csv file contains rows which each row is a day with its different features some of them are binary and last feature is outbreak which is rarely 1 and a main features considering in the cost.
You are describing class imbalance.
Typical approach is to generate balanced training data
by repeatedly running through examples containing
your (rare) positive class,
and each time choosing a new random sample
from the negative class.
Also, pay attention to your cost function.
You wouldn't want to reward a simple model
for always choosing the majority class.
My suggestions:
Supervised Approach
SMOTE for upsampling
Xgboost by tuning scale_pos_weight
replicate minority class eg:10 times
Try to use ensemble tree algorithms, trying to generate a linear surface is risky for your case.
Since your data is time series you can generate days with minority class just before real disease happened. For example you have minority class at 2010-07-20. Last observations before that time is 2010-06-27. You can generate observations by slightly changing variance as 2010-07-15, 2010-07-18 etc.
Unsupervised Approach
Try Anomaly Detection algorithms. Such as IsolationForest (try extended version of it also).
Cluster your observations check minority class becomes a cluster itself or not. If its successful you can label your data with cluster names (cluster1, cluster2, cluster3 etc) then train a decision tree to see split patterns. (Kmeans + DecisionTreeClassifier)
Model Evaluation
Set up a cost matrix. Do not use confusion matrix precision etc directly. You can find further information about cost matrix in here: http://mlwiki.org/index.php/Cost_Matrix
Note:
According to OP's question in comments groupby year could be done like this:
df["date"] = pd.to_datetime(df["date"])
df.groupby(df["date"].dt.year).mean()
You can use other aggregators also (mean, sum, count, etc)

Using SVM to perform classification on multi-dimensional time series datasets

I would like to use scikit-learn's svm.SVC() estimator to perform classification tasks on multi-dimensional time series - that is, on time series where the points in the series take values in R^d, where d > 1.
The issue with doing this is that svm.SVC() will only take ndarray objects of dimension at most 2, whereas the dimension of such a dataset would be 3. Specifically, the shape of a given dataset would be (n_samples, n_features, d).
Is there a workaround available? One simple solution would just be to reshape the dataset so that it is 2-dimensional, however I imagine this would lead to the classifier not learning from the dataset properly.
Without any further knowledge about the data reshaping is the best you can do. Feature engineering is a very manual art that depends heavily on domain knowledge.
As a rule of thumb: if you don't really know anything about the data throw in the raw data and see if it works. If you have an idea what properties of the data may be beneficial for classification, try to work it in a feature.
Say we want to classify swiping patterns on a touch screen. This closely resembles your data: We acquired many time series of such patterns by recording the 2D position every few milliseconds.
In the raw data, each time series is characterized by n_timepoints * 2 features. We can use that directly for classification. If we have additional knowledge we can use that to create additional/alternative features.
Let's assume we want to distinguish between zig-zag and wavy patterns. In that case smoothness (however that is defined) may be a very informative feature that we can add as a further column to the raw data.
On the other hand, if we want to distinguish between slow and fast patterns, the instantaneous velocity may be a good feature. However, the velocity can be computed as a simple difference along the time axis. Even linear classifiers can model this easily so it may turn out that such features, although good in principle, do not improve classification of raw data.
If you have lots and lots and lots and lots of data (say an internet full of good examples) Deep Learning neural networks can automatically learn features to some extent, but let's say this is rather advanced. In the end, most practical applications come down to try and error. See what features you can come up with and try them out in practice. And beware the overfitting gremlin.

Resources