When using Standardize in H2O on New Data - modeling

I am curious to know that when using the Standardised feature in a H2O model in R how does it work when scoring out new data.
I know that when it standardises on a training set is sets the mean to 0 and standard deviation to 1 based on the mean and standard deviation of the training data but what does it do with new data?
Does it standardise based on the training data mean and standard deviation or does it standardise based on the new data being scored?

The score function applies the same mapping used to standardize the training data to the test dataset. This is handled automatically by H2O.

Related

Evaluate Model Node in Azure ML Studio does not take all the rows of the dataset in confusion matrix

I have this dataset in which the positive class consists of component failures for a specific component of the APS system.
I am doing Predictive Maintenance using Microsoft Azure Machine Learning Studio.
As you can see from the pictures below, I am using 4 algorithm: Logistic Regression, Random Forest, Decision Tree and SVM. And you can see that the Output dataset in the score model node consists of 16k rows. However, when I see the output of the Evaluate Model, in the confusion matrix there are only 160 observations for the Logistic Regression, and the correct number, 16k for Random Forest. I have the same problem, only 160 observations in the models of Decision Tree and SVM. And the same problem is repeated in other experiments for example after feature selection, normalization etc.: some evaluate model does not use all the rows of the test dataset, and some other node does it.
How can I fix this problem? Because I am interested in the real number of false positive and false negatives.
The output metrics shown are based on the validation set (e.g. “validation metric”, “val-accuracy”).All the metrics computed and displayed are on validation set and not on the original training set. All those metrics are calculated only over the validation set without considering the training set, otherwise we would inflate the performances of the model by considering data already used to train the model.

Pytorch How to normalize new records with regard to previous dataset?

I am trying to build a neural network using pytorch. I am using sklearn.MinMaxScaler to normalize my dataset. But how do I normalize a new incoming record that I will need to predict with regards to the mix max values of my dataset?
scaler = MinMaxScaler()
scaler.fit_transform(file_x[list_of_features_to_normalize])
In order to use sklearn.preprocessing.MinMaxScaler you need first to fit the scaler to the values of your training data. This is done (as you already did) using
scaler.fit_transform(file_x[list_of_features_to_normalize])
After this fit your scaling object scaler has its internal parameters (e.g., min_, scale_ etc.) tuned according to the training data.
Once training is done, and you wish to evaluate your model on new records you only need to apply the scaler without fitting it to the new data:
val_t = scaler.transform(validation_data)

How do you treat a new sample after training a model using sklearn preprocessing scale?

Assume I have a dataset X and labels Y for a supervised machine learning task.
Assume X has 10 features and 1,000 samples and I believe that it is appropriate to scale my data using sklearn.preprocessing.scale. This operation is taken and I train my model.
I now wish to use it for model for new data, so I collect a new sample of the 10 features of X and wish to use my trained model to classify this sample.
Is there an easy way to apply the same scaling that was performed on X before training my model to this single new sample, before attempting classification?
If not, is the only solution to have retained a copy of X before scaling and to add my new sample to this data and then scale this dataset and attempt classification on the new sample after it has been scaled via this process?
use class api instead of function api. like preprocessing.MinMaxScaler, preprocessing.StandardScaler
http://scikit-learn.org/stable/modules/preprocessing.html#preprocessing
The function scale provides a quick and easy way to perform this operation on a
single array-like dataset
The preprocessing module further provides a utility class StandardScaler that
implements the Transformer API to compute the mean and standard deviation on a
training set so as to be able to later reapply the same transformation on the
testing set.
lets say you you have the training dataset "training_dataset" and you did the following to scale it,
x__feature_scaler = MinMaxScaler(feature_range = (0, 1))
training_scaled_dataset = x__feature_scaler.fit_transform(training_dataset)
Use the same instance of MinMaxScaler to scale the new dataset. If your new dataset is "new_dataset", do the following,
new_scaled_dataset = x__feature_scaler.transform(new_dataset)
That way you will scale your new dataset to the same scale as your training dataset.

Scaling data real time for LibSVM

I am using LibSVM to classify data. I train and test the classifier with linearly scaled feature data on the interval [-1 1]. After establishing a model which produces acceptable accuracy, I want to classify new data which arrives periodically, almost in real time.
I don't know how to rescale the feature columns of the 'real time' data on an interval of [-1 1] since I'm only generating 1 row of features for this input data. If I were to store the min/max values of the testing/training set data feature columns (in order to scale new data), this presents the possibility that if the new real time data does not fall into this min/max range, thus the model is no longer valid as I would have to re-scale all prior data to accommodate for the new min/max and generate a new model.
I have thought about using other scaling techniques such as mean normalization, but I have read that SVM works particularly well with linearly scaled features so I am hesitant to apply another methodology.
How does one deal with the rescaling of new features to a linear interval, when the new features are a single row vector, and could have higher/lower feature values than the max/min feature values used in rescaling the training data?
This is the equation I'm using to rescale the training/testing feature set.
Even if one were to use another feature scaling technique (such as mean normalization), with each additional 'real time' classification, would it be prudent to recalculate the mean, min and max for ALL (new, test and train) data before rescaling, or is it acceptable to use the stored scaling values from training/testing for new samples -- until a "re-training" the classifier to account for all the newly acquired data were to occur.
All in all, I think what I'm having trouble with is: how does one deal with linear feature scaling in an 'online' classification problem?

advanced feature extraction for cross-validation using sklearn

Given a sample dataset with 1000 samples of data, suppose I would like to preprocess the data in order to obtain 10000 rows of data, so each original row of data leads to 10 new samples. In addition, when training my model I would like to be able to perform cross validation as well.
The scoring function I have uses the original data to compute the score so I would like cross validation scoring to work on the original data as well rather than the generated one. Since I am feeding the generated data to the trainer (I am using a RandomForestClassifier), I cannot rely on cross-validation to correctly split the data according to the original samples.
What I thought about doing:
Create a custom feature extractor to extract features to feed to the classifier.
add the feature extractor to a pipeline and feed it to, say, GridSearchCv for example
implement a custom scorer which operates on the original data to score the model given a set of selected parameters.
Is there a better method for what I am trying to accomplish?
I am asking this in connection to a competition going on right now on Kaggle
Maybe you can use Stratified cross validation (e.g. Stratified K-Fold or Stratified Shuffle Split) on the expanded samples and use the original sample idx as stratification info in combination with a custom score function that would ignore the non original samples in the model evaluation.

Resources