How to correctly scale new data points sklearn - scikit-learn

Imagine a simple regression problem, where you are using Gradient Descent. For correct implementation you will need to scale values using mean of entire training dataset. Imagine your model is already trained and you feed it another example you wish to predict. How do you scale it correctly with respect to previous dataset? Do you include new example in training set and then scale it with mean of this training dataset + new data points? How this should be done the right way?
By referring to new data points i mean something the model hasn't seen before, neither in training nor testing. How do you handle scaling for anything you pass to regr.predict() if scaling of training set is done with respect to the whole set and not a single observation.
Imagine you have ndarray of features:
to_predict = [10, 12, 1, 330, 1311, 225].
The dataset used for train and test is already oscillating around 0 for every feature. Taking into account the below answer (pseudo code, that's why I'm asking how to do it right):
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
new_Xs = X_train.append(to_predict)
X_train_std_with_new = scalar.fit_transform(new_Xs)
scaled_to_predit = X_train_std_with_new[-1]
regr.predict(scaled_to_predict) ??

Related

What steps should I take next to improve my accuracy? Can data be the problem?

I built various ML models using sklearn for a binary classification problem. The data-set is provided to me by my professor for this comparative study.
my jupyter notebook and dataset can be found here
As I am getting very low accuracy, I fear that I must be doing something wrong while building the model. So I tested my decision tree on the inbuilt data-set in sklearn (breast cancer data-set) which is very similar to my data-set as both are binary classifications. Here I get an mean accuracy of 95 %. So I think right now that the problem might be my data-set. Can I get some help on how do I pre-process my data or any other steps that I might look into to improve accuracy.
Encode labels
Categorical data are variables that contain label values rather than numeric values.The number of possible values is often limited to a fixed set.
For example, users are typically described by country, gender, age group etc. We will use Label Encoder to label the categorical data. Label Encoder is the part of SciKit Learn library in Python and used to convert categorical data, or text data, into numbers, which our predictive models can better understand.
#Encoding categorical data values
from sklearn.preprocessing import LabelEncoder
labelencoder_Y = LabelEncoder()
Y = labelencoder_Y.fit_transform(Y)
Feature scaling
Most of the times, your dataset will contain features highly varying in magnitudes, units and range. But since, most of the machine learning algorithms use Eucledian distance between two data points in their computations. We need to bring all features to the same level of magnitudes. This can be achieved by scaling. This means that you’re transforming your data so that it fits within a specific scale, like 0–100 or 0–1. We will use StandardScaler method from SciKit-Learn library.
#Feature Scalingfrom sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
Choosing Right model
You kight also want to vhoose the appropriate model. You can't just use neural nets or so for all problems it's the no free luch theorem. For this you could use K-fold cross validation, AIC and BIC

Neural network regression with skewed data

I have been trying to build a machine learning model using Keras which predicts the radiation dose based on pre-treatment parameters. My dataset has approximately 2200 samples of which 20% goes into validation and testing.
The problem with the target variable is that it is very skewed since large radiation doses are much more rare than the small ones. Hence, I suspect that my regression model fails to predict the large values at all, and predicts everything around the mean, which is apparent from the figure. I have tried to log-normalise the target variable to make it more normally distributed, but it has had no effect.
Any suggestion how to fix this?
Target variable
Regression predictions
Computing individual sample weights based on 10 histogram bins helped in my case. See the code below:
import pandas as pd
import numpy as np
from sklearn.utils.class_weight import compute_sample_weight
hist, bin_edges = np.histogram(training_targets, bins = 10)
classes = training_targets.apply(lambda x: pd.cut(x, bin_edges, labels = False,
include_lowest = True)).values
sample_weights = compute_sample_weight('balanced', classes)

How do you treat a new sample after training a model using sklearn preprocessing scale?

Assume I have a dataset X and labels Y for a supervised machine learning task.
Assume X has 10 features and 1,000 samples and I believe that it is appropriate to scale my data using sklearn.preprocessing.scale. This operation is taken and I train my model.
I now wish to use it for model for new data, so I collect a new sample of the 10 features of X and wish to use my trained model to classify this sample.
Is there an easy way to apply the same scaling that was performed on X before training my model to this single new sample, before attempting classification?
If not, is the only solution to have retained a copy of X before scaling and to add my new sample to this data and then scale this dataset and attempt classification on the new sample after it has been scaled via this process?
use class api instead of function api. like preprocessing.MinMaxScaler, preprocessing.StandardScaler
http://scikit-learn.org/stable/modules/preprocessing.html#preprocessing
The function scale provides a quick and easy way to perform this operation on a
single array-like dataset
The preprocessing module further provides a utility class StandardScaler that
implements the Transformer API to compute the mean and standard deviation on a
training set so as to be able to later reapply the same transformation on the
testing set.
lets say you you have the training dataset "training_dataset" and you did the following to scale it,
x__feature_scaler = MinMaxScaler(feature_range = (0, 1))
training_scaled_dataset = x__feature_scaler.fit_transform(training_dataset)
Use the same instance of MinMaxScaler to scale the new dataset. If your new dataset is "new_dataset", do the following,
new_scaled_dataset = x__feature_scaler.transform(new_dataset)
That way you will scale your new dataset to the same scale as your training dataset.

Multi-output spatial statistics with gaussian processes

I've been investigating Gaussian processes lately. The perspective of probabilistic multi-output is promising in my field. In particular, spatial statistics. But I encountered three problems:
multi-ouput
overfitting and
anisotropy.
Let me run a simple case study with the meuse data set (from the R package sp).
UPDATE: The Jupyter notebook used for this question, and updated according to Grr's answer, is here.
import pandas as pd
import numpy as np
import matplotlib.pylab as plt
%matplotlib inline
meuse = pd.read_csv(filepath_or_buffer='https://gist.githubusercontent.com/essicolo/91a2666f7c5972a91bca763daecdc5ff/raw/056bda04114d55b793469b2ab0097ec01a6d66c6/meuse.csv', sep=',')
For the example, we will focus on copper and lead.
fig = plt.figure(figsize=(12,8))
ax1 = fig.add_subplot(121, aspect=1)
ax1.set_title('Lead')
ax1.scatter(x=meuse.x, y=meuse.y, s=meuse.lead, alpha=0.5, color='grey')
ax2 = fig.add_subplot(122, aspect=1)
ax2.set_title('Copper')
ax2.scatter(x=meuse.x, y=meuse.y, s=meuse.copper, alpha=0.5, color='orange')
In fact, concentrations of copper and lead are correlated.
plt.plot(meuse['lead'], meuse['copper'], '.')
plt.xlabel('Lead')
plt.ylabel('Copper')
This is thus a multi-output problem.
from sklearn.gaussian_process.kernels import RBF
from sklearn.gaussian_process import GaussianProcessRegressor as GPR
reg = GPR(kernel=RBF())
reg.fit(X=meuse[['x', 'y']], y=meuse[['lead', 'copper']])
predicted = reg.predict(meuse[['x', 'y']])
First question: Is the kernel built for correlated multi-output when y has more than one dimension? If not, how may I specify the kernel?
I keep on with the analysis to show the second problem, overfitting:
fig = plt.figure(figsize=(12,4))
ax1 = fig.add_subplot(121)
ax1.set_title('Lead')
ax1.set_xlabel('Measured')
ax1.set_ylabel('Predicted')
ax1.plot(meuse.lead, predicted[:,0], '.')
ax2 = fig.add_subplot(122)
ax2.set_title('Copper')
ax2.set_xlabel('Measured')
ax2.set_ylabel('Predicted')
ax2.plot(meuse.copper, predicted[:,1], '.')
I created a grid of x and y coordinates and all concentrations on that grid were predicted as zeros.
Finally, a last concern which particularly arises in 3D for soils: how may I specify anisotropy in such models?
First off you need to split your data. Training a model and then predicting on that same training data will look like overfitting as you have observed, but you did not test your model on any hold out data, so you have no idea how it performs in the wild. Try splitting your data with sklearn.model_selection.train_test_split like so:
X_train, X_test, y_train, y_test = train_test_split(meuse[['x', 'y']], meuse[['lead', 'copper']])
Then you can train your model. However, you also have an issue there. When you train the model the way you do you end up with a kernel with a length_scale=1e-05. Essentially there is no noise in your model. The predictions made with this setup will be centered so tightly around your input points (X_train) that you won't be able to make any predictions about the sites around them. You need to change the alpha parameter of the GaussianProcessRegressor to fix this. This is something that you will likely need to do a grid search on as the default value is 1e-10. For an example I used alpha=0.1.
reg = GPR(RBF(), alpha=0.1)
reg.fit(X_train, y_train)
predicted = reg.predict(X_test)
fig = plt.figure(figsize=(12,4))
ax1 = fig.add_subplot(121)
ax1.set_title('Lead')
ax1.set_xlabel('Measured')
ax1.set_ylabel('Predicted')
ax1.plot(y_test.lead, predicted[:,0], '.')
ax2 = fig.add_subplot(122)
ax2.set_title('Copper')
ax2.set_xlabel('Measured')
ax2.set_ylabel('Predicted')
ax2.plot(y_test.copper, predicted[:,1], '.')
That results in this graph:
As you can see there is no overfitting issue here, in fact this may be underfit. Like I said you will need to do some GridSearchCV on this model to come up with the optimal setup given your data.
So to answer your questions:
The model handles multi output quite well as is.
The overfitting can be addressed by properly splitting the data or testing on a different hold out set.
Take a look at the Radial Basis Function RBF Kernel section of the Gaussian Processes guide for some insight on applying the anisotropic kernel as opposed to the isotropic kernel we applied above.
Update for Question in Comments
When you write "the model handles multi output quite well as is", are you saying that the model "as is" is built for correlated targets, or that the model handles them quite well as a collection of independent models?
Good question. From what I understand about the GaussianProcessRegressor I do not believe that it is capable of storing multiple models internally. So this is a single model. That being said what is interesting about your question is the statement "built for correlated targets". In this situation our two targets do seem to be fairly correlated (Pearson Correlation Coefficient = 0.818, p=1.25e-38) so I really see two questions here:
For correlated data, if we built models for both targets as well as the individual targets how would the results compare?
For non-correlated data does the above hold true?
Unfortunately we cannot test the second question without creating a new "fake" dataset, which is somewhat beyond the scope of what we are doing here. We can however, answer the first question quite easily. Using our same train/test split we can train two new models with the same hyperparameters for predicting lead and copper individually. Then we can train a MultiOutputRegressor using both classes. And finally compare them all to the original model. Like so:
reg = GPR(RBF(), alpha=1)
reg.fit(X_train, y_train)
preds = reg.predict(X_test)
reg_lead = GPR(RBF(), alpha=1)
reg_lead.fit(X_train, y_train.lead)
lead_preds = reg_lead.predict(X_test)
reg_cop = GPR(RBF(), alpha=1)
reg_cop.fit(X_train, y_train.copper)
cop_preds = reg_cop.predict(X_test)
multi_reg = MultiOutputRegressor(GPR(RBF(), alpha=1))
multi_reg.fit(X_train, y_train)
multi_preds = multi_reg.predict(X_test)
Now we have several models we can compare. Let's plot the predictions and see what we get.
Interestingly there is no visible difference in the lead predictions, but there is some in the copper predictions. And these only exist between the origin GPR model and our other models. Moving on to more quantitative measures of error we can see that for explained variance the original model performs ever so slightly better than our MultiOutputRegressor. What is interesting is that the explained variance for the copper model is significantly lower than for the lead model (this in fact corresponds to the behavior of the individual components of the two other models as well). This is all very interesting and would lead us down a number of different development routes in coming to our final model.
I think the important take away here is that all of the model iterations appear to be in the same ballpark and there is no clear cut winner in this scenario. This is a case where you are going to need to do some significant grid searching and perhaps implementing the anisotropic kernel and any other domain specific knowledge would help, but as it is our example is a far cry from a useful model.

Sklearn overfitting

I have a data set containing 1000 points each with 2 inputs and 1 output. It has been split into 80% for training and 20% for testing purpose. I am training it using sklearn support vector regressor. I have got 100% accuracy with training set but results obtained with test set are not good. I think it may be because of overfitting. Please can you suggest me something to solve the problem.
You may be right: if your model scores very high on the training data, but it does poorly on the test data, it is usually a symptom of overfitting. You need to retrain your model under a different situation. I assume you are using train_test_split provided in sklearn, or a similar mechanism which guarantees that your split is fair and random. So, you will need to tweak the hyperparameters of SVR and create several models and see which one does best on your test data.
If you look at the SVR documentation, you will see that it can be initiated using several input parameters, each of which could be set to a number of different values. For the simplicity, let's assume you are only dealing with two parameters that you want to tweak: 'kernel' and 'C', while keeping the third parameter 'degree' set to 4. You are considering 'rbf' and 'linear' for kernel, and 0.1, 1, 10 for C. A simple solution is this:
for kernel in ('rbf', 'linear'):
for c in (0.1, 1, 10):
svr = SVR(kernel=kernel, C=c, degree=4)
svr.fit(train_features, train_target)
score = svr.score(test_features, test_target)
print kernel, c, score
This way, you can generate 6 models and see which parameters lead to the best score, which will be the best model to choose, given these parameters.
A simpler way is to let sklearn to do most of this work for you, using GridSearchCV (or RandomizedSearchCV):
parameters = {'kernel':('linear', 'rbf'), 'C':(0.1, 1, 10)}
clf = GridSearchCV(SVC(degree=4), parameters)
clf.fit(train_features, train_target)
print clf.best_score_
print clf.best_params_
model = clf.best_estimator_ # This is your model
I am working on a little tool to simplify using sklearn for small projects, and make it a matter of configuring a yaml file, and letting the tool do all the work for you. It is available on my github account. You might want to take a look and see if it helps.
Finally, your data may not be linear. In that case you may want to try using something like PolynomialFeatures to generate new nonlinear features based on the existing ones and see if it improves your model quality.
Try fitting your data using training data split Sklearn K-Fold cross-validation, this provides you a fair split of data and better model , though at a cost of performance , which should really matter for small dataset and where the priority is accuracy.
A few hints:
Since you have only two inputs, it would be great if you plot your data. Try either a scatter with alpha = 0.3 or a heatmap.
Try GridSearchCV, as mentioned by #shahins.
Especially, try different values for the C parameter. As mentioned in the docs, if you have a lot of noisy observations you should decrease it. It corresponds to regularize more the estimation.
If it's taking too long, you can also try RandomizedSearchCV
As a side note from #shahins answer (I am not allowed to add comments), both implementations are not equivalent. GridSearchCV is better since it performs cross-validation in the training set for tuning the hyperparameters. Do not use the test set for tuning hyperparameters!
Don't forget to scale your data

Resources