When it is referred to use min-max-scaler and when Standard Scalar.
I think it depends on the data. Is there any features of data to look on to decide to go for which preprocessing method.
I looked at the docs but can someone give me more insight into it.
The scaling will indeed depend of the type of data that you will. For most cases, StandardScaler is the scaler of choice. If you know that you have some outliers, go for the RobustScaler.
Then, you deal with some features with a weird distribution like for instance the digits, it will not be the best to use these scalers. Indeed, on this dataset, there a lot of pixel at zero meaning that you have a pick at zero for this distribution involving that dividing by the std. dev. will not be beneficial. So basically when the distribution of a feature is far to be Normal then you need to take an alternative.
In the case of the digits, the MinMaxScaler is a much better choice. However, if you want to keep the zero at zeros (because you use sparse matrices), you will go for a MaxAbsScaler.
NB: also look at the QuantileTransformer and the PowerTransformer if you want a feature to follow a Normal/Uniform distribution whatever the original distribution was.
StandardScaler assumes that data usually has distributed features and will scale them to zero mean and 1 standard deviation. Use StandardScaler() if you know the data distribution is normal. For most cases, StandardScaler would do no harm. Especially when dealing with variance (PCA, clustering, logistic regression, SVMs, perceptrons, neural networks) in fact Standard Scaler would be very important. On the other hand, it will not make much of a difference if you are using tree-based classifiers or regressors.
MinMaxScaler will transform each value in the column proportionally within the range [0,1]. This is quite acceptable in cases where we are not concerned about the standardisation along the variance axes. e.g. image processing or neural networks expecting values between 0 to 1.
I am using sklearn's random forests module to predict values based on 50 different dimensions. When I increase the number of dimensions to 150, the accuracy of the model decreases dramatically. I would expect more data to only make the model more accurate, but more features tend to make the model less accurate.
I suspect that splitting might only be done across one dimension which means that features which are actually more important get less attention when building trees. Could this be the reason?
Yes, the additional features you have added might not have good predictive power and as random forest takes random subset of features to build individual trees, the original 50 features might have got missed out. To test this hypothesis, you can plot variable importance using sklearn.
Your model is overfitting the data.
From Wikipedia:
An overfitted model is a statistical model that contains more parameters than can be justified by the data.
There are plenty of illustrations of overfitting, but for instance, this 2d plot represents the different functions that would have been learned for a binary classification task. Because the function on the right has too many parameters, it learns wrongs data patterns that don't generalize properly.
I have seen samples where the input data for the features are just any double values.
I am wondering if I need to normalize the input features for the MultilayerPerceptronClassifier to the range [-1,1] or [0,1].
I could not find that information in the Spark Documentations.
Maybe it is a thing I have to decide depending of the results..
.. then I might want to use one of these:
Yes, you should normalize them. This is not specific to any framework, but a general good practice for neural networks. If you do not normalize inputs and outputs, you might run into learning issues.
Whatever [0,1 ] or [-1,1], both work equally well. There is probably little difference.
I have a data set containing 1000 points each with 2 inputs and 1 output. It has been split into 80% for training and 20% for testing purpose. I am training it using sklearn support vector regressor. I have got 100% accuracy with training set but results obtained with test set are not good. I think it may be because of overfitting. Please can you suggest me something to solve the problem.
You may be right: if your model scores very high on the training data, but it does poorly on the test data, it is usually a symptom of overfitting. You need to retrain your model under a different situation. I assume you are using train_test_split provided in sklearn, or a similar mechanism which guarantees that your split is fair and random. So, you will need to tweak the hyperparameters of SVR and create several models and see which one does best on your test data.
If you look at the SVR documentation, you will see that it can be initiated using several input parameters, each of which could be set to a number of different values. For the simplicity, let's assume you are only dealing with two parameters that you want to tweak: 'kernel' and 'C', while keeping the third parameter 'degree' set to 4. You are considering 'rbf' and 'linear' for kernel, and 0.1, 1, 10 for C. A simple solution is this:
for kernel in ('rbf', 'linear'):
for c in (0.1, 1, 10):
svr = SVR(kernel=kernel, C=c, degree=4)
svr.fit(train_features, train_target)
score = svr.score(test_features, test_target)
print kernel, c, score
This way, you can generate 6 models and see which parameters lead to the best score, which will be the best model to choose, given these parameters.
A simpler way is to let sklearn to do most of this work for you, using GridSearchCV (or RandomizedSearchCV):
parameters = {'kernel':('linear', 'rbf'), 'C':(0.1, 1, 10)}
clf = GridSearchCV(SVC(degree=4), parameters)
clf.fit(train_features, train_target)
print clf.best_score_
print clf.best_params_
model = clf.best_estimator_ # This is your model
I am working on a little tool to simplify using sklearn for small projects, and make it a matter of configuring a yaml file, and letting the tool do all the work for you. It is available on my github account. You might want to take a look and see if it helps.
Finally, your data may not be linear. In that case you may want to try using something like PolynomialFeatures to generate new nonlinear features based on the existing ones and see if it improves your model quality.
Try fitting your data using training data split Sklearn K-Fold cross-validation, this provides you a fair split of data and better model , though at a cost of performance , which should really matter for small dataset and where the priority is accuracy.
A few hints:
Since you have only two inputs, it would be great if you plot your data. Try either a scatter with alpha = 0.3 or a heatmap.
Try GridSearchCV, as mentioned by #shahins.
Especially, try different values for the C parameter. As mentioned in the docs, if you have a lot of noisy observations you should decrease it. It corresponds to regularize more the estimation.
If it's taking too long, you can also try RandomizedSearchCV
As a side note from #shahins answer (I am not allowed to add comments), both implementations are not equivalent. GridSearchCV is better since it performs cross-validation in the training set for tuning the hyperparameters. Do not use the test set for tuning hyperparameters!
Don't forget to scale your data
I am using sklearn's DictVectorizer to construct a large, sparse feature matrix, which is fed to an ElasticNet model. Elastic net (and similar linear models) work best when predictors (columns in the feature matrix) are centered and scaled. The recommended approach is to build a Pipeline that uses a StandardScaler prior to the regressor, however that doesn't work with sparse features, as stated in the docs.
I thought to use the normalize=True flag in ElasticNet which seems to support sparse data, however it's not clear whether the normalization is applied during prediction to the test data as well. Does anyone know if normalize=True applies for prediction as well? If not, is there a way to use the same standardization on the training and test set when dealing with sparse features?
Digging through the sklearn code, it looks like when fit_intercept=True and normalize=True, the coefficients estimated on the normalized data are projected back to the original scale of the data. This is similar to the way glmnet in R handles standardization. The relevant code snippet is the method _set_intercept of LinearModel, see https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/linear_model/base.py#L158. So predictions on unseen data use coefficients in the original scale, i.e., normalize=True is safe to use.
Can anyone explain the difference between the RandomForestClassifier and ExtraTreesClassifier in scikit learn. I've spent a good bit of time reading the paper:
P. Geurts, D. Ernst., and L. Wehenkel, “Extremely randomized trees”, Machine Learning, 63(1), 3-42, 2006
It seems these are the difference for ET:
1) When choosing variables at a split, samples are drawn from the entire training set instead of a bootstrap sample of the training set.
2) Splits are chosen completely at random from the range of values in the sample at each split.
The result from these two things are many more "leaves".
Yes both conclusions are correct, although the Random Forest implementation in scikit-learn makes it possible to enable or disable the bootstrap resampling.
In practice, RFs are often more compact than ETs. ETs are generally cheaper to train from a computational point of view but can grow much bigger. ETs can sometime generalize better than RFs but it's hard to guess when it's the case without trying both first (and tuning n_estimators, max_features and min_samples_split by cross-validated grid search).
ExtraTrees classifier always tests random splits over fraction of features (in contrast to RandomForest, which tests all possible splits over fraction of features)
The main difference between random forests and extra trees (usually called extreme random forests) lies in the fact that, instead of computing the locally optimal feature/split combination (for the random forest), for each feature under consideration, a random value is selected for the split (for the extra trees). Here is a good resource to know more about their difference in more detail Random forest vs extra tree.