Can I simplify the Lasso-Lars runtime for zeroed coefficients? - scikit-learn

I have a dataset with tens of thousands of features and I use a 2nd order PolynomialFeatures component in my pipeline. When I train with Lasso(Lars) many of these features drop to 0, which is great. However, when I want to use this model in practice for even more features, I would like to avoid the calculation overhead for the 0 coefficients.
Is there a way in scikit-learn that I can take the result of the Lasso regression coefficients and create a model that only processes the non-zero coefficients when evaluating?

Related

Large dataset - ANN

I am trying to classify around 400K data with 13 attributes. I have used python sklearn's SVM package, but it didn't work, and then I learned that SVM's are not suitable for large dataset classification. Then I used the (sklearn) ANN using the following MLPClassifier:
MLPClassifier(solver='adam', alpha=1e-5, random_state=1,activation='relu', max_iter=500)
and trained the system using 200K samples, and tested the model on the remaining ones. The classification worked well. However, my concern is that the system is over trained or overfit. Can you please guide me on the number of hidden layers and node sizes to make sure that there is no overfit? (I have learned that the default implementation has 100 hidden neurons. Is it ok to use the default implementation as is?)
To know if your are overfitting you have to compute:
Training set accuracy
Test set accuracy
Once you have calculated this scores, compare it. If training set score is much better than your test set score, then you are overfitting. This means that your model is "memorizing" your data, instead of learning from it to make future predictions.
If you are overfitting with Neuronal Networks you probably have to reduce the number of layers and reduce the number of neurons per layer. There isn't any strict rule that says the number of layer or neurons you need depending on you dataset size. Every dataset can behaves completely different with the same dataset size.
So, to conclude, if you are overfitting, you would have to evaluate your model accuracy using different parameters of layers and number of neurons, and, then, observe with which values you obtain the best results. There are some methods you can use to find the best parameters, is like gridsearchCV.

Why does more features in a random forest decrease accuracy dramatically?

I am using sklearn's random forests module to predict values based on 50 different dimensions. When I increase the number of dimensions to 150, the accuracy of the model decreases dramatically. I would expect more data to only make the model more accurate, but more features tend to make the model less accurate.
I suspect that splitting might only be done across one dimension which means that features which are actually more important get less attention when building trees. Could this be the reason?
Yes, the additional features you have added might not have good predictive power and as random forest takes random subset of features to build individual trees, the original 50 features might have got missed out. To test this hypothesis, you can plot variable importance using sklearn.
Your model is overfitting the data.
From Wikipedia:
An overfitted model is a statistical model that contains more parameters than can be justified by the data.
https://qph.fs.quoracdn.net/main-qimg-412c8556aacf7e25b86bba63e9e67ac6-c
There are plenty of illustrations of overfitting, but for instance, this 2d plot represents the different functions that would have been learned for a binary classification task. Because the function on the right has too many parameters, it learns wrongs data patterns that don't generalize properly.

Sklearn overfitting

I have a data set containing 1000 points each with 2 inputs and 1 output. It has been split into 80% for training and 20% for testing purpose. I am training it using sklearn support vector regressor. I have got 100% accuracy with training set but results obtained with test set are not good. I think it may be because of overfitting. Please can you suggest me something to solve the problem.
You may be right: if your model scores very high on the training data, but it does poorly on the test data, it is usually a symptom of overfitting. You need to retrain your model under a different situation. I assume you are using train_test_split provided in sklearn, or a similar mechanism which guarantees that your split is fair and random. So, you will need to tweak the hyperparameters of SVR and create several models and see which one does best on your test data.
If you look at the SVR documentation, you will see that it can be initiated using several input parameters, each of which could be set to a number of different values. For the simplicity, let's assume you are only dealing with two parameters that you want to tweak: 'kernel' and 'C', while keeping the third parameter 'degree' set to 4. You are considering 'rbf' and 'linear' for kernel, and 0.1, 1, 10 for C. A simple solution is this:
for kernel in ('rbf', 'linear'):
for c in (0.1, 1, 10):
svr = SVR(kernel=kernel, C=c, degree=4)
svr.fit(train_features, train_target)
score = svr.score(test_features, test_target)
print kernel, c, score
This way, you can generate 6 models and see which parameters lead to the best score, which will be the best model to choose, given these parameters.
A simpler way is to let sklearn to do most of this work for you, using GridSearchCV (or RandomizedSearchCV):
parameters = {'kernel':('linear', 'rbf'), 'C':(0.1, 1, 10)}
clf = GridSearchCV(SVC(degree=4), parameters)
clf.fit(train_features, train_target)
print clf.best_score_
print clf.best_params_
model = clf.best_estimator_ # This is your model
I am working on a little tool to simplify using sklearn for small projects, and make it a matter of configuring a yaml file, and letting the tool do all the work for you. It is available on my github account. You might want to take a look and see if it helps.
Finally, your data may not be linear. In that case you may want to try using something like PolynomialFeatures to generate new nonlinear features based on the existing ones and see if it improves your model quality.
Try fitting your data using training data split Sklearn K-Fold cross-validation, this provides you a fair split of data and better model , though at a cost of performance , which should really matter for small dataset and where the priority is accuracy.
A few hints:
Since you have only two inputs, it would be great if you plot your data. Try either a scatter with alpha = 0.3 or a heatmap.
Try GridSearchCV, as mentioned by #shahins.
Especially, try different values for the C parameter. As mentioned in the docs, if you have a lot of noisy observations you should decrease it. It corresponds to regularize more the estimation.
If it's taking too long, you can also try RandomizedSearchCV
As a side note from #shahins answer (I am not allowed to add comments), both implementations are not equivalent. GridSearchCV is better since it performs cross-validation in the training set for tuning the hyperparameters. Do not use the test set for tuning hyperparameters!
Don't forget to scale your data

How to consistently standardize sparse feature matrix in scikit-learn?

I am using sklearn's DictVectorizer to construct a large, sparse feature matrix, which is fed to an ElasticNet model. Elastic net (and similar linear models) work best when predictors (columns in the feature matrix) are centered and scaled. The recommended approach is to build a Pipeline that uses a StandardScaler prior to the regressor, however that doesn't work with sparse features, as stated in the docs.
I thought to use the normalize=True flag in ElasticNet which seems to support sparse data, however it's not clear whether the normalization is applied during prediction to the test data as well. Does anyone know if normalize=True applies for prediction as well? If not, is there a way to use the same standardization on the training and test set when dealing with sparse features?
Digging through the sklearn code, it looks like when fit_intercept=True and normalize=True, the coefficients estimated on the normalized data are projected back to the original scale of the data. This is similar to the way glmnet in R handles standardization. The relevant code snippet is the method _set_intercept of LinearModel, see https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/linear_model/base.py#L158. So predictions on unseen data use coefficients in the original scale, i.e., normalize=True is safe to use.

Comparing a Poisson Regression to a logistic Regression

I have data which has an associated binary outcome variable. Naturally I ran a logistic regression in order to see parameter estimates and odds ratios. I was curious though, to change this data from a binary outcome to count data. Then I ran a poisson regression (and negative binomial regression) on the count data.
I have no idea of how to compare these different models though, all comparisons I see seem to only be concerned with nested models.
How would you go about deciding on the best model to use in this situation?
Essentially both models will be roughly equal. What really matters is what is your objective- what you really want to predict. If you want to determine how many of cases are good or bad (1 or 0), then you go for logistic regression. If you are really interested on how much the cases are going to do (counts) then do poisson.
In other words, the only difference between these two models is the logistic transformation and the fact that logistic regression tries to minimize the misclassification error (-2 log likelihood) .To put it simply, even if you run a linear regression (OLS) on the binary outcome, you should not see big differences from your logistic model apart from the fact that the results may not be between 0 and 1 (e.g. the Area under the RoC curve will be similar to the logistic model) .
To sum up, don't worry about which of these two models is better, they should be roughly the same in the way the capture your features' information. Just think what makes more sense to optimize, counts or probabilties. The answer might have been different if you were considering non-linear models (e.g random forests or neural networks etc), but the two you are considering are both (almost) linear- so don't worry about it.
One thing to consider is the sample design. If you are using a case-control study, then logistic regression is the way to go because of its logit link function, rather than log of ratios as in Poisson regression. This is because, where there is an oversampling of cases such as in case-control study, odds ratio is unbiased.

Resources