sklearn SGDClassifier can't stop - scikit-learn

I am using sklearn to train a model. The train dataset is about 3000k, so i use SGDClassifier. The feature is not very good, so i know it may not converge. But i want SGDClassifier to stop early according to my setting just like max_iter = 1000. As far as I am concerned, the function SGDClassifier has no parameter like max_iter. How can i do it?
This is the code.
This is the print information.
Any help will be appreciated...

This is weird, by default in scikit-learn 0.18.2, n_iter is set to 5 epochs. Can you please update your question with a script that makes it possible to reproduce the behavior using a toy dataset (for instance generated with numpy.random.randn or similar).
Note that in scikit-learn master and 0.19 once released, n_iter will be deprecated and replaced by max_iter and a tol (for instance set to 1e-3) to automatically stop when the objective function is no longer making progress.

The 20hours running could be not so strange since you have a dataset of 3000k and you use SGDClassifier that is slow. What processor do you have?
Try stopping it by using CTRL+C if you are in Windows. Then, use n_iter to control the number of iterations that you want. The default is 5 however.
Finally, if you want to save a model see here:
Save and Load Machine Learning Models in Python with scikit-learn

Related

Pytorch DDP find_unused_parameters setting during using multiple forwards/backwards

Issues like https://github.com/pytorch/pytorch/issues/69031
My training task need multiple step forwards,which means part of my epochs need find_unused_parameters=True and other epochs need find_unused_parameters=False.
pytorch still can't track used and unused parameters automaticall.
Is there any alternative solutions now?
Thanks!

XGboost classifier

I am new to XGBoost and I am currently working on a project where we have built an XGBoost classifier. Now we want to run some feature selection techniques. Is backward elimination method a good idea for this? I have used it in regression but I am not sure if/how to use it in a classification problem. Any leads will be greatly appreciated.
Note: I have already tried permutation line importance and it has yielded good results! Looking for another method to evaluate the features in the model.
Consider asking your question on Cross Validated since feature selection is more about theory/practice than code.
What is your concern ? Remove "noisy" features who drive down your results, obtain a sparse model ? Backward selection is one way to do of course. That being said, not sure if you are aware of this but XGBoost computes its own "variable importance" values.
# plot feature importance using built-in function
from xgboost import XGBClassifier
from xgboost import plot_importance
from matplotlib import pyplot
model = XGBClassifier()
model.fit(X, y)
# plot feature importance
plot_importance(model)
pyplot.show()
Something like this. This importance is based on how many times a feature is used to make a split. You can then define for instance a threshold below which you do not keep the variables. However do not forget that :
This variable importance has been obtained on the training data only
The removal of a variable with high importance may not affect your prediction error, e.g. if it is correlated with another highly important variable. Other tricks such as this one may exist.

I'm trying to implement 'multi-threading' to do both training and prediction(testing) at the same time

I'm trying to implement 'multi-threading' to do both training and prediction(testing) at the same time. And I'm gonna use the python module 'threading' as shown in https://www.tensorflow.org/api_docs/python/tf/FIFOQueue
And the followings are questions.
If I use the python module 'threading', does tensorflow use more portion of gpu or more portion of cpu?
Do I have to make two graphs(neural nets which have the same topology) in tensorflow one for prediction and the other for training? Or is it okay to make just one graph?
I'll be very grateful to anyone who can answer these questions! thanks!
If you use python threading module, it will only make use of cpu; also python threading not for run time parallelism, you should use multiprocessing.
In your model if you are using dropout or batch_norm like ops which change based on training and validation, it's a good idea to create separate graphs, reusing (validation graph will reuse all training variables) the common variable for validation/testing.
Note: you can use one graph also, with additional operations which changes behaviors based on training/validation.

Sklearn overfitting

I have a data set containing 1000 points each with 2 inputs and 1 output. It has been split into 80% for training and 20% for testing purpose. I am training it using sklearn support vector regressor. I have got 100% accuracy with training set but results obtained with test set are not good. I think it may be because of overfitting. Please can you suggest me something to solve the problem.
You may be right: if your model scores very high on the training data, but it does poorly on the test data, it is usually a symptom of overfitting. You need to retrain your model under a different situation. I assume you are using train_test_split provided in sklearn, or a similar mechanism which guarantees that your split is fair and random. So, you will need to tweak the hyperparameters of SVR and create several models and see which one does best on your test data.
If you look at the SVR documentation, you will see that it can be initiated using several input parameters, each of which could be set to a number of different values. For the simplicity, let's assume you are only dealing with two parameters that you want to tweak: 'kernel' and 'C', while keeping the third parameter 'degree' set to 4. You are considering 'rbf' and 'linear' for kernel, and 0.1, 1, 10 for C. A simple solution is this:
for kernel in ('rbf', 'linear'):
for c in (0.1, 1, 10):
svr = SVR(kernel=kernel, C=c, degree=4)
svr.fit(train_features, train_target)
score = svr.score(test_features, test_target)
print kernel, c, score
This way, you can generate 6 models and see which parameters lead to the best score, which will be the best model to choose, given these parameters.
A simpler way is to let sklearn to do most of this work for you, using GridSearchCV (or RandomizedSearchCV):
parameters = {'kernel':('linear', 'rbf'), 'C':(0.1, 1, 10)}
clf = GridSearchCV(SVC(degree=4), parameters)
clf.fit(train_features, train_target)
print clf.best_score_
print clf.best_params_
model = clf.best_estimator_ # This is your model
I am working on a little tool to simplify using sklearn for small projects, and make it a matter of configuring a yaml file, and letting the tool do all the work for you. It is available on my github account. You might want to take a look and see if it helps.
Finally, your data may not be linear. In that case you may want to try using something like PolynomialFeatures to generate new nonlinear features based on the existing ones and see if it improves your model quality.
Try fitting your data using training data split Sklearn K-Fold cross-validation, this provides you a fair split of data and better model , though at a cost of performance , which should really matter for small dataset and where the priority is accuracy.
A few hints:
Since you have only two inputs, it would be great if you plot your data. Try either a scatter with alpha = 0.3 or a heatmap.
Try GridSearchCV, as mentioned by #shahins.
Especially, try different values for the C parameter. As mentioned in the docs, if you have a lot of noisy observations you should decrease it. It corresponds to regularize more the estimation.
If it's taking too long, you can also try RandomizedSearchCV
As a side note from #shahins answer (I am not allowed to add comments), both implementations are not equivalent. GridSearchCV is better since it performs cross-validation in the training set for tuning the hyperparameters. Do not use the test set for tuning hyperparameters!
Don't forget to scale your data

Adaptive learning rate Lasagne

I am using Lasagne and Theano library to build my own deep learning model following the MNIST example. Can anyone please tell me how the adaptively change the learning rate?
I recommend having a look at https://github.com/Lasagne/Lasagne/blob/master/lasagne/updates.py.
If you are using sgd, then you can use a momentum term (e.g. https://github.com/Lasagne/Lasagne/blob/master/lasagne/updates.py#L156) to adaptively change the learning rate. If you want to make anything non-standard, the momentum implementation give you enough hints how to create something similar on your own.
I think the best way of doing this is by creating a theano shared variable for your learning rate, passing the shared variable to the updates function and changing through the set_value method, as follows:
lr_shared = theano.shared(np.array(0.1, dtype=theano.config.floatX))
updates = lasagne.updates.rmsprop(..., learning_rate=lr_shared)
...
for epoch in range(num_epochs):
if epoch % 10 == 0:
lr_shared.set_value(lr_shared.get_value() / 10)
Of course you can change the optimizer and the if codition, this is just an example.

Resources