How to continue training svm and knn in scikit-learn? - scikit-learn

after training since it cost a lot of time is there a way for me to continue my training and add samples using nusvc() and nearestneighbor() in scikitlearn?

For the SVM, you might be able to use the online learning abilities of the SGDClassifier class. To do so, you would need to use the partial_fit() function.

Related

XGboost classifier

I am new to XGBoost and I am currently working on a project where we have built an XGBoost classifier. Now we want to run some feature selection techniques. Is backward elimination method a good idea for this? I have used it in regression but I am not sure if/how to use it in a classification problem. Any leads will be greatly appreciated.
Note: I have already tried permutation line importance and it has yielded good results! Looking for another method to evaluate the features in the model.
Consider asking your question on Cross Validated since feature selection is more about theory/practice than code.
What is your concern ? Remove "noisy" features who drive down your results, obtain a sparse model ? Backward selection is one way to do of course. That being said, not sure if you are aware of this but XGBoost computes its own "variable importance" values.
# plot feature importance using built-in function
from xgboost import XGBClassifier
from xgboost import plot_importance
from matplotlib import pyplot
model = XGBClassifier()
model.fit(X, y)
# plot feature importance
plot_importance(model)
pyplot.show()
Something like this. This importance is based on how many times a feature is used to make a split. You can then define for instance a threshold below which you do not keep the variables. However do not forget that :
This variable importance has been obtained on the training data only
The removal of a variable with high importance may not affect your prediction error, e.g. if it is correlated with another highly important variable. Other tricks such as this one may exist.

Extending Keras TensborBoard callback to visualize model predictions

When training deep semantic segmentation models, it is often convenient to visualize a sample of predictions on the validation set during the training. Right now I'm simply saving some predictions to disk on my training server. I'm looking to migrate this task to TensorBoard. Simply put, I wan't to visualize a set of predictions (say 5) over each epoch.
I know there is a simple way to do it in pure TensorFlow like tf.summary.image(..) but I don't see any easy way to incorporate this into the Keras TensorBoard callback.
Any guidance would be much appreciated.
Fabio Perez provided an answer which should do exactly what you're looking for here How to display custom images in TensorBoard using Keras?

automatic classification model selecion

i want to know is there any method by which the computer can decide which classification model to use ( Decision trees, logistic regression, KNN, etc. ) by just looking at the training data.
even just the math will be extremely helpful.
I am going to be writing this in python 3, so if there's any built method in scikit-learn or tensorflow for this purpose,it would be of great help.
This scikit learn tool kit solves it :
https://automl.github.io/auto-sklearn/stable/index.html

Pytorch: Intermediate testing during training

How can I test my pytorch model on validation data during training?
I know that there is the function myNet.eval() which apparantly switches of any dropout layers, but is it also preventing the gradients from being accumulated?
Also how would I undo the myNet.eval() command in order to continue with the training?
If anyone has some code snippet / toy example I would be grateful!
How can I test my pytorch model on validation data during training?
There are plenty examples where there are train and test steps for every epoch during training. An easy one would be the official MNIST example. Since pytorch does not offer any high-level training, validation or scoring framework you have to write it yourself. Commonly this consists of
a data loader (commonly based on torch.utils.dataloader.Dataloader)
a main loop over the total number of epochs
a train() function that uses training data to optimize the model
a test() or valid() function to measure the effectiveness of the model given validation data and a metric
This is also what you will find in the linked example.
Alternatively you can use a framework that provides basic looping and validation facilities so you don't have to implement everything by yourself all the time.
tnt is torchnet for pytorch, supplying you with different metrics (such as accuracy) and abstraction of the train loop. See this MNIST example.
inferno and torchsample attempt to model things very similar to Keras and provide some tools for validation
skorch is a scikit-learn wrapper for pytorch that lets you use all the tools and metrics from sklearn
Also how would I undo the myNet.eval() command in order to continue with the training?
myNet.train() or, alternatively, supply a boolean to switch between eval and training: myNet.train(True) for train mode.
I know that there is the function myNet.eval() which apparantly switches of any dropout layers, but is it also preventing the gradients from being accumulated?
It doesn't prevent gradients from accumulating.
But I think during testing, you do want to ignore gradients. In that case, you should mark the variable input to the network as volatile=True, and it will save some time and space used in forward calculation.
Also how would I undo the myNet.eval() command in order to continue with the training?
myNet.train()

Logistic regression overfits even using cross validation in sklearn?

I am implementing a logistic regression model using sklearn, for a text classification competition on Kaggle.
When I use unigram, there are 23,617 features. The best mean_test_score Cross validation search (sklearn's GridSearchCV) gives me is similar to the score I got from Kaggle, using the best model.
There are 1,046,524 features if I use bigram. GridSearchCV gives me a better mean_test_score compared to unigram, but using this new model I got a much much lower score on Kaggle.
I guess the reason might be overfitting, since I have too many features. I have tried to set the GridSearchCV using 5-fold, or even 2-fold, but the scores are still inconsistent.
Does it really indicate my second model is overfitting, even in the validation stage? If so, how can I tune the regularization term for my logistic model using sklearn? Any suggestions are appreciated!
Assuming you are using sklearn. You could try looking into using the tuning parameters max_df, min_df, and max_features. Throwing these into a GridSearch may take a long time but you will likely get some interesting results back. I know these features are implemented in the sklearn.feature_extraction.text.TfidfVectorizer, but I am sure they use them elsewhere as well. Essentially the idea is that including too many grams can lead to overfitting, same thing with having too many grams with low or high document frequencies.

Resources