Does python's scikit-learn have any regression models that work well with sparse data?
I was poking around and found this "sparse linear regression" module, but it seems outdated. (It's so old that scikit-learn was called 'scikits-learn' at the time, I think.)
Most scikit-learn regression models (linear such as Ridge, Lasso, ElasticNet or non-linear, e.g. with RandomForestRegressor) support both dense and sparse input data recent versions of scikit-learn (0.16.0 is the latest stable version at the time of writing).
Edit: if you are unsure, check the docstring of the fit method of the class of interest.
Related
I trained some data science models with scikit learn from v0.19.1. The models are stored in a pickle file. After upgrading to latest version (v0.23.1), I get the following error when I try to load them:
File "../../Utils/WebsiteContentSelector.py", line 100, in build_page_selector
page_selector = pickle.load(pkl_file)
AttributeError: Can't get attribute 'DeprecationDict' on <module 'sklearn.utils.deprecation' from '/usr/local/lib/python3.6/dist-packages/sklearn/utils/deprecation.py'>
Is there a way to upgrade without retraining all my models (which is very expensive)?
You used a new version of sklearn to load a model which was trained by an old version of sklearn.
So, the options are:
Retrain the model with current version of sklearn if you have the training script and data
Or fall back to the lower sklearn version reported in the warning message
Depending on the kind of sklearn model used, if the model is simple regression model, what is probably needed is to get the actual weights and bias (or intercept) values.
You can check these values in your model:
model.classes_
model.coef_
model.intercept_
they are of numpy type and can be pickled easily. Also, you need to get the same parameters passed to the model construction. For example:
tol
max_iter
and so on. With this, in the upgraded version, the same model created with the same parameters can read the weights and intercept.
In this way, no re-training is needed and you can use the upgrade sklearn.
When lib versions are not backward compatible you can do the following:
Downgrade sklearn back to the original version
Load each model, extract and store its coefficients (which are model-specific - check documentation)
Upgrade sklearn, load coefficients and init models with them, save models
Related question.
I am using Scikit-learn for converting my train data to polynomials features and then fit it to a linear model.
model = Pipeline([('poly', PolynomialFeatures(degree=3)),
('linear', LinearRegression(fit_intercept=False))])
model.fit(X, y)
But it throws an error
TypeError: A sparse matrix was passed, but dense data is required
I know my data is sparse matrix format. So when I try to convert my data to dense matrix it shows memory error. Because my data is huge(50k~). Because of these large amounts of data I can't convert it to a dense matrix.
I also find Github Issues where this feature is requested. But still not implemented.
So please can someone tell how to use sparse data format in PolynomialFeatures in Scikit-learn without converting it to dense format?
This is a new feature in the upcoming 0.20 version of sklearn. See Release History - V0.20 - Enhancements If you really wanted to test it out you could install the development version by following the instructions in Sklean - Advanced Installation - Install Bleeding Edge.
Since version 0.21.0, the PolynomialFeatures class accepts CSR matrices for degrees 2 and 3. The method laid out here is used, and the computation is much, much faster than if the input is a CSC matrix or dense (assuming the data's sparse to any reasonable degree - even slightly).
While we are waiting for the latest update of Sklearn - you can find an implementation of sparse interaction here:
https://github.com/drivendataorg/box-plots-sklearn/blob/master/src/features/SparseInteractions.py
I'm succesfully using scikit-learn on my machine. I'm experimenting with an anaconda implemnetation (that relies on MKL for multithreading) and an openblas implementation.
I'd really like to use a parallel version of k-nearest neighbour classifier, and according to https://github.com/scikit-learn/scikit-learn/pull/4009 , sklearn should have merged this changes 1 year ago, in version 0.17.
Multithreading works successfully for PCA, and all numpy operations. I can tell multithreading is working due to high number of threads I can see when I do dot products and PCA. When I lunch KNN is taking around 10 minutes.
I’m classifying a high dimensional dataset of MNIST (image digits). So I’m doing PCA to get vector of dimension 35-50, and then I’m doing a nonlinear expansion, so I’m getting vector of dimension 600-100. That’s why I need parallelism so badly.
My version of sklearn is:
print('The scikit-learn version is {}.'.format(sklearn.version))
The scikit-learn version is 0.18.1.
I'm using python3 and this is a sample of the code:
def classify_knn(train, test, train_labels):
clf = KNeighborsClassifier(algorithm='ball_tree')
clf = clf.fit(train, train_labels)
return clf.predict(test)
I've tried with and without 'ball_tree'. No one should using python 2.7 in 2017 and neither do I.
Just passing as a parameter
n_jobs = -1
solved the issue.
I have trained a SVM (svc) using scikit-learn over half a terabyte of data. The model is working fine and I need to port it to C, but I don't want to re-train the SVM from scratch because it takes way too long for me. Is there a way to easily export the model generated by scikit-learn and import it into LibSVM? Internally scikit-learn uses LibSVM so theoretically it should be possible, but I haven't been able to find anything in the documentation. Any suggestion?
Is there a way to easily export the model generated by scikit-learn and import it into LibSVM?
No. The scikit-learn version of LIBSVM has been hacked up severely to fit it into the Python environment and the model is stored as NumPy/SciPy data structures.
Your best shot is to study the SVM decision function and reimplement it in C. The support vectors can be obtained from the SVC object as NumPy arrays, which are easily translated to C arrays.
I'm training a RandomForestClassifier on a binary classification problem in scikit-learn. I want to maximize my auc score for the model. I understand this is not possible in the 0.13 stable version but is possible in the 0.14 bleeding edge version.
I tried this but I seemed to get a worse result:
ic = RandomForestClassifier(n_estimators=100, compute_importances=True, criterion='entropy', score_func = auc_score);
Does this work as a parameter for the model or only in gridsearchCV?
If I use it in gridsearchCV will it make the model fit the data better for auc_score? I also want to try it to maximize recall_score.
I am surprised the above does not raise an error. You can use the AUC only for model selection as in GridSearchCV.
If you use it there (scoring='roc_auc' iirc), this means that the model with the best auc will be selected. It does not make the individual models better with respect to this score.
It is still worth trying, though.
I have found a journal article that addresses highly imbalanced classes with random forests. Although it is aimed at running RDF on Hadoop clusters, the same techniques seem to work well on smaller problems as well:
del Río, S., López, V., Benítez, J. M., & Herrera, F. (2014). On the use of MapReduce for imbalanced big data using Random Forest. Information Sciences, 285, 112-137.
http://sci2s.ugr.es/rf_big_imb/pdf/rio14_INS.pdf