How to make polynomial features using sparse matrix in Scikit-learn - scikit-learn

I am using Scikit-learn for converting my train data to polynomials features and then fit it to a linear model.
model = Pipeline([('poly', PolynomialFeatures(degree=3)),
('linear', LinearRegression(fit_intercept=False))])
model.fit(X, y)
But it throws an error
TypeError: A sparse matrix was passed, but dense data is required
I know my data is sparse matrix format. So when I try to convert my data to dense matrix it shows memory error. Because my data is huge(50k~). Because of these large amounts of data I can't convert it to a dense matrix.
I also find Github Issues where this feature is requested. But still not implemented.
So please can someone tell how to use sparse data format in PolynomialFeatures in Scikit-learn without converting it to dense format?

This is a new feature in the upcoming 0.20 version of sklearn. See Release History - V0.20 - Enhancements If you really wanted to test it out you could install the development version by following the instructions in Sklean - Advanced Installation - Install Bleeding Edge.

Since version 0.21.0, the PolynomialFeatures class accepts CSR matrices for degrees 2 and 3. The method laid out here is used, and the computation is much, much faster than if the input is a CSC matrix or dense (assuming the data's sparse to any reasonable degree - even slightly).

While we are waiting for the latest update of Sklearn - you can find an implementation of sparse interaction here:
https://github.com/drivendataorg/box-plots-sklearn/blob/master/src/features/SparseInteractions.py

Related

Is there any way to speed up the predicting process for tensorflow lattice?

I build my own model with Keras Premade Models in tensorflow lattice using python3.7 and save the trained model. However, when I use the trained model for predicting, the speed of predicting each data point is at millisecond level, which seems very slow. Is there any way to speed up the predicting process for tfl?
There are multiple ways to improve speed, but they may involve a tradeoff with prediction accuracy. I think the three most promising options are:
Reduce the number of features
Reduce the number of lattices per feature
Use an ensemble of lattice models where every lattice model only gets a subsets of the features and then average the predictions of the different models (like described here)
As the lattice model is a standard Keras model, I recommend trying OpenVINO. It optimizes your model by converting to Intermediate Representation (IR), performing graph pruning and fusing some operations into others while preserving accuracy. Then it uses vectorization in runtime. OpenVINO is optimized for Intel hardware, but it should work with any CPU.
It's rather straightforward to convert the Keras model to OpenVINO. The full tutorial on how to do it can be found here. Some snippets are below.
Install OpenVINO
The easiest way to do it is using PIP. Alternatively, you can use this tool to find the best way in your case.
pip install openvino-dev[tensorflow2]
Save your model as SavedModel
OpenVINO is not able to convert the HDF5 model, so you have to save it as SavedModel first.
import tensorflow as tf
from custom_layer import CustomLayer
model = tf.keras.models.load_model('model.h5', custom_objects={'CustomLayer': CustomLayer})
tf.saved_model.save(model, 'model')
Use Model Optimizer to convert SavedModel model
The Model Optimizer is a command-line tool that comes from OpenVINO Development Package. It converts the Tensorflow model to IR, a default format for OpenVINO. You can also try the precision of FP16, which should give you better performance without a significant accuracy drop (change data_type). Run in the command line:
mo --saved_model_dir "model" --data_type FP32 --output_dir "model_ir"
Run the inference
The converted model can be loaded by the runtime and compiled for a specific device, e.g., CPU or GPU (integrated into your CPU like Intel HD Graphics). If you don't know what the best choice for you is, use AUTO. If you care about latency, I suggest adding a performance hint (as shown below) to use the device that fulfills your requirement. If you care about throughput, change the value to THROUGHPUT or CUMULATIVE_THROUGHPUT.
# Load the network
ie = Core()
model_ir = ie.read_model(model="model_ir/model.xml")
compiled_model_ir = ie.compile_model(model=model_ir, device_name="AUTO", config={"PERFORMANCE_HINT":"LATENCY"})
# Get output layer
output_layer_ir = compiled_model_ir.output(0)
# Run inference on the input image
result = compiled_model_ir([input_image])[output_layer_ir]
Disclaimer: I work on OpenVINO.

sklearn: Regression models on sparse data?

Does python's scikit-learn have any regression models that work well with sparse data?
I was poking around and found this "sparse linear regression" module, but it seems outdated. (It's so old that scikit-learn was called 'scikits-learn' at the time, I think.)
Most scikit-learn regression models (linear such as Ridge, Lasso, ElasticNet or non-linear, e.g. with RandomForestRegressor) support both dense and sparse input data recent versions of scikit-learn (0.16.0 is the latest stable version at the time of writing).
Edit: if you are unsure, check the docstring of the fit method of the class of interest.

How to consistently standardize sparse feature matrix in scikit-learn?

I am using sklearn's DictVectorizer to construct a large, sparse feature matrix, which is fed to an ElasticNet model. Elastic net (and similar linear models) work best when predictors (columns in the feature matrix) are centered and scaled. The recommended approach is to build a Pipeline that uses a StandardScaler prior to the regressor, however that doesn't work with sparse features, as stated in the docs.
I thought to use the normalize=True flag in ElasticNet which seems to support sparse data, however it's not clear whether the normalization is applied during prediction to the test data as well. Does anyone know if normalize=True applies for prediction as well? If not, is there a way to use the same standardization on the training and test set when dealing with sparse features?
Digging through the sklearn code, it looks like when fit_intercept=True and normalize=True, the coefficients estimated on the normalized data are projected back to the original scale of the data. This is similar to the way glmnet in R handles standardization. The relevant code snippet is the method _set_intercept of LinearModel, see https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/linear_model/base.py#L158. So predictions on unseen data use coefficients in the original scale, i.e., normalize=True is safe to use.

Getting probability of each new observation being an outlier when using scikit-learn OneClassSVM

I'm new to scikit-learn, and SVM methods in general. I've got my data set working well with scikit-learn OneClassSVM in order to detect outliers; I train the OneClassSVM using observation all of which are 'inliers' and then use predict() to generate binary inlier/outlier predictions on my testing set of data.
However to continue further with my analysis I'd like to get the probabilities associated with each new observation in my test set. E.g. The probability of being an outlier associated with each new observation. I've noticed other classification methods in scikit-learn offer the ability to pass the parameter probability=True to compute this, but OneClassSVM does not offer this. Is there an easy way to get these results?
I was searching for an answer for the same question of yours until I got to this page. Stuck for sometime, then, I went back to check the original LIBSVM package since OneClassSVM of scikit-learn is based on the implementation of LIBSVM as stated here.
At the main page of LIBSVM, they state the following for option '-b' that is used to activate returning probability output scores for some variants of SVM:
-b probability_estimates: whether to train a SVC or SVR model for probability estimates, 0 or 1 (default 0)
In other words, the one-class SVM which is of type SVM (neither SVC nor SVR) does not have implementation for probability estimation.
If I go and try to force this option (i.e. -b) using the command line interface of LIBSVM, for example:
./svm-train -s 2 -t 2 -b 1 heart_scale
I receive the following error message:
ERROR: one-class SVM probability output not supported yet
In summary, this very desired output is not yet supported by LIBSVM and thus, scikit-learn is not offering it for the moment. I hope in near future, they activate this functionality and update the thread here.
It provides decision function scores which in theory is the distance from the marginal decision boundary between normal and anomales OCSVM does unsupervised classification. This means that the anomaly inside the algorithm is defined based on the distance to the origin (quoted from Scholkopf's paper from NIPS https://papers.nips.cc/paper/1999/file/8725fb777f25776ffa9076e44fcfd776-Paper.pdf).
TLDR: use
clf.decision_function(samples) * (-1)
as scores. you get a sparse distributiion of scores.

Convert scikit-learn SVM model to LibSVM

I have trained a SVM (svc) using scikit-learn over half a terabyte of data. The model is working fine and I need to port it to C, but I don't want to re-train the SVM from scratch because it takes way too long for me. Is there a way to easily export the model generated by scikit-learn and import it into LibSVM? Internally scikit-learn uses LibSVM so theoretically it should be possible, but I haven't been able to find anything in the documentation. Any suggestion?
Is there a way to easily export the model generated by scikit-learn and import it into LibSVM?
No. The scikit-learn version of LIBSVM has been hacked up severely to fit it into the Python environment and the model is stored as NumPy/SciPy data structures.
Your best shot is to study the SVM decision function and reimplement it in C. The support vectors can be obtained from the SVC object as NumPy arrays, which are easily translated to C arrays.

Resources