RobustScaler partial_fit() similar to MinMaxScaler or StandardScaler - scikit-learn

I have been using RobustScaler to scale data and recently we added additional data that is pushing the memory limits of fit_transform. I was hoping to do partial_fit in subset data but looks like RobustScaler does not provide that functionality. Most of the other scalers (MinMax, Standard, Abs) seem to have partial_fit.
Since I have outliers in the data, I need to use RobustScaler. I tried using MinMax and Standard scalers but outliers influence the data too much.
I was hoping to find an alternative to doing fit_transform for large dataset, similar to partial_fit in other scalers.

If it is not a hard requirement for you to use scikit-learn, you can perhaps check out another library for Biomolecular Dynamics called msmbuilder.
It claims to have RobustScaler similar to scikit-learn and with the option of using partial_fit, this is as per their documentation.
Link: http://msmbuilder.org/3.7.0/_preprocessing/msmbuilder.preprocessing.RobustScaler.html#msmbuilder.preprocessing.RobustScaler
PS: I have not tested it.

Related

How do I visualize CNN on pytorch

I've just learned a little about pytorch. I built a CNN to calculate the effects of various optimization algorithms with the official documents of pytorch (I've just finished from SGD to adagrad). However, most of the official documents and tutorial videos ended when the accuracy and time-consuming were calculated, and the code of model visualization ,I had no idea at all. I would like to ask what is used for visualization similar to the following two figures. Is it Matplotlib pyplot or the visualization tool corresponding to pytorch?enter image description here
I can not tell you what library is used to generate the plot you linked to.
There are plenty of options, all of which you can use once you have the data.
One of these options is matplotlib. Others include using Matlab or pgfplots if you want to include your plots in a LaTeX document. These are the tools I use somewhat frequently. They are purely subjective choices.
However, pytorch also supports tensorboard, which is especially useful for live tracking of the training progress.
Have a look at this tutorial: https://pytorch.org/tutorials/recipes/recipes/tensorboard_with_pytorch.html
it looks like matplotlib pyplot to me. While training, you can store all the loss values and accuracy values and plot them after the network is trained.

How to use GroupKFold with CalibratedClassifierCV?

Unlike GridSearchCV, CalibratedClassifierCV doesn't seem to support passing the groups parameter to the fit method. I found this very old github issue that reports this problem but it doesn't seem to be fixed yet. The documentation makes it clear that not properly stratifying your cv folds will result in an incorrectly calibrated model. My dataset has multiple observations from the same users so I would need to use GroupKFold to ensure proper calibration.
scikit-learn can take an iterable of (train, test) splits as the cv object, so just create them manually. For example:
my_cv = (
(train, test) for train, test in GroupKFold(n_splits=5).split(X, groups=my_groups)
)
cal_clf = CalibratedClassifierCV(clf, cv=my_cv)
I've created a modified version of CalibratedClassifierCV that addresses this issue for now. Until this is fixed in sklearn master, you can similarly modify the fit method of CalibratedClassifierCV to use GroupKFold. My solution can be found in this gist. This is based of sklearn version 0.24.1 but you can easily adapt it to your version of sklearn as needed.

what are the methods to check if my model fits the data (without using graphs)

I am working on a binary logistic regression data set in python. I want to know if there are any numerical methods to calculate how well the model fits the data.
please don't include graphical methods like plotting etc.
Thanks :)
read through 3.3.2. Classification metrics in sklearn documentation.
http://scikit-learn.org/stable/modules/model_evaluation.html
hope it helps.

How to consistently standardize sparse feature matrix in scikit-learn?

I am using sklearn's DictVectorizer to construct a large, sparse feature matrix, which is fed to an ElasticNet model. Elastic net (and similar linear models) work best when predictors (columns in the feature matrix) are centered and scaled. The recommended approach is to build a Pipeline that uses a StandardScaler prior to the regressor, however that doesn't work with sparse features, as stated in the docs.
I thought to use the normalize=True flag in ElasticNet which seems to support sparse data, however it's not clear whether the normalization is applied during prediction to the test data as well. Does anyone know if normalize=True applies for prediction as well? If not, is there a way to use the same standardization on the training and test set when dealing with sparse features?
Digging through the sklearn code, it looks like when fit_intercept=True and normalize=True, the coefficients estimated on the normalized data are projected back to the original scale of the data. This is similar to the way glmnet in R handles standardization. The relevant code snippet is the method _set_intercept of LinearModel, see https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/linear_model/base.py#L158. So predictions on unseen data use coefficients in the original scale, i.e., normalize=True is safe to use.

Possibility to apply online algorithms on big data files with sklearn?

I would like to apply fast online dimensionality reduction techniques such as (online/mini-batch) Dictionary Learning on big text corpora.
My input data naturally do not fit in the memory (this is why i want to use an online algorithm) so i am looking for an implementation that can iterate over a file rather than loading everything in memory.
Is it possible to do this with sklearn ? are there alternatives ?
Thanks
register
For some algorithms supporting partial_fit, it would be possible to write an outer loop in a script to do out-of-core, large scale text classification. However there are some missing elements: a dataset reader that iterates over the data on the disk as folders of flat files or a SQL database server, or NoSQL store or a Solr index with stored fields for instance. We also lack an online text vectorizer.
Here is a sample integration template to explain how it would fit together.
import numpy as np
from sklearn.linear_model import Perceptron
from mymodule import SomeTextDocumentVectorizer
from mymodule import DataSetReader
dataset_reader = DataSetReader('/path/to/raw/data')
expected_classes = dataset_reader.get_all_classes() # need to know the possible classes ahead of time
feature_extractor = SomeTextDocumentVectorizer()
classifier = Perceptron()
dataset_reader = DataSetReader('/path/to/raw/data')
for i, (documents, labels) in enumerate(dataset_reader.iter_chunks()):
vectors = feature_extractor.transform(documents)
classifier.partial_fit(vectors, labels, classes=expected_classes)
if i % 100 == 0:
# dump model to be able to monitor quality and later analyse convergence externally
joblib.dump(classifier, 'model_%04d.pkl' % i)
The dataset reader class is application specific and will probably never make it into scikit-learn (except maybe for a folder of flat text files or CSV files that would not require to add a new dependency to the library).
The text vectorizer part is more problematic. The current vectorizer does not have a partial_fit method because of the way we build the in-memory vocabulary (a python dict that is trimmed depending on max_df and min_df). We could maybe build one using an external store and drop the max_df and min_df features.
Alternatively we could build an HashingTextVectorizer that would use the hashing trick to drop the dictionary requirements. None of those exist at the moment (although we already have some building blocks such as a murmurhash wrapper and a pull request for hashing features).
In the mean time I would advise you to have a look at Vowpal Wabbit and maybe those python bindings.
Edit: The sklearn.feature_extraction.FeatureHasher class has been merged into the master branch of scikit-learn and will be available in the next release (0.13). Have a look at the documentation on feature extraction.
Edit 2: 0.13 is now released with both FeatureHasher and HashingVectorizerthat can directly deal with text data.
Edit 3: there is now an example on out-of-core learning with the Reuters dataset in the official example gallery of the project.
Since Sklearn 0.13 there is indeed an implementation of the HashingVectorizer.
EDIT: Here is a full-fledged example of such an application
Basically, this example demonstrates that you can learn (e.g. classify text) on data that cannot fit in the computer's main memory (but rather on disk / network / ...).
In addition to Vowpal Wabbit, gensim might be interesting as well - it too features online Latent Dirichlet Allocation.

Resources