I'm attempting to use sklearn 0.11's LogisticRegression object to fit a model on 200,000 observations with about 80,000 features. The goal is to classify short text descriptions into 1 of 800 classes.
When I attempt to fit the classifier pythonw.exe gives me:
Application Error "The instruction at ... referenced memory at 0x00000000". The memory could not be written".
The features are extremely sparse, about 10 per observation, and are binary (either 1 or 0), so by my back of the envelope calculation my 4 GB of RAM should be able to handle the memory requirements, but that doesn't appear to be the case. The models only fit when I use fewer observations and/or fewer features.
If anything, I would like to use even more observations and features. My naive understanding is that the liblinear library running things behind the scenes is capable of supporting that. Any ideas for how I might squeeze a few more observations in?
My code looks like this:
y_vectorizer = LabelVectorizer(y) # my custom vectorizer for labels
y = y_vectorizer.fit_transform(y)
x_vectorizer = CountVectorizer(binary = True, analyzer = features)
x = x_vectorizer.fit_transform(x)
clf = LogisticRegression()
clf.fit(x, y)
The features() function I pass to analyzer just returns a list of strings indicating the features detected in each observation.
I'm using Python 2.7, sklearn 0.11, Windows XP with 4 GB of RAM.
liblinear (the backing implementation of sklearn.linear_model.LogisticRegression) will host its own copy of the data because it is a C++ library whose internal memory layout cannot be directly mapped onto a pre-allocated sparse matrix in scipy such as scipy.sparse.csr_matrix or scipy.sparse.csc_matrix.
In your case I would recommend to load your data as a scipy.sparse.csr_matrix and feed it to a sklearn.linear_model.SGDClassifier (with loss='log' if you want a logistic regression model and the ability to call the predict_proba method). SGDClassifier will not copy the input data if it's already using the scipy.sparse.csr_matrix memory layout.
Expect it to allocate a dense model of 800 * (80000 + 1) * 8 / (1024 ** 2) = 488MB in memory (in addition to the size of your input dataset).
Edit: how to optimize the memory access for your dataset
To free memory after dataset extraction you can:
x_vectorizer = CountVectorizer(binary = True, analyzer = features)
x = x_vectorizer.fit_transform(x)
from sklearn.externals import joblib
joblib.dump(x.tocsr(), 'dataset.joblib')
Then quit this python process (to force complete memory deallocation) and in a new process:
x_csr = joblib.load('dataset.joblib')
Under linux / OSX you could memory map that even more efficiently with:
x_csr = joblib.load('dataset.joblib', mmap_mode='c')
Related
I am working with R scripts specifically for ANN models using the nnet() package. I run my scripts on my local computer (Windows) and my colleague runs the same R scripts on his computer (Docker -> Linux). We get similar but different results for the ANN models. There are large differences in the neuron weights, and slight differences in fitted values and predictions.
We are setting the same seed just before the nnet() function so we are on the same randomization set. Additionally, I have set the initialization weights ("Wts") to be the same value (1) for all coefficients, biases etc for the model. I have also tested the randomization of both systems by setting the seed and doing a random sample(), which returns the same results.
I have also tested our model inputs (spectra) and everything is 1:1 unity.
We build a number of models including PLS, GPR and SVR (with grid search parameters) and we always get the same result. These models do not utilize randomization so the assumption is that the randomization within the ANN models is the cause for the difference.
We have also updated R to the most recent version (4.2.2) and updated all of our packages including nnet() and dependencies from the same repository.
I am at a loss on what the difference could be from, my last thought is the difference between operating systems (me = Windows, he = Linux). Could there be another difference that could affect the nnet() function such as rounding (as the model input variables are in low magnitude decimals) or ordering differences between the operating systems?
The expectation is to have complete unity across ANN models (weights, fitted values and predictions).
Sorry for no reproducible code, the models work on high dimensional data (spectra > 1000 variables, n > 1000). I can share our nnet() function code:
cv_wts <- rep(1,cv_wts_n)
set.seed(seed)
cal <- nnet(TV ~ NIR, data = training_dat, size = n, decay = d, Wts = cv_wts,
linout=TRUE, maxit = 1000000, MaxNWts = 1000000, trace = FALSE)
I am currently using the kernels that come with sk-learn support vector machine library.
How do I extract the kernel matrix for a classifier created using sklearn.svm.SVC?
Unfortunately, scikit did not provide the direct method to get kernel matrix from a well-trained svm.
But, scikit allows svm to take a custom kernel, what I did is,
train a svm with specific kernel,
manually calculate the kernel
matrix from parameters given from the trained svm,
define a new
svm with the same type of kernel and the matrix, then check the new
svm on the same train data to see if it is the same with previous
one.
Here are codes, just taking rbf and poly as examples,
# rbf
K_train = np.exp(-clf.gamma * np.sum((X_train_C[..., None, :] - X_train_C) ** 2, axis=2))
# poly
# K_train = (clf.gamma * X_train_C.dot(X_train_C.T) + clf.coef0) ** clf.degree
clf_pre = SVC(kernel='precomputed')
clf_pre.fit(K_train, y_train_C)
pred_pre = clf_pre.predict(K_train)
There is one last thing I am not that sure, when I load the pre-computed kernel, I could not directlly use it. I need to re-fit it again, this is the same as given by scikit.
Here are examples provided by scikit.
https://scikit-learn.org/stable/auto_examples/svm/plot_custom_kernel.html#sphx-glr-auto-examples-svm-plot-custom-kernel-py
https://scikit-learn.org/stable/modules/svm.html?highlight=svc+custom+kernel
(1.4.6.2 Custom Kernels)
I am working on a binary classification problem. I have ~1.5 million data points, and the dimensionality of the feature space is 1 million. This dataset is stored as a sparse array, with a density of ~0.0001. For this post, I'll limit the scope to assume that the model is a shallow feedforward neural network, and also assume that the dimensionality has already been optimized (so cannot be reduced below 1 million). Naiive approaches to create mini-batches out of this data to feed into the network would take a lot of time (As an example, a basic approach of creating a TensorDataset (map style) from a torch.sparse.FloatTensor representation of the input array, and wrapping a DataLoader around it, means ~20s to get a mini-batch of 32 to the network, as opposed to say ~0.1s to perform the actual training). I am looking for ways to speed this up.
What I've tried
I first figured that reading from such a large sparse array in every iteration of the DataLoader was computationally intensive, so I broke down this sparse array into smaller sparse arrays
For the DataLoader to read from these multiple sparse arrays in an iterative fashion, I replaced the map style dataset that I had inside the DataLoader with an IterableDataset, and streamed these smaller sparse arrays into this IterableDataset like so:
from itertools import chain
from scipy import sparse
class SparseIterDataset(torch.utils.data.IterableDataset):
def __init__(self, fpaths):
super(SparseIter).__init__()
self.fpaths = fpaths
def read_from_file(self, fpath):
data = sparse.load_npz(fpath).toarray()
for d in data:
yield torch.Tensor(d)
def get_stream(self, fpaths):
return chain.from_iterable(map(self.read_from_file, fpaths))
def __iter__(self):
return self.get_stream(self.fpaths)
With this approach, I was able to bring down the time from the naiive base case of ~20s to ~0.2s per minibatch of 32. However, given that my dataset has ~1.5 million samples, this still implies a lot of time spent on even making one pass through the dataset. (As a comparison, even though it's slightly apples to oranges, running a logistic regression on scikit-learn on the original sparse array takes about ~6s per iteration through the whole dataset. With pytorch, with the approach I just outlined, it would take ~3000s just to load all the minibatches in an epoch)
One thing which I am aware of but yet to try is using multiprocess data loading by setting the num_workers argument in the DataLoader. I believe this has its own catches in the case of iterable style datasets though. Plus even a 10x speedup would still mean ~300s per epoch in loading mini batches. I feel I'm being inordinately slow! Are there any other approaches/improvements/best practices that you could suggest?
Your dataset in un-sparsified form would be 1.5M x 1M x 1 byte = 1.5TB as uint8, or 1.5M x 1M x 4 byte = 6TB as float32. Simply reading 6TB from memory to CPU could take 5-10 minutes on a modern CPU (depending on the architecture), and transfer speeds from CPU to GPU would be a bit slower than that (NVIDIA V100 on PCIe has 32GB/s theoretical).
Approaches:
Benchmark everything individually - eg in jupyter
%%timeit data = sparse.load_npz(fpath).toarray()
%%timeit dense = data.toarray() # un-sparsify for comparison
%%timeit t = torch.tensor(data) # probably about the same as the line above
Also print out the shapes and datatypes of everything to make sure they are as expected. I haven't tried running your code but I am pretty sure that (a) sparse.load_npz is extremely fast and unlikely to be a bottleneck, but (b) torch.tensor(data) produces a dense tensor and is also quite slow here
Use torch.sparse. I think torch sparse tensors can be used as regular tensors in most cases. You'd have to do some data prep to convert from scipy.sparse to torch.sparse:
A sparse tensor is represented as a pair of dense tensors: a tensor of
values and a 2D tensor of indices. A sparse tensor can be constructed by
providing these two tensors, as well as the size of the sparse tensor
You mention torch.sparse.FloatTensor but I'm pretty sure you're not making sparse tensors in your code - there is no reason to expect those would be constructed simply from passing a scipy.sparse array to a regular tensor constructor, since that's not how they're usually made.
If you figure out a good way to do this, I recommend you post it as a project or git on github, it would be quite useful.
If torch.sparse doesn't work out, think of other ways to either convert the data to dense only on the GPU, or avoid converting it entirely.
See also:
https://towardsdatascience.com/sparse-matrices-in-pytorch-be8ecaccae6
https://github.com/rusty1s/pytorch_sparse
I have the following snippet running to train a model for text classification. I optmized it quite a bit and it's running pretty smoothly however, it still uses a lot of RAM. Our dataset is huge (13 million documents + 18 million words in the vocabulary) but the point in execution throwing the error is very weird, in my opinion. The script:
encoder = LabelEncoder()
y = encoder.fit_transform(categories)
classes = list(range(0, len(encoder.classes_)))
vectorizer = CountVectorizer(vocabulary=vocabulary,
binary=True,
dtype=numpy.int8)
classifier = SGDClassifier(loss='modified_huber',
n_jobs=-1,
average=True,
random_state=1)
tokenpath = modelpath.joinpath("tokens")
for i in range(0, len(batches)):
token_matrix = joblib.load(
tokenpath.joinpath("{}.pickle".format(i)))
batchsize = len(token_matrix)
classifier.partial_fit(
vectorizer.transform(token_matrix),
y[i * batchsize:(i + 1) * batchsize],
classes=classes
)
joblib.dump(classifier, modelpath.joinpath('classifier.pickle'))
joblib.dump(vectorizer, modelpath.joinpath('vectorizer.pickle'))
joblib.dump(encoder, modelpath.joinpath('category_encoder.pickle'))
joblib.dump(options, modelpath.joinpath('extraction_options.pickle'))
I got the MemoryError at this line:
joblib.dump(vectorizer, modelpath.joinpath('vectorizer.pickle'))
At this point in execution, training is finished and the classifier is already dumped. It should be collected by the garbage collector in case more memory is needed. In addition to it, why should joblib allocate so much memory if it isn't even compressing the data.
I do not have deep knowledge of the inner workings of the python garbage collector. Should I be forcing gc.collect() or use 'del' statments to free those objects that are no longer needed?
Update:
I have tried using the HashingVectorizer and, even though it greatly reduces memory usage, the vectorizing is way slower making it not a very good alternative.
I have to pickle the vectorizer to later use it in the classification process so I can generate the sparse matrix that is submitted to the classifier. I will post here my classification code:
extracted_features = joblib.Parallel(n_jobs=-1)(
joblib.delayed(features.extractor) (d, extraction_options) for d in documents)
probabilities = classifier.predict_proba(
vectorizer.transform(extracted_features))
predictions = category_encoder.inverse_transform(
probabilities.argmax(axis=1))
trust = probabilities.max(axis=1)
If you are providing your custom vocabulary to the CountVectorizer, it should not be a problem to recreate it later on, during classification. As you provide set of strings instead of a mapping, you probably want to use the parsed vocabulary, which you can access with:
parsed_vocabulary = vectorizer.vocabulary_
joblib.dump(parsed_vocabulary, modelpath.joinpath('vocabulary.pickle'))
and then load it and use to re-create the CountVectorizer:
vectorizer = CountVectorizer(
vocabulary=parsed_vocabulary,
binary=True,
dtype=numpy.int8
)
Note that you do not need to use joblib here; the standard pickle should perform the same; you might get better results using any of available alternatives, with PyTables being worth mentioning.
If that uses to much of the memory too, you should try using the original vocabulary for recreation of the vectorizer; currently, when provided with a set of strings as vocabulary, vectorizers just convert sets to sorted lists so you shouldn't need to worry about reproducibility (although I would double check that before using in production). Or you could just convert the set to a list on your own.
To sum up: because you do not fit() the Vectorizer, the whole added value of using CountVectorizer is its transform() method; as the whole needed data is the vocabulary (and parameters) you might reduce the memory consumption pickling just your vocabulary, either processed or not.
As you asked for answer drawing from official sources, I would like to point you to: https://github.com/scikit-learn/scikit-learn/issues/3844 where an owner and a contributor of scikit-learn mention recreating a CountVectorizer, albeit for other purposes. You may have better luck reporting your problems in the linked repo, but make sure to include a dataset which causes excessive memory usage issues to make it reproducible.
And finally you may just use HashingVectorizer as mentioned earlier in a comment.
PS: regarding the use of gc.collect() - I would give it a go in this case; regarding the technical details, you will find many questions on SO tackling this issue.
I am trying to run Spark MLlib packages in pyspark with a test machine learning data set. I am splitting the data sets into half training data set and half test data set. Below is my code that builds the model. However, it shows weight of NaN, NaN.. across all dependent variables. Couldn't figure out why. But it works when I try to standardize the data with the StandardScaler function.
model = LinearRegressionWithSGD.train(train_data, step = 0.01)
# evaluate model on test data set
valuesAndPreds = test_data.map(lambda p: (p.label, model.predict(p.features)))
Thank you very much for the help.
Below is the code that I used to do the scaling.
scaler = StandardScaler(withMean = True, withStd = True).fit(data.map(lambda x:x.features))
feature = [scaler.transform(x) for x in data.map(lambda x:x.features).collect()]
label = data.map(lambda x:x.label).collect()
scaledData = [LabeledPoint(l, f) for l,f in zip(label, feature)]
Try scaling the features
StandardScaler standardizes features by scaling to unit variance and/or removing the mean using column summary statistics on the samples in the training set. This is a very common pre-processing step.
Standardization can improve the convergence rate during the optimization process, and also prevents against features with very large variances exerting an overly large influence during model training. Since you have some variables that are large numbers (eg: revenue) and some variables that are smaller (eg: number of clients), this should solve your problem.