MemoryError using MLPClassifier from sklearn.neural_network - python-3.x

I'm running python 3.5 on a windows 10 64-bit operating system.
When I try to implement MLPClassifier the code runs for a while and then gives me a MemoryError.
I think it's due to the size of the hidden layer that I'm asking it to run but I need to run this size to collect my data. How can I circumvent this error?
Code
gamma=[1,10,100,1000,10000,100000]#create array for range of gamma values
score_train=[]
score_test=[]
for j in gamma:
mlp = MLPClassifier(solver='lbfgs', random_state=0, hidden_layer_sizes=[j,j], activation='tanh').fit(data_train, classes_train)
score_train.append(mlp.score(data_train,classes_train))
score_test.append(mlp.score(data_test,classes_test))
print (score_train)
print (score_test)
Error
Memory Erroy Traceback

the code runs for a while and then gives me a MemoryError. I think it's due to the size of the hidden layer that I'm asking it to run but I need to run this size to collect my data.
Yes, it's the size of the hidden-layers! And the remaining part of that sentence does not make much sense (continue reading)!
Please make sure to read read the tutorial and API-docs
Now some more specific remarks:
The sizes of the hidden-layer does not have anything to do with the collection of your data!
input- and output-layers will be build based on the sizes of your X,y!
hidden_layer_sizes=[j,j] is actually creating 2 hidden-layers!
In the MLP, all layers are fully connected!
a call with hidden_layer_sizes=[100000, 100000] as you try to do will use ~76 gigabytes of memory (assuming 64-bit doubles) just for these weights connecting these 2 layers alone!
and this is just one connection-layer: input-h0 and h1-output are still missing
lbfgs is a completely different solver than all the others. Don't use it without some understanding of the implications! It's not default!
It's a full-batch method and therefore uses a lot more memory when sample-size is big!
Additionally, there are more internal reasons to use more memory compared to the other (first-order-) methods
Not that precise, but the docs already gave some hints: Note: The default solver ‘adam’ works pretty well on relatively large datasets (with thousands of training samples or more) in terms of both training time and validation score. For small datasets, however, ‘lbfgs’ can converge faster and perform better.

Related

How do I prevent a lack of VRAM halfway through training a Huggingface Transformers (Pegasus) model?

I'm taking a pre-trained pegasus model through Huggingface transformers, (specifically, google/pegasus-cnn_dailymail, and I'm using Huggingface transformers through Pytorch) and I want to finetune it on my own data. This is however quite a large dataset and I've run into the problem of running out of VRAM halfway through training, which because of the size of the dataset can be a few days after training even started, which makes a trial-and-error approach very inefficient.
I'm wondering how I can make sure ahead of time that it doesn't run out of memory. I would think that the memory usage of the model is in some way proportional to the size of the input, so I've passed truncation=True, padding=True, max_length=1024 to my tokenizer, which if my understanding is correct should make all the outputs of the tokenizer of the same size per line. Considering that the batch size is also a constant, I would think that the amount of VRAM in use should be stable. So I should just be able to cut up the dataset into managable parts, just looking at the ram/vram use of the first run, and infer that it will run smoothly from start to finish.
However, the opposite seems to be true. I've been observing the amount of VRAM used at any time and it can vary wildly, from ~12GB at one time to suddenly requiring more than 24GB and crashing (because I don't have more than 24GB).
So, how do I make sure that the amount of vram in use will stay within reasonable bounds for the full duration of the training process, and avoid it crashing due to a lack of vram when I'm already days into the training process?
padding=True actually doesn't pad to max_length, but to the longest sample in the list you pass to the tokenizer. To pad to max_length you need to set padding='max_length'.

Loading a model checkpoint in lesser amount of memory

I had a question that I can't find any answers to online. I have trained a model whose checkpoint file is about 20 GB. Since I do not have enough RAM with my system (or Colaboratory/Kaggle either - the limit being 16 GB), I can't use my model for predictions.
I know that the model has to be loaded into memory for the inferencing to work. However, is there a workaround or a method that can:
Save some memory and be able to load it in 16 GB of RAM (for CPU), or the memory in the TPU/GPU
Can use any framework (since I would be working with both) TensorFlow + Keras, or PyTorch (which I am using right now)
Is such a method even possible to do in either of these libraries? One of my tentative solutions was not load it in chunks perhaps, essentially maintaining a buffer for the model weights and biases and performing calculations accordingly - though I haven't found any implementations for that.
I would also like to add that I wouldn't mind the performance slowdown since it is to be expected with low-specification hardware. As long as it doesn't take more than two weeks :) I can definitely wait that long...
Yoy can try the following:
split model by two parts
load weights to the both parts separately calling model.load_weights(by_name=True)
call the first model with your input
call the second model with the output of the first model

How does the model in sklearn handle large data sets in python?

Now I have 10GB of data set to train the model in sklearn, but my computer only has 8GB of memory, so I have other ways to go besides incremental classifier.
I think sklearn can be used for larger data if the technique is right. If your chosen algorithms support partial_fit or an online learning approach then you're on track. The chunk_size may influence your success
This link may be useful( Working with big data in python and numpy, not enough ram, how to save partial results on the disc?)
Another thing you can do is to randomly pick whether or not to keep a row in your csv file...and save the result to a .npy file so it loads quicker. That way you get a sampling of your data that will allow you to start playing with it with all algorithms...and deal with the bigger data issue along the way(or not at all! sometimes a sample with a good approach is good enough depending on what you want).

Memory error while running LSTM Autoencoders

I am setting up an LSTM Autoencoder with multivariant time sequences. Each of my sequence has different time steps(approximately 30 million steps in one sequence) and 6 features. I know to give one sequence as input to the LSTM Autoencoder, I have to reshape my sequence as (1,30million,6).I reshaped all of my 9 sequences in a similar manner. I want the autoencoder to reconstruct my sequences. However my program is crashing due to large number of time steps in each sequence. How can I solve this memory error. Even if I am giving data in batch size,my program is running out of memory. I am new to machine learning and sequence learning, so please help me with the same.My network is attached below:
`
def repeat_vector(args):
[layer_to_repeat, sequence_layer] = args
return RepeatVector(K.shape(sequence_layer)[1])(layer_to_repeat)
encoder_input = Input(shape=(None, self._input_features))
encoder_output = LSTM(self._latent_space)(encoder_input)
decoder_input = Lambda(repeat_vector, output_shape=(None, self._latent_space))([encoder_output, encoder_input])
decoder_output = LSTM(self._input_cells, return_sequences=True)(decoder_input)
self._autoencoder = Model(encoder_input, decoder_output)
`
I have already tried to take input via hdf files.
I am not sure what system configuration are you using. OOMs can be solved from both software and hardware ends. If you are using a system with say 4GB RAM and some i5 processor(assuming it's intel), it might not work. If you are working on a GPU(which is not very likely.) it should not be a hardware issue.
If your system has a graphic card, well then you can optimize the code a bit.
Try a batch size of 1.
If you have a pre-processing queue etc. try to tweak the queue size.
I would suggest you to try this for a smaller series once before going in for the complete thing, and check if it works.
If you take the time step to be large, it will lose precision and if it's too small, well then it's heavy to compute. Check for each one
of them, if the time step can be increased, without compromising much
on precision.
You can use PCA for knowing the important features and reduce dimensionality. You can also use random forest as a preprocessing step
to know the feature importance and reduce the features with less
inportance.

MemoryError on joblib dump

I have the following snippet running to train a model for text classification. I optmized it quite a bit and it's running pretty smoothly however, it still uses a lot of RAM. Our dataset is huge (13 million documents + 18 million words in the vocabulary) but the point in execution throwing the error is very weird, in my opinion. The script:
encoder = LabelEncoder()
y = encoder.fit_transform(categories)
classes = list(range(0, len(encoder.classes_)))
vectorizer = CountVectorizer(vocabulary=vocabulary,
binary=True,
dtype=numpy.int8)
classifier = SGDClassifier(loss='modified_huber',
n_jobs=-1,
average=True,
random_state=1)
tokenpath = modelpath.joinpath("tokens")
for i in range(0, len(batches)):
token_matrix = joblib.load(
tokenpath.joinpath("{}.pickle".format(i)))
batchsize = len(token_matrix)
classifier.partial_fit(
vectorizer.transform(token_matrix),
y[i * batchsize:(i + 1) * batchsize],
classes=classes
)
joblib.dump(classifier, modelpath.joinpath('classifier.pickle'))
joblib.dump(vectorizer, modelpath.joinpath('vectorizer.pickle'))
joblib.dump(encoder, modelpath.joinpath('category_encoder.pickle'))
joblib.dump(options, modelpath.joinpath('extraction_options.pickle'))
I got the MemoryError at this line:
joblib.dump(vectorizer, modelpath.joinpath('vectorizer.pickle'))
At this point in execution, training is finished and the classifier is already dumped. It should be collected by the garbage collector in case more memory is needed. In addition to it, why should joblib allocate so much memory if it isn't even compressing the data.
I do not have deep knowledge of the inner workings of the python garbage collector. Should I be forcing gc.collect() or use 'del' statments to free those objects that are no longer needed?
Update:
I have tried using the HashingVectorizer and, even though it greatly reduces memory usage, the vectorizing is way slower making it not a very good alternative.
I have to pickle the vectorizer to later use it in the classification process so I can generate the sparse matrix that is submitted to the classifier. I will post here my classification code:
extracted_features = joblib.Parallel(n_jobs=-1)(
joblib.delayed(features.extractor) (d, extraction_options) for d in documents)
probabilities = classifier.predict_proba(
vectorizer.transform(extracted_features))
predictions = category_encoder.inverse_transform(
probabilities.argmax(axis=1))
trust = probabilities.max(axis=1)
If you are providing your custom vocabulary to the CountVectorizer, it should not be a problem to recreate it later on, during classification. As you provide set of strings instead of a mapping, you probably want to use the parsed vocabulary, which you can access with:
parsed_vocabulary = vectorizer.vocabulary_
joblib.dump(parsed_vocabulary, modelpath.joinpath('vocabulary.pickle'))
and then load it and use to re-create the CountVectorizer:
vectorizer = CountVectorizer(
vocabulary=parsed_vocabulary,
binary=True,
dtype=numpy.int8
)
Note that you do not need to use joblib here; the standard pickle should perform the same; you might get better results using any of available alternatives, with PyTables being worth mentioning.
If that uses to much of the memory too, you should try using the original vocabulary for recreation of the vectorizer; currently, when provided with a set of strings as vocabulary, vectorizers just convert sets to sorted lists so you shouldn't need to worry about reproducibility (although I would double check that before using in production). Or you could just convert the set to a list on your own.
To sum up: because you do not fit() the Vectorizer, the whole added value of using CountVectorizer is its transform() method; as the whole needed data is the vocabulary (and parameters) you might reduce the memory consumption pickling just your vocabulary, either processed or not.
As you asked for answer drawing from official sources, I would like to point you to: https://github.com/scikit-learn/scikit-learn/issues/3844 where an owner and a contributor of scikit-learn mention recreating a CountVectorizer, albeit for other purposes. You may have better luck reporting your problems in the linked repo, but make sure to include a dataset which causes excessive memory usage issues to make it reproducible.
And finally you may just use HashingVectorizer as mentioned earlier in a comment.
PS: regarding the use of gc.collect() - I would give it a go in this case; regarding the technical details, you will find many questions on SO tackling this issue.

Resources