Loading a model checkpoint in lesser amount of memory

Loading a model checkpoint in lesser amount of memory - python-3.x

I had a question that I can't find any answers to online. I have trained a model whose checkpoint file is about 20 GB. Since I do not have enough RAM with my system (or Colaboratory/Kaggle either - the limit being 16 GB), I can't use my model for predictions.
I know that the model has to be loaded into memory for the inferencing to work. However, is there a workaround or a method that can:
Save some memory and be able to load it in 16 GB of RAM (for CPU), or the memory in the TPU/GPU
Can use any framework (since I would be working with both) TensorFlow + Keras, or PyTorch (which I am using right now)
Is such a method even possible to do in either of these libraries? One of my tentative solutions was not load it in chunks perhaps, essentially maintaining a buffer for the model weights and biases and performing calculations accordingly - though I haven't found any implementations for that.
I would also like to add that I wouldn't mind the performance slowdown since it is to be expected with low-specification hardware. As long as it doesn't take more than two weeks :) I can definitely wait that long...

Yoy can try the following:
split model by two parts
load weights to the both parts separately calling model.load_weights(by_name=True)
call the first model with your input
call the second model with the output of the first model

Related

Having trouble training Word2Vec iteratively on Gensim

I'm attempting to train multiple texts supplied by myself iteratively. However, I keep running into an issue when I train the model more than once:
ValueError: You must specify either total_examples or total_words, for proper learning-rate and progress calculations. If you've just built the vocabulary using the same corpus, using the count cached in the model is sufficient: total_examples=model.corpus_count.
I'm currently initiating my model like this:
model = Word2Vec(sentences, min_count=0, workers=cpu_count())
model.build_vocab(sentences, update=False)
model.save('firstmodel.model')
model = Word2Vec.load('firstmodel.model')
and subsequently training it iteratively like this:
model.build_vocab(sentences, update = True)
model.train(sentences, totalexamples=model.corpus_count, epochs=model.epochs)
What am I missing here?
Somehow, it worked when I just trained one other model, so not sure why it doesn't work beyond two models...

First, the error message says you need to supply either the total_examples or total_words parameter to train() (so that it has an accurate estimate of the total training-corpus size).
Your code, as currently shown, only supplies totalexamples – a parameter name missing the necessary _. Correcting this typo should remedy the immediate error.
However, some other comments on your usage:
repeatedly calling train() with different data is an expert technique highly subject to error or other problems. It's not the usual way of using Word2Vec, nor the way most published results were reached. You can't count on it to always improve the model with new words; it might make the model worse, as new training sessions update some-but-not-all words, and alter the (usual) property that the vocabulary has one consistent set of word-frequencies from one single corpus. The best course is to train() once, with all available data, so that the full vocabulary, word-frequencies, & equally-trained word-vectors are achieved in a single consistent session.
min_count=0 is almost always a bad idea with word2vec: words with few examples in the corpus should be discarded. Trying to learn word-vectors for them not only gets weak vectors for those words, but dilutes/distracts the model from achieving better vectors for surrounding more-common words.
a count of workers up to your local cpu_count() only reliably helps up to about 4-12 workers, depending on other parameters & the efficiency of your corpus-reading, then more workers can hurt, due to inefficiencies in the Python GIL & Gensim corpus-to-worker handoffs. (inding the actual best count for your setup is, unfortunately, still just a matter of trial and error. But if you've got 16 (or more) cores, your setting is almost sure to do worse than a lower workers number.

How to optimize memory footprint of Stanza models

I'm using Stanza to get tokens, lemmas and tags from documents in multiple languages for the purposes of a language learning app. This means that I need to store and load many Stanza (default) models for different languages.
My main problem right now is that if I want to load all those models the memory requirement is too much for my resources. I currently deploy a web API running Stanza NLP on AWS. I want to keep my infrastructure costs at a minimum.
One possible solution is to load one model at a time when I need to run my script. I guess that means there will be some extra overhead each time in order to load the model in memory.
Another thing I tried is just to use the processors that I really need which decreases the memory footprint but not by that much.
I tried looking at open and closed issues on Github and Google but didn't find much.
What other possible solutions are out there?

The bottom line is a model for a language has to be in memory during execution, so by some means or another you need to make the model smaller or tolerate storing models on disk. I can offer some suggestions to make the models smaller, though be warned that making your model smaller will probably result in poorer accuracy.
You could examine the percentage breakdown of language requests, and store commonly requested languages in memory and only go to disk for rarer language requests.
The most immediate impact strategy for reducing model size is to shrink the vocabulary size. It is possible you could cut the vocabulary even smaller and still get similar accuracy. We have done some optimization on this front, but there may be more opportunity to cut model size.
You could experiment with smaller model size and word embeddings and may only get a small accuracy drop, we haven't really aggressively experimented with different model sizes to see how much accuracy you lose. This would mean retraining the model and just setting the embedding size and model size parameters smaller.
I don't know a lot about this, but there is a strategy of tagging a bunch of data with your big accurate model, and then training a smaller model to mimic the big model. I believe this is called "knowledge distillation".
In a similar direction, you could tag a bunch of data with Stanza, and then train a CoreNLP model (which I think would have a smaller memory footprint).
In summary, I think the easiest thing to do would be to retrain a model with a smaller vocabulary size. We I think it currently has 250,000 words, and cutting to 10,000 or 50,000 will reduce model size, but may not affect accuracy too badly.
Unfortunately I don't think there is a magical option you can select that will just solve this issue, you will have to retrain models and see what kind of accuracy you are willing to sacrifice for a lower memory footprint.

Continue training a deeplearning4j model after it has been saved and loaded

I am using a Convolutional Neural Network and I am saving it and loading it via the model serializer class.
What I want to do is to be able to come back at a later time and continue training the model on new data provided to it.
What I am doing is I load it using
ComputationGraph net = ModelSerializer.restoreComputationGraph(modelFileName);
and then I give it the data like before with
net.train(dataSetIterator);
This seems to work, but it makes my accuracy really bad. It was about 89% before I did this, and, using the same data, it gets to be around 50% accurate after a few iterations (using the same data it just trained itself on, so if anything it should be getting stupidly more accurate right?).
Am I missing a step?

I think it'll be difficult to answer based on the information given, but I'll give you an example.
I had this exact problem. I had based my app on the GravesLSTMCharModellingExample (which is LSTM). I had saved my model after running for a couple of epochs (at which point it generated legible sentences), but when loading it, it produced garbage.
I thought everything was the same, but in the end it turned out I didn't initialize the CharacterIterator the same. When I fixed it, it worked as expected.
So to cut a long story short;
Check your values when initializing the auxiliary classes.

MemoryError using MLPClassifier from sklearn.neural_network

I'm running python 3.5 on a windows 10 64-bit operating system.
When I try to implement MLPClassifier the code runs for a while and then gives me a MemoryError.
I think it's due to the size of the hidden layer that I'm asking it to run but I need to run this size to collect my data. How can I circumvent this error?
Code
gamma=[1,10,100,1000,10000,100000]#create array for range of gamma values
score_train=[]
score_test=[]
for j in gamma:
mlp = MLPClassifier(solver='lbfgs', random_state=0, hidden_layer_sizes=[j,j], activation='tanh').fit(data_train, classes_train)
score_train.append(mlp.score(data_train,classes_train))
score_test.append(mlp.score(data_test,classes_test))
print (score_train)
print (score_test)
Error
Memory Erroy Traceback

the code runs for a while and then gives me a MemoryError. I think it's due to the size of the hidden layer that I'm asking it to run but I need to run this size to collect my data.
Yes, it's the size of the hidden-layers! And the remaining part of that sentence does not make much sense (continue reading)!
Please make sure to read read the tutorial and API-docs
Now some more specific remarks:
The sizes of the hidden-layer does not have anything to do with the collection of your data!
input- and output-layers will be build based on the sizes of your X,y!
hidden_layer_sizes=[j,j] is actually creating 2 hidden-layers!
In the MLP, all layers are fully connected!
a call with hidden_layer_sizes=[100000, 100000] as you try to do will use ~76 gigabytes of memory (assuming 64-bit doubles) just for these weights connecting these 2 layers alone!
and this is just one connection-layer: input-h0 and h1-output are still missing
lbfgs is a completely different solver than all the others. Don't use it without some understanding of the implications! It's not default!
It's a full-batch method and therefore uses a lot more memory when sample-size is big!
Additionally, there are more internal reasons to use more memory compared to the other (first-order-) methods
Not that precise, but the docs already gave some hints: Note: The default solver ‘adam’ works pretty well on relatively large datasets (with thousands of training samples or more) in terms of both training time and validation score. For small datasets, however, ‘lbfgs’ can converge faster and perform better.

How to handle multi-task in one process using theano for machine learning?

I have a CNN model. The requests of using this model, for example to classify a picture, come 1 time a second.
I would like to collect the requests as new unsuperised data, and keep training my model.
My question is: How can I handle the training task and classify task effictively?
I will explain why it becomes a problem:
Every training step takes a long time, at least severy seconds, using GPU and not interruptable. So, if my classify tasks use GPU too, I cannot response the requests in time. I would like to make classify tasks using CPU, but looks like theano not support two diffrent config.device in one process.
Multi-process is not acceptable, because my memory is limited and theano costs too much.
Any help or advice would be apreciated.

You could build two separate copies of the same CNN, one on the CPU and one on the GPU. I think this could be done under either the old GPU backend or the new one, but in different ways....some ideas:
Under the old backend:
Load Theano with device=cpu. Build your inference function and compile it. Then call theano.sandbox.cuda.use('gpu'), and build a new copy of your inference function and take gradients of that one to make any training functions. Now the inference function should execute on the CPU, and the training should happen on the GPU. (I've never done this on purpose but I had it happen to me on accident!)
Under the new backend:
As far as I know, you have to tell Theano about any GPUs right when importing, not later. In this case, you could use THEANO_FLAGS="contexts=dev0->cuda0", which doesn't force using one device over another. Then build the inference version of your function like normal, and for the training version, again put all the shared variables on the GPU, and the input variables to any of your training functions should also be GPU variables (e.g. input_var_1.transfer('dev0')). When all your functions are compiled, look at the programs using theano.printing.debugprint(function) to see what's on GPU vs CPU. (When compiling the CPU functions, it might give a warning that it cannot infer the context, and as far as I've seen, that lands it on the CPU...not sure if this behavior is safe to depend on.)
In either case, this will depend on your GPU-based functions do NOT RETURN ANYTHING TO THE CPU (make sure the output variables are GPU ones). This should allow the training function to run concurrently to your inference function, and later you grab what you need to the CPU. For example when you take a training step, just copy the new values over to your inference network parameters, of course.
Let us hear what you come up with!

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string