memory saving gradients or memory check pointing in keras - python-3.x

I recently found a github repo: https://github.com/openai/gradient-checkpointing
The main purpose is to reduce gpu memory consumption. And the usage seems pretty straight forward:
from tensorflow.python.keras._impl.keras import backend as K
K.__dict__["gradients"] = memory_saving_gradients.gradients_memory
How can I do the same thing but with keras installed separately, not as a part of tensorflow? Since this didn't work:
from keras import backend as K
K.__dict__["gradients"] = memory_saving_gradients.gradients_memory
Thank you in advance

I know I am a bit late, but I recently ran into the same problem, and I was able to solve it.
The problem (I think) is that memory_saving_gradients.gradients_memory uses a heuristic approach which does not work well for many scenarios. Fortunately, there is an alternative function: memory_saving_gradients.gradients_collection, which works perfectly fine, but it requires you to specify at which points in the network the gradient must be checkpointed.
As an example on how this can be accomplished, suppose that we want to checkpoint all the Keras layers whose name contains the word 'add' (for instance, to make a resnet memory effcient). Then, you could include something like this after building your model, but before training it:
layer_names= [layer.name for layer in self.model.layers]
[tf.add_to_collection("checkpoints", self.model.get_layer(l).get_output_at(0))
for l in [i for i in layer_names if 'add' in i]]
K.__dict__["gradients"] = memory_saving_gradients.gradients_collection
I hope it helps!

Related

How to get the inference compute graph of the pytorch model?

I want to hand write a framework to perform inference of a given neural network. The network is so complicated, so to make sure my implementation is correct, I need to know how exactly the inference process is done on device.
I tried to use torchviz to visualize the network, but what I got seems to be the back propagation compute graph, which is really hard to understand.
Then I tried to convert the pytorch model to ONNX format, following the instruction enter link description here, but when I tried to visualize it, it seems that the original layers of the model had been seperated into very small operators.
I just want to get the result like this
How can I get this? Thanks!
Have you tried saving the model with torch.save (https://pytorch.org/tutorials/beginner/saving_loading_models.html) and opening it with Netron? The last view you showed is a view of the Netron app.
You can try also the package torchview, which provides several features (useful especially for large models). For instance you can set the display depth (depth in nested hierarchy of moduls).
It is also based on forward prop
github repo
Disclaimer: I am the author of the package
Note: The accepted format for tool is pytorch model

Keras Batch Normalization "is broken": model fails to predict. Is it _really_ broken? Is there a fix? Or specific documentation about?

Intro
I am making a classifier to recognize presence of defects in pictures, and in the path of improving my models, I tried Batch Normalization, mainly to exploit its ability to fasten convergence.
While it gives the expected speed benefits, I also observed some strange symptoms:
validation metrics are far from good. It smells of overfitting of course
predictions calculated at any point during training are completely wrong, particularly when images are picked from the training dataset; the corresponding metrics match with the (val_loss, val_acc) rather than with (loss, acc) printed during training
This failing to predict is the evidence that worries me the most. A model which does not predict the same as in training, is useless!
Searches
Googling around I found some posts that seem to be related, particularly this one (Keras BN layer is broken) which also claims the existence of a patch and of a pull request, that sadly "was rejected".
This is quite convincing, in that it explains a failure mechanism that matches my observations. As far as I understand, since BN calculates and keeps moving statistics (exponential averages and standard deviations) for doing its job, which require many iterations to stabilize and become significant, of course it will behave bad when it comes to make a prediction from scratch, when those statistics are not mature enough (in case I have misunderstood this concept, please tell me).
Actual Questions
But thinking more thoroughly, this doesn't really close the issue, and actually raises further doubts. I am still perplexed that:
This Keras BN being broken, is said to affect the use case of transfer learning, while mine is a classical case of a convolutional classifier, trained starting form standard glorot initialization. This should have been complained about by thousands of users, while instead there isn't much discussion about)
technically: if my understanding is correct, why aren't these statistics (since they are so fundamental for prediction) saved in the model, so that their latest update is available to make a prediction? It seems perfectly feasible to keep and use them at prediction time, as for any trainable parameter
managementwise: if Keras' BN were really broken, how could such a deadful bug remain unaddressed for more than one year? Isn't really out there anybody using BN and needing predictions out of their models? And not even anybody able to fix it?
more practically: on the contrary, if it is not a bug, but just a bad understanding on how to use it, were do I get a clear illustration of "how to correctly get a prediction in Keras for a model which uses BN?" (demo code would be appreciated)
Obviously I would really love that the right questions is the last, but I had to include the previous ones, given the evidence of someone claiming that Keras BN is broken.
Note to SE OP: before *closing the question as too broad*, please consider that, being not really clear what the issue is (Keras BN being broken, or the users being unable to use it properly), I had to offer more directions, among which whoever wishing to answer can choose.
Details
I am using keras 2.2.4 from a python 3.6 virtual environment (under pyenv/virtualenv).
data are fed through a classic ImageDataGenerator() + flow_from_directory() / flow_from_dataframe() scheme (augmentation is turned off though: only rescale=1./255 is applied), but I also tried to make them static
actually in the end, for verifying the above behaviour, I generated only one dataset x,y=next(valid_generator) and used an unique batch scheme for both training and validation. While on the training side it converges (yes, the aim was exactly to let it overfit!), on the validation side both metrics are poor and predictions are completely wrong and erratic (almost random)
in this setup, if BN is turned off, val_loss and val_acc match exactly with loss and acc, and with those that I can obtain from predictions calulated after training has finished.
Update
In the process of writing a minimal example of the issue, after battling to put in evidence the problem, I recognized that the problem is showing/not showing up in different machines. In particular, the problem is evident on a host running Keras 2.3.1, while another host with Keras 2.2.4 doesn't show it.
I'll post a minimal example here along with specific module versions asap.

MemoryError on joblib dump

I have the following snippet running to train a model for text classification. I optmized it quite a bit and it's running pretty smoothly however, it still uses a lot of RAM. Our dataset is huge (13 million documents + 18 million words in the vocabulary) but the point in execution throwing the error is very weird, in my opinion. The script:
encoder = LabelEncoder()
y = encoder.fit_transform(categories)
classes = list(range(0, len(encoder.classes_)))
vectorizer = CountVectorizer(vocabulary=vocabulary,
binary=True,
dtype=numpy.int8)
classifier = SGDClassifier(loss='modified_huber',
n_jobs=-1,
average=True,
random_state=1)
tokenpath = modelpath.joinpath("tokens")
for i in range(0, len(batches)):
token_matrix = joblib.load(
tokenpath.joinpath("{}.pickle".format(i)))
batchsize = len(token_matrix)
classifier.partial_fit(
vectorizer.transform(token_matrix),
y[i * batchsize:(i + 1) * batchsize],
classes=classes
)
joblib.dump(classifier, modelpath.joinpath('classifier.pickle'))
joblib.dump(vectorizer, modelpath.joinpath('vectorizer.pickle'))
joblib.dump(encoder, modelpath.joinpath('category_encoder.pickle'))
joblib.dump(options, modelpath.joinpath('extraction_options.pickle'))
I got the MemoryError at this line:
joblib.dump(vectorizer, modelpath.joinpath('vectorizer.pickle'))
At this point in execution, training is finished and the classifier is already dumped. It should be collected by the garbage collector in case more memory is needed. In addition to it, why should joblib allocate so much memory if it isn't even compressing the data.
I do not have deep knowledge of the inner workings of the python garbage collector. Should I be forcing gc.collect() or use 'del' statments to free those objects that are no longer needed?
Update:
I have tried using the HashingVectorizer and, even though it greatly reduces memory usage, the vectorizing is way slower making it not a very good alternative.
I have to pickle the vectorizer to later use it in the classification process so I can generate the sparse matrix that is submitted to the classifier. I will post here my classification code:
extracted_features = joblib.Parallel(n_jobs=-1)(
joblib.delayed(features.extractor) (d, extraction_options) for d in documents)
probabilities = classifier.predict_proba(
vectorizer.transform(extracted_features))
predictions = category_encoder.inverse_transform(
probabilities.argmax(axis=1))
trust = probabilities.max(axis=1)
If you are providing your custom vocabulary to the CountVectorizer, it should not be a problem to recreate it later on, during classification. As you provide set of strings instead of a mapping, you probably want to use the parsed vocabulary, which you can access with:
parsed_vocabulary = vectorizer.vocabulary_
joblib.dump(parsed_vocabulary, modelpath.joinpath('vocabulary.pickle'))
and then load it and use to re-create the CountVectorizer:
vectorizer = CountVectorizer(
vocabulary=parsed_vocabulary,
binary=True,
dtype=numpy.int8
)
Note that you do not need to use joblib here; the standard pickle should perform the same; you might get better results using any of available alternatives, with PyTables being worth mentioning.
If that uses to much of the memory too, you should try using the original vocabulary for recreation of the vectorizer; currently, when provided with a set of strings as vocabulary, vectorizers just convert sets to sorted lists so you shouldn't need to worry about reproducibility (although I would double check that before using in production). Or you could just convert the set to a list on your own.
To sum up: because you do not fit() the Vectorizer, the whole added value of using CountVectorizer is its transform() method; as the whole needed data is the vocabulary (and parameters) you might reduce the memory consumption pickling just your vocabulary, either processed or not.
As you asked for answer drawing from official sources, I would like to point you to: https://github.com/scikit-learn/scikit-learn/issues/3844 where an owner and a contributor of scikit-learn mention recreating a CountVectorizer, albeit for other purposes. You may have better luck reporting your problems in the linked repo, but make sure to include a dataset which causes excessive memory usage issues to make it reproducible.
And finally you may just use HashingVectorizer as mentioned earlier in a comment.
PS: regarding the use of gc.collect() - I would give it a go in this case; regarding the technical details, you will find many questions on SO tackling this issue.

I'm trying to implement 'multi-threading' to do both training and prediction(testing) at the same time

I'm trying to implement 'multi-threading' to do both training and prediction(testing) at the same time. And I'm gonna use the python module 'threading' as shown in https://www.tensorflow.org/api_docs/python/tf/FIFOQueue
And the followings are questions.
If I use the python module 'threading', does tensorflow use more portion of gpu or more portion of cpu?
Do I have to make two graphs(neural nets which have the same topology) in tensorflow one for prediction and the other for training? Or is it okay to make just one graph?
I'll be very grateful to anyone who can answer these questions! thanks!
If you use python threading module, it will only make use of cpu; also python threading not for run time parallelism, you should use multiprocessing.
In your model if you are using dropout or batch_norm like ops which change based on training and validation, it's a good idea to create separate graphs, reusing (validation graph will reuse all training variables) the common variable for validation/testing.
Note: you can use one graph also, with additional operations which changes behaviors based on training/validation.

Custom operation implementation for RBM/DBN with tensorflow?

Since Google released out tensorflow, it becomes kind of trend in the current deep learning selections.
I'd like to do some experiments about RBM/DBN (Restricted Boltzmann Machine/Deep Belief Network), I've made some attempt by myself and kind of implement it well through the combination of available APIs from tensorflow. See code and previous answer.
So, if doesn't bother the code running performance, here's the gift for RBM/DBN implementation with tensorflow.
But, the running performance must be considered for the future. Because of the special progress of CD (Contrastive Divergence) algorithm, I think it just works against the framework (data flow graph) used by tensorflow. That's why my code seems weired.
So, the custom operation should be implemented for acceleration. I've followed the current documentation about adding custom ops.
REGISTER_OP("NaiveRbm")
.Input("visible: float32")
.Input("weights: float32")
.Input("h_bias: float32")
.Input("v_bias: float32")
.Output("hidden: float32")
.Doc(R"doc(
Naive Rbm for seperate training use. DO NOT mix up with other operations
)doc");
In my design, NaiveRbm should is an operation that takes visible,weights,h_bias,v_bias as input, but output by only first 3 Variables ( simply sigmoid(X*W+hb) ), its gradient should return at least gradients for last 3 Variables.
Imagine example psuedo code like this:
X = tf.placeholder()
W1, hb1, vb1 = tf.Variable()
W2, hb2, vb2 = tf.Variable()
rbm1 = NaiveRbm(X,W1,hb1,vb1)
train_op = tf.train.MomentumOptimizer(0.01, 0.5).minimize(rbm1)
rbm2 = NaiveRbm(tf.stop_gradient(rbm1), W2, hb2, vb2)
train_op2 = tf.train.MomentumOptimizer(0.01, 0.5).minimize(rbm2)
with tf.Session() as sess:
for batch in batches:
sess.run(train_op, feed_dict={X: batch})
for batch in batches:
sess.run(train_op2, feed_dict={X: batch})
But the tensorflow library is too complex for me. And after too much time seeking for how to implement these existing operations (sigmoid, matmul, ma_add, relu, random_uniform) in custom operation, no solution is found by myself.
So, I'd like to ask if someone could help me achieve the remain works.
PS: before getting some ideas, I'd like to dive into Theano since it implements RBM/DBN already. Just in my opinion, Caffe is kind of not suitable for RBM/DBN because of its framework.
Update: After scratch through the tutorials from Theano, I found the key reason for Theano implemented the RBM/DBN while the tensorflow haven't is the scan technology. So, there might wait tensorflow to implement scan technology to prepare for RBM/DBN implementation.

Resources