Incrementally transferring PyTorch weights over a limited channel - pytorch

I am looking for a way to distribute a new version of PyTorch weights file to multiple computers doing inference. The connections are relatively slow, so reducing an amount of data could bring a great improvement.
So I am wondering - if there is a way to create a "diff" between the previous and new sets of weights and transmit only it instead of sending the whole file - especially considering the fact that only few layers are trainable.

Related

Loading a model checkpoint in lesser amount of memory

I had a question that I can't find any answers to online. I have trained a model whose checkpoint file is about 20 GB. Since I do not have enough RAM with my system (or Colaboratory/Kaggle either - the limit being 16 GB), I can't use my model for predictions.
I know that the model has to be loaded into memory for the inferencing to work. However, is there a workaround or a method that can:
Save some memory and be able to load it in 16 GB of RAM (for CPU), or the memory in the TPU/GPU
Can use any framework (since I would be working with both) TensorFlow + Keras, or PyTorch (which I am using right now)
Is such a method even possible to do in either of these libraries? One of my tentative solutions was not load it in chunks perhaps, essentially maintaining a buffer for the model weights and biases and performing calculations accordingly - though I haven't found any implementations for that.
I would also like to add that I wouldn't mind the performance slowdown since it is to be expected with low-specification hardware. As long as it doesn't take more than two weeks :) I can definitely wait that long...
Yoy can try the following:
split model by two parts
load weights to the both parts separately calling model.load_weights(by_name=True)
call the first model with your input
call the second model with the output of the first model

How does pytorch's parallel method and distributed method work?

I'm not an expert in distributed system and CUDA. But there is one really interesting feature that PyTorch support which is nn.DataParallel and nn.DistributedDataParallel. How are they actually implemented? How do they separate common embeddings and synchronize data?
Here is a basic example of DataParallel.
import torch.nn as nn
from torch.autograd.variable import Variable
import numpy as np
class Model(nn.Module):
def __init__(self):
super().__init__(
embedding=nn.Embedding(1000, 10),
rnn=nn.Linear(10, 10),
)
def forward(self, x):
x = self.embedding(x)
x = self.rnn(x)
return x
model = nn.DataParallel(Model())
model.forward(Variable.from_numpy(np.array([1,2,3,4,5,6], dtype=np.int64)).cuda()).cpu()
PyTorch can split the input and send them to many GPUs and merge the results back.
How does it manage embeddings and synchronization for a parallel model or a distributed model?
I wandered around PyTorch's code but it's very hard to know how the fundamentals work.
That's a great question.
PyTorch DataParallel paradigm is actually quite simple and the implementation is open-sourced here . Note that his paradigm is not recommended today as it bottlenecks at the master GPU and not efficient in data transfer.
This container parallelizes the application of the given :attr:module by
splitting the input across the specified devices by chunking in the batch
dimension (other objects will be copied once per device). In the forward
pass, the module is replicated on each device, and each replica handles a
portion of the input. During the backwards pass, gradients from each replica
are summed into the original module.
As of DistributedDataParallel, thats more tricky. This is currently the more advanced approach and it is quite efficient (see here).
This container parallelizes the application of the given module by
splitting the input across the specified devices by chunking in the batch
dimension. The module is replicated on each machine and each device, and
each such replica handles a portion of the input. During the backwards
pass, gradients from each node are averaged.
There are several approaches towards how to average the gradients from each node. I would recommend this paper to get a real sense how things work. Generally speaking, there is a trade-off between transferring the data from one GPU to another, regarding bandwidth and speed, and we want that part to be really efficient. So one possible approach is to connect each pairs of GPUs with a really fast protocol in a circle, and to pass only part of gradients from one to another, s.t. in total, we transfer less data, more efficiently, and all the nodes get all the gradients (or their average at least). There will still be a master GPU in that situation, or at least a process, but now there is no bottleneck on any GPU, they all share the same amount of data (up to...).
Now this can be further optimized if we don't wait for all the batches to finish compute and start do a time-sharing thing where each node sends his portion when he's ready. Don't take me on the details, but it turns out that if we don't wait for everything to end, and do the averaging as soon as we can, it might also speed up the gradient averaging.
Please refer to literature for more information about that area as it is still developing (as of today).
PS 1: Usually these distributed training work better on machines that are set for that task, e.g. AWS deep learning instances that implement those protocols in HW.
PS 2: Disclaimer: I really don't know what protocol PyTorch devs chose to implement and what is chosen according to what. I work with distributed training and prefer to follow PyTorch best practices without trying to outsmart them. I recommend for you to do the same unless you are really into researching this area.
References:
[1] Distributed Training of Deep Learning Models: A Taxonomic Perspective
Approach to ml parallelism with Pytorch
DataParallel & DistributedDataParallel
Model parallel https://pytorch.org/tutorials/intermediate/model_parallel_tutorial.html
See Will switching GPU device affect the gradient in PyTorch back propagation?

Pruning in Keras

I'm trying to design a neural network using Keras with priority on prediction performance, and I cannot get sufficiently high accuracy by further reducing the number of layers and nodes per layer. I have noticed that very large portion of my weights are effectively zero (>95%). Is there a way to prune dense layers in hope of reducing prediction time?
Not a dedicated way :(
There's currently no easy (dedicated) way of doing this with Keras.
A discussion is ongoing at https://groups.google.com/forum/#!topic/keras-users/oEecCWayJrM.
You may also be interested in this paper: https://arxiv.org/pdf/1608.04493v1.pdf.
Take a look at Keras Surgeon:
https://github.com/BenWhetton/keras-surgeon
I have not tried it myself, but the documentation claims that it has functions to remove or insert nodes.
Also, after looking at some papers on pruning, it seems that many researchers create a new model with less channels (or less layers), and then copy the weights from the original model to the new model.
See this dedicated tooling for tf.keras. https://www.tensorflow.org/model_optimization/guide/pruning
As the overview suggests, support for latency improvements is a work in progress
Edit: Keras -> tf.keras based on LucG's suggestion.
If you set an individual weight to zero won't that prevent it from being updated during back propagation? Shouldn't thatv weight remain zero from one epoch to the next? That's why you set the initial weights to nonzero values before training. If you want to "remove" an entire node, just set all of the weights on that node's output to zero and that will prevent that nodes from having any affect on the output throughout training.

Incremental training of ALS model

I'm trying to find out if it is possible to have "incremental training" on data using MLlib in Apache Spark.
My platform is Prediction IO, and it's basically a wrapper for Spark (MLlib), HBase, ElasticSearch and some other Restful parts.
In my app data "events" are inserted in real-time, but to get updated prediction results I need to "pio train" and "pio deploy". This takes some time and the server goes offline during the redeploy.
I'm trying to figure out if I can do incremental training during the "predict" phase, but cannot find an answer.
I imagine you are using spark MLlib's ALS model which is performing matrix factorization. The result of the model are two matrices a user-features matrix and an item-features matrix.
Assuming we are going to receive a stream of data with ratings or transactions for the case of implicit, a real (100%) online update of this model will be to update both matrices for each new rating information coming by triggering a full retrain of the ALS model on the entire data again + the new rating. In this scenario one is limited by the fact that running the entire ALS model is computationally expensive and the incoming stream of data could be frequent, so it would trigger a full retrain too often.
So, knowing this we can look for alternatives, a single rating should not change the matrices much plus we have optimization approaches which are incremental, for example SGD. There is an interesting (still experimental) library written for the case of Explicit Ratings which does incremental updates for each batch of a DStream:
https://github.com/brkyvz/streaming-matrix-factorization
The idea of using an incremental approach such as SGD follows the idea of as far as one moves towards the gradient (minimization problem) one guarantees that is moving towards a minimum of the error function. So even if we do an update to the single new rating, only to the user feature matrix for this specific user, and only the item-feature matrix for this specific item rated, and the update is towards the gradient, we guarantee that we move towards the minimum, of course as an approximation, but still towards the minimum.
The other problem comes from spark itself, and the distributed system, ideally the updates should be done sequentially, for each new incoming rating, but spark treats the incoming stream as a batch, which is distributed as an RDD, so the operations done for updating would be done for the entire batch with no guarantee of sequentiality.
In more details if you are using Prediction.IO for example, you could do an off line training which uses the regular train and deploy functions built in, but if you want to have the online updates you will have to access both matrices for each batch of the stream, and run updates using SGD, then ask for the new model to be deployed, this functionality of course is not in Prediction.IO you would have to build it on your own.
Interesting notes for SGD updates:
http://stanford.edu/~rezab/classes/cme323/S15/notes/lec14.pdf
For updating Your model near-online (I write near, because face it, the true online update is impossible) by using fold-in technique, e.g.:
Online-Updating Regularized Kernel Matrix Factorization Models for Large-Scale Recommender Systems.
Ou You can look at code of:
MyMediaLite
Oryx - framework build with Lambda Architecture paradigm. And it should have updates with fold-in of new users/items.
It's the part of my answer for similar question where both problems: near-online training and handling new users/items were mixed.

Multi threaded AForge.NET training

I am using AForge.NET ANN and training it on my training set. Because the training is single threaded and the process can take ages, I wondered if it's possible to run a multi threaded training.
Because it is a problem to use threads while training a Resilient Backpropagation network I thought about splitting my training set between different networks and once every N epoch's, combine the weights of all networks in to one, Then, duplicate it to all threads (so the next epoch will start with the new weights).
I can't seem to find a method in the AForge.NET that combines two (or more) networks. Looking for some help on how to get started with the implementation process.
Combining the neural networks every N number of iterations won't work really well. It can be very tricky to just take the weights and combine them. In some ways this is how the crossover operation of a Genetic Algorithm works.
Really the only way you are going to be able to do this is modify AForge's training to support multiple threads. Basically to do this you need to map the gradient calculation and then do a reduce-sum on the gradients. Then use the reduced gradients to update the network.
I've implemented this exact thing in the Encog Framework, it supports multi-threaded (RPROP), and has a C# version. http://www.heatonresearch.com/encog.

Resources