Keras use multi-gpu without Model object (not for training) - python-3.x

I have a bunch of tensor operations (matmul, transpose, etc..) I would like to run on a large dataset.
Since they are still matrix operations, and since I am using Keras generators to load the data batches, It would make sense to use GPUs to compute them.
Now, I've searched a while and I can't seem to find which is the correct way to use Keras to do parallel GPU operations, using generators, outside of the standard Model object interface.
Does anyone know how to do it? Thanks!

Related

Parallelization of sklearn functions using MPI without cross-validation

I have a group of time series which I want to apply a LASSO regression using sklearn on them. As the datasets is pretty sparse I need whole length of time series so that I can't cross-validate. The datasets are big and training process is time consuming which I have to run it on a cluster.
In order to use different nodes I use MPI. As far as I know there is possibility to use sklearn function on cluster using MPI. This possibility basically works with cross-validation chunks, like following issue:
https://github.com/sebp/scikit-learn-mpi-grid-search
I was wondering if there is any other way to use MPI to parallelize process of training in sklearn without cross-validation? I think it would mean that underlying algorithm of sklearn function should use parallelization.

Using Kernel K-Means in Scikit

I am working with a very large dataset (1.5 Million rows) and thought about using an SVR.
Since there is so much data I though about switching to a linear SVM and using the nystroem
method to make a kernel from the uniform sampled data.
However I would rather like to construct the kernel via Kernel K-Means, but I did not find an official
implementation yet.
This link provides a unofficual method, but this results in a very large model since it is serialized.
https://tslearn.readthedocs.io/en/stable/gen_modules/clustering/tslearn.clustering.KernelKMeans.html
Maybe someone has a clue where to look for this or how to implement this codewise from an arbitrary dataset?

How does pytorch's parallel method and distributed method work?

I'm not an expert in distributed system and CUDA. But there is one really interesting feature that PyTorch support which is nn.DataParallel and nn.DistributedDataParallel. How are they actually implemented? How do they separate common embeddings and synchronize data?
Here is a basic example of DataParallel.
import torch.nn as nn
from torch.autograd.variable import Variable
import numpy as np
class Model(nn.Module):
def __init__(self):
super().__init__(
embedding=nn.Embedding(1000, 10),
rnn=nn.Linear(10, 10),
)
def forward(self, x):
x = self.embedding(x)
x = self.rnn(x)
return x
model = nn.DataParallel(Model())
model.forward(Variable.from_numpy(np.array([1,2,3,4,5,6], dtype=np.int64)).cuda()).cpu()
PyTorch can split the input and send them to many GPUs and merge the results back.
How does it manage embeddings and synchronization for a parallel model or a distributed model?
I wandered around PyTorch's code but it's very hard to know how the fundamentals work.
That's a great question.
PyTorch DataParallel paradigm is actually quite simple and the implementation is open-sourced here . Note that his paradigm is not recommended today as it bottlenecks at the master GPU and not efficient in data transfer.
This container parallelizes the application of the given :attr:module by
splitting the input across the specified devices by chunking in the batch
dimension (other objects will be copied once per device). In the forward
pass, the module is replicated on each device, and each replica handles a
portion of the input. During the backwards pass, gradients from each replica
are summed into the original module.
As of DistributedDataParallel, thats more tricky. This is currently the more advanced approach and it is quite efficient (see here).
This container parallelizes the application of the given module by
splitting the input across the specified devices by chunking in the batch
dimension. The module is replicated on each machine and each device, and
each such replica handles a portion of the input. During the backwards
pass, gradients from each node are averaged.
There are several approaches towards how to average the gradients from each node. I would recommend this paper to get a real sense how things work. Generally speaking, there is a trade-off between transferring the data from one GPU to another, regarding bandwidth and speed, and we want that part to be really efficient. So one possible approach is to connect each pairs of GPUs with a really fast protocol in a circle, and to pass only part of gradients from one to another, s.t. in total, we transfer less data, more efficiently, and all the nodes get all the gradients (or their average at least). There will still be a master GPU in that situation, or at least a process, but now there is no bottleneck on any GPU, they all share the same amount of data (up to...).
Now this can be further optimized if we don't wait for all the batches to finish compute and start do a time-sharing thing where each node sends his portion when he's ready. Don't take me on the details, but it turns out that if we don't wait for everything to end, and do the averaging as soon as we can, it might also speed up the gradient averaging.
Please refer to literature for more information about that area as it is still developing (as of today).
PS 1: Usually these distributed training work better on machines that are set for that task, e.g. AWS deep learning instances that implement those protocols in HW.
PS 2: Disclaimer: I really don't know what protocol PyTorch devs chose to implement and what is chosen according to what. I work with distributed training and prefer to follow PyTorch best practices without trying to outsmart them. I recommend for you to do the same unless you are really into researching this area.
References:
[1] Distributed Training of Deep Learning Models: A Taxonomic Perspective
Approach to ml parallelism with Pytorch
DataParallel & DistributedDataParallel
Model parallel https://pytorch.org/tutorials/intermediate/model_parallel_tutorial.html
See Will switching GPU device affect the gradient in PyTorch back propagation?

Training one model with several GPU's

How you can program keras or tensorflow to partitionate training on multiple GPU, let's say you are in an amaozn ec2 instance that has 8 GPU's and you want to use all of them to train faster, but your code is just for a single cpu or GPU ?
Yes, can run Keras models on multiple GPUs. This is only possible with the TensorFlow backend for the time being, because the Theano feature is still rather new. We are looking at adding support for multi-gpu in Theano in the near future (it should be fairly straightforward).
With the TensorFlow backend, you can achieve this the same way as you would in pure TensorFlow: by using the with tf.device(d) scope when defining Keras layers.
Originally from here

I'm trying to implement 'multi-threading' to do both training and prediction(testing) at the same time

I'm trying to implement 'multi-threading' to do both training and prediction(testing) at the same time. And I'm gonna use the python module 'threading' as shown in https://www.tensorflow.org/api_docs/python/tf/FIFOQueue
And the followings are questions.
If I use the python module 'threading', does tensorflow use more portion of gpu or more portion of cpu?
Do I have to make two graphs(neural nets which have the same topology) in tensorflow one for prediction and the other for training? Or is it okay to make just one graph?
I'll be very grateful to anyone who can answer these questions! thanks!
If you use python threading module, it will only make use of cpu; also python threading not for run time parallelism, you should use multiprocessing.
In your model if you are using dropout or batch_norm like ops which change based on training and validation, it's a good idea to create separate graphs, reusing (validation graph will reuse all training variables) the common variable for validation/testing.
Note: you can use one graph also, with additional operations which changes behaviors based on training/validation.

Resources