I am working with python3 and tensorflow to generate image patches by using numpy view_as_windows but because of numpy can't run on GPU, is there any way to do it with tensorflow?
ex: view_as_windows(array2d, window_shape, stride)
Thanks
Note: This answer does not answer the OP's exact question, but addresses the actual need of the OP as clarified in the comments (i.e., generate image patches, quickly). I just thought this would fit better here than in a badly-formatted comment.
If all you need to do is generating image patches, Tensorflow (and generally GPU acceleration) is not the right tool for this, because the actual computation is trivial (extract a sub-area of an image) and the bottleneck would be memory transfer between GPU and CPU.
My suggestion is, then, to write CPU-only code that uses view_as_windows and parallelize it via multiprocessing to split the workload on all your CPU cores.
Should you need to feed those patches to a Tensorflow graph afterwards, the way to go would be to first generate the patches on the CPU (with whatever input pipeline you like), batch them and then feed them to the GPU for the graph computation.
Related
I am using pytorch for calculating the 3d convolution using the FFT.
The current code does an FFT first, then a pointwise multiplication in Fourier space and finally an inverse FFT. The code works but seems to be very slow compared to an optimized C++ CUDA code (approximately a factor 5 slower).
I think the main problem is that each pytorch operation is handled by a seperate CUDA kernel. Is it somehow possible to merge these operations?
I heard about cuFFTDx. Can this somehow be used from python?
Or does cupy help?
thanks for any hint.
best wishes
Florian
I want to use MLPRegressor from sklearn with all 12 cores available to me, however I do not see any option to select the amount of cores (such as with RandomForestClassifier which has the option with n_jobs).
Is there another way to make sure it uses all 12 cores? I vaguely heard about joblib, but how would I use it correctly?
MLPRegressor does not contain any multithreading per se, though the matrix operations will be vectorized and parallelized via numpy.
You may be able to get better performance by varying your batch size, but if performance is critical you should use a deep learning library like Tensorflow.
The docs (see also this) for autocast in PyTorch only discuss training. Does it speed things up if I also use autocast for inference?
Yes it could (may not in some cases though).
You are processing data with lower precision (e.g. float16 vs float32).
Your program has to read and process less data in this case.
This might help with cache locality and hardware specific software (e.g. tensor cores if using CUDA)
I am trying to run a GP regression over 2D space + 1D time with ~8000 observations and a composite kernel with 4 Matern 3/2 covariance functions -- more than a single core can handle.
It would be great to be able to distribute the GPR computation over multiple nodes rather than having to resort to variational GP. This github issue explains how to execute multithreading in GPflow 1.0, but I am not looking for a way to parallelize many predict_f calls.
Rather, I want to do GPR on a large dataset, which means inverting a covariance matrix larger than a single core can handle. Is there a way to parallelize this computation for a cluster or the Cloud?
In terms of computation, the GPflow can do whatever TensorFlow does. In other words, if TensorFlow supported cloud evaluations, the GPflow would support it as well. But, it doesn't mean that you cannot implement your version of TensorFlow computation, maybe more efficient and be able to run it on the cloud. You can start looking into TensorFlow custom ops: https://www.tensorflow.org/guide/create_op.
The linalg operations, like Cholesky, are hardly parallelisable and benefit of time-saving from it would be questionable. Although memory-wise the advantage of cluster computing is obvious.
If you're interested in MVM-based inference we have a bit of a start here:
https://github.com/tensorflow/probability/blob/7c70d4a3389680670e989b93561440caaa0fb8cd/tensorflow_probability/python/experimental/linalg/linear_operator_psd_kernel.py#L252
I've been playing with stochastic lanczos quadrature for logdet, and preconditioned CG for the solve, but so far have not committed those into TFP.
I'm not an expert in distributed system and CUDA. But there is one really interesting feature that PyTorch support which is nn.DataParallel and nn.DistributedDataParallel. How are they actually implemented? How do they separate common embeddings and synchronize data?
Here is a basic example of DataParallel.
import torch.nn as nn
from torch.autograd.variable import Variable
import numpy as np
class Model(nn.Module):
def __init__(self):
super().__init__(
embedding=nn.Embedding(1000, 10),
rnn=nn.Linear(10, 10),
)
def forward(self, x):
x = self.embedding(x)
x = self.rnn(x)
return x
model = nn.DataParallel(Model())
model.forward(Variable.from_numpy(np.array([1,2,3,4,5,6], dtype=np.int64)).cuda()).cpu()
PyTorch can split the input and send them to many GPUs and merge the results back.
How does it manage embeddings and synchronization for a parallel model or a distributed model?
I wandered around PyTorch's code but it's very hard to know how the fundamentals work.
That's a great question.
PyTorch DataParallel paradigm is actually quite simple and the implementation is open-sourced here . Note that his paradigm is not recommended today as it bottlenecks at the master GPU and not efficient in data transfer.
This container parallelizes the application of the given :attr:module by
splitting the input across the specified devices by chunking in the batch
dimension (other objects will be copied once per device). In the forward
pass, the module is replicated on each device, and each replica handles a
portion of the input. During the backwards pass, gradients from each replica
are summed into the original module.
As of DistributedDataParallel, thats more tricky. This is currently the more advanced approach and it is quite efficient (see here).
This container parallelizes the application of the given module by
splitting the input across the specified devices by chunking in the batch
dimension. The module is replicated on each machine and each device, and
each such replica handles a portion of the input. During the backwards
pass, gradients from each node are averaged.
There are several approaches towards how to average the gradients from each node. I would recommend this paper to get a real sense how things work. Generally speaking, there is a trade-off between transferring the data from one GPU to another, regarding bandwidth and speed, and we want that part to be really efficient. So one possible approach is to connect each pairs of GPUs with a really fast protocol in a circle, and to pass only part of gradients from one to another, s.t. in total, we transfer less data, more efficiently, and all the nodes get all the gradients (or their average at least). There will still be a master GPU in that situation, or at least a process, but now there is no bottleneck on any GPU, they all share the same amount of data (up to...).
Now this can be further optimized if we don't wait for all the batches to finish compute and start do a time-sharing thing where each node sends his portion when he's ready. Don't take me on the details, but it turns out that if we don't wait for everything to end, and do the averaging as soon as we can, it might also speed up the gradient averaging.
Please refer to literature for more information about that area as it is still developing (as of today).
PS 1: Usually these distributed training work better on machines that are set for that task, e.g. AWS deep learning instances that implement those protocols in HW.
PS 2: Disclaimer: I really don't know what protocol PyTorch devs chose to implement and what is chosen according to what. I work with distributed training and prefer to follow PyTorch best practices without trying to outsmart them. I recommend for you to do the same unless you are really into researching this area.
References:
[1] Distributed Training of Deep Learning Models: A Taxonomic Perspective
Approach to ml parallelism with Pytorch
DataParallel & DistributedDataParallel
Model parallel https://pytorch.org/tutorials/intermediate/model_parallel_tutorial.html
See Will switching GPU device affect the gradient in PyTorch back propagation?