combine multiple FFT operation on GPU within python - pytorch

I am using pytorch for calculating the 3d convolution using the FFT.
The current code does an FFT first, then a pointwise multiplication in Fourier space and finally an inverse FFT. The code works but seems to be very slow compared to an optimized C++ CUDA code (approximately a factor 5 slower).
I think the main problem is that each pytorch operation is handled by a seperate CUDA kernel. Is it somehow possible to merge these operations?
I heard about cuFFTDx. Can this somehow be used from python?
Or does cupy help?
thanks for any hint.
best wishes
Florian

Related

Using Kernel K-Means in Scikit

I am working with a very large dataset (1.5 Million rows) and thought about using an SVR.
Since there is so much data I though about switching to a linear SVM and using the nystroem
method to make a kernel from the uniform sampled data.
However I would rather like to construct the kernel via Kernel K-Means, but I did not find an official
implementation yet.
This link provides a unofficual method, but this results in a very large model since it is serialized.
https://tslearn.readthedocs.io/en/stable/gen_modules/clustering/tslearn.clustering.KernelKMeans.html
Maybe someone has a clue where to look for this or how to implement this codewise from an arbitrary dataset?

Do ReLU1 in PyTorch

I want to use ReLU1 non-linear activation. ReLU1 is linear in [0,1] but clamps values less than 0 to 0 and clamps values more than 1 to 1.
It will be used only for the last layer of my deep net in PyTorch having a really high definition output of 2048x4096. Since the code has to be highly optimized in terms of speed and memory I do not know which of the following will be the best implementation.
Following are the two implementations I can think of for the tensor x:-
x.clamp_(min=0.0, max=1.0)
For this I am unable to see the source code given in its docs. So do not know if its the best choice. I will prefer in place operation since backpropagation can happen through it.
The second alternative I have is to use torch.nn.functional.hardtanh_(x, min_val=0.0, max_val=1.0). This is definitely a in place function and the source code says that it uses the C++ file torch._C._nn.hardtanh(input, min_val, max_val) so I think it will be fast.
Please suggest which is the most efficient implementation and another one if possible.
Thankyou
Without trying it, my guess is that clamp and hardtanh will have the same speed, and it will be hard to do this operation any faster if you optimize it in isolation. The arithmetic is trivial so this operation will be bottlenecked by GPU memory bandwidth. To run faster, you'd want to fuse this operation with the operation that produced x. If you don't want to write a custom kernel for the combined operation, you can try using TorchScript.

Parallelizing GPflow 2.0 GP regression for large datasets

I am trying to run a GP regression over 2D space + 1D time with ~8000 observations and a composite kernel with 4 Matern 3/2 covariance functions -- more than a single core can handle.
It would be great to be able to distribute the GPR computation over multiple nodes rather than having to resort to variational GP. This github issue explains how to execute multithreading in GPflow 1.0, but I am not looking for a way to parallelize many predict_f calls.
Rather, I want to do GPR on a large dataset, which means inverting a covariance matrix larger than a single core can handle. Is there a way to parallelize this computation for a cluster or the Cloud?
In terms of computation, the GPflow can do whatever TensorFlow does. In other words, if TensorFlow supported cloud evaluations, the GPflow would support it as well. But, it doesn't mean that you cannot implement your version of TensorFlow computation, maybe more efficient and be able to run it on the cloud. You can start looking into TensorFlow custom ops: https://www.tensorflow.org/guide/create_op.
The linalg operations, like Cholesky, are hardly parallelisable and benefit of time-saving from it would be questionable. Although memory-wise the advantage of cluster computing is obvious.
If you're interested in MVM-based inference we have a bit of a start here:
https://github.com/tensorflow/probability/blob/7c70d4a3389680670e989b93561440caaa0fb8cd/tensorflow_probability/python/experimental/linalg/linear_operator_psd_kernel.py#L252
I've been playing with stochastic lanczos quadrature for logdet, and preconditioned CG for the solve, but so far have not committed those into TFP.

What kinds of optimization are used in PyTorch methods?

I'm using PyTorch to implement an intense sequence of matrix operations, using methods such as torch.mm or torch.dot. I was wondering if PyTorch uses multithreading or other optimization mechanisms to speed up the process. I am not utilizing a GPU. I appreciate if you could inform me of how fast these methods are and whether I need to take any actions to help the process.
PyTorch uses an efficient BLAS implementation and multithreading (openMP, if I'm not wrong) to parallelize such operations with multiple cores. Some performance loss comes from the Python itself - since this is an interpreted language, no significant compiler-like optimization can be done. You can use the jit module to speed up the "wrapper" code around the matrix multiplies, but for anything more than very small matrices this cost is probably negligible.
One big improvement you may be able to get manually, but which PyTorch doesn't apply automatically, is to properly order the matrix multiplies. As you probably know, depending on matrix shapes, a multiplication ABCD may have different performance computed as A(B(CD)) than if computed as (AB)(CD), etc.

Is there any tensorflow version of numpy.view_as_windows?

I am working with python3 and tensorflow to generate image patches by using numpy view_as_windows but because of numpy can't run on GPU, is there any way to do it with tensorflow?
ex: view_as_windows(array2d, window_shape, stride)
Thanks
Note: This answer does not answer the OP's exact question, but addresses the actual need of the OP as clarified in the comments (i.e., generate image patches, quickly). I just thought this would fit better here than in a badly-formatted comment.
If all you need to do is generating image patches, Tensorflow (and generally GPU acceleration) is not the right tool for this, because the actual computation is trivial (extract a sub-area of an image) and the bottleneck would be memory transfer between GPU and CPU.
My suggestion is, then, to write CPU-only code that uses view_as_windows and parallelize it via multiprocessing to split the workload on all your CPU cores.
Should you need to feed those patches to a Tensorflow graph afterwards, the way to go would be to first generate the patches on the CPU (with whatever input pipeline you like), batch them and then feed them to the GPU for the graph computation.

Resources