I'm using PyTorch to implement an intense sequence of matrix operations, using methods such as torch.mm or torch.dot. I was wondering if PyTorch uses multithreading or other optimization mechanisms to speed up the process. I am not utilizing a GPU. I appreciate if you could inform me of how fast these methods are and whether I need to take any actions to help the process.
PyTorch uses an efficient BLAS implementation and multithreading (openMP, if I'm not wrong) to parallelize such operations with multiple cores. Some performance loss comes from the Python itself - since this is an interpreted language, no significant compiler-like optimization can be done. You can use the jit module to speed up the "wrapper" code around the matrix multiplies, but for anything more than very small matrices this cost is probably negligible.
One big improvement you may be able to get manually, but which PyTorch doesn't apply automatically, is to properly order the matrix multiplies. As you probably know, depending on matrix shapes, a multiplication ABCD may have different performance computed as A(B(CD)) than if computed as (AB)(CD), etc.
Related
I am using pytorch for calculating the 3d convolution using the FFT.
The current code does an FFT first, then a pointwise multiplication in Fourier space and finally an inverse FFT. The code works but seems to be very slow compared to an optimized C++ CUDA code (approximately a factor 5 slower).
I think the main problem is that each pytorch operation is handled by a seperate CUDA kernel. Is it somehow possible to merge these operations?
I heard about cuFFTDx. Can this somehow be used from python?
Or does cupy help?
thanks for any hint.
best wishes
Florian
I want to use ReLU1 non-linear activation. ReLU1 is linear in [0,1] but clamps values less than 0 to 0 and clamps values more than 1 to 1.
It will be used only for the last layer of my deep net in PyTorch having a really high definition output of 2048x4096. Since the code has to be highly optimized in terms of speed and memory I do not know which of the following will be the best implementation.
Following are the two implementations I can think of for the tensor x:-
x.clamp_(min=0.0, max=1.0)
For this I am unable to see the source code given in its docs. So do not know if its the best choice. I will prefer in place operation since backpropagation can happen through it.
The second alternative I have is to use torch.nn.functional.hardtanh_(x, min_val=0.0, max_val=1.0). This is definitely a in place function and the source code says that it uses the C++ file torch._C._nn.hardtanh(input, min_val, max_val) so I think it will be fast.
Please suggest which is the most efficient implementation and another one if possible.
Thankyou
Without trying it, my guess is that clamp and hardtanh will have the same speed, and it will be hard to do this operation any faster if you optimize it in isolation. The arithmetic is trivial so this operation will be bottlenecked by GPU memory bandwidth. To run faster, you'd want to fuse this operation with the operation that produced x. If you don't want to write a custom kernel for the combined operation, you can try using TorchScript.
I am trying to run a GP regression over 2D space + 1D time with ~8000 observations and a composite kernel with 4 Matern 3/2 covariance functions -- more than a single core can handle.
It would be great to be able to distribute the GPR computation over multiple nodes rather than having to resort to variational GP. This github issue explains how to execute multithreading in GPflow 1.0, but I am not looking for a way to parallelize many predict_f calls.
Rather, I want to do GPR on a large dataset, which means inverting a covariance matrix larger than a single core can handle. Is there a way to parallelize this computation for a cluster or the Cloud?
In terms of computation, the GPflow can do whatever TensorFlow does. In other words, if TensorFlow supported cloud evaluations, the GPflow would support it as well. But, it doesn't mean that you cannot implement your version of TensorFlow computation, maybe more efficient and be able to run it on the cloud. You can start looking into TensorFlow custom ops: https://www.tensorflow.org/guide/create_op.
The linalg operations, like Cholesky, are hardly parallelisable and benefit of time-saving from it would be questionable. Although memory-wise the advantage of cluster computing is obvious.
If you're interested in MVM-based inference we have a bit of a start here:
https://github.com/tensorflow/probability/blob/7c70d4a3389680670e989b93561440caaa0fb8cd/tensorflow_probability/python/experimental/linalg/linear_operator_psd_kernel.py#L252
I've been playing with stochastic lanczos quadrature for logdet, and preconditioned CG for the solve, but so far have not committed those into TFP.
I am trying to develop a lightweight system that uses an unsupervised learning method that uses system parameters such as CPU, RAM utilization to train an anomaly detection system. I could not think of anything beyond a Self organizing map. Is there any other learning technique that I can consider here?
You don't have many options on this with SOM. The only think you could consider is whether you will do batch or sequential training, if of course the implementation that you will use offers both options. But this option mainly affects the training time (the first is much more quicker) and not the resulting map (in theory at least).
You could also select a distance function other than the Euclidian but the vast percentage of the bibliography doesn't bother with this.
I have written a back propagation class in VB.NET -it works well- and I'm using it in a C# artificial intelligence project.
But I have a AMD Phenom X3 at home and a Intel i5 at school. and my neural network is not multi-threaded.
How to convert that back propagation class to a multithreaded algorithm? or how to use GPGPU programming in it? or should I use any third party libraries that have a multithreaded back propagation neural network?
JeffHeaton has recommend that you use resilient propagation (RPROP) instead of backpropagation. There are examples on how to do multithreaded RPROP (MPROP):
Article on C# multithreaded backpropagation (from Jeff heaton)
Chapter 7.2.1- "Propagation and Multithreading" (p.94 of Introduction to Encog 2.5 for C#)
It's a difficult to discuss all of the details here, so I would recommend that you either read that article and take a look at the relevant chapters of the book I referenced. This, of course, is assuming you're familiar with concurrent programming.
Update:
Resilient propagation will typically outperform backpropagation by a
considerable factor. Additionally, RPROP has no parameters that must
be set. Backpropagation requires that a learning rate and momentum
value be specified. Finding an optimal learning rate and momentum
value for backpropagation can be difficult. This is not necessary with
resilient propagation.
(source: Encog Machine Learning)
I've tried implementing multiple threads for RPROP batch processing, but it seemed it was always slower than using a single thread. I've tried to implement separately at the loop level "#pragma omp parallel" and by calculating the errors, gradients and weights in separate threads. In my interpretation, it seems that the computing done in each thread is too small to outcome the computing done in switching the threads and synchronizing the results (mutex.) I'm wondering if I've done something wrong? My conclusion is that would be smarter to run RPROP single-threaded, while processing multiple neuronal networks at the same time in separate threads. Most of the implementations usually imply multiple interconnected NNs so that would make sense.