Do ReLU1 in PyTorch - pytorch

I want to use ReLU1 non-linear activation. ReLU1 is linear in [0,1] but clamps values less than 0 to 0 and clamps values more than 1 to 1.
It will be used only for the last layer of my deep net in PyTorch having a really high definition output of 2048x4096. Since the code has to be highly optimized in terms of speed and memory I do not know which of the following will be the best implementation.
Following are the two implementations I can think of for the tensor x:-
x.clamp_(min=0.0, max=1.0)
For this I am unable to see the source code given in its docs. So do not know if its the best choice. I will prefer in place operation since backpropagation can happen through it.
The second alternative I have is to use torch.nn.functional.hardtanh_(x, min_val=0.0, max_val=1.0). This is definitely a in place function and the source code says that it uses the C++ file torch._C._nn.hardtanh(input, min_val, max_val) so I think it will be fast.
Please suggest which is the most efficient implementation and another one if possible.
Thankyou

Without trying it, my guess is that clamp and hardtanh will have the same speed, and it will be hard to do this operation any faster if you optimize it in isolation. The arithmetic is trivial so this operation will be bottlenecked by GPU memory bandwidth. To run faster, you'd want to fuse this operation with the operation that produced x. If you don't want to write a custom kernel for the combined operation, you can try using TorchScript.

Related

combine multiple FFT operation on GPU within python

I am using pytorch for calculating the 3d convolution using the FFT.
The current code does an FFT first, then a pointwise multiplication in Fourier space and finally an inverse FFT. The code works but seems to be very slow compared to an optimized C++ CUDA code (approximately a factor 5 slower).
I think the main problem is that each pytorch operation is handled by a seperate CUDA kernel. Is it somehow possible to merge these operations?
I heard about cuFFTDx. Can this somehow be used from python?
Or does cupy help?
thanks for any hint.
best wishes
Florian

Why is the triton language faster than pytorch?

This blog, introducing OpenAI's new python extension called Triton, says this about why Triton can do matrix math faster than pytorch (referring to an an example of how Triton can be used to compute Softmax along the rows of an m by n matrix)
Importantly, this particular implementation of softmax keeps the rows of X in SRAM throughout the entire normalization process, which maximizes data reuse when applicable (~<32K columns). This differs from PyTorch’s internal CUDA code, whose use of temporary memory makes it more general but significantly slower (below). The bottom line here is not that Triton is inherently better, but that it simplifies the development of specialized kernels that can be much faster than those found in general-purpose libraries.
How does pytorch allocate memory for device tensors, what is the "temporary memory" being referred to here? Why is the use of this temporary memory more general, but slower than use of SRAM?
Is SRAM here referring to cache memory? If so, how/why does this library make better use of cache memory than pytorch internals? My understanding is that the decision about what data to cache is mostly up to the hardware rather than software.

BayesSearchCV parameters

I just read about Bayesian optimization and I want to try it.
I installed scikit-optimize and checked the API, and I'm confused:
I read that Bayesian optimization starts with some initialize samples.
I can't see where I can change this number ? (BayesSearchCV)
n_points will change the number of parameter settings to sample in parallel and n_iter is the number of iterations (and if I'm not wrong the iterations can't run in parallel, the algorithm improve the parameters after every iteration)
I read that we can use different acquisition functions.
I can't see where I can change the acquisition function in BayesSearchCV ?
Is this something you are looking for?
BayesSearchCV(..., optimizer_kwargs={'n_initial_points': 20, 'acq_func': 'gp_hedge'}, ...)
skopt.Optimizer is the one actually doing the hyperparameter optimization.
BayesSearchCV will build Optimzier with optimizer_kwargs parameters.
https://github.com/scikit-optimize/scikit-optimize/blob/de32b5fd2205a1e58526f3cacd0422a26d315d0f/skopt/searchcv.py#L551

What kinds of optimization are used in PyTorch methods?

I'm using PyTorch to implement an intense sequence of matrix operations, using methods such as torch.mm or torch.dot. I was wondering if PyTorch uses multithreading or other optimization mechanisms to speed up the process. I am not utilizing a GPU. I appreciate if you could inform me of how fast these methods are and whether I need to take any actions to help the process.
PyTorch uses an efficient BLAS implementation and multithreading (openMP, if I'm not wrong) to parallelize such operations with multiple cores. Some performance loss comes from the Python itself - since this is an interpreted language, no significant compiler-like optimization can be done. You can use the jit module to speed up the "wrapper" code around the matrix multiplies, but for anything more than very small matrices this cost is probably negligible.
One big improvement you may be able to get manually, but which PyTorch doesn't apply automatically, is to properly order the matrix multiplies. As you probably know, depending on matrix shapes, a multiplication ABCD may have different performance computed as A(B(CD)) than if computed as (AB)(CD), etc.

MemoryError using MLPClassifier from sklearn.neural_network

I'm running python 3.5 on a windows 10 64-bit operating system.
When I try to implement MLPClassifier the code runs for a while and then gives me a MemoryError.
I think it's due to the size of the hidden layer that I'm asking it to run but I need to run this size to collect my data. How can I circumvent this error?
Code
gamma=[1,10,100,1000,10000,100000]#create array for range of gamma values
score_train=[]
score_test=[]
for j in gamma:
mlp = MLPClassifier(solver='lbfgs', random_state=0, hidden_layer_sizes=[j,j], activation='tanh').fit(data_train, classes_train)
score_train.append(mlp.score(data_train,classes_train))
score_test.append(mlp.score(data_test,classes_test))
print (score_train)
print (score_test)
Error
Memory Erroy Traceback
the code runs for a while and then gives me a MemoryError. I think it's due to the size of the hidden layer that I'm asking it to run but I need to run this size to collect my data.
Yes, it's the size of the hidden-layers! And the remaining part of that sentence does not make much sense (continue reading)!
Please make sure to read read the tutorial and API-docs
Now some more specific remarks:
The sizes of the hidden-layer does not have anything to do with the collection of your data!
input- and output-layers will be build based on the sizes of your X,y!
hidden_layer_sizes=[j,j] is actually creating 2 hidden-layers!
In the MLP, all layers are fully connected!
a call with hidden_layer_sizes=[100000, 100000] as you try to do will use ~76 gigabytes of memory (assuming 64-bit doubles) just for these weights connecting these 2 layers alone!
and this is just one connection-layer: input-h0 and h1-output are still missing
lbfgs is a completely different solver than all the others. Don't use it without some understanding of the implications! It's not default!
It's a full-batch method and therefore uses a lot more memory when sample-size is big!
Additionally, there are more internal reasons to use more memory compared to the other (first-order-) methods
Not that precise, but the docs already gave some hints: Note: The default solver ‘adam’ works pretty well on relatively large datasets (with thousands of training samples or more) in terms of both training time and validation score. For small datasets, however, ‘lbfgs’ can converge faster and perform better.

Resources