Convert function to exploit parallelization of the GPU

Convert function to exploit parallelization of the GPU - pytorch

I have a function that uses values stored in one array to operate on another array. This behaves similar to the numpy.hist function. For example:
import numpy as np
from numba import jit
#jit(nopython=True)
def array_func(x, y, output_counts, output_weights):
for row in range(x.size):
col = int(x[row] * 10)
output_counts[col] += 1
output_weights[col] += y[row]
return (output_counts, output_weights)
# in the current code these arrays exists ad pytorch tensors
# on the GPU and get converted to numpy arrays on the CPU before
# being passed to "array_func"
x = np.random.randint(0, 11, (1000)) / 10
y = np.random.randint(0, 100, (10000))
output_counts, output_weights = array_func(x, y, np.zeros(y.size), np.zeros(y.size))
While this works for arrays it does not work for torch tensors that are on the GPU. This is close to what histogram functions do, but I also need the summation of binned values (i.e., the output_weights array/tensor). The current function requires me to continually pass the data from GPU to CPU, followed by the CPU function being run in series.
Can this function be converted to run in parallel on the GPU?
##EDIT##
The challenge is caused by the following line:
output_weights[col] += y[row]
If it weren't for that line I could just use the torch.histc function.
Here's my thought: GPUs are "fast" because they have hundreds/thousands of threads available and can run parts of a big job (or many smaller jobs) on these threads. However, if I convert the function above to work on torch tensors then there is no benefit to running on the GPU (it actually kills the performance). I wonder if there is a way I can break of x so each value gets sent to different threads (similar to how apply_async does within multiprocessing)?
I'm open to other options.
In it's current form the function is fast, but the GPU-to-CPU data transfer is killing me.

Your computation is indeed a general histogram operation. There are multiple ways to compute this on a GPU regarding the number of items to scan, the size of the histogram and the distribution of the values.
For example, one solution consist in building local histograms in each separate kernel blocks and then perform a reduction. However, this solution is not well suited in your case since len(x) / len(y) is relatively small.
An alternative solution is to perform atomic updates of the histogram in parallel. This solutions only scale well if there is no atomic conflicts which is dependent of the actual input data. Indeed, if all value of x are equal, then all updates will be serialized which is slower than doing the accumulation sequentially on a CPU (due to the overhead of the atomic operations). Such a case is frequent on small histograms but assuming the distribution is close to uniform, this can be fine.
This operation can be done with Numba using CUDA (targetting Nvidia GPUs). Here is an example of kernel solving your problem:
#cuda.jit
def array_func(x, y, output_counts, output_weights):
tx = cuda.threadIdx.x # Thread id in a 1D block
ty = cuda.blockIdx.x # Block id in a 1D grid
bw = cuda.blockDim.x # Block width, i.e. number of threads per block
pos = tx + ty * bw # Compute flattened index inside the array
if pos < x.size:
col = int(x[pos] * 10)
cuda.atomic.add(output_counts, col, 1)
cuda.atomic.add(output_weights, col, y[pos])
For more information about how to run this kernel, please read the documentation. Note that the arrays output_counts and output_weights can possibly be directly created on the GPU so to avoid transfers. x and y should be on the GPU for better performance (otherwise a CPU reduction will be certainly faster). Also note that the kernel should be pretty fast so the overhead to run/wait it and allocate/free temporary array may be significant and even possibly slower than the kernel itself (but certainly faster than doing a double transfer from/to the CPU so to compute things on the CPU assuming data was on the GPU). Note also that such atomic accesses are only fast on quite recent Nvidia GPU that benefit from specific computing units for atomic operations.

Related

What is the fastest way to sum 2 matrices using Numba?

I am trying to find the fastest way to sum 2 matrices of the same size using Numba. I came up with 3 different approaches but none of them could beat Numpy.
Here is my code:
import numpy as np
from numba import njit,vectorize, prange,float64
import timeit
import time
# function 1:
def sum_numpy(A,B):
return A+B
# function 2:
sum_numba_simple= njit(cache=True,fastmath=True) (sum_numpy)
# function 3:
#vectorize([float64(float64, float64)])
def sum_numba_vectorized(A,B):
return A+B
# function 4:
#njit('(float64[:,:],float64[:,:])', cache=True, fastmath=True, parallel=True)
def sum_numba_loop(A,B):
n=A.shape[0]
m=A.shape[1]
C = np.empty((n, m), A.dtype)
for i in prange(n):
for j in prange(m):
C[i,j]=A[i,j]+B[i,j]
return C
#Test the functions with 2 matrices of size 1,000,000x3:
N=1000000
np.random.seed(123)
A=np.random.uniform(low=-10, high=10, size=(N,3))
B=np.random.uniform(low=-5, high=5, size=(N,3))
t1=min(timeit.repeat(stmt='sum_numpy(A,B)',timer=time.perf_counter,repeat=3, number=100,globals=globals()))
t2=min(timeit.repeat(stmt='sum_numba_simple(A,B)',timer=time.perf_counter,repeat=3, number=100,globals=globals()))
t3=min(timeit.repeat(stmt='sum_numba_vectorized(A,B)',timer=time.perf_counter,repeat=3, number=100,globals=globals()))
t4=min(timeit.repeat(stmt='sum_numba_loop(A,B)',timer=time.perf_counter,repeat=3, number=100,globals=globals()))
print("function 1 (sum_numpy): t1= ",t1,"\n")
print("function 2 (sum_numba_simple): t2= ",t2,"\n")
print("function 3 (sum_numba_vectorized): t3= ",t3,"\n")
print("function 4 (sum_numba_loop): t4= ",t4,"\n")
Here are the results:
function 1 (sum_numpy): t1= 0.1655790419999903
function 2 (sum_numba_simple): t2= 0.3019776669998464
function 3 (sum_numba_vectorized): t3= 0.16486266700030683
function 4 (sum_numba_loop): t4= 0.1862256660001549
As you can see, the results show that there isn't any advantage in using Numba in this case. Therefore, my question is:
Is there any other implementation that would increase the speed of the summation ?

Your code is bound by page-faults (see here, here and there for more information about this). Page-faults happens because the array is newly allocated. A solution is to preallocate it and then write within it so to no cause pages to be remapped in physical memory. np.add(A, B, out=C) does this as indicated by #August in the comments. Another solution could be to adapt the standard allocator so not to give the memory back to the OS at the expense of a significant memory usage overhead (AFAIK TC-Malloc can do that for example).
There is another issue on most platforms (especially x86 ones): the cache-line write allocations of write-back caches are expensive during writes. The typical solution to avoid this is to do non-temporal store (if available on the target processor, which is the case on x86-64 one but maybe not others). That being said, neither Numpy nor Numba are able to do that yet. For Numba, I filled an issue covering a simple use-case. Compilers themselves (GCC for Numpy and Clang for Numba) tends not to generate such instructions because they can be detrimental in performance when arrays fit in cache and compilers do not know the size of the array at compile time (they could generate a specific code when they can evaluate the amount of data computed but this is not easy and can slow-down some other codes). AFAIK, the only possible way to fix this is to write a C code and use low-level instructions or to use compiler directives. In your case, about 25% of the bandwidth is lost due to this effect, causing a slowdown up to 33%.
Using multiple threads do not always make memory-bound code faster. In fact, it generally barely scale because using more core do not speed up the execution when the RAM is already saturated. Few cores are generally required so to saturate the RAM on most platforms. Page faults can benefit from using multiple cores regarding the target system (Linux does that in parallel quite well, Windows generally does not scale well, IDK for MacOS).
Finally, there is another issue: the code is not vectorized (at least not on my machine while it can be). On solution is to flatten the array view and do one big loop that the compiler can more easily vectorize (the j-based loop is too small for SIMD instructions to be effective). The contiguity of the input array should also be specified for the compiler to generate a fast SIMD code. Here is the resulting Numba code:
#njit('(float64[:,::1], float64[:,::1], float64[:,::1])', cache=True, fastmath=True, parallel=True)
def sum_numba_fast_loop(A, B, C):
n, m = A.shape
assert C.shape == A.shape
A_flat = A.reshape(n*m)
B_flat = B.reshape(n*m)
C_flat = C.reshape(n*m)
for i in prange(n*m):
C_flat[i]=A_flat[i]+B_flat[i]
return C
Here are results on my 6-core i5-9600KF processor with a ~42 GiB/s RAM:
sum_numpy: 0.642 s 13.9 GiB/s
sum_numba_simple: 0.851 s 10.5 GiB/s
sum_numba_vectorized: 0.639 s 14.0 GiB/s
sum_numba_loop serial: 0.759 s 11.8 GiB/s
sum_numba_loop parallel: 0.472 s 18.9 GiB/s
Numpy "np.add(A, B, out=C)": 0.281 s 31.8 GiB/s <----
Numba fast: 0.288 s 31.0 GiB/s <----
Optimal time: 0.209 s 32.0 GiB/s
The Numba code and the Numpy one saturate my RAM. Using more core does not help (in fact it is a bit slower certainly due to the contention of the memory controller). Both are sub-optimal since they do not use non-temporal store instructions that can prevent cache-line write allocations (causing data to be fetched from the RAM before being written back). The optimal time is the one expected using such instruction. Note that it is expected to reach only 65-80% of the RAM bandwidth because of RAM mixed read/writes. Indeed, interleaving reads and writes cause low-level overheads preventing the RAM to be saturated. For more information about how RAM works, please consider reading Introduction to High Performance Scientific Computing -- Chapter 1.3 and What Every Programmer Should Know About Memory (and possibly this).

scikit learn unwanted parallel processing

I have a problem with nested multiprocessing witch starts when I use scikit-learn (v. 0.22) Quadratic Discriminant Analysis. Necessary is system configuration: 24 thread Xeon machine running fedora 30.
I run consecutively qda on the randomly selected subset of predictors:
def process(X,y,n_features,i=1):
comb = np.random.choice(range(X.shape[1]),n_features,replace=False)
qda = QDA(tol=1e-8)
qda.fit(X[:,comb],y)
y_pred = qda.predict(X[:,comb])
return (accuracy_score(y,y_pred),comb,i)
where n_features is number of features randomly selected from the full set of possible predictors, X,y explanatory and depended variables.
When n_features is 18 or less process works in single-thread mode, which means that I can use any tool to parallel processing (I use ray). When n_features is 19, and above for unknown reason it (not me) starts all available threads, and entire calculation get more time even in comparison to a single thread.
tmp = [process(X,y,n_features,i=1) for _ in range(1000)]
Based on my previous experiences with other Linux libraries (R gstat precisely) the same situation (uncontrolled multithreading mode) was caused by Linux implementation of blas, but here it could not be the case. In general, the question is: what starts this multithreading and how to control it even if LDA/QDA hasn't n_jobs parameter to avoid nested multiprocessing.

QDA in scikit-learn does not expose n_jobs meaning that you cannot set anything. However, it could be due to numpy which does not restrict the number of threads.
The solution to limit the number of threads are:
set the environment variable OMP_NUM_THREADS, MKL_NUM_THREADS, or OPENBLAS_NUM_THREADS to be sure that you will limit the number of threads;
you can use threadpoolctl which provides a context manager to set the number of threads.

How to speed up parallel loading (& unloading) of matrices onto multiple GPUs in Matlab

I am trying to implement an algorithm involving large dense matrices in Matlab. I am using multi-GPU AWS instances for performance.
At each iteration, I have to work with two large m by n matrices (of doubles), A and B, where m = 1600000, and n = 500. Due to the size of the matrices and the memory capacity of each GPU (~8 GB memory each), I decompose the problem by partitioning the matrices row-wise into K chunks of smaller matrices who has the same number of n columns but fewer rows (M /K).
In theory, I can load each chunk of data onto the GPU one at a time, perform computations, and gather the data before repeating with the next chunk. However, since I have access to 4 GPUs, I would like to use all 4 GPUs in parallel to save time, and decompose the matrices into 4 chunks.
To achieve this, I tried using the parfor loop in Matlab (with the parallel computing toolbox), utilizing best practices such as slicing, loading only relevant data for each worker. For posterity, here is a complete code snippet. I have provided small, decomposed problems deeper down in this post.
M = 1600000;
K = 4;
m = M/K;
n = 500;
A = randn(K, m,n);
B = randn(K,m,n);
C = randn(n,2);
D = zeros(K,m,2);
%delete(gcp('nocreate'));
%p = parpool('local',K);
tic
toc_load = zeros(K,1);
toc_compute = zeros(K,1);
toc_unload = zeros(K,1);
parfor j = 1:K
tic
A_blk = gpuArray(reshape(A(j,:,:),[m,n]));
B_blk = gpuArray(reshape(B(j,:,:), [m,n]));
C_blk = gpuArray(C);
D_blk = gpuArray(reshape(D(j,:,:), [m,2]));
toc_load(j) = toc;
tic
B_blk = D_blk * C_blk' + A_blk + B_blk;
toc_compute(j) = toc;
tic
B(j,:,:) = gather(B_blk);
toc_unload(j) = toc;
end
toc_all = toc;
fprintf('averaged over 4 workers, loading onto GPU took %f seconds \n', mean(toc_load));
fprintf('averaged over 4 workers, computation on GPU took %f seconds \n',mean(toc_compute));
fprintf('averaged over 4 workers, unloading from GPU took %f seconds \n', mean(toc_unload));
fprintf('the entire process took %f seconds \n', toc_all);
Using the tic-toc time checker (I run the code only after starting the parpool to ensure that time-tracker is accurate), I found that each worker takes on average:
6.33 seconds to load the data onto the GPU
0.18 seconds to run the computations on the GPU
4.91 seconds to unload the data from the GPU.
However, the entire process takes 158.57 seconds. So, the communication overhead (or something else?) took up a significant chunk of the running time.
I then tried a simple for loop without parallelization, see snippet below.
%% for loop
tic
for j = 1:K
A_blk = gpuArray(reshape(A(j,:,:),[m,n]));
B_blk = gpuArray(reshape(B(j,:,:), [m,n]));
C_blk = gpuArray(C);
D_blk = gpuArray(reshape(D(j,:,:), [m,2]));
toc_load(j) = toc;
B_blk = D_blk * C_blk' + A_blk + B_blk;
toc_compute(j) = toc;
B(j,:,:) = gather(B_blk);
end
toc_all = toc;
fprintf('the entire process took %f seconds \n', toc_all);
This time, running the entire code took only 27.96 seconds. So running the code in serial significantly improved performance in this case. Nonetheless, given that I have 4 GPUs, it seems disappointing to not be able to gain a speedup by using all 4 at the same time.
From my experiments above, I have observed that the actual computational cost of the GPU working on the linear algebra tasks appears low. The key bottleneck appears to be the time taken in loading the data in parallel from CPU onto the multiple GPUs, and gathering the data from the multiple GPUs back to CPU, though it is also possible that there is some other factor in play.
In lieu of this, I have the following questions:
What exactly is underlying the slowness of parfor? Why is the communication overhead (or whatever the underlying reason) so expensive?
How can I speed up the parallel loading and unloading of data from CPU to multiple GPUs and then back in Matlab? Are there tricks involving parfor, spmd (or other things such as parfeval, which I have not tried) that I have neglected? Or have I reached some kind of fundamental speed limit in Matlab (assuming I maintain my current CPU/GPU setup) ?
If there is a fundamental limitation in how Matlab handles the data loading/unloading, would the only recourse be to rewrite this portion of the code in C++?
Thank you for any assistance!

Sending data to/from AWS instances to use with parfor is considerably slower than using workers on your local machine because (a) the machines are further away, and (b) there's additional overhead because all communication with AWS workers use secure communication.
You can use ticBytes and tocBytes to see how much data is being transferred.
To improve the performance, I would suggest doing everything possible to avoid transferring large amounts of data between your client and the workers. It can often be more efficient to build data directly on the workers, even if this means building arrays redundantly multiple times.
Precisely how you avoid data transfer is highly dependent on where your original fundamental data is coming from. If you have files on your client system... that's tough. In your example, you're using rand - which is easy to run on the cluster, but presumably not really representative.
Sometimes there's a middle ground where you have some small-ish fundamental data that can only be computed at the client, and large derived data that is needed on the workers. In that case, you might conceivably couple the computation with parallel.pool.Constant, or just do everything inside a single spmd block or something. (Your parfor loop as written could equally use spmd since you're arranging things to have one iteration per worker).

Reduce multiprocessing for statsmodels glm

I am currently doing proof of concept for one of our business process that requires logistic regression. I have been using statsmodels glm to perform classification against our data set (as per below code). Our data set consists of ~10M rows and around 80 features (where almost 70+ are dummies e.g. "1" or "0" based on the defined categorical variables). Using smaller data set, glm works fine, however if i run it against the full data set, python is throwing an error "cannot allocate memory".
glmmodel = smf.glm(formula, data, family=sm.families.Binomial())
glmresult = glmmodel.fit()
resultstring = glmresult.summary().as_csv()
This got me thinking that this might be due to statsmodels is designed to make use of all the available cpu cores and each subprocess underneath creates a copy of the data set into RAM (please correct me if I am mistaken). Question now would be if there is a way for glm to just make use of minimal number of cores? I am not into performance but just want to be able to run the glm against the full data set.
For reference, below is the machine configuration and some more information if needed.
CPU: 10 cores
RAM: 40 GB (usable/free ~25GB as there are other processes running on the
same machine)
swap: 16 GB
dataset size: 1.4 GB (based on Panda's DataFrame.info(memory_usage='deep')

GLM uses multiprocessing only through the linear algbra libraries
The following copies my FAQ issue description from https://github.com/statsmodels/statsmodels/issues/2914
It includes some links to other issues where this shows up.
(quote:)
Statsmodels is using joblib in a few places for parallel processing where it's under our control. Current usage is mainly for bootstrap and it is not used in the models directly.
However, some of the underlying Blas/Lapack libraries in numpy/scipy also use mutliple cores. This can be efficient for linear algebra with large arrays, but it can also slow down the operations especially when we want to use parallel processing on a higher level.
How can we restrict the number of cores used by the linear algebra libraries?
This depends on which linear algebra library is used. see mailing list thread
https://groups.google.com/d/msg/pystatsmodels/Lz9-In0pgPk/BtcYsj_ABQAJ
openblas: try setting the environment variable OMP_NUM_THREADS=1
Accelerate on OSX, set VECLIB_MAXIMUM_THREADS
mkl in anaconda:
import mkl
mkl.set_num_threads(1)

This is because Statsmodels use IRLS in estimating GLM and the IRLS process utilize its WLS regression routine which again uses QR decomposition. The QR decomposition is directly done on the X and your X has 10million rows, 80 columns which turns out putting a lot of stress on the memory and CPU.
Here is the source code from statsmodels:
if method == 'pinv':
pinv_wexog = np.linalg.pinv(self.wexog)
params = pinv_wexog.dot(self.wendog)
elif method == 'qr':
Q, R = np.linalg.qr(self.wexog)
params = np.linalg.solve(R, np.dot(Q.T, self.wendog))
else:
params, _, _, _ = np.linalg.lstsq(self.wexog, self.wendog,

Python3 multiprocessing: Memory Allocation Error

I know that this question has been asked a lot of times, but the answers are not applicable.
This is answer one of a parallelized loop using multiprocessing on StackoverFlow:
import multiprocessing as mp
def processInput(i):
return i * i
if __name__ == '__main__':
inputs = range(1000000)
pool = mp.Pool(processes=4)
results = pool.map(processInput, inputs)
print(results)
This code works fine. But if I increase the range to 1000000000, my 16GB of Ram are getting filled completely and I get [Errno 12] Cannot allocate memory. It seems as if the map function starts as many processes as possible. How do I limit the number of parallel processes?

The pool.map function starts 4 processes as you instructed it (in the line processes=4 you instruct the pool on how many processes it can use to perform your logic).
There is however a different issue underlying this implementation.
The pool.map function will return a list of objects, in this case its numbers.
Numbers do not act like int-s in ANSI-C they have overhead and will not overflow (e.g. turn to -2^31 whenever reaching 2^31+1 on 32-bit).
Also python lists are not array and do incur an overhead.
To be more specific, on python 3.6, running the following code will reveal some overhead:
>>>import sys
>>>t = [1,2,3,4]
>>>sys.getsizeof(t)
96
>>>t = [x for x in range(1000)]
>>>sys.getsizeof(t)
9024
So this means 24 bytes per number on small lists and ~9 bytes on large lists.
So for a list the size of 10^9 we get about 8.5GB
EDIT: 1. As tfb mentioned, this is not even the size of the underlying Number objects, just pointers and list overhead, meaning there is much more memory overhead I did not account for in the original answer.
Default python installation on windows is 32-bit (you can get 64-bit installation but you need to check the section of all available downloads in the python website), So I assumed you are using the 32-bit installation.

range(1000000000) creates a list of 10^9 ints. This is around 8GB (8 bytes per int on a 64-bit system). You are then trying to process this to create another list of 10^9 ints. A really really smart implementation might be able to do this on a 16GB machine, but its basically a lost cause.
In Python 2 you could try using xrange which might or might not help. I am not sure what the Python 3 equivalent is.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string