What is the fastest way to sum 2 matrices using Numba?

What is the fastest way to sum 2 matrices using Numba? - multithreading

I am trying to find the fastest way to sum 2 matrices of the same size using Numba. I came up with 3 different approaches but none of them could beat Numpy.
Here is my code:
import numpy as np
from numba import njit,vectorize, prange,float64
import timeit
import time
# function 1:
def sum_numpy(A,B):
return A+B
# function 2:
sum_numba_simple= njit(cache=True,fastmath=True) (sum_numpy)
# function 3:
#vectorize([float64(float64, float64)])
def sum_numba_vectorized(A,B):
return A+B
# function 4:
#njit('(float64[:,:],float64[:,:])', cache=True, fastmath=True, parallel=True)
def sum_numba_loop(A,B):
n=A.shape[0]
m=A.shape[1]
C = np.empty((n, m), A.dtype)
for i in prange(n):
for j in prange(m):
C[i,j]=A[i,j]+B[i,j]
return C
#Test the functions with 2 matrices of size 1,000,000x3:
N=1000000
np.random.seed(123)
A=np.random.uniform(low=-10, high=10, size=(N,3))
B=np.random.uniform(low=-5, high=5, size=(N,3))
t1=min(timeit.repeat(stmt='sum_numpy(A,B)',timer=time.perf_counter,repeat=3, number=100,globals=globals()))
t2=min(timeit.repeat(stmt='sum_numba_simple(A,B)',timer=time.perf_counter,repeat=3, number=100,globals=globals()))
t3=min(timeit.repeat(stmt='sum_numba_vectorized(A,B)',timer=time.perf_counter,repeat=3, number=100,globals=globals()))
t4=min(timeit.repeat(stmt='sum_numba_loop(A,B)',timer=time.perf_counter,repeat=3, number=100,globals=globals()))
print("function 1 (sum_numpy): t1= ",t1,"\n")
print("function 2 (sum_numba_simple): t2= ",t2,"\n")
print("function 3 (sum_numba_vectorized): t3= ",t3,"\n")
print("function 4 (sum_numba_loop): t4= ",t4,"\n")
Here are the results:
function 1 (sum_numpy): t1= 0.1655790419999903
function 2 (sum_numba_simple): t2= 0.3019776669998464
function 3 (sum_numba_vectorized): t3= 0.16486266700030683
function 4 (sum_numba_loop): t4= 0.1862256660001549
As you can see, the results show that there isn't any advantage in using Numba in this case. Therefore, my question is:
Is there any other implementation that would increase the speed of the summation ?

Your code is bound by page-faults (see here, here and there for more information about this). Page-faults happens because the array is newly allocated. A solution is to preallocate it and then write within it so to no cause pages to be remapped in physical memory. np.add(A, B, out=C) does this as indicated by #August in the comments. Another solution could be to adapt the standard allocator so not to give the memory back to the OS at the expense of a significant memory usage overhead (AFAIK TC-Malloc can do that for example).
There is another issue on most platforms (especially x86 ones): the cache-line write allocations of write-back caches are expensive during writes. The typical solution to avoid this is to do non-temporal store (if available on the target processor, which is the case on x86-64 one but maybe not others). That being said, neither Numpy nor Numba are able to do that yet. For Numba, I filled an issue covering a simple use-case. Compilers themselves (GCC for Numpy and Clang for Numba) tends not to generate such instructions because they can be detrimental in performance when arrays fit in cache and compilers do not know the size of the array at compile time (they could generate a specific code when they can evaluate the amount of data computed but this is not easy and can slow-down some other codes). AFAIK, the only possible way to fix this is to write a C code and use low-level instructions or to use compiler directives. In your case, about 25% of the bandwidth is lost due to this effect, causing a slowdown up to 33%.
Using multiple threads do not always make memory-bound code faster. In fact, it generally barely scale because using more core do not speed up the execution when the RAM is already saturated. Few cores are generally required so to saturate the RAM on most platforms. Page faults can benefit from using multiple cores regarding the target system (Linux does that in parallel quite well, Windows generally does not scale well, IDK for MacOS).
Finally, there is another issue: the code is not vectorized (at least not on my machine while it can be). On solution is to flatten the array view and do one big loop that the compiler can more easily vectorize (the j-based loop is too small for SIMD instructions to be effective). The contiguity of the input array should also be specified for the compiler to generate a fast SIMD code. Here is the resulting Numba code:
#njit('(float64[:,::1], float64[:,::1], float64[:,::1])', cache=True, fastmath=True, parallel=True)
def sum_numba_fast_loop(A, B, C):
n, m = A.shape
assert C.shape == A.shape
A_flat = A.reshape(n*m)
B_flat = B.reshape(n*m)
C_flat = C.reshape(n*m)
for i in prange(n*m):
C_flat[i]=A_flat[i]+B_flat[i]
return C
Here are results on my 6-core i5-9600KF processor with a ~42 GiB/s RAM:
sum_numpy: 0.642 s 13.9 GiB/s
sum_numba_simple: 0.851 s 10.5 GiB/s
sum_numba_vectorized: 0.639 s 14.0 GiB/s
sum_numba_loop serial: 0.759 s 11.8 GiB/s
sum_numba_loop parallel: 0.472 s 18.9 GiB/s
Numpy "np.add(A, B, out=C)": 0.281 s 31.8 GiB/s <----
Numba fast: 0.288 s 31.0 GiB/s <----
Optimal time: 0.209 s 32.0 GiB/s
The Numba code and the Numpy one saturate my RAM. Using more core does not help (in fact it is a bit slower certainly due to the contention of the memory controller). Both are sub-optimal since they do not use non-temporal store instructions that can prevent cache-line write allocations (causing data to be fetched from the RAM before being written back). The optimal time is the one expected using such instruction. Note that it is expected to reach only 65-80% of the RAM bandwidth because of RAM mixed read/writes. Indeed, interleaving reads and writes cause low-level overheads preventing the RAM to be saturated. For more information about how RAM works, please consider reading Introduction to High Performance Scientific Computing -- Chapter 1.3 and What Every Programmer Should Know About Memory (and possibly this).

Related

Convert function to exploit parallelization of the GPU

I have a function that uses values stored in one array to operate on another array. This behaves similar to the numpy.hist function. For example:
import numpy as np
from numba import jit
#jit(nopython=True)
def array_func(x, y, output_counts, output_weights):
for row in range(x.size):
col = int(x[row] * 10)
output_counts[col] += 1
output_weights[col] += y[row]
return (output_counts, output_weights)
# in the current code these arrays exists ad pytorch tensors
# on the GPU and get converted to numpy arrays on the CPU before
# being passed to "array_func"
x = np.random.randint(0, 11, (1000)) / 10
y = np.random.randint(0, 100, (10000))
output_counts, output_weights = array_func(x, y, np.zeros(y.size), np.zeros(y.size))
While this works for arrays it does not work for torch tensors that are on the GPU. This is close to what histogram functions do, but I also need the summation of binned values (i.e., the output_weights array/tensor). The current function requires me to continually pass the data from GPU to CPU, followed by the CPU function being run in series.
Can this function be converted to run in parallel on the GPU?
##EDIT##
The challenge is caused by the following line:
output_weights[col] += y[row]
If it weren't for that line I could just use the torch.histc function.
Here's my thought: GPUs are "fast" because they have hundreds/thousands of threads available and can run parts of a big job (or many smaller jobs) on these threads. However, if I convert the function above to work on torch tensors then there is no benefit to running on the GPU (it actually kills the performance). I wonder if there is a way I can break of x so each value gets sent to different threads (similar to how apply_async does within multiprocessing)?
I'm open to other options.
In it's current form the function is fast, but the GPU-to-CPU data transfer is killing me.

Your computation is indeed a general histogram operation. There are multiple ways to compute this on a GPU regarding the number of items to scan, the size of the histogram and the distribution of the values.
For example, one solution consist in building local histograms in each separate kernel blocks and then perform a reduction. However, this solution is not well suited in your case since len(x) / len(y) is relatively small.
An alternative solution is to perform atomic updates of the histogram in parallel. This solutions only scale well if there is no atomic conflicts which is dependent of the actual input data. Indeed, if all value of x are equal, then all updates will be serialized which is slower than doing the accumulation sequentially on a CPU (due to the overhead of the atomic operations). Such a case is frequent on small histograms but assuming the distribution is close to uniform, this can be fine.
This operation can be done with Numba using CUDA (targetting Nvidia GPUs). Here is an example of kernel solving your problem:
#cuda.jit
def array_func(x, y, output_counts, output_weights):
tx = cuda.threadIdx.x # Thread id in a 1D block
ty = cuda.blockIdx.x # Block id in a 1D grid
bw = cuda.blockDim.x # Block width, i.e. number of threads per block
pos = tx + ty * bw # Compute flattened index inside the array
if pos < x.size:
col = int(x[pos] * 10)
cuda.atomic.add(output_counts, col, 1)
cuda.atomic.add(output_weights, col, y[pos])
For more information about how to run this kernel, please read the documentation. Note that the arrays output_counts and output_weights can possibly be directly created on the GPU so to avoid transfers. x and y should be on the GPU for better performance (otherwise a CPU reduction will be certainly faster). Also note that the kernel should be pretty fast so the overhead to run/wait it and allocate/free temporary array may be significant and even possibly slower than the kernel itself (but certainly faster than doing a double transfer from/to the CPU so to compute things on the CPU assuming data was on the GPU). Note also that such atomic accesses are only fast on quite recent Nvidia GPU that benefit from specific computing units for atomic operations.

Reduce multiprocessing for statsmodels glm

I am currently doing proof of concept for one of our business process that requires logistic regression. I have been using statsmodels glm to perform classification against our data set (as per below code). Our data set consists of ~10M rows and around 80 features (where almost 70+ are dummies e.g. "1" or "0" based on the defined categorical variables). Using smaller data set, glm works fine, however if i run it against the full data set, python is throwing an error "cannot allocate memory".
glmmodel = smf.glm(formula, data, family=sm.families.Binomial())
glmresult = glmmodel.fit()
resultstring = glmresult.summary().as_csv()
This got me thinking that this might be due to statsmodels is designed to make use of all the available cpu cores and each subprocess underneath creates a copy of the data set into RAM (please correct me if I am mistaken). Question now would be if there is a way for glm to just make use of minimal number of cores? I am not into performance but just want to be able to run the glm against the full data set.
For reference, below is the machine configuration and some more information if needed.
CPU: 10 cores
RAM: 40 GB (usable/free ~25GB as there are other processes running on the
same machine)
swap: 16 GB
dataset size: 1.4 GB (based on Panda's DataFrame.info(memory_usage='deep')

GLM uses multiprocessing only through the linear algbra libraries
The following copies my FAQ issue description from https://github.com/statsmodels/statsmodels/issues/2914
It includes some links to other issues where this shows up.
(quote:)
Statsmodels is using joblib in a few places for parallel processing where it's under our control. Current usage is mainly for bootstrap and it is not used in the models directly.
However, some of the underlying Blas/Lapack libraries in numpy/scipy also use mutliple cores. This can be efficient for linear algebra with large arrays, but it can also slow down the operations especially when we want to use parallel processing on a higher level.
How can we restrict the number of cores used by the linear algebra libraries?
This depends on which linear algebra library is used. see mailing list thread
https://groups.google.com/d/msg/pystatsmodels/Lz9-In0pgPk/BtcYsj_ABQAJ
openblas: try setting the environment variable OMP_NUM_THREADS=1
Accelerate on OSX, set VECLIB_MAXIMUM_THREADS
mkl in anaconda:
import mkl
mkl.set_num_threads(1)

This is because Statsmodels use IRLS in estimating GLM and the IRLS process utilize its WLS regression routine which again uses QR decomposition. The QR decomposition is directly done on the X and your X has 10million rows, 80 columns which turns out putting a lot of stress on the memory and CPU.
Here is the source code from statsmodels:
if method == 'pinv':
pinv_wexog = np.linalg.pinv(self.wexog)
params = pinv_wexog.dot(self.wendog)
elif method == 'qr':
Q, R = np.linalg.qr(self.wexog)
params = np.linalg.solve(R, np.dot(Q.T, self.wendog))
else:
params, _, _, _ = np.linalg.lstsq(self.wexog, self.wendog,

Python3 multiprocessing: Memory Allocation Error

I know that this question has been asked a lot of times, but the answers are not applicable.
This is answer one of a parallelized loop using multiprocessing on StackoverFlow:
import multiprocessing as mp
def processInput(i):
return i * i
if __name__ == '__main__':
inputs = range(1000000)
pool = mp.Pool(processes=4)
results = pool.map(processInput, inputs)
print(results)
This code works fine. But if I increase the range to 1000000000, my 16GB of Ram are getting filled completely and I get [Errno 12] Cannot allocate memory. It seems as if the map function starts as many processes as possible. How do I limit the number of parallel processes?

The pool.map function starts 4 processes as you instructed it (in the line processes=4 you instruct the pool on how many processes it can use to perform your logic).
There is however a different issue underlying this implementation.
The pool.map function will return a list of objects, in this case its numbers.
Numbers do not act like int-s in ANSI-C they have overhead and will not overflow (e.g. turn to -2^31 whenever reaching 2^31+1 on 32-bit).
Also python lists are not array and do incur an overhead.
To be more specific, on python 3.6, running the following code will reveal some overhead:
>>>import sys
>>>t = [1,2,3,4]
>>>sys.getsizeof(t)
96
>>>t = [x for x in range(1000)]
>>>sys.getsizeof(t)
9024
So this means 24 bytes per number on small lists and ~9 bytes on large lists.
So for a list the size of 10^9 we get about 8.5GB
EDIT: 1. As tfb mentioned, this is not even the size of the underlying Number objects, just pointers and list overhead, meaning there is much more memory overhead I did not account for in the original answer.
Default python installation on windows is 32-bit (you can get 64-bit installation but you need to check the section of all available downloads in the python website), So I assumed you are using the 32-bit installation.

range(1000000000) creates a list of 10^9 ints. This is around 8GB (8 bytes per int on a 64-bit system). You are then trying to process this to create another list of 10^9 ints. A really really smart implementation might be able to do this on a 16GB machine, but its basically a lost cause.
In Python 2 you could try using xrange which might or might not help. I am not sure what the Python 3 equivalent is.

Can we use `shuffle()` instruction for reg-to-reg data-exchange between items (threads) in WaveFront?

As we known, WaveFront (AMD OpenCL) is very similar to WARP (CUDA): http://research.cs.wisc.edu/multifacet/papers/isca14-channels.pdf
GPGPU languages, like OpenCL™ and CUDA, are called SIMT because they
map the programmer’s view of a thread to a SIMD lane. Threads
executing on the same SIMD unit in lockstep are called a wavefront
(warp in CUDA).
Also known, that AMD suggested us the (Reduce) addition of numbers using a local memory. And for accelerating of addition (Reduce) suggests using vector types: http://amd-dev.wpengine.netdna-cdn.com/wordpress/media/2013/01/AMD_OpenCL_Tutorial_SAAHPC2010.pdf
But are there any optimized register-to-register data-exchage instructions between items (threads) in WaveFront:
such as int __shfl_down(int var, unsigned int delta, int width=warpSize); in WARP (CUDA): https://devblogs.nvidia.com/parallelforall/faster-parallel-reductions-kepler/
or such as __m128i _mm_shuffle_epi8(__m128i a, __m128i b); SIMD-lanes on x86_64: https://software.intel.com/en-us/node/524215
This shuffle-instruction can, for example, execute Reduce (add up the numbers) of 8 elements from 8 threads/lanes, for 3 cycles without any synchronizations and without using any cache/local/shared-memory (which has ~3 cycles latency for each access).
I.e. threads sends its value directly to register of other threads: https://devblogs.nvidia.com/parallelforall/faster-parallel-reductions-kepler/
Or in OpenCL we can use only instruction gentypen shuffle( gentypem x, ugentypen mask ) which can be used only for vector-types such as float16/uint16 into each item (thread), but not between items (threads) in WaveFront: https://www.khronos.org/registry/OpenCL/sdk/1.1/docs/man/xhtml/shuffle.html
Can we use something looks like shuffle() for reg-to-reg data-exchange between items (threads) in WaveFront which more faster than data-echange via Local memory?
Are there in AMD OpenCL instructions for register-to-register data-exchange intra-WaveFront such as instructions __any(), __all(), __ballot(), __shfl() for intra-WARP(CUDA): http://on-demand.gputechconf.com/gtc/2015/presentation/S5151-Elmar-Westphal.pdf
Warp vote functions:
__any(predicate) returns non-zero if any of the predicates for the
threads in the warp returns non-zero
__all(predicate) returns non-zero if all of the predicates for the
threads in the warp returns non-zero
__ballot(predicate) returns a bit-mask with the respective bits
of threads set where predicate returns non-zero
__shfl(value, thread) returns value from the requested thread
(but only if this thread also performed a __shfl()-operation)
CONCLUSION:
As known, in OpenCL-2.0 there is Sub-groups with SIMD execution model akin to WaveFronts: Does the official OpenCL 2.2 standard support the WaveFront?
For Sub-Group there are - page-160: http://amd-dev.wpengine.netdna-cdn.com/wordpress/media/2013/12/AMD_OpenCL_Programming_User_Guide2.pdf
int sub_group_all(int predicate) the same as CUDA-__all(predicate)
int sub_group_any(int predicate); the same as CUDA-__any(predicate)
But in OpenCL there is no similar functions:
CUDA-__ballot(predicate)
CUDA-__shfl(value, thread)
There is only Intel-specified built-in shuffle functions in Version 4, August 28, 2016 Final Draft OpenCL Extension #35: intel_sub_group_shuffle, intel_sub_group_shuffle_down, intel_sub_group_shuffle_down, intel_sub_group_shuffle_up: https://www.khronos.org/registry/OpenCL/extensions/intel/cl_intel_subgroups.txt
Also in OpenCL there are functions, which usually implemented by shuffle-functions, but there are not all of functions which can be implemented by using shuffle-functions:
<gentype> sub_group_broadcast( <gentype> x, uint sub_group_local_id );
<gentype> sub_group_reduce_<op>( <gentype> x );
<gentype> sub_group_scan_exclusive_<op>( <gentype> x );
<gentype> sub_group_scan_inclusive_<op>( <gentype> x );
Summary:
shuffle-functions remain more flexible functions , and ensure the fastest possible communication between threads with direct register-to-register data-exchanging.
But functions sub_group_broadcast/_reduce/_scan doesn't guarantee direct register-to-register data-exchanging, and these sub-group-functions less flexible.

There is
gentype work_group_reduce<op> ( gentype x)
for version >=2.0
but its definition doesn't say anything about using local memory or registers. This just reduces each collaborator's x value to a single sum of all. This function must be hit by all workgroup-items so its not on a wavefront level approach. Also the order of floating-point operations is not guaranteed.
Maybe some vendors do it register way while some use local memory. Nvidia does with register I assume. But an old mainstream Amd gpu has local memory bandwidth of 3.7 TB/s which is still good amount. (edit: its not 22 TB/s) For 2k cores, this means nearly 1.5 byte per cycle per core or much faster per cache line.
For %100 register(if not spills to global memory) version, you can reduce number of threads and do vectorized reduction in threads themselves without communicating with others if number of elements are just 8 or 16. Such as
v.s0123 += v.s4567
v.s01 += v.s23
v.s0 += v.s1
which should be similar to a __m128i _mm_shuffle_epi8 and its sum version when compiled on a CPU and non-scalar implementations will use same SIMD on a GPU to do these 3 operations.
Also using these vector types tend to use efficient memory transactions even for global and local, not just registers.
A SIMD works on only a single wavefront at a time, but a wavefront may be processed by multiple SIMDs, so, this vector operation does not imply a whole wavefront is being used. Or even whole wavefront may be computing 1st elements of all vectors in a cycle. But for a CPU, most logical option is SIMD computing work items one by one(avx,sse) instead of computing them in parallel by their same indexed elements.
If main work group doesn't fit ones requirements, there are child kernels to spawn and use dynamic width kernels for this kind of operations. Child kernel works on another group called sub-group concurrently. This is done within device-side queue and needs OpenCl version to be at least 2.0.
Look for "device-side enqueue" in http://amd-dev.wpengine.netdna-cdn.com/wordpress/media/2013/12/AMD_OpenCL_Programming_User_Guide2.pdf
AMD APP SDK supports Sub-Group

yield loss with OpenMP in Fortran

First of all sorry if I make grammatical mistakes. I'm not english.
I'm trying to improve the yield loss that is occurring when increasing the number of threads using OpenMP in Fortran.
I'm using two Intel Xeon X5650 (12 physical cores) with 96 Gb of RAM
The best results that I've obtained are the following:
1 proc -> 15.50 sec; 2 proc -> 8.10 sec; 4 proc -> 4.42 sec; 8 proc -> 2.81 sec; 12 proc -> 2.43 sec
Like you see, the improvement decreases the more threads I run.
Here's the code:
allocate(SUM(PUNTOST,PUNTOSP,NUM_DATA,2))
allocate(SUMATORIO(1))
allocate(SUMATORIO(1)%REGION(REGIONS))
DO i=1,REGIONES
allocate(SUMATORIO(1)%REGION(i)%VALOR(2,PUNTOSP,PUNTOST))
SUMATORIO(1)%REGION(i)%VALOR= cmplx(0.0,0.0)
END DO
allocate(valor_aux(2,PUNTOSP,PUNTOST))
!...
call SYSTEM_CLOCK(counti,count_rate)
!$OMP PARALLEL NUM_THREADS(THREADS) DEFAULT(PRIVATE) FIRSTPRIVATE(REGIONS) &
!$OMP SHARED(SUMATORIO,SUM,PP,VEC_1,VEC_2,IDENT,TIPO,MUESTRA,PUNTOST,PUNTOSP)
!$OMP DO SCHEDULE(DYNAMIC,8)
DO i=1,REGIONS
INDICE=VEC_1(i)
valor_aux = cmplx(0.0,0.0)
DO j=1,VEC_2(i)
ii=IDENT(INDICE+1)
INDICE=INDICE+1
IF(TIPO(ii).ne.4) THEN
j1=MUESTRA(ii)
DO I1=1,PUNTOST
DO I2=1,PUNTOSP
valor_aux(1,I2,I1)=valor_aux(1,I2,I1)+SUM(I1,I2,J1,1)*PP(ii)
valor_aux(2,I2,I1)=valor_aux(2,I2,I1)+SUM(I1,I2,J1,2)*PP(ii)
END DO
END DO
END IF
END DO
SUMATORIO(1)%REGION(i)%VALOR= valor_aux
END DO
!$OMP END DO
!$OMP END PARALLEL
call SYSTEM_CLOCK(countf)
dt=REAL(countf-counti)/REAL(count_rate)
write(*,*)'FASE_1: Time: ',dt,'seconds'
Some points to know:
All data types are COMPLEX, except for loop vectors
NUM_DATA = 14000000
REGIONS = 1000000
Values contained in VEC_2 are between 10 and 20
PUNTOST = 21
PUNTOSP = 20
All allocated memory consume about 60 Gb of RAM
I've tried to change the dimensions of the matrixes to evade excesive memory caching (SUM(2,PUNTOSP,PUNTOST,NUM_DATA) for example) but this is the way in which I've obtained the best performance (I don't know the reason because in most of documents I've read they say that you have to try to make memory access be "sequential" to make the CPU brings the least amount of memory to cachee).
Also I've changed memory alignment to 32, 64 and 128 bytes but It didn't improve nothing.
Also I've changed the SCHEDULE option to STATIC with different chunk sizes and DYNAMIC with different chunk sizes but the results are the same or worse.
Do you have some ideas that I could use to improve the performance when using 8 or more cores?
Thank you so much for your attention and help.

Dividing by 6 the cpu time using 12 processors is rather good. In my applications I get rarely more than 4 or 5 (but it always remains sequential parts which is possibly the reason of that).
You could try the option collapse allowing to merge two loops together... But I don't know whether this is possible in your case because they are conditions to fulfill (for instance no instruction between the two loops).

While working on multidimensional arrays in Fortran, the leftmost index should change the fastest. You could try to change the order of the indices of valor_aux and SUM to
valor_aux(PUNTOSP, PUNTOST, 2)
SUM(PUNTOSP, PUNTOST, NUM_DATA, 2)
Additionally, you should always mind Amdahl's law. There is always some overhead, which yields additional speedup impossible.
Also: In your two innermost loops, the factor PP(ii) doesn't change. You should try to apply it after these loops (except, you know, that you are using FMA). And these loops are only a SUM of many values. You should try the intrinsic function SUM to remove these loops. Both things could require a massive redesign of your loops.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string