Prange slowing down Cython loop - multithreading

Consider two ways of calculating random numbers, one in one thread and one multithread using cython prange with openmp:
def rnd_test(long size1):
cdef long i
for i in range(size1):
rand()
return 1
and
def rnd_test_par(long size1):
cdef long i
with nogil, parallel():
for i in prange(size1, schedule='static'):
rand()
return 1
Function rnd_test is first compiled with the following setup.py
from distutils.core import setup
from Cython.Build import cythonize
setup(
name = 'Hello world app',
ext_modules = cythonize("cython_test.pyx"),
)
rnd_test(100_000_000) runs in 0.7s.
Then, rnd_test_par is compiled with the following setup.py
from distutils.core import setup
from distutils.extension import Extension
from Cython.Build import cythonize
ext_modules = [
Extension(
"cython_test_openmp",
["cython_test_openmp.pyx"],
extra_compile_args=["-O3", '-fopenmp'],
extra_link_args=['-fopenmp'],
)
]
setup(
name='hello-parallel-world',
ext_modules=cythonize(ext_modules),
)
rnd_test_par(100_000_000) runs in 10s!!!
Similar results are obtained using cython within ipython:
%%cython
import cython
from cython.parallel cimport parallel, prange
from libc.stdlib cimport rand
def rnd_test(long size1):
cdef long i
for i in range(size1):
rand()
return 1
%%timeit
rnd_test(100_000_000)
1 loop, best of 3: 1.5 s per loop
and
%%cython --compile-args=-fopenmp --link-args=-fopenmp --force
import cython
from cython.parallel cimport parallel, prange
from libc.stdlib cimport rand
def rnd_test_par(long size1):
cdef long i
with nogil, parallel():
for i in prange(size1, schedule='static'):
rand()
return 1
%%timeit
rnd_test_par(100_000_000)
1 loop, best of 3: 8.42 s per loop
What am I doing wrong? I am completely new to cython, this is my second time using it. I had a good experience last time so I decided to use for a project with monte-carlo simulation (hence the use of rand).
Is this expected? Having read all the documentation, I think prange should work well in an embarrassingly parallel case like this. I don't understand why this is failing to speed up the loop and even making it so much slower.
Some additional information:
I am running python 3.6, cython 0.26.
gcc version is "gcc (Ubuntu 5.4.0-6ubuntu1~16.04.4) 5.4.0 20160609"
CPU usage confirms the parallel version is actually using many cores
(90% vs 25% of the serial case)
I appreciate any help you can provide. I tried first with numba and it did speed up the calculation but it has other problems that make me want to avoid it. I'd like Cython to work in this case.
Thanks!!!

With DavidW's useful feedback and links, I have a multithreaded solution for random number generation.
However, the time savings over single-threaded (vectorized) Numpy solution are not that massive. The numpy approach generates 100 million numbers (5GB in memory) in 1.2s versus 0.7s of the multithreaded approach. Given the increased complexity (using c++ libraries for example), I wonder if it's worth it. Maybe I will leave the random number generation single-threaded and work on parallelizing the calculations that follow this step.
The exercise is, however, very useful to understand the problems of randon number generators. Ultimately, I'd like to have framework that could work in a distributed environment and I can see now that the challenge would be even larger in regards to the random number generator due to generators essentially having a state that cannot be ignored.
%%cython --compile-args=-fopenmp --link-args=-fopenmp --force
# distutils: language = c++
# distutils: extra_compile_args = -std=c++11
import cython
cimport numpy as np
import numpy as np
from cython.parallel cimport parallel, prange, threadid
cimport openmp
cdef extern from "<random>" namespace "std" nogil:
cdef cppclass mt19937:
mt19937() # we need to define this constructor to stack allocate classes in Cython
mt19937(unsigned int seed) # not worrying about matching the exact int type for seed
cdef cppclass uniform_real_distribution[T]:
uniform_real_distribution()
uniform_real_distribution(T a, T b)
T operator()(mt19937 gen) # ignore the possibility of using other classes for "gen"
#cython.boundscheck(False)
#cython.wraparound(False)
def test_rnd_par(long size):
cdef:
mt19937 gen
uniform_real_distribution[double] dist = uniform_real_distribution[double](0.0,1.0)
narr = np.empty(size, dtype=np.dtype("double"))
double [:] narr_view = narr
long i
with nogil, parallel():
gen = mt19937(openmp.omp_get_thread_num())
for i in prange(size, schedule='static'):
narr_view[i] = dist(gen)
return narr

I would like to note two things, that might be worth of your consideration:
A: If you take a look at the implementation of rand() in glibc, you will see that using rand() in a multi-threaded program leads to unspecified behavior: the produced numbers are always the same (assuming we have the same seed), but you cannot say which number will be used for which thread due to possible raise conditions. There is only one common state which is shared between all threads, and it need to be protected by a lock, otherwise even worse things could happen:
long int
__random ()
{
int32_t retval;
__libc_lock_lock (lock);
(void) __random_r (&unsafe_state, &retval);
__libc_lock_unlock (lock);
return retval;
}
From this code a possible workaround becomes clear, if we are not allowed to use c++11: every thread could have its own seed and we could use the rand_r() method.
This lock, is the reason you cannot see any speed-up with the original version.
B: Why don't you see more speed-up with your c++11-solution? You produce 5GB of data and write it to memory - it is a pretty memory-bound-task. So if a thread is working, the memory-bandwidth is enough to transport the created data and the bottle-neck is the calculation of the next random number. If there are two threads, there are twice as much data, but no more memory-bandwidth. So there will be a number of threads, for which the memory-bandwidth becomes the bottle-neck and you will not be able to achieve any speed-up by adding more threads/cores.
So there is no gain in parallelizing the random number generation? The problem is not the random number generation, but the amount of data written to memory: if the created random number is consumed by the same thread without storing it in RAM, it would be a much better solution to parallelize compared to producing the numbers by a single thread and to distribute them:
You don't have to write these numbers to RAM.
You don't have to read these numbers from RAM.
You calculate them faster as with a single thread.

Related

Performance degradation due to loading a shared library with thread local storage

I write a python wrapper around a large Fortran program with pybind11 as a python module. The Fortran program is a large simulation tool, that uses OpenMP for multithreading. My initial work was to reproduce the Fortran executable from a python function. That yielded (as expected) exactly the same results and the same performance. But when I started to add more functions, I observed a large performance degradation (about 50% to 100% longer runtimes).
Tracking the cause in pybind11
I could track it down to a call of the pybind11 macro PYBIND11_NUMPY_DTYPE, which loads in its internals the numpy library numpy.core._multiarray_umath. I could reproduce the performance degradation with the following code:
import ctypes
import time
# This is the fortran code, compiled to a shared library and a subroutine modulemain, that resembles the main program.
fcode = ctypes.CDLL("./libfcode.so")
# Only loading the library results in a worse performance of the Fortran code.
import numpy.core._multiarray_umath
t = time.time()
fcode.modulemain()
print("runtime: ", time.time()-t)
Tracking the cause in numpy
After finding, that the reason of my bad performance lies just in including the numpy.core._multiarray_umath library, I further digged into it. Ultimately I could track it down to two lines in that library, where two variables with thread local storage a defined.
// from numpy 1.21.5, numpy/core/src/multiarray/multiarraymodule.c:4011
static NPY_TLS int sigint_buf_init = 0;
static NPY_TLS NPY_SIGJMP_BUF _NPY_SIGINT_BUF;
where NPY_TLSis defined as
#define NPY_TLS __thread
So the inclusion of a shared object with __thread TLS is the root cause for my performance degradation. This leads me to my two questions:
Why?
Is there any way to prevent it? Not using PYBIND11_NUMPY_DTYPE is no option, as the loading of the numpy library after my module will trigger the bug as well!
Minimal working example
My error is about a large and heavy Fortran code, that I wanted to export to python via pybind11. But in the end it resulted in a problem of using OpenMP thread local storage and then loading a library that exports a variable with __thread thread local storage in the python interpreter. I could create a minimal working example, that reproduced the behavior.
The worker program work.f90
module data
integer, parameter :: N = 10000
real :: X(1:N)
!$omp threadprivate(X)
end module
subroutine work() bind(C, name="worker")
use data, only: X,N
!$omp parallel
X(1) = 0.131
do i=2,N
do j=1,i-1
X(i) = X(i) + 0.431*sin(X(i-1))
end do
end do
!$omp end parallel
The bad library tl.c
__thread int badVariable = 3;
a python script that shows the effect run.py
import ctypes
import time
work = ctypes.CDLL("./libwork.so")
# first worker run without loaded libtl.so. Good performance!
t = time.time()
work.worker()
print("TIME: ", time.time()-t)
# load the bad library
bad = ctypes.CDLL("./libtl.so")
# second worker with degraded performance
t = time.time()
work.worker()
print("TIME: ", time.time()-t)
The Makefile
FLAGS = -fPIC -shared
all: libwork.so libtl.so
libwork.so: work.f90
gfortran-11 $(FLAGS) work.f90 -fopenmp -o $#
libtl.so: tl.c
gcc-11 $(FLAGS) tl.c -o $#
The worker is so simple, that enabling optimization will hide the effect. I guess is could be a call to access the thread local storage area, that could be easily optimized out here. But in a real program, the effect is there with optimization.
Setup
I have the problem on a ubuntu 22.04 LTS computer with a x86 CPU (Xeon 8280M). gcc is Ubuntu 11.3.0-1ubuntu1~22.04 (I tried others down to 7.5.0 with the same effect). Python is version 3.10.6.
The problem is not Fortran specific, I can easily write a worker in plain C with the same effect. And I also tried this on a Raspberry Pi with the same effect! (ARM, GCC 8.3.0, Python 2.7.16)

What is the fastest way to sum 2 matrices using Numba?

I am trying to find the fastest way to sum 2 matrices of the same size using Numba. I came up with 3 different approaches but none of them could beat Numpy.
Here is my code:
import numpy as np
from numba import njit,vectorize, prange,float64
import timeit
import time
# function 1:
def sum_numpy(A,B):
return A+B
# function 2:
sum_numba_simple= njit(cache=True,fastmath=True) (sum_numpy)
# function 3:
#vectorize([float64(float64, float64)])
def sum_numba_vectorized(A,B):
return A+B
# function 4:
#njit('(float64[:,:],float64[:,:])', cache=True, fastmath=True, parallel=True)
def sum_numba_loop(A,B):
n=A.shape[0]
m=A.shape[1]
C = np.empty((n, m), A.dtype)
for i in prange(n):
for j in prange(m):
C[i,j]=A[i,j]+B[i,j]
return C
#Test the functions with 2 matrices of size 1,000,000x3:
N=1000000
np.random.seed(123)
A=np.random.uniform(low=-10, high=10, size=(N,3))
B=np.random.uniform(low=-5, high=5, size=(N,3))
t1=min(timeit.repeat(stmt='sum_numpy(A,B)',timer=time.perf_counter,repeat=3, number=100,globals=globals()))
t2=min(timeit.repeat(stmt='sum_numba_simple(A,B)',timer=time.perf_counter,repeat=3, number=100,globals=globals()))
t3=min(timeit.repeat(stmt='sum_numba_vectorized(A,B)',timer=time.perf_counter,repeat=3, number=100,globals=globals()))
t4=min(timeit.repeat(stmt='sum_numba_loop(A,B)',timer=time.perf_counter,repeat=3, number=100,globals=globals()))
print("function 1 (sum_numpy): t1= ",t1,"\n")
print("function 2 (sum_numba_simple): t2= ",t2,"\n")
print("function 3 (sum_numba_vectorized): t3= ",t3,"\n")
print("function 4 (sum_numba_loop): t4= ",t4,"\n")
Here are the results:
function 1 (sum_numpy): t1= 0.1655790419999903
function 2 (sum_numba_simple): t2= 0.3019776669998464
function 3 (sum_numba_vectorized): t3= 0.16486266700030683
function 4 (sum_numba_loop): t4= 0.1862256660001549
As you can see, the results show that there isn't any advantage in using Numba in this case. Therefore, my question is:
Is there any other implementation that would increase the speed of the summation ?
Your code is bound by page-faults (see here, here and there for more information about this). Page-faults happens because the array is newly allocated. A solution is to preallocate it and then write within it so to no cause pages to be remapped in physical memory. np.add(A, B, out=C) does this as indicated by #August in the comments. Another solution could be to adapt the standard allocator so not to give the memory back to the OS at the expense of a significant memory usage overhead (AFAIK TC-Malloc can do that for example).
There is another issue on most platforms (especially x86 ones): the cache-line write allocations of write-back caches are expensive during writes. The typical solution to avoid this is to do non-temporal store (if available on the target processor, which is the case on x86-64 one but maybe not others). That being said, neither Numpy nor Numba are able to do that yet. For Numba, I filled an issue covering a simple use-case. Compilers themselves (GCC for Numpy and Clang for Numba) tends not to generate such instructions because they can be detrimental in performance when arrays fit in cache and compilers do not know the size of the array at compile time (they could generate a specific code when they can evaluate the amount of data computed but this is not easy and can slow-down some other codes). AFAIK, the only possible way to fix this is to write a C code and use low-level instructions or to use compiler directives. In your case, about 25% of the bandwidth is lost due to this effect, causing a slowdown up to 33%.
Using multiple threads do not always make memory-bound code faster. In fact, it generally barely scale because using more core do not speed up the execution when the RAM is already saturated. Few cores are generally required so to saturate the RAM on most platforms. Page faults can benefit from using multiple cores regarding the target system (Linux does that in parallel quite well, Windows generally does not scale well, IDK for MacOS).
Finally, there is another issue: the code is not vectorized (at least not on my machine while it can be). On solution is to flatten the array view and do one big loop that the compiler can more easily vectorize (the j-based loop is too small for SIMD instructions to be effective). The contiguity of the input array should also be specified for the compiler to generate a fast SIMD code. Here is the resulting Numba code:
#njit('(float64[:,::1], float64[:,::1], float64[:,::1])', cache=True, fastmath=True, parallel=True)
def sum_numba_fast_loop(A, B, C):
n, m = A.shape
assert C.shape == A.shape
A_flat = A.reshape(n*m)
B_flat = B.reshape(n*m)
C_flat = C.reshape(n*m)
for i in prange(n*m):
C_flat[i]=A_flat[i]+B_flat[i]
return C
Here are results on my 6-core i5-9600KF processor with a ~42 GiB/s RAM:
sum_numpy: 0.642 s 13.9 GiB/s
sum_numba_simple: 0.851 s 10.5 GiB/s
sum_numba_vectorized: 0.639 s 14.0 GiB/s
sum_numba_loop serial: 0.759 s 11.8 GiB/s
sum_numba_loop parallel: 0.472 s 18.9 GiB/s
Numpy "np.add(A, B, out=C)": 0.281 s 31.8 GiB/s <----
Numba fast: 0.288 s 31.0 GiB/s <----
Optimal time: 0.209 s 32.0 GiB/s
The Numba code and the Numpy one saturate my RAM. Using more core does not help (in fact it is a bit slower certainly due to the contention of the memory controller). Both are sub-optimal since they do not use non-temporal store instructions that can prevent cache-line write allocations (causing data to be fetched from the RAM before being written back). The optimal time is the one expected using such instruction. Note that it is expected to reach only 65-80% of the RAM bandwidth because of RAM mixed read/writes. Indeed, interleaving reads and writes cause low-level overheads preventing the RAM to be saturated. For more information about how RAM works, please consider reading Introduction to High Performance Scientific Computing -- Chapter 1.3 and What Every Programmer Should Know About Memory (and possibly this).

Convert function to exploit parallelization of the GPU

I have a function that uses values stored in one array to operate on another array. This behaves similar to the numpy.hist function. For example:
import numpy as np
from numba import jit
#jit(nopython=True)
def array_func(x, y, output_counts, output_weights):
for row in range(x.size):
col = int(x[row] * 10)
output_counts[col] += 1
output_weights[col] += y[row]
return (output_counts, output_weights)
# in the current code these arrays exists ad pytorch tensors
# on the GPU and get converted to numpy arrays on the CPU before
# being passed to "array_func"
x = np.random.randint(0, 11, (1000)) / 10
y = np.random.randint(0, 100, (10000))
output_counts, output_weights = array_func(x, y, np.zeros(y.size), np.zeros(y.size))
While this works for arrays it does not work for torch tensors that are on the GPU. This is close to what histogram functions do, but I also need the summation of binned values (i.e., the output_weights array/tensor). The current function requires me to continually pass the data from GPU to CPU, followed by the CPU function being run in series.
Can this function be converted to run in parallel on the GPU?
##EDIT##
The challenge is caused by the following line:
output_weights[col] += y[row]
If it weren't for that line I could just use the torch.histc function.
Here's my thought: GPUs are "fast" because they have hundreds/thousands of threads available and can run parts of a big job (or many smaller jobs) on these threads. However, if I convert the function above to work on torch tensors then there is no benefit to running on the GPU (it actually kills the performance). I wonder if there is a way I can break of x so each value gets sent to different threads (similar to how apply_async does within multiprocessing)?
I'm open to other options.
In it's current form the function is fast, but the GPU-to-CPU data transfer is killing me.
Your computation is indeed a general histogram operation. There are multiple ways to compute this on a GPU regarding the number of items to scan, the size of the histogram and the distribution of the values.
For example, one solution consist in building local histograms in each separate kernel blocks and then perform a reduction. However, this solution is not well suited in your case since len(x) / len(y) is relatively small.
An alternative solution is to perform atomic updates of the histogram in parallel. This solutions only scale well if there is no atomic conflicts which is dependent of the actual input data. Indeed, if all value of x are equal, then all updates will be serialized which is slower than doing the accumulation sequentially on a CPU (due to the overhead of the atomic operations). Such a case is frequent on small histograms but assuming the distribution is close to uniform, this can be fine.
This operation can be done with Numba using CUDA (targetting Nvidia GPUs). Here is an example of kernel solving your problem:
#cuda.jit
def array_func(x, y, output_counts, output_weights):
tx = cuda.threadIdx.x # Thread id in a 1D block
ty = cuda.blockIdx.x # Block id in a 1D grid
bw = cuda.blockDim.x # Block width, i.e. number of threads per block
pos = tx + ty * bw # Compute flattened index inside the array
if pos < x.size:
col = int(x[pos] * 10)
cuda.atomic.add(output_counts, col, 1)
cuda.atomic.add(output_weights, col, y[pos])
For more information about how to run this kernel, please read the documentation. Note that the arrays output_counts and output_weights can possibly be directly created on the GPU so to avoid transfers. x and y should be on the GPU for better performance (otherwise a CPU reduction will be certainly faster). Also note that the kernel should be pretty fast so the overhead to run/wait it and allocate/free temporary array may be significant and even possibly slower than the kernel itself (but certainly faster than doing a double transfer from/to the CPU so to compute things on the CPU assuming data was on the GPU). Note also that such atomic accesses are only fast on quite recent Nvidia GPU that benefit from specific computing units for atomic operations.

cython ctypedef large double arrays lead to segfault on Ubuntu 18.04

ctypedef struct ReturnRows:
double[50000] v1
double[50000] v2
double[50000] v3
double[50000] v4
works, but
ctypedef struct ReturnRows:
double[100000] v1
double[100000] v2
double[100000] v3
double[100000] v4
fails with Process finished with exit code 139 (interrupted by signal 11: SIGSEGV)
It does not make sense to me, because the upper limit should be close to the available limit of the system dedicated to that processing task. Is there an upper limit set in some way?
Here is my builder:
from distutils.core import setup
import numpy as np
from distutils.core import setup, Extension
from Cython.Build import cythonize
file_names = ['example_cy.pyx', 'enricher_cy.pyx']
for fn in file_names:
print("cythonize %s" % fn)
setup(ext_modules = cythonize(fn),
author='CGi',
author_email='hi#so.com',
description='Utils for faster data enrichment.',
packages=['distutils', 'distutils.command'],
include_dirs=[np.get_include()])
From Question: How i use the struct? I iterate over it, coming from a pandas dataframe:
cpdef ReturnRows cython_atrs(list v1, list v2, list v3, list v4):
cdef ReturnRows s_ReturnRows # Allocate memory for the struct
s_ReturnRows.v1 = [0] * 50000
s_ReturnRows.v2 = [0] * 50000
s_ReturnRows.v3 = [0] * 50000
s_ReturnRows.v4 = [0] * 50000
# tmp counters to keep track of the latest data averages and so on.
cdef int lines_over_v1 = 0
cdef double all_ranges = 0
cdef int some_index = 0
for i in range(len(v3)-1):
# trs
s_ReturnRows.v1[i] = (round(v2[i] - v3[i],2))
# A lot more calculations, why I really need this loop.
As the linked question #ead suggested the solution is to allocate the variables on the heap rather than on the stack (as a function-local variable). The reason is that space on the stack is pretty limited (~8MB on linux), while the heap is (usually) whatever is available on your PC.
The linked question mainly refers to new/delete as the C++ way of doing so; while Cython code can use C++, C is more commonly used and to do that you use malloc/free. The Cython documentation is pretty good on this but to demonstrate it with the code in your question:
from libc.stdlib cimport malloc, free
# note some discussion about the return-value at the end of the question...
def cython_atrs(list v1, list v2, list v3, list v4):
cdef ReturnRows *s_ReturnRows # pointer, not value
s_ReturnRows = <ReturnRows*>malloc(sizeof(ReturnRows))
try:
# all your code goes here and remains the same...
finally:
free(s_ReturnRows)
You could also use a module-level global variable, but your probably don't want to do that.
Another option would be to use a cdef class instead of a struct:
cdef class ReturnRows:
double[50000] v1
double[50000] v2
double[50000] v3
double[50000] v4
This is automatically allocated on the heap, and memory is tracked by Python.
You could also use 2D Numpy/other Python library arrays. These are also allocated on the heap (but it's hidden from you). The advantage is that Python keeps track of the memory so you can't forget to free it. The second advantage is you can easily vary the array size without recomplication. If you invent some particularly pointless microbenchmark to allocate lots of arrays and do nothing else you may find a performance difference but for most normal code you won't. Access through a typed memoryview should be as quick as a C pointer. You can find lots of questions comparing speed/other features of this but really you should just pick the one you find easiest to write (with structs that may be C, maybe).
The return of ReturnRows in your function adds a complication (and makes your existing code dubious too even when it doesn't crash). You should either write a cdef function and return a ReturnRows*, and move the deallocation to the calling function, or you you should write a def function and return a valid Python object. That might push you towards Numpy arrays as a better solution, or maybe a cdef class.
What your current function will do is a huge and in-efficicent conversion of ReturnRows to a Python dictionary (containing Python lists for the arrays) whenever it's called from Python. This probably isn't what you want.

Does partial fit runs in parallel in sklearn.decomposition.IncrementalPCA?

I've followed Imanol Luengo's answer to build a partial fit and transform for sklearn.decomposition.IncrementalPCA. But for some reason, it looks like (from htop) it uses all CPU cores at maximum. I could find neither n_jobs parameter nor anything related to multiprocessing. My question is: if this is default behavior of these functions how can I set the number of CPU's and where can I find information about it? If not, obviously I am doing something wrong in previous sections of my code.
PS: I need to limit the number of CPU cores because using all cores in a server causing a lot of trouble with other people.
Additional information and debug code:
So, it has been a while and I still couldn't figure out the reason for this behavior or how to limit the number of CPU cores used at a time. I've decided to provide a sample code to test it. Note that, this code snippet is taken from the sklearn's website. The only difference is made to increase the size of the dataset, so one can easily see the behavior.
from sklearn.datasets import load_digits
from sklearn.decomposition import IncrementalPCA
import numpy as np
X, _ = load_digits(return_X_y=True)
#Copy-paste and increase the size of the dataset to see the behavior at htop.
for _ in range(8):
X = np.vstack((X, X))
print(X.shape)
transformer = IncrementalPCA(n_components=7, batch_size=200)
transformer.partial_fit(X[:100, :])
X_transformed = transformer.fit_transform(X)
print(X_transformed.shape)
And the output is:
(460032, 64)
(460032, 7)
Process finished with exit code 0
And the htop shows:
TL:DR Solved the issue by setting BLAS environmental variables before importing numpy or any library that imports numpy with the code below. Detailed information can be found here.
Long story:
I was looking for a workaround to this problem in another post of mine and I figured out this is not because of scikit-learn implementation fault but rather due to BLAS library (specifically OpenBLAS) used by numpy library, which is used in sklearn's IncrementalPCA function. OpenBLAS is set to use all available threads by default. Detailed information can be found here.
import os
os.environ["OMP_NUM_THREADS"] = 1 # export OMP_NUM_THREADS=1
os.environ["OPENBLAS_NUM_THREADS"] = 1 # export OPENBLAS_NUM_THREADS=1
os.environ["MKL_NUM_THREADS"] = 1 # export MKL_NUM_THREADS=1
os.environ["VECLIB_MAXIMUM_THREADS"] = 1 # export VECLIB_MAXIMUM_THREADS=1
os.environ["NUMEXPR_NUM_THREADS"] = 1 # export NUMEXPR_NUM_THREADS=1

Resources