Performance degradation due to loading a shared library with thread local storage

Performance degradation due to loading a shared library with thread local storage - multithreading

I write a python wrapper around a large Fortran program with pybind11 as a python module. The Fortran program is a large simulation tool, that uses OpenMP for multithreading. My initial work was to reproduce the Fortran executable from a python function. That yielded (as expected) exactly the same results and the same performance. But when I started to add more functions, I observed a large performance degradation (about 50% to 100% longer runtimes).
Tracking the cause in pybind11
I could track it down to a call of the pybind11 macro PYBIND11_NUMPY_DTYPE, which loads in its internals the numpy library numpy.core._multiarray_umath. I could reproduce the performance degradation with the following code:
import ctypes
import time
# This is the fortran code, compiled to a shared library and a subroutine modulemain, that resembles the main program.
fcode = ctypes.CDLL("./libfcode.so")
# Only loading the library results in a worse performance of the Fortran code.
import numpy.core._multiarray_umath
t = time.time()
fcode.modulemain()
print("runtime: ", time.time()-t)
Tracking the cause in numpy
After finding, that the reason of my bad performance lies just in including the numpy.core._multiarray_umath library, I further digged into it. Ultimately I could track it down to two lines in that library, where two variables with thread local storage a defined.
// from numpy 1.21.5, numpy/core/src/multiarray/multiarraymodule.c:4011
static NPY_TLS int sigint_buf_init = 0;
static NPY_TLS NPY_SIGJMP_BUF _NPY_SIGINT_BUF;
where NPY_TLSis defined as
#define NPY_TLS __thread
So the inclusion of a shared object with __thread TLS is the root cause for my performance degradation. This leads me to my two questions:
Why?
Is there any way to prevent it? Not using PYBIND11_NUMPY_DTYPE is no option, as the loading of the numpy library after my module will trigger the bug as well!
Minimal working example
My error is about a large and heavy Fortran code, that I wanted to export to python via pybind11. But in the end it resulted in a problem of using OpenMP thread local storage and then loading a library that exports a variable with __thread thread local storage in the python interpreter. I could create a minimal working example, that reproduced the behavior.
The worker program work.f90
module data
integer, parameter :: N = 10000
real :: X(1:N)
!$omp threadprivate(X)
end module
subroutine work() bind(C, name="worker")
use data, only: X,N
!$omp parallel
X(1) = 0.131
do i=2,N
do j=1,i-1
X(i) = X(i) + 0.431*sin(X(i-1))
end do
end do
!$omp end parallel
The bad library tl.c
__thread int badVariable = 3;
a python script that shows the effect run.py
import ctypes
import time
work = ctypes.CDLL("./libwork.so")
# first worker run without loaded libtl.so. Good performance!
t = time.time()
work.worker()
print("TIME: ", time.time()-t)
# load the bad library
bad = ctypes.CDLL("./libtl.so")
# second worker with degraded performance
t = time.time()
work.worker()
print("TIME: ", time.time()-t)
The Makefile
FLAGS = -fPIC -shared
all: libwork.so libtl.so
libwork.so: work.f90
gfortran-11 $(FLAGS) work.f90 -fopenmp -o $#
libtl.so: tl.c
gcc-11 $(FLAGS) tl.c -o $#
The worker is so simple, that enabling optimization will hide the effect. I guess is could be a call to access the thread local storage area, that could be easily optimized out here. But in a real program, the effect is there with optimization.
Setup
I have the problem on a ubuntu 22.04 LTS computer with a x86 CPU (Xeon 8280M). gcc is Ubuntu 11.3.0-1ubuntu1~22.04 (I tried others down to 7.5.0 with the same effect). Python is version 3.10.6.
The problem is not Fortran specific, I can easily write a worker in plain C with the same effect. And I also tried this on a Raspberry Pi with the same effect! (ARM, GCC 8.3.0, Python 2.7.16)

Related

Python keeps running for 10+mins (after last statement in program) when there is huge (33GB) data structure in memory (nothing in swap)

I have need to parse a huge gz file (about ~10GB compressed, ~100GB uncompressed). The code creates data structure ('data_struct') in memory. I am running on a machine with Intel(R) Xeon(R) CPU E5-2667 v4 # 3.20GHz with 16 CPUs and plenty RAM (ie 200+ GB), running CentOS-6.9. I have implemented these things using a Class in Python3.6.3 (CPython) as shown below :
class my_class():
def __init__(self):
cmd = f'gunzip huge-file.gz'
self.process = subprocess(cmd, stdout=subprocess.PIPE, shell=True)
self.data_struct = dict()
def populate_struct(self):
for line in process.stdout:
<populate the self.data_struct dictionary>
def __del__():
self.process.wait()
#del self.data_struct # presence/absence of this statement decreases/increases runtime respectively
#================End of my_class===================
def main():
my_object = my_class()
my_object.populate_struct()
print(f'~~~~ Finished populate_struct() ~~~~') # last statement in my program.
## Python keeps running at 100% past the previous statement for 10+mins
if __name__ == '__main__':
main()
#================End of Main=======================
The resident memory consumption of my data_struct in memory (RAM only, no swap) is about ~33GB. I did $ top to find the PID of Python process and traced the Python process using $ strace -p <PID> -o <out_file> (to see what Python is doing). While it is executing populate_struct(), I can see in the out_file of strace that Python is using calls like mmap(NULL, 262144, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x2b0684160000 to create data_struct. While Python was running past the last print() statement, I found that Python was issuing only munmap() operations as shown below :
munmap(0x2b3c75375000, 41947136) = 0
munmap(0x2b3c73374000, 33558528) = 0
munmap(0x2b4015d2a000, 262144) = 0
munmap(0x2b4015cea000, 262144) = 0
munmap(0x2b4015caa000, 262144) = 0
munmap(0x2b4015c6a000, 262144) = 0
munmap(0x2b4015c2a000, 262144) = 0
munmap(0x2b4015bea000, 262144) = 0
munmap(0x2b4015baa000, 262144) = 0
...
...
The Python keeps running anywhere between 10+ mins to 12mins after the last print() statement. An observation is that if I have del self.data_struct statement in __del__() method, then it takes 2mins only. I have done these experiments multiple times and the runtime decrease/increase by the presence/absence of del self.data_struct in __del__().
My questions :
I understanding is that Python is doing cleanup work by using munmap(), but unlike Python, other languages like Perl immediately release memory and exit the program. Am I doing it right by implementing as shown above ? Is there a way to tell Python to avoid this munmap() ?
Why does it take 10+mins to cleanup if there is no del self.data_struct statement in __del__(), and takes only 2mins to cleanup if there is del self.data_struct statement in __del__() ?
Is there a way to speedup the cleanup work ie munmap()?
Is there a way to exit program immediately without the cleanup work ?
Other thoughts/suggestions about tackling this problem are appreciated.

Please try a more recent version of Python (at least 3.8)? This shows several signs of being a mild(!) form of a worst-case quadratic-time algorithm in CPython's object deallocator, which was rewritten here (and note that the issue linked to here in turn contains a link to an older StackOverflow post with more details):
https://bugs.python.org/issue37029
Some glosses
If my guess is right, the amount of memory isn't particularly important - it's instead the sheer number of distinct Python objects being managed by CPython's "small object allocator" (obmalloc.c), combined with "bad luck" in the order in which their memory is released.
When that code was first written, RAM wasn't big enough to hold millions of Python objects, so nobody noticed that one particular part of deallocation logic could take time quadratic in the number of allocated "arenas" (details aren't really helpful, but "arenas" are the granularity at which system mmap() and munmap() calls are made - 256 KiB chunks).
It's not those mapping calls that are consuming mounds of time, and any decent implementation of any language using OS memory mapping facilities will eventually call munmap() enough times to release the OS resources consumed by its mmap() calls.
So that's a red herring. munmap() is being called many times simply because you allocated many objects, which required many mmap() calls.
There isn't any crisp or easy way to explain exactly when the problem shows up. See "bad luck" above ;-) The relevant code was rewritten for CPython 3.8 to be worst-case linear time instead, which gave a factor of ~250 speedup for the specific program that triggered the issue report (see the link already given).
As a comment noted, you can exit your program immediately at any time by invoking os._exit(), but the leading underscore is meant to scare you off: "immediately" means "immediately". No cleanups of any kind are performed. For example, the __del__ method in your class? Skipped. __del__ is run as a side effect of deallocation, but if you actually "immediately release memory and exit the program" then no destructors of any kind are run, nor any handlers registered with the atexit module, etc etc. It's as drastic as a program dying, e.g., with a segfault.

How to use child kernels (CUDA dynamic parallelism) using PyCUDA

My python code has a gpu kernel function which is called multiple times in a for loop from host like this :
for i in range:
gpu_kernel_func(blocksize, grid)
Since this function call requires communication between host and gpu device multiple times which is not efficient, I want to make this as
gpu_kernel_function(){
for(){
computation } ;
}
But this requires extra step to make sure all the blocks in grid are in sync. According to dynamic parallelism, calling a dummy child kernel should ensure that every thread (in whole grid) should finish that child kernel before the code continues running. So I defined another kernel just like gpu_kernel_function and I tried this :
GPUcode = '''
\__global__ gpu_kernel_function() {... }
\__global__ dummy_child_kernel(){ ... }
'''
gpu_kernel_function(){
for() {
computation } ;
dummy_child_kernel(void);
}
But I am getting this error " nvcc fatal : Option '--cubin (-cubin)' is not allowed when compiling for a virtual compute architecture "
I am using Tesla P100 (compute 6.0), python 3.5, cuda.8.0.44. I am compiling my sourcemodule like this :
mod = SourceModule(GPUcode, options=['-rdc=true' ,'-lcudart','-lcudadevrt','--machine=64'],arch='compute_60' )
I tried compute_35 too which gives same error.

The error message is explicitly telling you what the issue is. compute_60 is a virtual architecture. You can't statically compile virtual architectures to machine code. They are intended for producing PTX (virtual machine assembler) for JIT translation to machine code by the runtime. PyCUDA compiles code to a binary payload ("cubin") using the CUDA toolchain and them loads it via the driver API into the CUDA context. Thus the error.
You can fix the error by specifying a valid physical GPU target architecture. So you should modify the source module constructor call to something like this:
mod = SourceModule(GPUcode,
options=['-rdc=true','-lcudart','-lcudadevrt','--machine=64'],
arch='sm_60' )
This should fix the compiler error.
However, note that using dynamic parallelism requires device code linkage, and I am 99% sure that PyCUDA still doesn't support this, so you likely won't be able to do what you are asking about via a SourceModule. You could link your own cubin by hand using the compiler outside of PyCUDA and then load that cubin inside PyCUDA. You will find many examples of how to compile dynamic parallelism correctly if you search for them.

Prange slowing down Cython loop

Consider two ways of calculating random numbers, one in one thread and one multithread using cython prange with openmp:
def rnd_test(long size1):
cdef long i
for i in range(size1):
rand()
return 1
and
def rnd_test_par(long size1):
cdef long i
with nogil, parallel():
for i in prange(size1, schedule='static'):
rand()
return 1
Function rnd_test is first compiled with the following setup.py
from distutils.core import setup
from Cython.Build import cythonize
setup(
name = 'Hello world app',
ext_modules = cythonize("cython_test.pyx"),
)
rnd_test(100_000_000) runs in 0.7s.
Then, rnd_test_par is compiled with the following setup.py
from distutils.core import setup
from distutils.extension import Extension
from Cython.Build import cythonize
ext_modules = [
Extension(
"cython_test_openmp",
["cython_test_openmp.pyx"],
extra_compile_args=["-O3", '-fopenmp'],
extra_link_args=['-fopenmp'],
)
]
setup(
name='hello-parallel-world',
ext_modules=cythonize(ext_modules),
)
rnd_test_par(100_000_000) runs in 10s!!!
Similar results are obtained using cython within ipython:
%%cython
import cython
from cython.parallel cimport parallel, prange
from libc.stdlib cimport rand
def rnd_test(long size1):
cdef long i
for i in range(size1):
rand()
return 1
%%timeit
rnd_test(100_000_000)
1 loop, best of 3: 1.5 s per loop
and
%%cython --compile-args=-fopenmp --link-args=-fopenmp --force
import cython
from cython.parallel cimport parallel, prange
from libc.stdlib cimport rand
def rnd_test_par(long size1):
cdef long i
with nogil, parallel():
for i in prange(size1, schedule='static'):
rand()
return 1
%%timeit
rnd_test_par(100_000_000)
1 loop, best of 3: 8.42 s per loop
What am I doing wrong? I am completely new to cython, this is my second time using it. I had a good experience last time so I decided to use for a project with monte-carlo simulation (hence the use of rand).
Is this expected? Having read all the documentation, I think prange should work well in an embarrassingly parallel case like this. I don't understand why this is failing to speed up the loop and even making it so much slower.
Some additional information:
I am running python 3.6, cython 0.26.
gcc version is "gcc (Ubuntu 5.4.0-6ubuntu1~16.04.4) 5.4.0 20160609"
CPU usage confirms the parallel version is actually using many cores
(90% vs 25% of the serial case)
I appreciate any help you can provide. I tried first with numba and it did speed up the calculation but it has other problems that make me want to avoid it. I'd like Cython to work in this case.
Thanks!!!

With DavidW's useful feedback and links, I have a multithreaded solution for random number generation.
However, the time savings over single-threaded (vectorized) Numpy solution are not that massive. The numpy approach generates 100 million numbers (5GB in memory) in 1.2s versus 0.7s of the multithreaded approach. Given the increased complexity (using c++ libraries for example), I wonder if it's worth it. Maybe I will leave the random number generation single-threaded and work on parallelizing the calculations that follow this step.
The exercise is, however, very useful to understand the problems of randon number generators. Ultimately, I'd like to have framework that could work in a distributed environment and I can see now that the challenge would be even larger in regards to the random number generator due to generators essentially having a state that cannot be ignored.
%%cython --compile-args=-fopenmp --link-args=-fopenmp --force
# distutils: language = c++
# distutils: extra_compile_args = -std=c++11
import cython
cimport numpy as np
import numpy as np
from cython.parallel cimport parallel, prange, threadid
cimport openmp
cdef extern from "<random>" namespace "std" nogil:
cdef cppclass mt19937:
mt19937() # we need to define this constructor to stack allocate classes in Cython
mt19937(unsigned int seed) # not worrying about matching the exact int type for seed
cdef cppclass uniform_real_distribution[T]:
uniform_real_distribution()
uniform_real_distribution(T a, T b)
T operator()(mt19937 gen) # ignore the possibility of using other classes for "gen"
#cython.boundscheck(False)
#cython.wraparound(False)
def test_rnd_par(long size):
cdef:
mt19937 gen
uniform_real_distribution[double] dist = uniform_real_distribution[double](0.0,1.0)
narr = np.empty(size, dtype=np.dtype("double"))
double [:] narr_view = narr
long i
with nogil, parallel():
gen = mt19937(openmp.omp_get_thread_num())
for i in prange(size, schedule='static'):
narr_view[i] = dist(gen)
return narr

I would like to note two things, that might be worth of your consideration:
A: If you take a look at the implementation of rand() in glibc, you will see that using rand() in a multi-threaded program leads to unspecified behavior: the produced numbers are always the same (assuming we have the same seed), but you cannot say which number will be used for which thread due to possible raise conditions. There is only one common state which is shared between all threads, and it need to be protected by a lock, otherwise even worse things could happen:
long int
__random ()
{
int32_t retval;
__libc_lock_lock (lock);
(void) __random_r (&unsafe_state, &retval);
__libc_lock_unlock (lock);
return retval;
}
From this code a possible workaround becomes clear, if we are not allowed to use c++11: every thread could have its own seed and we could use the rand_r() method.
This lock, is the reason you cannot see any speed-up with the original version.
B: Why don't you see more speed-up with your c++11-solution? You produce 5GB of data and write it to memory - it is a pretty memory-bound-task. So if a thread is working, the memory-bandwidth is enough to transport the created data and the bottle-neck is the calculation of the next random number. If there are two threads, there are twice as much data, but no more memory-bandwidth. So there will be a number of threads, for which the memory-bandwidth becomes the bottle-neck and you will not be able to achieve any speed-up by adding more threads/cores.
So there is no gain in parallelizing the random number generation? The problem is not the random number generation, but the amount of data written to memory: if the created random number is consumed by the same thread without storing it in RAM, it would be a much better solution to parallelize compared to producing the numbers by a single thread and to distribute them:
You don't have to write these numbers to RAM.
You don't have to read these numbers from RAM.
You calculate them faster as with a single thread.

How can I compile C code to get a bare-metal skeleton of a minimal RISC-V assembly program?

I have the following simple C code:
void main(){
int A = 333;
int B=244;
int sum;
sum = A + B;
}
When I compile this with
$riscv64-unknown-elf-gcc code.c -o code.o
If I want to see the assembly code I use
$riscv64-unknown-elf-objdump -d code.o
But when I explore the assembly code I see that this generates a lot of code which I assume is for Proxy Kernel support (I am a newbie to riscv). However, I do not want that this code has support for Proxy kernel, because the idea is to implement only this simple C code within an FPGA.
I read that riscv provides three types of compilation: Bare-metal mode, newlib proxy kernel and riscv Linux. According to previous research, the kind of compilation that I should do is bare metal mode. This is because I desire a minimum assembly without support for the operating system or kernel proxy. Assembly functions as a system call are not required.
However, I have not yet been able to find as I can compile a C code for get a skeleton of a minimal riscv assembly program. How can I compile the C code above in bare metal mode or for get a skeleton of a minimal riscv assembly code?

Warning: this answer is somewhat out-of-date as of the latest RISC-V Privileged Spec v1.9, which includes the removal of the tohost Control/Status Register (CSR), which was a part of the non-standard Host-Target Interface (HTIF) which has since been removed. The current (as of 2016 Sep) riscv-tests instead perform a memory-mapped store to a tohost memory location, which in a tethered environment is monitored by the front-end server.
If you really and truly need/want to run RISC-V code bare-metal, then here are the instructions to do so. You lose a bunch of useful stuff, like printf or FP-trap software emulation, which the riscv-pk (proxy kernel) provides.
First things first - Spike boots up at 0x200. As Spike is the golden ISA simulator model, your core should also boot up at 0x200.
(cough, as of 2015 Jul 13, the "master" branch of riscv-tools (https://github.com/riscv/riscv-tools) is using an older pre-v1.7 Privileged ISA, and thus starts at 0x2000. This post will assume you are using v1.7+, which may require using the "new_privileged_isa" branch of riscv-tools).
So when you disassemble your bare-metal program, it better
start at 0x200!!! If you want to run it on top of the proxy-kernel, it
better start at 0x10000 (and if Linux, it’s something even larger…).
Now, if you want to run bare metal, you’re forcing yourself to write up the
processor boot code. Yuck. But let’s punt on that and pretend that’s not
necessary.
(You can also look into riscv-tests/env/p, for the “virtual machine”
description for a physically addressed machine. You’ll find the linker script
you need and some macros.h to describe some initial setup code. Or better
yet, in riscv-tests/benchmarks/common.crt.S).
Anyways, armed with the above (confusing) knowledge, let’s throw that all
away and start from scratch ourselves...
hello.s:
.align 6
.globl _start
_start:
# screw boot code, we're going minimalist
# mtohost is the CSR in machine mode
csrw mtohost, 1;
1:
j 1b
and link.ld:
OUTPUT_ARCH( "riscv" )
ENTRY( _start )
SECTIONS
{
/* text: test code section */
. = 0x200;
.text :
{
*(.text)
}
/* data: Initialized data segment */
.data :
{
*(.data)
}
/* End of uninitalized data segement */
_end = .;
}
Now to compile this…
riscv64-unknown-elf-gcc -nostdlib -nostartfiles -Tlink.ld -o hello hello.s
This compiles to (riscv64-unknown-elf-objdump -d hello):
hello: file format elf64-littleriscv
Disassembly of section .text:
0000000000000200 <_start>:
200: 7810d073 csrwi tohost,1
204: 0000006f j 204 <_start+0x4>
And to run it:
spike hello
It’s a thing of beauty.
The link script places our code at 0x200. Spike will start at
0x200, and then write a #1 to the control/status register
“tohost”, which tells Spike “stop running”. And then we spin on an address
(1: j 1b) until the front-end server has gotten the message and kills us.
It may be possible to ditch the linker script if you can figure out how to
tell the compiler to move <_start> to 0x200 on its own.
For other examples, you can peruse the following repositories:
The riscv-tests repository holds the RISC-V ISA tests that are very minimal
(https://github.com/riscv/riscv-tests).
This Makefile has the compiler options:
https://github.com/riscv/riscv-tests/blob/master/isa/Makefile
And many of the “virtual machine” description macros and linker scripts can
be found in riscv-tests/env (https://github.com/riscv/riscv-test-env).
You can take a look at the “simplest” test at (riscv-tests/isa/rv64ui-p-simple.dump).
And you can check out riscv-tests/benchmarks/common for start-up and support code for running bare-metal.

The "extra" code is put there by gcc and is the sort of stuff required for any program. The proxy kernel is designed to be the bare minimum amount of support required to run such things. Once your processor is working, I would recommend running things on top of pk rather than bare-metal.
In the meantime, if you want to look at simple assembly, I would recommend skipping the linking phase with '-c':
riscv64-unknown-elf-gcc code.c -c -o code.o
riscv64-unknown-elf-objdump -d code.o
For examples of running code without pk or linux, I would look at riscv-tests.

I'm surprised no one mentioned gcc -S which skips assembly and linking altogether and outputs assembly code, albeit with a bunch of boilerplate, but it may be convenient just to poke around.

Data corruption when threading Vector Statistical Library-Math Kernel Library

I've just parallelized a fortran routine that simulates individuals behavior and I've had some problems when generating random numbers with Vector Statistical Library (a library from the Math Kernel Library). The structure of the program is the following:
program example
...
!$omp parallel do num_threads(proc) default(none) private(...) shared(...)
do i=1,n
call firstroutine(...)
enddo
!$omp end parallel do
...
end program example
subroutine firstroutine
...
call secondroutine(...)
...
end subroutine
subroutine secondroutine
...
VSL calls
...
end subroutine
I use the Intel Fortran Compiler for the compilation with a makefile that looks as follows:
f90comp = ifort
libdir = /home
mklpath = /opt/intel/mkl/10.0.5.025/lib/32/
mklinclude = /opt/intel/mkl/10.0.5.025/include/
exec: Example.o Firstroutine.o Secondroutine.o
$(f90comp) -O3 -fpscomp logicals -openmp -o aaa -L$(mklpath) -I$(mklinclude) Example.o -lmkl_ia32 -lguide -lpthread
Example.o: $(libdir)Example.f90
$(f90comp) -O3 -fpscomp logicals -openmp -c $(libdir)Example.f90
Firstroutine.o: $(libdir)Firstroutine.f90
$(f90comp) -O3 -fpscomp logicals -openmp -c $(libdir)Firstroutine.f90
Secondroutine.o: $(libdir)Secondroutine.f90
$(f90comp) -O3 -fpscomp logicals -openmp -c -L$(mklpath) -I$(mklinclude) $(libdir)Secondroutine.f90 -lmkl_ia32 -lguide -lpthread
At compilation time everything works fine. When I run my program generating variables with it, everything seems to work fine. However, from time to time (say once each 200-500 iterations), it generates crazy numbers for a couple of iterations and then runs again in a normal way. I have not found any patern to when does this corruption happen.
Any idea on why is it happening?

The random number code is either using a global variable internally or all threads use the same generator. Eventually, two threads will try to update the same piece of memory at the same time and the result will be non-predictable.
So you must allocate one random number generator per thread.
Solution: Protect the call to the random routine with a semaphore/lock.

I got the solution! I was modifying the pseudo-random numbers generated by some values taken from a file. From time to time, more than one thread tried to read the same file and generated the corruption. To solve this, I added a omp critical section and it worked.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string