Memory leak in the assignment using Intel Fortran compiler - memory-leaks

Consider the following minimal working example:
module lib
type FG_t
real,allocatable::g(:)
contains
procedure,private::FG_Assign
generic::assignment(=)=>FG_Assign
end type
interface operator(-)
procedure FG_Sub
end interface
contains
elemental subroutine FG_Assign(this,that)
class(FG_t),intent(inout)::this
class(FG_t),intent(in)::that
this%g=that%g
end
elemental type(FG_t) function FG_Sub(this,that)
class(FG_t),intent(in)::this
real,intent(in)::that
FG_Sub=FG_t(this%g-that)
end
end
program prog
use lib
type(FG_t)::arr(1000),arr_(SIZE(arr))
do i=1,SIZE(arr)
allocate(arr(i)%g(10))
end do
do i=1,100000
arr_=arr-1.
end do
end
When running the executable generated from the code with ifx (2022.2.1), ifort (2021.7.1), nvfortran (22.9), or nagfor (7.1), memory fills rapidly (which can lead to your PC crashing in the case of a higher number of iterations). Memory vs time:
Using gfortran (11.1.0) or replacing elemental before FG_assign with pure fixes the problem for my version of Intel compiler (but not for the Nvidia and NAG compilers). However, the elemental keyword is used for assignment in a similar context in the code of Fortran stdlib.
Intel VTune Profiler shows that most of the memory is allocated in the line this%g=that%g after FG_Sub is called in the line arr_=arr-1..
What is the reason for this compiler-dependent problem, and is there a way to avoid it?

Related

Performance degradation due to loading a shared library with thread local storage

I write a python wrapper around a large Fortran program with pybind11 as a python module. The Fortran program is a large simulation tool, that uses OpenMP for multithreading. My initial work was to reproduce the Fortran executable from a python function. That yielded (as expected) exactly the same results and the same performance. But when I started to add more functions, I observed a large performance degradation (about 50% to 100% longer runtimes).
Tracking the cause in pybind11
I could track it down to a call of the pybind11 macro PYBIND11_NUMPY_DTYPE, which loads in its internals the numpy library numpy.core._multiarray_umath. I could reproduce the performance degradation with the following code:
import ctypes
import time
# This is the fortran code, compiled to a shared library and a subroutine modulemain, that resembles the main program.
fcode = ctypes.CDLL("./libfcode.so")
# Only loading the library results in a worse performance of the Fortran code.
import numpy.core._multiarray_umath
t = time.time()
fcode.modulemain()
print("runtime: ", time.time()-t)
Tracking the cause in numpy
After finding, that the reason of my bad performance lies just in including the numpy.core._multiarray_umath library, I further digged into it. Ultimately I could track it down to two lines in that library, where two variables with thread local storage a defined.
// from numpy 1.21.5, numpy/core/src/multiarray/multiarraymodule.c:4011
static NPY_TLS int sigint_buf_init = 0;
static NPY_TLS NPY_SIGJMP_BUF _NPY_SIGINT_BUF;
where NPY_TLSis defined as
#define NPY_TLS __thread
So the inclusion of a shared object with __thread TLS is the root cause for my performance degradation. This leads me to my two questions:
Why?
Is there any way to prevent it? Not using PYBIND11_NUMPY_DTYPE is no option, as the loading of the numpy library after my module will trigger the bug as well!
Minimal working example
My error is about a large and heavy Fortran code, that I wanted to export to python via pybind11. But in the end it resulted in a problem of using OpenMP thread local storage and then loading a library that exports a variable with __thread thread local storage in the python interpreter. I could create a minimal working example, that reproduced the behavior.
The worker program work.f90
module data
integer, parameter :: N = 10000
real :: X(1:N)
!$omp threadprivate(X)
end module
subroutine work() bind(C, name="worker")
use data, only: X,N
!$omp parallel
X(1) = 0.131
do i=2,N
do j=1,i-1
X(i) = X(i) + 0.431*sin(X(i-1))
end do
end do
!$omp end parallel
The bad library tl.c
__thread int badVariable = 3;
a python script that shows the effect run.py
import ctypes
import time
work = ctypes.CDLL("./libwork.so")
# first worker run without loaded libtl.so. Good performance!
t = time.time()
work.worker()
print("TIME: ", time.time()-t)
# load the bad library
bad = ctypes.CDLL("./libtl.so")
# second worker with degraded performance
t = time.time()
work.worker()
print("TIME: ", time.time()-t)
The Makefile
FLAGS = -fPIC -shared
all: libwork.so libtl.so
libwork.so: work.f90
gfortran-11 $(FLAGS) work.f90 -fopenmp -o $#
libtl.so: tl.c
gcc-11 $(FLAGS) tl.c -o $#
The worker is so simple, that enabling optimization will hide the effect. I guess is could be a call to access the thread local storage area, that could be easily optimized out here. But in a real program, the effect is there with optimization.
Setup
I have the problem on a ubuntu 22.04 LTS computer with a x86 CPU (Xeon 8280M). gcc is Ubuntu 11.3.0-1ubuntu1~22.04 (I tried others down to 7.5.0 with the same effect). Python is version 3.10.6.
The problem is not Fortran specific, I can easily write a worker in plain C with the same effect. And I also tried this on a Raspberry Pi with the same effect! (ARM, GCC 8.3.0, Python 2.7.16)

How to use child kernels (CUDA dynamic parallelism) using PyCUDA

My python code has a gpu kernel function which is called multiple times in a for loop from host like this :
for i in range:
gpu_kernel_func(blocksize, grid)
Since this function call requires communication between host and gpu device multiple times which is not efficient, I want to make this as
gpu_kernel_function(){
for(){
computation } ;
}
But this requires extra step to make sure all the blocks in grid are in sync. According to dynamic parallelism, calling a dummy child kernel should ensure that every thread (in whole grid) should finish that child kernel before the code continues running. So I defined another kernel just like gpu_kernel_function and I tried this :
GPUcode = '''
\__global__ gpu_kernel_function() {... }
\__global__ dummy_child_kernel(){ ... }
'''
gpu_kernel_function(){
for() {
computation } ;
dummy_child_kernel(void);
}
But I am getting this error " nvcc fatal : Option '--cubin (-cubin)' is not allowed when compiling for a virtual compute architecture "
I am using Tesla P100 (compute 6.0), python 3.5, cuda.8.0.44. I am compiling my sourcemodule like this :
mod = SourceModule(GPUcode, options=['-rdc=true' ,'-lcudart','-lcudadevrt','--machine=64'],arch='compute_60' )
I tried compute_35 too which gives same error.
The error message is explicitly telling you what the issue is. compute_60 is a virtual architecture. You can't statically compile virtual architectures to machine code. They are intended for producing PTX (virtual machine assembler) for JIT translation to machine code by the runtime. PyCUDA compiles code to a binary payload ("cubin") using the CUDA toolchain and them loads it via the driver API into the CUDA context. Thus the error.
You can fix the error by specifying a valid physical GPU target architecture. So you should modify the source module constructor call to something like this:
mod = SourceModule(GPUcode,
options=['-rdc=true','-lcudart','-lcudadevrt','--machine=64'],
arch='sm_60' )
This should fix the compiler error.
However, note that using dynamic parallelism requires device code linkage, and I am 99% sure that PyCUDA still doesn't support this, so you likely won't be able to do what you are asking about via a SourceModule. You could link your own cubin by hand using the compiler outside of PyCUDA and then load that cubin inside PyCUDA. You will find many examples of how to compile dynamic parallelism correctly if you search for them.

What happens when you execute an instruction that your CPU does not support?

What happens if a CPU attempts to execute a binary that has been compiled with some instructions that your CPU doesn't support. I'm specifically wondering about some of the new AVX instructions running on older processors.
I'm assuming this can be tested for, and a friendly message could in theory be displayed to a user. Presumably most low level libraries will check this on your behalf. Assuming you didn't make this check, what would you expect to happen? What signal would your process receive?
A new instruction can be designed to be "legacy compatible" or it can not.
To the former class belong instructions like tzcnt or xacquire that have an encoding that produces valid instructions in older architecture: tzcnt is encoded as
rep bsf and xacquire is just repne.
The semantic is different of course.
To the second class belong the majority of new instructions, AVX being one popular example.
When the CPU encounters an invalid or reserved encoding it generates the #UD (for UnDefined) exception - that's interrupt number 6.
The Linux kernel set the IDT entry for #UD early in entry_64.S:
idtentry invalid_op do_invalid_op has_error_code=0
the entry points to do_invalid_op that is generated with a macro in traps.c:
DO_ERROR(X86_TRAP_UD, SIGILL, "invalid opcode", invalid_op)
the macro DO_ERROR generates a function that calls do_error_trap in the same file (here).
do_error_trap uses fill_trap_info (in the same file, here) to create a siginfo_t structure containing the Linux signal information:
case X86_TRAP_UD:
sicode = ILL_ILLOPN;
siaddr = uprobe_get_trap_addr(regs);
break;
from there the following calls happen:
do_trap in traps.c
force_sig_info in signal.c
specific_send_sig_info in signal.c
that ultimately culminates in calling the signal handler for SIGILL of the offending process.
The following program is a very simple example that generates an #UD
BITS 64
GLOBAL _start
SECTION .text
_start:
ud2
we can use strace to check the signal received by running that program
--- SIGILL {si_signo=SIGILL, si_code=ILL_ILLOPN, si_addr=0x400080} ---
+++ killed by SIGILL +++
as expected.
As Cody Gray commented, libraries don't usually rely on SIGILL, instead they use a CPU dispatcher or check the presence of an instruction explicitly.

how to boot this code?

i am a newbie to assembly and program in c (use GCC in Linux)
can anyone here tell me how to compile c code into assembly and boot from it using pen drive
i use the command (in linux terminal) :
gcc -S bootcode.c
the code gives me a bootcode.S file
what do i do with that ???
i just wanna compile the following code and run it directly from a USB stick
#include<stdio.h>
void main()
{
printf ("hi");
}
any help here ???
First of all,
You Should be aware that when you are writing bootloader codes , you should know that you are CREATING YOUR OWN ENVIRONMENT of CODE, that means, there is nothing such ready made C Library available to you or anything similar , ONLY and ONLY BIOS SERVICES (or INTERRUPT ROUTINES).
Now, if you got this, you will probably figure out that the above code won't boot since, you don't have the "stdio.h" header, this means that the CPU when executing your compiled code won't find this header and thereby won't understand what is "printf" (since printf is a method of the stdio.h header).
So if you want to print any string you need to write this function by YOUR OWN either in a separate file as a header and link its object file at compilation time when creating the final binary file or in the same file. it is up to you. There could be other ways, I'm not well familiar with them, just do some researches.
Another thing you should know, it is the BIOS who is responsible for loading this boot code (your above code in your case) into memory location 0x07C00 (0x0000h:0x7C00 in segment:offset representation), so you HAVE to mention in your code that you are writing this code on this memory location, either by
1-using the ORG instruction
2-Or by loading the appropriate registers for that (cs,ds,es)
Also, you should get yourself familiar with the segment:offset memory representation scheme, just google it or read intel manuals.
Finally, for the BIOS to load your code into the 0x07C00, the boot code must not exceed 512byte (ONLY ON FIRST SECTOR OF THE BOOTABLE MEDIA, since a sectore is 512byte) and he must find at the last two byte of this first sector (byte 510 & byte 511) of your code the boot signature 0x55AA, otherwise the BIOS won't consider this code AS BOOTABLE.
Usually this is coded as :
ORG 0x7C00
...
your boot code and to load more codes since 512byte won't be sufficient.
...
times 510 - ($ - $$) db 0x00 ; Zerofill up to 510 bytes
dw 0xAA55 ;Boot Sector signature,written in reverse order since it
will be stored as little endian notation
Just to let you know, I'm not covering everything here, because if so, I'll be writing pages about it, you need to look for more resources on the net, and here is a link to start with(coding in assembly):
http://www.brokenthorn.com/Resources/OSDevIndex.html
That's all, hopefully this was helpful to you...^_^
Khilo - ALGERIA
Booting a computer is not that easy. A bootloader needs to be written. The bootloader must obey certain rules and correspond with hardware such as ROM. You also need to disable interrupts, reserve some memory etc. Look up MikeOS, it's a great project that can better help you understand the process.
Cheers

Memory leak in Ada.Strings.Unbounded ?

I have a curious memory leak, it seems that the library function to_unbounded_string is leaking!
Code snippets:
procedure Parse (Str : in String;
... do stuff...
declare
New_Element : constant Ada.Strings.Unbounded.Unbounded_String :=
Ada.Strings.Unbounded.To_Unbounded_String (Str); -- this leaks
begin
valgrind output:
==6009== 10,276 bytes in 1 blocks are possibly lost in loss record 153 of 153
==6009== at 0x4025BD3: malloc (vg_replace_malloc.c:236)
==6009== by 0x42703B8: __gnat_malloc (in /usr/lib/libgnat-4.4.so.1)
==6009== by 0x4269480: system__secondary_stack__ss_allocate (in /usr/lib/libgnat-4.4.so.1)
==6009== by 0x414929B: ada__strings__unbounded__to_unbounded_string (in /usr/lib/libgnat-4.4.so.1)
==6009== by 0x80F8AD4: syntax__parser__dash_parser__parseXn (token_parser_g.adb:35)
Where token_parser_g.adb:35 is listed above as the "-- this leaks" line.
Other info: Gnatmake version 4.4.5. gcc version 4.4 valgrind version valgrind-3.6.0.SVN-Debian, valgrind options -v --leak-check=full --read-var-info=yes --show-reachable=no
Any help or insights appreciated,
NWS.
Valgrind clearly says that there is possibly a memory leak. It doesn't necessarily mean there is one. For example, if first call to that function allocates a pool of memory that is re-used during the life time of the program but is never freed, Valgrind will report it as a possible memory leak, even though it is not, as this is a common practice and memory will be returned to OS upon process termination.
Now, if you think that there is a memory leak for real, call this function in a loop, and see it memory continues to grow. If it does - file a bug report or even better, try to find and fix the leak and send a patch along with a bug report.
Hope it helps.
Was trying to keep this to comments, but what I was saying got too long and started to need formatting.
In Ada string objects are generally assumed to be perfectly-sized. The language provies functions to return the size and bounds of any string. Because of this, string handling in Ada is very different than C, and in fact more resembles how you'd do it in a functional language like Lisp.
But the basic principle is that, except in some very unusual situations, if you find yourself using Ada.Strings.Unbounded, you are going about things the wrong way.
The one case where you really can't get around using a variable-length string (or perhaps a buffer with a separate valid_length variable), is when reading strings as input from some external source. As you say, your parsing example is such a situation.
However, even here you should only have that situation on the initial buffer. Your call to your Parse routine should look something like this:
Ada.Text_IO.Get_Line (Buffer, Buffer_Len);
Parse (Buffer(Buffer'first..Buffer'first + Buffer_Len - 1));
Now inside the Parse routine you have a perfectly-sized constant Ada string to work with. If for some reason you need to pull out a subslice, you would do the following:
... --// Code to find start and end indices of my subslice
New_Element : constant String := Str(Element_Start...Element_End);
If you don't actually need to make a copy of that data for some reason though, you are better off just finding Element_Start and Element_End and working with a slice of the original string buffer. Eg:
if Str(Element_Start..Element_End) = "MyToken" then
I know this doesn't answer your question about Ada.Strings.Unbounded possibly leaking. But even if it doesn't leak, that code is relatively wasteful of machine resources (CPU and memory), and probably shouldn't be used for string manipulation unless you really need it.
Are bound[ed] strings scoped?
Expanding on #T.E.D.'s comments, Ada.Strings.Bounded "objects should not be implemented by implicit pointers and dynamic allocation." Instead, the maximum size is fixed when the generic in instantiated. As an implmentation detail, GNAT uses a discriminant to specify the maximum size of the string and a record to store the current size & contents.
In contrast, Ada.Strings.Unbounded requires that "No storage associated with an Unbounded_String object shall be lost upon assignment or scope exit." As an implmentation detail, GNAT uses a buffered implementation derived from Ada.Finalization.Controlled. As a result, the memory used by an Unbounded_String may appear to be a leak until the object is finalized, as for example when the code returns to an enclosing scope.

Resources