The futex facility returned an unexpected error code? - multithreading

Two threads in same process using rwlock object stored in shared memory encounter crash during pthreads stress test. I spent a while trying to find memory corruption or deadlock but nothing so far. is this just an less than optimal way of informing me I have created a deadlock? Any pointers on tools/methods for debugging this?
Thread 5 "tms_test" received signal SIGABRT, Aborted.
[Switching to Thread 0x7ffff28a7700 (LWP 3777)]
0x00007ffff761e428 in __GI_raise (sig=sig#entry=6) at ../sysdeps/unix/sysv/linux/raise.c:54
54 ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
(gdb) bt
#0 0x00007ffff761e428 in __GI_raise (sig=sig#entry=6) at ../sysdeps/unix/sysv/linux/raise.c:54
#1 0x00007ffff762002a in __GI_abort () at abort.c:89
#2 0x00007ffff76607ea in __libc_message (do_abort=do_abort#entry=1, fmt=fmt#entry=0x7ffff77776cc "%s") at ../sysdeps/posix/libc_fatal.c:175
#3 0x00007ffff766080e in __GI___libc_fatal (message=message#entry=0x7ffff79c4ae0 "The futex facility returned an unexpected error code.") at ../sysdeps/posix/libc_fatal.c:185
#4 0x00007ffff79be7e5 in futex_fatal_error () at ../sysdeps/nptl/futex-internal.h:200
#5 futex_wait (private=, expected=, futex_word=0x7ffff7f670d9) at ../sysdeps/unix/sysv/linux/futex-internal.h:77
#6 futex_wait_simple (private=, expected=, futex_word=0x7ffff7f670d9) at ../sysdeps/nptl/futex-internal.h:135
#7 __pthread_rwlock_wrlock_slow (rwlock=0x7ffff7f670cd) at pthread_rwlock_wrlock.c:67
#8 0x00000000004046e3 in _memstat (offset=0x7fffdc0b11a5, func=0x0, lineno=0, size=134, flag=1 '\001') at tms_mem.c:107
#9 0x000000000040703b in TmsMemReallocExec (in=0x7fffdc0abb81, size=211, func=0x43f858 "_malloc_thread", lineno=478) at tms_mem.c:390
#10 0x000000000042a008 in _malloc_thread (arg=0x644c11) at tms_test.c:478
#11 0x000000000041a1d6 in _threadStarter (arg=0x644c51) at tms_mem.c:2384
#12 0x00007ffff79b96ba in start_thread (arg=0x7ffff28a7700) at pthread_create.c:333
#13 0x00007ffff76ef82d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
(gdb)

It's pretty hard to debug something what is not documented well. I was trying to find any helpful information about "The futex facility returned an unexpected error code" but it seems that it isn't specified in futex documentation.
In my case this message was generated by sem_wait(sem), where sem wasn't valid sem_t pointer. I was accidentally overwriting it (the memory pointed by sem) with some random integers after initializing sem with sem_init(sem,1,1).
Try checking if you are passing valid pointer to locking function.

I was getting this error when i declared sem_t mutex as local variable.

Related

A question about malloc implementation in glibc

I was reading source code of glibc.
In function void *__libc_malloc(size_t bytes):
void *__libc_malloc(size_t bytes) {
mstate ar_ptr;
void *victim;
_Static_assert(PTRDIFF_MAX <= SIZE_MAX / 2, "PTRDIFF_MAX is not more than half of SIZE_MAX");
if (!__malloc_initialized) ptmalloc_init();
...
}
It shows that if the first thread was created, it calls ptmalloc_init(), and links thread_arena with main_arena, and sets __malloc_initialized to true.
On the other hand, the second thread was blocked by the following code in ptmalloc_init():
static void ptmalloc_init(void) {
if (__malloc_initialized) return;
__malloc_initialized = true;
thread_arena = &main_arena;
malloc_init_state(&main_arena);
...
Thus the thread_arena of the second thread is NULL, and it has to mmap() additional arena.
My question is:
It seems possible to cause race condition because there's no any lock with __malloc_initialized, and thread_arenas of the first thread and second thread may both link with main_arena, why not use lock to protect __malloc_initialized?
It seems possible to cause race condition because there's no any lock with __malloc_initialized
It is impossible1 for a program to create a second running thread without having called an allocation routine (and therefore ptmalloc_init) while it was still single-threaded.
Because of that, ptmalloc_init can assume that it runs while there is only a single thread.
1Why is it impossible? Because creating a thread itself calls calloc.
For example, in this program:
#include <pthread.h>
void *fn(void *p) { return p; }
int main()
{
pthread_t tid;
pthread_create(&tid, NULL, fn, NULL);
pthread_join(tid, NULL);
return 0;
}
ptmalloc_init is called here (only a single thread exists at that point):
Breakpoint 2, ptmalloc_init () at /usr/src/debug/glibc-2.34-42.fc35.x86_64/malloc/arena.c:283
283 if (__malloc_initialized)
(gdb) bt
#0 ptmalloc_init () at /usr/src/debug/glibc-2.34-42.fc35.x86_64/malloc/arena.c:283
#1 __libc_calloc (n=17, elem_size=16) at malloc.c:3526
#2 0x00007ffff7fdd6c3 in calloc (b=16, a=17) at ../include/rtld-malloc.h:44
#3 allocate_dtv (result=result#entry=0x7ffff7dae640) at ../elf/dl-tls.c:375
#4 0x00007ffff7fde0e2 in __GI__dl_allocate_tls (mem=mem#entry=0x7ffff7dae640) at ../elf/dl-tls.c:634
#5 0x00007ffff7e514e5 in allocate_stack (stacksize=<synthetic pointer>, stack=<synthetic pointer>,
pdp=<synthetic pointer>, attr=0x7fffffffde30)
at /usr/src/debug/glibc-2.34-42.fc35.x86_64/nptl/allocatestack.c:429
#6 __pthread_create_2_1 (newthread=0x7fffffffdf58, attr=0x0, start_routine=0x401136 <fn>, arg=0x0)
at pthread_create.c:648
#7 0x0000000000401167 in main () at p.c:7
GLIBC's dynamic memory allocator is designed to deliver performances in both mono-threaded and multi-threaded programs. Several mutexes are used instead of having a centralized unique one which would at the end serialize every concurrent accesses to the dynamic memory allocator. The concept of arenas protected by one mutex has been introduced to have a kind of reserved memory area for each thread. Hence, the threads can access the memory allocator data structures in parallel as long as they use different arenas.
The main goal is to avoid as much as possible the contention on the mutexes.
The initialization step is critical because the main arena must be set up once. The __malloc_initialized global variable is a flag to prevent multiple initializations. Of course, in a multi-threaded environment, the latter should be protected by a mutex because checking the value of a variable is not multi-thread safe. But doing this would break the main design principle consisting to avoid a centralized mutex which would somehow serialize the execution of the concurrent threads during the process life time.
So, the unprotected __malloc_initialized is a trade-off that works as long as the first access to the memory allocator is done in mono-threaded mode.
Under Linux, a process starts mono-threaded (the main thread). With dynamically and statically linked programs, the GLIBC library has an initialization entry point (CSU = C Start Up) called __libc_start_main()_ defined in csu/libc-start.c in the library's source tree. It performs many initializations before calling the main() function. This is where a first call to the dynamic allocator occurs to initialize the main arena.
Let's look at the following program which does not explicitly call any service from the dynamic memory allocator and does not create any thread:
#include <unistd.h>
int main(void)
{
pause();
return 0;
}
Let's compile it and run it with gdb and a breakpoint on malloc():
$ gcc -g mm.c -o mm
$ gdb ./mm
[...]
(gdb) br malloc
Function "malloc" not defined.
Make breakpoint pending on future shared library load? (y or [n]) y
Breakpoint 1 (malloc) pending.
(gdb) run
Starting program: /.../mm
Breakpoint 1, malloc (n=1441) at dl-minimal.c:49
49 dl-minimal.c: No such file or directory.
(gdb) where
#0 malloc (n=1441) at dl-minimal.c:49
#1 0x00007ffff7fec5e5 in calloc (nmemb=<optimized out>, size=size#entry=1) at dl-minimal.c:103
#2 0x00007ffff7fdc284 in _dl_new_object (realname=realname#entry=0x7ffff7ff4342 "", libname=libname#entry=0x7ffff7ff4342 "", type=type#entry=0, loader=loader#entry=0x0,
mode=mode#entry=536870912, nsid=nsid#entry=0) at dl-object.c:89
#3 0x00007ffff7fd1d2f in dl_main (phdr=0x555555554040, phnum=<optimized out>, user_entry=<optimized out>, auxv=<optimized out>) at rtld.c:1330
#4 0x00007ffff7febc4b in _dl_sysdep_start (start_argptr=start_argptr#entry=0x7fffffffdf70, dl_main=dl_main#entry=0x7ffff7fd15e0 <dl_main>) at ../elf/dl-sysdep.c:252
#5 0x00007ffff7fd104c in _dl_start_final (arg=0x7fffffffdf70) at rtld.c:449
#6 _dl_start (arg=0x7fffffffdf70) at rtld.c:539
#7 0x00007ffff7fd0108 in _start () from /lib64/ld-linux-x86-64.so.2
#8 0x0000000000000001 in ?? ()
#9 0x00007fffffffe2e2 in ?? ()
#10 0x0000000000000000 in ?? ()
(gdb)
The above display shows that even if malloc() is not called explicitly in the main program, the GLIBC's internals call at least once the memory allocator triggering the initialization of the main arena.
We may consequently wonder why we need to check the __malloc_initialized variable during the process life time after the internal initialization step. The GLIBC initialization sets up various internal modules (main stack, pthreads...) and some of them may call the dynamic memory allocator. Hence __malloc_initialized is here to allow calling the allocator at any time during the initialization step. And, if the allocator is not needed because of some specific esoteric configuration, then it will not be initialized at all.

stacktrace of few threads show nothing except __nanosleep_nocancel from core generated

stack trace of the thread show nothing except __nanosleep_nocancel from the core dump'ed using gdb on Debian. This is been observed when analyzing the threads stack trace from the coredump generated by the kernel which is triggered from the application when anomaly found
Thread 5 (Thread 0x7f8b307bf700 (LWP 27000)):
#0 ......Application function .....
#1......Application function .....
#2 ......Application function .....
#3 ......Application function .....
#4 0x00007f8b303c9494 in start_thread () from /lib/x86_64-linux- gnu/libpthread.so.0
#5 0x00007f8b2f666aff in __libc_ifunc_impl_list () from /lib/x86_64-linux-gnu/libc.so.6
#6 0x0000000000000000 in ?? ()
Thread 3 (Thread 0x7f8b30685700 (LWP 27025)):
#0 0x00007f8b303d27dd in __nanosleep_nocancel () from /lib/x86_64-linux-gnu/libpthread.so.0
#1 0x0000000000000000 in ?? ()
Thread 2 (Thread 0x7f8b2eb31700 (LWP 27032)):
#0 0x00007f8b303d27dd in __nanosleep_nocancel () from /lib/x86_64-linux-gnu/libpthread.so.0
#1 0x0000000000000000 in ?? ()
Thread 1 (Thread 0x7f8b306c3700 (LWP 27022)):
#0 0x00007f8b303d2f9f in raise () from /lib/x86_64-linux- gnu/libpthread.so.0
Here thread 2 and 3's stack trace showing __nanosleep_nocancel , where I expect stack trace be like thread 5.
any leads on this would be greatly appreciated.

OpenCV app compiled with C++11 creates extra thread

I'm debugging a OpenCV app compiled with C++11 (I use OpenCV 2.4.10). The app has two threads that do some image processing on the CPU (no GPU functions used but I also included libopencv_gpu.so in the linked libraries).
Using gdb I noticed that instead of just two threads (the main process thread and another thread created by the main process thread) I found 3 threads running:
(gdb) info threads
Id Target Id Frame
78 Thread 0x7fffe2ff5700 (LWP 20531) "app_name" 0x00007ffff5bb2f3d in nanosleep () at ../sysdeps/unix/syscall-template.S:81
2 Thread 0x7fffe3c42700 (LWP 20454) "app_name" 0x00007ffff5bdf12d in poll () at ../sysdeps/unix/syscall-template.S:81
* 1 Thread 0x7ffff7fab800 (LWP 20450) "app_name" 0x00007ffff5bb2f3d in nanosleep () at ../sysdeps/unix/syscall-template.S:81
Thread 1 and 78 (using gdb ID) are executing my code. I added a sleep call in each one so I can make sure that those are my threads.
Thread 2 (using gdb ID) is created before entering the main function of the main process I believe. As far as I could debug this, thread with ID 2 just calls poll() function all the time.
I'm new to gdb and maybe you can tell me how to find out who creates this thread and what is it's purpose? Is this OpenCV related or C++11? When I compile the same app using Opencv4Tegra and run it on a Tegra K1 board, thread number 2 does not exist.
EDIT
This is the backtrace when creating thread number 2. It seems that libusb creates this but I don't know why yet:
(gdb) backtrace
#0 __pthread_create_2_1 (newthread=0x7fffea79c438, attr=0x0, start_routine=0x7fffea5941c0, arg=0x0) at pthread_create.c:466
#1 0x00007fffea5943df in ?? () from /lib/x86_64-linux-gnu/libusb-1.0.so.0
#2 0x00007fffea5926a5 in ?? () from /lib/x86_64-linux-gnu/libusb-1.0.so.0
#3 0x00007fffea58b715 in libusb_init () from /lib/x86_64-linux-gnu/libusb-1.0.so.0
#4 0x00007ffff2f06a0e in ?? () from /usr/lib/x86_64-linux-gnu/libdc1394.so.22
#5 0x00007ffff2ef5465 in dc1394_new () from /usr/lib/x86_64-linux-gnu/libdc1394.so.22
#6 0x00007ffff6f615e9 in CvDC1394::CvDC1394() () from /usr/local/lib/libopencv_highgui.so.2.4
#7 0x00007ffff6f373f0 in _GLOBAL__sub_I_cap_dc1394_v2.cpp () from /usr/local/lib/libopencv_highgui.so.2.4
#8 0x00007ffff7dea13a in call_init (l=<optimized out>, argc=argc#entry=3, argv=argv#entry=0x7fffffffdcd8, env=env#entry=0x7fffffffdcf8) at dl-init.c:78
#9 0x00007ffff7dea223 in call_init (env=<optimized out>, argv=<optimized out>, argc=<optimized out>, l=<optimized out>) at dl-init.c:36
#10 _dl_init (main_map=0x7ffff7ffe1c8, argc=3, argv=0x7fffffffdcd8, env=0x7fffffffdcf8) at dl-init.c:126
#11 0x00007ffff7ddb30a in _dl_start_user () from /lib64/ld-linux-x86-64.so.2
(gdb) quit

GDB debug output for multi-thread program

All,
I am debuging a 24-thread program with GDB, now I have find which line in the code the error occurs, but I cannot tell what the error is from the output of GDB. The followsing line of code leads to the error, it's just a normal insertion to a map structure.
current_node->children.insert(std::pair<string, ComponentTrieNode*>(comps[j], temp_node));
I used GDB to find out in which thread the error happens and switched to that thread, the backtrace command shows the function calls in the stack. (The last several lines try to print the value of some variables in a function, but failed.)
What should I do to clear know what error is happening?
[root#localhost nameComponentEncoding]# gdb NCE_david
GNU gdb (GDB) Fedora (7.2.90.20110429-36.fc15)
Copyright (C) 2011 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /mnt/disk2/experiments_BLOODMOON/two_stage_bloom_filter/programs/nameComponentEncoding/NCE_david...done.
(gdb) r /mnt/disk2/FIB_with_port/10_1.txt /mnt/disk2/trace/a_10_1.trace /mnt/disk2/FIB_with_port/10_2.txt
Starting program: /mnt/disk2/experiments_BLOODMOON/two_stage_bloom_filter/programs/nameComponentEncoding/NCE_david /mnt/disk2/FIB_with_port/10_1.txt /mnt/disk2/trace/a_10_1.trace /mnt/disk2/FIB_with_port/10_2.txt
[Thread debugging using libthread_db enabled]
[New Thread 0x7fffd2bf5700 (LWP 13129)]
[New Thread 0x7fffd23f4700 (LWP 13130)]
[New Thread 0x7fffd1bf3700 (LWP 13131)]
[New Thread 0x7fffd13f2700 (LWP 13132)]
[New Thread 0x7fffd0bf1700 (LWP 13133)]
[New Thread 0x7fffd03f0700 (LWP 13134)]
[New Thread 0x7fffcfbef700 (LWP 13135)]
[New Thread 0x7fffcf3ee700 (LWP 13136)]
[New Thread 0x7fffcebed700 (LWP 13137)]
[New Thread 0x7fffce3ec700 (LWP 13138)]
[New Thread 0x7fffcdbeb700 (LWP 13139)]
[New Thread 0x7fffcd3ea700 (LWP 13140)]
[New Thread 0x7fffccbe9700 (LWP 13141)]
[New Thread 0x7fffcc3e8700 (LWP 13142)]
[New Thread 0x7fffcbbe7700 (LWP 13143)]
[New Thread 0x7fffcb3e6700 (LWP 13144)]
[New Thread 0x7fffcabe5700 (LWP 13145)]
[New Thread 0x7fffca3e4700 (LWP 13146)]
[New Thread 0x7fffc9be3700 (LWP 13147)]
[New Thread 0x7fffc93e2700 (LWP 13148)]
[New Thread 0x7fffc8be1700 (LWP 13149)]
[New Thread 0x7fffc83e0700 (LWP 13150)]
[New Thread 0x7fffc7bdf700 (LWP 13151)]
this is thread 1
this is thread 7
this is thread 14
this is thread 18
this is thread 2
this is thread 19
this is thread 6
this is thread 8
this is thread 24
base: 64312646
this is thread 11
this is thread 5
this is thread 12
this is thread 13
this is thread 3
this is thread 15
this is thread 16
this is thread 17
this is thread 4
this is thread 20
this is thread 21
this is thread 22
this is thread 23
this is thread 9
this is thread 10
Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fffc8be1700 (LWP 13149)]
std::local_Rb_tree_rotate_left (__x=0xa057c90, __root=#0x608118) at ../../../../libstdc++-v3/src/tree.cc:126
126 __x->_M_right = __y->_M_left;
(gdb) info threads
Id Target Id Frame
24 Thread 0x7fffc7bdf700 (LWP 13151) "NCE_david" compare (__n=<optimized out>, __s2=<optimized out>, __s1=<optimized out>)
at /usr/lib/gcc/x86_64-redhat-linux/4.6.0/../../../../include/c++/4.6.0/bits/char_traits.h:257
(... other 22 threads not listed)
2 Thread 0x7fffd2bf5700 (LWP 13129) "NCE_david" compare (__n=<optimized out>, __s2=<optimized out>, __s1=<optimized out>)
at /usr/lib/gcc/x86_64-redhat-linux/4.6.0/../../../../include/c++/4.6.0/bits/char_traits.h:257
1 Thread 0x7ffff7fe57a0 (LWP 13126) "NCE_david" strtok () at ../sysdeps/x86_64/strtok.S:76
(gdb) thread 22
[Switching to thread 22 (Thread 0x7fffc8be1700 (LWP 13149))]
#0 std::local_Rb_tree_rotate_left (__x=0xa057c90, __root=#0x608118) at ../../../../libstdc++-v3/src/tree.cc:126
126 __x->_M_right = __y->_M_left;
(gdb) bt
#0 std::local_Rb_tree_rotate_left (__x=0xa057c90, __root=#0x608118) at ../../../../libstdc++-v3/src/tree.cc:126
#1 0x0000003cdd26e848 in std::_Rb_tree_insert_and_rebalance (__insert_left=<optimized out>, __x=0x7fffc0005ba0, __p=<optimized out>, __header=...)
at ../../../../libstdc++-v3/src/tree.cc:266
#2 0x00000000004029ca in std::_Rb_tree<std::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::pair<std::basic_string<char, std::char_traits<char>, std::allocator<char> > const, ComponentTrieNode*>, std::_Select1st<std::pair<std::basic_string<char, std::char_traits<char>, std::allocator<char> > const, ComponentTrieNode*> >, std::less<std::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::basic_string<char, std::char_traits<char>, std::allocator<char> > const, ComponentTrieNode*> > >::_M_insert_ (this=0x608108, __x=<optimized out>, __p=0x16cd3e30, __v=...)
at /usr/lib/gcc/x86_64-redhat-linux/4.6.0/../../../../include/c++/4.6.0/bits/stl_pair.h:87
#3 0x0000000000402b7d in std::_Rb_tree<std::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::pair<std::basic_string<char, std::char_traits<char>, std::allocator<char> > const, ComponentTrieNode*>, std::_Select1st<std::pair<std::basic_string<char, std::char_traits<char>, std::allocator<char> > const, ComponentTrieNode*> >, std::less<std::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::basic_string<char, std::char_traits<char>, std::allocator<char> > const, ComponentTrieNode*> > >::_M_insert_unique (this=0x608108, __v=...)
at /usr/lib/gcc/x86_64-redhat-linux/4.6.0/../../../../include/c++/4.6.0/bits/stl_tree.h:1281
#4 0x000000000040444c in insert (__x=..., this=0x608108) at /usr/lib/gcc/x86_64-redhat-linux/4.6.0/../../../../include/c++/4.6.0/bits/stl_map.h:518
#5 ComponentTrie::add_prefix (this=0x7fffffffe2e0, prefix_input=<optimized out>, port=10) at ComponentTrie_david.cpp:112
#6 0x0000000000401c3b in main._omp_fn.0 () at NameComponentEncoding_david.cpp:277
#7 0x0000003cd2607fea in gomp_thread_start (xdata=<optimized out>) at ../../../libgomp/team.c:115
#8 0x0000003cd0607cd1 in start_thread (arg=0x7fffc8be1700) at pthread_create.c:305
#9 0x0000003cd02dfd3d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115
(gdb) p 'ComponentTrie::add_prefix(char*, int)'::comps[j]
No symbol "comps" in specified context.
(gdb) p 'ComponentTrie::add_prefix(char*, int)'::prefix
No symbol "prefix" in specified context.
Edit: I have run the code with valgrind --tool=memcheck, the following is the result.
[root#localhost nameComponentEncoding]# valgrind --tool=memcheck ./NCE_david /mnt/disk2/FIB_with_port/10_1.txt /mnt/disk2/trace/a_10_1.trace /mnt/disk2/FIB_with_port/10_2.txt
(... many lines omitted)
==13261==
==13261== Thread 11:
==13261== Invalid read of size 1
==13261== at 0x3CD02849BC: strtok (strtok.S:141)
==13261== by 0x40426A: ComponentTrie::add_prefix(char*, int) (ComponentTrie_david.cpp:99)
==13261== by 0x40242C: main._omp_fn.0 (NameComponentEncoding_david.cpp:531)
==13261== by 0x3CD2607FE9: gomp_thread_start (team.c:115)
==13261== by 0x3CD0607CD0: start_thread (pthread_create.c:305)
==13261== by 0x3CD02DFD3C: clone (clone.S:115)
==13261== Address 0x234422c02 is not stack'd, malloc'd or (recently) free'd
==13261==
==13261== Invalid read of size 1
==13261== at 0x3CD02849EC: strtok (strtok.S:167)
==13261== by 0x40426A: ComponentTrie::add_prefix(char*, int) (ComponentTrie_david.cpp:99)
==13261== by 0x40242C: main._omp_fn.0 (NameComponentEncoding_david.cpp:531)
==13261== by 0x3CD2607FE9: gomp_thread_start (team.c:115)
==13261== by 0x3CD0607CD0: start_thread (pthread_create.c:305)
==13261== by 0x3CD02DFD3C: clone (clone.S:115)
==13261== Address 0x234422c02 is not stack'd, malloc'd or (recently) free'd
==13261==
Insertion and lookup cost time(us): 994669532 67108864 14.821731 0.067469
component number:4849478, state number: 2545847
Parallel threads:24
==13261==
==13261== HEAP SUMMARY:
==13261== in use at exit: 4,239,081,584 bytes in 76,746,193 blocks
==13261== total heap usage: 80,050,114 allocs, 3,303,921 frees, 4,323,622,103 bytes allocated
==13261==
==13261== LEAK SUMMARY:
==13261== definitely lost: 0 bytes in 0 blocks
==13261== indirectly lost: 0 bytes in 0 blocks
==13261== possibly lost: 4,111,951,106 bytes in 74,746,429 blocks
==13261== still reachable: 127,130,478 bytes in 1,999,764 blocks
==13261== suppressed: 0 bytes in 0 blocks
==13261== Rerun with --leak-check=full to see details of leaked memory
==13261==
==13261== For counts of detected and suppressed errors, rerun with: -v
==13261== Use --track-origins=yes to see where uninitialised values come from
==13261== ERROR SUMMARY: 45 errors from 30 contexts (suppressed: 6 from 6)
We know that the program is segfaulting on this line:
current_node->children.insert(std::pair<string, ComponentTrieNode*>(comps[j], temp_node));
From the stack trace, we know that the segfault happens deep in the red black tree implementation of std::map:
#0 std::local_Rb_tree_rotate_left (__x=0xa057c90, __root=#0x608118) at ../../../../libstdc++-v3/src/tree.cc:126
126 __x->_M_right = __y->_M_left;
This implies that:
The segfault could be caused by:
evaluating __x->_M_right
evaluating __y->_M_left
storing the right hand side to the left hand side of __x->_M_right = __y->_M_left
std::map::insert() being called implies that the segfault was NOT caused while building the arguments to the call. In particular comps[j] is not out of bounds.
This leads me to think that your heap was already corrupted by previous memory operation errors by this time and that the crash in std::map::insert() is a symptom and not a cause.
Run your program under the Valgrind memcheck tool:
$ valgrind --tool=memcheck /mnt/disk2/experiments_BLOODMOON/two_stage_bloom_filter/programs/nameComponentEncoding/NCE_david /mnt/disk2/FIB_with_port/10_1.txt /mnt/disk2/trace/a_10_1.trace /mnt/disk2/FIB_with_port/10_2.txt
and carefully read Valgrind's output afterwards to find the first memory error in your program.
Valgrind is implemented as a virtual CPU, so your program would slow down by a factor of ~30. This is time consuming but should allow you to make progress in troubleshooting the problem.
In addition to Valgrind, you might also want to try enabling debug mode for the libstdc++ containers:
To use the libstdc++ debug mode, compile your application with the compiler flag -D_GLIBCXX_DEBUG. Note that this flag changes the sizes and behavior of standard class templates such as std::vector, and therefore you can only link code compiled with debug mode and code compiled without debug mode if no instantiation of a container is passed between the two translation units.
If your program uses no external libraries then rebuilding the whole thing with -D_GLIBCXX_DEBUG added to CXXFLAGS in the Makefile should work. Otherwise you'd need to know whether C++ containers are passed between components compiled with and without the debug flag.
Valgrind Log Review
I'm surprised that you're using strtok() in a multi-threaded program. Is ComponentTrie::add_prefix() never called from two threads concurrently? While fixing the invalid read by inspecting how strtok() is used on ComponentTrie_david.cpp:99, you might want to replace strtok() with strtok_r() as well.
Concurrent Access to STL Containers
The standard C++ containers are explicitly documented to not do thread synchronization:
The user code must guard against concurrent function calls which access any particular library object's state when one or more of those accesses modifies the state. An object will be modified by invoking a non-const member function on it or passing it as a non-const argument to a library function. An object will not be modified by invoking a const member function on it or passing it to a function as a pointer- or reference-to-const. Typically, the application programmer may infer what object locks must be held based on the objects referenced in a function call and whether the objects are accessed as const or non-const.
(That's from the GNU libstdc++ documentation but the C++11 standard essentially specifies the same behavior) Concurrent modifications of std::map and other containers is a serious error and likely the culprit that caused the crash. Guard each container with their own pthread_mutex_t or use the OpenMP synchronization mechanisms.

Debugging a failed node-ffi callback / segmentation fault

I'm trying to use libvlc from within node.js using node-ffi, and while it seems to work great for the general basic media player functionality, I keep getting crashes, segmentation faults and general freezes in my program when I try to use libvlc's asynchronous event system and integrate it with node's EventEmitter. The code I'm using thus far is hosted at https://gist.github.com/2644721 but doesn't seem to work.
GDB produces a mixed-bag of results, but the last crash I received was:
Program received signal SIGSEGV, Segmentation fault.
0x000000000057cc86 in v8::Function::Call(v8::Handle<v8::Object>, int, v8::Handle<v8::Value>*) ()
(gdb) bt
#0 0x000000000057cc86 in v8::Function::Call(v8::Handle<v8::Object>, int, v8::Handle<v8::Value>*) ()
#1 0x00007ffff5997a41 in CallbackInfo::DispatchToV8(CallbackInfo*, void*, void**) ()
from /home/adam/node_modules/node-ffi/compiled/0.6/linux/x64/ffi_bindings.node
#2 0x00007ffff5997adb in CallbackInfo::WatcherCallback(uv_async_s*, int) ()
from /home/adam/node_modules/node-ffi/compiled/0.6/linux/x64/ffi_bindings.node
#3 0x00000000007be12f in ev_invoke_pending ()
#4 0x00000000007c2087 in ev_run ()
#5 0x00000000007b597f in uv_run ()
#6 0x000000000052a147 in node::Start(int, char**) ()
#7 0x00007ffff63ca76d in __libc_start_main ()
from /lib/x86_64-linux-gnu/libc.so.6
#8 0x0000000000524fe5 in _start ()
It's obvious I'm doing something wrong here - node-ffi documentation say that it's really easy to cause this sort of behaviour if you do something wrong. I'm thinking perhaps the callback isn't being run from the same thread or scope, but I'm not sure how to check or even fix that. Any help would be appreciated...
Program received signal SIGSEGV, Segmentation fault.
IsGlobalObject (this=0x1)
at /build/buildd/nodejs-0.6.17/deps/v8/src/objects-inl.h:796
796 in /build/buildd/nodejs-0.6.17/deps/v8/src/objects-inl.h
(gdb) bt
#0 IsGlobalObject (this=0x1)
at /build/buildd/nodejs-0.6.17/deps/v8/src/objects-inl.h:796
#1 v8::internal::Invoke (construct=<optimised out>, func=..., receiver=...,
argc=2, args=0x7fffffffdeb0, has_pending_exception=0x7fffffffde1f)
at /build/buildd/nodejs-0.6.17/deps/v8/src/execution.cc:101
#2 0x00000000005ae967 in v8::internal::Execution::Call (callable=...,
receiver=..., argc=2, args=0x7fffffffdeb0,
pending_exception=0x7fffffffde1f, convert_receiver=<optimised out>)
at /build/buildd/nodejs-0.6.17/deps/v8/src/execution.cc:175
#3 0x000000000057cd31 in v8::Function::Call (this=0xc0aae0, recv=..., argc=2,
argv=0x7fffffffdeb0) at /build/buildd/nodejs-0.6.17/deps/v8/src/api.cc:3601
#4 0x00007ffff5997a41 in CallbackInfo::DispatchToV8(CallbackInfo*, void*, void**) ()
from /home/adam/node_modules/node-ffi/compiled/0.6/linux/x64/ffi_bindings.node
#5 0x00007ffff5997adb in CallbackInfo::WatcherCallback(uv_async_s*, int) ()
from /home/adam/node_modules/node-ffi/compiled/0.6/linux/x64/ffi_bindings.node
#6 0x00000000007be12f in ev_invoke_pending (loop=0xb9dea0)
at src/unix/ev/ev.c:2149
#7 0x00000000007c2087 in ev_run (loop=0xb9dea0, flags=0)
at src/unix/ev/ev.c:2525
#8 0x00000000007b597f in uv_run (loop=<optimised out>) at src/unix/core.c:194

Resources