tcmalloc not generating stack traces

tcmalloc not generating stack traces - memory-leaks

I am running a binary linked with tcmalloc and it is not generating a stack trace
for leaks it is detecting.
The output says:
The 1 largest leaks:
Leak of 1401231 bytes in 82093 objects allocated from:
If the preceding stack traces are not enough to find the leaks, try running THIS shell command:
pprof ../../prog "/tmp/prog.15062.prog-end.heap" --inuse_objects --lines --heapcheck --edgefraction=1e-10 --nodefraction=1e-10 --gv
When I run pprof I get a message that there are no nodes to print.
I am enclosing code which has the suspected memory leak by
HeapLeakChecker checker("prog");
....
assert(checker.NoLeaks());
Any ideas as to how to debug this?

I would suggest trying to build the program with -fno-omit-frame-pointer (gcc), as frame pointers might be needed to get a stack trace in some setups.
tcmalloc usually uses libunwind to get the stack traces, but because of deadlock issues this is not usable everywhere.
An interesting information would be if the generated file (/tmp/prog.15062.prog-end.heap in this case) contains some adresses.

Related

Weird Backtrace in Perf

I used the following command to extract backtraces leading to user level L3-misses in a simple evince benchmark:
sudo perf record -d --call-graph dwarf -c 10000 -e mem_load_uops_retired.l3_miss:uppp /opt/evince-3.28.4/bin/evince
As it is clear, the sampling period is quite large (10000 events between consecutive samples). For this experiment, the output of perf script had some samples similar to this one:
EvJobScheduler 27529 26441.375932: 10000 mem_load_uops_retired.l3_miss:uppp: 7fffcd5d8ec0 5080022 N/A|SNP N/A|TLB N/A|LCK N/A
7ffff17bec7f bits_image_fetch_separable_convolution_affine+0x2df (inlined)
7ffff17bec7f bits_image_fetch_separable_convolution_affine_pad_x8r8g8b8+0x2df (/usr/lib/x86_64-linux-gnu/libpixman-1.so.0.34.0)
7ffff17d1fd1 general_composite_rect+0x301 (/usr/lib/x86_64-linux-gnu/libpixman-1.so.0.34.0)
ffffffffffffffff [unknown] ([unknown])
At the bottom of the backtrace, there is a symbol called [unknown], which seems OK. But then a line in general_composite_rect() is called. Is this backtrace OK?
AFAIK, the first caller in the backtrace should be something like _start() or __GI___clone(). But the backtrace is not in this form. What is wrong?
Is there any way to resolve the issue? Are the truncated (parts of) backtraces reliable?

TL;DR perf backtracing process may stop at some function if there is no frame pointer saved in the stack or no CFI tables for dwarf method. Recompile libraries with -fno-omit-frame-pointer or with -g or get debuginfo. With release binaries and libs perf often will stop backtrace early without chance to reach main() or _start or clone()/start_thread() top functions.
perf profiling tool in Linux is statistical sampling profiler (without binary instrumentation): it programs software timer or event source or hardware performance monitoring unit (PMU) to generate periodic interrupt. In your example
-c 10000 -e mem_load_uops_retired.l3_miss:uppp is used to select hardware PMU in x86_64 in some kind of PEBS mode (https://easyperf.net/blog/2018/06/08/Advanced-profiling-topics-PEBS-and-LBR) to generate interrupt after 10000 of mem_load_uops_retired (with l3_miss mask). Generated interrupt is handled by Linux Kernel (perf_events subsystem, kernel/events and arch/x86/events). In this handler PMU is reset (reprogrammed) to generate next interrupt after 10000 more events and sample is generated. Sample data dump is saved into perf.data file by perf report command, but every wake of tool can save thousands of samples; samples can be read by perf script or perf script -D.
perf_events interrupt handler, something near __perf_event_overflow of kernel/events/core.c, has full access to the registers of current function, and has some time to do additional data retrieval to record current time, pid, etc. Part of such process is https://en.wikipedia.org/wiki/Call_stack data collection. But with x86_64 and -fomit-frame-pointer (often enabled for many system libraries of Debian/Ubuntu/others) there is no default place in registers or in function stack to store frame pointers:
https://gcc.gnu.org/onlinedocs/gcc-4.6.4/gcc/Optimize-Options.html#index-fomit_002dframe_002dpointer-692
-fomit-frame-pointer
Don't keep the frame pointer in a register for functions that don't need one. This avoids the instructions to save, set up and
restore frame pointers; it also makes an extra register available in
many functions. It also makes debugging impossible on some machines.
Starting with GCC version 4.6, the default setting (when not optimizing for size) for 32-bit Linux x86 and 32-bit Darwin x86
targets has been changed to -fomit-frame-pointer. The default can be
reverted to -fno-omit-frame-pointer by configuring GCC with the
--enable-frame-pointer configure option.
With frame pointers saved in the function stack backtracing/unwinding is easy. But for some functions modern gcc (and other compilers) may not generate frame pointer. So backtracing code like in perf_events handler either will stop backtrace at such function or needs another method of frame pointer recovery. Option -g method (--call-graph) of perf record selects the method to be used. It is documented in man perf-record http://man7.org/linux/man-pages/man1/perf-record.1.html:
--call-graph Setup and enable call-graph (stack chain/backtrace) recording, implies -g. Default is "fp".
Allows specifying "fp" (frame pointer) or "dwarf" (DWARF's CFI -
Call Frame Information) or "lbr" (Hardware Last Branch Record
facility) as the method to collect the information used to show the
call graphs.
In some systems, where binaries are build with gcc
--fomit-frame-pointer, using the "fp" method will produce bogus call graphs, using "dwarf", if available (perf tools linked to the
libunwind or libdw library) should be used instead. Using the "lbr"
method doesn't require any compiler options. It will produce call
graphs from the hardware LBR registers. The main limitation is that
it is only available on new Intel platforms, such as Haswell. It
can only get user call chain. It doesn't work with branch stack
sampling at the same time.
When "dwarf" recording is used, perf also records (user) stack dump
when sampled. Default size of the stack dump is 8192 (bytes). User
can change the size by passing the size after comma like
"--call-graph dwarf,4096".
So, dwarf method reuses CFI tables to find stack frame sizes and find caller's stack frame. I'm not sure are CFI tables stripped from release libraries by default or not; but debuginfo probably will have them. LBR will not help because it is rather short hardware buffer. Dwarf split processing (kernel handler saves part of stack and perf user-space tool will parse it with libdw+libunwind) may lose some parts of call stack, so try also to increase dwarf stack dumps by using --call-graph dwarf,10240 or --call-graph dwarf,81920 etc.
Backtracing is implemented in arch-dependent part of perf_events: arch/x86/events/core.c:perf_callchain_user(); called from kernel/events/callchain.c:get_perf_callchain() <- perf_callchain <- perf_prepare_sample <-
__perf_event_output <- *(event->overflow_handler) <- READ_ONCE(event->overflow_handler)(event, data, regs); of __perf_event_overflow.
Gregg did warn about incomplete call stacks of perf: http://www.brendangregg.com/blog/2014-06-22/perf-cpu-sample.html
Incomplete stacks usually mean -fomit-frame-pointer was used – a compiler optimization that makes little positive difference in the real world, but breaks stack profilers. Always compile with -fno-omit-frame-pointer. More recent perf has a -g dwarf option, to use the alternate libunwind/dwarf method for retrieving stacks.
I also did write about backtraces in perf with some additional links: How does linux's perf utility understand stack traces?

I had the same problem and it was like this: when you are collecting traces with --call-graph dwarf, if the size of the stack is too big, you will get unknown in the stack backtrace.
The default maximum stack size is 8kB, but it can be increased like this, --call-graph dwarf,16578. Unfortunately, perf has some other problems when you increase the stack size. In my case, the solution was to get rid of a large stack-allocated array by allocating it on the heap.

Two questions regarding threads

If there is a missing/corrupted library in the gdb core how do I isolate it?
I also read that there is a possibility the thread could have overwritten its own stack , how do I detect that ?
how do I isolate the above problems with the below bt ?
/etc/gdb/gdbinit:105: Error in sourced command file:
Error while executing Python code.
Reading symbols from /opt/hsp/bin/addrman...done.
warning: Corrupted shared library list: 0x0 != 0x7c8d48ea8948c089
warning: Corrupted shared library list: 0x0 != 0x4ed700
warning: no loadable sections found in added symbol-file system-supplied DSO at
0x7ffd50ff6000
Core was generated by `addrman --notification-socket
/opt/hsp/sockets/memb_notify.socket'.
Program terminated with signal 11, Segmentation fault.
#0 0x00000000004759e4 in ps_locktrk_info::lktrk_locker_set (this=0x348,
locker_ip=<optimized out>) at ./ps/ps_lock_track.h:292
292 ./ps/ps_lock_track.h: No such file or directory.
(gdb) bt
#0 0x00000000004759e4 in ps_locktrk_info::lktrk_locker_set (this=0x348,
locker_ip=<optimized out>) at ./ps/ps_lock_track.h:292
#1 0x0000000000000000 in ?? ()

It looks like the core file is corrupt, likely due to heap or stack corruption. Corruption is oftentimes the result of a buffer overflow or other undefined behavior.
If you are running on Linux, I would try valgrind. It can oftentimes spot corruption very quickly. Windows has some similar tools.
Yes, a multithreaded application can overflow the stack. Each thread is only allocated a limited amount. This usually only happens if you have very deep function call stack or you are allocating large local object on the stack.
Some interesting information here and here on setting the stack size for Linux applications.
Faced with your problem, I would:
Check all the callers of the lktrk_locer_set method. Carefully investigate each, if possible, to see if there is obvious stack overflow or heap corruption
Try to use Valgrind or similar tools to spot the issue
Add debug logging to isolate the issue

warning: Corrupted shared library list: 0x0 != 0x7c8d48ea8948c089
The above error is usually a sign that you gave GDB different system libraries (or the main binary) from the ones used when the core dump was produced.
Either you are analyzing a "production" core dump on a development machine, or you've upgraded system libraries between the time core dump was produced and when you are analyzing it, or you've rebuilt the main binary.
See this answer for what to do if one of the above is correct.

best approach to debug "corrupted double-linked list" crash

I am in the process of debugging a "corrupted double-linked list" crash. I have seen the source and understand the chunk struct and the fd/bk pointers, etc, so I think I know why this crash has occurred. I am now trying to fix it and I have a couple of questions.
Question #1: where (with respect to the pointer returned from malloc) is the malloc_chunks struct maintained? Are they before the memory block or after it?
Question #2: the malloc_chunks for allocated memory are different from the malloc_chunks for unallocated memory. It appears (??) that the allocated buffer case does not have the fd/bk pointers. Is this correct?
Question #3: what is the recommended approach to debug this type of error? I am assuming that I should put a break point for the malloc_chunks so I can break on when the struct is overwritten. But I am not sure how to access those malloc structs so I can set a break point in gdb.
Any suggestions on how to proceed would be very appreciated.
Thanks,
-Andres

what is the recommended approach to debug this type of error?
The usual way is not to peek into GLIBC internals, but to use a tool like Valgrind or AddressSanitizer, either of which is likely to point you straight at the problem.
Update:
Valgrind crashes ...
You should try building the latest Valgrind version from source, and if that still crashes, report the crash to Valgrind developers.
Chances are the Valgrind problem is already fixed, and building new Valgrind and testing your program with it will still be faster than trying to debug GLIBC internals (heap corruption bugs are notoriously difficult to find by program inspection or debugging).
AddressSanitizer, I thought it was a clang only tool -- I do not think it is available for linux.
Two points:
Clang works just fine on Linux, I use it almost every day,
Recent GCC versions have an equivalent -fsanitize=address option.

There are ways to debug heap overruns without valgrind.
One way is to use a malloc debug library such as Electric Fence. It will make your rogram crash exactly at the moment of accessing an illegal address in the heap.
The other way is to use built-in debug capabilities of GNU malloc. See man mcheck. If you call mcheck_pedantic before the first call to malloc, then every memory block is checked at every allocation. This is very slow but does allow you to isolate the fault.

How to increase probability of Linux core dumps matching symbols?

I have a very complex cross-platform application. Recently my team and I have been running stress tests and have encountered several crashes (and core dumps accompanying them). Some of these core dumps are very precise, and show me the exact location where the crash occurred with around 10 or more stack frames. Others sometimes have just one stack frame with ?? being the only symbol!
What I'd like to know is:
Is there a way to increase the probability of core dumps pointing in the right direction?
Why isn't the number of stack frames reported consistent?
Any best practice advise for managing core dumps.
Here's how I compile the binaries (in release mode):
Compiler and platform: g++ with glibc-2.3.2-95.50 on CentOS 3.6 x86_64 -- This helps me maintain compatibility with older versions of Linux.
All files are compiled with the -g flag.
Debug symbols are stripped from the final binary and saved in a separate file.
When I have a core dump, I use GDB with the executable which created the core, and the symbols file. GDB never complains that there's a mismatch between the core/binary/symbols.
Yet I sometimes get core dumps with no symbols at all! It's understandable that I'm linking against non-debug version of libstdc++ and libgcc, but it would be nice if at least the stack trace shows me where in my code did the faulty instruction call originate (although it may ultimately end in ??).

Others sometimes have just one stack frame with "??" being the only symbol!
There can be many reasons for that, among others:
the stack frame was trashed (overwritten)
EBP/RBP (on x86/x64) is currently not holding any meaningful value — this can happen e.g. in units compiled with -fomit-frame-pointer or asm units that do so
Note that the second point may occur simply by, for example, glibc being compiled in such a way. Having the debug info for such system libraries installed could mitigate this (something like what the glibc-debug{info,source} packages are on openSUSE).
gdb has more control over the program than glibc, so glibc's backtrace call would naturally be unable to print a backtrace if gdb cannot do so either.
But shipping the source would be much easier :-)

As an alternative, on a glibc system, you could use the backtrace function call (or backtrace_symbols or backtrace_symbols_fd) and filter out the results yourself, so only the symbols belonging to your own code are displayed. It's a bit more work, but then, you can really tailor it to your needs.

Have you tried installing debugging symbols of the various libraries that you are using? For example, my distribution (Ubuntu) provides libc6-dbg, libstdc++6-4.5-dbg, libgcc1-dbg etc.
If you're building with optimisation enabled (eg. -O2), then the compiler can blur the boundary between stack frames, for example by inlining. I'm not sure that this would cause backtraces with just one stack frame, but in general the rule is to expect great debugging difficulty since the code you are looking it in the core dump has been modified and so does not necessarily correspond to your source.

Why sometimes the call stack in a dump file does not looks correct?

Recently We have a production issue with application freezed, we tried to break in and analyse the dump file, unfortunately the call stack for the dump file does not looks good and hard to track down the cause of the freeze.

Two reasons why a call stack might look incorrect:
The stack might be corrupted. If the stack was corrupted for some reason (for instance, due to an overflow of a buffer which was allocated on the stack), all the stack frames are destroyed. This makes it impossible to compute the list of callers.
The symbols you use (if any) might not be appropriate for the binary which crashed. You need to use the exact same symbols which were used when compiling the binary. A slight change to the source code can render all the symbols invalid.

If the application hung rather than crashed, try loading into windbg and run !analyze -v -hang or try using adplus in hang mode. This tries to determine the cause of the hang which should give you a more meaningful call stack. The !locks command can be useful too if you have deadlock by showing you what is blocking on a resource.

If you call into the Windows API's, which then call back into you on the same thread (via Windows message handlers, for instance), it is not uncommon for the operations in the Windows DLL's to use stack conventions that the debugger cannot interpret. There's no requirement that the stack be traceable all the time during the execution of a c/c++ function/method, the stack-related registers can be re-used for other purposes and the standard places of stashing stack information can be ignored. I see this a lot in Windows.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string