pthread_detach() causes SIGSEGV on 64 bit Linux - linux

Here is a description of my situation: I have to take care of the bug in our product. The thread is created as joinable , it must do its work, terminate and nobody will call pthread_join() for it. So the thread is created with JOINABLE attribute (by default) and before termination it calls the next code:
{ pthread_detach(pthread_self()); pthread_exit(NULL); }
It works like a charm on all 32 bit linux distros I met, but it causes SIGSEGV on 64 bit distros (Ubuntu 13.04 x86_64 and Debian). I didn't try with Slackware. Here is a core:
Core was generated by `IsaVM -s=1 -PrjPath="/home/taf/Linux_Fov_540148/Cmds" -stgMode=1 -PR -Failover'.
Program terminated with signal 11, Segmentation fault.
#0 0x00007f5911a7c009 in pthread_detach () from /lib/x86_64-linux-gnu/libpthread.so.0
(gdb) bt
#0 0x00007f5911a7c009 in pthread_detach () from /lib/x86_64-linux-gnu/libpthread.so.0
#1 0x000000000041310d in _kerCltDownloadThr (StartParams=0x6bfce0 <RESFOV>) at ./dker0clt.c:1258
#2 0x00007f5911a7ae9a in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#3 0x00007f591159f3fd in clone () from /lib/x86_64-linux-gnu/libc.so.6
#4 0x0000000000000000 in ?? ()
I figured out how to fix this bug - I set CREATE_DETACHABLE attribute (with pthread_attr_setdetachstate()) for the thread before it is created and it works as expected.
But my question - is it a crime to call this code?
{ pthread_detach(pthread_self()); pthread_exit(NULL); }
Does pthread_detach() do something asynchronously after call and that causes pthread_exit() to bring problems? But the crash point is pthread_detach() not pthread_exit()! I don't understand the reason for this crash completely! Why does it work on 32 bits? Is it a race condition somewhere in the pthread implementation?
pthread_join() doesn't called for this thread.
Thanks in advance for any ideas.

A thread detaching itself does not feel right. It is normally responsibility of the thread that called pthread_create() which can create a detached thread if necessary.
It could be that the thread has already been detached. Because attempting to detach an already detached thread results in unspecified behaviour.
My top wild guesses would be:
The thread gets detached more than once. As a quick check I would try setting a breakpoint on pthread_detach in gdb to see whether duplicate thread ids gets passed in this function. If it is difficult to run your application under gdb, another option is to override pthread_create and pthread_detach and track thread ids to detect double detach. See http://hackerboss.com/overriding-system-functions-for-fun-and-profit/
Memory corruption. valgrind may help you detect memory corruption if it is possible to run your application under it. Alternatively, try instrumenting your application with run-time error checks by compiling with -fstack-protector-all, -fsanitize=address, -fsanitize=thread if you use gcc. clang compiler also have an array of options to detect such errors, see sanitizers on http://clang.llvm.org/docs/index.html.

I finished my research with approaches offered by a respectable #MaximYegorushkin. AddressSanitizer shows me one buffer obverflow in our product but it isn't related to my problem (I will definitely fix it later, it is always good to have such a wise tool to hunt the bugs). So decided to override all necessary pthread_xxx functions with LD_PRELOAD method. I run a simple test to be sure my library works as expected:
[HACK] Loading pthread hack.
Starting thread...!
[HACK] pthread_create: thread=7FAC6C86D700
Waiting for 2 seconds...
[HACK] pthread_self: thread=7FAC6C86D700
thread_func: thread id = 7FAC6C86D700
Thread: sin(3.26) = -0.121109
[HACK] pthread_self: thread=7FAC6C86D700
[HACK] pthread_detach: thread=7FAC6C86D700
Terminating...
All strings started from [HACK] are produced by my threadhack.so library.
Then I run my project with this library it points me exactly where the problem is:
Code executed: { pthread_detach(pthread_self()); pthread_exit(NULL); }
Debug traces:
[HACK] pthread_create: thread=7F403251CB00
.....
[HACK] pthread_self: thread=7F403251CB00
[HACK] pthread_detach: thread=3251CB00
So we see that pthread_self returns a good thread id, but pthread_detach received it already mangled (cut to 32 bit). How could this be? I generated assembler code for both my simple working test application as a reference and for my project:
Reference application:
call pthread_self
movq %rax, %rdi
call pthread_detach
movl $0, %edi
call pthread_exit
So we see here that movq instruction is used to copy 64 bit thread id (movq %rax, %rdi). OK, check what GCC generated for my project:
movl $0, %eax
call pthread_self
movl %eax, %edi
movl $0, %eax
call pthread_detach
movl $0, %edi
movl $0, %eax
call pthread_exit
Woa! We have two movl instructions (32 bit), one copies the least significant 32 bits (movl %eax, %edi) and instead of most significan part it always put zero! (movl $0, %eax). So this is a reason for the mangled thead id. I have no idea why the code is so different - compilation flags are the same. I saw this bug in GCC 4.7 I see this bug in GCC 4.8 (Latest package from the Ubuntu 13.10 x86_64).
So at least now I see what hapenning. Thanks to #Maxim and brilliant tools. I learned a new thing again.
P.S. I don't know how to submit a bug report to the GCC team. I can't reproduce the problem on a small simple application and I can't hand them my project because it is a proprietary software and I'm NDA-ed to not distribute it.

My guess is that you don't have the prototype for either pthread_detach or pthread_self in the code that invokes pthread_detach(pthread_self()); Without the prototype, the compiler will assume the argument is int (pthread_detach) or that the function returns an int (pthread_self).
Although thinking it through further, I'm more suspecting that pthread_self is the culprit being either undefined (returning an int) or defined incorrectly as returning an int. The compiler then correctly extends this to a 64 bit integer by adding the leading 32 bits of zero.

Related

On x64 Linux, what is the difference between syscall, int 0x80 and ret to exit a program?

I decided yesterday to learn assembly (NASM syntax) after years of C++ and Python and I'm already confused about the way to exit a program. It's mostly about ret because it's the suggested instruction on SASM IDE.
I'm speaking for main obviously. I don't care about x86 backward compatibility. Only the x64 Linux best way. I'm curious.
If you use printf or other libc functions, it's best to ret from main or call exit. (Which are equivalent; main's caller will call the libc exit function.)
If not, if you were only making other raw system calls like write with syscall, it's also appropriate and consistent to exit that way, but either way, or call exit are 100% fine in main.
If you want to work without libc at all, e.g. put your code under _start: instead of main: and link with ld or gcc -static -nostdlib, then you can't use ret. Use mov eax, 231 (__NR_exit_group) / syscall.
main is a real & normal function like any other (called with a valid return address), but _start (the process entry point) isn't. On entry to _start, the stack holds argc and argv, so trying to ret would set RIP=argc, and then code-fetch would segfault on that unmapped address. Nasm segmentation fault on RET in _start
System call vs. ret-from-main
Exiting via a system call is like calling _exit() in C - skip atexit() and libc cleanup, notably not flushing any buffered stdout output (line buffered on a terminal, full-buffered otherwise).
This leads to symptoms such as Using printf in assembly leads to empty output when piping, but works on the terminal (or if your output doesn't end with \n, even on a terminal.)
main is a function, called (indirectly) from CRT startup code. (Assuming you link your program normally, like you would a C program.) Your hand-written main works exactly like a compiler-generate C main function would. Its caller (__libc_start_main) really does do something like int result = main(argc, argv); exit(result);,
e.g. call rax (pointer passed by _start) / mov edi, eax / call exit.
So returning from main is exactly1 like calling exit.
Syscall implementation of exit() for a comparison of the relevant C functions, exit vs. _exit vs. exit_group and the underlying asm system calls.
C question: What is the difference between exit and return? is primarily about exit() vs. return, although there is mention of calling _exit() directly, i.e. just making a system call. It's applicable because C main compiles to an asm main just like you'd write by hand.
Footnote 1: You can invent a hypothetical intentionally weird case where it's different. e.g. you used stack space in main as your stdio buffer with sub rsp, 1024 / mov rsi, rsp / ... / call setvbuf. Then returning from main would involve putting RSP above that buffer, and __libc_start_main's call to exit could overwrite some of that buffer with return addresses and locals before execution reached the fflush cleanup. This mistake is more obvious in asm than C because you need leave or mov rsp, rbp or add rsp, 1024 or something to point RSP at your return address.
In C++, return from main runs destructors for its locals (before global/static exit stuff), exit doesn't. But that just means the compiler makes asm that does more stuff before actually running the ret, so it's all manual in asm, like in C.
The other difference is of course the asm / calling-convention details: exit status in EAX (return value) or EDI (first arg), and of course to ret you have to have RSP pointing at your return address, like it was on function entry. With call exit you don't, and you can even do a conditional tailcall of exit like jne exit. Since it's a noreturn function, you don't really need RSP pointing at a valid return address. (RSP should be aligned by 16 before a call, though, or RSP%16 = 8 before a tailcall, matching the alignment after call pushes a return address. It's unlikely that exit / fflush cleanup will do any alignment-required stores/loads to the stack, but it's a good habit to get this right.)
(This whole footnote is about ret vs. call exit, not syscall, so it's a bit of a tangent from the rest of the answer. You can also run syscall without caring where the stack-pointer points.)
SYS_exit vs. SYS_exit_group raw system calls
The raw SYS_exit system call is for exiting the current thread, like pthread_exit().
(eax=60 / syscall, or eax=1 / int 0x80).
SYS_exit_group is for exiting the whole program, like _exit.
(eax=231 / syscall, or eax=252 / int 0x80).
In a single-threaded program you can use either, but conceptually exit_group makes more sense to me if you're going to use raw system calls. glibc's _exit() wrapper function actually uses the exit_group system call (since glibc 2.3). See Syscall implementation of exit() for more details.
However, nearly all the hand-written asm you'll ever see uses SYS_exit1. It's not "wrong", and SYS_exit is perfectly acceptable for a program that didn't start more threads. Especially if you're trying to save code size with xor eax,eax / inc eax (3 bytes in 32-bit mode) or push 60 / pop rax (3 bytes in 64-bit mode), while push 231/pop rax would be even larger than mov eax,231 because it doesn't fit in a signed imm8.
Note 1: (Usually actually hard-coding the number, not using __NR_... constants from asm/unistd.h or their SYS_... names from sys/syscall.h)
And historically, it's all there was. Note that in unistd_32.h, __NR_exit has call number 1, but __NR_exit_group = 252 wasn't added until years later when the kernel gained support for tasks that share virtual address space with their parent, aka threads started by clone(2). This is when SYS_exit conceptually became "exit current thread". (But one could easily and convincingly argue that in a single-threaded program, SYS_exit does still mean exit the whole program, because it only differs from exit_group if there are multiple threads.)
To be honest, I've never used eax=252 / int 0x80 in anything, only ever eax=1. It's only in 64-bit code where I often use mov eax,231 instead of mov eax,60 because neither number is "simple" or memorable the way 1 is, so might as well be a cool guy and use the "modern" exit_group way in my single-threaded toy program / experiment / microbenchmark / SO answer. :P (If I didn't enjoy tilting at windmills, I wouldn't spend so much time on assembly, especially on SO.)
And BTW, I usually use NASM for one-off experiments so it's inconvenient to use pre-defined symbolic constants for call numbers; with GCC to preprocess a .S before running GAS you can make your code self-documenting with #include <sys/syscall.h> so you can use mov $SYS_exit_group, %eax (or $__NR_exit_group), or mov eax, __NR_exit_group with .intel_syntax noprefix.
Don't use the 32-bit int 0x80 ABI in 64-bit code:
What happens if you use the 32-bit int 0x80 Linux ABI in 64-bit code? explains what happens if you use the COMPAT_IA32_EMULATION int 0x80 ABI in 64-bit code.
It's totally fine for just exiting, as long as your kernel has that support compiled in, otherwise it will segfault just like any other random int number like int 0x7f. (e.g. on WSL1, or people that built custom kernels and disabled that support.)
But the only reason you'd do it that way in asm would be so you could build the same source file with nasm -felf32 or nasm -felf64. (You can't use syscall in 32-bit code, except on some AMD CPUs which have a 32-bit version of syscall. And the 32-bit ABI uses different call numbers anyway so this wouldn't let the same source be useful for both modes.)
Related:
Why am I allowed to exit main using ret? (CRT startup code calls main, you're not returning directly to the kernel.)
Nasm segmentation fault on RET in _start - you can't ret from _start
Using printf in assembly leads to empty output when piping, but works on the terminal stdout buffer (not) flushing with raw system call exit
Syscall implementation of exit() call exit vs. mov eax,60/syscall (_exit) vs. mov eax,231/syscall (exit_group).
Can't call C standard library function on 64-bit Linux from assembly (yasm) code - modern Linux distros config GCC in a way that call exit or call puts won't link with nasm -felf64 foo.asm && gcc foo.o.
Is main() really start of a C++ program? - Ciro's answer is a deep dive into how glibc + its CRT startup code actually call main (including x86-64 asm disassembly in GDB), and shows the glibc source code for __libc_start_main.
Linux x86 Program Start Up
or - How the heck do we get to main()? 32-bit asm, and more detail than you'll probably want until you're a lot more comfortable with asm, but if you've ever wondered why CRT runs so much code before getting to main, that covers what's happening at a level that's a couple steps up from using GDB with starti (stop at the process entry point, e.g. in the dynamic linker's _start) and stepi until you get to your own _start or main.
https://stackoverflow.com/tags/x86/info lots of good links about this and everything else.

Linux exit function

I am trying to understand linux syscalls mechanism. I am reading a book and it in the book it says that exit function look like that(with gdb):
mov $0x0,%ebx
mov $0x1,%eax
80 int $0x80
I understand that this is a syscall to exit, but in my Debian it looks like that:
jmp *0x8049698
push $0x8
jmp 0x80482c0
maybe can someone explain me why it's not the same? When I try to do disas on 0x80482c0
gdb prints me:
No function contains specified address.
Also, can someone give me a good reference to Linux Internals material(as Windows internals)?
Thanks!
The function you most likely called is exit() from C Standard Library (see man 3 exit). This function is a library function which, in turn, calls SYS_exit system call, but not being a system call itself. You will not see that good looking int 0x80 code in your C program disassembly. All existing functions (exit(), syscall(), etc.) are called from some library, so your program is only doing call to that library, and those functions are not belong to your program.
If you want to see exactly that int 0x80 code -- you can inline that asm code in your C application. But this is considered a bad practice, though, as your code become architecture-dependent (only applicable to x86 architecture, in your case).
can someone give me a good reference to Linux Internals material
The code itself is the best up-to-date reference. All books are more or less outdated. Also look into Documentation/ directory in kernel sources.

Linux assembly x86 | trying to get stack value, syntax error

I have been messing around with linux assembly on an x86 machine,
Basically my question is: I have pushed couple values into the stack moved the stack pointer into the base pointer and moved a value of 8 into a register to get a pushed value and in the end i wanted to get the value and put it into %ebx for the system call so i would get the value, but it seems to get an error. no clue why.
Error is: junk (%ebp) after register
Example:
.section .data
.section .text
.globl _start
_start:
pushl $50
pushl $20
movl %esp,%ebp
movl $8,%edx
movl %edx(%ebp),%ebx ## Supposed to be return value at system termination // PROBLEM HERE
movl $1,%eax ## System call
int $0x80 # Terminate program
I think part of the problem might be that in x86 the stack actually grows downwards, not up. You're adding to the base pointer, which is giving junk, where you have to subtract from it. I don't have an x86 machine so I can't test this, but have you tried something like movl -%edx(%ebp),%ebx?
Oops, I reversed the direction of the operands in my head. In this case, your stack looks like this:
1952 - ???
1948 - 20
1944 - 50 <- ebp <- esp
So when you take ebp+8, you aren't getting 20, you're getting address 1952, and you don't know what that contains.
Check out the links in https://stackoverflow.com/tags/x86/info. I updated them recently, and added the info about using gdb to single-step asm.
What do you mean "get an error"? Segmentation fault? Syntax error? (The normal syntax is (%ebp, %edx). Only numeric-constant displacements go outside the parens, e.g. -4(%ebp, %edx))
Also, if you're going to use stack frame pointers at all, do the mov %esp, %ebp after pushing any registers you want to preserve, but before pushing args to any functions you're going to call. However, there's no need to use %ebp that way at all, though. gcc defaults to -fomit-frame-pointer since 4.4 I think. It can make it easier to keep track of where your local variables are, if you're pushing/popping stuff.
You might want to just start with 64bit asm, instead of messing around with the obsolete x86 args-on-the-stack ABI.
This just made me think of what's probably wrong with your code. You're probably getting a segfault. (But you didn't say if it was that, syntax error, or something else.) Because you probably built your code in 64bit mode. Build a 32bit binary, or change your code to use %rsp.
You might want to just start with 64bit asm, instead of messing around with the obsolete x86 args-on-the-stack ABI.
This just made me think of what's probably wrong with your code. You're probably getting a segfault. (But you didn't say if it was that, syntax error, or something else.) Because you probably built your code in 64bit mode. Build a 32bit binary, or change your code to use %rsp.

x86 reserved EFLAGS bit 1 == 0: how can this happen?

I'm using the Win32 API to stop/start/inspect/change thread state. Generally works pretty well. Sometimes it fails, and I'm trying to track down the cause.
I have one thread that is forcing context switches on other threads by:
thread stop
fetch processor state into windows context block
read thread registers from windows context block to my own context block
write thread registers from another context block into windows context block
restart thread
This works remarkably well... but ... very rarely, context switches seem to fail.
(Symptom: my multithread system blows sky high executing a strange places with strange register content).
The context control is accomplished by:
if ((suspend_count=SuspendThread(WindowsThreadHandle))<0)
{ printf("TimeSlicer Suspend Thread failure");
...
}
...
Context.ContextFlags = (CONTEXT_INTEGER | CONTEXT_CONTROL | CONTEXT_FLOATING_POINT);
if (!GetThreadContext(WindowsThreadHandle,&Context))
{ printf("Context fetch failure");
...
}
call ContextSwap(&Context); // does the context swap
if (ResumeThread(WindowsThreadHandle)<0)
{ printf("Thread resume failure");
...
}
None of the print statements ever get executed. I conclude that Windows thinks the context operations all happened reliably.
Oh, yes, I do know when a thread being stopped is not computing [e.g., in a system function] and won't attempt to stop/context switch it. I know this because each thread that does anything other-than-computing sets a thread specific "don't touch me" flag, while it is doing other-than-computing. (Device driver programmers will recognize this as the equivalent of "interrupt disable" instructions).
So, I wondered about the reliability of the content of the context block.
I added a variety of sanity tests on various register values pulled out of the context block; you can actually decide that ESP is OK (within bounds of the stack area defined in the TIB), PC is in the program that I expect or in a system call, etc. No surprises here.
I decided to check that the condition code bits (EFLAGS) were being properly read out; if this were wrong, it would cause a switched task to take a "wrong branch" when its state was restored. So I added the following code to verify that the purported EFLAGS register contains stuff that only look like EFLAGS according to the Intel reference manual (http://en.wikipedia.org/wiki/FLAGS_register).
mov eax, Context.EFlags[ebx] ; ebx points to Windows Context block
mov ecx, eax ; check that we seem to have flag bits
and ecx, 0FFFEF32Ah ; where we expect constant flag bits to be
cmp ecx, 000000202h ; expected state of constant flag bits
je #f
breakpoint ; trap if unexpected flag bit status
##:
On my Win 7 AMD Phenom II X6 1090T (hex core),
it traps occasionally with a breakpoint, with ECX = 0200h. Fails same way on my Win 7 Intel i7 system. I would ignore this,
except it hints the EFLAGS aren't being stored correctly, as I suspected.
According to my reading of the Intel (and also the AMD) reference manuals, bit 1 is reserved and always has the value "1". Not what I see here.
Obviously, MS fills the context block by doing complicated things on a thread stop. I expect them to store the state accurately. This bit isn't stored correctly.
If they don't store this bit correctly, what else don't they store?
Any explanations for why the value of this bit could/should be zero sometimes?
EDIT: My code dumps the registers and the stack on catching a breakpoint.
The stack area contains the context block as a local variable.
Both EAX, and the value in the stack at the proper offset for EFLAGS in the context block contain the value 0244h. So the value in the context block really is wrong.
EDIT2: I changed the mask and comparsion values to
and ecx, 0FFFEF328h ; was FFEF32Ah where we expect flag bits to be
cmp ecx, 000000200h
This seems to run reliably with no complaints. Apparently Win7 doesn't do bit 1 of eflags right, and it appears not to matter.
Still interested in an explanation, but apparently this is not the source of my occasional context switch crash.
Microsoft has a long history of squirreling away a few bits in places that aren't really used. Raymond Chen has given plenty of examples, e.g. using the lower bit(s) of a pointer that's not byte-aligned.
In this case, Windows might have needed to store some of its thread context in an existing CONTEXT structure, and decided to use an otherwise unused bit in EFLAGS. You couldn't do anything with that bit anyway, and Windows will get that bit back when you call SetThreadContext.

x64 memset core, is passed buffer address truncated?

1. Problem Background
Recently a core dump occurred on one of our on-line search server. The core happens in memset() due to the attempt to write to an invalid address, and hence received the SIGSEGV signal. The following information is from dmsg:
is_searcher_ser[17405]: segfault at 000000002c32a668 rip 0000003da0a7b006 rsp 0000000053abc790 error 6
The environment of our on-line servers goes as follows:
OS: RHEL 5.3
Kernel: 2.6.18-131.el5.custom, x86_64 (64-bit)
GCC: 4.1.2 20080704 (Red Hat 4.1.2-44)
Glibc: glibc-2.5-49.6
The following is the relevant code snippet:
CHashMap<…>::CHashMap(…)
{
…
typedef HashEntry *HashEntryPtr;
m_ppEntry = new HashEntryPtr[m_nHashSize]; // m_nHashSize is 389 when core
assert(m_ppEntry != NULL);
memset(m_ppEntry, 0x0, m_nHashSize*sizeof(HashEntryPtr)); // Core in this memset() invocation
…
}
The assembly code of the above code is:
…
0x000000000091fe9e <+110>: callq 0x502638 <_Znam#plt> // new HashEntryPtr[m_nHashSize]
0x000000000091fea3 <+115>: mov 0xc(%rbx),%edx // Get the value of m_nHashSize
0x000000000091fea6 <+118>: mov %rax,%rdi // Put m_ppEntry pointer to %rdi for later memset invocation
0x000000000091fea9 <+121>: mov %rax,0x20(%rbx) // Store the pointer to m_ppEntry member variable(%rbx holds the this pointer)
0x000000000091fead <+125>: xor %esi,%esi // Generate 0
0x000000000091feaf <+127>: shl $0x3,%rdx // m_nHashSize*sizeof(HashEntryPtr)
0x000000000091feb3 <+131>: callq 0x502b38 <memset#plt> // Call the memset() function
…
In the core dump, the assembly of memset#plt is:
(gdb) disassemble 0x502b38
Dump of assembler code for function memset#plt:
0x0000000000502b38 <+0>: jmpq *0x771b92(%rip) # 0xc746d0 <memset#got.plt>
0x0000000000502b3e <+6>: pushq $0x53
0x0000000000502b43 <+11>: jmpq 0x5025f8
End of assembler dump.
(gdb) x/ag 0x0000000000502b3e+0x771b92
0xc746d0 <memset#got.plt>: 0x3da0a7acb0 <memset>
(gdb) disassemble 0x3da0a7acb0
Dump of assembler code for function memset:
0x0000003da0a7acb0 <+0>: cmp $0x1,%rdx
0x0000003da0a7acb4 <+4>: mov %rdi,%rax
…
For the above GDB analysis, we know that the address of memset() has been resolved in the relocation PLT table. That is to say, the first jmpq *0x771b92(%rip) will directly jump to the first instruction of function memset(). Besides, the program had run nearly one day on-line, the relocation address of memset() should have been already resolved earlier.
2. Weird phenomenon
This core fired at the instruction => 0x0000003da0a7b006 <+854>: mov %rdx,-0x8(%rdi) in the memset(). Actually this is the instruction in the memset() to set the 0 at the right begin position of the buffer which is the first parameter of memset().
When cored , in frame 0, the value of $rdi is 0x2c32a670 ,and $rax is 0x2c32a668. From the assembly analysis and off-line test, $rax should hold the source buffer of the memset, i.e., the first parameter of memset().
So, in our example, $rax should be same as the address of m_ppEntry, the value of which is stored in the this object (this pointer is stored in %rbx) first before it is zeroed by memset later. However, the value of m_ppEntry is 0x2ab02c32a668.
Then use info files GDB command to check, the address 0x2c32a668 is indeed invalid (not mapped), and address 0x2ab02c32a668 is a valid address.
3. Why it is weird?
The weird place of this core is that: If the real address of memset has been resolved already(very very probably), then there are only very few instructions between the operation to put the pointer value into m_ppEntry and the attempt to memset it. And actually the value of register $rax (holding the passed buffer address) are not changed at all during these instructions. So, how can m_ppEntry isn’t equal to $rax?
What is weird More is that: when core, the value of $rax (0x2c32a668) is actually the value of lower 4 bytes of m_ppEntry (0x2ab02c32a668). If there is indeed some relationship between the two values, is the m_ppEntry parameter passed to memset being truncated? However, the involved several instructions all use %rax, rather than %eax. By the way, I cannot reproduce this issue offline.
So,
1) Which address is valid? If 0x2c32a668 is valid? Is the heap corrupted just between the several instructions? And how to paraphrase that the value of m_ppEntry is 0x2ab02c32a668, and why the low 4 bytes of this two value is the same?
2) If 0x2ab02c32a668 is valid, why the address is truncated when passed into the 64-bit memset()? Under which condition this error will occur? I cannot reproduce this offline. Is this issue an known bug? I didn't find it through Google.
3) Or, is it due to some hardware or power issue to make the 4 higher bytes of %rdi passed to memset zeroed? (I’m very very reluctant to believe this).
At last, any comment on this core is appreciated.
Thanks,
Gary Hu
I'm assuming most of the time this code works fine, given your mention of one day's running.
I agree signals are worth inspecting, it does look suspiciously like pointer truncation is happening somewhere else.
Only other thing I'm thinking it could be an issue with the new. Is there any possibly that on occasion you could end up calling an overloaded new operator?
Also for completeness what is the declaration of m_ppEntry ?
I'm assuming you're using a no throw new otherwise the assert(m_ppEntry != NULL); would be meaningless.

Resources