any way to stop unaligned access from c++ standard library on x86_64? - linux

I am trying to check for any unaligned reads in my program. I enable unaligned access processor exception via (using x86_64 on g++ on linux kernel 3.19):
asm volatile("pushf \n"
"pop %%rax \n"
"or $0x40000, %%rax \n"
"push %%rax \n"
"popf \n" ::: "rax");
I do an optional forced unaligned read which triggers the exception so i know its working. After i disable that I get an error in a piece of code which otherwise seems fine :
char fullpath[eMaxPath];
snprintf(fullpath, eMaxPath, "%s/%s", "blah", "blah2");
the stacktrace shows a failure via __memcpy_sse2 which leads me to suspect that the standard library is using sse to fulfill my memcpy but it doesnt realize that i have now made unaligned reads unacceptable.
Is my thinking correct and is there any way around this (ie can i make the standard library use an unaligned safe sprintf/memcpy instead)?
thanks

While I hate to discourage an admirable notion, you're playing with fire, my friend.
It's not merely sse2 access but any unaligned access. Even a simple int fetch.
Here's a test program:
#include <stdio.h>
#include <unistd.h>
#include <string.h>
#include <malloc.h>
void *intptr;
void
require_aligned(void)
{
asm volatile("pushf \n"
"pop %%rax \n"
"or $0x00040000, %%eax \n"
"push %%rax \n"
"popf \n" ::: "rax");
}
void
relax_aligned(void)
{
asm volatile("pushf \n"
"pop %%rax \n"
"andl $0xFFFBFFFF, %%eax \n"
"push %%rax \n"
"popf \n" ::: "rax");
}
void
msg(const char *str)
{
int len;
len = strlen(str);
write(1,str,len);
}
void
grab(void)
{
volatile int x = *(int *) intptr;
}
int
main(void)
{
setlinebuf(stdout);
// minimum alignment from malloc is [usually] 8
intptr = malloc(256);
printf("intptr=%p\n",intptr);
// normal access to aligned pointer
msg("normal\n");
grab();
// enable alignment check exception
require_aligned();
// access aligned pointer under check [will be okay]
msg("aligned_norm\n");
grab();
// this grab will generate a bus error
intptr += 1;
msg("aligned_except\n");
grab();
return 0;
}
The output of this is:
intptr=0x1996010
normal
aligned_norm
aligned_except
Bus error (core dumped)
The program generated this simply because of an attempted 4 byte int fetch from address 0x1996011 [which is odd and not a multiple of 4].
So, once you turn on the AC [alignment check] flag, even simple things will break.
IMO, if you truly have some things that are not aligned optimally and are trying to find them, using printf, instrumenting your code with debug asserts, or using gdb with some special watch commands or breakpoints with condition statements are a better/safer way to go
UPDATE:
I a using my own custom allocator am preparing my code to run on an architecture that doesnt suport unaligned read/writes so I want to make sure my code will not break on that architecture.
Fair enough.
Side note: My curiousity has gotten the better of me as the only [major] arches I can recall [at the moment] that have this issue are Motorola mc68000 and older IBM mainframes (e.g. IBM System 370).
One practical reason for my curiosity is that for certain arches (e.g. ARM/android, MIPS) there are emulators available. You could rebuild the emulator from source, adding any extra checks, if needed. Otherwise, doing your debugging under the emulator might be an option.
I can trap unaligned read/write using either the asm , or via gdb but both cause SIGBUS which i cant continue from in gdb and im getting too many false positives from std library (in the sense that their implementation would be aligned access only on the target).
I can tell you from experience that trying to resume from a signal handler after this doesn't work too well [if at all]. Using gdb is the best bet if you can eliminate the false positives by having AC off in the standard functions [see below].
Ideally i guess i would like to use something like perf to show me callstacks that have misaligned but so far no dice.
This is possible, but you'd have to verify that perf even reports them. To see, you could try perf against my original test program above. If it works, the "counter" should be zero before and one after.
The cleanest way may be to pepper your code with "assert" macros [that can be compiled in and out with a -DDEBUG switch].
However, since you've gone to the trouble of laying the groundwork, it may be worthwhile to see if the AC method can work.
Since you're trying to debug your memory allocator, you only need AC on in your functions. If one of your functions calls libc, disable AC, call the function, and then reenable AC.
A memory allocator is fairly low level, so it can't rely on too many standard functions. Most standard functions rely on being able to call malloc. So, you might also want to consider a vtable interface to the rest of the [standard] library.
I've coded some slightly different AC bit set/clear functions. I put them into a .S function to eliminate inline asm hassles.
I've coded up a simple sample usage in three files.
Here are the AC set/clear functions:
// acbit/acops.S -- low level AC [alignment check] operations
#define AC_ON $0x00040000
#define AC_OFF $0xFFFFFFFFFFFBFFFF
.text
// acpush -- turn on AC and return previous mask
.globl acpush
acpush:
// get old mask
pushfq
pop %rax
mov %rax,%rcx // save to temp
or AC_ON,%ecx // turn on AC bit
// set new mask
push %rcx
popfq
ret
// acpop -- restore previous mask
.globl acpop
acpop:
// get current mask
pushfq
pop %rax
and AC_OFF,%rax // clear current AC bit
and AC_ON,%edi // isolate the AC bit in argument
or %edi,%eax // lay it in
// set new mask
push %rax
popfq
ret
// acon -- turn on AC
.globl acon
acon:
jmp acpush
// acoff -- turn off AC
.globl acoff
acoff:
// get current mask
pushfq
pop %rax
and AC_OFF,%rax // clear current AC bit
// set new mask
push %rax
popfq
ret
Here is a header file that has the function prototypes and some "helper" macros:
// acbit/acbit.h -- common control
#ifndef _acbit_acbit_h_
#define _acbit_acbit_h_
#include <stdio.h>
#include <unistd.h>
#include <string.h>
#include <malloc.h>
typedef unsigned long flags_t;
#define VARIABLE_USED(_sym) \
do { \
if (1) \
break; \
if (!! _sym) \
break; \
} while (0)
#ifdef ACDEBUG
#define ACPUSH \
do { \
flags_t acflags = acpush()
#define ACPOP \
acpop(acflags); \
} while (0)
#define ACEXEC(_expr) \
do { \
acoff(); \
_expr; \
acon(); \
} while (0)
#else
#define ACPUSH /**/
#define ACPOP /**/
#define ACEXEC(_expr) _expr
#endif
void *intptr;
flags_t
acpush(void);
void
acpop(flags_t omsk);
void
acon(void);
void
acoff(void);
#endif
Here is a sample program that uses all of the above:
// acbit/acbit2 -- sample allocator
#include <acbit.h>
// mymalloc1 -- allocation function [raw calls]
void *
mymalloc1(size_t len)
{
flags_t omsk;
void *vp;
// function prolog
// NOTE: do this on all "outer" (i.e. API) functions
omsk = acpush();
// do lots of stuff ...
vp = NULL;
// encapsulate standard library calls like this to prevent false positives:
acoff();
printf("%p\n",vp);
acon();
// function epilog
acpop(omsk);
return vp;
}
// mymalloc2 -- allocation function [using helper macros]
void *
mymalloc2(size_t len)
{
void *vp;
// function prolog
ACPUSH;
// do lots of stuff ...
vp = NULL;
// encapsulate standard library calls like this to prevent false positives:
ACEXEC(printf("%p\n",vp));
// function epilog
ACPOP;
return vp;
}
int
main(void)
{
int x;
setlinebuf(stdout);
// minimum alignment from malloc is [usually] 8
intptr = mymalloc1(256);
intptr = mymalloc2(256);
x = *(int *) intptr;
return x;
}
UPDATE #2:
I like the idea of disabling the check before any library calls.
If the AC H/W works and you wrap the library calls, this should yield no false positives. The only exception would be if the compiler made a call to its internal helper library (e.g. doing 64 bit divide on 32 bit machine, etc.).
Be aware/wary of the ELF loader (e.g. /lib64/ld-linux-x86-64.so.2) doing dynamic symbol resolution on "lazy" bindings of symbols. Shouldn't be a big problem. There are ways to force the relocations to occur at program start, if necessary.
I have given up on perf for this as it seems to show me garbage even for a simple program like the one you wrote.
The perf code in the kernel is complex enough that it may be more trouble than it's worth. It has to communicate with the perf program with a pipe [IIRC]. Also, doing the AC thing is [probably] uncommon enough that the kernel's code path for this isn't well tested.
Im using ocperf with misalign_mem_ref.loads and stores but either way the counters dont correlate at all. If i record and look at the callstacks i get completely unrecognizable callstacks for these counters so i suspect either the counter doesnt work on my hardware/perf or it doesnt actually count what i think it counts
I honestly don't know if perf handles process reschedules to different cores properly [or not]--it should [IMO]. But, using sched_setaffinity to lock your program to a single core might help.
But, using the AC bit is more direct and definitive, IMO. I think that's the better bet.
I've talked about adding "assert" macros in the code.
I've coded some up below. These are what I'd use. They are independent of the AC code. But, they can also be used in conjunction with the AC bit code in a "belt and suspenders" approach.
These macros have one distinct advantage. When properly [and liberally] inserted, they can check for bad pointer values at the time they're calculated. That is, much closer to the true source of the problem.
With AC, you may calculate a bad value, but AC only kicks in [sometime] later, when the pointer is dereferenced [which may not happen in your API code at all].
I've done a complete memory allocator before [with overrun checks and "guard" pages, etc.]. The macro approach is what I used. And, if I had only one tool for this, it is the one I'd use. So, I recommend it above all else.
But, as I said, it can be used with the AC code as well.
Here's the header file for the macros:
// acbit/acptr.h -- alignment check macros
#ifndef _acbit_acptr_h_
#define _acbit_acptr_h_
#include <stdio.h>
typedef unsigned int u32;
// bit mask for given width
#define ACMSKOFWID(_wid) \
((1u << (_wid)) - 1)
#ifdef ACDEBUG2
#define ACPTR_MSK(_ptr,_msk) \
acptrchk(_ptr,_msk,__FILE__,__LINE__)
#else
#define ACPTR_MSK(_ptr,_msk) /**/
#endif
#define ACPTR_WID(_ptr,_wid) \
ACPTR_MSK(_ptr,(_wid) - 1)
#define ACPTR_TYPE(_ptr,_typ) \
ACPTR_WID(_ptr,sizeof(_typ))
// acptrfault -- pointer alignment fault
void
acptrfault(const void *ptr,const char *file,int lno);
// acptrchk -- check pointer for given alignment
static inline void
acptrchk(const void *ptr,u32 msk,const char *file,int lno)
{
#ifdef ACDEBUG2
#if ACDEBUG2 >= 2
printf("acptrchk: TRACE ptr=%p msk=%8.8X file='%s' lno=%d\n",
ptr,msk,file,lno);
#endif
if (((unsigned long) ptr) & msk)
acptrfault(ptr,file,lno);
#endif
}
#endif
Here's the "fault" handler function:
// acbit/acptr -- alignment check macros
#include <acbit/acptr.h>
#include <acbit/acbit.h>
#include <stdlib.h>
// acptrfault -- pointer alignment fault
void
acptrfault(const void *ptr,const char *file,int lno)
{
// NOTE: it's easy to set a breakpoint on this function
printf("acptrfault: pointer fault -- ptr=%p file='%s' lno=%d\n",
ptr,file,lno);
exit(1);
}
And, here's a sample program that uses them:
// acbit/acbit3 -- sample allocator using check macros
#include <acbit.h>
#include <acptr.h>
static double static_array[20];
// mymalloc3 -- allocation function
void *
mymalloc3(size_t len)
{
void *vp;
// get something valid
vp = static_array;
// do lots of stuff ...
printf("BEF vp=%p\n",vp);
// check pointer
// NOTE: these can be peppered after every [significant] calculation
ACPTR_TYPE(vp,double);
// do something bad ...
vp += 1;
printf("AFT vp=%p\n",vp);
// check again -- this should fault
ACPTR_TYPE(vp,double);
return vp;
}
int
main(void)
{
int x;
setlinebuf(stdout);
// minimum alignment from malloc is [usually] 8
intptr = mymalloc3(256);
x = *(int *) intptr;
return x;
}
Here's the program output:
BEF vp=0x601080
acptrchk: TRACE ptr=0x601080 msk=00000007 file='acbit/acbit3.c' lno=22
AFT vp=0x601081
acptrchk: TRACE ptr=0x601081 msk=00000007 file='acbit/acbit3.c' lno=29
acptrfault: pointer fault -- ptr=0x601081 file='acbit/acbit3.c' lno=29
I left off the AC code in this example. On your real target system, the dereference of intptr in main would/should fault on an alignment, but notice how much later that is in the execution timeline.

Like I commented on the question, that asm isn't safe, because it steps on the red-zone. Instead, use
asm volatile ("add $-128, %rsp\n\t"
"pushf\n\t"
"orl $0x40000, (%rsp)\n\t"
"popf\n\t"
"sub $-128, %rsp\n\t"
);
(-128 fits in a sign-extended 8bit immediate, but 128 doesn't, hence using add $-128 to subtract 128.)
Or in this case, there are dedicated instructions for toggling that bit, like there are for the carry and direction flags:
asm("stac"); // Set AC flag
asm("clac"); // Clear AC flag
It's a good idea to have some idea when your code uses unaligned memory. It's not necessarily a good idea to change your code to avoid it in every case. Sometimes better locality from packing data closer together is more valuable.
Given that you shouldn't necessarily aim to eliminate all unaligned accesses anyway, I don't think this is the easiest way to find the ones you do have.
modern x86 hardware has fast hardware support for unaligned loads/stores. When they don't span a cache-line boundary, or lead to store-forwarding stalls, there's literally no penalty.
What you might try is looking at performance counters for some of these events:
misalign_mem_ref.loads [Speculative cache line split load uops dispatched to L1 cache]
misalign_mem_ref.stores [Speculative cache line split STA uops dispatched to L1 cache]
ld_blocks.store_forward [This event counts loads that followed a store to the same address, where the data could not be forwarded inside the pipeline from the store to the load.
The most common reason why store forwarding would be blocked is when a load's address range overlaps with a preceeding smaller uncompleted store.
See the table of not supported store forwards in the Intel? 64 and IA-32 Architectures Optimization Reference Manual.
The penalty for blocked store forwarding is that the load must wait for the store to complete before it can be issued.]
(from ocperf.py list output on my Sandybridge CPU).
There are probably other ways to detect unaligned memory access. Maybe valgrind? I searched on valgrind detect unaligned and found this mailing list discussion from 13 years ago. Probably still not implemented.
The hand-optimized library functions do use unaligned accesses because it's the fastest way for them to get their job done. e.g. copying bytes 6 to 13 of a string to somewhere else can and should be done with just a single 8-byte load/store.
So yes, you would need special slow&safe versions of library functions.
If your code would have to execute extra instructions to avoid using unaligned loads, it's often not worth it. Esp. if the input is usually aligned, having a loop that does the first up-to-alignment-boundary elements before starting the main loop may just slow things down. In the aligned case, everything works optimally, with no overhead of checking alignment. In the unaligned case, things might work a few percent slower, but as long as the unaligned cases are rare, it's not worth avoiding them.
Esp. if it's not SSE code, since non-AVX legacy SSE can only fold loads into memory operands for ALU instructions when alignment is guaranteed.
The benefit of having good-enough hardware support for unaligned memory ops is that software can be faster in the aligned case. It can leave alignment-handling to hardware, instead of running extra instructions to handle pointers that are probably aligned. (Linus Torvalds had some interesting posts about this on the http://realworldtech.com/ forums, but they're not searchable so I can't find it.

You're not going to like it, but there is only one answer: don't link against the standard libraries. By changing that setting you have changed the ABI and the standard library doesn't like it. memcpy and friends are hand-written assembly so it's not a matter of compiler options to convince the compiler to do something else.

Related

RIP register doesn't understand valid memory address [duplicate]

I want a simple C method to be able to run hex bytecode on a Linux 64 bit machine. Here's the C program that I have:
char code[] = "\x48\x31\xc0";
#include <stdio.h>
int main(int argc, char **argv)
{
int (*func) ();
func = (int (*)()) code;
(int)(*func)();
printf("%s\n","DONE");
}
The code that I am trying to run ("\x48\x31\xc0") I obtained by writting this simple assembly program (it's not supposed to really do anything)
.text
.globl _start
_start:
xorq %rax, %rax
and then compiling and objdump-ing it to obtain the bytecode.
However, when I run my C program I get a segmentation fault. Any ideas?
Machine code has to be in an executable page. Your char code[] is in the read+write data section, without exec permission, so the code cannot be executed from there.
Here is a simple example of allocating an executable page with mmap:
#include <stdio.h>
#include <string.h>
#include <sys/mman.h>
int main ()
{
char code[] = {
0x8D, 0x04, 0x37, // lea eax,[rdi+rsi]
0xC3 // ret
};
int (*sum) (int, int) = NULL;
// allocate executable buffer
sum = mmap (0, sizeof(code), PROT_READ|PROT_WRITE|PROT_EXEC,
MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
// copy code to buffer
memcpy (sum, code, sizeof(code));
// doesn't actually flush cache on x86, but ensure memcpy isn't
// optimized away as a dead store.
__builtin___clear_cache (sum, sum + sizeof(sum)); // GNU C
// run code
int a = 2;
int b = 3;
int c = sum (a, b);
printf ("%d + %d = %d\n", a, b, c);
}
See another answer on this question for details about __builtin___clear_cache.
Until recent Linux kernel versions (sometime before 5.4), you could simply compile with gcc -z execstack - that would make all pages executable, including read-only data (.rodata), and read-write data (.data) where char code[] = "..." goes.
Now -z execstack only applies to the actual stack, so it currently works only for non-const local arrays. i.e. move char code[] = ... into main.
See Linux default behavior against `.data` section for the kernel change, and Unexpected exec permission from mmap when assembly files included in the project for the old behaviour: enabling Linux's READ_IMPLIES_EXEC process for that program. (In Linux 5.4, that Q&A shows you'd only get READ_IMPLIES_EXEC for a missing PT_GNU_STACK, like a really old binary; modern GCC -z execstack would set PT_GNU_STACK = RWX metadata in the executable, which Linux 5.4 would handle as making only the stack itself executable. At some point before that, PT_GNU_STACK = RWX did result in READ_IMPLIES_EXEC.)
The other option is to make system calls at runtime to copy into an executable page, or change permissions on the page it's in. That's still more complicated than using a local array to get GCC to copy code into executable stack memory.
(I don't know if there's an easy way to enable READ_IMPLIES_EXEC under modern kernels. Having no GNU-stack attribute at all in an ELF binary does that for 32-bit code, but not 64-bit.)
Yet another option is __attribute__((section(".text"))) const char code[] = ...;
Working example: https://godbolt.org/z/draGeh.
If you need the array to be writeable, e.g. for shellcode that inserts some zeros into strings, you could maybe link with ld -N. But probably best to use -z execstack and a local array.
Two problems in the question:
exec permission on the page, because you used an array that will go in the noexec read+write .data section.
your machine code doesn't end with a ret instruction so even if it did run, execution would fall into whatever was next in memory instead of returning.
And BTW, the REX prefix is totally redundant. "\x31\xc0" xor eax,eax has exactly the same effect as xor rax,rax.
You need the page containing the machine code to have execute permission. x86-64 page tables have a separate bit for execute separate from read permission, unlike legacy 386 page tables.
The easiest way to get static arrays to be in read+exec memory was to compile with gcc -z execstack. (Used to make the stack and other sections executable, now only the stack).
Until recently (2018 or 2019), the standard toolchain (binutils ld) would put section .rodata into the same ELF segment as .text, so they'd both have read+exec permission. Thus using const char code[] = "..."; was sufficient for executing manually-specified bytes as data, without execstack.
But on my Arch Linux system with GNU ld (GNU Binutils) 2.31.1, that's no longer the case. readelf -a shows that the .rodata section went into an ELF segment with .eh_frame_hdr and .eh_frame, and it only has Read permission. .text goes in a segment with Read + Exec, and .data goes in a segment with Read + Write (along with the .got and .got.plt). (What's the difference of section and segment in ELF file format)
I assume this change is to make ROP and Spectre attacks harder by not having read-only data in executable pages where sequences of useful bytes could be used as "gadgets" that end with the bytes for a ret or jmp reg instruction.
// TODO: use char code[] = {...} inside main, with -z execstack, for current Linux
// Broken on recent Linux, used to work without execstack.
#include <stdio.h>
// can be non-const if you use gcc -z execstack. static is also optional
static const char code[] = {
0x8D, 0x04, 0x37, // lea eax,[rdi+rsi] // retval = a+b;
0xC3 // ret
};
static const char ret0_code[] = "\x31\xc0\xc3"; // xor eax,eax ; ret
// the compiler will append a 0 byte to terminate the C string,
// but that's fine. It's after the ret.
int main () {
// void* cast is easier to type than a cast to function pointer,
// and in C can be assigned to any other pointer type. (not C++)
int (*sum) (int, int) = (void*)code;
int (*ret0)(void) = (void*)ret0_code;
// run code
int c = sum (2, 3);
return ret0();
}
On older Linux systems: gcc -O3 shellcode.c && ./a.out (Works because of const on global/static arrays)
On Linux before 5.5 (or so) gcc -O3 -z execstack shellcode.c && ./a.out (works because of -zexecstack regardless of where your machine code is stored). Fun fact: gcc allows -zexecstack with no space, but clang only accepts clang -z execstack.
These also work on Windows, where read-only data goes in .rdata instead of .rodata.
The compiler-generated main looks like this (from objdump -drwC -Mintel). You can run it inside gdb and set breakpoints on code and ret0_code
(I actually used gcc -no-pie -O3 -zexecstack shellcode.c hence the addresses near 401000
0000000000401020 <main>:
401020: 48 83 ec 08 sub rsp,0x8 # stack aligned by 16 before a call
401024: be 03 00 00 00 mov esi,0x3
401029: bf 02 00 00 00 mov edi,0x2 # 2 args
40102e: e8 d5 0f 00 00 call 402008 <code> # note the target address in the next page
401033: 48 83 c4 08 add rsp,0x8
401037: e9 c8 0f 00 00 jmp 402004 <ret0_code> # optimized tailcall
Or use system calls to modify page permissions
Instead of compiling with gcc -zexecstack, you can instead use mmap(PROT_EXEC) to allocate new executable pages, or mprotect(PROT_EXEC) to change existing pages to executable. (Including pages holding static data.) You also typically want at least PROT_READ and sometimes PROT_WRITE, of course.
Using mprotect on a static array means you're still executing the code from a known location, maybe making it easier to set a breakpoint on it.
On Windows you can use VirtualAlloc or VirtualProtect.
Telling the compiler that data is executed as code
Normally compilers like GCC assume that data and code are separate. This is like type-based strict aliasing, but even using char* doesn't make it well-defined to store into a buffer and then call that buffer as a function pointer.
In GNU C, you also need to use __builtin___clear_cache(buf, buf + len) after writing machine code bytes to a buffer, because the optimizer doesn't treat dereferencing a function pointer as reading bytes from that address. Dead-store elimination can remove the stores of machine code bytes into a buffer, if the compiler proves that the store isn't read as data by anything. https://codegolf.stackexchange.com/questions/160100/the-repetitive-byte-counter/160236#160236 and https://godbolt.org/g/pGXn3B has an example where gcc really does do this optimization, because gcc "knows about" malloc.
(And on non-x86 architectures where I-cache isn't coherent with D-cache, it actually will do any necessary cache syncing. On x86 it's purely a compile-time optimization blocker and doesn't expand to any instructions itself.)
Re: the weird name with three underscores: It's the usual __builtin_name pattern, but name is __clear_cache.
My edit on #AntoineMathys's answer added this.
In practice GCC/clang don't "know about" mmap(MAP_ANONYMOUS) the way they know about malloc. So in practice the optimizer will assume that the memcpy into the buffer might be read as data by the non-inline function call through the function pointer, even without __builtin___clear_cache(). (Unless you declared the function type as __attribute__((const)).)
On x86, where I-cache is coherent with data caches, having the stores happen in asm before the call is sufficient for correctness. On other ISAs, __builtin___clear_cache() will actually emit special instructions as well as ensuring the right compile-time ordering.
It's good practice to include it when copying code into a buffer because it doesn't cost performance, and stops hypothetical future compilers from breaking your code. (e.g. if they do understand that mmap(MAP_ANONYMOUS) gives newly-allocated anonymous memory that nothing else has a pointer to, just like malloc.)
With current GCC, I was able to provoke GCC into really doing an optimization we don't want by using __attribute__((const)) to tell the optimizer sum() is a pure function (that only reads its args, not global memory). GCC then knows sum() can't read the result of the memcpy as data.
With another memcpy into the same buffer after the call, GCC does dead-store elimination into just the 2nd store after the call. This results in no store before the first call so it executes the 00 00 add [rax], al bytes, segfaulting.
// demo of a problem on x86 when not using __builtin___clear_cache
#include <stdio.h>
#include <string.h>
#include <sys/mman.h>
int main ()
{
char code[] = {
0x8D, 0x04, 0x37, // lea eax,[rdi+rsi]
0xC3 // ret
};
__attribute__((const)) int (*sum) (int, int) = NULL;
// copy code to executable buffer
sum = mmap (0,sizeof(code),PROT_READ|PROT_WRITE|PROT_EXEC,
MAP_PRIVATE|MAP_ANON,-1,0);
memcpy (sum, code, sizeof(code));
//__builtin___clear_cache(sum, sum + sizeof(code));
int c = sum (2, 3);
//printf ("%d + %d = %d\n", a, b, c);
memcpy(sum, (char[]){0x31, 0xc0, 0xc3, 0}, 4); // xor-zero eax, ret, padding for a dword store
//__builtin___clear_cache(sum, sum + 4);
return sum(2,3);
}
Compiled on the Godbolt compiler explorer with GCC9.2 -O3
main:
push rbx
xor r9d, r9d
mov r8d, -1
mov ecx, 34
mov edx, 7
mov esi, 4
xor edi, edi
sub rsp, 16
call mmap
mov esi, 3
mov edi, 2
mov rbx, rax
call rax # call before store
mov DWORD PTR [rbx], 12828721 # 0xC3C031 = xor-zero eax, ret
add rsp, 16
pop rbx
ret # no 2nd call, CSEd away because const and same args
Passing different args would have gotten another call reg, but even with __builtin___clear_cache the two sum(2,3) calls can CSE. __attribute__((const)) doesn't respect changes to the machine code of a function. Don't do it. It's safe if you're going to JIT the function once and then call many times, though.
Uncommenting the first __clear_cache results in
mov DWORD PTR [rax], -1019804531 # lea; ret
call rax
mov DWORD PTR [rbx], 12828721 # xor-zero; ret
... still CSE and use the RAX return value
The first store is there because of __clear_cache and the sum(2,3) call. (Removing the first sum(2,3) call does let dead-store elimination happen across the __clear_cache.)
The second store is there because the side-effect on the buffer returned by mmap is assumed to be important, and that's the final value main leaves.
Godbolt's ./a.out option to run the program still seems to always fail (exit status of 255); maybe it sandboxes JITing? It works on my desktop with __clear_cache and crashes without.
mprotect on a page holding existing C variables.
You can also give a single existing page read+write+exec permission. This is an alternative to compiling with -z execstack
You don't need __clear_cache on a page holding read-only C variables because there's no store to optimize away. You would still need it for initializing a local buffer (on the stack). Otherwise GCC will optimize away the initializer for this private buffer that a non-inline function call definitely doesn't have a pointer to. (Escape analysis). It doesn't consider the possibility that the buffer might hold the machine code for the function unless you tell it that via __builtin___clear_cache.
#include <stdio.h>
#include <sys/mman.h>
#include <stdint.h>
// can be non-const if you want, we're using mprotect
static const char code[] = {
0x8D, 0x04, 0x37, // lea eax,[rdi+rsi] // retval = a+b;
0xC3 // ret
};
static const char ret0_code[] = "\x31\xc0\xc3";
int main () {
// void* cast is easier to type than a cast to function pointer,
// and in C can be assigned to any other pointer type. (not C++)
int (*sum) (int, int) = (void*)code;
int (*ret0)(void) = (void*)ret0_code;
// hard-coding x86's 4k page size for simplicity.
// also assume that `code` doesn't span a page boundary and that ret0_code is in the same page.
uintptr_t page = (uintptr_t)code & -4095ULL; // round down
mprotect((void*)page, 4096, PROT_READ|PROT_EXEC|PROT_WRITE); // +write in case the page holds any writeable C vars that would crash later code.
// run code
int c = sum (2, 3);
return ret0();
}
I used PROT_READ|PROT_EXEC|PROT_WRITE in this example so it works regardless of where your variable is. If it was a local on the stack and you left out PROT_WRITE, call would fail after making the stack read only when it tried to push a return address.
Also, PROT_WRITE lets you test shellcode that self-modifies, e.g. to edit zeros into its own machine code, or other bytes it was avoiding.
$ gcc -O3 shellcode.c # without -z execstack
$ ./a.out
$ echo $?
0
$ strace ./a.out
...
mprotect(0x55605aa3f000, 4096, PROT_READ|PROT_WRITE|PROT_EXEC) = 0
exit_group(0) = ?
+++ exited with 0 +++
If I comment out the mprotect, it does segfault with recent versions of GNU Binutils ld which no longer put read-only constant data into the same ELF segment as the .text section.
If I did something like ret0_code[2] = 0xc3;, I would need __builtin___clear_cache(ret0_code+2, ret0_code+2) after that to make sure the store wasn't optimized away, but if I don't modify the static arrays then it's not needed after mprotect. It is needed after mmap+memcpy or manual stores, because we want to execute bytes that have been written in C (with memcpy).
You need to include the assembly in-line via a special compiler directive so that it'll properly end up in a code segment. See this guide, for example: http://www.ibiblio.org/gferg/ldp/GCC-Inline-Assembly-HOWTO.html
Your machine code may be all right, but your CPU objects.
Modern CPUs manage memory in segments. In normal operation, the operating system loads a new program into a program-text segment and sets up a stack in a data segment. The operating system tells the CPU never to run code in a data segment. Your code is in code[], in a data segment. Thus the segfault.
This will take some effort.
Your code variable is stored in the .data section of your executable:
$ readelf -p .data exploit
String dump of section '.data':
[ 10] H1À
H1À is the value of your variable.
The .data section is not executable:
$ readelf -S exploit
There are 30 section headers, starting at offset 0x1150:
Section Headers:
[Nr] Name Type Address Offset
Size EntSize Flags Link Info Align
[...]
[24] .data PROGBITS 0000000000601010 00001010
0000000000000014 0000000000000000 WA 0 0 8
All 64-bit processors I'm familiar with support non-executable pages natively in the pagetables. Most newer 32-bit processors (the ones that support PAE) provide enough extra space in their pagetables for the operating system to emulate hardware non-executable pages. You'll need to run either an ancient OS or an ancient processor to get a .data section marked executable.
Because these are just flags in the executable, you ought to be able to set the X flag through some other mechanism, but I don't know how to do so. And your OS might not even let you have pages that are both writable and executable.
You may need to set the page executable before you may call it.
On MS-Windows, see the VirtualProtect -function.
URL: http://msdn.microsoft.com/en-us/library/windows/desktop/aa366898%28v=vs.85%29.aspx
Sorry, I couldn't follow above examples which are complicated.
So, I created an elegant solution for executing hex code from C.
Basically, you could use asm and .word keywords to place your instructions in hex format.
See below example:
asm volatile(".rept 1024\n"
CNOP
".endr\n");
where CNOP is defined as below:
#define ".word 0x00010001 \n"
Basically, c.nop instruction was not supported by my current assembler. So, I defined CNOP as the hex equivalent of c.nop with proper syntax and used inside asm, with which I was aware of.
.rept <NUM> .endr will basically, repeat the instruction NUM times.
This solution is working and verified.

Debugging in threading building Blocks

I would like to program in threading building blocks with tasks. But how does one do the debugging in practice?
In general the print method is a solid technique for debugging programs.
In my experience with MPI parallelization, the right way to do logging is that each thread print its debugging information in its own file (say "debug_irank" with irank the rank in the MPI_COMM_WORLD) so that the logical errors can be found.
How can something similar be achieved with TBB? It is not clear how to access the thread number in the thread pool as this is obviously something internal to tbb.
Alternatively, one could add an additional index specifying the rank when a task is generated but this makes the code rather complicated since the whole program has to take care of that.
First, get the program working with 1 thread. To do this, construct a task_scheduler_init as the first thing in main, like this:
#include "tbb/tbb.h"
int main() {
tbb::task_scheduler_init init(1);
...
}
Be sure to compile with the macro TBB_USE_DEBUG set to 1 so that TBB's checking will be enabled.
If the single-threaded version works, but the multi-threaded version does not, consider using Intel Inspector to spot race conditions. Be sure to compile with TBB_USE_THREADING_TOOLS so that Inspector gets enough information.
Otherwise, I usually first start by adding assertions, because the machine can check assertions much faster than I can read log messages. If I am really puzzled about why an assertion is failing, I use printfs and task ids (not thread ids). Easiest way to create a task id is to allocate one by post-incrementing a tbb::atomic<size_t> and storing the result in the task.
If I'm having a really bad day and the printfs are changing program behavior so that the error does not show up, I use "delayed printfs". Stuff the printf arguments in a circular buffer, and run printf on the records later after the failure is detected. Typically for the buffer, I use an array of structs containing the format string and a few word-size values, and make the array size a power of two. Then an atomic increment and mask suffices to allocate slots. E.g., something like this:
const size_t bufSize = 1024;
struct record {
const char* format;
void *arg0, *arg1;
};
tbb::atomic<size_t> head;
record buf[bufSize];
void recf(const char* fmt, void* a, void* b) {
record* r = &buf[head++ & bufSize-1];
r->format = fmt;
r->arg0 = a;
r->arg1 = b;
}
void recf(const char* fmt, int a, int b) {
record* r = &buf[head++ & bufSize-1];
r->format = fmt;
r->arg0 = (void*)a;
r->arg1 = (void*)b;
}
The two recf routines record the format and the values. The casting is somewhat abusive, but on most architectures you can print the record correctly in practice with printf(r->format, r->arg0, r->arg1) even if the the 2nd overload of recf created the record.
~
~

Confusing result from counting page fault in linux

I was writing programs to count the time of page faults in a linux system. More precisely, the time kernel execute the function __do_page_fault.
And somehow I wrote two global variables, named pfcount_at_beg and pfcount_at_end, which increase once when the function __do_page_fault is executed at different locations of the function.
To illustrate, the modified function goes as:
unsigned long pfcount_at_beg = 0;
unsigned long pfcount_at_end = 0;
static void __kprobes
__do_page_fault(...)
{
struct vm_area_sruct *vma;
... // VARIABLES DEFINITION
unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE;
pfcount_at_beg++; // I add THIS
...
...
// ORIGINAL CODE OF THE FUNCTION
...
pfcount_at_end++; // I add THIS
}
I expected that the value of pfcount_at_end is smaller than the value of pfcount_at_beg.
Because, I think, every time kernel executes the instructions of code pfcount_at_end++, it must have executed pfcount_at_beg++(Every function starts at the very beginning of the code).
On the other hand, as there are many conditional return between these two lines of code.
However, the result turns out oppositely. The value of pfcount_at_end is larger than the value of pfcount_at_beg.
I use printk to print these kernel variables through a self-defined syscall. And I wrote the user level program to call the system call.
Here is my simple syscall and user-level program:
// syscall
asmlinkage int sys_mysyscall(void)
{
printk( KERN_INFO "total pf_at_beg%lu\ntotal pf_at_end%lu\n", pfcount_at_beg, pfcount_at_end)
return 0;
}
// user-level program
#include<linux/unistd.h>
#include<sys/syscall.h>
#define __NR_mysyscall 223
int main()
{
syscall(__NR_mysyscall);
return 0;
}
Is there anybody who knows what exactly happened during this?
Just now I modified the code, to make pfcount_at_beg and pfcount_at_end static. However the result did not change, i.e. the value of pfcount_at_end is larger than the value of pfcount_at_beg.
So possibly it might be caused by in-atomic operation of increment. Would it be better if I use read-write lock?
The ++ operator is not garanteed to be atomic, so your counters may suffer concurrent access and have incorrect values. You should protect your increment as a critical section, or use the atomic_t type defined in <asm/atomic.h>, and its related atomic_set() and atomic_add() functions (and a lot more).
Not directly connected to your issue, but using a specific syscall is overkill (but maybe it is an exercise). A lighter solution could be to use a /proc entry (also an interesting exercise).

How does linux know when to allocate more pages to a call stack?

Given the program below, segfault() will (As the name suggests) segfault the program by accessing 256k below the stack. nofault() however, gradually pushes below the stack all the way to 1m below, but never segfaults.
Additionally, running segfault() after nofault() doesn't result in an error either.
If I put sleep()s in nofault() and use the time to cat /proc/$pid/maps I see the allocated stack space grows between the first and second call, this explains why segfault() doesn't crash afterwards - there's plenty of memory.
But the disassembly shows there's no change to %rsp. This makes sense since that would screw up the call stack.
I presumed that the maximum stack size would be baked into the binary at compile time (In retrospect that would be very hard for a compiler to do) or that it would just periodically check %rsp and add a buffer after that.
How does the kernel know when to increase the stack memory?
#include <stdio.h>
#include <unistd.h>
void segfault(){
char * x;
int a;
for( x = (char *)&x-1024*256; x<(char *)(&x+1); x++){
a = *x & 0xFF;
printf("%p = 0x%02x\n",x,a);
}
}
void nofault(){
char * x;
int a;
sleep(20);
for( x = (char *)(&x); x>(char *)&x-1024*1024; x--){
a = *x & 0xFF;
printf("%p = 0x%02x\n",x,a);
}
sleep(20);
}
int main(){
nofault();
segfault();
}
The processor raises a page fault when you access an unmapped page. The kernel's page fault handler checks whether the address is reasonably close to the process's %rsp and if so, it allocates some memory and resumes the process. If you are too far below %rsp, the kernel passes the fault along to the process as a signal.
I tried to find the precise definition of what addresses are close enough to %rsp to trigger stack growth, and came up with this from linux/arch/x86/mm.c:
/*
* Accessing the stack below %sp is always a bug.
* The large cushion allows instructions like enter
* and pusha to work. ("enter $65535, $31" pushes
* 32 pointers and then decrements %sp by 65535.)
*/
if (unlikely(address + 65536 + 32 * sizeof(unsigned long) < regs->sp)) {
bad_area(regs, error_code, address);
return;
}
But experimenting with your program I found that 65536+32*sizeof(unsigned long) isn't the actual cutoff point between segfault and no segfault. It seems to be about twice that value. So I'll just stick with the vague "reasonably close" as my official answer.

Two structs, one references another

typedef struct Radios_Frequencia {
char tipo_radio[3];
int qt_radio;
int frequencia;
}Radiof;
typedef struct Radio_Cidade {
char nome_cidade[30];
char nome_radio[30];
char dono_radio[3];
int numero_horas;
int audiencia;
Radiof *fre;
}R_cidade;
void Cadastrar_Radio(R_cidade**q){
printf("%d\n",i);
q[0]=(R_cidade*)malloc(sizeof(R_cidade));
printf("informa a frequencia da radio\n");
scanf("%d",&q[0]->fre->frequencia); //problem here
printf("%d\n",q[0]->fre->frequencia); // problem here
}
i want to know why this function void Cadastrar_Radio(R_cidade**q) does not print the data
You allocated storage for your primary structure but not the secondary one. Change
q[0]=(R_cidade*)malloc(sizeof(R_cidade));
to:
q[0]=(R_cidade*)malloc(sizeof(R_cidade));
q[0]->fre = malloc(sizeof(Radiof));
which will allocate both. Without that, there's a very good chance that fre will point off into never-never land (as in "you can never never tell what's going to happen since it's undefined behaviour).
You've allocated some storage, but you've not properly initialized any of it.
You won't get anything reliable to print until you put reliable values into the structures.
Additionally, as PaxDiablo also pointed out, you've allocated the space for the R_cidade structure, but not for the Radiof component of it. You're using scanf() to read a value into space that has not been allocated; that is not reliable - undefined behaviour at best, but most usually core dump time.
Note that although the two types are linked, the C compiler most certainly doesn't do any allocation of Radiof simply because R_cidade mentions it. It can't tell whether the pointer in R_cidade is meant to be to a single structure or the start of an array of structures, for example, so it cannot tell how much space to allocate. Besides, you might not want to initialize that structure every time - you might be happy to have left pointing nowhere (a null pointer) except in some special circumstances known only to you.
You should also verify that the memory allocation succeeded, or use a memory allocator that guarantees never to return a null or invalid pointer. Classically, that might be a cover function for the standard malloc() function:
#undef NDEBUG
#include <assert.h>
void *emalloc(size_t nbytes)
{
void *space = malloc(nbytes);
assert(space != 0);
return(space);
}
That's crude but effective. I use non-crashing error reporting routines of my own devising in place of the assert:
#include "stderr.h"
void *emalloc(size_t nbytes)
{
void *space = malloc(nbytes);
if (space == 0)
err_error("Out of memory\n");
return space;
}

Resources