How does linux know when to allocate more pages to a call stack?

How does linux know when to allocate more pages to a call stack? - linux

Given the program below, segfault() will (As the name suggests) segfault the program by accessing 256k below the stack. nofault() however, gradually pushes below the stack all the way to 1m below, but never segfaults.
Additionally, running segfault() after nofault() doesn't result in an error either.
If I put sleep()s in nofault() and use the time to cat /proc/$pid/maps I see the allocated stack space grows between the first and second call, this explains why segfault() doesn't crash afterwards - there's plenty of memory.
But the disassembly shows there's no change to %rsp. This makes sense since that would screw up the call stack.
I presumed that the maximum stack size would be baked into the binary at compile time (In retrospect that would be very hard for a compiler to do) or that it would just periodically check %rsp and add a buffer after that.
How does the kernel know when to increase the stack memory?
#include <stdio.h>
#include <unistd.h>
void segfault(){
char * x;
int a;
for( x = (char *)&x-1024*256; x<(char *)(&x+1); x++){
a = *x & 0xFF;
printf("%p = 0x%02x\n",x,a);
}
}
void nofault(){
char * x;
int a;
sleep(20);
for( x = (char *)(&x); x>(char *)&x-1024*1024; x--){
a = *x & 0xFF;
printf("%p = 0x%02x\n",x,a);
}
sleep(20);
}
int main(){
nofault();
segfault();
}

The processor raises a page fault when you access an unmapped page. The kernel's page fault handler checks whether the address is reasonably close to the process's %rsp and if so, it allocates some memory and resumes the process. If you are too far below %rsp, the kernel passes the fault along to the process as a signal.
I tried to find the precise definition of what addresses are close enough to %rsp to trigger stack growth, and came up with this from linux/arch/x86/mm.c:
/*
* Accessing the stack below %sp is always a bug.
* The large cushion allows instructions like enter
* and pusha to work. ("enter $65535, $31" pushes
* 32 pointers and then decrements %sp by 65535.)
*/
if (unlikely(address + 65536 + 32 * sizeof(unsigned long) < regs->sp)) {
bad_area(regs, error_code, address);
return;
}
But experimenting with your program I found that 65536+32*sizeof(unsigned long) isn't the actual cutoff point between segfault and no segfault. It seems to be about twice that value. So I'll just stick with the vague "reasonably close" as my official answer.

Related

How to test/validate the vmalloc guard page is working in Linux

I am studying stack guarding in Linux. I found that the Linux kernel VMAP_STACK config parameter is using the guard page mechanism along with vmalloc() to provide stack guarding.
I am trying to find a way to check how this guard page is working in Linux kernel. I googled and checked the kernel code, but did NOT find out the codes.
A further question is how to verify the guarded stack.
I had a kernel module to underrun/overflow a process's kernel stack, like this
static void shoot_kernel_stack(void)
{
unsigned char *ptr = task_stack_page(current);
unsigned char *tmp = NULL;
tmp = ptr + THREAD_SIZE + PAGE_SIZE + 0;
// tmp -= 0x100;
memset(tmp, 0xB4, 0x10); // Underrun
}
I really get the kernel panic like below,
[ 8006.358354] BUG: stack guard page was hit at 00000000e8dc2d98 (stack is 00000000cff0f921..00000000653b24a9)
[ 8006.361276] kernel stack overflow (page fault): 0000 [#1] SMP PTI
Is this the right way to verify the guard page?

The VMAP_STACK Linux feature is used to map the kernel stack of the threads into VMA. By virtually mapping stack, the underlying physical pages don't need to be contiguous. It is possible to detect cross-page overflows by adding guard pages. As the VMA are followed by a guard (unless the VM_NO_GUARD flag is passed at allocation time), the stacks allocated in those area benefits from it for stack overflow detection.
ALLOCATION
The thread stacks are allocated at thread creation time with alloc_thread_stack_node() in kernel/fork.c. When VMAP_STACK is activated, the stacks are cached because according to the comments in the source code:
vmalloc() is a bit slow, and calling vfree() enough times will force a TLB
flush. Try to minimize the number of calls by caching stacks.
The kernel stack size is THREAD_SIZE (equal to 4 pages on x86_64 platforms). The source code of the allocation invoked at thread creation time is:
static unsigned long *alloc_thread_stack_node(struct task_struct *tsk, int node)
{
#ifdef CONFIG_VMAP_STACK
void *stack;
int i;
[...] // <----- Part which gets a previously cached stack. If no stack in cache
// the following is run to allocate a brand new stack:
/*
* Allocated stacks are cached and later reused by new threads,
* so memcg accounting is performed manually on assigning/releasing
* stacks to tasks. Drop __GFP_ACCOUNT.
*/
stack = __vmalloc_node_range(THREAD_SIZE, THREAD_ALIGN,
VMALLOC_START, VMALLOC_END,
THREADINFO_GFP & ~__GFP_ACCOUNT,
PAGE_KERNEL,
0, node, __builtin_return_address(0));
[...]
__vmalloc_node_range() is defined in mm/vmalloc.c. This calls __get_vm_area_node(). As the latter is not passed the VM_NO_GUARD flags, an additional page is added at the end of the allocated area. This is the guard page of the VMA:
static struct vm_struct *__get_vm_area_node(unsigned long size,
unsigned long align, unsigned long flags, unsigned long start,
unsigned long end, int node, gfp_t gfp_mask, const void *caller)
{
struct vmap_area *va;
struct vm_struct *area;
BUG_ON(in_interrupt());
size = PAGE_ALIGN(size);
if (unlikely(!size))
return NULL;
if (flags & VM_IOREMAP)
align = 1ul << clamp_t(int, get_count_order_long(size),
PAGE_SHIFT, IOREMAP_MAX_ORDER);
area = kzalloc_node(sizeof(*area), gfp_mask & GFP_RECLAIM_MASK, node);
if (unlikely(!area))
return NULL;
if (!(flags & VM_NO_GUARD)) // <----- A GUARD PAGE IS ADDED
size += PAGE_SIZE;
va = alloc_vmap_area(size, align, start, end, node, gfp_mask);
if (IS_ERR(va)) {
kfree(area);
return NULL;
}
setup_vmalloc_vm(area, va, flags, caller);
return area;
}
OVERFLOW MANAGEMENT
The stack overflow management is architecture dependent (i.e. source code located in arch/...). The links referenced below provide some pointers on some architecture dependent implementations.
For x86_64 platform, the overflow check is done upon the page fault interruption which triggers the following chain of function calls: do_page_fault()->__do_page_fault()->do_kern_addr_fault()->bad_area_nosemaphore()->no_context() function defined in arch/x86/mm/fault.c. In no_context(), there is a part dedicated to VMAP_STACK management for the detection of the stack under/overflow:
static noinline void
no_context(struct pt_regs *regs, unsigned long error_code,
unsigned long address, int signal, int si_code)
{
struct task_struct *tsk = current;
unsigned long flags;
int sig;
[...]
#ifdef CONFIG_VMAP_STACK
/*
* Stack overflow? During boot, we can fault near the initial
* stack in the direct map, but that's not an overflow -- check
* that we're in vmalloc space to avoid this.
*/
if (is_vmalloc_addr((void *)address) &&
(((unsigned long)tsk->stack - 1 - address < PAGE_SIZE) ||
address - ((unsigned long)tsk->stack + THREAD_SIZE) < PAGE_SIZE)) {
unsigned long stack = __this_cpu_ist_top_va(DF) - sizeof(void *);
/*
* We're likely to be running with very little stack space
* left. It's plausible that we'd hit this condition but
* double-fault even before we get this far, in which case
* we're fine: the double-fault handler will deal with it.
*
* We don't want to make it all the way into the oops code
* and then double-fault, though, because we're likely to
* break the console driver and lose most of the stack dump.
*/
asm volatile ("movq %[stack], %%rsp\n\t"
"call handle_stack_overflow\n\t"
"1: jmp 1b"
: ASM_CALL_CONSTRAINT
: "D" ("kernel stack overflow (page fault)"),
"S" (regs), "d" (address),
[stack] "rm" (stack));
unreachable();
}
#endif
[...]
}
In the above code, when a stack under/overflow is detected, the handle_stack_overflow() function defined in arch/x86/kernel/traps.c) is called:
#ifdef CONFIG_VMAP_STACK
__visible void __noreturn handle_stack_overflow(const char *message,
struct pt_regs *regs,
unsigned long fault_address)
{
printk(KERN_EMERG "BUG: stack guard page was hit at %p (stack is %p..%p)\n",
(void *)fault_address, current->stack,
(char *)current->stack + THREAD_SIZE - 1);
die(message, regs, 0);
/* Be absolutely certain we don't return. */
panic("%s", message);
}
#endif
The example error message "BUG: stack guard page was hit at..." pointed out in the question comes from the above handle_stack_overflow() function.
FROM YOUR EXAMPLE MODULE
When VMAP_STACK is defined, the stack_vm_area field of the task descriptor appears and is set with the VMA address associated to the stack. From there, it is possible to grab interesting information:
struct task_struct *task;
#ifdef CONFIG_VMAP_STACK
struct vm_struct *vm;
#endif // CONFIG_VMAP_STACK
task = current;
printk("\tKernel stack: 0x%lx\n", (unsigned long)(task->stack));
printk("\tStack end magic: 0x%lx\n", *(unsigned long *)(task->stack));
#ifdef CONFIG_VMAP_STACK
vm = task->stack_vm_area;
printk("\tstack_vm_area->addr = 0x%lx\n", (unsigned long)(vm->addr));
printk("\tstack_vm_area->nr_pages = %u\n", vm->nr_pages);
printk("\tstack_vm_area->size = %lu\n", vm->size);
#endif // CONFIG_VMAP_STACK
printk("\tLocal var in stack: 0x%lx\n", (unsigned long)(&task));
The nr_pages field is the number of pages without the additional guard page. The last unsigned long at the top of the stack is set with STACK_END_MAGIC defined in include/uapi/linux/magic.h as:
#define STACK_END_MAGIC 0x57AC6E9D
REFERENCES:
Preventing stack guard-page hopping
arm64: VMAP_STACK support
CONFIG_VMAP_STACK: Use a virtually-mapped stack
Linux 4.9 On x86_64 To Support Vmapped Stacks
A Decade of Linux Kernel Vulnerabilities

any way to stop unaligned access from c++ standard library on x86_64?

I am trying to check for any unaligned reads in my program. I enable unaligned access processor exception via (using x86_64 on g++ on linux kernel 3.19):
asm volatile("pushf \n"
"pop %%rax \n"
"or $0x40000, %%rax \n"
"push %%rax \n"
"popf \n" ::: "rax");
I do an optional forced unaligned read which triggers the exception so i know its working. After i disable that I get an error in a piece of code which otherwise seems fine :
char fullpath[eMaxPath];
snprintf(fullpath, eMaxPath, "%s/%s", "blah", "blah2");
the stacktrace shows a failure via __memcpy_sse2 which leads me to suspect that the standard library is using sse to fulfill my memcpy but it doesnt realize that i have now made unaligned reads unacceptable.
Is my thinking correct and is there any way around this (ie can i make the standard library use an unaligned safe sprintf/memcpy instead)?
thanks

While I hate to discourage an admirable notion, you're playing with fire, my friend.
It's not merely sse2 access but any unaligned access. Even a simple int fetch.
Here's a test program:
#include <stdio.h>
#include <unistd.h>
#include <string.h>
#include <malloc.h>
void *intptr;
void
require_aligned(void)
{
asm volatile("pushf \n"
"pop %%rax \n"
"or $0x00040000, %%eax \n"
"push %%rax \n"
"popf \n" ::: "rax");
}
void
relax_aligned(void)
{
asm volatile("pushf \n"
"pop %%rax \n"
"andl $0xFFFBFFFF, %%eax \n"
"push %%rax \n"
"popf \n" ::: "rax");
}
void
msg(const char *str)
{
int len;
len = strlen(str);
write(1,str,len);
}
void
grab(void)
{
volatile int x = *(int *) intptr;
}
int
main(void)
{
setlinebuf(stdout);
// minimum alignment from malloc is [usually] 8
intptr = malloc(256);
printf("intptr=%p\n",intptr);
// normal access to aligned pointer
msg("normal\n");
grab();
// enable alignment check exception
require_aligned();
// access aligned pointer under check [will be okay]
msg("aligned_norm\n");
grab();
// this grab will generate a bus error
intptr += 1;
msg("aligned_except\n");
grab();
return 0;
}
The output of this is:
intptr=0x1996010
normal
aligned_norm
aligned_except
Bus error (core dumped)
The program generated this simply because of an attempted 4 byte int fetch from address 0x1996011 [which is odd and not a multiple of 4].
So, once you turn on the AC [alignment check] flag, even simple things will break.
IMO, if you truly have some things that are not aligned optimally and are trying to find them, using printf, instrumenting your code with debug asserts, or using gdb with some special watch commands or breakpoints with condition statements are a better/safer way to go
UPDATE:
I a using my own custom allocator am preparing my code to run on an architecture that doesnt suport unaligned read/writes so I want to make sure my code will not break on that architecture.
Fair enough.
Side note: My curiousity has gotten the better of me as the only [major] arches I can recall [at the moment] that have this issue are Motorola mc68000 and older IBM mainframes (e.g. IBM System 370).
One practical reason for my curiosity is that for certain arches (e.g. ARM/android, MIPS) there are emulators available. You could rebuild the emulator from source, adding any extra checks, if needed. Otherwise, doing your debugging under the emulator might be an option.
I can trap unaligned read/write using either the asm , or via gdb but both cause SIGBUS which i cant continue from in gdb and im getting too many false positives from std library (in the sense that their implementation would be aligned access only on the target).
I can tell you from experience that trying to resume from a signal handler after this doesn't work too well [if at all]. Using gdb is the best bet if you can eliminate the false positives by having AC off in the standard functions [see below].
Ideally i guess i would like to use something like perf to show me callstacks that have misaligned but so far no dice.
This is possible, but you'd have to verify that perf even reports them. To see, you could try perf against my original test program above. If it works, the "counter" should be zero before and one after.
The cleanest way may be to pepper your code with "assert" macros [that can be compiled in and out with a -DDEBUG switch].
However, since you've gone to the trouble of laying the groundwork, it may be worthwhile to see if the AC method can work.
Since you're trying to debug your memory allocator, you only need AC on in your functions. If one of your functions calls libc, disable AC, call the function, and then reenable AC.
A memory allocator is fairly low level, so it can't rely on too many standard functions. Most standard functions rely on being able to call malloc. So, you might also want to consider a vtable interface to the rest of the [standard] library.
I've coded some slightly different AC bit set/clear functions. I put them into a .S function to eliminate inline asm hassles.
I've coded up a simple sample usage in three files.
Here are the AC set/clear functions:
// acbit/acops.S -- low level AC [alignment check] operations
#define AC_ON $0x00040000
#define AC_OFF $0xFFFFFFFFFFFBFFFF
.text
// acpush -- turn on AC and return previous mask
.globl acpush
acpush:
// get old mask
pushfq
pop %rax
mov %rax,%rcx // save to temp
or AC_ON,%ecx // turn on AC bit
// set new mask
push %rcx
popfq
ret
// acpop -- restore previous mask
.globl acpop
acpop:
// get current mask
pushfq
pop %rax
and AC_OFF,%rax // clear current AC bit
and AC_ON,%edi // isolate the AC bit in argument
or %edi,%eax // lay it in
// set new mask
push %rax
popfq
ret
// acon -- turn on AC
.globl acon
acon:
jmp acpush
// acoff -- turn off AC
.globl acoff
acoff:
// get current mask
pushfq
pop %rax
and AC_OFF,%rax // clear current AC bit
// set new mask
push %rax
popfq
ret
Here is a header file that has the function prototypes and some "helper" macros:
// acbit/acbit.h -- common control
#ifndef _acbit_acbit_h_
#define _acbit_acbit_h_
#include <stdio.h>
#include <unistd.h>
#include <string.h>
#include <malloc.h>
typedef unsigned long flags_t;
#define VARIABLE_USED(_sym) \
do { \
if (1) \
break; \
if (!! _sym) \
break; \
} while (0)
#ifdef ACDEBUG
#define ACPUSH \
do { \
flags_t acflags = acpush()
#define ACPOP \
acpop(acflags); \
} while (0)
#define ACEXEC(_expr) \
do { \
acoff(); \
_expr; \
acon(); \
} while (0)
#else
#define ACPUSH /**/
#define ACPOP /**/
#define ACEXEC(_expr) _expr
#endif
void *intptr;
flags_t
acpush(void);
void
acpop(flags_t omsk);
void
acon(void);
void
acoff(void);
#endif
Here is a sample program that uses all of the above:
// acbit/acbit2 -- sample allocator
#include <acbit.h>
// mymalloc1 -- allocation function [raw calls]
void *
mymalloc1(size_t len)
{
flags_t omsk;
void *vp;
// function prolog
// NOTE: do this on all "outer" (i.e. API) functions
omsk = acpush();
// do lots of stuff ...
vp = NULL;
// encapsulate standard library calls like this to prevent false positives:
acoff();
printf("%p\n",vp);
acon();
// function epilog
acpop(omsk);
return vp;
}
// mymalloc2 -- allocation function [using helper macros]
void *
mymalloc2(size_t len)
{
void *vp;
// function prolog
ACPUSH;
// do lots of stuff ...
vp = NULL;
// encapsulate standard library calls like this to prevent false positives:
ACEXEC(printf("%p\n",vp));
// function epilog
ACPOP;
return vp;
}
int
main(void)
{
int x;
setlinebuf(stdout);
// minimum alignment from malloc is [usually] 8
intptr = mymalloc1(256);
intptr = mymalloc2(256);
x = *(int *) intptr;
return x;
}
UPDATE #2:
I like the idea of disabling the check before any library calls.
If the AC H/W works and you wrap the library calls, this should yield no false positives. The only exception would be if the compiler made a call to its internal helper library (e.g. doing 64 bit divide on 32 bit machine, etc.).
Be aware/wary of the ELF loader (e.g. /lib64/ld-linux-x86-64.so.2) doing dynamic symbol resolution on "lazy" bindings of symbols. Shouldn't be a big problem. There are ways to force the relocations to occur at program start, if necessary.
I have given up on perf for this as it seems to show me garbage even for a simple program like the one you wrote.
The perf code in the kernel is complex enough that it may be more trouble than it's worth. It has to communicate with the perf program with a pipe [IIRC]. Also, doing the AC thing is [probably] uncommon enough that the kernel's code path for this isn't well tested.
Im using ocperf with misalign_mem_ref.loads and stores but either way the counters dont correlate at all. If i record and look at the callstacks i get completely unrecognizable callstacks for these counters so i suspect either the counter doesnt work on my hardware/perf or it doesnt actually count what i think it counts
I honestly don't know if perf handles process reschedules to different cores properly [or not]--it should [IMO]. But, using sched_setaffinity to lock your program to a single core might help.
But, using the AC bit is more direct and definitive, IMO. I think that's the better bet.
I've talked about adding "assert" macros in the code.
I've coded some up below. These are what I'd use. They are independent of the AC code. But, they can also be used in conjunction with the AC bit code in a "belt and suspenders" approach.
These macros have one distinct advantage. When properly [and liberally] inserted, they can check for bad pointer values at the time they're calculated. That is, much closer to the true source of the problem.
With AC, you may calculate a bad value, but AC only kicks in [sometime] later, when the pointer is dereferenced [which may not happen in your API code at all].
I've done a complete memory allocator before [with overrun checks and "guard" pages, etc.]. The macro approach is what I used. And, if I had only one tool for this, it is the one I'd use. So, I recommend it above all else.
But, as I said, it can be used with the AC code as well.
Here's the header file for the macros:
// acbit/acptr.h -- alignment check macros
#ifndef _acbit_acptr_h_
#define _acbit_acptr_h_
#include <stdio.h>
typedef unsigned int u32;
// bit mask for given width
#define ACMSKOFWID(_wid) \
((1u << (_wid)) - 1)
#ifdef ACDEBUG2
#define ACPTR_MSK(_ptr,_msk) \
acptrchk(_ptr,_msk,__FILE__,__LINE__)
#else
#define ACPTR_MSK(_ptr,_msk) /**/
#endif
#define ACPTR_WID(_ptr,_wid) \
ACPTR_MSK(_ptr,(_wid) - 1)
#define ACPTR_TYPE(_ptr,_typ) \
ACPTR_WID(_ptr,sizeof(_typ))
// acptrfault -- pointer alignment fault
void
acptrfault(const void *ptr,const char *file,int lno);
// acptrchk -- check pointer for given alignment
static inline void
acptrchk(const void *ptr,u32 msk,const char *file,int lno)
{
#ifdef ACDEBUG2
#if ACDEBUG2 >= 2
printf("acptrchk: TRACE ptr=%p msk=%8.8X file='%s' lno=%d\n",
ptr,msk,file,lno);
#endif
if (((unsigned long) ptr) & msk)
acptrfault(ptr,file,lno);
#endif
}
#endif
Here's the "fault" handler function:
// acbit/acptr -- alignment check macros
#include <acbit/acptr.h>
#include <acbit/acbit.h>
#include <stdlib.h>
// acptrfault -- pointer alignment fault
void
acptrfault(const void *ptr,const char *file,int lno)
{
// NOTE: it's easy to set a breakpoint on this function
printf("acptrfault: pointer fault -- ptr=%p file='%s' lno=%d\n",
ptr,file,lno);
exit(1);
}
And, here's a sample program that uses them:
// acbit/acbit3 -- sample allocator using check macros
#include <acbit.h>
#include <acptr.h>
static double static_array[20];
// mymalloc3 -- allocation function
void *
mymalloc3(size_t len)
{
void *vp;
// get something valid
vp = static_array;
// do lots of stuff ...
printf("BEF vp=%p\n",vp);
// check pointer
// NOTE: these can be peppered after every [significant] calculation
ACPTR_TYPE(vp,double);
// do something bad ...
vp += 1;
printf("AFT vp=%p\n",vp);
// check again -- this should fault
ACPTR_TYPE(vp,double);
return vp;
}
int
main(void)
{
int x;
setlinebuf(stdout);
// minimum alignment from malloc is [usually] 8
intptr = mymalloc3(256);
x = *(int *) intptr;
return x;
}
Here's the program output:
BEF vp=0x601080
acptrchk: TRACE ptr=0x601080 msk=00000007 file='acbit/acbit3.c' lno=22
AFT vp=0x601081
acptrchk: TRACE ptr=0x601081 msk=00000007 file='acbit/acbit3.c' lno=29
acptrfault: pointer fault -- ptr=0x601081 file='acbit/acbit3.c' lno=29
I left off the AC code in this example. On your real target system, the dereference of intptr in main would/should fault on an alignment, but notice how much later that is in the execution timeline.

Like I commented on the question, that asm isn't safe, because it steps on the red-zone. Instead, use
asm volatile ("add $-128, %rsp\n\t"
"pushf\n\t"
"orl $0x40000, (%rsp)\n\t"
"popf\n\t"
"sub $-128, %rsp\n\t"
);
(-128 fits in a sign-extended 8bit immediate, but 128 doesn't, hence using add $-128 to subtract 128.)
Or in this case, there are dedicated instructions for toggling that bit, like there are for the carry and direction flags:
asm("stac"); // Set AC flag
asm("clac"); // Clear AC flag
It's a good idea to have some idea when your code uses unaligned memory. It's not necessarily a good idea to change your code to avoid it in every case. Sometimes better locality from packing data closer together is more valuable.
Given that you shouldn't necessarily aim to eliminate all unaligned accesses anyway, I don't think this is the easiest way to find the ones you do have.
modern x86 hardware has fast hardware support for unaligned loads/stores. When they don't span a cache-line boundary, or lead to store-forwarding stalls, there's literally no penalty.
What you might try is looking at performance counters for some of these events:
misalign_mem_ref.loads [Speculative cache line split load uops dispatched to L1 cache]
misalign_mem_ref.stores [Speculative cache line split STA uops dispatched to L1 cache]
ld_blocks.store_forward [This event counts loads that followed a store to the same address, where the data could not be forwarded inside the pipeline from the store to the load.
The most common reason why store forwarding would be blocked is when a load's address range overlaps with a preceeding smaller uncompleted store.
See the table of not supported store forwards in the Intel? 64 and IA-32 Architectures Optimization Reference Manual.
The penalty for blocked store forwarding is that the load must wait for the store to complete before it can be issued.]
(from ocperf.py list output on my Sandybridge CPU).
There are probably other ways to detect unaligned memory access. Maybe valgrind? I searched on valgrind detect unaligned and found this mailing list discussion from 13 years ago. Probably still not implemented.
The hand-optimized library functions do use unaligned accesses because it's the fastest way for them to get their job done. e.g. copying bytes 6 to 13 of a string to somewhere else can and should be done with just a single 8-byte load/store.
So yes, you would need special slow&safe versions of library functions.
If your code would have to execute extra instructions to avoid using unaligned loads, it's often not worth it. Esp. if the input is usually aligned, having a loop that does the first up-to-alignment-boundary elements before starting the main loop may just slow things down. In the aligned case, everything works optimally, with no overhead of checking alignment. In the unaligned case, things might work a few percent slower, but as long as the unaligned cases are rare, it's not worth avoiding them.
Esp. if it's not SSE code, since non-AVX legacy SSE can only fold loads into memory operands for ALU instructions when alignment is guaranteed.
The benefit of having good-enough hardware support for unaligned memory ops is that software can be faster in the aligned case. It can leave alignment-handling to hardware, instead of running extra instructions to handle pointers that are probably aligned. (Linus Torvalds had some interesting posts about this on the http://realworldtech.com/ forums, but they're not searchable so I can't find it.

You're not going to like it, but there is only one answer: don't link against the standard libraries. By changing that setting you have changed the ABI and the standard library doesn't like it. memcpy and friends are hand-written assembly so it's not a matter of compiler options to convince the compiler to do something else.

Why my implementation of sbrk system call does not work?

I try to write a very simple os to better understand the basic principles. And I need to implement user-space malloc. So at first I want to implement and test it on my linux-machine.
At first I have implemented the sbrk() function by the following way
void* sbrk( int increment ) {
return ( void* )syscall(__NR_brk, increment );
}
But this code does not work. Instead, when I use sbrk given by os, this works fine.
I have tryed to use another implementation of the sbrk()
static void *sbrk(signed increment)
{
size_t newbrk;
static size_t oldbrk = 0;
static size_t curbrk = 0;
if (oldbrk == 0)
curbrk = oldbrk = brk(0);
if (increment == 0)
return (void *) curbrk;
newbrk = curbrk + increment;
if (brk(newbrk) == curbrk)
return (void *) -1;
oldbrk = curbrk;
curbrk = newbrk;
return (void *) oldbrk;
}
sbrk invoked from this function
static Header *morecore(unsigned nu)
{
char *cp;
Header *up;
if (nu < NALLOC)
nu = NALLOC;
cp = sbrk(nu * sizeof(Header));
if (cp == (char *) -1)
return NULL;
up = (Header *) cp;
up->s.size = nu; // ***Segmentation fault
free((void *)(up + 1));
return freep;
}
This code also does not work, on the line (***) I get segmentation fault.
Where is a problem ?
Thanks All. I have solved my problem using new implementation of the sbrk. The given code works fine.
void* __sbrk__(intptr_t increment)
{
void *new, *old = (void *)syscall(__NR_brk, 0);
new = (void *)syscall(__NR_brk, ((uintptr_t)old) + increment);
return (((uintptr_t)new) == (((uintptr_t)old) + increment)) ? old :
(void *)-1;
}

The first sbrk should probably have a long increment. And you forgot to handle errors (and set errno)
The second sbrk function does not change the address space (as sbrk does). You could use mmap to change it (but using mmap instead of sbrk won't update the kernel's view of data segment end as sbrk does). You could use cat /proc/1234/maps to query the address space of process of pid 1234). or even read (e.g. with fopen&fgets) the /proc/self/maps from inside your program.
BTW, sbrk is obsolete (most malloc implementations use mmap), and by definition every system call (listed in syscalls(2)) is executed by the kernel (for sbrk the kernel maintains the "data segment" limit!). So you cannot avoid the kernel, and I don't even understand why you want to emulate any system call. Almost by definition, you cannot emulate syscalls since they are the only way to interact with the kernel from a user application. From the user application, every syscall is an atomic elementary operation (done by a single SYSENTER machine instruction with appropriate contents in machine registers).
You could use strace(1) to understand the actual syscalls done by your running program.
BTW, the GNU libc is a free software. You could look into its source code. musl-libc is a simpler libc and its code is more readable.
At last compile with gcc -Wall -Wextra -g and use the gdb debugger (you can even query the registers, if you wanted to). Perhaps read the x86/64-ABI specification and the Linux Assembly HowTo.

How to restore stack frame in gcc?

I want to build my own checkpoint library. I'm able to save the stack frame to a file calling checkpoint_here(stack pointer) and that can be restored later via calling recover(stack pointer) function.
Here is my problem: I'm able to jump from function recover(sp) to main() but the stack frame gets changed(stack pointer,frame pointer). So I want to jump to main from recover(sp) just after is checkpoint_here(sp) called retaining the stack frame of main(). I've tried setjmp/longjmp but can't make them working.Thanks in anticipation.
//jmp_buf env;
void *get_pc () { return __builtin_return_address(1); }
void checkpoint_here(register int *sp){
//printf("%p\n",get_pc());
void *pc;
pc=get_pc();//getting the program counter of caller
//printf("pc inside chk:%p\n",pc);
size_t i;
long size;
//if(!setjmp(env)){
void *l=__builtin_frame_address(1);//frame pointer of caller
int fd=open("ckpt1.bin", O_WRONLY|O_CREAT,S_IWUSR|S_IRUSR|S_IRGRP);
int mfd=open("map.bin", O_WRONLY|O_CREAT,S_IWUSR|S_IRUSR|S_IRGRP);
size=(long)l-(long)sp;
//printf("s->%ld\n",size);
write(mfd,&size,sizeof(long)); //writing the size of the data to be written to file.
write(mfd,&pc,sizeof(long)); //writing program counter of the caller.
write(fd,(char *)sp,(long)l-(long)sp); //writing local variables on the stack frame of caller.
close(fd);
close(mfd);
//}
}
void recover(register int *sp){
//int dummy;
long size;
void *pc;
//printf("old %p\n",sp);
/*void *newsp=(void *)&dummy;
printf("new %p old %p\n",newsp,sp);
if(newsp>=(void *)sp)
recover(sp);*/
int fd=open("ckpt1.bin", O_RDONLY,0644);
int mfd=open("map.bin", O_RDONLY,0644);
read(mfd,&size,sizeof(long)); //reading size of data written
read(mfd,&pc,sizeof(long)); //reading program counter
read(fd,(char *)sp,size); //reading local variables
close(mfd);
close(fd);
//printf("got->%ld\n",size);
//longjmp(env,1);
void (*foo)(void) =pc;
foo(); //trying to jump to main just after checkpoint_here() is called.
//asm volatile("jmp %0" : : "r" (pc));
}
int main(int argc,char **argv)
{
register int *sp asm ("rsp");
if(argc==2){
if(strcmp(argv[1],"recover")==0){
recover(sp); //restoring local variables
exit(0);
}
}
int a, b, c;
float s, area;
char x='a';
printf("Enter the sides of triangle\n");
//printf("\na->%p b->%p c->%p s->%p area->%p\n",&a,&b,&c,&s,&area);
scanf("%d %d %d",&a,&b,&c);
s = (a+b+c)/2.0;
//printf("%p\n",get_pc());
checkpoint_here(sp); //saving stack
//printf("here\n");
//printf("nsp->%p\n",sp);
area = (s*(s-a)*(s-b)*(s-c));
printf("%d %d %d %f %f %d\n",a,b,c,s,area,x);
printf("Area of triangle = %f\n", area);
printf("%f\n",s);
return 0;
}

You cannot do that in general.
You might try non-portable extended asm instructions (to restore %rsp and %rbp on x86-64). You could use longjmp (see setjmp(3) and longjmp(3)) -since longjmp is restoring the stack pointer- assuming you understand the implementation details.
The stack has, thanks to ASLR, a "random", non reproducible, location. In other words, if you start twice the same program, the stack pointer of main would be different. And in C some stack frames contain a pointer into other stack frames. See also this answer.
Read more about application checkpointing (see this) and study the source code (or use) BLCR.
You could perhaps restrict the C code to be used (e.g. if you generate the C code) and you might perhaps extend GCC using MELT for your needs. This is a significant amount of work.
BTW, MELT is (internally also) generating C++ code, with restricted stack frames which could be easily checkpointable. You could take that as an inspiration source.
Read also about x86 calling conventions and garbage collection (since a precise GC has to scan local pointers, which is similar to your needs).

Segmentation Fault With Multiple Threads

I get error segmentation fault because of the free() at the end of this equation...
don't I have to free the temporary variable *stck? Or since it's a local pointer and
was never assigned a memory space via malloc, the compiler cleans it up for me?
void * push(void * _stck)
{
stack * stck = (stack*)_stck;//temp stack
int task_per_thread = 0; //number of push per thread
pthread_mutex_lock(stck->mutex);
while(stck->head == MAX_STACK -1 )
{
pthread_cond_wait(stck->has_space,stck->mutex);
}
while(task_per_thread <= (MAX_STACK/MAX_THREADS)&&
(stck->head < MAX_STACK) &&
(stck->item < MAX_STACK)//this is the amount of pushes
//we want to execute
)
{ //store actual value into stack
stck->list[stck->head]=stck->item+1;
stck->head = stck->head + 1;
stck->item = stck->item + 1;
task_per_thread = task_per_thread+1;
}
pthread_mutex_unlock(stck->mutex);
pthread_cond_signal(stck->has_element);
free(stck);
return NULL;
}

Edit: You totally changed the question so my old answer doesn't really make sense anymore. I'll try to answer the new one (old answer still below) but for reference, next time please just ask a new question instead of changing an old one.
stck is a pointer that you set to point to the same memory as _stck points to. A pointer does not imply allocating memory, it just points to memory that is already (hopefully) allocated. When you do for example
char* a = malloc(10); // Allocate memory and save the pointer in a.
char* b = a; // Just make b point to the same memory block too.
free(a); // Free the malloc'd memory block.
free(b); // Free the same memory block again.
you free the same memory twice.
-- old answer
In push, you're setting stck to point to the same memory block as _stck, and at the end of the call you free stack (thereby calling free() on your common stack once from each thread)
Remove the free() call and, at least for me, it does not crash anymore. Deallocating the stack should probably be done in main() after joining all the threads.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string