Linux kernel system call returns -1 instead of {-1, -256} - linux

I'm a kernel newbie and am facing a weird issue. I have written a proof-of-concept calculator syscall and while it works fine for most computations, it is returning -1 when the SUBTRACTION result is between -1 to -256. If someone can throw some light on what could be happening, would appreciate it. Below is the syscall code.
SYSCALL_DEFINE3(calc, int, a, int, b , char, op) {
int res_int;
switch(op) {
case '+': res_int = a + b;
break;
case '-': res_int = a - b;
break;
case '*': res_int = a * b;
break;
case '/': res_int = (a*1000) / b;
break;
}
printk(KERN_INFO "KERNEL CALC RESULT : %d %c %d = %ld",a, op, b, res_int);
return res_int;
}
EDIT:
Kernel version: Android Linux Kernel 3.10.xxx.
Platform: Nexus7 ARM EABI.
What I don't understand is why it is failing. The errno is not useful at all since it is setting -res_int to errno. Also, I don't understand why it is failing only when res_int is {-1, -256}.
a=1200, b=1300 op='-' => res_int=-100 is an example case where printk prints -100, but in my userspace app, I receive -1.
It looks like when res_int is {-1, -256}, errno is being set to -res_int.
root#android:/data/local # ./calc
Please enter request in 'num1 oper num2' format:
2.45 - 2.2
returned from syscall with res_int = 250
errno = 22, strerror(errno) = Invalid argument
Calculator result = 0.250000
root#android:/data/local # ./calc
Please enter request in 'num1 oper num2' format:
2.2 - 2.45
returned from syscall with res_int = -1
errno = 250, strerror(errno) = Unknown error 250
Calculator result = -0.001000
root#android:/data/local #

You didnt' mention your kernel version and platform but fwiw, from kernel's perspective a syscall usually returns zero on success and a negative number on error. Generally this is the ABI convention between kernel folks and application developers so the code can understand each other.
But user program always uses C library as a wrapper to make system calls and the C API should comply with the API standard. That is, the C library will check the return value from kernel space and set the errno based on that. (Note that errno is in user space and kernel doesn't know it.)
Say, if syscall foo(), from user's view, returns -1 on failure and should set the errno to indicate the error (ERR1 for failure 1, ERR2 for failure2, etc), the kernel syscall implementation would return -ERR1 or -ERR2 accordingly and the C library would check the return value against zero and if it's negative, it will return -1 to user and also set errno to ERR1/ERR2.
Hope this helps.
Update: I checked the x86_64 code of glibc, and this may help understand this:
.text
ENTRY (syscall)
movq %rdi, %rax /* Syscall number -> rax. */
movq %rsi, %rdi /* shift arg1 - arg5. */
movq %rdx, %rsi
movq %rcx, %rdx
movq %r8, %r10
movq %r9, %r8
movq 8(%rsp),%r9 /* arg6 is on the stack. */
syscall /* Do the system call. */
cmpq $-4095, %rax /* Check %rax for error. */
jae SYSCALL_ERROR_LABEL /* Jump to error handler if error. */
L(pseudo_end):
ret /* Return to caller. */
PSEUDO_END (syscall)
File: "sysdeps/unix/sysv/linux/x86_64/syscall.S"
Linus said he will make sure the no syscall returns a value in -1 .. -4095 as a valid result so we can savely test with -4095.
Update: so I guess it's your c library that converts the return value. You platform ABI may define that syscall returning {-1, -256} on failure so the C wrapper sets the return value to -1 in such cases and set the errno accordingly.

Related

What is the point of atomic.Load and atomic.Store

In the Go's memory model nothing is stated about atomics and their relation to memory fencing.
Although many internal packages seem to rely on the memory ordering that could be provided if atomics created memory fences around them. See this issue for details.
After not understanding how it really works, I went to the sources, in particular src/runtime/internal/atomic/atomic_amd64.go and found following implementations of Load and Store:
//go:nosplit
//go:noinline
func Load(ptr *uint32) uint32 {
return *ptr
}
Store is implemented in asm_amd64.s in the same package.
TEXT runtime∕internal∕atomic·Store(SB), NOSPLIT, $0-12
MOVQ ptr+0(FP), BX
MOVL val+8(FP), AX
XCHGL AX, 0(BX)
RET
Both look as if they had nothing to do with parallelism.
I did look into other architectures but implementation seems to be equivalent.
However, if atomics are indeed weak and provide no memory ordering guarantees, than the code below could fail, but it does not.
As an addition I tried replacing atomic calls with simple assignments but it still produces consistent and "successful" result in both cases.
func try() {
var a, b int32
go func() {
// atomic.StoreInt32(&a, 1)
// atomic.StoreInt32(&b, 1)
a = 1
b = 1
}()
for {
// if n := atomic.LoadInt32(&b); n == 1 {
if n := b; n == 1 {
if a != 1 {
panic("fail")
}
break
}
runtime.Gosched()
}
}
func main() {
n := 1000000000
for i := 0; i < n ; i++ {
try()
}
}
The next thought was that the compiler does some magic to provide ordering guarantees. So below is the listing of the variant with atomic Store and Load not commented. Full listing is available on the pastebin.
// Anonymous function implementation with atomic calls inlined
TEXT %22%22.try.func1(SB) gofile../path/atomic.go
atomic.StoreInt32(&a, 1)
0x816 b801000000 MOVL $0x1, AX
0x81b 488b4c2408 MOVQ 0x8(SP), CX
0x820 8701 XCHGL AX, 0(CX)
atomic.StoreInt32(&b, 1)
0x822 b801000000 MOVL $0x1, AX
0x827 488b4c2410 MOVQ 0x10(SP), CX
0x82c 8701 XCHGL AX, 0(CX)
}()
0x82e c3 RET
// Important "cycle" part of try() function
0x6ca e800000000 CALL 0x6cf [1:5]R_CALL:runtime.newproc
for {
0x6cf eb12 JMP 0x6e3
runtime.Gosched()
0x6d1 90 NOPL
checkTimeouts()
0x6d2 90 NOPL
mcall(gosched_m)
0x6d3 488d0500000000 LEAQ 0(IP), AX [3:7]R_PCREL:runtime.gosched_m·f
0x6da 48890424 MOVQ AX, 0(SP)
0x6de e800000000 CALL 0x6e3 [1:5]R_CALL:runtime.mcall
if n := atomic.LoadInt32(&b); n == 1 {
0x6e3 488b442420 MOVQ 0x20(SP), AX
0x6e8 8b08 MOVL 0(AX), CX
0x6ea 83f901 CMPL $0x1, CX
0x6ed 75e2 JNE 0x6d1
if a != 1 {
0x6ef 488b442428 MOVQ 0x28(SP), AX
0x6f4 833801 CMPL $0x1, 0(AX)
0x6f7 750a JNE 0x703
0x6f9 488b6c2430 MOVQ 0x30(SP), BP
0x6fe 4883c438 ADDQ $0x38, SP
0x702 c3 RET
As you can see, no fences or locks are in place again.
Note: all tests are done on x86_64 and i5-8259U
The question:
So, is there any point of wrapping simple pointer dereference in a function call or is there some hidden meaning to it and why do these atomics still work as memory barriers? (if they do)
I don't know Go at all, but it looks like the x86-64 implementations of .load() and .store() are sequentially-consistent. Presumably on purpose / for a reason!
//go:noinline on the load means the compiler can't reorder around a blackbox non-inline function, I assume. On x86 that's all you need for the load side of sequential-consistency, or acq-rel. A plain x86 mov load is an acquire load.
The compiler-generated code gets to take advantage of x86's strongly-ordered memory model, which is sequential consistency + a store buffer (with store forwarding), i.e. acq/rel. To recover sequential consistency, you only need to drain the store buffer after a release-store.
.store() is written in asm, loading its stack args and using xchg as a seq-cst store.
XCHG with memory has an implicit lock prefix which is a full barrier; it's an efficient alternative to mov+mfence to implement what C++ would call a memory_order_seq_cst store.
It flushes the store buffer before later loads and stores are allowed to touch L1d cache. Why does a std::atomic store with sequential consistency use XCHG?
See
https://bartoszmilewski.com/2008/11/05/who-ordered-memory-fences-on-an-x86/
C/C++11 mappings to processors
describes the sequences of instructions that implement relaxed load/store, acq/rel load/store, seq-cst load/store, and various barriers, on various ISAs. So you can recognize things like xchg with memory.
Does lock xchg have the same behavior as mfence? (TL:DR: yes except for maybe some corner cases with NT loads from WC memory, e.g. from video RAM). You may see a dummy lock add $0, (SP) used as an alternative to mfence in some code.
IIRC, AMD's optimization manual even recommends this. It's good on Intel as well, especially on Skylake where mfence was strengthened by microcode update to fully block out-of-order exec even of ALU instructions (like lfence) as well as memory reordering. (To fix an erratum with NT loads.)
https://preshing.com/20120913/acquire-and-release-semantics/

register_kprobe returns EINVAL (-22) error for instructions involving rip

I am trying to insert probes at different instructions with kprobes in function of kernel module.
But register_kprobe is returning EINVAL(-22) error for 0xffffffffa33c1085 instruction addresses and 0xffffffffa33c109b from below assembly code (it passes for all other instruction addresses).
Instructions giving errors:
0xffffffffa33c1085 <test_increment+5>: mov 0x21bd(%rip),%eax # 0xffffffffa33c3248
0xffffffffa33c109b <test_increment+27>: mov %esi,0x21a7(%rip) # 0xffffffffa33c3248
Observed that both these instructions use rip register. Tried with functions of other modules, observed same error with instructions which use rip register.
Why is register_kprobe failing ? does it have any constraints involving rip ? Any help is appreciated.
System has kernel 3.10.0-514 on x86_64 installed.
kprobe function:
kp = kzalloc(sizeof(struct kprobe), GFP_KERNEL);
kp->post_handler = exit_func;
kp->pre_handler = entry_func;
kp->addr = sym_addr;
atomic_set(&pcount, 0);
ret = register_kprobe(kp);
if ( ret != 0 ) {
printk(KERN_INFO "register_kprobe returned %d for %s\n", ret, str);
kfree(kp);
kp=NULL;
return ret;
}
probed function:
int race=0;
void test_increment()
{
race++;
printk(KERN_INFO "VALUE=%d\n",race);
return;
}
assembly code:
crash> dis -l test_increment
0xffffffffa33c1080 <test_increment>: nopl 0x0(%rax,%rax,1) [FTRACE NOP]
0xffffffffa33c1085 <test_increment+5>: mov 0x21bd(%rip),%eax # 0xffffffffa33c3248
0xffffffffa33c108b <test_increment+11>: push %rbp
0xffffffffa33c108c <test_increment+12>: mov $0xffffffffa33c2024,%rdi
0xffffffffa33c1093 <test_increment+19>: mov %rsp,%rbp
0xffffffffa33c1096 <test_increment+22>: lea 0x1(%rax),%esi
0xffffffffa33c1099 <test_increment+25>: xor %eax,%eax
0xffffffffa33c109b <test_increment+27>: mov %esi,0x21a7(%rip) # 0xffffffffa33c3248
0xffffffffa33c10a1 <test_increment+33>: callq 0xffffffff81659552 <printk>
0xffffffffa33c10a6 <test_increment+38>: pop %rbp
0xffffffffa33c10a7 <test_increment+39>: retq
Thanks
Turns out, register_kprobe does have limitations with instructions invoving rip relative addressing for x86_64.
Here is snippet of __copy_instruction function code causing error (register_kprobe -> prepare_kprobe -> arch_prepare_kprobe -> arch_copy_kprobe -> __copy_instruction )
#ifdef CONFIG_X86_64
if (insn_rip_relative(&insn)) {
s64 newdisp;
u8 *disp;
kernel_insn_init(&insn, dest);
insn_get_displacement(&insn);
/*
* The copied instruction uses the %rip-relative addressing
* mode. Adjust the displacement for the difference between
* the original location of this instruction and the location
* of the copy that will actually be run. The tricky bit here
* is making sure that the sign extension happens correctly in
* this calculation, since we need a signed 32-bit result to
* be sign-extended to 64 bits when it's added to the %rip
* value and yield the same 64-bit result that the sign-
* extension of the original signed 32-bit displacement would
* have given.
*/
newdisp = (u8 *) src + (s64) insn.displacement.value - (u8 *) dest;
if ((s64) (s32) newdisp != newdisp) {
pr_err("Kprobes error: new displacement does not fit into s32 (%llx)\n", newdisp);
pr_err("\tSrc: %p, Dest: %p, old disp: %x\n", src, dest, insn.displacement.value);
return 0;
}
disp = (u8 *) dest + insn_offset_displacement(&insn);
*(s32 *) disp = (s32) newdisp;
}
#endif
http://elixir.free-electrons.com/linux/v3.10/ident/__copy_instruction
A new displacement value is calculated based new instruction address (where orig insn is copied). If that value doesn't fit in 32 bit, it returns 0 which results in EINVAL error. Hence the failure.
As a workaround, we can set kprobe handler post previous instruction or pre next instruction based on need (works for me).

In MSVC, why do InterlockedOr and InterlockedAnd generate a loop instead of a simple locked instruction?

On MSVC for x64 (19.10.25019),
InterlockedOr(&g, 1)
generates this code sequence:
prefetchw BYTE PTR ?g##3JC
mov eax, DWORD PTR ?g##3JC ; g
npad 3
$LL3#f:
mov ecx, eax
or ecx, 1
lock cmpxchg DWORD PTR ?g##3JC, ecx ; g
jne SHORT $LL3#f
I would have expected the much simpler (and loopless):
mov eax, 1
lock or [?g##3JC], eax
InterlockedAnd generates analogous code to InterlockedOr.
It seems wildly inefficient to have to have a loop for this instruction. Why is this code generated?
(As a side note: the whole reason I was using InterlockedOr was to do an atomic load of the variable - I have since learned that InterlockedCompareExchange is the way to do this. It is odd to me that there is no InterlockedLoad(&x), but I digress...)
The documented contract for InterlockedOr has it returning the original value:
InterlockedOr
Performs an atomic OR operation on the specified LONG values. The function prevents more than one thread from using the same variable simultaneously.
LONG __cdecl InterlockedOr(
_Inout_ LONG volatile *Destination,
_In_ LONG Value
);
Parameters:
Destination [in, out]
A pointer to the first operand. This value will be replaced with the result of the operation.
Value [in]
The second operand.
Return value
The function returns the original value of the Destination parameter.
This is why the unusual code that you've observed is required. The compiler cannot simply emit an OR instruction with a LOCK prefix, because the OR instruction does not return the previous value. Instead, it has to use the odd workaround with LOCK CMPXCHG in a loop. In fact, this apparently unusual sequence is the standard pattern for implementing interlocked operations when they aren't natively supported by the underlying hardware: capture the old value, perform an interlocked compare-and-exchange with the new value, and keep trying in a loop until the old value from this attempt is equal to the captured old value.
As you observed, you see the same thing with InterlockedAnd, for exactly the same reason: the x86 AND instruction doesn't return the original value, so the code-generator has to fallback on the general pattern involving compare-and-exchange, which is directly supported by the hardware.
Note that, at least on x86 where InterlockedOr is implemented as an intrinsic, the optimizer is smart enough to figure out whether you're using the return value or not. If you are, then it uses the workaround code involving CMPXCHG. If you are ignoring the return value, then it goes ahead and emits code using LOCK OR, just like you would expect.
#include <intrin.h>
LONG InterlockedOrWithReturn()
{
LONG val = 42;
return _InterlockedOr(&val, 8);
}
void InterlockedOrWithoutReturn()
{
LONG val = 42;
LONG old = _InterlockedOr(&val, 8);
}
InterlockedOrWithoutReturn, COMDAT PROC
mov DWORD PTR [rsp+8], 42
lock or DWORD PTR [rsp+8], 8
ret 0
InterlockedOrWithoutReturn ENDP
InterlockedOrWithReturn, COMDAT PROC
mov DWORD PTR [rsp+8], 42
prefetchw BYTE PTR [rsp+8]
mov eax, DWORD PTR [rsp+8]
LoopTop:
mov ecx, eax
or ecx, 8
lock cmpxchg DWORD PTR [rsp+8], ecx
jne SHORT LoopTop
ret 0
InterlockedOrWithReturn ENDP
The optimizer is equally as smart for InterlockedAnd, and should be for the other Interlocked* functions, as well.
As intuition would tell you, the LOCK OR implementation is more efficient than the LOCK CMPXCHG in a loop. Not only is there the expanded code size and the overhead of looping, but you risk branch prediction misses, which can cost a large number of cycles. In performance-critical code, if you can avoid relying on the return value for interlocked operations, you can gain a performance boost.
However, what you really should be using in modern C++ is std::atomic, which allows you to specify the desired memory model/semantics, and then let the standard library maintainers deal with the complexity.

linux syscall using spinlock returning value to userspace

I'm, currently struggling with the correct implementation of a kernel-spinlock in combination with a return statement which should return a value to userspace. I implemented a kernel syscall 'sys_kernel_entropy_is_recording' which should return the value of a kernel-variable 'is_kernel_entropy_recording':
asmlinkage bool sys_kernel_entropy_is_recording(void)
{
spin_lock(&entropy_analysis_lock);
return is_kernel_entropy_recording;
spin_unlock(&entropy_analysis_lock);
}
At this point arise two questions:
Q1: Is this implementation correct at all, meaning will the correct value of 'is_kernel_entropy_recording' be returned to userspace and afterwards the spinlock be released?
My concerns are:
a) is it allowed to return a value from kernelspace to userspace this way at all?
b) the return statement is located before the spin_unlock statement, hence will spin_unlock be even called?
Q2: To answer these question myself I disassembled the compiled .o file but determined (at least it looks for me like) the spin_lock/spin_unlock calls are completely ignored by the compiler, as it just moves the value of 'sys_kernel_entropy_is_recording' to eax an calls ret (I'm not sure about line 'callq 0xa5'):
(gdb) disassemble /m sys_kernel_entropy_is_recording
Dump of assembler code for function sys_kernel_entropy_is_recording:
49 {
0x00000000000000a0 <+0>: callq 0xa5 <sys_kernel_entropy_is_recording+5>
0x00000000000000a5 <+5>: push %rbp
0x00000000000000ad <+13>: mov %rsp,%rbp
50 spin_lock(&entropy_analysis_lock);
51 return is_kernel_entropy_recording;
52 spin_unlock(&entropy_analysis_lock);
53 }
0x00000000000000b5 <+21>: movzbl 0x0(%rip),%eax # 0xbc <sys_kernel_entropy_is_recording+28>
0x00000000000000bc <+28>: pop %rbp
0x00000000000000bd <+29>: retq
Hence I guess the application of spinlock is not correct.. Could someone please give me an advice for an appropriate approach?
Thanks a lot in advance!
It is prohibited to return from syscall with spinlock holded. And, as usual with C code, none instruction is executed after return statement.
Common practice is to save value obtained under lock into local variable, and return value of this variable after unlock:
bool ret;
spin_lock(&entropy_analysis_lock);
ret = is_kernel_entropy_recording;
spin_unlock(&entropy_analysis_lock);
return ret;

VC++ SSE code generation - is this a compiler bug?

A very particular code sequence in VC++ generated the following instruction (for Win32):
unpcklpd xmm0,xmmword ptr [ebp-40h]
2 questions arise:
(1) As far as I understand the intel manual, unpcklpd accepts as 2nd argument a 128-aligned memory address. If the address is relative to a stack frame alignment cannot be forced. Is this really a compiler bug?
(2) Exceptions are thrown from at the execution of this instruction only when run from the debugger, and even then not always. Even attaching to the process and executing this code does not throw. How can this be??
The particular exception thrown is access violation at 0xFFFFFFFF, but AFAIK that's just a code for misalignment.
[Edit:]
Here's some source that demonstrates the bad code generation - but typically doesn't cause a crash. (that's mostly what I'm wondering about)
[Edit 2:]
The code sample now reproduces the actual crash. This one also crashes outside the debugger - I suspect the difference occurs because the debugger launches the program at different typical base addresses.
// mock.cpp
#include <stdio.h>
struct mockVect2d
{
double x, y;
mockVect2d() {}
mockVect2d(double a, double b) : x(a), y(b) {}
mockVect2d operator + (const mockVect2d& u) {
return mockVect2d(x + u.x, y + u.y);
}
};
struct MockPoly
{
MockPoly() {}
mockVect2d* m_Vrts;
double m_Area;
int m_Convex;
bool m_ParClear;
void ClearPar() { m_Area = -1.; m_Convex = 0; m_ParClear = true; }
MockPoly(int len) { m_Vrts = new mockVect2d[len]; }
mockVect2d& Vrt(int i) {
if (!m_ParClear) ClearPar();
return m_Vrts[i];
}
const mockVect2d& GetCenter() { return m_Vrts[0]; }
};
struct MockItem
{
MockItem() : Contour(1) {}
MockPoly Contour;
};
struct Mock
{
Mock() {}
MockItem m_item;
virtual int GetCount() { return 2; }
virtual mockVect2d GetCenter() { return mockVect2d(1.0, 2.0); }
virtual MockItem GetItem(int i) { return m_item; }
};
void testInner(int a)
{
int c = 8;
printf("%d", c);
Mock* pMock = new Mock;
int Flag = true;
int nlr = pMock->GetCount();
if (nlr == 0)
return;
int flr = 1;
if (flr == nlr)
return;
if (Flag)
{
if (flr < nlr && flr>0) {
int c = 8;
printf("%d", c);
MockPoly pol(2);
mockVect2d ctr = pMock->GetItem(0).Contour.GetCenter();
// The mess happens here:
// ; 74 : pol.Vrt(1) = ctr + mockVect2d(0., 1.0);
//
// call ? Vrt#MockPoly##QAEAAUmockVect2d##H#Z; MockPoly::Vrt
// movdqa xmm0, XMMWORD PTR $T4[ebp]
// unpcklpd xmm0, QWORD PTR tv190[ebp] **** crash!
// movdqu XMMWORD PTR[eax], xmm0
pol.Vrt(0) = ctr + mockVect2d(1.0, 0.);
pol.Vrt(1) = ctr + mockVect2d(0., 1.0);
}
}
}
void main()
{
testInner(2);
return;
}
If you prefer, download a ready vcxproj with all the switches set from here. This includes the complete ASM too.
Update: this is now a confirmed VC++ compiler bug, hopefully to be resolved in VS2015 RTM.
Edit: The connect report, like many others, is now garbage. However the compiler bug seems to be resolved in VS2017 - not in 2015 update 3.
Since no one else has stepped up, I'm going to take a shot.
1) If the address is relative to a stack frame alignment cannot be forced. Is this really a compiler bug?
I'm not sure it is true that you cannot force alignment for stack variables. Consider this code:
struct foo
{
char a;
int b;
unsigned long long c;
};
int wmain(int argc, wchar_t* argv[])
{
foo moo;
moo.a = 1;
moo.b = 2;
moo.c = 3;
}
Looking at the startup code for main, we see:
00E31AB0 push ebp
00E31AB1 mov ebp,esp
00E31AB3 sub esp,0DCh
00E31AB9 push ebx
00E31ABA push esi
00E31ABB push edi
00E31ABC lea edi,[ebp-0DCh]
00E31AC2 mov ecx,37h
00E31AC7 mov eax,0CCCCCCCCh
00E31ACC rep stos dword ptr es:[edi]
00E31ACE mov eax,dword ptr [___security_cookie (0E440CCh)]
00E31AD3 xor eax,ebp
00E31AD5 mov dword ptr [ebp-4],eax
Adding __declspec(align(16)) to moo gives
01291AB0 push ebx
01291AB1 mov ebx,esp
01291AB3 sub esp,8
01291AB6 and esp,0FFFFFFF0h <------------------------
01291AB9 add esp,4
01291ABC push ebp
01291ABD mov ebp,dword ptr [ebx+4]
01291AC0 mov dword ptr [esp+4],ebp
01291AC4 mov ebp,esp
01291AC6 sub esp,0E8h
01291ACC push esi
01291ACD push edi
01291ACE lea edi,[ebp-0E8h]
01291AD4 mov ecx,3Ah
01291AD9 mov eax,0CCCCCCCCh
01291ADE rep stos dword ptr es:[edi]
01291AE0 mov eax,dword ptr [___security_cookie (12A40CCh)]
01291AE5 xor eax,ebp
01291AE7 mov dword ptr [ebp-4],eax
Apparently the compiler (VS2010 compiled debug for Win32), recognizing that we will need specific alignments for the code, takes steps to ensure it can provide that.
2) Exceptions are thrown from at the execution of this instruction only when run from the debugger, and even then not always. Even attaching to the process and executing this code does not throw. How can this be??
So, a couple of thoughts:
"and even then not always" - Not standing over your shoulder when you run this, I can't say for certain. However it seems plausible that just by random chance, stacks could get created with the alignment you need. By default, x86 uses 4byte stack alignment. If you need 16 byte alignment, you've got a 1 in 4 shot.
As for the rest (from https://msdn.microsoft.com/en-us/library/aa290049%28v=vs.71%29.aspx#ia64alignment_topic4):
On the x86 architecture, the operating system does not make the alignment fault visible to the application. ...you will also suffer performance degradation on the alignment fault, but it will be significantly less severe than on the Itanium, because the hardware will make the multiple accesses of memory to retrieve the unaligned data.
TLDR: Using __declspec(align(16)) should give you the alignment you want, even for stack variables. For unaligned accesses, the OS will catch the exception and handle it for you (at a cost of performance).
Edit1: Responding to the first 2 comments below:
Based on MS's docs, you are correct about the alignment of stack parameters, but they propose a solution as well:
You cannot specify alignment for function parameters. When data that
has an alignment attribute is passed by value on the stack, its
alignment is controlled by the calling convention. If data alignment
is important in the called function, copy the parameter into correctly
aligned memory before use.
Neither your sample on Microsoft connect nor the code about produce the same code for me (I'm only on vs2010), so I can't test this. But given this code from your sample:
struct mockVect2d
{
double x, y;
mockVect2d(double a, double b) : x(a), y(b) {}
It would seem that aligning either mockVect2d or the 2 doubles might help.

Resources