Prohibit unaligned memory accesses on x86/x86_64

Prohibit unaligned memory accesses on x86/x86_64 - linux

I want to emulate the system with prohibited unaligned memory accesses on the x86/x86_64.
Is there some debugging tool or special mode to do this?
I want to run many (CPU-intensive) tests on the several x86/x86_64 PCs when working with software (C/C++) designed for SPARC or some other similar CPU. But my access to Sparc is limited.
As I know, Sparc always checks alignment in memory reads and writes to be natural (reading a byte from any address, but reading a 4-byte word only allowed when address is divisible by 4).
May be Valgrind or PIN has such mode? Or special mode of compiler?
I'm searching for Linux non-commercial tool, but windows tools allowed too.
or may be there is secret CPU flag in EFLAGS?

I've just read question Does unaligned memory access always cause bus errors? which linked to Wikipedia article Segmentation Fault.
In the article, there's a wonderful reminder of rather uncommon Intel processor flags AC aka Alignment Check.
And here's how to enable it (from Wikipedia's Bus Error example, with a red-zone clobber bug fixed for x86-64 System V so this is safe on Linux and MacOS, and converted from Basic asm which is never a good idea inside functions: you want changes to AC to be ordered wrt. memory accesses.
#if defined(__GNUC__)
# if defined(__i386__)
/* Enable Alignment Checking on x86 */
__asm__("pushf\n orl $0x40000,(%%esp)\n popf" ::: "memory");
# elif defined(__x86_64__)
/* Enable Alignment Checking on x86_64 */
__asm__("add $-128, %%rsp \n" // skip past the red-zone, in case there is one and the compiler has local vars there.
"pushf\n"
"orl $0x40000,(%%rsp)\n"
"popf \n"
"sub $-128, %%rsp" // and restore the stack pointer.
::: "memory"); // ordered wrt. other mem access
# endif
#endif
Once enable it's working a lot like ARM alignment settings in /proc/cpu/alignment, see answer How to trap unaligned memory access? for examples.
Additionally, if you're using GCC, I suggest you enable -Wcast-align warnings. When building for a target with strict alignment requirements (ARM for example), GCC will report locations that might lead to unaligned memory access.
But note that libc's handwritten asm for memcpy and other functions will still make unaligned accesses, so setting AC is often not practical on x86 (including x86-64). GCC will sometimes emit asm that makes unaligned accesses even if your source doesn't, e.g. as an optimization to copy or zero two adjacent array elements or struct members at once.

It's tricky and I haven't done it personally, but I think you can do it in the following way:
x86_64 CPUs (specifically I've checked Intel Corei7 but I guess others as well) have a performance counter MISALIGN_MEM_REF which counter misaligned memory references.
So first of all, you can run your program and use "perf" tool under Linux to get a count of the number of misaligned access your code has done.
A more tricky and interesting hack would be to write a kernel module that programs the performance counter to generate an interrupt on overflow and get it to overflow the first unaligned load/store. Respond to this interrupt in your kernel module but sending a signal to your process.
This will, in effect, turn the x86_64 into a core that doesn't support unaligned access.
This wont be simple though - beside your code, the system libraries also use unaligned accesses, so it will be tricky to separate them from your own code.

Both GCC and Clang have UndefinedBehaviorSanitizer built in. One of those checks, alignment, can be enabled with -fsanitize=alignment. It'll emit code to check pointer alignment at runtime and abort if unaligned pointers are dereferenced.
See online documentation at:
https://clang.llvm.org/docs/UndefinedBehaviorSanitizer.html

Perhaps you somehow could compile to SSE, with all aligned moves. Unaligned accesses with movaps are illegal and probably would behave as illegal unaligned accesses on other architechtures.

Related

Are all linux kernel functions aligned by 0x10? and why?

I tried to solve the problem that "kallsyms_lookup_name is not exported anymore in kernels > 5.7
", and found a solution at: https://github.com/xcellerator/linux_kernel_hacking/issues/3.
It says that "the kernel functions are all aligned so that the final nibble is 0x0", and I wonder why?

In general: no, kernel functions aren't all and always aligned to 16 bytes. Saying "the kernel functions are all aligned so that the final nibble is 0x0" is wrong. However, in the most common case, which is Linux x86-64 compiled with GCC with default kernel compiler flags, this happens to be true. Take some other case, like for example default config for ARM64, and you'll see that this does not hold.
The kernel itself does not specify any alignment for functions, but the compiler optimizations that it enables can (and will) align functions.
Alignment of functions is in fact a compiler optimization that on GCC is enabled using -falign-functions=. According to the GCC doc, compiling with at least -O2 (selected by CC_OPTIMIZE_FOR_PERFORMANCE=y, which is the default) will enable this optimization without an explicit value set. This means that the actual alignment value is chosen by GCC based on the architecture. On x86, the default is 16 bytes for the "generic" machine type (-march=x86-64, see doc).
Clang also supports -falign-functions= since version 6.0.1 according to their repository (it was previously ignored), though I am not sure if it is enabled or not at different optimization levels.
Why is this an optimization? Well, alignment can offer performance advantages. In theory, cache line alignment would be "optimal" for cache performance, but there are other factors to consider: aligning to 64 bytes (cache line size on x86) would probably waste a lot of space for no good reason without improving performance that much. See How much does function alignment actually matter on modern processors?

Most importantly, this is not true generically. Some architectures will have other alignment criteria. Why is often a really difficult question to answer. It is completely possible for Linux to not do this. It could even change with kernel versions for the same architechure.
The low nibble as zero is 16byte alignment. It is enforced by the linker and the compiler (via CPU and/or ABI restrictions/efficiencies). Generally, you want a function to start on a cache line. This is so functions do not overlap with each other in the cache. It is also easier to fill. Ie, the expectation is that you will only have one partial at the end of the function. It is possible that branches, case labels and other constructs could also be aligned depending on the CPU.
Even without cache, primary SDRAM fills in batches. 16bytes seems like a reasonable amount to minimize the overhead of alignment (wasted bytes) versus efficiency. Less SDRAM cycles. Of course the SDRAM burst and cache lines are of the same order as they work together to get code to the CPUs decode units.
There can be other reasons for alignment such as hardware and internal tables that only use a sub-set of address bits. This alignment can be for external functions only. Some instructions will operate faster (or only be possible) on aligned data. So, some kernel function instrumentation may also benefit from the alignment (either through more compact tables or adding veneers, etc).
See: x86_64 stack alignment - where many of the rationales for the stack alignment can apply to code as the kernel can on occasion treat code as data.

Why does x86 allows for unaligned accesses, and how unaligned accesses can be detected?

Perhaps I misunderstand something, but it seems unaligned access in x86 gives security troubles such as a Return Address Integrity issue.
Why do x86 designers allow for unaligned accesses in the first place? (Performance is the only benefit I can think of.)
If x86 designers permit this unaligned access trouble, they should somehow know how to solve it, don't they? Can unaligned accesses get detected with static techniques or sanitization techniques?

I'm skeptical of the entire premise that there's a security downside here; a quick search of your link doesn't find any mention of unaligned access being a problem.
Many other ISAs support unaligned access now, too. e.g. AArch64, later ARM including ARMv6 and ARMv7, and even MIPS32r6 (but earlier MIPS revisions didn't guarantee that). Non-x86 implementations often have a performance penalty for unaligned load or especially store, even when it's within a single cache line (which has no penalty on modern x86 for cacheable loads/stores).
The primary designer of 8086 was Stephen Morse (who wrote a book about it, The 8086 Primer, which is now free on his web site).
The x86 design choice was made between 1976 and 1978. (And couldn't be changed in later x86 without breaking backwards compat, which is the main thing x86 has going for it.) 8086 needed to support byte loads and stores, and the hardware required to support unaligned 2-byte words on its 16-bit bus was presumably minor. Especially since 8088 was also planned, with an 8-bit bus. I think its only differences from 8086 were in the bus-interface unit. Or it might have been cheaper to just do it than to implement some mechanism for alignment faults.
There is no obvious security problem, and certainly none that anyone then would have heard of.
8086 was designed for easy asm source-porting from 8080 - IDK if 8080 could ever load or store 2 bytes at once, but if it allowed doing so, it probably didn't care about alignment, so 8086 needed to support. Modern static analysis tools probably weren't even dreamed of yet, and most 8080 code was hand-written in asm. (Like much early 8086 code, I'd guess.)
The Internet barely existed at the time and almost certainly wasn't a consideration. 8086 had no memory protection or privilege levels, so it certainly wasn't designed with security in mind. (Unlike contemporary CPUs for minicomputers that ran multi-user OSes).
The only real security threat for PCs at the time AFAIK was boot-sector viruses, and usually those spread by directly executing code that the system auto-ran during boot or from floppies, not attacking vulnerabilities in other programs. I could imagine malicious data files like .zip or word-processor formats were thought of at some point, but if there is any security advantage to disallowing misaligned accesses, it wasn't anything known then.
Software certainly wasn't spending extra code-size or cycles on hardening, not for decades after 8086.
Can unaligned accesses get detected with static techniques or sanitization techniques?
There's HW support for detecting unaligned accesses on x86, in the form of the AC bit in EFLAGS. But that's normally unusable because compilers (and hand-written asm memcpy etc. in libc) sometimes use unaligned loads, e.g. to initialize or copy adjacent narrow members of a struct.
GCC has -fsanitize=alignment which seems to check for C UB of dereferencing pointers that aren't sufficiently aligned for their type. e.g. it checks *int_ptr, but doesn't add checks for memcpy(char_arr, &my_int, 4) even though it inlines as a dword store. https://godbolt.org/z/ac6K13nc1
Misaligned locked instructions are extremely expensive (like system-wide bus lock or something), at least when split across two cache lines, and there is special support for detecting them specifically, without complaining about the normal misaligned loads/stores that happen in memcpy for odd sizes. The mechanisms include a perf counter for it, and a recent addition of an MSR (Model Specific Register) config bit to let the kernel make them raise an exception.
Cache-line-split locked instructions can apparently be a problem in terms of letting unprivileged code on one core interfere with hard-realtime code on another core.
It seems unaligned access in x86 gives security troubles such as a Return Address Integrity issue.
How so?
The paper you linked mentions alignment of the Function Lookup Table in this proposed hardening mechanism. There are only two instances of the string "align" in the whole paper, and neither of them talk about ARMv7-M's support for unaligned load/store creating any difficulty. (ARMv7-M is the ISA they're discussing, since it's about hardening embedded systems.)

Memory available to assembly program in Linux

For fun I am just trying to write a program in assembly for Linux on a laptop with an x86 processor to get some system information. So one of the things I am trying to find is how much memory is available to my program, and where e.g. the stack is and if and how I can allocate additional memory if needed.
Long time ago I did things like this on an Atari ST and there was just a system 'malloc' I could ask memory from and there were functions to find the available memory.
I know Linux is set up differently and I kind of have the whole address space to myself, but I guess there are some memory areas I am not allowed to touch.
And somehow a default stack seems to have been setup.
I researched quite a bit for this, but I can't find any 'assembly' system call. Most people point to linking the C malloc for memory management, but I am not looking for a memory manager. I just want to know the memory boundaries of my program.
I find things like getrlimit, setrlimit, prlimit and brk and sbrk, but those seem to be C functions and not system calls.
What am i missing?

Linux uses virtual memory (and ASLR). Atari ST doesn't use either so it did have a fixed memory map for some OS data structures and code. (Because the OS was in ROM and couldn't be easily updated, some people even documented some internal addresses.)
Linux tries to keep the boundary between kernel and user-space rigid, with a well-defined documented API / ABI for user-space to interact with the kernel via system calls. (e.g. on x86-64, via the syscall instruction). User-space doesn't need to care what's on the other side of that wall, and usually not even where its pages are in virtual memory as long as it has pointers to them.
When glibc malloc wants more pages from the OS, it uses mmap(MAP_ANONYMOUS) or brk to get them, and hand out chunks of them for small calls to malloc. It keeps bookkeeping data structures in user-space (so that's per-process of course).
I know Linux is set up differently and I kind of have the whole address space to myself, but I guess there are some memory areas I am not allowed to touch.
Yeah, every process has its own virtual address space. You can only touch the parts you've allocated, otherwise the resulting page fault will be "invalid" (OS knows there isn't supposed to be a physical page for that virtual page) and will deliver a SIGSEGV signal to your process if you try to read or write it. ("valid" page faults happen because of swap space or lazy allocation / copy-on-write; the kernel updates the HW page tables and returns to user-space for it to re-run the instruction that faulted.)
Also, the kernel claims the high half of virtual address space for its own use. (https://wiki.osdev.org/Higher_Half_Kernel). See also https://www.kernel.org/doc/Documentation/x86/x86_64/mm.txt for Linux's x86-64 memory map layout.
I can't find any 'assembly' system call.
mmap and brk are true system calls. See the "notes" section of the brk(2) man page. Section 2 man pages are system calls, section 3 are libc functions.
Of course in C when you call mmap(...), you're actually calling a wrapper function in glibc. glibc provides wrapper functions, not inline asm macros that use the syscall instruction directly.
See also The Definitive Guide to Linux System Calls which explains the asm interface, and also the VDSO pages. Linux maps some kernel memory (read-only) into your user-space process, holding code and data so getpid() and clock_gettime() can run in user-space.
Also various Q&As on Stack Overflow, including What are the calling conventions for UNIX & Linux system calls on i386 and x86-64
So one of the things I am trying to find is how much memory is available to my program
There isn't a system call to query the current memory map of your process. Parsing /proc/self/maps would be your best bet.
See Finding mapped memory from inside a process for some fun ideas on using system calls to scan ranges of virtual address space for mapped pages. e.g. Like Linux's mincore(2) syscall returns -ENOMEM if the specified range contains any unmapped pages.

memory barrier and atomic_t on linux

Recently, I am reading some Linux kernel space codes, I see this
uint64_t used;
uint64_t blocked;
used = atomic64_read(&g_variable->used); //#1
barrier(); //#2
blocked = atomic64_read(&g_variable->blocked); //#3
What is the semantics of this code snippet? Does it make sure #1 executes before #3 by #2.
But I am a litter bit confused, becasue
#A In 64 bit platform, atomic64_read macro is expanded to
used = (&g_variable->used)->counter // where counter is volatile.
In 32 bits platform, it was converted to use lock cmpxchg8b. I assume these two have the same semantic, and for 64 bits version, I think it means:
all-or-nothing, we can exclude case where address is unaligned and word size large than CPU's native word size.
no optimization, force CPU read from memory location.
atomic64_read doesn't have semantic for preserve read ordering!!! see this
#B the barrier macro is defined as
/* Optimization barrier */
/* The "volatile" is due to gcc bugs */
#define barrier() __asm__ __volatile__("": : :"memory")
From the wiki this just prevents gcc compiler from reordering read and write.
What i am confused is how does it disable reorder optimization for CPU? In addition, can i think barrier macro is full fence?

32-bit x86 processors don't provide simple atomic read operations for 64-bit types. The only atomic operation on 64-bit types on such CPUs that deals with "normal" registers is LOCK CMPXCHG8B, which is why it is used here. The alternative is to use MOVQ and MMX/XMM registers, but that requires knowledge of the FPU state/registers, and requires that all operations on that value are done with the MMX/XMM instructions.
On 64-bit x86_64 processors, aligned reads of 64-bit types are atomic, and can be done with a MOV instruction, so only a plain read is required --- the use of volatile is just to ensure that the compiler actually does a read, and doesn't cache a previous value.
As for the read ordering, the inline assembler you quote ensures that the compiler emits the instructions in the right order, and this is all that is required on x86/x86_64 CPUs, provided the writes are correctly sequenced. LOCKed writes on x86 have a total ordering; plain MOV writes provide "causal consistency", so if thread A does x=1 then y=2, if thread B reads y==2 then a subsequent read of x will see x==1.
On IA-64, PowerPC, SPARC, and other processors with a more relaxed memory model there may well be more to atomic64_read() and barrier().

x86 CPUs don’t do read-after-read reordering, so it is sufficient to prevent the compiler from doing any reordering. On other platforms such as PowerPC, things will look a lot different.

Debugging SIGBUS on x86 Linux

What can cause SIGBUS (bus error) on a generic x86 userland application in Linux? All of the discussion I've been able to find online is regarding memory alignment errors, which from what I understand doesn't really apply to x86.
(My code is running on a Geode, in case there are any relevant processor-specific quirks there.)

SIGBUS can happen in Linux for quite a few reasons other than memory alignment faults - for example, if you attempt to access an mmap region beyond the end of the mapped file.
Are you using anything like mmap, shared memory regions, or similar?

You can get a SIGBUS from an unaligned access if you turn on the unaligned access trap, but normally that's off on an x86. You can also get it from accessing a memory mapped device if there's an error of some kind.
Your best bet is using a debugger to identify the faulting instruction (SIGBUS is synchronous), and trying to see what it was trying to do.

SIGBUS on x86 (including x86_64) Linux is a rare beast. It may appear from attempt to access past the end of mmaped file, or some other situations described by POSIX.
But from hardware faults it's not easy to get SIGBUS. Namely, unaligned access from any instruction — be it SIMD or not — usually results in SIGSEGV. Stack overflows result in SIGSEGV. Even accesses to addresses not in canonical form result in SIGSEGV. All this due to #GP being raised, which almost always maps to SIGSEGV.
Now, here're some ways to get SIGBUS due to a CPU exception:
Enable AC bit in EFLAGS, then do unaligned access by any memory read or write instruction. See this discussion for details.
Do canonical violation via a stack pointer register (rsp or rbp), generating #SS. Here's an example for GCC (compile with gcc test.c -o test -masm=intel):
int main()
{
__asm__("mov rbp,0x400000000000000\n"
"mov rax,[rbp]\n"
"ud2\n");
}

Oh yes there's one more weird way to get SIGBUS.
If the kernel fails to page in a code page due to memory pressure (OOM killer must be disabled) or failed IO request, SIGBUS.

This was briefly mentioned above as a "failed IO request", but I'll expand upon it a bit.
A frequent case is when you lazily grow a file using ftruncate, map it into memory, start writing data and then run out of space in your filesystem. Physical space for mapped file is allocated on page faults, if there's none left then process receives a SIGBUS.
If you need your application to correctly recover from this error, it makes sense to explicitly reserve space prior to mmap using fallocate. Handling ENOSPC in errno after fallocate call is much simpler than dealing with signals, especially in a multi-threaded application.

You may see SIGBUS when you're running the binary off NFS (network file system) and the file is changed. See https://rachelbythebay.com/w/2018/03/15/core/.

If you request a mapping backed by hugepages with mmap and the MAP_HUGETLB flag, you can get SIGBUS if the kernel runs out of allocated huge pages and thus cannot handle a page fault.
In this case, you'll need to raise the number of allocated huge pages via
/sys/kernel/mm/hugepages/hugepages-<size>/nr_hugepages or
/sys/devices/system/node/nodeX/hugepages/hugepages-<size>/nr_hugepages on NUMA systems.

A common cause of a bus error on x86 Linux is attempting to dereference something that is not really a pointer, or is a wild pointer. For example, failing to initialize a pointer, or assigning an arbitrary integer to a pointer and then attempting to dereference it will normally produce either a segmentation fault or a bus error.
Alignment does apply to x86. Even though memory on an x86 is byte-addressable (so you can have a char pointer to any address), if you have for example an pointer to a 4-byte integer, that pointer must be aligned.
You should run your program in gdb and determine which pointer access is generating the bus error to diagnose the issue.

It's a bit off the beaten path, but you can get SIGBUS from an unaligned SSE2 (m128) load.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string