Are all linux kernel functions aligned by 0x10? and why? - linux

I tried to solve the problem that "kallsyms_lookup_name is not exported anymore in kernels > 5.7
", and found a solution at: https://github.com/xcellerator/linux_kernel_hacking/issues/3.
It says that "the kernel functions are all aligned so that the final nibble is 0x0", and I wonder why?

In general: no, kernel functions aren't all and always aligned to 16 bytes. Saying "the kernel functions are all aligned so that the final nibble is 0x0" is wrong. However, in the most common case, which is Linux x86-64 compiled with GCC with default kernel compiler flags, this happens to be true. Take some other case, like for example default config for ARM64, and you'll see that this does not hold.
The kernel itself does not specify any alignment for functions, but the compiler optimizations that it enables can (and will) align functions.
Alignment of functions is in fact a compiler optimization that on GCC is enabled using -falign-functions=. According to the GCC doc, compiling with at least -O2 (selected by CC_OPTIMIZE_FOR_PERFORMANCE=y, which is the default) will enable this optimization without an explicit value set. This means that the actual alignment value is chosen by GCC based on the architecture. On x86, the default is 16 bytes for the "generic" machine type (-march=x86-64, see doc).
Clang also supports -falign-functions= since version 6.0.1 according to their repository (it was previously ignored), though I am not sure if it is enabled or not at different optimization levels.
Why is this an optimization? Well, alignment can offer performance advantages. In theory, cache line alignment would be "optimal" for cache performance, but there are other factors to consider: aligning to 64 bytes (cache line size on x86) would probably waste a lot of space for no good reason without improving performance that much. See How much does function alignment actually matter on modern processors?

Most importantly, this is not true generically. Some architectures will have other alignment criteria. Why is often a really difficult question to answer. It is completely possible for Linux to not do this. It could even change with kernel versions for the same architechure.
The low nibble as zero is 16byte alignment. It is enforced by the linker and the compiler (via CPU and/or ABI restrictions/efficiencies). Generally, you want a function to start on a cache line. This is so functions do not overlap with each other in the cache. It is also easier to fill. Ie, the expectation is that you will only have one partial at the end of the function. It is possible that branches, case labels and other constructs could also be aligned depending on the CPU.
Even without cache, primary SDRAM fills in batches. 16bytes seems like a reasonable amount to minimize the overhead of alignment (wasted bytes) versus efficiency. Less SDRAM cycles. Of course the SDRAM burst and cache lines are of the same order as they work together to get code to the CPUs decode units.
There can be other reasons for alignment such as hardware and internal tables that only use a sub-set of address bits. This alignment can be for external functions only. Some instructions will operate faster (or only be possible) on aligned data. So, some kernel function instrumentation may also benefit from the alignment (either through more compact tables or adding veneers, etc).
See: x86_64 stack alignment - where many of the rationales for the stack alignment can apply to code as the kernel can on occasion treat code as data.

Related

Why does x86 allows for unaligned accesses, and how unaligned accesses can be detected?

Perhaps I misunderstand something, but it seems unaligned access in x86 gives security troubles such as a Return Address Integrity issue.
Why do x86 designers allow for unaligned accesses in the first place? (Performance is the only benefit I can think of.)
If x86 designers permit this unaligned access trouble, they should somehow know how to solve it, don't they? Can unaligned accesses get detected with static techniques or sanitization techniques?
I'm skeptical of the entire premise that there's a security downside here; a quick search of your link doesn't find any mention of unaligned access being a problem.
Many other ISAs support unaligned access now, too. e.g. AArch64, later ARM including ARMv6 and ARMv7, and even MIPS32r6 (but earlier MIPS revisions didn't guarantee that). Non-x86 implementations often have a performance penalty for unaligned load or especially store, even when it's within a single cache line (which has no penalty on modern x86 for cacheable loads/stores).
The primary designer of 8086 was Stephen Morse (who wrote a book about it, The 8086 Primer, which is now free on his web site).
The x86 design choice was made between 1976 and 1978. (And couldn't be changed in later x86 without breaking backwards compat, which is the main thing x86 has going for it.) 8086 needed to support byte loads and stores, and the hardware required to support unaligned 2-byte words on its 16-bit bus was presumably minor. Especially since 8088 was also planned, with an 8-bit bus. I think its only differences from 8086 were in the bus-interface unit. Or it might have been cheaper to just do it than to implement some mechanism for alignment faults.
There is no obvious security problem, and certainly none that anyone then would have heard of.
8086 was designed for easy asm source-porting from 8080 - IDK if 8080 could ever load or store 2 bytes at once, but if it allowed doing so, it probably didn't care about alignment, so 8086 needed to support. Modern static analysis tools probably weren't even dreamed of yet, and most 8080 code was hand-written in asm. (Like much early 8086 code, I'd guess.)
The Internet barely existed at the time and almost certainly wasn't a consideration. 8086 had no memory protection or privilege levels, so it certainly wasn't designed with security in mind. (Unlike contemporary CPUs for minicomputers that ran multi-user OSes).
The only real security threat for PCs at the time AFAIK was boot-sector viruses, and usually those spread by directly executing code that the system auto-ran during boot or from floppies, not attacking vulnerabilities in other programs. I could imagine malicious data files like .zip or word-processor formats were thought of at some point, but if there is any security advantage to disallowing misaligned accesses, it wasn't anything known then.
Software certainly wasn't spending extra code-size or cycles on hardening, not for decades after 8086.
Can unaligned accesses get detected with static techniques or sanitization techniques?
There's HW support for detecting unaligned accesses on x86, in the form of the AC bit in EFLAGS. But that's normally unusable because compilers (and hand-written asm memcpy etc. in libc) sometimes use unaligned loads, e.g. to initialize or copy adjacent narrow members of a struct.
GCC has -fsanitize=alignment which seems to check for C UB of dereferencing pointers that aren't sufficiently aligned for their type. e.g. it checks *int_ptr, but doesn't add checks for memcpy(char_arr, &my_int, 4) even though it inlines as a dword store. https://godbolt.org/z/ac6K13nc1
Misaligned locked instructions are extremely expensive (like system-wide bus lock or something), at least when split across two cache lines, and there is special support for detecting them specifically, without complaining about the normal misaligned loads/stores that happen in memcpy for odd sizes. The mechanisms include a perf counter for it, and a recent addition of an MSR (Model Specific Register) config bit to let the kernel make them raise an exception.
Cache-line-split locked instructions can apparently be a problem in terms of letting unprivileged code on one core interfere with hard-realtime code on another core.
It seems unaligned access in x86 gives security troubles such as a Return Address Integrity issue.
How so?
The paper you linked mentions alignment of the Function Lookup Table in this proposed hardening mechanism. There are only two instances of the string "align" in the whole paper, and neither of them talk about ARMv7-M's support for unaligned load/store creating any difficulty. (ARMv7-M is the ISA they're discussing, since it's about hardening embedded systems.)

Why do register renaming, when we can increase the number of registers in the architecture?

In processors, why can't we simply increase the number of registers instead of having a huge reorder buffer and mapping the register for resolving name dependencies?
Lots of reasons.
first, we are often designing micro-architectures to execute programs for an existing architecture. Adding registers would change the architecture. At best, existing binaries would not benefit from the new registers, at worst they won't run at all without some kind of JIT compilation.
there is the problem of encoding. Adding new registers means increasing the number of bit dedicated to encode the registers, probably increasing the instruction size with effects on the cache and elsewhere.
there is the issue of the size of the visible state. Context swapping would have to save all the visible registers. Taking more time. Taking more place (and thus an effect on the cache, thus more time again).
there is the effect that dynamic renaming can be applied at places where static renaming and register allocation is impossible, or at least hard to do; and when they are possible, that takes more instructions thus increasing the cache pressure.
In conclusion there is a sweet spot which is usually considered at 16 or 32 registers for the integer/general purpose case. For floating point and vector registers, there are arguments to consider more registers (ISTR that Fujitsu was at a time using 128 or 256 floating point registers for its own extended SPARC).
Related question on electronics.se.
An additional note, the mill architecture takes another approach to statically scheduled processors and avoid some of the drawbacks, apparently changing the trade-off. But AFAIK, it is not yet know if there will ever be available silicon for it.
Because static scheduling at compile time is hard (software pipelining) and inflexible to variable timings like cache misses. Having the CPU able to find and exploit ILP (Instruction Level Parallelism) in more cases is very useful for hiding latency of cache misses and FP or integer math.
Also, instruction-encoding considerations. For example, Haswell's 168-entry integer register file would need about 8 bits per operand to encode if we had that many architectural registers. vs. 3 or 4 for actual x86 machine code.
Related:
http://www.lighterra.com/papers/modernmicroprocessors/ great intro to CPU design and how smarter CPUs can find more ILP
Understanding the impact of lfence on a loop with two long dependency chains, for increasing lengths shows how OoO exec can overlap exec of two dependency chains, unless you block it.
http://blog.stuffedcow.net/2013/05/measuring-rob-capacity/ has some specific examples of how much OoO exec can do to hide cache-miss or other latency
this Q&A about how superscalar execution works.
Register identifier encoding space will be a problem. Indeed, many more registers has been tried. For example, SPARC has register windows, 72 to 640 registers of which 32 are visible at one time.
Instead, from Computer Organization And Design: RISC-V Edition.
Smaller is faster. The desire for speed is the reason that RISC-V has 32 registers rather than many more.
BTW, ROB size has to do with the processor being out-of-order, superscalar, rather than renaming and providing lots of general purpose registers.

Understanding how bootmem works

I have been studying OS concepts and decided to look in how these stuff are actually implemented in Linux. But I am having problem understanding some thing in relation with memory management during boot process before page_allocator is turned on, more precisely how bootmem works. I do not need the exact workings of it, but just an understanding how some things are/can be solved.
So obviously, bootmem cannot use dynamic memory, meaning that size he has must be known before runtime, so appropriate steps can be taken, i.e. the maximum size of his bitmap must be known in advance. From what I understand, this is most likely solved by simply mapping enough memory during kernel initialization, if architecture changes, simply change the size of the mapped memory. Obviously, there is probably a lot more going on, but I guess I got the general idea? However, what really makes no sense to me is NUMA architecture. Everywhere I read, it says that pg_data_t is created for each memory node. This pg_data is put into a list(how can it know the size of the list? Or is the size fixed for specific arch?) and for each node, bitmap is allocated. So, basically, it sounds like it can create undefined number of these pg_data, each of which has their memory bitmap of arbitrary size. How? What am I missing?
EDIT: Sorry for not including reference. Here is bootmem code, it can also be found in mm/bootmem.c: http://lxr.free-electrons.com/source/mm/bootmem.c
It is architecture-dependent. On the x86 architecture, early on in the boot process the kernel does issue one BIOS call - the 0xe820 function of the trap at Interrupt Vector 0x15. This returns a memory map that the kernel can use to build it's memory tables, including holes for non-memory (PCI or ISA) devices, etc. Bootloaders (before the kernel) will do the same.
See: Detecting Memory
After looking into this more, I think it works this way: basically, all the necessary things are statically allocated, i.e. by using preprocessor DEFINES it is ensured certain sections of bootmem (as well as other parts of the kernel) code either exist or do not exist in the compiled code for specific architecture (even though the code itself is architecture-independent). These DEFINES are specified in architecture-dependent sources codes found under arch/ (e.g. arch/i386, arch/arm/, etc.). For NUMA architectures there is a define called MAX_NUMNODES, ensuring that list of structs (more specifically, list of pg_data_t structures) representing nodes is allocated as a static array (which is then treated as a list). The bitmaps representing memory map are obviously relatively small, since each page is represented as only one bit, taking up KBs, or maybe MBs. Whatever the case, architecture dependent head.S sets-up all the necessary structures needed for system functioning (like page-tables) and ensure enough physical memory is mapped to virtual so these bitmaps can fit in it without causing page-fault (in case of x86 arch, initial 8MB of RAM is mapped, which is more then enough for both kernel and additional structures, like bitmaps).

What <4GB workloads would have worse performance in the Linux x32 ABI than x64?

There is a relatively new Linux ABI referred to as x32, where the x86-64 processor runs in 32-bit mode, so pointers are still only 32-bits, but the 64-bit architecture specific registers are still used. So you're still limited to 4GB max memory use as in normal 32-bit, but your pointers use up less cache space than they do in 64-bit, you can do 64-bit arithmetic efficiently, and you get access to more registers (16) than you would in vanilla 32-bit (8).
Assuming you have a workload that fits nicely within 4GB, is there any way the performance of x32 could be worse than on x86-64?
It seems to me that if you don't need the extra memory space nothing is lost -- you should always get the same perf (when you already fit in cache) or better (when the pointer space savings lets you fit more in cache). But it wouldn't surprise me if there are paging/TLB/etc. details that I don't know about.
Certainly if you have a multithreaded program, the fact that data structures are smaller on x32 might cause cache line fighting between threads -- different objects might get allocated on the same cache line in x32 mode and different cache lines in x86_64 mode. If two threads modify those objects independently the cache ping-ponging could severely slow down the x32 code. Of course, this kind of cache effect could happen regardless of pointer size, but if the code has been tuned assuming 64-bit pointers, going to 32-bit pointers could de-tune things.
In X32 the processor is actually executing in "long mode", the same mode as for x86_64. That is, addresses as seen by the processor when doing addressing are still 64 bits, however the X32 ABI makes sure that all addresses are small enough to fit into 32 bits. As a result of this, in some case there is some slight overhead when pointers have to be zero extended from 32 bits to 64.
Also, needing x86/x86-64/x32 libraries in RAM, which I suppose is what one will end up with in practice (unless you're talking about some embedded or other tightly controlled system rather than a general purpose computer), may eat up some of the benefit of X32.

Prohibit unaligned memory accesses on x86/x86_64

I want to emulate the system with prohibited unaligned memory accesses on the x86/x86_64.
Is there some debugging tool or special mode to do this?
I want to run many (CPU-intensive) tests on the several x86/x86_64 PCs when working with software (C/C++) designed for SPARC or some other similar CPU. But my access to Sparc is limited.
As I know, Sparc always checks alignment in memory reads and writes to be natural (reading a byte from any address, but reading a 4-byte word only allowed when address is divisible by 4).
May be Valgrind or PIN has such mode? Or special mode of compiler?
I'm searching for Linux non-commercial tool, but windows tools allowed too.
or may be there is secret CPU flag in EFLAGS?
I've just read question Does unaligned memory access always cause bus errors? which linked to Wikipedia article Segmentation Fault.
In the article, there's a wonderful reminder of rather uncommon Intel processor flags AC aka Alignment Check.
And here's how to enable it (from Wikipedia's Bus Error example, with a red-zone clobber bug fixed for x86-64 System V so this is safe on Linux and MacOS, and converted from Basic asm which is never a good idea inside functions: you want changes to AC to be ordered wrt. memory accesses.
#if defined(__GNUC__)
# if defined(__i386__)
/* Enable Alignment Checking on x86 */
__asm__("pushf\n orl $0x40000,(%%esp)\n popf" ::: "memory");
# elif defined(__x86_64__)
/* Enable Alignment Checking on x86_64 */
__asm__("add $-128, %%rsp \n" // skip past the red-zone, in case there is one and the compiler has local vars there.
"pushf\n"
"orl $0x40000,(%%rsp)\n"
"popf \n"
"sub $-128, %%rsp" // and restore the stack pointer.
::: "memory"); // ordered wrt. other mem access
# endif
#endif
Once enable it's working a lot like ARM alignment settings in /proc/cpu/alignment, see answer How to trap unaligned memory access? for examples.
Additionally, if you're using GCC, I suggest you enable -Wcast-align warnings. When building for a target with strict alignment requirements (ARM for example), GCC will report locations that might lead to unaligned memory access.
But note that libc's handwritten asm for memcpy and other functions will still make unaligned accesses, so setting AC is often not practical on x86 (including x86-64). GCC will sometimes emit asm that makes unaligned accesses even if your source doesn't, e.g. as an optimization to copy or zero two adjacent array elements or struct members at once.
It's tricky and I haven't done it personally, but I think you can do it in the following way:
x86_64 CPUs (specifically I've checked Intel Corei7 but I guess others as well) have a performance counter MISALIGN_MEM_REF which counter misaligned memory references.
So first of all, you can run your program and use "perf" tool under Linux to get a count of the number of misaligned access your code has done.
A more tricky and interesting hack would be to write a kernel module that programs the performance counter to generate an interrupt on overflow and get it to overflow the first unaligned load/store. Respond to this interrupt in your kernel module but sending a signal to your process.
This will, in effect, turn the x86_64 into a core that doesn't support unaligned access.
This wont be simple though - beside your code, the system libraries also use unaligned accesses, so it will be tricky to separate them from your own code.
Both GCC and Clang have UndefinedBehaviorSanitizer built in. One of those checks, alignment, can be enabled with -fsanitize=alignment. It'll emit code to check pointer alignment at runtime and abort if unaligned pointers are dereferenced.
See online documentation at:
https://clang.llvm.org/docs/UndefinedBehaviorSanitizer.html
Perhaps you somehow could compile to SSE, with all aligned moves. Unaligned accesses with movaps are illegal and probably would behave as illegal unaligned accesses on other architechtures.

Resources