Logging and debugging unaligned accesses on Linux / aarch64

Logging and debugging unaligned accesses on Linux / aarch64 - linux

How can I log unaligned memory accesses on Linux / aarch64 (Cortex-a57)?
I understand there are two different things involved here:
Choosing to raise an interrupt from the cpu on an unaligned access (ie. interrupts for unaligned memory accesses that would otherwise be supported by the cpu at a performance cost)
Choosing how to handle these interrupts in Linux (log them / fire a SIGBUS / soft emulate unaligned access)
My problem is that first, I do not know how to manage the cpu's control registers from my program (nor if I should actually do it in my userspace application), and second, the /proc/cpu/alignment interface for managing the unaligned accesses in Linux seems to be gone (I am using a 4.4.0 kernel), see link below.
Managing unaligned accesses from the kernel:
https://www.kernel.org/doc/Documentation/arm/mem_alignment (likely out-of-date)
Related:
Does AArch64 support unaligned access?

You can't do this. Not with Linux, anyway.
Alignment faults for EL0 are governed by the SCTLR_EL1.A bit, but that also affects EL1. Thus even if you wrote a hacky kernel module to enable it (you obviously can't touch privileged system control registers directly from userspace), you're pretty much guaranteed that the kernel's going to panic as soon as the next network packet arrives. The arm64 kernel port relies on having the unaligned access capability provided by AArch64. It doesn't have the ARM port's /proc/cpu/alignment handler, because it doesn't have the legacy of pre-ARMv6 CPUs that didn't support unaligned access at all (well, in any usable fashion at least).
What you can do, though, is use perf tools to monitor any or all of Cortex-A57's microarchitectural PMU events 0x68, 0x69 or 0x6a, to count the unaligned-access-related events which your program triggers. There's no means to trap or debug individual accesses as there might be with the blunt instrument of alignment faults, but otherwise it's arguably more useful since it'll only count events attributable to your program.

Related

How is hardware context switct used/unused in Linux?

Old x86 intel architecture provided context switching (in the form of TSS) at hardware level. But I have read that, linux has long "abandoned" using hardware context switching functionality as they were less optimised, less flexible and was not available on all architextures.
What confuses me is how can software (linux) control hardware operations (saving/restoring context)? Linux can choose not to use context setup by hardware but hardware context switch would nevertheless happen (making "optimisation" argument irrelevant).
Also if linux is not using hardware context switch, how can then the value %eip (pointing to next instruction in user program) be saved and kernel stack pointer restored by the kernel? (and vice-versa process)
I think kernel would need some support from hardware to save user program %eip and switch %esp (user to kernel stack) registers even before interrupt service routine starts.
If this support indeed is provided by hardware then how is linux not using hardware context switches?
Terribly confused!!!

Difference between user-space driver and kernel driver [duplicate]

This question already has answers here:
Userspace vs kernel space driver
(2 answers)
Closed 5 years ago.
I have been reading "Linux Device Drivers" by Jonathan Corbet. I have some questions that I want to know:
What are the main differences between a user-space driver and a kernel driver?
What are the limitations of both of them?
Why user-space drivers are commonly used and preferred nowadays over kernel drivers?

What are the main differences between a user-space driver and a kernel driver?
User space drivers run in user space. Kernel drivers run in kernel space.
What are the limitations of both of them?
The kernel driver can do anything the kernel can, so you could say it has no limitations. But kernel drivers are much harder to "prove correct" and debug. It's all-to-easy to introduce race conditions, or use a kernel function in the wrong context or with the wrong locking. Things will appear to work for a while, but cause problems (including crashing the whole system) down the road. Drivers must also be wary when reading all user input (both from the device and from userspace) because invalid data can sometimes cause crashes.
A user-space driver usually needs a small shim in the kernel to do it's bidding. Usually, that 'shim' provides a simpler API. For example, the FUSE layer lets people write file systems in any language. They can be mounted, read/written, then unmounted. The shim must also protect the kernel against all invalid input.
User-space drivers have lots of limitations. For example, the kernel reserves some memory for use during emergencies, but that is not available for users-space. During memory pressure, the kernel will kill random user-space programs, but never kill kernel threads. User-space programs may be swapped out, which could lead to your device being unavailable for several seconds. (Kernel code can not be swapped out.) Running code in user-space requires several context switches. These waste a "lot" of CPU time. If your device is a 300 baud modem, nobody will notice. But if it's a gigabit Ethernet card, and every packet has to go to your userspace driver before it gets to the real user, the system will have major bottlenecks.
User space programs are also "harder" to use because you have to install that user-space software, which often has many library dependencies. Kernel modules "just work".
Why user-space drivers are commonly used and preferred nowadays over kernel drivers?
The question is "Does this complexity really need to be in the kernel?"
I used to work for a company that made USB dongles that talked a particular protocol. We could have written a full kernel driver, but instead just wrote our program on top of libUSB.
The advantages: The program was portable between Linux, Mac, Win. No worrying about our code vs the GPL.
The disadvantages: If the device needed to data to the PC and get a response quickly, there is no guarantee that would happen. For example, if we needed a real-time control loop on the PC, it would be harder to have bounded response times. (Maybe not entirely impossible on Linux.)
If there is a way to do it in userspace, I would try that first. Only if there are significant performance bottlenecks, or significant complexity in keeping it in userspace would you move it. Even then, consider the "shim" approach, and/or the "emulator" approach (where your kernel module makes your device look like a serial port or a block device.)
On the other hand, if there are already several kernel modules similar to what you want, then start there.

Prohibit unaligned memory accesses on x86/x86_64

I want to emulate the system with prohibited unaligned memory accesses on the x86/x86_64.
Is there some debugging tool or special mode to do this?
I want to run many (CPU-intensive) tests on the several x86/x86_64 PCs when working with software (C/C++) designed for SPARC or some other similar CPU. But my access to Sparc is limited.
As I know, Sparc always checks alignment in memory reads and writes to be natural (reading a byte from any address, but reading a 4-byte word only allowed when address is divisible by 4).
May be Valgrind or PIN has such mode? Or special mode of compiler?
I'm searching for Linux non-commercial tool, but windows tools allowed too.
or may be there is secret CPU flag in EFLAGS?

I've just read question Does unaligned memory access always cause bus errors? which linked to Wikipedia article Segmentation Fault.
In the article, there's a wonderful reminder of rather uncommon Intel processor flags AC aka Alignment Check.
And here's how to enable it (from Wikipedia's Bus Error example, with a red-zone clobber bug fixed for x86-64 System V so this is safe on Linux and MacOS, and converted from Basic asm which is never a good idea inside functions: you want changes to AC to be ordered wrt. memory accesses.
#if defined(__GNUC__)
# if defined(__i386__)
/* Enable Alignment Checking on x86 */
__asm__("pushf\n orl $0x40000,(%%esp)\n popf" ::: "memory");
# elif defined(__x86_64__)
/* Enable Alignment Checking on x86_64 */
__asm__("add $-128, %%rsp \n" // skip past the red-zone, in case there is one and the compiler has local vars there.
"pushf\n"
"orl $0x40000,(%%rsp)\n"
"popf \n"
"sub $-128, %%rsp" // and restore the stack pointer.
::: "memory"); // ordered wrt. other mem access
# endif
#endif
Once enable it's working a lot like ARM alignment settings in /proc/cpu/alignment, see answer How to trap unaligned memory access? for examples.
Additionally, if you're using GCC, I suggest you enable -Wcast-align warnings. When building for a target with strict alignment requirements (ARM for example), GCC will report locations that might lead to unaligned memory access.
But note that libc's handwritten asm for memcpy and other functions will still make unaligned accesses, so setting AC is often not practical on x86 (including x86-64). GCC will sometimes emit asm that makes unaligned accesses even if your source doesn't, e.g. as an optimization to copy or zero two adjacent array elements or struct members at once.

It's tricky and I haven't done it personally, but I think you can do it in the following way:
x86_64 CPUs (specifically I've checked Intel Corei7 but I guess others as well) have a performance counter MISALIGN_MEM_REF which counter misaligned memory references.
So first of all, you can run your program and use "perf" tool under Linux to get a count of the number of misaligned access your code has done.
A more tricky and interesting hack would be to write a kernel module that programs the performance counter to generate an interrupt on overflow and get it to overflow the first unaligned load/store. Respond to this interrupt in your kernel module but sending a signal to your process.
This will, in effect, turn the x86_64 into a core that doesn't support unaligned access.
This wont be simple though - beside your code, the system libraries also use unaligned accesses, so it will be tricky to separate them from your own code.

Both GCC and Clang have UndefinedBehaviorSanitizer built in. One of those checks, alignment, can be enabled with -fsanitize=alignment. It'll emit code to check pointer alignment at runtime and abort if unaligned pointers are dereferenced.
See online documentation at:
https://clang.llvm.org/docs/UndefinedBehaviorSanitizer.html

Perhaps you somehow could compile to SSE, with all aligned moves. Unaligned accesses with movaps are illegal and probably would behave as illegal unaligned accesses on other architechtures.

Linux Interrupt Handling in User Space

In Linux, what are the options for handling device interrupts in user space code rather than in kernel space?

Experience tells it is possible to write good and stable user-space drivers for almost any PCI adapter. It just requires some sophistication and a small proxying layer in the kernel. UIO is a step in that direction, but If you want to correctly handle interrupts in user-space then UIO might not be enough, for example if the device doesn't support the PCI-spec's interrupt disable bit which UIO relies on.
Notice that process wakeup latencies are a few microsecs so if your implementation requires very low latency then user-space might be a drag on it.
If I were to implement a user-space driver, I would reduce the kernel ISR to just a "disable & ack & wakeup-userpace" operation, handle the interrupt inside the waked-up process, and then re-enable the interrupt (of course, by writing to mapped PCI memory from the userspace process).

There is Userspace I/O system (UIO), but handling should still be done in kernelspace. OTOH, if you just need to notice the interrupt, you don't need the kernel part.

You may like to take a look at CHAPTER 10: Interrupt Handling from Linux Device Drivers, Third Edition book.

Have to trigger userland code indirectly.
Kernel ISR indicates interrupt by writing file / setting register / signalling. User space application polls this and goes on with the appropriate code.
Edge cases: more or less interrupts than expected (time out / too many interrupts per time interval)
Linux file abstraction is used to connect kernel and user space. This is performed by character devices and ioctl() calls. Some may prefer sysfs entries for this purpose.
This can look odd because event triggered device notifications (interrupts) are hooked with 'time triggered' polling, but it is actually asyncronous blocking (read/select). Anyway some questions are arising according to performance.
So interrupts cannot be directly handled outside the kernel.
E.g. shared memory can be in user space and with some I/O permission settings addresses can be mapped, so U-I/O works, but not for direct interrupt handling.
I have found only one 'minority report' in topic vfio (http://lxr.free-electrons.com/source/Documentation/vfio.txt):
https://stackoverflow.com/a/21197797/5349798
Similar questions:
Running user thread in context of an interrupt in linux
Is it possible in linux to register a interrupt handler from any user-space program?
Linux Kernel: invoke call back function in user space from kernel space
Linux Interrupt vs. Polling
Linux user space PCI driver
How do I inform a user space application that the driver has received an interrupt in linux?

Disabling Multithreading during runtime

I am wondering if Intel's processor provides instructions in their instruction set
to turn on and off the multithreading or hyperthreading capability? Basically, I wanna
know if an Operating System can control these feature via instructions somehow?
Thank you so much
Mareike

Most operating systems have a facility for changing a process' CPU affinity, thereby restricting it to a single physical or virtual core. But multithreading is a program architecture, not a CPU facility.

I think that what you are trying to ask is, "Is there a way to prevent the OS from utilizing hyperthreading and/or multiple cores?"
The answer is, definitely. This isn't governed by a single instruction, and indeed it's not like you can just write a device driver that would automagically disable all of that hardware. Most of this depends on how the kernel configures the interrupt controllers at boot time.
When a machine is first started, there is a designated processor that is used for bootstrapping. It is the responsibility of the OS to configure the multiprocessor hardware accordingly. On PC platforms this would involve reading information about the multiprocessor configuration from in-memory tables provided by the boot firmware. This data would likely conform to either the ACPI or the Intel multiprocessor specifications. The kernel then uses that date to configure the APIC hardware accordingly.

Multithreading and multitasking are not special instructions or modes in the CPU. They're just fancy ways people who write operating systems use interrupts. There is a hardware timer, basically a counter being incremented by a clocking signal, that triggers an interrupt when it overflows. The exact interrupt is platform specific. In the olden days this timer is actually a separate chip/circuit on the motherboard that is simply attached to one of the CPU's interrupt pin. Modern CPUs have this timer built in. So, to turn off multithreading and multitasking the OS can simply disable the interrupt signal.
Alternatively, since it's the OS's job to actually schedule processes/threads, the OS can simply decide to ignore all threads and not run them.
Hyperthreading is a different thing. It sort of allows the OS to see a second virtual CPU that it can execute code on. Never had to deal with the thing directly so I'm not sure how to turn it off (or even if it is possible).

There is no x86 instruction that disables HyperThreading or additional cores. But, there is BIOS settings that can turn off these features. Because it can be set in BIOS, it requires rebooting, and generally it's beyond OS control. There is Windows booting option that limits the number of active core, but HyperThreading can be turn on/off only by BIOS. Current Intel's HyperThreading implementation doesn't allow dynamic turn on and off (and it won't be easily implemented in a near time).
I have assumed 'multithreading' in your question as 'hardware multithreading' which is technically identical to HyperThreading. However, if you really intended software-level multithreading (i.e., multitasking), then it's totally different question. It is (almost) impossible for modern operating systems since they are by default supports multitasking. And, this question actually doesn't make sense. It can make sense if you want to run MS-DOS (as real mode of x86, where a single task can be done).
p.s. Please note that 'multithreading' can be either hardware or software. Also I agree all others' answers such as processor/thread affinity.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string