Has Hardware Lock Elision gone forever due to Spectre Mitigation? - security

Is this correct that Hardware Lock Elision is disabled for all current CPUs due to Spectre mitigation, and any attempt to have a mutex using HLE intrinsics/instructions would result in usual mutex?
Is this likely that there will not be anything like HLE mutexes in future to avoid vulnerabilities like Spectre?

So, TSX may be disabled not to mitigate Spectre, but as a part of another vulnerability mitigation, TSX Asynchronous Abort (TAA).
Here's relevant article on Intel website:
Intel® Transactional Synchronization Extensions (Intel® TSX) Asynchronous Abort / CVE-2019-11135 / INTEL-SA-00270
Which links to two more detailed articles:
TSX Asynchronous Abort (TAA) CVE-2019-11135
Microarchitectural Store Buffer Data Sampling (MSBDS) CVE-2018-12126
Links contain the following information:
Some future or even current CPUs may have hardware mitigation for TAA, detected by IA32_ARCH_CAPABILITIES[TAA_NO]=1.
Otherwise if the CPU is susceptible to MDS (IA32_ARCH_CAPABILITIES[MDS_NO]=0), software mitigation for MDS will also mitigate TAA
In the case of IA32_ARCH_CAPABILITIES[TAA_NO]=0 and IA32_ARCH_CAPABILITIES[MDS_NO]=1, TAA should be mitigated by one of following:
Software mitigation
Selectively disabling TSX
Ability for above mentioned selectively disabling TSX arrives with microcode update. After such microcode update, ability to control TSX is controlled by IA32_ARCH_CAPABILITIES[TSX_CTRL] (bit 7)=1.
Now, about HLE. TAA article says:
Some processors may need to load a microcode update to add support for IA32_TSX_CTRL. The MSR supports disabling the RTM functionality of Intel TSX by setting TSX_CTRL_RTM_DISABLE (bit 0). When this bit is set, all RTM transactions will abort with abort code 0 before any instructions can execute within the transaction, even speculatively. On processors that enumerate IA32_ARCH_CAPABILITIES[TSX_CTRL] (bit 7)=1, HLE prefix hints are always ignored.
The HLE feature is also marked as removed in Intel® 64 and IA-32 Architectures
Software Developer’s Manual:
2.5 INTEL INSTRUCTION SET ARCHITECTURE AND FEATURES REMOVED
Intel® Memory Protection Extensions (Intel® MPX)
MSR_TEST_CTRL, bit 31 (MSR address 33H)
Hardware Lock Elision (HLE)
I believe that I have answers to my questions:
Is this correct that Hardware Lock Elision is disabled for all current CPUs due to Spectre TAA mitigation, and any attempt to have a mutex using HLE intrinsics/instructions would result in usual mutex?
Yes. It is deprecated. Unless Intel undeprecates it.
Is this likely that there will not be anything like HLE mutexes in future to avoid vulnerabilities like Spectre?
No. There is still RTM, which may be not disabled, and it can be used to create mutexes like HLE mutexes. There may also may be future processors not susceptible to TAA, RTM may work for them.

Related

Why does x86 allows for unaligned accesses, and how unaligned accesses can be detected?

Perhaps I misunderstand something, but it seems unaligned access in x86 gives security troubles such as a Return Address Integrity issue.
Why do x86 designers allow for unaligned accesses in the first place? (Performance is the only benefit I can think of.)
If x86 designers permit this unaligned access trouble, they should somehow know how to solve it, don't they? Can unaligned accesses get detected with static techniques or sanitization techniques?
I'm skeptical of the entire premise that there's a security downside here; a quick search of your link doesn't find any mention of unaligned access being a problem.
Many other ISAs support unaligned access now, too. e.g. AArch64, later ARM including ARMv6 and ARMv7, and even MIPS32r6 (but earlier MIPS revisions didn't guarantee that). Non-x86 implementations often have a performance penalty for unaligned load or especially store, even when it's within a single cache line (which has no penalty on modern x86 for cacheable loads/stores).
The primary designer of 8086 was Stephen Morse (who wrote a book about it, The 8086 Primer, which is now free on his web site).
The x86 design choice was made between 1976 and 1978. (And couldn't be changed in later x86 without breaking backwards compat, which is the main thing x86 has going for it.) 8086 needed to support byte loads and stores, and the hardware required to support unaligned 2-byte words on its 16-bit bus was presumably minor. Especially since 8088 was also planned, with an 8-bit bus. I think its only differences from 8086 were in the bus-interface unit. Or it might have been cheaper to just do it than to implement some mechanism for alignment faults.
There is no obvious security problem, and certainly none that anyone then would have heard of.
8086 was designed for easy asm source-porting from 8080 - IDK if 8080 could ever load or store 2 bytes at once, but if it allowed doing so, it probably didn't care about alignment, so 8086 needed to support. Modern static analysis tools probably weren't even dreamed of yet, and most 8080 code was hand-written in asm. (Like much early 8086 code, I'd guess.)
The Internet barely existed at the time and almost certainly wasn't a consideration. 8086 had no memory protection or privilege levels, so it certainly wasn't designed with security in mind. (Unlike contemporary CPUs for minicomputers that ran multi-user OSes).
The only real security threat for PCs at the time AFAIK was boot-sector viruses, and usually those spread by directly executing code that the system auto-ran during boot or from floppies, not attacking vulnerabilities in other programs. I could imagine malicious data files like .zip or word-processor formats were thought of at some point, but if there is any security advantage to disallowing misaligned accesses, it wasn't anything known then.
Software certainly wasn't spending extra code-size or cycles on hardening, not for decades after 8086.
Can unaligned accesses get detected with static techniques or sanitization techniques?
There's HW support for detecting unaligned accesses on x86, in the form of the AC bit in EFLAGS. But that's normally unusable because compilers (and hand-written asm memcpy etc. in libc) sometimes use unaligned loads, e.g. to initialize or copy adjacent narrow members of a struct.
GCC has -fsanitize=alignment which seems to check for C UB of dereferencing pointers that aren't sufficiently aligned for their type. e.g. it checks *int_ptr, but doesn't add checks for memcpy(char_arr, &my_int, 4) even though it inlines as a dword store. https://godbolt.org/z/ac6K13nc1
Misaligned locked instructions are extremely expensive (like system-wide bus lock or something), at least when split across two cache lines, and there is special support for detecting them specifically, without complaining about the normal misaligned loads/stores that happen in memcpy for odd sizes. The mechanisms include a perf counter for it, and a recent addition of an MSR (Model Specific Register) config bit to let the kernel make them raise an exception.
Cache-line-split locked instructions can apparently be a problem in terms of letting unprivileged code on one core interfere with hard-realtime code on another core.
It seems unaligned access in x86 gives security troubles such as a Return Address Integrity issue.
How so?
The paper you linked mentions alignment of the Function Lookup Table in this proposed hardening mechanism. There are only two instances of the string "align" in the whole paper, and neither of them talk about ARMv7-M's support for unaligned load/store creating any difficulty. (ARMv7-M is the ISA they're discussing, since it's about hardening embedded systems.)

Can we use x86_64 CPU atomics to generate on PCI Express the compound atomic operations?

As you know, starting with version 2.0, PCI Express supports compound atomic operations: FetchAdd, Swap, CAS: https://pcisig.com/sites/default/files/specification_documents/ECN_Atomic_Ops_080417.pdf
Also known, that x86_64 CPU has assembler compound atomic operations: lock add, [lock] xchg, lock cmpxchg: https://godbolt.org/g/MmqMRw
That can be produced by C-compiler used volatile atomic_int-operations:
int expceted_cas = 0;
volatile atomic_int a;
atomic_fetch_add( &a, 1 );
atomic_exchange( &a, 1 );
atomic_compare_exchange_weak( &a, &expceted_cas, 1 );
I want to access to the buffer memory on device (Ethernet, GPU, ...) that connected by PCI Express to PC-x86_64, by using compound atomic operations. I.e. we already know how works hardware bus (PCIe supports atomics FetchAdd/Swap/CAS), but we want to know what assembler source code required to use this PCIe features.
Can we use x86_64 CPU compound atomic operations: lock add, [lock] xchg, lock cmpxchg to generate on PCI Express the compound atomic operations: FetchAdd, Swap, CAS?
Or what asm-code should we use on x86_64 CPU to perform atomic operations FetchAdd, Swap, CAS on PCI Express 2.0/3.0?
For what I can gather from the Internet, the latest generations of Intel CPUs at the time of writing [1] [2] [3] only support PCIe AtomicOps as completers.
The PCIe devices integrated into the uncore can complete an AtomicOp but cannot request one, the PCIe ports can request an AtomicOp but that's possibly just for forwarding device initiated requests.
It seems that the PCI root complex is unable to request AtomicOps.
Enabling AtomicOps would require a tight coupling between the processor and the root complex: not only the processor has to transmit the type of operation it is performing - thereby implementing a mapping between x86 instructions and PCIe AtomicOps - but also its operands.
Furthermore, the root complex must be able to identify when a write targets an AtomicOps enabled device among all the possible destinations - thereby requiring a set of software configurable address ranges.
Finally, AtomicOps need to be handled specially by the QPI Quiesce Master - since the target device is already taking care of the atomicity, a global QPI lock can be avoided.
All of this, of course, assuming that the target memory is not cacheable (or a cache lock would take place instead).
I don't think these are insurmountable obstacles rather I believe that AtomicOps were invented primarily to shorten the latency of an IO->HostMem atomic write or an IO->IO write.
Looking at what Intel wrote:
Today, message-based transactions are used for PCIe devices, and these use interrupts that can experience long latency, unlike CPU updates to main memory that use atomic transactions.
it seems that the primary concern is the use of an interrupt to notify a device driver that an atomic write must be performed on behalf of its managed device.
Host->IO AtomicOps are allowed but It seems they can't be generated as today,
surely not with a lock prefix alone.
I also believe that issuing an AtomicOps to a device from the processor would only be useful to perform a write that is atomic with respect to other PCIe devices as the processors usually synchronise themselves with locks.

Are RISC-V instruction execution durations standardized for the sake of cryptographic security?

Some cryptographic functions require a consistent execution duration to avoid timing attacks. I read that such functions targeting x86 are hard to write for reasons potentially including the emulated nature of the ISA and out-of-order processing. Therefore preventing timing attacks on the x86 is not easy because it depends on complex, and/or unknown factors in any given moment.
In a standard RISC-V core, are instruction timings predictably consistent relative to each another? What about in the case of a standard core with out-of-order processing or proprietary implementations of the base ISA?
RISC-V could be implemented in a machine with deterministic latencies; this has to do more with the implementation than the ISA.
See this project for a RISC-V implementation that supports predictable-latency execution: https://github.com/pretis/flexpret. It was developed for the embedded space, but would seem to be suitable for your proposed application as well.
It is important differentiate an ISA from an implementation of it. Nothing in the RISC-V spec mandates the instruction execution latencies. Most implementations will do whatever gives them the highest performance. A security paranoid processor could be designed to have consistent latencies for all instructions and yet still conform to the RISC-V spec.
A nice feature of RISC-V is that plenty of opcode space was intentionally left unused to make room for ISA extensions. There appear to be no publicly announced plans for a crypto extension, so this feature could be incorporated into a crypto extension when it is made if needed.
I'm not sure about core, but I've read that in RISC-V Cryptography Extensions Volume I (riscv-crypto-spec-scalar-v1.0.1.pdf), cryptographic instructions are required of this:
This instruction must always be implemented such that its execution latency does not depend on the data being operated on.
So in the context of cryptographic-specific instructions, yes.
"is there a standard for how long each instruction should take to complete relative to other operations?"
No.
Such behavior will be consistent with all other major ISAs as far as I am aware of.
An out-of-order processor will execute instructions as their dependencies resolve. Cache misses and the potentially random nature of issue select will mean that successive loop iterations will behave differently with regards to when instructions execute relative to one another. Any number of other micro-architecture issues get in the way, including instruction fetch misses, dcache misses, resource stalls causing replays, etc. Even a typical in-order core will face such issues.
how does the RISC-V team plan to address potential standard or non-standard complexity that a cryptographic library developer must find some way to address?
I can't speak for the RISC-V team, but if I may hazard a guess, I suspect that this (and similar) areas will involve the wider community to discuss and address such issues.

How are threads implemented in Windows 7?

Microsoft introduced that Window 7 has improved threading subsystem introducing Hybrid (N:M user-space / kernel space threads mapping).
Does somebody know the specifics of threading implementation. While there are a lot of materials (and obviously open source for Linux NPTL implementation) and some info on Mac OS threads implementation, I couldn't find any info on Windows 7 threading implementation specifics.
Especially I'm interested about:
Synchronization primitives implementation (like futexes in Linux)
Thread queueing policies
Thread data structures
Thread local storage implementation
Memory allocation and deallocation
... other threading-related features I've forgotten to mention
Would be appreciate about any provided info and / or links.
Nothing drastic was changed in Windows 7, just a minor improvement in the "thread mapping" (aka thread affinity). The scheduler improves the odds that a thread stays scheduled on a particular core and doesn't jump from one core to another. This is good for power consumption, reducing cache thrashing and supporting Intel Nehalem and AMD Phenom II, architectures that support per-core low-power states. No software considerations apply, that I can think of anyway.

What is Intel microcode?

From what I've read it's used to fix bugs in the CPU without modifying the BIOS.
From my basic knowledge of Assembly I know that assembly instructions are split into microcodes internally by the CPU and executed accordingly. But intel somehow gives access to make some updates while the system is up and running.
Anyone has more info on them? Is there any documentation regarding what can it be done with microcodes and how can they be used?
EDIT:
I've read the wikipedia article: didn't figure out how can I write some on my own, and what uses it would have.
In older times, microcode was heavily used in CPU: every single instruction was split into microcode. This enabled relatively complex instruction sets in modest CPU (consider that a Motorola 68000, with its many operand modes and eight 32-bit registers, fits in 40000 transistors, whereas a single-core modern x86 will have more than a hundred millions). This is not true anymore. For performance reasons, most instructions are now "hardwired": their interpretation is performed by inflexible circuitry, outside of any microcode.
In a recent x86, it is plausible that some complex instructions such as fsin (which computes the sine function on a floating point value) are implemented with microcode, but simple instructions (including integer multiplication with imul) are not. This limits what can be achieved with custom microcode.
That being said, microcode format is not only very specific to the specific processor model (e.g. microcode for a Pentium III and a Pentium IV cannot be freely exchanged with eachother -- and, of course, using Intel microcode for an AMD processor is out of the question), but it is also a severely protected secret. Intel has published the method by which an operating system or a motherboard BIOS may update the microcode (it must be done after each hard reset; the update is kept in volatile RAM) but the microcode contents are undocumented. The Intel® 64 and IA-32 Architectures Software Developer’s Manual (volume 3a) describes the update procedure (section 9.11 "microcode update facilities") but states that the actual microcode is "encrypted" and clock-full of checksums. The wording is vague enough that just about any kind of cryptographic protection may be hidden, but the bottom-line is that it is not currently possible, for people other than Intel, to write and try some custom microcode.
If the "encryption" does not include a digital (asymmetric) signature and/or if the people at Intel botched the protection system somehow, then it may be conceivable that some remarkable reverse-engineering effort could potentially enable one to produce such microcode, but, given the probably limited applicability (since most instructions are hardwired), chances are that this would not buy much, as far as programming power is concerned.
Think loosely about a virtual machine or simulator where say for example qemu-arm can simulate an arm processor on an x86 host, ideally the software running on the simulated arm has no idea that it isnt a real arm. Take this idea to the level where the whole chip is designed such that it always looks like you are an x86, the software never knows there is some programmable items inside the chip. And that some other processor inside is somewhat designed for the purpose of implementing/simulating an x86. Supposedly the popular AMD 29000 product line just went away because the hardware team and perhaps processor/core became the guts of an early x86 clone. Transmeta, where Linus worked, had a vliw processor that was made to be a low power x86. In that case the translation layer was not (as much of) a secret. Vliw, very long instruction word, RISC taken to the extreme, is the kind of thing you build for this kind of task.
No it is not as much of an emulation layer as I am implying, there isnt some linux running there with a qemu program inside each chip. It is somewhere between hardwired where there is no software/microcode in the middle and a full blow emulation. The programmable bits may be like an fpga, programmable gates, or it may be software or programmable state machines, meaning not-programmable gates, just what runs on the gates is programmable.
Your non-x86, non-big iron type processors. Take ARM for example, are hardwired, no microcode. Microcontrollers, PIC, MSP430, AVR, assume these are not microcoded. Basically do not assume all processors are microcoded, few if any processor families are. It is just that the ones we deal with in PCs have been and may still be, so it may feel like they all are.
As fun as it may sound to play with this microcode, it is likely very specific to the processor family, and you likely will never gain access to how it works unless you work for Intel or AMD, each of which likely have their own internals. So you would need to get a job at one of the two, then work your way through the trenches to become one of what is likely an elite team that does this work. And once you get that far your career is trapped, your skills may be limited to one job at one company. You might have more fun programming individual gpus on a video card, something that is documented or at least has tools, something you can do today without spending 10 years at AMD or Intel to possibly get nowhere.

Resources