SSE on x86, stack alignment and __m128i function args

SSE on x86, stack alignment and __m128i function args - visual-c++

The SSE code I have got was written for x64, where the stack is aligned by 16. The optimised code paths have now been requested for 32-bit x86 (for MSVC/Windows and GCC/Linux). Getting this working on MSVC first.
Now apart from some inlines that took more than 3 __m128 parameters which it refused to compile (fixed by making a const ref and hoping the compiler will optimize it out) everything seems to work as is.
//error C2719: 'd': formal parameter with __declspec(align('16')) won't be aligned
inline __m128i foo(__m128i a, __m128i b, __m128i c, __m128i d) {...}
However I was under the impression the stack is not 16byte aligned on x86 Windows. Yet some __declspec(align(16)) arrays on the stack didn't even get a warning, and I am sure it must be pushing and popping the __m128s (I recall working out 12 registers were required on x64, and even then it moved some to the stack it didn't need for a bit and did its own thing anyway).
I even added some asserts on the array memory addresses (and turned off NDEBUG) and they all seem to pass.
__declspec(align(16)) uint32_t blocks[64];
assert(((uintptr_t)blocks) % 16 == 0);
__m128i a = ...;
__m128i b = ...;
__m128i c = ...;
__m128i d = ...;
__m128i e = ...;
__m128i f = ...;
__m128i g = ...;
//do other stuff, which surely means there is not enough registers on x86
Did I just get really lucky or is there some magic going on here to realign the stack? And is this portable? I am sure I recall having issues getting some D3DX stuff to align on x86 when I was doing D3D9 back with VS2008.
One thing I did get a bunch of warnings for however was the __m128 -> __m128& conversions being non-standard. Is this really not supported on some compiler that does support SSE, and how is one meant to avoid it (e.g. inlines with output __m128's, or more than 3 params)?
Also a quick look suggests somehow MS themselves break these rules (e.g. XMMatrixTransformation http://msdn.microsoft.com/en-us/library/windows/desktop/microsoft.directx_sdk.matrix.xmmatrixtransformation%28v=vs.85%29.aspx takes 6 SSE objects, the only difference I can see being there wrapped in structs)
XMMATRIX XMMatrixTransformation(
[in] XMVECTOR ScalingOrigin,
[in] XMVECTOR ScalingOrientationQuaternion,
[in] XMVECTOR Scaling,
[in] XMVECTOR RotationOrigin,
[in] XMVECTOR RotationQuaternion,
[in] XMVECTOR Translation
);

The variables on stack are aligned. As far as I recall, Visual C++ always properly overaligned stack variables.
The error that you see for the fourth parameter is that your Visual C++ is not able to pass overaligned type as a value parameter passed as a pointer on stack. The first three are passed via registers.
Use __vectorcall to pass more parameters via registers (six), and to pass the rest of the parameters by stack value (thus avoiding the error even for 7th parameter).
Use the latest Visual C++ which can pass overaligned types normally (starting in Visual C++ 2017). (There was a bug fixed relatively recently, but it was about passing non-trivially copyable overaligned types, xmm types are trivially copyable, so they are fine).
Better use both latest Visual C++ and __vectorcall :-)

Related

Is it safe to attempt (and fail) to write to a const on an STM32?

So, we are experimenting with an approach to perform some matrix math. This is embedded, so memory is limited, and we will have large matrices so it helps us to keep some of them stored in flash rather than RAM.
I've written a matrix structure, two arrays (one const/flash and the other RAM), and a "modify" and "get" function. One matrix, I initialize to the RAM data, and the other matrix I initialize to the flash data, using a cast from const *f32 to *f32.
What I find is that when I run this code on my STM32 embedded processor, the RAM matrix is modifiable, and the matrix pointing to the flash data simply doesn't change (the set to 12.0 doesn't "take", the value remains 2.0).
(before change) a=2, b=2, (after change) c=2, d=12
This is acceptable behavior, by design we will not attempt to modify matrices of flash data, but if we make a mistake we don't want it to crash.
If I run the same code on my windows machine with Visual C++, however, I get an "access violation" when I attempt to run the code below, when I try to modify the const array to 12.0.
This is not surprising that Windows would object, but I'd like to understand the difference in behavior better. This seems related to CPU architecture. Is it safe, on our STM32, to let the code attempt to write to a const array and let it have no effect? Or are there side effects, or reasons to avoid this?
static const f32 constarray[9] = {1,2,3,1,2,3,1,2,3};
static f32 ramarray[9] = {1,2,3,1,2,3,1,2,3};
typedef struct {
u16 rows;
u16 cols;
f32 * mat;
} matrix_versatile;
void modify_versatile_matrix(matrix_versatile * m, uint16_t r, uint16_t c, double new_value)
{
m->mat[r * m->cols + c] = new_value;
}
double get_versatile_matrix_value(matrix_versatile * m, uint16_t r, uint16_t c)
{
return m->mat[r * m->cols + c];
}
double a;
double b;
double c;
double d;
int main(void)
{
matrix_versatile matrix_with_const_data;
matrix_versatile matrix_with_ram_data;
matrix_with_const_data.cols = 3;
matrix_with_const_data.rows = 3;
matrix_with_const_data.mat = (f32 *) constarray;
matrix_with_ram_data.cols = 3;
matrix_with_ram_data.rows = 3;
matrix_with_ram_data.mat = ramarray;
a = get_versatile_matrix_value(&matrix_with_const_data, 1, 1);
b = get_versatile_matrix_value(&matrix_with_ram_data, 1, 1);
modify_versatile_matrix(&matrix_with_const_data, 1, 1, 12.0);
modify_versatile_matrix(&matrix_with_ram_data, 1, 1, 12.0);
c = get_versatile_matrix_value(&matrix_with_const_data, 1, 1);
d = get_versatile_matrix_value(&matrix_with_ram_data, 1, 1);

but if we make a mistake we don't want it to crash.
Attempting to write to ROM will not in itself cause a crash, but the code attempting to write it is by definition buggy and may crash in any case, and will certainly not behave as intended.
It is almost entirely wrong thinking; if you have a bug, you really want it to crash during development, and not after deployment. If it silently does the wrong thing, you may never notice the bug, or the crash might occur somewhere other than in proximity of the bug, so be very hard to find.
Architectures an MMU or MPU may issue an exception if you attempt to write to memory marked as read-only. That is what is happening in Windows. In that case it can be a useful debug aid given an exception handler that reports such errors by some means. In this case the error is reported exactly when it occurs, rather than crashing some time later when some invalid data is accessed or incorrect result acted upon.
Some, but mot all STM32 parts include the MPU (application note)

The answer may depend on the series (STM32F1, STM32F4, STM32L1 etc), as they have somewhat different flash controllers.
I've once made the same mistake on an STM32F429, and investigated a bit, so I can tell what would happen on an STM32F4.
Probably nothing.
The flash is by default protected, in order to be somewhat resilient to those kind of programming errors. In order to modify the flash, one has to write certain values to the FLASH->KEYR register. If the wrong value is written, then the flash will be locked until reset, so nothing really bad can happen unless the program writes 64 bits of correct values. No unexpected interrupts can happen, because the interrupt enable bit is protected by this key too. The attempt will set some error bits in FLASH->SR, so a program can check it and warn the user (preferably the tester).
However if there is some code there (e.g. a bootloader, or logging something into flash) that is supposed to write something in the flash, i.e. it unlocks the flash with the correct keys, then bad things can happen.
If the flash is left unlocked after a preceding write operation, then writing to a previously programmed area will change bits from 1 to 0, but not from 0 to 1. It means that the flash will contain the bitwise AND of the old and the newly written value.
If the failed write attempt occurs first, and unlocked afterwards, then no legitimate write or erase operation would succeed unless the status bits are properly cleared first.
If the intended and unintended accesses occur interleaved, e.g. in interrupt handlers, then all bets are off.
Even if the values are in immutable flash memory, there can still be unexpected result. Consider this code
int foo(int *array) {
array[0] = 1;
array[1] = 3;
array[2] = 5;
return array[0];
}
An optimizing compiler might recognize that the return value should always be 1, and emit code to that effect. Or it might not, and reload array[0] from wherever it is stored, possibly a different value from flash. It may behave differently in debug and release builds, or when the function is called from different places, as it might be inlined differently.
If the pointer points to an unmapped area, neither RAM nor FLASH nor some memory mapped register, then a a fault will occur, and as the default fault handlers contain just an infinite loop, the program will hang unless it has a fault handler installed that can deal with the situation. Needless to say, overwriting random RAM areas or registers can result in barely predictable behaviour.
UPDATE
I've tried your code on actual hardware. When I ran it verbatim, the compiler (gcc-arm-none-eabi-7-2018-q2-update -O3 -lto) optimized away everything, since the variables were not used afterwards. Marking a, b, c, d as volatile resulted in c=2 and d=12, it was still considering the first array const, and no accesses to the arrays were generated. constarray did not show up in the map file at all, the linker had eliminated it completely.
So I've tried a few things one at a time to force the optimizer to generate code that would actually access the arrays.
Disablig optimization (-O0)
Making all variables volatile
Inserting a couple of compile-time memory barriers (asm volatile("":::"memory");
Doing some complex calculations in the middle
Any of these has produced varying effects on different MCUs, but they were always consistent on a single platform.
STM32F103: Hard Fault. Only halfword (16 bit) write accesses are allowed to the flash, 8 or 32 bits always result in a fault. When I've changed the data types to short, the code ran, of course without any effect on the flash.
STM32F417: Code runs, with no effects on the flash contents, but bits 6 and 7, PGPERR and PGSERR in FLASH->SR were set a few cycles after the first write attempt to constarray.
STM32L151: Code runs, with no effects on the flash controller status.

Why does the ICU use this aliasing barrier when doing a reinterpret_cast?

I'm porting code from ICU 58.2 to ICU 59.1 where they changed the character type from a uint16_t to a char16_t. I was going to just do a straight reinterpret_cast where I needed to convert the types, but found that the ICU 59.1 actually provides functions for this conversion. What I don't understand is why they need to use this anti aliasing barrier before doing a reinterpret_cast.
#elif (defined(__clang__) || defined(__GNUC__)) && U_PLATFORM !=
U_PF_BROWSER_NATIVE_CLIENT
# define U_ALIASING_BARRIER(ptr) asm volatile("" : : "rm"(ptr) : "memory")
#endif
...
inline const UChar *toUCharPtr(const char16_t *p) {
#ifdef U_ALIASING_BARRIER
U_ALIASING_BARRIER(p);
#endif
return reinterpret_cast<const UChar *>(p);
Why wouldn't it be safe just to use reinterpret_cast without calling U_ALIASING_BARRIER?

At a guess, it's to stop any violations of the strict aliasing rule, that might occur in calling code that hasn't been completely cleaned up, from resulting in unexpected behaviour when optimizing (the hint to this is in the comment above: "Barrier for pointer anti-aliasing optimizations even across function boundaries.").
The strict aliasing rule forbids dereferencing pointers that alias the same value when they have incompatible types (a C notion, but C++ says a similar thing with more words). Here's a small gotcha: char16_t and uint16_t aren't required to be compatible. uint16_t is actually an optionally-supported type (in both C and C++); char16_t is equivalent to uint_least16_t, which isn't necessarily the same type. It will have the same width on x86, but a compiler isn't required to have it tagged as actually being the same thing. It might even be intentionally lax with assuming types that typically indicate different intent could alias.
There's a more complete explanation in the linked answer, but basically given code like this:
uint16_t buffer[] = ...
buffer[0] = u'a';
uint16_t * pc1 = buffer;
char16_t * pc2 = (char16_t *)pc1;
pc2[0] = u'b';
uint16_t c3 = pc1[0];
...if for whatever reason the compiler doesn't have char16_t and uint16_t tagged as compatible, and you're compiling with optimizations on including its equivalent of -fstrict-aliasing, it's allowed to assume that the write through pc2 couldn't have modified whatever pc1 points at, and not reload the value before assigning it to c3, possibly giving it u'a' instead.
Code a bit like the example could plausibly arise mid-way through a conversion process where the previous code was happily using uint16_t * everywhere, but now a char16_t * is made available at the top of a block for compatibility with ICU 59, before all the code below has been completely changed to read only through the correctly-typed pointer.
Since compilers don't generally optimize hand-coded assembly, the presence of an asm block will force it to check all of its assumptions about the state of registers and other temporary values, and do a full reload of every value the first time it's dereferenced after U_ALIASING_BARRIER, regardless of optimization flags. This won't protect you from any further aliasing problems if you continue to write through the uint16_t * below the conversion (if you do that, it's legitimately your own fault), but it should at least ensure the state from before the conversion call doesn't persist in a way that could cause writes through the new pointer to be accidentally skipped afterwards.

Function for unaligned memory access on ARM

I am working on a project where data is read from memory. Some of this data are integers, and there was a problem accessing them at unaligned addresses. My idea would be to use memcpy for that, i.e.
uint32_t readU32(const void* ptr)
{
uint32_t n;
memcpy(&n, ptr, sizeof(n));
return n;
}
The solution from the project source I found is similar to this code:
uint32_t readU32(const uint32_t* ptr)
{
union {
uint32_t n;
char data[4];
} tmp;
const char* cp=(const char*)ptr;
tmp.data[0] = *cp++;
tmp.data[1] = *cp++;
tmp.data[2] = *cp++;
tmp.data[3] = *cp;
return tmp.n;
}
So my questions:
Isn't the second version undefined behaviour? The C standard says in 6.2.3.2 Pointers, at 7:
A pointer to an object or incomplete type may be converted to a pointer to a different
object or incomplete type. If the resulting pointer is not correctly aligned 57) for the
pointed-to type, the behavior is undefined.
As the calling code has, at some point, used a char* to handle the memory, there must be some conversion from char* to uint32_t*. Isn't the result of that undefined behaviour, then, if the uint32_t* is not corrently aligned? And if it is, there is no point for the function as you could write *(uint32_t*) to fetch the memory. Additionally, I think I read somewhere that the compiler may expect an int* to be aligned correctly and any unaligned int* would mean undefined behaviour as well, so the generated code for this function might make some shortcuts because it may expect the function argument to be aligned properly.
The original code has volatile on the argument and all variables because the memory contents could change (it's a data buffer (no registers) inside a driver). Maybe that's why it does not use memcpy since it won't work on volatile data. But, in which world would that make sense? If the underlying data can change at any time, all bets are off. The data could even change between those byte copy operations. So you would have to have some kind of mutex to synchronize access to this data. But if you have such a synchronization, why would you need volatile?
Is there a canonical/accepted/better solution to this memory access problem? After some searching I come to the conclusion that you need a mutex and do not need volatile and can use memcpy.
P.S.:
# cat /proc/cpuinfo
processor : 0
model name : ARMv7 Processor rev 10 (v7l)
BogoMIPS : 1581.05
Features : swp half thumb fastmult vfp edsp neon vfpv3 tls
CPU implementer : 0x41
CPU architecture: 7
CPU variant : 0x2
CPU part : 0xc09
CPU revision : 10

This code
uint32_t readU32(const uint32_t* ptr)
{
union {
uint32_t n;
char data[4];
} tmp;
const char* cp=(const char*)ptr;
tmp.data[0] = *cp++;
tmp.data[1] = *cp++;
tmp.data[2] = *cp++;
tmp.data[3] = *cp;
return tmp.n;
}
passes the pointer as a uint32_t *. If it's not actually a uint32_t, that's UB. The argument should probably be a const void *.
The use of a const char * in the conversion itself is not undefined behavior. Per 6.3.2.3 Pointers, paragraph 7 of the C Standard (emphasis mine):
A pointer to an object type may be converted to a pointer to a
different object type. If the resulting pointer is not correctly
aligned for the referenced type, the behavior is undefined.
Otherwise, when converted back again, the result shall compare
equal to the original pointer. When a pointer to an object is
converted to a pointer to a character type, the result points to the
lowest addressed byte of the object. Successive increments of the
result, up to the size of the object, yield pointers to the remaining
bytes of the object.
The use of volatile with respect to the correct way to access memory/registers directly on your particular hardware would have no canonical/accepted/best solution. Any solution for that would be specific to your system and beyond the scope of standard C.

Implementations are allowed to define behaviors in cases where the Standard does not, and some implementations may specify that all pointer types have the same representation and may be freely cast among each other regardless of alignment, provided that pointers which are actually used to access things are suitably aligned.
Unfortunately, because some obtuse compilers compel the use of "memcpy" as an
escape valve for aliasing issues even when pointers are known to be aligned,
the only way compilers can efficiently process code which needs to make
type-agnostic accesses to aligned storage is to assume that any pointer of a type requiring alignment will always be aligned suitably for such type. As a result, your instinct that approach using uint32_t* is dangerous is spot on. It may be desirable to have compile-time checking to ensure that a function is either passed a void* or a uint32_t*, and not something like a uint16_t* or a double*, but there's no way to declare a function that way without allowing a compiler to "optimize" the function by consolidating the byte accesses into a 32-bit load that will fail if the pointer isn't aligned.

accessing __m128 fields across compilers

I've noticed that accessing __m128 fields by index is possible in gcc, without using the union trick.
__m128 t;
float r(t[0] + t[1] + t[2] + t[3]);
I can also load a __m128 just like an array:
__m128 t{1.f, 2.f, 3.f, 4.f};
This is all in line with gcc's vector extensions. These, however, may not be available elsewhere. Are the loading and accessing features supported by the intel compiler and msvc?

If you want you code to work on other compilers then don't use those GCC extensions. Use the set/load/store intrinsics. _mm_setr_ps is fine for setting constant values but should not be used in a loop. To access elements I normally store the values to an array first then read the array.
If you have an array a you should read/store it in with
__m128 t = _mm_loadu_ps(a);
_mm_storeu_ps(a, t);
If the array is 16-byte aligned you can use an aligned load/store which is slightly faster on newer systems but much faster on older systems.
__m128 t = _mm_load_ps(a);
_mm_store_ps(a, t);
To get 16-byte aligned memory on the stack use
__declspec(align(16)) const float a[] = ...//MSVC
__attribute__((aligned(16))) const float a[] ...//GCC, ICC
For 16-byte aligned dynamic arrays use:
float *a = (float*)_mm_malloc(sizeof(float)*n, 16); //MSVC, GCC, ICC, MinGW

You can also use macros for this:
//Test to see if we are using MSVC as MSVC AVX types are slightly different to the GCC ones
#ifdef _MSC_VER
#define GET_F32_AVX_MULTIPLATTFORM(vector,index) (vector).m256_f32[index]
#define GET_F64_AVX_MULTIPLATTFORM(vector,index) (vector).m256d_f64[index]
#else
#define GET_F32_AVX_MULTIPLATTFORM(vector,index) (vector)[index]
#define GET_F64_AVX_MULTIPLATTFORM(vector,index) (vector)[index]
#endif

To load a __m128, you can write _mm_setr_ps(1.f, 2.f, 3.f, 4.f), which is supported by GCC, ICC, MSVC and clang.
So far as I know, clang and recent versions of GCC support accessing __m128 fields by index. I don't know how to do this in ICC or MSVC. I guess _mm_extract_ps works for all 4 compilers but its return type is insane making it painful to use.

instruction set emulator guide

I am interested in writing emulators like for gameboy and other handheld consoles, but I read the first step is to emulate the instruction set. I found a link here that said for beginners to emulate the Commodore 64 8-bit microprocessor, the thing is I don't know a thing about emulating instruction sets. I know mips instruction set, so I think I can manage understanding other instruction sets, but the problem is what is it meant by emulating them?
NOTE: If someone can provide me with a step-by-step guide to instruction set emulation for beginners, I would really appreciate it.
NOTE #2: I am planning to write in C.
NOTE #3: This is my first attempt at learning the whole emulation thing.
Thanks
EDIT: I found this site that is a detailed step-by-step guide to writing an emulator which seems promising. I'll start reading it, and hope it helps other people who are looking into writing emulators too.
Emulator 101

An instruction set emulator is a software program that reads binary data from a software device and carries out the instructions that data contains as if it were a physical microprocessor accessing physical data.
The Commodore 64 used a 6502 Microprocessor. I wrote an emulator for this processor once. The first thing you need to do is read the datasheets on the processor and learn about its behavior. What sort of opcodes does it have, what about memory addressing, method of IO. What are its registers? How does it start executing? These are all questions you need to be able to answer before you can write an emulator.
Here is a general overview of how it would look like in C (Not 100% accurate):
uint8_t RAM[65536]; //Declare a memory buffer for emulated RAM (64k)
uint16_t A; //Declare Accumulator
uint16_t X; //Declare X register
uint16_t Y; //Declare Y register
uint16_t PC = 0; //Declare Program counter, start executing at address 0
uint16_t FLAGS = 0 //Start with all flags cleared;
//Return 1 if the carry flag is set 0 otherwise, in this example, the 3rd bit is
//the carry flag (not true for actual 6502)
#define CARRY_FLAG(flags) ((0x4 & flags) >> 2)
#define ADC 0x69
#define LDA 0xA9
while (executing) {
switch(RAM[PC]) { //Grab the opcode at the program counter
case ADC: //Add with carry
A = X + RAM[PC+1] + CARRY_FLAG(FLAGS);
UpdateFlags(A);
PC += ADC_SIZE;
break;
case LDA: //Load accumulator
A = RAM[PC+1];
UpdateFlags(X);
PC += MOV_SIZE;
break;
default:
//Invalid opcode!
}
}
According to this reference ADC actually has 8 opcodes in the 6502 processor, which means you will have 8 different ADC in your switch statement, each one for different opcodes and memory addressing schemes. You will have to deal with endianess and byte order, and of course pointers. I would get a solid understanding of pointer and type casting in C if you dont already have one. To manipulate the flags register you have to have a solid understanding of bitwise operations in C. If you are clever you can make use of C macros and even function pointers to save yourself some work, as the CARRY_FLAG example above.
Every time you execute an instruction, you must advance the program counter by the size of that instruction, which is different for each opcode. Some opcodes dont take any arguments and so their size is just 1 byte, while others take 16-bit integers as in my MOV example above. All this should be pretty well documented.
Branch instructions (JMP, JE, JNE etc) are simple: If some flag is set in the flags register then load the PC to the address specified. This is how "decisions" are made in a microprocessor and emulating them is simply a matter of changing the PC, just as the real microprocessor would do.
The hardest part about writing an instruction set emulator is debugging. How do you know if everything is working like it should? There are plenty of resources for helping you. People have written test codes that will help you debug every instruction. You can execute them one instruction at a time and compare the reference output. If something is different, you know you have a bug somewhere and can fix it.
This should be enough to get you started. The important thing is that you have A) A good solid understanding of the instruction set you want to emulate and B) a solid understanding of low level data manipulation in C, including type casting, pointers, bitwise operations, byte order, etc.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string