Why does the ICU use this aliasing barrier when doing a reinterpret_cast? - icu

I'm porting code from ICU 58.2 to ICU 59.1 where they changed the character type from a uint16_t to a char16_t. I was going to just do a straight reinterpret_cast where I needed to convert the types, but found that the ICU 59.1 actually provides functions for this conversion. What I don't understand is why they need to use this anti aliasing barrier before doing a reinterpret_cast.
#elif (defined(__clang__) || defined(__GNUC__)) && U_PLATFORM !=
U_PF_BROWSER_NATIVE_CLIENT
# define U_ALIASING_BARRIER(ptr) asm volatile("" : : "rm"(ptr) : "memory")
#endif
...
inline const UChar *toUCharPtr(const char16_t *p) {
#ifdef U_ALIASING_BARRIER
U_ALIASING_BARRIER(p);
#endif
return reinterpret_cast<const UChar *>(p);
Why wouldn't it be safe just to use reinterpret_cast without calling U_ALIASING_BARRIER?

At a guess, it's to stop any violations of the strict aliasing rule, that might occur in calling code that hasn't been completely cleaned up, from resulting in unexpected behaviour when optimizing (the hint to this is in the comment above: "Barrier for pointer anti-aliasing optimizations even across function boundaries.").
The strict aliasing rule forbids dereferencing pointers that alias the same value when they have incompatible types (a C notion, but C++ says a similar thing with more words). Here's a small gotcha: char16_t and uint16_t aren't required to be compatible. uint16_t is actually an optionally-supported type (in both C and C++); char16_t is equivalent to uint_least16_t, which isn't necessarily the same type. It will have the same width on x86, but a compiler isn't required to have it tagged as actually being the same thing. It might even be intentionally lax with assuming types that typically indicate different intent could alias.
There's a more complete explanation in the linked answer, but basically given code like this:
uint16_t buffer[] = ...
buffer[0] = u'a';
uint16_t * pc1 = buffer;
char16_t * pc2 = (char16_t *)pc1;
pc2[0] = u'b';
uint16_t c3 = pc1[0];
...if for whatever reason the compiler doesn't have char16_t and uint16_t tagged as compatible, and you're compiling with optimizations on including its equivalent of -fstrict-aliasing, it's allowed to assume that the write through pc2 couldn't have modified whatever pc1 points at, and not reload the value before assigning it to c3, possibly giving it u'a' instead.
Code a bit like the example could plausibly arise mid-way through a conversion process where the previous code was happily using uint16_t * everywhere, but now a char16_t * is made available at the top of a block for compatibility with ICU 59, before all the code below has been completely changed to read only through the correctly-typed pointer.
Since compilers don't generally optimize hand-coded assembly, the presence of an asm block will force it to check all of its assumptions about the state of registers and other temporary values, and do a full reload of every value the first time it's dereferenced after U_ALIASING_BARRIER, regardless of optimization flags. This won't protect you from any further aliasing problems if you continue to write through the uint16_t * below the conversion (if you do that, it's legitimately your own fault), but it should at least ensure the state from before the conversion call doesn't persist in a way that could cause writes through the new pointer to be accidentally skipped afterwards.

Related

Is it safe to attempt (and fail) to write to a const on an STM32?

So, we are experimenting with an approach to perform some matrix math. This is embedded, so memory is limited, and we will have large matrices so it helps us to keep some of them stored in flash rather than RAM.
I've written a matrix structure, two arrays (one const/flash and the other RAM), and a "modify" and "get" function. One matrix, I initialize to the RAM data, and the other matrix I initialize to the flash data, using a cast from const *f32 to *f32.
What I find is that when I run this code on my STM32 embedded processor, the RAM matrix is modifiable, and the matrix pointing to the flash data simply doesn't change (the set to 12.0 doesn't "take", the value remains 2.0).
(before change) a=2, b=2, (after change) c=2, d=12
This is acceptable behavior, by design we will not attempt to modify matrices of flash data, but if we make a mistake we don't want it to crash.
If I run the same code on my windows machine with Visual C++, however, I get an "access violation" when I attempt to run the code below, when I try to modify the const array to 12.0.
This is not surprising that Windows would object, but I'd like to understand the difference in behavior better. This seems related to CPU architecture. Is it safe, on our STM32, to let the code attempt to write to a const array and let it have no effect? Or are there side effects, or reasons to avoid this?
static const f32 constarray[9] = {1,2,3,1,2,3,1,2,3};
static f32 ramarray[9] = {1,2,3,1,2,3,1,2,3};
typedef struct {
u16 rows;
u16 cols;
f32 * mat;
} matrix_versatile;
void modify_versatile_matrix(matrix_versatile * m, uint16_t r, uint16_t c, double new_value)
{
m->mat[r * m->cols + c] = new_value;
}
double get_versatile_matrix_value(matrix_versatile * m, uint16_t r, uint16_t c)
{
return m->mat[r * m->cols + c];
}
double a;
double b;
double c;
double d;
int main(void)
{
matrix_versatile matrix_with_const_data;
matrix_versatile matrix_with_ram_data;
matrix_with_const_data.cols = 3;
matrix_with_const_data.rows = 3;
matrix_with_const_data.mat = (f32 *) constarray;
matrix_with_ram_data.cols = 3;
matrix_with_ram_data.rows = 3;
matrix_with_ram_data.mat = ramarray;
a = get_versatile_matrix_value(&matrix_with_const_data, 1, 1);
b = get_versatile_matrix_value(&matrix_with_ram_data, 1, 1);
modify_versatile_matrix(&matrix_with_const_data, 1, 1, 12.0);
modify_versatile_matrix(&matrix_with_ram_data, 1, 1, 12.0);
c = get_versatile_matrix_value(&matrix_with_const_data, 1, 1);
d = get_versatile_matrix_value(&matrix_with_ram_data, 1, 1);
but if we make a mistake we don't want it to crash.
Attempting to write to ROM will not in itself cause a crash, but the code attempting to write it is by definition buggy and may crash in any case, and will certainly not behave as intended.
It is almost entirely wrong thinking; if you have a bug, you really want it to crash during development, and not after deployment. If it silently does the wrong thing, you may never notice the bug, or the crash might occur somewhere other than in proximity of the bug, so be very hard to find.
Architectures an MMU or MPU may issue an exception if you attempt to write to memory marked as read-only. That is what is happening in Windows. In that case it can be a useful debug aid given an exception handler that reports such errors by some means. In this case the error is reported exactly when it occurs, rather than crashing some time later when some invalid data is accessed or incorrect result acted upon.
Some, but mot all STM32 parts include the MPU (application note)
The answer may depend on the series (STM32F1, STM32F4, STM32L1 etc), as they have somewhat different flash controllers.
I've once made the same mistake on an STM32F429, and investigated a bit, so I can tell what would happen on an STM32F4.
Probably nothing.
The flash is by default protected, in order to be somewhat resilient to those kind of programming errors. In order to modify the flash, one has to write certain values to the FLASH->KEYR register. If the wrong value is written, then the flash will be locked until reset, so nothing really bad can happen unless the program writes 64 bits of correct values. No unexpected interrupts can happen, because the interrupt enable bit is protected by this key too. The attempt will set some error bits in FLASH->SR, so a program can check it and warn the user (preferably the tester).
However if there is some code there (e.g. a bootloader, or logging something into flash) that is supposed to write something in the flash, i.e. it unlocks the flash with the correct keys, then bad things can happen.
If the flash is left unlocked after a preceding write operation, then writing to a previously programmed area will change bits from 1 to 0, but not from 0 to 1. It means that the flash will contain the bitwise AND of the old and the newly written value.
If the failed write attempt occurs first, and unlocked afterwards, then no legitimate write or erase operation would succeed unless the status bits are properly cleared first.
If the intended and unintended accesses occur interleaved, e.g. in interrupt handlers, then all bets are off.
Even if the values are in immutable flash memory, there can still be unexpected result. Consider this code
int foo(int *array) {
array[0] = 1;
array[1] = 3;
array[2] = 5;
return array[0];
}
An optimizing compiler might recognize that the return value should always be 1, and emit code to that effect. Or it might not, and reload array[0] from wherever it is stored, possibly a different value from flash. It may behave differently in debug and release builds, or when the function is called from different places, as it might be inlined differently.
If the pointer points to an unmapped area, neither RAM nor FLASH nor some memory mapped register, then a a fault will occur, and as the default fault handlers contain just an infinite loop, the program will hang unless it has a fault handler installed that can deal with the situation. Needless to say, overwriting random RAM areas or registers can result in barely predictable behaviour.
UPDATE
I've tried your code on actual hardware. When I ran it verbatim, the compiler (gcc-arm-none-eabi-7-2018-q2-update -O3 -lto) optimized away everything, since the variables were not used afterwards. Marking a, b, c, d as volatile resulted in c=2 and d=12, it was still considering the first array const, and no accesses to the arrays were generated. constarray did not show up in the map file at all, the linker had eliminated it completely.
So I've tried a few things one at a time to force the optimizer to generate code that would actually access the arrays.
Disablig optimization (-O0)
Making all variables volatile
Inserting a couple of compile-time memory barriers (asm volatile("":::"memory");
Doing some complex calculations in the middle
Any of these has produced varying effects on different MCUs, but they were always consistent on a single platform.
STM32F103: Hard Fault. Only halfword (16 bit) write accesses are allowed to the flash, 8 or 32 bits always result in a fault. When I've changed the data types to short, the code ran, of course without any effect on the flash.
STM32F417: Code runs, with no effects on the flash contents, but bits 6 and 7, PGPERR and PGSERR in FLASH->SR were set a few cycles after the first write attempt to constarray.
STM32L151: Code runs, with no effects on the flash controller status.

Function for unaligned memory access on ARM

I am working on a project where data is read from memory. Some of this data are integers, and there was a problem accessing them at unaligned addresses. My idea would be to use memcpy for that, i.e.
uint32_t readU32(const void* ptr)
{
uint32_t n;
memcpy(&n, ptr, sizeof(n));
return n;
}
The solution from the project source I found is similar to this code:
uint32_t readU32(const uint32_t* ptr)
{
union {
uint32_t n;
char data[4];
} tmp;
const char* cp=(const char*)ptr;
tmp.data[0] = *cp++;
tmp.data[1] = *cp++;
tmp.data[2] = *cp++;
tmp.data[3] = *cp;
return tmp.n;
}
So my questions:
Isn't the second version undefined behaviour? The C standard says in 6.2.3.2 Pointers, at 7:
A pointer to an object or incomplete type may be converted to a pointer to a different
object or incomplete type. If the resulting pointer is not correctly aligned 57) for the
pointed-to type, the behavior is undefined.
As the calling code has, at some point, used a char* to handle the memory, there must be some conversion from char* to uint32_t*. Isn't the result of that undefined behaviour, then, if the uint32_t* is not corrently aligned? And if it is, there is no point for the function as you could write *(uint32_t*) to fetch the memory. Additionally, I think I read somewhere that the compiler may expect an int* to be aligned correctly and any unaligned int* would mean undefined behaviour as well, so the generated code for this function might make some shortcuts because it may expect the function argument to be aligned properly.
The original code has volatile on the argument and all variables because the memory contents could change (it's a data buffer (no registers) inside a driver). Maybe that's why it does not use memcpy since it won't work on volatile data. But, in which world would that make sense? If the underlying data can change at any time, all bets are off. The data could even change between those byte copy operations. So you would have to have some kind of mutex to synchronize access to this data. But if you have such a synchronization, why would you need volatile?
Is there a canonical/accepted/better solution to this memory access problem? After some searching I come to the conclusion that you need a mutex and do not need volatile and can use memcpy.
P.S.:
# cat /proc/cpuinfo
processor : 0
model name : ARMv7 Processor rev 10 (v7l)
BogoMIPS : 1581.05
Features : swp half thumb fastmult vfp edsp neon vfpv3 tls
CPU implementer : 0x41
CPU architecture: 7
CPU variant : 0x2
CPU part : 0xc09
CPU revision : 10
This code
uint32_t readU32(const uint32_t* ptr)
{
union {
uint32_t n;
char data[4];
} tmp;
const char* cp=(const char*)ptr;
tmp.data[0] = *cp++;
tmp.data[1] = *cp++;
tmp.data[2] = *cp++;
tmp.data[3] = *cp;
return tmp.n;
}
passes the pointer as a uint32_t *. If it's not actually a uint32_t, that's UB. The argument should probably be a const void *.
The use of a const char * in the conversion itself is not undefined behavior. Per 6.3.2.3 Pointers, paragraph 7 of the C Standard (emphasis mine):
A pointer to an object type may be converted to a pointer to a
different object type. If the resulting pointer is not correctly
aligned for the referenced type, the behavior is undefined.
Otherwise, when converted back again, the result shall compare
equal to the original pointer. When a pointer to an object is
converted to a pointer to a character type, the result points to the
lowest addressed byte of the object. Successive increments of the
result, up to the size of the object, yield pointers to the remaining
bytes of the object.
The use of volatile with respect to the correct way to access memory/registers directly on your particular hardware would have no canonical/accepted/best solution. Any solution for that would be specific to your system and beyond the scope of standard C.
Implementations are allowed to define behaviors in cases where the Standard does not, and some implementations may specify that all pointer types have the same representation and may be freely cast among each other regardless of alignment, provided that pointers which are actually used to access things are suitably aligned.
Unfortunately, because some obtuse compilers compel the use of "memcpy" as an
escape valve for aliasing issues even when pointers are known to be aligned,
the only way compilers can efficiently process code which needs to make
type-agnostic accesses to aligned storage is to assume that any pointer of a type requiring alignment will always be aligned suitably for such type. As a result, your instinct that approach using uint32_t* is dangerous is spot on. It may be desirable to have compile-time checking to ensure that a function is either passed a void* or a uint32_t*, and not something like a uint16_t* or a double*, but there's no way to declare a function that way without allowing a compiler to "optimize" the function by consolidating the byte accesses into a 32-bit load that will fail if the pointer isn't aligned.

Why is there an infinite loop at the end of do_exit() defined in kernel/exit.c?

In the Linux kernel, I am confused about the purpose of having the loop at the end of do_exit().
Isn't the call to schedule() the last code that will be ever executed by do_exit()..?
666 void do_exit(long code)
667 {
668 struct task_struct *tsk = current;
669 int group_dead;
670
671 ..........
833 /* causes final put_task_struct in finish_task_switch(). */
834 tsk->state = TASK_DEAD;
835 tsk->flags |= PF_NOFREEZE; /* tell freezer to ignore us */
836 schedule();
837 BUG();
838 /* Avoid "noreturn function does return". */
839 for (;;)
840 cpu_relax(); /* For when BUG is null */
841 }
842
Depending on the kernel it may appear differently but generally the structure is the same.
In some implementations it starts with:
NORET_TYPE void do_exit(long code)
NORET_TYPE may be different across varying GCC compilers (it may be an attribute that marks a no return function) or it may be volatile. What does such a declaration on a function do? It effectively says I won't return. You can find more about it in the GCC documentation which says:
The attribute noreturn is not implemented in GCC versions earlier than 2.5. An alternative way to declare that a function does not return, which works in the current version and in some older versions, is as follows:
typedef void voidfn ();
volatile voidfn fatal;
volatile void functions are non-conforming extensions to the C standard created by the GCC developers. You won't find it in the ANSI C Standard (C89).
It happens to be when the kernel reaches do_exit() it doesn't intend to return from the function.Generally it will block indefinitely or until something resets the system (usually). The problem is that if you mark a function as not returning the compiler will warn you if your function exits. So you generally see an infinite loop of some sort (a while, for, goto etc). In your case it does:
/* Avoid "noreturn function does return". */
for (;;)
Interestingly enough the comment pretty much gives the reason. noreturn function does return is a gcc compiler warning. an infinite loop that is obvious to the compiler (like for (;;)) is enough to stop the compiler from complaining as it it will determine that the function can't reach a point where it can exit.
Even if there is no compiler warning to worry about the infinite loop prevents the function from returning. At some point a kernel (not necessarily Linux) is going to be faced with fact that it was called by a jmp instruction to get it started (At least on x86 systems). Often the jmp is used to set the code segment for entering protected mode or at the most basic level the BIOS jumps to the code that loads the boot sector (in very simple OSes). This means that there is a finite end to the code and to prevent the processor from executing invalid instructions it is better to make it busy doing nothing of interest.
The code cpu_relax(); /* For when BUG is null */ is to fix a kernel bug. It is mentioned in this post my Linus
sched: Fix ancient race in do_exit()

What is the use of __iomem in linux while writing device drivers?

I have seen that __iomem is used to store the return type of ioremap(), but I have used u32 in ARM architecture for it and it works well.
So what difference does __iomem make here? And in which circumstances should I use it exactly?
Lots of type casts are going to just "work well". However, this is not very strict. Nothing stops you from casting a u32 to a u32 * and dereference it, but this is not following the kernel API and is prone to errors.
__iomem is a cookie used by Sparse, a tool used to find possible coding faults in the kernel. If you don't compile your kernel code with Sparse enabled, __iomem will be ignored anyway.
Use Sparse by first installing it, and then adding C=1 to your make call. For example, when building a module, use:
make -C $KPATH M=$PWD C=1 modules
__iomem is defined like this:
# define __iomem __attribute__((noderef, address_space(2)))
Adding (and requiring) a cookie like __iomem for all I/O accesses is a way to be stricter and avoid programming errors. You don't want to read/write from/to I/O memory regions with absolute addresses because you're usually using virtual memory. Thus,
void __iomem *ioremap(phys_addr_t offset, unsigned long size);
is usually called to get the virtual address of an I/O physical address offset, for a specified length size in bytes. ioremap() returns a pointer with an __iomem cookie, so this may now be used with inline functions like readl()/writel() (although it's now preferable to use the more explicit macros ioread32()/iowrite32(), for example), which accept __iomem addresses.
Also, the noderef attribute is used by Sparse to make sure you don't dereference an __iomem pointer. Dereferencing should work on some architecture where the I/O is really memory-mapped, but other architectures use special instructions for accessing I/Os and in this case, dereferencing won't work.
Let's look at an example:
void *io = ioremap(42, 4);
Sparse is not happy:
warning: incorrect type in initializer (different address spaces)
expected void *io
got void [noderef] <asn:2>*
Or:
u32 __iomem* io = ioremap(42, 4);
pr_info("%x\n", *io);
Sparse is not happy either:
warning: dereference of noderef expression
In the last example, the first line is correct, because ioremap() returns its value to an __iomem variable. But then, we deference it, and we're not supposed to.
This makes Sparse happy:
void __iomem* io = ioremap(42, 4);
pr_info("%x\n", ioread32(io));
Bottom line: always use __iomem where it's required (as a return type or as a parameter type), and use Sparse to make sure you did so. Also: do not dereference an __iomem pointer.
Edit: Here's a great LWN article about the inception of __iomem and functions using it.
Simple, Straight and Short (S3) Explanation.
There is an article https://lwn.net/Articles/653585/ for more details.

SSE on x86, stack alignment and __m128i function args

The SSE code I have got was written for x64, where the stack is aligned by 16. The optimised code paths have now been requested for 32-bit x86 (for MSVC/Windows and GCC/Linux). Getting this working on MSVC first.
Now apart from some inlines that took more than 3 __m128 parameters which it refused to compile (fixed by making a const ref and hoping the compiler will optimize it out) everything seems to work as is.
//error C2719: 'd': formal parameter with __declspec(align('16')) won't be aligned
inline __m128i foo(__m128i a, __m128i b, __m128i c, __m128i d) {...}
However I was under the impression the stack is not 16byte aligned on x86 Windows. Yet some __declspec(align(16)) arrays on the stack didn't even get a warning, and I am sure it must be pushing and popping the __m128s (I recall working out 12 registers were required on x64, and even then it moved some to the stack it didn't need for a bit and did its own thing anyway).
I even added some asserts on the array memory addresses (and turned off NDEBUG) and they all seem to pass.
__declspec(align(16)) uint32_t blocks[64];
assert(((uintptr_t)blocks) % 16 == 0);
__m128i a = ...;
__m128i b = ...;
__m128i c = ...;
__m128i d = ...;
__m128i e = ...;
__m128i f = ...;
__m128i g = ...;
//do other stuff, which surely means there is not enough registers on x86
Did I just get really lucky or is there some magic going on here to realign the stack? And is this portable? I am sure I recall having issues getting some D3DX stuff to align on x86 when I was doing D3D9 back with VS2008.
One thing I did get a bunch of warnings for however was the __m128 -> __m128& conversions being non-standard. Is this really not supported on some compiler that does support SSE, and how is one meant to avoid it (e.g. inlines with output __m128's, or more than 3 params)?
Also a quick look suggests somehow MS themselves break these rules (e.g. XMMatrixTransformation http://msdn.microsoft.com/en-us/library/windows/desktop/microsoft.directx_sdk.matrix.xmmatrixtransformation%28v=vs.85%29.aspx takes 6 SSE objects, the only difference I can see being there wrapped in structs)
XMMATRIX XMMatrixTransformation(
[in] XMVECTOR ScalingOrigin,
[in] XMVECTOR ScalingOrientationQuaternion,
[in] XMVECTOR Scaling,
[in] XMVECTOR RotationOrigin,
[in] XMVECTOR RotationQuaternion,
[in] XMVECTOR Translation
);
The variables on stack are aligned. As far as I recall, Visual C++ always properly overaligned stack variables.
The error that you see for the fourth parameter is that your Visual C++ is not able to pass overaligned type as a value parameter passed as a pointer on stack. The first three are passed via registers.
Use __vectorcall to pass more parameters via registers (six), and to pass the rest of the parameters by stack value (thus avoiding the error even for 7th parameter).
Use the latest Visual C++ which can pass overaligned types normally (starting in Visual C++ 2017). (There was a bug fixed relatively recently, but it was about passing non-trivially copyable overaligned types, xmm types are trivially copyable, so they are fine).
Better use both latest Visual C++ and __vectorcall :-)

Resources