Is it safe to attempt (and fail) to write to a const on an STM32?

Is it safe to attempt (and fail) to write to a const on an STM32? - visual-c++

So, we are experimenting with an approach to perform some matrix math. This is embedded, so memory is limited, and we will have large matrices so it helps us to keep some of them stored in flash rather than RAM.
I've written a matrix structure, two arrays (one const/flash and the other RAM), and a "modify" and "get" function. One matrix, I initialize to the RAM data, and the other matrix I initialize to the flash data, using a cast from const *f32 to *f32.
What I find is that when I run this code on my STM32 embedded processor, the RAM matrix is modifiable, and the matrix pointing to the flash data simply doesn't change (the set to 12.0 doesn't "take", the value remains 2.0).
(before change) a=2, b=2, (after change) c=2, d=12
This is acceptable behavior, by design we will not attempt to modify matrices of flash data, but if we make a mistake we don't want it to crash.
If I run the same code on my windows machine with Visual C++, however, I get an "access violation" when I attempt to run the code below, when I try to modify the const array to 12.0.
This is not surprising that Windows would object, but I'd like to understand the difference in behavior better. This seems related to CPU architecture. Is it safe, on our STM32, to let the code attempt to write to a const array and let it have no effect? Or are there side effects, or reasons to avoid this?
static const f32 constarray[9] = {1,2,3,1,2,3,1,2,3};
static f32 ramarray[9] = {1,2,3,1,2,3,1,2,3};
typedef struct {
u16 rows;
u16 cols;
f32 * mat;
} matrix_versatile;
void modify_versatile_matrix(matrix_versatile * m, uint16_t r, uint16_t c, double new_value)
{
m->mat[r * m->cols + c] = new_value;
}
double get_versatile_matrix_value(matrix_versatile * m, uint16_t r, uint16_t c)
{
return m->mat[r * m->cols + c];
}
double a;
double b;
double c;
double d;
int main(void)
{
matrix_versatile matrix_with_const_data;
matrix_versatile matrix_with_ram_data;
matrix_with_const_data.cols = 3;
matrix_with_const_data.rows = 3;
matrix_with_const_data.mat = (f32 *) constarray;
matrix_with_ram_data.cols = 3;
matrix_with_ram_data.rows = 3;
matrix_with_ram_data.mat = ramarray;
a = get_versatile_matrix_value(&matrix_with_const_data, 1, 1);
b = get_versatile_matrix_value(&matrix_with_ram_data, 1, 1);
modify_versatile_matrix(&matrix_with_const_data, 1, 1, 12.0);
modify_versatile_matrix(&matrix_with_ram_data, 1, 1, 12.0);
c = get_versatile_matrix_value(&matrix_with_const_data, 1, 1);
d = get_versatile_matrix_value(&matrix_with_ram_data, 1, 1);

but if we make a mistake we don't want it to crash.
Attempting to write to ROM will not in itself cause a crash, but the code attempting to write it is by definition buggy and may crash in any case, and will certainly not behave as intended.
It is almost entirely wrong thinking; if you have a bug, you really want it to crash during development, and not after deployment. If it silently does the wrong thing, you may never notice the bug, or the crash might occur somewhere other than in proximity of the bug, so be very hard to find.
Architectures an MMU or MPU may issue an exception if you attempt to write to memory marked as read-only. That is what is happening in Windows. In that case it can be a useful debug aid given an exception handler that reports such errors by some means. In this case the error is reported exactly when it occurs, rather than crashing some time later when some invalid data is accessed or incorrect result acted upon.
Some, but mot all STM32 parts include the MPU (application note)

The answer may depend on the series (STM32F1, STM32F4, STM32L1 etc), as they have somewhat different flash controllers.
I've once made the same mistake on an STM32F429, and investigated a bit, so I can tell what would happen on an STM32F4.
Probably nothing.
The flash is by default protected, in order to be somewhat resilient to those kind of programming errors. In order to modify the flash, one has to write certain values to the FLASH->KEYR register. If the wrong value is written, then the flash will be locked until reset, so nothing really bad can happen unless the program writes 64 bits of correct values. No unexpected interrupts can happen, because the interrupt enable bit is protected by this key too. The attempt will set some error bits in FLASH->SR, so a program can check it and warn the user (preferably the tester).
However if there is some code there (e.g. a bootloader, or logging something into flash) that is supposed to write something in the flash, i.e. it unlocks the flash with the correct keys, then bad things can happen.
If the flash is left unlocked after a preceding write operation, then writing to a previously programmed area will change bits from 1 to 0, but not from 0 to 1. It means that the flash will contain the bitwise AND of the old and the newly written value.
If the failed write attempt occurs first, and unlocked afterwards, then no legitimate write or erase operation would succeed unless the status bits are properly cleared first.
If the intended and unintended accesses occur interleaved, e.g. in interrupt handlers, then all bets are off.
Even if the values are in immutable flash memory, there can still be unexpected result. Consider this code
int foo(int *array) {
array[0] = 1;
array[1] = 3;
array[2] = 5;
return array[0];
}
An optimizing compiler might recognize that the return value should always be 1, and emit code to that effect. Or it might not, and reload array[0] from wherever it is stored, possibly a different value from flash. It may behave differently in debug and release builds, or when the function is called from different places, as it might be inlined differently.
If the pointer points to an unmapped area, neither RAM nor FLASH nor some memory mapped register, then a a fault will occur, and as the default fault handlers contain just an infinite loop, the program will hang unless it has a fault handler installed that can deal with the situation. Needless to say, overwriting random RAM areas or registers can result in barely predictable behaviour.
UPDATE
I've tried your code on actual hardware. When I ran it verbatim, the compiler (gcc-arm-none-eabi-7-2018-q2-update -O3 -lto) optimized away everything, since the variables were not used afterwards. Marking a, b, c, d as volatile resulted in c=2 and d=12, it was still considering the first array const, and no accesses to the arrays were generated. constarray did not show up in the map file at all, the linker had eliminated it completely.
So I've tried a few things one at a time to force the optimizer to generate code that would actually access the arrays.
Disablig optimization (-O0)
Making all variables volatile
Inserting a couple of compile-time memory barriers (asm volatile("":::"memory");
Doing some complex calculations in the middle
Any of these has produced varying effects on different MCUs, but they were always consistent on a single platform.
STM32F103: Hard Fault. Only halfword (16 bit) write accesses are allowed to the flash, 8 or 32 bits always result in a fault. When I've changed the data types to short, the code ran, of course without any effect on the flash.
STM32F417: Code runs, with no effects on the flash contents, but bits 6 and 7, PGPERR and PGSERR in FLASH->SR were set a few cycles after the first write attempt to constarray.
STM32L151: Code runs, with no effects on the flash controller status.

Related

Setting registers using embedded rust

So... I've been following the embedded rust book... and I'm currently reading about registers. Now, the book does suggest that I use an STM32F303VC discovery to avoid issues, but I couldn't find one, due to which I got a Nucleo F303RE instead. the targets and stuff for cargo remain the same. So I thought there wouldn't be any issues.
So, the MCU that I'm using has the Led attached to portA (0x48000000), which has a BSRR at an offset of 0x18. Now, I read in the datasheet, that the default value for port A is 0xa8000000, which I don't understand why. But when I try setting the led pins using
ptr::write_volatile(PORTA_BSRR as *mut u32, 1 << 5); Nothing happens. Even my gdb terminal doesn't reflect any changes. So I tried checking with portE as the original tutorial suggests (0x48001018). But even then the register values don't change. I am unable to debug this issue.
Now, I am able to run the previous tutorials, and able to check variables and stuff. Nothing seems to be wrong with my stm as I am able to control it just fine using the stmc32cubeide.
here's the code in case you want to refer to it
EDIT: So, I read #Ikolbly 's comment, and looked into the RCC_AHBENR register, which I guess is like setting pinMode(pin, HIGH) in the arduino, it turns the port on.
I've modified the code to set that bit, but there seems to be no change. I'm guessing the auxiliary code already did it for portE which is why I didn't have to do any initialization for that... but even changing the register values for portE did not work.
//#![deny(unsafe_code)]
#![no_main]
#![no_std]
use aux5::entry;
use core::ptr;
#[entry]
fn main() -> ! {
const RCC_AHBENR: u32 = 0x48000014;
const PORTA_BSRR: u32 = 0x48000018;
let _y;
let x = 42;
_y = x;
unsafe {
// EDIT enabling portA
ptr::write_volatile(RCC_AHBENR as *mut u32, 1 << 17);
// Toggling pin PA5
ptr::write_volatile(PORTA_BSRR as *mut u32, 1 << 5);
// Toggling random shit to see if it works
ptr::write_volatile(PORTA_BSRR as *mut u32, 1 << 6);
ptr::write_volatile(PORTA_BSRR as *mut u32, 1 << 7);
ptr::write_volatile(PORTA_BSRR as *mut u32, 1 << 8);
}
// infinite loop; just so we don't leave this stack frame
loop {}
}

Well over 99% of bare metal is reading.
So you figured out that from the nucleo datasheet D13 is LD2 which on the F303 variant is PA5. Port A pin 5.
In the reference manual for the STM3F303R...
The base address for
RCC 0x40021000
GPIOA 0x48000000
You had the RCC base wrong it appears from the comments.
RCC_AHBENR is at rcc base + 0x14. bit 17 is IOPAEN (I/O port A (clock) enable) reset value of 0x00000014 flash and sram enabled on reset, that is a pretty good thing to have enabled.
Now the minimum you need to do for these ST ports to blink an led is, enable the gpio port, in this case bit 17 of RCC_AHBENR needs to be set. Then you set the GPIOA_MODER register to set the port as an output, you do not need to mess with the pullup/down nor speed and the output type register is already set for output on reset. So enable the port, and make it a push-pull output. then use bsrr to blink.
GPIOA_MODER resets to 0xA8000000 making 15,14,13 alternate function after reset, you will find that these are jtag registers and two are also SWD data and I/O that you can use with an SWD debugger (comes built into the nucleo board). Easier to just copy the binary file over to the virtual thumb drive than to use swd directly (the debug mcu then uses swd to write the target mcu and reset it).
And as pointed out above RCC_AHBENR has sram and flash enabled.
As a general rule (there are exceptions) you want to do read-modify-writes. Simply writing 1<<17 to RCC_AHBENR will enable porta. Now you have just disabled sram AND flash during sleep mode. Now if you are not going into a sleep mode you are technically okay, but you really should
x = read RCC_AHBENR
x|= 1<<17
write RCC_AHBENR = x
Which can be done with an or equal, as a habit I would recommend against, it, but for hand optimization for this register, it is fine. I am not a rust expert yet (some day) so do not know the ways to do this. I think you should have tried asm or C first and succeeded then taken that knowledge to rust.
For the MODER register it becomes more apparent as to the nature of the read-modify-write because it is more than one bit.
x = read GPIOA_MODER
x&=(~(3<<10))
x|=1<<10
write GPIOA_MODER = x
would be a proper generic way to do this, now looking at the documentation and looking at the rest value you could, technically just write 0xA8000400 in a single write or you could do a read-modify-write
x = read
x |= 1<<10
write = x
without clearing bit 11.
Now some stm32 parts document this, some do not, most people are lucky because the use a canned tool or a read-modify-write habit, and the logic has a strangeness to it with the moder register in particular (likely to make their code work).
If you were to prepare registers up front and do back to back writes
ldr r0,=0x40021014
ldr r1,=0x00020014
ldr r2,=0x48000000
ldr r3,=0xA8000400
str r1,[r0]
str r3,[r2]
assuming I have my bits and addresses right (which is not the point of the comment/issue) the back to back writes DO NOT WORK on all stm32 parts, you have a race condition the write to enable the i/o port needs a delay before talking to the i/o port. Now even on those chips, this
ldr r0,=0x40021014
ldr r1,=0x00020014
ldr r2,=0x48000000
str r1,[r0]
ldr r3,[r2]
modify
write
Does work, not dumb luck I am sure (it has the same race condition).
And of course this can get worse if you adjust the clocks from reset speeds to faster values.
If you do back to back writes in a high level language there is no guarantee it will generate the above it may generate
ldr r0,=0x40021014
ldr r1,=0x00020014
ldr r2,=0x48000000
ldr r3,=0xA8000400
str r1,[r0]
one or both of the r2/r3 inserted here
str r3,[r2]
And at least at reset clock speeds on the stm32 parts I can break, that works. But I have seen one compiler with one set of command line options that generated code that broke, while others or other options did not they inserted an instruction in between just due to the nature of the code generation (making the same C code in that case both work and not work based on dumb luck).
So I recommend you do not write back to back, and/or you examine the output of the compiler (which you should be doing on any new project, esp one like your first bare metal rust program on a language that is tough at the moment to generate bare metal code like this). (quite doable but the number of say C folks is still one in a million give or take, but for rust an small number out of the whole population of the planet).
You should be examining the vector table, placement in the binary, things like the loads and stores and addresses and data, etc. Likewise when converting from whatever binary format to the raw binary format needed to copy to the nucleo board, examine that binary as well to see that it starts with the vector table and any padding needed is in place.
So....
Fix your rcc register address, and I would fix the data as well.
Add a read or delay after rcc register write, can do a simple throwaway read of the gpio moder if you want, or better do a read-modify-write.
Write moder with bits 11:10 as 0b01 to make it a general purpose output
THEN you can mess with bit 5 and 16+5 in the BSRR.
Lastly from the nucleo documentation you do not even have to look at the schematic. They nicely document under LD2 that
the I/O is HIGH value, the LED is on
So you want to SET PA5 to turn the led on and reset to turn it off.
Write a 1<<(0+5) to GPIOA_BSRR. in one version of the code. Then write 1<<(16+5) in another. The first one should leave the led on, the second leave the led off. (THEN mess with a delay and trying to blink it).
The infinite loop at the end is only needed of the bootstrap or pre main code messes with things on a return, if you wrote your own bootstrap then you could simply have it do an infinite loop on return from main (that is how I prefer it, sometimes a wfi/wfe loop, some libraries though will hose things on return thus this habit). Since you should be very aware of the boot process and code for any bare metal project (vendor sandbox or your own), you should know the answer before writing main and know what the requirements are. And maybe you do in this case, just have not shown it here.

Is it possible to {activate|de-activate} SIGFPE generation only for some specific C++11 code segments?

I am writing a path tracer in C++11, on Linux, for the numerical simulation of light transport and I am using
#include <fenv.h>
...
feenableexcept( FE_INVALID |
FE_DIVBYZERO |
FE_OVERFLOW |
FE_UNDERFLOW );
in order to catch and debug any numerical errors that may eventually occur during execution.
At some point in the code I have to compute the intersection of rays (line segments) against axis-aligned bounding boxes (AABBs). For this computation I am using a very optimized and robust ray-box intersection algorithm which relies on the generation of some special values (e.g. NaN and inf) described in the IEEE 754 standard. Obviously, I am not interested in catching floating point exceptions generated specifically by this ray-box intersection routine.
Thus, my questions are:
Is it possible to deactivate the generation of floating point exception signals (SIGFPE) for only some sections of the code (i.e. for the ray-box intersection code section)?
When we are calculating simulations we are very concerned about
performance. In the case that it is possible to suppress exception
signals only for specific code sections, can this be done at
compile time (i.e. instrumenting/de-instrumenting code during its
generation, such that we could avoid expensive function calls)?
Thank you for any help!
UPDATE
It is possible to instrument/deinstrument code through the use of feenableexcept and fedisableexcept function calls (actually, I posted this question because I was not aware about the fedisableexcept, only feenableexcept... shame on me!). For instance:
#include <fenv.h>
int main() {
float a = 1.0f;
fedisableexcept(FE_DIVBYZERO); // disable div by zero catching
// generates an inf that **won't be** catched
float c = a / 0.0f;
feenableexcept(FE_DIVBYZERO); // enable div by zero catching
// generates an inf that **will be** catched
float d = a / 2.0f;
return 0
}

Standard C++ does not provide any way to mark code at compile-time as to whether it should run with floating-point trapping enabled or disabled. In fact, support for manipulating the floating-point environment is not required by the standard, so whether an implementation has it at all is implementation-dependent. Any answer beyond standard C++ depends on the particular hardware and software you are using, but you have not reported that information.
On typical processors, enabling and disabling floating-point trapping is achieved by changing a processor control register. You do not need a function call to do this, but it is not the function call that is expensive, as you suggest in your question. The actual instruction may consume time as it may require the processor to serialize instruction execution. (Modern processors may have hundreds of instructions executing at the same time—some being decoded, some waiting for a subunit within the processor, some in various stages of calculation, some waiting to write their results to general registers, and so on. When changing a control register, the processor may have to wait for all currently executing instructions to finish, then change the register, then start executing new instructions.) If your hardware behaves this way, there is no way to get around it. (With such hardware, which is common, it is not possible to compile code to run with or without trapping without actually executing the run-time instruction to change the control register.)
You might be able to mitigate the time cost by batching path-tracing calculations, so they are performed in groups with only two changes to the floating-point control register (one to turn traps off, one to turn them on) for the entire group.

What does the IS_ALIGNED macro in the linux kernel do?

I've been trying to read the implementation of a kernel module, and I'm stumbling on this piece of code.
unsigned long addr = (unsigned long) buf;
if (!IS_ALIGNED(addr, 1 << 9)) {
DMCRIT("#%s in %s is not sector-aligned. I/O buffer must be sector-aligned.", name, caller);
BUG();
}
The IS_ALIGNED macro is defined in the kernel source as follows:
#define IS_ALIGNED(x, a) (((x) & ((typeof(x))(a) - 1)) == 0)
I understand that data has to be aligned along the size of a datatype to work, but I still don't understand what the code does.
It left-shifts 1 by 9, then subtracts by 1, which gives 111111111. Then 111111111 does bitwise-and with x.
Why does this code work? How is this checking for byte alignment?

In systems programming it is common to need a memory address to be aligned to a certain number of bytes -- that is, several lowest-order bits are zero.
Basically, !IS_ALIGNED(addr, 1 << 9) checks whether addr is on a 512-byte (2^9) boundary (the last 9 bits are zero). This is a common requirement when erasing flash locations because flash memory is split into large blocks which must be erased or written as a single unit.
Another application of this I ran into. I was working with a certain DMA controller which has a modulo feature. Basically, that means you can allow it to change only the last several bits of an address (destination address in this case). This is useful for protecting memory from mistakes in the way you use a DMA controller. Problem it, I initially forgot to tell the compiler to align the DMA destination buffer to the modulo value. This caused some incredibly interesting bugs (random variables that have nothing to do with the thing using the DMA controller being overwritten... sometimes).
As far as "how does the macro code work?", if you subtract 1 from a number that ends with all zeroes, you will get a number that ends with all ones. For example, 0b00010000 - 0b1 = 0b00001111. This is a way of creating a binary mask from the integer number of required-alignment bytes. This mask has ones only in the bits we are interested in checking for zero-value. After we AND the address with the mask containing ones in the lowest-order bits we get a 0 if any only if the lowest 9 (in this case) bits are zero.
"Why does it need to be aligned?": This comes down to the internal makeup of flash memory. Erasing and writing flash is a much less straightforward process then reading it, and typically it requires higher-than-logic-level voltages to be supplied to the memory cells. The circuitry required to make write and erase operations possible with a one-byte granularity would waste a great deal of silicon real estate only to be used rarely. Basically, designing a flash chip is a statistics and tradeoff game (like anything else in engineering) and the statistics work out such that writing and erasing in groups gives the best bang for the buck.
At no extra charge, I will tell you that you will be seeing a lot of this type of this type of thing if you are reading driver and kernel code. It may be helpful to familiarize yourself with the contents of this article (or at least keep it around as a reference): https://graphics.stanford.edu/~seander/bithacks.html

independent searches on GPU -- how to synchronize its finish?

Assume I have some algorithm generateRandomNumbersAndTestThem() which returns true with probability p and false with probability 1-p. Typically p is very small, e.g. p=0.000001.
I'm trying to build a program in JOCL that estimates p as follows: generateRandomNumbersAndTestThem() is executed in parallel on all available shader cores (preferrably of multiple GPUs), until at least 100 trues are found. Then the estimate for p is 100/n, where n is the total number of times that generateRandomNumbersAndTestThem() was executed.
For p = 0.0000001, this means roughly 10^9 independent attempts, which should make it obvious why I'm looking to do this on GPUs. But I'm struggling a bit how to implement the stop condition properly. My idea was to have something along these lines as the kernel:
__kernel void sampleKernel(all_the_input, __global unsigned long *totAttempts) {
int gid = get_global_id(0);
//here code that localizes all_the_input for faster access
while (lessThan100truesFound) {
totAttempts[gid]++;
if (generateRandomNumbersAndTestThem())
reportTrue();
}
}
How should I implement this without severe performance loss, given that
triggering of the "if" will be a very rare event and so it is not a problem if all threads have to wait while reportTrue() is executed
lessThan100truesFound has to be modified only once (from true to false) when reportTrue() is called for the 100th time (so I don't even know if a boolean is the right way)
the plan is to buy brand-new GPU hardware for this, so you can assume a recent GPU, e.g. multiple ATI Radeon HD7970s. But it would be nice if I could test it on my current HD5450.
I assume that something can be done similar to Java's "synchronized" modifier, but I fail to find the exact way to do it. What is the "right" way to do this, i.e. any way that works without severe performance loss?

I'd suggest not using global flag to stop kernel, but rather run kernel to do certain amount of attempts, check on host if you have accumulated enough 'successes', and repeat if necessary. Using cycle of undefined length in kernel is bad since GPU driver could be killed by watch-dog timer. Besides, checking some global variable at each iteration would certainly screw kernel performance.
This way, reportTrue could be implemented as atomic_inc to some counter residing in global memory.
__kernel void sampleKernel(all_the_input, __global unsigned long *successes) {
int gid = get_global_id(0);
//here code that localizes all_the_input for faster access
for (int i = 0; i < ATT_PER_THREAD; ++i) {
if (generateRandomNumbersAndTestThem())
atomic_inc(successes);
}
}
ATT_PER_THREAD is to be adjusted depending on how long it takes to execute generateRandomNumbersAndTestThem(). Kernel launch overhead is pretty small, so there usually is no need to make your kernel run more than 0.1--1 second

x86 equivalent for LWARX and STWCX

I'm looking for an equivalent of LWARX and STWCX (as found on the PowerPC processors) or a way to implement similar functionality on the x86 platform. Also, where would be the best place to find out about such things (i.e. good articles/web sites/forums for lock/wait-free programing).
Edit
I think I might need to give more details as it is being assumed that I'm just looking for a CAS (compare and swap) operation. What I'm trying to do is implement a lock-free reference counting system with smart pointers that can be accessed and changed by multiple threads. I basically need a way to implement the following function on an x86 processor.
int* IncrementAndRetrieve(int **ptr)
{
int val;
int *pval;
do
{
// fetch the pointer to the value
pval = *ptr;
// if its NULL, then just return NULL, the smart pointer
// will then become NULL as well
if(pval == NULL)
return NULL;
// Grab the reference count
val = lwarx(pval);
// make sure the pointer we grabbed the value from
// is still the same one referred to by 'ptr'
if(pval != *ptr)
continue;
// Increment the reference count via 'stwcx' if any other threads
// have done anything that could potentially break then it should
// fail and try again
} while(!stwcx(pval, val + 1));
return pval;
}
I really need something that mimics LWARX and STWCX fairly accurately to pull this off (I can't figure out a way to do this with the CompareExchange, swap or add functions I've so far found for the x86).
Thanks

As Michael mentioned, what you're probably looking for is the cmpxchg instruction.
It's important to point out though that the PPC method of accomplishing this is known as Load Link / Store Conditional (LL/SC), while the x86 architecture uses Compare And Swap (CAS). LL/SC has stronger semantics than CAS in that any change to the value at the conditioned address will cause the store to fail, even if the other change replaces the value with the same value that the load was conditioned on. CAS, on the other hand, would succeed in this case. This is known as the ABA problem (see the CAS link for more info).
If you need the stronger semantics on the x86 architecture, you can approximate it by using the x86s double-width compare-and-swap (DWCAS) instruction cmpxchg8b, or cmpxchg16b under x86_64. This allows you to atomically swap two consecutive 'natural sized' words at once, instead of just the usual one. The basic idea is one of the two words contains the value of interest, and the other one contains an always incrementing 'mutation count'. Although this does not technically eliminate the problem, the likelihood of the mutation counter to wrap between attempts is so low that it's a reasonable substitute for most purposes.

x86 does not directly support "optimistic concurrency" like PPC does -- rather, x86's support for concurrency is based on a "lock prefix", see here. (Some so-called "atomic" instructions such as XCHG actually get their atomicity by intrinsically asserting the LOCK prefix, whether the assembly code programmer has actually coded it or not). It's not exactly "bomb-proof", to put it diplomatically (indeed, it's rather accident-prone, I would say;-).

You're probably looking for the cmpxchg family of instructions.
You'll need to precede these with a lock instruction to get equivalent behaviour.
Have a look here for a quick overview of what's available.
You'll likely end up with something similar to this:
mov ecx,dword ptr [esp+4]
mov edx,dword ptr [esp+8]
mov eax,dword ptr [esp+12]
lock cmpxchg dword ptr [ecx],edx
ret 12
You should read this paper...
Edit
In response to the updated question, are you looking to do something like the Boost shared_ptr? If so, have a look at that code and the files in that directory - they'll definitely get you started.

if you are on 64 bits and limit yourself to say 1tb of heap, you can pack the counter into the 24 unused top bits. if you have word aligned pointers the bottom 5 bits are also available.
int* IncrementAndRetrieve(int **ptr)
{
int val;
int *unpacked;
do
{
val = *ptr;
unpacked = unpack(val);
if(unpacked == NULL)
return NULL;
// pointer is on the bottom
} while(!cas(unpacked, val, val + 1));
return unpacked;
}

Don't know if LWARX and STWCX invalidate the whole cache line, CAS and DCAS do. Meaning that unless you are willing to throw away a lot of memory (64 bytes for each independent "lockable" pointer) you won't see much improvement if you are really pushing your software into stress. The best results I've seen so far were when people consciously casrificed 64b, planed their structures around it (packing stuff that won't be subject of contention), kept everything alligned on 64b boundaries, and used explicit read and write data barriers. Cache line invalidation can cost approx 20 to 100 cycles, making it a bigger real perf issue then just lock avoidance.
Also, you'd have to plan different memory allocation strategy to manage either controlled leaking (if you can partition code into logical "request processing" - one request "leaks" and then releases all it's memory bulk at the end) or datailed allocation management so that one structure under contention never receives memory realesed by elements of the same structure/collection (to prevent ABA). Some of that can be very counter-intuitive but it's either that or paying the price for GC.

What you are trying to do will not work the way you expect. What you implemented above can be done with the InterlockedIncrement function (Win32 function; assembly: XADD).
The reason that your code does not do what you think it does is that another thread can still change the value between the second read of *ptr and stwcx without invalidating the stwcx.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string