x86 equivalent for LWARX and STWCX

x86 equivalent for LWARX and STWCX - multithreading

I'm looking for an equivalent of LWARX and STWCX (as found on the PowerPC processors) or a way to implement similar functionality on the x86 platform. Also, where would be the best place to find out about such things (i.e. good articles/web sites/forums for lock/wait-free programing).
Edit
I think I might need to give more details as it is being assumed that I'm just looking for a CAS (compare and swap) operation. What I'm trying to do is implement a lock-free reference counting system with smart pointers that can be accessed and changed by multiple threads. I basically need a way to implement the following function on an x86 processor.
int* IncrementAndRetrieve(int **ptr)
{
int val;
int *pval;
do
{
// fetch the pointer to the value
pval = *ptr;
// if its NULL, then just return NULL, the smart pointer
// will then become NULL as well
if(pval == NULL)
return NULL;
// Grab the reference count
val = lwarx(pval);
// make sure the pointer we grabbed the value from
// is still the same one referred to by 'ptr'
if(pval != *ptr)
continue;
// Increment the reference count via 'stwcx' if any other threads
// have done anything that could potentially break then it should
// fail and try again
} while(!stwcx(pval, val + 1));
return pval;
}
I really need something that mimics LWARX and STWCX fairly accurately to pull this off (I can't figure out a way to do this with the CompareExchange, swap or add functions I've so far found for the x86).
Thanks

As Michael mentioned, what you're probably looking for is the cmpxchg instruction.
It's important to point out though that the PPC method of accomplishing this is known as Load Link / Store Conditional (LL/SC), while the x86 architecture uses Compare And Swap (CAS). LL/SC has stronger semantics than CAS in that any change to the value at the conditioned address will cause the store to fail, even if the other change replaces the value with the same value that the load was conditioned on. CAS, on the other hand, would succeed in this case. This is known as the ABA problem (see the CAS link for more info).
If you need the stronger semantics on the x86 architecture, you can approximate it by using the x86s double-width compare-and-swap (DWCAS) instruction cmpxchg8b, or cmpxchg16b under x86_64. This allows you to atomically swap two consecutive 'natural sized' words at once, instead of just the usual one. The basic idea is one of the two words contains the value of interest, and the other one contains an always incrementing 'mutation count'. Although this does not technically eliminate the problem, the likelihood of the mutation counter to wrap between attempts is so low that it's a reasonable substitute for most purposes.

x86 does not directly support "optimistic concurrency" like PPC does -- rather, x86's support for concurrency is based on a "lock prefix", see here. (Some so-called "atomic" instructions such as XCHG actually get their atomicity by intrinsically asserting the LOCK prefix, whether the assembly code programmer has actually coded it or not). It's not exactly "bomb-proof", to put it diplomatically (indeed, it's rather accident-prone, I would say;-).

You're probably looking for the cmpxchg family of instructions.
You'll need to precede these with a lock instruction to get equivalent behaviour.
Have a look here for a quick overview of what's available.
You'll likely end up with something similar to this:
mov ecx,dword ptr [esp+4]
mov edx,dword ptr [esp+8]
mov eax,dword ptr [esp+12]
lock cmpxchg dword ptr [ecx],edx
ret 12
You should read this paper...
Edit
In response to the updated question, are you looking to do something like the Boost shared_ptr? If so, have a look at that code and the files in that directory - they'll definitely get you started.

if you are on 64 bits and limit yourself to say 1tb of heap, you can pack the counter into the 24 unused top bits. if you have word aligned pointers the bottom 5 bits are also available.
int* IncrementAndRetrieve(int **ptr)
{
int val;
int *unpacked;
do
{
val = *ptr;
unpacked = unpack(val);
if(unpacked == NULL)
return NULL;
// pointer is on the bottom
} while(!cas(unpacked, val, val + 1));
return unpacked;
}

Don't know if LWARX and STWCX invalidate the whole cache line, CAS and DCAS do. Meaning that unless you are willing to throw away a lot of memory (64 bytes for each independent "lockable" pointer) you won't see much improvement if you are really pushing your software into stress. The best results I've seen so far were when people consciously casrificed 64b, planed their structures around it (packing stuff that won't be subject of contention), kept everything alligned on 64b boundaries, and used explicit read and write data barriers. Cache line invalidation can cost approx 20 to 100 cycles, making it a bigger real perf issue then just lock avoidance.
Also, you'd have to plan different memory allocation strategy to manage either controlled leaking (if you can partition code into logical "request processing" - one request "leaks" and then releases all it's memory bulk at the end) or datailed allocation management so that one structure under contention never receives memory realesed by elements of the same structure/collection (to prevent ABA). Some of that can be very counter-intuitive but it's either that or paying the price for GC.

What you are trying to do will not work the way you expect. What you implemented above can be done with the InterlockedIncrement function (Win32 function; assembly: XADD).
The reason that your code does not do what you think it does is that another thread can still change the value between the second read of *ptr and stwcx without invalidating the stwcx.

Related

Results of doing += on a double from multiple threads

Consider the following code:
void add(double& a, double b) {
a += b;
}
which according to godbolt compiles on a Skylake to:
add(double&, double):
vaddsd xmm0, xmm0, QWORD PTR [rdi]
vmovsd QWORD PTR [rdi], xmm0
ret
If I call add(a, 1.23) and add(a, 2.34) from different threads (for the same variable a), will a definitely end up as either a+1.23, a+2.34, or a+1.23+2.34?
That is, will one of these results definitely happen given this assembly, and a will not end up in some other state?

Here is a relevant questions to me:
Does the CPU fetch the word you are dealing with in a single operation?
Some processor might allow memory access to a variable that happens to be not aligned in memory by doing two fetches one after the other - non atomically of course.
In that case, problems would arise if another thread interjects writing on that area of memory while the first thread had fetched already the first part of the word and then fetches the second part when the other thread has already modified the word.
thread 1 fetches first part of a XXXX
thread 1 fetches second part of a YYYY
thread 2 fetches first part of a XXXX
thread 1 increments double represented as XXXXYYYY that becomes ZZZZWWWW by adding b
thread 1 writes back in memory ZZZZ
thread 1 writes back in memory WWWW
thread 2 fetches second part of a that is now WWWW
thread 2 increments double represented as XXXXWWWW that becomes VVVVPPPP by adding b
thread 2 writes back in memory VVVV
thread 2 writes back in memory PPPP
For keeping it compact I used one character to represent 8 bits.
Now XXXXWWWW and VVVVPPPP are going to be representation of total different floating point values than the one you would have expected. That is because you ended up mixing two parts of two different binary representation (IEEE-754) of double variables.
Said that, I know that in certain ARM based architectures data access are not allowed (that would cause a trap to be generated), but I suspect that Intel processors do allow that instead.
Therefore, if your variable a is aligned, your result can be any of
a+1.23, a+2.34, a+1.23+2.34
if your variable might be mis-aligned (i.e. has got an address that is not a multiple of 8) your result can be any of
a+1.23, a+2.34, a+1.23+2.34 or a rubbish value
As a further note, please bear in mind that even if your environment alignof(double) == 8 that is not necessarily enough to conclude you are not going to have misalignment issues. All depends from where your particular variable comes from. Consider the following (or run it here):
#pragma push()
#pragma pack(1)
struct Packet
{
unsigned char val1;
unsigned char val2;
double val3;
unsigned char val4;
unsigned char val5;
};
#pragma pop()
int main()
{
static_assert(alignof(double) == 8);
double d;
add(d,1.23); // your a parameter is aligned
Packet p;
add(p.val3,1.23); // your a parameter is now NOT aligned
return 0;
}
Therefore asserting alignof() doesn't necessarily guarantee your variable is aligned. If your variable is not involved in any packing then you should be OK.
Please allow me just a disclaimer for whoever else is reading this answer: using std::atomic<double> in these situations is the best compromise in term of implementation effort and performance to achieve thread safety. There are CPUs architectures that have special efficient instructions for dealing with atomic variables without injecting heavy fences. That might end up satisfying your performance requirements already.

Forcing MSVC to generate FIST instructions with the /QIfist option

I'm using the /QIfist compiler switch regularly, which causes the compiler to generate FISTP instructions to round floating point values to integers, instead of calling the _ftol helper function.
How can I make it use FIST(P) DWORD, instead of QWORD?
FIST QWORD requires the CPU to store the result on stack, then read stack into register and finally store to destination memory, while FIST DWORD just stores directly into destination memory.

FIST QWORD requires the CPU to store the result on stack, then read stack into register and finally store to destination memory, while FIST DWORD just stores directly into destination memory.
I don't understand what you are trying to say here.
The FIST and FISTP instructions differ from each other in exactly two ways:
FISTP pops the top value off of the floating point stack, while FIST does not. This is the obvious difference, and is reflected in the opcode naming: FISTP has that P suffix, which means "pop", just like ADDP, etc.
FISTP has an additional encoding that works with 64-bit (QWORD) operands. That means you can use FISTP to convert a floating point value to a 64-bit integer. FIST, on the other hand, maxes out at 32-bit (DWORD) operands.
(I don't think there's a technical reason for this. I certainly can't imagine it is related to the popping behavior. I assume that when the Intel engineers added support for 64-bit operands some time later, they figured there was no reason for a non-popping version. They were probably running out of opcode encodings.)
There are lots of online references for the x86 instruction set. For example, this site is the top hit for most Google searches. Or you can look in Intel's manuals (FIST/FISTP are on p. 365).
Where the two instructions read the value from, and where they store it to, is exactly the same. Both read the value from the top of the floating point stack, and both store the result to memory.
There would be absolutely no advantage to the compiler using FIST instead of FISTP. Remember that you always have to pop all values off of the floating point stack when exiting from a function, so if FIST is used, you'd have to follow it by an additional FSTP instruction. That might not be any slower, but it would needlessly inflate the code.
Besides, there's another reason that the compiler prefers FISTP: the support for 64-bit operands. It allows the code generator to be identical, regardless of what size integer you're rounding to.
The only time you might prefer to use FIST is if you're hand-writing assembly code and want to re-use the floating point value on the stack after rounding it. The compiler doesn't need to do that.
So anyway, all of that to say that the answer to your question is no. The compiler can't be made to generate FIST instructions automatically. If you're still insistent, you can write inline assembly that uses whatever instructions you want:
int32 RoundToNearestEven(float value)
{
int32 result;
__asm
{
fld DWORD PTR value
fist DWORD PTR result
// do something with the value on the floating point stack...
//
// ... but be sure to pop it off before returning
fstp st(0)
}
return result;
}

instruction set emulator guide

I am interested in writing emulators like for gameboy and other handheld consoles, but I read the first step is to emulate the instruction set. I found a link here that said for beginners to emulate the Commodore 64 8-bit microprocessor, the thing is I don't know a thing about emulating instruction sets. I know mips instruction set, so I think I can manage understanding other instruction sets, but the problem is what is it meant by emulating them?
NOTE: If someone can provide me with a step-by-step guide to instruction set emulation for beginners, I would really appreciate it.
NOTE #2: I am planning to write in C.
NOTE #3: This is my first attempt at learning the whole emulation thing.
Thanks
EDIT: I found this site that is a detailed step-by-step guide to writing an emulator which seems promising. I'll start reading it, and hope it helps other people who are looking into writing emulators too.
Emulator 101

An instruction set emulator is a software program that reads binary data from a software device and carries out the instructions that data contains as if it were a physical microprocessor accessing physical data.
The Commodore 64 used a 6502 Microprocessor. I wrote an emulator for this processor once. The first thing you need to do is read the datasheets on the processor and learn about its behavior. What sort of opcodes does it have, what about memory addressing, method of IO. What are its registers? How does it start executing? These are all questions you need to be able to answer before you can write an emulator.
Here is a general overview of how it would look like in C (Not 100% accurate):
uint8_t RAM[65536]; //Declare a memory buffer for emulated RAM (64k)
uint16_t A; //Declare Accumulator
uint16_t X; //Declare X register
uint16_t Y; //Declare Y register
uint16_t PC = 0; //Declare Program counter, start executing at address 0
uint16_t FLAGS = 0 //Start with all flags cleared;
//Return 1 if the carry flag is set 0 otherwise, in this example, the 3rd bit is
//the carry flag (not true for actual 6502)
#define CARRY_FLAG(flags) ((0x4 & flags) >> 2)
#define ADC 0x69
#define LDA 0xA9
while (executing) {
switch(RAM[PC]) { //Grab the opcode at the program counter
case ADC: //Add with carry
A = X + RAM[PC+1] + CARRY_FLAG(FLAGS);
UpdateFlags(A);
PC += ADC_SIZE;
break;
case LDA: //Load accumulator
A = RAM[PC+1];
UpdateFlags(X);
PC += MOV_SIZE;
break;
default:
//Invalid opcode!
}
}
According to this reference ADC actually has 8 opcodes in the 6502 processor, which means you will have 8 different ADC in your switch statement, each one for different opcodes and memory addressing schemes. You will have to deal with endianess and byte order, and of course pointers. I would get a solid understanding of pointer and type casting in C if you dont already have one. To manipulate the flags register you have to have a solid understanding of bitwise operations in C. If you are clever you can make use of C macros and even function pointers to save yourself some work, as the CARRY_FLAG example above.
Every time you execute an instruction, you must advance the program counter by the size of that instruction, which is different for each opcode. Some opcodes dont take any arguments and so their size is just 1 byte, while others take 16-bit integers as in my MOV example above. All this should be pretty well documented.
Branch instructions (JMP, JE, JNE etc) are simple: If some flag is set in the flags register then load the PC to the address specified. This is how "decisions" are made in a microprocessor and emulating them is simply a matter of changing the PC, just as the real microprocessor would do.
The hardest part about writing an instruction set emulator is debugging. How do you know if everything is working like it should? There are plenty of resources for helping you. People have written test codes that will help you debug every instruction. You can execute them one instruction at a time and compare the reference output. If something is different, you know you have a bug somewhere and can fix it.
This should be enough to get you started. The important thing is that you have A) A good solid understanding of the instruction set you want to emulate and B) a solid understanding of low level data manipulation in C, including type casting, pointers, bitwise operations, byte order, etc.

Writing a custom malloc which stores informations in the pointer

I have recently been reading about a family of automatic memory management techniques that rely on storing information in the pointer returned by the allocator, i.e. few bits of header e.g. to differentiate between pointers or to store thread-related information (note that I'm not talking about limited-field reference counting here, only immutable information).
I'd like to toy with these techniques. Now, to implement them, I need to be able to return pointers with a specific shape from my allocator. I suppose I could play with the least weight bits but this would require padding that looks extremely memory consuming, so I believe that I should play with the heaviest bits. However, I have no good idea on how to do this. Is there a way for me to, call malloc or malloc_create_zone or some related function and request a pointer that always starts with the given bits?
Thanks everyone!

The amount of information you can actually store in a pointer is pretty limited (typically one or two bits per pointer). And every attempt to dereference the pointer has to first mask out the magic information. The technique is often called tagging, BTW.
#define TAG_MASK 0x3
#define CONS_TAG 0x1
#define STRING_TAG 0x2
#define NUMBER_TAG 0x3
typedef uintptr_t value_t;
typedef struct cons {
value_t car;
value_t cdr;
} cons_t;
value_t
create_cons(value_t t1, value_t t2)
{
cons_t* pair = malloc(sizeof(cons_t));
value_t addr = (value_t)pair;
pair->car = t1;
pair->cdr = t2;
return addr | CONS_TAG;
}
value_t
car_of_cons(value_t v)
{
if ((v % TAG_MASK) != CONS_TAG) error("wrong type of argument");
return ((cons_t*) (v & ~TAG_MASK))->car;
}
One advantage of this technique is, that you can directly infer the type of the object from the pointer itself. You don't need to dereference it (say, in order to read a special type field or similar). Many language implementations using this scheme also have a special tag combination for "immediate" numbers and other small values, which can be represented direcly using the "pointer".
The disadvatage is, that the amount of information, which can be stored, is pretty limited. Also, as the example code shows, you have to be aware of the tagging in every access to the object, and need to "untag" the pointer before actually using it.
The use of the least significant bits for tagging stemms from the observation, that on most platforms, all pointer to malloced memory is actually aligned on a non-byte boundary (usually 8 bytes), so the least significant bits are always zero.

Is there a way to check whether the processor cache has been flushed recently?

On i386 linux. Preferably in c/(c/posix std libs)/proc if possible. If not is there any piece of assembly or third party library that can do this?
Edit: I'm trying to develop test whether a kernel module clear a cache line or the whole proccesor(with wbinvd()). Program runs as root but I'd prefer to stay in user space if possible.

Cache coherent systems do their utmost to hide such things from you. I think you will have to observe it indirectly, either by using performance counting registers to detect cache misses or by carefully measuring the time to read a memory location with a high resolution timer.
This program works on my x86_64 box to demonstrate the effects of clflush. It times how long it takes to read a global variable using rdtsc. Being a single instruction tied directly to the CPU clock makes direct use of rdtsc ideal for this.
Here is the output:
took 81 ticks
took 81 ticks
flush: took 387 ticks
took 72 ticks
You see 3 trials: The first ensures i is in the cache (which it is, because it was just zeroed as part of BSS), the second is a read of i that should be in the cache. Then clflush kicks i out of the cache (along with its neighbors) and shows that re-reading it takes significantly longer. A final read verifies it is back in the cache. The results are very reproducible and the difference is substantial enough to easily see the cache misses. If you cared to calibrate the overhead of rdtsc() you could make the difference even more pronounced.
If you can't read the memory address you want to test (although even mmap of /dev/mem should work for these purposes) you may be able to infer what you want if you know the cacheline size and associativity of the cache. Then you can use accessible memory locations to probe the activity in the set you're interested in.
Source code:
#include <stdio.h>
#include <stdint.h>
inline void
clflush(volatile void *p)
{
asm volatile ("clflush (%0)" :: "r"(p));
}
inline uint64_t
rdtsc()
{
unsigned long a, d;
asm volatile ("rdtsc" : "=a" (a), "=d" (d));
return a | ((uint64_t)d << 32);
}
volatile int i;
inline void
test()
{
uint64_t start, end;
volatile int j;
start = rdtsc();
j = i;
end = rdtsc();
printf("took %lu ticks\n", end - start);
}
int
main(int ac, char **av)
{
test();
test();
printf("flush: ");
clflush(&i);
test();
test();
return 0;
}

I dont know of any generic command to get the the cache state, but there are ways:
I guess this is the easiest: If you got your kernel module, just disassemble it and look for cache invalidation / flushing commands (atm. just 3 came to my mind: WBINDVD, CLFLUSH, INVD).
You just said it is for i386, but I guess you dont mean a 80386. The problem is that there are many different with different extension and features. E.g. the newest Intel series has some performance/profiling registers for the cache system included, which you can use to evalute cache misses/hits/number of transfers and similar.
Similar to 2, very depending on the system you got. But when you have a multiprocessor configuration you could watch the first cache coherence protocol (MESI) with the 2nd.
You mentioned WBINVD - afaik that will always flush complete, i.e. all, cache lines

It may not be an answer to your specific question, but have you tried using a cache profiler such as Cachegrind? It can only be used to profile userspace code, but you might be able to use it nonetheless, by e.g. moving the code of your function to userspace if it does not depend on any kernel-specific interfaces.
It might actually be more effective than trying to ask the processor for information that may or may not exist and that will be probably affected by your mere asking about it - yes, Heisenberg was way before his time :-)

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string