Atomic Compare And Swap with struct in Go - struct

I am trying to create a non-blocking queue package for concurrent application using the algorithm by Maged M. Michael and Michael L. Scott as described here.
This requires the use of atomic CompareAndSwap which is offered by the "sync/atomic" package.
I am however not sure what the Go-equivalent to the following pseudocode would be:
E9: if CAS(&tail.ptr->next, next, <node, next.count+1>)
where tail and next is of type:
type pointer_t struct {
ptr *node_t
count uint
}
and node is of type:
type node_t struct {
value interface{}
next pointer_t
}
If I understood it correctly, it seems that I need to do a CAS with a struct (both a pointer and a uint). Is this even possible with the atomic-package?
Thanks for help!

If I understood it correctly, it seems that I need to do a CAS with a struct (both a > pointer and a uint). Is this even possible with the atomic-package?
No, that is not possible. Most architectures only support atomic operations on a single word. A lot of academic papers however use more powerful CAS statements (e.g. compare and swap double) that are not available today. Luckily there are a few tricks that are commonly used in such situations:
You could for example steal a couple of bits from the pointer (especially on 64bit systems) and use them, to encode your counter. Then you could simply use Go's CompareAndSwapPointer, but you need to mask the relevant bits of the pointer before you try to dereference it.
The other possibility is to work with pointers to your (immutable!) pointer_t struct. Whenever you want to modify an element from your pointer_t struct, you would have to create a copy, modify the copy and atomically replace the pointer to your struct. This idiom is called COW (copy on write) and works with arbitrary large structures. If you want to use this technique, you would have to change the next attribute to next *pointer_t.
I have recently written a lock-free list in Go for educational reasons. You can find the (imho well documented) source here: https://github.com/tux21b/goco/blob/master/list.go
This rather short example uses atomic.CompareAndSwapPointer excessively and also introduces an atomic type for marked pointers (the MarkAndRef struct). This type is very similar to your pointer_t struct (except that it stores a bool+pointer instead of an int+pointer). It's used to ensure that a node has not been marked as deleted while you are trying to insert an element directly afterwards. Feel free to use this source as starting point for your own projects.

You can do something like this:
if atomic.CompareAndSwapPointer(
(*unsafe.Pointer)(unsafe.Pointer(tail.ptr.next)),
unsafe.Pointer(&next),
unsafe.Pointer(&pointer_t{&node, next.count + 1})
)

Related

Need help porting a wait free trie from C to Rust

For some HPC computing I need to port a wait free trie (but you can think of it as a tree) from C to Rust. The trie is 99% read 1% write, and is used both in parallel-concurrent scenarios , and concurrent only scenarios (single thread with multiple coroutines). The size of the tree is usually between 100kb and 8 mb.
Basically the tree is like:
pub struct WFTrie {
nodes: Vec<Node>,
leaves: Vec<Leaf>,
updates: Vec<Update>,
update_pos: AtomicUsize,
update_cap: usize,
}
//....
let mut wftrie = WFTrie::new();
let wftrie_ptr = AtomicPtr::new(&mut wftrie);
//....
As you can see the trie already uses something very similar to the arena approach that many suggests, by using a vec and storing
When I want to update, I do a fetch_and_add on update_pos. If it's greater than update_cap I return an error (size exhausted), otherwise I'm sure sure my coroutine/thread has exclusive access to updates[update_pos % update_cap], where I can put my update
Every X updates (if update_pos % 8 == 0 { //update tree} ) a coroutine will clone the tree, apply all pending updates, and then compare_and_swap the wftrie_ptr
When I want to read I do an atomic load on wtftrie_ptr and I can access the tree, taking care of considering also the pending updates.
My questions are:
If I have multiple coroutines having an immutable reference, how can I have one coroutine doing updates? What's the most idiomatic way to translate this to rust?
If a coroutine is still holding a ref to the old tree during an update what happens? Should I replace the AtomicPtr with Arc?
Does this design fit well with the rust borrow checker or am I going to have to use unsafe?
In the case of concurrent only scenario, can I drop atomics without using unsafe?
I'm not particularly informed about HPC, but I hope I can give you some useful pointers about programming with concurrency in Rust.
… I'm sure sure my coroutine/thread has exclusive access to updates[update_pos], where I can put my update
This is necessary, but not sufficient: if you are going to write to data behind an & reference (whether from multiple threads or not), then you are implementing interior mutability, and you must always signal that to the compiler by using some cell type. The minimal at-your-own-risk way to do that is with an UnsafeCell, which provides no synchronization and is unsafe to operate on:
updates: Vec<UnsafeCell<Update>>,
but you can also use a safer type such as a lock (that you know will never experience any contention since you arranged that no other thread is using it). In Rust, all locks and other interior mutability primitives — including atomics — are built on top of UnsafeCell; it is how you tell the compiler “this memory is exempt from the usual rule of N readers xor 1 writer”.
Every X updates … a coroutine will clone the tree …
You will have to arrange so that the clone isn't trying to read the data that is being modified. That is, don't clone the WFTrie, but only the nodes and leaves vectors since that's the data you actually need.
If you were to use UnsafeCell or a lock as I mentioned above, then you'll find that you can't just clone the updates vector, precisely because it wouldn't be sound to do so without explicit locking. If necessary, you can read it step-by-step in some way that agrees with your concurrency requirements.
If I have multiple coroutines having an immutable reference, how can I have one coroutine doing updates? What's the most idiomatic way to translate this to rust?
I'm not aware of a specifically Rust-y way to do this. It sounds like you could manage it with just an AtomicBool, but you probably know more than I do here.
If a coroutine is still holding a ref to the old tree during an update what happens? Should I replace the AtomicPtr with Arc?
Yes, this is a problem. Note that AtomicPtr is unsafe to read, because it does not guarantee the pointed-to thing is alive. You need a way to use something like Arc and atomically swap which specific Arc is in the wftrie_ptr location; the arc-swap library provides that.
Does this design fit well with the rust borrow checker or am I going to have to use unsafe?
From a perspective of using the data structure, it seems fine — it will not be any more inconvenient to read or write than any other data structure held behind Arc.
For implementation, you will not be having very much “fight with the borrow checker” because you are not going to be writing functions with very many borrows — since your data structure owns its contents. But, insofar as you are writing your own concurrency primitives with unsafe, you will need to be careful to obey all the memory rules anyway. Remember that unsafe is not a license to disregard the rules; it's telling the compiler “I'm going to obey the rules but in a way you can't check”.
In the case of concurrent only scenario, can I drop atomics without using unsafe?
Yes; in a no-threads scenario, you can use Cell for all purposes that atomics would serve.

The most idiomatic way to efficiently serialize/deserialize a Copy struct into/out of [u8]

Copy means that the struct could be copied just by copying bytes as is. As a result, it should be easily possible to re-interpret such a struct as [u8]. What's the most idiomatic way to do so, preferably without involving unsafe.
I want to have an optimized struct which could be easily sent via processes/wire/disk. I understand, that there're a lot of details which needs to be taken care of, like alignment, and looking for a solution for such a high performance use case. I.e. I am looking for close to zero copy high performance serialization.
Copy means that the struct could be copied just by copying bytes as is.
This is true.
As a result, it should be easily possible to re-interpret such a struct as [u8].
This is not true, because Copy structs can still contain padding, which is not permitted to be read except incidentally while copying.
What's the most idiomatic way to do so, preferably without involving unsafe.
You should start with bytemuck. It is a library which provides trivial conversion to and from [u8] when it is safe to do so. In particular, it checks that there is no padding in the struct, and that the representation is well-defined (not subject to the whims of the compiler).
You will still need to consider alignment, and for that purpose may need to introduce explicit “padding” fields (whose value is explicitly set rather than being left undefined) so that the alignment of other fields is satisfied.
Your program's data will also not be compatible with machines of different endianness unless you take care. (However, it is possible to do so, in ways which have zero run-time overhead if not necessary, and most machines are little-endian today so that cost will almost never actually apply.)

golang write struct as raw data

I am working on a new type of database, using GO. One of the things I would like to do is have a distributed disk so that I can distribute queries over multiple machines (think Pi type architectures). This means building my own structures on raw disk.
My challenge is that I can't find a GO package that will let me write N bytes from a pointer to a structure. All the IO packages limit the access to []byte slices.
That's nice for protection, but if I have to buffer everything through a byte array via some form of encoding it will slow down the access to a specific object.
Anyone got any idea on how to do raw IO? Or am I going to have to handle GOBs as my unit of IO and suffer the penalty for encoding/decoding?
Big warning first: don't do it: it is neither safe nor portable
For a given struct, you can reflect over it to figure out the in-memory size of the actual struct, then unsafely cast it to a []byte using unsafe.
eg: (*[in-mem size]byte)(unsafe.Pointer(&mystruct))
This will give you something C-ish with absolutely no safety guarantees or portability.
I'll quote the Go spec:
A package using unsafe must be vetted manually for type safety and may
not be portable.
You can find a lot more details in this Go and Memory layout post, including all the steps you need to unsafely treat structs as just bytes.
Overall, it's fascinating to examine how Go functions on a low level, but this is absolutely the wrong thing to do in your case. Any real data infrastructure will need storage logic way more complicated than just dumping in-memory structs to disk anyway.
In general, you cannot do raw IO of a Go struct (i.e. memdump). This is because many things in Go contain pointers, and the actual data is not contiguous in memory.
For example, a struct like this:
type Person struct {
Name string
}
contains a string, which in turn contains a pointer to the bytes of the string. A raw memdump would only dump the pointer.
The solution is serialization. This is never free, although some implementations do a pretty good job.
The closest to what you are describing is something like go-memdump, but I wouldn't recommend it for production.
Otherwise, I recommend looking at a performant serialization technique. (Go's gob encoding is not the best.)
...Or am I going to have to handle GOBs as my unit of IO and suffer the penalty for encoding/decoding?
Just use GOBs.
Premature optimization is the root of all evil.

Box<X> vs move semantics on X

I have an easy question regarding Box<X>.
I understand what it does, it allocates X on the heap.
In C++ you use the new operator to allocate something on the heap so it can outlive the current scope (because if you create something on the stack it goes away at the end of the current block).
But reading Rust's documentation, it looks like you can create something on the stack and still return it taking advantage of the language's move semantics without having to resort to the heap.
Then it's not clear to me when to use Box<X> as opposed to simply X.
I just started reading about Rust so I apologize if I'm missing something obvious.
First of all: C++11 (and newer) has move semantics with rvalue references, too. So your question would also apply to C++. Keep in mind though, that C++'s move semantics are -- unlike Rust's ones -- highly unsafe.
Second: the word "move semantic" somehow hints the absence of a "copy", which is not true. Suppose you have a struct with 100 64-bit integers. If you would transfer an object of this struct via move semantics, those 100 integers will be copied (of course, the compiler's optimizer can often remove those copies, but anyway...). The advantage of move semantics comes to play when dealing with objects that deal with some kind of data on the heap (or pointers in general).
For example, take a look at Vec (similar to C++'s vector): the type itself only contains a pointer and two pointer-sized integer (ptr, len and cap). Those three times 64bit are still copied when the vector is moved, but the main data of the vector (which lives on the heap) is not touched.
That being said, let's discuss the main question: "Why to use Box at all?". There are actually many use cases:
Unsized types: some types (e.g. Trait-objects which also includes closures) are unsized, meaning their size is not known to the compiler. But the compiler has to know the size of each stack frame -- hence those unsized types cannot live on the stack.
Recursive data structures: think of a BinaryTreeNode struct. It saves two members named "left" and "right" of type... BinaryTreeNode? That won't work. So you can box both children so that the compiler knows the size of your struct.
Huge structs: think of the 100 integer struct mentioned above. If you don't want to copy it every time, you can allocate it on the heap (this happens pretty seldom).
There are cases where you can’t return X eg. if X is ?Sized (traits, non-compile-time-sized arrays, etc.). In those cases Box<X> will still work.

What's going on in the 'offsetof' macro?

Visual C++ 2008 C runtime offers an operator 'offsetof', which is actually macro defined as this:
#define offsetof(s,m) (size_t)&reinterpret_cast<const volatile char&>((((s *)0)->m))
This allows you to calculate the offset of the member variable m within the class s.
What I don't understand in this declaration is:
Why are we casting m to anything at all and then dereferencing it? Wouldn't this have worked just as well:
&(((s*)0)->m)
?
What's the reason for choosing char reference (char&) as the cast target?
Why use volatile? Is there a danger of the compiler optimizing the loading of m? If so, in what exact way could that happen?
An offset is in bytes. So to get a number expressed in bytes, you have to cast the addresses to char, because that is the same size as a byte (on this platform).
The use of volatile is perhaps a cautious step to ensure that no compiler optimisations (either that exist now or may be added in the future) will change the precise meaning of the cast.
Update:
If we look at the macro definition:
(size_t)&reinterpret_cast<const volatile char&>((((s *)0)->m))
With the cast-to-char removed it would be:
(size_t)&((((s *)0)->m))
In other words, get the address of member m in an object at address zero, which does look okay at first glance. So there must be some way that this would potentially cause a problem.
One thing that springs to mind is that the operator & may be overloaded on whatever type m happens to be. If so, this macro would be executing arbitrary code on an "artificial" object that is somewhere quite close to address zero. This would probably cause an access violation.
This kind of abuse may be outside the applicability of offsetof, which is supposed to only be used with POD types. Perhaps the idea is that it is better to return a junk value instead of crashing.
(Update 2: As Steve pointed out in the comments, there would be no similar problem with operator ->)
offsetof is something to be very careful with in C++. It's a relic from C. These days we are supposed to use member pointers. That said, I believe that member pointers to data members are overdesigned and broken - I actually prefer offsetof.
Even so, offsetof is full of nasty surprises.
First, for your specific questions, I suspect the real issue is that they've adapted relative to the traditional C macro (which I thought was mandated in the C++ standard). They probably use reinterpret_cast for "it's C++!" reasons (so why the (size_t) cast?), and a char& rather than a char* to try to simplify the expression a little.
Casting to char looks redundant in this form, but probably isn't. (size_t) is not equivalent to reinterpret_cast, and if you try to cast pointers to other types into integers, you run into problems. I don't think the compiler even allows it, but to be honest, I'm suffering memory failure ATM.
The fact that char is a single byte type has some relevance in the traditional form, but that may only be why the cast is correct again. To be honest, I seem to remember casting to void*, then char*.
Incidentally, having gone to the trouble of using C++-specific stuff, they really should be using std::ptrdiff_t for the final cast.
Anyway, coming back to the nasty surprises...
VC++ and GCC probably won't use that macro. IIRC, they have a compiler intrinsic, depending on options.
The reason is to do what offsetof is intended to do, rather than what the macro does, which is reliable in C but not in C++. To understand this, consider what would happen if your struct uses multiple or virtual inheritance. In the macro, when you dereference a null pointer, you end up trying to access a virtual table pointer that isn't there at address zero, meaning that your app probably crashes.
For this reason, some compilers have an intrinsic that just uses the specified structs layout instead of trying to deduce a run-time type. But the C++ standard doesn't mandate or even suggest this - it's only there for C compatibility reasons. And you still have to be careful if you're working with class heirarchies, because as soon as you use multiple or virtual inheritance, you cannot assume that the layout of the derived class matches the layout of the base class - you have to ensure that the offset is valid for the exact run-time type, not just a particular base.
If you're working on a data structure library, maybe using single inheritance for nodes, but apps cannot see or use your nodes directly, offsetof works well. But strictly speaking, even then, there's a gotcha. If your data structure is in a template, the nodes may have fields with types from template parameters (the contained data type). If that isn't POD, technically your structs aren't POD either. And all the standard demands for offsetof is that it works for POD. In practice, it will work - your type hasn't gained a virtual table or anything just because it has a non-POD member - but you have no guarantees.
If you know the exact run-time type when you dereference using a field offset, you should be OK even with multiple and virtual inheritance, but ONLY if the compiler provides an intrinsic implementation of offsetof to derive that offset in the first place. My advice - don't do it.
Why use inheritance in a data structure library? Well, how about...
class node_base { ... };
class leaf_node : public node_base { ... };
class branch_node : public node_base { ... };
The fields in the node_base are automatically shared (with identical layout) in both the leaf and branch, avoiding a common error in C with accidentally different node layouts.
BTW - offsetof is avoidable with this kind of stuff. Even if you are using offsetof for some jobs, node_base can still have virtual methods and therefore a virtual table, so long as it isn't needed to dereference member variables. Therefore, node_base can have pure virtual getters, setters and other methods. Normally, that's exactly what you should do. Using offsetof (or member pointers) is a complication, and should only be used as an optimisation if you know you need it. If your data structure is in a disk file, for instance, you definitely don't need it - a few virtual call overheads will be insignificant compared with the disk access overheads, so any optimisation efforts should go into minimising disk accesses.
Hmmm - went off on a bit of a tangent there. Whoops.
char is guarenteed to be the smallest number of bits the architectural can "bite" (aka byte).
All pointers are actually numbers, so cast adress 0 to that type because it's the beginning.
Take the address of member starting from 0 (resulting into 0 + location_of_m).
Cast that back to size_t.
1) I also do not know why it is done in this way.
2) The char type is special in two ways.
No other type has weaker alignment restrictions than the char type. This is important for reinterpret cast between pointers and between expression and reference.
It is also the only type (together with its unsigned variant) for which the specification defines behavior in case the char is used to access stored value of variables of different type. I do not know if this applies to this specific situation.
3) I think that the volatile modifier is used to ensure that no compiler optimization will result in attempt to read the memory.
2 . What's the reason for choosing char reference (char&) as the cast target?
if type s has operator& overloaded then we can't get address using &s
so we reinterpret_cast the type s to primitive type char because primitive type char
doesn't have operator& overloaded
now we can get address from that
if in C then reinterpret_cast is not required
3 . Why use volatile? Is there a danger of the compiler optimizing the loading of m? If so, in what exact way could that happen?
here volatile is not relevant to compiler optimizing.
if type s have const or volatile or both qualifier(s) then
reinterpret_cast can't cast to char& because reinterpret_cast can't remove cv-qualifiers
so result is using <const volatile char&> for casting work from any combination

Resources