std::string, const, and thread safety - string

This question has been discussed but I haven't seen a set answer yet. It's partially discussed here:
https://www.justsoftwaresolutions.co.uk/cplusplus/const-and-thread-safety.html
But the answer still isn't totally clear with me.
If you define std::string const kstrValue = "Value" is kstrValue inherently thread-safe?
What my research has indicated is that it is thread-safe as long as you don't call std library functions that mutate the string.
Is this true?

You can't call any functions that modify that string, std:: or otherwise. What the article is saying, is write your classes like std::string (or a hypothetical int class).
By ensuring that methods that don't modify the class are marked const, you can have const Foo objects freely shared among threads, safe in the knowledge that there can be no data races, because there are no modifications.
It is slightly more subtle with const Foo & references. You don't know if the underlying object really is const, or that const was added to the reference and somewhere else modifications can occur. Access to such objects still needs to be synchronised between threads.

Yes, it is.
[res.on.data.races]/3:
A C++ standard library function shall not directly or indirectly modify objects ([intro.multithread]) accessible by threads other than the current thread unless the objects are accessed directly or indirectly via the function's non-const arguments, including this.

Related

How can I safely share objects between Rust and C++?

One way to construct and destruct C++ objects from Rust is to call the constructor and return an int64_t pointer to Rust. Then, Rust can call methods on the object by passing the int64_t which will be cast to the pointer again.
void do_something(int64_t pointer, char* methodName, ...) {
//cast pointer and call method here
}
However, this is extremely unsafe. Instead I tend to store the pointer into a map and pass the map key to Rust, so it can call C++ back:
void do_something(int id, char* methodName, ...) {
//retrieve pointer from id and call method on it
}
Now, imagine I create, from Rust, a C++ object that calls Rust back. I could do the same: give C++ an int64_t and then C++ calls Rust:
#[no_mangle]
pub fn do_somethind(pointer: i64, method_name: &CString, ...) {
}
but that's also insecure. Instead I'd do something similar as C++, using an id:
#[no_mangle]
pub fn do_something(id: u32, method_name: &CString, ...) {
//search id in map and get object
//execute method on the object
}
However, this isn't possible, as Rust does not have support for static variables like a map. And Rust's lazy_static is immutable.
The only way to do safe calls from C++ back to Rust is to pass the address of something static (the function do_something) so calling it will always point to something concrete. Passing pointers is insecure as it could stop existing. However there should be a way for this function to maintain a map of created objects and ids.
So, how to safely call Rust object functions from C++? (for a Rust program, not a C++ program)
Pointers or Handles
Ultimately, this is about object identity: you need to pass something which allows to identify one instance of an object.
The simplest interface is to return a Pointer. It is the most performant interface, albeit requires trust between the parties, and a clear ownership.
When a Pointer is not suitable, the fallback is to use a Handle. This is, for example, typically what kernels do: a file descriptor, in Linux, is just an int.
Handles do not preclude strong typing.
C and Linux are poor examples, here. Just because a Handle is, often, an integral ID does not preclude encapsulating said integer into a strong type.
For example, you could struct FileDescriptor(i32); to represent a file descriptor handed over from Linux.
Handles do not preclude strongly typed functions.
Similarly, just because you have a Handle does not mean that you have a single syscall interface where the name of the function must be passed by ID (or worse string) and an unknown/untyped soup of arguments follow.
You can perfectly, and really should, use strongly typed functions:
int read(FileDescriptor fd, std::byte* buffer, std::size_t size);
Handles are complicated.
Handles are, to a degree, more complicated than pointers.
First of all, handles are meaningless without some repository: 33 has no intrinsic meaning, it is just a key to look-up the real instance.
The repository need not be a singleton. It can perfectly be passed along in the function call.
The repository should likely be thread-safe and re-entrant.
There may be data-races between usage and deletion of a handle.
The latter point is maybe the most surprising, and means that care must be taken when using the repository: accesses to the underlying values must also be thread-safe, and re-entrant.
(Non thread-safe or non re-entrant underlying values leave you open to Undefined Behavior)
Use Pointers.
In general, my recommendation is to use Pointers.
While Handles may feel safer, implementing a correct system is much more complicated than it looks. Furthermore, Handles do not intrinsically solve ownership issues. Instead of Undefined Behavior, you'll get Null Pointer Dangling Handle Exceptions... and have to reinvent the tooling to track them down.
If you cannot solve the ownership issues with Pointers, you are unlikely to solve them with Handles.

Why does pthread_exit use void*?

I recently started using posix threads and the choice of argument types in the standard made me curious. I haven't been able to asnwer the question of why does pthread_exit use void* instead of int for the returning status of the thread? (the same as exit does).
The only advantage I see is that it lets the programmers define the status how they want (e.g. return a pointer to a complex structure), but I doubt it is widely used like this.
It seems that in most cases this choice has more overhead because of necessary casting.
This isn't just for status, it's the return value of the thread. Using a pointer allows the thread to return a pointer to a dynamically-allocated array or structure.
You can't really compare it with the exit() parameter, because that's for sending status to the operating system. This is intentionally very simple to allow portability with many OSes.
The only advantage I see is that it lets the programmers define the status how they want (e.g. return a pointer to a complex structure), but I doubt it is widely used like this.
Indeed, that's the reason. And it's probably not used that widely (e.g. you can communicate values via other means such as a pointer passed to thread function, global var with synchronisation, etc). But if you were to have a it like void pthread_exit(int);, the it takes away the ability to return pointers. So void pthread_exit(void*); is a more flexible design.
It seems that in most cases this choice has more overhead because of necessary casting.
In most cases, it's not used at all as the common way is to return nothing i.e. pthread_exit(NULL);. So it only matters when returning pointers (to structs and such) and in those cases a conversion to void * isn't necessary as any pointer type can be converted to void * without an explicit cast.

How to implement long-lived variables/state in a library?

I understand that the preferred way to implement something like a global/instance/module variable in Rust is to create said variable in main() or other common entry point and then pass it down to whoever needs it.
It also seems possible to use a lazy_static for an immutable variable, or it can be combined with a mutex to implement a mutable one.
In my case, I am using Rust to create a .so with bindings to Python and I need to have a large amount of mutable state stored within the Rust library (in response to many different function calls invoked by the Python application).
What is the preferred way to store that state?
Is it only via the mutable lazy_static approach since I have no main() (or more generally, any function which does not terminate between function calls from Python), or is there another way to do it?
Bundle it
In general, and absent other requirements, the answer is to bundle your state in some object and hand it over to the client. A popular name is Context.
Then, the client should have to pass the object around in each function call that requires it:
Either by defining the functionality as methods on the object.
Or by requiring the object as parameter of the functions/methods.
This gives full control to the client.
The client may end up creating a global for it, or may actually appreciate the flexibility of being able to juggle multiple instances.
Note: There is no need to provide any access to the inner state of the object; all the client needs is a handle (ref-counted, in Python) to control the lifetime and decide when to use which handle. In C, this would be a void*.
Exceptions
There are cases, such as a cache, where the functionality is not impacted, only the performance.
In this case, while the flexibility could be appreciated, it may be more of a burden than anything. A global, or thread-local, would then make sense.
I'd be tempted to dip into unsafe code here. You cannot use non-static lifetimes, as the lifetime of your state would be determined by the Python code, which Rust can't see. On the other hand, 'static state has other problems:
It necessarily persists until the end of the program, which means there's no way of recovering memory you're no longer using.
'static variables are essentially singletons, making it very difficult to write an application that makes multiple independent usages of your library.
I would go with a solution similar to what #Matthieu M. suggests, but instead of passing the entire data structure back and forth over the interface, allocate it on the heap, unsafely, and then pass some sort of handle (i.e. pointer) back and forth.
You would probably want to write a cleanup function, and document your library to compel users to call the cleanup function when they're done using a particular handle. Effectively, you're explicitly delegating the management of the lifecycle of the data to the calling code.
With this model, if desired, an application could create, use, and cleanup multiple datasets (each represented by their own handle) concurrently and independently. If an application "forgets" to cleanup a handle when finished, you have a memory leak, but that's no worse than storing the data in a 'static variable.
There may be helper crates and libraries to assist with doing this sort of thing. I'm not familiar enough with rust to know.

magic statics: similar constructs, interesting non-obvious uses?

C++11 introduced threadsafe local static initialization, aka "magic statics": Is local static variable initialization thread-safe in C++11?
In particular, the spec says:
If control enters the declaration concurrently while the variable is
being initialized, the concurrent execution shall wait for completion
of the initialization.
So there's an implicit mutex lock here. This is very interesting, and seems like an anomaly-- that is, I don't know of any other implicit mutexes built into c++ (i.e. mutex semantics without any use of things like std::mutex). Are there any others, or is this unique in the spec?
I'm also curious whether magic static's implicit mutex (or other implicit mutexes, if there are any) can be leveraged to implement other synchronization primitives. For example, I see that they can be used to implement std::call_once, since this:
std::call_once(onceflag, some_function);
can be expressed as this:
static int dummy = (some_function(), 0);
Note, however, that the magic static version is more limited than std::call_once, since with std::call_once you could re-initialize onceflag and so use the code multiple times per program execution, whereas with magic statics, you really only get to use it once per program execution.
That's the only somewhat non-obvious use of magic statics that I can think of.
Is it possible to use magic static's implicit mutex to implement other synchronization primitives, e.g. a general std::mutex, or other useful things?
Initialization of block-scope static variables is the only place where the language requires synchronization. Several library functions require synchronization, but aren't directly synchronization functions (e.g. atexit).
Since the synchronization on the initialization of a local static is a one-time affair, it would be hard, if not impossible, to implement a general purpose synchronization mechanism on top of it, since every time you needed a synchronization point you would need to be initializing a different local static object.
Though they can be used in place of call_once in some circumstances, they can't be used as a general replacement for that, since a given once_flag object may be used from many places.

How should I use storage class specifiers like ref, in, out, etc. in function arguments in D?

There are comparatively many storage class specifiers for functions arguments in D, which are:
none
in (which is equivalent to const scope)
out
ref
scope
lazy
const
immutable
shared
inout
What's the rational behind them? Their names already put forth the obvious use. However, there are some open questions:
Should I use ref combined with in for struct type function arguments by default?
Does out imply ref implicitely?
When should I use none?
Does ref on classes and/or interfaces make sense? (Class types are references by default.)
How about ref on array slices?
Should I use const for built-in arithmetic types, whenever possible?
More generally put: When and why should I use which storage class specifier for function argument types in case of built-in types, arrays, structs, classes and interfaces?
(In order to isolate the scope of the question a little bit, please don't discuss shared, since it has its own isolated meaning.)
I wouldn't use either by default. ref parameters only take lvalues, and it implies that you're going to be altering the argument that's being passed in. If you want to avoid copying, then use const ref or auto ref. But const ref still requires an lvalue, so unless you want to duplicate your functions, it's frequently more annoying than it's worth. And while auto ref will avoid copying lvalues (it basically makes it so that there's a version of the function which takes an lvalues by ref and one which takes rvalues without ref), it only works with templates, limiting its usefulness. And using const can have far-reaching consequences due to the fact that D's const is transitive and the fact that it's undefined behavior to cast away const from a variable and modify it. So, while it's often useful, using it by default is likely to get you into trouble.
Using in gives you scope in addition to const, which I'd generally advise against. scope on function parameters is supposed to make it so that no reference to that data can escape the function, but the checks for it aren't properly implemented yet, so you can actually use it in a lot more situations than are supposed to be legal. There are some cases where scope is invaluable (e.g. with delegates, since it makes it so that the compiler doesn't have to allocate a closure for it), but for other types, it can be annoying (e.g. if you pass an array be scope, then you couldn't return a slice to that array from the function). And any structs with any arrays or reference types would be affected. And while you won't get many complaints about incorrectly using scope right now, if you've been using it all over the place, you're bound to get a lot of errors once it's fixed. Also, its utterly pointless for value types, since they have no references to escape. So, using const and in on a value type (including structs which are value types) are effectively identical.
out is the same as ref except that it resets the parameter to its init value so that you always get the same value passed in regardless of what the previous state of the variable being passed in was.
Almost always as far as function arguments go. You use const or scope or whatnot when you have a specific need it, but I wouldn't advise using any of them by default.
Of course it does. ref is separate from the concept of class references. It's a reference to the variable being passed in. If I do
void func(ref MyClass obj)
{
obj = new MyClass(7);
}
auto var = new MyClass(5);
func(var);
then var will refer the newly constructed new MyClass(7) after the call to func rather than the new MyClass(5). You're passing the reference by ref. It's just like how taking the address of a reference (like var) gives you a pointer to a reference and not a pointer to a class object.
MyClass* p = &var; //points to var, _not_ to the object that var refers to.
Same deal as with classes. ref makes the parameter refer to the variable passed in. e.g.
void func(ref int[] arr)
{
arr ~= 5;
}
auto var = [1, 2, 3];
func(var);
assert(var == [1, 2, 3, 5]);
If func didn't take its argument by ref, then var would have been sliced, and appending to arr would not have affected var. But since the parameter was ref, anything done to arr is done to var.
That's totally up to you. Making it const makes it so that you can't mutate it, which means that you're protected from accidentally mutating it if you don't intend to ever mutate it. It might also enable some optimizations, but if you never write to the variable, and it's a built-in arithmetic type, then the compiler knows that it's never altered and the optimizer should be able to do those optimizations anyway (though whether it does or not depends on the compiler's implementation).
immutable and const are effectively identical for the built-in arithmetic types in almost all cases, so personally, I'd just use immutable if I want to guarantee that such a variable doesn't change. In general, using immutable instead of const if you can gives you better optimizations and better guarantees, since it allows the variable to be implicitly shared across threads (if applicable) and it always guarantees that the variable can't be mutated (whereas for reference types, const just means only that that reference can't mutate the object, not that it can't be mutated).
Certainly, if you mark your variables const and immutable as much as possible, then it does help the compiler with optimizations at least some of the time, and it makes it easier to catch bugs where you mutated something when you didn't mean to. It also can make your code easier to understand, since you know that the variable is not going to be mutated. So, using them liberally can be valuable. But again, using const or immutable can be overly restrictive depending on the type (though that isn't a problem with the built-in integral types), so just automatically marking everything as const or immutable can cause problems.

Resources