Thread-safe data structure design - multithreading

I have to design a data structure that is to be used in a multi-threaded environment. The basic API is simple: insert element, remove element, retrieve element, check that element exists. The structure's implementation uses implicit locking to guarantee the atomicity of a single API call. After i implemented this it became apparent, that what i really need is atomicity across several API calls. For example if a caller needs to check the existence of an element before trying to insert it he can't do that atomically even if each single API call is atomic:
if(!data_structure.exists(element)) {
data_structure.insert(element);
}
The example is somewhat awkward, but the basic point is that we can't trust the result of "exists" call anymore after we return from atomic context (the generated assembly clearly shows a minor chance of context switch between the two calls).
What i currently have in mind to solve this is exposing the lock through the data structure's public API. This way clients will have to explicitly lock things, but at least they won't have to create their own locks. Is there a better commonly-known solution to these kinds of problems? And as long as we're at it, can you advise some good literature on thread-safe design?
EDIT: I have a better example. Suppose that element retrieval returns either a reference or a pointer to the stored element and not it's copy. How can a caller be protected to safely use this pointer\reference after the call returns? If you think that not returning copies is a problem, then think about deep copies, i.e. objects that should also copy another objects they point to internally.
Thank you.

You either provide a mechanism for external locking (bad), or redesign the API, like putIfAbsent. The latter approach is for instance used for Java's concurrent data-structures.
And, when it comes to such basic collection types, you should check-out whether your language of choice already offers them in its standard library.
[edit]To clarify: external locking is bad for the user of the class, as it introduces another source of potential bugs. Yes, there are times, when performance considerations indeed make matters worse for concurrent data-structures than externally synchronized one, but those cases are rare, and then they usually can only be solved/optimized by people with far more knowledge/experience than me.
One, maybe important, performance hint is found in Will's answer below.
[/edit]
[edit2]Given your new example: Basically you should try to keep the synchronization of the collection and the of the elements separated as much as possible. If the lifetime of the elements is bound to its presence in one collection, you will run into problems; when using a GC this kind of problem actually becomes simpler. Otherwise you will have to use a kind of proxy instead of raw elements to be in the collection; in the simplest case for C++ you would go and use boost::shared_ptr, which uses an atomic ref-count. Insert usual performance disclaimer here. When you are using C++ (as I suspect as you talk about pointers and references), the combination of boost::shared_ptr and boost::make_shared should suffice for a while.
[/edit2]

Sometimes its expensive to create an element to be inserted. In these scenarios you can't really afford to routinely create objects that might already exist just in case they do.
One approach is for the insertIfAbsent() method to return a 'cursor' that is locked - it inserts a place-holder into the internal structure so that no other thread can believe it is absent, but does not insert the new object. The placeholder can contain a lock so that other threads that want to access that particular element must wait for it to be inserted.
In an RAII language like C++ you can use a smart stack class to encapsulate the returned cursor so that it automatically rolls-back if the calling code does not commit. In Java its a bit more deferred with the finalize() method, but can still work.
Another approach is for the caller to create the object that isn't present, but that to occasionally fail in the actual insertion if another thread has 'won the race'. This is how, for example, memcache updates are done. It can work very well.

What about moving the existance check into the .insert() method? A client calls it and if it returns false you know that something went wrong. Much like what malloc() does in plain old C -- return NULL if failed, set ERRNO.
Obviously you can also return an exception, or an instance of an object, and complicate your life up from there..
But please, don't rely on the user setting their own locks.

In an RAII style fashion you could create accessor/handle objects (don't know how its called, there probably exists a pattern of this), e.g. a List:
template <typename T>
class List {
friend class ListHandle<T>;
// private methods use NO locking
bool _exists( const T& e ) { ... }
void _insert( const T& e ) { ... }
void _lock();
void _unlock();
public:
// public methods use internal locking and just wrap the private methods
bool exists( const T& e ) {
raii_lock l;
return _exists( e );
}
void insert( const T& e ) {
raii_lock l;
_insert( e );
}
...
};
template <typename T>
class ListHandle {
List<T>& list;
public:
ListHandle( List<T>& l ) : list(l) {
list._lock();
}
~ListHandle() {
list._unlock();
}
bool exists( const T& e ) { return list._exists(e); }
void insert( const T& e ) { list._insert(e); }
};
List<int> list;
void foo() {
ListHandle<int> hndl( list ); // locks the list as long as it exists
if( hndl.exists(element) ) {
hndl.insert(element);
}
// list is unlocked here (ListHandle destructor)
}
You duplicate (or even triplicate) the public interface, but you give users the choice to choose between internal and safe and comfortable external locking wherever it is required.

First of all, you should really separate your concerns. You have two things to worry about:
The datastructure and its methods.
The thread synchronization.
I highly suggest you use an interface or virtual base class that represents the type of datastructure you are implementing. Create a simple implementation that does not do any locking, at all. Then, create a second implementation that wraps the first implementation and adds locking on top of it. This will allow a more performant implementation where locking isn't needed and will greatly simplify your code.
It looks like you are implementing some sort of dictionary. One thing you can do is provide methods that have semantics that are equivalent to the combined statement. For example setdefault is a reasonable function to provide that will set a value only if the corresponding key does not already exist in the dictionary.
In other words, my recommendation would be to figure out what combinations of methods are frequently used together, and simply create API methods that perform that combination of operations atomically.

Related

How can I safely share objects between Rust and C++?

One way to construct and destruct C++ objects from Rust is to call the constructor and return an int64_t pointer to Rust. Then, Rust can call methods on the object by passing the int64_t which will be cast to the pointer again.
void do_something(int64_t pointer, char* methodName, ...) {
//cast pointer and call method here
}
However, this is extremely unsafe. Instead I tend to store the pointer into a map and pass the map key to Rust, so it can call C++ back:
void do_something(int id, char* methodName, ...) {
//retrieve pointer from id and call method on it
}
Now, imagine I create, from Rust, a C++ object that calls Rust back. I could do the same: give C++ an int64_t and then C++ calls Rust:
#[no_mangle]
pub fn do_somethind(pointer: i64, method_name: &CString, ...) {
}
but that's also insecure. Instead I'd do something similar as C++, using an id:
#[no_mangle]
pub fn do_something(id: u32, method_name: &CString, ...) {
//search id in map and get object
//execute method on the object
}
However, this isn't possible, as Rust does not have support for static variables like a map. And Rust's lazy_static is immutable.
The only way to do safe calls from C++ back to Rust is to pass the address of something static (the function do_something) so calling it will always point to something concrete. Passing pointers is insecure as it could stop existing. However there should be a way for this function to maintain a map of created objects and ids.
So, how to safely call Rust object functions from C++? (for a Rust program, not a C++ program)
Pointers or Handles
Ultimately, this is about object identity: you need to pass something which allows to identify one instance of an object.
The simplest interface is to return a Pointer. It is the most performant interface, albeit requires trust between the parties, and a clear ownership.
When a Pointer is not suitable, the fallback is to use a Handle. This is, for example, typically what kernels do: a file descriptor, in Linux, is just an int.
Handles do not preclude strong typing.
C and Linux are poor examples, here. Just because a Handle is, often, an integral ID does not preclude encapsulating said integer into a strong type.
For example, you could struct FileDescriptor(i32); to represent a file descriptor handed over from Linux.
Handles do not preclude strongly typed functions.
Similarly, just because you have a Handle does not mean that you have a single syscall interface where the name of the function must be passed by ID (or worse string) and an unknown/untyped soup of arguments follow.
You can perfectly, and really should, use strongly typed functions:
int read(FileDescriptor fd, std::byte* buffer, std::size_t size);
Handles are complicated.
Handles are, to a degree, more complicated than pointers.
First of all, handles are meaningless without some repository: 33 has no intrinsic meaning, it is just a key to look-up the real instance.
The repository need not be a singleton. It can perfectly be passed along in the function call.
The repository should likely be thread-safe and re-entrant.
There may be data-races between usage and deletion of a handle.
The latter point is maybe the most surprising, and means that care must be taken when using the repository: accesses to the underlying values must also be thread-safe, and re-entrant.
(Non thread-safe or non re-entrant underlying values leave you open to Undefined Behavior)
Use Pointers.
In general, my recommendation is to use Pointers.
While Handles may feel safer, implementing a correct system is much more complicated than it looks. Furthermore, Handles do not intrinsically solve ownership issues. Instead of Undefined Behavior, you'll get Null Pointer Dangling Handle Exceptions... and have to reinvent the tooling to track them down.
If you cannot solve the ownership issues with Pointers, you are unlikely to solve them with Handles.

Why does pthread_exit use void*?

I recently started using posix threads and the choice of argument types in the standard made me curious. I haven't been able to asnwer the question of why does pthread_exit use void* instead of int for the returning status of the thread? (the same as exit does).
The only advantage I see is that it lets the programmers define the status how they want (e.g. return a pointer to a complex structure), but I doubt it is widely used like this.
It seems that in most cases this choice has more overhead because of necessary casting.
This isn't just for status, it's the return value of the thread. Using a pointer allows the thread to return a pointer to a dynamically-allocated array or structure.
You can't really compare it with the exit() parameter, because that's for sending status to the operating system. This is intentionally very simple to allow portability with many OSes.
The only advantage I see is that it lets the programmers define the status how they want (e.g. return a pointer to a complex structure), but I doubt it is widely used like this.
Indeed, that's the reason. And it's probably not used that widely (e.g. you can communicate values via other means such as a pointer passed to thread function, global var with synchronisation, etc). But if you were to have a it like void pthread_exit(int);, the it takes away the ability to return pointers. So void pthread_exit(void*); is a more flexible design.
It seems that in most cases this choice has more overhead because of necessary casting.
In most cases, it's not used at all as the common way is to return nothing i.e. pthread_exit(NULL);. So it only matters when returning pointers (to structs and such) and in those cases a conversion to void * isn't necessary as any pointer type can be converted to void * without an explicit cast.

Examples of languages that hide variable multiplicity

What are some examples of programming languages, extensions to programming languages or other solutions that hides the multiplicity of variables when operating on them, calling method etc?
Specifically I imagine a system where I have a single typed collection of objects that transparently will forward any method call on the collection of objects so that the method is applied to all of them individually including using the return value in a meaningful way. Preferably I would like to see examples of languages that does this in a good way, but it could be interesting to see also solutions where this does not work well.
I imagine something like this:
struct Foo
{
int bar();
}
void myFunction()
{
// 4 Foo objects are created in a vector
vector<Foo> vals(4);
// The bar() method is applied to each of the Foo objects and each
// return an int that is automatically inserted into a new vector
vector<int> = vals.bar();
}
Take a look at Java 8 streams. Basically, you'd "stream" the container's contents, and indicate to the stream that each thing that goes through should have the method Foo::bar applied to it.
vals.stream().forEach(Foo::bar);
A lot of these concepts come from earlier languages, including Lisp (list processing).

Are methods with no return value more secure?

Please excuse my naivety as I am not familiar with exploiting or eliminating software vulnerabilities, so I may be asking a question that doesn't make any sense. This is something I was thinking about recently and couldn't find anything online that specifically addressed my question.
Please compare the following classes (Java):
public class ProvidesCollection {
private Item[] items;
public Item[] getItemsArray() {
return Arrays.copyOf(items, items.length);
}
}
public class ContainsCollection {
private Item[] items;
public void actionItem(int itemNumber, ItemAction itemAction) {
if(itemNumber < 0 || itemNumber > items.length)
return; // Or handle as error
Item i = items[itemNumber - 1];
if(i != null)
itemAction.performOn(i);
}
}
The ProvidesCollection class is a typical OO design. The caller is trusted to loop through the items array returned from getItemsArray. If the caller forgets to do a bounds check it could open the code to a buffer overflow attack (if I'm not mistaken). I know Java's memory management avoids buffer overflows, so maybe Java is a bad example. Lets assume there is no mechanism for catching overflows.
The ContainsCollection class keeps the array completely hidden. Notice how the actionItem method allows the programmer to check for input errors and resolve them. Those responsible for implementing the API have more control over the data and flow of execution.
I would like to know, is the ContainsCollection class more secure than the ProvidesCollection class? Is there any evidence that avoiding return values (void methods) helps at all to remove a hacker's ability to exploit errors in the code?
No, void methods are not intrinsically more secure than methods that return values. You can write secure methods that return values, and you can write insecure methods that return nothing.
Typically, you will have void methods when you want to encapsulate some code that achieves a side-effect. For example, sending a file to a printer, changing the internal state of an object, or performing some other action. That should be the litmus test of whether or not the signature's return type should be void -- when it's a "fire and forget" type of operation.
Methods that return values are really only more insecure than void methods when they expose sensitive data to unscrupulous people. However that doesn't mean that the same unscrupulous people couldn't pass certain data into a void method's arguments to compromise security. Though void methods don't return values, they can still throw exceptions. A caller could possibly learn certain things about a void method's data by making it throw exceptions and try/catching them. Also, I have had the unfortunate opportunity to read code that logged passwords to trace files, and that logging method was void.
Say your Item object had properties like CreditCardNumber and SocialSecurityNumber. In this case, your first method may potentially expose a security vulnerability. However you could mitigate that by encrypting those values before returning the array reference (or do not even expose them at all). Any operations that need to operate with the credit card number to perform a side-effect action (such as authorizing a transaction) could be marked void, and do the decryption internally (or obtain the unencrypted value in an encapsulated operation).
But it's not necessarily the method's return signature that makes it more or less secure -- it's the data that is being exposed, and who it's being exposed to. Remember, anyone can write a silly void method that writes their database connection string to a public web page.
Update
...say a vulnerability exists because a method returns a bad value or
from bad usage of the return value. How can you fix the problem if
users depend on the returned value? There is no chance to go back and
remove the return because others depend on it.
If you need to, then you introduce a breaking change. This is a good reason to have clients depend on abstractions like interfaces rather than concrete types. Where you must have concrete types, design them carefully. Expose the minimum amount of information needed.
In the end, all data is just text. Your Item class will have string, integer, boolean, and other primitive values or nested objects that wrap primitives. You can still make changes to the encapsulated getItemsArray method to obfuscate sensitive data before returning the value if needed. If this has the potential to break client code, then you decide whether to bite the bullet and issue a breaking change or live with the flaw.
Is the void method better because you can fix it?
No. Then, you would end up with an API that only performs actions, like a black hole where you send data and never hear from it again. Like I said before, a method should be void if it performs some side effect and the caller does not need to get anything back (except possibly catching exceptions). When your API needs to return data, return an abstraction like an interface rather than a concrete type.

Is it a bad idea to use the new Dynamic Keyword as a replacement switch statement?

I like the new Dynamic keyword and read that it can be used as a replacement visitor pattern.
It makes the code more declarative which I prefer.
Is it a good idea though to replace all instances of switch on 'Type' with a class that implements dynamic dispatch.
class VistorTest
{
public string DynamicVisit(dynamic obj)
{
return Visit(obj);
}
private string Visit(string str)
{
return "a string was called with value " + str;
}
private string Visit(int value)
{
return "an int was called with value " + value;
}
}
It really depends on what you consider a "good idea".
This works, and it works in a fairly elegant manner. It has some advantages and some disadvantages to other approaches.
On the advantage side:
It's concise, and easy to extend
The code is fairly simple
For the disadvantages:
Error checking is potentially more difficult than a classic visitor implementation, since all error checking must be done at runtime. For example, if you pass visitorTest.DynamicVisit(4.2);, you'll get an exception at runtime, but no compile time complaints.
The code may be less obvious, and have a higher maintenance cost.
Personally, I think this is a reasonable approach. The visitor pattern, in a classic implementation, has a fairly high maintenance cost and is often difficult to test cleanly. This potentially makes the cost slightly higher, but makes the implementation much simpler.
With good error checking, I don't have a problem with using dynamic as an approach here. Personally, I'd probably use an approach like this, since the alternatives that perform in a reasonable manner get pretty nasty otherwise.
However, there are a couple of changes I would make here. First, as I mentioned, you really need to include error checking.
Second, I would actually make DynamicVisit take a dynamic directly, which might make it (slightly) more obvious what's happening:
class VistorTest
{
public string DynamicVisit(dynamic obj)
{
try
{
return Visit(obj);
}
catch (RuntimeBinderException e)
{
// Handle the exception here!
Console.WriteLine("Invalid type specified");
}
return string.Empty;
}
// ...Rest of code
The visitor pattern exists primarily to work around the fact that some languages do not allow double dispatch and multiple dispatch.
Multiple dispatch or multimethods is the feature of some object-oriented programming languages in which a function or method can be dynamically dispatched based on the run time (dynamic) type of more than one of its arguments. This is an extension of single dispatch polymorphism where a method call is dynamically dispatched based on the actual derived type of the object. Multiple dispatch generalizes the dynamic dispatching to work with a combination of two or more objects.
Until version 4, C# was one of those languages. With the introduction of the dynamic keyword, however, C# allows developers to opt-in to this dispatch mechanism just as you've shown. I don't see anything wrong with using it in this manner.
You haven't changed the type safety at all, because even a switch (or more likely dispatch dictionary, given that C# does not allow switching on type) would have to have a default case that throws when it can't match a function to call, and this will do exactly the same if it can't find a suitable function to bind to.

Resources