Perfectly parallel problems in Rust - multithreading

I'm struggling to see how I could best implement a common concurrency pattern I use in c++.
Pseudocode:
void kernel(inputs, outputs, start, stride) {
for (size_t i = start; i < length(inputs); i+=stride) {
outputs[i] = process(inputs[i]);
}
}
void run(inputs, number_of_threads) {
threads = []
vector outputs(length(inputs))
for (int i = 0; i < number_of_threads; i++ {
threads.push(thread(kernel, inputs, &outputs, i, number_of_threads));
}
for t in threads {
t.join()
}
return outputs
}
That is, doing some function over lots of inputs by striding over the input space. It's perfectly parallel, never has race conditions etc. But with rust's ownership model, each kernel would need a mutable reference to outputs, so I don't understand how it could work.
What's the safe way to do this simple problem in Rust?

You can use splice::split_at_mut to separate the array into separate sub-slices, each with their own mutable reference. You can pass the first slice to a scope'd thread then continue splitting the second slice.

Related

Peterson's algorithm and deadlock

I am trying to experiment with some mutual execution algorithms. I have implemented the Peterson's algorithm. It prints the correct counter value but sometimes it seems just like some kind of a deadlock had occurred which stalls the execution indefinitely. This should not be possible since this algorithm is deadlock free.
PS: Is this related to problems with compiler optimizations often mentioned when addressing the danger of "benign" data races? If this is the case then how to disable such optimizations?
PPS: When atomically storing/loading the victim field, the problem seems to disappear which makes the compiler's optimizations more suspicious
package main
import (
"fmt"
"sync"
)
type mutex struct {
flag [2]bool
victim int
}
func (m *mutex) lock(id int) {
m.flag[id] = true // I'm interested
m.victim = id // you can go before me if you want
for m.flag[1-id] && m.victim == id {
// while the other thread is inside the CS
// and the victime was me (I expressed my interest after the other one already did)
}
}
func (m *mutex) unlock(id int) {
m.flag[id] = false // I'm not intersted anymore
}
func main() {
var wg sync.WaitGroup
var mu mutex
var cpt, n = 0, 100000
for i := 0; i < 2; i++ {
wg.Add(1)
go func(id int) {
defer wg.Done()
for j := 0; j < n; j++ {
mu.lock(id)
cpt = cpt + 1
mu.unlock(id)
}
}(i)
}
wg.Wait()
fmt.Println(cpt)
}
There is no "benign" data race. Your program has data race, and the behavior is undefined.
At the core of the problem is the mutex implementation. Modifications made to a shared object from one goroutine are not necessarily observable from others until those goroutines communicate using one of the synchronization primitives. You are writing to mutex.victim from multiple goroutines, and won't be observed. You are also reading the mutex.flag elements written by other goroutines, and won't necessarily be seen. That is, there may be cases where the for-loop won't terminate even if the other goroutine changes the variables.
And since the mutex implementation is broken, the updates to cpt will not necessarily be correct either.
To implement this correctly, you need the sync/atomic package.
See the Go Memory Model: https://go.dev/ref/mem
For Peterson's algorithm (same goes for Dekker), you need to ensure that your code is sequential consistent. In Go you can do that using atomics. This will prevent the compiler and the hardware to mess things up.

How to join all threads before deleting the ThreadPool

I am using a MultiThreading class which creates the required number of threads in its own threadpool and deletes itself after use.
std::thread *m_pool; //number of threads according to available cores
std::mutex m_locker;
std::condition_variable m_condition;
std::atomic<bool> m_exit;
int m_processors
m_pool = new std::thread[m_processors + 1]
void func()
{
//code
}
for (int i = 0; i < m_processors; i++)
{
m_pool[i] = std::thread(func);
}
void reset(void)
{
{
std::lock_guard<std::mutex> lock(m_locker);
m_exit = true;
}
m_condition.notify_all();
for(int i = 0; i <= m_processors; i++)
m_pool[i].join();
delete[] m_pool;
}
After running through all tasks, the for-loop is supposed to join all running threads before delete[] is being executed.
But there seems to be one last thread still running, while the m_pool does not exist anymore.
This leads to the problem, that I can't close my program anymore.
Is there any way to check if all threads are joined or wait for all threads to be joined before deleting the threadpool?
Simple typo bug I think.
Your loop that has the condition i <= m_processors is a bug and will actually process one extra entry past the end of the array. This is an off-by-one bug. Suppose m_processors is 2. You'll have an array that contains 2 elements with indices [0] and [1]. Yet, you'll be reading past the end of the array, attempting to join with the item at index [2]. m_pool[2] is undefined memory and you're likely going to either crash or block forever there.
You likely intended i < m_processors.
The real source of the problem is addressed by Wick's answer. I will extend it with some tips that also solve your problem while improving other aspects of your code.
If you use C++11 for std::thread, then you shouldn't create your thread handles using operator new[]. There are better ways of doing that with other C++ constructs, which will make everything simpler and exception safe (you don't leak memory if an unexpected exception is thrown).
Store your thread objects in a std::vector. It will manage the memory allocation and deallocation for you (no more new and delete). You can use other more flexible containers such as std::list if you insert/delete threads dynamically.
Fill the vector in place with std::generate or similar
std::vector<std::thread> m_pool;
m_pool.reserve(n_processors);
// Fill the vector
std::generate_n( std::back_inserter(m_pool), m_processors,
[](){ return std::thread(func); } );
Join all the elements using range-for loop and delete handles using container's functions.
for( std::thread& t: m_pool ) {
t.join();
}
m_pool.clear();

Allocating a pool of equivalent resources to a group of threads using semaphores

I have a doubt about a concurrent programming problem.
More specifically, we are working with the shared memory model(i.e. threads). The problem is: given a pool of N equivalent resources, there being the constraint that at a generic instant t there can only be one thread using a resource R, write a program that allocates these resources on demand to the threads that ask for them. To do this we have to use semaphores. Note that what these resources are and what the threads do with them is out of scope, the focus is on how the resources are managed and allocated. My professor gave us a C/Java-like pseudocode solution for the resources manager class:
class ResourcesManager {
semaphore mutex = 1;
semaphore availableResourcesSemaphore = N;
boolean available[N];
Resource resources[N];
public ResourcesManager(){
for(int i = 0; i < N; i++) {
available[i] = true;
resources[N] = new Resource();
}
}
public int acquireResource() {
int i = 0;
P(availableResourcesSemaphore);
P(mutex);
while(available[i]==false) i++;
available[i] = false;
V(mutex);
return i;
}
public void releaseResource(int i) {
P(mutex);
available[i] = true; //HERE IS THE PROBLEM
V(mutex);
V(availableResourcesSemaphore);
}
}
While I see that this solution works, there is something I don't get about the releaseResource(int i) method. Why is there the need of having the line marked with the comment "HERE IS THE PROBLEM":
available[i] = true;
executed in mutual exclusion? I have thought about it and to me it looks like nothing bad happens if we do otherwise.
What I mean is that while in the original solution we have:
P(mutex);
available[i] = true;
V(mutex);
we can replace these three lines simply with
available[i] = true;
and the solution is still correct.
Now, of course I see that mutual exclusion is needed when operating on the array "available" in the other method acquireResource(), and since the instruction
available[i] = true;
operates on the same variable it is more elegant and conceptually cleaner to operate on it in mutual exclusion too. On the other hand, as a beginner in concurrent programming, I don't think it's good to have mutual exclusion where it is not needed.
So am I right(the instruction can be executed without mutual exclusion) or am I missing something, and removing mutual exclusion causes some issues? One final note on the execution environment: it can be both uniprocessor or multiprocessor, meaning that the solution has to work for both cases. Thanks for your help!

Can there be a deadlock in Bakery Algorithm max() operation?

As resources stated, Bakery algorithm is supposed to be deadlock free.
But when I tried to understand the pseudocode, I came up with a line which could raise a deadlock (according to my knowledge).
Reffering to the code below,
in Lock() function, we have a line saying
label[i] = max( label[0], ..., label[n-1] ) + 1;
What if two threads come to that state at the same time and since max is not atomic, two labels will get the same value?
Then since two labels have to same value, both threads with that labels will get the permission to go for the critical section at the same time. Wouldn't that occur a deadlock?
Tried myself best to explain the problem here. Comment if it is still not clear. Thanks .
class Bakery implements Lock {
volatile boolean[] flag;
volatile Label[] label;
public Bakery (int n) {
flag = new boolean[n];
label = new Label[n];
for (int i = 0; i < n; i++) {
flag[i] = false; label[i] = 0;
}
public void lock() {
flag[i] = true;
label[i] =max(label[0], ...,label[n-1])+1;
while ( $ k flag[k] && (label[i],i) > (label[k],k);
}
}
public void unlock() {
flag[i] = false;
}
Then since two labels have to same value, both threads with that labels will get the permission to go for the critical section at the same time. Wouldn't that occur a deadlock?
To begin with, you probably mean a race, not a deadlock.
However, no, there won't be a race here. If you look, there's the condition
(label[i],i) > (label[k],k)
and while this happens, the thread effectively busy-waits.
This means that even if label[i] is the same as label[k] (as both performed the max concurrently), the thread numbered higher will defer to the thread numbered lower.
(Arguably, this is a problem with the algorithm, as it inherently prioritizes the threads.)

How can I make this prime finder operate in parallel

I know prime finding is well studied, and there are a lot of different implementations. My question is, using the provided method (code sample), how can I go about breaking up the work? The machine it will be running on has 4 quad core hyperthreaded processors and 16GB of ram. I realize that there are some improvements that could be made, particularly in the IsPrime method. I also know that problems will occur once the list has more than int.MaxValue items in it. I don't care about any of those improvements. The only thing I care about is how to break up the work.
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
namespace Prime
{
class Program
{
static List<ulong> primes = new List<ulong>() { 2 };
static void Main(string[] args)
{
ulong reportValue = 10;
for (ulong possible = 3; possible <= ulong.MaxValue; possible += 2)
{
if (possible > reportValue)
{
Console.WriteLine(String.Format("\nThere are {0} primes less than {1}.", primes.Count, reportValue));
try
{
checked
{
reportValue *= 10;
}
}
catch (OverflowException)
{
reportValue = ulong.MaxValue;
}
}
if (IsPrime(possible))
{
primes.Add(possible);
Console.Write("\r" + possible);
}
}
Console.WriteLine(primes[primes.Count - 1]);
Console.ReadLine();
}
static bool IsPrime(ulong value)
{
foreach (ulong prime in primes)
{
if (value % prime == 0) return false;
if (prime * prime > value) break;
}
return true;
}
}
}
There are 2 basic schemes I see: 1) using all threads to test a single number, which is probably great for higher primes but I cannot really think of how to implement it, or 2) using each thread to test a single possible prime, which can cause a non-continuous string of primes to be found and run into unused resources problems when the next number to be tested is greater than the square of the highest prime found.
To me it feels like both of these situations are challenging only in the early stages of building the list of primes, but I'm not entirely sure. This is being done for a personal exercise in breaking this kind of work.
If you want, you can parallelize both operations: the checking of a prime, and the checking of multiple primes at once. Though I'm not sure this would help. To be honest I'd consider remove the threading in main().
I've tried to stay faithful to your algorithm, but to speed it up a lot I've used x*x instead of reportvalue; this is something you could easily revert if you wish.
To further improve on my core splitting you could determine an algorithm to figure out the number of computations required to perform the divisions based on the size of the numbers and split the list that way. (aka smaller numbers take less time to divide by so make the first partitions larger)
Also my concept of threadpool may not exist the way I want to use it
Here's my go at it(pseudo-ish-code):
List<int> primes = {2};
List<int> nextPrimes = {};
int cores = 4;
main()
{
for (int x = 3; x < MAX; x=x*x){
int localmax = x*x;
for(int y = x; y < localmax; y+=2){
thread{primecheck(y);}
}
"wait for all threads to be executed"
primes.add(nextPrimes);
nextPrimes = {};
}
}
void primecheck(int y)
{
bool primality;
threadpool? pool;
for(int x = 0; x < cores; x++){
pool.add(thread{
if (!smallcheck(x*primes.length/cores,(x+1)*primes.length/cores ,y)){
primality = false;
pool.kill();
}
});
}
"wait for all threads to be executed or killed"
if (primality)
nextPrimes.add(y);
}
bool smallcheck(int a, int b, int y){
foreach (int div in primes[a to b])
if (y%div == 0)
return false;
return true;
}
E: I added what I think pooling should look like, look at revision if you want to see it without.
Use the sieve of Eratosthenes instead. It's not worthwhile to parallelize unless you use a good algorithm in the first place.
Separate the space to sieve into large regions and sieve each in its own thread. Or better use some workqueue concept for large regions.
Use a bit array to represent the prime numbers, it takes less space than representing them explicitly.
See also this answer for a good implementation of a sieve (in Java, no split into regions).

Resources