Sharing lock-less resource between threads in Rust

Sharing lock-less resource between threads in Rust - multithreading

I'm porting my C++ chess engine in Rust. I have a big hash table shared between search threads and in the C++ version this table is lock-less; there is no mutex for sharing read/write access. Here is the theory, if you are interested.
In the Rust version of this code, it is working fine, but uses a Mutex:
let shared_hash = Arc::new(Mutex::new(new_hash()));
for _ in 0..n_cpu {
println!("start thread");
let my_hash = shared_hash.clone();
thread_pool.push(thread::spawn(move || {
let mut my_hash = my_hash.lock().unwrap();
let mut search_engine = SearchEngine::new();
search_engine.search(&mut myhash);
}));
}
for i in thread_pool {
let _ = i.join();
}
How could I share the table between threads without a mutex?

Quite simply, actually: the Mutex is unnecessary if the underlying structure is already Sync.
In your case, an array of structs of atomics for example would work. You can find Rust's available atomics here.

Data races are undefined behavior in both C++ and Rust. Just Say No.
The right way is to build your table out of atomic integers. It's rocket science. You have to decide case by case how much you care about the order of memory operations. This does clutter up your code:
// non-atomic array access
table[h] = 0;
// atomic array access
table[h].store(0, Ordering::SeqCst);
But it's worth it.
There's no telling what the performance penalty will be -- you just have to try it out.

Related

Rust: initialize a static variable/reference in a lib?

I'm new to Rust. I'm trying to create a static variable DATA of Vec<u8> in a library so that it is initialized after the compilation of the lib. I then include the lib in the main code hoping to use DATA directly without calling init_data() again. Here's what I've tried:
my_lib.rs:
use lazy_static::lazy_static;
pub fn init_data() -> Vec<u8> {
// some expensive calculations
}
lazy_static! {
pub static ref DATA: Vec<u8> = init_data(); // supposed to call init_data() only once during compilation
}
main.rs:
use my_lib::DATA;
call1(&DATA); // use DATA here without calling init_data()
call2(&DATA);
But it turned out that init_data() is still called in the main.rs. What's wrong with this code?
Update: as Ivan C pointed out, lazy_static is not run at compile time. So, what's the right choice for 'pre-loading' the data?

There are two problems here: the choice of type, and performing the allocation.
It is not possible to construct a Vec, a Box, or any other type that requires heap allocation at compile time, because the heap allocator and the heap do not yet exist at that point. Instead, you must use a reference type, which can point to data allocated in the binary rather than in the run-time heap, or an array without any reference (if the data is not too large).
Next, we need a way to perform the computation. Theoretically, the cleanest option is constant evaluation — straightforwardly executing parts of your code at compile time.
static DATA: &'static [u8] = {
// code goes here
};
However, in current stable Rust versions (1.58.1 as I'm writing this), constant evaluation is very limited, because you cannot do anything that looks like dropping a value, or use any function belonging to a trait. It can still do some things, mostly integer arithmetic or constructing other "almost literal" data; for example:
const N: usize = 10;
static FIRST_N_FIBONACCI: &'static [u32; N] = &{
let mut array = [0; N];
array[1] = 1;
let mut i = 2;
while i < array.len() {
array[i] = array[i - 1] + array[i - 2];
i += 1;
}
array
};
fn main() {
dbg!(FIRST_N_FIBONACCI);
}
If your computation cannot be expressed using const evaluation, then you will need to perform it another way:
Procedural macros are effectively compiler plugins, and they can perform arbitrary computation, but their output is generated Rust syntax. So, a procedural macro could produce an array literal with the precomputed data.
The main limitation of procedural macros is that they must be defined in dedicated crates (so if your project is one library crate, it would now be two instead).
Build scripts are ordinary Rust code which can compile or generate files used by the main compilation. They don't interact with the compiler, but are run by Cargo before compilation starts.
(Unlike const evaluation, both build scripts and proc macros can't use any of the types or constants defined within the crate being built itself; they can read the source code, but they run too early to use other items in the crate in their own code.)
In your case, because you want to precompute some [u8] data, I think the simplest approach would be to add a build script which writes the data to a file, after which your normal code can embed this data from the file using include_bytes!.

How can I mutate a shared variable from multiple threads, disregarding data races?

How can I mutate the variable i inside the closure? Race conditions are considered to be acceptable.
use rayon::prelude::*;
fn main() {
let mut i = 0;
let mut closure = |_| {
i = i + 1;
};
(0..100).into_par_iter().for_each(closure);
}
This code fails with:
error[E0525]: expected a closure that implements the `Fn` trait, but this closure only implements `FnMut`
--> src\main.rs:6:23
|
6 | let mut closure = |_| {
| ^^^ this closure implements `FnMut`, not `Fn`
7 | i = i + 1;
| - closure is `FnMut` because it mutates the variable `i` here
...
10 | (0..100).into_par_iter().for_each(closure);
| -------- the requirement to implement `Fn` derives from here

There is a difference between a race condition and a data race.
A race condition is any situation when the outcome of two or more events depends on which one happens first, and nothing enforces a relative ordering between them. This can be fine, and as long as all possible orderings are acceptable, you may accept that your code has a race in it.
A data race is a specific kind of race condition where the events are unsynchronized accesses to the same memory and at least one of them is a mutation. Data races are undefined behavior. You cannot "accept" a data race because its existence invalidates the entire program; a program with an unavoidable data race in it does not have any defined behavior at all, so it does nothing useful.
Here's a version of your code that has a race condition, but not a data race:
use std::sync::atomic::{AtomicI32, Ordering};
let i = AtomicI32::new(0);
let closure = |_| {
i.store(i.load(Ordering::Relaxed) + 1, Ordering::Relaxed);
};
(0..100).into_par_iter().for_each(closure);
Because the loads and stores are not ordered with respect to the concurrently executing threads, there is no guarantee that the final value of i will be exactly 100. It could be 99, or 72, or 41, or even 1. This code has indeterminate, but defined behavior because although you don't know the exact order of events or the final outcome, you can still reason about its behavior. In this case, you can prove that the final value of i must be at least 1 and no greater than 100.
Note that in order to write this racy code, I still had to use AtomicI32 and atomic load and store. Not caring about the order of events in different threads doesn't free you from having to think about synchronizing memory access.
If your original code compiled, it would have a data race.¹ This means there are no guarantees about its behavior at all. So, assuming you actually accept data races, here's a version of your code that is consistent with what a compiler is allowed to do with it:
fn main() {}
Oh, right, undefined behavior must never occur. So this hypothetical compiler just deleted all your code because it is never allowed to run in the first place.
It's actually even worse than that. Suppose you had written something like this:
fn main() {
let mut i = 0;
let mut closure = |_| {
i = i + 1;
};
(0..100).into_par_iter().for_each(closure);
if i < 100 || i >= 100 {
println!("this should always print");
} else {
println!("this should never print");
}
}
What should this code print? If there are no data races, this code must emit the following:
this should always print
But if we allow data races, it might also print this:
this should never print
Or it could even print this:
this should never print
this should always print
If you think there is no way it could do the last thing, you are wrong. Undefined behavior in a program cannot be accepted, because it invalidates analysis even of correct code that has nothing obvious to do with the original error.
How likely is any of this to happen, if you just use unsafe and ignore the possibility of a data race? Well, probably not very likely, to be honest. If you use unsafe to bypass the checks and look at the generated assembly, it's likely to even be correct. But the only way to be sure is to write in assembly language directly, understand and code to the machine model: if you want to use Rust, you have to code to Rust's model, even if that means you lose a little performance.
How much performance? Probably not much, if anything. Atomic operations are very efficient and on many architectures, including the one you're probably using right now to read this, they actually are exactly as fast as non-atomic operations in cases like this. If you really want to know how much potential performance you lose, write both versions and benchmark them, or simply compare the assembly code with and without atomic operations.
¹ Technically, we can't say that a data race must occur, because it depends on whether any threads actually access i at the same time or not. If for_each decided for some reason to run all the closures on the same OS thread, for example, this code would not have a data race. But the fact that it may have a data race still poisons our analysis because we can't be sure it doesn't.

You cannot do that exactly, you need to ensure that some safe synchronisation happens in the under-layers for example. For example using an Arc + some kind of atomics operations.
You have some examples in the documentation:
use std::sync::Arc;
use std::sync::atomic::{AtomicUsize, Ordering};
use std::thread;
let val = Arc::new(AtomicUsize::new(5));
for _ in 0..10 {
let val = Arc::clone(&val);
thread::spawn(move || {
let v = val.fetch_add(1, Ordering::SeqCst);
println!("{:?}", v);
});
}
Playground
(as Adien4 points: there is no need of the Arc or the move in the second example- Rayon only requires the Fn to be Send + Sync)
Which lead us to your example, that could be adapted as:
use std::sync::Arc;
use std::sync::atomic::{AtomicUsize, Ordering};
use rayon::prelude::*;
fn main() {
let i = AtomicUsize::new(5);
let mut closure = |_| {
i.fetch_add(1, Ordering::SeqCst);
};
(0..100).into_par_iter().for_each(closure);
}
Playground

This is not possible as it would require parallel access to i which causes race conditions. You can try to use a Mutex to allow access from multiple threads.

The accepted answer explains the situation thoroughly - you definitely don't want data races in your code, because they are undefined behavior, and distinct from the more general "race conditions". Nor do you need data races to update shared data, there are better efficient ways to do that. But to satisfy curiosity, this answer attempts to answer the question as literally asked - if you were reckless enough to intentionally ignore data races and incur undefined behavior at your own peril, could you do it in unsafe Rust?
You indeed can. Code and discussion in this answer is provided for educational purposes, such as to check what kind of code the compiler generates. If code that intentionally incurs UB offends you, please stop reading here. You've been warned. :)
The obvious way to convince Rust to allow this data race is to create a raw pointer to mut i, send the pointer to the closure, and dereference it to mutate i. This dereference is unsafe because it leaves it to the programmer to ensure that no mutable references exist simultaneously, and that writes to the underlying data are synchronized with other accesses to it. While we can easily ensure the former by just not creating a reference, we obviously won't ensure the latter:
// Must wrap raw pointer in type that implements Sync.
struct Wrap(*mut i32);
unsafe impl Sync for Wrap {}
// Contains undefined behavior - don't use this!
fn main() {
let mut i = 0;
let i_ptr = Wrap(&mut i as *mut i32);
let closure = |_| {
unsafe { *i_ptr.0 = *i_ptr.0 + 1 }; // XXX: UB!
};
(0..100).into_par_iter().for_each(closure);
println!("{}", i);
}
Playground
Note that pointers don't implement Sync or Send, so they require a wrapper to use them in threads. The wrapper unsafely implements Sync, but this unsafe is actually not UB - accessing to the pointer is safe, and there would be no UB if we, say, only printed it, or even dereferenced it for reading (as long as no one else writes to i). Writing to the dereferenced pointer is where we create UB, and that itself requires unsafe.
While this is the kind of code that the OP might have been after (it even prints 100 when run), it's of course still undefined behavior, and could break on a different hardware, or when upgraded to a different compiler. Making even a slight change to the code, such as using let i_ref = unsafe { &mut *i_ptr } to create a mutable reference and update it with *i_ref += 1 will make it change behavior.
In the context of C++11 Hans Boehm wrote an entire article on the danger of so-called "benign" data races, and why they cannot be allowed in the C++ memory model (which Rust shares).

Does calling `into_inner()` on an atomic take into account all the relaxed writes?

Does into_inner() return all the relaxed writes in this example program? If so, which concept guarantees this?
extern crate crossbeam;
use std::sync::atomic::{AtomicUsize, Ordering};
fn main() {
let thread_count = 10;
let increments_per_thread = 100000;
let i = AtomicUsize::new(0);
crossbeam::scope(|scope| {
for _ in 0..thread_count {
scope.spawn(|| {
for _ in 0..increments_per_thread {
i.fetch_add(1, Ordering::Relaxed);
}
});
}
});
println!(
"Result of {}*{} increments: {}",
thread_count,
increments_per_thread,
i.into_inner()
);
}
(https://play.rust-lang.org/?gist=96f49f8eb31a6788b970cf20ec94f800&version=stable)
I understand that crossbeam guarantees that all threads are finished and since the ownership goes back to the main thread, I also understand that there will be no outstanding borrows, but the way I see it, there could still be outstanding pending writes, if not on the CPUs, then in the caches.
Which concept guarantees that all writes are finished and all caches are synced back to the main thread when into_inner() is called? Is it possible to lose writes?

Does into_inner() return all the relaxed writes in this example program? If so, which concept guarantees this?
It's not into_inner that guarantees it, it's join.
What into_inner guarantees is that either some synchronization has been performed since the final concurrent write (join of thread, last Arc having been dropped and unwrapped with try_unwrap, etc.), or the atomic was never sent to another thread in the first place. Either case is sufficient to make the read data-race-free.
Crossbeam documentation is explicit about using join at the end of a scope:
This [the thread being guaranteed to terminate] is ensured by having the parent thread join on the child thread before the scope exits.
Regarding losing writes:
Which concept guarantees that all writes are finished and all caches are synced back to the main thread when into_inner() is called? Is it possible to lose writes?
As stated in various places in the documentation, Rust inherits the C++ memory model for atomics. In C++11 and later, the completion of a thread synchronizes with the corresponding successful return from join. This means that by the time join completes, all actions performed by the joined thread must be visible to the thread that called join, so it is not possible to lose writes in this scenario.
In terms of atomics, you can think of a join as an acquire read of an atomic that the thread performed a release store on just before it finished executing.

I will include this answer as a potential complement to the other two.
The kind of inconsistency that was mentioned, namely whether some writes could be missing before the final reading of the counter, is not possible here. It would have been undefined behaviour if writes to a value could be postponed until after its consumption with into_inner. However, there are no unexpected race conditions in this program, even without the counter being consumed with into_inner, and even without the help of crossbeam scopes.
Let us write a new version of the program without crossbeam scopes and where the counter is not consumed (Playground):
let thread_count = 10;
let increments_per_thread = 100000;
let i = Arc::new(AtomicUsize::new(0));
let threads: Vec<_> = (0..thread_count)
.map(|_| {
let i = i.clone();
thread::spawn(move || for _ in 0..increments_per_thread {
i.fetch_add(1, Ordering::Relaxed);
})
})
.collect();
for t in threads {
t.join().unwrap();
}
println!(
"Result of {}*{} increments: {}",
thread_count,
increments_per_thread,
i.load(Ordering::Relaxed)
);
This version still works pretty well! Why? Because a synchronizes-with relation is established between the ending thread and its corresponding join. And so, as well explained in a separate answer, all actions performed by the joined thread must be visible to the caller thread.
One could probably also wonder whether even the relaxed memory ordering constraint is sufficient to guarantee that the full program behaves as expected. This part is addressed by the Rust Nomicon, emphasis mine:
Relaxed accesses are the absolute weakest. They can be freely re-ordered and provide no happens-before relationship. Still, relaxed operations are still atomic. That is, they don't count as data accesses and any read-modify-write operations done to them occur atomically. Relaxed operations are appropriate for things that you definitely want to happen, but don't particularly otherwise care about. For instance, incrementing a counter can be safely done by multiple threads using a relaxed fetch_add if you're not using the counter to synchronize any other accesses.
The mentioned use case is exactly what we are doing here. Each thread is not required to observe the incremented counter in order to make decisions, and yet all operations are atomic. In the end, the thread joins synchronize with the main thread, thus implying a happens-before relation, and guaranteeing that the operations are made visible there. As Rust adopts the same memory model as C++11's (this is implemented by LLVM internally), we can see regarding the C++ std::thread::join function that "The completion of the thread identified by *this synchronizes with the corresponding successful return". In fact, the very same example in C++ is available in cppreference.com as part of the explanation on the relaxed memory order constraint:
#include <vector>
#include <iostream>
#include <thread>
#include <atomic>
std::atomic<int> cnt = {0};
void f()
{
for (int n = 0; n < 1000; ++n) {
cnt.fetch_add(1, std::memory_order_relaxed);
}
}
int main()
{
std::vector<std::thread> v;
for (int n = 0; n < 10; ++n) {
v.emplace_back(f);
}
for (auto& t : v) {
t.join();
}
std::cout << "Final counter value is " << cnt << '\n';
}

The fact that you can call into_inner (which consumes the AtomicUsize) means that there are no more borrows on that backing storage.
Each fetch_add is an atomic with the Relaxed ordering, so once the threads are complete there shouldn't be any thing that changes it (if so, then there's a bug in crossbeam).
See the description on into_inner for more info

How can I sum up using concurrency from 1 to 1000000 with Rust?

I am a newbie to Rust, and I want to sum up a large amount of numbers using concurrency. I found this code:
use std::thread;
use std::sync::{Arc, Mutex};
static NTHREAD: usize = 10;
fn main() {
let mut threads = Vec::new();
let x = 0;
// A thread-safe, sharable mutex object
let data = Arc::new(Mutex::new(x));
for i in 1..(NTHREAD+1) {
// Increment the count of the mutex
let mutex = data.clone();
threads.push(thread::spawn(move || {
// Lock the mutex
let n = mutex.lock();
match n {
Ok(mut n) => *n += i,
Err(str) => println!("{}", str)
}
}));
}
// Wait all threads ending
for thread in threads {
let _ = thread.join().unwrap();
}
assert_eq!(*data.lock().unwrap(), 55);
}
This works when the threads are 10, but does not work when the threads are larger than 20.
I think it should be fine in any number of threads.
Do I misunderstand something? Is there another way to sum up from 1 to 1000000 with concurrency?

There are several problems with the provided code.
thread::spawn creates an OS-level thread, which means the existing code cannot possibly scale to numbers up to a million as indicated in the title. That would require a million threads in parallel, where typical modern OS'es support up to a few thousands of threads at best. More constrained environments, such as embedded systems or virtual/paravirtual machines, allow much less than that; for example, the Rust playground appears to allow a maximum of 24 concurrent threads. Instead, one needs to create a fixed small number of threads, and carefully divide the work among them.
The function executing in each thread runs inside a lock, which effectively serializes the work done by the threads. Even if one could spawn arbitrarily many threads, the loop as written would execute no faster than what would be achieved by a single thread - and in practice it would be orders of magnitude slower because it would spend a lot of time on locking/unlocking of a heavily contended mutex.
One good way to approach this kind of problem while still managing threads manually is provided in the comment by Boiethios: if you have 4 threads, just sum 1..250k, 250k..500k, etc. in each thread and then sum up the return of the threaded functions.
Or is there another way to sum up from 1 to 1000000 with concurrency?
I would recommend using a higher-level library that encapsulates creation/pooling of worker threads and division of work among them. Rayon is an excellent one, providing a "parallel iteration" facility, which works like iteration, but automatically dividing up the work among multiple cores. Using Rayon, parallel summing of integers would look like this:
extern crate rayon;
use rayon::prelude::*;
fn main() {
let sum: usize = (1..1000001).collect::<Vec<_>>().par_iter().sum();
assert_eq!(sum, 500000500000);
}

How do I make my struct fields mutable when accessing through a shared box ptr?

Editor's note: This code is from a version of Rust prior to 1.0 and is not syntactically or semantically valid Rust 1.0 code.
So, scoping out shared box pointers as a learning exercise. Purely academic exercise.
#[feature(managed_boxes)];
struct Monster {
legs: int
}
fn main() {
let mut steve = #Monster{ legs: 2 };
steve.legs = 8;
}
I'm a little surprised to be getting this compiler error:
shared_box.rs:10:5: 10:15 error: cannot assign to immutable field
shared_box.rs:10 steve.legs = 8;
What gives?
The error goes away if I switch to an Owned Box pointer. Is this some kind of restriction on managed pointer access?

You can't.
# is immutable.
Managed boxes are being steadily destroyed, so you shouldn't use them.
#mut has been removed from the language.
There is, however, a way of getting around this: RefCell. If you wrap an object in it then you can modify it even though it appears to be immutable. This is sometimes useful, but where possible you should avoid it. Here's an example of using it (with Gc; you should probably tend to use Rc at present instead, because Gc is not properly implemented):
let steve = box(GC) RefCell::new(Monster { legs: 2 });
steve.borrow().borrow_mut().get().legs = 8;
assert_eq!(steve.borrow().borrow().get().legs, 8);
It's not pretty; smart pointer traits may well improve the situation. But where possible, avoid such things. Immutable data is good, task-local data is good.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Sharing lock-less resource between threads in Rust - multithreading

Quite simply, actually: the Mutex is unnecessary if the underlying structure is already Sync. In your case, an array of structs of atomics for example would work. You can find Rust's available atomics here.

Related

Rust: initialize a static variable/reference in a lib?

How can I mutate a shared variable from multiple threads, disregarding data races?

Does calling `into_inner()` on an atomic take into account all the relaxed writes?

How can I sum up using concurrency from 1 to 1000000 with Rust?

How do I make my struct fields mutable when accessing through a shared box ptr?

Categories

Resources