Loading a counter with Relaxed ordering - rust

In Recent issue of RustMagazine, the author has shared below snippet
async fn update_metadata(file: &mut File, counter: AtomicU64) -> Result<()> {
let next_number = counter.load(Ordering::Relaxed) + 1;
persist_number(file, next_number).await?;
counter.fetch_add(1, Ordering::Relaxed);
}
Shouldn't the first line have Acquire instead of Relaxed ordering? If other thread calls fetch_update there is no guarantee that "counter.load would return the latest value in the other thread (as it is using relaxed ordering)
counter.load(Ordering::Relaxed) + 1;
Also is it correct to say fetch_Add establishes ordering on counter so it always get the latest atomic value event with Relaxed Ordering?

Related

Atomic wrappers vs primitives

I'm trying to understand few differences between std::sync::atomic::Atomic* structs and primitives such as i32, usize, bool in scope of multithreading.
First question, will another thread see changes to the non atomic type from another thread?
fn main() {
let mut counter = 0;
std::thread::scope(|scope| {
scope.spawn(|| counter += 1)
});
println!("{counter}");
}
Can I be sure that counter will be 1 right after another thread will write this value into it, or thread could cache this value? If not, will it work only with atomic type only?
fn main() {
let counter = AtomicI32::new(0);
std::thread::scope(|scope| {
scope.spawn(|| counter.store(1, Ordering::Release))
});
println!("{}", counter.load(Ordering::Acquire)); // Ordering::Acquire to prevent from reordering previous instructions
}
Second question, does Ordering type affects when value in store will be visible in other threads, or it will be visible right after store, even if Ordering::Relaxed was applied? As for example, will same code but with Ordering::Relaxed and no instructions reorder show 1 in counter?
fn main() {
let counter = AtomicI32::new(0);
std::thread::scope(|scope| {
scope.spawn(|| counter.store(1, Ordering::Relaxed))
});
println!("{}", counter.load(Ordering::Relaxed));
}
I understand difference between atomic and non atomic writes to same variable, I'm only interested if another thread will see changes, even if this changes won't be consistent.
First question, will another thread see changes to the non atomic type from another thread?
Yes. The difference between atomic and non-atomic variables is that you can change atomic variables using shared references, &AtomicX, and not just using mutable references, &mut X. This means that they can be changed in parallel in different threads. For primitives, the compiler will reject attempting that, e.g.:
fn main() {
let mut counter = 0;
std::thread::scope(|scope| {
scope.spawn(|| counter += 1);
scope.spawn(|| counter += 1);
});
println!("{counter}");
}
Or even the following, where we use the variable on the main thread but before the spawned thread is joined:
fn main() {
let mut counter = 0;
std::thread::scope(|scope| {
scope.spawn(|| counter += 1);
counter += 1;
});
println!("{counter}");
}
While with atomics this will work:
fn main() {
let counter = AtomicI32::new(0);
std::thread::scope(|scope| {
scope.spawn(|| counter.store(1, Ordering::Relaxed));
scope.spawn(|| counter.store(1, Ordering::Relaxed));
});
println!("{}", counter.load(Ordering::Relaxed));
}
Second question, does Ordering type affects when value in store will be visible in other threads, or it will be visible right after store, even if Ordering::Relaxed was applied? As for example, will same code but with Ordering::Relaxed and no instructions reorder show 1 in counter?
No. Ordering does not change what other threads will observe with this variable. And therefore, your usage of Release and Acquire is wrong.
On the other hand, Relaxed here will suffice, for other reasons.
You are guaranteed to see the value 1 in your code no matter what ordering you will use, because std::thread::scope() implicitly joins all spawned threads on exit, and joining a thread forms a happens-before relationship between everything done in this thread and the code after the join. In other words, you are guaranteed that everything done in the thread (and the includes storing to counter) will happen before everything you do after you join it (and that includes reading counter).
If there was not a join, for example, in this code:
fn main() {
let counter = AtomicI32::new(0);
std::thread::scope(|scope| {
scope.spawn(|| counter.store(1, Ordering::Release));
scope.spawn(|| println!("{}", counter.load(Ordering::Acquire)));
});
}
Then you are not guaranteed, despite the Release and Acquire orderings, to read the updated value. It may happen so, or may happen that you will read the old value.
Orderings are useful to create a happens-before relationship with different variables and code. But this is a complicated subject. I recommend reading this book (written by a Rust libs team member).

Rust deadlock with shared struct: Arc + channel + atomic

I'm new to Rust and was trying to generate plenty of JSON data on the fly for a project, but I'm having deadlocks.
I've tried removing the serialization (json_serde) and sending the HashMaps in the channel instead but I still get deadlocks on my computer. If I however comment the send(generator.next()) line and send a string myself, code works flawlessly, thus the deadlock is caused by my DatasetGenerator, but I don't understand why.
Code summary:
Have a DatasetGenerator object that can generate sequences of "events" and serialize them to JSON.
generator.next() works like an "iterator" - It increments an internal atomic counter in the generator and then generates the i-th item in the sequence + serializes the JSON.
Have a generator threadpool generate these JSONs at high throughput (very large payloads each)
Send these JSONs through a channel to other thread (which will send them through network but irrelevant for this question)
Depending if I comment tx_ref.send(generator_ref.next()) or tx_ref.send(some_new_string) below my code deadlocks or succeeds:
src/main.rs:
extern crate threads_pool;
use threads_pool::*;
mod generator;
use std::sync::mpsc;
use std::sync::Arc;
use std::thread;
fn main() {
// N will be an argument, and a very high number. For tests use this:
const N: i64 = 12; // Increase this if you're not getting the deadlock yet, or run cargo run again until it happens.
let (tx, rx) = mpsc::channel();
let tx_producer = tx.clone();
let producer_thread = thread::spawn(move || {
let pool = ThreadPool::new(4);
let generator = Arc::new(generator::data_generator::DatasetGenerator::new(3000));
for i in 0..N {
println!("Generating #{}", i);
let tx_ref = tx_producer.clone();
let generator_ref = generator.clone();
pool.execute(move || {
////////// v !!!DEADLOCK HERE!!! v //////////
tx_ref.send(generator_ref.next()).expect("tx failed."); // This locks!
//tx_ref.send(format!(" {} ", i)).expect("tx failed."); // This works!
////////// ^ !!!DEADLOCK HERE!!! ^ //////////
})
.unwrap();
}
println!("Generator done!");
});
println!("-» Consumer consuming!");
for j in 0..N {
let s = rx.recv().expect("rx failed");
println!("-» Consumed #{}: {} ... ", j, &s[..10]);
}
println!("Consumer done!!");
producer_thread.join().unwrap();
println!("Success. Exit!");
}
This is my DatasetGenerator which seems to be causing all the trouble (as not using serde but outputting the HashMaps still gives deadlocks). src/generator/dataset_generator.rs:
use serde_json::Value;
use std::collections::HashMap;
use std::sync::atomic;
pub struct DatasetGenerator {
num_features: usize,
pub counter: atomic::AtomicI64,
feature_names: Vec<String>,
}
type Datapoint = HashMap<String, Value>;
type Out = String;
impl DatasetGenerator {
pub fn new(num_features: usize) -> DatasetGenerator {
let mut feature_names = Vec::new();
for i in 0..num_features {
feature_names.push(format!("f_{}", i));
}
DatasetGenerator {
num_features,
counter: atomic::AtomicI64::new(0),
feature_names,
}
}
/// Generates the next item in the sequence (iterator-like).
pub fn next(&self) -> Out {
let value = self.counter.fetch_add(1, atomic::Ordering::SeqCst);
self.gen(value)
}
/// Generates the ith item in the sequence. DEADLOCKS!!! ///////////////////////////
pub fn gen(&self, ith: i64) -> Out {
let mut data = Datapoint::with_capacity(self.num_features);
for f in 0..self.num_features {
let name = self.feature_names.get(f).unwrap();
data.insert(name.to_string(), Value::from(ith));
}
serde_json::json!(data).to_string() // Tried without serialization and still deadlocks!
}
}
Commit with deadlock code is here if you want to try out yourself with cargo run: https://github.com/AlbertoEAF/learn-rust/tree/dc5fa867e5a70b605553ef65796fdc9dd42d38a0/rest-injector
Deadlock on Windows with Rust 1.60.0:
Thank you for the help! it's greatly appreciated :)
Update
I've followed the suggestions from #kmdreko's answer below, and apparently the problem is in the generator: not all the items are generated. Even though pool.execute() is called N times, only a random number of closures c < N are executed even if I place pool.close() before leaving the producer_thread. Why does that happen / How can it be fixed?
Fix: Turns out this lockup is caused by the threads_pool library (0.2.6). I switched the thread pool to rayon's and it worked smoothly at the first try.
One thing you should change: an mpsc::Receiver will return an error on .recv() if it cannot possibly yield a result by realizing that all the associated mpsc::Senders have dropped, which is a good indicator that all the work is done. Your tx_refs and even tx_producer will be dropped when their respective tasks/threads complete, however you still have tx in scope that can theoretically give a value. This is what gives you the apparent deadlock. You should simply remove tx_producer and use tx directly so it is moved into the producer thread and dropped accordingly.
Now, you'll see either all N tasks complete, or you'll get an error indicating that some tasks did not complete. The reason not all tasks are completing is because you're creating the thread pool, spawning all the tasks, and then immediately destroying it. The threads_pool documentation says that the threads will finish their current job when the pool is destroyed, but you want to wait until all jobs have completed. For that you need to call the .close() method provided by the PoolManager trait before the end of the closure.
The reason you saw inconsistent behavior, but was benefited by returning a string directly is because the jobs required less work and the threads could get away with completing all them before they saw their signal to exit. Your generator_ref.next() requires much more computation so its not surprising they'd only process 4-plus-a-bit jobs before they see they've been told to exit.

Mutex<bool> with an atomic read&write

I need to have a global boolean flag that will be accessed by multiple threads.
Here is an example of what I need:
static GLOBAL_FLAG: SyncLazy<Mutex<bool>> = SyncLazy::new(|| {
Mutex::new(false)
});
fn set_flag_to_true() { // can be called by 2+ threads concurrently
*GLOBAL_FLAG.lock().unwrap() = true;
}
fn get_flag_and_set_to_true() -> bool { // only one thread is calling this function
let v = *GLOBAL_FLAG.lock().unwrap(); // Obtain current flag value
*GLOBAL_FLAG.lock().unwrap() = true; // Always set the flag to true
v // Return the previous value
}
The get_flag_and_set_to_true() implementation doesn't feel quite right. I imagine it would be best if I only locked once. What's the best way to do that?
BTW I suppose Arc<[AtomicBool]> can also be used and should in theory be faster, although in my particular case the speed benefit will be unnoticeable.
BTW I suppose Arc<[AtomicBool]> can also be used and should in theory be faster, although in my particular case the speed benefit will be unnoticeable.
It's not just about benefit in performance, but also in amount of code and ease of reasoning about the code. With AtomicBool you don't need either SyncLazy or the mutex, and the code is shorter and clearer:
use std::sync::atomic::{AtomicBool, Ordering};
static GLOBAL_FLAG: AtomicBool = AtomicBool::new(false);
pub fn set_flag_to_true() {
GLOBAL_FLAG.store(true, Ordering::SeqCst);
}
pub fn get_flag_and_set_to_true() -> bool {
GLOBAL_FLAG.swap(true, Ordering::SeqCst)
}
Playground
Conceivably, another thread could come in between when you read GLOBAL_FLAG and when you set GLOBAL_FLAG to true. To work around this you can directly store the MutexGuard (docs) that GLOBAL_FLAG.lock().unwrap() returns:
fn get_flag_and_set_to_true() -> bool { // only one thread is calling this function
let mut global_flag = GLOBAL_FLAG.lock().unwrap();
let v = *global_flag; // Obtain current flag value
*global_flag = true; // Always set the flag to true
v // Return the previous value
}
global_flag will keep the mutex locked until it gets dropped.

Some confused regarding to Rust memory order

I have some questions regarding to Rust memory barrier, let's have a look about this example, based on the example, I made some changes:
use std::cell::UnsafeCell;
use std::sync::atomic::{AtomicUsize, Ordering};
use std::sync::{Arc, Barrier};
use std::thread;
struct UsizePair {
atom: AtomicUsize,
norm: UnsafeCell<usize>,
}
// UnsafeCell is not thread-safe. So manually mark our UsizePair to be Sync.
// (Effectively telling the compiler "I'll take care of it!")
unsafe impl Sync for UsizePair {}
static NTHREADS: usize = 8;
static NITERS: usize = 1000000;
fn main() {
let upair = Arc::new(UsizePair::new(0));
// Barrier is a counter-like synchronization structure (not to be confused
// with a memory barrier). It blocks on a `wait` call until a fixed number
// of `wait` calls are made from various threads (like waiting for all
// players to get to the starting line before firing the starter pistol).
let barrier = Arc::new(Barrier::new(NTHREADS + 1));
let mut children = vec![];
for _ in 0..NTHREADS {
let upair = upair.clone();
let barrier = barrier.clone();
children.push(thread::spawn(move || {
barrier.wait();
let mut v = 0;
while v < NITERS - 1 {
// Read both members `atom` and `norm`, and check whether `atom`
// contains a newer value than `norm`. See `UsizePair` impl for
// details.
let (atom, norm) = upair.get();
if atom != norm {
// If `Acquire`-`Release` ordering is used in `get` and
// `set`, then this statement will never be reached.
println!("Reordered! {} != {}", atom, norm);
}
v = atom;
}
}));
}
barrier.wait();
for v in 1..NITERS {
// Update both members `atom` and `norm` to value `v`. See the impl for
// details.
upair.set(v);
}
for child in children {
let _ = child.join();
}
}
impl UsizePair {
pub fn new(v: usize) -> UsizePair {
UsizePair {
atom: AtomicUsize::new(v),
norm: UnsafeCell::new(v),
}
}
pub fn get(&self) -> (usize, usize) {
let atom = self.atom.load(Ordering::Acquire); //Ordering::Acquire
// If the above load operation is performed with `Acquire` ordering,
// then all writes before the corresponding `Release` store is
// guaranteed to be visible below.
let norm = unsafe { *self.norm.get() };
(atom, norm)
}
pub fn set(&self, v: usize) {
unsafe { *self.norm.get() = v };
// If the below store operation is performed with `Release` ordering,
// then the write to `norm` above is guaranteed to be visible to all
// threads that "loads `atom` with `Acquire` ordering and sees the same
// value that was stored below". However, no guarantees are provided as
// to when other readers will witness the below store, and consequently
// the above write. On the other hand, there is also no guarantee that
// these two values will be in sync for readers. Even if another thread
// sees the same value that was stored below, it may actually see a
// "later" value in `norm` than what was written above. That is, there
// is no restriction on visibility into the future.
self.atom.store(v, Ordering::Release); //Ordering::Release
}
}
Basically, I just changed the judge condition into if atom != norm and the memory order in get and set method.
According to what I have learned so far, all the memory operations(1. doesn't require that these memory operations are operating on the same memory location, 2. no matter it is an atomic operation or normal memory operation) happens before a store Release, will be visible to the memory operation after a load Acquire.
I don't get why if atom != norm is not always true? Actually, from the comments in the example, it does point out that:
However, no guarantees are provided as to when other readers will witness the below store, and consequently the above write. On the other hand, there is also no guarantee that these two values will be in sync for readers. Even if another thread sees the same value that was stored below, it may actually see a "later" value in norm than what was written above. That is, there is no restriction on visibility into the future.
Can someone explain to me why norm can see some "future value"?
Also in this c++ example, is it the same reason that causes these statements in code?
v0, v1, v2 might turn out to be -1, some, or all of them.
all the memory operations ... happens before a store Release, will be visible to the memory operation after a load Acquire.
That's true only if the acquire load sees the value from the release store.
If not, the acquire load ran before the release store was globally visible, so there are no guarantees about anything; you didn't actually synchronize with that writer. The load of norm happens after the acquire load, so another store might have become globally visible1 during that interval.
Also, the norm store is done first2 so even if atom and norm were loaded simultaneously (e.g. by one wide atomic load), it would still be possible for it to see norm updated by atom not yet.
Footnote 1: (Or visible to this thread, on the rare machine where that can happen without being globally visible, e.g. PowerPC)
Footnote 2: The only actual guarantee is not-later; they could both become globally visible as one wider transaction, e.g. the compiler would be allowed to merge the norm store and the atom store into one wider atomic store, or hardware could do that via store coalescing in the store buffer. So there might never be a time interval when you could observe norm updated by atom not; it depends on the implementation (hardware and compiler).
(IDK what kind of guarantees Rust gives here or how it formally defines synchronization and memory order. But the basics of acquire and release synchronization are fairly universal. https://preshing.com/20120913/acquire-and-release-semantics/. In C++ reading a non-atomic norm at all without achieving synchronization would be data-race UB (undefined behaviour), but of course when compiled for real hardware the effects I describe are what would happen in practice, whether the source language is C++ or Rust.)

Is it possible to share a HashMap between threads without locking the entire HashMap?

I would like to have a shared struct between threads. The struct has many fields that are never modified and a HashMap, which is. I don't want to lock the whole HashMap for a single update/remove, so my HashMap looks something like HashMap<u8, Mutex<u8>>. This works, but it makes no sense since the thread will lock the whole map anyways.
Here's this working version, without threads; I don't think that's necessary for the example.
use std::collections::HashMap;
use std::sync::{Arc, Mutex};
fn main() {
let s = Arc::new(Mutex::new(S::new()));
let z = s.clone();
let _ = z.lock().unwrap();
}
struct S {
x: HashMap<u8, Mutex<u8>>, // other non-mutable fields
}
impl S {
pub fn new() -> S {
S {
x: HashMap::default(),
}
}
}
Playground
Is this possible in any way? Is there something obvious I missed in the documentation?
I've been trying to get this working, but I'm not sure how. Basically every example I see there's always a Mutex (or RwLock, or something like that) guarding the inner value.
I don't see how your request is possible, at least not without some exceedingly clever lock-free data structures; what should happen if multiple threads need to insert new values that hash to the same location?
In previous work, I've used a RwLock<HashMap<K, Mutex<V>>>. When inserting a value into the hash, you get an exclusive lock for a short period. The rest of the time, you can have multiple threads with reader locks to the HashMap and thus to a given element. If they need to mutate the data, they can get exclusive access to the Mutex.
Here's an example:
use std::{
collections::HashMap,
sync::{Arc, Mutex, RwLock},
thread,
time::Duration,
};
fn main() {
let data = Arc::new(RwLock::new(HashMap::new()));
let threads: Vec<_> = (0..10)
.map(|i| {
let data = Arc::clone(&data);
thread::spawn(move || worker_thread(i, data))
})
.collect();
for t in threads {
t.join().expect("Thread panicked");
}
println!("{:?}", data);
}
fn worker_thread(id: u8, data: Arc<RwLock<HashMap<u8, Mutex<i32>>>>) {
loop {
// Assume that the element already exists
let map = data.read().expect("RwLock poisoned");
if let Some(element) = map.get(&id) {
let mut element = element.lock().expect("Mutex poisoned");
// Perform our normal work updating a specific element.
// The entire HashMap only has a read lock, which
// means that other threads can access it.
*element += 1;
thread::sleep(Duration::from_secs(1));
return;
}
// If we got this far, the element doesn't exist
// Get rid of our read lock and switch to a write lock
// You want to minimize the time we hold the writer lock
drop(map);
let mut map = data.write().expect("RwLock poisoned");
// We use HashMap::entry to handle the case where another thread
// inserted the same key while where were unlocked.
thread::sleep(Duration::from_millis(50));
map.entry(id).or_insert_with(|| Mutex::new(0));
// Let the loop start us over to try again
}
}
This takes about 2.7 seconds to run on my machine, even though it starts 10 threads that each wait for 1 second while holding the exclusive lock to the element's data.
This solution isn't without issues, however. When there's a huge amount of contention for that one master lock, getting a write lock can take a while and completely kills parallelism.
In that case, you can switch to a RwLock<HashMap<K, Arc<Mutex<V>>>>. Once you have a read or write lock, you can then clone the Arc of the value, returning it and unlocking the hashmap.
The next step up would be to use a crate like arc-swap, which says:
Then one would lock, clone the [RwLock<Arc<T>>] and unlock. This suffers from CPU-level contention (on the lock and on the reference count of the Arc) which makes it relatively slow. Depending on the implementation, an update may be blocked for arbitrary long time by a steady inflow of readers.
The ArcSwap can be used instead, which solves the above problems and has better performance characteristics than the RwLock, both in contended and non-contended scenarios.
I often advocate for performing some kind of smarter algorithm. For example, you could spin up N threads each with their own HashMap. You then shard work among them. For the simple example above, you could use id % N_THREADS, for example. There are also complicated sharding schemes that depend on your data.
As Go has done a good job of evangelizing: do not communicate by sharing memory; instead, share memory by communicating.
Suppose the key of the data is map-able to a u8
You can have Arc<HashMap<u8,Mutex<HashMap<Key,Value>>>
When you initialize the data structure you populate all the first level map before putting it in Arc (it will be immutable after initialization)
When you want a value from the map you will need to do a double get, something like:
data.get(&map_to_u8(&key)).unwrap().lock().expect("poison").get(&key)
where the unwrap is safe because we initialized the first map with all the value.
to write in the map something like:
data.get(&map_to_u8(id)).unwrap().lock().expect("poison").entry(id).or_insert_with(|| value);
It's easy to see contention is reduced because we now have 256 Mutex and the probability of multiple threads asking the same Mutex is low.
#Shepmaster example with 100 threads takes about 10s on my machine, the following example takes a little more than 1 second.
use std::{
collections::HashMap,
sync::{Arc, Mutex, RwLock},
thread,
time::Duration,
};
fn main() {
let mut inner = HashMap::new( );
for i in 0..=u8::max_value() {
inner.insert(i, Mutex::new(HashMap::new()));
}
let data = Arc::new(inner);
let threads: Vec<_> = (0..100)
.map(|i| {
let data = Arc::clone(&data);
thread::spawn(move || worker_thread(i, data))
})
.collect();
for t in threads {
t.join().expect("Thread panicked");
}
println!("{:?}", data);
}
fn worker_thread(id: u8, data: Arc<HashMap<u8,Mutex<HashMap<u8,Mutex<i32>>>>> ) {
loop {
// first unwrap is safe to unwrap because we populated for every `u8`
if let Some(element) = data.get(&id).unwrap().lock().expect("poison").get(&id) {
let mut element = element.lock().expect("Mutex poisoned");
// Perform our normal work updating a specific element.
// The entire HashMap only has a read lock, which
// means that other threads can access it.
*element += 1;
thread::sleep(Duration::from_secs(1));
return;
}
// If we got this far, the element doesn't exist
// Get rid of our read lock and switch to a write lock
// You want to minimize the time we hold the writer lock
// We use HashMap::entry to handle the case where another thread
// inserted the same key while where were unlocked.
thread::sleep(Duration::from_millis(50));
data.get(&id).unwrap().lock().expect("poison").entry(id).or_insert_with(|| Mutex::new(0));
// Let the loop start us over to try again
}
}
Maybe you want to consider evmap:
A lock-free, eventually consistent, concurrent multi-value map.
The trade-off is eventual-consistency: Readers do not see changes until the writer refreshes the map. A refresh is atomic and the writer decides when to do it and expose new data to the readers.

Resources