Some confused regarding to Rust memory order

Some confused regarding to Rust memory order - rust

I have some questions regarding to Rust memory barrier, let's have a look about this example, based on the example, I made some changes:
use std::cell::UnsafeCell;
use std::sync::atomic::{AtomicUsize, Ordering};
use std::sync::{Arc, Barrier};
use std::thread;
struct UsizePair {
atom: AtomicUsize,
norm: UnsafeCell<usize>,
}
// UnsafeCell is not thread-safe. So manually mark our UsizePair to be Sync.
// (Effectively telling the compiler "I'll take care of it!")
unsafe impl Sync for UsizePair {}
static NTHREADS: usize = 8;
static NITERS: usize = 1000000;
fn main() {
let upair = Arc::new(UsizePair::new(0));
// Barrier is a counter-like synchronization structure (not to be confused
// with a memory barrier). It blocks on a `wait` call until a fixed number
// of `wait` calls are made from various threads (like waiting for all
// players to get to the starting line before firing the starter pistol).
let barrier = Arc::new(Barrier::new(NTHREADS + 1));
let mut children = vec![];
for _ in 0..NTHREADS {
let upair = upair.clone();
let barrier = barrier.clone();
children.push(thread::spawn(move || {
barrier.wait();
let mut v = 0;
while v < NITERS - 1 {
// Read both members `atom` and `norm`, and check whether `atom`
// contains a newer value than `norm`. See `UsizePair` impl for
// details.
let (atom, norm) = upair.get();
if atom != norm {
// If `Acquire`-`Release` ordering is used in `get` and
// `set`, then this statement will never be reached.
println!("Reordered! {} != {}", atom, norm);
}
v = atom;
}
}));
}
barrier.wait();
for v in 1..NITERS {
// Update both members `atom` and `norm` to value `v`. See the impl for
// details.
upair.set(v);
}
for child in children {
let _ = child.join();
}
}
impl UsizePair {
pub fn new(v: usize) -> UsizePair {
UsizePair {
atom: AtomicUsize::new(v),
norm: UnsafeCell::new(v),
}
}
pub fn get(&self) -> (usize, usize) {
let atom = self.atom.load(Ordering::Acquire); //Ordering::Acquire
// If the above load operation is performed with `Acquire` ordering,
// then all writes before the corresponding `Release` store is
// guaranteed to be visible below.
let norm = unsafe { *self.norm.get() };
(atom, norm)
}
pub fn set(&self, v: usize) {
unsafe { *self.norm.get() = v };
// If the below store operation is performed with `Release` ordering,
// then the write to `norm` above is guaranteed to be visible to all
// threads that "loads `atom` with `Acquire` ordering and sees the same
// value that was stored below". However, no guarantees are provided as
// to when other readers will witness the below store, and consequently
// the above write. On the other hand, there is also no guarantee that
// these two values will be in sync for readers. Even if another thread
// sees the same value that was stored below, it may actually see a
// "later" value in `norm` than what was written above. That is, there
// is no restriction on visibility into the future.
self.atom.store(v, Ordering::Release); //Ordering::Release
}
}
Basically, I just changed the judge condition into if atom != norm and the memory order in get and set method.
According to what I have learned so far, all the memory operations(1. doesn't require that these memory operations are operating on the same memory location, 2. no matter it is an atomic operation or normal memory operation) happens before a store Release, will be visible to the memory operation after a load Acquire.
I don't get why if atom != norm is not always true? Actually, from the comments in the example, it does point out that:
However, no guarantees are provided as to when other readers will witness the below store, and consequently the above write. On the other hand, there is also no guarantee that these two values will be in sync for readers. Even if another thread sees the same value that was stored below, it may actually see a "later" value in norm than what was written above. That is, there is no restriction on visibility into the future.
Can someone explain to me why norm can see some "future value"?
Also in this c++ example, is it the same reason that causes these statements in code?
v0, v1, v2 might turn out to be -1, some, or all of them.

all the memory operations ... happens before a store Release, will be visible to the memory operation after a load Acquire.
That's true only if the acquire load sees the value from the release store.
If not, the acquire load ran before the release store was globally visible, so there are no guarantees about anything; you didn't actually synchronize with that writer. The load of norm happens after the acquire load, so another store might have become globally visible1 during that interval.
Also, the norm store is done first2 so even if atom and norm were loaded simultaneously (e.g. by one wide atomic load), it would still be possible for it to see norm updated by atom not yet.
Footnote 1: (Or visible to this thread, on the rare machine where that can happen without being globally visible, e.g. PowerPC)
Footnote 2: The only actual guarantee is not-later; they could both become globally visible as one wider transaction, e.g. the compiler would be allowed to merge the norm store and the atom store into one wider atomic store, or hardware could do that via store coalescing in the store buffer. So there might never be a time interval when you could observe norm updated by atom not; it depends on the implementation (hardware and compiler).
(IDK what kind of guarantees Rust gives here or how it formally defines synchronization and memory order. But the basics of acquire and release synchronization are fairly universal. https://preshing.com/20120913/acquire-and-release-semantics/. In C++ reading a non-atomic norm at all without achieving synchronization would be data-race UB (undefined behaviour), but of course when compiled for real hardware the effects I describe are what would happen in practice, whether the source language is C++ or Rust.)

Related

Stack overflow occuring in my Rust programm expecially when using `tokio` in combination with `crossbeam` channels [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 9 months ago.
Improve this question
I have encountered a very weird problem, while using crossbeam channels in combination with tokio.
Seemingly without allocating much memory, my program just stack overflows in debug mode.
But when I decided to replace the unbounded channels with bounded ones and thus allocating more memory to the stack everything seems to work, even in debug mode.
On another application I am writing, this is happening too, but without using tokio and I am required to run it in release mode.
I tried to wrap everything that might consume a lot of memory and is stack-allocated in a Box, but without any success.
My question is if I should be worried about this, or just run my program in release mode without any further concerns.
I am writing a server for my game, so stability is very important.
EDIT
I found out that the rust compiler apparently performs "tail-call optimization" which allow an infinite loop in release mode, source
And because I am running 2 infinite loops in both programs this explains, that they only work in release mode I think.
minimal working example
communication.rs
use crate::channel::*;
use std::mem::forget;
use tokio::net::{TcpStream, ToSocketAddrs};
use tokio::runtime;
use tokio::io::AsyncWriteExt;
pub struct Communicator {
event_sender: Sender<[u8; 255]>,
event_receiver: Receiver<[u8; 255]>,
}
impl Communicator {
pub fn init<H: ToSocketAddrs>(host: H) -> Option<Self> {
let rt = runtime::Builder::new_multi_thread()
.enable_io()
.build()
.unwrap();
if let Some((event_sender, event_receiver)) = rt.block_on(async move {
if let Ok(socket) = TcpStream::connect(host).await {
let (mut read, mut write) = socket.into_split();
let (ev_sender_tx, ev_sender_rx): (Sender<[u8; 255]>, Receiver<[u8; 255]>) = channel();
let (ev_receiver_tx, ev_receiver_rx) = channel();
// rx
tokio::spawn(async move {
loop {
let mut buffer: [u8; 255] = [0; 255];
ev_receiver_tx.send(buffer).unwrap();
}
});
// tx
tokio::spawn(async move {
loop {
if let Some(event) = ev_sender_rx.recv() {
write.write_all(&event).await;
}
}
});
Some((ev_sender_tx, ev_receiver_rx))
} else {
None
}
}) {
// not allowed to run destructor, must leak
forget(rt);
Some(Self { event_sender, event_receiver })
} else {
None
}
}
}
channel.rs
use crossbeam::channel;
use crossbeam::channel::SendError;
#[derive(Debug, Clone)]
pub struct Receiver<T> {
inner: channel::Receiver<T>,
}
impl<T> Receiver<T> {
pub fn recv(&self) -> Option<T> {
self.inner.recv().ok()
}
#[allow(dead_code)]
pub fn try_recv(&self) -> Option<T> {
self.inner.try_recv().ok()
}
}
#[derive(Debug, Clone)]
pub struct Sender<T> {
inner: channel::Sender<T>,
}
impl<T> Sender<T> {
pub fn send(&self, data: T) -> Result<(), SendError<T>> {
self.inner.send(data)
}
}
pub fn channel<T>() -> (Sender<T>, Receiver<T>) {
let (s, r) = channel::unbounded();
(
Sender {inner: s},
Receiver {inner: r},
)
}
// will block if full
pub fn bounded_channel<T>(bound: usize) -> (Sender<T>, Receiver<T>) {
let (s, r) = channel::bounded(bound);
(
Sender {inner: s},
Receiver {inner: r},
)
}

loop {
let mut buffer: [u8; 255] = [0; 255];
ev_receiver_tx.send(buffer).unwrap();
}
Unless I'm understanding something wrong, this loop just fills up the channel completely. And for an unbounded channel, I'm not surprised that this results in an infinite amount of memory being used. What's the purpose of this loop, or is that just for the example?
If you just run this loop on its own, with an unbounded channel, you can watch the memory get eaten until the program crashes.
But when I decided to replace the unbounded channels with bounded ones and thus allocating more memory to the stack everything seems to work, even in debug mode.
I don't quite follow ... how does replacing the channels change the amount of memory allocated to the stack? This actually limits the amount of memory a channel could take if flooded with items, which seems to be the root cause of the crash in this case. I'm still a bit uncertain, though, because this should cause an out-of-memory error, not a stack overflow.
// not allowed to run destructor, must leak
forget(rt);
That one is also highly suspicious to me. According to its documentation, dropping the runtime handle is the official way of shutting down the runtime, so calling forget on it sounds to me like a memory leak and incorrect behaviour. forget really is one of those functions that don't serve much of a purpose unless paired with some form of unsafe code.
If you added this line because it complained that it cannot destroy it while it is borrowed by event_sender and event_receiver, then that was definitely the wrong way to 'fix' that problem.
I think the general approach (if I understand it correctly) of spawning an async runtime for every connection is not possible the way you implemented it. Especially a multi_thread runtime will spawn a lot of threads as you get more and more connections, and as you forget the runtime handle, those threads will never go away again, even if the users disconnect.
I think a more useful way would be to have a reusable multi_thread runtime somewhere (which is already capable of using 100% of your cpu) and then to use tokio::spawn to create new tasks for the new connection in your init function, as you already do.
My question is if I should be worried about this, or just run my program in release mode without any further concerns.
Yes, in my opinion, you should definitely be worried about this. Normal healthy Rust programs do not just randomly stack overflow in debug mode. This is a strong indication of a programming error.
Try to use memory profiling tools to find out where the memory usage actually comes from.
I found out that the rust compiler apparently performs "tail-call optimization" which allow an infinite loop in release mode
I think you misunderstood that post. It means:
In recursion, the compiler might reduce the usage of the stack by replacing the recursive function with a loop. Here is a discussion about how to the compiler might achieve this.
This does not and under no circumstance mean that the compiler just randomly decides to deadlock in an infinite loop in release. That would be horrible and would take all credibility from the Rust compiler and the language as a whole.
The only reason this article mentioned "infinite" loops is because it was an infinite unbounded recursion to begin with. That's what the article was actually about, to show how fast a stack overflow happens if you cause one on purpose, not how to prevent it.
Unless you perform recursion, it's almost impossible to cause a stack overflow in Rust (and in most languages, for that matter). Memory leaks almost exclusively happen on the heap, as memory usage on the stack is in almost all circumstances already known at compile time and stays within a constant limit. (again, recursions excluded) Further, almost all data structures that store large amounts of data store the data on the heap. Among other things, this is done to prevent excessive copies; memory on the stack usually gets moved around a lot, while memory on the heap stays at a constant location.

I falsely assumed that sending data through an unbounded channel results in the data that is being sent to be stored on the heap and data that is being sent through an bounded channel to be stored on the stack if the datatype allows that.
What seems to fix this issue was to just wrap the data that is sent in a Box and thus forcing a heap allocation.
The code would now look somewhat like this:
use crate::channel::*;
use std::mem::forget;
use tokio::net::{TcpStream, ToSocketAddrs};
use tokio::runtime;
use tokio::io::AsyncWriteExt;
pub struct Communicator {
event_sender: Sender<Box<[u8; 255]>>,
event_receiver: Receiver<Box<[u8; 255]>>,
}
impl Communicator {
pub fn init<H: ToSocketAddrs>(host: H) -> Option<Self> {
let rt = runtime::Builder::new_multi_thread()
.enable_io()
.build()
.unwrap();
if let Some((event_sender, event_receiver)) = rt.block_on(async move {
if let Ok(socket) = TcpStream::connect(host).await {
let (mut read, mut write) = socket.into_split();
let (ev_sender_tx, ev_sender_rx): (Sender<Box<[u8; 255]>>, Receiver<Box<[u8; 255]>>) = channel();
let (ev_receiver_tx, ev_receiver_rx) = channel();
// rx
tokio::spawn(async move {
loop {
let mut buffer: [u8; 255] = [0; 255];
ev_receiver_tx.send(Box::new(buffer)).unwrap();
}
});
// tx
tokio::spawn(async move {
loop {
if let Some(event) = ev_sender_rx.recv() {
write.write_all(&(*event)).await;
}
}
});
Some((ev_sender_tx, ev_receiver_rx))
} else {
None
}
}) {
// not allowed to run destructor, must leak
forget(rt);
Some(Self { event_sender, event_receiver })
} else {
None
}
}
main.rs
mod communicator;
mod channel;
fn main() {
let comm = com::Communicator::init("0.0.0.0:8000");;
std::thread::park();
}
EDIT
The problem was that I had a data structure somewhere else in the code, which tried to store a big 3-dimensional array.
Just converting [[[u16; 32]; 32]; 32] to Box<[[[u16; 32]; 32]; 32]> worked.

Is accessing an object through the same pointer from multiple threads for atomic operations undefined behavior?

So I am experimenting with some lock-free algorithms and wait-free algorithms in rust. I have found that the AtomicPtr class is quite useful here as it allows operations like cmpexchange or swapping etc on pointer values for implementations of lock-free data structures. However, I read something potentially concerning on Rust's rules about pointer aliasing :
Breaking the pointer aliasing rules. &mut T and &T follow LLVM’s scoped noalias model, except if the &T contains an UnsafeCell.
Mutating immutable data. All data inside a const item is immutable. Moreover, all data reached through a shared reference or data owned by an immutable binding is immutable, unless that data is contained within an UnsafeCell.
More so, the LLVM section they link to also sounds concerning
This indicates that memory locations accessed via pointer values based on the argument or return value are not also accessed, during the execution of the function, via pointer values not based on the argument or return value. This guarantee only holds for memory locations that are modified, by any means, during the execution of the function.
From this, it's not quite clear how atomics could be considered defined behavior, as an atomic operation does mutate the contents of the data a pointer points to. The Rust docs nor the LLVM list atomics as an exception to this.
Thus, I wanted to know if the following code would be considered undefined behavior in Rust:
struct Point {
x:AtomicUsize,
y:AtomicUsize
}
impl Point {
fn new_ptr() -> *mut Point {
Box::into_raw(Box::new(Point{x:AtomicUsize::new(0), y:AtomicUsize::new(0)}))
}
}
fn main() {
let point = Point::new_ptr();
let cont = AtomicPtr::new(point);
let handle = thread::spawn(move || {
let r = unsafe { cont.load(Ordering::SeqCst).as_ref().unwrap() };
for _ in 0..10000 {
r.x.fetch_add(1, Ordering::SeqCst);
}
});
unsafe { point.as_ref().unwrap().x.fetch_add(1, Ordering::SeqCst); }
handle.join().unwrap();
unsafe {
drop(Box::from_raw(point));
}
}
It compiles and executes as expected, along with slight variations. But unsure if what I am doing is undefined or not. I want to ensure this allowed behavior is not restricted or changed in the future.

Is it safe to `Send` struct containing `Rc` if strong_count is 1 and weak_count is 0?

I have a struct that is not Send because it contains Rc. Lets say that Arc has too big overhead, so I want to keep using Rc. I would still like to occasionally Send this struct between threads, but only when I can verify that the Rc has strong_count 1 and weak_count 0.
Here is (hopefully safe) abstraction that I have in mind:
mod my_struct {
use std::rc::Rc;
#[derive(Debug)]
pub struct MyStruct {
reference_counted: Rc<String>,
// more fields...
}
impl MyStruct {
pub fn new() -> Self {
MyStruct {
reference_counted: Rc::new("test".to_string())
}
}
pub fn pack_for_sending(self) -> Result<Sendable, Self> {
if Rc::strong_count(&self.reference_counted) == 1 &&
Rc::weak_count(&self.reference_counted) == 0
{
Ok(Sendable(self))
} else {
Err(self)
}
}
// There are more methods, some may clone the `Rc`!
}
/// `Send`able wrapper for `MyStruct` that does not allow you to access it,
/// only unpack it.
pub struct Sendable(MyStruct);
// Safety: `MyStruct` is not `Send` because of `Rc`. `Sendable` can be
// only created when the `Rc` has strong count 1 and weak count 0.
unsafe impl Send for Sendable {}
impl Sendable {
/// Retrieve the inner `MyStruct`, making it not-sendable again.
pub fn unpack(self) -> MyStruct {
self.0
}
}
}
use crate::my_struct::MyStruct;
fn main() {
let handle = std::thread::spawn(|| {
let my_struct = MyStruct::new();
dbg!(&my_struct);
// Do something with `my_struct`, but at the end the inner `Rc` should
// not be shared with anybody.
my_struct.pack_for_sending().expect("Some Rc was still shared!")
});
let my_struct = handle.join().unwrap().unpack();
dbg!(&my_struct);
}
I did a demo on the Rust playground.
It works. My question is, is it actually safe?
I know that the Rc is owned only by a single onwer and nobody can change that under my hands, because it can't be accessed by other threads and we wrap it into Sendable which does not allow access to the contained value.
But in some crazy world Rc could for example internally use thread local storage and this would not be safe... So is there some guarantee that I can do this?
I know that I must be extremely careful to not introduce some additional reason for the MyStruct to not be Send.

No.
There are multiple points that need to be verified to be able to send Rc across threads:
There can be no other handle (Rc or Weak) sharing ownership.
The content of Rc must be Send.
The implementation of Rc must use a thread-safe strategy.
Let's review them in order!
Guaranteeing the absence of aliasing
While your algorithm -- checking the counts yourself -- works for now, it would be better to simply ask Rc whether it is aliased or not.
fn is_aliased<T>(t: &mut Rc<T>) -> bool { Rc::get_mut(t).is_some() }
The implementation of get_mut will be adjusted should the implementation of Rc change in ways you have not foreseen.
Sendable content
While your implementation of MyStruct currently puts String (which is Send) into Rc, it could tomorrow change to Rc<str>, and then all bets are off.
Therefore, the sendable check needs to be implemented at the Rc level itself, otherwise you need to audit any change to whatever Rc holds.
fn sendable<T: Send>(mut t: Rc<T>) -> Result<Rc<T>, ...> {
if !is_aliased(&mut t) {
Ok(t)
} else {
...
}
}
Thread-safe Rc internals
And that... cannot be guaranteed.
Since Rc is not Send, its implementation can be optimized in a variety of ways:
The entire memory could be allocated using a thread-local arena.
The counters could be allocated using a thread-local arena, separately, so as to seamlessly convert to/from Box.
...
This is not the case at the moment, AFAIK, however the API allows it, so the next release could definitely take advantage of this.
What should you do?
You could make pack_for_sending unsafe, and dutifully document all assumptions that are counted on -- I suggest using get_mut to remove one of them. Then, on each new release of Rust, you'd have to double-check each assumption to ensure that your usage if still safe.
Or, if you do not mind making an allocation, you could write a conversion to Arc<T> yourself (see Playground):
fn into_arc<T>(this: Rc<T>) -> Result<Arc<T>, Rc<T>> {
Rc::try_unwrap(this).map(|t| Arc::new(t))
}
Or, you could write a RFC proposing a Rc <-> Arc conversion!
The API would be:
fn Rc<T: Send>::into_arc(this: Self) -> Result<Arc<T>, Rc<T>>
fn Arc<T>::into_rc(this: Self) -> Result<Rc<T>, Arc<T>>
This could be made very efficiently inside std, and could be of use to others.
Then, you'd convert from MyStruct to MySendableStruct, just moving the fields and converting Rc to Arc as you go, send to another thread, then convert back to MyStruct.
And you would not need any unsafe...

The only difference between Arc and Rc is that Arc uses atomic counters. The counters are only accessed when the pointer is cloned or dropped, so the difference between the two is negligible in applications which just share pointers between long-lived threads.
If you have never cloned the Rc, it is safe to send between threads. However, if you can guarantee that the pointer is unique then you can make the same guarantee about a raw value, without using a smart pointer at all!
This all seems quite fragile, for little benefit; future changes to the code might not meet your assumptions, and you will end up with Undefined Behaviour. I suggest that you at least try making some benchmarks with Arc. Only consider approaches like this when you measure a performance problem.
You might also consider using the archery crate, which provides a reference-counted pointer that abstracts over atomicity.

Is it possible to share a HashMap between threads without locking the entire HashMap?

I would like to have a shared struct between threads. The struct has many fields that are never modified and a HashMap, which is. I don't want to lock the whole HashMap for a single update/remove, so my HashMap looks something like HashMap<u8, Mutex<u8>>. This works, but it makes no sense since the thread will lock the whole map anyways.
Here's this working version, without threads; I don't think that's necessary for the example.
use std::collections::HashMap;
use std::sync::{Arc, Mutex};
fn main() {
let s = Arc::new(Mutex::new(S::new()));
let z = s.clone();
let _ = z.lock().unwrap();
}
struct S {
x: HashMap<u8, Mutex<u8>>, // other non-mutable fields
}
impl S {
pub fn new() -> S {
S {
x: HashMap::default(),
}
}
}
Playground
Is this possible in any way? Is there something obvious I missed in the documentation?
I've been trying to get this working, but I'm not sure how. Basically every example I see there's always a Mutex (or RwLock, or something like that) guarding the inner value.

I don't see how your request is possible, at least not without some exceedingly clever lock-free data structures; what should happen if multiple threads need to insert new values that hash to the same location?
In previous work, I've used a RwLock<HashMap<K, Mutex<V>>>. When inserting a value into the hash, you get an exclusive lock for a short period. The rest of the time, you can have multiple threads with reader locks to the HashMap and thus to a given element. If they need to mutate the data, they can get exclusive access to the Mutex.
Here's an example:
use std::{
collections::HashMap,
sync::{Arc, Mutex, RwLock},
thread,
time::Duration,
};
fn main() {
let data = Arc::new(RwLock::new(HashMap::new()));
let threads: Vec<_> = (0..10)
.map(|i| {
let data = Arc::clone(&data);
thread::spawn(move || worker_thread(i, data))
})
.collect();
for t in threads {
t.join().expect("Thread panicked");
}
println!("{:?}", data);
}
fn worker_thread(id: u8, data: Arc<RwLock<HashMap<u8, Mutex<i32>>>>) {
loop {
// Assume that the element already exists
let map = data.read().expect("RwLock poisoned");
if let Some(element) = map.get(&id) {
let mut element = element.lock().expect("Mutex poisoned");
// Perform our normal work updating a specific element.
// The entire HashMap only has a read lock, which
// means that other threads can access it.
*element += 1;
thread::sleep(Duration::from_secs(1));
return;
}
// If we got this far, the element doesn't exist
// Get rid of our read lock and switch to a write lock
// You want to minimize the time we hold the writer lock
drop(map);
let mut map = data.write().expect("RwLock poisoned");
// We use HashMap::entry to handle the case where another thread
// inserted the same key while where were unlocked.
thread::sleep(Duration::from_millis(50));
map.entry(id).or_insert_with(|| Mutex::new(0));
// Let the loop start us over to try again
}
}
This takes about 2.7 seconds to run on my machine, even though it starts 10 threads that each wait for 1 second while holding the exclusive lock to the element's data.
This solution isn't without issues, however. When there's a huge amount of contention for that one master lock, getting a write lock can take a while and completely kills parallelism.
In that case, you can switch to a RwLock<HashMap<K, Arc<Mutex<V>>>>. Once you have a read or write lock, you can then clone the Arc of the value, returning it and unlocking the hashmap.
The next step up would be to use a crate like arc-swap, which says:
Then one would lock, clone the [RwLock<Arc<T>>] and unlock. This suffers from CPU-level contention (on the lock and on the reference count of the Arc) which makes it relatively slow. Depending on the implementation, an update may be blocked for arbitrary long time by a steady inflow of readers.
The ArcSwap can be used instead, which solves the above problems and has better performance characteristics than the RwLock, both in contended and non-contended scenarios.
I often advocate for performing some kind of smarter algorithm. For example, you could spin up N threads each with their own HashMap. You then shard work among them. For the simple example above, you could use id % N_THREADS, for example. There are also complicated sharding schemes that depend on your data.
As Go has done a good job of evangelizing: do not communicate by sharing memory; instead, share memory by communicating.

Suppose the key of the data is map-able to a u8
You can have Arc<HashMap<u8,Mutex<HashMap<Key,Value>>>
When you initialize the data structure you populate all the first level map before putting it in Arc (it will be immutable after initialization)
When you want a value from the map you will need to do a double get, something like:
data.get(&map_to_u8(&key)).unwrap().lock().expect("poison").get(&key)
where the unwrap is safe because we initialized the first map with all the value.
to write in the map something like:
data.get(&map_to_u8(id)).unwrap().lock().expect("poison").entry(id).or_insert_with(|| value);
It's easy to see contention is reduced because we now have 256 Mutex and the probability of multiple threads asking the same Mutex is low.
#Shepmaster example with 100 threads takes about 10s on my machine, the following example takes a little more than 1 second.
use std::{
collections::HashMap,
sync::{Arc, Mutex, RwLock},
thread,
time::Duration,
};
fn main() {
let mut inner = HashMap::new( );
for i in 0..=u8::max_value() {
inner.insert(i, Mutex::new(HashMap::new()));
}
let data = Arc::new(inner);
let threads: Vec<_> = (0..100)
.map(|i| {
let data = Arc::clone(&data);
thread::spawn(move || worker_thread(i, data))
})
.collect();
for t in threads {
t.join().expect("Thread panicked");
}
println!("{:?}", data);
}
fn worker_thread(id: u8, data: Arc<HashMap<u8,Mutex<HashMap<u8,Mutex<i32>>>>> ) {
loop {
// first unwrap is safe to unwrap because we populated for every `u8`
if let Some(element) = data.get(&id).unwrap().lock().expect("poison").get(&id) {
let mut element = element.lock().expect("Mutex poisoned");
// Perform our normal work updating a specific element.
// The entire HashMap only has a read lock, which
// means that other threads can access it.
*element += 1;
thread::sleep(Duration::from_secs(1));
return;
}
// If we got this far, the element doesn't exist
// Get rid of our read lock and switch to a write lock
// You want to minimize the time we hold the writer lock
// We use HashMap::entry to handle the case where another thread
// inserted the same key while where were unlocked.
thread::sleep(Duration::from_millis(50));
data.get(&id).unwrap().lock().expect("poison").entry(id).or_insert_with(|| Mutex::new(0));
// Let the loop start us over to try again
}
}

Maybe you want to consider evmap:
A lock-free, eventually consistent, concurrent multi-value map.
The trade-off is eventual-consistency: Readers do not see changes until the writer refreshes the map. A refresh is atomic and the writer decides when to do it and expose new data to the readers.

What are the performance implications of consuming self and returning it?

I've been reading questions like Why does a function that accepts a Box<MyType> complain of a value being moved when a function that accepts self works?, Preferable pattern for getting around the "moving out of borrowed self" checker, and How to capture self consuming variable in a struct?, and now I'm curious about the performance characteristics of consuming self but possibly returning it to the caller.
To make a simpler example, imagine I want to make a collection type that's guaranteed to be non-empty. To achieve this, the "remove" operation needs to consume the collection and optionally return itself.
struct NonEmptyCollection { ... }
impl NonEmptyCollection {
fn pop(mut self) -> Option<Self> {
if self.len() == 1 {
None
} else {
// really remove the element here
Some(self)
}
}
}
(I suppose it should return the value it removed from the list too, but it's just an example.) Now let's say I call this function:
let mut c = NonEmptyCollection::new(...);
if let Some(new_c) = c.pop() {
c = new_c
} else {
// never use c again
}
What actually happens to the memory of the object? What if I have some code like:
let mut opt: Option<NonEmptyCollection> = Some(NonEmptyCollection::new(...));
opt = opt.take().pop();
The function's signature can't guarantee that the returned object is actually the same one, so what optimizations are possible? Does something like the C++ return value optimization apply, allowing the returned object to be "constructed" in the same memory it was in before? If I have the choice between an interface like the above, and an interface where the caller has to deal with the lifetime:
enum PopResult {
StillValid,
Dead
};
impl NonEmptyCollection {
fn pop(&mut self) -> PopResult {
// really remove the element
if self.len() == 0 { PopResult::Dead } else { PopResult::StillValid }
}
}
is there ever a reason to choose this dirtier interface for performance reasons? In the answer to the second example I linked, trentcl recommends storing Options in a data structure to allow the caller to do a change in-place instead of doing remove followed by insert every time. Would this dirty interface be a faster alternative?

YMMV
Depending on the optimizer's whim, you may end up with:
close to a no-op,
a few register moves,
a number of bit-copies.
This will depend whether:
the call is inlined, or not,
the caller re-assigns to the original variable or creates a fresh variable (and how well LLVM handles reusing dead space),
the size_of::<Self>().
The only guarantees you get is that no deep-copy will occur, as there is no .clone() call.
For anything else, you need to check the LLVM IR or assembly.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string