Using thread unsafe values in an async block

Using thread unsafe values in an async block - multithreading

In this code snippet (playground link), we have some simple communication between two threads. The main thread (which executes the second async block) sends 2 to thread 2 in the async move block, which receives it, adds its own value, and sends the result back over another channel to the main thread, which prints the value.
Thread 2 contains some local state, the thread_unsafe variable, which is neither Send nor Sync, and is maintained across an .await. Therefore the impl Future object that we are creating is itself neither Send nor Sync, and hence the call to pool.spawn_ok is a compile error.
However, this seems like it should be fine. I understand why spawn_ok() can't accept a future that is not Send, and I also understand why the compilation of the async block into a state machine results in a struct that contains a non-Send value, but in this example the only thing I want to send to the other thread is recv and send2. How do I express that the future should switch to non-thread safe mode only after it has been sent?
use std::rc::Rc;
use std::cell::RefCell;
use futures::channel::oneshot::channel;
use futures::executor::{ThreadPool, block_on};
fn main() {
let pool = ThreadPool::new().unwrap();
let (send, recv) = channel();
let (send2, recv2) = channel();
pool.spawn_ok(async move {
let thread_unsafe = Rc::new(RefCell::new(40));
let a = recv.await.unwrap();
send2.send(a + *thread_unsafe.borrow()).unwrap();
});
let r = block_on(async {
send.send(2).unwrap();
recv2.await.unwrap()
});
println!("the answer is {}", r)
}

but in this example the only thing I want to send to the other thread is recv and send2
There is also the local variable thread_unsafe which is used across an .await. Since .await can suspend an async function, and later resume it on another thread, this could send thread_unsafe to a different thread, which is not allowed.

Related

`RefCell<std::string::String>` cannot be shared between threads safely?

This is a continuation of How to re-use a value from the outer scope inside a closure in Rust? , opened new Q for better presentation.
// main.rs
// The value will be modified eventually inside `main`
// and a http request should respond with whatever "current" value it holds.
let mut test_for_closure :Arc<RefCell<String>> = Arc::new(RefCell::from("Foo".to_string()));
// ...
// Handler for HTTP requests
// From https://docs.rs/hyper/0.14.8/hyper/service/fn.service_fn.html
let make_svc = make_service_fn(|_conn| async {
Ok::<_, Infallible>(service_fn(|req: Request<Body>| async move {
if req.version() == Version::HTTP_11 {
let foo:String = *test_for_closure.borrow();
Ok(Response::new(Body::from(foo.as_str())))
} else {
Err("not HTTP/1.1, abort connection")
}
}))
});
Unfortunately, I get RefCell<std::string::String> cannot be shared between threads safely:

RefCell only works on single threads. You will need to use Mutex which is similar but works on multiple threads. You can read more about Mutex here: https://doc.rust-lang.org/std/sync/struct.Mutex.html.
Here is an example of moving an Arc<Mutex<>> into a closure:
use std::sync::{Arc, Mutex};
fn main() {
let mut test: Arc<Mutex<String>> = Arc::new(Mutex::from("Foo".to_string()));
let mut test_for_closure = Arc::clone(&test);
let closure = || async move {
// lock it so it cant be used in other threads
let foo = test_for_closure.lock().unwrap();
println!("{}", foo);
};
}

The first error in your error message is that Sync is not implemented for RefCell<String>. This is by design, as stated by Sync's rustdoc:
Types that are not Sync are those that have “interior mutability” in a
non-thread-safe form, such as Cell and RefCell. These types allow for
mutation of their contents even through an immutable, shared
reference. For example the set method on Cell takes &self, so it
requires only a shared reference &Cell. The method performs no
synchronization, thus Cell cannot be Sync.
Thus it's not safe to share RefCells between threads, because you can cause a data race through a regular, shared reference.
But what if you wrap it in Arc ? Well, the rustdoc is quite clear again:
Arc will implement Send and Sync as long as the T implements Send
and Sync. Why can’t you put a non-thread-safe type T in an Arc to
make it thread-safe? This may be a bit counter-intuitive at first:
after all, isn’t the point of Arc thread safety? The key is this:
Arc makes it thread safe to have multiple ownership of the same
data, but it doesn’t add thread safety to its data. Consider
Arc<RefCell>. RefCell isn’t Sync, and if Arc was always Send,
Arc<RefCell> would be as well. But then we’d have a problem:
RefCell is not thread safe; it keeps track of the borrowing count
using non-atomic operations.
In the end, this means that you may need to pair Arc with some sort
of std::sync type, usually Mutex.
Arc<T> will not be Sync unless T is Sync because of the same reason. Given that, probably you should use std/tokio Mutex instead of RefCell

What are idiomatic ways to send data between threads?

I want to do some calculation in a separate thread, and then recover the data from the main thread. What are the canonical ways to pass some data from a thread to another in Rust?
fn main() {
let handle = std::thread::spawn(|| {
// I want to send this to the main thread:
String::from("Hello world!")
});
// How to recover the data from the other thread?
handle.join().unwrap();
}

There are lots of ways to send send data between threads -- without a clear "best" solution. It depends on your situation.
Using just thread::join
Many people do not realize that you can very easily send data with only the thread API, but only twice: once to the new thread and once back.
use std::thread;
let data_in = String::from("lots of data");
let handle = thread::spawn(move || {
println!("{}", data_in); // we can use the data here!
let data_out = heavy_compuations();
data_out // <-- simply return the data from the closure
});
let data_out = handle.join().expect("thread panicked :(");
println!("{}", data_out); // we can use the data generated in the thread here!
(Playground)
This is immensely useful for threads that are just spawned to do one specific job. Note the move keyword before the closure that makes sure all referenced variables are moved into the closure (which is then moved to another thread).
Channels from std
The standard library offers a multi producer single consumer channel in std::sync::mpsc. You can send arbitrarily many values through a channel, so it can be used in more situations. Simple example:
use std::{
sync::mpsc::channel,
thread,
time::Duration,
};
let (sender, receiver) = channel();
thread::spawn(move || {
sender.send("heavy computation 1").expect("receiver hung up :(");
thread::sleep(Duration::from_millis(500));
sender.send("heavy computation 2").expect("receiver hung up :(");
});
let result1 = receiver.recv().unwrap();
let result2 = receiver.recv().unwrap();
(Playground)
Of course you can create another channel to provide communication in the other direction as well.
More powerful channels by crossbeam
Unfortunately, the standard library currently only provides channels that are restricted to a single consumer (i.e. Receiver can't be cloned). To get more powerful channels, you probably want to use the channels from the awesome crossbeam library. Their description:
This crate is an alternative to std::sync::mpsc with more features and better performance.
In particular, it is a mpmc (multi consumer!) channel. This provides a nice way to easily share work between multiple threads. Example:
use std::thread;
// You might want to use a bounded channel instead...
let (sender, receiver) = crossbeam_channel::unbounded();
for _ in 0..num_cpus::get() {
let receiver = receiver.clone(); // clone for this thread
thread::spawn(move || {
for job in receiver {
// process job
}
});
}
// Generate jobs
for x in 0..10_000 {
sender.send(x).expect("all threads hung up :(");
}
(Playground)
Again, adding another channel allows you to communicate results back to the main thread.
Other methods
There are plenty of other crates that offer some other means of sending data between threads. Too many to list them here.
Note that sending data is not the only way to communicate between threads. There is also the possibility to share data between threads via Mutex, atomics, lock-free data structures and many other ways. This is conceptually very different. It depends on the situation whether sending or sharing data is the better way to describe your cross thread communication.

The idiomatic way to do so is to use a channel. It conceptually behaves like an unidirectional tunnel: you put something in one end and it comes out the other side.
use std::sync::mpsc::channel;
fn main() {
let (sender, receiver) = channel();
let handle = std::thread::spawn(move || {
sender.send(String::from("Hello world!")).unwrap();
});
let data = receiver.recv().unwrap();
println!("Got {:?}", data);
handle.join().unwrap();
}
The channel won't work anymore when the receiver is dropped.
They are mainly 3 ways to recover the data:
recv will block until something is received
try_recv will return immediately. If the channel is not closed, it is either Ok(data) or Err(TryRevcError::Empty).
recv_timeout is the same as try_recv but it waits to get a data a certain amount of time.

How can I send a message to a specific thread?

I need to create some threads where some of them are going to run until their runner variable value has been changed. This is my minimal code.
use std::sync::{Arc, Mutex};
use std::thread;
use std::time::Duration;
fn main() {
let mut log_runner = Arc::new(Mutex::new(true));
println!("{}", *log_runner.lock().unwrap());
let mut threads = Vec::new();
{
let mut log_runner_ref = Arc::clone(&log_runner);
// log runner thread
let handle = thread::spawn(move || {
while *log_runner_ref.lock().unwrap() == true {
// DO SOME THINGS CONTINUOUSLY
println!("I'm a separate thread!");
}
});
threads.push(handle);
}
// let the main thread to sleep for x time
thread::sleep(Duration::from_millis(1));
// stop the log_runner thread
*log_runner.lock().unwrap() = false;
// join all threads
for handle in threads {
handle.join().unwrap();
println!("Thread joined!");
}
println!("{}", *log_runner.lock().unwrap());
}
It looks like I'm able to set the log_runner_ref in the log runner thread after 1 second to false. Is there a way to mark the treads with some name / ID or something similar and send a message to a specific thread using its specific marker (name / ID)?
If I understand it correctly, then the let (tx, rx) = mpsc::channel(); can be used for sending messages to all the threads simultaneously rather than to a specific one. I could send some identifier with the messages and each thread will be looking for its own identifier for the decision if to act on received message or not, but I would like to avoid the broadcasting effect.

MPSC stands for Multiple Producers, Single Consumer. As such, no, you cannot use that by itself to send a message to all threads, since for that you'd have to be able to duplicate the consumer. There are tools for this, but the choice of them requires a bit more info than just "MPMC" or "SPMC".
Honestly, if you can rely on channels for messaging (there are cases where it'd be a bad idea), you can create a channel per thread, assign the ID outside of the thread, and keep a HashMap instead of a Vec with the IDs associated to the threads. Receiver<T> can be moved into the thread (it implements Send if T implements Send), so you can quite literally move it in.
You then keep the Sender outside and send stuff to it :-)

Is it possible to share a HashMap between threads without locking the entire HashMap?

I would like to have a shared struct between threads. The struct has many fields that are never modified and a HashMap, which is. I don't want to lock the whole HashMap for a single update/remove, so my HashMap looks something like HashMap<u8, Mutex<u8>>. This works, but it makes no sense since the thread will lock the whole map anyways.
Here's this working version, without threads; I don't think that's necessary for the example.
use std::collections::HashMap;
use std::sync::{Arc, Mutex};
fn main() {
let s = Arc::new(Mutex::new(S::new()));
let z = s.clone();
let _ = z.lock().unwrap();
}
struct S {
x: HashMap<u8, Mutex<u8>>, // other non-mutable fields
}
impl S {
pub fn new() -> S {
S {
x: HashMap::default(),
}
}
}
Playground
Is this possible in any way? Is there something obvious I missed in the documentation?
I've been trying to get this working, but I'm not sure how. Basically every example I see there's always a Mutex (or RwLock, or something like that) guarding the inner value.

I don't see how your request is possible, at least not without some exceedingly clever lock-free data structures; what should happen if multiple threads need to insert new values that hash to the same location?
In previous work, I've used a RwLock<HashMap<K, Mutex<V>>>. When inserting a value into the hash, you get an exclusive lock for a short period. The rest of the time, you can have multiple threads with reader locks to the HashMap and thus to a given element. If they need to mutate the data, they can get exclusive access to the Mutex.
Here's an example:
use std::{
collections::HashMap,
sync::{Arc, Mutex, RwLock},
thread,
time::Duration,
};
fn main() {
let data = Arc::new(RwLock::new(HashMap::new()));
let threads: Vec<_> = (0..10)
.map(|i| {
let data = Arc::clone(&data);
thread::spawn(move || worker_thread(i, data))
})
.collect();
for t in threads {
t.join().expect("Thread panicked");
}
println!("{:?}", data);
}
fn worker_thread(id: u8, data: Arc<RwLock<HashMap<u8, Mutex<i32>>>>) {
loop {
// Assume that the element already exists
let map = data.read().expect("RwLock poisoned");
if let Some(element) = map.get(&id) {
let mut element = element.lock().expect("Mutex poisoned");
// Perform our normal work updating a specific element.
// The entire HashMap only has a read lock, which
// means that other threads can access it.
*element += 1;
thread::sleep(Duration::from_secs(1));
return;
}
// If we got this far, the element doesn't exist
// Get rid of our read lock and switch to a write lock
// You want to minimize the time we hold the writer lock
drop(map);
let mut map = data.write().expect("RwLock poisoned");
// We use HashMap::entry to handle the case where another thread
// inserted the same key while where were unlocked.
thread::sleep(Duration::from_millis(50));
map.entry(id).or_insert_with(|| Mutex::new(0));
// Let the loop start us over to try again
}
}
This takes about 2.7 seconds to run on my machine, even though it starts 10 threads that each wait for 1 second while holding the exclusive lock to the element's data.
This solution isn't without issues, however. When there's a huge amount of contention for that one master lock, getting a write lock can take a while and completely kills parallelism.
In that case, you can switch to a RwLock<HashMap<K, Arc<Mutex<V>>>>. Once you have a read or write lock, you can then clone the Arc of the value, returning it and unlocking the hashmap.
The next step up would be to use a crate like arc-swap, which says:
Then one would lock, clone the [RwLock<Arc<T>>] and unlock. This suffers from CPU-level contention (on the lock and on the reference count of the Arc) which makes it relatively slow. Depending on the implementation, an update may be blocked for arbitrary long time by a steady inflow of readers.
The ArcSwap can be used instead, which solves the above problems and has better performance characteristics than the RwLock, both in contended and non-contended scenarios.
I often advocate for performing some kind of smarter algorithm. For example, you could spin up N threads each with their own HashMap. You then shard work among them. For the simple example above, you could use id % N_THREADS, for example. There are also complicated sharding schemes that depend on your data.
As Go has done a good job of evangelizing: do not communicate by sharing memory; instead, share memory by communicating.

Suppose the key of the data is map-able to a u8
You can have Arc<HashMap<u8,Mutex<HashMap<Key,Value>>>
When you initialize the data structure you populate all the first level map before putting it in Arc (it will be immutable after initialization)
When you want a value from the map you will need to do a double get, something like:
data.get(&map_to_u8(&key)).unwrap().lock().expect("poison").get(&key)
where the unwrap is safe because we initialized the first map with all the value.
to write in the map something like:
data.get(&map_to_u8(id)).unwrap().lock().expect("poison").entry(id).or_insert_with(|| value);
It's easy to see contention is reduced because we now have 256 Mutex and the probability of multiple threads asking the same Mutex is low.
#Shepmaster example with 100 threads takes about 10s on my machine, the following example takes a little more than 1 second.
use std::{
collections::HashMap,
sync::{Arc, Mutex, RwLock},
thread,
time::Duration,
};
fn main() {
let mut inner = HashMap::new( );
for i in 0..=u8::max_value() {
inner.insert(i, Mutex::new(HashMap::new()));
}
let data = Arc::new(inner);
let threads: Vec<_> = (0..100)
.map(|i| {
let data = Arc::clone(&data);
thread::spawn(move || worker_thread(i, data))
})
.collect();
for t in threads {
t.join().expect("Thread panicked");
}
println!("{:?}", data);
}
fn worker_thread(id: u8, data: Arc<HashMap<u8,Mutex<HashMap<u8,Mutex<i32>>>>> ) {
loop {
// first unwrap is safe to unwrap because we populated for every `u8`
if let Some(element) = data.get(&id).unwrap().lock().expect("poison").get(&id) {
let mut element = element.lock().expect("Mutex poisoned");
// Perform our normal work updating a specific element.
// The entire HashMap only has a read lock, which
// means that other threads can access it.
*element += 1;
thread::sleep(Duration::from_secs(1));
return;
}
// If we got this far, the element doesn't exist
// Get rid of our read lock and switch to a write lock
// You want to minimize the time we hold the writer lock
// We use HashMap::entry to handle the case where another thread
// inserted the same key while where were unlocked.
thread::sleep(Duration::from_millis(50));
data.get(&id).unwrap().lock().expect("poison").entry(id).or_insert_with(|| Mutex::new(0));
// Let the loop start us over to try again
}
}

Maybe you want to consider evmap:
A lock-free, eventually consistent, concurrent multi-value map.
The trade-off is eventual-consistency: Readers do not see changes until the writer refreshes the map. A refresh is atomic and the writer decides when to do it and expose new data to the readers.

Prevent `chan::Receiver` from blocking on empty buffer

I'd like to build an Multi-Producer Multi-Consumer (MPMC) channel with different concurrent tasks processing and producing data in it. Some of these tasks have the responsibility to interface with the filesystem or network.
Two examples:
PrintOutput(String) would be consumed by a logger, a console output, or a GUI.
NewJson(String) would be consumed by a logger or a parser.
To achieve this, I've selected chan as the MPMC channel provider and tokio as the system to manage event loops for each listener on the channel.
After reading the example on tokio's site, I began to implement futures::stream::Stream for chan::Receiver. This would allow the use of a for each future to listen on the channel. However, the documentation of these two libraries highlights a conflict:
fn poll(&mut self) -> Poll<Option<Self::Item>, Self::Error>
Attempt to pull out the next value of this stream, returning None if the stream is finished.
This method, like Future::poll, is the sole method of pulling out a value from a stream. This method must also be run within the context of a task typically and implementors of this trait must ensure that implementations of this method do not block, as it may cause consumers to behave badly.
fn recv(&self) -> Option<T>
Receive a value on this channel.
If this is an asnychronous channel, recv only blocks when the buffer is empty.
If this is a synchronous channel, recv only blocks when the buffer is empty.
If this is a rendezvous channel, recv blocks until a corresponding send sends a value.
For all channels, if the channel is closed and the buffer is empty, then recv always and immediately returns None. (If the buffer is non-empty on a closed channel, then values from the buffer are returned.)
Values are guaranteed to be received in the same order that they are sent.
This operation will never panic! but it can deadlock if the channel is never closed.
chan::Receiver may block when the buffer is empty, but futures::stream::Stream expects to never block when polled.
If an empty buffer blocks, there isn't a clear way to confirm that it is empty. How do I check if the buffer is empty to prevent blocking?
Although Kabuki is on my radar and seems to be the most mature of the actor model crates, it almost entirely lacks documentation.
This is my implementation so far:
extern crate chan;
extern crate futures;
struct RX<T>(chan::Receiver<T>);
impl<T> futures::stream::Stream for RX<T> {
type Item = T;
type Error = Box<std::error::Error>;
fn poll(&mut self) -> futures::Poll<Option<Self::Item>, Self::Error> {
let &mut RX(ref receiver) = self;
let item = receiver.recv();
match item {
Some(value) => Ok(futures::Async::Ready(Some(value))),
None => Ok(futures::Async::NotReady),
}
}
}
I've finished a quick test to see how it works. It seems alright, but as expected does block after finishing the buffer. While this should work, I'm somewhat worried about what it means for a consumer to "behave badly". For now I'll continue to test this approach and hopefully I don't encounter bad behaviour.
extern crate chan;
extern crate futures;
use futures::{Stream, Future};
fn my_test() {
let mut core = tokio_core::reactor::Core::new().unwrap();
let handle = core.handle();
let (tx, rx) = chan::async::<String>();
tx.send("Hello".to_string()); // fill the buffer before it blocks; single thread here.
let incoming = RX(rx).for_each(|s| {
println!("Result: {}", s);
Ok(())
});
core.run(incoming).unwrap()
}

The chan crate provides a chan_select macro that would allow a non-blocking recv; but to implement Future for such primitives you also need to wake up the task when the channel becomes ready (see futures::task::current()).
You can implement Future by using existing primitives; implementing new ones is usually more difficult. In this case you probably have to fork chan to make it Future compatible.
It seems the multiqueue crate has a Future compatible mpmc channel mpmc_fut_queue.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string