What are idiomatic ways to send data between threads? - multithreading

I want to do some calculation in a separate thread, and then recover the data from the main thread. What are the canonical ways to pass some data from a thread to another in Rust?
fn main() {
let handle = std::thread::spawn(|| {
// I want to send this to the main thread:
String::from("Hello world!")
});
// How to recover the data from the other thread?
handle.join().unwrap();
}

There are lots of ways to send send data between threads -- without a clear "best" solution. It depends on your situation.
Using just thread::join
Many people do not realize that you can very easily send data with only the thread API, but only twice: once to the new thread and once back.
use std::thread;
let data_in = String::from("lots of data");
let handle = thread::spawn(move || {
println!("{}", data_in); // we can use the data here!
let data_out = heavy_compuations();
data_out // <-- simply return the data from the closure
});
let data_out = handle.join().expect("thread panicked :(");
println!("{}", data_out); // we can use the data generated in the thread here!
(Playground)
This is immensely useful for threads that are just spawned to do one specific job. Note the move keyword before the closure that makes sure all referenced variables are moved into the closure (which is then moved to another thread).
Channels from std
The standard library offers a multi producer single consumer channel in std::sync::mpsc. You can send arbitrarily many values through a channel, so it can be used in more situations. Simple example:
use std::{
sync::mpsc::channel,
thread,
time::Duration,
};
let (sender, receiver) = channel();
thread::spawn(move || {
sender.send("heavy computation 1").expect("receiver hung up :(");
thread::sleep(Duration::from_millis(500));
sender.send("heavy computation 2").expect("receiver hung up :(");
});
let result1 = receiver.recv().unwrap();
let result2 = receiver.recv().unwrap();
(Playground)
Of course you can create another channel to provide communication in the other direction as well.
More powerful channels by crossbeam
Unfortunately, the standard library currently only provides channels that are restricted to a single consumer (i.e. Receiver can't be cloned). To get more powerful channels, you probably want to use the channels from the awesome crossbeam library. Their description:
This crate is an alternative to std::sync::mpsc with more features and better performance.
In particular, it is a mpmc (multi consumer!) channel. This provides a nice way to easily share work between multiple threads. Example:
use std::thread;
// You might want to use a bounded channel instead...
let (sender, receiver) = crossbeam_channel::unbounded();
for _ in 0..num_cpus::get() {
let receiver = receiver.clone(); // clone for this thread
thread::spawn(move || {
for job in receiver {
// process job
}
});
}
// Generate jobs
for x in 0..10_000 {
sender.send(x).expect("all threads hung up :(");
}
(Playground)
Again, adding another channel allows you to communicate results back to the main thread.
Other methods
There are plenty of other crates that offer some other means of sending data between threads. Too many to list them here.
Note that sending data is not the only way to communicate between threads. There is also the possibility to share data between threads via Mutex, atomics, lock-free data structures and many other ways. This is conceptually very different. It depends on the situation whether sending or sharing data is the better way to describe your cross thread communication.

The idiomatic way to do so is to use a channel. It conceptually behaves like an unidirectional tunnel: you put something in one end and it comes out the other side.
use std::sync::mpsc::channel;
fn main() {
let (sender, receiver) = channel();
let handle = std::thread::spawn(move || {
sender.send(String::from("Hello world!")).unwrap();
});
let data = receiver.recv().unwrap();
println!("Got {:?}", data);
handle.join().unwrap();
}
The channel won't work anymore when the receiver is dropped.
They are mainly 3 ways to recover the data:
recv will block until something is received
try_recv will return immediately. If the channel is not closed, it is either Ok(data) or Err(TryRevcError::Empty).
recv_timeout is the same as try_recv but it waits to get a data a certain amount of time.

Related

How can I return a shared channel or data stream that can be consumed by multiple receivers?

my goal/use case:
Subscribe to a datafeed
Publish to internal subscribers
Example use case: subscribe to stock prices, consume from multiple different bots running on different threads within the same app.
In other languages, I'd be using an RX Subject and simply subscribe to that from anywhere else and I can choose which thread to observe the values on (threadpool or same thread, etc).
Here is my attempt using a simulated data feed:
Code:
async fn test_observable() -> Receiver<Decimal> {
let (x, response) = mpsc::channel::<Decimal>(100);
tokio::spawn(async move {
for i in 0..10 {
sleep(Duration::from_secs(1)).await;
x.send(Decimal::from(i)).await;
}
});
response
}
#[tokio::main]
async fn main() {
let mut o = test_observable().await;
while let Some(x) = o.recv().await {
println!("{}", x);
}
}
Questions:
Is this the right approach? I normally use RX in other languages but it is too complicated in Rust so I resorted to using Rust channels. RX for Rust
I think this approach won't work if I have multiple receivers. How do I work around that? I just want something like an RX observable, it should not be difficult to achieve that.
Is this creating any threads?

Using thread unsafe values in an async block

In this code snippet (playground link), we have some simple communication between two threads. The main thread (which executes the second async block) sends 2 to thread 2 in the async move block, which receives it, adds its own value, and sends the result back over another channel to the main thread, which prints the value.
Thread 2 contains some local state, the thread_unsafe variable, which is neither Send nor Sync, and is maintained across an .await. Therefore the impl Future object that we are creating is itself neither Send nor Sync, and hence the call to pool.spawn_ok is a compile error.
However, this seems like it should be fine. I understand why spawn_ok() can't accept a future that is not Send, and I also understand why the compilation of the async block into a state machine results in a struct that contains a non-Send value, but in this example the only thing I want to send to the other thread is recv and send2. How do I express that the future should switch to non-thread safe mode only after it has been sent?
use std::rc::Rc;
use std::cell::RefCell;
use futures::channel::oneshot::channel;
use futures::executor::{ThreadPool, block_on};
fn main() {
let pool = ThreadPool::new().unwrap();
let (send, recv) = channel();
let (send2, recv2) = channel();
pool.spawn_ok(async move {
let thread_unsafe = Rc::new(RefCell::new(40));
let a = recv.await.unwrap();
send2.send(a + *thread_unsafe.borrow()).unwrap();
});
let r = block_on(async {
send.send(2).unwrap();
recv2.await.unwrap()
});
println!("the answer is {}", r)
}
but in this example the only thing I want to send to the other thread is recv and send2
There is also the local variable thread_unsafe which is used across an .await. Since .await can suspend an async function, and later resume it on another thread, this could send thread_unsafe to a different thread, which is not allowed.

How can I send a message to a specific thread?

I need to create some threads where some of them are going to run until their runner variable value has been changed. This is my minimal code.
use std::sync::{Arc, Mutex};
use std::thread;
use std::time::Duration;
fn main() {
let mut log_runner = Arc::new(Mutex::new(true));
println!("{}", *log_runner.lock().unwrap());
let mut threads = Vec::new();
{
let mut log_runner_ref = Arc::clone(&log_runner);
// log runner thread
let handle = thread::spawn(move || {
while *log_runner_ref.lock().unwrap() == true {
// DO SOME THINGS CONTINUOUSLY
println!("I'm a separate thread!");
}
});
threads.push(handle);
}
// let the main thread to sleep for x time
thread::sleep(Duration::from_millis(1));
// stop the log_runner thread
*log_runner.lock().unwrap() = false;
// join all threads
for handle in threads {
handle.join().unwrap();
println!("Thread joined!");
}
println!("{}", *log_runner.lock().unwrap());
}
It looks like I'm able to set the log_runner_ref in the log runner thread after 1 second to false. Is there a way to mark the treads with some name / ID or something similar and send a message to a specific thread using its specific marker (name / ID)?
If I understand it correctly, then the let (tx, rx) = mpsc::channel(); can be used for sending messages to all the threads simultaneously rather than to a specific one. I could send some identifier with the messages and each thread will be looking for its own identifier for the decision if to act on received message or not, but I would like to avoid the broadcasting effect.
MPSC stands for Multiple Producers, Single Consumer. As such, no, you cannot use that by itself to send a message to all threads, since for that you'd have to be able to duplicate the consumer. There are tools for this, but the choice of them requires a bit more info than just "MPMC" or "SPMC".
Honestly, if you can rely on channels for messaging (there are cases where it'd be a bad idea), you can create a channel per thread, assign the ID outside of the thread, and keep a HashMap instead of a Vec with the IDs associated to the threads. Receiver<T> can be moved into the thread (it implements Send if T implements Send), so you can quite literally move it in.
You then keep the Sender outside and send stuff to it :-)

Is it possible to share a HashMap between threads without locking the entire HashMap?

I would like to have a shared struct between threads. The struct has many fields that are never modified and a HashMap, which is. I don't want to lock the whole HashMap for a single update/remove, so my HashMap looks something like HashMap<u8, Mutex<u8>>. This works, but it makes no sense since the thread will lock the whole map anyways.
Here's this working version, without threads; I don't think that's necessary for the example.
use std::collections::HashMap;
use std::sync::{Arc, Mutex};
fn main() {
let s = Arc::new(Mutex::new(S::new()));
let z = s.clone();
let _ = z.lock().unwrap();
}
struct S {
x: HashMap<u8, Mutex<u8>>, // other non-mutable fields
}
impl S {
pub fn new() -> S {
S {
x: HashMap::default(),
}
}
}
Playground
Is this possible in any way? Is there something obvious I missed in the documentation?
I've been trying to get this working, but I'm not sure how. Basically every example I see there's always a Mutex (or RwLock, or something like that) guarding the inner value.
I don't see how your request is possible, at least not without some exceedingly clever lock-free data structures; what should happen if multiple threads need to insert new values that hash to the same location?
In previous work, I've used a RwLock<HashMap<K, Mutex<V>>>. When inserting a value into the hash, you get an exclusive lock for a short period. The rest of the time, you can have multiple threads with reader locks to the HashMap and thus to a given element. If they need to mutate the data, they can get exclusive access to the Mutex.
Here's an example:
use std::{
collections::HashMap,
sync::{Arc, Mutex, RwLock},
thread,
time::Duration,
};
fn main() {
let data = Arc::new(RwLock::new(HashMap::new()));
let threads: Vec<_> = (0..10)
.map(|i| {
let data = Arc::clone(&data);
thread::spawn(move || worker_thread(i, data))
})
.collect();
for t in threads {
t.join().expect("Thread panicked");
}
println!("{:?}", data);
}
fn worker_thread(id: u8, data: Arc<RwLock<HashMap<u8, Mutex<i32>>>>) {
loop {
// Assume that the element already exists
let map = data.read().expect("RwLock poisoned");
if let Some(element) = map.get(&id) {
let mut element = element.lock().expect("Mutex poisoned");
// Perform our normal work updating a specific element.
// The entire HashMap only has a read lock, which
// means that other threads can access it.
*element += 1;
thread::sleep(Duration::from_secs(1));
return;
}
// If we got this far, the element doesn't exist
// Get rid of our read lock and switch to a write lock
// You want to minimize the time we hold the writer lock
drop(map);
let mut map = data.write().expect("RwLock poisoned");
// We use HashMap::entry to handle the case where another thread
// inserted the same key while where were unlocked.
thread::sleep(Duration::from_millis(50));
map.entry(id).or_insert_with(|| Mutex::new(0));
// Let the loop start us over to try again
}
}
This takes about 2.7 seconds to run on my machine, even though it starts 10 threads that each wait for 1 second while holding the exclusive lock to the element's data.
This solution isn't without issues, however. When there's a huge amount of contention for that one master lock, getting a write lock can take a while and completely kills parallelism.
In that case, you can switch to a RwLock<HashMap<K, Arc<Mutex<V>>>>. Once you have a read or write lock, you can then clone the Arc of the value, returning it and unlocking the hashmap.
The next step up would be to use a crate like arc-swap, which says:
Then one would lock, clone the [RwLock<Arc<T>>] and unlock. This suffers from CPU-level contention (on the lock and on the reference count of the Arc) which makes it relatively slow. Depending on the implementation, an update may be blocked for arbitrary long time by a steady inflow of readers.
The ArcSwap can be used instead, which solves the above problems and has better performance characteristics than the RwLock, both in contended and non-contended scenarios.
I often advocate for performing some kind of smarter algorithm. For example, you could spin up N threads each with their own HashMap. You then shard work among them. For the simple example above, you could use id % N_THREADS, for example. There are also complicated sharding schemes that depend on your data.
As Go has done a good job of evangelizing: do not communicate by sharing memory; instead, share memory by communicating.
Suppose the key of the data is map-able to a u8
You can have Arc<HashMap<u8,Mutex<HashMap<Key,Value>>>
When you initialize the data structure you populate all the first level map before putting it in Arc (it will be immutable after initialization)
When you want a value from the map you will need to do a double get, something like:
data.get(&map_to_u8(&key)).unwrap().lock().expect("poison").get(&key)
where the unwrap is safe because we initialized the first map with all the value.
to write in the map something like:
data.get(&map_to_u8(id)).unwrap().lock().expect("poison").entry(id).or_insert_with(|| value);
It's easy to see contention is reduced because we now have 256 Mutex and the probability of multiple threads asking the same Mutex is low.
#Shepmaster example with 100 threads takes about 10s on my machine, the following example takes a little more than 1 second.
use std::{
collections::HashMap,
sync::{Arc, Mutex, RwLock},
thread,
time::Duration,
};
fn main() {
let mut inner = HashMap::new( );
for i in 0..=u8::max_value() {
inner.insert(i, Mutex::new(HashMap::new()));
}
let data = Arc::new(inner);
let threads: Vec<_> = (0..100)
.map(|i| {
let data = Arc::clone(&data);
thread::spawn(move || worker_thread(i, data))
})
.collect();
for t in threads {
t.join().expect("Thread panicked");
}
println!("{:?}", data);
}
fn worker_thread(id: u8, data: Arc<HashMap<u8,Mutex<HashMap<u8,Mutex<i32>>>>> ) {
loop {
// first unwrap is safe to unwrap because we populated for every `u8`
if let Some(element) = data.get(&id).unwrap().lock().expect("poison").get(&id) {
let mut element = element.lock().expect("Mutex poisoned");
// Perform our normal work updating a specific element.
// The entire HashMap only has a read lock, which
// means that other threads can access it.
*element += 1;
thread::sleep(Duration::from_secs(1));
return;
}
// If we got this far, the element doesn't exist
// Get rid of our read lock and switch to a write lock
// You want to minimize the time we hold the writer lock
// We use HashMap::entry to handle the case where another thread
// inserted the same key while where were unlocked.
thread::sleep(Duration::from_millis(50));
data.get(&id).unwrap().lock().expect("poison").entry(id).or_insert_with(|| Mutex::new(0));
// Let the loop start us over to try again
}
}
Maybe you want to consider evmap:
A lock-free, eventually consistent, concurrent multi-value map.
The trade-off is eventual-consistency: Readers do not see changes until the writer refreshes the map. A refresh is atomic and the writer decides when to do it and expose new data to the readers.

How can I use hyper::client from another thread?

I have multiple threads performing some heavy operations and I need to use a client in middle of work. I'm using Hyper v0.11 as a HTTP client and I would like to reuse the connections so I need to share the same hyper::Client in order to keep open the connections (under keep-alive mode).
The client is not shareable among threads (it doesn't implement Sync or Send). Here a small snippet with the code I've tried to do:
let mut core = Core::new().expect("Create Client Event Loop");
let handle = core.handle();
let remote = core.remote();
let client = Client::new(&handle.clone());
thread::spawn(move || {
// intensive operations...
let response = &client.get("http://google.com".parse().unwrap()).and_then(|res| {
println!("Response: {}", res.status());
Ok(())
});
remote.clone().spawn(|_| {
response.map(|_| { () }).map_err(|_| { () })
});
// more intensive operations...
});
core.run(futures::future::empty::<(), ()>()).unwrap();
This code doesn't compile:
thread::spawn(move || {
^^^^^^^^^^^^^ within `[closure#src/load-balancer.rs:46:19: 56:6 client:hyper::Client<hyper::client::HttpConnector>, remote:std::sync::Arc<tokio_core::reactor::Remote>]`, the trait `std::marker::Send` is not implemented for `std::rc::Weak<std::cell::RefCell<tokio_core::reactor::Inner>>`
thread::spawn(move || {
^^^^^^^^^^^^^ within `[closure#src/load-balancer.rs:46:19: 56:6 client:hyper::Client<hyper::client::HttpConnector>, remote:std::sync::Arc<tokio_core::reactor::Remote>]`, the trait `std::marker::Send` is not implemented for `std::rc::Rc<std::cell::RefCell<hyper::client::pool::PoolInner<tokio_proto::util::client_proxy::ClientProxy<tokio_proto::streaming::message::Message<hyper::http::MessageHead<hyper::http::RequestLine>, hyper::Body>, tokio_proto::streaming::message::Message<hyper::http::MessageHead<hyper::http::RawStatus>, tokio_proto::streaming::body::Body<hyper::Chunk, hyper::Error>>, hyper::Error>>>>`
...
remote.clone().spawn(|_| {
^^^^^ the trait `std::marker::Sync` is not implemented for `futures::Future<Error=hyper::Error, Item=hyper::Response> + 'static`
Is there any way to reuse the same client from different threads or some other approach?
The short answer is no, but it's better that way.
Each Client object holds a pool of connections. Here's how Hyper's Pool is defined in version 0.11.0:
pub struct Pool<T> {
inner: Rc<RefCell<PoolInner<T>>>,
}
As inner is reference-counted with an Rc and borrow-checked in run-time with RefCell, the pool is certainly not thread-safe. When you tried to move that Client to a new thread, that object would be holding a pool that lives in another thread, which would have been a source of data races.
This implementation is understandable. Attempting to reuse an HTTP connection across multiple threads is not very usual, as it requires synchronized access to a resource that is mostly I/O intensive. This couples pretty well with Tokio's asynchronous nature. It is actually more reasonable to perform multiple requests in the same thread, and let Tokio's core take care of sending messages and receiving them asynchronously, without waiting for each response in sequence. Moreover, computationally intensive tasks can be executed by a CPU pool from futures_cpupool. With that in mind, the code below works fine:
extern crate tokio_core;
extern crate hyper;
extern crate futures;
extern crate futures_cpupool;
use tokio_core::reactor::Core;
use hyper::client::Client;
use futures::Future;
use futures_cpupool::CpuPool;
fn main() {
let mut core = Core::new().unwrap();
let handle = core.handle();
let client = Client::new(&handle.clone());
let pool = CpuPool::new(1);
println!("Begin!");
let req = client.get("http://google.com".parse().unwrap())
.and_then(|res| {
println!("Response: {}", res.status());
Ok(())
});
let intensive = pool.spawn_fn(|| {
println!("I'm working hard!!!");
std::thread::sleep(std::time::Duration::from_secs(1));
println!("Phew!");
Ok(())
});
let task = req.join(intensive)
.map(|_|{
println!("End!");
});
core.run(task).unwrap();
}
If the response is not received too late, the output will be:
Begin!
I'm working hard!!!
Response: 302 Found
Phew!
End!
If you have multiple tasks running in separate threads, the problem becomes open-ended, since there are multiple architectures feasible. One of them is to delegate all communications to a single actor, thus requiring all other worker threads to send their data to it. Alternatively, you can have one client object to each worker, thus also having separate connection pools.

Resources