Preferred method for awaiting concurrent threads - multithreading

I have a program that loops over HTTP responses. These don't depend on each other, so they can be done simultaneously. I am using threads to do this:
extern crate hyper;
use std::thread;
use std::sync::Arc;
use hyper::Client;
fn main() {
let client = Arc::new(Client::new());
for num in 0..10 {
let client_helper = client.clone();
thread::spawn(move || {
client_helper.get(&format!("http://example.com/{}", num))
.send().unwrap();
}).join().unwrap();
}
}
This works, but I can see other possibilities of doing this such as:
let mut threads = vec![];
threads.push(thread::spawn(move || {
/* snip */
for thread in threads {
let _ = thread.join();
}
It would also make sense to me to use a function that returns the thread handler, but I couldn't figure out how to do that ... not sure what the return type has to be.
What is the optimal/recommended way to wait for concurrent threads in Rust?

Your first program does not actually have any parallelism. Each time you spin up a worker thread, you immediately wait for it to finish before you start the next one. This is, of course, worse than useless.
The second way works, but there are crates that do some of the busywork for you. For example, scoped_threadpool and crossbeam have thread pools that allow you to write something like (untested, may contain mistakes):
let client = &Client::new();// No Arc needed
run_in_pool(|scope| {
for num in 0..10 {
scope.spawn(move || {
client.get(&format!("http://example.com/{}", num)).send().unwrap();
}
}
})

Related

Why the channel in the example code of tokio::sync::Notify is a mpsc?

I'm learning the synchronizing primitive of tokio. From the example code of Notify, I found it is confused to understand why Channel<T> is mpsc.
use tokio::sync::Notify;
use std::collections::VecDeque;
use std::sync::Mutex;
struct Channel<T> {
values: Mutex<VecDeque<T>>,
notify: Notify,
}
impl<T> Channel<T> {
pub fn send(&self, value: T) {
self.values.lock().unwrap()
.push_back(value);
// Notify the consumer a value is available
self.notify.notify_one();
}
// This is a single-consumer channel, so several concurrent calls to
// `recv` are not allowed.
pub async fn recv(&self) -> T {
loop {
// Drain values
if let Some(value) = self.values.lock().unwrap().pop_front() {
return value;
}
// Wait for values to be available
self.notify.notified().await;
}
}
}
If there are elements in values, the consumer tasks will take it away
If there is no element in values, the consumer tasks will yield until the producer nitify it
But after I writen some test code, I found in no case the consumer will lose the notice from producer.
Could some one give me test code to prove the above Channel<T> fail to work well as a mpmc?
The following code shows why it is unsafe to use the above channel as mpmc.
use std::sync::Arc;
#[tokio::main]
async fn main() {
let mut i = 0;
loop{
let ch = Arc::new(Channel {
values: Mutex::new(VecDeque::new()),
notify: Notify::new(),
});
let mut handles = vec![];
for i in 0..100{
if i % 2 == 1{
for _ in 0..2{
let sender = ch.clone();
tokio::spawn(async move{
sender.send(1);
});
}
}else{
for _ in 0..2{
let receiver = ch.clone();
let handle = tokio::spawn(async move{
receiver.recv().await;
});
handles.push(handle);
}
}
}
futures::future::join_all(handles).await;
i += 1;
println!("No.{i} loop finished.");
}
}
Not running the next loop means that there are consumer tasks not finishing, and consumer tasks miss a notify.
Quote from the documentation you linked:
If you have two calls to recv and two calls to send in parallel, the following could happen:
Both calls to try_recv return None.
Both new elements are added to the vector.
The notify_one method is called twice, adding only a single permit to the Notify.
Both calls to recv reach the Notified future. One of them consumes the permit, and the other sleeps forever.
Replace try_recv with self.values.lock().unwrap().pop_front() in our case; the rest of the explanation stays identical.
The third point is the important one: Multiple calls to notify_one only result in a single token if no thread is waiting yet. And there is a short time window where it is possible that multiple threads already checked for the existance of an item but aren't waiting yet.

How to tokio::join multiple tasks?

Imagine that some futures are stored in a Vec whose length are runtime-determined, you are supposed to join these futures concurrently, what should you do?
Obviously, by the example in the document of tokio::join, manually specifying each length the Vec could be, like 1, 2, 3, ... and dealing with respectable case should work.
extern crate tokio;
let v = Vec::new();
v.push(future_1);
// directly or indirectly you push many futures to the vector
v.push(future_N);
// to join these futures concurrently one possible way is
if v.len() == 0 {}
if v.len() == 1 { join!(v.pop()); }
if v.len() == 2 { join!(v.pop(), v.pop() ); }
// ...
And I also noticed that tokio::join! take a list as parameter in the document, when I use syntax like
tokio::join!(v);
or something like
tokio::join![ v ] / tokio::join![ v[..] ] / tokio::join![ v[..][..] ]
it just doesn't work
And here comes the question that is there any doorway to join these futures more efficient or should I miss something against what the document says?
You can use futures::future::join_all to "merge" your collection of futures together into a single future, that resolves when all of the subfutures resolve.
join_all and try_join_all, as well as more versatile FuturesOrdered and FuturesUnordered utilities from the same crate futures, are executed as a single task. This is probably fine if the constituent futures are not often concurrently ready to perform work, but if you want to make use of CPU parallelism with the multi-threaded runtime, consider spawning the individual futures as separate tasks and waiting on the tasks to finish.
Tokio 1.21.0 or later: JoinSet
With recent Tokio releases, you can use JoinSet to get the maximum flexibility, including the ability to abort all tasks. The tasks in the set are also aborted when JoinSet is dropped.
use tokio::task::JoinSet;
let mut set = JoinSet::new();
for fut in v {
set.spawn(fut);
}
while let Some(res) = set.join_next().await {
let out = res?;
// ...
}
Older API
Spawn tasks with tokio::spawn and wait on the join handles:
use futures::future;
// ...
let outputs = future::try_join_all(v.into_iter().map(tokio::spawn)).await?;
You can also use the FuturesOrdered and FuturesUnordered combinators to process the outputs asynchronously in a stream:
use futures::stream::FuturesUnordered;
use futures::prelude::*;
// ...
let mut completion_stream = v.into_iter()
.map(tokio::spawn)
.collect::<FuturesUnordered<_>>();
while let Some(res) = completion_stream.next().await {
// ...
}
One caveat with waiting for tasks this way is that the tasks are not cancelled when the future (e.g. an async block) that has spawned the task and possibly owns the returned JoinHandle gets dropped. The JoinHandle::abort method needs to be used to explicitly cancel the task.
A full example:
#[tokio::main]
async fn main() {
let tasks = (0..5).map(|i| tokio::spawn(async move {
sleep(Duration::from_secs(1)).await; // simulate some work
i * 2
})).collect::<FuturesUnordered<_>>();
let result = futures::future::join_all(tasks).await;
println!("{:?}", result); // [Ok(8), Ok(6), Ok(4), Ok(2), Ok(0)]
}
Playground

How can I use hyper::client from another thread?

I have multiple threads performing some heavy operations and I need to use a client in middle of work. I'm using Hyper v0.11 as a HTTP client and I would like to reuse the connections so I need to share the same hyper::Client in order to keep open the connections (under keep-alive mode).
The client is not shareable among threads (it doesn't implement Sync or Send). Here a small snippet with the code I've tried to do:
let mut core = Core::new().expect("Create Client Event Loop");
let handle = core.handle();
let remote = core.remote();
let client = Client::new(&handle.clone());
thread::spawn(move || {
// intensive operations...
let response = &client.get("http://google.com".parse().unwrap()).and_then(|res| {
println!("Response: {}", res.status());
Ok(())
});
remote.clone().spawn(|_| {
response.map(|_| { () }).map_err(|_| { () })
});
// more intensive operations...
});
core.run(futures::future::empty::<(), ()>()).unwrap();
This code doesn't compile:
thread::spawn(move || {
^^^^^^^^^^^^^ within `[closure#src/load-balancer.rs:46:19: 56:6 client:hyper::Client<hyper::client::HttpConnector>, remote:std::sync::Arc<tokio_core::reactor::Remote>]`, the trait `std::marker::Send` is not implemented for `std::rc::Weak<std::cell::RefCell<tokio_core::reactor::Inner>>`
thread::spawn(move || {
^^^^^^^^^^^^^ within `[closure#src/load-balancer.rs:46:19: 56:6 client:hyper::Client<hyper::client::HttpConnector>, remote:std::sync::Arc<tokio_core::reactor::Remote>]`, the trait `std::marker::Send` is not implemented for `std::rc::Rc<std::cell::RefCell<hyper::client::pool::PoolInner<tokio_proto::util::client_proxy::ClientProxy<tokio_proto::streaming::message::Message<hyper::http::MessageHead<hyper::http::RequestLine>, hyper::Body>, tokio_proto::streaming::message::Message<hyper::http::MessageHead<hyper::http::RawStatus>, tokio_proto::streaming::body::Body<hyper::Chunk, hyper::Error>>, hyper::Error>>>>`
...
remote.clone().spawn(|_| {
^^^^^ the trait `std::marker::Sync` is not implemented for `futures::Future<Error=hyper::Error, Item=hyper::Response> + 'static`
Is there any way to reuse the same client from different threads or some other approach?
The short answer is no, but it's better that way.
Each Client object holds a pool of connections. Here's how Hyper's Pool is defined in version 0.11.0:
pub struct Pool<T> {
inner: Rc<RefCell<PoolInner<T>>>,
}
As inner is reference-counted with an Rc and borrow-checked in run-time with RefCell, the pool is certainly not thread-safe. When you tried to move that Client to a new thread, that object would be holding a pool that lives in another thread, which would have been a source of data races.
This implementation is understandable. Attempting to reuse an HTTP connection across multiple threads is not very usual, as it requires synchronized access to a resource that is mostly I/O intensive. This couples pretty well with Tokio's asynchronous nature. It is actually more reasonable to perform multiple requests in the same thread, and let Tokio's core take care of sending messages and receiving them asynchronously, without waiting for each response in sequence. Moreover, computationally intensive tasks can be executed by a CPU pool from futures_cpupool. With that in mind, the code below works fine:
extern crate tokio_core;
extern crate hyper;
extern crate futures;
extern crate futures_cpupool;
use tokio_core::reactor::Core;
use hyper::client::Client;
use futures::Future;
use futures_cpupool::CpuPool;
fn main() {
let mut core = Core::new().unwrap();
let handle = core.handle();
let client = Client::new(&handle.clone());
let pool = CpuPool::new(1);
println!("Begin!");
let req = client.get("http://google.com".parse().unwrap())
.and_then(|res| {
println!("Response: {}", res.status());
Ok(())
});
let intensive = pool.spawn_fn(|| {
println!("I'm working hard!!!");
std::thread::sleep(std::time::Duration::from_secs(1));
println!("Phew!");
Ok(())
});
let task = req.join(intensive)
.map(|_|{
println!("End!");
});
core.run(task).unwrap();
}
If the response is not received too late, the output will be:
Begin!
I'm working hard!!!
Response: 302 Found
Phew!
End!
If you have multiple tasks running in separate threads, the problem becomes open-ended, since there are multiple architectures feasible. One of them is to delegate all communications to a single actor, thus requiring all other worker threads to send their data to it. Alternatively, you can have one client object to each worker, thus also having separate connection pools.

Running interruptible Rust program that spawns threads

I am trying to write a program that spawns a bunch of threads and then joins the threads at the end. I want it to be interruptible, because my plan is to make this a constantly running program in a UNIX service.
The idea is that worker_pool will contain all the threads that have been spawned, so terminate can be called at any time to collect them.
I can't seem to find a way to utilize the chan_select crate to do this, because this requires I spawn a thread first to spawn my child threads, and once I do this I can no longer use the worker_pool variable when joining the threads on interrupt, because it had to be moved out for the main loop. If you comment out the line in the interrupt that terminates the workers, it compiles.
I'm a little frustrated, because this would be really easy to do in C. I could set up a static pointer, but when I try and do that in Rust I get an error because I am using a vector for my threads, and I can't initialize to an empty vector in a static. I know it is safe to join the workers in the interrupt code, because execution stops here waiting for the signal.
Perhaps there is a better way to do the signal handling, or maybe I'm missing something that I can do.
The error and code follow:
MacBook8088:video_ingest pjohnson$ cargo run
Compiling video_ingest v0.1.0 (file:///Users/pjohnson/projects/video_ingest)
error[E0382]: use of moved value: `worker_pool`
--> src/main.rs:30:13
|
24 | thread::spawn(move || run(sdone, &mut worker_pool));
| ------- value moved (into closure) here
...
30 | worker_pool.terminate();
| ^^^^^^^^^^^ value used here after move
<chan macros>:42:47: 43:23 note: in this expansion of chan_select! (defined in <chan macros>)
src/main.rs:27:5: 35:6 note: in this expansion of chan_select! (defined in <chan macros>)
|
= note: move occurs because `worker_pool` has type `video_ingest::WorkerPool`, which does not implement the `Copy` trait
main.rs
#[macro_use]
extern crate chan;
extern crate chan_signal;
extern crate video_ingest;
use chan_signal::Signal;
use video_ingest::WorkerPool;
use std::thread;
use std::ptr;
///
/// Starts processing
///
fn main() {
let mut worker_pool = WorkerPool { join_handles: vec![] };
// Signal gets a value when the OS sent a INT or TERM signal.
let signal = chan_signal::notify(&[Signal::INT, Signal::TERM]);
// When our work is complete, send a sentinel value on `sdone`.
let (sdone, rdone) = chan::sync(0);
// Run work.
thread::spawn(move || run(sdone, &mut worker_pool));
// Wait for a signal or for work to be done.
chan_select! {
signal.recv() -> signal => {
println!("received signal: {:?}", signal);
worker_pool.terminate(); // <-- Comment out to compile
},
rdone.recv() => {
println!("Program completed normally.");
}
}
}
fn run(sdone: chan::Sender<()>, worker_pool: &mut WorkerPool) {
loop {
worker_pool.ingest();
worker_pool.terminate();
}
}
lib.rs
extern crate libc;
use std::thread;
use std::thread::JoinHandle;
use std::os::unix::thread::JoinHandleExt;
use libc::pthread_join;
use libc::c_void;
use std::ptr;
use std::time::Duration;
pub struct WorkerPool {
pub join_handles: Vec<JoinHandle<()>>
}
impl WorkerPool {
///
/// Does the actual ingestion
///
pub fn ingest(&mut self) {
// Use 9 threads for an example.
for i in 0..10 {
self.join_handles.push(
thread::spawn(move || {
// Get the videos
println!("Getting videos for thread {}", i);
thread::sleep(Duration::new(5, 0));
})
);
}
}
///
/// Joins all threads
///
pub fn terminate(&mut self) {
println!("Total handles: {}", self.join_handles.len());
for handle in &self.join_handles {
println!("Joining thread...");
unsafe {
let mut state_ptr: *mut *mut c_void = 0 as *mut *mut c_void;
pthread_join(handle.as_pthread_t(), state_ptr);
}
}
self.join_handles = vec![];
}
}
terminate can be called at any time to collect them.
I don't want to stop the threads; I want to collect them with join. I agree stopping them would not be a good idea.
These two statements don't make sense to me. You can only join a thread when it's complete. The word "interruptible" and "at any time" would mean that you could attempt to stop a thread while it is still doing some processing. Which behavior do you want?
If you want to be able to stop a thread that has partially completed, you have to enhance your code to check if it should exit early. This is usually complicated by the fact that you are doing some big computation that you don't have control over. Ideally, you break that up into chunks and check your exit flag frequently. For example, with video work, you could check every frame. Then the response delay is roughly the time to process a frame.
this would be really easy to do in C.
This would be really easy to do incorrectly. For example, the code currently presented attempts to perform mutation to the pool from two different threads without any kind of synchronization. That's a sure-fire recipe to make broken, hard-to-debug code.
// Use 9 threads for an example.
0..10 creates 10 threads.
Anyway, it seems like the missing piece of knowledge is Arc and Mutex. Arc allows sharing ownership of a single item between threads, and Mutex allows for run-time mutable borrowing between threads.
#[macro_use]
extern crate chan;
extern crate chan_signal;
use chan_signal::Signal;
use std::thread::{self, JoinHandle};
use std::sync::{Arc, Mutex};
fn main() {
let worker_pool = Arc::new(Mutex::new(WorkerPool::new()));
let signal = chan_signal::notify(&[Signal::INT, Signal::TERM]);
let (work_done_tx, work_done_rx) = chan::sync(0);
let worker_pool_clone = worker_pool.clone();
thread::spawn(move || run(work_done_tx, worker_pool_clone));
// Wait for a signal or for work to be done.
chan_select! {
signal.recv() -> signal => {
println!("received signal: {:?}", signal);
let mut pool = worker_pool.lock().expect("Unable to lock the pool");
pool.terminate();
},
work_done_rx.recv() => {
println!("Program completed normally.");
}
}
}
fn run(_work_done_tx: chan::Sender<()>, worker_pool: Arc<Mutex<WorkerPool>>) {
loop {
let mut worker_pool = worker_pool.lock().expect("Unable to lock the pool");
worker_pool.ingest();
worker_pool.terminate();
}
}
pub struct WorkerPool {
join_handles: Vec<JoinHandle<()>>,
}
impl WorkerPool {
pub fn new() -> Self {
WorkerPool {
join_handles: vec![],
}
}
pub fn ingest(&mut self) {
self.join_handles.extend(
(0..10).map(|i| {
thread::spawn(move || {
println!("Getting videos for thread {}", i);
})
})
)
}
pub fn terminate(&mut self) {
for handle in self.join_handles.drain(..) {
handle.join().expect("Unable to join thread")
}
}
}
Beware that the program logic itself is still poor; even though an interrupt is sent, the loop in run continues to execute. The main thread will lock the mutex, join all the current threads1, unlock the mutex and exit the program. However, the loop can lock the mutex before the main thread has exited and start processing some new data! And then the program exits right in the middle of processing. It's almost the same as if you didn't handle the interrupt at all.
1: Haha, tricked you! There are no running threads at that point. Since the mutex is locked for the entire loop, the only time another lock can be made is when the loop is resetting. However, since the last instruction in the loop is to join all the threads, there won't be anymore running.
I don't want to let the program terminate before all threads have completed.
Perhaps it's an artifact of the reduced problem, but I don't see how the infinite loop can ever exit, so the "I'm done" channel seems superfluous.
I'd probably just add a flag that says "please stop" when an interrupt is received. Then I'd check that instead of the infinite loop and wait for the running thread to finish before exiting the program.
use std::sync::atomic::{AtomicBool, Ordering};
fn main() {
let worker_pool = WorkerPool::new();
let signal = chan_signal::notify(&[Signal::INT, Signal::TERM]);
let please_stop = Arc::new(AtomicBool::new(false));
let threads_please_stop = please_stop.clone();
let runner = thread::spawn(|| run(threads_please_stop, worker_pool));
// Wait for a signal
chan_select! {
signal.recv() -> signal => {
println!("received signal: {:?}", signal);
please_stop.store(true, Ordering::SeqCst);
},
}
runner.join().expect("Unable to join runner thread");
}
fn run(please_stop: Arc<AtomicBool>, mut worker_pool: WorkerPool) {
while !please_stop.load(Ordering::SeqCst) {
worker_pool.ingest();
worker_pool.terminate();
}
}

What do I use to share an object with many threads and one writer in Rust?

What is the right approach to share a common object between many threads when the object may sometimes be written to by one owner?
I tried to create one Configuration trait object with several methods to get and set config keys. I'd like to pass this to other threads where configuration items may be read. Bonus points would be if it can be written and read by everyone.
I found a Reddit thread which talks about Rc and RefCell; would that be the right way? I think these would not allow me to borrow the object immutably multiple times and still mutate it.
Rust has a built-in concurrency primitive exactly for this task called RwLock. Together with Arc, it can be used to implement what you want:
use std::sync::{Arc, RwLock};
use std::sync::mpsc;
use std::thread;
const N: usize = 12;
let shared_data = Arc::new(RwLock::new(Vec::new()));
let (finished_tx, finished_rx) = mpsc::channel();
for i in 0..N {
let shared_data = shared_data.clone();
let finished_tx = finished_tx.clone();
if i % 4 == 0 {
thread::spawn(move || {
let mut guard = shared_data.write().expect("Unable to lock");
guard.push(i);
finished_tx.send(()).expect("Unable to send");
});
} else {
thread::spawn(move || {
let guard = shared_data.read().expect("Unable to lock");
println!("From {}: {:?}", i, *guard);
finished_tx.send(()).expect("Unable to send");
});
}
}
// wait until everything's done
for _ in 0..N {
let _ = finished_rx.recv();
}
println!("Done");
This example is very silly but it demonstrates what RwLock is and how to use it.
Also note that Rc and RefCell/Cell are not appropriate in a multithreaded environment because they are not synchronized properly. Rust won't even allow you to use them at all with thread::spawn(). To share data between threads you must use an Arc, and to share mutable data you must additionally use one of the synchronization primitives like RWLock or Mutex.

Resources