Polling many futures of different types - rust

I'm trying to understand how to implement polling multiple futures with different types. For context, I'm calling an API that will return something like:
[{"type": "source_a", "id": 123}, {"type": "source_b", "id": 234}, ...]
Each type in the API response requires a call to another API, with each API returning different data types. The code I've written works something like this:
async fn get_data(sources: Vec<Source>) -> Data {
let mut data = Default::default();
for source in sources {
if source.kind == "source_a" {
let source_data = get_source_a(source).await;
process_source_a(source_data, &mut data);
} else if source.kind == "source_b" {
...
}
}
data
}
This won't run concurrently, it will simply fetch sources one at a time and process them. How can I rewrite this so that each source is fetched concurrently and then processed once data is available? Speaking Rustily, I think what I want is to execute a closure that mutably borrows data when the future is ready. Should I be looking at something like an Arc<RefCell<Data>>?

To process the futures in parallel, you need to await something like join_all, which will run them concurrently and return when they are all done. For this to work, you have to resolve two issues:
join_all requires futures of the same type, so you need to box them or otherwise unify them.
data needs to be accessed by multiple async blocks, so it needs to be protected by Arc and Mutex.
The first issue can be solved simply by spawning the async fns as tasks, which has the added advantage of potentially running them in parallel (in addition to them being run concurrently). The example below uses tokio::spawn, but it would be almost exactly the same for async_std. Since there is no reproducible example, I can't test the code, but it could look like this:
async fn get_data(sources: Vec<Source>) -> Data {
let data = Arc::new(Mutex::new(Data::default()));
let mut tasks = vec![];
for source in sources {
if source.kind == "source_a" {
let data = Arc::clone(&data);
tasks.push(tokio::task::spawn(async move {
let source_data = get_source_a(source).await;
process_source_a(source_data, &mut data.lock().unwrap());
}));
} else if source.kind == "source_b" {
// ...
}
}
// Wait for all sources to finish and propagate the panic if any.
// With async_std this wouldn't require the `for_each()`.
futures::future::join_all(tasks)
.await
.for_each(|x| x.unwrap());
// As all tasks are done, there should be no references to `data` at
// this point, so we can extract it out of the Arc<Mutex<_>> wrapping.
data.try_unwrap().unwrap().into_inner()
}

Related

How to understand FusedFuture

Recently, I am reading source code in the repo parity-bridges-common. There are some rs files full of incomprehensibly asynchronous syntax that I am not familiar with. Especially about FusedFuture. Example file:
sync_loop.rs
message_loop.rs
According to the futures::future::fuse, what I understand is that we can transfer a Future into FusedFuture and then poll it, again and again, for many times. Besides this, I find a function named terminated() in FuseFurture and example below.
Creates a new Fuse-wrapped future which is already terminated.
This can be useful in combination with looping and the select! macro, which bypasses terminated futures.
let (sender, mut stream) = mpsc::unbounded();
// Send a few messages into the stream
sender.unbounded_send(()).unwrap();
sender.unbounded_send(()).unwrap();
drop(sender);
// Use `Fuse::terminated()` to create an already-terminated future
// which may be instantiated later.
// bear: We must terminated() here in order to use select!?
let foo_printer = Fuse::terminated();
pin_mut!(foo_printer);
loop {
select! {
_ = foo_printer => {},
() = stream.select_next_some() => {
if !foo_printer.is_terminated() {
println!("Foo is already being printed!");
} else {
// bear: here we reset the foo_printer pointed value?
foo_printer.set(async {
// do some other async operations
println!("Printing foo from `foo_printer` future");
}.fuse());
}
},
complete => break, // `foo_printer` is terminated and the stream is done
}
}
I cannot understand when to use this function and how to combine it with select!. Can anyone help me understand it more easily? Or are there better docs or examples about this usage?
Some posts i found, maybe useful:
Why doesn’t tokio::select! require FusedFuture?
What is the difference between futures::select! and tokio::select?

How do I spawn a long running Tokio task within another task without blocking the parent task?

I'm trying to construct an object that can manage a feed from a websocket but be able to switch between multiple feeds.
There is a Feed trait:
trait Feed {
async fn start(&mut self);
async fn stop(&mut self);
}
There are three structs that implement Feed: A, B, and C.
When start is called, it starts an infinite loop of listening for messages from a websocket and processing each one as it comes in.
I want to implement a FeedManager that maintains a single active feed but can receive commands to switch what feed source it is using.
enum FeedCommand {
Start(String),
Stop,
}
struct FeedManager {
active_feed_handle: tokio::task::JoinHandle,
controller: mpsc::Receiver<FeedCommand>,
}
impl FeedManager {
async fn start(&self) {
while let Some(command) = self.controller.recv().await {
match command {
FeedCommand::Start(feed_type) => {
// somehow tell the active feed to stop (need channel probably) or kill the task?
if feed_type == "A" {
// replace active feed task with a new tokio task for consuming feed A
} else if feed_type == "B" {
// replace active feed task with a new tokio task for consuming feed B
} else {
// replace active feed task with a new tokio task for consuming feed C
}
}
}
}
}
}
I'm struggling to understand how to manage all the of Tokio tasks properly. FeedManager's core loop is to listen forever for new commands that come in, but it needs to be able to spawn another long lived task without blocking on it (so it can listen for commands).
My first attempt was:
if feed_type == "A" {
self.active_feed_handle = tokio::spawn(async {
A::new().start().await;
});
self.active_feed_handle.await
}
the .await on the handle would cause the core loop to no longer accept commands, right?
can I omit that last .await and have the task still run?
do I need to clean up the currently active task some way?
You can I spawn a long running Tokio task without blocking the parent task by spawning a task — that's a primary reason tasks exist. If you don't .await the task, then you won't wait for the task:
use std::time::Duration;
use tokio::{task, time}; // 1.3.0
#[tokio::main]
async fn main() {
task::spawn(async {
time::sleep(Duration::from_secs(100)).await;
eprintln!(
"You'll likely never see this printed \
out because the parent task has exited \
and so has the entire program"
);
});
}
See also:
What happens to an async task when it is aborted?
One way to do this would be to use Tokio's join!() macro, which takes multiple futures and awaits on all of them. You could the create multiple futures and join!() them together to await on them collectively.

Rust: Safe multi threading with recursion

I'm new to Rust.
For learning purposes, I'm writing a simple program to search for files in Linux, and it uses a recursive function:
fn ffinder(base_dir:String, prmtr:&'static str, e:bool, h:bool) -> std::io::Result<()>{
let mut handle_vec = vec![];
let pth = std::fs::read_dir(&base_dir)?;
for p in pth {
let p2 = p?.path().clone();
if p2.is_dir() {
if !h{ //search doesn't include hidden directories
let sstring:String = get_fname(p2.display().to_string());
let slice:String = sstring[..1].to_string();
if slice != ".".to_string() {
let handle = thread::spawn(move || {
ffinder(p2.display().to_string(),prmtr,e,h);
});
handle_vec.push(handle);
}
}
else {//search include hidden directories
let handle2 = thread::spawn(move || {
ffinder(p2.display().to_string(),prmtr,e,h);
});
handle_vec.push(handle2);
}
}
else {
let handle3 = thread::spawn(move || {
if compare(rmv_underline(get_fname(p2.display().to_string())),rmv_underline(prmtr.to_string()),e){
println!("File found at: {}",p2.display().to_string().blue());
}
});
handle_vec.push(handle3);
}
}
for h in handle_vec{
h.join().unwrap();
}
Ok(())
}
I've tried to use multi threading (thread::spawn), however it can create too many threads, exploding the OS limit, which breaks the program execution.
Is there a way to multi thread with recursion, using a safe,limited (fixed) amount of threads?
As one of the commenters mentioned, this is an absolutely perfect case for using Rayon. The blog post mentioned doesn't show how Rayon might be used in recursion, only making an allusion to crossbeam's scoped threads with a broken link. However, Rayon provides its own scoped threads implementation that solves your problem as well in that only uses as many threads as you have cores available, avoiding the error you ran into.
Here's the documentation for it:
https://docs.rs/rayon/1.0.1/rayon/fn.scope.html
Here's an example from some code I recently wrote. Basically what it does is recursively scan a folder, and each time it nests into a folder it creates a new job to scan that folder while the current thread continues. In my own tests it vastly outperforms a single threaded approach.
let source = PathBuf::from("/foo/bar/");
let (tx, rx) = mpsc::channel();
rayon::scope(|s| scan(&source, tx, s));
fn scan<'a, U: AsRef<Path>>(
src: &U,
tx: Sender<(Result<DirEntry, std::io::Error>, u64)>,
scope: &Scope<'a>,
) {
let dir = fs::read_dir(src).unwrap();
dir.into_iter().for_each(|entry| {
let info = entry.as_ref().unwrap();
let path = info.path();
if path.is_dir() {
let tx = tx.clone();
scope.spawn(move |s| scan(&path, tx, s)) // Recursive call here
} else {
// dbg!("{}", path.as_os_str().to_string_lossy());
let size = info.metadata().unwrap().len();
tx.send((entry, size)).unwrap();
}
});
}
I'm not an expert on Rayon, but I'm fairly certain the threading strategy works like this:
Rayon creates a pool of threads to match the number of logical cores you have available in your environment. The first call to the scoped function creates a job that the first available thread "steals" from the queue of jobs available. Each time we make another recursive call, it doesn't necessarily execute immediately, but it creates a new job that an idle thread can then "steal" from the queue. If all of the threads are busy, the job queue just fills up each time we make another recursive call, and each time a thread finishes its current job it steals another job from the queue.
The full code can be found here: https://github.com/1Dragoon/fcp
(Note that repo is a work in progress and the code there is currently typically broken and probably won't work at the time you're reading this.)
As an exercise to the reader, I'm more of a sys admin than an actual developer, so I also don't know if this is the ideal approach. From Rayon's documentation linked earlier:
scope() is a more flexible building block compared to join(), since a loop can be used to spawn any number of tasks without recursing
The language of that is a bit confusing. I'm not sure what they mean by "without recursing". Join seems to intend for you to already have tasks known about ahead of time and to execute them in parallel as threads become available, whereas scope seems to be more aimed at only creating jobs when you need them rather than having to know everything you need to do in advance. Either that or I'm understanding their meaning backwards.

Can I miss a value by calling select on two async receivers?

Is it possible, if a task sends to a and an other (at the same time) sends to b, that tokio::select! on a and b drops one of the value by cancelling the remaining future? Or is it guaranteed to be received at the next loop iteration?
use tokio::sync::mpsc::Receiver;
async fn foo(mut a: Receiver<()>, mut b: Receiver<()>) {
loop {
tokio::select!{
_ = a.recv() => {
println!("A!");
}
_ = b.recv() => {
println!("B!");
}
}
}
}
My mind can't get around what is really happening behind the async magic in that case.
It doesn't appear to be guaranteed in the documentation anywhere, but is likely to work for reading directly from a channel because of the way rusts poll based architecture works. A select is equivalent to polling each of futures in a random order, until one of them is ready, or if none are, then waiting until a waker is signaled and then repeating the process. A message is only removed from the channel when returned by a successful poll. A successful poll stops the select, so the rest of the channels will not be touched. Thus they will be polled the next time the loop occurs and then return the message.
However, this is a dangerous approach, because if the receiver is replaced with something that returns a future that does anything more complex than a direct read, where it could potentially suspend after the read, then you could lose messages when that happens. As such it should probably be treated as if it wouldn't work. A safer approach would be to store the futures in mutable variables that you update when they fire:
use tokio::sync::mpsc::Receiver;
async fn foo(mut a: Receiver<()>, mut b: Receiver<()>) {
let mut a_fut = a.recv();
let mut b_fut = b.recv();
loop {
tokio::select!{
_ = a_fut => {
println!("A!");
a_fut = a.recv();
}
_ = b_fut => {
println!("B!");
b_fut = b.recv();
}
}
}
}

Join futures with limited concurrency

I have a large vector of Hyper HTTP request futures and want to resolve them into a vector of results. Since there is a limit of maximum open files, I want to limit concurrency to N futures.
I've experimented with Stream::buffer_unordered but seems like it executed futures one by one.
We've used code like this in a project to avoid opening too many TCP sockets. These futures have Hyper futures within, so it seems exactly the same case.
// Convert the iterator into a `Stream`. We will process
// `PARALLELISM` futures at the same time, but with no specified
// order.
let all_done =
futures::stream::iter(iterator_of_futures.map(Ok))
.buffer_unordered(PARALLELISM);
// Everything after here is just using the stream in
// some manner, not directly related
let mut successes = Vec::with_capacity(LIMIT);
let mut failures = Vec::with_capacity(LIMIT);
// Pull values off the stream, dividing them into success and
// failure buckets.
let mut all_done = all_done.into_future();
loop {
match core.run(all_done) {
Ok((None, _)) => break,
Ok((Some(v), next_all_done)) => {
successes.push(v);
all_done = next_all_done.into_future();
}
Err((v, next_all_done)) => {
failures.push(v);
all_done = next_all_done.into_future();
}
}
}
This is used in a piece of example code, so the event loop (core) is explicitly driven. Watching the number of file handles used by the program showed that it was capped. Additionally, before this bottleneck was added, we quickly ran out of allowable file handles, whereas afterward we did not.

Resources