How to wait for tokio tasks to finish? - multithreading

I am trying to write to a HashMap using the Arc<Mutex<T>> pattern as part of a website scraping exercise inspired from the Rust cookbook.
This first part uses tokio runtime. I cannot get past the tasks being completed and returning the HashMap as it just hangs.
type Db = Arc<Mutex<HashMap<String, bool>>>;
pub async fn handle_async_tasks(db: Db) -> BoxResult<HashMap<String, bool>> {
let links = NodeUrl::new("https://www.inverness-courier.co.uk/")
.await
.unwrap();
let arc = db.clone();
let mut handles = Vec::new();
for link in links.links_with_paths {
let x = arc.clone();
handles.push(tokio::spawn(async move {
process(x, link).await;
}));
}
// for handle in handles {
// handle.await.expect("Task panicked!");
// } < I tried this as well>
futures::future::join_all(handles).await;
let readables = arc.lock().await;
for (key, value) in readables.clone().into_iter() {
println!("Checking db: k, v ==>{} / {}", key, value);
}
let clone_db = readables.clone();
return Ok(clone_db);
}
async fn process(db: Db, url: Url) {
let mut db = db.lock().await;
println!("checking {}", url);
if check_link(&url).await.is_ok() {
db.insert(url.to_string(), true);
} else {
db.insert(url.to_string(), false);
}
}
async fn check_link(url: &Url) -> BoxResult<bool> {
let res = reqwest::get(url.as_ref()).await?;
Ok(res.status() != StatusCode::NOT_FOUND)
}
pub struct NodeUrl {
domain: String,
pub links_with_paths: Vec<Url>,
}
#[tokio::main]
async fn main() {
let db: Db = Arc::new(Mutex::new(HashMap::new()));
let db = futures::executor::block_on(task::handle_async_tasks(db));
}
I would like to return the HashMap to main() where the main thread is blocked. How can I wait for all async threaded processes to be complete and return the HashMap?

let links = NodeUrl::new("https://www.some-site.com/.co.uk/").await.unwrap();
This doesn't seem like a valid URL to me.
async fn process(db: Db, url: Url) {
let mut db = db.lock().await;
println!("checking {}", url);
if check_link(&url).await.is_ok() {
db.insert(url.to_string(), true);
} else {
db.insert(url.to_string(), false);
}
}
This is highly problematic. You hold the exclusive lock on the database during the entire request. This makes your application effectively serial.
The default timeout in reqwest is 30 seconds. So if the server isn't responsive and you have a lot of links to go through the program might just seem to 'hang'.
You should only get the database lock for as short as possible - just to do the insert:
async fn process(db: Db, url: Url) {
println!("checking {}", url);
if check_link(&url).await.is_ok() {
let mut db = db.lock().await;
db.insert(url.to_string(), true);
} else {
let mut db = db.lock().await;
db.insert(url.to_string(), false);
}
}
Or even better, eliminating the useless if:
async fn process(db: Db, url: Url) {
println!("checking {}", url);
let valid = check_link(&url).await.is_ok();
db.lock().await.insert(url.to_string(), valid);
}
Finally you didn't show your main function, the way you invoke handle_async_tasks or have other stuff running might be problematic.

My main issue was how to handle the MutexGuard - which I did in the end by using clone and returning the inner value.
There was no need to use an futures::executor in main: since we are within the tokio runtime, calling .await was sufficient to access the final value synchronously.
Cloning the Arc once was enough; I had cloned it twice before passing it into the multi-threaded context.
Thanks to #orlp for pointing out bad logic where it concerned the check_link function.
pub async fn handle_async_tasks() -> BoxResult<HashMap<String, bool>> {
let get_links = NodeUrl::new("https://www.invernesscourier.co.uk/")
.await
.unwrap();
let db: Db = Arc::new(Mutex::new(HashMap::new()));
let mut handles = Vec::new();
for link in get_links.links_with_paths {
let x = db.clone();
handles.push(tokio::spawn(async move {
process(x, link).await;
}));
}
futures::future::join_all(handles).await;
let guard = db.lock().await;
let cloned = guard.clone();
Ok(cloned)
}
#[tokio::main]
async fn main() {
let db = task::handle_async_tasks().await.unwrap();
for (key, value) in db.into_iter() {
println!("Checking db: {} / {}", key, value);
}
}
This is by no means the best Rust code, but I wanted to share how I tackled things in the end.

Related

deno_runtime running multiple invokes on single worker concurrently

I'm trying to run multiple invocation of the same script on a single deno MainWorker concurrently, and waiting for their
results (since the scripts can be async). Conceptually, I want something like the loop in run_worker below.
type Tx = Sender<(String, Sender<String>)>;
type Rx = Receiver<(String, Sender<String>)>;
struct Runner {
worker: MainWorker,
futures: FuturesUnordered<Pin<Box<dyn Future<Output=(String, Result<Global<Value>, Error>)>>>>,
response_futures: FuturesUnordered<Pin<Box<dyn Future<Output=(String, Result<(), SendError<String>>)>>>>,
result_senders: HashMap<String, Sender<String>>,
}
impl Runner {
fn new() ...
async fn run_worker(&mut self, rx: &mut Rx, main_module: ModuleSpecifier, user_module: ModuleSpecifier) {
self.worker.execute_main_module(&main_module).await.unwrap();
self.worker.preload_side_module(&user_module).await.unwrap();
loop {
tokio::select! {
msg = rx.recv() => {
if let Some((id, sender)) = msg {
let global = self.worker.js_runtime.execute_script("test", "mod.entry()").unwrap();
self.result_senders.insert(id, sender);
self.futures.push(Box::pin(async {
let resolved = self.worker.js_runtime.resolve_value(global).await;
return (id, resolved);
}));
}
},
script_result = self.futures.next() => {
if let Some((id, out)) = script_result {
self.response_futures.push(Box::pin(async {
let value = deserialize_value(out.unwrap(), &mut self.worker);
let res = self.result_senders.remove(&id).unwrap().send(value).await;
return (id.clone(), res);
}));
}
},
// also handle response_futures here
else => break,
}
}
}
}
The worker can't be borrowed as mutable multiple times, so this won't work. So the worker has to be a RefCell, and
I've created a BorrowingFuture:
struct BorrowingFuture {
worker: RefCell<MainWorker>,
global: Global<Value>,
id: String
}
And its poll implementation:
fn poll(self: Pin<&mut Self>, cx: &mut Context<'_>) -> Poll<Self::Output> {
match Pin::new(&mut Box::pin(self.worker.borrow_mut().js_runtime.resolve_value(self.global.clone()))).poll(cx) {
Poll::Ready(result) => Poll::Ready((self.id.clone(), result)),
Poll::Pending => {
cx.waker().clone().wake_by_ref();
Poll::Pending
}
}
}
So the above
self.futures.push(Box::pin(async {
let resolved = self.worker.js_runtime.resolve_value(global).await;
return (id, resolved);
}));
would become
self.futures.push(Box::pin(BorrowingFuture{worker: self.worker, global: global.clone(), id: id.clone()}));
and this would have to be done for the response_futures above as well.
But I see a few issues with this.
Creating a new future on every poll and then polling that seems wrong, but it does work.
It probably has a performance impact because new objects are created constantly.
The same issue would happen for the response futures, which would call send on each poll, which seems completely wrong.
The waker.wake_by_ref is called on every poll, because there is no way to know when a script result
will resolve. This results in the future being polled thousands (and more) times per second (always creating a new object),
which could be the same as checking it in a loop, I guess.
Note My current setup doesn't use select!, but an enum as Output from multiple Future implementations, pushed into
a single FuturesUnordered, and then matched to handle the correct type (script, send, receive). I used select here because
it's far less verbose, and gets the point across.
Is there a way to do this better/more efficiently? Or is it just not the way MainWorker was meant to be used?
main for completeness:
#[tokio::main]
async fn main() {
let main_module = deno_runtime::deno_core::resolve_url(MAIN_MODULE_SPECIFIER).unwrap();
let user_module = deno_runtime::deno_core::resolve_url(USER_MODULE_SPECIFIER).unwrap();
let (tx, mut rx) = channel(1);
let (result_tx, mut result_rx) = channel(1);
let handle = thread::spawn(move || {
let runtime = tokio::runtime::Builder::new_multi_thread().enable_all().build().unwrap();
let mut runner = Runner::new();
runtime.block_on(runner.run_worker(&mut rx, main_module, user_module));
});
tx.send(("test input".to_string(), result_tx)).await.unwrap();
let result = result_rx.recv().await.unwrap();
println!("result from worker {}", result);
handle.join().unwrap();
}

Receiver on tokio's mpsc channel only receives messages when buffer is full

I've spent a few hours trying to figure this out and I'm pretty done. I found the question with a similar name, but that looks like something was blocking synchronously which was messing with tokio. That very well may be the issue here, but I have absolutely no idea what is causing it.
Here is a heavily stripped down version of my project which hopefully gets the issue across.
use std::io;
use futures_util::{
SinkExt,
stream::{SplitSink, SplitStream},
StreamExt,
};
use tokio::{
net::TcpStream,
sync::mpsc::{channel, Receiver, Sender},
};
use tokio_tungstenite::{
connect_async,
MaybeTlsStream,
tungstenite::Message,
WebSocketStream,
};
#[tokio::main]
async fn main() {
connect_to_server("wss://a_valid_domain.com".to_string()).await;
}
async fn read_line() -> String {
loop {
let mut str = String::new();
io::stdin().read_line(&mut str).unwrap();
str = str.trim().to_string();
if !str.is_empty() {
return str;
}
}
}
async fn connect_to_server(url: String) {
let (ws_stream, _) = connect_async(url).await.unwrap();
let (write, read) = ws_stream.split();
let (tx, rx) = channel::<ChannelMessage>(100);
tokio::spawn(channel_thread(write, rx));
tokio::spawn(handle_std_input(tx.clone()));
read_messages(read, tx).await;
}
#[derive(Debug)]
enum ChannelMessage {
Text(String),
Close,
}
// PROBLEMATIC FUNCTION
async fn channel_thread(
mut write: SplitSink<WebSocketStream<MaybeTlsStream<TcpStream>>, Message>,
mut rx: Receiver<ChannelMessage>,
) {
while let Some(msg) = rx.recv().await {
println!("{:?}", msg); // This only fires when buffer is full
match msg {
ChannelMessage::Text(text) => write.send(Message::Text(text)).await.unwrap(),
ChannelMessage::Close => {
write.close().await.unwrap();
rx.close();
return;
}
}
}
}
async fn read_messages(
mut read: SplitStream<WebSocketStream<MaybeTlsStream<TcpStream>>>,
tx: Sender<ChannelMessage>,
) {
while let Some(msg) = read.next().await {
let msg = match msg {
Ok(m) => m,
Err(_) => continue
};
match msg {
Message::Text(m) => println!("{}", m),
Message::Close(_) => break,
_ => {}
}
}
if !tx.is_closed() {
let _ = tx.send(ChannelMessage::Close).await;
}
}
async fn handle_std_input(tx: Sender<ChannelMessage>) {
loop {
let str = read_line().await;
if tx.is_closed() {
break;
}
tx.send(ChannelMessage::Text(str)).await.unwrap();
}
}
As you can see, what I'm trying to do is:
Connect to a websocket
Print outgoing messages from the websocket
Forward any input from stdin to the websocket
Also a custom heartbeat solution which was trimmed out
The issue lies in the channel_thread() function. I move the websocket writer into this function as well as the channel receiver. The issue is, it only loops over the sent objects when the buffer is full.
I've spent a lot of time trying to solve this, any help is greatly appreciated.
Here, you make a blocking synchronous call in an async context:
async fn read_line() -> String {
loop {
let mut str = String::new();
io::stdin().read_line(&mut str).unwrap();
// ^^^^^^^^^^^^^^^^^^^
// This is sync+blocking
str = str.trim().to_string();
if !str.is_empty() {
return str;
}
}
}
You never ever make blocking synchronous calls in an async context, because that prevents the entire thread from running other async tasks. Your channel receiver task is likely also assigned to this thread, so it's having to wait until all the blocking calls are done and whatever invokes this function yields back to the async runtime.
Tokio has its own async version of stdin, which you should use instead.

Waiting for a list of futures in Rust

I am attempting to make a future that continuously finds new work to do and then maintains a set of futures for those work items. I would like to make sure neither my main future that finds work to be blocked for long periods of time and to have my work being done concurrently.
Here is a rough overview of what I am trying to do. Specifically isDone does not exist and also from what I can understand from the docs isn't necessarily a valid way to use futures in Rust. What is the idomatic way of doing this kind of thing?
use std::collections::HashMap;
use tokio::runtime::Runtime;
async fn find_work() -> HashMap<i64, String> {
// Go read from the DB or something...
let mut work = HashMap::new();
work.insert(1, "test".to_string());
work.insert(2, "test".to_string());
return work;
}
async fn do_work(id: i64, value: String) -> () {
// Result<(), Error> {
println!("{}: {}", id, value);
}
async fn async_main() -> () {
let mut pending_work = HashMap::new();
loop {
for (id, value) in find_work().await {
if !pending_work.contains_key(&id) {
let fut = do_work(id, value);
pending_work.insert(id, fut);
}
}
pending_work.retain(|id, fut| {
if isDone(fut) {
// do something with the result
false
} else {
true
}
});
}
}
fn main() {
let runtime = Runtime::new().unwrap();
let exec = runtime.executor();
exec.spawn(async_main());
runtime.shutdown_on_idle();
}

How can I use Server-Sent events in Iron?

I have a small Rust application that receives some requests through a serial port, does some processing and saves the results locally. I wanted to use a browser as a remote monitor so I can see everything that is happening and as I understand SSEs are pretty good for that.
I tried using Iron for that but I can't find a way to keep the connection open. The request handlers all need to return a Response, so I can't keep sending data.
This was my (dumb) attempt:
fn monitor(req: &mut Request) -> IronResult<Response> {
let mut headers = Headers::new();
headers.set(ContentType(Mime(TopLevel::Text, SubLevel::EventStream, vec![])));
headers.set(CacheControl(vec![CacheDirective::NoCache]));
println!("{:?}", req);
let mut count = 0;
loop {
let mut response = Response::with((iron::status::Ok, format!("data: Count!:{}", count)));
response.headers = headers.clone();
return Ok(response); //obviously won't do what I want
count += 1;
std::thread::sleep_ms(1000);
}
}
I think the short answer is: you can't. The current version of Iron is built on a single request-response interaction. This can be seen in your code because the only way to send a response is to return it; terminating the handler thread.
There's an issue in Iron to utilize the new async support in Hyper, which itself was merged relatively recently. There are even other people trying to use Server-Send Events in Hyper that haven't succeeded yet.
If you are willing to use the Hyper master branch, something like this seems to work. No guarantees that this is a good solution or that it doesn't eat up all your RAM or CPU. It seems to work in Chrome though.
extern crate hyper;
use std::time::{Duration, Instant};
use std::io::prelude::*;
use hyper::{Control, Encoder, Decoder, Next };
use hyper::server::{Server, HandlerFactory, Handler, Request, Response};
use hyper::status::StatusCode;
use hyper::header::ContentType;
use hyper::net::{HttpStream};
fn main() {
let address = "0.0.0.0:7777".parse().expect("Invalid address");
let server = Server::http(&address).expect("Invalid server");
let (_listen, server_loop) = server.handle(MyFactory).expect("Failed to handle");
println!("Starting...");
server_loop.run();
}
struct MyFactory;
impl HandlerFactory<HttpStream> for MyFactory {
type Output = MyHandler;
fn create(&mut self, ctrl: Control) -> Self::Output {
MyHandler {
control: ctrl,
}
}
}
struct MyHandler {
control: Control,
}
impl Handler<HttpStream> for MyHandler {
fn on_request(&mut self, _request: Request<HttpStream>) -> Next {
println!("A request was made");
Next::write()
}
fn on_request_readable(&mut self, _request: &mut Decoder<HttpStream>) -> Next {
println!("Request has data to read");
Next::write()
}
fn on_response(&mut self, response: &mut Response) -> Next {
println!("A response is ready to be sent");
response.set_status(StatusCode::Ok);
let mime = "text/event-stream".parse().expect("Invalid MIME");
response.headers_mut().set(ContentType(mime));
every_duration(Duration:: from_secs(1), self.control.clone());
Next::wait()
}
fn on_response_writable(&mut self, response: &mut Encoder<HttpStream>) -> Next {
println!("A response can be written");
// Waited long enough, send some data
let fake_data = r#"event: userconnect
data: {"username": "bobby", "time": "02:33:48"}"#;
println!("Writing some data");
response.write_all(fake_data.as_bytes()).expect("Failed to write");
response.write_all(b"\n\n").expect("Failed to write");
Next::wait()
}
}
use std::thread;
fn every_duration(max_elapsed: Duration, control: Control) {
let mut last_sent: Option<Instant> = None;
let mut count = 0;
thread::spawn(move || {
loop {
// Terminate after a fixed number of messages
if count >= 5 {
println!("Maximum messages sent, ending");
control.ready(Next::end()).expect("Failed to trigger end");
return;
}
// Wait a little while between messages
if let Some(last) = last_sent {
let elapsed = last.elapsed();
println!("It's been {:?} since the last message", elapsed);
if elapsed < max_elapsed {
let remaining = max_elapsed - elapsed;
println!("There's {:?} remaining", remaining);
thread::sleep(remaining);
}
}
// Trigger a message
control.ready(Next::write()).expect("Failed to trigger write");
last_sent = Some(Instant::now());
count += 1;
}
});
}
And the client-side JS:
var evtSource = new EventSource("http://127.0.0.1:7777");
evtSource.addEventListener("userconnect", function(e) {
const obj = JSON.parse(e.data);
console.log(obj);
}, false);

Application on OSX cannot spawn more than 2048 threads

I have a Rust application on on OSX firing up a large amount of threads as can be seen in the code below, however, after looking at how many max threads my version of OSX is allowed to create via the sysctl kern.num_taskthreads command, I can see that it is kern.num_taskthreads: 2048 which explains why I can't spin up over 2048 threads.
How do I go about getting past this hard limit?
let threads = 300000;
let requests = 1;
for _x in 0..threads {
println!("{}", _x);
let request_clone = request.clone();
let handle = thread::spawn(move || {
for _y in 0..requests {
request_clone.lock().unwrap().push((request::Request::new(request::Request::create_request())));
}
});
child_threads.push(handle);
}
Before starting, I'd encourage you to read about the C10K problem. When you get into this scale, there's a lot more things you need to keep in mind.
That being said, I'd suggest looking at mio...
a lightweight IO library for Rust with a focus on adding as little overhead as possible over the OS abstractions.
Specifically, mio provides an event loop, which allows you to handle a large number of connections without spawning threads. Unfortunately, I don't know of a HTTP library that currently supports mio. You could create one and be a hero to the Rust community!
Not sure how helpful this will be, but I was trying to create a small pool of threads that will create connections and then send them over to an event loop via a channel for reading.
I'm sure this code is probably pretty bad, but here it is anyways for examples. It uses the Hyper library, like you mentioned.
extern crate hyper;
use std::io::Read;
use std::thread;
use std::thread::{JoinHandle};
use std::sync::{Arc, Mutex};
use std::sync::mpsc::channel;
use hyper::Client;
use hyper::client::Response;
use hyper::header::Connection;
const TARGET: i32 = 100;
const THREADS: i32 = 10;
struct ResponseWithString {
index: i32,
response: Response,
data: Vec<u8>,
complete: bool
}
fn main() {
// Create a client.
let url: &'static str = "http://www.gooogle.com/";
let mut threads = Vec::<JoinHandle<()>>::with_capacity((TARGET * 2) as usize);
let conn_count = Arc::new(Mutex::new(0));
let (tx, rx) = channel::<ResponseWithString>();
for _ in 0..THREADS {
// Move var references into thread context
let conn_count = conn_count.clone();
let tx = tx.clone();
let t = thread::spawn(move || {
loop {
let idx: i32;
{
// Lock, increment, and release
let mut count = conn_count.lock().unwrap();
*count += 1;
idx = *count;
}
if idx > TARGET {
break;
}
let mut client = Client::new();
// Creating an outgoing request.
println!("Creating connection {}...", idx);
let res = client.get(url) // Get URL...
.header(Connection::close()) // Set headers...
.send().unwrap(); // Fire!
println!("Pushing response {}...", idx);
tx.send(ResponseWithString {
index: idx,
response: res,
data: Vec::<u8>::with_capacity(1024),
complete: false
}).unwrap();
}
});
threads.push(t);
}
let mut responses = Vec::<ResponseWithString>::with_capacity(TARGET as usize);
let mut buf: [u8; 1024] = [0; 1024];
let mut completed_count = 0;
loop {
if completed_count >= TARGET {
break; // No more work!
}
match rx.try_recv() {
Ok(r) => {
println!("Incoming response! {}", r.index);
responses.push(r)
},
_ => { }
}
for r in &mut responses {
if r.complete {
continue;
}
// Read the Response.
let res = &mut r.response;
let data = &mut r.data;
let idx = &r.index;
match res.read(&mut buf) {
Ok(i) => {
if i == 0 {
println!("No more data! {}", idx);
r.complete = true;
completed_count += 1;
}
else {
println!("Got data! {} => {}", idx, i);
for x in 0..i {
data.push(buf[x]);
}
}
}
Err(e) => {
panic!("Oh no! {} {}", idx, e);
}
}
}
}
}

Resources