Parallelly processing HUGE file by splitting it into logical shards - rust

I'm trying to parallelly process a huge file ~15GB - ~60GB which contains 560 Million to 2 Billion records
the record looks something like the following
<id> <amount>
123, 6000
123, 4593
111, 1
111, 100
111, -50
111, 10000
there could be thousands of users contained within a file whose activity is recorded as series of transactions.
I processed this file sequentially. Not an issue.
This can be safely parallelized by processing every client data by same thread/task.
But when I try to process it parallelly for optimize other cores available based on creating logical group which will be processed by the same tokio task. For now I'm sticking to creating spawning a single task per available core. And the transaction goes to same task by looking at client id.
This approach is way slow than sequential.
Following is snippet of the approach
#[tokio::main]
async fn main() -> Result<(), Box<dyn Error>> {
let max_threads_supported = num_cpus::get();
let mut account_state: HashMap<u16, Client> = HashMap::new();
let (result_sender, mut result_receiver) =
mpsc::channel::<HashMap<u16, Client>>(max_threads_supported);
// repository of sender for each shard
let mut sender_repository: HashMap<u16, Sender<Transaction>> = HashMap::new();
for task_counter in 0..max_threads_supported {
let result_sender_clone = result_sender.clone();
// create separate mpsc channel for each processor
let (sender, mut receiver) = mpsc::channel::<Transaction>(10_000);
sender_repository.insert(task_counter as u16, sender);
tokio::spawn(async move {
let mut exec_engine = Engine::initialize();
while let Some(tx) = receiver.recv().await {
match exec_engine.execute_transaction(tx) {
Ok(_) => (),
Err(err) => ()
}
}
result_sender_clone
.send(exec_engine.get_account_state_owned())
.await
});
}
drop(result_sender);
tokio::spawn(async move {
// just getting reading tx from file sequential std::io::BufferedReader
for result in reader.deserialize::<Transaction>() {
match result {
Ok(tx) => {
match sender_repository.get(&(&tx.get_client_id() % max_threads_supported)) {
Some(sender) => {
sender.send(tx).await;
}
None => ()
}
}
_ =>()
}
}
});
// accumulate result from all the processor
while let Some(result) = result_receiver.recv().await {
account_state.extend(result.into_iter());
}
// do what ever you like with result
Ok(())
}
But this seems pretty slow than sequential approach. What am I doing wrong?
Btw I've also tried to use broadcast approach but there is chance of lagging consumer and losing messages. So moved to mpsc.
How can I optimize this for better performance??

There are a couple of misconceptions here. The main one being that tokio is not meant for cpu-bound parallelism. It is meant for io-bound event based situations where short reaction times and good scalability is required, like web servers.
What further reinforces my impression is that you "spawn one task per CPU core", which is the opposite of what you want in tokio. Tokio's strengh is that you can spawn a very large number of tasks, and tokio efficiently schedules them on the available CPU resources. I mean, some configurations of the tokio runtime are single-threaded! So spawning more tasks achieves absolutely no speedup whatsoever; spawning tasks is not for speedup, but for waiting at more await points at the same time. For example in a web server, if you are connected to 100 clients at the same time, you need 100 wait points to wait for a message from each of them. That's where you need one task per connection.
What you actually want is not asynchronism but parallelism.
The current go-to library for structured parallelism is rayon combined with the excellent crossbeam-channel library for dataflow.
This should point you in the right direction:
fn main() -> Result<(), Box<dyn std::error::Error>> {
let max_threads_supported = num_cpus::get();
let mut account_state: Mutex<HashMap<u16, Client>> = Mutex::new(HashMap::new());
rayon::scope(|s| {
let (work_queue_sender, work_queue_receiver) = crossbeam_channel::bounded(1000);
for task_counter in 0..max_threads_supported {
let work_receiver = work_queue_receiver.clone();
s.spawn(|_| {
let mut exec_engine = Engine::initialize();
while let Ok(tx) = work_receiver.recv() {
// TODO: do some proper error handling
exec_engine.execute_transaction(tx).unwrap();
}
account_state
.lock()
.extend(exec_engine.get_account_state_owned().into_iter());
});
}
let reader = Reader;
for result in reader.deserialize::<Transaction>() {
work_queue_sender.send(result.unwrap()).unwrap();
}
drop(work_queue_sender);
});
// Do whatever you want with the `account_state` HashMap.
Ok(())
}
Although many imports were missing from your code (please provide a minimal reproducible example next time), so I wasn't able to test the code.
But it should look somewhat similar to this.

Related

Rust threadpool with init code in each thread?

Following code is working, it can be tested in Playground
use std::{thread, time::Duration};
use rand::Rng;
fn main() {
let mut hiv = Vec::new();
let (sender, receiver) = crossbeam_channel::unbounded();
// make workers
for t in 0..5 {
println!("Make worker {}", t);
let receiver = receiver.clone(); // clone for this thread
let handler = thread::spawn(move || {
let mut rng = rand::thread_rng(); // each thread have one
loop {
let r = receiver.recv();
match r {
Ok(x) => {
let s = rng.gen_range(100..1000);
thread::sleep(Duration::from_millis(s));
println!("w={} r={} working={}", t, x, s);
},
_ => { println!("No more work for {} --- {:?}.", t, r); break},
}
}
});
hiv.push(handler);
}
// Generate jobs
for x in 0..10 {
sender.send(x).expect("all threads hung up :(");
}
drop(sender);
// wait for jobs to finish.
println!("Wait for all threads to finish.\n");
for h in hiv {
h.join().unwrap();
}
println!("join() done. Work Finish.");
}
My question is following :
Can I remove boilerplate code by using threadpool, rayon or some other Rust crate ?
I know that I could do my own implementation, but would like to know is there some crate with same functionality ?
From my research threadpool/rayon are useful when you "send" code and it is executed, but I have not found way to make N threads that will have some code/logic that they need to remember ?
Basic idea is in let mut rng = rand::thread_rng(); this is instance that each thread need to have on it own.
Also is there are some other problems with code, please point it out.
Yes, you can use Rayon to eliminate a lot of that code and make the remaining code much more readable, as illustrated in this gist:
https://gist.github.com/BillBarnhill/db07af903cb3c3edb6e715d9cedae028
The worker pool model is not great in Rust, due to the ownership rules. As a result parallel iterators are often a better choice.
I forgot to address your main concern, per thread context, originally. You can see how to store per thread context using a ThreadLocal! in this answer:
https://stackoverflow.com/a/42656422/204343
I will try to come back and edit the code to reflect ThreadLocal! use as soon as I have more time.
The gist requires nightly because of thread_id_value, but that is all but stable and can be removed if needed.
The real catch is that the gist has timing, and compares main_new with main_original, with surprising results. Perhaps not so surprising, Rayon has good debug support.
On Debug build the timing output is:
main_new duration: 1.525667954s
main_original duration: 1.031234059s
You can see main_new takes almost 50% longer to run.
On release however main_new is a little faster:
main_new duration: 1.584190936s
main_original duration: 1.5851124s
A slimmed version of the gist is below, with only the new code.
#![feature(thread_id_value)]
use std::{thread, time::Duration, time::Instant};
use rand::Rng;
#[allow(unused_imports)]
use rayon::prelude::*;
fn do_work(x : u32) -> String {
let mut rng = rand::thread_rng(); // each thread have one
let s = rng.gen_range(100..1000);
let thread_id = thread::current().id();
let t = thread_id.as_u64();
thread::sleep(Duration::from_millis(s));
format!("w={} r={} working={}", t, x, s)
}
fn process_work_product(output : String) {
println!("{}", output);
}
fn main() {
// bit hacky, but lets set number of threads to 5
rayon::ThreadPoolBuilder::new()
.num_threads(4)
.build_global()
.unwrap();
let x = 0..10;
x.into_par_iter()
.map(do_work)
.for_each(process_work_product);
}

If "futures do nothing unless awaited", why does `tokio::spawn` work anyway?

I have read here that futures in Rust do nothing unless they are awaited. However, I tried a more complex example and it is a little unclear why I get a message printed by the 2nd print in this example because task::spawn gives me a JoinHanlde on which I do not do any .await.
Meanwhile, I tried the same example, but with an await above the 2nd print, and now I get printed only the message in the 1st print.
If I wait for all the futures at the end, I get printed both messages, which I understood. My question is why the behaviour in the previous 2 cases.
use futures::stream::{FuturesUnordered, StreamExt};
use futures::TryStreamExt;
use rand::prelude::*;
use std::collections::VecDeque;
use std::sync::Arc;
use tokio::sync::Semaphore;
use tokio::task::JoinHandle;
use tokio::{task, time};
fn candidates() -> Vec<i32> {
Vec::from([2, 2])
}
async fn produce_event(nanos: u64) -> i32 {
println!("waiting {}", nanos);
time::sleep(time::Duration::from_nanos(nanos)).await;
1
}
async fn f(seconds: i64, semaphore: &Arc<Semaphore>) {
let mut futures = vec![];
for (i, j) in (0..1).enumerate() {
for (i, event) in candidates().into_iter().enumerate() {
let permit = Arc::clone(semaphore).acquire_owned().await;
let secs = 500;
futures.push(task::spawn(async move {
let _permit = permit;
produce_event(500); // 2nd example has an .await here
println!("Event produced at {}", seconds);
}));
}
}
}
#[tokio::main()]
async fn main() {
let semaphore = Arc::new(Semaphore::new(45000));
for _ in 0..1 {
let mut futures: FuturesUnordered<_> = (0..2).map(|moment| f(moment, &semaphore)).collect();
while let Some(item) = futures.next().await {
let () = item;
}
}
}
However, I tried a more complex example and it is a little unclear why I get a message printed by the 2nd print in this example because task::spawn gives me a JoinHanlde on which I do not do any .await.
You're spawning tasks. A task is a separate thread of execution which can execute concurrently to the current task, and can be scheduled in parallel.
All the JoinHandle does there is wait for that task to end, it doesn't control the task running.
Meanwhile, I tried the same example, but with an await above the 2nd print, and now I get printed only the message in the 1st print.
You spawn a bunch of tasks and make them sleep. Since you don't wait for them to terminate (don't join them) nor is there any sort of sleep in their parent task, once all the tasks have been spawned the loops terminate, you reach the end of the main function and the program terminates.
At this point all the tasks are still sleeping.

Non-blocking recv on Tokio mpsc Receiver

I am using Rust and Tokio 1.6 to build an app which can interact with an Elgato StreamDeck via hidapi = "1.2". I want to poll the HID device for events (key down / key up) and send those events on an mpsc channel, while watching a separate mpsc channel for incoming commands to update the device state (reset, change brightness, update image, etc). Since the device handle is not thread safe, I need to do both things from a single thread.
major edits below
This is a rewrite of my original question. I've left my interim answer below, but in the interest of a more self contained example, here is a the basic process using device_query = "0.2":
use device_query::{DeviceState, Keycode};
use std::time::Duration;
use tokio;
use tokio::sync::mpsc::{Receiver, Sender};
use tokio::time::timeout;
#[tokio::main]
async fn main() {
// channel for key press events coming from device loop
let (key_tx, mut key_rx) = tokio::sync::mpsc::channel(32);
// channel for commands sent to device loop
let (dev_tx, mut dev_rx) = tokio::sync::mpsc::channel(32);
start_device_loop(60, key_tx, dev_rx);
println!("Waiting for key presses");
while let Some(k) = key_rx.recv().await {
match k {
Some(ch) => match ch {
Keycode::Q => dev_tx.clone().try_send(String::from("Quit!")).expect("Could not send command"),
ch => println!("{}", ch),
},
_ => (),
}
}
println!("Done.")
}
/// Starts a tokio task, polling the supplied device and sending key events
/// on the supplied mpsc sender
pub fn start_device_loop(hz: u32, tx: Sender<Option<Keycode>>, mut rx: Receiver<String>) {
let poll_wait = 1000 / hz;
let poll_wait = Duration::from_millis(poll_wait as u64);
tokio::task::spawn(async move {
let dev = DeviceState::new();
loop {
let mut keys = dev.query_keymap();
match keys.len() {
0 => (),
1 => tx.clone().try_send(Some(keys.remove(0))).unwrap(),
_ => println!("So many keys..."),
}
match timeout(poll_wait, rx.recv()).await {
Ok(cmd) => println!("Command '{}' received.", cmd.unwrap()),
_ => (),
};
// std::thread::sleep(poll_wait);
}
});
}
Note this does not compile - I get an error future created by async block is not 'Send' and within 'impl Future', the trait 'Send' is not implemented for '*mut x11::xlib::_XDisplay'. My understanding of the error is that because device_query is not thread-safe, and awaiting introduces the possibility of scope moving across threads, nothing may be awaited while a non-thread-safe object is in scope. And indeed, if I comment out the block around match timeout... and uncomment the std::thread::sleep everything compiles and runs.
Which brings me back to the original question; how can I both send and receive messages in a single thread without using await or the apparently forbidden fruit of poll_recv()?
After much hunting I found noop_waker in the futures crate which appears to do what I need in combination with poll_recv:
pub fn start_device_loop(hz: u32, tx: Sender<Option<Keycode>>, mut rx: Receiver<String>) {
let poll_wait = 1000 / hz;
let poll_wait = Duration::from_millis(poll_wait as u64);
tokio::task::spawn_blocking(move || {
let dev = DeviceState::new();
let waker = futures::task::noop_waker();
let mut cx = std::task::Context::from_waker(&waker);
loop {
let mut keys = dev.query_keymap();
match keys.len() {
0 => (),
1 => tx.clone().try_send(Some(keys.remove(0))).unwrap(),
_ => println!("So many keys..."),
}
match rx.poll_recv(&mut cx) {
Poll::Ready(cmd) => println!("Command '{}' received.", cmd.unwrap()),
_ => ()
};
std::thread::sleep(poll_wait);
}
});
}
After digging through docs and tokio source more I can't find anything that suggests poll_recv is supposed to be an internal-only function or that using it here would have any obvious side effects. Letting the process run at 125hz I'm not seeing any excess resource usage either.
I'm leaving the above code for posterity, but since asking this question the try_recv method has been added to Receivers, making this all much cleaner.

Rusoto async using FuturesOrdered combinator

I am trying to send off parallel asynchronous Rusoto SQS requests using FuturesOrdered:
use futures::prelude::*; // 0.1.26
use futures::stream::futures_unordered::FuturesUnordered;
use rusoto_core::{Region, HttpClient}; // 0.38.0
use rusoto_credential::EnvironmentProvider; // 0.17.0
use rusoto_sqs::{SendMessageBatchRequest, SendMessageBatchRequestEntry, Sqs, SqsClient}; // 0.38.0
fn main() {
let client = SqsClient::new_with(
HttpClient::new().unwrap(),
EnvironmentProvider::default(),
Region::UsWest2,
);
let messages: Vec<u32> = (1..12).map(|n| n).collect();
let chunks: Vec<_> = messages.chunks(10).collect();
let tasks: FuturesUnordered<_> = chunks.into_iter().map(|c| {
let batch = create_batch(c);
client.send_message_batch(batch)
}).collect();
let tasks = tasks
.for_each(|t| {
println!("{:?}", t);
Ok(())
})
.map_err(|e| println!("{}", e));
tokio::run(tasks);
}
fn create_batch(ids: &[u32]) -> SendMessageBatchRequest {
let queue_url = "https://sqs.us-west-2.amazonaws.com/xxx/xxx".to_string();
let entries = ids
.iter()
.map(|id| SendMessageBatchRequestEntry {
id: id.to_string(),
message_body: id.to_string(),
..Default::default()
})
.collect();
SendMessageBatchRequest {
entries,
queue_url,
}
}
The tasks complete correctly but tokio::run(tasks) doesn't stop. I assume that is because of tasks.for_each() will force it to continue to run and look for more futures?
Why doesn't tokio::run(tasks) stop? Am I using FuturesOrdered correctly?
I am also a little worried about memory usage when creating up to 60,000 futures to complete and pushing them into the FuturesUnordered combinator.
I discovered that it was the SqsClient in the main function that was causing it to block as it is still doing some house work even though the tasks are finished.
A solution provided by one of the Rusoto people was to add this just above tokio::run
std::mem::drop(client);

Is there an API to race N threads (or N closures on N threads) to completion?

Given several threads that complete with an Output value, how do I get the first Output that's produced? Ideally while still being able to get the remaining Outputs later in the order they're produced, and bearing in mind that some threads may or may not terminate.
Example:
struct Output(i32);
fn main() {
let mut spawned_threads = Vec::new();
for i in 0..10 {
let join_handle: ::std::thread::JoinHandle<Output> = ::std::thread::spawn(move || {
// pretend to do some work that takes some amount of time
::std::thread::sleep(::std::time::Duration::from_millis(
(1000 - (100 * i)) as u64,
));
Output(i) // then pretend to return the `Output` of that work
});
spawned_threads.push(join_handle);
}
// I can do this to wait for each thread to finish and collect all `Output`s
let outputs_in_order_of_thread_spawning = spawned_threads
.into_iter()
.map(::std::thread::JoinHandle::join)
.collect::<Vec<::std::thread::Result<Output>>>();
// but how would I get the `Output`s in order of completed threads?
}
I could solve the problem myself using a shared queue/channels/similar, but are there built-in APIs or existing libraries which could solve this use case for me more elegantly?
I'm looking for an API like:
fn race_threads<A: Send>(
threads: Vec<::std::thread::JoinHandle<A>>
) -> (::std::thread::Result<A>, Vec<::std::thread::JoinHandle<A>>) {
unimplemented!("so far this doesn't seem to exist")
}
(Rayon's join is the closest I could find, but a) it only races 2 closures rather than an arbitrary number of closures, and b) the thread pool w/ work stealing approach doesn't make sense for my use case of having some closures that might run forever.)
It is possible to solve this use case using pointers from How to check if a thread has finished in Rust? just like it's possible to solve this use case using an MPSC channel, however here I'm after a clean API to race n threads (or failing that, n closures on n threads).
These problems can be solved by using a condition variable:
use std::sync::{Arc, Condvar, Mutex};
#[derive(Debug)]
struct Output(i32);
enum State {
Starting,
Joinable,
Joined,
}
fn main() {
let pair = Arc::new((Mutex::new(Vec::new()), Condvar::new()));
let mut spawned_threads = Vec::new();
let &(ref lock, ref cvar) = &*pair;
for i in 0..10 {
let my_pair = pair.clone();
let join_handle: ::std::thread::JoinHandle<Output> = ::std::thread::spawn(move || {
// pretend to do some work that takes some amount of time
::std::thread::sleep(::std::time::Duration::from_millis(
(1000 - (100 * i)) as u64,
));
let &(ref lock, ref cvar) = &*my_pair;
let mut joinable = lock.lock().unwrap();
joinable[i] = State::Joinable;
cvar.notify_one();
Output(i as i32) // then pretend to return the `Output` of that work
});
lock.lock().unwrap().push(State::Starting);
spawned_threads.push(Some(join_handle));
}
let mut should_stop = false;
while !should_stop {
let locked = lock.lock().unwrap();
let mut locked = cvar.wait(locked).unwrap();
should_stop = true;
for (i, state) in locked.iter_mut().enumerate() {
match *state {
State::Starting => {
should_stop = false;
}
State::Joinable => {
*state = State::Joined;
println!("{:?}", spawned_threads[i].take().unwrap().join());
}
State::Joined => (),
}
}
}
}
(playground link)
I'm not claiming this is the simplest way to do it. The condition variable will awake the main thread every time a child thread is done. The list can show the state of each thread, if one is (about to) finish, it can be joined.
No, there is no such API.
You've already been presented with multiple options to solve your problem:
Use channels
Use a CondVar
Use futures
Sometimes when programming, you have to go beyond sticking pre-made blocks together. This is supposed to be a fun part of programming. I encourage you to embrace it. Go create your ideal API using the components available and publish it to crates.io.
I really don't see what's so terrible about the channels version:
use std::{sync::mpsc, thread, time::Duration};
#[derive(Debug)]
struct Output(i32);
fn main() {
let (tx, rx) = mpsc::channel();
for i in 0..10 {
let tx = tx.clone();
thread::spawn(move || {
thread::sleep(Duration::from_millis((1000 - (100 * i)) as u64));
tx.send(Output(i)).unwrap();
});
}
// Don't hold on to the sender ourselves
// Otherwise the loop would never terminate
drop(tx);
for r in rx {
println!("{:?}", r);
}
}

Resources