Parallelising file processing using rayon - multithreading

I've got 7 CSV files (55 MB each) in my local source folder which I want to convert into JSON format and store into local folder. My OS is MacOS (Quad-Core Intel i5).
Basically, it is a simple Rust program which is run from a console as
./target/release/convert <source-folder> <target-folder>
My multithreading approach using Rust threads is ass following
fn main() -> Result<()> {
let source_dir = PathBuf::from(get_first_arg()?);
let target_dir = PathBuf::from(get_second_arg()?);
let paths = get_file_paths(&source_dir)?;
let mut handles = vec![];
for source_path in paths {
let target_path = create_target_file_path(&source_path, &target_dir)?;
handles.push(thread::spawn(move || {
let _ = convert(&source_path, &target_path);
}));
}
for h in handles {
let _ = h.join();
}
Ok(())
}
I run it using time to measure CPU utilisation which gives
2.93s user 0.55s system 316% cpu 1.098 total
Then I try to implement the same task using the rayon (threadpool) crate:
fn main() -> Result<()> {
let source_dir = PathBuf::from(get_first_arg()?);
let target_dir = PathBuf::from(get_second_arg()?);
let paths = get_file_paths(&source_dir)?;
let pool = rayon::ThreadPoolBuilder::new().num_threads(15).build()?;
for source_path in paths {
let target_path = create_target_file_path(&source_path, &target_dir)?;
pool.install(|| {
let _ = convert(&source_path, &target_path);
});
}
Ok(())
}
I run it using time to measure CPU utilisation which gives
2.97s user 0.53s system 98% cpu 3.561 total
I can't see any improvements when I use rayon. I probably use rayon the wrong way.
Does anyone have an idea what is wrong with it?
Update (09 Apr)
After some time of fight with the Rust checker 😀, just want to share a solution, maybe it could help others, or anyone else could suggest a better approach/solution
pool.scope(move |s| {
for source_path in paths {
let target_path = create_target_file_path(&source_path, &target_dir).unwrap();
s.spawn(move |_s| {
convert(&source_path, &target_path).unwrap();
});
}
});
But still does not beat the approach using rust std::thread for 113 files.
46.72s user 8.30s system 367% cpu 14.955 total
Update (10 Apr)
After #maxy comment
// rayon solution
paths.into_par_iter().for_each(|source_path| {
let target_path = create_target_file_path(&source_path, &target_dir);
match target_path {
Ok(target_path) => {
info!(
"Processing {}",
target_path.to_str().unwrap_or("Unable to convert")
);
let res = convert(&source_path, &target_path);
if let Err(e) = res {
error!("{}", e);
}
}
Err(e) => error!("{}", e),
}
});
// std::thread solution
let mut handles = vec![];
for source_path in paths {
let target_path = create_target_file_path(&source_path, &target_dir)?;
handles.push(thread::spawn(move || {
let _ = convert(&source_path, &target_path);
}));
}
for handle in handles {
let _ = handle.join();
}
Comparison on 57 files:
std::threads: 23.71s user 4.19s system 356% cpu 7.835 total
rayon: 23.36s user 4.08s system 324% cpu 8.464 total

The docu for rayon install is not super clear, but the signature:
pub fn install<OP, R>(&self, op: OP) -> R where
R: Send,
OP: FnOnce() -> R + Send,
says it returns type R. The same type R that your closure returns. So obviously install() has to wait for the result.
This only makes sense if the closure spawns additional tasks, for example by using .par_iter() inside the closure. I suggest to use rayon's parallel iterators directly (instead of your for loop) over the list of files. You don't even need to create your own thread pool, the default pool is usually fine.
If you insist on doing it manually, you'll have to use spawn() instead of install. And you'll probably have to move your loop into a lambda passed to scope().

Related

"there is no signal driver running, must be called from the context of Tokio runtime" despite running in Runtime::block_on

I'm writing a commandline application using Tokio that controls its lifecycle by listening for keyboard interrupt events (i.e. ctrl + c); however, at the same time it must also monitor the other tasks that are spawned and potentially initiate an early shutdown if any of the tasks panic or otherwise encounter an error. To do this, I have wrapped tokio::select in a while loop that terminates once the application has at least had a chance to safely shut down.
However, as soon as the select block polls the future returned by tokio::signal::ctrl_c, the main thread panics with the following message:
thread 'main' panicked at 'there is no signal driver running, must be called from the context of Tokio runtime'
...which is confusing, because this is all done inside a Runtime::block_on call. I haven't published this application (yet), but the problem can be reproduced with the following code:
use tokio::runtime::Builder;
use tokio::signal;
use tokio::sync::watch;
use tokio::task::JoinSet;
fn main() {
let runtime = Builder::new_multi_thread().worker_threads(2).build().unwrap();
runtime.block_on(async {
let _rt_guard = runtime.enter();
let (ping_tx, mut ping_rx) = watch::channel(0u32);
let (pong_tx, mut pong_rx) = watch::channel(0u32);
let mut tasks = JoinSet::new();
let ping = tasks.spawn(async move {
let mut val = 0u32;
ping_tx.send(val).unwrap();
while val < 10u32 {
pong_rx.changed().await.unwrap();
val = *pong_rx.borrow();
ping_tx.send(val + 1).unwrap();
println!("ping! {}", val + 1);
}
});
let pong = tasks.spawn(async move {
let mut val = 0u32;
while val < 10u32 {
ping_rx.changed().await.unwrap();
val = *ping_rx.borrow();
pong_tx.send(val + 1).unwrap();
println!("pong! {}", val + 1);
}
});
let mut interrupt = Box::pin(signal::ctrl_c());
let mut interrupt_read = false;
while !interrupt_read && !tasks.is_empty() {
tokio::select! {
biased;
_ = &mut interrupt, if !interrupt_read => {
ping.abort();
pong.abort();
interrupt_read = true;
},
_ = tasks.join_next() => {}
}
}
});
}
Rust Playground
This example is a bit contrived, but the important parts are:
I am intentionally using Runtime::block_on() instead of tokio::main as I want to control the number of runtime threads at runtime.
Although, curiously, this example works if rewritten to use tokio::main.
I added let _rt_guard = runtime.enter() to ensure that the runtime context was set, but its presence or absence don't appear to make a difference.
I discovered the answer to this while I was finishing writing up the question, but since I couldn't find the answer by searching on the error message I'll share the answer here.
As I noted, the example I gave works if run from within a function annotated with tokio::main. Looking at the docs, the main macro expands to:
fn main() {
tokio::runtime::Builder::new_multi_thread()
.enable_all()
.build()
.unwrap()
.block_on(async {
println!("Hello world");
})
}
The example I gave did not call enable_all() on the builder (or, more specifically, enable_io()). Adding this call to the builder chain stops the panic from occurring. The runtime.enter() call I made can also be removed, since it wasn't really doing anything.
Updated example (one of the threads still panics due to a Receiver handle being dropped, but the ctrl_c await no longer panics, which is the key):
use tokio::runtime::Builder;
use tokio::signal;
use tokio::sync::watch;
use tokio::task::JoinSet;
fn main() {
Builder::new_multi_thread()
.worker_threads(2)
.enable_io()
.build()
.unwrap()
.block_on(async {
let (ping_tx, mut ping_rx) = watch::channel(0u32);
let (pong_tx, mut pong_rx) = watch::channel(0u32);
let mut tasks = JoinSet::new();
let ping = tasks.spawn(async move {
let mut val = 0u32;
ping_tx.send(val).unwrap();
while val < 10u32 {
pong_rx.changed().await.unwrap();
val = *pong_rx.borrow();
ping_tx.send(val + 1).unwrap();
println!("ping! {}", val + 1);
}
});
let pong = tasks.spawn(async move {
let mut val = 0u32;
while val < 10u32 {
ping_rx.changed().await.unwrap();
val = *ping_rx.borrow();
pong_tx.send(val + 1).unwrap();
println!("pong! {}", val + 1);
}
});
let mut interrupt = Box::pin(signal::ctrl_c());
let mut interrupt_read = false;
while !interrupt_read && !tasks.is_empty() {
tokio::select! {
biased;
_ = &mut interrupt, if !interrupt_read => {
ping.abort();
pong.abort();
interrupt_read = true;
},
_ = tasks.join_next() => {}
}
}
});
}
Rust Playground

Why Sequential is faster than parallel?

The code is to count the frequency of each word in an article. In the code, I implemented sequential, muti-thread, and muti-thread with a thread pool.
I test the running time of three methods, however, I found that the sequential method is the fastest one. I use the article (data) at 37423.txt, the code is at play.rust-lang.org.
Below is just the single- and multi version (without the threadpool version):
use std::collections::HashMap;
use std::sync::{Arc, Mutex};
use std::thread;
use std::time::SystemTime;
pub fn word_count(article: &str) -> HashMap<String, i64> {
let now1 = SystemTime::now();
let mut map = HashMap::new();
for word in article.split_whitespace() {
let count = map.entry(word.to_string()).or_insert(0);
*count += 1;
}
let after1 = SystemTime::now();
let d1 = after1.duration_since(now1);
println!("single: {:?}", d1.as_ref().unwrap());
map
}
fn word_count_thread(word_vec: Vec<String>, counts: &Arc<Mutex<HashMap<String, i64>>>) {
let mut p_count = HashMap::new();
for word in word_vec {
*p_count.entry(word).or_insert(0) += 1;
}
let mut counts = counts.lock().unwrap();
for (word, count) in p_count {
*counts.entry(word.to_string()).or_insert(0) += count;
}
}
pub fn mt_word_count(article: &str) -> HashMap<String, i64> {
let word_vec = article
.split_whitespace()
.map(|x| x.to_owned())
.collect::<Vec<String>>();
let count = Arc::new(Mutex::new(HashMap::new()));
let len = word_vec.len();
let q1 = len / 4;
let q2 = len / 2;
let q3 = q1 * 3;
let part1 = word_vec[..q1].to_vec();
let part2 = word_vec[q1..q2].to_vec();
let part3 = word_vec[q2..q3].to_vec();
let part4 = word_vec[q3..].to_vec();
let now2 = SystemTime::now();
let count1 = count.clone();
let count2 = count.clone();
let count3 = count.clone();
let count4 = count.clone();
let handle1 = thread::spawn(move || {
word_count_thread(part1, &count1);
});
let handle2 = thread::spawn(move || {
word_count_thread(part2, &count2);
});
let handle3 = thread::spawn(move || {
word_count_thread(part3, &count3);
});
let handle4 = thread::spawn(move || {
word_count_thread(part4, &count4);
});
handle1.join().unwrap();
handle2.join().unwrap();
handle3.join().unwrap();
handle4.join().unwrap();
let x = count.lock().unwrap().clone();
let after2 = SystemTime::now();
let d2 = after2.duration_since(now2);
println!("muti: {:?}", d2.as_ref().unwrap());
x
}
The result of mine is: single:7.93ms, muti: 15.78ms, threadpool: 25.33ms
I did the separation of the article before calculating time.
I want to know if the code has some problem.
First you may want to know the single-threaded version is mostly parsing whitespace (and I/O, but the file is small so it will be in the OS cache on the second run). At most ~20% of the runtime is the counting that you parallelized. Here is the cargo flamegraph of the single-threaded code:
In the multi-threaded version, it's a mess of thread creation, copying and hashmap overhead. To make sure it's not a "too little data" problem, I've used used 100x your input txt file and I'm measuring a 2x slowdown over the single-threaded version. According to the time command, it uses 2x CPU-time compared to wall-clock, so it seems to do some parallel work. The profile looks like this: (clickable svg version)
I'm not sure what to make of it exactly, but it's clear that memory management overhead has increased a lot. You seem to be copying strings for each hashmap, while an ideal wordcount would probably do zero string copying while counting.
More generally I think it's a bad idea to join the results in the threads - the way you do it (as opposed to a map-reduce pattern) the thread needs a global lock, so you could just pass the results back to the main thread instead for combining. I'm not sure if this is the main problem here, though.
Optimization
To avoid string copying, use HashMap<&str, i64> instead of HashMap<String, i64>. This requires some changes (lifetime annotations and thread::scope()). It makes mt_word_count() about 6x faster compared to the old version.
With a large input I'm measuring now a 4x speedup compared to word_count(). (Which is the best you can hope for with four threads.) The multi-threaded version is now also faster overall, but only by ~20% or so, for the reasons explained above. (Note that the single-threaded baseline has also profited the same &str optimization. Also, many things could still be improved/optimized, but I'll stop here.)
fn word_count_thread<'t>(word_vec: Vec<&'t str>, counts: &Arc<Mutex<HashMap<&'t str, i64>>>) {
let mut p_count = HashMap::new();
for word in word_vec {
*p_count.entry(word).or_insert(0) += 1;
}
let mut counts = counts.lock().unwrap();
for (word, count) in p_count {
*counts.entry(word).or_insert(0) += count;
}
}
pub fn mt_word_count<'t>(article: &'t str) -> HashMap<&'t str, i64> {
let word_vec = article.split_whitespace().collect::<Vec<&str>>();
// (skipping 16 unmodified lines)
let x = thread::scope(|scope| {
let handle1 = scope.spawn(move || {
word_count_thread(part1, &count1);
});
let handle2 = scope.spawn(move || {
word_count_thread(part2, &count2);
});
let handle3 = scope.spawn(move || {
word_count_thread(part3, &count3);
});
let handle4 = scope.spawn(move || {
word_count_thread(part4, &count4);
});
handle1.join().unwrap();
handle2.join().unwrap();
handle3.join().unwrap();
handle4.join().unwrap();
count.lock().unwrap().clone()
});
let after2 = SystemTime::now();
let d2 = after2.duration_since(now2);
println!("muti: {:?}", d2.as_ref().unwrap());
x
}

Rust threadpool with init code in each thread?

Following code is working, it can be tested in Playground
use std::{thread, time::Duration};
use rand::Rng;
fn main() {
let mut hiv = Vec::new();
let (sender, receiver) = crossbeam_channel::unbounded();
// make workers
for t in 0..5 {
println!("Make worker {}", t);
let receiver = receiver.clone(); // clone for this thread
let handler = thread::spawn(move || {
let mut rng = rand::thread_rng(); // each thread have one
loop {
let r = receiver.recv();
match r {
Ok(x) => {
let s = rng.gen_range(100..1000);
thread::sleep(Duration::from_millis(s));
println!("w={} r={} working={}", t, x, s);
},
_ => { println!("No more work for {} --- {:?}.", t, r); break},
}
}
});
hiv.push(handler);
}
// Generate jobs
for x in 0..10 {
sender.send(x).expect("all threads hung up :(");
}
drop(sender);
// wait for jobs to finish.
println!("Wait for all threads to finish.\n");
for h in hiv {
h.join().unwrap();
}
println!("join() done. Work Finish.");
}
My question is following :
Can I remove boilerplate code by using threadpool, rayon or some other Rust crate ?
I know that I could do my own implementation, but would like to know is there some crate with same functionality ?
From my research threadpool/rayon are useful when you "send" code and it is executed, but I have not found way to make N threads that will have some code/logic that they need to remember ?
Basic idea is in let mut rng = rand::thread_rng(); this is instance that each thread need to have on it own.
Also is there are some other problems with code, please point it out.
Yes, you can use Rayon to eliminate a lot of that code and make the remaining code much more readable, as illustrated in this gist:
https://gist.github.com/BillBarnhill/db07af903cb3c3edb6e715d9cedae028
The worker pool model is not great in Rust, due to the ownership rules. As a result parallel iterators are often a better choice.
I forgot to address your main concern, per thread context, originally. You can see how to store per thread context using a ThreadLocal! in this answer:
https://stackoverflow.com/a/42656422/204343
I will try to come back and edit the code to reflect ThreadLocal! use as soon as I have more time.
The gist requires nightly because of thread_id_value, but that is all but stable and can be removed if needed.
The real catch is that the gist has timing, and compares main_new with main_original, with surprising results. Perhaps not so surprising, Rayon has good debug support.
On Debug build the timing output is:
main_new duration: 1.525667954s
main_original duration: 1.031234059s
You can see main_new takes almost 50% longer to run.
On release however main_new is a little faster:
main_new duration: 1.584190936s
main_original duration: 1.5851124s
A slimmed version of the gist is below, with only the new code.
#![feature(thread_id_value)]
use std::{thread, time::Duration, time::Instant};
use rand::Rng;
#[allow(unused_imports)]
use rayon::prelude::*;
fn do_work(x : u32) -> String {
let mut rng = rand::thread_rng(); // each thread have one
let s = rng.gen_range(100..1000);
let thread_id = thread::current().id();
let t = thread_id.as_u64();
thread::sleep(Duration::from_millis(s));
format!("w={} r={} working={}", t, x, s)
}
fn process_work_product(output : String) {
println!("{}", output);
}
fn main() {
// bit hacky, but lets set number of threads to 5
rayon::ThreadPoolBuilder::new()
.num_threads(4)
.build_global()
.unwrap();
let x = 0..10;
x.into_par_iter()
.map(do_work)
.for_each(process_work_product);
}

Run function over range in multiple threads

I have a function,
fn calculate(x: i64) -> i64 {
// do stuff
x
}
which I want to apply to a range
for i in 0..100 {
calculate(i);
}
I want to multithread this though. I've tried different things: having an atomic i would be a good idea, but then I'd have to go into the details of shared ownership using libraries etc... is there a simple way of doing this?
If you just want to run stuff on multiple threads and don't really care about the specifics, rayon might be helpful:
use rayon::prelude::*;
fn calculate(x: i64) -> i64 {
x
}
fn main() {
let results = (0..100i64)
.into_par_iter()
.map(calculate)
.collect::<Vec<i64>>();
println!("Results: {:?}", results);
}
This will automatically spin up threads based on how many cores you have and distribute work between them.
Not really certain about what you want to achieve precisely, but here is a trivial example.
let th: Vec<_> = (0..4) // four threads will be launched
.map(|t| {
std::thread::spawn(move || {
let begin = t * 25;
let end = (t + 1) * 25;
for i in begin..end { // each one handles a quarter of the overall computation
let r = calculate(i);
println!("t={} i={} r={}", t, i, r);
}
})
})
.collect();
for t in th { // wait for the four threads to terminate
let _ = t.join();
}

Is it normal to experience large overhead using the 1:1 threading that comes in the standard library?

While working through learning Rust, a friend asked me to see what kind of performance I could get out of Rust for generating the first 1 million prime numbers both single-threaded and multi-threaded. After trying several implementations, I'm just stumped. Here is the kind of performance that I'm seeing:
rust_primes --threads 8 --verbose --count 1000000
Options { verbose: true, count: 1000000, threads: 8 }
Non-concurrent using while (15485863): 2.814 seconds.
Concurrent using mutexes (15485863): 876.561 seconds.
Concurrent using channels (15485863): 798.217 seconds.
Without overloading the question with too much code, here are the methods responsible for each of the benchmarks:
fn non_concurrent(options: &Options) {
let mut count = 0;
let mut current = 0;
let ts = Instant::now();
while count < options.count {
if is_prime(current) {
count += 1;
}
current += 1;
}
let d = ts.elapsed();
println!("Non-concurrent using while ({}): {}.{} seconds.", current - 1, d.as_secs(), d.subsec_nanos() / 1_000_000);
}
fn concurrent_mutex(options: &Options) {
let count = Arc::new(Mutex::new(0));
let highest = Arc::new(Mutex::new(0));
let mut cc = 0;
let mut current = 0;
let ts = Instant::now();
while cc < options.count {
let mut handles = vec![];
for x in current..(current + options.threads) {
let count = Arc::clone(&count);
let highest = Arc::clone(&highest);
let handle = thread::spawn(move || {
if is_prime(x) {
let mut c = count.lock().unwrap();
let mut h = highest.lock().unwrap();
*c += 1;
if x > *h {
*h = x;
}
}
});
handles.push(handle);
}
for handle in handles {
handle.join().unwrap();
}
cc = *count.lock().unwrap();
current += options.threads;
}
let d = ts.elapsed();
println!("Concurrent using mutexes ({}): {}.{} seconds.", *highest.lock().unwrap(), d.as_secs(), d.subsec_nanos() / 1_000_000);
}
fn concurrent_channel(options: &Options) {
let mut count = 0;
let mut current = 0;
let mut highest = 0;
let ts = Instant::now();
while count < options.count {
let (tx, rx) = mpsc::channel();
for x in current..(current + options.threads) {
let txc = mpsc::Sender::clone(&tx);
thread::spawn(move || {
if is_prime(x) {
txc.send(x).unwrap();
}
});
}
drop(tx);
for message in rx {
count += 1;
if message > highest && count <= options.count {
highest = message;
}
}
current += options.threads;
}
let d = ts.elapsed();
println!("Concurrent using channels ({}): {}.{} seconds.", highest, d.as_secs(), d.subsec_nanos() / 1_000_000);
}
Am I doing something wrong, or is this normal performance with the 1:1 threading that comes in the standard library?
Here is a MCVE that shows the same problem. I didn't limit the number of threads it starts up at once here like I did in the code above. The point is, threading seems to have a very significant overhead unless I'm doing something horribly wrong.
use std::thread;
use std::time::Instant;
use std::sync::{Mutex, Arc};
use std::time::Duration;
fn main() {
let iterations = 100_000;
non_threaded(iterations);
threaded(iterations);
}
fn threaded(iterations: u32) {
let tx = Instant::now();
let counter = Arc::new(Mutex::new(0));
let mut handles = vec![];
for _ in 0..iterations {
let counter = Arc::clone(&counter);
let handle = thread::spawn(move || {
let mut num = counter.lock().unwrap();
*num = test(*num);
});
handles.push(handle);
}
for handle in handles {
handle.join().unwrap();
}
let d = tx.elapsed();
println!("Threaded in {}.", dur_to_string(d));
}
fn non_threaded(iterations: u32) {
let tx = Instant::now();
let mut _q = 0;
for x in 0..iterations {
_q = test(x + 1);
}
let d = tx.elapsed();
println!("Non-threaded in {}.", dur_to_string(d));
}
fn dur_to_string(d: Duration) -> String {
let mut s = d.as_secs().to_string();
s.push_str(".");
s.push_str(&(d.subsec_nanos() / 1_000_000).to_string());
s
}
fn test(x: u32) -> u32 {
x
}
Here are the results of this on my machine:
Non-threaded in 0.9.
Threaded in 5.785.
threading seems to have a very significant overhead
It's not the general concept of "threading", it's the concept of creating and destroying lots of threads.
By default in Rust 1.22.1, each spawned thread allocates 2MiB of memory to use as stack space. In the worst case, your MCVE could allocate ~200GiB of RAM. In reality, this is unlikely to happen as some threads will exit, memory will be reused, etc. I only saw it use ~400MiB.
On top of that, there is overhead involved with inter-thread communication (Mutex, channels, Atomic*) compared to intra-thread variables. Some kind of locking needs to be performed to ensure that all threads see the same data. "Embarrassingly parallel" algorithms tend to not have a lot of communication required. There are also different amounts of time required for different communication primitives. Atomic variables tend to be faster than others in many cases, but aren't as widely usable.
Then there's compiler optimizations to account for. Non-threaded code is way easier to optimize compared to threaded code. For example, running your code in release mode shows:
Non-threaded in 0.0.
Threaded in 142.775.
That's right, the non-threaded code took no time. The compiler can see through the code and realizes that nothing actually happens and removes it all. I don't know how you got 5 seconds for the threaded code as opposed to the 2+ minutes I saw.
Switching to a threadpool will reduce a lot of the unneeded creation of threads. We can also use a threadpool that provides scoped threads, which allows us to avoid the Arc as well:
extern crate scoped_threadpool;
use scoped_threadpool::Pool;
fn threaded(iterations: u32) {
let tx = Instant::now();
let counter = Mutex::new(0);
let mut pool = Pool::new(8);
pool.scoped(|scope| {
for _ in 0..iterations {
scope.execute(|| {
let mut num = counter.lock().unwrap();
*num = test(*num);
});
}
});
let d = tx.elapsed();
println!("Threaded in {}.", dur_to_string(d));
}
Non-threaded in 0.0.
Threaded in 0.675.
As with most pieces of programming, it's crucial to understand the tools you have and to use them appropriately.

Resources