Why Sequential is faster than parallel? - multithreading

The code is to count the frequency of each word in an article. In the code, I implemented sequential, muti-thread, and muti-thread with a thread pool.
I test the running time of three methods, however, I found that the sequential method is the fastest one. I use the article (data) at 37423.txt, the code is at play.rust-lang.org.
Below is just the single- and multi version (without the threadpool version):
use std::collections::HashMap;
use std::sync::{Arc, Mutex};
use std::thread;
use std::time::SystemTime;
pub fn word_count(article: &str) -> HashMap<String, i64> {
let now1 = SystemTime::now();
let mut map = HashMap::new();
for word in article.split_whitespace() {
let count = map.entry(word.to_string()).or_insert(0);
*count += 1;
}
let after1 = SystemTime::now();
let d1 = after1.duration_since(now1);
println!("single: {:?}", d1.as_ref().unwrap());
map
}
fn word_count_thread(word_vec: Vec<String>, counts: &Arc<Mutex<HashMap<String, i64>>>) {
let mut p_count = HashMap::new();
for word in word_vec {
*p_count.entry(word).or_insert(0) += 1;
}
let mut counts = counts.lock().unwrap();
for (word, count) in p_count {
*counts.entry(word.to_string()).or_insert(0) += count;
}
}
pub fn mt_word_count(article: &str) -> HashMap<String, i64> {
let word_vec = article
.split_whitespace()
.map(|x| x.to_owned())
.collect::<Vec<String>>();
let count = Arc::new(Mutex::new(HashMap::new()));
let len = word_vec.len();
let q1 = len / 4;
let q2 = len / 2;
let q3 = q1 * 3;
let part1 = word_vec[..q1].to_vec();
let part2 = word_vec[q1..q2].to_vec();
let part3 = word_vec[q2..q3].to_vec();
let part4 = word_vec[q3..].to_vec();
let now2 = SystemTime::now();
let count1 = count.clone();
let count2 = count.clone();
let count3 = count.clone();
let count4 = count.clone();
let handle1 = thread::spawn(move || {
word_count_thread(part1, &count1);
});
let handle2 = thread::spawn(move || {
word_count_thread(part2, &count2);
});
let handle3 = thread::spawn(move || {
word_count_thread(part3, &count3);
});
let handle4 = thread::spawn(move || {
word_count_thread(part4, &count4);
});
handle1.join().unwrap();
handle2.join().unwrap();
handle3.join().unwrap();
handle4.join().unwrap();
let x = count.lock().unwrap().clone();
let after2 = SystemTime::now();
let d2 = after2.duration_since(now2);
println!("muti: {:?}", d2.as_ref().unwrap());
x
}
The result of mine is: single:7.93ms, muti: 15.78ms, threadpool: 25.33ms
I did the separation of the article before calculating time.
I want to know if the code has some problem.

First you may want to know the single-threaded version is mostly parsing whitespace (and I/O, but the file is small so it will be in the OS cache on the second run). At most ~20% of the runtime is the counting that you parallelized. Here is the cargo flamegraph of the single-threaded code:
In the multi-threaded version, it's a mess of thread creation, copying and hashmap overhead. To make sure it's not a "too little data" problem, I've used used 100x your input txt file and I'm measuring a 2x slowdown over the single-threaded version. According to the time command, it uses 2x CPU-time compared to wall-clock, so it seems to do some parallel work. The profile looks like this: (clickable svg version)
I'm not sure what to make of it exactly, but it's clear that memory management overhead has increased a lot. You seem to be copying strings for each hashmap, while an ideal wordcount would probably do zero string copying while counting.
More generally I think it's a bad idea to join the results in the threads - the way you do it (as opposed to a map-reduce pattern) the thread needs a global lock, so you could just pass the results back to the main thread instead for combining. I'm not sure if this is the main problem here, though.
Optimization
To avoid string copying, use HashMap<&str, i64> instead of HashMap<String, i64>. This requires some changes (lifetime annotations and thread::scope()). It makes mt_word_count() about 6x faster compared to the old version.
With a large input I'm measuring now a 4x speedup compared to word_count(). (Which is the best you can hope for with four threads.) The multi-threaded version is now also faster overall, but only by ~20% or so, for the reasons explained above. (Note that the single-threaded baseline has also profited the same &str optimization. Also, many things could still be improved/optimized, but I'll stop here.)
fn word_count_thread<'t>(word_vec: Vec<&'t str>, counts: &Arc<Mutex<HashMap<&'t str, i64>>>) {
let mut p_count = HashMap::new();
for word in word_vec {
*p_count.entry(word).or_insert(0) += 1;
}
let mut counts = counts.lock().unwrap();
for (word, count) in p_count {
*counts.entry(word).or_insert(0) += count;
}
}
pub fn mt_word_count<'t>(article: &'t str) -> HashMap<&'t str, i64> {
let word_vec = article.split_whitespace().collect::<Vec<&str>>();
// (skipping 16 unmodified lines)
let x = thread::scope(|scope| {
let handle1 = scope.spawn(move || {
word_count_thread(part1, &count1);
});
let handle2 = scope.spawn(move || {
word_count_thread(part2, &count2);
});
let handle3 = scope.spawn(move || {
word_count_thread(part3, &count3);
});
let handle4 = scope.spawn(move || {
word_count_thread(part4, &count4);
});
handle1.join().unwrap();
handle2.join().unwrap();
handle3.join().unwrap();
handle4.join().unwrap();
count.lock().unwrap().clone()
});
let after2 = SystemTime::now();
let d2 = after2.duration_since(now2);
println!("muti: {:?}", d2.as_ref().unwrap());
x
}

Related

How to create threads that last entire duration of program and pass immutable chunks for threads to operate on?

I have a bunch of math that has real time constraints. My main loop will just call this function repeatedly and it will always store results into an existing buffer. However, I want to be able to spawn the threads at init time and then allow the threads to run and do their work and then wait for more data. The synchronization I will use a Barrier and have that part working. What I can't get working and have tried various iterations of Arc or crossbeam is splitting the thread spawning up and the actual workload. This is what I have now.
pub const WORK_SIZE: usize = 524_288;
pub const NUM_THREADS: usize = 6;
pub const NUM_TASKS_PER_THREAD: usize = WORK_SIZE / NUM_THREADS;
fn main() {
let mut work: Vec<f64> = Vec::with_capacity(WORK_SIZE);
for i in 0..WORK_SIZE {
work.push(i as f64);
}
crossbeam::scope(|scope| {
let threads: Vec<_> = work
.chunks(NUM_TASKS_PER_THREAD)
.map(|chunk| scope.spawn(move |_| chunk.iter().cloned().sum::<f64>()))
.collect();
let threaded_time = std::time::Instant::now();
let thread_sum: f64 = threads.into_iter().map(|t| t.join().unwrap()).sum();
let threaded_micros = threaded_time.elapsed().as_micros() as f64;
println!("threaded took: {:#?}", threaded_micros);
let serial_time = std::time::Instant::now();
let no_thread_sum: f64 = work.iter().cloned().sum();
let serial_micros = serial_time.elapsed().as_micros() as f64;
println!("serial took: {:#?}", serial_micros);
assert_eq!(thread_sum, no_thread_sum);
println!(
"Threaded performace was {:?}",
serial_micros / threaded_micros
);
})
.unwrap();
}
But I can't find a way to spin these threads up in an init function and then in a do_work function pass work into them. I attempted to do something like this with Arc's and Mutex's but couldn't get everything straight there either. What I want to turn this into is something like the following
use std::sync::{Arc, Barrier, Mutex};
use std::{slice::Chunks, thread::JoinHandle};
pub const WORK_SIZE: usize = 524_288;
pub const NUM_THREADS: usize = 6;
pub const NUM_TASKS_PER_THREAD: usize = WORK_SIZE / NUM_THREADS;
//simplified version of what actual work that code base will do
fn do_work(data: &[f64], result: Arc<Mutex<f64>>, barrier: Arc<Barrier>) {
loop {
barrier.wait();
let sum = data.into_iter().cloned().sum::<f64>();
let mut result = *result.lock().unwrap();
result += sum;
}
}
fn init(
mut data: Chunks<'_, f64>,
result: &Arc<Mutex<f64>>,
barrier: &Arc<Barrier>,
) -> Vec<std::thread::JoinHandle<()>> {
let mut handles = Vec::with_capacity(NUM_THREADS);
//spawn threads, in actual code these would be stored in a lib crate struct
for i in 0..NUM_THREADS {
let result = result.clone();
let barrier = barrier.clone();
let chunk = data.nth(i).unwrap();
handles.push(std::thread::spawn(|| {
//Pass the particular thread the particular chunk it will operate on.
do_work(chunk, result, barrier);
}));
}
handles
}
fn main() {
let mut work: Vec<f64> = Vec::with_capacity(WORK_SIZE);
let mut result = Arc::new(Mutex::new(0.0));
for i in 0..WORK_SIZE {
work.push(i as f64);
}
let work_barrier = Arc::new(Barrier::new(NUM_THREADS + 1));
let threads = init(work.chunks(NUM_TASKS_PER_THREAD), &result, &work_barrier);
loop {
work_barrier.wait();
//actual code base would do something with summation stored in result.
println!("{:?}", result.lock().unwrap());
}
}
I hope this expresses the intent clearly enough of what I need to do. The issue with this specific implementation is that the chunks don't seem to live long enough and when I tried wrapping them in an Arc as it just moved the argument doesn't live long enough to the Arc::new(data.chunk(_)) line.
use std::sync::{Arc, Barrier, Mutex};
use std::thread;
pub const WORK_SIZE: usize = 524_288;
pub const NUM_THREADS: usize = 6;
pub const NUM_TASKS_PER_THREAD: usize = WORK_SIZE / NUM_THREADS;
//simplified version of what actual work that code base will do
fn do_work(data: &[f64], result: Arc<Mutex<f64>>, barrier: Arc<Barrier>) {
loop {
barrier.wait();
let sum = data.iter().sum::<f64>();
*result.lock().unwrap() += sum;
}
}
fn init(
work: Vec<f64>,
result: Arc<Mutex<f64>>,
barrier: Arc<Barrier>,
) -> Vec<thread::JoinHandle<()>> {
let mut handles = Vec::with_capacity(NUM_THREADS);
//spawn threads, in actual code these would be stored in a lib crate struct
for i in 0..NUM_THREADS {
let slice = work[i * NUM_TASKS_PER_THREAD..(i + 1) * NUM_TASKS_PER_THREAD].to_owned();
let result = Arc::clone(&result);
let w = Arc::clone(&barrier);
handles.push(thread::spawn(move || {
do_work(&slice, result, w);
}));
}
handles
}
fn main() {
let mut work: Vec<f64> = Vec::with_capacity(WORK_SIZE);
let result = Arc::new(Mutex::new(0.0));
for i in 0..WORK_SIZE {
work.push(i as f64);
}
let work_barrier = Arc::new(Barrier::new(NUM_THREADS + 1));
let _threads = init(work, Arc::clone(&result), Arc::clone(&work_barrier));
loop {
thread::sleep(std::time::Duration::from_secs(3));
work_barrier.wait();
//actual code base would do something with summation stored in result.
println!("{:?}", result.lock().unwrap());
}
}

How do I avoid obfuscating logic in a `loop`?

Trying to respect Rust safety rules leads me to write code that is, in this case, less clear than the alternative.
It's marginal, but must be a very common pattern, so I wonder if there's any better way.
The following example doesn't compile:
async fn query_all_items() -> Vec<u32> {
let mut items = vec![];
let limit = 10;
loop {
let response = getResponse().await;
// response is moved here
items.extend(response);
// can't do this, response is moved above
if response.len() < limit {
break;
}
}
items
}
In order to satisfy Rust safety rules, we can pre-compute the break condition:
async fn query_all_items() -> Vec<u32> {
let mut items = vec![];
let limit = 10;
loop {
let response = getResponse().await;
let should_break = response.len() < limit;
// response is moved here
items.extend(response);
// meh
if should_break {
break;
}
}
items
}
Is there any other way?
I agree with Daniel's point that this should be a while rather than a loop, though I'd move the logic to the while rather than creating a boolean:
let mut len = limit;
while len >= limit {
let response = queryItems(limit).await?;
len = response.len();
items.extend(response);
}
Not that you should do this, but an async stream version is possible. However a plain old loop is much easier to read.
use futures::{future, stream, StreamExt}; // 0.3.19
use rand::{
distributions::{Distribution, Uniform},
rngs::ThreadRng,
};
use std::sync::{Arc, Mutex};
use tokio; // 1.15.0
async fn get_response(rng: Arc<Mutex<ThreadRng>>) -> Vec<u32> {
let mut rng = rng.lock().unwrap();
let range = Uniform::from(0..100);
let len_u32 = range.sample(&mut *rng);
let len_usize = usize::try_from(len_u32).unwrap();
vec![len_u32; len_usize]
}
async fn query_all_items() -> Vec<u32> {
let rng = Arc::new(Mutex::new(ThreadRng::default()));
stream::iter(0..)
.then(|_| async { get_response(Arc::clone(&rng)).await })
.take_while(|v| future::ready(v.len() >= 10))
.collect::<Vec<_>>()
.await
.into_iter()
.flatten()
.collect()
}
#[tokio::main]
async fn main() {
// [46, 46, 46, ..., 78, 78, 78], or whatever random list you get
println!("{:?}", query_all_items().await);
}
I would do this in a while loop since the while will surface the flag more easily.
fn query_all_items () -> Vec<Item> {
let items = vec![];
let limit = 10;
let mut limit_reached = false;
while limit_reached {
let response = queryItems(limit).await?;
limit_reached = response.len() >= limit;
items.extend(response);
}
items
}
Without context it's hard to advise ideal code. I would do:
fn my_body_is_ready() -> Vec<u32> {
let mut acc = vec![];
let min = 10;
loop {
let foo = vec![42];
if foo.len() < min {
acc.extend(foo);
break acc;
} else {
acc.extend(foo);
}
}
}

How to give each CPU core mutable access to a portion of a Vec? [duplicate]

This question already has an answer here:
How do I pass disjoint slices from a vector to different threads?
(1 answer)
Closed 4 years ago.
I've got an embarrassingly parallel bit of graphics rendering code that I would like to run across my CPU cores. I've coded up a test case (the function computed is nonsense) to explore how I might parallelize it. I'd like to code this using std Rust in order to learn about using std::thread. But, I don't understand how to give each thread a portion of the framebuffer. I'll put the full testcase code below, but I'll try to break it down first.
The sequential form is super simple:
let mut buffer0 = vec![vec![0i32; WIDTH]; HEIGHT];
for j in 0..HEIGHT {
for i in 0..WIDTH {
buffer0[j][i] = compute(i as i32,j as i32);
}
}
I thought that it would help to make a buffer that was the same size, but re-arranged to be 3D & indexed by core first. This is the same computation, just a reordering of the data to show the workings.
let mut buffer1 = vec![vec![vec![0i32; WIDTH]; y_per_core]; num_logical_cores];
for c in 0..num_logical_cores {
for y in 0..y_per_core {
let j = y*num_logical_cores + c;
if j >= HEIGHT {
break;
}
for i in 0..WIDTH {
buffer1[c][y][i] = compute(i as i32,j as i32)
}
}
}
But, when I try to put the inner part of the code in a closure & create a thread, I get errors about the buffer & lifetimes. I basically don't understand what to do & could use some guidance. I want per_core_buffer to just temporarily refer to the data in buffer2 that belongs to that core & allow it to be written, synchronize all the threads & then read buffer2 afterwards. Is this possible?
let mut buffer2 = vec![vec![vec![0i32; WIDTH]; y_per_core]; num_logical_cores];
let mut handles = Vec::new();
for c in 0..num_logical_cores {
let per_core_buffer = &mut buffer2[c]; // <<< lifetime error
let handle = thread::spawn(move || {
for y in 0..y_per_core {
let j = y*num_logical_cores + c;
if j >= HEIGHT {
break;
}
for i in 0..WIDTH {
per_core_buffer[y][i] = compute(i as i32,j as i32)
}
}
});
handles.push(handle)
}
for handle in handles {
handle.join().unwrap();
}
The error is this & I don't understand:
error[E0597]: `buffer2` does not live long enough
--> src/main.rs:50:36
|
50 | let per_core_buffer = &mut buffer2[c]; // <<< lifetime error
| ^^^^^^^ borrowed value does not live long enough
...
88 | }
| - borrowed value only lives until here
|
= note: borrowed value must be valid for the static lifetime...
The full testcase is:
extern crate num_cpus;
use std::time::Instant;
use std::thread;
fn compute(x: i32, y: i32) -> i32 {
(x*y) % (x+y+10000)
}
fn main() {
let num_logical_cores = num_cpus::get();
const WIDTH: usize = 40000;
const HEIGHT: usize = 10000;
let y_per_core = HEIGHT/num_logical_cores + 1;
// ------------------------------------------------------------
// Serial Calculation...
let mut buffer0 = vec![vec![0i32; WIDTH]; HEIGHT];
let start0 = Instant::now();
for j in 0..HEIGHT {
for i in 0..WIDTH {
buffer0[j][i] = compute(i as i32,j as i32);
}
}
let dur0 = start0.elapsed();
// ------------------------------------------------------------
// On the way to Parallel Calculation...
// Reorder the data buffer to be 3D with one 2D region per core.
let mut buffer1 = vec![vec![vec![0i32; WIDTH]; y_per_core]; num_logical_cores];
let start1 = Instant::now();
for c in 0..num_logical_cores {
for y in 0..y_per_core {
let j = y*num_logical_cores + c;
if j >= HEIGHT {
break;
}
for i in 0..WIDTH {
buffer1[c][y][i] = compute(i as i32,j as i32)
}
}
}
let dur1 = start1.elapsed();
// ------------------------------------------------------------
// Actual Parallel Calculation...
let mut buffer2 = vec![vec![vec![0i32; WIDTH]; y_per_core]; num_logical_cores];
let mut handles = Vec::new();
let start2 = Instant::now();
for c in 0..num_logical_cores {
let per_core_buffer = &mut buffer2[c]; // <<< lifetime error
let handle = thread::spawn(move || {
for y in 0..y_per_core {
let j = y*num_logical_cores + c;
if j >= HEIGHT {
break;
}
for i in 0..WIDTH {
per_core_buffer[y][i] = compute(i as i32,j as i32)
}
}
});
handles.push(handle)
}
for handle in handles {
handle.join().unwrap();
}
let dur2 = start2.elapsed();
println!("Runtime: Serial={0:.3}ms, AlmostParallel={1:.3}ms, Parallel={2:.3}ms",
1000.*dur0.as_secs() as f64 + 1e-6*(dur0.subsec_nanos() as f64),
1000.*dur1.as_secs() as f64 + 1e-6*(dur1.subsec_nanos() as f64),
1000.*dur2.as_secs() as f64 + 1e-6*(dur2.subsec_nanos() as f64));
// Sanity check
for j in 0..HEIGHT {
let c = j % num_logical_cores;
let y = j / num_logical_cores;
for i in 0..WIDTH {
if buffer0[j][i] != buffer1[c][y][i] {
println!("wtf1? {0} {1} {2} {3}",i,j,buffer0[j][i],buffer1[c][y][i])
}
if buffer0[j][i] != buffer2[c][y][i] {
println!("wtf2? {0} {1} {2} {3}",i,j,buffer0[j][i],buffer2[c][y][i])
}
}
}
}
Thanks to #Shepmaster for the pointers and clarification that this is not an easy problem for Rust, and that I needed to consider crates to find a reasonable solution. I'm only just starting out in Rust, so this really wasn't clear to me.
I liked the ability to control the number of threads that scoped_threadpool gives, so I went with that. Translating my code from above directly, I tried to use the 4D buffer with core as the most-significant-index and that ran into troubles because that 3D vector does not implement the Copy trait. The fact that it implements Copy makes me concerned about performance, but I went back to the original problem and implemented it more directly & found a reasonable speedup by making each row a thread. Copying each row will not be a large memory overhead.
The code that works for me is:
let mut buffer2 = vec![vec![0i32; WIDTH]; HEIGHT];
let mut pool = Pool::new(num_logical_cores as u32);
pool.scoped(|scope| {
let mut y = 0;
for e in &mut buffer2 {
scope.execute(move || {
for x in 0..WIDTH {
(*e)[x] = compute(x as i32,y as i32);
}
});
y += 1;
}
});
On a 6 core, 12 thread i7-8700K for 400000x4000 testcase this runs in 3.2 seconds serially & 481ms in parallel--a reasonable speedup.
EDIT: I continued to think about this issue and got a suggestion from Rustlang on twitter that I should consider rayon. I converted my code to rayon and got similar speedup with the following code.
let mut buffer2 = vec![vec![0i32; WIDTH]; HEIGHT];
buffer2
.par_iter_mut()
.enumerate()
.map(|(y,e): (usize, &mut Vec<i32>)| {
for x in 0..WIDTH {
(*e)[x] = compute(x as i32,y as i32);
}
})
.collect::<Vec<_>>();

Is it normal to experience large overhead using the 1:1 threading that comes in the standard library?

While working through learning Rust, a friend asked me to see what kind of performance I could get out of Rust for generating the first 1 million prime numbers both single-threaded and multi-threaded. After trying several implementations, I'm just stumped. Here is the kind of performance that I'm seeing:
rust_primes --threads 8 --verbose --count 1000000
Options { verbose: true, count: 1000000, threads: 8 }
Non-concurrent using while (15485863): 2.814 seconds.
Concurrent using mutexes (15485863): 876.561 seconds.
Concurrent using channels (15485863): 798.217 seconds.
Without overloading the question with too much code, here are the methods responsible for each of the benchmarks:
fn non_concurrent(options: &Options) {
let mut count = 0;
let mut current = 0;
let ts = Instant::now();
while count < options.count {
if is_prime(current) {
count += 1;
}
current += 1;
}
let d = ts.elapsed();
println!("Non-concurrent using while ({}): {}.{} seconds.", current - 1, d.as_secs(), d.subsec_nanos() / 1_000_000);
}
fn concurrent_mutex(options: &Options) {
let count = Arc::new(Mutex::new(0));
let highest = Arc::new(Mutex::new(0));
let mut cc = 0;
let mut current = 0;
let ts = Instant::now();
while cc < options.count {
let mut handles = vec![];
for x in current..(current + options.threads) {
let count = Arc::clone(&count);
let highest = Arc::clone(&highest);
let handle = thread::spawn(move || {
if is_prime(x) {
let mut c = count.lock().unwrap();
let mut h = highest.lock().unwrap();
*c += 1;
if x > *h {
*h = x;
}
}
});
handles.push(handle);
}
for handle in handles {
handle.join().unwrap();
}
cc = *count.lock().unwrap();
current += options.threads;
}
let d = ts.elapsed();
println!("Concurrent using mutexes ({}): {}.{} seconds.", *highest.lock().unwrap(), d.as_secs(), d.subsec_nanos() / 1_000_000);
}
fn concurrent_channel(options: &Options) {
let mut count = 0;
let mut current = 0;
let mut highest = 0;
let ts = Instant::now();
while count < options.count {
let (tx, rx) = mpsc::channel();
for x in current..(current + options.threads) {
let txc = mpsc::Sender::clone(&tx);
thread::spawn(move || {
if is_prime(x) {
txc.send(x).unwrap();
}
});
}
drop(tx);
for message in rx {
count += 1;
if message > highest && count <= options.count {
highest = message;
}
}
current += options.threads;
}
let d = ts.elapsed();
println!("Concurrent using channels ({}): {}.{} seconds.", highest, d.as_secs(), d.subsec_nanos() / 1_000_000);
}
Am I doing something wrong, or is this normal performance with the 1:1 threading that comes in the standard library?
Here is a MCVE that shows the same problem. I didn't limit the number of threads it starts up at once here like I did in the code above. The point is, threading seems to have a very significant overhead unless I'm doing something horribly wrong.
use std::thread;
use std::time::Instant;
use std::sync::{Mutex, Arc};
use std::time::Duration;
fn main() {
let iterations = 100_000;
non_threaded(iterations);
threaded(iterations);
}
fn threaded(iterations: u32) {
let tx = Instant::now();
let counter = Arc::new(Mutex::new(0));
let mut handles = vec![];
for _ in 0..iterations {
let counter = Arc::clone(&counter);
let handle = thread::spawn(move || {
let mut num = counter.lock().unwrap();
*num = test(*num);
});
handles.push(handle);
}
for handle in handles {
handle.join().unwrap();
}
let d = tx.elapsed();
println!("Threaded in {}.", dur_to_string(d));
}
fn non_threaded(iterations: u32) {
let tx = Instant::now();
let mut _q = 0;
for x in 0..iterations {
_q = test(x + 1);
}
let d = tx.elapsed();
println!("Non-threaded in {}.", dur_to_string(d));
}
fn dur_to_string(d: Duration) -> String {
let mut s = d.as_secs().to_string();
s.push_str(".");
s.push_str(&(d.subsec_nanos() / 1_000_000).to_string());
s
}
fn test(x: u32) -> u32 {
x
}
Here are the results of this on my machine:
Non-threaded in 0.9.
Threaded in 5.785.
threading seems to have a very significant overhead
It's not the general concept of "threading", it's the concept of creating and destroying lots of threads.
By default in Rust 1.22.1, each spawned thread allocates 2MiB of memory to use as stack space. In the worst case, your MCVE could allocate ~200GiB of RAM. In reality, this is unlikely to happen as some threads will exit, memory will be reused, etc. I only saw it use ~400MiB.
On top of that, there is overhead involved with inter-thread communication (Mutex, channels, Atomic*) compared to intra-thread variables. Some kind of locking needs to be performed to ensure that all threads see the same data. "Embarrassingly parallel" algorithms tend to not have a lot of communication required. There are also different amounts of time required for different communication primitives. Atomic variables tend to be faster than others in many cases, but aren't as widely usable.
Then there's compiler optimizations to account for. Non-threaded code is way easier to optimize compared to threaded code. For example, running your code in release mode shows:
Non-threaded in 0.0.
Threaded in 142.775.
That's right, the non-threaded code took no time. The compiler can see through the code and realizes that nothing actually happens and removes it all. I don't know how you got 5 seconds for the threaded code as opposed to the 2+ minutes I saw.
Switching to a threadpool will reduce a lot of the unneeded creation of threads. We can also use a threadpool that provides scoped threads, which allows us to avoid the Arc as well:
extern crate scoped_threadpool;
use scoped_threadpool::Pool;
fn threaded(iterations: u32) {
let tx = Instant::now();
let counter = Mutex::new(0);
let mut pool = Pool::new(8);
pool.scoped(|scope| {
for _ in 0..iterations {
scope.execute(|| {
let mut num = counter.lock().unwrap();
*num = test(*num);
});
}
});
let d = tx.elapsed();
println!("Threaded in {}.", dur_to_string(d));
}
Non-threaded in 0.0.
Threaded in 0.675.
As with most pieces of programming, it's crucial to understand the tools you have and to use them appropriately.

How should I read the contents of a file respecting endianess?

I can see that in Rust I can read a file to a byte array with:
File::open(&Path::new("fid")).read_to_end();
I can also read just one u32 in either big endian or little endian format with:
File::open(&Path::new("fid")).read_be_u32();
File::open(&Path::new("fid")).read_le_u32();
but as far as I can see i'm going to have to do something like this (simplified):
let path = Path::new("fid");
let mut file = File::open(&path);
let mut v = vec![];
for n in range(1u64, path.stat().unwrap().size/4u64){
v.push(if big {
file.read_be_u32()
} else {
file.read_le_u32()
});
}
But that's ugly as hell and I'm just wondering if there's a nicer way to do this.
Ok so the if in the loop was a big part of what was ugly so I hoisted that as suggested, the new version is as follows:
let path = Path::new("fid");
let mut file = File::open(&path);
let mut v = vec![];
let fun = if big {
||->IoResult<u32>{file.read_be_u32()}
} else {
||->IoResult<u32>{file.read_le_u32()}
};
for n in range(1u64, path.stat().unwrap().size/4u64){
v.push(fun());
}
Learned about range_step and using _ as an index, so now I'm left with:
let path = Path::new("fid");
let mut file = File::open(&path);
let mut v = vec![];
let fun = if big {
||->IoResult<u32>{file.read_be_u32()}
} else {
||->IoResult<u32>{file.read_le_u32()}
};
for _ in range_step(0u64, path.stat().unwrap().size,4u64){
v.push(fun().unwrap());
}
Any more advice? This is already looking much better.
This solution reads the whole file into a buffer, then creates a view of the buffer as words, then maps those words into a vector, converting endianness. collect() avoids all the reallocations of growing a mutable vector. You could also mmap the file rather than reading it into a buffer.
use std::io::File;
use std::num::{Int, Num};
fn from_bytes<'a, T: Num>(buf: &'a [u8]) -> &'a [T] {
unsafe {
std::mem::transmute(std::raw::Slice {
data: buf.as_ptr(),
len: buf.len() / std::mem::size_of::<T>()
})
}
}
fn main() {
let buf = File::open(&Path::new("fid")).read_to_end().unwrap();
let words: &[u32] = from_bytes(buf.as_slice());
let big = true;
let v: Vec<u32> = words.iter().map(if big {
|&n| { Int::from_be(n) }
} else {
|&n| { Int::from_le(n) }
}).collect();
println!("{}", v);
}

Resources