Rust, how to perform basic recursive async? - rust

I am just doing some quick experimenting in an attempt to learn the rust language, I have done a few successful async tests, this is my starting point:
use async_std::task;
use futures;
use std::time::SystemTime;
fn main() {
let now = SystemTime::now();
task::block_on(async {
let mut fs = Vec::new();
let sum = 100000000;
let chunks: u64 = 5; //only valid for factors of sum
let chunk_size: u64 = sum/chunks;
for n in 1..=chunks {
fs.push(task::spawn(async move {
add_range((n - 1) * chunk_size + 1, n * chunk_size + 1)
}));
}
let vals = futures::future::join_all(fs).await;
// 5000000050000000 for this configuration of inputs
println!("{}", vals.iter().sum::<u64>());
});
println!("{}ms", now.elapsed().unwrap().as_millis());
}
fn add_range(start: u64, end: u64) -> u64 {
println!("{}, {}", start, end);
let mut total: u64 = 0;
for n in start..end {
total += n;
}
return total;
}
by changing the value of chunks you can change how many task::spawns there are. Now rather than a flat set of workers, I want the add_range function to be recursive and to keep forking off workers based on the inputs, however following the compiler errors I have gotten myself quite tangled up:
use async_std::task;
use futures;
use std::future::Future;
use std::pin::Pin;
fn main() {
let pin_box_u64 = task::block_on(add_range(0, 10, 10, 1, 1001));
println!("{}", pin_box_u64/*how do i get u64 out of this*/)
}
// recursively calls itself in a branching tree structure
// forking off more worker threads
async fn add_range(
depth: u64,
chunk_split: u64,
chunk_size: u64,
start: u64,
end: u64,
) -> Pin<Box<dyn Future<Output = u64>>> {
println!("{}, {}, {}", depth, start, end);
// if the range of start to end is more than the allowed
// chunk_size then fork off more workers dividing
// the work up further.
if end - start > chunk_size {
let mut fs = Vec::new();
let next_chunk_size = (end - start) / chunk_split;
for n in 0..chunk_split {
let s = start + (next_chunk_size * n);
let mut e = start + (next_chunk_size * (n + 1));
if e > end {
e = end;
}
// spawn more workers
fs.push(task::spawn(add_range(depth + 1, chunk_split, chunk_size, s, e)));
}
return Box::pin(async move {
// join workers back up and do joining sum.
return futures::future::join_all(fs).await.iter().map(/*how do i get u64s out of here*/).sum::<u64>();
});
} else {
// else the work is less than the allowed chunk_size
// so lets now do the actual sum for my chunk
let mut total: u64 = 0;
for n in start..end {
total += n;
}
return Box::pin(async move { total });
}
}
I have played around with this for a while but I feel like Im just becoming more and more lost with the compiler errors.

You need to box the returned future, otherwise the compiler can't determine the size of the return type.
Additional context can be found here: https://rust-lang.github.io/async-book/07_workarounds/04_recursion.html
use std::pin::Pin;
use async_std::task;
use futures::Future;
use futures::FutureExt;
fn main() {
let pin_box_u64 = task::block_on(add_range(0, 10, 10, 1, 1001));
println!("{}", pin_box_u64)
}
// recursively calls itself in a branching tree structure
// forking off more worker threads
fn add_range(
depth: u64,
chunk_split: u64,
chunk_size: u64,
start: u64,
end: u64,
) -> Pin<Box<dyn Future<Output = u64> + Send + 'static>> {
println!("{}, {}, {}", depth, start, end);
// if the range of start to end is more than the allowed
// chunk_size then fork off more workers dividing
// the work up further.
if end - start > chunk_size {
let mut fs = Vec::new();
let next_chunk_size = (end - start) / chunk_split;
for n in 0..chunk_split {
let s = start + (next_chunk_size * n);
let mut e = start + (next_chunk_size * (n + 1));
if e > end {
e = end;
}
// spawn more workers
fs.push(task::spawn(add_range(
depth + 1,
chunk_split,
chunk_size,
s,
e,
)));
}
// join workers back up and do joining sum.
return futures::future::join_all(fs)
.map(|v| v.iter().sum::<u64>())
.boxed();
} else {
// else the work is less than the allowed chunk_size
// so lets now do the actual sum for my chunk
let mut total: u64 = 0;
for n in start..end {
total += n;
}
return futures::future::ready(total).boxed();
}
}

Related

How to create threads that last entire duration of program and pass immutable chunks for threads to operate on?

I have a bunch of math that has real time constraints. My main loop will just call this function repeatedly and it will always store results into an existing buffer. However, I want to be able to spawn the threads at init time and then allow the threads to run and do their work and then wait for more data. The synchronization I will use a Barrier and have that part working. What I can't get working and have tried various iterations of Arc or crossbeam is splitting the thread spawning up and the actual workload. This is what I have now.
pub const WORK_SIZE: usize = 524_288;
pub const NUM_THREADS: usize = 6;
pub const NUM_TASKS_PER_THREAD: usize = WORK_SIZE / NUM_THREADS;
fn main() {
let mut work: Vec<f64> = Vec::with_capacity(WORK_SIZE);
for i in 0..WORK_SIZE {
work.push(i as f64);
}
crossbeam::scope(|scope| {
let threads: Vec<_> = work
.chunks(NUM_TASKS_PER_THREAD)
.map(|chunk| scope.spawn(move |_| chunk.iter().cloned().sum::<f64>()))
.collect();
let threaded_time = std::time::Instant::now();
let thread_sum: f64 = threads.into_iter().map(|t| t.join().unwrap()).sum();
let threaded_micros = threaded_time.elapsed().as_micros() as f64;
println!("threaded took: {:#?}", threaded_micros);
let serial_time = std::time::Instant::now();
let no_thread_sum: f64 = work.iter().cloned().sum();
let serial_micros = serial_time.elapsed().as_micros() as f64;
println!("serial took: {:#?}", serial_micros);
assert_eq!(thread_sum, no_thread_sum);
println!(
"Threaded performace was {:?}",
serial_micros / threaded_micros
);
})
.unwrap();
}
But I can't find a way to spin these threads up in an init function and then in a do_work function pass work into them. I attempted to do something like this with Arc's and Mutex's but couldn't get everything straight there either. What I want to turn this into is something like the following
use std::sync::{Arc, Barrier, Mutex};
use std::{slice::Chunks, thread::JoinHandle};
pub const WORK_SIZE: usize = 524_288;
pub const NUM_THREADS: usize = 6;
pub const NUM_TASKS_PER_THREAD: usize = WORK_SIZE / NUM_THREADS;
//simplified version of what actual work that code base will do
fn do_work(data: &[f64], result: Arc<Mutex<f64>>, barrier: Arc<Barrier>) {
loop {
barrier.wait();
let sum = data.into_iter().cloned().sum::<f64>();
let mut result = *result.lock().unwrap();
result += sum;
}
}
fn init(
mut data: Chunks<'_, f64>,
result: &Arc<Mutex<f64>>,
barrier: &Arc<Barrier>,
) -> Vec<std::thread::JoinHandle<()>> {
let mut handles = Vec::with_capacity(NUM_THREADS);
//spawn threads, in actual code these would be stored in a lib crate struct
for i in 0..NUM_THREADS {
let result = result.clone();
let barrier = barrier.clone();
let chunk = data.nth(i).unwrap();
handles.push(std::thread::spawn(|| {
//Pass the particular thread the particular chunk it will operate on.
do_work(chunk, result, barrier);
}));
}
handles
}
fn main() {
let mut work: Vec<f64> = Vec::with_capacity(WORK_SIZE);
let mut result = Arc::new(Mutex::new(0.0));
for i in 0..WORK_SIZE {
work.push(i as f64);
}
let work_barrier = Arc::new(Barrier::new(NUM_THREADS + 1));
let threads = init(work.chunks(NUM_TASKS_PER_THREAD), &result, &work_barrier);
loop {
work_barrier.wait();
//actual code base would do something with summation stored in result.
println!("{:?}", result.lock().unwrap());
}
}
I hope this expresses the intent clearly enough of what I need to do. The issue with this specific implementation is that the chunks don't seem to live long enough and when I tried wrapping them in an Arc as it just moved the argument doesn't live long enough to the Arc::new(data.chunk(_)) line.
use std::sync::{Arc, Barrier, Mutex};
use std::thread;
pub const WORK_SIZE: usize = 524_288;
pub const NUM_THREADS: usize = 6;
pub const NUM_TASKS_PER_THREAD: usize = WORK_SIZE / NUM_THREADS;
//simplified version of what actual work that code base will do
fn do_work(data: &[f64], result: Arc<Mutex<f64>>, barrier: Arc<Barrier>) {
loop {
barrier.wait();
let sum = data.iter().sum::<f64>();
*result.lock().unwrap() += sum;
}
}
fn init(
work: Vec<f64>,
result: Arc<Mutex<f64>>,
barrier: Arc<Barrier>,
) -> Vec<thread::JoinHandle<()>> {
let mut handles = Vec::with_capacity(NUM_THREADS);
//spawn threads, in actual code these would be stored in a lib crate struct
for i in 0..NUM_THREADS {
let slice = work[i * NUM_TASKS_PER_THREAD..(i + 1) * NUM_TASKS_PER_THREAD].to_owned();
let result = Arc::clone(&result);
let w = Arc::clone(&barrier);
handles.push(thread::spawn(move || {
do_work(&slice, result, w);
}));
}
handles
}
fn main() {
let mut work: Vec<f64> = Vec::with_capacity(WORK_SIZE);
let result = Arc::new(Mutex::new(0.0));
for i in 0..WORK_SIZE {
work.push(i as f64);
}
let work_barrier = Arc::new(Barrier::new(NUM_THREADS + 1));
let _threads = init(work, Arc::clone(&result), Arc::clone(&work_barrier));
loop {
thread::sleep(std::time::Duration::from_secs(3));
work_barrier.wait();
//actual code base would do something with summation stored in result.
println!("{:?}", result.lock().unwrap());
}
}

How to declare generic types for a function that computes k-shortest-paths using Yen's algorithm and petgraph?

I have implemented Yen's algorithm Wikipedia using petgraph in Rust.
In a main function, the code looks like this:
use std::collections::BinaryHeap;
use std::cmp::Reverse;
use std::collections::HashSet;
use petgraph::{Graph, Undirected};
use petgraph::graph::NodeIndex;
use petgraph::stable_graph::StableUnGraph;
use petgraph::algo::{astar};
use petgraph::visit::NodeRef;
fn main() {
let mut graph: Graph<String, u32, Undirected> = Graph::new_undirected();
let c = graph.add_node(String::from("C"));
let d = graph.add_node(String::from("D"));
let e = graph.add_node(String::from("E"));
let f = graph.add_node(String::from("F"));
let g = graph.add_node(String::from("G"));
let h = graph.add_node(String::from("H"));
graph.add_edge(c, d, 3);
graph.add_edge(c, e, 2);
graph.add_edge(d, e, 1);
graph.add_edge(d, f, 4);
graph.add_edge(e, f, 2);
graph.add_edge(e, g, 3);
graph.add_edge(f, g, 2);
graph.add_edge(f, h, 1);
graph.add_edge(g, h, 2);
let start = c;
let goal = h;
// start solving Yen's k-shortest-paths
let (length, path) = match astar(&graph, start, |n| n == goal.unwrap(), |e| *e.weight(), |_| 0) {
Some(x) => x,
None => panic!("Testing!"),
};
println!("Initial path found\tlength: {}", length);
for i in 0..(path.len() - 1) {
println!("\t{:?}({:?}) -> {:?}({:?})", graph.node_weight(path[i].id()).unwrap(), path[i].id(), graph.node_weight(path[i+1].id()).unwrap(), path[i+1].id());
}
let k = 10;
let mut ki = 0;
let mut visited = HashSet::new();
let mut routes = vec![(length, path)];
let mut k_routes = BinaryHeap::new();
for ki in 0..(k - 1) {
println!("Computing path {}", ki);
if routes.len() <= ki {
// We have no more routes to explore
break;
}
let previous = routes[ki].1.clone();
for i in 0..(previous.len() - 1) {
let spur_node = previous[i].clone();
let root_path = &previous[0..i];
let mut graph_copy = StableUnGraph::<String, u32>::from(graph.clone());
println!("\tComputing pass {}\tspur: {:?}\troot: {:?}", i, graph.node_weight(spur_node), root_path.iter().map(|n| graph.node_weight(*n).unwrap()));
for (_, path) in &routes {
if path.len() > i + 1 && &path[0..i] == root_path {
let ei = graph.find_edge_undirected(path[i], path[i + 1]);
if ei.is_some() {
let edge = ei.unwrap().0;
graph_copy.remove_edge(edge);
let edge_obj = graph.edge_endpoints(edge);
let ns = edge_obj.unwrap();
println!("\t\tRemoving edge {:?} from {:?} -> {:?}", edge, graph.node_weight(ns.0).unwrap(), graph.node_weight(ns.1).unwrap());
}
else {
panic!("\t\tProblem finding edge");
}
}
}
if let Some((_, spur_path)) =
astar(&graph_copy, spur_node, |n| n == goal.unwrap(), |e| *e.weight(), |_| 0)
{
let nodes: Vec<NodeIndex> = root_path.iter().cloned().chain(spur_path).collect();
let mut node_names = vec![];
for ni in 0..nodes.len() {
node_names.push(graph.node_weight(nodes[ni]).unwrap());
}
// compute root_path length
let mut path_length = 0;
for i_rp in 0..(nodes.len() - 1) {
let ei = graph.find_edge_undirected(nodes[i_rp], nodes[i_rp + 1]);
if ei.is_some() {
let ew = graph.edge_weight(ei.unwrap().0);
if ew.is_some() {
path_length += ew.unwrap();
}
}
}
println!("\t\t\tfound path: {:?} with cost {}", node_names, path_length);
if !visited.contains(&nodes) {
// Mark as visited
visited.insert(nodes.clone());
// Build a min-heap
k_routes.push(Reverse((path_length, nodes)));
}
}
}
if let Some(k_route) = k_routes.pop() {
println!("\tselected route {:?}", k_route.0);
routes.push(k_route.0);
}
}
}
Now, I want to put this algorithm within a function that I can call from my code. I made an initial attempt with the signature like this:
pub fn yen_k_shortest_paths<G, E, Ty, Ix, F, K>(
graph: Graph<String, u32, Undirected>,
start: NodeIndex<u32>,
goal: NodeIndex<u32>,
mut edge_cost: F,
k: usize,
) -> Result<Vec<(u32, Vec<NodeIndex<u32>>)>, Box<dyn std::error::Error>>
where
G: IntoEdges + Visitable,
Ty: EdgeType,
Ix: IndexType,
E: Default + Debug + std::ops::Add,
F: FnMut(G::EdgeRef) -> K,
K: Measure + Copy,
{
// implementation here
}
However, when I try to call the function with:
let paths = yen::yen_k_shortest_paths(graph, start, goal, |e: EdgeReference<u32>| *e.weight(), 5);
the compiler complains: type annotations needed cannot satisfy <_ as IntoEdgeReferences>::EdgeRef == petgraph::graph::EdgeReference<'_, u32>`
I already tried several alternatives without success. Do you have any suggestion on how to fix this issue?
The issue with the yen_k_shortest_paths() function signature as written is the generic type parameters aren't used correctly. As an example, consider the first declared type parameter on yen_k_shortest_paths(): G, which is intended to represent the graph type. Declaring G like this means that the code that calls yen_k_shortest_paths() gets to pick the graph type G. But the graph argument is declared with the concrete type Graph<String, u32, Undirected>—the caller has no choice. This contradiction is the problem with G. Similar reasoning applies to the other type parameters, except F and K. There are two ways to fix this kind of issue:
Keep the graph argument as Graph<String, u32, Undirected> and remove the G type parameter.
Change the graph argument to take a G.
Approach #1 is simpler but your function won't be as general. Approach #2 can involve needing to add extra bounds and some code changes in the function in order for the code to compile.
In this case, the simplest approach doesn't need any type parameters at all:
fn yen_k_shortest_paths(
graph: &Graph<String, u32, Undirected>,
start: NodeIndex<u32>,
goal: NodeIndex<u32>,
edge_cost: fn(EdgeReference<u32>) -> u32,
k: usize,
) -> Vec<(u32, Vec<NodeIndex<u32>>)> {...}
Here's the full code, which can be run:
use std::cmp::Reverse;
use std::collections::BinaryHeap;
use std::collections::HashSet;
use petgraph::algo::astar;
use petgraph::graph::{EdgeReference, NodeIndex};
use petgraph::stable_graph::StableUnGraph;
use petgraph::visit::NodeRef;
use petgraph::{Graph, Undirected};
fn main() {
let mut graph: Graph<String, u32, Undirected> = Graph::new_undirected();
let c = graph.add_node(String::from("C"));
let d = graph.add_node(String::from("D"));
let e = graph.add_node(String::from("E"));
let f = graph.add_node(String::from("F"));
let g = graph.add_node(String::from("G"));
let h = graph.add_node(String::from("H"));
graph.add_edge(c, d, 3);
graph.add_edge(c, e, 2);
graph.add_edge(d, e, 1);
graph.add_edge(d, f, 4);
graph.add_edge(e, f, 2);
graph.add_edge(e, g, 3);
graph.add_edge(f, g, 2);
graph.add_edge(f, h, 1);
graph.add_edge(g, h, 2);
let start = c;
let goal = h;
let edge_cost = |e: EdgeReference<u32>| *e.weight();
let k = 10;
let _paths = yen_k_shortest_paths(&graph, start, goal, edge_cost, k);
}
fn yen_k_shortest_paths(
graph: &Graph<String, u32, Undirected>,
start: NodeIndex<u32>,
goal: NodeIndex<u32>,
edge_cost: fn(EdgeReference<u32>) -> u32,
k: usize,
) -> Vec<(u32, Vec<NodeIndex<u32>>)> {
let (length, path) = match astar(graph, start, |n| n == goal, edge_cost, |_| 0) {
Some(x) => x,
None => panic!("Testing!"),
};
println!("Initial path found\tlength: {}", length);
for i in 0..(path.len() - 1) {
println!(
"\t{:?}({:?}) -> {:?}({:?})",
graph.node_weight(path[i].id()).unwrap(),
path[i].id(),
graph.node_weight(path[i + 1].id()).unwrap(),
path[i + 1].id()
);
}
let mut visited = HashSet::new();
let mut routes = vec![(length, path)];
let mut k_routes = BinaryHeap::new();
for ki in 0..(k - 1) {
println!("Computing path {}", ki);
if routes.len() <= ki {
// We have no more routes to explore
break;
}
let previous = routes[ki].1.clone();
for i in 0..(previous.len() - 1) {
let spur_node = previous[i];
let root_path = &previous[0..i];
let mut graph_copy = StableUnGraph::from(graph.clone());
println!(
"\tComputing pass {}\tspur: {:?}\troot: {:?}",
i,
graph.node_weight(spur_node),
root_path
.iter()
.map(|n| graph.node_weight(*n).unwrap())
.collect::<Vec<_>>()
);
for (_, path) in &routes {
if path.len() > i + 1 && &path[0..i] == root_path {
let ei = graph.find_edge_undirected(path[i], path[i + 1]);
if let Some(ei) = ei {
let edge = ei.0;
graph_copy.remove_edge(edge);
let edge_obj = graph.edge_endpoints(edge);
let ns = edge_obj.unwrap();
println!(
"\t\tRemoving edge {:?} from {:?} -> {:?}",
edge,
graph.node_weight(ns.0).unwrap(),
graph.node_weight(ns.1).unwrap()
);
} else {
panic!("\t\tProblem finding edge");
}
}
}
if let Some((_, spur_path)) = astar(
&graph_copy,
spur_node,
|n| n == goal,
|e| *e.weight(),
|_| 0,
) {
let nodes: Vec<NodeIndex> = root_path.iter().cloned().chain(spur_path).collect();
let mut node_names = vec![];
for &node in &nodes {
node_names.push(graph.node_weight(node).unwrap());
}
// compute root_path length
let mut path_length = 0;
for i_rp in 0..(nodes.len() - 1) {
let ei = graph.find_edge_undirected(nodes[i_rp], nodes[i_rp + 1]);
if let Some(ei) = ei {
let ew = graph.edge_weight(ei.0);
if let Some(&ew) = ew {
path_length += ew;
}
}
}
println!(
"\t\t\tfound path: {:?} with cost {}",
node_names, path_length
);
if !visited.contains(&nodes) {
// Mark as visited
visited.insert(nodes.clone());
// Build a min-heap
k_routes.push(Reverse((path_length, nodes)));
}
}
}
if let Some(k_route) = k_routes.pop() {
println!("\tselected route {:?}", k_route.0);
routes.push(k_route.0);
}
}
routes
}
As another example of a possible function signature, this one is generic over the node type N and the edge cost function F:
fn yen_k_shortest_paths<'a, N, F>(
graph: &'a Graph<N, u32, Undirected>,
start: NodeIndex<u32>,
goal: NodeIndex<u32>,
edge_cost: F,
k: usize,
) -> Vec<(u32, Vec<NodeIndex<u32>>)>
where
&'a Graph<N, u32, Undirected>:
GraphBase<NodeId = NodeIndex<u32>> + IntoEdgeReferences<EdgeRef = EdgeReference<'a, u32>>,
N: Debug + Clone,
F: FnMut(EdgeReference<u32>) -> u32,
{...}
As you can see, these bounds can get pretty complicated. Figuring them out involved reading the error messages the compiler emitted, as well as reading the docs for the involved types/traits. (Although, I think in this case the complicated bound &'a Graph<N, u32, Undirected>: GraphBase<NodeId = NodeIndex<u32>> + IntoEdgeReferences<EdgeRef = EdgeReference<'a, u32>> should be inferred, but currently isn't due to a complier bug/limitation)

Severe performance degredation over time in multi-threading: what am I missing?

In my application a method runs quickly once started but begins to continuously degrade in performance upon nearing completion, this seems to be even irrelevant of the amount of work (the number of iterations of a function each thread has to perform). Once it reaches near the end it slows to an incredibly slow pace compared to earlier (worth noting this is not just a result of fewer threads remaining incomplete, it seems even each thread slows down).
I cannot figure out why this occurs, so I'm asking. What am I doing wrong?
An overview of CPU usage:
A slideshow of the problem
Worth noting that CPU temperature remains low throughout.
This stage varies with however much work is set, more work produces a better appearance with all threads constantly near 100%. Still, at this moment this appears good.
Here we see the continued performance of earlier,
Here we see it start to degrade. I do not know why this occurs.
After some period of chaos most of the threads have finished their work and the remaining threads continue, at this point although it seems they are at 100% they in actually perform their remaining workload very slowly. I cannot understand why this occurs.
Printing progress
I have written a multi-threaded random_search (documentation link) function for optimization. Most of the complexity in this function comes from printing data passing data between threads, this supports giving outputs showing progress like:
2300
565 (24.57%) 00:00:11 / 00:00:47 [25.600657363049734] { [563.0ns, 561.3ms, 125.0ns, 110.0ns] [2.0µs, 361.8ms, 374.0ns, 405.0ns] [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1] }
I have been trying to use this output to figure out whats gone wrong, but I have no idea.
This output describes:
The total number of iterations 2300.
The total number of current iterations 565.
The time running 00:00:11 (mm:ss:ms).
The estimated time remaining 00:00:47 (mm:ss:ms).
The current best value [25.600657363049734].
The most recently measured times between execution positions (effectively time taken for thread to go from some line, to another line (defined specifically with update_execution_position in code below) [563.0ns, 561.3ms, 125.0ns, 110.0ns].
The averages times between execution positions (this is average across entire runtime rather than since last measured) [2.0µs, 361.8ms, 374.0ns, 405.0ns].
The execution positions of threads (0 is when a thread is completed, rest represent a thread having hit some line, which triggered this setting, but yet to hit next line which changes it, effectively being between 2 positions) [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
The random_search code:
Given I have tested implementations with the other methods in my library grid_search and simulated_annealing it would suggest to me the problem does not atleast entirely reside in random_search.rs.
random_search.rs:
pub fn random_search<
A: 'static + Send + Sync,
T: 'static + Copy + Send + Sync + Default + SampleUniform + PartialOrd,
const N: usize,
>(
// Generics
ranges: [Range<T>; N],
f: fn(&[T; N], Option<Arc<A>>) -> f64,
evaluation_data: Option<Arc<A>>,
polling: Option<Polling>,
// Specifics
iterations: u64,
) -> [T; N] {
// Gets cpu data
let cpus = num_cpus::get() as u64;
let search_cpus = cpus - 1; // 1 cpu is used for polling, this one.
let remainder = iterations % search_cpus;
let per = iterations / search_cpus;
let ranges_arc = Arc::new(ranges);
let (best_value, best_params) = search(
// Generics
ranges_arc.clone(),
f,
evaluation_data.clone(),
// Since we are doing this on the same thread, we don't need to use these
Arc::new(AtomicU64::new(Default::default())),
Arc::new(Mutex::new(Default::default())),
Arc::new(AtomicBool::new(false)),
Arc::new(AtomicU8::new(0)),
Arc::new([
Mutex::new((Duration::new(0, 0), 0)),
Mutex::new((Duration::new(0, 0), 0)),
Mutex::new((Duration::new(0, 0), 0)),
Mutex::new((Duration::new(0, 0), 0)),
]),
// Specifics
remainder,
);
let thread_exit = Arc::new(AtomicBool::new(false));
// (handles,(counters,thread_bests))
let (handles, links): (Vec<_>, Vec<_>) = (0..search_cpus)
.map(|_| {
let ranges_clone = ranges_arc.clone();
let counter = Arc::new(AtomicU64::new(0));
let thread_best = Arc::new(Mutex::new(f64::MAX));
let thread_execution_position = Arc::new(AtomicU8::new(0));
let thread_execution_time = Arc::new([
Mutex::new((Duration::new(0, 0), 0)),
Mutex::new((Duration::new(0, 0), 0)),
Mutex::new((Duration::new(0, 0), 0)),
Mutex::new((Duration::new(0, 0), 0)),
]);
let counter_clone = counter.clone();
let thread_best_clone = thread_best.clone();
let thread_exit_clone = thread_exit.clone();
let evaluation_data_clone = evaluation_data.clone();
let thread_execution_position_clone = thread_execution_position.clone();
let thread_execution_time_clone = thread_execution_time.clone();
(
thread::spawn(move || {
search(
// Generics
ranges_clone,
f,
evaluation_data_clone,
counter_clone,
thread_best_clone,
thread_exit_clone,
thread_execution_position_clone,
thread_execution_time_clone,
// Specifics
per,
)
}),
(
counter,
(
thread_best,
(thread_execution_position, thread_execution_time),
),
),
)
})
.unzip();
let (counters, links): (Vec<Arc<AtomicU64>>, Vec<_>) = links.into_iter().unzip();
let (thread_bests, links): (Vec<Arc<Mutex<f64>>>, Vec<_>) = links.into_iter().unzip();
let (thread_execution_positions, thread_execution_times) = links.into_iter().unzip();
if let Some(poll_data) = polling {
poll(
poll_data,
counters,
remainder,
iterations,
thread_bests,
thread_exit,
thread_execution_positions,
thread_execution_times,
);
}
let joins: Vec<_> = handles.into_iter().map(|h| h.join().unwrap()).collect();
let (_, best_params) = joins
.into_iter()
.fold((best_value, best_params), |(bv, bp), (v, p)| {
if v < bv {
(v, p)
} else {
(bv, bp)
}
});
return best_params;
fn search<
A: 'static + Send + Sync,
T: 'static + Copy + Send + Sync + Default + SampleUniform + PartialOrd,
const N: usize,
>(
// Generics
ranges: Arc<[Range<T>; N]>,
f: fn(&[T; N], Option<Arc<A>>) -> f64,
evaluation_data: Option<Arc<A>>,
counter: Arc<AtomicU64>,
best: Arc<Mutex<f64>>,
thread_exit: Arc<AtomicBool>,
thread_execution_position: Arc<AtomicU8>,
thread_execution_times: Arc<[Mutex<(Duration, u64)>; 4]>,
// Specifics
iterations: u64,
) -> (f64, [T; N]) {
let mut execution_position_timer = Instant::now();
let mut rng = thread_rng();
let mut params = [Default::default(); N];
let mut best_value = f64::MAX;
let mut best_params = [Default::default(); N];
for _ in 0..iterations {
// Gen random values
for (range, param) in ranges.iter().zip(params.iter_mut()) {
*param = rng.gen_range(range.clone());
}
// Update execution position
execution_position_timer = update_execution_position(
1,
execution_position_timer,
&thread_execution_position,
&thread_execution_times,
);
// Run function
let new_value = f(&params, evaluation_data.clone());
// Update execution position
execution_position_timer = update_execution_position(
2,
execution_position_timer,
&thread_execution_position,
&thread_execution_times,
);
// Check best
if new_value < best_value {
best_value = new_value;
best_params = params;
*best.lock().unwrap() = best_value;
}
// Update execution position
execution_position_timer = update_execution_position(
3,
execution_position_timer,
&thread_execution_position,
&thread_execution_times,
);
counter.fetch_add(1, Ordering::SeqCst);
// Update execution position
execution_position_timer = update_execution_position(
4,
execution_position_timer,
&thread_execution_position,
&thread_execution_times,
);
if thread_exit.load(Ordering::SeqCst) {
break;
}
}
// Update execution position
// 0 represents ended state
thread_execution_position.store(0, Ordering::SeqCst);
return (best_value, best_params);
}
}
util.rs:
pub fn update_execution_position<const N: usize>(
i: usize,
execution_position_timer: Instant,
thread_execution_position: &Arc<AtomicU8>,
thread_execution_times: &Arc<[Mutex<(Duration, u64)>; N]>,
) -> Instant {
{
let mut data = thread_execution_times[i - 1].lock().unwrap();
data.0 += execution_position_timer.elapsed();
data.1 += 1;
}
thread_execution_position.store(i as u8, Ordering::SeqCst);
Instant::now()
}
pub struct Polling {
pub poll_rate: u64,
pub printing: bool,
pub early_exit_minimum: Option<f64>,
pub thread_execution_reporting: bool,
}
impl Polling {
const DEFAULT_POLL_RATE: u64 = 10;
pub fn new(printing: bool, early_exit_minimum: Option<f64>) -> Self {
Self {
poll_rate: Polling::DEFAULT_POLL_RATE,
printing,
early_exit_minimum,
thread_execution_reporting: false,
}
}
}
pub fn poll<const N: usize>(
data: Polling,
// Current count of each thread.
counters: Vec<Arc<AtomicU64>>,
offset: u64,
// Final total iterations.
iterations: u64,
// Best values of each thread.
thread_bests: Vec<Arc<Mutex<f64>>>,
// Early exit switch.
thread_exit: Arc<AtomicBool>,
// Current positions of execution of each thread.
thread_execution_positions: Vec<Arc<AtomicU8>>,
// Current average times between execution positions for each thread
thread_execution_times: Vec<Arc<[Mutex<(Duration, u64)>; N]>>,
) {
let start = Instant::now();
let mut stdout = stdout();
let mut count = offset
+ counters
.iter()
.map(|c| c.load(Ordering::SeqCst))
.sum::<u64>();
if data.printing {
println!("{:20}", iterations);
}
let mut poll_time = Instant::now();
let mut held_best: f64 = f64::MAX;
let mut held_average_execution_times: [(Duration, u64); N] =
vec![(Duration::new(0, 0), 0); N].try_into().unwrap();
let mut held_recent_execution_times: [Duration; N] =
vec![Duration::new(0, 0); N].try_into().unwrap();
while count < iterations {
if data.printing {
// loop {
let percent = count as f32 / iterations as f32;
// If count == 0, give 00... for remaining time as placeholder
let remaining_time_estimate = if count == 0 {
Duration::new(0, 0)
} else {
start.elapsed().div_f32(percent)
};
print!(
"\r{:20} ({:.2}%) {} / {} [{}] {}\t",
count,
100. * percent,
print_duration(start.elapsed(), 0..3),
print_duration(remaining_time_estimate, 0..3),
if held_best == f64::MAX {
String::from("?")
} else {
format!("{}", held_best)
},
if data.thread_execution_reporting {
let (average_execution_times, recent_execution_times): (
Vec<String>,
Vec<String>,
) = (0..thread_execution_times[0].len())
.map(|i| {
let (mut sum, mut num) = (Duration::new(0, 0), 0);
for n in 0..thread_execution_times.len() {
{
let mut data = thread_execution_times[n][i].lock().unwrap();
sum += data.0;
held_average_execution_times[i].0 += data.0;
num += data.1;
held_average_execution_times[i].1 += data.1;
*data = (Duration::new(0, 0), 0);
}
}
if num > 0 {
held_recent_execution_times[i] = sum.div_f64(num as f64);
}
(
if held_average_execution_times[i].1 > 0 {
format!(
"{:.1?}",
held_average_execution_times[i]
.0
.div_f64(held_average_execution_times[i].1 as f64)
)
} else {
String::from("?")
},
if held_recent_execution_times[i] > Duration::new(0, 0) {
format!("{:.1?}", held_recent_execution_times[i])
} else {
String::from("?")
},
)
})
.unzip();
let execution_positions: Vec<u8> = thread_execution_positions
.iter()
.map(|pos| pos.load(Ordering::SeqCst))
.collect();
format!(
"{{ [{}] [{}] {:.?} }}",
recent_execution_times.join(", "),
average_execution_times.join(", "),
execution_positions
)
} else {
String::from("")
}
);
stdout.flush().unwrap();
}
// Updates best and does early exiting
match (data.early_exit_minimum, data.printing) {
(Some(early_exit), true) => {
for thread_best in thread_bests.iter() {
let thread_best_temp = *thread_best.lock().unwrap();
if thread_best_temp < held_best {
held_best = thread_best_temp;
if thread_best_temp <= early_exit {
thread_exit.store(true, Ordering::SeqCst);
println!();
return;
}
}
}
}
(None, true) => {
for thread_best in thread_bests.iter() {
let thread_best_temp = *thread_best.lock().unwrap();
if thread_best_temp < held_best {
held_best = thread_best_temp;
}
}
}
(Some(early_exit), false) => {
for thread_best in thread_bests.iter() {
if *thread_best.lock().unwrap() <= early_exit {
thread_exit.store(true, Ordering::SeqCst);
return;
}
}
}
(None, false) => {}
}
thread::sleep(saturating_sub(
Duration::from_millis(data.poll_rate),
poll_time.elapsed(),
));
poll_time = Instant::now();
count = offset
+ counters
.iter()
.map(|c| c.load(Ordering::SeqCst))
.sum::<u64>();
}
if data.printing {
println!(
"\r{:20} (100.00%) {} / {} [{}] {}\t",
count,
print_duration(start.elapsed(), 0..3),
print_duration(start.elapsed(), 0..3),
held_best,
if data.thread_execution_reporting {
let (average_execution_times, recent_execution_times): (Vec<String>, Vec<String>) =
(0..thread_execution_times[0].len())
.map(|i| {
let (mut sum, mut num) = (Duration::new(0, 0), 0);
for n in 0..thread_execution_times.len() {
{
let mut data = thread_execution_times[n][i].lock().unwrap();
sum += data.0;
held_average_execution_times[i].0 += data.0;
num += data.1;
held_average_execution_times[i].1 += data.1;
*data = (Duration::new(0, 0), 0);
}
}
if num > 0 {
held_recent_execution_times[i] = sum.div_f64(num as f64);
}
(
if held_average_execution_times[i].1 > 0 {
format!(
"{:.1?}",
held_average_execution_times[i]
.0
.div_f64(held_average_execution_times[i].1 as f64)
)
} else {
String::from("?")
},
if held_recent_execution_times[i] > Duration::new(0, 0) {
format!("{:.1?}", held_recent_execution_times[i])
} else {
String::from("?")
},
)
})
.unzip();
let execution_positions: Vec<u8> = thread_execution_positions
.iter()
.map(|pos| pos.load(Ordering::SeqCst))
.collect();
format!(
"{{ [{}] [{}] {:.?} }}",
recent_execution_times.join(", "),
average_execution_times.join(", "),
execution_positions
)
} else {
String::from("")
}
);
stdout.flush().unwrap();
}
}
// Since `Duration::saturating_sub` is unstable this is an alternative.
fn saturating_sub(a: Duration, b: Duration) -> Duration {
if let Some(dur) = a.checked_sub(b) {
dur
} else {
Duration::new(0, 0)
}
}
main.rs
use std::{cmp,sync::Arc};
type Image = Vec<Vec<Pixel>>;
#[derive(Clone)]
pub struct Pixel {
pub luma: u8,
}
impl From<&u8> for Pixel {
fn from(x: &u8) -> Pixel {
Pixel { luma: *x }
}
}
fn main() {
// Setup
// -------------------------------------------
fn open_image(path: &str) -> Image {
let example = image::open(path).unwrap().to_rgb8();
let dims = example.dimensions();
let size = (dims.0 as usize, dims.1 as usize);
let example_vec = example.into_raw();
// Binarizes image
let img_vec = from_raw(&example_vec, size);
img_vec
}
println!("Started ...");
let example: Image = open_image("example.jpg");
let target: Image = open_image("target.jpg");
// let first_image = Some(Arc::new((examples[0].clone(), targets[0].clone())));
println!("Opened...");
let image = Some(Arc::new((example, target)));
// Running the optimization
// -------------------------------------------
println!("Started opt...");
let best = simple_optimization::random_search(
[0..255, 0..255, 0..255, 1..255, 1..255],
eval_one,
image,
Some(simple_optimization::Polling {
poll_rate: 100,
printing: true,
early_exit_minimum: None,
thread_execution_reporting: true,
}),
2300,
);
println!("{:.?}", best); // [34, 220, 43, 253, 168]
assert!(false);
fn eval_one(arr: &[u8; 5], opt: Option<Arc<(Image, Image)>>) -> f64 {
let bin_params = (
arr[0] as u8,
arr[1] as u8,
arr[2] as u8,
arr[3] as usize,
arr[4] as usize,
);
let arc = opt.unwrap();
// Gets average mean-squared-error
let binary_pixels = binarize_buffer(arc.0.clone(), bin_params);
mse(binary_pixels, &arc.1)
}
// Mean-squared-error
fn mse(prediction: Image, target: &Image) -> f64 {
let n = target.len() * target[0].len();
prediction
.iter()
.flatten()
.zip(target.iter().flatten())
.map(|(p, t)| difference(p, t).powf(2.))
.sum::<f64>()
/ (2. * n as f64)
}
#[rustfmt::skip]
fn difference(p: &Pixel, t: &Pixel) -> f64 {
p.luma as f64 - t.luma as f64
}
}
pub fn from_raw(raw: &[u8], (_i_size, j_size): (usize, usize)) -> Vec<Vec<Pixel>> {
(0..raw.len())
.step_by(j_size)
.map(|index| {
raw[index..index + j_size]
.iter()
.map(Pixel::from)
.collect::<Vec<Pixel>>()
})
.collect()
}
pub fn binarize_buffer(
mut img: Vec<Vec<Pixel>>,
(_, _, local_luma_boundary, local_field_reach, local_field_size): (u8, u8, u8, usize, usize),
) -> Vec<Vec<Pixel>> {
let (i_size, j_size) = (img.len(), img[0].len());
let i_chunks = (i_size as f32 / local_field_size as f32).ceil() as usize;
let j_chunks = (j_size as f32 / local_field_size as f32).ceil() as usize;
let mut local_luma: Vec<Vec<u8>> = vec![vec![u8::default(); j_chunks]; i_chunks];
// Gets average luma in local fields
// O((s+r)^2*(n/s)*(m/s)) : s = local field size, r = local field reach
for (i_chunk, i) in (0..i_size).step_by(local_field_size).enumerate() {
let i_range = zero_checked_sub(i, local_field_reach)
..cmp::min(i + local_field_size + local_field_reach, i_size);
let i_range_length = i_range.end - i_range.start;
for (j_chunk, j) in (0..j_size).step_by(local_field_size).enumerate() {
let j_range = zero_checked_sub(j, local_field_reach)
..cmp::min(j + local_field_size + local_field_reach, j_size);
let j_range_length = j_range.end - j_range.start;
let total: u32 = i_range
.clone()
.map(|i_range_indx| {
img[i_range_indx][j_range.clone()]
.iter()
.map(|p| p.luma as u32)
.sum::<u32>()
})
.sum();
local_luma[i_chunk][j_chunk] = (total / (i_range_length * j_range_length) as u32) as u8;
}
}
// Apply binarization
// O(nm)
for i in 0..i_size {
let i_group: usize = i / local_field_size; // == floor(i as f32 / local_field_size as f32) as usize
for j in 0..j_size {
let j_group: usize = j / local_field_size;
// Local average boundaries
// --------------------------------
if let Some(local) = local_luma[i_group][j_group].checked_sub(local_luma_boundary) {
if img[i][j].luma < local {
img[i][j].luma = 0;
continue;
}
}
if let Some(local) = local_luma[i_group][j_group].checked_add(local_luma_boundary) {
if img[i][j].luma > local {
img[i][j].luma = 255;
continue;
}
}
// White is the negative (false/0) colour in our binarization, thus this is our else case
img[i][j].luma = 255;
}
}
img
}
#[rustfmt::skip]
fn zero_checked_sub(a: usize, b: usize) -> usize { if a > b { a - b } else { 0 } }
Project zip (in case you'd rather not spend time setting it up).
Else, here are the images being used as /target.jpg and /example.jpg (it shouldn't matter it being specifically these images, any should work):
And Cargo.toml dependencies:
[dependencies]
rand = "0.8.4"
itertools = "0.10.1" # izip!
num_cpus = "1.13.0" # Multi-threading
print_duration = "1.0.0" # Printing progress
num = "0.4.0" # Generics
rand_distr = "0.4.1" # Normal distribution
image = "0.23.14"
serde = { version="1.0.118", features=["derive"] }
serde_json = "1.0.50"
I do feel rather reluctant to post such a large question and
inevitably require people to read a few hundred lines (especially given the project doesn't work in a playground), but I'm really lost here and can see no other way to communicate the whole area of the problem. Apologies for this.
As noted, I have tried for a while to figure out what is happening here, but I have come up short, any help would be really appreciate.
Some basic debugging (aka println! everywhere) shows that your performance problem is not related to the multithreading at all. It just happens randomly, and when there are 24 threads doing their job, the fact that one is randomly stalling is not noticeable, but when there is only one or two threads left, they stand out as slow.
But where is this performance bottleneck? Well, you are stating it yourself in the code: in binary_buffer you say:
// Gets average luma in local fields
// O((s+r)^2*(n/s)*(m/s)) : s = local field size, r = local field reach
The values of s and r seem to be random values between 0 and 255, while n is the length of a image row, in bytes 3984 * 3 = 11952, and m is the number of rows 2271.
Now, most of the times that O() is around a few millions, quite manageable. But if s happens to be small and r big, such as (3, 200) then the number of computations blows up to over 1e11!
Fortunately I think you can define the ranges of those values in the original call to random_search so a bit of tweaking there should send you back to reasonable complexity. Changing the ranges to:
[0..255, 0..255, 0..255, 1..255, 20..255],
// ^ here
seems to do the trick for me.
PS: These lines at the beginning of binary_buffer were key to discover this:
let o = (i_size / local_field_size) * (j_size / local_field_size) * (local_field_size + local_field_reach).pow(2);
println!("\nO() = {}", o);

How to use `waitpid` to wait for a process in Rust?

I am trying to implement a merge sort using processes but I have a problem using the waitpid function:
extern crate nix;
extern crate rand;
use nix::sys::wait::WaitStatus;
use rand::Rng;
use std::io;
use std::process::exit;
use std::thread::sleep;
use std::time::{Duration, Instant};
use nix::sys::wait::waitpid;
use nix::unistd::Pid;
use nix::unistd::{fork, getpid, getppid, ForkResult};
static mut process_count: i32 = 0;
static mut thread_count: i32 = 0;
fn generate_array(len: usize) -> Vec<f64> {
let mut my_vector: Vec<f64> = Vec::new();
for _ in 0..len {
my_vector.push(rand::thread_rng().gen_range(0.0, 100.0)); // 0 - 99.99999
}
return my_vector;
}
fn get_array_size_from_user() -> usize {
let mut n = String::new();
io::stdin()
.read_line(&mut n)
.expect("failed to read input.");
let n: usize = n.trim().parse().expect("invalid input");
return n;
}
fn display_array(array: &mut Vec<f64>) {
println!("{:?}", array);
println!();
}
fn clear_screen() {
print!("{}[2J", 27 as char);
//print!("\x1B[2J"); // 2nd option
}
pub fn mergeSort(a: &mut Vec<f64>, low: usize, high: usize) {
let middle = (low + high) / 2;
let mut len = (high - low + 1);
if (len <= 1) {
return;
}
let lpid = fork();
match lpid {
Ok(ForkResult::Child) => {
println!("Left Process Running ");
mergeSort(a, low, middle);
exit(0);
}
Ok(ForkResult::Parent { child }) => {
let rpid = fork();
match rpid {
Ok(ForkResult::Child) => {
println!("Right Process Running ");
mergeSort(a, middle + 1, high);
exit(0);
}
Ok(ForkResult::Parent { child }) => {}
Err(err) => {
panic!("Right process not created: {}", err);
}
};
}
Err(err) => {
panic!("Left process not created {}", err);
}
};
//waitpid(lpid, None);
//waitpid(rpid, None);
// Merge the sorted subarrays
merge(a, low, middle, high);
}
fn merge(a: &mut Vec<f64>, low: usize, m: usize, high: usize) {
println!("x");
let mut left = a[low..m + 1].to_vec();
let mut right = a[m + 1..high + 1].to_vec();
println!("left: {:?}", left);
println!("right: {:?}", right);
left.reverse();
right.reverse();
for k in low..high + 1 {
if left.is_empty() {
a[k] = right.pop().unwrap();
continue;
}
if right.is_empty() {
a[k] = left.pop().unwrap();
continue;
}
if right.last() < left.last() {
a[k] = right.pop().unwrap();
} else {
a[k] = left.pop().unwrap();
}
}
println!("array: {:?}", a);
}
unsafe fn display_process_thread_counts() {
unsafe {
println!("process count:");
println!("{}", process_count);
println!("thread count:");
println!("{}", thread_count);
}
}
unsafe fn process_count_plus_plus() {
process_count += 1;
}
unsafe fn thread_count_plus_plus() {
thread_count += 1;
}
fn print_time(start: Instant, end: Instant) {
println!("TIME:");
println!("{:?}", end.checked_duration_since(start));
}
fn main() {
println!("ENTER SIZE OF ARRAY \n");
let array_size = get_array_size_from_user();
let mut generated_array = generate_array(array_size);
clear_screen();
println!("GENERATED ARRAY: \n");
display_array(&mut generated_array);
// SORTING
let start = Instant::now();
mergeSort(&mut generated_array, 0, array_size - 1);
let end = Instant::now();
// RESULT
//unsafe{
// process_count_plus_plus();
// thread_count_plus_plus();
//}
println!("SORTED ARRAY: \n");
display_array(&mut generated_array);
print_time(start, end);
unsafe {
display_process_thread_counts();
}
}
I get these results without using waitpid for the vector [3, 70, 97, 74]:
array before comparison: [3, 70, 97, 74]
comparison: [97], [74]
array after comparison: [3, 70, 74, 97]
array before comparison: [3, 70, 97, 74]
comparison: [3], [70]
array after comparison: [3, 70, 97, 74]
array before comparison: [3, 70, 97, 74]
comparison: [3, 70], [97, 74] (should be [74, 97])
array after comparison: [3, 70, 97, 74]
This has nothing to do with waitpid and everything to do with fork. When you fork a process, the OS creates a copy of your data and the child operates on this copy 1. When the child exits, its memory is discarded. The parent never sees the changes made by the child.
If you need the parent to see the changes made by the child, you should do one of the following:
Easiest and fastest is to use threads instead of processes. Threads share memory, so the parent and children all use the same memory. In Rust, the borrow checker ensures that parent and children behave correctly when accessing the same piece of memory.
Use mmap or something equivalent to share memory between the parent and children processes. Note however that it will be very difficult to ensure memory safety while the processes all try to access the same memory concurrently.
Use some kind of Inter-Process Communication (IPC) mechanism to send the result back from the children to the parent. This is easier than mmap since there is no risk of collision during memory accesses but in your case, given the amount of data that will need to be sent, this will be the slowest.
1 Actually, it uses Copy-On-Write, so data that is simply read is shared, but anything that either the parent or child writes will be copied and the other will not see the result of the write.

How to give each CPU core mutable access to a portion of a Vec? [duplicate]

This question already has an answer here:
How do I pass disjoint slices from a vector to different threads?
(1 answer)
Closed 4 years ago.
I've got an embarrassingly parallel bit of graphics rendering code that I would like to run across my CPU cores. I've coded up a test case (the function computed is nonsense) to explore how I might parallelize it. I'd like to code this using std Rust in order to learn about using std::thread. But, I don't understand how to give each thread a portion of the framebuffer. I'll put the full testcase code below, but I'll try to break it down first.
The sequential form is super simple:
let mut buffer0 = vec![vec![0i32; WIDTH]; HEIGHT];
for j in 0..HEIGHT {
for i in 0..WIDTH {
buffer0[j][i] = compute(i as i32,j as i32);
}
}
I thought that it would help to make a buffer that was the same size, but re-arranged to be 3D & indexed by core first. This is the same computation, just a reordering of the data to show the workings.
let mut buffer1 = vec![vec![vec![0i32; WIDTH]; y_per_core]; num_logical_cores];
for c in 0..num_logical_cores {
for y in 0..y_per_core {
let j = y*num_logical_cores + c;
if j >= HEIGHT {
break;
}
for i in 0..WIDTH {
buffer1[c][y][i] = compute(i as i32,j as i32)
}
}
}
But, when I try to put the inner part of the code in a closure & create a thread, I get errors about the buffer & lifetimes. I basically don't understand what to do & could use some guidance. I want per_core_buffer to just temporarily refer to the data in buffer2 that belongs to that core & allow it to be written, synchronize all the threads & then read buffer2 afterwards. Is this possible?
let mut buffer2 = vec![vec![vec![0i32; WIDTH]; y_per_core]; num_logical_cores];
let mut handles = Vec::new();
for c in 0..num_logical_cores {
let per_core_buffer = &mut buffer2[c]; // <<< lifetime error
let handle = thread::spawn(move || {
for y in 0..y_per_core {
let j = y*num_logical_cores + c;
if j >= HEIGHT {
break;
}
for i in 0..WIDTH {
per_core_buffer[y][i] = compute(i as i32,j as i32)
}
}
});
handles.push(handle)
}
for handle in handles {
handle.join().unwrap();
}
The error is this & I don't understand:
error[E0597]: `buffer2` does not live long enough
--> src/main.rs:50:36
|
50 | let per_core_buffer = &mut buffer2[c]; // <<< lifetime error
| ^^^^^^^ borrowed value does not live long enough
...
88 | }
| - borrowed value only lives until here
|
= note: borrowed value must be valid for the static lifetime...
The full testcase is:
extern crate num_cpus;
use std::time::Instant;
use std::thread;
fn compute(x: i32, y: i32) -> i32 {
(x*y) % (x+y+10000)
}
fn main() {
let num_logical_cores = num_cpus::get();
const WIDTH: usize = 40000;
const HEIGHT: usize = 10000;
let y_per_core = HEIGHT/num_logical_cores + 1;
// ------------------------------------------------------------
// Serial Calculation...
let mut buffer0 = vec![vec![0i32; WIDTH]; HEIGHT];
let start0 = Instant::now();
for j in 0..HEIGHT {
for i in 0..WIDTH {
buffer0[j][i] = compute(i as i32,j as i32);
}
}
let dur0 = start0.elapsed();
// ------------------------------------------------------------
// On the way to Parallel Calculation...
// Reorder the data buffer to be 3D with one 2D region per core.
let mut buffer1 = vec![vec![vec![0i32; WIDTH]; y_per_core]; num_logical_cores];
let start1 = Instant::now();
for c in 0..num_logical_cores {
for y in 0..y_per_core {
let j = y*num_logical_cores + c;
if j >= HEIGHT {
break;
}
for i in 0..WIDTH {
buffer1[c][y][i] = compute(i as i32,j as i32)
}
}
}
let dur1 = start1.elapsed();
// ------------------------------------------------------------
// Actual Parallel Calculation...
let mut buffer2 = vec![vec![vec![0i32; WIDTH]; y_per_core]; num_logical_cores];
let mut handles = Vec::new();
let start2 = Instant::now();
for c in 0..num_logical_cores {
let per_core_buffer = &mut buffer2[c]; // <<< lifetime error
let handle = thread::spawn(move || {
for y in 0..y_per_core {
let j = y*num_logical_cores + c;
if j >= HEIGHT {
break;
}
for i in 0..WIDTH {
per_core_buffer[y][i] = compute(i as i32,j as i32)
}
}
});
handles.push(handle)
}
for handle in handles {
handle.join().unwrap();
}
let dur2 = start2.elapsed();
println!("Runtime: Serial={0:.3}ms, AlmostParallel={1:.3}ms, Parallel={2:.3}ms",
1000.*dur0.as_secs() as f64 + 1e-6*(dur0.subsec_nanos() as f64),
1000.*dur1.as_secs() as f64 + 1e-6*(dur1.subsec_nanos() as f64),
1000.*dur2.as_secs() as f64 + 1e-6*(dur2.subsec_nanos() as f64));
// Sanity check
for j in 0..HEIGHT {
let c = j % num_logical_cores;
let y = j / num_logical_cores;
for i in 0..WIDTH {
if buffer0[j][i] != buffer1[c][y][i] {
println!("wtf1? {0} {1} {2} {3}",i,j,buffer0[j][i],buffer1[c][y][i])
}
if buffer0[j][i] != buffer2[c][y][i] {
println!("wtf2? {0} {1} {2} {3}",i,j,buffer0[j][i],buffer2[c][y][i])
}
}
}
}
Thanks to #Shepmaster for the pointers and clarification that this is not an easy problem for Rust, and that I needed to consider crates to find a reasonable solution. I'm only just starting out in Rust, so this really wasn't clear to me.
I liked the ability to control the number of threads that scoped_threadpool gives, so I went with that. Translating my code from above directly, I tried to use the 4D buffer with core as the most-significant-index and that ran into troubles because that 3D vector does not implement the Copy trait. The fact that it implements Copy makes me concerned about performance, but I went back to the original problem and implemented it more directly & found a reasonable speedup by making each row a thread. Copying each row will not be a large memory overhead.
The code that works for me is:
let mut buffer2 = vec![vec![0i32; WIDTH]; HEIGHT];
let mut pool = Pool::new(num_logical_cores as u32);
pool.scoped(|scope| {
let mut y = 0;
for e in &mut buffer2 {
scope.execute(move || {
for x in 0..WIDTH {
(*e)[x] = compute(x as i32,y as i32);
}
});
y += 1;
}
});
On a 6 core, 12 thread i7-8700K for 400000x4000 testcase this runs in 3.2 seconds serially & 481ms in parallel--a reasonable speedup.
EDIT: I continued to think about this issue and got a suggestion from Rustlang on twitter that I should consider rayon. I converted my code to rayon and got similar speedup with the following code.
let mut buffer2 = vec![vec![0i32; WIDTH]; HEIGHT];
buffer2
.par_iter_mut()
.enumerate()
.map(|(y,e): (usize, &mut Vec<i32>)| {
for x in 0..WIDTH {
(*e)[x] = compute(x as i32,y as i32);
}
})
.collect::<Vec<_>>();

Resources