How can I move the data between threads safely? - multithreading

I'm currently trying to call a function to which I pass multiple file names and expect the function to read the files and generate the appropriate structs and return them in a Vec<Audit>. I've been able to accomplish it reading the files one by one but I want to achieve it using threads.
This is the function:
fn generate_audits_from_files(files: Vec<String>) -> Vec<Audit> {
let mut audits = Arc::new(Mutex::new(vec![]));
let mut handlers = vec![];
for file in files {
let audits = Arc::clone(&audits);
handlers.push(thread::spawn(move || {
let mut audits = audits.lock().unwrap();
audits.push(audit_from_xml_file(file.clone()));
audits
}));
}
for handle in handlers {
let _ = handle.join();
}
audits
.lock()
.unwrap()
.into_iter()
.fold(vec![], |mut result, audit| {
result.push(audit);
result
})
}
But it won't compile due to the following error:
error[E0277]: `MutexGuard<'_, Vec<Audit>>` cannot be sent between threads safely
--> src/main.rs:82:23
|
82 | handlers.push(thread::spawn(move || {
| ^^^^^^^^^^^^^ `MutexGuard<'_, Vec<Audit>>` cannot be sent between threads safely
|
::: /home/enthys/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/std/src/thread/mod.rs:618:8
I have tried wrapping the generated Audit structs in Some(Audit) to avoid the MutexGuard but then I stumble with Poisonned Thread issues.

The cause of the error is that after after pushing the new Audit into the (locked) audits vec you then try to return the vec's MutexGuard.
In Rust, a thread's function can actually return values, the point of doing that is to send the value back to whoever is join-ing the thread. This means the value is going to move between threads, so the value needs to be movable betweem threads (aka Send), which mutex guards have no reason to be[0].
The easy solution is to just... not do that. Just delete the last line of the spawn function. Though it's not like the code works after that as you still have borrowing issue related to the thing at the end.
An alternative is to lean into the feature (especially if Audit objects are not too big): drop the audits vec entirely and instead have each thread return its audit, then collect from the handlers when you join them:
pub fn generate_audits_from_files(files: Vec<String>) -> Vec<Audit> {
let mut handlers = vec![];
for file in files {
handlers.push(thread::spawn(move || {
audit_from_xml_file(file)
}));
}
handlers.into_iter()
.map(|handler| handler.join().unwrap())
.collect()
}
Though at that point you might as well just let Rayon handle it:
use rayon::prelude::*;
pub fn generate_audits_from_files(files: Vec<String>) -> Vec<Audit> {
files.into_par_iter().map(audit_from_xml_file).collect()
}
That also avoids crashing the program or bringing the machine to its knees if you happen to have millions of files.
[0] and all the reasons not to be, locking on one thread and unlocking on an other is not necessarily supported e.g. ReleaseMutex
The ReleaseMutex function fails if the calling thread does not own the mutex object.
(NB: in the windows lingo, "owning" a mutex means having acquired it via WaitForSingleObject, which translates to lock in posix lingo)
and can be plain UB e.g. pthread_mutex_unlock
If a thread attempts to unlock a mutex that it has not locked or a mutex which is unlocked, undefined behavior results.

Your problem is that you are passing your Vec<Audit> (or more precisely the MutexGuard<Vec<Audit>>), to the threads and back again, without really needing it.
And you don't need Mutex or Arc for this simpler task:
fn generate_audits_from_files(files: Vec<String>) -> Vec<Audit> {
let mut handlers = vec![];
for file in files {
handlers.push(thread::spawn(move || {
audit_from_xml_file(file)
}));
}
handlers
.into_iter()
.flat_map(|x| x.join())
.collect()
}

Related

Rust - How to pass function parameters to closure

I'm trying to write a function that takes two parameters. The function starts two threads and uses one of the parameters inside one of the thread closures. This doesn't work because of the error "Borrowed data escapes outside of closure". Here's the code.
pub fn measure_stats(testdatapath: &PathBuf, filenameprefix: &String) {
let (tx, rx) = mpsc::channel();
let filename = format!("test.txt")
let measure_thread = thread::spawn(move || {
let stats = sar();
fs::write(filename, stats).expect("failed to write output to file");
// Send a signal that we're done.
let _ = tx.send(());
});
thread::spawn(move || {
let mut n = 0;
loop {
// Break if the measure thread is done.
match rx.try_recv() {
Ok(_) | Err(TryRecvError::Disconnected) => break,
Err(TryRecvError::Empty) => {}
}
let filename = format!("{:04}.img", n);
let filepath = Path::new(testdatapath).join(&filename);
random_file_write(&filepath).unwrap();
random_file_read(&filepath).unwrap();
fs::remove_file(&filepath).expect("failed to remove file");
n += 1;
}
});
measure_thread.join().expect("joining measure thread panicked");
}
The problem is that testdatapath escapes the function body. I think this is a problem because the lifetime of testdatapath is only guaranteed until the end of the closure, but it needs to be the lifetime of the entire program. But it's a little confusing to me.
I've tried cloning the variable, but that didn't help. I'm not sure how I'm supposed to do this. How do I use a function parameter inside the closure or accomplish the same goal some other more canonical way?
If it's okay for the function not to return until both threads complete, then use std::thread::scope() to create scoped threads instead of std::thread::spawn(). Scoped threads allow borrowing data whereas regular spawning cannot, but require the threads to all terminate before the scope ends and the function that created them returns.
If this has to be a “background” task, then you need to make sure that all the data used by each thread is owned, i.e. not a reference. In this case, that means you should change the parameters to be owned:
pub fn measure_stats(testdatapath: PathBuf, filenameprefix: String) {
Then, those values will be moved into the receiving thread, without any lifetime constraints.
You're trying to make testdata live longer than the function, since this is a value you're borrowing and since you can't guarantee that the original PathBuff will outlive closure running in the new thread the compiler is warning you that you're assuming that this would be the case, but not taking any precautions to do so.
The 3 simpler choices:
Move the PathBuff to the function instead of borrowing it (remove the &).
Use an Arc
clone it and move the clone into the thread.

How to loop over thread handles and join if finished, within another loop?

I have a program that creates threads in a loop, and also checks if they have finished and cleans them up if they have. See below for a minimal example:
use std::thread;
fn main() {
let mut v = Vec::<std::thread::JoinHandle<()>>::new();
for _ in 0..10 {
let jh = thread::spawn(|| {
thread::sleep(std::time::Duration::from_secs(1));
});
v.push(jh);
for jh in v.iter_mut() {
if jh.is_finished() {
jh.join().unwrap();
}
}
}
}
This gives the error:
error[E0507]: cannot move out of `*jh` which is behind a mutable reference
--> src\main.rs:13:17
|
13 | jh.join().unwrap();
| ^^^------
| | |
| | `*jh` moved due to this method call
| move occurs because `*jh` has type `JoinHandle<()>`, which does not implement the `Copy` trait
|
note: this function takes ownership of the receiver `self`, which moves `*jh`
--> D:\rust\.rustup\toolchains\stable-x86_64-pc-windows-msvc\lib/rustlib/src/rust\library\std\src\thread\mod.rs:1461:17
|
1461 | pub fn join(self) -> Result<T> {
How can I get the borrow checker to allow this?
JoinHandle::join actually consumes the JoinHandle.
iter_mut(), however, only borrows the elements of the vector and keeps the vector alive. Therefore your JoinHandles are only borrowed, and you cannot call consuming methods on borrowed objects.
What you need to do is to take the ownership of the elements while iterating over the vector, so they can be then consumed by join(). This is achieved by using into_iter() instead of iter_mut().
The second mistake is that you (probably accidentally) wrote the two for loops inside of each other, while they should be independent loops.
The third problem is a little more complex. You cannot check if a thread has finished and then join it the way you did. Therefore I removed the is_finished() check for now and will talk about this further down again.
Here is your fixed code:
use std::thread;
fn main() {
let mut v = Vec::<std::thread::JoinHandle<()>>::new();
for _ in 0..10 {
let jh = thread::spawn(|| {
thread::sleep(std::time::Duration::from_secs(1));
});
v.push(jh);
}
for jh in v.into_iter() {
jh.join().unwrap();
}
}
Reacting to finished threads
This one is harder. If you just want to wait until all of them are finished, the code above is the way to go.
However, if you have to react to finished threads right away, you basically have to set up some kind of event propagation. You don't want to loop over all threads over and over again until they are all finished, because that is something called idle-waiting and consumes a lot of computational power.
So if you want to achieve that there are two problems that have to be dealt with:
join() consumes the JoinHandle(), which would leave behind an incomplete Vec of JoinHandles. This isn't possible, so we need to wrap JoinHandle in a type that can actually be ripped out of the vector partially, like Option.
we need a way to signal to the main thread that a new child thread is finished, so that the main thread doesn't have to continuously iterate over the threads.
All in all this is very complex and tricky to implement.
Here is my attempt:
use std::{
thread::{self, JoinHandle},
time::Duration,
};
fn main() {
let mut v: Vec<Option<JoinHandle<()>>> = Vec::new();
let (send_finished_thread, receive_finished_thread) = std::sync::mpsc::channel();
for i in 0..10 {
let send_finished_thread = send_finished_thread.clone();
let join_handle = thread::spawn(move || {
println!("Thread {} started.", i);
thread::sleep(Duration::from_millis(2000 - i as u64 * 100));
println!("Thread {} finished.", i);
// Signal that we are finished.
// This will wake up the main thread.
send_finished_thread.send(i).unwrap();
});
v.push(Some(join_handle));
}
loop {
// Check if all threads are finished
let num_left = v.iter().filter(|th| th.is_some()).count();
if num_left == 0 {
break;
}
// Wait until a thread is finished, then join it
let i = receive_finished_thread.recv().unwrap();
let join_handle = std::mem::take(&mut v[i]).unwrap();
println!("Joining {} ...", i);
join_handle.join().unwrap();
println!("{} joined.", i);
}
println!("All joined.");
}
Important
This code is just a demonstration. It will deadlock if one of the threads panic. But this shows how complicated that problem is.
It could be solved by utilizing a drop guard, but I think this answer is convoluted enough ;)

Waiting on multiple futures borrowing mutable self

Each of the following methods need (&mut self) to operate. The following code gives the error.
cannot borrow *self as mutable more than once at a time
How can I achieve this correctly?
loop {
let future1 = self.handle_new_connections(sender_to_connector.clone());
let future2 = self.handle_incoming_message(&mut receiver_from_peers);
let future3 = self.handle_outgoing_message();
tokio::pin!(future1, future2, future3);
tokio::select! {
_=future1=>{},
_=future2=>{},
_=future3=>{}
}
}
You are not allowed to have multiple mutable references to an object and there's a good reason for that.
Imagine you pass an object mutably to 2 different functions and they edited the object out of sync since you don't have any mechanism for that in place. then you'd end up with something called a race condition.
To prevent this bug rust allows only one mutable reference to an object at a time but you can have multiple immutable references and often you see people use internal mutability patterns.
In your case, you want data not to be able to be modified by 2 different threads at the same time so you'd wrap it in a Lock or RwLock then since you want multiple threads to be able to own this value you'd wrap that in an Arc.
here you can read about interior mutability in more detail.
Alternatively, while declaring the type of your function you could add proper lifetimes to indicate the resulting Future will be waited on in the same context by giving it a lifetime since your code waits for the future before the next iteration that would do the trick as well.
I encountered the same problem when dealing with async code. Here is what I figured out:
Let's say you have an Engine, that contains both incoming and outgoing:
struct Engine {
log: Arc<Mutex<Vec<String>>>,
outgoing: UnboundedSender<String>,
incoming: UnboundedReceiver<String>,
}
Our goal is to create two functions process_incoming and process_logic and then poll them simultaneously without messing up with the borrow checker in Rust.
What is important here is that:
You cannot pass &mut self to these async functions simultaneously.
Either incoming or outgoing will be only held by one function at most.
The data access by both process_incoming and process_logic need to be wrapped by a lock.
Any trying to lock Engine directly will lead to a deadlock at runtime.
So that leaves us giving up using the method in favor of the associated function:
impl Engine {
// ...
async fn process_logic(outgoing: &mut UnboundedSender<String>, log: Arc<Mutex<Vec<String>>>) {
loop {
Delay::new(Duration::from_millis(1000)).await.unwrap();
let msg: String = "ping".into();
println!("outgoing: {}", msg);
log.lock().push(msg.clone());
outgoing.send(msg).await.unwrap();
}
}
async fn process_incoming(
incoming: &mut UnboundedReceiver<String>,
log: Arc<Mutex<Vec<String>>>,
) {
while let Some(msg) = incoming.next().await {
println!("incoming: {}", msg);
log.lock().push(msg);
}
}
}
And we can then write main as:
fn main() {
futures::executor::block_on(async {
let mut engine = Engine::new();
let a = Engine::process_incoming(&mut engine.incoming, engine.log.clone()).fuse();
let b = Engine::process_logic(&mut engine.outgoing, engine.log).fuse();
futures::pin_mut!(a, b);
select! {
_ = a => {},
_ = b => {},
}
});
}
I put the whole example here.
It's a workable solution, only be aware that you should add futures and futures-timer in your dependencies.

Is it possible to share a HashMap between threads without locking the entire HashMap?

I would like to have a shared struct between threads. The struct has many fields that are never modified and a HashMap, which is. I don't want to lock the whole HashMap for a single update/remove, so my HashMap looks something like HashMap<u8, Mutex<u8>>. This works, but it makes no sense since the thread will lock the whole map anyways.
Here's this working version, without threads; I don't think that's necessary for the example.
use std::collections::HashMap;
use std::sync::{Arc, Mutex};
fn main() {
let s = Arc::new(Mutex::new(S::new()));
let z = s.clone();
let _ = z.lock().unwrap();
}
struct S {
x: HashMap<u8, Mutex<u8>>, // other non-mutable fields
}
impl S {
pub fn new() -> S {
S {
x: HashMap::default(),
}
}
}
Playground
Is this possible in any way? Is there something obvious I missed in the documentation?
I've been trying to get this working, but I'm not sure how. Basically every example I see there's always a Mutex (or RwLock, or something like that) guarding the inner value.
I don't see how your request is possible, at least not without some exceedingly clever lock-free data structures; what should happen if multiple threads need to insert new values that hash to the same location?
In previous work, I've used a RwLock<HashMap<K, Mutex<V>>>. When inserting a value into the hash, you get an exclusive lock for a short period. The rest of the time, you can have multiple threads with reader locks to the HashMap and thus to a given element. If they need to mutate the data, they can get exclusive access to the Mutex.
Here's an example:
use std::{
collections::HashMap,
sync::{Arc, Mutex, RwLock},
thread,
time::Duration,
};
fn main() {
let data = Arc::new(RwLock::new(HashMap::new()));
let threads: Vec<_> = (0..10)
.map(|i| {
let data = Arc::clone(&data);
thread::spawn(move || worker_thread(i, data))
})
.collect();
for t in threads {
t.join().expect("Thread panicked");
}
println!("{:?}", data);
}
fn worker_thread(id: u8, data: Arc<RwLock<HashMap<u8, Mutex<i32>>>>) {
loop {
// Assume that the element already exists
let map = data.read().expect("RwLock poisoned");
if let Some(element) = map.get(&id) {
let mut element = element.lock().expect("Mutex poisoned");
// Perform our normal work updating a specific element.
// The entire HashMap only has a read lock, which
// means that other threads can access it.
*element += 1;
thread::sleep(Duration::from_secs(1));
return;
}
// If we got this far, the element doesn't exist
// Get rid of our read lock and switch to a write lock
// You want to minimize the time we hold the writer lock
drop(map);
let mut map = data.write().expect("RwLock poisoned");
// We use HashMap::entry to handle the case where another thread
// inserted the same key while where were unlocked.
thread::sleep(Duration::from_millis(50));
map.entry(id).or_insert_with(|| Mutex::new(0));
// Let the loop start us over to try again
}
}
This takes about 2.7 seconds to run on my machine, even though it starts 10 threads that each wait for 1 second while holding the exclusive lock to the element's data.
This solution isn't without issues, however. When there's a huge amount of contention for that one master lock, getting a write lock can take a while and completely kills parallelism.
In that case, you can switch to a RwLock<HashMap<K, Arc<Mutex<V>>>>. Once you have a read or write lock, you can then clone the Arc of the value, returning it and unlocking the hashmap.
The next step up would be to use a crate like arc-swap, which says:
Then one would lock, clone the [RwLock<Arc<T>>] and unlock. This suffers from CPU-level contention (on the lock and on the reference count of the Arc) which makes it relatively slow. Depending on the implementation, an update may be blocked for arbitrary long time by a steady inflow of readers.
The ArcSwap can be used instead, which solves the above problems and has better performance characteristics than the RwLock, both in contended and non-contended scenarios.
I often advocate for performing some kind of smarter algorithm. For example, you could spin up N threads each with their own HashMap. You then shard work among them. For the simple example above, you could use id % N_THREADS, for example. There are also complicated sharding schemes that depend on your data.
As Go has done a good job of evangelizing: do not communicate by sharing memory; instead, share memory by communicating.
Suppose the key of the data is map-able to a u8
You can have Arc<HashMap<u8,Mutex<HashMap<Key,Value>>>
When you initialize the data structure you populate all the first level map before putting it in Arc (it will be immutable after initialization)
When you want a value from the map you will need to do a double get, something like:
data.get(&map_to_u8(&key)).unwrap().lock().expect("poison").get(&key)
where the unwrap is safe because we initialized the first map with all the value.
to write in the map something like:
data.get(&map_to_u8(id)).unwrap().lock().expect("poison").entry(id).or_insert_with(|| value);
It's easy to see contention is reduced because we now have 256 Mutex and the probability of multiple threads asking the same Mutex is low.
#Shepmaster example with 100 threads takes about 10s on my machine, the following example takes a little more than 1 second.
use std::{
collections::HashMap,
sync::{Arc, Mutex, RwLock},
thread,
time::Duration,
};
fn main() {
let mut inner = HashMap::new( );
for i in 0..=u8::max_value() {
inner.insert(i, Mutex::new(HashMap::new()));
}
let data = Arc::new(inner);
let threads: Vec<_> = (0..100)
.map(|i| {
let data = Arc::clone(&data);
thread::spawn(move || worker_thread(i, data))
})
.collect();
for t in threads {
t.join().expect("Thread panicked");
}
println!("{:?}", data);
}
fn worker_thread(id: u8, data: Arc<HashMap<u8,Mutex<HashMap<u8,Mutex<i32>>>>> ) {
loop {
// first unwrap is safe to unwrap because we populated for every `u8`
if let Some(element) = data.get(&id).unwrap().lock().expect("poison").get(&id) {
let mut element = element.lock().expect("Mutex poisoned");
// Perform our normal work updating a specific element.
// The entire HashMap only has a read lock, which
// means that other threads can access it.
*element += 1;
thread::sleep(Duration::from_secs(1));
return;
}
// If we got this far, the element doesn't exist
// Get rid of our read lock and switch to a write lock
// You want to minimize the time we hold the writer lock
// We use HashMap::entry to handle the case where another thread
// inserted the same key while where were unlocked.
thread::sleep(Duration::from_millis(50));
data.get(&id).unwrap().lock().expect("poison").entry(id).or_insert_with(|| Mutex::new(0));
// Let the loop start us over to try again
}
}
Maybe you want to consider evmap:
A lock-free, eventually consistent, concurrent multi-value map.
The trade-off is eventual-consistency: Readers do not see changes until the writer refreshes the map. A refresh is atomic and the writer decides when to do it and expose new data to the readers.

One mutable borrow and multiple immutable borrows

I'm trying to write a program that spawns a background thread that continuously inserts data into some collection. At the same time, I want to keep getting input from stdin and check if that input is in the collection the thread is operating on.
Here is a boiled down example:
use std::collections::HashSet;
use std::thread;
fn main() {
let mut set: HashSet<String> = HashSet::new();
thread::spawn(move || {
loop {
set.insert("foo".to_string());
}
});
loop {
let input: String = get_input_from_stdin();
if set.contains(&input) {
// Do something...
}
}
}
fn get_input_from_stdin() -> String {
String::new()
}
However this doesn't work because of ownership stuff.
I'm still new to Rust but this seems like something that should be possible. I just can't find the right combination of Arcs, Rcs, Mutexes, etc. to wrap my data in.
First of all, please read Need holistic explanation about Rust's cell and reference counted types.
There are two problems to solve here:
Sharing ownership between threads,
Mutable aliasing.
To share ownership, the simplest solution is Arc. It requires its argument to be Sync (accessible safely from multiple threads) which can be achieved for any Send type by wrapping it inside a Mutex or RwLock.
To safely get aliasing in the presence of mutability, both Mutex and RwLock will work. If you had multiple readers, RwLock might have an extra performance edge. Since you have a single reader there's no point: let's use the simple Mutex.
And therefore, your type is: Arc<Mutex<HashSet<String>>>.
The next trick is passing the value to the closure to run in another thread. The value is moved, and therefore you need to first make a clone of the Arc and then pass the clone, otherwise you've moved your original and cannot access it any longer.
Finally, accessing the data requires going through the borrows and locks...
use std::sync::{Arc, Mutex};
fn main() {
let set = Arc::new(Mutex::new(HashSet::new()));
let clone = set.clone();
thread::spawn(move || {
loop {
clone.lock().unwrap().insert("foo".to_string());
}
});
loop {
let input: String = get_input_from_stdin();
if set.lock().unwrap().contains(&input) {
// Do something...
}
}
}
The call to unwrap is there because Mutex::lock returns a Result; it may be impossible to lock the Mutex if it is poisoned, which means a panic occurred while it was locked and therefore its content is possibly garbage.

Resources