Per-thread initialization in Rayon

Per-thread initialization in Rayon - multithreading

I am trying to optimize my function using Rayon's par_iter().
The single threaded version is something like:
fn verify_and_store(store: &mut Store, txs: Vec<Tx>) {
let result = txs.iter().map(|tx| {
tx.verify_and_store(store)
}).collect();
...
}
Each Store instance must be used only by one thread, but multiple instances of Store can be used concurrently, so I can make this multithreaded by clone-ing store:
fn verify_and_store(store: &mut Store, txs: Vec<Tx>) {
let result = txs.par_iter().map(|tx| {
let mut local_store = store.clone();
tx.verify_and_store(&mut local_store)
}).collect();
...
}
However, this clones the store on every iteration, which is way too slow. I would like to use one store instance per thread.
Is this possible with Rayon? Or should I resort to manual threading and a work-queue?

It is possible to use a thread-local variable to ensure that local_store is not created more than once in a given thread.
For example, this compiles (full source):
fn verify_and_store(store: &mut Store, txs: Vec<Tx>) {
use std::cell::RefCell;
thread_local!(static STORE: RefCell<Option<Store>> = RefCell::new(None));
let mut result = Vec::new();
txs.par_iter().map(|tx| {
STORE.with(|cell| {
let mut local_store = cell.borrow_mut();
if local_store.is_none() {
*local_store = Some(store.clone());
}
tx.verify_and_store(local_store.as_mut().unwrap())
})
}).collect_into(&mut result);
}
There are two problems with this code, however. One, if the clones of store need to do something when par_iter() is done, such as flush their buffers, it simply won't happen - their Drop will only be called when Rayon's worker threads exit, and even that is not guaranteed.
The second, and more serious problem, is that the clones of store are created exactly once per worker thread. If Rayon caches its thread pool (and I believe it does), this means that an unrelated later call to verify_and_store will continue working with last known clones of store, which possibly have nothing to do with the current store.
This can be rectified by complicating the code somewhat:
Store the cloned variables in a Mutex<Option<...>> instead of Option, so that they can be accessed by the thread that invoked par_iter(). This will incur a mutex lock on every access, but the lock will be uncontested and therefore cheap.
Use an Arc around the mutex in order to collect references to the created store clones in a vector. This vector is used to clean up the stores by resetting them to None after the iteration has finished.
Wrap the whole call in an unrelated mutex, so that two parallel calls to verify_and_store don't end up seeing each other's store clones. (This might be avoidable if a new thread pool were created and installed before the iteration.) Hopefully this serialization won't affect the performance of verify_and_store, since each call will utilize the whole thread pool.
The result is not pretty, but it compiles, uses only safe code, and appears to work:
fn verify_and_store(store: &mut Store, txs: Vec<Tx>) {
use std::sync::{Arc, Mutex};
type SharedStore = Arc<Mutex<Option<Store>>>;
lazy_static! {
static ref STORE_CLONES: Mutex<Vec<SharedStore>> = Mutex::new(Vec::new());
static ref NO_REENTRY: Mutex<()> = Mutex::new(());
}
thread_local!(static STORE: SharedStore = Arc::new(Mutex::new(None)));
let mut result = Vec::new();
let _no_reentry = NO_REENTRY.lock();
txs.par_iter().map({
|tx| {
STORE.with(|arc_mtx| {
let mut local_store = arc_mtx.lock().unwrap();
if local_store.is_none() {
*local_store = Some(store.clone());
STORE_CLONES.lock().unwrap().push(arc_mtx.clone());
}
tx.verify_and_store(local_store.as_mut().unwrap())
})
}
}).collect_into(&mut result);
let mut store_clones = STORE_CLONES.lock().unwrap();
for store in store_clones.drain(..) {
store.lock().unwrap().take();
}
}

Old question, but I feel the answer needs revisiting. In general, there are two methods:
Use map_with. This will clone every time a thread steals a work item from another thread. This will possibly clone more stores than there are threads, but it should be fairly low. If the clones are too expensive, you can increase the size rayon will split workloads with with_min_len.
fn verify_and_store(store: &mut Store, txs: Vec<Tx>) {
let result = txs.iter().map_with(|| store.clone(), |store, tx| {
tx.verify_and_store(store)
}).collect();
...
}
Or use the scoped ThreadLocal from the thread_local crate. This will ensure that you only use as many objects as there are threads, and that they are destroyed once the ThreadLocal object goes out of scope.
fn verify_and_store(store: &mut Store, txs: Vec<Tx>) {
let tl = ThreadLocal::new();
let result = txs.iter().map(|tx| {
let store = tl.get_or(|| Box::new(RefCell::new(store.clone)));
tx.verify_and_store(store.get_mut());
}).collect();
...
}

Related

Is it possible to create a global that is assigned once at the beginning of runtime and then borrowed across threads?

lazy_static doesn't work because I need to assign to this variable at runtime after some user interaction. thread_local doesn't work because I need to read this variable across threads.
From a system perspective, I think what I'm trying to do should be simple. At the beginning of execution I'm single threaded, I initialize some things, and then I tokio::spawn some tasks which only need to read those things.
I can get past the problem by using a mutex, but I don't really see why I should need to use a mutex when I can guarantee that no tasks will ever try to get mutable access other than at the very beginning of runtime when I'm still in a single thread. Is there a better way that using a mutex?
This is what I have so far, in case anyone is curious:
lazy_static! {
pub static ref KEYPAIRSTORE_GLOBAL: Mutex<KeypairStore> = Mutex::new(KeypairStore::new());
}
// ...
// at top of main:
let mut keypairstore = KEYPAIRSTORE_GLOBAL.lock().unwrap();
*keypairstore = KeypairStore::new_from_password();
// somewhere later in a tokio::spawn:
let keypair_store = KEYPAIRSTORE_GLOBAL.lock().unwrap();
let keypair = keypair_store.get_keypair();
println!("{}", keypair.address());
I don't see why I need to use this mutex... I'd be happy to use unsafe during assignment, but I'd rather not have to use it every time I want to read.

As written, you need the Mutex because you are mutating it after it is initialised. Instead, do the mutation during the initialisation:
lazy_static! {
pub static ref KEYPAIRSTORE_GLOBAL: KeypairStore = {
let mut keystore = KeypairStore::new_from_password();
// ... more initialisation here...
keystore
}
}
// somewhere later in a tokio::spawn:
let keypair = KEYPAIRSTORE_GLOBAL.get_keypair();
println!("{}", keypair.address());
This is assuming that the signature of get_keypair is:
pub fn get_keypair(&self) -> Keypair;

How can I move the data between threads safely?

I'm currently trying to call a function to which I pass multiple file names and expect the function to read the files and generate the appropriate structs and return them in a Vec<Audit>. I've been able to accomplish it reading the files one by one but I want to achieve it using threads.
This is the function:
fn generate_audits_from_files(files: Vec<String>) -> Vec<Audit> {
let mut audits = Arc::new(Mutex::new(vec![]));
let mut handlers = vec![];
for file in files {
let audits = Arc::clone(&audits);
handlers.push(thread::spawn(move || {
let mut audits = audits.lock().unwrap();
audits.push(audit_from_xml_file(file.clone()));
audits
}));
}
for handle in handlers {
let _ = handle.join();
}
audits
.lock()
.unwrap()
.into_iter()
.fold(vec![], |mut result, audit| {
result.push(audit);
result
})
}
But it won't compile due to the following error:
error[E0277]: `MutexGuard<'_, Vec<Audit>>` cannot be sent between threads safely
--> src/main.rs:82:23
|
82 | handlers.push(thread::spawn(move || {
| ^^^^^^^^^^^^^ `MutexGuard<'_, Vec<Audit>>` cannot be sent between threads safely
|
::: /home/enthys/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/std/src/thread/mod.rs:618:8
I have tried wrapping the generated Audit structs in Some(Audit) to avoid the MutexGuard but then I stumble with Poisonned Thread issues.

The cause of the error is that after after pushing the new Audit into the (locked) audits vec you then try to return the vec's MutexGuard.
In Rust, a thread's function can actually return values, the point of doing that is to send the value back to whoever is join-ing the thread. This means the value is going to move between threads, so the value needs to be movable betweem threads (aka Send), which mutex guards have no reason to be[0].
The easy solution is to just... not do that. Just delete the last line of the spawn function. Though it's not like the code works after that as you still have borrowing issue related to the thing at the end.
An alternative is to lean into the feature (especially if Audit objects are not too big): drop the audits vec entirely and instead have each thread return its audit, then collect from the handlers when you join them:
pub fn generate_audits_from_files(files: Vec<String>) -> Vec<Audit> {
let mut handlers = vec![];
for file in files {
handlers.push(thread::spawn(move || {
audit_from_xml_file(file)
}));
}
handlers.into_iter()
.map(|handler| handler.join().unwrap())
.collect()
}
Though at that point you might as well just let Rayon handle it:
use rayon::prelude::*;
pub fn generate_audits_from_files(files: Vec<String>) -> Vec<Audit> {
files.into_par_iter().map(audit_from_xml_file).collect()
}
That also avoids crashing the program or bringing the machine to its knees if you happen to have millions of files.
[0] and all the reasons not to be, locking on one thread and unlocking on an other is not necessarily supported e.g. ReleaseMutex
The ReleaseMutex function fails if the calling thread does not own the mutex object.
(NB: in the windows lingo, "owning" a mutex means having acquired it via WaitForSingleObject, which translates to lock in posix lingo)
and can be plain UB e.g. pthread_mutex_unlock
If a thread attempts to unlock a mutex that it has not locked or a mutex which is unlocked, undefined behavior results.

Your problem is that you are passing your Vec<Audit> (or more precisely the MutexGuard<Vec<Audit>>), to the threads and back again, without really needing it.
And you don't need Mutex or Arc for this simpler task:
fn generate_audits_from_files(files: Vec<String>) -> Vec<Audit> {
let mut handlers = vec![];
for file in files {
handlers.push(thread::spawn(move || {
audit_from_xml_file(file)
}));
}
handlers
.into_iter()
.flat_map(|x| x.join())
.collect()
}

Is it possible to share a HashMap between threads without locking the entire HashMap?

I would like to have a shared struct between threads. The struct has many fields that are never modified and a HashMap, which is. I don't want to lock the whole HashMap for a single update/remove, so my HashMap looks something like HashMap<u8, Mutex<u8>>. This works, but it makes no sense since the thread will lock the whole map anyways.
Here's this working version, without threads; I don't think that's necessary for the example.
use std::collections::HashMap;
use std::sync::{Arc, Mutex};
fn main() {
let s = Arc::new(Mutex::new(S::new()));
let z = s.clone();
let _ = z.lock().unwrap();
}
struct S {
x: HashMap<u8, Mutex<u8>>, // other non-mutable fields
}
impl S {
pub fn new() -> S {
S {
x: HashMap::default(),
}
}
}
Playground
Is this possible in any way? Is there something obvious I missed in the documentation?
I've been trying to get this working, but I'm not sure how. Basically every example I see there's always a Mutex (or RwLock, or something like that) guarding the inner value.

I don't see how your request is possible, at least not without some exceedingly clever lock-free data structures; what should happen if multiple threads need to insert new values that hash to the same location?
In previous work, I've used a RwLock<HashMap<K, Mutex<V>>>. When inserting a value into the hash, you get an exclusive lock for a short period. The rest of the time, you can have multiple threads with reader locks to the HashMap and thus to a given element. If they need to mutate the data, they can get exclusive access to the Mutex.
Here's an example:
use std::{
collections::HashMap,
sync::{Arc, Mutex, RwLock},
thread,
time::Duration,
};
fn main() {
let data = Arc::new(RwLock::new(HashMap::new()));
let threads: Vec<_> = (0..10)
.map(|i| {
let data = Arc::clone(&data);
thread::spawn(move || worker_thread(i, data))
})
.collect();
for t in threads {
t.join().expect("Thread panicked");
}
println!("{:?}", data);
}
fn worker_thread(id: u8, data: Arc<RwLock<HashMap<u8, Mutex<i32>>>>) {
loop {
// Assume that the element already exists
let map = data.read().expect("RwLock poisoned");
if let Some(element) = map.get(&id) {
let mut element = element.lock().expect("Mutex poisoned");
// Perform our normal work updating a specific element.
// The entire HashMap only has a read lock, which
// means that other threads can access it.
*element += 1;
thread::sleep(Duration::from_secs(1));
return;
}
// If we got this far, the element doesn't exist
// Get rid of our read lock and switch to a write lock
// You want to minimize the time we hold the writer lock
drop(map);
let mut map = data.write().expect("RwLock poisoned");
// We use HashMap::entry to handle the case where another thread
// inserted the same key while where were unlocked.
thread::sleep(Duration::from_millis(50));
map.entry(id).or_insert_with(|| Mutex::new(0));
// Let the loop start us over to try again
}
}
This takes about 2.7 seconds to run on my machine, even though it starts 10 threads that each wait for 1 second while holding the exclusive lock to the element's data.
This solution isn't without issues, however. When there's a huge amount of contention for that one master lock, getting a write lock can take a while and completely kills parallelism.
In that case, you can switch to a RwLock<HashMap<K, Arc<Mutex<V>>>>. Once you have a read or write lock, you can then clone the Arc of the value, returning it and unlocking the hashmap.
The next step up would be to use a crate like arc-swap, which says:
Then one would lock, clone the [RwLock<Arc<T>>] and unlock. This suffers from CPU-level contention (on the lock and on the reference count of the Arc) which makes it relatively slow. Depending on the implementation, an update may be blocked for arbitrary long time by a steady inflow of readers.
The ArcSwap can be used instead, which solves the above problems and has better performance characteristics than the RwLock, both in contended and non-contended scenarios.
I often advocate for performing some kind of smarter algorithm. For example, you could spin up N threads each with their own HashMap. You then shard work among them. For the simple example above, you could use id % N_THREADS, for example. There are also complicated sharding schemes that depend on your data.
As Go has done a good job of evangelizing: do not communicate by sharing memory; instead, share memory by communicating.

Suppose the key of the data is map-able to a u8
You can have Arc<HashMap<u8,Mutex<HashMap<Key,Value>>>
When you initialize the data structure you populate all the first level map before putting it in Arc (it will be immutable after initialization)
When you want a value from the map you will need to do a double get, something like:
data.get(&map_to_u8(&key)).unwrap().lock().expect("poison").get(&key)
where the unwrap is safe because we initialized the first map with all the value.
to write in the map something like:
data.get(&map_to_u8(id)).unwrap().lock().expect("poison").entry(id).or_insert_with(|| value);
It's easy to see contention is reduced because we now have 256 Mutex and the probability of multiple threads asking the same Mutex is low.
#Shepmaster example with 100 threads takes about 10s on my machine, the following example takes a little more than 1 second.
use std::{
collections::HashMap,
sync::{Arc, Mutex, RwLock},
thread,
time::Duration,
};
fn main() {
let mut inner = HashMap::new( );
for i in 0..=u8::max_value() {
inner.insert(i, Mutex::new(HashMap::new()));
}
let data = Arc::new(inner);
let threads: Vec<_> = (0..100)
.map(|i| {
let data = Arc::clone(&data);
thread::spawn(move || worker_thread(i, data))
})
.collect();
for t in threads {
t.join().expect("Thread panicked");
}
println!("{:?}", data);
}
fn worker_thread(id: u8, data: Arc<HashMap<u8,Mutex<HashMap<u8,Mutex<i32>>>>> ) {
loop {
// first unwrap is safe to unwrap because we populated for every `u8`
if let Some(element) = data.get(&id).unwrap().lock().expect("poison").get(&id) {
let mut element = element.lock().expect("Mutex poisoned");
// Perform our normal work updating a specific element.
// The entire HashMap only has a read lock, which
// means that other threads can access it.
*element += 1;
thread::sleep(Duration::from_secs(1));
return;
}
// If we got this far, the element doesn't exist
// Get rid of our read lock and switch to a write lock
// You want to minimize the time we hold the writer lock
// We use HashMap::entry to handle the case where another thread
// inserted the same key while where were unlocked.
thread::sleep(Duration::from_millis(50));
data.get(&id).unwrap().lock().expect("poison").entry(id).or_insert_with(|| Mutex::new(0));
// Let the loop start us over to try again
}
}

Maybe you want to consider evmap:
A lock-free, eventually consistent, concurrent multi-value map.
The trade-off is eventual-consistency: Readers do not see changes until the writer refreshes the map. A refresh is atomic and the writer decides when to do it and expose new data to the readers.

One mutable borrow and multiple immutable borrows

I'm trying to write a program that spawns a background thread that continuously inserts data into some collection. At the same time, I want to keep getting input from stdin and check if that input is in the collection the thread is operating on.
Here is a boiled down example:
use std::collections::HashSet;
use std::thread;
fn main() {
let mut set: HashSet<String> = HashSet::new();
thread::spawn(move || {
loop {
set.insert("foo".to_string());
}
});
loop {
let input: String = get_input_from_stdin();
if set.contains(&input) {
// Do something...
}
}
}
fn get_input_from_stdin() -> String {
String::new()
}
However this doesn't work because of ownership stuff.
I'm still new to Rust but this seems like something that should be possible. I just can't find the right combination of Arcs, Rcs, Mutexes, etc. to wrap my data in.

First of all, please read Need holistic explanation about Rust's cell and reference counted types.
There are two problems to solve here:
Sharing ownership between threads,
Mutable aliasing.
To share ownership, the simplest solution is Arc. It requires its argument to be Sync (accessible safely from multiple threads) which can be achieved for any Send type by wrapping it inside a Mutex or RwLock.
To safely get aliasing in the presence of mutability, both Mutex and RwLock will work. If you had multiple readers, RwLock might have an extra performance edge. Since you have a single reader there's no point: let's use the simple Mutex.
And therefore, your type is: Arc<Mutex<HashSet<String>>>.
The next trick is passing the value to the closure to run in another thread. The value is moved, and therefore you need to first make a clone of the Arc and then pass the clone, otherwise you've moved your original and cannot access it any longer.
Finally, accessing the data requires going through the borrows and locks...
use std::sync::{Arc, Mutex};
fn main() {
let set = Arc::new(Mutex::new(HashSet::new()));
let clone = set.clone();
thread::spawn(move || {
loop {
clone.lock().unwrap().insert("foo".to_string());
}
});
loop {
let input: String = get_input_from_stdin();
if set.lock().unwrap().contains(&input) {
// Do something...
}
}
}
The call to unwrap is there because Mutex::lock returns a Result; it may be impossible to lock the Mutex if it is poisoned, which means a panic occurred while it was locked and therefore its content is possibly garbage.

How can I pass the same data to different threads without copying it?

Struct S may be actually some big data, for example a large Vec. If I have one thread and do not use the data after creating a thread, I can move data to it, but with two threads (or using the same data in main thread), it is impossible.
struct S {
i : i32,
}
fn thr(s : &S)
{
}
fn main()
{
let s1 = S { i:1 };
thr(&s1);
let t1 = std::thread::spawn(|| thr(&s1)); // does not work
let t2 = std::thread::spawn(|| thr(&s1)); // does not work
t1.join();
t2.join();
}

I'd highly recommend reading The Rust Programming Language, specifically the chapter on concurrency. In it, you are introduced to Arc:
use std::sync::Arc;
struct S {
i: i32,
}
fn thr(s: &S) {}
fn main() {
let s1 = Arc::new(S { i: 1 });
thr(&s1);
let s2 = s1.clone();
let t2 = std::thread::spawn(move || thr(&s2));
let s3 = s1.clone();
let t3 = std::thread::spawn(move || thr(&s3));
t2.join();
t3.join();
}
Notably, when Arcs are cloned, they simply bump a reference count, not duplicate the contained data.

If this is C or C++ the allocated memory won't go away until its explicitly freed, and also, unless the objects are declared threadlocal then all threads in your app can refer to them.
In the code you wrote it seems your structure is on the stack rather than the heap, so once it goes out of scope that object pointer is invalid.
The brutal way to deal with this, without ARC, is to simply allocate the object and the pass the allocated pointer to your threads. Some thread needs to be the one to ultimately delete it (that's where ARC is a good idea).

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Per-thread initialization in Rayon - multithreading

Related

Is it possible to create a global that is assigned once at the beginning of runtime and then borrowed across threads?

How can I move the data between threads safely?

Is it possible to share a HashMap between threads without locking the entire HashMap?

One mutable borrow and multiple immutable borrows

How can I pass the same data to different threads without copying it?

Categories

Resources