How do I initialize a Vec from multiple threads? - multithreading

I have code that generates a CSR sparse matrix by reading multiple parquet files in parallel, preprocessing the data, then finally acquiring a mutex to sequentially write into the sparse array data structure. This is roughly the pseudocode for each thread:
read parquet data
get list of nonzero values
acquire mutex
append nonzero values to sparse array (basically memcpy)
release mutex
repeat
The speed of the code doesn't increase with more CPUs, so I suspect the mutex contention is a bottleneck. So I would like to try replacing the mutex with an atomic "offset" variable into the sparse array, so that I can do something like this:
fn push(&self, rhs: &Self, offset: &AtomicU64) -> MyResult<()> {
let old = offset.fetch_add(rhs.data.0.len() + (1 << 32), Ordering::SeqCst);
let (nnz, m) = (old & 0xffffffff, old >> 32);
let indptr = nnz.try_into()? + rhs.data.0.len().try_into()?;
let range = nnz..nnz + rhs.data.0.len();
self.data[range].copy_from_slice(&rhs.data.0);
self.indices[range].copy_from_slice(&rhs.indices.0);
self.indptr[m] = indptr;
self.y[m] = rhs.y.0[0];
Ok(())
}
I.e. I want each thread to first atomically update the pointer into the sparse array, then write into the preceding data. Problem is, the data is an immutable reference & (in order to pass it to each thread), and I need a mutable reference &mut (in order to update it). Reading similar questions and Rust documentation it seems you can't do this simple task like you would in C, but no alternative is provided, which is quite irritating.

Related

How to safely reinterpret Vec<f64> as Vec<num_complex::Complex<f64>> with half the size?

I have complex number data filled into a Vec<f64> by an external C library (prefer not to change) in the form [i_0_real, i_0_imag, i_1_real, i_1_imag, ...] and it appears that this Vec<f64> has the same memory layout as a Vec<num_complex::Complex<f64>> of half the length would be, given that num_complex::Complex<f64>'s data structure is memory-layout compatible with [f64; 2] as documented here. I'd like to use it as such without needing a re-allocation of a potentially large buffer.
I'm assuming that it's valid to use from_raw_parts() in std::vec::Vec to fake a new Vec that takes ownership of the old Vec's memory (by forgetting the old Vec) and use size / 2 and capacity / 2, but that requires unsafe code. Is there a "safe" way to do this kind of data re-interpretation?
The Vec is allocated in Rust as a Vec<f64> and is populated by a C function using .as_mut_ptr() that fills in the Vec<f64>.
My current compiling unsafe implementation:
extern crate num_complex;
pub fn convert_to_complex_unsafe(mut buffer: Vec<f64>) -> Vec<num_complex::Complex<f64>> {
let new_vec = unsafe {
Vec::from_raw_parts(
buffer.as_mut_ptr() as *mut num_complex::Complex<f64>,
buffer.len() / 2,
buffer.capacity() / 2,
)
};
std::mem::forget(buffer);
return new_vec;
}
fn main() {
println!(
"Converted vector: {:?}",
convert_to_complex_unsafe(vec![3.0, 4.0, 5.0, 6.0])
);
}
Is there a "safe" way to do this kind of data re-interpretation?
No. At the very least, this is because the information you need to know is not expressed in the Rust type system but is expressed via prose (a.k.a. the docs):
Complex<T> is memory layout compatible with an array [T; 2].
— Complex docs
If a Vec has allocated memory, then [...] its pointer points to len initialized, contiguous elements in order (what you would see if you coerced it to a slice),
— Vec docs
Arrays coerce to slices ([T])
— Array docs
Since a Complex is memory-compatible with an array, an array's data is memory-compatible with a slice, and a Vec's data is memory-compatible with a slice, this transformation should be safe, even though the compiler cannot tell this.
This information should be attached (via a comment) to your unsafe block.
I would make some small tweaks to your function:
Having two Vecs at the same time pointing to the same data makes me very nervous. This can be trivially avoided by introducing some variables and forgetting one before creating the other.
Remove the return keyword to be more idiomatic
Add some asserts that the starting length of the data is a multiple of two.
As rodrigo points out, the capacity could easily be an odd number. To attempt to avoid this, we call shrink_to_fit. This has the downside that the Vec may need to reallocate and copy the memory, depending on the implementation.
Expand the unsafe block to cover all of the related code that is required to ensure that the safety invariants are upheld.
pub fn convert_to_complex(mut buffer: Vec<f64>) -> Vec<num_complex::Complex<f64>> {
// This is where I'd put the rationale for why this `unsafe` block
// upholds the guarantees that I must ensure. Too bad I
// copy-and-pasted from Stack Overflow without reading this comment!
unsafe {
buffer.shrink_to_fit();
let ptr = buffer.as_mut_ptr() as *mut num_complex::Complex<f64>;
let len = buffer.len();
let cap = buffer.capacity();
assert!(len % 2 == 0);
assert!(cap % 2 == 0);
std::mem::forget(buffer);
Vec::from_raw_parts(ptr, len / 2, cap / 2)
}
}
To avoid all the worrying about the capacity, you could just convert a slice into the Vec. This also doesn't have any extra memory allocation. It's simpler because we can "lose" any odd trailing values because the Vec still maintains them.
pub fn convert_to_complex(buffer: &[f64]) -> &[num_complex::Complex<f64>] {
// This is where I'd put the rationale for why this `unsafe` block
// upholds the guarantees that I must ensure. Too bad I
// copy-and-pasted from Stack Overflow without reading this comment!
unsafe {
let ptr = buffer.as_ptr() as *mut num_complex::Complex<f64>;
let len = buffer.len();
assert!(len % 2 == 0);
std::slice::from_raw_parts(ptr, len / 2)
}
}

How can I use Rayon to split a big range into chunks of ranges and have each thread find within a chunk?

I am making a program that brute forces a password by parallelization. At the moment the password to crack is already available as plain text, I'm just attempting to brute force it anyway.
I have a function called generate_char_array() which, based on an integer seed, converts base and returns a u8 slice of characters to try and check. This goes through the alphabet first for 1 character strings, then 2, etc.
let found_string_index = (0..1e12 as u64).into_par_iter().find_any(|i| {
let mut array = [0u8; 20];
let bytes = generate_char_array(*i, &mut array);
return &password_bytes == &bytes;
});
With the found string index (or seed integer rather), I can generate the found string.
The problem is that the way Rayon parallelizes this for me is split the arbitrary large integer range into thread_count-large slices (e.g. for 4 threads, 0..2.5e11, 2.5e11..5e11 etc). This is not good, because the end of the range is for arbitrarily super large password lengths (10+, I don't know), whereas most passwords (including the fixed "zzzzz" I tend to try) are much shorter, and thus what I get is that the first thread does all the work, and the rest of the threads just waste time testing way too long passwords and synchronizing; getting actually slower than single thread performance as a result.
How could I instead split the arbitrary big range (doesn't have to have an end actually) into chunks of ranges and have each thread find within chunks? That would make the workers in different threads actually useful.
This goes through the alphabet first for 1 character strings, then 2
You wish to impose some sequencing on your data processing, but the whole point of Rayon is to go in parallel.
Instead, use regular iterators to sequentially go up in length and then use parallel iterators inside a specific length to quickly process all of the values of that length.
Since you haven't provided enough code for a runnable example, I've made this rough approximation to show the general shape of such a solution:
extern crate rayon;
use rayon::iter::{IntoParallelRefIterator, ParallelIterator};
use std::ops::RangeInclusive;
type Seed = u8;
const LENGTHS: RangeInclusive<usize> = 1..=3;
const SEEDS: RangeInclusive<Seed> = 0..=std::u8::MAX;
fn find<F>(test_password: F) -> Option<(usize, Seed)>
where
F: Fn(usize, Seed) -> bool + Sync,
{
// Rayon doesn't support RangeInclusive yet
let seeds: Vec<_> = SEEDS.collect();
// Step 1-by-1 through the lengths, sequentially
LENGTHS.flat_map(|length| {
// In parallel, investigate every value in this length
// This doesn't do that, but it shows how the parallelization
// would be introduced
seeds
.par_iter()
.find_any(|&&seed| test_password(length, seed))
.map(|&seed| (length, seed))
}).next()
}
fn main() {
let pass = find(|l, s| {
println!("{}, {}", l, s);
// Actually generate and check the password based on the search criteria
l == 3 && s == 250
});
println!("Found password length and seed: {:?}", pass);
}
This can "waste" a little time at the end of each length as the parallel threads spin down one-by-one before spinning back up for the next length, but that seems unlikely to be a primary concern.
This is a version of what I suggested in my comment.
The main loop is parallel and is only over the first byte of each attempt. For each first byte, do the full brute force search for the remainder.
let matched_bytes = (0 .. 0xFFu8).into_par_iter().filter_map(|n| {
let mut array = [0u8; 8];
// the first digit is always the same in this run
array[0] = n;
// The highest byte is 0 because it's provided from the outer loop
(0 ..= 0x0FFFFFFFFFFFFFFF as u64).into_iter().filter_map(|i| {
// pass a slice so that the first byte is not affected
generate_char_array(i, &mut array[1 .. 8]);
if &password_bytes[..] == &array[0 .. password_bytes.len()] {
Some(array.clone())
} else {
None
}
}).next()
}).find_any(|_| true);
println!("found = {:?}", matched_bytes);
Also, even for a brute force method, this is probably highly inefficient still!
If Rayon splits the slices as you described, then apply simple math to balance the password lengths:
let found_string_index = (0..max_val as u64).into_par_iter().find_any(|i| {
let mut array = [0u8; 20];
let v = i/span + (i%span) * num_cpu;
let bytes = generate_char_array(*v, &mut array);
return &password_bytes == &bytes;
});
The span value depends on the number of CPUs (the number of threads used by Rayon), in your case:
let num_cpu = 4;
let span = 2.5e11 as u64;
let max_val = span * num_cpu;
Note the performance of this approach is highly dependent on how Rayon performs the split of the sequence on parallel threads. Verify that it works as you reported in the question.

How does exactly rust handle return values

I've got question about my code:
pub fn get_signals(path: &String) -> Vec<Vec<f64>> {
let mut rdr = csv::ReaderBuilder::new().delimiter(b';').from_path(&path).unwrap();
let mut signals: Vec<Vec<f64>> = Vec::new();
for record in rdr.records(){
let mut r = record.unwrap();
for (i, value) in r.iter().enumerate(){
match signals.get(i){
Some(_) => {},
None => signals.push(Vec::new())
}
signals[i].push(value.parse::<f64>().unwrap());
}
}
signals
}
How exactly does Rust handle return? When I, for example write let signals = get_signal(&"data.csv".to_string()); does Rust assume I want a new instance of Vec(copies all the data) or just pass a pointer to previously allocated(via Vec::new()) memory? What is the most efficient way to do this? Also, what happens with rdr? I assume, given Rusts memory safety, it's destroyed.
How exactly does Rust handle return?
The only guarantee Rust, the language, makes is that values are never cloned without an explicit .clone() in the code. Therefore, from a semantic point of view, the value is moved which will not require allocating memory.
does Rust assume I want a new instance of Vec(copies all the data) or just pass a pointer to previously allocated (via Vec::new()) memory?
This is implementation specific, and part of the ABI (Application Binary Interface). The Rust ABI is not formalized, and not stable, so there is no standard describing it and no guarantee about this holding up.
Furthermore, this will depend on whether the function call is inlined or not. If the function call is inlined, there is of course no return any longer yet the same behavior should be observed.
For small values, they should be returned via a register (or a couple of registers).
For larger values:
the caller should reserve memory on the stack (properly sized and aligned) and pass a pointer to this area to the callee,
the callee will then construct the return value at the place pointed to, so that by the time it returns the value exists there for the caller to use.
Note: by the size here is the size on the stack, as returned by std::mem::size_of; so size_of::<Vec<_>>() == 24 on 64-bits architecture.
What is the most efficient way to do this?
Returning is as efficient as it gets for a single call.
If however you find yourself in a situation where, say, you want to read a file line by line, then it makes sense to reuse the buffer from one call to the other which can be accomplished either by:
taking a &mut references to the buffer (String or Vec<u8> say),
or taking a buffer by value and returning it.
The point being to avoid memory allocations.

How can I sum up using concurrency from 1 to 1000000 with Rust?

I am a newbie to Rust, and I want to sum up a large amount of numbers using concurrency. I found this code:
use std::thread;
use std::sync::{Arc, Mutex};
static NTHREAD: usize = 10;
fn main() {
let mut threads = Vec::new();
let x = 0;
// A thread-safe, sharable mutex object
let data = Arc::new(Mutex::new(x));
for i in 1..(NTHREAD+1) {
// Increment the count of the mutex
let mutex = data.clone();
threads.push(thread::spawn(move || {
// Lock the mutex
let n = mutex.lock();
match n {
Ok(mut n) => *n += i,
Err(str) => println!("{}", str)
}
}));
}
// Wait all threads ending
for thread in threads {
let _ = thread.join().unwrap();
}
assert_eq!(*data.lock().unwrap(), 55);
}
This works when the threads are 10, but does not work when the threads are larger than 20.
I think it should be fine in any number of threads.
Do I misunderstand something? Is there another way to sum up from 1 to 1000000 with concurrency?
There are several problems with the provided code.
thread::spawn creates an OS-level thread, which means the existing code cannot possibly scale to numbers up to a million as indicated in the title. That would require a million threads in parallel, where typical modern OS'es support up to a few thousands of threads at best. More constrained environments, such as embedded systems or virtual/paravirtual machines, allow much less than that; for example, the Rust playground appears to allow a maximum of 24 concurrent threads. Instead, one needs to create a fixed small number of threads, and carefully divide the work among them.
The function executing in each thread runs inside a lock, which effectively serializes the work done by the threads. Even if one could spawn arbitrarily many threads, the loop as written would execute no faster than what would be achieved by a single thread - and in practice it would be orders of magnitude slower because it would spend a lot of time on locking/unlocking of a heavily contended mutex.
One good way to approach this kind of problem while still managing threads manually is provided in the comment by Boiethios: if you have 4 threads, just sum 1..250k, 250k..500k, etc. in each thread and then sum up the return of the threaded functions.
Or is there another way to sum up from 1 to 1000000 with concurrency?
I would recommend using a higher-level library that encapsulates creation/pooling of worker threads and division of work among them. Rayon is an excellent one, providing a "parallel iteration" facility, which works like iteration, but automatically dividing up the work among multiple cores. Using Rayon, parallel summing of integers would look like this:
extern crate rayon;
use rayon::prelude::*;
fn main() {
let sum: usize = (1..1000001).collect::<Vec<_>>().par_iter().sum();
assert_eq!(sum, 500000500000);
}

Reverse order of a reference to immutable array slice

I would like to reverse the order of a slice:
&[1, 2, 3] -> &[3, 2, 1]
This is my code:
fn iterate_over_file(data: &[u8], ...) -> ... {
...
for cur_data in data.chunks(chunk_size) {
let reversed_cur_data = cur_data.reverse() // this returns ()
...
...
}
This data parameter comes from a file I read in using FileBuffer, and I'd like to keep it as a referenced slice (and not turn it into an owned Vec, since it's a heavy computation to make).
How could I reverse the order of cur_data with the minimal amount of operations and memory allocation? Its length is known for a specific runtime of my program (called here chunk_size), but it changes between different runs. reversed() seems to return (), which makes sense as it's done in-place, and I only have a referenced slice. .iter().rev() creates an iterator, but then I'd have to call .next() on it several times to get the slice back, which is both not elegant and not effective, as I have at least tens of millions of cur_data lines per file.
Not only does reverse return (), it also requires a mutable slice, which you do not have. The optimal solution depends on exactly what you need to do with the data. If you only need to iterate over the data, cur_data.iter().rev() is exactly the right and the most efficient choice.
If you need the reversed data inside a slice for further processing, such as to send the reversed chunk to a function that expects a slice, you can collect the data into a vector. To avoid a new allocation for each chunk, you can reuse the same vector across all loop iterations:
let mut reversed = Vec::with_capacity(chunk_size);
for cur_data in data.chunks(chunk_size) {
// truncate the slice at the beginning of each iteration.
// Vec explicitly guarantees that this will *not* deallocate,
// it will only reset its internal length. An allocation will
// thus only happen prior to the loop.
reversed.truncate(0);
reversed.extend(cur_data.iter().rev());
// &reversed is now the reversed_cur_data you need
...
}

Resources