Parallelize groupby with Rayon in Rust

Parallelize groupby with Rayon in Rust - multithreading

I am trying to do something where I have a little logic in groupby. Consider this working code for example:
use itertools::Itertools;
let counts = vec![Some(1), Some(1), None, None, Some(1), Some(1), Some(3), Some(3)];
let vals = vec![1, 3, 2, 2, 1, 0, 1, 2];
let mut hm1: HashMap<u32,u32> = HashMap::new();
let groups = &(counts.into_iter().zip(vals.into_iter())).group_by(|(d1,_d2)| *d1);
groups.into_iter()
.map(|(key, group)| {
group.into_iter()
.map(|(count, val)| {
hm1.entry(val).and_modify(|e| *e += 1).or_insert(1);
}).collect::<Vec<_>>();
}).collect::<Vec<_>>();
Here, I am essentially creating a histogram in HashMap. Please ignore the fact that in this particular case I do not actually need to groupby to get my HashMap. It's illustrative.
Now, I am trying to understand how I can parallelize this type of code with Rayon. Upon searching, I found this thread that says groupby in itertools cannot be used with Rayon. Reading the Rayon documentation, I could not find any way to achieve this. Note that I don't particularly care about using itertools, so any other solution that parallelizes a groupby in rust would work.

Related

Given `Vec<HashSet>`, how to update `v[i]` while iterating `v[i - 1]`? [duplicate]

This question already has answers here:
How to get mutable references to two array elements at the same time?
(8 answers)
Closed 2 months ago.
Let v be Vec<HashSet<usize>>.
Is it possible to update v[i] while iterating v[i - 1]?
Normally, Rust's ownership rule doesn't allow this, but I believe some way should exist since v[i] and v[i - 1] are essentially independent.
unsafe is allowed because unsafe sometimes lets us bypass (in a sense) the ownership rule. (For example, swapping values of HashMap is normally impossible, but using unsafe makes it possible. ref: swapping two entries of a HashMap)
Please assume v.len() is very large because, if v.len() is small, you can give up using Vec container in the first place.
Very artificial but minimum working example is shown below (Rust Playground). This type of source code is often seen in doing dynamic programming.
use std::collections::HashSet;
fn main() {
let n = 100000;
let mut v = vec![HashSet::new(); n];
v[0].insert(0);
for i in 1..n {
v[i] = v[i - 1].clone(); //This `clone()` is necessarily.
let prev = v[i - 1].clone(); //I want to eliminate this `clone()`.
prev.iter().for_each(|e| {
v[i].insert(*e + 1);
})
}
println!("{:?}", v); //=> [{0}, {0, 1}, {0, 1, 2}, {0, 1, 2, 3}, ...]
}

When you modify a vector with v[i], you are using the IndexMut trait, which requires a mutable borrow to Self, ie. the whole vector. For this reason, Rust will never allow taking v[i] and v[i-1] at the same time, if at least one of them is a mutable borrow.
To solve this issue, you must work a little harder to make Rust understand v[i] and v[i-1] are not aliased (because, in the end, all the borrow checking stuff ends up in LLVM being able to tell if something is aliased, or not).
The "bad" news is that it's impossible to do so without relying on unsafe somewhere. The good news is that someone else already did that, and wrapped it in a safe interface, namely split_at_mut. This will break a single vector into two subslices, which are guaranteed to be disjoint (this is where unsafe kicks in).
So, for instance, in your case, you could do
use std::collections::HashSet;
fn main() {
let n = 100000;
let mut v = vec![HashSet::new(); n];
v[0].insert(0);
for i in 1..n {
v[i] = v[i - 1].clone(); //This `clone()` is necessarily.
let (left, right) = v.split_at_mut(i);
left[i-1].iter().for_each(|e| {
right[0].insert(*e + 1);
})
}
println!("{:?}", v); //=> [{0}, {0, 1}, {0, 1, 2}, {0, 1, 2, 3}, ...]
}
See the playground.
Besides, maybe this is just because your example is simplified, but there is actually no point in creating 100000 HashMaps, if you just modify them right away. A simpler solution would be
use std::collections::HashSet;
fn main() {
let n = 100000;
let mut v = Vec::with_capacity(n);
v.insert(HashSet::from([0]));
for i in 0..n-1 {
let mut new_set = v[i].clone();
for e in v[i].iter().copied() {
new_set.insert(e+1);
}
v.push(new_set);
}
println!("{:?}", v);
}
See the playground.

Transpose Vec<Vec<T>> to Vec<Vec<T>> where T has no Copy trait [duplicate]

This question already has answers here:
How to transpose a vector of vectors in Rust?
(2 answers)
Closed 2 months ago.
The goal is to achieve the following conversion from A to B both of type Vec<Vec<T>> in Rust, where type T has no Copy trait:
A = [ [t1,t2,t3], [t4,t5,t6] ]
B = [ [t1,t4], [t2,t5], [t3,t6] ]
Through testing, I know the next three ideas do work very well:
Suppose A is a vector of M elements, each of which is a vector of N elements.
Idea One:
let B: Vec<Vec<T>> = (0..N)
.map(|i| A
.iter()
.map(|x| (*x.iter().skip(i).next().unwrap()).clone())
.collect::<Vec<_>>()
)
.collect::<Vec<_>>();
Idea Two:
let B: Vec<Vec<T>> = (0..N)
.map(|i| A
.iter()
.map(|x| x[i].clone())
.collect::<Vec<_>>()
)
.collect::<Vec<_>>();
Idea Three:
let A = A.into_iter().flatten().collect::<Vec<T>>();
let B: Vec<Vec<T>> = (0..N)
.map(|i| A
.iter()
.enumerate()
.filter(|(v, _)| *v % N == i)
.map(|(_, j)| j.clone())
.collect::<Vec<_>>()
)
.collect::<Vec<_>>();
Is there any other idea that avoids using the Clone trait or indexing (such as A[i]), or at least tries to use them as little as possible? Thanks in advance.
I tried to find my answer on Google, stackoverflow, github or Rust Programming community. Unfortunately, I can't see any similar questions.
I expect there might be some clues to solve this question in a way that avoids using the Clone trait or indexing (such as A[i]), or at least tries to use them as little as possible.
I believe this is rather simple for an experienced Rust programmer, but I am kinda stuck somewhere that I have no idea about.

Here's a solution using no cloning or indexing: convert the rows into consuming iterators, then to make each row in the output, take one item from each of those iterators.
let matrix = vec![vec![1, 2, 3], vec![4, 5, 6]];
let num_cols = matrix.first().unwrap().len();
let mut row_iters: Vec<_> = matrix.into_iter().map(Vec::into_iter).collect();
let mut out: Vec<Vec<_>> = (0..num_cols).map(|_| Vec::new()).collect();
for out_row in out.iter_mut() {
for it in row_iters.iter_mut() {
out_row.push(it.next().unwrap());
}
}
println!("{:?}", out);
// [[1, 4], [2, 5], [3, 6]]
Playground link
If you don't like the for loops either, you can replace them with this:
let out: Vec<Vec<_>> = (0..num_cols)
.map(|_| row_iters.iter_mut().map(|it| it.next().unwrap()).collect())
.collect();

parallel sorting on separate sections of a single slice

I'm trying to implement a sort of parallel bubble sort, e.g. have a number of threads work on distinct parts of the same slice and then have a final thread sort those two similar to a kind of merge sort
I have this code so far
pub fn parallel_bubble_sort(to_sort: Arc<&[i32]>) {
let midpoint = to_sort.len() / 2;
let ranges = [0..midpoint, midpoint..to_sort.len()];
let handles = (ranges).map(|range| {
thread::spawn(|| {
to_sort[range].sort();
})
});
}
But I get a series of errors, relating to 'to_sort's lifetime, etc
How would someone go about modifying distinct slices of a larger slice across thread bounds?

Disclaimer: I assume that you want to sort in place, as you call .sort().
There's a couple of problems with your code:
The to_sort isn't mutable, so you won't be able to modify it. Which is an essential part of sorting ;) So I think that Arc<&[i32]> should most certainly be &mut [i32].
You cannot split a mutable slice like this. Rust doesn't know if your ranges overlap, and therefore disallows this entirely. You can, however, use split_at to split it into two parts. This even works with mutable references, which is important in your case.
You cannot move mutable references to threads, because it's unknown how long the
thread will exists. Overcoming this issue is the hardest part, I'm afraid; I don't know how easy it is in normal Rust without the use of unsafe. I think the easiest solution would be to use a library like rayon which already solved those problems for you.
EDIT: Rust 1.63 introduces scoped threads, which eliminates the need for rayon in this usecase.
This should be a good start for you:
pub fn parallel_bubble_sort(to_sort: &mut [i32]) {
let midpoint = to_sort.len() / 2;
let (left, right) = to_sort.split_at_mut(midpoint);
std::thread::scope(|s| {
s.spawn(|| left.sort());
s.spawn(|| right.sort());
});
// TODO: merge left and right
}
fn main() {
let mut data = [1, 6, 3, 4, 9, 7, 4];
parallel_bubble_sort(&mut data);
println!("{:?}", data);
}
[1, 3, 6, 4, 4, 7, 9]
Previous answer for Rust versions older than 1.63
pub fn parallel_bubble_sort(to_sort: &mut [i32]) {
let midpoint = to_sort.len() / 2;
let (left, right) = to_sort.split_at_mut(midpoint);
rayon::scope(|s| {
s.spawn(|_| left.sort());
s.spawn(|_| right.sort());
});
// TODO: merge left and right
}
fn main() {
let mut data = [1, 6, 3, 4, 9, 7, 4];
parallel_bubble_sort(&mut data);
println!("{:?}", data);
}
[1, 3, 6, 4, 4, 7, 9]

How to convert Vec<T> to HashMap<T,T>?

I have a vector of strings.I need to convert it to HashMap.
Vector's 0 elements should become a key and 1 element should become a value. The same for 2, 3, and so on.
The obvious solution, just to make a for loop and add them to HashMap one by one. However, it will end up several lines of code. I am curious whether there is a cleaner, one-liner.
I know you can do vec.to_iter().collect(). However, this requires a vector to have tuples (vs a flat vector).

You can use chunks_exact plus a few combinators to achieve this. However, I wouldn't recommend putting this on only one line for readability reasons. This does have a downside, and that is extra elements (if the vector has an odd number of elements) will be discarded.
use std::collections::HashMap;
fn main() {
// vector with elements
let vector = vec!["a", "b", "c", "d", "e", "f"];
let map = vector.chunks_exact(2) // chunks_exact returns an iterator of slices
.map(|chunk| (chunk[0], chunk[1])) // map slices to tuples
.collect::<HashMap<_, _>>(); // collect into a hashmap
// outputs: Map {"e": "f", "c": "d", "a": "b"}
println!("Map {:?}", map);
}

slice::array_chunks is currently unstable but when it's stabilized in the future, I would prefer this over .chunks(2):
#![feature(array_chunks)]
use std::collections::HashMap;
fn main() {
let vec = vec![1, 2, 3, 4, 5, 6, 7];
let map = vec
.array_chunks::<2>()
.map(|[k, v]| (k, v))
.collect::<HashMap<_, _>>();
dbg!(map);
}
Output:
[src/main.rs:11] map = {
1: 2,
3: 4,
5: 6,
}
Playground

Using itertools's tuples:
use itertools::Itertools;
use std::collections::HashMap;
fn main() {
let v: Vec<String> = vec!["key1".into(), "val1".into(), "key2".into(), "val2".into()];
// Extra elements are discarded
let hm: HashMap<String, String> = v.into_iter().tuples().collect();
assert_eq!(hm, HashMap::from([("key1".into(), "val1".into()), ("key2".into(), "val2".into())]));
}

How do I output multiple values from .map() or use map twice in one iteration?

How do I use map twice on one into_iter. Currently I have.
let res_arr_to: Vec<String> = v.result.transactions.into_iter().map( |x| x.to).collect();
let res_arr_from: Vec<String> = v.result.transactions.into_iter().map( |x| x.from).collect();
What I want is both arrays in one array, the order doesn't matter. I need either a closure that outputs two values (if that is even a closure?). Or a way to use map twice in one iteration, without using the generated value, but instead using the untouched iterator if that makes sense and is possible. I am a total noob in functional programming so if there is a completely different way to do this another explanation is fine to.
v is an EthBlockTxResponse:
#[derive(Debug, Deserialize)]
struct EthTransactionObj {
from: String,
to: String
}
#[derive(Debug, Deserialize)]
struct EthTransactions {
transactions : Vec<EthTransactionObj>
}
#[derive(Debug, Deserialize)]
struct EthBlockTxResponse {
result : EthTransactions
}
Thanks

You can use .unzip() to collect two vectors at once like this:
let (res_arr_to, res_arr_from): (Vec<_>, Vec<_>) =
v.result.transactions.into_iter().map(|x| (x.to, x.from)).unzip();
Note that into_iter consumes v.result.transactions - moving out of that field. This is probably not what you want, and you should copy the strings instead in that case:
let (res_arr_to, res_arr_from): (Vec<_>, Vec<_>) =
v.result.transactions.iter().map(|x| (x.to.clone(), x.from.clone())).unzip();

I find the question a bit vague, but think you're trying to get both the x.to and the x.from at the same time instead of having to iterate the data twice and build two vectors. I'll address that first and then some cases of what you might have meant by some other things you mentioned.
One way you can do it is with .flat_map(). This will produce one flat vector removing the extra level of nesting. If you wanted tuples, you could just use .map(|x| (x.from, x.to)). I'm assuming that x.from and x.to are Copy and you actually want everything in a single vector without nesting.
let res_arr_combined = v.result.transactions.into_iter()
.flat_map( |x| [x.to, x.from])
.collect::<Vec<_>>();
Reference:
Iterator::flat_map()
Excerpt:
The map adapter is very useful, but only when the closure argument produces values. If it produces an iterator instead, there’s an extra layer of indirection. flat_map() will remove this extra layer on its own.
fn main()
{
// Adding more data to an iterator stream.
(0..5).flat_map(|n| [n, n * n])
.for_each(|n| print!("{}, ", n));
println!("");
}
output:
0, 0, 1, 1, 2, 4, 3, 9, 4, 16,
You may not really require the following, but wrt your comment about wanting to get data from an iterator without using the value or changing the state of the iterator, there is a .peek() operation you can invoke on iterators wrapped in Peekable.
To get a peekable iterator, you just invoke .peekable() on any iterator.
let mut p = [1, 2, 3, 4].into_iter().peekable();
println!("{:?}", p.peek());
println!("{:?}", p.next());
output:
Some(1)
Some(1)
The peekable behaves the same way as the iterator it was taken from, but adds a couple interesting methods like .next_if(|x| x > 0), which produces an iterator that will continue rendering items until the condition evaluates to false without consuming the last item it didn't render.
And one last topic in line with "using map twice in one iteration", if by that you might mean to pull items from a slice in chunks of 2. If v.result.transactions is itself a Vec, you can use the .chunks() method to group its item by 2's - or 3's as I have below:
let a = [1, 2, 3, 4, 5, 6, 7, 8, 9].chunks(3).collect::<Vec<_>>();
println!("{:?}", a);
output:
[[1, 2, 3], [4, 5, 6], [7, 8, 9]]

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Parallelize groupby with Rayon in Rust - multithreading

Related

Given `Vec<HashSet>`, how to update `v[i]` while iterating `v[i - 1]`? [duplicate]

Transpose Vec<Vec<T>> to Vec<Vec<T>> where T has no Copy trait [duplicate]

parallel sorting on separate sections of a single slice

How to convert Vec<T> to HashMap<T,T>?

How do I output multiple values from .map() or use map twice in one iteration?

Categories

Resources