Rust Unique Return Counts / Unique with Frequency - rust

What is the fastest way to to get the unique elements in a vector and their count? Similar to numpy.unique(return_counts=True). The below becomes exceedingly slow as the array grows into the millions.
use std::collections::HashMap;
use itertools::Itertools;
fn main () {
let kmers: Vec<u8> = vec![64, 64, 64, 65, 65, 65];
let nodes: HashMap<u8, usize> = kmers
.iter()
.unique()
.map(|kmer| {
let count = kmers.iter().filter(|x| x == &kmer).count();
(kmer.to_owned(), count)
})
.collect();
println!("{:?}", nodes)
}

You can use the entry API for this. The linked docs have a similar example to what you need, here it is modified to fit your case:
use std::collections::HashMap;
fn main () {
let kmers: Vec<u8> = vec![64, 64, 64, 65, 65, 65];
let mut nodes: HashMap<u8, usize> = HashMap::new();
for n in kmers.iter() {
nodes.entry(*n).and_modify(|count| *count += 1).or_insert(1);
}
println!("{:?}", nodes)
}
playground
If you want the output to be sorted, you can use a BTreeMap instead.

If you prefer a one-liner, you can use itertools' counts(): (this uses the same code as in #PitaJ answer under the hood, with a little improvement):
use std::collections::HashMap;
use itertools::Itertools;
fn main () {
let kmers: Vec<u8> = vec![64, 64, 64, 65, 65, 65];
let nodes: HashMap<u8, usize> = kmers.iter().copied().counts();
println!("{:?}", nodes)
}

Related

How to collect a flattened Vec from a stream of Results containing a list?

I'm trying to improve the following method:
let output: Vec<i32> = stream::iter(vec![1, 2, 3])
.then(|val| {
future::ok::<_, ()>(vec![val * 10, val * 10 + 1])
})
.try_collect::<Vec<Vec<_>>>()
.await?
.into_iter()
.flatten() // How to flatten directly from the stream?
.collect();
assert_eq!(output, vec![10, 11, 20, 21, 30, 31]);
This method works but I think this could be improved because, as you can see, I need to collect two times in order to have the output I want.
This issue comes from the fact that I'm trying to flatten Results that contain a list. I tried to use try_flatten() however I absolutely can't make it work. Does anybody have an idea on how to achieve this?
I'm not sure how to use the future iterator adaptors, but you can accomplish the same thing like this:
use futures::{future, stream, StreamExt};
async fn f() -> Result<Vec<i32>, ()> {
let mut output = Vec::new();
let mut stream = stream::iter(vec![1, 2, 3])
.then(|val| future::ok::<_, ()>(vec![val * 10, val * 10 + 1]));
while let Some(items) = stream.next().await {
output.extend(items?);
}
assert_eq!(output, vec![10, 11, 20, 21, 30, 31]);
Ok(output)
}
#[tokio::main]
async fn main() {
println!("{:?}", f().await.unwrap());
}
just to let you know that I found the try_concat method in the TryStreamExt module that does exactly what I wanted.
let output: Vec<i32> = stream::iter(vec![1, 2, 3])
.then(|val| {
future::ok::<_, ()>(vec![val * 10, val * 10 + 1])
})
.try_concat()
.await?;
assert_eq!(output, vec![10, 11, 20, 21, 30, 31]);

How can I iterate over a sequence multiple times within a function?

I have a function that I would like to take an argument that can be looped over. However I would like to loop over it twice. I tried using the Iterator trait however I can only iterate over it once because it consumes the struct when iterating.
How should I make it so my function can loop twice? I know I could use values: Vec<usize> however I would like to make it generic over any object that is iterable.
Here's an example of what I would like to do: (Please ignore what the loops are actually doing. In my real code I can't condense the two loops into one.)
fn perform<'a, I>(values: I) -> usize
where
I: Iterator<Item = &'a usize>,
{
// Loop one: This works.
let sum = values.sum::<usize>();
// Loop two: This doesn't work due to `error[E0382]: use of moved value:
// `values``.
let max = values.max().unwrap();
sum * max
}
fn main() {
let v: Vec<usize> = vec![1, 2, 3, 4];
let result = perform(v.iter());
print!("Result: {}", result);
}
You can't iterate over the same iterator twice, because iterators are not guaranteed to be randomly accessible. For example, std::iter::from_fn produces an iterator that is most definitely not randomly accessible.
As #mousetail already mentioned, one way to get around this problem is to expect a Cloneable iterator:
fn perform<'a, I>(values: I) -> usize
where
I: Iterator<Item = &'a usize> + Clone,
{
// Loop one: This works.
let sum = values.clone().sum::<usize>();
// Loop two: This doesn't work due to `error[E0382]: use of moved value:
// `values``.
let max = values.max().unwrap();
sum * max
}
fn main() {
let v: Vec<usize> = vec![1, 2, 3, 4];
let result = perform(v.iter());
println!("Result: {}", result);
}
Result: 40
Although in your specific example, I'd compute both sum and max in the same iteration:
fn perform<'a, I>(values: I) -> usize
where
I: Iterator<Item = &'a usize>,
{
let (sum, max) = values.fold((0, usize::MIN), |(sum, max), &el| {
(sum + el, usize::max(max, el))
});
sum * max
}
fn main() {
let v: Vec<usize> = vec![1, 2, 3, 4];
let result = perform(v.iter());
println!("Result: {}", result);
}
Result: 40

Use Rayon to chunk a hash map

Is it possible to use Rayon to chunk the data in a HashMap? I see several chunking methods, but they seem to want to work on a slice (or something similar).
use rayon::prelude::*;
use std::collections::HashMap;
use log::info;
fn main() {
let foo = vec![1, 2, 3, 4, 5, 6, 7, 8];
foo.par_chunks(3).for_each(|x| {
info!("x: {:?}", x);
});
let bar = HashMap::<String, String>::default();
bar.par_chunks(3).for_each(|x| {
info!("x: {:?}", x);
});
bar.chunks(3).for_each(|x| {
info!("x: {:?}", x);
});
bar.par_iter().chunks(3).for_each(|x| {
info!("x: {:?}", x);
});
The vec code compiles without error, but all o the HashMap attempts fail with "no method named ..." errors.
Edit: The question about how to use an existing iterator with rayon does not answer this question. This question is how to get an iterator that chunks a hash map.
Answer
The way to chunk a hash map is the following:
use itertools::Itertools;
use std::collections::HashMap;
fn main() {
let mut m: HashMap<usize, usize> = HashMap::default();
for n in 0..100 {
m.insert(n, 2 * n);
}
println!("m: {:?}", m);
let res: HashMap<usize, usize> = (&m)
.into_iter()
.chunks(7)
.into_iter()
.map(|c| c.map(|(a, b)| (a + b, b - a)))
.flatten()
.collect();
println!("M still usable: {}", m.len());
println!("res: {:?}", res);
}

Rust - padded array of bytes from str

This rust does exactly what I want, but I don't do much rust and I get the feeling this could be done much better - like maybe in one line. Can anyone give hints to a more "rust idiomatic" way?
https://play.rust-lang.org/?version=stable&mode=debug&edition=2018&gist=1f139ccf6e8f88dbe92f1f1e4d7a487a
fn fill_from_str(bytes: &mut [u8], s: &str) {
let mut i=0;
for b in s.as_bytes() {
bytes[i] = *b;
i=i+1;
}
}
fn main() {
let mut bytes: [u8; 10] = [0; 10];
fill_from_str(&mut bytes,"hello");
println!("{:?}",bytes);
}
This can be done very succinctly via std::io::Write which is implemented for &mut [u8]:
use std::io::Write;
fn fill_from_str(mut bytes: &mut [u8], s: &str) {
bytes.write(s.as_bytes()).unwrap();
}
fn main() {
let mut bytes: [u8; 10] = [0; 10];
fill_from_str(&mut bytes, "hello");
println!("{:?}", bytes);
}
[104, 101, 108, 108, 111, 0, 0, 0, 0, 0]
There is a better way to do this: the copy_from_slice method. If the slice and the string are the same length this is a one-liner:
fn copy_from_str(dest:&mut[u8], src:&str){
dest.copy_from_slice(src.as_bytes());
}
The copy_from_slice method also is just a single call to memcpy so it is faster than your version. If you want to support different sizes a little more code is needed:
fn copy_from_str(dest:&mut [u8],src:&str){
if dest.len() == src.len(){
dest.copy_from_slice(src.as_bytes());
} else if dest.len() > src.len(){
dest[..src.len()].copy_from_slice(src.as_bytes());
} else {
dest.copy_from_slice(&src.as_bytes()[..dest.len()]);
}
}
That function will also not panic if it winds up slicing on the boundary of a multibyte character.
Edit: Added Plaground link

How to obtain the chunk index in Rayon's par_chunks_mut

I have some data and I want to process it and use it to fill an array that already exists. For example suppose I want to repeat each value 4 times (playground):
use rayon::prelude::*; // 1.3.0
fn main() {
let input = vec![4, 7, 2, 3, 5, 8];
// This already exists.
let mut output = vec![0; input.len() * 4];
output.par_chunks_mut(4).for_each(|slice| {
for x in slice.iter_mut() {
*x = input[?];
}
});
}
This almost works but Rayon doesn't pass the chunk index to me so I can't put anything in input[?]. Is there an efficient solution?
The easiest thing to do is avoid the need for an index at all. For this example, we can just zip the iterators:
use rayon::prelude::*; // 1.3.0
fn main() {
let input = vec![4, 7, 2, 3, 5, 8];
let mut output = vec![0; input.len() * 4];
// Can also use `.zip(&input)` if you don't want to give up ownership
output.par_chunks_mut(4).zip(input).for_each(|(o, i)| {
for o in o {
*o = i
}
});
println!("{:?}", output)
}
For traditional iterators, this style of implementation is beneficial as it avoids unneeded bounds checks which would otherwise be handled by the iterator. I'm not sure that Rayon benefits from the exact same thing, but I also don't see any reason it wouldn't.
Rayon provides an enumerate() function for most of its iterators that works just like the non-parallel counterpart:
let input = vec![4, 7, 2, 3, 5, 8];
let mut output = vec![0; input.len() * 4];
output.par_chunks_mut(4).enumerate().for_each(|(i, slice)| {
for x in slice.iter_mut() {
*x = input[i];
}
});

Resources