I'm doing some coding practice in Rust.
I found somewhat weird test result. Maybe I misunderstood something.
Test Result
Time elapsed in function bruteforce is: 2.387µs
Time elapsed in function hashset is: 24.413µs // Why it takes relatively long ?
Time elapsed in function rusty is: 1.13µs
test arrays::contains_common_item::test_time_measure ... ok
use std::collections::HashSet;
// O(N^2) O(1)
fn contains_common_item_bruteforce(arr1: Vec<char>, arr2: Vec<char>) -> bool {
for arr1_item in arr1.iter() {
for arr2_item in arr2.iter() {
if arr1_item == arr2_item {
return true;
}
}
}
false
}
// O(2N) O(N)
fn contains_common_item_hashset(arr1: Vec<char>, arr2: Vec<char>) -> bool {
let mut contained_items_map = HashSet::new();
// iter() iterates over the items by reference
// iter_mut() iterates over the items, giving a mutable reference to each item
// into_iter() iterates over the items, moving them into the new scope
for item in arr1.iter() {
contained_items_map.insert(*item);
}
for item in arr2.iter() {
if contained_items_map.contains(item) {
return true;
}
}
false
}
fn contains_common_item_rusty(arr1: Vec<char>, arr2: Vec<char>) -> bool {
arr1.iter().any(|x| arr2.contains(x)) // contains O(n)
}
#[test]
fn test_time_measure() {
let arr1 = vec!['a', 'b', 'c', 'x'];
let arr2 = vec!['z', 'y', 'i'];
let start = std::time::Instant::now();
contains_common_item_bruteforce(arr1, arr2);
let duration: std::time::Duration = start.elapsed();
eprintln!("Time elapsed in function bruteforce is: {:?}", duration);
let arr1 = vec!['a', 'b', 'c', 'x'];
let arr2 = vec!['z', 'y', 'i'];
let start = std::time::Instant::now();
contains_common_item_hashset(arr1, arr2);
let duration: std::time::Duration = start.elapsed();
eprintln!("Time elapsed in function hashset is: {:?}", duration);
let arr1 = vec!['a', 'b', 'c', 'x'];
let arr2 = vec!['z', 'y', 'i'];
let start = std::time::Instant::now();
contains_common_item_rusty(arr1, arr2);
let duration: std::time::Duration = start.elapsed();
eprintln!("Time elapsed in function rusty is: {:?}", duration);
}
There are a few things wrong with both your testing methodology and with your expectations.
First of all: optimizations are variable, cpus are variable, caches are variable... Running a single pass, in a combined test, with fixed values is not accounting for these variables. You should be using a proper performance benchmarking test framework if you want practical results. Look into using criterion.
Also, either your computer is quite slow, or you're testing in debug mode. The Rust Playground gives 210ns, 4.25µs, and 170ns respectively. Benchmarking in debug mode is fairly useless since the performance wouldn't reflect how it'd behave in a release environment.
Second, HashSet purports an O(1) access time, but there's no such thing as a free lunch. For one thing, it is a variable sized collection that must be built before you can even use it. A similar crude test shows that this step alone is 4x as costly as the other two functions in their entirety. This would include allocation time, hashing time, and any other record-keeping that the HashSet does.
You may have been surprised by the Big-O complexity indicating that the HashSet should perform better, but you're probably missing that Big-O notation only shows an upper-bound of work being done and expresses how the work extends as n grows larger. Here, you have fixed sets of 4-5 items so the time will be more dominated by the fixed costs that Big-O notation leaves out (like the HashSet creation). And Big-O notation also leaves out how much work each n actually uses, it takes much more work to compute a hash, look-up the bucket, potentially handle collisions, and check whether an item exists; than it takes to compare two chars. The n in the O(n^2) of the bruteforce method and the O(n) of the hashset method are not directly comparable.
In summary, if your usage means doing intersection checks on small datasets then there's a pretty good chance that bruteforce will be faster. But you should use realistic data in a proper benchmarking test to verify.
Related
I have a long string stored in a variable in Rust. I often remove some characters from its front with a drain method and use the value returned from it:
my_str.drain(0..i).collect::<String>();
The problem is, that draining from this string is done really often in the program and it's slowing it down a lot (it takes ~99.6% of runtime). This is a very expensive operation, since every time, the entire string has to be moved left in the memory.
I do not drain from the end of the string at all (which should be much faster), just from the front.
How can I make this more efficient? Is there some alternative to String, that uses a different memory layout, which would be better for this use case?
If you can't use slices because of the lifetimes, you could use a type that provides shared-ownership like SharedString from the shared-string crate or Str from the bytes-utils crate. The former looks more fully-featured but both provide methods that can take the prefix from a string in O(1) because the original data is never moved.
As stated by #Jmb, keeping the original string intact and working with slices is certainly a big win.
I don't know, from the question, the context and usage of these strings, but this quick and dirty benchmark shows a substantial difference in performances.
This benchmark is flawed because there is a useless clone() at each repetition, there is no warm-up, there is no black-box for the result, there are no statistics... but it just gives an idea.
use std::time::Instant;
fn with_drain(mut my_str: String) -> usize {
let mut total = 0;
'work: loop {
for &i in [1, 2, 3, 4, 5].iter().cycle() {
if my_str.len() < i {
break 'work;
}
let s = my_str.drain(0..i).collect::<String>();
total += s.len();
}
}
total
}
fn with_slice(my_str: String) -> usize {
let mut total = 0;
let mut pos = 0;
'work: loop {
for &i in [1, 2, 3, 4, 5].iter().cycle() {
let next_pos = pos + i;
if my_str.len() <= next_pos {
break 'work;
}
let s = &my_str[pos..next_pos];
pos = next_pos;
total += s.len();
}
}
total
}
fn main() {
let my_str="I have a long string stored in a variable in Rust.
I often remove some characters from its front with a drain method and use the value returned from it:
my_str.drain(0..i).collect::<String>();
The problem is, that draining from this string is done really often in the program and it's slowing it down a lot (it takes ~99.6% of runtime). This is a very expensive operation, since every time, the entire string has to be moved left in the memory.
I do not drain from the end of the string at all (which should be much faster), just from the front.
How can I make this more efficient? Is there some alternative to String, that uses a different memory layout, which would be better for this use case?
".to_owned();
let repeat = 1_000_000;
let instant = Instant::now();
for _ in 0..repeat {
let _ = with_drain(my_str.clone());
}
let drain_duration = instant.elapsed();
let instant = Instant::now();
for _ in 0..repeat {
let _ = with_slice(my_str.clone());
}
let slice_duration = instant.elapsed();
println!("{:?} {:?}", drain_duration, slice_duration);
}
/*
$ cargo run --release
Finished release [optimized] target(s) in 0.00s
Running `target/release/prog`
5.017018957s 310.466253ms
*/
As proposed by #SUTerliakov, using VecDeque<char> in this case is much more effective than String either with the pop_front method or the drain method (when draining from the front of course)
I can see a huge difference in terms of performance, between an iterator-based algorithm (slow) and a procedural algorithm (fast). I want to improve the speed of the iterator-based algorithm but I don't know how to do it.
Context
I need to convert strings like 12345678 to 12_345_678. Questions like this have already been asked like in this question. (Note: I don't want to use a crate for this, because I have some custom needs about it).
Edit: it is assumed that the input string is ASCII.
I wrote two versions of the algorithm:
an iterator-based version, but I'm not satisfied because there is two .collect(), which means many allocations (at least one .collect() is extra, without being able to remove it, see the code below);
a procedural version of it.
Then I compared the execution time of the two versions: my iterator version is about ~8x slower compared to the procedural version.
Benchmark code
use itertools::Itertools; // Using itertools v0.10.0
/// Size of the string blocks to separate:
const BLOCK_SIZE: usize = 3;
/// Procedural version.
fn procedural(input: &str) -> String {
let len = input.len();
let nb_complete_blocks = (len as isize - 1).max(0) as usize / BLOCK_SIZE;
let first_block_len = len - nb_complete_blocks * BLOCK_SIZE;
let capacity = first_block_len + nb_complete_blocks * (BLOCK_SIZE + 1);
let mut output = String::with_capacity(capacity);
output.push_str(&input[..first_block_len]);
for i in 0..nb_complete_blocks {
output.push('_');
let start = first_block_len + i * BLOCK_SIZE;
output.push_str(&input[start..start + BLOCK_SIZE]);
}
output
}
/// Iterator version.
fn with_iterators(input: &str) -> String {
input.chars()
.rev()
.chunks(BLOCK_SIZE)
.into_iter()
.map(|c| c.collect::<String>())
.join("_")
.chars()
.rev()
.collect::<String>()
}
fn main() {
let input = "12345678";
macro_rules! bench {
( $version:ident ) => {{
let now = std::time::Instant::now();
for _ in 0..1_000_000 {
$version(input);
}
println!("{:.2?}", now.elapsed());
}};
}
print!("Procedural benchmark: ");
bench!(procedural);
print!("Iterator benchmark: ");
bench!(with_iterators);
}
Typical benchmark result
Procedural benchmark: 17.07ms
Iterator benchmark: 240.19ms
Question
How can I improve the iterator-based version, in order to reach the performance of the procedural version ?
The with_iterators version allocates a new String for every chunk in the input whereas the procedural version just slices the input and appends it to the output as necessary. Afterwards, these Strings are joined into a reversed version of the target String which again has to be reversed into yet another String, including determining its char offsets and collecting into another String. This is a lot of additional work and would explain the massive slowdown.
You can do something very similar which will even be more robust than the procedural version through chars and for_each:
/// Iterator version.
fn with_iterators(input: &str) -> String {
let n_chars = input.chars().count();
let capacity = n_chars + n_chars / BLOCK_SIZE;
let mut acc = String::with_capacity(capacity);
input
.chars()
.enumerate()
.for_each(|(idx, c)| {
if idx != 0 && (n_chars - idx) % BLOCK_SIZE == 0 {
acc.push('_');
}
acc.push(c);
});
acc
}
Just slicing into a &str is prone to panics if you can't rule out that multi-byte encoded characters are part of the input. I.e., procedural assumes a 1-to-1 mapping between u8 and char which is not given.
chars returns each character, with enumerate() you can track its offset in terms of other characters into the str and determine when to push the '_'.
The proposed version and the procedural version both run between 10-20ms on my machine.
This is also very slow. But here's a slightly different way for what it's worth.
fn iterators_todd(input: &str) -> String {
let len = input.len();
let n_first = len % BLOCK_SIZE;
let it = input.chars();
let mut out = it.clone().take(n_first).collect::<String>();
if len > BLOCK_SIZE && n_first != 0 {
out.push('_');
}
out.push_str(&it.skip(n_first)
.chunks(BLOCK_SIZE)
.into_iter()
.map(|c| c.collect::<String>())
.join("_"));
out
}
I wouldn't blame the slowness entirely on .collect(). The chunking iterator may also be slowing it down. Generally, chaining together a number of iterators, then iteratively pulling data through the chain is just not going to be as efficient as approaches with fewer iterators chained together.
The code above is roughly an O(N) algorithm (but a slow one), at least judging from visual appearances of the code above without digging into the implementations of the iterators.
I used the timeit crate to time the performance of each solution. This Gist has the code that produced the results below.
iterators_yolenoyer - yolenoyer's iterator approach.
procedural_yolenoyer - yolenoyer's "procedural" function.
procedural_yolenoyer_modified - the previous with some changes.
iterators_sebpuetz - sebpuetz' iterator example.
procedural_sebpuetz - the above using a for loop instead of .for_each().
iterators_todd - the example in this answer.
Output:
iterators_yolenoyer benchmark: 701.427605 ns
procedural_yolenoyer benchmark: 62.651766 ns
procedural_yolenoyer_modified benchmark: 59.283306 ns
iterators_sebpuetz benchmark: 94.315160 ns
procedural_sebpuetz benchmark: 121.447247 ns
iterators_todd benchmark: 606.468828 ns
It's interesting to note that using .for_each() is faster than a simple for loop (compare iterators_sebpuetz, which uses .for_each() to the same algorithm using a for loop instead, procedural_sebpuetz). The for loop being slower may not be generally the case for all use cases.
I can't imagine this hasn't been asked before, but I have searched everywhere and could not find the answer.
I have an iterable, which contains duplicate elements. I want to count number of times each element occurs in this iterable and return n-th most frequent one.
I have a working code which does exactly that, but I really doubt its the most optimal way to achieve this.
use std::collections::{BinaryHeap, HashMap};
// returns n-th most frequent element in collection
pub fn most_frequent<T: std::hash::Hash + std::cmp::Eq + std::cmp::Ord>(array: &[T], n: u32) -> &T {
// intialize empty hashmap
let mut map = HashMap::new();
// count occurence of each element in iterable and save as (value,count) in hashmap
for value in array {
// taken from https://doc.rust-lang.org/std/collections/struct.HashMap.html#method.entry
// not exactly sure how this works
let counter = map.entry(value).or_insert(0);
*counter += 1;
}
// determine highest frequency of some element in the collection
let mut heap: BinaryHeap<_> = map.values().collect();
let mut max = heap.pop().unwrap();
// get n-th largest value
for _i in 1..n {
max = heap.pop().unwrap();
}
// find that element (get key from value in hashmap)
// taken from https://stackoverflow.com/questions/59401720/how-do-i-find-the-key-for-a-value-in-a-hashmap
map.iter()
.find_map(|(key, &val)| if val == *max { Some(key) } else { None })
.unwrap()
}
Are there any better ways or more optimal std methods to achieve what I want? Or maybe there are some community made crates that I could use.
Your implementation has a time complexity of Ω(n log n), where n is the length of the array. The optimal solution to this problem has a complexity of Ω(n log k) for retrieving the k-th most frequent element. The usual implementation of this optimal solution indeed involves a binary heap, but not in the way you used it.
Here's a suggested implementation of the common algorithm:
use std::cmp::{Eq, Ord, Reverse};
use std::collections::{BinaryHeap, HashMap};
use std::hash::Hash;
pub fn most_frequent<T>(array: &[T], k: usize) -> Vec<(usize, &T)>
where
T: Hash + Eq + Ord,
{
let mut map = HashMap::new();
for x in array {
*map.entry(x).or_default() += 1;
}
let mut heap = BinaryHeap::with_capacity(k + 1);
for (x, count) in map.into_iter() {
heap.push(Reverse((count, x)));
if heap.len() > k {
heap.pop();
}
}
heap.into_sorted_vec().into_iter().map(|r| r.0).collect()
}
(Playground)
I changed the prototype of the function to return a vector of the k most frequent elements together with their counts, since this is what you need to keep track of anyway. If you only want the k-th most frequent element, you can index the result with [k - 1][1].
The algorithm itself first builds a map of element counts the same way your code does – I just wrote it in a more concise form.
Next, we buid a BinaryHeap for the most frequent elements. After each iteration, this heap contains at most k elements, which are the most frequent ones seen so far. If there are more than k elements in the heap, we drop the least frequent element. Since we always drop the least frequent element seen so far, the heap always retains the k most frequent elements seen so far. We need to use the Reverse wrapper to get a min heap, as documented in the documentation of BinaryHeap.
Finally, we collect the results into a vector. The into_sorted_vec() function basically does this job for us, but we still want to unwrap the items from its Reverse wrapper – that wrapper is an implemenetation detail of our function and should not be returned to the caller.
(In Rust Nightly, we could also use the into_iter_sorted() method, saving one vector allocation.)
The code in this answer makes sure the heap is essentially limited to k elements, so an insertion to the heap has a complexity of Ω(log k). In your code, you push all elements from the array to the heap at once, without limiting the size of the heap, so you end up with a complexity of Ω(log n) for insertions. You essentially use the binary heap to sort a list of counts. Which works, but it's certainly neither the easiest nor the fastest way to achieve that, so there is little justification for going that route.
As I was reading the rust documentation, I stumbled upon this code that iterates the array a using a while loop (with an index):
fn main() {
let a = [10, 20, 30, 40, 50];
let mut index = 0;
while index < 5 {
println!("the value is: {}", a[index]);
index += 1;
}
}
The documentation says:
... this approach is error prone; we could cause the program to panic if the index length is incorrect. It’s also slow, because the compiler adds runtime code to perform the conditional check on every element on every iteration through the loop.
The first reason was self-explanatory. The second reason was where I got confused.
Furthermore, they suggested to use a for-loop for this.
fn main() {
let a = [10, 20, 30, 40, 50];
for element in a.iter() {
println!("the value is: {}", element);
}
}
I just can't seem to wrap my head around this. Is there some kind of behavior that the Rust compiler does?
The two parts are complementary:
we could cause the program to panic if the index length is incorrect.
Every time you write some_slice[some_index], the standard library does the equivalent of:
if some_index < some_slice.len() {
some_slice.get_the_value_without_checks(some_index)
} else {
panic!("Hey, stop that");
}
the compiler adds runtime code to perform the conditional check on every element
In a loop, this works out to be something like:
while some_index < limit {
if some_index < some_slice.len() {
some_slice.get_the_value_without_checks(some_index)
} else {
panic!("Hey, stop that");
}
some_index += 1;
}
Those repeated conditionals aren't the most efficient code.
The implementations of Iterator for slices utilize unsafe code to be more efficient at the expense of more complicated code. The iterators contain raw pointers to the data but ensure that you can never misuse them to cause memory unsafety. Without needing to perform that conditional at each step, the iterator solution is often faster1. It's more-or-less equivalent to:
while some_index < limit {
some_slice.get_the_value_without_checks(some_index)
some_index += 1;
}
See also:
Does Rust's array bounds checking affect performance?
Does Rust optimize for loops over calculated ranges?
1 — as Matthieu M. points out:
It should be noted that the optimizer may (or may not) be able to remove the bounds check in the while case. If it succeeds, then performance is equivalent; if it fails, suddenly your code is slower. In microbenchmarks, with simple code, changes are it will succeed... but this may not carry to your production code, or it may carry now, and the next change in the loop body will prevent the optimization, etc... In short, a while loop can be a performance ticking bomb.
See also:
Why does my code run slower when I remove bounds checks?
I am making a program that brute forces a password by parallelization. At the moment the password to crack is already available as plain text, I'm just attempting to brute force it anyway.
I have a function called generate_char_array() which, based on an integer seed, converts base and returns a u8 slice of characters to try and check. This goes through the alphabet first for 1 character strings, then 2, etc.
let found_string_index = (0..1e12 as u64).into_par_iter().find_any(|i| {
let mut array = [0u8; 20];
let bytes = generate_char_array(*i, &mut array);
return &password_bytes == &bytes;
});
With the found string index (or seed integer rather), I can generate the found string.
The problem is that the way Rayon parallelizes this for me is split the arbitrary large integer range into thread_count-large slices (e.g. for 4 threads, 0..2.5e11, 2.5e11..5e11 etc). This is not good, because the end of the range is for arbitrarily super large password lengths (10+, I don't know), whereas most passwords (including the fixed "zzzzz" I tend to try) are much shorter, and thus what I get is that the first thread does all the work, and the rest of the threads just waste time testing way too long passwords and synchronizing; getting actually slower than single thread performance as a result.
How could I instead split the arbitrary big range (doesn't have to have an end actually) into chunks of ranges and have each thread find within chunks? That would make the workers in different threads actually useful.
This goes through the alphabet first for 1 character strings, then 2
You wish to impose some sequencing on your data processing, but the whole point of Rayon is to go in parallel.
Instead, use regular iterators to sequentially go up in length and then use parallel iterators inside a specific length to quickly process all of the values of that length.
Since you haven't provided enough code for a runnable example, I've made this rough approximation to show the general shape of such a solution:
extern crate rayon;
use rayon::iter::{IntoParallelRefIterator, ParallelIterator};
use std::ops::RangeInclusive;
type Seed = u8;
const LENGTHS: RangeInclusive<usize> = 1..=3;
const SEEDS: RangeInclusive<Seed> = 0..=std::u8::MAX;
fn find<F>(test_password: F) -> Option<(usize, Seed)>
where
F: Fn(usize, Seed) -> bool + Sync,
{
// Rayon doesn't support RangeInclusive yet
let seeds: Vec<_> = SEEDS.collect();
// Step 1-by-1 through the lengths, sequentially
LENGTHS.flat_map(|length| {
// In parallel, investigate every value in this length
// This doesn't do that, but it shows how the parallelization
// would be introduced
seeds
.par_iter()
.find_any(|&&seed| test_password(length, seed))
.map(|&seed| (length, seed))
}).next()
}
fn main() {
let pass = find(|l, s| {
println!("{}, {}", l, s);
// Actually generate and check the password based on the search criteria
l == 3 && s == 250
});
println!("Found password length and seed: {:?}", pass);
}
This can "waste" a little time at the end of each length as the parallel threads spin down one-by-one before spinning back up for the next length, but that seems unlikely to be a primary concern.
This is a version of what I suggested in my comment.
The main loop is parallel and is only over the first byte of each attempt. For each first byte, do the full brute force search for the remainder.
let matched_bytes = (0 .. 0xFFu8).into_par_iter().filter_map(|n| {
let mut array = [0u8; 8];
// the first digit is always the same in this run
array[0] = n;
// The highest byte is 0 because it's provided from the outer loop
(0 ..= 0x0FFFFFFFFFFFFFFF as u64).into_iter().filter_map(|i| {
// pass a slice so that the first byte is not affected
generate_char_array(i, &mut array[1 .. 8]);
if &password_bytes[..] == &array[0 .. password_bytes.len()] {
Some(array.clone())
} else {
None
}
}).next()
}).find_any(|_| true);
println!("found = {:?}", matched_bytes);
Also, even for a brute force method, this is probably highly inefficient still!
If Rayon splits the slices as you described, then apply simple math to balance the password lengths:
let found_string_index = (0..max_val as u64).into_par_iter().find_any(|i| {
let mut array = [0u8; 20];
let v = i/span + (i%span) * num_cpu;
let bytes = generate_char_array(*v, &mut array);
return &password_bytes == &bytes;
});
The span value depends on the number of CPUs (the number of threads used by Rayon), in your case:
let num_cpu = 4;
let span = 2.5e11 as u64;
let max_val = span * num_cpu;
Note the performance of this approach is highly dependent on how Rayon performs the split of the sequence on parallel threads. Verify that it works as you reported in the question.