I have a long string stored in a variable in Rust. I often remove some characters from its front with a drain method and use the value returned from it:
my_str.drain(0..i).collect::<String>();
The problem is, that draining from this string is done really often in the program and it's slowing it down a lot (it takes ~99.6% of runtime). This is a very expensive operation, since every time, the entire string has to be moved left in the memory.
I do not drain from the end of the string at all (which should be much faster), just from the front.
How can I make this more efficient? Is there some alternative to String, that uses a different memory layout, which would be better for this use case?
If you can't use slices because of the lifetimes, you could use a type that provides shared-ownership like SharedString from the shared-string crate or Str from the bytes-utils crate. The former looks more fully-featured but both provide methods that can take the prefix from a string in O(1) because the original data is never moved.
As stated by #Jmb, keeping the original string intact and working with slices is certainly a big win.
I don't know, from the question, the context and usage of these strings, but this quick and dirty benchmark shows a substantial difference in performances.
This benchmark is flawed because there is a useless clone() at each repetition, there is no warm-up, there is no black-box for the result, there are no statistics... but it just gives an idea.
use std::time::Instant;
fn with_drain(mut my_str: String) -> usize {
let mut total = 0;
'work: loop {
for &i in [1, 2, 3, 4, 5].iter().cycle() {
if my_str.len() < i {
break 'work;
}
let s = my_str.drain(0..i).collect::<String>();
total += s.len();
}
}
total
}
fn with_slice(my_str: String) -> usize {
let mut total = 0;
let mut pos = 0;
'work: loop {
for &i in [1, 2, 3, 4, 5].iter().cycle() {
let next_pos = pos + i;
if my_str.len() <= next_pos {
break 'work;
}
let s = &my_str[pos..next_pos];
pos = next_pos;
total += s.len();
}
}
total
}
fn main() {
let my_str="I have a long string stored in a variable in Rust.
I often remove some characters from its front with a drain method and use the value returned from it:
my_str.drain(0..i).collect::<String>();
The problem is, that draining from this string is done really often in the program and it's slowing it down a lot (it takes ~99.6% of runtime). This is a very expensive operation, since every time, the entire string has to be moved left in the memory.
I do not drain from the end of the string at all (which should be much faster), just from the front.
How can I make this more efficient? Is there some alternative to String, that uses a different memory layout, which would be better for this use case?
".to_owned();
let repeat = 1_000_000;
let instant = Instant::now();
for _ in 0..repeat {
let _ = with_drain(my_str.clone());
}
let drain_duration = instant.elapsed();
let instant = Instant::now();
for _ in 0..repeat {
let _ = with_slice(my_str.clone());
}
let slice_duration = instant.elapsed();
println!("{:?} {:?}", drain_duration, slice_duration);
}
/*
$ cargo run --release
Finished release [optimized] target(s) in 0.00s
Running `target/release/prog`
5.017018957s 310.466253ms
*/
As proposed by #SUTerliakov, using VecDeque<char> in this case is much more effective than String either with the pop_front method or the drain method (when draining from the front of course)
Related
I'm doing some coding practice in Rust.
I found somewhat weird test result. Maybe I misunderstood something.
Test Result
Time elapsed in function bruteforce is: 2.387µs
Time elapsed in function hashset is: 24.413µs // Why it takes relatively long ?
Time elapsed in function rusty is: 1.13µs
test arrays::contains_common_item::test_time_measure ... ok
use std::collections::HashSet;
// O(N^2) O(1)
fn contains_common_item_bruteforce(arr1: Vec<char>, arr2: Vec<char>) -> bool {
for arr1_item in arr1.iter() {
for arr2_item in arr2.iter() {
if arr1_item == arr2_item {
return true;
}
}
}
false
}
// O(2N) O(N)
fn contains_common_item_hashset(arr1: Vec<char>, arr2: Vec<char>) -> bool {
let mut contained_items_map = HashSet::new();
// iter() iterates over the items by reference
// iter_mut() iterates over the items, giving a mutable reference to each item
// into_iter() iterates over the items, moving them into the new scope
for item in arr1.iter() {
contained_items_map.insert(*item);
}
for item in arr2.iter() {
if contained_items_map.contains(item) {
return true;
}
}
false
}
fn contains_common_item_rusty(arr1: Vec<char>, arr2: Vec<char>) -> bool {
arr1.iter().any(|x| arr2.contains(x)) // contains O(n)
}
#[test]
fn test_time_measure() {
let arr1 = vec!['a', 'b', 'c', 'x'];
let arr2 = vec!['z', 'y', 'i'];
let start = std::time::Instant::now();
contains_common_item_bruteforce(arr1, arr2);
let duration: std::time::Duration = start.elapsed();
eprintln!("Time elapsed in function bruteforce is: {:?}", duration);
let arr1 = vec!['a', 'b', 'c', 'x'];
let arr2 = vec!['z', 'y', 'i'];
let start = std::time::Instant::now();
contains_common_item_hashset(arr1, arr2);
let duration: std::time::Duration = start.elapsed();
eprintln!("Time elapsed in function hashset is: {:?}", duration);
let arr1 = vec!['a', 'b', 'c', 'x'];
let arr2 = vec!['z', 'y', 'i'];
let start = std::time::Instant::now();
contains_common_item_rusty(arr1, arr2);
let duration: std::time::Duration = start.elapsed();
eprintln!("Time elapsed in function rusty is: {:?}", duration);
}
There are a few things wrong with both your testing methodology and with your expectations.
First of all: optimizations are variable, cpus are variable, caches are variable... Running a single pass, in a combined test, with fixed values is not accounting for these variables. You should be using a proper performance benchmarking test framework if you want practical results. Look into using criterion.
Also, either your computer is quite slow, or you're testing in debug mode. The Rust Playground gives 210ns, 4.25µs, and 170ns respectively. Benchmarking in debug mode is fairly useless since the performance wouldn't reflect how it'd behave in a release environment.
Second, HashSet purports an O(1) access time, but there's no such thing as a free lunch. For one thing, it is a variable sized collection that must be built before you can even use it. A similar crude test shows that this step alone is 4x as costly as the other two functions in their entirety. This would include allocation time, hashing time, and any other record-keeping that the HashSet does.
You may have been surprised by the Big-O complexity indicating that the HashSet should perform better, but you're probably missing that Big-O notation only shows an upper-bound of work being done and expresses how the work extends as n grows larger. Here, you have fixed sets of 4-5 items so the time will be more dominated by the fixed costs that Big-O notation leaves out (like the HashSet creation). And Big-O notation also leaves out how much work each n actually uses, it takes much more work to compute a hash, look-up the bucket, potentially handle collisions, and check whether an item exists; than it takes to compare two chars. The n in the O(n^2) of the bruteforce method and the O(n) of the hashset method are not directly comparable.
In summary, if your usage means doing intersection checks on small datasets then there's a pretty good chance that bruteforce will be faster. But you should use realistic data in a proper benchmarking test to verify.
I'm trying to solve my first ever project Euler problem just to have fun with Rust, and got stuck on what seems to be an extremely long compute time to solve
Problem:
https://projecteuler.net/problem=757
I came up with this code to try to solve it, which I'm able to solve the base problem (up to 10^6) in ~245 ms and get the expected result of 2,851.
use std::time::Instant;
fn factor(num: u64) -> Vec<u64> {
let mut counter = 1;
let mut factors = Vec::with_capacity(((num as f64).log(10.0)*100.0) as _);
while counter <= (num as f64).sqrt() as _ {
let div = num / counter;
let rem = num % counter;
if rem == 0 {
factors.push(counter);
factors.push(div);
}
counter += 1
}
factors.shrink_to_fit();
factors
}
fn main() {
let now = Instant::now();
let max = 10u64.pow(6);
let mut counter = 0;
'a: for i in 1..max {
// Optimization: All numbers in the pattern appear to be evenly divisible by 4
let div4 = i / 4;
let mod4 = i % 4;
if mod4 != 0 {continue}
// Optimization: And the remainder of that divided by 3 is always 0 or 1
if div4 % 3 > 1 {continue}
let mut factors = factor(i);
if factors.len() >= 4 {
// Optimization: The later found factors seem to be the most likely to fit the pattern, so try them first
factors.reverse();
let pairs: Vec<_> = factors.chunks(2).collect();
for paira in pairs.iter() {
for pairb in pairs.iter() {
if pairb[0] + pairb[1] == paira[0] + paira[1] + 1 {
counter += 1;
continue 'a;
}
}
}
}
}
println!("{}, {} ms", counter, now.elapsed().as_millis());
}
It looks like my code is spending the most amount of time on factoring, and in my search for a more efficient factoring algorithm than what I was able to come up with on my own, I couldn't find any rust code already made (the code I did find was actually slower.) But I did a simulation to estimate how long it would take even if I had a perfect factoring algorithm, and it would take 13 days to find all numbers up to 10^14 with the non-factoring portions of this code. Probably not what the creator of this problem intends.
Given I'm relatively new to programming, is there some concept or programming method that I'm not aware of (like say using a hashmap to do fast lookups) that can be used in this situation? Or is the solution going to involve spotting patterns in the numbers and making optimizations like the ones I have found so far?
If Vec::push is called when the vector is at its capacity, it will re-allocate its internal buffer to double the size and copy all its elements to this new allocation.
Vec::new() creates a vector with no space allocated so it will be doing this re-allocation.
You can use Vec::with_capacity((num/2) as usize) to avoid this and just allocate the max you might need.
I can see a huge difference in terms of performance, between an iterator-based algorithm (slow) and a procedural algorithm (fast). I want to improve the speed of the iterator-based algorithm but I don't know how to do it.
Context
I need to convert strings like 12345678 to 12_345_678. Questions like this have already been asked like in this question. (Note: I don't want to use a crate for this, because I have some custom needs about it).
Edit: it is assumed that the input string is ASCII.
I wrote two versions of the algorithm:
an iterator-based version, but I'm not satisfied because there is two .collect(), which means many allocations (at least one .collect() is extra, without being able to remove it, see the code below);
a procedural version of it.
Then I compared the execution time of the two versions: my iterator version is about ~8x slower compared to the procedural version.
Benchmark code
use itertools::Itertools; // Using itertools v0.10.0
/// Size of the string blocks to separate:
const BLOCK_SIZE: usize = 3;
/// Procedural version.
fn procedural(input: &str) -> String {
let len = input.len();
let nb_complete_blocks = (len as isize - 1).max(0) as usize / BLOCK_SIZE;
let first_block_len = len - nb_complete_blocks * BLOCK_SIZE;
let capacity = first_block_len + nb_complete_blocks * (BLOCK_SIZE + 1);
let mut output = String::with_capacity(capacity);
output.push_str(&input[..first_block_len]);
for i in 0..nb_complete_blocks {
output.push('_');
let start = first_block_len + i * BLOCK_SIZE;
output.push_str(&input[start..start + BLOCK_SIZE]);
}
output
}
/// Iterator version.
fn with_iterators(input: &str) -> String {
input.chars()
.rev()
.chunks(BLOCK_SIZE)
.into_iter()
.map(|c| c.collect::<String>())
.join("_")
.chars()
.rev()
.collect::<String>()
}
fn main() {
let input = "12345678";
macro_rules! bench {
( $version:ident ) => {{
let now = std::time::Instant::now();
for _ in 0..1_000_000 {
$version(input);
}
println!("{:.2?}", now.elapsed());
}};
}
print!("Procedural benchmark: ");
bench!(procedural);
print!("Iterator benchmark: ");
bench!(with_iterators);
}
Typical benchmark result
Procedural benchmark: 17.07ms
Iterator benchmark: 240.19ms
Question
How can I improve the iterator-based version, in order to reach the performance of the procedural version ?
The with_iterators version allocates a new String for every chunk in the input whereas the procedural version just slices the input and appends it to the output as necessary. Afterwards, these Strings are joined into a reversed version of the target String which again has to be reversed into yet another String, including determining its char offsets and collecting into another String. This is a lot of additional work and would explain the massive slowdown.
You can do something very similar which will even be more robust than the procedural version through chars and for_each:
/// Iterator version.
fn with_iterators(input: &str) -> String {
let n_chars = input.chars().count();
let capacity = n_chars + n_chars / BLOCK_SIZE;
let mut acc = String::with_capacity(capacity);
input
.chars()
.enumerate()
.for_each(|(idx, c)| {
if idx != 0 && (n_chars - idx) % BLOCK_SIZE == 0 {
acc.push('_');
}
acc.push(c);
});
acc
}
Just slicing into a &str is prone to panics if you can't rule out that multi-byte encoded characters are part of the input. I.e., procedural assumes a 1-to-1 mapping between u8 and char which is not given.
chars returns each character, with enumerate() you can track its offset in terms of other characters into the str and determine when to push the '_'.
The proposed version and the procedural version both run between 10-20ms on my machine.
This is also very slow. But here's a slightly different way for what it's worth.
fn iterators_todd(input: &str) -> String {
let len = input.len();
let n_first = len % BLOCK_SIZE;
let it = input.chars();
let mut out = it.clone().take(n_first).collect::<String>();
if len > BLOCK_SIZE && n_first != 0 {
out.push('_');
}
out.push_str(&it.skip(n_first)
.chunks(BLOCK_SIZE)
.into_iter()
.map(|c| c.collect::<String>())
.join("_"));
out
}
I wouldn't blame the slowness entirely on .collect(). The chunking iterator may also be slowing it down. Generally, chaining together a number of iterators, then iteratively pulling data through the chain is just not going to be as efficient as approaches with fewer iterators chained together.
The code above is roughly an O(N) algorithm (but a slow one), at least judging from visual appearances of the code above without digging into the implementations of the iterators.
I used the timeit crate to time the performance of each solution. This Gist has the code that produced the results below.
iterators_yolenoyer - yolenoyer's iterator approach.
procedural_yolenoyer - yolenoyer's "procedural" function.
procedural_yolenoyer_modified - the previous with some changes.
iterators_sebpuetz - sebpuetz' iterator example.
procedural_sebpuetz - the above using a for loop instead of .for_each().
iterators_todd - the example in this answer.
Output:
iterators_yolenoyer benchmark: 701.427605 ns
procedural_yolenoyer benchmark: 62.651766 ns
procedural_yolenoyer_modified benchmark: 59.283306 ns
iterators_sebpuetz benchmark: 94.315160 ns
procedural_sebpuetz benchmark: 121.447247 ns
iterators_todd benchmark: 606.468828 ns
It's interesting to note that using .for_each() is faster than a simple for loop (compare iterators_sebpuetz, which uses .for_each() to the same algorithm using a for loop instead, procedural_sebpuetz). The for loop being slower may not be generally the case for all use cases.
My plan is to write a simple method which does exactly what std::cin >> from the C++ standard library does:
use std::io::BufRead;
pub fn input<T: std::str::FromStr>(handle: &std::io::Stdin) -> Result<T, T::Err> {
let mut x = String::new();
let mut guard = handle.lock();
loop {
let mut trimmed = false;
let available = guard.fill_buf().unwrap();
let l = match available.iter().position(|&b| !(b as char).is_whitespace()) {
Some(i) => {
trimmed = true;
i
}
None => available.len(),
};
guard.consume(l);
if trimmed {
break;
}
}
let available = guard.fill_buf().unwrap();
let l = match available.iter().position(|&b| (b as char).is_whitespace()) {
Some(i) => i,
None => available.len(),
};
x.push_str(std::str::from_utf8(&available[..l]).unwrap());
guard.consume(l);
T::from_str(&x)
}
The loop is meant to trim away all the whitespace before valid input begins. The match block outside the loop is where the length of the valid input (that is, before trailing whitespaces begin or EOF is reached) is calculated.
Here is an example using the above method.
let handle = std::io::stdin();
let x: i32 = input(&handle).unwrap();
println!("x: {}", x);
let y: String = input(&handle).unwrap();
println!("y: {}", y);
When I tried a few simple tests, the method works as intended. However, when I use this in online programming judges like the one in codeforces, I get a complaint telling that the program sometimes stays idle or that the wrong input has been taken, among other issues, which leads to suspecting that I missed a corner case or something like that. This usually happens when the input is a few hundreds of lines long.
What input is going to break the method? What is the correction?
After a lot of experimentation, I noticed a lag when reading each input, which added up as the number of inputs were increased. The function doesn't make use of a buffer. It tries to access the stream every time it needs to fill a variable, which is slow and hence the lag.
Lesson learnt: Always use a buffer with a good capacity.
However, the idleness issue still persisted, until I replaced the fill_buf, consume pairs with something like read_line or read_string.
I am making a program that brute forces a password by parallelization. At the moment the password to crack is already available as plain text, I'm just attempting to brute force it anyway.
I have a function called generate_char_array() which, based on an integer seed, converts base and returns a u8 slice of characters to try and check. This goes through the alphabet first for 1 character strings, then 2, etc.
let found_string_index = (0..1e12 as u64).into_par_iter().find_any(|i| {
let mut array = [0u8; 20];
let bytes = generate_char_array(*i, &mut array);
return &password_bytes == &bytes;
});
With the found string index (or seed integer rather), I can generate the found string.
The problem is that the way Rayon parallelizes this for me is split the arbitrary large integer range into thread_count-large slices (e.g. for 4 threads, 0..2.5e11, 2.5e11..5e11 etc). This is not good, because the end of the range is for arbitrarily super large password lengths (10+, I don't know), whereas most passwords (including the fixed "zzzzz" I tend to try) are much shorter, and thus what I get is that the first thread does all the work, and the rest of the threads just waste time testing way too long passwords and synchronizing; getting actually slower than single thread performance as a result.
How could I instead split the arbitrary big range (doesn't have to have an end actually) into chunks of ranges and have each thread find within chunks? That would make the workers in different threads actually useful.
This goes through the alphabet first for 1 character strings, then 2
You wish to impose some sequencing on your data processing, but the whole point of Rayon is to go in parallel.
Instead, use regular iterators to sequentially go up in length and then use parallel iterators inside a specific length to quickly process all of the values of that length.
Since you haven't provided enough code for a runnable example, I've made this rough approximation to show the general shape of such a solution:
extern crate rayon;
use rayon::iter::{IntoParallelRefIterator, ParallelIterator};
use std::ops::RangeInclusive;
type Seed = u8;
const LENGTHS: RangeInclusive<usize> = 1..=3;
const SEEDS: RangeInclusive<Seed> = 0..=std::u8::MAX;
fn find<F>(test_password: F) -> Option<(usize, Seed)>
where
F: Fn(usize, Seed) -> bool + Sync,
{
// Rayon doesn't support RangeInclusive yet
let seeds: Vec<_> = SEEDS.collect();
// Step 1-by-1 through the lengths, sequentially
LENGTHS.flat_map(|length| {
// In parallel, investigate every value in this length
// This doesn't do that, but it shows how the parallelization
// would be introduced
seeds
.par_iter()
.find_any(|&&seed| test_password(length, seed))
.map(|&seed| (length, seed))
}).next()
}
fn main() {
let pass = find(|l, s| {
println!("{}, {}", l, s);
// Actually generate and check the password based on the search criteria
l == 3 && s == 250
});
println!("Found password length and seed: {:?}", pass);
}
This can "waste" a little time at the end of each length as the parallel threads spin down one-by-one before spinning back up for the next length, but that seems unlikely to be a primary concern.
This is a version of what I suggested in my comment.
The main loop is parallel and is only over the first byte of each attempt. For each first byte, do the full brute force search for the remainder.
let matched_bytes = (0 .. 0xFFu8).into_par_iter().filter_map(|n| {
let mut array = [0u8; 8];
// the first digit is always the same in this run
array[0] = n;
// The highest byte is 0 because it's provided from the outer loop
(0 ..= 0x0FFFFFFFFFFFFFFF as u64).into_iter().filter_map(|i| {
// pass a slice so that the first byte is not affected
generate_char_array(i, &mut array[1 .. 8]);
if &password_bytes[..] == &array[0 .. password_bytes.len()] {
Some(array.clone())
} else {
None
}
}).next()
}).find_any(|_| true);
println!("found = {:?}", matched_bytes);
Also, even for a brute force method, this is probably highly inefficient still!
If Rayon splits the slices as you described, then apply simple math to balance the password lengths:
let found_string_index = (0..max_val as u64).into_par_iter().find_any(|i| {
let mut array = [0u8; 20];
let v = i/span + (i%span) * num_cpu;
let bytes = generate_char_array(*v, &mut array);
return &password_bytes == &bytes;
});
The span value depends on the number of CPUs (the number of threads used by Rayon), in your case:
let num_cpu = 4;
let span = 2.5e11 as u64;
let max_val = span * num_cpu;
Note the performance of this approach is highly dependent on how Rayon performs the split of the sequence on parallel threads. Verify that it works as you reported in the question.