Construct string slice from vector of string slices - rust

I have a vector holding n string slices. I would like to construct a string slice based on these.
fn main() {
let v: Vec<&str> = vec!["foo", "bar"];
let h: &str = "home";
let result = format!("hello={}#{}&{}#{}", v[0], h, v[1], h);
println!("{}", result);
}
I searched through the docs but I failed to find anything on this subject.

This can be done (somewhat inefficiently) with iterators:
let result = format!("hello={}",
v.iter().map(|s| format!("{}#{}", s, h))
.collect::<Vec<_>>()
.join("&")
);
(Playground)
If high performance is needed, a loop that builds a String will be quite a bit faster. The approach above allocates an additional String for each input &str, then a vector to hold them all before finally joining them together.
Here's a more efficient way to implement this. The operation carried out by this function is to call the passed function for each element in the iterator, giving it access to the std::fmt::Write reference passed in, and sticking the iterator in between successive calls. (Note that String implements std::fmt::Write!)
use std::fmt::Write;
fn delimited_write<W, I, V, F>(writer: &mut W, seq: I, delim: &str, mut func: F)
-> Result<(), std::fmt::Error>
where W: Write,
I: IntoIterator<Item=V>,
F: FnMut(&mut W, V) -> Result<(), std::fmt::Error>
{
let mut iter = seq.into_iter();
match iter.next() {
None => { },
Some(v) => {
func(writer, v)?;
for v in iter {
writer.write_str(delim)?;
func(writer, v)?;
}
},
};
Ok(())
}
You'd use it to implement your operation like so:
use std::fmt::Write;
fn main() {
let v: Vec<&str> = vec!["foo", "bar"];
let h: &str = "home";
let mut result: String = "hello=".to_string();
delimited_write(&mut result, v.iter(), "&", |w, i| {
write!(w, "{}#{}", i, h)
}).expect("write succeeded");
println!("{}", result);
}
It's not as pretty, but it makes no temporary String or Vec allocations. (Playground)

You will need to iterate over the vector as cdhowie suggests above. Let me explain why this is necessarily an O(n) problem and you can't create a single string slice from a vector of string slices without iterating over the vector:
Your vector only holds references to the strings; it doesn't hold the strings themselves. The strings are likely not stored contiguously in memory (only their references inside your vector are) so combining them into a single slice is not as simple as creating a slice that points to the beginning of the first string referenced in the vector and then extending the size of the slice.
Given that a &str is just an integer indicating the length of the slice and a pointer to a location in memory or the application binary where a str (essentially an array of char's) is stored, you can imagine that if the first &str in your vector references a string on the stack and the next one references a hardcoded string that is stored in the executable binary of the program, there is no way to create a single &str that points to both str's without copying at least one of them (in practice, probably both of them will be copied).
In order to get a single string slice from all of those &str's in your vector, you need to copy each of the str's they reference to a single, contiguous chunk of memory and then create a slice of that chunk. That copying requires iterating over the vector.

Related

Rust string comparison same speed as Python . Want to parallelize the program

I am new to rust. I want to write a function which later can be imported into Python as a module using the pyo3 crate.
Below is the Python implementation of the function I want to implement in Rust:
def pcompare(a, b):
letters = []
for i, letter in enumerate(a):
if letter != b[i]:
letters.append(f'{letter}{i + 1}{b[i]}')
return letters
The first Rust implemention I wrote looks like this:
use pyo3::prelude::*;
#[pyfunction]
fn compare_strings_to_vec(a: &str, b: &str) -> PyResult<Vec<String>> {
if a.len() != b.len() {
panic!(
"Reads are not the same length!
First string is length {} and second string is length {}.",
a.len(), b.len());
}
let a_vec: Vec<char> = a.chars().collect();
let b_vec: Vec<char> = b.chars().collect();
let mut mismatched_chars = Vec::new();
for (mut index,(i,j)) in a_vec.iter().zip(b_vec.iter()).enumerate() {
if i != j {
index += 1;
let mutation = format!("{i}{index}{j}");
mismatched_chars.push(mutation);
}
}
Ok(mismatched_chars)
}
#[pymodule]
fn compare_strings(_py: Python<'_>, m: &PyModule) -> PyResult<()> {
m.add_function(wrap_pyfunction!(compare_strings_to_vec, m)?)?;
Ok(())
}
Which I builded in --release mode. The module could be imported to Python, but the performance was quite similar to the performance of the Python implementation.
My first question is: Why is the Python and Rust function similar in speed?
Now I am working on a parallelization implementation in Rust. When just printing the result variable, the function works:
use rayon::prelude::*;
fn main() {
let a: Vec<char> = String::from("aaaa").chars().collect();
let b: Vec<char> = String::from("aaab").chars().collect();
let length = a.len();
let index: Vec<_> = (1..=length).collect();
let mut mismatched_chars: Vec<String> = Vec::new();
(a, index, b).into_par_iter().for_each(|(x, i, y)| {
if x != y {
let mutation = format!("{}{}{}", x, i, y).to_string();
println!("{mutation}");
//mismatched_chars.push(mutation);
}
});
}
However, when I try to push the mutation variable to the mismatched_charsvector:
use rayon::prelude::*;
fn main() {
let a: Vec<char> = String::from("aaaa").chars().collect();
let b: Vec<char> = String::from("aaab").chars().collect();
let length = a.len();
let index: Vec<_> = (1..=length).collect();
let mut mismatched_chars: Vec<String> = Vec::new();
(a, index, b).into_par_iter().for_each(|(x, i, y)| {
if x != y {
let mutation = format!("{}{}{}", x, i, y).to_string();
//println!("{mutation}");
mismatched_chars.push(mutation);
}
});
}
I get the following error:
error[E0596]: cannot borrow `mismatched_chars` as mutable, as it is a captured variable in a `Fn` closure
--> src/main.rs:16:13
|
16 | mismatched_chars.push(mutation);
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ cannot borrow as mutable
For more information about this error, try `rustc --explain E0596`.
error: could not compile `testing_compare_strings` due to previous error
I tried A LOT of different things. When I do:
use rayon::prelude::*;
fn main() {
let a: Vec<char> = String::from("aaaa").chars().collect();
let b: Vec<char> = String::from("aaab").chars().collect();
let length = a.len();
let index: Vec<_> = (1..=length).collect();
let mut mismatched_chars: Vec<&str> = Vec::new();
(a, index, b).into_par_iter().for_each(|(x, i, y)| {
if x != y {
let mutation = format!("{}{}{}", x, i, y).to_string();
mismatched_chars.push(&mutation);
}
});
}
The error becomes:
error[E0596]: cannot borrow `mismatched_chars` as mutable, as it is a captured variable in a `Fn` closure
--> src/main.rs:16:13
|
16 | mismatched_chars.push(&mutation);
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ cannot borrow as mutable
error[E0597]: `mutation` does not live long enough
--> src/main.rs:16:35
|
10 | let mut mismatched_chars: Vec<&str> = Vec::new();
| -------------------- lifetime `'1` appears in the type of `mismatched_chars`
...
16 | mismatched_chars.push(&mutation);
| ----------------------^^^^^^^^^-
| | |
| | borrowed value does not live long enough
| argument requires that `mutation` is borrowed for `'1`
17 | }
| - `mutation` dropped here while still borrowed
I suspect that the solution is quite simple, but I cannot see it myself.
You have the right idea with what you are doing, but you will want to try to use an iterator chain with filter and map to remove or convert iterator items into different values. Rayon also provides a collect method similar to regular iterators to convert items into a type T: FromIterator (such as Vec<T>).
fn compare_strings_to_vec(a: &str, b: &str) -> Vec<String> {
// Same as with the if statement, but just a little shorter to write
// Plus, it will print out the two values it is comparing if it errors.
assert_eq!(a.len(), b.len(), "Reads are not the same length!");
// Zip the character iterators from a and b together
a.chars().zip(b.chars())
// Iterate with the index of each item
.enumerate()
// Rayon function which turns a regular iterator into a parallel one
.par_bridge()
// Filter out values where the characters are the same
.filter(|(_, (a, b))| a != b)
// Convert the remaining values into an error string
.map(|(index, (a, b))| {
format!("{}{}{}", a, index + 1, b)
})
// Turn the items of this iterator into a Vec (Or any other FromIterator type).
.collect()
}
Rust Playground
Optimizing for speed
On the other hand, if you want speed we need to approach this problem from a different direction. You may have noticed, but the rayon version is quite slow since the cost of spawning a thread and using concurrency structures is orders of magnitude more than just simply comparing the bytes in the original thread. In my benchmarks, I found that even with better workload distribution, additional threads were only helpful on my machine (64GB RAM, 16 cores) when the strings were at least 1-2 million bytes long. Given that you have stated they are typically ~30,000 bytes long I think using rayon (or really any other threading for comparisons of this size) will only slow down your code.
Using criterion for benchmarking, I eventually came to this implementation. It generally gets about 2.8156 µs per run on strings of 30,000 characters with 10 different bytes. For comparison, the code posted in the original question usually gets around 61.156 µs on my system under the same conditions so this should give a ~20x speedup. It can vary a bit, but it consistently got the best results in the benchmark. I'm guessing this should be fast enough to have this step no-longer be the bottleneck in your code.
This key focus of this implementation is to do the comparisons in batches. We can take advantage of the 128bit registers on most CPUs to compare the input in 16 byte batches. Upon an inequality being found, the 16 byte section it covers is re-scanned for the exact position of the discrepancy. This gives a decent boost to performance. I initially thought that a usize would work better, but it seems that was not the case. I also attempted to use the portable_simd nightly feature to write a simd version of this code, but I was unable to match the speed of this code. I suspect this was either due to missed optimizations or a lack of experience to effectively use simd on my part.
I was worried about drops in speed due to alignment of chunks not being enforced for u128 values, but it seems to mostly be a non-issue. First of all, it is generally quite difficult to find allocators which are willing to allocate to an address which is not a multiple of the system word size. Of course, this is due to practicality rather than any actual requirement. When I manually gave it unaligned slices (unaligned for u128s), it is not significantly effected. This is why I do not attempt to enforce that the start index of the slice be aligned to align_of::<u128>().
fn compare_strings_to_vec(a: &str, b: &str) -> Vec<String> {
let a_bytes = a.as_bytes();
let b_bytes = b.as_bytes();
let remainder = a_bytes.len() % size_of::<u128>();
// Strongly suggest to the compiler we are iterating though u128
a_bytes
.chunks_exact(size_of::<u128>())
.zip(b_bytes.chunks_exact(size_of::<u128>()))
.enumerate()
.filter(|(_, (a, b))| {
let a_block: &[u8; 16] = (*a).try_into().unwrap();
let b_block: &[u8; 16] = (*b).try_into().unwrap();
u128::from_ne_bytes(*a_block) != u128::from_ne_bytes(*b_block)
})
.flat_map(|(word_index, (a, b))| {
fast_path(a, b).map(move |x| word_index * size_of::<u128>() + x)
})
.chain(
fast_path(
&a_bytes[a_bytes.len() - remainder..],
&b_bytes[b_bytes.len() - remainder..],
)
.map(|x| a_bytes.len() - remainder + x),
)
.map(|index| {
format!(
"{}{}{}",
char::from(a_bytes[index]),
index + 1,
char::from(b_bytes[index])
)
})
.collect()
}
/// Very similar to regular route, but with nothing fancy, just get the indices of the overlays
#[inline(always)]
fn fast_path<'a>(a: &'a [u8], b: &'a [u8]) -> impl 'a + Iterator<Item = usize> {
a.iter()
.zip(b.iter())
.enumerate()
.filter_map(|(x, (a, b))| (a != b).then_some(x))
}
You cannot directly access the field mismatched_chars in a multithreading environment.
You can use Arc<RwLock> to access the field in multithreading.
use rayon::prelude::*;
use std::sync::{Arc, RwLock};
fn main() {
let a: Vec<char> = String::from("aaaa").chars().collect();
let b: Vec<char> = String::from("aaab").chars().collect();
let length = a.len();
let index: Vec<_> = (1..=length).collect();
let mismatched_chars: Arc<RwLock<Vec<String>>> = Arc::new(RwLock::new(Vec::new()));
(a, index, b).into_par_iter().for_each(|(x, i, y)| {
if x != y {
let mutation = format!("{}{}{}", x, i, y);
mismatched_chars
.write()
.expect("could not acquire write lock")
.push(mutation);
}
});
for mismatch in mismatched_chars
.read()
.expect("could not acquire read lock")
.iter()
{
eprintln!("{}", mismatch);
}
}

Why does this HashMap key have to be dereferenced twice?

This function computes the mode of a Vec<i32> using a HashMap to keep count of the occurrence of each value. I do not understand why this will not compile unless the key is deferenced twice in this last line:
fn mode(vec: &Vec<i32>) -> i32 {
let mut counts = HashMap::new();
for n in vec {
let count = counts.entry(n).or_insert(0);
*count += 1;
}
**counts.iter().max_by_key(|a| a.1).unwrap().0
}
It has to be dereferenced twice because you've created a double reference.
You are iterating over &Vec<T> which produces &T.
You called HashMap::iter on HashMap<K, V> which produces (&K, &V).
fn mode(vec: &[i32]) -> i32 {
let mut counts = std::collections::HashMap::new();
for &n in vec {
*counts.entry(n).or_insert(0) += 1;
}
counts.into_iter().max_by_key(|a| a.1).unwrap().0
}
See also:
What is the difference between iter and into_iter?
What does it mean to pass in a vector into a `for` loop versus a reference to a vector?
Iterating over a slice's values instead of references in Rust?
Meaning of '&variable' in arguments/patterns
What is the difference between `e1` and `&e2` when used as the for-loop variable?
What is the purpose of `&` before the loop variable?
Why is it discouraged to accept a reference to a String (&String), Vec (&Vec), or Box (&Box) as a function argument?

How to find the starting offset of a string slice of another string? [duplicate]

This question already has answers here:
How to get the byte offset between `&str`
(2 answers)
Closed 3 years ago.
Given a string and a slice referring to some substring, is it possible to find the starting and ending index of the slice?
I have a ParseString function which takes in a reference to a string, and tries to parse it according to some grammar:
ParseString(inp_string: &str) -> Result<(), &str>
If the parsing is fine, the result is just Ok(()), but if there's some error, it usually is in some substring, and the error instance is Err(e), where e is a slice of that substring.
When given the substring where the error occurs, I want to say something like "Error from characters x to y", where x and y are the starting and ending indices of the erroneous substring.
I don't want to encode the position of the errors directly in Err, because I'm nesting these invocations, and the offsets in the nested slice might not correspond to the some slice in the top level string.
As long as all of your string slices borrow from the same string buffer, you can calculate offsets with simple pointer arithmetic. You need the following methods:
str::as_ptr(): Returns the pointer to the start of the string slice
A way to get the difference between two pointers. Right now, the easiest way is to just cast both pointers to usize (which is always a no-op) and then subtract those. On 1.47.0+, there is a method offset_from() which is slightly nicer.
Here is working code (Playground):
fn get_range(whole_buffer: &str, part: &str) -> (usize, usize) {
let start = part.as_ptr() as usize - whole_buffer.as_ptr() as usize;
let end = start + part.len();
(start, end)
}
fn main() {
let input = "Everyone ♥ Ümläuts!";
let part1 = &input[1..7];
println!("'{}' has offset {:?}", part1, get_range(input, part1));
let part2 = &input[7..16];
println!("'{}' has offset {:?}", part2, get_range(input, part2));
}
Rust actually used to have an unstable method for doing exactly this, but it was removed due to being obsolete, which was a bit odd considering the replacement didn't remotely have the same functionality.
That said, the implementation isn't that big, so you can just add the following to your code somewhere:
pub trait SubsliceOffset {
/**
Returns the byte offset of an inner slice relative to an enclosing outer slice.
Examples
```ignore
let string = "a\nb\nc";
let lines: Vec<&str> = string.lines().collect();
assert!(string.subslice_offset_stable(lines[0]) == Some(0)); // &"a"
assert!(string.subslice_offset_stable(lines[1]) == Some(2)); // &"b"
assert!(string.subslice_offset_stable(lines[2]) == Some(4)); // &"c"
assert!(string.subslice_offset_stable("other!") == None);
```
*/
fn subslice_offset_stable(&self, inner: &Self) -> Option<usize>;
}
impl SubsliceOffset for str {
fn subslice_offset_stable(&self, inner: &str) -> Option<usize> {
let self_beg = self.as_ptr() as usize;
let inner = inner.as_ptr() as usize;
if inner < self_beg || inner > self_beg.wrapping_add(self.len()) {
None
} else {
Some(inner.wrapping_sub(self_beg))
}
}
}
You can remove the _stable suffix if you don't need to support old versions of Rust; it's just there to avoid a name conflict with the now-removed subslice_offset method.

How to translate "x-y" to vec![x, x+1, … y-1, y]?

This solution seems rather inelegant:
fn parse_range(&self, string_value: &str) -> Vec<u8> {
let values: Vec<u8> = string_value
.splitn(2, "-")
.map(|part| part.parse().ok().unwrap())
.collect();
{ values[0]..(values[1] + 1) }.collect()
}
Since splitn(2, "-") returns exactly two results for any valid string_value, it would be better to assign the tuple directly to two variables first and last rather than a seemingly arbitrary-length Vec. I can't seem to do this with a tuple.
There are two instances of collect(), and I wonder if it can be reduced to one (or even zero).
Trivial implementation
fn parse_range(string_value: &str) -> Vec<u8> {
let pos = string_value.find(|c| c == '-').expect("No valid string");
let (first, second) = string_value.split_at(pos);
let first: u8 = first.parse().expect("Not a number");
let second: u8 = second[1..].parse().expect("Not a number");
{ first..second + 1 }.collect()
}
Playground
I would recommend returning a Result<Vec<u8>, Error> instead of panicking with expect/unwrap.
Nightly implementation
My next thought was about the second collect. Here is a code example which uses nightly code, but you won't need any collect at all.
#![feature(conservative_impl_trait, inclusive_range_syntax)]
fn parse_range(string_value: &str) -> impl Iterator<Item = u8> {
let pos = string_value.find(|c| c == '-').expect("No valid string");
let (first, second) = string_value.split_at(pos);
let first: u8 = first.parse().expect("Not a number");
let second: u8 = second[1..].parse().expect("Not a number");
first..=second
}
fn main() {
println!("{:?}", parse_range("3-7").collect::<Vec<u8>>());
}
Instead of calling collect the first time, just advance the iterator:
let mut values = string_value
.splitn(2, "-")
.map(|part| part.parse().unwrap());
let start = values.next().unwrap();
let end = values.next().unwrap();
Do not call .ok().unwrap() — that converts the Result with useful error information to an Option, which has no information. Just call unwrap directly on the Result.
As already mentioned, if you want to return a Vec, you'll want to call collect to create it. If you want to return an iterator, you can. It's not bad even in stable Rust:
fn parse_range(string_value: &str) -> std::ops::Range<u8> {
let mut values = string_value
.splitn(2, "-")
.map(|part| part.parse().unwrap());
let start = values.next().unwrap();
let end = values.next().unwrap();
start..end + 1
}
fn main() {
assert!(parse_range("1-5").eq(1..6));
}
Sadly, inclusive ranges are not yet stable, so you'll need to continue to use +1 or switch to nightly.
Since splitn(2, "-") returns exactly two results for any valid string_value, it would be better to assign the tuple directly to two variables first and last rather than a seemingly arbitrary-length Vec. I can't seem to do this with a tuple.
This is not possible with Rust's type system. You are asking for dependent types, a way for runtime values to interact with the type system. You'd want splitn to return a (&str, &str) for a value of 2 and a (&str, &str, &str) for a value of 3. That gets even more complicated when the argument is a variable, especially when it's set at run time.
The closest workaround would be to have a runtime check that there are no more values:
assert!(values.next().is_none());
Such a check doesn't feel valuable to me.
See also:
What is the correct way to return an Iterator (or any other trait)?
How do I include the end value in a range?

Convert "*mut *mut f32" into "&[&[f32]]"

I want to convert arrays.
Example:
func()-> *mut *mut f32;
...
let buffer = func();
for n in 0..48000 {
buffer[0][n] = 1.0;
buffer[1][n] = 3.0;
}
In Rust &[T]/&mut [T] is called a slice. A slice is not an array; it is a pointer to the beginning of an array and the number of items in this array. Therefore, to create &mut [T] out of *mut T, you need to known the length of the array behind the pointer.
*mut *mut T looks like a C implementation of a 2D, possibly jagged, array, i.e. an array of arrays (this is different from a contiguous 2D array, as you probably know). There is no free way to convert it to &mut [&mut [T]], because, as I said before, *mut T is one pointer-sized number, while &mut [T] is two pointer-sized numbers. So you can't, for example, transmute *mut T to &mut [T], it would be a size mismatch. Therefore, you can't simply transform *mut *mut f32 to &mut [&mut [f32]] because of the layout mismatch.
In order to safely work with numbers stored in *mut *mut f32, you need, first, determine the length of the outer array and lengths of all of the inner arrays. For simplicity, let's consider that they are all known statically:
const ROWS: usize = 48000;
const COLUMNS: usize = 48000;
Now, since you know the length, you can convert the outer pointer to a slice of raw pointers:
use std::slice;
let buffer: *mut *mut f32 = func();
let buf_slice: &mut [*mut f32] = unsafe {
slice::from_raw_parts_mut(buffer, ROWS);
};
Now you need to go through this slice and convert each item to a slice, collecting the results into a vector:
let matrix: Vec<&mut [f32]> = buf_slice.iter_mut()
.map(|p| unsafe { slice::from_raw_parts_mut(p, COLUMNS) })
.collect();
And now you can indeed access your buffer by indices:
for n in 0..COLUMNS {
matrix[0][n] = 1.0;
matrix[1][n] = 3.0;
}
(I have put explicit types on bindings for readability, most of them in fact can be omitted)
So, there are two main things to consider when converting raw pointers to slices:
you need to know exact length of the array to create a slice from it; if you know it, you can use slice::from_raw_parts() or slice::from_raw_parts_mut();
if you are converting nested arrays, you need to rebuild each layer of the indirection because pointers have different size than slices.
And naturally, you have to track who is the owner of the buffer and when it will be freed, otherwise you can easily get a slice pointing to a buffer which does not exist anymore. This is unsafe, after all.
Since your array seems to be an array of pointers to an array of 48000 f32s, you can simply use fixed size arrays ([T; N]) instead of slices ([T]):
fn func() -> *mut *mut f32 { unimplemented!() }
fn main() {
let buffer = func();
let buffer: &mut [&mut [f32; 48000]; 2] = unsafe { std::mem::transmute(buffer) };
for n in 0..48000 {
buffer[0][n] = 1.0;
buffer[1][n] = 3.0;
}
}

Resources