Searching for a matching subslice? [duplicate] - rust

I have a &[u8] slice over a binary buffer. I need to parse it, but a lot of the methods that I would like to use (such as str::find) don't seem to be available on slices.
I've seen that I can covert both by buffer slice and my pattern to str by using from_utf8_unchecked() but that seems a little dangerous (and also really hacky).
How can I find a subsequence in this slice? I actually need the index of the pattern, not just a slice view of the parts, so I don't think split will work.

Here's a simple implementation based on the windows iterator.
fn find_subsequence(haystack: &[u8], needle: &[u8]) -> Option<usize> {
haystack.windows(needle.len()).position(|window| window == needle)
}
fn main() {
assert_eq!(find_subsequence(b"qwertyuiop", b"tyu"), Some(4));
assert_eq!(find_subsequence(b"qwertyuiop", b"asd"), None);
}
The find_subsequence function can also be made generic:
fn find_subsequence<T>(haystack: &[T], needle: &[T]) -> Option<usize>
where for<'a> &'a [T]: PartialEq
{
haystack.windows(needle.len()).position(|window| window == needle)
}

I don't think the standard library contains a function for this. Some libcs have memmem, but at the moment the libc crate does not wrap this. You can use the twoway crate however. rust-bio implements some pattern matching algorithms, too. All of those should be faster than using haystack.windows(..).position(..)

I found the memmem crate useful for this task:
use memmem::{Searcher, TwoWaySearcher};
let search = TwoWaySearcher::new("dog".as_bytes());
assert_eq!(
search.search_in("The quick brown fox jumped over the lazy dog.".as_bytes()),
Some(41)
);

How about Regex on bytes? That looks very powerful. See this Rust playground demo.
extern crate regex;
use regex::bytes::Regex;
fn main() {
//see https://doc.rust-lang.org/regex/regex/bytes/
let re = Regex::new(r"say [^,]*").unwrap();
let text = b"say foo, say bar, say baz";
// Extract all of the strings without the null terminator from each match.
// The unwrap is OK here since a match requires the `cstr` capture to match.
let cstrs: Vec<usize> =
re.captures_iter(text)
.map(|c| c.get(0).unwrap().start())
.collect();
assert_eq!(cstrs, vec![0, 9, 18]);
}

Related

How to rotate a vector without standard library?

I'm getting into Rust and Arduino at the same time.
I was programming my LCD display to show a long string by rotating it through the top column of characters. Means: Every second I shift all characters by one position and show the new String.
This was fairly complex in the Arduino language, especially because I had to know the size of the String at compile time (given my limited knowledge).
Since I'd like to use Rust in the long term, I was curious to see if that could be done more easily in a modern language. Not so much.
This is the code I came up with, after hours of experimentation:
#![no_std]
extern crate alloc;
use alloc::{vec::Vec};
fn main() {
}
fn rotate_by<T: Copy>(rotate: Vec<T>, by: isize) -> Vec<T> {
let real_by = modulo(by, rotate.len() as isize) as usize;
Vec::from_iter(rotate[real_by..].iter().chain(rotate[..real_by].iter()).cloned())
}
fn modulo(a: isize, b: isize) -> isize {
a - b * (a as f64 /b as f64).floor() as isize
}
mod tests {
use super::*;
#[test]
fn test_rotate_five() {
let chars: Vec<_> = "I am the string and you should rotate me! ".chars().collect();
let res_chars: Vec<_> = "the string and you should rotate me! I am ".chars().collect();
assert_eq!(rotate_by(chars, 5), res_chars);
}
}
My questions are:
Could you provide an optimized version of this function? I'm aware that there already is Vec::rotate but it uses unsafe code and can panic, which I would like to avoid (by returning a Result).
Explain whether or not it is possible to achieve this in-place without unsafe code (I failed).
Is Vec<_> the most efficient data structure to work with? I tried hard to use [char], which I thought would be more efficient, but then I have to know the size at compile time, which hardly works. I thought Rust arrays would be similar to Java arrays, which can be sized at runtime yet are also fixed size once created, but they seem to have a lot more constraints.
Oh and also what happens if I index into a vector at an invalid index? Will it panic? Can I do this better? Without "manually" checking the validity of the slice indices?
I realize that's a lot of questions, but I'm struggling and this is bugging me a lot, so if somebody could set me straight it would be much appreciated!
You can use slice::rotate_left and slice::rotate_right:
#![no_std]
extern crate alloc;
use alloc::vec::Vec;
fn rotate_by<T>(data: &mut [T], by: isize) {
if by > 0 {
data.rotate_left(by.unsigned_abs());
} else {
data.rotate_right(by.unsigned_abs());
}
}
I made it rotate in-place because that is more efficient. If you don't want to do it in-place you still have the option of cloning the vector first, so this is more flexible than if the function creates a new vector, as you have done, because you aren't be able to opt out of that when you call it.
Notice that rotate_by takes a mutable slice, but you can still pass a mutable reference to a vector, because of deref coercion.
#[test]
fn test_rotate_five() {
let mut chars: Vec<_> = "I am the string and you should rotate me! ".chars().collect();
let res_chars: Vec<_> = "the string and you should rotate me! I am ".chars().collect();
rotate_by(&mut chars, 5);
assert_eq!(chars, res_chars);
}
There are some edge cases with moving chars around like this because some valid UTF-8 will contain grapheme clusters that are made up of multiple codepoints (chars in Rust). This will result in strange effects when a grapheme cluster is split between the start and end of the string. For example, rotating "abcdéfghijk" by 5 will result in "efghijkabcd\u{301}", with the acute accent stranded on its own, away from the 'e'.
If your strings are ASCII then you don't have to worry about that, but then you can also just treat them as byte strings anyway:
#[test]
fn test_rotate_five_ascii() {
let mut chars = b"I am the string and you should rotate me! ".clone();
let res_chars = b"the string and you should rotate me! I am ";
rotate_by(&mut chars, 5);
assert_eq!(chars, &res_chars[..]);
}

Can I transform the strings in a [&'static str; N] into a &[&str] without creating multiple temporary values? [duplicate]

Per Steve Klabnik's writeup in the pre-Rust 1.0 documentation on the difference between String and &str, in Rust you should use &str unless you really need to have ownership over a String. Similarly, it's recommended to use references to slices (&[]) instead of Vecs unless you really need ownership over the Vec.
I have a Vec<String> and I want to write a function that uses this sequence of strings and it doesn't need ownership over the Vec or String instances, should that function take &[&str]? If so, what's the best way to reference the Vec<String> into &[&str]? Or, is this coercion overkill?
You can create a function that accepts both &[String] and &[&str] using the AsRef trait:
fn test<T: AsRef<str>>(inp: &[T]) {
for x in inp { print!("{} ", x.as_ref()) }
println!("");
}
fn main() {
let vref = vec!["Hello", "world!"];
let vown = vec!["May the Force".to_owned(), "be with you.".to_owned()];
test(&vref);
test(&vown);
}
This is actually impossible without either memory allocation or per-element call1.
Going from String to &str is not just viewing the bits in a different light; String and &str have a different memory layout, and thus going from one to the other requires creating a new object. The same applies to Vec and &[]
Therefore, whilst you can go from Vec<T> to &[T], and thus from Vec<String> to &[String], you cannot directly go from Vec<String> to &[&str]. Your choices are:
either accept &[String]
allocate a new Vec<&str> referencing the first Vec, and convert that into a &[&str]
As an example of the allocation:
fn usage(_: &[&str]) {}
fn main() {
let owned = vec![String::new()];
let half_owned: Vec<_> = owned.iter().map(String::as_str).collect();
usage(&half_owned);
}
1 Using generics and the AsRef<str> bound as shown in #aSpex's answer you get a slightly more verbose function declaration with the flexibility you were asking for, but you do have to call .as_ref() in all elements.

Convert Vec<String> into a slice of &str in Rust?

Per Steve Klabnik's writeup in the pre-Rust 1.0 documentation on the difference between String and &str, in Rust you should use &str unless you really need to have ownership over a String. Similarly, it's recommended to use references to slices (&[]) instead of Vecs unless you really need ownership over the Vec.
I have a Vec<String> and I want to write a function that uses this sequence of strings and it doesn't need ownership over the Vec or String instances, should that function take &[&str]? If so, what's the best way to reference the Vec<String> into &[&str]? Or, is this coercion overkill?
You can create a function that accepts both &[String] and &[&str] using the AsRef trait:
fn test<T: AsRef<str>>(inp: &[T]) {
for x in inp { print!("{} ", x.as_ref()) }
println!("");
}
fn main() {
let vref = vec!["Hello", "world!"];
let vown = vec!["May the Force".to_owned(), "be with you.".to_owned()];
test(&vref);
test(&vown);
}
This is actually impossible without either memory allocation or per-element call1.
Going from String to &str is not just viewing the bits in a different light; String and &str have a different memory layout, and thus going from one to the other requires creating a new object. The same applies to Vec and &[]
Therefore, whilst you can go from Vec<T> to &[T], and thus from Vec<String> to &[String], you cannot directly go from Vec<String> to &[&str]. Your choices are:
either accept &[String]
allocate a new Vec<&str> referencing the first Vec, and convert that into a &[&str]
As an example of the allocation:
fn usage(_: &[&str]) {}
fn main() {
let owned = vec![String::new()];
let half_owned: Vec<_> = owned.iter().map(String::as_str).collect();
usage(&half_owned);
}
1 Using generics and the AsRef<str> bound as shown in #aSpex's answer you get a slightly more verbose function declaration with the flexibility you were asking for, but you do have to call .as_ref() in all elements.

Ergonomics issues with fixed size byte arrays in Rust

Rust sadly cannot produce a fixed size array [u8; 16] with a fixed size slicing operator s[0..16]. It'll throw errors like "expected array of 16 elements, found slice".
I've some KDFs that output several keys in wrapper structs like
pub struct LeafKey([u8; 16]);
pub struct MessageKey([u8; 32]);
fn kdfLeaf(...) -> (MessageKey,LeafKey) {
// let mut r: [u8; 32+16];
let mut r: (MessageKey, LeafKey);
debug_assert_eq!(mem::size_of_val(&r), 384/8);
let mut sha = Sha3::sha3_384();
sha.input(...);
// sha.result(r);
sha.result(
unsafe { mem::transmute::<&mut (MessageKey, LeafKey),&mut [u8;32+16]>(&r) }
);
sha.reset();
// (MessageKey(r[0..31]), LeafKey(r[32..47]))
r
}
Is there a safer way to do this? We know mem::transmute will refuse to compile if the types do not have the same size, but that only checks that pointers have the same size here, so I added that debug_assert.
In fact, I'm not terribly worried about extra copies though since I'm running SHA3 here, but afaik rust offers no ergonomic way to copy amongst byte arrays.
Can I avoid writing (MessageKey, LeafKey) three times here? Is there a type alias for the return type of the current function? Is it safe to use _ in the mem::transmute given that I want the code to refuse to compile if the sizes do not match? Yes, I know I could make a type alias, but that seems silly.
As an aside, there is a longer discussion of s[0..16] not having type [u8; 16] here
There's the copy_from_slice method.
fn main() {
use std::default::Default;
// Using 16+8 because Default isn't implemented
// for [u8; 32+16] due to type explosion unfortunateness
let b: [u8; 24] = Default::default();
let mut c: [u8; 16] = Default::default();
let mut d: [u8; 8] = Default::default();
c.copy_from_slice(&b[..16])
d.copy_from_slice(&b[16..16+8]);
}
Note, unfortunately copy_from_slice throws a runtime error if the slices are not the same length, so make sure you thoroughly test this yourself, or use the lengths of the other arrays to guard.
Unfortunately, c.copy_from_slice(&b[..c.len()]) doesn't work because Rust thinks c is borrowed both immutably and mutably at the same time.
I marked the accepted answer as best since it's safe, and led me to the clone_into_array answer here, but..
Another idea that improves the safety is to make a version of mem::transmute for references that checks the sizes of the referenced types, as opposed to just the pointers. It might look like :
#[inline]
unsafe fn transmute_ptr_mut<A,B>(v: &mut A) -> &mut B {
debug_assert_eq!(core::mem::size_of(A),core::mem::size_of(B));
core::mem::transmute::<&mut A,&mut B>(v)
}
I have raised an issue on the arrayref crate to discuss this, as arrayref might be a reasonable crate for it to live in.
Update : We've a new "best answer" by the arrayref crate developer :
let (a,b) = array_refs![&r,32,16];
(MessageKey(*a), LeafKey(*b))

Repeat string with integer multiplication

Is there an easy way to do the following (from Python) in Rust?
>>> print ("Repeat" * 4)
RepeatRepeatRepeatRepeat
I'm starting to learn the language, and it seems String doesn't override Mul, and I can't find any discussion anywhere on a compact way of doing this (other than a map or loop).
Rust 1.16+
str::repeat is now available:
fn main() {
let repeated = "Repeat".repeat(4);
println!("{}", repeated);
}
Rust 1.0+
You can use iter::repeat:
use std::iter;
fn main() {
let repeated: String = iter::repeat("Repeat").take(4).collect();
println!("{}", repeated);
}
This also has the benefit of being more generic — it creates an infinitely repeating iterator of any type that is cloneable.
This one doesn't use Iterator::map but Iterator::fold instead:
fn main() {
println!("{:?}", (1..5).fold(String::new(), |b, _| b + "Repeat"));
}

Resources