Using str and String interchangably - string

Suppose I'm trying to do a fancy zero-copy parser in Rust using &str, but sometimes I need to modify the text (e.g. to implement variable substitution). I really want to do something like this:
fn main() {
let mut v: Vec<&str> = "Hello there $world!".split_whitespace().collect();
for t in v.iter_mut() {
if (t.contains("$world")) {
*t = &t.replace("$world", "Earth");
}
}
println!("{:?}", &v);
}
But of course the String returned by t.replace() doesn't live long enough. Is there a nice way around this? Perhaps there is a type which means "ideally a &str but if necessary a String"? Or maybe there is a way to use lifetime annotations to tell the compiler that the returned String should be kept alive until the end of main() (or have the same lifetime as v)?

Rust has exactly what you want in form of a Cow (Clone On Write) type.
use std::borrow::Cow;
fn main() {
let mut v: Vec<_> = "Hello there $world!".split_whitespace()
.map(|s| Cow::Borrowed(s))
.collect();
for t in v.iter_mut() {
if t.contains("$world") {
*t.to_mut() = t.replace("$world", "Earth");
}
}
println!("{:?}", &v);
}
as #sellibitze correctly notes, the to_mut() creates a new String which causes a heap allocation to store the previous borrowed value. If you are sure you only have borrowed strings, then you can use
*t = Cow::Owned(t.replace("$world", "Earth"));
In case the Vec contains Cow::Owned elements, this would still throw away the allocation. You can prevent that using the following very fragile and unsafe code (It does direct byte-based manipulation of UTF-8 strings and relies of the fact that the replacement happens to be exactly the same number of bytes.) inside your for loop.
let mut last_pos = 0; // so we don't start at the beginning every time
while let Some(pos) = t[last_pos..].find("$world") {
let p = pos + last_pos; // find always starts at last_pos
last_pos = pos + 5;
unsafe {
let s = t.to_mut().as_mut_vec(); // operating on Vec is easier
s.remove(p); // remove $ sign
for (c, sc) in "Earth".bytes().zip(&mut s[p..]) {
*sc = c;
}
}
}
Note that this is tailored exactly to the "$world" -> "Earth" mapping. Any other mappings require careful consideration inside the unsafe code.

std::borrow::Cow, specifically used as Cow<'a, str>, where 'a is the lifetime of the string being parsed.
use std::borrow::Cow;
fn main() {
let mut v: Vec<Cow<'static, str>> = vec![];
v.push("oh hai".into());
v.push(format!("there, {}.", "Mark").into());
println!("{:?}", v);
}
Produces:
["oh hai", "there, Mark."]

Related

reusable rust vector with associated vector of slices

I need rust code to read lines of a file, and break them into an array of slices. The working code is
use std::io::{self, BufRead};
fn main() {
let stdin = io::stdin();
let mut f = stdin.lock();
let mut line : Vec<u8> = Vec::new();
loop {
line.clear();
let sz = f.read_until(b'\n', &mut line).unwrap();
if sz == 0 {break};
let body : Vec<&[u8]> = line.split(|ch| *ch == b'\t').collect();
DoStuff(body);
}
}
However, that code is slower than I'd like. The code I want to write is
use std::io::{self, BufRead};
fn main() {
let stdin = io::stdin();
let mut f = stdin.lock();
let mut line : Vec<u8> = Vec::new();
let mut body: Vec<&[u8]> = Vec::new();
loop {
line.clear();
let sz = f.read_until(b'\n', &mut line).unwrap();
if sz == 0 {break};
body.extend(&mut line.split(|ch| *ch == b'\t'));
DoStuff(body);
body.clear();
}
}
but that runs afoul of the borrow checker.
In general, I'd like a class containing a Vec<u8> and an associated Vec<&[u8]>, which is the basis of a lot of C++ code I'm trying to replace.
Is there any way I can accomplish this?
I realize that I could replace the slices with pairs of integers, but that seems clumsy.
No, I can't just use the items from the iterator as they come through -- I need random access to the individual column values. In the simplified case where I do use the iterator directly, I get a 3X speedup, which is why I suspect a significant speedup by replacing collect with extend.
Other comments on this code is also welcome.
Just for sake of completeness, and since you are coming from C++, a more Rusty way of writing the code would be
use std::io::{self, BufRead};
fn do_stuff(body: &[&str]) {}
fn main() {
for line in io::stdin().lock().lines() {
let line = line.unwrap();
let body = line.split('\t').collect::<Vec<_>>();
do_stuff(&body);
}
}
This uses .lines() from BufRead to get an iterator over \n-delimited lines from the input. It assumes that your input is actually valid UTF8, which in your code was not a requirement. If it is not UTF8, use .split(b'\n'), .split(b'\t') and &[&u8] instead.
Notice that this does allocate and subsequently free a new Vec via .collect() every time the loop executes. We are somewhat relying on the allocator's free-list to make this cheap. But it is correct in all cases.
The reason your second example does not compile (after fixing the DoStuff(&body) is this:
12 | line.clear();
| ^^^^^^^^^^^^ mutable borrow occurs here
...
15 | body.extend(&mut line.split(|ch| *ch == b'\t'));
| ---- ---- immutable borrow occurs here
| |
| immutable borrow later used here
The problem here is the loop: Line 12 line.clear() will execute after line 15 body.extend() from the second iteration onwards. But the compiler has figured out that body borrows from line (it contains references to the fields inside line). The call to line.clear() mutably borrows line - all of line - and as far as the compiler is concerned is free to do anything it wants with the data it holds. This is an error because line.clear() could possibly mutate data that body has borrowed immutably. The compiler does not reason about the fact that .clear() obviously does not mutate the borrowed data, quite the opposite in fact, but the compiler's reasoning stops at the function signature.
I seems like the answer is
No, it's not possible to reuse the vector of slices.
The way to go is to make something like a slice, but with integer offsets rather than pointers. Code is attached, comments welcome.
Performance is currently 15% better than the C++, but the C++ is part of a larger system, and is probably doing some additional stuff.
/// pointers into a vector, simulating a slice without the ownership issues
#[derive(Debug, Clone)]
pub struct FakeSlice {
begin: u32,
end: u32,
}
/// A line of a text file, broken into columns.
/// Access to the `lines` and `parts` is allowed, but should seldom be necessary
/// `line` does not include the trailing newline
/// An empty line contains one empty column
///```
/// use std::io::BufRead;
/// let mut data = b"one\ttwo\tthree\n";
/// let mut dp = &data[..];
/// let mut line = cdx::TextLine::new();
/// let eof = line.read(&mut dp).unwrap();
/// assert_eq!(eof, false);
/// assert_eq!(line.strlen(), 13);
/// line.split(b'\t');
/// assert_eq!(line.len(), 3);
/// assert_eq!(line.get(1), b"two");
///```
#[derive(Debug, Clone)]
pub struct TextLine {
pub line: Vec<u8>,
pub parts: Vec<FakeSlice>,
}
impl TextLine {
/// make a new TextLine
pub fn new() -> TextLine {
TextLine {
line: Vec::new(),
parts: Vec::new(),
}
}
fn clear(&mut self) {
self.parts.clear();
self.line.clear();
}
/// How many column in the line
pub fn len(&self) -> usize {
self.parts.len()
}
/// How many bytes in the line
pub fn strlen(&self) -> usize {
self.line.len()
}
/// should always be false, but required by clippy
pub fn is_empty(&self) -> bool {
self.parts.is_empty()
}
/// Get one column. Return an empty column if index is too big.
pub fn get(&self, index: usize) -> &[u8] {
if index >= self.parts.len() {
&self.line[0..0]
} else {
&self.line[self.parts[index].begin as usize..self.parts[index].end as usize]
}
}
/// Read a new line from a file, should generally be followed by `split`
pub fn read<T: std::io::BufRead>(&mut self, f: &mut T) -> std::io::Result<bool> {
self.clear();
let sz = f.read_until(b'\n', &mut self.line)?;
if sz == 0 {
Ok(true)
} else {
if self.line.last() == Some(&b'\n') {
self.line.pop();
}
Ok(false)
}
}
/// split the line into columns
/// hypothetically you could split on one delimiter, do some work, then split on a different delimiter.
pub fn split(&mut self, delim: u8) {
self.parts.clear();
let mut begin: u32 = 0;
let mut end: u32 = 0;
#[allow(clippy::explicit_counter_loop)] // I need the counter to be u32
for ch in self.line.iter() {
if *ch == delim {
self.parts.push(FakeSlice { begin, end });
begin = end + 1;
}
end += 1;
}
self.parts.push(FakeSlice { begin, end });
}
}

How to pass &mut str and change the original mut str without a return?

I'm learning Rust from the Book and I was tackling the exercises at the end of chapter 8, but I'm hitting a wall with the one about converting words into Pig Latin. I wanted to see specifically if I could pass a &mut String to a function that takes a &mut str (to also accept slices) and modify the referenced string inside it so the changes are reflected back outside without the need of a return, like in C with a char **.
I'm not quite sure if I'm just messing up the syntax or if it's more complicated than it sounds due to Rust's strict rules, which I have yet to fully grasp. For the lifetime errors inside to_pig_latin() I remember reading something that explained how to properly handle the situation but right now I can't find it, so if you could also point it out for me it would be very appreciated.
Also what do you think of the way I handled the chars and indexing inside strings?
use std::io::{self, Write};
fn main() {
let v = vec![
String::from("kaka"),
String::from("Apple"),
String::from("everett"),
String::from("Robin"),
];
for s in &v {
// cannot borrow `s` as mutable, as it is not declared as mutable
// cannot borrow data in a `&` reference as mutable
to_pig_latin(&mut s);
}
for (i, s) in v.iter().enumerate() {
print!("{}", s);
if i < v.len() - 1 {
print!(", ");
}
}
io::stdout().flush().unwrap();
}
fn to_pig_latin(mut s: &mut str) {
let first = s.chars().nth(0).unwrap();
let mut pig;
if "aeiouAEIOU".contains(first) {
pig = format!("{}-{}", s, "hay");
s = &mut pig[..]; // `pig` does not live long enough
} else {
let mut word = String::new();
for (i, c) in s.char_indices() {
if i != 0 {
word.push(c);
}
}
pig = format!("{}-{}{}", word, first.to_lowercase(), "ay");
s = &mut pig[..]; // `pig` does not live long enough
}
}
Edit: here's the fixed code with the suggestions from below.
fn main() {
// added mut
let mut v = vec![
String::from("kaka"),
String::from("Apple"),
String::from("everett"),
String::from("Robin"),
];
// added mut
for mut s in &mut v {
to_pig_latin(&mut s);
}
for (i, s) in v.iter().enumerate() {
print!("{}", s);
if i < v.len() - 1 {
print!(", ");
}
}
println!();
}
// converted into &mut String
fn to_pig_latin(s: &mut String) {
let first = s.chars().nth(0).unwrap();
if "aeiouAEIOU".contains(first) {
s.push_str("-hay");
} else {
// added code to make the new first letter uppercase
let second = s.chars().nth(1).unwrap();
*s = format!(
"{}{}-{}ay",
second.to_uppercase(),
// the slice starts at the third char of the string, as if &s[2..]
&s[first.len_utf8() * 2..],
first.to_lowercase()
);
}
}
I'm not quite sure if I'm just messing up the syntax or if it's more complicated than it sounds due to Rust's strict rules, which I have yet to fully grasp. For the lifetime errors inside to_pig_latin() I remember reading something that explained how to properly handle the situation but right now I can't find it, so if you could also point it out for me it would be very appreciated.
What you're trying to do can't work: with a mutable reference you can update the referee in-place, but this is extremely limited here:
a &mut str can't change length or anything of that matter
a &mut str is still just a reference, the memory has to live somewhere, here you're creating new Strings inside your function then trying to use these as the new backing buffers for the reference, which as the compiler tells you doesn't work: the String will be deallocated at the end of the function
What you could do is take an &mut String, that lets you modify the owned string itself in-place, which is much more flexible. And, in fact, corresponds exactly to your request: an &mut str corresponds to a char*, it's a pointer to a place in memory.
A String is also a pointer, so an &mut String is a double-pointer to a zone in memory.
So something like this:
fn to_pig_latin(s: &mut String) {
let first = s.chars().nth(0).unwrap();
if "aeiouAEIOU".contains(first) {
*s = format!("{}-{}", s, "hay");
} else {
let mut word = String::new();
for (i, c) in s.char_indices() {
if i != 0 {
word.push(c);
}
}
*s = format!("{}-{}{}", word, first.to_lowercase(), "ay");
}
}
You can also likely avoid some of the complete string allocations by using somewhat finer methods e.g.
fn to_pig_latin(s: &mut String) {
let first = s.chars().nth(0).unwrap();
if "aeiouAEIOU".contains(first) {
s.push_str("-hay")
} else {
s.replace_range(first.len_utf8().., "");
write!(s, "-{}ay", first.to_lowercase()).unwrap();
}
}
although the replace_range + write! is not very readable and not super likely to be much of a gain, so that might as well be a format!, something along the lines of:
fn to_pig_latin(s: &mut String) {
let first = s.chars().nth(0).unwrap();
if "aeiouAEIOU".contains(first) {
s.push_str("-hay")
} else {
*s = format!("{}-{}ay", &s[first.len_utf8()..], first.to_lowercase());
}
}

How do I convert reverse domain notation to PascalCase?

I want to convert "foo.bar.baz" to "FooBarBaz". My input will always be only ASCII. I tried:
let result = "foo.bar.baz"
.to_string()
.split(".")
.map(|x| x[0..1].to_string().to_uppercase() + &x[1..])
.fold("".to_string(), |acc, x| acc + &x);
println!("{}", result);
but that feels inefficient.
Your solution is a good start. You could probably make it work without heap allocations in the "functional" style; I prefer putting complex logic into normal for loops though.
Also I don't like assuming input is in ASCII without actually checking - this should work with any string.
You probably could also use String::with_capacity in your code to avoid reallocations in standard cases.
Playground
fn dotted_to_pascal_case(s: &str) -> String {
let mut result = String::with_capacity(s.len());
for part in s.split('.') {
let mut cs = part.chars();
if let Some(c) = cs.next() {
result.extend(c.to_uppercase());
}
result.push_str(cs.as_str());
}
result
}
fn main() {
println!("{}", dotted_to_pascal_case("foo.bar.baz"));
}
Stefan's answer is correct, but I decided to get rid of that first String allocation and go full-functional, without loops:
fn dotted_to_pascal_case(s: &str) -> String {
s.split('.')
.map(|piece| piece.chars())
.flat_map(|mut chars| {
chars
.next()
.expect("empty section between dots!")
.to_uppercase()
.chain(chars)
})
.collect()
}
fn main() {
println!("{}", dotted_to_pascal_case("foo.bar.baz"));
}

Idiomatic Rust method for handling references to a buffer

I would like to be able to construct objects that contain immutable references to a mutable buffer object. The following code does not work but illustrates my use case, is there an idiomatic Rust method for handling this?
#[derive(Debug)]
struct Parser<'a> {
buffer: &'a String
}
fn main() {
let mut source = String::from("Peter");
let buffer = &source;
let parser = Parser { buffer };
// How can I legally change source?
source.push_str(" Pan");
println!("{:?}", parser);
}
The golden rule of the rust borrow checker is: Only one writer OR multiple readers can access a resource at a time. This ensures that algorithms are safe to run in multiple threads.
You breach this rule here:
#[derive(Debug)]
struct Parser<'a> {
buffer: &'a String
}
fn main() {
// mutable access begins here
let mut source = String::from("Peter");
// immutable access begins here
let buffer = &source;
let parser = Parser { buffer };
source.push_str(" Pan");
println!("{:?}", parser);
// Both immutable and mutable access end here
}
If you are sure that your program doesn't actively access resources at the same time mutably and immutably, you can move the check from compile time to run time by wrapping your resource in a RefCell:
use std::cell::RefCell;
use std::rc::Rc;
#[derive(Debug)]
struct Parser {
buffer: Rc<RefCell<String>>
}
fn main() {
let source = Rc::new(RefCell::new(String::from("Peter")));
let parser = Parser { buffer: source.clone() };
source.borrow_mut().push_str(" Pan");
println!("{:?}", parser);
}
If you plan on passing your resource around threads, you can use an RwLock to block the thread until the resource is available:
use std::sync::{RwLock, Arc};
#[derive(Debug)]
struct Parser {
buffer: Arc<RwLock<String>>
}
fn main() {
let source = Arc::new(RwLock::new(String::from("Peter")));
let parser = Parser { buffer: source.clone() };
source.write().unwrap().push_str(" Pan");
println!("{:?}", parser);
}
On another note, you should prefer &str over &String
It's hard to tell what exactly you want to achieve by mutating the source; I would assume you don't want it to happen while the parser is doing its work? You can always try (depending on your specific use case) to separate the immutable from the mutable with an extra scope:
fn main() {
let mut source = String::from("Peter");
{
let buffer = &source;
let parser = Parser { buffer };
println!("{:?}", parser);
}
source.push_str(" Pan");
}
If you don't want to use RefCell, unsafe (or to simply keep a mutable reference to source in Parser and use that), I'm afraid it doesn't get better than plain refactoring.
To elaborate on how this can be done unsafely, what you've described can be achieved by using a raw const pointer to avoid the borrowing rules, which of course is inherently unsafe, as the very concept of what you've described is pretty unsafe. There are ways to make it safer though, should you choose this path. But I would probably default to using an Arc<RwLock> or Arc<Mutex> should safety be important.
use std::fmt::{self, Display};
#[derive(Debug)]
struct Parser {
buffer: *const String
}
impl Display for Parser {
fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result {
let buffer = unsafe { &*self.buffer };
write!(f, "{}", buffer)
}
}
fn main() {
let mut source = String::from("Peter");
let buffer = &source as *const String;
let parser = Parser { buffer };
source.push_str(" Pan");
println!("{}", parser);
}

How to "crop" characters off the beginning of a string in Rust?

I want a function that can take two arguments (string, number of letters to crop off front) and return the same string except with the letters before character x gone.
If I write
let mut example = "stringofletters";
CropLetters(example, 3);
println!("{}", example);
then the output should be:
ingofletters
Is there any way I can do this?
In many uses it would make sense to simply return a slice of the input, avoiding any copy. Converting #Shepmaster's solution to use immutable slices:
fn crop_letters(s: &str, pos: usize) -> &str {
match s.char_indices().skip(pos).next() {
Some((pos, _)) => &s[pos..],
None => "",
}
}
fn main() {
let example = "stringofletters"; // works with a String if you take a reference
let cropped = crop_letters(example, 3);
println!("{}", cropped);
}
Advantages over the mutating version are:
No copy is needed. You can call cropped.to_string() if you want a newly allocated result; but you don't have to.
It works with static string slices as well as mutable String etc.
The disadvantage is that if you really do have a mutable string you want to modify, it would be slightly less efficient as you'd need to allocate a new String.
Issues with your original code:
Functions use snake_case, types and traits use CamelCase.
"foo" is a string literal of type &str. These may not be changed. You will need something that has been heap-allocated, such as a String.
The call crop_letters(stringofletters, 3) would transfer ownership of stringofletters to the method, which means you wouldn't be able to use the variable anymore. You must pass in a mutable reference (&mut).
Rust strings are not ASCII, they are UTF-8. You need to figure out how many bytes each character requires. char_indices is a good tool here.
You need to handle the case of when the string is shorter than 3 characters.
Once you have the byte position of the new beginning of the string, you can use drain to move a chunk of bytes out of the string. We just drop these bytes and let the String move over the remaining bytes.
fn crop_letters(s: &mut String, pos: usize) {
match s.char_indices().nth(pos) {
Some((pos, _)) => {
s.drain(..pos);
}
None => {
s.clear();
}
}
}
fn main() {
let mut example = String::from("stringofletters");
crop_letters(&mut example, 3);
assert_eq!("ingofletters", example);
}
See Chris Emerson's answer if you don't actually need to modify the original String.
I found this answer which I don't consider really idiomatic:
fn crop_with_allocation(string: &str, len: usize) -> String {
string.chars().skip(len).collect()
}
fn crop_without_allocation(string: &str, len: usize) -> &str {
// optional length check
if string.len() < len {
return &"";
}
&string[len..]
}
fn main() {
let example = "stringofletters"; // works with a String if you take a reference
let cropped = crop_with_allocation(example, 3);
println!("{}", cropped);
let cropped = crop_without_allocation(example, 3);
println!("{}", cropped);
}
my version
fn crop_str(s: &str, n: usize) -> &str {
let mut it = s.chars();
for _ in 0..n {
it.next();
}
it.as_str()
}
#[test]
fn test_crop_str() {
assert_eq!(crop_str("123", 1), "23");
assert_eq!(crop_str("ЖФ1", 1), "Ф1");
assert_eq!(crop_str("ЖФ1", 2), "1");
}

Resources