reusable rust vector with associated vector of slices - rust

I need rust code to read lines of a file, and break them into an array of slices. The working code is
use std::io::{self, BufRead};
fn main() {
let stdin = io::stdin();
let mut f = stdin.lock();
let mut line : Vec<u8> = Vec::new();
loop {
line.clear();
let sz = f.read_until(b'\n', &mut line).unwrap();
if sz == 0 {break};
let body : Vec<&[u8]> = line.split(|ch| *ch == b'\t').collect();
DoStuff(body);
}
}
However, that code is slower than I'd like. The code I want to write is
use std::io::{self, BufRead};
fn main() {
let stdin = io::stdin();
let mut f = stdin.lock();
let mut line : Vec<u8> = Vec::new();
let mut body: Vec<&[u8]> = Vec::new();
loop {
line.clear();
let sz = f.read_until(b'\n', &mut line).unwrap();
if sz == 0 {break};
body.extend(&mut line.split(|ch| *ch == b'\t'));
DoStuff(body);
body.clear();
}
}
but that runs afoul of the borrow checker.
In general, I'd like a class containing a Vec<u8> and an associated Vec<&[u8]>, which is the basis of a lot of C++ code I'm trying to replace.
Is there any way I can accomplish this?
I realize that I could replace the slices with pairs of integers, but that seems clumsy.
No, I can't just use the items from the iterator as they come through -- I need random access to the individual column values. In the simplified case where I do use the iterator directly, I get a 3X speedup, which is why I suspect a significant speedup by replacing collect with extend.
Other comments on this code is also welcome.

Just for sake of completeness, and since you are coming from C++, a more Rusty way of writing the code would be
use std::io::{self, BufRead};
fn do_stuff(body: &[&str]) {}
fn main() {
for line in io::stdin().lock().lines() {
let line = line.unwrap();
let body = line.split('\t').collect::<Vec<_>>();
do_stuff(&body);
}
}
This uses .lines() from BufRead to get an iterator over \n-delimited lines from the input. It assumes that your input is actually valid UTF8, which in your code was not a requirement. If it is not UTF8, use .split(b'\n'), .split(b'\t') and &[&u8] instead.
Notice that this does allocate and subsequently free a new Vec via .collect() every time the loop executes. We are somewhat relying on the allocator's free-list to make this cheap. But it is correct in all cases.
The reason your second example does not compile (after fixing the DoStuff(&body) is this:
12 | line.clear();
| ^^^^^^^^^^^^ mutable borrow occurs here
...
15 | body.extend(&mut line.split(|ch| *ch == b'\t'));
| ---- ---- immutable borrow occurs here
| |
| immutable borrow later used here
The problem here is the loop: Line 12 line.clear() will execute after line 15 body.extend() from the second iteration onwards. But the compiler has figured out that body borrows from line (it contains references to the fields inside line). The call to line.clear() mutably borrows line - all of line - and as far as the compiler is concerned is free to do anything it wants with the data it holds. This is an error because line.clear() could possibly mutate data that body has borrowed immutably. The compiler does not reason about the fact that .clear() obviously does not mutate the borrowed data, quite the opposite in fact, but the compiler's reasoning stops at the function signature.

I seems like the answer is
No, it's not possible to reuse the vector of slices.
The way to go is to make something like a slice, but with integer offsets rather than pointers. Code is attached, comments welcome.
Performance is currently 15% better than the C++, but the C++ is part of a larger system, and is probably doing some additional stuff.
/// pointers into a vector, simulating a slice without the ownership issues
#[derive(Debug, Clone)]
pub struct FakeSlice {
begin: u32,
end: u32,
}
/// A line of a text file, broken into columns.
/// Access to the `lines` and `parts` is allowed, but should seldom be necessary
/// `line` does not include the trailing newline
/// An empty line contains one empty column
///```
/// use std::io::BufRead;
/// let mut data = b"one\ttwo\tthree\n";
/// let mut dp = &data[..];
/// let mut line = cdx::TextLine::new();
/// let eof = line.read(&mut dp).unwrap();
/// assert_eq!(eof, false);
/// assert_eq!(line.strlen(), 13);
/// line.split(b'\t');
/// assert_eq!(line.len(), 3);
/// assert_eq!(line.get(1), b"two");
///```
#[derive(Debug, Clone)]
pub struct TextLine {
pub line: Vec<u8>,
pub parts: Vec<FakeSlice>,
}
impl TextLine {
/// make a new TextLine
pub fn new() -> TextLine {
TextLine {
line: Vec::new(),
parts: Vec::new(),
}
}
fn clear(&mut self) {
self.parts.clear();
self.line.clear();
}
/// How many column in the line
pub fn len(&self) -> usize {
self.parts.len()
}
/// How many bytes in the line
pub fn strlen(&self) -> usize {
self.line.len()
}
/// should always be false, but required by clippy
pub fn is_empty(&self) -> bool {
self.parts.is_empty()
}
/// Get one column. Return an empty column if index is too big.
pub fn get(&self, index: usize) -> &[u8] {
if index >= self.parts.len() {
&self.line[0..0]
} else {
&self.line[self.parts[index].begin as usize..self.parts[index].end as usize]
}
}
/// Read a new line from a file, should generally be followed by `split`
pub fn read<T: std::io::BufRead>(&mut self, f: &mut T) -> std::io::Result<bool> {
self.clear();
let sz = f.read_until(b'\n', &mut self.line)?;
if sz == 0 {
Ok(true)
} else {
if self.line.last() == Some(&b'\n') {
self.line.pop();
}
Ok(false)
}
}
/// split the line into columns
/// hypothetically you could split on one delimiter, do some work, then split on a different delimiter.
pub fn split(&mut self, delim: u8) {
self.parts.clear();
let mut begin: u32 = 0;
let mut end: u32 = 0;
#[allow(clippy::explicit_counter_loop)] // I need the counter to be u32
for ch in self.line.iter() {
if *ch == delim {
self.parts.push(FakeSlice { begin, end });
begin = end + 1;
}
end += 1;
}
self.parts.push(FakeSlice { begin, end });
}
}

Related

Temporarily cache owned value between iterator adapters

I'd like to know if there's a way to cache an owned value between iterator adapters, so that adapters later in the chain can reference it.
(Or if there's another way to allow later adapters to reference an owned value that lives inside the iterator chain.)
To illustrate what I mean, let's look at this (contrived) example:
I have a function that returns a String, which is called in an Iterator map() adapter, yielding an iterator over Strings. I'd like to get an iterator over the chars() in those Strings, but the chars() method requires a string slice, meaning a reference.
Is this possible to do, without first collecting the Strings?
Here's a minimal example that of course fails:
fn greet(c: &str) -> String {
"Hello, ".to_owned() + c
}
fn main() {
let names = ["Martin", "Helena", "Ingrid", "Joseph"];
let iterator = names.into_iter().map(greet);
let fails = iterator.flat_map(<str>::chars);
}
Playground
Using a closure instead of <str>::chars - |s| s.chars() - does of course not work either. It makes the types match, but breaks lifetimes.
Edit (2022-10-03): In response to the comments, here's some pseudocode of what I have in mind, but with incorrect lifetimes:
struct IteratorCache<'a, T, I>{
item : Option<T>,
inner : I,
_p : core::marker::PhantomData<&'a T>
}
impl<'a, T, I> Iterator for IteratorCache<'a, T,I>
where I: Iterator<Item=T>
{
type Item=&'a T;
fn next(&mut self) -> Option<&'a T> {
self.item = self.inner.next();
if let Some(x) = &self.item {
Some(&x)
} else {
None
}
}
}
The idea would be that the reference could stay valid until the next call to next(). However I don't know if this can be expressed with the function signature of the Iterator trait. (Or if this can be expressed at all.)
I don't think something like this exists yet, and collecting into a Vec<char> creates some overhead, but you can write such an iterator yourself with a little bit of trickery:
struct OwnedCharsIter {
s: String,
index: usize,
}
impl OwnedCharsIter {
pub fn new(s: String) -> Self {
Self { s, index: 0 }
}
}
impl Iterator for OwnedCharsIter {
type Item = char;
fn next(&mut self) -> Option<Self::Item> {
// Slice of leftover characters
let slice = &self.s[self.index..];
// Iterator over leftover characters
let mut chars = slice.chars();
// Query the next char
let next_char = chars.next()?;
// Compute the new index by looking at how many bytes are left
// after querying the next char
self.index = self.s.len() - chars.as_str().len();
// Return next char
Some(next_char)
}
}
fn greet(c: &str) -> String {
"Hello, ".to_owned() + c
}
fn main() {
let names = ["Martin", "Helena", "Ingrid", "Joseph"];
let iterator = names.into_iter().map(greet);
let chars_iter = iterator.flat_map(OwnedCharsIter::new);
println!("{:?}", chars_iter.collect::<String>())
}
"Hello, MartinHello, HelenaHello, IngridHello, Joseph"

How can I create a fixed size array of Strings using constant generics?

I have a function using a constant generic:
fn foo<const S: usize>() -> Vec<[String; S]> {
// Some code
let mut row: [String; S] = Default::default(); //It sucks because of default arrays are specified up to 32 only
// Some code
}
How can I create a fixed size array of Strings in my case? let mut row: [String; S] = ["".to_string(), S]; doesn't work because String doesn't implement the Copy trait.
You can do it with MaybeUninit and unsafe:
use std::mem::MaybeUninit;
fn foo<const S: usize>() -> Vec<[String; S]> {
// Some code
let mut row: [String; S] = unsafe {
let mut result = MaybeUninit::uninit();
let start = result.as_mut_ptr() as *mut String;
for pos in 0 .. S {
// SAFETY: safe because loop ensures `start.add(pos)`
// is always on an array element, of type String
start.add(pos).write(String::new());
}
// SAFETY: safe because loop ensures entire array
// has been manually initialised
result.assume_init()
};
// Some code
todo!()
}
Of course, it might be easier to abstract such logic to your own trait:
use std::mem::MaybeUninit;
trait DefaultArray {
fn default_array() -> Self;
}
impl<T: Default, const S: usize> DefaultArray for [T; S] {
fn default_array() -> Self {
let mut result = MaybeUninit::uninit();
let start = result.as_mut_ptr() as *mut T;
unsafe {
for pos in 0 .. S {
// SAFETY: safe because loop ensures `start.add(pos)`
// is always on an array element, of type T
start.add(pos).write(T::default());
}
// SAFETY: safe because loop ensures entire array
// has been manually initialised
result.assume_init()
}
}
}
(The only reason for using your own trait rather than Default is that implementations of the latter would conflict with those provided in the standard library for arrays of up to 32 elements; I wholly expect the standard library to replace its implementation of Default with something similar to the above once const generics have stabilised).
In which case you would now have:
fn foo<const S: usize>() -> Vec<[String; S]> {
// Some code
let mut row: [String; S] = DefaultArray::default_array();
// Some code
todo!()
}
See it on the Playground.
As of now, there is no way to compile constant generics. As #AlexLarionov said, you can try to use procedural macros, but that approach still has its bugs and limitations.
If you need a generic that has to be a number, you can use the Num crate, or the more verbose std::num.

Rust creating String with pointer offset

So let's say I have a String, "Foo Bar" and I want to create a substring of "Bar" without allocating new memory.
So I moved the raw pointer of the original string to the start of the substring (in this case offsetting it by 4) and use the String::from_raw_parts() function to create the String.
So far I have the following code, which as far as I understand should do this just fine. I just don't understand why this does not work.
use std::mem;
fn main() {
let s = String::from("Foo Bar");
let ptr = s.as_ptr();
mem::forget(s);
unsafe {
// no error when using ptr.add(0)
let txt = String::from_raw_parts(ptr.add(4) as *mut _, 3, 3);
println!("{:?}", txt); // This even prints "Bar" but crashes afterwards
println!("prints because 'txt' is still in scope");
}
println!("won't print because 'txt' was dropped",)
}
I get the following error on Windows:
error: process didn't exit successfully: `target\debug\main.exe` (exit code: 0xc0000374, STATUS_HEAP_CORRUPTION)
And these on Linux (cargo run; cargo run --release):
munmap_chunk(): invalid pointer
free(): invalid pointer
I think it has something to do with the destructor of String, because as long as txt is in scope the program runs just fine.
Another thing to notice is that when I use ptr.add(0) instead of ptr.add(4) it runs without an error.
Creating a slice didn't give me any problems on the other Hand. Dropping that worked just fine.
let t = slice::from_raw_parts(ptr.add(4), 3);
In the end I want to split an owned String in place into multiple owned Strings without allocating new memory.
Any help is appreciated.
The reason for the errors is the way that the allocator works. It is Undefined Behaviour to ask the allocator to free a pointer that it didn't give you in the first place. In this case, the allocator allocated 7 bytes for s and returned a pointer to the first one. However, when txt is dropped, it tells the allocator to deallocate a pointer to byte 4, which it has never seen before. This is why there is no issue when you add(0) instead of add(4).
Using unsafe correctly is hard, and you should avoid it where possible.
Part of the purpose of the &str type is to allow portions of an owned string to be shared, so I would strongly encourage you to use those if you can.
If the reason you can't just use &str on its own is because you aren't able to track the lifetimes back to the original String, then there are still some solutions, with different trade-offs:
Leak the memory, so it's effectively static:
let mut s = String::from("Foo Bar");
let s = Box::leak(s.into_boxed_str());
let txt: &'static str = &s[4..];
let s: &'static str = &s[..4];
Obviously, you can only do this a few times in your application, or else you are going to use up too much memory that you can't get back.
Use reference-counting to make sure that the original String stays around long enough for all of the slices to remain valid. Here is a sketch solution:
use std::{fmt, ops::Deref, rc::Rc};
struct RcStr {
rc: Rc<String>,
start: usize,
len: usize,
}
impl RcStr {
fn from_rc_string(rc: Rc<String>, start: usize, len: usize) -> Self {
RcStr { rc, start, len }
}
fn as_str(&self) -> &str {
&self.rc[self.start..self.start + self.len]
}
}
impl Deref for RcStr {
type Target = str;
fn deref(&self) -> &str {
self.as_str()
}
}
impl fmt::Display for RcStr {
fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result {
fmt::Display::fmt(self.as_str(), f)
}
}
impl fmt::Debug for RcStr {
fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result {
fmt::Debug::fmt(self.as_str(), f)
}
}
fn main() {
let s = Rc::new(String::from("Foo Bar"));
let txt = RcStr::from_rc_string(Rc::clone(&s), 4, 3);
let s = RcStr::from_rc_string(Rc::clone(&s), 0, 4);
println!("{:?}", txt); // "Bar"
println!("{:?}", s); // "Foo "
}

error: `line` does not live long enough but it's ok in playground

I can't figure it out why my local var line does not live long enough. You can see bellow my code. It work on the Rust's playground.
I may have an idea of the issue: I use a structure (load is a function of this structure). As I want to store the result of the line in a member of my struct, it could be the issue. But I don't see what should I do to resolve this problem.
pub struct Config<'a> {
file: &'a str,
params: HashMap<&'a str, &'a str>
}
impl<'a> Config<'a> {
pub fn new(file: &str) -> Config {
Config { file: file, params: HashMap::new() }
}
pub fn load(&mut self) -> () {
let f = match fs::File::open(self.file) {
Ok(e) => e,
Err(e) => {
println!("Failed to load {}, {}", self.file, e);
return;
}
};
let mut reader = io::BufReader::new(f);
let mut buffer = String::new();
loop {
let result = reader.read_line(&mut buffer);
if result.is_ok() && result.ok().unwrap() > 0 {
let line: Vec<String> = buffer.split("=").map(String::from).collect();
let key = line[0].trim();
let value = line[1].trim();
self.params.insert(key, value);
}
buffer.clear();
}
}
...
}
And I get this error:
src/conf.rs:33:27: 33:31 error: `line` does not live long enough
src/conf.rs:33 let key = line[0].trim();
^~~~
src/conf.rs:16:34: 41:6 note: reference must be valid for the lifetime 'a as defined on the block at 16:33...
src/conf.rs:16 pub fn load(&mut self) -> () {
src/conf.rs:17 let f = match fs::File::open(self.file) {
src/conf.rs:18 Ok(e) => e,
src/conf.rs:19 Err(e) => {
src/conf.rs:20 println!("Failed to load {}, {}", self.file, e);
src/conf.rs:21 return;
...
src/conf.rs:31:87: 37:14 note: ...but borrowed value is only valid for the block suffix following statement 0 at 31:86
src/conf.rs:31 let line: Vec<String> = buffer.split("=").map(String::from).collect();
src/conf.rs:32
src/conf.rs:33 let key = line[0].trim();
src/conf.rs:34 let value = line[1].trim();
src/conf.rs:35
src/conf.rs:36 self.params.insert(key, value);
...
There are three steps in realizing why this does not work.
let line: Vec<String> = buffer.split("=").map(String::from).collect();
let key = line[0].trim();
let value = line[1].trim();
self.params.insert(key, value);
line is a Vec of Strings, meaning the vector owns the strings its containing. An effect of this is that when the vector is freed from memory, the elements, the strings, are also freed.
If we look at string::trim here, we see that it takes and returns a &str. In other words, the function does not allocate anything, or transfer ownership - the string it returns is simply a slice of the original string. So if we were to free the original string, the trimmed string would not have valid data.
The signature of HashMap::insert is fn insert(&mut self, k: K, v: V) -> Option<V>. The function moves both the key and the value, because these needs to be valid for as long as they may be in the hashmap. We would like to give the hashmap the two strings. However, both key and value are just references to strings which is owned by the vector - we are just borrowing them - so we can't give them away.
The solution is simple: copy the strings after they have been split.
let line: Vec<String> = buffer.split("=").map(String::from).collect();
let key = line[0].trim().to_string();
let value = line[1].trim().to_string();
self.params.insert(key, value);
This will allocate two new strings, and copy the trimmed slices into the new strings.
We could have moved the string out of the vector(ie. with Vec::remove), if we didn't trim the strings afterwards; I was unable to find a easy way of trimming a string without allocating a new one.
In addition, as malbarbo mentions, we can avoid the extra allocation that is done with map(String::from), and the creation of the vector with collect(), by simply omitting them.
In this case you have to use String instead of &str. See this to understand the difference.
You can also eliminate the creation of the intermediate vector and use the iterator return by split direct
pub struct Config<'a> {
file: &'a str,
params: HashMap<String, String>
}
...
let mut line = buffer.split("=");
let key = line.next().unwrap().trim().to_string();
let value = line.next().unwrap().trim().to_string();

Using str and String interchangably

Suppose I'm trying to do a fancy zero-copy parser in Rust using &str, but sometimes I need to modify the text (e.g. to implement variable substitution). I really want to do something like this:
fn main() {
let mut v: Vec<&str> = "Hello there $world!".split_whitespace().collect();
for t in v.iter_mut() {
if (t.contains("$world")) {
*t = &t.replace("$world", "Earth");
}
}
println!("{:?}", &v);
}
But of course the String returned by t.replace() doesn't live long enough. Is there a nice way around this? Perhaps there is a type which means "ideally a &str but if necessary a String"? Or maybe there is a way to use lifetime annotations to tell the compiler that the returned String should be kept alive until the end of main() (or have the same lifetime as v)?
Rust has exactly what you want in form of a Cow (Clone On Write) type.
use std::borrow::Cow;
fn main() {
let mut v: Vec<_> = "Hello there $world!".split_whitespace()
.map(|s| Cow::Borrowed(s))
.collect();
for t in v.iter_mut() {
if t.contains("$world") {
*t.to_mut() = t.replace("$world", "Earth");
}
}
println!("{:?}", &v);
}
as #sellibitze correctly notes, the to_mut() creates a new String which causes a heap allocation to store the previous borrowed value. If you are sure you only have borrowed strings, then you can use
*t = Cow::Owned(t.replace("$world", "Earth"));
In case the Vec contains Cow::Owned elements, this would still throw away the allocation. You can prevent that using the following very fragile and unsafe code (It does direct byte-based manipulation of UTF-8 strings and relies of the fact that the replacement happens to be exactly the same number of bytes.) inside your for loop.
let mut last_pos = 0; // so we don't start at the beginning every time
while let Some(pos) = t[last_pos..].find("$world") {
let p = pos + last_pos; // find always starts at last_pos
last_pos = pos + 5;
unsafe {
let s = t.to_mut().as_mut_vec(); // operating on Vec is easier
s.remove(p); // remove $ sign
for (c, sc) in "Earth".bytes().zip(&mut s[p..]) {
*sc = c;
}
}
}
Note that this is tailored exactly to the "$world" -> "Earth" mapping. Any other mappings require careful consideration inside the unsafe code.
std::borrow::Cow, specifically used as Cow<'a, str>, where 'a is the lifetime of the string being parsed.
use std::borrow::Cow;
fn main() {
let mut v: Vec<Cow<'static, str>> = vec![];
v.push("oh hai".into());
v.push(format!("there, {}.", "Mark").into());
println!("{:?}", v);
}
Produces:
["oh hai", "there, Mark."]

Resources