Rust creating String with pointer offset - string

So let's say I have a String, "Foo Bar" and I want to create a substring of "Bar" without allocating new memory.
So I moved the raw pointer of the original string to the start of the substring (in this case offsetting it by 4) and use the String::from_raw_parts() function to create the String.
So far I have the following code, which as far as I understand should do this just fine. I just don't understand why this does not work.
use std::mem;
fn main() {
let s = String::from("Foo Bar");
let ptr = s.as_ptr();
mem::forget(s);
unsafe {
// no error when using ptr.add(0)
let txt = String::from_raw_parts(ptr.add(4) as *mut _, 3, 3);
println!("{:?}", txt); // This even prints "Bar" but crashes afterwards
println!("prints because 'txt' is still in scope");
}
println!("won't print because 'txt' was dropped",)
}
I get the following error on Windows:
error: process didn't exit successfully: `target\debug\main.exe` (exit code: 0xc0000374, STATUS_HEAP_CORRUPTION)
And these on Linux (cargo run; cargo run --release):
munmap_chunk(): invalid pointer
free(): invalid pointer
I think it has something to do with the destructor of String, because as long as txt is in scope the program runs just fine.
Another thing to notice is that when I use ptr.add(0) instead of ptr.add(4) it runs without an error.
Creating a slice didn't give me any problems on the other Hand. Dropping that worked just fine.
let t = slice::from_raw_parts(ptr.add(4), 3);
In the end I want to split an owned String in place into multiple owned Strings without allocating new memory.
Any help is appreciated.

The reason for the errors is the way that the allocator works. It is Undefined Behaviour to ask the allocator to free a pointer that it didn't give you in the first place. In this case, the allocator allocated 7 bytes for s and returned a pointer to the first one. However, when txt is dropped, it tells the allocator to deallocate a pointer to byte 4, which it has never seen before. This is why there is no issue when you add(0) instead of add(4).
Using unsafe correctly is hard, and you should avoid it where possible.
Part of the purpose of the &str type is to allow portions of an owned string to be shared, so I would strongly encourage you to use those if you can.
If the reason you can't just use &str on its own is because you aren't able to track the lifetimes back to the original String, then there are still some solutions, with different trade-offs:
Leak the memory, so it's effectively static:
let mut s = String::from("Foo Bar");
let s = Box::leak(s.into_boxed_str());
let txt: &'static str = &s[4..];
let s: &'static str = &s[..4];
Obviously, you can only do this a few times in your application, or else you are going to use up too much memory that you can't get back.
Use reference-counting to make sure that the original String stays around long enough for all of the slices to remain valid. Here is a sketch solution:
use std::{fmt, ops::Deref, rc::Rc};
struct RcStr {
rc: Rc<String>,
start: usize,
len: usize,
}
impl RcStr {
fn from_rc_string(rc: Rc<String>, start: usize, len: usize) -> Self {
RcStr { rc, start, len }
}
fn as_str(&self) -> &str {
&self.rc[self.start..self.start + self.len]
}
}
impl Deref for RcStr {
type Target = str;
fn deref(&self) -> &str {
self.as_str()
}
}
impl fmt::Display for RcStr {
fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result {
fmt::Display::fmt(self.as_str(), f)
}
}
impl fmt::Debug for RcStr {
fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result {
fmt::Debug::fmt(self.as_str(), f)
}
}
fn main() {
let s = Rc::new(String::from("Foo Bar"));
let txt = RcStr::from_rc_string(Rc::clone(&s), 4, 3);
let s = RcStr::from_rc_string(Rc::clone(&s), 0, 4);
println!("{:?}", txt); // "Bar"
println!("{:?}", s); // "Foo "
}

Related

Where does this slice point to in Rust? Who is its owner?

I'm currently learning Rust's lifetime and ownership. I encountered codes like this(minimized):
fn main() {
let mut text = Text::new();
println!("text: {:?}", text);
let mut byte = [1_u8];
let txt = text.get_text(byte);
println!("byte: {:?}", byte);
println!("txt: {:?}", txt);
byte[0] = 7_u8;
println!("byte: {:?}", byte);
println!("txt: {:?}", txt);
}
#[derive(Debug)]
struct Text<'a> {
temp: [u8; 1],
data: &'a [u8],
}
impl<'a> Text<'a> {
fn new() -> Self {
Text {
temp: [0_u8],
data: &[0],
}
}
fn get_lifetime(&self) -> &'a Self {
unsafe {
// &*(self as *const Self)
std::mem::transmute(self)
}
}
fn get_text(&mut self, byte: [u8; 1]) -> Text<'a> {
self.temp = byte;
let res = Text {
temp: [0_u8],
data: &self.get_lifetime().temp,
};
res
}
}
the output is
text: Text { temp: [0], data: [0] }
byte: [1]
txt: Text { temp: [0], data: [1] }
byte: [7]
txt: Text { temp: [0], data: [1] }
It seems that the slice data inside variable text point to some place independent of byte. I think maybe the byte is copied into get_text. But is slice data always refer to the copied one? How does Rust determine when to drop the copied byte?
Further more, I'm confused by the use of get_lifetime. Can anyone further explain the usage, and whether it is safe to use?
It seems that the slice data inside variable text point to some place independent of byte.
The value of data is initially set to &[0]. A slice literal is 'static and the reference is valid for the entire duration of the program.
After you call get_text, the value of byte is copied into temp. Then data is made into a reference to that field.
This is not safe. In the code you've actually shown, there is no problem. The problem is how this struct might be used later.
Consider this example:
fn main() {
let mut text = Text::new();
let mut byte = [1_u8];
let txt = text.get_text(byte);
println!("txt: {:?}", txt); // txt: Text { temp: [0], data: [1] }
use_text(text);
}
// Don't inline, so it's guaranteed that text is physically moved in memory
#[inline(never)]
fn use_text<'a>(text: Text<'a>) {
println!("txt: {:?}", text); // Undefined Behaviour
}
In this example text is moved into the use_text function's call frame. That means it is stored in a new location in memory. But data will still point to to the old place in memory where the temp field used to be stored.
This illustrates one of the big dangers of using unsafe. The consequences may be observed far away from where you used that keyword. This example was sound until text was moved, but after that point, it is Undefined Behaviour.

Temporarily cache owned value between iterator adapters

I'd like to know if there's a way to cache an owned value between iterator adapters, so that adapters later in the chain can reference it.
(Or if there's another way to allow later adapters to reference an owned value that lives inside the iterator chain.)
To illustrate what I mean, let's look at this (contrived) example:
I have a function that returns a String, which is called in an Iterator map() adapter, yielding an iterator over Strings. I'd like to get an iterator over the chars() in those Strings, but the chars() method requires a string slice, meaning a reference.
Is this possible to do, without first collecting the Strings?
Here's a minimal example that of course fails:
fn greet(c: &str) -> String {
"Hello, ".to_owned() + c
}
fn main() {
let names = ["Martin", "Helena", "Ingrid", "Joseph"];
let iterator = names.into_iter().map(greet);
let fails = iterator.flat_map(<str>::chars);
}
Playground
Using a closure instead of <str>::chars - |s| s.chars() - does of course not work either. It makes the types match, but breaks lifetimes.
Edit (2022-10-03): In response to the comments, here's some pseudocode of what I have in mind, but with incorrect lifetimes:
struct IteratorCache<'a, T, I>{
item : Option<T>,
inner : I,
_p : core::marker::PhantomData<&'a T>
}
impl<'a, T, I> Iterator for IteratorCache<'a, T,I>
where I: Iterator<Item=T>
{
type Item=&'a T;
fn next(&mut self) -> Option<&'a T> {
self.item = self.inner.next();
if let Some(x) = &self.item {
Some(&x)
} else {
None
}
}
}
The idea would be that the reference could stay valid until the next call to next(). However I don't know if this can be expressed with the function signature of the Iterator trait. (Or if this can be expressed at all.)
I don't think something like this exists yet, and collecting into a Vec<char> creates some overhead, but you can write such an iterator yourself with a little bit of trickery:
struct OwnedCharsIter {
s: String,
index: usize,
}
impl OwnedCharsIter {
pub fn new(s: String) -> Self {
Self { s, index: 0 }
}
}
impl Iterator for OwnedCharsIter {
type Item = char;
fn next(&mut self) -> Option<Self::Item> {
// Slice of leftover characters
let slice = &self.s[self.index..];
// Iterator over leftover characters
let mut chars = slice.chars();
// Query the next char
let next_char = chars.next()?;
// Compute the new index by looking at how many bytes are left
// after querying the next char
self.index = self.s.len() - chars.as_str().len();
// Return next char
Some(next_char)
}
}
fn greet(c: &str) -> String {
"Hello, ".to_owned() + c
}
fn main() {
let names = ["Martin", "Helena", "Ingrid", "Joseph"];
let iterator = names.into_iter().map(greet);
let chars_iter = iterator.flat_map(OwnedCharsIter::new);
println!("{:?}", chars_iter.collect::<String>())
}
"Hello, MartinHello, HelenaHello, IngridHello, Joseph"

reusable rust vector with associated vector of slices

I need rust code to read lines of a file, and break them into an array of slices. The working code is
use std::io::{self, BufRead};
fn main() {
let stdin = io::stdin();
let mut f = stdin.lock();
let mut line : Vec<u8> = Vec::new();
loop {
line.clear();
let sz = f.read_until(b'\n', &mut line).unwrap();
if sz == 0 {break};
let body : Vec<&[u8]> = line.split(|ch| *ch == b'\t').collect();
DoStuff(body);
}
}
However, that code is slower than I'd like. The code I want to write is
use std::io::{self, BufRead};
fn main() {
let stdin = io::stdin();
let mut f = stdin.lock();
let mut line : Vec<u8> = Vec::new();
let mut body: Vec<&[u8]> = Vec::new();
loop {
line.clear();
let sz = f.read_until(b'\n', &mut line).unwrap();
if sz == 0 {break};
body.extend(&mut line.split(|ch| *ch == b'\t'));
DoStuff(body);
body.clear();
}
}
but that runs afoul of the borrow checker.
In general, I'd like a class containing a Vec<u8> and an associated Vec<&[u8]>, which is the basis of a lot of C++ code I'm trying to replace.
Is there any way I can accomplish this?
I realize that I could replace the slices with pairs of integers, but that seems clumsy.
No, I can't just use the items from the iterator as they come through -- I need random access to the individual column values. In the simplified case where I do use the iterator directly, I get a 3X speedup, which is why I suspect a significant speedup by replacing collect with extend.
Other comments on this code is also welcome.
Just for sake of completeness, and since you are coming from C++, a more Rusty way of writing the code would be
use std::io::{self, BufRead};
fn do_stuff(body: &[&str]) {}
fn main() {
for line in io::stdin().lock().lines() {
let line = line.unwrap();
let body = line.split('\t').collect::<Vec<_>>();
do_stuff(&body);
}
}
This uses .lines() from BufRead to get an iterator over \n-delimited lines from the input. It assumes that your input is actually valid UTF8, which in your code was not a requirement. If it is not UTF8, use .split(b'\n'), .split(b'\t') and &[&u8] instead.
Notice that this does allocate and subsequently free a new Vec via .collect() every time the loop executes. We are somewhat relying on the allocator's free-list to make this cheap. But it is correct in all cases.
The reason your second example does not compile (after fixing the DoStuff(&body) is this:
12 | line.clear();
| ^^^^^^^^^^^^ mutable borrow occurs here
...
15 | body.extend(&mut line.split(|ch| *ch == b'\t'));
| ---- ---- immutable borrow occurs here
| |
| immutable borrow later used here
The problem here is the loop: Line 12 line.clear() will execute after line 15 body.extend() from the second iteration onwards. But the compiler has figured out that body borrows from line (it contains references to the fields inside line). The call to line.clear() mutably borrows line - all of line - and as far as the compiler is concerned is free to do anything it wants with the data it holds. This is an error because line.clear() could possibly mutate data that body has borrowed immutably. The compiler does not reason about the fact that .clear() obviously does not mutate the borrowed data, quite the opposite in fact, but the compiler's reasoning stops at the function signature.
I seems like the answer is
No, it's not possible to reuse the vector of slices.
The way to go is to make something like a slice, but with integer offsets rather than pointers. Code is attached, comments welcome.
Performance is currently 15% better than the C++, but the C++ is part of a larger system, and is probably doing some additional stuff.
/// pointers into a vector, simulating a slice without the ownership issues
#[derive(Debug, Clone)]
pub struct FakeSlice {
begin: u32,
end: u32,
}
/// A line of a text file, broken into columns.
/// Access to the `lines` and `parts` is allowed, but should seldom be necessary
/// `line` does not include the trailing newline
/// An empty line contains one empty column
///```
/// use std::io::BufRead;
/// let mut data = b"one\ttwo\tthree\n";
/// let mut dp = &data[..];
/// let mut line = cdx::TextLine::new();
/// let eof = line.read(&mut dp).unwrap();
/// assert_eq!(eof, false);
/// assert_eq!(line.strlen(), 13);
/// line.split(b'\t');
/// assert_eq!(line.len(), 3);
/// assert_eq!(line.get(1), b"two");
///```
#[derive(Debug, Clone)]
pub struct TextLine {
pub line: Vec<u8>,
pub parts: Vec<FakeSlice>,
}
impl TextLine {
/// make a new TextLine
pub fn new() -> TextLine {
TextLine {
line: Vec::new(),
parts: Vec::new(),
}
}
fn clear(&mut self) {
self.parts.clear();
self.line.clear();
}
/// How many column in the line
pub fn len(&self) -> usize {
self.parts.len()
}
/// How many bytes in the line
pub fn strlen(&self) -> usize {
self.line.len()
}
/// should always be false, but required by clippy
pub fn is_empty(&self) -> bool {
self.parts.is_empty()
}
/// Get one column. Return an empty column if index is too big.
pub fn get(&self, index: usize) -> &[u8] {
if index >= self.parts.len() {
&self.line[0..0]
} else {
&self.line[self.parts[index].begin as usize..self.parts[index].end as usize]
}
}
/// Read a new line from a file, should generally be followed by `split`
pub fn read<T: std::io::BufRead>(&mut self, f: &mut T) -> std::io::Result<bool> {
self.clear();
let sz = f.read_until(b'\n', &mut self.line)?;
if sz == 0 {
Ok(true)
} else {
if self.line.last() == Some(&b'\n') {
self.line.pop();
}
Ok(false)
}
}
/// split the line into columns
/// hypothetically you could split on one delimiter, do some work, then split on a different delimiter.
pub fn split(&mut self, delim: u8) {
self.parts.clear();
let mut begin: u32 = 0;
let mut end: u32 = 0;
#[allow(clippy::explicit_counter_loop)] // I need the counter to be u32
for ch in self.line.iter() {
if *ch == delim {
self.parts.push(FakeSlice { begin, end });
begin = end + 1;
}
end += 1;
}
self.parts.push(FakeSlice { begin, end });
}
}

Why can I just pass an immutable reference to BufReader, instead of a mutable reference? [duplicate]

This question already has an answer here:
Why is it possible to implement Read on an immutable reference to File?
(1 answer)
Closed 6 years ago.
I am writing a simple TCP-based echo server. When I tried to use BufReader and BufWriter to read from and write to a TcpStream, I found that passing a TcpStream to BufReader::new() by value moves its ownership so that I couldn't pass it to a BufWriter. Then, I found an answer in this thread that solves the problem:
fn handle_client(stream: TcpStream) {
let mut reader = BufReader::new(&stream);
let mut writer = BufWriter::new(&stream);
// Receive a message
let mut message = String::new();
reader.read_line(&mut message).unwrap();
// ingored
}
This is simple and it works. However, I can not quite understand why this code works. Why can I just pass an immutable reference to BufReader::new(), instead of a mutable reference ?
The whole program can be found here.
More Details
In the above code, I used reader.read_line(&mut message). So I opened the source code of BufRead in Rust standard library and saw this:
fn read_line(&mut self, buf: &mut String) -> Result<usize> {
// ignored
append_to_string(buf, |b| read_until(self, b'\n', b))
}
Here we can see that it passes the self (which may be a &mut BufReader in my case) to read_until(). Next, I found the following code in the same file:
fn read_until<R: BufRead + ?Sized>(r: &mut R, delim: u8, buf: &mut Vec<u8>)
-> Result<usize> {
let mut read = 0;
loop {
let (done, used) = {
let available = match r.fill_buf() {
Ok(n) => n,
Err(ref e) if e.kind() == ErrorKind::Interrupted => continue,
Err(e) => return Err(e)
};
match memchr::memchr(delim, available) {
Some(i) => {
buf.extend_from_slice(&available[..i + 1]);
(true, i + 1)
}
None => {
buf.extend_from_slice(available);
(false, available.len())
}
}
};
r.consume(used);
read += used;
if done || used == 0 {
return Ok(read);
}
}
}
In this part, there are two places using the BufReader: r.fill_buf() and r.consume(used). I thought r.fill_buf() is what I want to see. Therefore, I went to the code of BufReader in Rust standard library and found this:
fn fill_buf(&mut self) -> io::Result<&[u8]> {
// ignored
if self.pos == self.cap {
self.cap = try!(self.inner.read(&mut self.buf));
self.pos = 0;
}
Ok(&self.buf[self.pos..self.cap])
}
It seems like it uses self.inner.read(&mut self.buf) to read the data from self.inner. Then, we take a look at the structure of BufReader and the BufReader::new():
pub struct BufReader<R> {
inner: R,
buf: Vec<u8>,
pos: usize,
cap: usize,
}
// ignored
impl<R: Read> BufReader<R> {
// ignored
#[stable(feature = "rust1", since = "1.0.0")]
pub fn new(inner: R) -> BufReader<R> {
BufReader::with_capacity(DEFAULT_BUF_SIZE, inner)
}
// ignored
#[stable(feature = "rust1", since = "1.0.0")]
pub fn with_capacity(cap: usize, inner: R) -> BufReader<R> {
BufReader {
inner: inner,
buf: vec![0; cap],
pos: 0,
cap: 0,
}
}
// ignored
}
From the above code, we can know that inner is a type which implements Read. In my case, the inner may be a &TcpStream.
I knew the signature of Read.read() is:
fn read(&mut self, buf: &mut [u8]) -> Result<usize>
It requires a mutable reference here, but I only lent it an immutable reference. Is this supposed to be a problem when the program reaches self.inner.read() in fill_buf() ?
Quick anser: we pass a &TcpStream as R: Read, not TcpStream. Thus self in Read::read is &mut & TcpStream, not &mut TcpStream. Read is implement for &TcpStream as you can see in the documentation.
Look at this working code:
let stream = TcpStream::connect("...").unwrap();
let mut buf = [0; 100];
Read::read(&mut (&stream), &mut buf);
Note that stream is not even bound as mut, because we use it immutably, just having a mutable reference to the immutable one.
Next, you could ask why Read can be implemented for &TcpStream, because it's necessary to mutate something during the read operation.
This is where the nice Rust-world 🌈 ☮ ends, and the evil C-/operating system-world starts 😈. For example, on Linux you have a simple integer as "file descriptor" for the stream. You can use this for all operations on the stream, including reading and writing. Since you pass the integer by value (it's also a Copy-type), it doesn't matter if you have a mutable or immutable reference to the integer as you can just copy it.
Therefore a minimal amount of synchronization has to be done by the operating system or by the Rust std implementation, because usually it's strange and dangerous to mutate through an immutable reference. This behavior is called "interior mutability" and you can read a little bit more about it...
in the cell documentation
in the book 📖

Using str and String interchangably

Suppose I'm trying to do a fancy zero-copy parser in Rust using &str, but sometimes I need to modify the text (e.g. to implement variable substitution). I really want to do something like this:
fn main() {
let mut v: Vec<&str> = "Hello there $world!".split_whitespace().collect();
for t in v.iter_mut() {
if (t.contains("$world")) {
*t = &t.replace("$world", "Earth");
}
}
println!("{:?}", &v);
}
But of course the String returned by t.replace() doesn't live long enough. Is there a nice way around this? Perhaps there is a type which means "ideally a &str but if necessary a String"? Or maybe there is a way to use lifetime annotations to tell the compiler that the returned String should be kept alive until the end of main() (or have the same lifetime as v)?
Rust has exactly what you want in form of a Cow (Clone On Write) type.
use std::borrow::Cow;
fn main() {
let mut v: Vec<_> = "Hello there $world!".split_whitespace()
.map(|s| Cow::Borrowed(s))
.collect();
for t in v.iter_mut() {
if t.contains("$world") {
*t.to_mut() = t.replace("$world", "Earth");
}
}
println!("{:?}", &v);
}
as #sellibitze correctly notes, the to_mut() creates a new String which causes a heap allocation to store the previous borrowed value. If you are sure you only have borrowed strings, then you can use
*t = Cow::Owned(t.replace("$world", "Earth"));
In case the Vec contains Cow::Owned elements, this would still throw away the allocation. You can prevent that using the following very fragile and unsafe code (It does direct byte-based manipulation of UTF-8 strings and relies of the fact that the replacement happens to be exactly the same number of bytes.) inside your for loop.
let mut last_pos = 0; // so we don't start at the beginning every time
while let Some(pos) = t[last_pos..].find("$world") {
let p = pos + last_pos; // find always starts at last_pos
last_pos = pos + 5;
unsafe {
let s = t.to_mut().as_mut_vec(); // operating on Vec is easier
s.remove(p); // remove $ sign
for (c, sc) in "Earth".bytes().zip(&mut s[p..]) {
*sc = c;
}
}
}
Note that this is tailored exactly to the "$world" -> "Earth" mapping. Any other mappings require careful consideration inside the unsafe code.
std::borrow::Cow, specifically used as Cow<'a, str>, where 'a is the lifetime of the string being parsed.
use std::borrow::Cow;
fn main() {
let mut v: Vec<Cow<'static, str>> = vec![];
v.push("oh hai".into());
v.push(format!("there, {}.", "Mark").into());
println!("{:?}", v);
}
Produces:
["oh hai", "there, Mark."]

Resources