How can I better store a string to avoid many clones? - rust

I am using tokio's UdpCodec trait:
pub trait UdpCodec {
type In;
type Out;
fn decode(&mut self, src: &SocketAddr, buf: &[u8]) -> Result<Self::In>;
fn encode(&mut self, msg: Self::Out, buf: &mut Vec<u8>) -> SocketAddr;
}
My associated type for In is a (SocketAddr, Vec<Metric>). Metric is defined as:
#[derive(Debug, PartialEq)]
pub struct Metric {
pub name: String,
pub value: f64,
pub metric_type: MetricType,
pub sample_rate: Option<f64>,
}
I have used owned strings to avoid lifetime constraints with the associated types. However I also do HashMap lookups and inserts with these metric names which involves a lot of cloning since I borrow metrics in other functions.
How can I better store a string within this Metric type to avoid many inefficient clones? Using the Cow type has crossed my mind but it also obviously has a lifetime association.

Expanding on #Josh's suggestion, I would suggest using interning.
Depending on how memory or CPU intensive your task is, make your pick between:
A double hash-map: ID <-> String, shared between components
A single hash-map: String -> Rc<str>
If you can afford the latter, I definitely advise it. Also note that you can likely fold MetricType within the Rc: Rc<(MetricType, str)>.
Then you still need to call clone left and right, but each is just a cheap non-atomic increment operation... and moving to multithread is as simple as swapping Arc for Rc.

Related

How To Do Zero-Copy Deserialization of Recursive Enums with Serde?

I'm not even sure it's possible with serde, but what I'm trying to do is something along the following:
#[derive(serde::Deserialize)]
pub enum Tree<'a> {
Zero,
One(&'a Tree<'a>),
Two(&'a Tree<'a>, &'a Tree<'a>),
Three(&'a Tree<'a>, &'a Tree<'a>, &'a Tree<'a>),
}
Is this possible using specific serde attributes (like #[serde(borrow)], etc.)? Is it required to do a custom implementation of Deserialize? Or is it not something serde can do?
You can't because something has to own all the new Tree objects.
You can however create a similar structure:
#[derive(Debug, serde::Serialize, serde::Deserialize)]
pub enum Tree<'a> {
Zero(&'a str),
One(Box<Tree<'a>>),
Two(Box<(Tree<'a>, Tree<'a>)>),
Three(Box<(Tree<'a>, Tree<'a>, Tree<'a>)>),
}
I added a &'a str argument to Zero to have some use for that lifetime, else you could just get rid of it all together.
Boxes are needed because else we would have an infinite size requirement.
This is still zero-copy since we're not copying any data from the underlaying array. It is however not zero-allocation which might work with some hacks or in special cases but generally is impossible.
I figured out the closest possible thing to what I wanted to do without allocation:
#[derive(serde::Deserialize)]
pub enum Tree<'a> {
Zero,
One(&'a [u8]),
Two(&'a [u8], &'a [u8]),
Three(&'a [u8], &'a [u8], &'a [u8]),
}
Then each individual slice would be deserialized into Tree on descent. As #Caesar pointed out this would not technically be zero-copy, though, depending on your definition (I think it's kind of a gray area).

How can I write a self-referential Rust struct with Arc and BufReader?

I'm trying to write this following code for a server:
use std::io::{BufReader, BufWriter};
use std::net::TcpStream;
struct User<'a> {
stream: Arc<TcpStream>,
reader: BufReader<&'a TcpStream>,
writer: BufWriter<&'a TcpStream>,
}
fn accept_socket(users: &mut Vec<User>, stream: Arc<TcpStream>) {
let stream_clone = stream.clone();
let user = User {
stream: stream_clone,
reader: BufReader::new(stream_clone.as_ref()),
writer: BufWriter::new(stream_clone.as_ref()),
};
users.push(user);
}
The stream is behind an Arc because it is shared across threads. The BufReader and BufWriter point to the User's own Arc, but the compiler complains that the reference stream_clone.as_ref() does not live long enough, even though it obviously does (it points to the Arc, which isn't dropped as long as the User is alive). How do I get the compiler to accept this code?
Self-referential structs are a no-go. Rust has no way of updating the address in the references if the struct is moved since moving is always a simple bit copy. Unlike C++ with its move constructors, there's no way to attach behavior to moves.
What you can do instead is store Arcs inside the reader and writer so they share ownership of the TcpStream.
struct User {
stream: Arc<TcpStream>,
reader: BufReader<IoArc<TcpStream>>,
writer: BufWriter<IoArc<TcpStream>>,
}
The tricky part is that Arc doesn't implement Read and Write. You'll need a newtype that does (IoArc, above). Yoshua Wuyts wrote about this problem:
One of those patterns is perhaps lesser known but integral to std’s functioning: impl Read/Write for &Type. What this means is that if you have a reference to an IO type, such as File or TcpStream, you’re still able to call Read and Write methods thanks to some interior mutability tricks.
The implication of this is also that if you want to share a std::fs::File between multiple threads you don’t need to use an expensive Arc<Mutex<File>> because an Arc<File> suffices.
You might expect that if we wrap an IO type T in an Arc that it would implement Clone + Read + Write. But in reality it only implements Clone + Deref<T>... However, there's an escape hatch here: we can create a wrapper type around Arc<T> that implements Read + Write by dereferencing &T internally.
Here is his solution:
/// A variant of `Arc` that delegates IO traits if available on `&T`.
#[derive(Debug)]
pub struct IoArc<T>(Arc<T>);
impl<T> IoArc<T> {
/// Create a new instance of IoArc.
pub fn new(data: T) -> Self {
Self(Arc::new(data))
}
}
impl<T> Read for IoArc<T>
where
for<'a> &'a T: Read,
{
fn read(&mut self, buf: &mut [u8]) -> io::Result<usize> {
(&mut &*self.0).read(buf)
}
}
impl<T> Write for IoArc<T>
where
for<'a> &'a T: Write,
{
fn write(&mut self, buf: &[u8]) -> io::Result<usize> {
(&mut &*self.0).write(buf)
}
fn flush(&mut self) -> io::Result<()> {
(&mut &*self.0).flush()
}
}
MIT license
IoArc is available in the io_arc crate, though it is short enough to implement yourself if you don't want to pull in the dependency.
Simple answer: You can't.
In Rust, every type is implicitly movable by memcpy. So if your type stores references to itself, it would break as soon as the move happens; the references would be dangling.
More complex answer: You can't, unless you use Pin, unsafe and raw pointers.
But I'm pretty sure that using Arc for everything is the way to go instead.
Arc<TcpStream> does not implement Read or Write
You could just write a very thin wrapper struct around Arc<TcpStream> which implements Read and Write. It should be fairly easy.
Edit: Take a look at #JohnKugelman's anwser for such a wrapper.

How to best parse a string in rust [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 1 year ago.
Improve this question
I'm looking to build a toy string parser in Rust, but fairly beginner to the language. The design I'm thinking of is building a Parser struct, that uses the Iterator trait to grab the next token for the end-user. Something like:
pub struct Scanner {
pub start: i32, // start of token
pub current: i32, //current char
pub line: i32, //line currently being parsed
pub char_it: //storing some kind of string iterator state
}
...
impl Iterator for Scanner {
fn next(&mut self) -> Option<Token> {
do_something(self.char_it.next()); // look at the next char or combination of chars and build a token
return some_token;
}
Is this the most idiomatic way to approach this design? In most languages I would just track an index or pointer over the string I'm parsing but my understanding is it's not best practice to index into a string, and usually you want to use an Iterator like test_str.chars().next().
How exactly would you store an iterator in a struct so that it keeps an internal state of where the scanner is at? What datatype is this? I tried pub char_it: Iterator<Item = char> which did not seem to work.
Is there any way to index into the data structure without an iterator efficiently? I know there's char_at(), but I hear this is potentially an O(n) operation?
Thanks
Is this the most idiomatic way to approach this design? In most languages I would just track an index or pointer over the string I'm parsing but my understanding is it's not best practice to index into a string, and usually you want to use an Iterator like test_str.chars().next().
Using either indexes or an iterator would be fine. The issue to keep in mind is that Rust strings are utf-8 encoded and directly indexing into the middle of a multi-byte code point will result in a panic. However, if you're determining the index via .char_indices() or another utf-8 aware mechanism, then it won't be a problem.
How exactly would you store an iterator in a struct so that it keeps an internal state of where the scanner is at? What datatype is this? I tried pub char_it: Iterator<Item = char> which did not seem to work.
There are a few ways to go about:
In this case you can use the type directly. .chars() always returns a Chars value and can be used directly (and adding the lifetime binding the iterator to the original string):
use std::str::Chars;
pub struct Scanner<'source> {
pub start: i32,
pub current: i32,
pub line: i32,
pub char_it: Chars<'source>
}
Iterator is a trait, not a concrete type itself. But in general, you can make a type generic over the iterator type:
pub struct Scanner<I: Iterator<Item = char>> {
pub start: i32,
pub current: i32,
pub line: i32,
pub char_it: I
}
The other option is to use a trait object (using Box):
pub struct Scanner<'source> {
pub start: i32,
pub current: i32,
pub line: i32,
pub char_it: Box<dyn Iterator<Item = char> + 'source>
}
The 'source lifetime is needed here because trait objects are 'static by default, which won't work for owned strings.
For a purpose like this, I would recommend keeping the slice of the string that is left to be parsed:
pub struct Scanner<'source> {
pub start: i32,
pub current: i32,
pub line: i32,
pub char_it: &'source str
}
You can use .chars() on the slice and store it back with .as_str() to get the remaining slice.
Is there any way to index into the data structure without an iterator efficiently? I know there's char_at(), but I hear this is potentially an O(n) operation?
I don't know what char_at() is. Using [] to index a string is O(1) because it just works on bytes. But again, it will panic if its within a code point.
It really depends on what you're parsing, and whether you need to move backwards through a stream, or peek from a stream. Probably you will want to make some kind of object to represent the input stream, complete with a reference to the stream data, as well as the current positions in that data. That input stream object would then have the low level methods needed to read in the next expected input, whether it is a single character or a whole token. It's hard to elaborate more than that without knowing more specifics. You might want to look at some of the existing parsing libraries, such as https://github.com/Geal/nom

Shared ownership of an str between a HashMap and a Vec

I come from a Java/C#/JavaScript background and I am trying to implement a Dictionary that would assign each passed string an id that never changes. The dictionary should be able to return a string by the specified id. This allows to store some data that has a lot of repetitive strings far more efficiently in the file system because only the ids of strings would be stored instead of entire strings.
I thought that a struct with a HashMap and a Vec would do but it turned out to be more complicated than that.
I started with the usage of &str as a key for HashMap and an item of Vec like in the following sample. The value of HashMap serves as an index into Vec.
pub struct Dictionary<'a> {
values_map: HashMap<&'a str, u32>,
keys_map: Vec<&'a str>
}
impl<'a> Dictionary<'a> {
pub fn put_and_get_key(&mut self, value: &'a str) -> u32 {
match self.values_map.get_mut(value) {
None => {
let id_usize = self.keys_map.len();
let id = id_usize as u32;
self.keys_map.push(value);
self.values_map.insert(value, id);
id
},
Some(&mut id) => id
}
}
}
This works just fine until it turns out that the strs need to be stored somewhere, preferably in this same struct as well. I tried to store a Box<str> in the Vec and &'a str in the HashMap.
pub struct Dictionary<'a> {
values_map: HashMap<&'a str, u32>,
keys_map: Vec<Box<str>>
}
The borrow checker did not allow this of course because it would have allowed a dangling pointer in the HashMap when an item is removed from the Vec (or in fact sometimes when another item is added to the Vec but this is an off-topic here).
I understood that I either need to write unsafe code or use some form of shared ownership, the simplest kind of which seems to be an Rc. The usage of Rc<Box<str>> looks like introducing double indirection but there seems to be no simple way to construct an Rc<str> at the moment.
pub struct Dictionary {
values_map: HashMap<Rc<Box<str>>, u32>,
keys_map: Vec<Rc<Box<str>>>
}
impl Dictionary {
pub fn put_and_get_key(&mut self, value: &str) -> u32 {
match self.values_map.get_mut(value) {
None => {
let id_usize = self.keys_map.len();
let id = id_usize as u32;
let value_to_store = Rc::new(value.to_owned().into_boxed_str());
self.keys_map.push(value_to_store);
self.values_map.insert(value_to_store, id);
id
},
Some(&mut id) => id
}
}
}
Everything seems fine with regard to ownership semantics, but the code above does not compile because the HashMap now expects an Rc, not an &str:
error[E0277]: the trait bound `std::rc::Rc<Box<str>>: std::borrow::Borrow<str>` is not satisfied
--> src/file_structure/sample_dictionary.rs:14:31
|
14 | match self.values_map.get_mut(value) {
| ^^^^^^^ the trait `std::borrow::Borrow<str>` is not implemented for `std::rc::Rc<Box<str>>`
|
= help: the following implementations were found:
= help: <std::rc::Rc<T> as std::borrow::Borrow<T>>
Questions:
Is there a way to construct an Rc<str>?
Which other structures, methods or approaches could help to resolve this problem. Essentially, I need a way to efficiently store two maps string-by-id and id-by-string and be able to retrieve an id by &str, i.e. without any excessive allocations.
Is there a way to construct an Rc<str>?
Annoyingly, not that I know of. Rc::new requires a Sized argument, and I am not sure whether it is an actual limitation, or just something which was forgotten.
Which other structures, methods or approaches could help to resolve this problem?
If you look at the signature of get you'll notice:
fn get<Q: ?Sized>(&self, k: &Q) -> Option<&V>
where K: Borrow<Q>, Q: Hash + Eq
As a result, you could search by &str if K implements Borrow<str>.
String implements Borrow<str>, so the simplest solution is to simply use String as a key. Sure it means you'll actually have two String instead of one... but it's simple. Certainly, a String is simpler to use than a Box<str> (although it uses 8 more bytes).
If you want to shave off this cost, you can use a custom structure instead:
#[derive(Clone, Debug)]
struct RcStr(Rc<String>);
And then implement Borrow<str> for it. You'll then have 2 allocations per key (1 for Rc and 1 for String). Depending on the size of your String, it might consume less or more memory.
If you wish to got further (why not?), here are some ideas:
implement your own reference-counted string, in a single heap-allocation,
use a single arena for the slice inserted in the Dictionary,
...

How to make a struct where one of the fields refers to another field

I have the following problem: I have a have a data structure that is parsed from a buffer and contains some references into this buffer, so the parsing function looks something like
fn parse_bar<'a>(buf: &'a [u8]) -> Bar<'a>
So far, so good. However, to avoid certain lifetime issues I'd like to put the data structure and the underlying buffer into a struct as follows:
struct BarWithBuf<'a> {bar: Bar<'a>, buf: Box<[u8]>}
// not even sure if these lifetime annotations here make sense,
// but it won't compile unless I add some lifetime to Bar
However, now I don't know how to actually construct a BarWithBuf value.
fn make_bar_with_buf<'a>(buf: Box<[u8]>) -> BarWithBuf<'a> {
let my_bar = parse_bar(&*buf);
BarWithBuf {buf: buf, bar: my_bar}
}
doesn't work, since buf is moved in the construction of the BarWithBuf value, but we borrowed it for parsing.
I feel like it should be possible to do something along the lines of
fn make_bar_with_buf<'a>(buf: Box<[u8]>) -> BarWithBuf<'a> {
let mut bwb = BarWithBuf {buf: buf};
bwb.bar = parse_bar(&*bwb.buf);
bwb
}
to avoid moving the buffer after parsing the Bar, but I can't do that because the whole BarWithBuf struct has to be initalised in one go.
Now I suspect that I could use unsafe code to partially construct the struct, but I'd rather not do that.
What would be the best way to solve this problem? Do I need unsafe code? If I do, would it be safe do to this here? Or am I completely on the wrong track here and there is a better way to tie a data structure and its underlying buffer together?
I think you're right in that it's not possible to do this without unsafe code. I would consider the following two options:
Change the reference in Bar to an index. The contents of the box won't be protected by a borrow, so the index might become invalid if you're not careful. However, an index might convey the meaning of the reference in a clearer way.
Move Box<[u8]> into Bar, and add a function buf() -> &[u8] to the implementation of Bar; instead of references, store indices in Bar. Now Bar is the owner of the buffer, so it can control its modification and keep the indices valid (thereby avoiding the problem of option #1).
As per DK's suggestion below, store indices in BarWithBuf (or in a helper struct BarInternal) and add a function fn bar(&self) -> Bar to the implementation of BarWithBuf, which constructs a Bar on-the-fly.
Which of these options is the most appropriate one depends on the actual problem context. I agree that some form of "member-by-member construction" of structs would be immensely helpful in Rust.
Here's an approach that will work through a little bit of unsafe code. This approach requires that you are okay with putting the referred-to thing (here, your [u8]) on the heap, so it won't work for direct reference of a sibling field.
Let's start with a toy Bar<'a> implementation:
struct Bar<'a> {
refs: Vec<&'a [u8]>,
}
impl<'a> Bar<'a> {
pub fn parse(src: &'a [u8]) -> Self {
// placeholder for actually parsing goes here
Self { refs: vec![src] }
}
}
We'll make BarWithBuf that uses a Bar<'static>, as 'static is the only lifetime with an an accessible name. The buffer we store things in can be anything that doesn't move the target data around on us. I'm going to go with a Vec<u8>, but Box, Pin, whatever will work fine.
struct BarWithBuf {
buf: Vec<u8>,
bar: Bar<'static>,
}
The implementation requires a tiny bit of unsafe code.
impl BarWithBuf {
pub fn new(buf: Vec<u8>) -> Self {
// The `&'static [u8]` is inferred, but writing it here for demo
let buf_slice: &'static [u8] = unsafe {
// Going through a pointer is a "good" way to get around lifetime checks
std::slice::from_raw_parts(&buf[0], buf.len())
};
let bar = Bar::parse(buf_slice);
Self { buf, bar }
}
/// Access to Bar should always come through this function.
pub fn bar(&self) -> &Bar {
&self.bar
}
}
The BarWithBuf::bar is an important function to re-associate the proper lifetimes to the references. Rust's lifetime elision rules make the function equivalent to pub fn bar<'a>(&'a self) -> &'a Bar<'a>, which turns out to be exactly what we want. The lifetime of the slices in BarWithBuf::bar::refs are tied to the lifetime of BarWithBuf.
WARNING: You have to be very careful with your implementation here. You cannot make #[derive(Clone)] for BarWithBuf, since the default clone implementation will clone buf, but the elements of bar.refs will still point to the original. It is only one line of unsafe code, but the safety is still off in the "safe" bits.
For larger bits of self-referencing structures, there's the ouroboros crate, which wraps up a lot of unsafe bits for you. The techniques are similar to the one I described above, but they live behind macros, which is a more pleasant experience if you find yourself making a number of self-references.

Resources