How to best parse a string in rust [closed] - string

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 1 year ago.
Improve this question
I'm looking to build a toy string parser in Rust, but fairly beginner to the language. The design I'm thinking of is building a Parser struct, that uses the Iterator trait to grab the next token for the end-user. Something like:
pub struct Scanner {
pub start: i32, // start of token
pub current: i32, //current char
pub line: i32, //line currently being parsed
pub char_it: //storing some kind of string iterator state
}
...
impl Iterator for Scanner {
fn next(&mut self) -> Option<Token> {
do_something(self.char_it.next()); // look at the next char or combination of chars and build a token
return some_token;
}
Is this the most idiomatic way to approach this design? In most languages I would just track an index or pointer over the string I'm parsing but my understanding is it's not best practice to index into a string, and usually you want to use an Iterator like test_str.chars().next().
How exactly would you store an iterator in a struct so that it keeps an internal state of where the scanner is at? What datatype is this? I tried pub char_it: Iterator<Item = char> which did not seem to work.
Is there any way to index into the data structure without an iterator efficiently? I know there's char_at(), but I hear this is potentially an O(n) operation?
Thanks

Is this the most idiomatic way to approach this design? In most languages I would just track an index or pointer over the string I'm parsing but my understanding is it's not best practice to index into a string, and usually you want to use an Iterator like test_str.chars().next().
Using either indexes or an iterator would be fine. The issue to keep in mind is that Rust strings are utf-8 encoded and directly indexing into the middle of a multi-byte code point will result in a panic. However, if you're determining the index via .char_indices() or another utf-8 aware mechanism, then it won't be a problem.
How exactly would you store an iterator in a struct so that it keeps an internal state of where the scanner is at? What datatype is this? I tried pub char_it: Iterator<Item = char> which did not seem to work.
There are a few ways to go about:
In this case you can use the type directly. .chars() always returns a Chars value and can be used directly (and adding the lifetime binding the iterator to the original string):
use std::str::Chars;
pub struct Scanner<'source> {
pub start: i32,
pub current: i32,
pub line: i32,
pub char_it: Chars<'source>
}
Iterator is a trait, not a concrete type itself. But in general, you can make a type generic over the iterator type:
pub struct Scanner<I: Iterator<Item = char>> {
pub start: i32,
pub current: i32,
pub line: i32,
pub char_it: I
}
The other option is to use a trait object (using Box):
pub struct Scanner<'source> {
pub start: i32,
pub current: i32,
pub line: i32,
pub char_it: Box<dyn Iterator<Item = char> + 'source>
}
The 'source lifetime is needed here because trait objects are 'static by default, which won't work for owned strings.
For a purpose like this, I would recommend keeping the slice of the string that is left to be parsed:
pub struct Scanner<'source> {
pub start: i32,
pub current: i32,
pub line: i32,
pub char_it: &'source str
}
You can use .chars() on the slice and store it back with .as_str() to get the remaining slice.
Is there any way to index into the data structure without an iterator efficiently? I know there's char_at(), but I hear this is potentially an O(n) operation?
I don't know what char_at() is. Using [] to index a string is O(1) because it just works on bytes. But again, it will panic if its within a code point.

It really depends on what you're parsing, and whether you need to move backwards through a stream, or peek from a stream. Probably you will want to make some kind of object to represent the input stream, complete with a reference to the stream data, as well as the current positions in that data. That input stream object would then have the low level methods needed to read in the next expected input, whether it is a single character or a whole token. It's hard to elaborate more than that without knowing more specifics. You might want to look at some of the existing parsing libraries, such as https://github.com/Geal/nom

Related

How To Do Zero-Copy Deserialization of Recursive Enums with Serde?

I'm not even sure it's possible with serde, but what I'm trying to do is something along the following:
#[derive(serde::Deserialize)]
pub enum Tree<'a> {
Zero,
One(&'a Tree<'a>),
Two(&'a Tree<'a>, &'a Tree<'a>),
Three(&'a Tree<'a>, &'a Tree<'a>, &'a Tree<'a>),
}
Is this possible using specific serde attributes (like #[serde(borrow)], etc.)? Is it required to do a custom implementation of Deserialize? Or is it not something serde can do?
You can't because something has to own all the new Tree objects.
You can however create a similar structure:
#[derive(Debug, serde::Serialize, serde::Deserialize)]
pub enum Tree<'a> {
Zero(&'a str),
One(Box<Tree<'a>>),
Two(Box<(Tree<'a>, Tree<'a>)>),
Three(Box<(Tree<'a>, Tree<'a>, Tree<'a>)>),
}
I added a &'a str argument to Zero to have some use for that lifetime, else you could just get rid of it all together.
Boxes are needed because else we would have an infinite size requirement.
This is still zero-copy since we're not copying any data from the underlaying array. It is however not zero-allocation which might work with some hacks or in special cases but generally is impossible.
I figured out the closest possible thing to what I wanted to do without allocation:
#[derive(serde::Deserialize)]
pub enum Tree<'a> {
Zero,
One(&'a [u8]),
Two(&'a [u8], &'a [u8]),
Three(&'a [u8], &'a [u8], &'a [u8]),
}
Then each individual slice would be deserialized into Tree on descent. As #Caesar pointed out this would not technically be zero-copy, though, depending on your definition (I think it's kind of a gray area).

When to use impl on an empty struct [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 12 months ago.
Improve this question
I saw the following code on the internet:
pub struct Processor;
impl Processor {
pub fn process(
) -> i32 {
// some stuff here
}
}
and used as:
let a = Processor::process();
What's the advantage of having struct here at all? Can the same thing somehow be achieved without it?
Here are a couple of the more common examples you might run into. I'm sure there are others I forgot, but these will hopefully help you when reading rust code in the future.
Creating structs to hold different implementations of a trait
You may find yourself in a situation where it would be nice to be able to use a custom handler, but want to avoid the overhead of storing a function inside of each struct. An easy alternative is to create a trait for it instead and define types for the sole purpose of having different versions of a trait implemented on them.
This way you can use them like compile-time type modifiers which allow for core functionality to be easily swapped or redefined later without any extra overhead or requiring extra information be stored in a struct.
trait Smoothing {
fn smooth(a: i32, b: i32) -> i32;
}
struct LinearStrategy;
impl Smoothing for LinearStrategy {
fn smooth(a: i32, b: i32) -> i32 {
(a + b) / 2
}
}
struct GeometricStrategy;
impl Smoothing for GeometricStrategy {
fn smooth(a: i32, b: i32) -> i32 {
i32::sqrt(a * a + b * b)
}
}
struct ComplexStruct<T> {
/* etc. */
_phantom: PhantomData<T>,
}
// Change how ComplexStruct operates at compile time
impl<T: Smoothing> ComplexStruct<T> {
pub fn sample(&self, x: i32) -> i32 {
T::smooth(self.raw_sample(x - 1), self.raw_sample(x + 1))
}
}
The best example I could find of this is probably Vec. It may not seem like it at first, but Vec has 2 type parameters. In Vec<T, A = Global>, the A is the allocator used by Vec. By default it is set to the global allocator, but in some cases it can be really handy to be able to easily switch it out with something else and still have access to all of the normal functionality of Vec.
Placeholders for FFI
When creating a rust api for a C/other library it may make sense to add types to mirror the C api without containing the same data. Normally that will end up looking like this where a pointer is wrapped with a safe rust alternative and a placeholder lifetime (Since rust does not own this data).
pub struct Foo<'a> {
ptr: *mut sys::Foo,
_phantom: PhantomData<&'a ()>,
}
However, it may also make sense to use a struct in cases where you need to enforce constructors/destructors are called for a resource. Since Rust calls a struct's drop function when it falls out of scope then gets dropped from memory, it is quite easy to enforce these rules. In this case, by having a struct we can enforce some pre-requisite to be filled to gain access to the functions in the struct.
pub struct FooApi;
impl FooApi {
pub fn new() -> Self {
unsafe { sys::init_thread_foo(); }
FooApi
}
/// Some call that is only safe if sys::init_thread_foo() has been called
pub fn do_something(&self) { /* ... */}
}
/// Call FFI function to dispose of this resource once this FooApi is no longer needed
impl Drop for FooApi {
fn drop(&mut self) {
unsafe {
sys::dispose_thread_foo();
}
}
}
Verbosity
Sometimes a struct could probably be replaced with a module. However depending on the developer, they may prefer to use a struct instead in cases where it makes more conceptual sense to think of acting upon an object. Often these cases are reasonably rare and usually indicate that a type previously or will in the future be split into traits and generified. Alternatively it may also be used in cases where there are a couple equivalent, but non-identical alternatives that can be swapped between (Ex: types of CPUs ).
So far those are a couple of the causes that come to mind, but I may come back to add more.

Shared ownership of an str between a HashMap and a Vec

I come from a Java/C#/JavaScript background and I am trying to implement a Dictionary that would assign each passed string an id that never changes. The dictionary should be able to return a string by the specified id. This allows to store some data that has a lot of repetitive strings far more efficiently in the file system because only the ids of strings would be stored instead of entire strings.
I thought that a struct with a HashMap and a Vec would do but it turned out to be more complicated than that.
I started with the usage of &str as a key for HashMap and an item of Vec like in the following sample. The value of HashMap serves as an index into Vec.
pub struct Dictionary<'a> {
values_map: HashMap<&'a str, u32>,
keys_map: Vec<&'a str>
}
impl<'a> Dictionary<'a> {
pub fn put_and_get_key(&mut self, value: &'a str) -> u32 {
match self.values_map.get_mut(value) {
None => {
let id_usize = self.keys_map.len();
let id = id_usize as u32;
self.keys_map.push(value);
self.values_map.insert(value, id);
id
},
Some(&mut id) => id
}
}
}
This works just fine until it turns out that the strs need to be stored somewhere, preferably in this same struct as well. I tried to store a Box<str> in the Vec and &'a str in the HashMap.
pub struct Dictionary<'a> {
values_map: HashMap<&'a str, u32>,
keys_map: Vec<Box<str>>
}
The borrow checker did not allow this of course because it would have allowed a dangling pointer in the HashMap when an item is removed from the Vec (or in fact sometimes when another item is added to the Vec but this is an off-topic here).
I understood that I either need to write unsafe code or use some form of shared ownership, the simplest kind of which seems to be an Rc. The usage of Rc<Box<str>> looks like introducing double indirection but there seems to be no simple way to construct an Rc<str> at the moment.
pub struct Dictionary {
values_map: HashMap<Rc<Box<str>>, u32>,
keys_map: Vec<Rc<Box<str>>>
}
impl Dictionary {
pub fn put_and_get_key(&mut self, value: &str) -> u32 {
match self.values_map.get_mut(value) {
None => {
let id_usize = self.keys_map.len();
let id = id_usize as u32;
let value_to_store = Rc::new(value.to_owned().into_boxed_str());
self.keys_map.push(value_to_store);
self.values_map.insert(value_to_store, id);
id
},
Some(&mut id) => id
}
}
}
Everything seems fine with regard to ownership semantics, but the code above does not compile because the HashMap now expects an Rc, not an &str:
error[E0277]: the trait bound `std::rc::Rc<Box<str>>: std::borrow::Borrow<str>` is not satisfied
--> src/file_structure/sample_dictionary.rs:14:31
|
14 | match self.values_map.get_mut(value) {
| ^^^^^^^ the trait `std::borrow::Borrow<str>` is not implemented for `std::rc::Rc<Box<str>>`
|
= help: the following implementations were found:
= help: <std::rc::Rc<T> as std::borrow::Borrow<T>>
Questions:
Is there a way to construct an Rc<str>?
Which other structures, methods or approaches could help to resolve this problem. Essentially, I need a way to efficiently store two maps string-by-id and id-by-string and be able to retrieve an id by &str, i.e. without any excessive allocations.
Is there a way to construct an Rc<str>?
Annoyingly, not that I know of. Rc::new requires a Sized argument, and I am not sure whether it is an actual limitation, or just something which was forgotten.
Which other structures, methods or approaches could help to resolve this problem?
If you look at the signature of get you'll notice:
fn get<Q: ?Sized>(&self, k: &Q) -> Option<&V>
where K: Borrow<Q>, Q: Hash + Eq
As a result, you could search by &str if K implements Borrow<str>.
String implements Borrow<str>, so the simplest solution is to simply use String as a key. Sure it means you'll actually have two String instead of one... but it's simple. Certainly, a String is simpler to use than a Box<str> (although it uses 8 more bytes).
If you want to shave off this cost, you can use a custom structure instead:
#[derive(Clone, Debug)]
struct RcStr(Rc<String>);
And then implement Borrow<str> for it. You'll then have 2 allocations per key (1 for Rc and 1 for String). Depending on the size of your String, it might consume less or more memory.
If you wish to got further (why not?), here are some ideas:
implement your own reference-counted string, in a single heap-allocation,
use a single arena for the slice inserted in the Dictionary,
...

How can I better store a string to avoid many clones?

I am using tokio's UdpCodec trait:
pub trait UdpCodec {
type In;
type Out;
fn decode(&mut self, src: &SocketAddr, buf: &[u8]) -> Result<Self::In>;
fn encode(&mut self, msg: Self::Out, buf: &mut Vec<u8>) -> SocketAddr;
}
My associated type for In is a (SocketAddr, Vec<Metric>). Metric is defined as:
#[derive(Debug, PartialEq)]
pub struct Metric {
pub name: String,
pub value: f64,
pub metric_type: MetricType,
pub sample_rate: Option<f64>,
}
I have used owned strings to avoid lifetime constraints with the associated types. However I also do HashMap lookups and inserts with these metric names which involves a lot of cloning since I borrow metrics in other functions.
How can I better store a string within this Metric type to avoid many inefficient clones? Using the Cow type has crossed my mind but it also obviously has a lifetime association.
Expanding on #Josh's suggestion, I would suggest using interning.
Depending on how memory or CPU intensive your task is, make your pick between:
A double hash-map: ID <-> String, shared between components
A single hash-map: String -> Rc<str>
If you can afford the latter, I definitely advise it. Also note that you can likely fold MetricType within the Rc: Rc<(MetricType, str)>.
Then you still need to call clone left and right, but each is just a cheap non-atomic increment operation... and moving to multithread is as simple as swapping Arc for Rc.

How to make a struct where one of the fields refers to another field

I have the following problem: I have a have a data structure that is parsed from a buffer and contains some references into this buffer, so the parsing function looks something like
fn parse_bar<'a>(buf: &'a [u8]) -> Bar<'a>
So far, so good. However, to avoid certain lifetime issues I'd like to put the data structure and the underlying buffer into a struct as follows:
struct BarWithBuf<'a> {bar: Bar<'a>, buf: Box<[u8]>}
// not even sure if these lifetime annotations here make sense,
// but it won't compile unless I add some lifetime to Bar
However, now I don't know how to actually construct a BarWithBuf value.
fn make_bar_with_buf<'a>(buf: Box<[u8]>) -> BarWithBuf<'a> {
let my_bar = parse_bar(&*buf);
BarWithBuf {buf: buf, bar: my_bar}
}
doesn't work, since buf is moved in the construction of the BarWithBuf value, but we borrowed it for parsing.
I feel like it should be possible to do something along the lines of
fn make_bar_with_buf<'a>(buf: Box<[u8]>) -> BarWithBuf<'a> {
let mut bwb = BarWithBuf {buf: buf};
bwb.bar = parse_bar(&*bwb.buf);
bwb
}
to avoid moving the buffer after parsing the Bar, but I can't do that because the whole BarWithBuf struct has to be initalised in one go.
Now I suspect that I could use unsafe code to partially construct the struct, but I'd rather not do that.
What would be the best way to solve this problem? Do I need unsafe code? If I do, would it be safe do to this here? Or am I completely on the wrong track here and there is a better way to tie a data structure and its underlying buffer together?
I think you're right in that it's not possible to do this without unsafe code. I would consider the following two options:
Change the reference in Bar to an index. The contents of the box won't be protected by a borrow, so the index might become invalid if you're not careful. However, an index might convey the meaning of the reference in a clearer way.
Move Box<[u8]> into Bar, and add a function buf() -> &[u8] to the implementation of Bar; instead of references, store indices in Bar. Now Bar is the owner of the buffer, so it can control its modification and keep the indices valid (thereby avoiding the problem of option #1).
As per DK's suggestion below, store indices in BarWithBuf (or in a helper struct BarInternal) and add a function fn bar(&self) -> Bar to the implementation of BarWithBuf, which constructs a Bar on-the-fly.
Which of these options is the most appropriate one depends on the actual problem context. I agree that some form of "member-by-member construction" of structs would be immensely helpful in Rust.
Here's an approach that will work through a little bit of unsafe code. This approach requires that you are okay with putting the referred-to thing (here, your [u8]) on the heap, so it won't work for direct reference of a sibling field.
Let's start with a toy Bar<'a> implementation:
struct Bar<'a> {
refs: Vec<&'a [u8]>,
}
impl<'a> Bar<'a> {
pub fn parse(src: &'a [u8]) -> Self {
// placeholder for actually parsing goes here
Self { refs: vec![src] }
}
}
We'll make BarWithBuf that uses a Bar<'static>, as 'static is the only lifetime with an an accessible name. The buffer we store things in can be anything that doesn't move the target data around on us. I'm going to go with a Vec<u8>, but Box, Pin, whatever will work fine.
struct BarWithBuf {
buf: Vec<u8>,
bar: Bar<'static>,
}
The implementation requires a tiny bit of unsafe code.
impl BarWithBuf {
pub fn new(buf: Vec<u8>) -> Self {
// The `&'static [u8]` is inferred, but writing it here for demo
let buf_slice: &'static [u8] = unsafe {
// Going through a pointer is a "good" way to get around lifetime checks
std::slice::from_raw_parts(&buf[0], buf.len())
};
let bar = Bar::parse(buf_slice);
Self { buf, bar }
}
/// Access to Bar should always come through this function.
pub fn bar(&self) -> &Bar {
&self.bar
}
}
The BarWithBuf::bar is an important function to re-associate the proper lifetimes to the references. Rust's lifetime elision rules make the function equivalent to pub fn bar<'a>(&'a self) -> &'a Bar<'a>, which turns out to be exactly what we want. The lifetime of the slices in BarWithBuf::bar::refs are tied to the lifetime of BarWithBuf.
WARNING: You have to be very careful with your implementation here. You cannot make #[derive(Clone)] for BarWithBuf, since the default clone implementation will clone buf, but the elements of bar.refs will still point to the original. It is only one line of unsafe code, but the safety is still off in the "safe" bits.
For larger bits of self-referencing structures, there's the ouroboros crate, which wraps up a lot of unsafe bits for you. The techniques are similar to the one I described above, but they live behind macros, which is a more pleasant experience if you find yourself making a number of self-references.

Resources