Adding state to a nom parser - rust

I wrote a parser in nom that is completely stateless, now I need to wrap it in a few stateful layers.
I have a top-level parsing function named alt_fn that will provide me the next bit of parsed output as an enum variant, the details of which probably aren't important.
I have three things I need to do that involve state:
1) I need to conditionally perform a transformation on the output of alt_fn if there is a match in a non-mutable HashMap that is part of my State struct. This should basically be like a map! but as a method call on my struct. Something like this:
named!(alt_fn<AllTags> ,alt!(// snipped for brevity));
fn applyMath(self, i:AllTags)->AllTags { // snipped for brevity }
method!(apply_math<State, &[u8], AllTags>, mut self, call_m!(self.applyMath, call!(alt_fn)));
This currently gives me: error: unexpected end of macro invocation with alt_fn underlined.
2) I need to update the other fields of the state struct with the data I got from the input (such as computing checksums and updating timestamps, etc.), and then transform the output again with this new knowledge. This will probably look like the following:
fn updateState(mut self, i:AllTags) -> AllTags { // snipped for brevity }
method!(update_state<State, &[u8], AllTags>, mut self, call_m!(self.updateState, call_m!(self.applyMath)));
3) I need to call the method from part two repeatedly until all the input is used up:
method!(pub parse<State,&[u8],Vec<AllTags>>, mut self, many1!(update_state));
Unfortunately the nom docs are pretty limited, and I'm not great with macro syntax so I don't know what I'm doing wrong.

When I need to do something complicated with nom, I normally write my own functions.
For example
named!(my_func<T>, <my_macros>);
is equivalent to
fn my_func(i: &[u8]) -> nom::IResult<T, &[u8]> {
<my_macros>
}
with the proviso that you must pass i to the macro (see my comment).
Creating your own function means you can have any control flow you want in there, and it will play nice with nom as long as it takes a &[u8] and returns nom::IResult where the output &[u8] is the remaining unparsed raw input.
If you need some more info comment and I'll try to improve my answer!

Related

Ergonomically passing a slice of trait objects

I am converting a variety of types to String when they are passed to a function. I'm not concerned about performance as much as ergonomics, so I want the conversion to be implicit. The original, less generic implementation of the function simply used &[impl Into<String>], but I think that it should be possible to pass a variety of types at once without manually converting each to a string.
The key is that ideally, all of the following cases should be valid calls to my function:
// String literals
perform_tasks(&["Hello", "world"]);
// Owned strings
perform_tasks(&[String::from("foo"), String::from("bar")]);
// Non-string types
perform_tasks(&[1,2,3]);
// A mix of any of them
perform_tasks(&["All", 3, String::from("types!")]);
Some various signatures I've attempted to use:
fn perform_tasks(items: &[impl Into<String>])
The original version fails twice; it can't handle numeric types without manual conversion, and it requires all of the arguments to be the same type.
fn perform_tasks(items: &[impl ToString])
This is slightly closer, but it still requires all of the arguments to be of one type.
fn perform_tasks(items: &[&dyn ToString])
Doing it this way is almost enough, but it won't compile unless I manually add a borrow on each argument.
And that's where we are. I suspect that either Borrow or AsRef will be involved in a solution, but I haven't found a way to get them to handle this situation. For convenience, here is a playground link to the final signature in use (without the needed references for it to compile), alongside the various tests.
The following way works for the first three cases if I understand your intention correctly.
pub fn perform_tasks<I, A>(values: I) -> Vec<String>
where
A: ToString,
I: IntoIterator<Item = A>,
{
values.into_iter().map(|s| s.to_string()).collect()
}
As the other comments pointed out, Rust does not support an array of mixed types. However, you can do one extra step to convert them into a &[&dyn fmt::Display] and then call the same function perform_tasks to get their strings.
let slice: &[&dyn std::fmt::Display] = &[&"All", &3, &String::from("types!")];
perform_tasks(slice);
Here is the playground.
If I understand your intention right, what you want is like this
fn main() {
let a = 1;
myfn(a);
}
fn myfn(i: &dyn SomeTrait) {
//do something
}
So it's like implicitly borrow an object as function argument. However, Rust won't let you to implicitly borrow some objects since borrowing is quite an important safety measure in rust and & can help other programmers quickly identified which is a reference and which is not. Thus Rust is designed to enforce the & to avoid confusion.

Compress and serialize with rust: the specific case of `Vec<u8>`

I am currently working on a way to compress data (structs I made) when serializing with serde. Everything works fine except with the specitic case of Vec. I'd like to know if some of you already met this problem or if you would have any thought to share :)
My goal is to provide a simple way of compressing any part of a struct by adding the #[serde(with="crate::compress")] macro. No matter of the underlying data structure, I want it to be compressed with my custom serialize function.
For instance, I want this structure to be serializable with compression :
struct MyCustomStruct {
data: String,
#[serde(with="crate::compress")]
data2: SomeOtherStruct,
#[serde(with="crate::compress")]
data3: Vec<u8>,
}
For now, everything works fine and calls my custom module :
// in compress module
pub fn serialize<T, S>(data: T, serializer: S) -> Result(S::Ok, S::Error)
where
T: Serialize,
S: Serializer,
{
// Simplified functioning below:
let serialized_data: Vec<u8> = some_function(data);
let compressed_data: Vec<u8> = some_other_fonction(serialized_data);
Ok(choosen_serializer::serialize(compressed_data, serializer)?)
}
However, I do have a problem when it comes to compressing Vec elements (as data3 in the struct above).
Since the data is already Vec, I don't need to serialize it and I can directly pass it to my compression function. Worse: if I serialize it, I my not be with serde_bytes and so the compression call macro will result in increasing size!
I also don't want to have two different functions (one for Vec, one other) since it would be to the user to choose which one to use and using the wrong one with Vec would work but also increasing size (instead of decreasing it).
I already thought about / tried a few things but none of them can work :
1) Macro :
Very complicated since it needs to rewrite another macro. It would be a macro writing #[serde(with="crate::compress")] or #[serde(with="crate::compress_vec_u8")] depending on the type. I don't even know if this is possible, a meta-macro? :')
2) Trait implementation
It would be something like that :
trait CompressAndSerialize {
fn serialize<S>(&self, serializer: S) -> Result<S::Ok, S::Error)
where S:Serializer;
}
impl CompressAndSerialize for Vec<u8> { ... }
impl<T> CompressAndSerialize for T where T:Serialize { ... }
but then I got an error (conflicting implementations of trait CompressAndSerialize for Vec) which seems normal since there are indeed two implementations for Vec :/
3) Worse solution but the one I'm heading for: using TypeId::of::<T>()
and skip serialization if data is already a Vec. I would still have to encapsulate it in an Enum so de-serialization will know if data is Vec or something else...
Edit: this isn't possible because type must have static lifetime which is almost always impossible (not in my case anyway)
Sorry this is a bit long and quite specific but I hope maybe one of you will have suggestions on how to deal with this problem :D
Edit: on the serde documentation (https://serde.rs/impl-serialize.html#other-special-cases) there is a link to a issue in rust-lang/rust (https://github.com/rust-lang/rust/issues/31844) and once this issue will be resolved I won't have any problem with the serialization of Vec since serde_bytes won't be needed anymore. Too bad this issue has been opened since Feb 2016 :'(

How do I create a streaming parser in nom?

I've created a few non-trivial parsers in nom, so I'm pretty familiar with it at this point. All the parsers I've created until now always provide the entire input slice to the parser.
I'd like to create a streaming parser, which I assume means that I can continue to feed bytes into the parser until it is complete. I've had a hard time finding any documentation or examples that illustrate this, and I also question my assumption of what a "streaming parser" is.
My questions are:
Is my understanding of what a streaming parser is correct?
If so, are there any good examples of a parser using this technique?
nom parsers neither maintain a buffer to feed more data into, nor do they maintain "state" where they previously needed more bytes.
But if you take a look at the IResult structure you see that you can return a partial result or indicate that you need more data.
There seem to be some structures provided to handle streaming: I think you are supposed to create a Consumer from a parser using the consumer_from_parser! macro, implement a Producer for your data source, and call run until it returns None (and start again when you have more data). Examples and docs seem to be mostly missing so far - see bottom of https://github.com/Geal/nom :)
Also it looks like most functions and macros in nom are not documented well (or at all) regarding their behavior when hitting the end of the input. For example take_until! returns Incomplete if the input isn't long enough to contain the substr to look for, but returns an error if the input is long enough but doesn't contain substr.
Also nom mostly uses either &[u8] or &str for input; you can't signal an actual "end of stream" through these types. You could implement your own input type (related traits: nom::{AsBytes,Compare,FindSubstring,FindToken,InputIter,InputLength,InputTake,Offset,ParseTo,Slice}) to add a "reached end of stream" flag, but the nom provided macros and functions won't be able to interpret it.
All in all I'd recommend splitting streamed input through some other means into chunks you can handle with simple non-streaming parsers (maybe even use synom instead of nom).
Here is a minimal working example. As #Stefan wrote, "I'd recommend splitting streamed input through some other means into chunks you can handle".
What somewhat works (and I'd be glad for suggestions on how to improve it), is to combine a File::bytes() method and then only take as many bytes as necessary and pass them to nom::streaming::take.
let reader = file.bytes();
let buf = reader.take(length).collect::<B>()?;
let (_input, chunk) = take(length)(&*buf)...;
The complete function can look like this:
/// Parse the first handful of bytes and return the bytes interpreted as UTF8
fn parse_first_bytes(file: std::fs::File, length: usize) -> Result<String> {
type B = std::result::Result<Vec<u8>, std::io::Error>;
let reader = file.bytes();
let buf = reader.take(length).collect::<B>()?;
let (_input, chunk) = take(length)(&*buf)
.finish()
.map_err(|nom::error::Error { input: _, code: _ }| eyre!("..."))?;
let s = String::from_utf8_lossy(chunk);
Ok(s.to_string())
}
Here is the rest of main for an implementation similar to Unix' head command.
use color_eyre::Result;
use eyre::eyre;
use nom::{bytes::streaming::take, Finish};
use std::{fs::File, io::Read, path::PathBuf};
use structopt::StructOpt;
#[derive(Debug, StructOpt)]
#[structopt(about = "A minimal example of parsing a file only partially.
This implements the POSIX 'head' utility.")]
struct Args {
/// Input File
#[structopt(parse(from_os_str))]
input: PathBuf,
/// Number of bytes to consume
#[structopt(short = "c", default_value = "32")]
num_bytes: usize,
}
fn main() -> Result<()> {
let args = Args::from_args();
let file = File::open(args.input)?;
let head = parse_first_bytes(file, args.num_bytes)?;
println!("{}", head);
Ok(())
}

Concisely initializing a vector of Strings

I'm trying to create a vector of Strings to test arg parsing (since this is what std::env::args() returns) but struggling with how to do this concisely.
What I want:
let test_args = vec!["-w", "60", "arg"]; // should array of Strings
let expected_results = my_arg_parser(test_args);
This obviously doesn't work because the vectors contents are all &strs.
Using String::from but works but doesn't scale well and is ugly :)
let args = vec![String::from("-w"), String::from("60"), String::from("args")];
I could map over the references and return string objects, but this seems very verbose:
let args = vec!["-w", "60", "args"].iter().map(|x| x.to_string()).collect::<Vec<String>>();
Should I just create a helper function to do the conversion, or is there an easier way?
You can use the to_string() method directly on the literals:
let test_args = vec!["-w".to_string(), "60".to_string(), "arg".to_string()];
Otherwise a macro to do this would be as simple as:
macro_rules! vec_of_strings {
($($x:expr),*) => (vec![$($x.to_string()),*]);
}
See play.rust.org example
JDemler already provided a nice answer. I have two additional things to say:
First, you can also use into() instead of to_string() for all elements but the first. This is slightly shorter and also equivalent to to_string()/String::from(). Looks like this:
vec!["a".to_string(), "b".into(), "c".into()];
Second, you might want to redesign your arg parsing. I will assume here that you won't mutate the Strings you get from env::args(). I imagine your current function to look like:
fn parse_args(args: &[String]) -> SomeResult { ... }
But you can make that function more generic by not accepting Strings but AsRef<str>. It would look like this:
fn parse_args<T: AsRef<str>>(args: &[T]) -> SomeResult { ... }
In the documentation you can see that String as well as str itself implement that trait. Therefore you can pass a &[String] and a &[&str] into your function. Awesome, eh?
In similar fashion, if you want to accept anything that can be converted into an owned String, you can accept <T: Into<String>> and if you want to return either a String or an &str, you can use Cow. You can read more about that here and here.
Apart from all that: there are plenty of good CLI-Arg parsers out there (clap-rs, docopt-rs, ...), so you might not need to write your own.
I agree that Lukas Kalbertodt's answer is the best — use generics to accept anything that can look like a slice of strings.
However, you can clean up the map version a little bit:
There's no need to allocate a vector for the initial set of strings.
There's no need to use the complete type (Vec<String>); you could specify just the collection (Vec<_>). If you pass the result to a function that only accepts a Vec<String>, then you don't need any explicit types at all; it can be completely inferred.
You can use a slightly shorter s.into() in the map.
fn do_stuff_with_args(args: Vec<String>) { println!("{}", args.len()) }
fn main() {
let args = ["-w", "60", "args"].iter().map(|&s| s.into()).collect();
do_stuff_with_args(args);
}

How to write a fn that processes input and returns an iterator instead of the full result?

Forgive me if this is a dumb question, but I'm new to Rust, and having a hard time writing this toy program to test my understanding.
I want a function that given a string, returns the first word in each line, as an iterator (because the input could be huge, I don't want to buffer the result as an array). Here's the program I wrote which collects the result as an array first:
fn get_first_words(input: ~str) -> ~[&str] {
return input.lines_any().filter_map(|x| x.split_str(" ").nth(0)).collect();
}
fn main() {
let s = ~"Hello World\nFoo Bar";
let words = get_words(s);
for word in words.iter() {
println!("{}", word);
}
}
Result (as expected):
Hello
Foo
How do I modify this to return an Iterator instead? I'm apparently not allowed to make Iterator<&str> the return type. If I try #Iterator<&str>, rustc says
error: The managed box syntax is being replaced by the `std::gc::Gc` and `std::rc::Rc` types. Equivalent functionality to managed trait objects will be implemented but is currently missing.
I can't figure out for the life of me how to make that work.
Similarly, trying to return ~Iterator<&str> makes rustc complain that the actual type is std::iter::FilterMap<....blah...>.
In C# this is really easy, as you simply return the result of the equivalent map call as an IEnumerable<string>. Then the callee doesn't have to know what the actual type is that's returned, it only uses methods available in the IEnumerable interface.
Is there nothing like returning an interface in Rust??
(I'm using Rust 0.10)
I believe that the equivalent of the C# example would be returning ~Iterator<&str>. This can be done, but must be written explicitly: rather than returning x, return ~x as ~Iterator<&'a str>. (By the way, your function is going to have to take &'a str rather than ~str—if you don’t know why, ask and I’ll explain.)
This is not, however, idiomatic Rust because it is needlessly inefficient. The idiomatic Rust is to list the return type explicitly. You can specify it in one place like this if you like:
use std::iter::{FilterMap, Map};
use std::str::CharSplits;
type Foo = FilterMap<'a, &'a str, &'a str,
Map<'a, &'a str, &'a str,
CharSplits<'a, char>>>
And then list Foo as the return type.
Yes, this is cumbersome. At present, there is no such thing as inferring a return type in any way. This has, however, been discussed and I believe it likely that it will come eventually in some syntax similar to fn foo<'a>(&'a str) -> Iterator<&'a str>. For now, though, there is no fancy sugar.

Resources