I've created a few non-trivial parsers in nom, so I'm pretty familiar with it at this point. All the parsers I've created until now always provide the entire input slice to the parser.
I'd like to create a streaming parser, which I assume means that I can continue to feed bytes into the parser until it is complete. I've had a hard time finding any documentation or examples that illustrate this, and I also question my assumption of what a "streaming parser" is.
My questions are:
Is my understanding of what a streaming parser is correct?
If so, are there any good examples of a parser using this technique?
nom parsers neither maintain a buffer to feed more data into, nor do they maintain "state" where they previously needed more bytes.
But if you take a look at the IResult structure you see that you can return a partial result or indicate that you need more data.
There seem to be some structures provided to handle streaming: I think you are supposed to create a Consumer from a parser using the consumer_from_parser! macro, implement a Producer for your data source, and call run until it returns None (and start again when you have more data). Examples and docs seem to be mostly missing so far - see bottom of https://github.com/Geal/nom :)
Also it looks like most functions and macros in nom are not documented well (or at all) regarding their behavior when hitting the end of the input. For example take_until! returns Incomplete if the input isn't long enough to contain the substr to look for, but returns an error if the input is long enough but doesn't contain substr.
Also nom mostly uses either &[u8] or &str for input; you can't signal an actual "end of stream" through these types. You could implement your own input type (related traits: nom::{AsBytes,Compare,FindSubstring,FindToken,InputIter,InputLength,InputTake,Offset,ParseTo,Slice}) to add a "reached end of stream" flag, but the nom provided macros and functions won't be able to interpret it.
All in all I'd recommend splitting streamed input through some other means into chunks you can handle with simple non-streaming parsers (maybe even use synom instead of nom).
Here is a minimal working example. As #Stefan wrote, "I'd recommend splitting streamed input through some other means into chunks you can handle".
What somewhat works (and I'd be glad for suggestions on how to improve it), is to combine a File::bytes() method and then only take as many bytes as necessary and pass them to nom::streaming::take.
let reader = file.bytes();
let buf = reader.take(length).collect::<B>()?;
let (_input, chunk) = take(length)(&*buf)...;
The complete function can look like this:
/// Parse the first handful of bytes and return the bytes interpreted as UTF8
fn parse_first_bytes(file: std::fs::File, length: usize) -> Result<String> {
type B = std::result::Result<Vec<u8>, std::io::Error>;
let reader = file.bytes();
let buf = reader.take(length).collect::<B>()?;
let (_input, chunk) = take(length)(&*buf)
.finish()
.map_err(|nom::error::Error { input: _, code: _ }| eyre!("..."))?;
let s = String::from_utf8_lossy(chunk);
Ok(s.to_string())
}
Here is the rest of main for an implementation similar to Unix' head command.
use color_eyre::Result;
use eyre::eyre;
use nom::{bytes::streaming::take, Finish};
use std::{fs::File, io::Read, path::PathBuf};
use structopt::StructOpt;
#[derive(Debug, StructOpt)]
#[structopt(about = "A minimal example of parsing a file only partially.
This implements the POSIX 'head' utility.")]
struct Args {
/// Input File
#[structopt(parse(from_os_str))]
input: PathBuf,
/// Number of bytes to consume
#[structopt(short = "c", default_value = "32")]
num_bytes: usize,
}
fn main() -> Result<()> {
let args = Args::from_args();
let file = File::open(args.input)?;
let head = parse_first_bytes(file, args.num_bytes)?;
println!("{}", head);
Ok(())
}
Related
Most operations on str in Rust create newly-allocated Strings. I understand that, because UTF8 is complex, one cannot generally know beforehand the size of the output, and so that output must be growable. However, I might have my own buffer , such as a Vec<u8> I'd like to grow. Is there any way to specify an existing output container to string operations?
e.g.,
let s = "my string";
let s: Vec<u8> = Vec::with_capacity(100); // explicit allocation
s.upper_into(s); // perhaps new allocation here, if result fits in `v`
-EDIT-
This is, of course, just for one case. I'd love to be able to treat all of the str methods this way, including for example those in sentencecase , without having to copy their internal logic.
You can walk char-by-char and use char::to_uppercase():
let mut uppercase = String::with_capacity(100);
uppercase.extend(s.chars().flat_map(char::to_uppercase));
I think it does not handle everything correctly, but this is exactly what str::to_uppercase() does too, so I assume it's OK.
I was trying to use rust for competitive coding and I was wondering what is the most efficient way of storing user input in a Vec. I have come up with a method but I am afraid that it is slow and redundant.
Here is my code:
use std::io;
fn main() {
let mut input = String::new();
io::stdin().read_line(&mut input).expect("cant read line");
let input:Vec<&str> = input.split(" ").collect();
let input:Vec<String> = input.iter().map(|x| x.to_string()).collect();
let input:Vec<i32> = input.iter().map(|x| x.trim().parse().unwrap()).collect();
println!("{:?}", input);
}
PS: I am new to rust.
I see those ways of improving performance of the code:
Although not really relevant for std::io::stdin(), std::io::BufReader may have great effect for reading e.g. from std::fs::File. Buffer capacity can also matter.
Using locked stdin: let si = std::io::stdin(); let si = si.locked();
Avoiding allocations by keeping vectors around and using extend_from_iter instead of collect, if the code reads multiple line (unlike in the sample you posted in the question).
Maybe avoiding temporary vectors alltogether and just chaining Iterator operations together. Or using a loop like for line in input.split(...) { ... }. It may affect performance in both ways - you need to experiment to find out.
Avoiding to_string() and just storing reference to input buffer (which can also be used to parse() into i32. Note that this may invite famous Rust borrowing and lifetimes complexity.
Maybe finding some fast SIMD-enhanced string to int parser instead of libstd's parse().
Maybe streaming the result to algorithm instead of collecting everything to a Vec first. This can be beneficial especially if multiple threads can be used. For performance, you would still likely need to send data in chunks, not by one single i32.
Yeah, there are some changes you can make that will make your code more precise, simple and faster.
A better code :
use std::io;
fn main() {
let mut input = String::new();
io::stdin().read_line(&mut input).unwrap();
let input: Vec<i32> = input.split_whitespace().map(|x| x.parse().unwrap()).collect();
println!("{:?}", input);
}
Explanation
The input.split_whitespace() returns an iterator containing elements that are seperated by any kind of whitespace including line breaks. This saves the time used in spliting by just one whitespace input.split(" ") and iterating over again with a .trim() method on each string slice to remove any surronding whitespaces.
(You can also checkout input.split_ascii_whitespace(), if you want to restrict the split over ascii whitespaces).
There was no need for the code input.iter().map(|x| x.to_string()).collect(), since you can call also call a .trim() method on a string slice.
This saves some time in both the runtime and coding process, since the .collect() method is only used once and there was just one iteration.
I wrote a parser in nom that is completely stateless, now I need to wrap it in a few stateful layers.
I have a top-level parsing function named alt_fn that will provide me the next bit of parsed output as an enum variant, the details of which probably aren't important.
I have three things I need to do that involve state:
1) I need to conditionally perform a transformation on the output of alt_fn if there is a match in a non-mutable HashMap that is part of my State struct. This should basically be like a map! but as a method call on my struct. Something like this:
named!(alt_fn<AllTags> ,alt!(// snipped for brevity));
fn applyMath(self, i:AllTags)->AllTags { // snipped for brevity }
method!(apply_math<State, &[u8], AllTags>, mut self, call_m!(self.applyMath, call!(alt_fn)));
This currently gives me: error: unexpected end of macro invocation with alt_fn underlined.
2) I need to update the other fields of the state struct with the data I got from the input (such as computing checksums and updating timestamps, etc.), and then transform the output again with this new knowledge. This will probably look like the following:
fn updateState(mut self, i:AllTags) -> AllTags { // snipped for brevity }
method!(update_state<State, &[u8], AllTags>, mut self, call_m!(self.updateState, call_m!(self.applyMath)));
3) I need to call the method from part two repeatedly until all the input is used up:
method!(pub parse<State,&[u8],Vec<AllTags>>, mut self, many1!(update_state));
Unfortunately the nom docs are pretty limited, and I'm not great with macro syntax so I don't know what I'm doing wrong.
When I need to do something complicated with nom, I normally write my own functions.
For example
named!(my_func<T>, <my_macros>);
is equivalent to
fn my_func(i: &[u8]) -> nom::IResult<T, &[u8]> {
<my_macros>
}
with the proviso that you must pass i to the macro (see my comment).
Creating your own function means you can have any control flow you want in there, and it will play nice with nom as long as it takes a &[u8] and returns nom::IResult where the output &[u8] is the remaining unparsed raw input.
If you need some more info comment and I'll try to improve my answer!
I've written the following method to parse binary data from a gzipped file using GzDecoder from the Flate2 library
fn read_primitive<T: Copy>(reader: &mut GzDecoder<File>) -> std::io::Result<T>
{
let sz = mem::size_of::<T>();
let mut vec = Vec::<u8>::with_capacity(sz);
let ret: T;
unsafe{
vec.set_len(sz);
let mut s = &mut vec[..];
try!(reader.read(&mut s));
let ptr :*const u8 = s.as_ptr();
ret = *(ptr as *const T)
}
Ok(ret)
}
It works, but I'm not particularly happy with the code, especially with using the dummy vector and the temporary variable ptr. It all feels very inelegant to me and I'm sure there's a better way to do this. I'd be happy to hear any suggestions of how to clean up this code.
Your code allows any copyable T, not just primitives. That means you could try to parse in something with a reference, which is probably not what you want:
#[derive(Copy)]
struct Foo(&str);
However, the general sketch of your code is what I'd expect. You need a temporary place to store some data, and then you must convert that data to the appropriate primitive (perhaps dealing with endinaness issues).
I'd recommend the byteorder library. With it, you call specific methods for the primitive that is required:
reader.read_u16::<LittleEndian>()
Since these methods know the desired size, they can stack-allocate an array to use as the temporary buffer, which is likely a bit more efficient than a heap-allocation. Additionally, I'd suggest changing your code to accept a generic object with the Read trait, instead of the specific GzDecoder.
You may also want to look into a serialization library like rustc-serialize or serde to see if they fit any of your use cases.
I'm trying to take an image that is converted into a vector of bytes and write those bytes to a new file. The first part is working, and my code is compiling, but the new file that is created ends up empty (nothing is written to it). What am I missing?
Is there a cleaner way to convert Vec<u8> into &[u8] so that it can be written? The way I'm currently doing it seems kind of ridiculous...
use std::os;
use std::io::BufferedReader;
use std::io::File;
use std::io::BufferedWriter;
fn get_file_buffer(path_str: String) -> Vec<u8> {
let path = Path::new(path_str.as_bytes());
let file = File::open(&path);
let mut reader = BufferedReader::new(file);
match reader.read_to_end() {
Ok(x) => x,
Err(_) => vec![0],
}
}
fn main() {
let file = get_file_buffer(os::args()[1].clone());
let mut new_file = File::create(&Path::new("foo.png")).unwrap();
let mut writer = BufferedWriter::new(new_file);
writer.write(String::from_utf8(file).unwrap().as_bytes()).unwrap();
writer.flush().unwrap();
}
Given a Vec<T>, you can get a &[T] out of it in two ways:
Take a reference to a dereference of it, i.e. &*file; this works because Vec<T> implements Deref<[T]>, so *file is effectively of type [T] (though doing that without borrowing it, i.e. &*file, is not legal).
Call the as_slice() method.
As the BufWriter docs say, “the buffer will be written out when the writer is dropped”, so that writer.flush().unwrap() is not strictly necessary, serving only to make handling of errors explicit.
But as for the behaviour you describe, that I mostly do not observe. So long as you do not encounter any I/O errors, the version not using the String dance will work fine, while with the String dance it will panic if the input data is not legal UTF-8 (which if you’re dealing with images it probably won’t be). String::from_utf8 returns None in such cases, and so unwrapping that panics.