How can I create a negative lookahead parser for nom?
For example, I'd like to parse "hello", except if it's followed by " world". The equivalent regex would be hello(?! world).
I tried to combine the cond, not and peek parsers
fn parser(input: &str) -> IResult<&str, &str> {
cond(peek(not(tag(" world"))(input)), tag("hello"))(input)
}
but this doesn't work as cond expects the condition as bool instead of as IResult.
Try using terminated()
terminated(tag("hello"), not(tag(" world")))
Related
Using latest (v7) nom crate.
Trying to build a parser capable of extracting code blocks from markdown. In the flavor of markdown I need to support, a code block only ends if there is three grave/backtick characters on a line by themselves, excepting perhaps followed by whitespace.
Here is an example, where I replace backticks with single quotes (') to make editing with the StackOverflow markdown sane:
'''python
print("""
'''")
// this is all still a code block
'''
The obvious solution is to just use take_until("'''") however, that will end the take early, since that just does a search for the first occurrence of ''' which is not accurate. I need the termination condition to be tuple((tag(code_end), space0, newline)).
The next obvious solution is to use regular expressions as the pattern in take_until... but I would prefer to avoid that. Is there any prebuilt parser (or available in another crate) that will take all until a parser returns Ok?
use nom::IResult;
use nom::combinator::opt;
use nom::sequence::{terminated, tuple};
use nom::bytes::complete::{tag, take_until};
use nom::character::complete::{newline, space0, alpha1};
fn code(i: &[u8]) -> IResult<&[u8], &[u8]> {
let (input, _) = tuple((tag("'''"), opt(alpha1), tag("\n")))(i)?;
let terminator = tuple((tag("'''"), space0, newline));
let (input, contents) = terminated(take_until("'''"), terminator)(input)?;
Ok((input, contents))
}
fn main() {
let test = &b"'''python
print(\"\"\"
'''\"\"\"
// this is all still a code block
'''
";
assert!(code(&test[..]).is_ok());
}
the above assertion will fail. However, if you remove the line with the three (''') single quotes, it will pass. This is because of the difference between terminator and take_all("'''"). What is my best pattern for solving this problem?
Thanks for any help. I have a feeling I'm missing something obvious or just doing something wrong. Let me know if anything isn't clear.
Here is a link to the above example in the Rust Playground for convenience: https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=d5459edded1e4258ba3e034658ea4acf
I think the proper combinator would be many_till:
Applies the parser f until the parser g produces a result.
That combined with anychar will return a Vec<char> for your code block.
I think there is no anybyte in nom, but you can easily write it yourself if you prefer to get Vec<u8>.
Or if you want to avoid allocating and want a slice referencing to the original slice, and don't mind a bit of unsafe you can ignore the consumed characters and take compute the slice from the start and end pointers (playground):
fn code(i: &[u8]) -> IResult<&[u8], &[u8]> {
let (input, _) = tuple((tag("'''"), opt(alpha1), tag("\n")))(i)?;
let terminator = tuple((tag("'''"), space0, newline));
let start = input;
let (input, (_, (end, _, _))) = many_till(map(anychar, drop), terminator)(input)?;
let len = unsafe { end.as_ptr().offset_from(start.as_ptr()) as usize};
Ok((input, &start[..len]))
}
I'm trying to learn NOM for a project in Rust. I have a text file that consists of:
[tag="#43674"]char[/tag] with multiple tags back to back on each line. I'm trying to pull the '#43674' and 'char', store them in a tuple (x, y) and then push those into a vector Vec<(x, y)> for each line of the text file.
So far I have successfully combined parsers into two functions; one for the '#43674' and one for 'char' which I then combine together to return <IResult<&str, (String, String)>. Here is the code:
fn color_tag(i: &str) -> IResult<&str, &str> {
delimited(tag("[color="), take_until("]"), tag("]"))(i)
}
fn char_take(i: &str) -> IResult<&str, &str> {
terminated(take_until("[/color]"), tag("[/color]"))(i)
}
pub fn color_char(i: &str) -> IResult<&str, (String, String)> {
let (i, color) = color_tag(i)?;
let (i, chara) = char_take(i)?;
let colors = color.to_string();
let charas = chara.to_string();
let tuple = (colors, charas);
Ok((i, tuple))
}
How can I iterate this function over a given line of the text file? I already have a function that iterates the text file into lines, but I need color_char to repeat for each closure in that line. Am I missing the point entirely?
You'll probably want to use the nom::multi::many0 combinator to match a parser multiple times, and you can also use the nom::sequence::tuple combinator to combine your color_tag and char_take parsers
// Match color_tag followed by char_take
fn color_char(i: &str) -> IResult<&str, (&str, &str)> {
tuple((color_tag, char_take))(i)
}
// Match 0 or more times on the same line
fn color_char_multiple(i: &str) -> IResult<&str, Vec<(String, String)>> {
many0(color_char)(i)
}
To parse multiple lines, you can modify color_char() to match a trailing newline character with one of the character parsers provided by nom, such as nom::character::complete::line_ending, make it optional using nom::combinator::opt, and combine it with something like nom::sequence::terminated:
terminated(tuple((color_tag, char_take)), opt(line_ending))(i)
I'm looking at the nom crate for rust, which contains lots of functions to parse bytes/characters.
Many of the functions, such as tag(), seen below, process input that's provided not as a parameter to the function, but that appears instead in a second set of parentheses, following what I would call the parameters. If, in examples, one looks for a needle in a haystack, then the tag() function uses a parameter of its own, which is how one specifies the needle, but the haystack is specified separately, after the parameter parentheses, inside parentheses of its own (perhaps because it's a single value tuple?).
use nom::bytes::complete::tag;
fn parser(s: &str) -> IResult<&str, &str> {
tag("Hello")(s)
}
In the example above, tag()'s job is to test whether the input s starts with Hello. You can call parser, passing in "Hello everybody!, and the tag() function does indeed verify that the start of s is Hello. But how did (s) find its way into tag()?
Can someone explain this syntax to me, or show where to read about it. It works, and I can use it, but I don't understand what I'm looking at!
thanks
The return value of tag() is impl Fn(Input) -> IResult<Input, Input, Error>, i.e. the function returns another function. The first set of parentheses is for calling tag(); the second set is for calling the function it returns.
This allows you to store the "parser" returned by these functions in a variable and use it multiple times. Or, put differently, instead of the function definition in your question you could also write
let parser = tag("Hello");
and then call parser the same way you would call the function.
tag("Hello") just returns a function, which is then immediately invoked with the argument s, i.e. tag("Hello")(s). Here's a simple implementation example:
fn tag<'a>(needle: &'a str) -> impl Fn(&str) -> bool + 'a {
move |haystack: &str| haystack.starts_with(needle)
}
fn parser(s: &str) -> bool {
tag("Hello")(s)
}
fn main() {
println!("{}", parser("Hello everbody!")); // true
println!("{}", parser("Bye everybody!")); // false
}
playground
In trim_right_matches I can pass a string value:
println!("{}", "[(foo)]".trim_right_matches(")]"));
// [(foo
I cannot, however, use a string value in trim_matches:
println!("{}", "[(foo)]".trim_matches("[()]"));
Because I get the following error:
error[E0277]: the trait bound `std::str::pattern::StrSearcher<'_, '_>: std::str::pattern::DoubleEndedSearcher<'_>` is not satisfied
--> test.rs:2:27
|
2 | println!("{}", "[(foo)]".trim_matches("[()]"));
| ^^^^^^^^^^^^ the trait `std::str::pattern::DoubleEndedSearcher<'_>` is not implemented for `std::str::pattern::StrSearcher<'_, '_>`
error: aborting due to previous error
The following code works:
println!("{}", "[(foo)]".trim_matches(&['(', '[', ']', ')'] as &[_]));
// foo
However, it is long and not as easy to read as a single string value; I want to be able to use a string value like with trim_right_matches.
These two functions have similar signatures, but if you look closer you'll notice that their search patterns are actually different:
trim_right_matches:
pub fn trim_right_matches<'a, P>(&'a self, pat: P) -> &'a str
where
P: Pattern<'a>,
<P as Pattern<'a>>::Searcher: ReverseSearcher<'a> // ReverseSearcher
trim_matches:
pub fn trim_matches<'a, P>(&'a self, pat: P) -> &'a str
where
P: Pattern<'a>,
<P as Pattern<'a>>::Searcher: DoubleEndedSearcher<'a> // DoubleEndedSearcher
In the docs for DoubleEndedSearcher you can find the explanation why &str can't be a DoubleEndedSearcher:
(&str)::Searcher is not a DoubleEndedSearcher because the pattern "aa"
in the haystack "aaa" matches as either "[aa]a" or "a[aa]", depending
from which side it is searched.
As for why your workaround works:
"[(foo)]".trim_matches(&['(', '[', ']', ')'] as &[_]));
It's because it is actually not matching on a &str, but on &[char], which is not a string slice but a slice of an array of characters, which is a valid DoubleEndedSearcher.
The first bit of code doesn't do what you think. It trims exactly the string ")]" from the end, but it would not modify the string "([foo])". In other words, passing a string to the trim functions means "trim exactly this string", not "trim all of the characters occurring in this string". The code that doesn't compile wouldn't do what you want, because it would only trim away the exact string "[()]", and this doesn't occur in your examples.
Passing an array of chars instead trims all of the characters individually, no matter what order.
So you want the array of chars, even though passing a string looks so much more convenient.
As for why the code you wrote doesn't compile, ljedrz answered that part.
I am experimenting with Rust by implementing a small F# snippet of mine.
I am at the point where I want to destructure a string of characters. Here is the F#:
let rec internalCheck acc = function
| w :: tail when Char.IsWhiteSpace(w) ->
internalCheck acc tail
| other
| matches
| here
..which can be called like this: internalCheck [] "String here" where the :: operator signifies the right hand side is the "rest of the list".
So I checked the Rust documentation and there are examples for destructuring vectors like this:
let v = vec![1,2,3];
match v {
[] => ...
[first, second, ..rest] => ...
}
..etc. However this is now behind the slice_patterns feature gate. I tried something similar to this:
match input.chars() {
[w, ..] => ...
}
Which informed me that feature gates require non-stable releases to use.
So I downloaded multirust and installed the latest nightly I could find (2016-01-05) and when I finally got the slice_patterns feature working ... I ran into endless errors regarding syntax and "rest" (in the above example) not being allowed.
So, is there an equivalent way to destructure a string of characters, utilizing ::-like functionality ... in Rust? Basically I want to match 1 character with a guard and use "everything else" in the expression that follows.
It is perfectly acceptable if the answer is "No, there isn't". I certainly cannot find many examples of this sort online anywhere and the slice pattern matching doesn't seem to be high on the feature list.
(I will happily delete this question if there is something I missed in the Rust documentation)
You can use the pattern matching with a byte slice:
#![feature(slice_patterns)]
fn internal_check(acc: &[u8]) -> bool {
match acc {
&[b'-', ref tail..] => internal_check(tail),
&[ch, ref tail..] if (ch as char).is_whitespace() => internal_check(tail),
&[] => true,
_ => false,
}
}
fn main() {
for s in ["foo", "bar", " ", " - "].iter() {
println!("text '{}', checks? {}", s, internal_check(s.as_bytes()));
}
}
You can use it with a char slice (where char is a Unicode Scalar Value):
#![feature(slice_patterns)]
fn internal_check(acc: &[char]) -> bool {
match acc {
&['-', ref tail..] => internal_check(tail),
&[ch, ref tail..] if ch.is_whitespace() => internal_check(tail),
&[] => true,
_ => false,
}
}
fn main() {
for s in ["foo", "bar", " ", " - "].iter() {
println!("text '{}', checks? {}",
s, internal_check(&s.chars().collect::<Vec<char>>()));
}
}
But as of now it doesn't work with a &str (producing E0308). Which I think is for the best since &str is neither here nor there, it's a byte slice under the hood but Rust tries to guarantee that it's a valid UTF-8 and tries to remind you to work with &str in terms of unicode sequences and characters rather than bytes. So to efficiently match on the &str we have to explicitly use the as_bytes method, essentially telling Rust that "we know what we're doing".
That's my reading, anyway. If you want to dig deeper and into the source code of the Rust compiler you might start with issue 1844 and browse the commits and issues linked there.
Basically I want to match 1 character with a guard and use "everything
else" in the expression that follows.
If you only want to match on a single character then using the chars iterator to get the characters and matching on the character itself might be better than converting the entire UTF-8 &str into a &[char] slice. For instance, with the chars iterator you don't have to allocate the memory for the characters array.
fn internal_check(acc: &str) -> bool {
for ch in acc.chars() {
match ch {
'-' => (),
ch if ch.is_whitespace() => (),
_ => return false,
}
}
return true;
}
fn main() {
for s in ["foo", "bar", " ", " - "].iter() {
println!("text '{}', checks? {}", s, internal_check(s));
}
}
You can also use the chars iterator to split the &str on the Unicode Scalar Value boundary:
fn internal_check(acc: &str) -> bool {
let mut chars = acc.chars();
match chars.next() {
Some('-') => internal_check(chars.as_str()),
Some(ch) if ch.is_whitespace() => internal_check(chars.as_str()),
None => true,
_ => false,
}
}
fn main() {
for s in ["foo", "bar", " ", " - "].iter() {
println!("text '{}', checks? {}", s, internal_check(s));
}
}
But keep in mind that as of now Rust provides no guarantees of optimizing this tail-recursive function into a loop. (Tail call optimization would've been a welcome addition to the language but it wasn't implemented so far due to LLVM-related difficulties).
I don't believe so. Slice patterns aren't likely to be amenable to this, either, since the "and the rest" part of the pattern goes inside the array pattern, which would imply some way of putting said pattern inside a string, which implies an escaping mechanism that doesn't exist.
In addition, Rust doesn't have a proper "concatenation" operator, and the operators it does have can't participate in destructuring. So, I wouldn't hold your breath on this one.
Just going to post this here... it seems to do what I want. As a simple test, this will just print every character in a string but print Found a whitespace character when it finds a whitespace character. It does this recursively and destructuring a vector of bytes. I must give a shout out to #ArtemGr who gave me the inspiration to look at working with bytes to see if that fixed the compiler issues I was having with chars.
There are no doubt memory issues I am unaware of as yet here (copying/allocations, etc; especially around the String instances)... but I'll work on those as I dig deeper in to the inner workings of Rust. It's also probably much more verbose than it needs to be.. this is just where I got to after a little tinkering.
#![feature(slice_patterns)]
use std::iter::FromIterator;
use std::vec::Vec;
fn main() {
process("Hello world!".to_string());
}
fn process(input: String) {
match input.as_bytes() {
&[c, ref _rest..] if (c as char).is_whitespace() => { println!("Found a whitespace character"); process(string_from_rest(_rest)) },
&[c, ref _rest..] => { println!("{}", c as char); process(string_from_rest(_rest)) },
_ => ()
}
}
fn string_from_rest(rest: &[u8]) -> String {
String::from_utf8(Vec::from_iter(rest.iter().cloned())).unwrap()
}
Output:
H
e
l
l
o
Found a whitespace character
w
o
r
l
d
!
Obviously, as its testing against individual bytes (and only considering possible UTF-8 characters when rebuilding the string), its not going to work with wide characters. My actual use case only requires characters in the ASCII space .. so this is sufficient for now.
I guess, to work on wider characters the Rust pattern matching would require the ability to type coerce (which I don't believe you can do currently?), since a Chars<'T> iterator seems to be inferred as &[_]. That could just be my immaturity with the Rust language though during my other attempts.