Using latest (v7) nom crate.
Trying to build a parser capable of extracting code blocks from markdown. In the flavor of markdown I need to support, a code block only ends if there is three grave/backtick characters on a line by themselves, excepting perhaps followed by whitespace.
Here is an example, where I replace backticks with single quotes (') to make editing with the StackOverflow markdown sane:
'''python
print("""
'''")
// this is all still a code block
'''
The obvious solution is to just use take_until("'''") however, that will end the take early, since that just does a search for the first occurrence of ''' which is not accurate. I need the termination condition to be tuple((tag(code_end), space0, newline)).
The next obvious solution is to use regular expressions as the pattern in take_until... but I would prefer to avoid that. Is there any prebuilt parser (or available in another crate) that will take all until a parser returns Ok?
use nom::IResult;
use nom::combinator::opt;
use nom::sequence::{terminated, tuple};
use nom::bytes::complete::{tag, take_until};
use nom::character::complete::{newline, space0, alpha1};
fn code(i: &[u8]) -> IResult<&[u8], &[u8]> {
let (input, _) = tuple((tag("'''"), opt(alpha1), tag("\n")))(i)?;
let terminator = tuple((tag("'''"), space0, newline));
let (input, contents) = terminated(take_until("'''"), terminator)(input)?;
Ok((input, contents))
}
fn main() {
let test = &b"'''python
print(\"\"\"
'''\"\"\"
// this is all still a code block
'''
";
assert!(code(&test[..]).is_ok());
}
the above assertion will fail. However, if you remove the line with the three (''') single quotes, it will pass. This is because of the difference between terminator and take_all("'''"). What is my best pattern for solving this problem?
Thanks for any help. I have a feeling I'm missing something obvious or just doing something wrong. Let me know if anything isn't clear.
Here is a link to the above example in the Rust Playground for convenience: https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=d5459edded1e4258ba3e034658ea4acf
I think the proper combinator would be many_till:
Applies the parser f until the parser g produces a result.
That combined with anychar will return a Vec<char> for your code block.
I think there is no anybyte in nom, but you can easily write it yourself if you prefer to get Vec<u8>.
Or if you want to avoid allocating and want a slice referencing to the original slice, and don't mind a bit of unsafe you can ignore the consumed characters and take compute the slice from the start and end pointers (playground):
fn code(i: &[u8]) -> IResult<&[u8], &[u8]> {
let (input, _) = tuple((tag("'''"), opt(alpha1), tag("\n")))(i)?;
let terminator = tuple((tag("'''"), space0, newline));
let start = input;
let (input, (_, (end, _, _))) = many_till(map(anychar, drop), terminator)(input)?;
let len = unsafe { end.as_ptr().offset_from(start.as_ptr()) as usize};
Ok((input, &start[..len]))
}
Related
I need to match a few words in the start of a string, handle it, than removes it. How should I remove few chars or bytes in then end of aString?
I using regex crate to match the string. I can't find a way to remove chars in the end of the String.
Maybe something like this, but have non-ASCII chars:
use lazy_static::lazy_static;
use regex::Regex;
fn func(s: &mut String) {
lazy_static! {
static ref RE: Regex = Regex::new(r"123").unwrap();
}
let cap = match RE.captures(s.as_str()) {
Some(v) => v.get(0).unwrap(),
None => panic!("Error"),
};
do_something(cap.as_str());
s.delete(0, cap.end());
}
fn do_something(s: &str) {
assert_eq!(s, "123")
}
fn main() {
let s = String::from("123456");
func(s);
assert_eq!(s, "456");
}
I have seen remove method, but it says it's O(n). If it is, I think O(nm) is a little bit too slow for me.
You can use regexes Match::start to get a start of the capture group.
You can then use truncate to get rid of everything after that.
fn main() {
let mut text: String = "this is a text with some garbage after!abc".into();
let re = regex::Regex::new("abc$").unwrap();
let m = re.captures(&text).unwrap();
let g = m.get(0).unwrap();
text.truncate(g.start());
dbg!(text);
}
What you're looking for is truncate - except with non-ascii support.
For ascii only, this works:
let mut s = String::from("123456789");
s.truncate(s.len() - 3);
assert_eq!(s, "123456");
However since String can contain unicode characters which aren't always 1 byte, it doesn't work for non-ascii (panics if the new length does not lie on a char boundary)
If you want non-ascii support, there isn't an O(1) solution according to this answer. That answer does give an implementation using char_indicies(), I think it's the best way unless I'm missing something.
There is also the unicode-truncate crate, which also seems to use char_indicies() - might be worth a look.
I have a String, and I want to make a new String, with every character in the first one doubled. So "abc" would become "aabbcc" and so on.
The best I've come up with is:
let mut result = String::new();
for c in original_string.chars() {
result.push(c);
result.push(c);
}
result
This works fine. but is there a more succinct (or more idiomatic) way to do this?
In JavaScript I would probably write something like:
original.split('').map(c => c+c).join('')
Or in Ruby:
(original.chars.map { |c| c+c }).join('')
Since Rust also has functional elements, I was wondering if there is a similarly succinct solution.
I would use std::iter::repeat to repeat every char value from the input. This creates an infinite iterator, but for your case we only need to iterate 2 times, so we can use take to limit our iterator, then flatten all the iterators that hold the doubled chars.
use std::iter;
fn main() {
let input = "abc"; //"abc".to_string();
let output = input
.chars()
.flat_map(|c| iter::repeat(c).take(2))
.collect::<String>();
println!("{:?}", output);
}
Playground
Note: To double we are using take(2) but you can use any usize to increase the repetition.
Personally, I would just do exactly what you're doing. Its intent is clear (more clear than the functional approaches you presented from JavaScript or Ruby, in my opinion) and it is efficient. The only thing I would change is perhaps reserve space for the characters, since you know exactly how much space you will need.
let mut result = String::with_capacity(original_string.len() * 2);
However, if you are really in love with this style, you could use flat_map
let result: String = original_string.chars()
.flat_map(|c| std::iter::repeat(c).take(2))
.collect();
I want to split a String that I give as an input according to white spaces in it.
I have used the split_whitespaces() function but when I use this function on a custom input it just gives me the first String slice.
let s:String = read!();
let mut i:usize = 0;
for token in s.split_whitespace() {
println!("token {} {}", i, token);
i+=1;
}
What am I missing?
As far as I know, read! is not a standard macro. A quick search reveals that is probably is from the text_io crate (if you are using external crates you should tell so in the question).
From the docs in that crate:
The read!() macro will always read until the next ascii whitespace character (\n, \r, \t or space).
So what you are seeing is by design.
If you want to read a whole line from stdin you may try the standard function std::Stdin::read_line.
You are missing test cases which could locate the source of the problem. Split the code into a function and replace the read!()-macro with a test case, which you could put in main for now, where you provide different strings to the function and observe the output.
fn strspilit(s:String){
let mut i:usize = 0;
for token in s.split_whitespace() {
println!("token {} {}", i, token);
i+=1;
}
}
fn main() {
println!("Hello, world!");
strspilit("Hello Huge World".to_string());
}
Then you will see your code is working as it should but as notices in other answers the read!() macro is only returning the string until the first white space so you should probably use another way of reading your input.
I've created a few non-trivial parsers in nom, so I'm pretty familiar with it at this point. All the parsers I've created until now always provide the entire input slice to the parser.
I'd like to create a streaming parser, which I assume means that I can continue to feed bytes into the parser until it is complete. I've had a hard time finding any documentation or examples that illustrate this, and I also question my assumption of what a "streaming parser" is.
My questions are:
Is my understanding of what a streaming parser is correct?
If so, are there any good examples of a parser using this technique?
nom parsers neither maintain a buffer to feed more data into, nor do they maintain "state" where they previously needed more bytes.
But if you take a look at the IResult structure you see that you can return a partial result or indicate that you need more data.
There seem to be some structures provided to handle streaming: I think you are supposed to create a Consumer from a parser using the consumer_from_parser! macro, implement a Producer for your data source, and call run until it returns None (and start again when you have more data). Examples and docs seem to be mostly missing so far - see bottom of https://github.com/Geal/nom :)
Also it looks like most functions and macros in nom are not documented well (or at all) regarding their behavior when hitting the end of the input. For example take_until! returns Incomplete if the input isn't long enough to contain the substr to look for, but returns an error if the input is long enough but doesn't contain substr.
Also nom mostly uses either &[u8] or &str for input; you can't signal an actual "end of stream" through these types. You could implement your own input type (related traits: nom::{AsBytes,Compare,FindSubstring,FindToken,InputIter,InputLength,InputTake,Offset,ParseTo,Slice}) to add a "reached end of stream" flag, but the nom provided macros and functions won't be able to interpret it.
All in all I'd recommend splitting streamed input through some other means into chunks you can handle with simple non-streaming parsers (maybe even use synom instead of nom).
Here is a minimal working example. As #Stefan wrote, "I'd recommend splitting streamed input through some other means into chunks you can handle".
What somewhat works (and I'd be glad for suggestions on how to improve it), is to combine a File::bytes() method and then only take as many bytes as necessary and pass them to nom::streaming::take.
let reader = file.bytes();
let buf = reader.take(length).collect::<B>()?;
let (_input, chunk) = take(length)(&*buf)...;
The complete function can look like this:
/// Parse the first handful of bytes and return the bytes interpreted as UTF8
fn parse_first_bytes(file: std::fs::File, length: usize) -> Result<String> {
type B = std::result::Result<Vec<u8>, std::io::Error>;
let reader = file.bytes();
let buf = reader.take(length).collect::<B>()?;
let (_input, chunk) = take(length)(&*buf)
.finish()
.map_err(|nom::error::Error { input: _, code: _ }| eyre!("..."))?;
let s = String::from_utf8_lossy(chunk);
Ok(s.to_string())
}
Here is the rest of main for an implementation similar to Unix' head command.
use color_eyre::Result;
use eyre::eyre;
use nom::{bytes::streaming::take, Finish};
use std::{fs::File, io::Read, path::PathBuf};
use structopt::StructOpt;
#[derive(Debug, StructOpt)]
#[structopt(about = "A minimal example of parsing a file only partially.
This implements the POSIX 'head' utility.")]
struct Args {
/// Input File
#[structopt(parse(from_os_str))]
input: PathBuf,
/// Number of bytes to consume
#[structopt(short = "c", default_value = "32")]
num_bytes: usize,
}
fn main() -> Result<()> {
let args = Args::from_args();
let file = File::open(args.input)?;
let head = parse_first_bytes(file, args.num_bytes)?;
println!("{}", head);
Ok(())
}
I'd like to capitalize the first letter of a &str. It's a simple problem and I hope for a simple solution. Intuition tells me to do something like this:
let mut s = "foobar";
s[0] = s[0].to_uppercase();
But &strs can't be indexed like this. The only way I've been able to do it seems overly convoluted. I convert the &str to an iterator, convert the iterator to a vector, upper case the first item in the vector, which creates an iterator, which I index into, creating an Option, which I unwrap to give me the upper-cased first letter. Then I convert the vector into an iterator, which I convert into a String, which I convert to a &str.
let s1 = "foobar";
let mut v: Vec<char> = s1.chars().collect();
v[0] = v[0].to_uppercase().nth(0).unwrap();
let s2: String = v.into_iter().collect();
let s3 = &s2;
Is there an easier way than this, and if so, what? If not, why is Rust designed this way?
Similar question
Why is it so convoluted?
Let's break it down, line-by-line
let s1 = "foobar";
We've created a literal string that is encoded in UTF-8. UTF-8 allows us to encode the 1,114,112 code points of Unicode in a manner that's pretty compact if you come from a region of the world that types in mostly characters found in ASCII, a standard created in 1963. UTF-8 is a variable length encoding, which means that a single code point might take from 1 to 4 bytes. The shorter encodings are reserved for ASCII, but many Kanji take 3 bytes in UTF-8.
let mut v: Vec<char> = s1.chars().collect();
This creates a vector of characters. A character is a 32-bit number that directly maps to a code point. If we started with ASCII-only text, we've quadrupled our memory requirements. If we had a bunch of characters from the astral plane, then maybe we haven't used that much more.
v[0] = v[0].to_uppercase().nth(0).unwrap();
This grabs the first code point and requests that it be converted to an uppercase variant. Unfortunately for those of us who grew up speaking English, there's not always a simple one-to-one mapping of a "small letter" to a "big letter". Side note: we call them upper- and lower-case because one box of letters was above the other box of letters back in the day.
This code will panic when a code point has no corresponding uppercase variant. I'm not sure if those exist, actually. It could also semantically fail when a code point has an uppercase variant that has multiple characters, such as the German ß. Note that ß may never actually be capitalized in The Real World, this is the just example I can always remember and search for. As of 2017-06-29, in fact, the official rules of German spelling have been updated so that both "ẞ" and "SS" are valid capitalizations!
let s2: String = v.into_iter().collect();
Here we convert the characters back into UTF-8 and require a new allocation to store them in, as the original variable was stored in constant memory so as to not take up memory at run time.
let s3 = &s2;
And now we take a reference to that String.
It's a simple problem
Unfortunately, this is not true. Perhaps we should endeavor to convert the world to Esperanto?
I presume char::to_uppercase already properly handles Unicode.
Yes, I certainly hope so. Unfortunately, Unicode isn't enough in all cases.
Thanks to huon for pointing out the Turkish I, where both the upper (İ) and lower case (i) versions have a dot. That is, there is no one proper capitalization of the letter i; it depends on the locale of the the source text as well.
why the need for all data type conversions?
Because the data types you are working with are important when you are worried about correctness and performance. A char is 32-bits and a string is UTF-8 encoded. They are different things.
indexing could return a multi-byte, Unicode character
There may be some mismatched terminology here. A char is a multi-byte Unicode character.
Slicing a string is possible if you go byte-by-byte, but the standard library will panic if you are not on a character boundary.
One of the reasons that indexing a string to get a character was never implemented is because so many people misuse strings as arrays of ASCII characters. Indexing a string to set a character could never be efficient - you'd have to be able to replace 1-4 bytes with a value that is also 1-4 bytes, causing the rest of the string to bounce around quite a lot.
to_uppercase could return an upper case character
As mentioned above, ß is a single character that, when capitalized, becomes two characters.
Solutions
See also trentcl's answer which only uppercases ASCII characters.
Original
If I had to write the code, it'd look like:
fn some_kind_of_uppercase_first_letter(s: &str) -> String {
let mut c = s.chars();
match c.next() {
None => String::new(),
Some(f) => f.to_uppercase().chain(c).collect(),
}
}
fn main() {
println!("{}", some_kind_of_uppercase_first_letter("joe"));
println!("{}", some_kind_of_uppercase_first_letter("jill"));
println!("{}", some_kind_of_uppercase_first_letter("von Hagen"));
println!("{}", some_kind_of_uppercase_first_letter("ß"));
}
But I'd probably search for uppercase or unicode on crates.io and let someone smarter than me handle it.
Improved
Speaking of "someone smarter than me", Veedrac points out that it's probably more efficient to convert the iterator back into a slice after the first capital codepoints are accessed. This allows for a memcpy of the rest of the bytes.
fn some_kind_of_uppercase_first_letter(s: &str) -> String {
let mut c = s.chars();
match c.next() {
None => String::new(),
Some(f) => f.to_uppercase().collect::<String>() + c.as_str(),
}
}
Is there an easier way than this, and if so, what? If not, why is Rust designed this way?
Well, yes and no. Your code is, as the other answer pointed out, not correct, and will panic if you give it something like བོད་སྐད་ལ་. So doing this with Rust's standard library is even harder than you initially thought.
However, Rust is designed to encourage code reuse and make bringing in libraries easy. So the idiomatic way to capitalize a string is actually quite palatable:
extern crate inflector;
use inflector::Inflector;
let capitalized = "some string".to_title_case();
It's not especially convoluted if you are able to limit your input to ASCII-only strings.
Since Rust 1.23, str has a make_ascii_uppercase method (in older Rust versions, it was available through the AsciiExt trait). This means you can uppercase ASCII-only string slices with relative ease:
fn make_ascii_titlecase(s: &mut str) {
if let Some(r) = s.get_mut(0..1) {
r.make_ascii_uppercase();
}
}
This will turn "taylor" into "Taylor", but it won't turn "édouard" into "Édouard". (playground)
Use with caution.
I did it this way:
fn str_cap(s: &str) -> String {
format!("{}{}", (&s[..1].to_string()).to_uppercase(), &s[1..])
}
If it is not an ASCII string:
fn str_cap(s: &str) -> String {
format!("{}{}", s.chars().next().unwrap().to_uppercase(),
s.chars().skip(1).collect::<String>())
}
The OP's approach taken further:
replace the first character with its uppercase representation
let mut s = "foobar".to_string();
let r = s.remove(0).to_uppercase().to_string() + &s;
or
let r = format!("{}{s}", s.remove(0).to_uppercase());
println!("{r}");
works with Unicode characters as well eg. "😎foobar"
The first guaranteed to be an ASCII character, can changed to a capital letter in place:
let mut s = "foobar".to_string();
if !s.is_empty() {
s[0..1].make_ascii_uppercase(); // Foobar
}
Panics with a non ASCII character in first position!
Since the method to_uppercase() returns a new string, you should be able to just add the remainder of the string like so.
this was tested in rust version 1.57+ but is likely to work in any version that supports slice.
fn uppercase_first_letter(s: &str) -> String {
s[0..1].to_uppercase() + &s[1..]
}
Here's a version that is a bit slower than #Shepmaster's improved version, but also more idiomatic:
fn capitalize_first(s: &str) -> String {
let mut chars = s.chars();
chars
.next()
.map(|first_letter| first_letter.to_uppercase())
.into_iter()
.flatten()
.chain(chars)
.collect()
}
This is how I solved this problem, notice I had to check if self is not ascii before transforming to uppercase.
trait TitleCase {
fn title(&self) -> String;
}
impl TitleCase for &str {
fn title(&self) -> String {
if !self.is_ascii() || self.is_empty() {
return String::from(*self);
}
let (head, tail) = self.split_at(1);
head.to_uppercase() + tail
}
}
pub fn main() {
println!("{}", "bruno".title());
println!("{}", "b".title());
println!("{}", "🦀".title());
println!("{}", "ß".title());
println!("{}", "".title());
println!("{}", "བོད་སྐད་ལ".title());
}
Output
Bruno
B
🦀
ß
བོད་སྐད་ལ
Inspired by get_mut examples I code something like this:
fn make_capital(in_str : &str) -> String {
let mut v = String::from(in_str);
v.get_mut(0..1).map(|s| { s.make_ascii_uppercase(); &*s });
v
}