Does ocamllex support non-greedy patterns? - ocamllex

Would be nice to be able to match everything inside a comment until the end-token "*/". Possible? Without creating a string buffer manually.
Example:
| "/**" (_* as s) "*/" { DOCBLOCK_AS_STR s }

Related

How to match all whitespace? [duplicate]

This question already has an answer here:
Is there a way to use custom patterns such as a regex or functions in a match?
(1 answer)
Closed 2 years ago.
Context: Rust has the match construct, which is really useful to make a (possibly) exhaustive list of cases and their corresponding results. The problem is: how do I create a case which encompasses a subset of many cases?
Regarding my specific problem, I'm making a lexer which reads a string character-by-character and spits out tokens. Its main function looks like this:
(...)
fn new(input: &str) -> Lexer {
let mut characters = input.chars();
for c in characters {
let mut token: Option<Token> = match c {
'+' => Some(Token::Add),
'-' => Some(Token::Minus),
'*' => Some(Token::Mul),
'/' => Some(Token::Div),
'e' => Some(Token::EulersNum),
'π' => Some(Token::Pi),
'(' => Some(Token::LeftParen),
')' => Some(Token::RightParen),
' ' | '\t' | '\n' => continue, //Whitespace
_ => None
};
if token == None {
continue;
}
}
todo!()
}
(...)
Now, the most important part, for the purposes of this question, is the one commented with 'Whitespace'. The problem with my handling of the whitespaces is that it may not correspond to the actual implementation of whitespaces in a given string format. Sure, I could handle all of the different kinds of ascii whitespaces, but what about Unicode? Making an exhaustive list of whitespaces is something that is not only cumbersome, but also something that obfuscates the code. It should be left to the language, not to it's users.
Is it possible to just match it with a 'Whitespace' expression, such as:
(...)
Whitespace => continue,
(...)
And if so, how do I do it?
You could use char::is_whitespace() in a match guard:
match c {
'+' => Some(Token::Add),
'-' => Some(Token::Minus),
'*' => Some(Token::Mul),
'/' => Some(Token::Div),
c if c.is_whitespace() => Some(Token::Whitespace),
_ => None,
};
Playground link
Sure, I could handle all of the different kinds of ascii whitespaces, but what about Unicode?
Use a library that provides a is_whitespace() function for many string formats or more complex matching capabilities, if you need it.
Otherwise, use functions like char::is_whitespace() if you just need to match against Unicode whitespace.
Making an exhaustive list of whitespaces is something that is not only cumbersome, but also something that obfuscates the code. It should be left to the language, not to it's users.
No, a match (and pattern matching etc.) are general tools. Rust is not a language specialized for language or string processing. Thus adding support for "whitespace matching" to pattern matching does not make sense.
Rust has basic ASCII, UTF-8, UTF-16 etc. support for practical reasons, but that's about it. Adding complex bits part of the standard library is debatable.

Counting the vowels in a string solution

I recently did a method to count the vowels in a given string and was able to solve it fairly simply, but my solution was compared to the best practices and this was the top one:
public class Vowels {
public static int getCount(String str) {
return str.replaceAll("(?i)[^aeiou]", "").length();
}
}
...which is much more elegant that what i wrote and i am trying to understand it. I don't get what exactly the "(?i)[^aeiou]" part is doing. I get that it is deleting all the characters that aren't vowels but I don't understand what the operators are doing or why they work in quotes shouldn't the program just see it as a string?
This is a regex and it is basically ignoring the case because we are only providing set of [aeiou] but it should also match with the capital ones [AEIOU]. Then ^ symbol is used to replace all the characters with empty string "" except for vowels(irrespective of their case).
(?i) - starts case-insensitive mode
(?-i) - turns off case-insensitive mode
[^...] - NOT ONE of these characters.

Vim - change up to and including searched string

Assuming I have the following code:
bool myCopiedFunc() {
Some code that I've written;
The cursor is on this line; <<<<<<<<<<<<<<
if (something) {
bool aValue;
some of this is inside braces;
return aValue;
}
if (somethingElse) {
this is also inside braces;
bool anotherValue;
{
more braces;
}
return anotherValue;
}
return false;
}
I decide I want to rewrite the remainder of the function, from the line with the cursor on it.
To replace up to a char on the same line, I can use ct<char> e.g. ct;
To replace up to and including a char on the same line I can use cf<char> e.g. cf;
To replace up to a string across multiple lines, I can use c/<string> e.g. c/return false
To replace up to and including a string across multiple lines, I can use... ?? e.g. ??
I can't just search for a semicolon, as there are an unknown number of them between the cursor and the end of the function, and counting them would be slow.
I can't just search for a closing brace, as there are several blocks between the cursor and the end of the function, and counting all closing braces would be slow.
With the help of code highlighting, I can easily see that the unique string I can search for is return false.
Is there an elegant solution to delete or change up to and including a string pattern?
I've already looked at a couple of related questions.
Make Vim treat forward search as "up to and including" has an accepted answer which doesn't answer my question.
In my case, I settled for deleting up to the search string, then separately deleting up to the semicolon, but it felt inefficient, and like it would have been quicker to just reach for the mouse. #firstworldproblems
To replace up to and including a string across multiple lines, I can
use... ?? e.g. ??
The / supports offsets.
In your case, you are gonna need the e offset, that is, c/foo/e.
You may want to know more details about "search offset":
:h offset
If you'll replace up to the closing brace associated to your current scope, you have c]}.
If you're looking for the end of the function, even if it means crossing to the upper scope, you'll need a plugin if the function may not be 0-indented as it's the case in C++, Java... See the related Q/A on vi.SE

Implementing String Interpolation in Flex/Bison

I'm currently writing an interpreter for a language I have designed.
The lexer/parser (GLR) is written in Flex/Bison and the main interpreter in D - and everything working flawlessly so far.
The thing is I want to also add string interpolation, that is identify string literals that contain a specific pattern (e.g. "[some expression]") and convert the included expression. I think this should be done at parser level, from within the corresponding Grammar action.
My idea is converting/treating the interpolated string as what it would look like with simple concatenation (as it works right now).
E.g.
print "this is the [result]. yay!"
to
print "this is the " + result + ". yay!"
However, I'm a bit confused as to how I could do that in Bison: basically, how do I tell it to re-parse a specific string (while constructing the main AST)?
Any ideas?
You could reparse the string, if you really wanted you, by generating a reentrant parser. You would probably want a reentrant scanner, as well, although I suppose you could kludge something together with a default scanner, using flex's buffer stack. Indeed, it's worth learning how to build reentrant parsers and scanners on the general principle of avoiding unnecessary globals, whether or not you need them for this particular purpose.
But you don't really need to reparse anything; you can do the entire parse in one pass. You just need enough smarts in your scanner so that it knows about nested interpolations.
The basic idea is to let the scanner split the string literal with interpolations into a series of tokens, which can easily be assembled into an appropriate AST by the parser. Since the scanner may return more than one token out of a single string literal, we'll need to introduce a start condition to keep track of whether the scan is currently inside a string literal or not. And since interpolations can, presumably, be nested we'll use flex's optional start condition stack, enabled with %option stack, to keep track of the nested contexts.
So here's a rough sketch.
As mentioned, the scanner has extra start conditions: SC_PROGRAM, the default, which is in effect while the scanner is scanning regular program text, and SC_STRING, in effect while the scanner is scanning a string. SC_PROGRAM is only needed because flex does not provide an official interface to check whether the start condition stack is empty; aside from nesting, it is identical to the INITIAL top-level start condition. The start condition stack is used to keep track of interpolation markers ([ and ] in this example), and it is needed because an interpolated expression might use brackets (as array subscripts, for example) or might even include a nested interpolated string. Since SC_PROGRAM is, with one exception, identical to INITIAL, we'll make it an inclusive rule.
%option stack
%s SC_PROGRAM
%x SC_STRING
%%
Since we're using a separate start condition to analyse string literals, we can also normalise escape sequences as we parse. Not all applications will want to do this, but it's pretty common. But since that's not really the point of this answer, I've left out most of the details. More interesting is the way that embedded interpolation expressions are handled, particularly deeply nested ones.
The end result will be to turn string literals into a series of tokens, possibly representing a nested structure. In order to avoid actually parsing in the scanner, we don't make any attempt to create AST nodes or otherwise rewrite the string literal; instead, we just pass the quote characters themselves through to the parser, delimiting the sequence of string literal pieces:
["] { yy_push_state(SC_STRING); return '"'; }
<SC_STRING>["] { yy_pop_state(); return '"'; }
A very similar set of rules is used for interpolation markers:
<*>"[" { yy_push_state(SC_PROGRAM); return '['; }
<INITIAL>"]" { return ']'; }
<*>"]" { yy_pop_state(); return ']'; }
The second rule above avoids popping the start condition stack if it is empty (as it will be in the INITIAL state). It's not necessary to issue an error message in the scanner; we can just pass the unmatched close bracket through to the parser, which will then do whatever error recovery seems necessary.
To finish off the SC_STRING state, we need to return tokens for pieces of the string, possibly including escape sequences:
<SC_STRING>{
[^[\\"]+ { yylval.str = strdup(yytext); return T_STRING; }
\\n { yylval.chr = '\n'; return T_CHAR; }
\\t { yylval.chr = '\t'; return T_CHAR; }
/* ... Etc. */
\\x[[:xdigit]]{2} { yylval.chr = strtoul(yytext, NULL, 16);
return T_CHAR; }
\\. { yylval.chr = yytext[1]; return T_CHAR; }
}
Returning escaped characters like that to the parser is probably not the best strategy; normally I would use an internal scanner buffer to accumulate the entire string. But it was simple for illustrative purposes. (Some error handling is omitted here; there are various corner cases, including newline handling and the annoying case where the last character in the program is a backslash inside an unterminated string literal.)
In the parser, we just need to insert a concatenation node for interpolated strings. The only complication is that we don't want to insert such a node for the common case of a string literal without any interpolations, so we use two syntax productions, one for a string with exactly one contained piece, and one for a string with two or more pieces:
string : '"' piece '"' { $$ = $2; }
| '"' piece piece_list '"' { $$ = make_concat_node(
prepend_to_list($2, $3));
}
piece : T_STRING { $$ = make_literal_node($1); }
| '[' expr ']' { $$ = $2; }
piece_list
: piece { $$ = new_list($1); }
| piece_list piece { $$ = append_to_list($1, $2); }

How will I implement the lexing of strings using ocamllex?

I am new to the concept of lexing and am trying to write a lexer in ocaml to read the following example input:
(blue, 4, dog, 15)
Basically the input is a list of any random string or integer. I have found many examples for int based inputs as most of them model a calculator, but have not found any guidance through examples or the documentation for lexing strings. Here is what I have so far as my lexer:
(* File lexer.mll *)
{
open Parser
}
rule lexer_main = parse
[' ' '\r' '\t'] { lexer_main lexbuf } (* skip blanks *)
| ['0'-'9']+ as lxm { INT(int_of_string lxm) }
| '(' { LPAREN }
| ')' { RPAREN }
| ',' { COMMA }
| eof { EOF }
| _ { syntax_error "couldn't identify the token" }
As you can see I am missing the ability to parse strings. I am aware that a string can be represented in the form ['a'-'z'] so would it be as simple as ['a'-'z'] { STRING }
Thanks for your help.
The notation ['a'-'z'] represents a single character, not a string. So a string is more or less a sequence of one or more of those. I have a fear that this is an assignment, so I'll just say that you can extend a pattern for a single character into a pattern for a sequence of the same kind of character using the same technique you're using for INT.
However, I wonder whether you really want your strings to be so restrictive. Are they really required to consist of alphabetic characters only?

Resources