Parsec does not recognize block comments - haskell

I have a problem with Parsec recognizing comments when parsing mustache templates.
The various mustache tags all start with {{ include the block comment ({{!comment}}).
I have set commentStart and commentEnd to {{! and }} in my TokenParser.
Whenever I add comments to a template, Parsec complains that the comment is unexpected.
It expects a mustache variable instead, since that is the only token that matches {{.
When does Parsec remove comments? I thought it would happen before the source hits my parser?

Parsec doesn't remove comments. In a TokenParser, comments are subsumed under white space, so
whiteSpace tokenParser
skips comments and ordinary white space (blanks, tabs, newlines, ...).
Usually, you use lexeme parser to skip all white space following a lexeme, then you only need one initial white-space-skipping for the top-level parser to skip any leading white space in the source, afterward, all white space (including comments) is handled automatically (by the TokenParser that makeTokenParser creates).
If you don't use lexeme and handle white space manually, you must take care of tokens/lexemes that are a prefix of the comment delimiter. If you try the prefix first, that will succeed, but only consume part of the comment delimiter, in this case leaving the '!' for the variable parser, which then fails.

Related

ANTLR lexer token not being used

I have this small example, trying to parse a key:value type string (real examples may be more complex, but I want to essentially have a [a-zA-Z0-9] style string then a colon, then whatever else on that line to be the value (not including the colon)
https://gist.github.com/nmz787/4888cfadf707a575de0662f8a3914ce0
Unfortunately it isn't working, the INTERMEDIATE lexer token is not being found... I just can't figure it out. This is a really simple example extracted from a more complex parser and lexer that I've been handed to work on adding more features to. So I hope it's sufficient for this forum.
Using ANTLR4 to parse a key/value store is pretty much over-the-top. All you would need is to split your input into individual lines. Then split each line at the colon, trim the resulting strings and there you have it. No need for a parser at all.

Why does sublime consider <!------- (multiple dashes) a syntax error

I have a .html file that is working perfectly fine but for some reason Sublime 3 decides that it has invalid code, check the image below:
Any idea why that's happening and how to fix it without having to modify the code?
The HTML5 spec states (my emphasis):
Comments must start with the four character sequence U+003C LESS-THAN SIGN, U+0021 EXCLAMATION MARK, U+002D HYPHEN-MINUS, U+002D HYPHEN-MINUS (<!--). Following this sequence, the comment may have text, with the additional restriction that the text must not start with a single > (U+003E) character, nor start with a U+002D HYPHEN-MINUS character (-) followed by a > (U+003E) character,
nor contain two consecutive U+002D HYPHEN-MINUS characters (--),
nor end with a U+002D HYPHEN-MINUS character (-). Finally, the comment must be ended by the three character sequence U+002D HYPHEN-MINUS, U+002D HYPHEN-MINUS, U+003E GREATER-THAN SIGN (-->).
So that's why it's complaining. As to how to fix it without changing the code, that's trickier.
Your contention that it works is no different really to C developers wondering why they need to worry about undefined behaviour because the code they wrote works fine. The fact that it works fine in one particular implementation is not relevant to portable code.
My advice is to actually change the code. It's not valid, after all, and any browser (current or future) would be well within its rights to simply reject it.
As an aside after some historical digging, it appears this is not allowed because SGML, on which HTML was based, had slightly different rules regarding comment.
On sensing the <!-- token, the parser was switched to a comment mode where > characters were actually allowed within the comment. If the -- sequence was encountered, it changed to a different mode where the > would end the comment.
In fact, it appears to have been a toggle switch between those two modes, so something like <!-- >>>>> -- xyzzy -- >>>>> --> was possible, but putting a > where the xyzzy would end the comment.
XML, for one, didn't adopt this behaviour and HTML has now modified it to follow the "don't use -- within comments at all" rule, the reason being that hardly anyone knew that the comments behaved in the SGML way, causing some pain :-)

Haskell - Alex lexer - handle whitespace and newlines as state

I'm writing a parser for a language in Haskell with Alex + Happy.
What I want to do is: in Alex, skip whitespaces and newlines, but keep them as state, and then emit tokens which contain the newlines and the indent before the token.
I guess I could emit extra tokens for indent and newlines, and then collapse them later, but I'd prefer a cleaner approach.
Is there any way to wrap the token processing in alex in a monad that carries the indent/newline information and is accesible inside the actions that emit tokens?

Antlr4 pass text back to parser from the lexer as a string not individual characters

I have a grammar that needs to process comments starting with '{* and ending at *}' at any point in the input stream. Also it needs to process template markers which start with '{' followed by a '$' or and identifier and end on a '}' and pass everything else through as text.
The only way to achieve this seem to be is to pass any thing that isn't a comment or a token back to the parser as individual characters and let the parser build the string. This is incredibly inefficient as the parser has to build a node for every character that it receives and then I have to walk the nodes an build a string from them. I would be a lot simpler an faster if the lexer could just return the text as a large string.
On an I7 running the program as a 32bit #C program on a 90K text file with no tokens or comments, just text, it takes about 15 minutes before it crashes with and out on memory exception.
The grammar basically is
Parser:
text: ANY_CHAR+;
Lexer:
COMMENT: '{*' .*? '*}' -> skip;
... Token Definitions .....
ANY_CHAR: [ -~];
If I try to accumulate the text in the lexer it swallows everything and doesn't recognize the comments or tokens because something like ANY_CHAR+ matches everything and returns comments and template markers in the string.
Does anybody know a way around this problem? At the moment it looks like I have to hand write a lexer.
Yes, that is inefficient, but also not the way to do it. The solution is completely in lexer.
I understood that you want to detect comments, template markers and text. For this, you should use lexer modes. Every time you hit "{" go into some lexer mode, say MODE1 where you can detect only "*" or "$" or (since I didn't understand what you meant by '{' followed by a '$' or and identifier) something else, and depending on what you hit go into MODE2 or MODE3. After that (MODE2 or MODE3) wait for '}' and switch back to default mode. Of course, there is the possibility to make even more modes in between, depends on what you want do to, but for what I've just written:
MODE1 would be in which you determine if you area now detecting comment or template marker. Only two tokens in this mode '' and everything else. If it's '' go to MODE2, if anything else go to MODE3
MODE2 there is only one token here that you need and that is COMMENT,but you also need to detect '*}' or '}' (depending how you want to handle it)
MODE3 similarly as MODE2 - detect what you need and have a token that will switch back to default mode.

When is it acceptable to not trim a user input string?

Can someone give me a real-world scenario of a method/function with a string argument which came from user input (e.g. form field, parsed data from file, etc.) where leading or trailing spaces SHOULD NOT have been trimmed?
I can't ever recall such a situation for myself.
EDIT: Mind you, I didn't say trimming any whitespace. I said trimming leading or trailing (only) spaces (or whitespace).
Search string in any "Find" dialog in an editor.
Password input boxes. There's lots of data out there, where whitespace can genuinely be considered important part of the string. It narrows things down alot by making it starting and ending whitespace only, but there's still many examples. Stuff you pass through a PHP style nl2br function.
If you are inputting code. There may be a scenario where whitespace at the begining and end are necessary.
Also, look at Stack Overflow's markdown editor. Code examples are indented. If you posted just a code example, then it will require leading and trailing white space not be trimmed.
Perhaps a Whitespace interpreter.
Python....
A Stackoverflow answer, or more generally input written in markdown (four leading spaces -> code block).
A paragraph entry.
If the input is python code (say, for a pastebin kinda thing), you certainly can't trim leading white space; but you also can't trim trailing white space, because it could be a part of a multi-line string (triple quoted string).
I've used whitespace as a delimiter before, so there. Also, for anything that involves concatenating multiple inputs, removing leading/trailing whitespace can break formatting or possibly do worse. Aside from that, as Spencer said, for indented paragraphs you probably would not want to remove the leading whitespace.
Obviously passwords should not be trimmed. Passwords can contain leading or trailing whitespaces that need to be be treated as valid characters.

Resources