Haskell - Alex lexer - handle whitespace and newlines as state - haskell

I'm writing a parser for a language in Haskell with Alex + Happy.
What I want to do is: in Alex, skip whitespaces and newlines, but keep them as state, and then emit tokens which contain the newlines and the indent before the token.
I guess I could emit extra tokens for indent and newlines, and then collapse them later, but I'd prefer a cleaner approach.
Is there any way to wrap the token processing in alex in a monad that carries the indent/newline information and is accesible inside the actions that emit tokens?

Related

ANTLR lexer token not being used

I have this small example, trying to parse a key:value type string (real examples may be more complex, but I want to essentially have a [a-zA-Z0-9] style string then a colon, then whatever else on that line to be the value (not including the colon)
https://gist.github.com/nmz787/4888cfadf707a575de0662f8a3914ce0
Unfortunately it isn't working, the INTERMEDIATE lexer token is not being found... I just can't figure it out. This is a really simple example extracted from a more complex parser and lexer that I've been handed to work on adding more features to. So I hope it's sufficient for this forum.
Using ANTLR4 to parse a key/value store is pretty much over-the-top. All you would need is to split your input into individual lines. Then split each line at the colon, trim the resulting strings and there you have it. No need for a parser at all.

Why Ampersand should be escaped because of XSS injection

The five characters that OWASP recommend escape to prevent XSS injections are
&, <, >, ", '.
Among them, I cannot understand why &(ampersand) should be escaped and how it can be used as a vector to inject script. Can somebody give an example that all the other four characters that are escaped but ampersand is not so there will be XSS injection vulnerability.
I have checked the other question but that answer really does not make things any clearer.
The answer here addresses the issue only in a nested JavaScript context within an HTML attribute context, whereas your question asks specifically about pure HTML context escaping.
In that question, the escaping should be as per the OWASP recommendation for JavaScript:
Except for alphanumeric characters, escape all characters with the \uXXXX unicode escaping format (X = Integer).
Which will already handle & because it is not alphanumeric.
To answer you question,
from a practical point of view, why wouldn't you escape ampersand?
The HTML representation of & is &, so it makes a lot of sense to do that. If you didn't, anytime a user entered &amp, &lt, or &gt into your application, your application would render &, <, or > instead of &amp, &lt or &gt.
An edge case? Definitely. A security concern? It shouldn't be.
From the HTML5 syntax Character references section:
Character references must start with a U+0026 AMPERSAND character (&).
Following this, there are three possible kinds of character
references:
Named character references
Decimal numeric character reference
Hexadecimal numeric character reference
When an & is encountered:
Switch to the data state.
Attempt to consume a character reference, with no additional allowed
character.
If nothing is returned, emit a U+0026 AMPERSAND character (&) token.
Otherwise, emit the character tokens that were returned.
Therefore, anything after the & will cause either & to be output, or the character represented. As the following characters have to be alphanumeric or else they won't be consumed, there is no chance of an escape character (e.g. ', ", >, <) being consumed and ignored, therefore there is little security risk of an attacker changing the parsing context. However, you never know if there is a browser bug that doesn't quite follow the standard properly, therefore I would always escape &. Internet Explorer had an issue where you could specify <% and it would be interpreted as < allowing the .NET Request Validation from being bypassed for XSS attack vectors. Always better to be safe than sorry.

Antlr4 pass text back to parser from the lexer as a string not individual characters

I have a grammar that needs to process comments starting with '{* and ending at *}' at any point in the input stream. Also it needs to process template markers which start with '{' followed by a '$' or and identifier and end on a '}' and pass everything else through as text.
The only way to achieve this seem to be is to pass any thing that isn't a comment or a token back to the parser as individual characters and let the parser build the string. This is incredibly inefficient as the parser has to build a node for every character that it receives and then I have to walk the nodes an build a string from them. I would be a lot simpler an faster if the lexer could just return the text as a large string.
On an I7 running the program as a 32bit #C program on a 90K text file with no tokens or comments, just text, it takes about 15 minutes before it crashes with and out on memory exception.
The grammar basically is
Parser:
text: ANY_CHAR+;
Lexer:
COMMENT: '{*' .*? '*}' -> skip;
... Token Definitions .....
ANY_CHAR: [ -~];
If I try to accumulate the text in the lexer it swallows everything and doesn't recognize the comments or tokens because something like ANY_CHAR+ matches everything and returns comments and template markers in the string.
Does anybody know a way around this problem? At the moment it looks like I have to hand write a lexer.
Yes, that is inefficient, but also not the way to do it. The solution is completely in lexer.
I understood that you want to detect comments, template markers and text. For this, you should use lexer modes. Every time you hit "{" go into some lexer mode, say MODE1 where you can detect only "*" or "$" or (since I didn't understand what you meant by '{' followed by a '$' or and identifier) something else, and depending on what you hit go into MODE2 or MODE3. After that (MODE2 or MODE3) wait for '}' and switch back to default mode. Of course, there is the possibility to make even more modes in between, depends on what you want do to, but for what I've just written:
MODE1 would be in which you determine if you area now detecting comment or template marker. Only two tokens in this mode '' and everything else. If it's '' go to MODE2, if anything else go to MODE3
MODE2 there is only one token here that you need and that is COMMENT,but you also need to detect '*}' or '}' (depending how you want to handle it)
MODE3 similarly as MODE2 - detect what you need and have a token that will switch back to default mode.

Parsec does not recognize block comments

I have a problem with Parsec recognizing comments when parsing mustache templates.
The various mustache tags all start with {{ include the block comment ({{!comment}}).
I have set commentStart and commentEnd to {{! and }} in my TokenParser.
Whenever I add comments to a template, Parsec complains that the comment is unexpected.
It expects a mustache variable instead, since that is the only token that matches {{.
When does Parsec remove comments? I thought it would happen before the source hits my parser?
Parsec doesn't remove comments. In a TokenParser, comments are subsumed under white space, so
whiteSpace tokenParser
skips comments and ordinary white space (blanks, tabs, newlines, ...).
Usually, you use lexeme parser to skip all white space following a lexeme, then you only need one initial white-space-skipping for the top-level parser to skip any leading white space in the source, afterward, all white space (including comments) is handled automatically (by the TokenParser that makeTokenParser creates).
If you don't use lexeme and handle white space manually, you must take care of tokens/lexemes that are a prefix of the comment delimiter. If you try the prefix first, that will succeed, but only consume part of the comment delimiter, in this case leaving the '!' for the variable parser, which then fails.

New lines in tab delimited or comma delimtted output

I am looking for some best practices as far as handling csv and tab delimited files.
For CSV files I am already doing some formatting if a value contains a comma or double quote but what if the value contains a new line character? Should I leave the new line intact and encase the value in double quotes + escape any double quotes within the value?
Same question for tab delimited files. I assume the answer would be very similar if not the same.
Usually you keep \n unaltered while exploiting the fact that the newline char will be enclosed in a " " string. This doesn't create ambiguities but it's really ugly if you have to take a look to the file using a normal texteditor.
But it is how you should do since you don't escape anything inside a string in a CSV except for the double quote itself.
#Jack is right, that your best bet is to keep the \n unaltered, since you'll expect it inside of double-quotes if that is the case.
As with most things, I think consistency here is key. As far as I know, your values only need to be double-quoted if they span multiple lines, contain commas, or contain double-quotes. In some implementations I've seen, all values are escaped and double-quoted, since it makes the parsing algorithm simpler (there's never a question of escaping and double-quoting, and the reverse on reading the CSV).
This isn't the most space-optimized solution, but makes reading and writing the file a trivial affair, for both your own library and others that may consume it in the future.
For TSV, if you want lossless representation of values, the "Linear TSV" specification is worth considering: http://paulfitz.github.io/dataprotocols/linear-tsv/index.html
For obvious reasons, most such conventions adhere to the following at a minimum:
\n for newline,
\t for tab,
\r for carriage return,
\\ for backslash
Some tools add \0 for NUL.

Resources