Can I parse antlr4 input one prefix at a time? - antlr4

I have a file with many input pattern at a time. My grammar is as follows.
start
: pattern+ EOF
;
pattern
: rule1 (rule2 rule3)*
;
A pattern is a multi-line string. But the file contains many many patterns. I want to parse one pattern at a time, do some processing using the visitor, and then move to the next pattern. I want to free up most/all of the memory during previous pattern parsing. Otherwise, memory may become a problem.
What are my options with antlr. I presented the grammar in a very simple form. It will take lookahead of anything between 10 to 100 tokens to ensure that a pattern has been detected.

Related

Antlr4 Lexer with two continuous parentheses

My text is : Function(argument, function(argument)). When I use g4 DEBUGGER to generate the tree. It will works when there is blank between the last two parentheses: Function(argument, function(argument) ). But the debugger will say unexpected '))' when there is not a blank. So, how should I revise my grammar to make it?
It confuses me a lot.
(It will be much easier to confirm the answer to your question if you post the grammar, or at least a stripped down version that demonstrates the issue. Note: Often, stripping down to a minimal example to demonstrate the issue will result in you finding your own answer.)
That said, based upon your error message, I would guess that you have a Lexer rule that matches )). If you do, then ANTLR will match that rule and create that token rather than creating two ) tokens.
Most of the time this mistake originates from not understanding that the Lexer is completely independent of the parser. The Lexer recognizes streams of characters and produces a stream of tokens, and then the parser matched rules against that stream of tokens. In this case, I would guess that the parser rule is looking to match a ) token, but finds the )) token instead. Those are two different tokens, so the parser rule will fail.

Elegant way to parse "line splices" (backslashes followed by a newline) in megaparsec

for a small compiler project we are currently working on implementing a compiler for a subset of C for which we decided to use Haskell and megaparsec. Overall we made good progress but there are still some corner cases that we cannot correctly handle yet. One of them is the treatment of backslashes followed by a newline. To quote from the specification:
Each instance of a backslash character () immediately followed by a
new-line character is deleted, splicing physical source lines to form
logical source lines. Only the last backslash on any physical source
line shall be eligible for being part of such a splice.
(ยง5.1.1., ISO/IEC9899:201x)
So far we came up with two possible approaches to this problem:
1.) Implement a pre-lexing phase in which the initial input is reproduced and every occurence of \\\n is removed. The big disadvantage we see in this approach is that we loose accurate error locations which we need.
2.) Implement a special char' combinator that behaves like char but looks an extra character ahead and will silently consume any \\\n. This would give us correct positions. The disadvantage here is that we need to replace every occurence of char with char' in any parser, even in the megaparsec-provided ones like string, integer, whitespace etc...
Most likely we are not the first people trying to parse a language with such a "quirk" with parsec/megaparsec, so I could imagine that there is some nicer way to do it. Does anyone have an idea?

Lexer: Handling unterminated strings whilst tokenizing

I have started writing my own lexer and ended up with a problem with tokenising strings as they have a start (") and an end (") character associated with them.
Does anyone know of a common technique where a lexer can cope and continue lexing with having an unterminated string?
I think ANTLR can do this, is this handled by a ATN in ANTLR?
I can see there being two issues here assuming that strings must terminate on a single line:
String termination occurs on a separate line -- therefore warning the user that strings can only be put on a single line.
String termination does not occur, then when do you know a valid point to continue is at? Use a heuristic of the next valid token after the new line.
i.e.
char *mystring = "my string which is unterminated....
int id = 20;
If your language prohibits newlines in string literals, then just terminating the string at the end of the line is likely to be acceptable. It is reasonably unlikely that there will be a declaration or keyword statement on the same line as the string literal (and there is no reason to encourage bad style by trying to compensate for it.)
You might skip a useful close parenthesis:
printf("%s\n, line);
but you likely have recovery rules in place which can cope with that.
If string literals can contain newlines -- and there is ample evidence that this feature is often desired -- then recovery is rather more difficult and you might well find that the simplest solution all round is to just throw a syntax error which clearly states where the offending string started.

Two Different Ways of Designing a Rule (Detecting Tokens Like ##<Text>##)

I'm designing a grammar for a markdown based language but without the context awareness.
For example I want to detect tokens like ## ##.
I found two different ways of designing rules for that and I'm not quite sure which way could be the best approach.
The first way: Defining more complex tokens and a simple rule.
fragment
HEAD
: '#'
;
fragment
HEADING_TEXT
: (~[#]|'\\#')+?
;
SUBHEADLINE
: HEAD HEAD HEADING_TEXT HEAD HEAD
;
subheadline
: SUBHEADLINE
;
Due to the fragments HEAD and HEADING_TEXT would get to the parser. I'm prototyping within IntelliJ and the parsing works well. And the errors message show something like "missing SUBHEADLINE" what's great for the main application (I think I can change those errors easily to human readable ones).
The second approach: Much simpler tokens and more complex rules for the parser.
HEAD
: '#'
;
HEADING_TEXT
: (~[#]|'\\#')+?
;
subheadline
: HEAD HEAD HEADING_TEXT HEAD HEAD
;
Works fine, too. The errors are more specific and maybe not very good for transforming them to human readable ones.
But I'm overall not sure which approach I should follow and why?! The more complex tokens are easier to write in this case because there won't be any complex rules like normal programming languages contains. But it don't feel like this is the correct way of doing it.
Both ways have their own behavior and it depends on what you need to decide what to use. Defining the subheadline in the lexer the way you did does not allow skipped/hidden tokens between e.g. '#', which is probably what you intend. Doing that in the parser instead allows to have e.g. # /*acomment*/headline## which is probably not the intended behavior. Also I would combine things that strictly belong together into one rule. For instance HEADING_TEXT in your second variant may match input that you want to have matched in a different way. Instead define the subheading exactly as the language dictates:
SUBHEADING: '##' .*? '##';
This is even conciser than your simpler variant while still not allowing skipped input between the markers.

Get match length in parsec

Parsec's parse pattern "(some_input)" input returns the parsed data (as I specified in pattern.
How to know how much of input have it consumed (the pattern is not anchored with eof)? I don't want to add length tracking though the all pattern's internals (if discards some parts of input).
It is not easy with Parsec;
If it is needed to skip header you can grab the rest of input using getInput;
May be other parser libraries can do this.
(the answer is based on comments to the question)

Resources