ANTLR Lexical Modes for complex grammar - antlr4

I am faced with syntax that allows for multiple formats of parameters, for example
OpenFile FILE=filename
OpenFile FILE=(filename)
OpenFile filename
Parsing the variants is a no-brainer, but ... (there is always a BUT, isn't it ?)
"filename" can contain strange symbols, replacement tags, yada, yada, so it's difficult to impossible to define a clear, single lexer token for it, I always run into ambiguities, so I'd need to get that parameter as a raw string without further tokenizing.
While it wouldn't be too difficult to switch to a "raw" Lexical Mode and switch back on a NL or the closing paren for the first two formats I just can't think of a way for the 3rd format, when I switch on the OpenFile token itself FILE= etc. would be considered part of the raw string for the other formats.

Related

String literal in ANTLR4

I'm using antlr4 C++ runtime and I'd like to create a string literal in my lexer definition file. How can I do this?
What I have so far:
V_STRING : '"' ~('\\' | '"')* '"';
I doesn't work with
printf("string literal\n");
but works with
printf("string literal\\n");
I don't want to explicitly escape the new line character.
my assumptions are that antlr interprets the new line character as a regular new line (when reading a file, for example).
Thanks in advance.
It's always a good idea to list out your token stream to see if your Lexer rules really do what you expect. (Look into the tokens option of the TestRig; also, some plugins will show you your tokens)
In your case your rule essentially says that a String is " a " followed by 0 or more characters that are not a \ or a " and then a "".
So, when the Lexer encounters your \, matches the ~('\\\\'|'")* part of the rule and then looks for a " (which it does not find, since the \ is followed by a n), so It won't recognize "string literal\n" as a V_STRING token (it also fails to match "string literal\\n" as well, here, so I'm not quite sure what's going on with the example that "works").
try:
V_STRING: '"' ~["]* '"';
Note: this is a very simple String rule, but it accepts your input. You probably want to examine grammars for other languages to see how you might want to handle strings in your language; there are several approaches (and many of them involve using Lexer modes). You can find examples here)
If you want the "\n" to be treated as a newline, just understand that the parser won't do that for you, you'll just see the characters "" and "n". It'll be up to you to handle encoding the escaped characters (and it's once you try to handle " that it'll get more complicated and you'll need to look into Lexer modes)

Nodejs equivalent of c sscanf

I need a function that behaves similar to the behavior of sscanf
For example, let's suppose we have a format string that looks like this (the function I'm looking for doesn't have to be exactly like this, but something similar)
"This is normal text that has to exactly match, but here is a ${var}"
And have return/modify a variable to look like
{'var': <whatever was there>}
After researching this for a while, the only things I could actually find was scanf, but that takes input form stdin, and not a string
I am aware that there is a regex solution for this, but I'm looking for a function that does this without the need for regex (regex is slow). However, if there is no other solution for this, I will accept a regex solution.
The normal solution for this in most languages that have regular expressions built-in is to use regular expressions.
If you're not used to or don't like regular expressions I'm sorry. Most of the programming world have assumed that knowledge of regular expressions is mandatory.
In any case. The normal solution to this is string.prototype.match:
let text = get_string_to_scan();
let match = text.match(/This is normal text that has to exactly match, but here is a (.+)/);
if (match) { // match is null if no match is found
// The result you want is in match[1]
console.log('value of var is:', match[1]);
}
What pattern you put in your capture group (the (..) part) depends on what you want. The code above captures anything at all including spaces and special characters.
If you just want to capture a "word", that is, printable characters without spaces, then you can use (\w+):
text.match(/This is normal text that has to exactly match, but here is a (\w+)/)
If you want to capture a word with only letters but not numbers you can use ([a-zA-Z]+):
text.match(/This is normal text that has to exactly match, but here is a ([a-zA-Z]+)/)
The flexibility of regular expression is why other methods of string scanning are usually not supported in languages that have had regular expression built-in since the beginning. But of course, flexibility comes with complexity.
Do you mean to have the ${var} to act as a placeholder? If so you could do it by replacing the " with the backtick:
console.log(`This is normal text that has to exactly match, but here is a ${"whatever was there"}`)

ANTLR lexer token not being used

I have this small example, trying to parse a key:value type string (real examples may be more complex, but I want to essentially have a [a-zA-Z0-9] style string then a colon, then whatever else on that line to be the value (not including the colon)
https://gist.github.com/nmz787/4888cfadf707a575de0662f8a3914ce0
Unfortunately it isn't working, the INTERMEDIATE lexer token is not being found... I just can't figure it out. This is a really simple example extracted from a more complex parser and lexer that I've been handed to work on adding more features to. So I hope it's sufficient for this forum.
Using ANTLR4 to parse a key/value store is pretty much over-the-top. All you would need is to split your input into individual lines. Then split each line at the colon, trim the resulting strings and there you have it. No need for a parser at all.

Antlr4 pass text back to parser from the lexer as a string not individual characters

I have a grammar that needs to process comments starting with '{* and ending at *}' at any point in the input stream. Also it needs to process template markers which start with '{' followed by a '$' or and identifier and end on a '}' and pass everything else through as text.
The only way to achieve this seem to be is to pass any thing that isn't a comment or a token back to the parser as individual characters and let the parser build the string. This is incredibly inefficient as the parser has to build a node for every character that it receives and then I have to walk the nodes an build a string from them. I would be a lot simpler an faster if the lexer could just return the text as a large string.
On an I7 running the program as a 32bit #C program on a 90K text file with no tokens or comments, just text, it takes about 15 minutes before it crashes with and out on memory exception.
The grammar basically is
Parser:
text: ANY_CHAR+;
Lexer:
COMMENT: '{*' .*? '*}' -> skip;
... Token Definitions .....
ANY_CHAR: [ -~];
If I try to accumulate the text in the lexer it swallows everything and doesn't recognize the comments or tokens because something like ANY_CHAR+ matches everything and returns comments and template markers in the string.
Does anybody know a way around this problem? At the moment it looks like I have to hand write a lexer.
Yes, that is inefficient, but also not the way to do it. The solution is completely in lexer.
I understood that you want to detect comments, template markers and text. For this, you should use lexer modes. Every time you hit "{" go into some lexer mode, say MODE1 where you can detect only "*" or "$" or (since I didn't understand what you meant by '{' followed by a '$' or and identifier) something else, and depending on what you hit go into MODE2 or MODE3. After that (MODE2 or MODE3) wait for '}' and switch back to default mode. Of course, there is the possibility to make even more modes in between, depends on what you want do to, but for what I've just written:
MODE1 would be in which you determine if you area now detecting comment or template marker. Only two tokens in this mode '' and everything else. If it's '' go to MODE2, if anything else go to MODE3
MODE2 there is only one token here that you need and that is COMMENT,but you also need to detect '*}' or '}' (depending how you want to handle it)
MODE3 similarly as MODE2 - detect what you need and have a token that will switch back to default mode.

Treat macro arguments in Common Lisp as (case-sensitive) strings

(This is one of those things that seems like it should be so simple that I imagine there may be a better approach altogether)
I'm trying to define a macro (for CLISP) that accepts a variable number of arguments as symbols (which are then converted to case-sensitive strings).
(defmacro symbols-to-words (&body body)
`(join-words (mapcar #'symbol-name '(,#body))))
converts the symbols to uppercase strings, whereas
(defmacro symbols-to-words (&body body)
`(join-words (mapcar #'symbol-name '(|,#body|))))
treats ,#body as a single symbol, with no expansion.
Any ideas? I'm thinking there's probably a much easier way altogether.
The symbol names are uppercased during the reader step, which occurs before macroexpansion, and so there is nothing you can do with macros to affect that. You can globally set READTABLE-CASE, but that will affect all code, in particular you will have to write all standard symbols in uppercase in your source. There is also a '-modern' option for CLISP, which provides lowercased version for names of the standard library and sets the reader to be case-preserving, but it is itself non-standard. I have never used it myself so I am not sure what caveats actually apply.
The other way to control the reader is through reader macros. Common Lisp already has a reader macro implementing a syntax for case-sensitive strings: the double quote. It is hard to offer more advice without knowing why you are not just using it.
As Ramarren correctly says, the case of symbols is determined during read time. Not at macro expansion time.
Common Lisp has a syntax for specifying symbols without changing the case:
|This is a symbol| - using the vertical bar as multiple escape character.
and there is also a backslash - a single escape character:
CL-USER > 'foo\bar
|FOObAR|
Other options are:
using a different global readtable case
using a read macro which reads and preserves case
using a read macro which uses its own reader
Also note that a syntax for something like |,#body| (where body is spliced in) does not exist in Common Lisp. The splicing in does only work for lists - not symbol names. |, the vertical bar, surrounds character elements of a symbol. The explanation in the Common Lisp Hyperspec is a bit cryptic: Multiple Escape Characters.

Resources