redefinition token type in parser - antlr4

Need to implement syntax highlighting for COS aka MUMPS
for the language of a possible design of the form
new (new,set,kill)
set kill=new
where: 'new' and 'set' are commands, and also variable
grammar cos;
Command_KILL :( ('k'|'K') | ( ('k'|'K')('i'|'I')('l'|'L')('l'|'L') ) );
Command_NEW :( ('n'|'N') | ( ('n'|'N')('e'|'E')('w'|'W') ) );
Command_SET :( ('s'|'S') | ( ('s'|'S')('e'|'E')('t'|'T') ) );
INT : [0-9]+;
ID : [a-zA-Z][a-zA-Z0-9]*;
Space: ' ';
Equal: '=';
newCommand
: Command_NEW Space ID
;
setCommand
: Command_SET Space ID Space* Equal Space* INT
;
I have a problem, when ID like name as commands (NEW,SET e.t.c.)

According to the Wikipedia page, MUMPS doesn't have reserved words:
Reserved words: None. Since MUMPS interprets source code by context, there is no need for reserved words. You may use the names of language commands as variables.
Lexer rules like Command_KILL function exactly like reserved words: they're designed to make sure no other token is generated when input "kill" is encountered. So token type Command_KILL will always be produced on "kill", even if it's intended to be an identifier. You can keep the command lexer rules if you want, but you'll have to treat them like IDs as well because you just don't know what "kill" refers to based on the token alone.
Making a MUMPS implementation in ANTLR means focusing on token usage and context rather than token types. Consider this grammar:
grammar Example;
document : (expr (EOL|EOF))+;
expr : command=ID Space+ value (Space* COMMA Space* value)* #CallExpr
| command=ID Space+ name=ID Space* Equal Space* value #SetExpr
;
value : ID | INT;
INT : [0-9]+;
ID : [a-zA-Z][a-zA-Z0-9]*;
Space : ' ';
Equal : '=';
EOL : [\r\n]+;
COMMA : ',';
Parser rule expr knows when an ID token is a command based on the layout of the entire line.
If the input tokens are ID ID, then the input is a CallExpr: the first ID is a command name and the second ID is a regular identifier.
If the input tokens are ID ID Equal ID, then the input is a SetExpr: the first ID will be a command (either "set" or something like it), the second ID is the target identifier, and the third ID is the source identifier.
Here's a Java test application followed by a test case similar to the one mentioned in your question.
import java.util.List;
import org.antlr.v4.runtime.ANTLRInputStream;
import org.antlr.v4.runtime.CommonTokenStream;
public class ExampleTest {
public static void main(String[] args) {
ANTLRInputStream input = new ANTLRInputStream(
"new new, set, kill\nset kill = new");
ExampleLexer lexer = new ExampleLexer(input);
ExampleParser parser = new ExampleParser(new CommonTokenStream(lexer));
parser.addParseListener(new ExampleBaseListener() {
#Override
public void exitCallExpr(ExampleParser.CallExprContext ctx) {
System.out.println("Call:");
System.out.printf("\tcommand = %s%n", ctx.command.getText());
List<ExampleParser.ValueContext> values = ctx.value();
if (values != null) {
for (int i = 0, count = values.size(); i < count; ++i) {
ExampleParser.ValueContext value = values.get(i);
System.out.printf("\targ[%d] = %s%n", i,
value.getText());
}
}
}
#Override
public void exitSetExpr(ExampleParser.SetExprContext ctx) {
System.out.println("Set:");
System.out.printf("\tcommand = %s%n", ctx.command.getText());
System.out.printf("\tname = %s%n", ctx.name.getText());
System.out.printf("\tvalue = %s%n", ctx.value().getText());
}
});
parser.document();
}
}
Input
new new, set, kill
set kill = new
Output
Call:
command = new
arg[0] = new
arg[1] = set
arg[2] = kill
Set:
command = set
name = kill
value = new
It's up to the calling code to determine whether a command is valid in a given context. The parser can't reasonably handle this because of MUMPS's loose approach to commands and identifiers. But it's not as bad as it may sound: you'll know which commands function like a call and which function like a set, so you'll be able to test the input from the Listener that ANTLR produces. In the code above, for example, it would be very easy to test whether "set" was the command passed to exitSetExpr.
Some MUMPS syntax may be more difficult to process than this, but the general approach will be the same: let the lexer treat commands and identifiers like IDs, and use the parser rules to determine whether an ID refers to a command or an identifier based on the context of the entire line.

Related

How to set a String into newly created Context's children

I have a parser that calls a visitX method with an XContext that contains an expression that resolves to an ArrayNode and a FunctionName. I can retrieve the function's object and want to call invoke on it for each element in this array but the function takes an ExpressionVisitor and a YContext. I can create a new YContext(XContext) and the children are empty as expected. I need to add my array.get(i) as a TerminalNode into the children array so the function receiving the YContext can check the number of children (1) and then get the value (e.g., ctx.exprValues().exprList().expr(0)) from the YContext.
TerminalNodeImpl can take a Token (which is an interface) and I haven't found a way to create a Token using the implementing classes that can take an JsonNode value (e.g., String, int, Object).
The YContext children is a List but I am not sure what implements ParseTree that I could construct using the JsonNode value.
I tried parsing the JsonNode value using code like this but I can't get anything in tokens that I could use addAnyChild to my new context...
for (int i=0;i<mapArray.size();i++) {
ANTLRInputStream input = new ANTLRInputStream(mapArray.get(i).asText());
MappingExpressionLexer lexer = new MappingExpressionLexer(input);
CommonTokenStream tokens = new CommonTokenStream(lexer);
I am sure I'm overlooking something simple. In other situations I've been able to push the value onto the stack but in this case the functions I can call all take the YContext so I need to put the value into the YContext.children somehow.
The solution is convoluted but necessary based on how the expression is defined. I traced the ctx being passed to the function so I could get its structure:
Function_callContext:
---------------------
TerminalNodeImpl $string 44 '$string'
ExprValuesContext
TerminalNodeImpl ( 2 '('
ExpressionListContext
NumberContext
TerminalNodeImpl 1 22 '1'
TerminalNodeImpl ) 3 ')'
TerminalNodeImpl $string 44 '$string'
Because the parser is looking for:
expr :
...
| VAR_ID exprValues # function_call
...
;
...
exprList : expr (',' expr)* ;
exprValues : '(' exprList ')' ;
VAR_ID : '$' ID ;
ID : [a-zA-Z] [a-zA-Z0-9_]*;
and I found the CommonTokenFactory to let me create the Token I could put in the TerminalNodeImpl so I could build up the correct context.
Here is the code (I only implemented the NUMBER at this point but will add exceptions and other types later... My test was to transform an array of numbers to an array of strings using the $map([1..5],$string) example (where [1..5] is a sequence that becomes the array.
for (int i = 0; i < mapArray.size(); i++) {
Function_callContext callCtx = new Function_callContext(ctx);
// note: callCtx.children should be empty unless carrying an
// exception
ExprListContext elc = new ExprListContext(callCtx.getParent(),callCtx.invokingState);
ExprValuesContext evc = new ExprValuesContext(callCtx.getParent(),callCtx.invokingState);
evc.addAnyChild(new TerminalNodeImpl(CommonTokenFactory.DEFAULT.create(MappingExpressionParser.T__1,"(")));
CommonToken token = null;
JsonNode element = mapArray.get(i);
switch (element.getNodeType()) {
case ARRAY: {
break;
}
case BINARY:
break;
case BOOLEAN:
break;
case MISSING:
break;
case NULL:
break;
case NUMBER:
token = CommonTokenFactory.DEFAULT.create(MappingExpressionParser.NUMBER,element.asText());
TerminalNodeImpl tn = new TerminalNodeImpl(token);
NumberContext nc = new NumberContext(callCtx);
nc.addAnyChild(tn);
elc.addAnyChild(nc);
evc.addAnyChild(elc);
break;
case OBJECT:
break;
case POJO:
break;
case STRING:
break;
default:
break;
}
evc.addAnyChild(new TerminalNodeImpl(CommonTokenFactory.DEFAULT.create(MappingExpressionParser.T__1,")")));
callCtx.addAnyChild(var);
callCtx.addAnyChild(evc);
result = function.invoke(this, callCtx);
resultArray.add(result);
}

Understanding Graph, Weighted method

Okay, so what does the SET stand for in the second line? Why is the second string in<>, ?
public Weighted(In in, String delimiter) {
st = new ST<String, SET<String>>();
while (!in.isEmpty()) {
String line = in.readLine();
String[] names = line.split(delimiter);
for (int i = 1; i < names.length; i++) {
addEdge(names[0], names[i]);
}
}
}
With the little information you gave, I will assume that SET is an abstract data type. An abstract data type can store any values without any particular order and with no duplicates. By telling <String> after SET you are telling you want to store Strings inside your SET.
You can learn more about SETs here: https://en.wikipedia.org/wiki/Set_(abstract_data_type)

ANTLR4 lexer rule with #init block

I have this lexer rule defined in my ANTLR v3 grammar file - it maths text in double quotes.
I need to convert it to ANTLR v4. ANTLR compiler throws an error 'syntax error: mismatched input '#' expecting COLON while matching a lexer rule' (in #init line). Can lexer rule contain a #init block ? How this should be rewritten ?
DOUBLE_QUOTED_CHARACTERS
#init
{
int doubleQuoteMark = input.mark();
int semiColonPos = -1;
}
: ('"' WS* '"') => '"' WS* '"' { $channel = HIDDEN; }
{
RecognitionException re = new RecognitionException("Illegal empty quotes\"\"!", input);
reportError(re);
}
| '"' (options {greedy=false;}: ~('"'))+
('"'|';' { semiColonPos = input.index(); } ('\u0020'|'\t')* ('\n'|'\r'))
{
if (semiColonPos >= 0)
{
input.rewind(doubleQuoteMark);
RecognitionException re = new RecognitionException("Missing closing double quote!", input);
reportError(re);
input.consume();
}
else
{
setText(getText().substring(1, getText().length()-1));
}
}
;
Sample data:
" " -> throws error "Illegal empty quotes!";
"asd -> throws error "Missing closing double quote!"
"text" -> returns text (valid input, content of "...")
I think this is the right way to do this.
DOUBLE_QUOTED_CHARACTERS
:
{
int doubleQuoteMark = input.mark();
int semiColonPos = -1;
}
(
('"' WS* '"') => '"' WS* '"' { $channel = HIDDEN; }
{
RecognitionException re = new RecognitionException("Illegal empty quotes\"\"!", input);
reportError(re);
}
| '"' (options {greedy=false;}: ~('"'))+
('"'|';' { semiColonPos = input.index(); } ('\u0020'|'\t')* ('\n'|'\r'))
{
if (semiColonPos >= 0)
{
input.rewind(doubleQuoteMark);
RecognitionException re = new RecognitionException("Missing closing double quote!", input);
reportError(re);
input.consume();
}
else
{
setText(getText().substring(1, getText().length()-1));
}
}
)
;
There are some other errors as well in above like WS .. => ... but I am not correcting them as part of this answer. Just to keep things simple. I took hint from here
Just to hedge against that link moving or becoming invalid after sometime, quoting the text as is:
Lexer actions can appear anywhere as of 4.2, not just at the end of the outermost alternative. The lexer executes the actions at the appropriate input position, according to the placement of the action within the rule. To execute a single action for a role that has multiple alternatives, you can enclose the alts in parentheses and put the action afterwards:
END : ('endif'|'end') {System.out.println("found an end");} ;
The action conforms to the syntax of the target language. ANTLR copies the action’s contents into the generated code verbatim; there is no translation of expressions like $x.y as there is in parser actions.
Only actions within the outermost token rule are executed. In other words, if STRING calls ESC_CHAR and ESC_CHAR has an action, that action is not executed when the lexer starts matching in STRING.
I in countered this problem when my .g4 grammar imported a lexer file. Importing grammar files seems to trigger lots of undocumented shortcomings in ANTLR4. So ultimately I had to stop using import.
In my case, once I merged the LEXER grammar into the parser grammar (one single .g4 file) my #input and #after parsing errors vanished. I should submit a test case + bug, at least to get this documented. I will update here once I do that.
I vaguely recall 2-3 issues with respect to importing lexer grammar into my parser that triggered undocumented behavior. Much is covered here on stackoverflow.

java apps. string manipulation

I want to limit the no. of character that can be put on JTextField because on my database I have this column that has Sex,Status (which the no. of char. allowed is 1 only).
and Middle Initial (which the no. of char. allowed is 2 only).
This what I have in my mind :
(for Sex,Status column)
String text = jTextField2.getText();
int count = text.();
if (count>1) {
(delete the next character that will be input)
}
(for M.I. column)
String text = jTextField1.getText();
int count = text.();
if (count>2) {
(delete the next character that will be input)
}
is this possible? is there a command that will delete the next character, so the no. of char. is acceptable for my database?
Sure. Just use String#substring.
String middleInitial = "JKL";
middleInitial.substring(0, 2);
System.out.println(middleInitial); // => JK
Similarly, you can use substring(0, 1) for sex.
It might be better if sex is an enum, though.
public enum Sex {
MALE("m"), FEMALE("f");
final String symbol;
private Sex(String symbol) {
this.symbol = symbol;
}
}
Now you can use it like this:
String sex = "male";
Sex.valueOf(sex.toUpperCase());
Or directly
Sex.MALE;
Instead of a text field for sex, you might use a JComboBox so the user can only choose one of the two options. This way you're sure to have valid input.

ANTLR4: Parser for a Boolean expression

I am trying to parse a boolean expression of the following type
B1=p & A4=p | A6=p &(~A5=c)
I want a tree that I can use to evaluate the above expression. So I tried this in Antlr3 with the example in Antlr parser for and/or logic - how to get expressions between logic operators?
It worked in Antlr3. Now I want to do the same thing for Antlr 4. I came up the grammar below and it compiles. But I am having trouble writing the Java code.
Start of Antlr4 grammar
grammar TestAntlr4;
options {
output = AST;
}
tokens { AND, OR, NOT }
AND : '&';
OR : '|';
NOT : '~';
// parser/production rules start with a lower case letter
parse
: expression EOF! // omit the EOF token
;
expression
: or
;
or
: and (OR^ and)* // make `||` the root
;
and
: not (AND^ not)* // make `&&` the root
;
not
: NOT^ atom // make `~` the root
| atom
;
atom
: ID
| '('! expression ')'! // omit both `(` and `)`
;
// lexer/terminal rules start with an upper case letter
ID
:
(
'a'..'z'
| 'A'..'Z'
| '0'..'9' | ' '
| ('+'|'-'|'*'|'/'|'_')
| '='
)+
;
I have written the Java Code (snippet below) for getting a tree for the expression "B1=p & A4=p | A6=p &(~A5=c)". I am expecting & with children B1=p and |. The child | operator will have children A4=p and A6=p &(~A5=c). And so on.
Here is that Java code but I am stuck trying to figure out how I will get the tree. I was able to do this in Antlr 3.
Java Code
String src = "B1=p & A4=p | A6=p &(~A5=c)";
CharStream stream = (CharStream)(new ANTLRInputStream(src));
TestAntlr4Lexer lexer = new TestAntlr4Lexer(stream);
parser.setBuildParseTree(true);
ParserRuleContext tree = parser.parse();
tree.inspect(parser);
if ( tree.children.size() > 0) {
System.out.println(" **************");
test.getChildren(tree, parser);
}
The get Children method is below. But this does not seem to extract any tokens.
public void getChildren(ParseTree tree, TestAntlr4Parser parser ) {
for (int i=0; i<tree.getChildCount(); i++){
System.out.println(" Child i= " + i);
System.out.println(" expression = <" + tree.toStringTree(parser) + ">");
if ( tree.getChild(i).getChildCount() != 0 ) {
this.getChildren(tree.getChild(i), parser);
}
}
}
Could someone help me figure out how to write the parser in Java?
The output=AST option was removed in ANTLR 4, as well as the ^ and ! operators you used in the grammar. ANTLR 4 produces parse trees instead of ASTs, so the root of the tree produced by a rule is the rule itself. For example, given the following rule:
and : not (AND not)*;
You will end up with an AndContext tree containing NotContext and TerminalNode children for the not and AND references, respectively. To make it easier to work with the trees, AndContext will contain a generated method not() which returns a list of context objects returned by the invocations of the not rule (return type List<? extends NotContext>). It also contains a generated method AND which returns a list of the TerminalNode instances created for each AND token that was matched.

Resources