Several questions on antlr4 have used lexer predicates which have not been mentioned in the book, eg 28730446 uses ahead(String), 42058127 uses getCharPositionInLine(), 23465358 uses _input.LA(1), etc. The _input.LA(1) is also used a few times in the book (eg on pages 212 and 228 of the 2014 edition) but there is no explanation on what it exactly does. Is there a list of available lexer predicates, and their documentation?
These are not lexer predicates. Rather, they are ordinary methods on run-time objects: Token#getCharPositionInLine() and CharStream#LA(int). The documentation is provided in the source code.
The Lexer class defines _input as
public CharStream _input;
Also, the ahead() method is custom defined in the #lexer::members block at the top of that particular grammar (and depends on the use of CharStream#LA(int)).
TDAR remains the best expositive documentation. The source code is internally well documented.
Related
I tried really hard to find answer to this question on google engine.
But I wonder how these high level programming languages are created in principle of automata or is automata theory not included in defining the languages?
Language design tends to have two important levels:
Lexical analysis - the definition of what tokens look like. What is a string literal, what is a number, what are valid names for variables, functions, etc.
Syntactic analysis - the definition of how tokens work together to make meaningful statements. Can you assign a value to a literal, what does a block look like, what does an if statement look like, etc.
The lexical analysis is done using regular languages, and generally tokens are defined using regular expressions. It's not that a DFA is used (most regex implementations are not DFAs in practice), but that regular expressions tend to line up well with what most languages consider tokens. If, for example, you wanted a language where all variable names had to be palindromes, then your language's token specification would have to be context-free instead.
The input to the lexing stage is the raw characters of the source code. The alphabet would therefore be ASCII or Unicode or whatever input your compiler is expecting. The output is a stream of tokens with metadata, such as string-literal (value: hello world) which might represent "hello world" in the source code.
The syntactic analysis is typically done using a subset of context-free languages called LL or LR parsers. This is because the implementation of CFG (PDAs) are nondeterministic. LL and LR parsing are ways to make deterministic decisions with respect to how to parse a given expression.
We use CFGs for code because this is the level on the Chomsky hierarchy where nesting occurs (where you can express the idea of "depth", such as with an if within an if). Higher or lower levels on the hierarchy are possible, but a regular syntax would not be able to express nesting easily, and context-sensitive syntax would probably cause confusion (but it's not unheard of).
The input to the syntactic analysis step is the token stream, and the output is some form of executable structure, typically a parse tree that is either executed immediately (as in interpretted languages) or stored for later optimization and/or execution (as in compiled languages) or something else (as in intermediate-compiled languages like Java). The alphabet of the CFG is therefore the possible tokens specified by the lexical analysis step.
So this whole thing is a long-winded way of saying that it's not so much the automata theory that's important, but rather the formal languages. We typically want to have the simplest language class that meets our needs. That typically means regular tokens and context-free syntax, but not always.
The implementation of the regular expression need not be an automaton, and the implementation of the CFG cannot be a PDA, because PDAs are nondeterministic, so we define deterministic parsers on reasonable subsets of the CFG class instead.
More generally we talk about Theory of computation.
What has happened through the history of programming languages is that it has been formally proven that higher-level constructs are equivalent to the constructs in the abstract machines of the theory.
We prefer the higher-level constructs in modern languages because they make programs easier to write, and easier to understand by other people. That in turn leads to easier peer-review and team-play, and thus better programs with less bugs.
The Wikipedia article about Structured programming tells part of the history.
As to Automata theory, it is still present in the implementation of regular expression engines, and in most programming situations in which a good solution consists in transitioning through a set of possible states.
I'm trying to write a lexer rule that would match following strings
a
aa
aaa
bbbb
the requirement here is all characters must be the same
I tried to use this rule:
REPEAT_CHARS: ([a-z])(\1)*
But \1 is not valid in antlr4. is it possible to come up with a pattern for this?
You can’t do that in an ANTLR lexer. At least, not without target specific code inside your grammar. And placing code in your grammar is something you should not do (it makes it hard to read, and the grammar is tied to that language). It is better to do those kind of checks/validations inside a listener or visitor.
Things like back-references and look-arounds are features that krept in regex-engines of programming languages. The regular expression syntax available in ANTLR (and all parser generators I know of) do not support those features, but are true regular languages.
Many features found in virtually all modern regular expression libraries provide an expressive power that far exceeds the regular languages. For example, many implementations allow grouping subexpressions with parentheses and recalling the value they match in the same expression (backreferences). This means that, among other things, a pattern can match strings of repeated words like "papa" or "WikiWiki", called squares in formal language theory.
-- https://en.wikipedia.org/wiki/Regular_expression#Patterns_for_non-regular_languages
Say I'd like to find instances of the expression while using the Java7 grammar:
FoobarClass.getInstanceOfType("Bazz");
Using a ParseTreeWalker and listening to exitExpression() calls sounded like a good first place to start. What surprised me was the level of manual traversal of the Java7Parser.ExpressionContext required to find expressions of this type.
What's the appropriate method to find matches to the above expression? At this point using a Regex in place of ANTLR4 yields simpler code, but this won't scale.
ANTLR 4 does not currently include feature allowing you to write concrete or abstract syntax queries. We hope to add something in the future to help with this type of application.
I've needed to write a few pattern recognition features for ANTLR 4 parse trees. I implemented the predicate itself with relative success by extending BaseMyParserVisitor<Boolean> (the parser in this example is called MyParser).
I'm having trouble articulating the difference between Chomsky type 2 (context free languages) and Chomsky type 3 (Regular languages).
Can someone out there give me an answer in plain English? I'm having trouble understanding the whole hierarchy thing.
A Type II grammar is a Type III grammar with a stack
A Type II grammar is basically a Type III grammar with nesting.
Type III grammar (Regular):
Use Case - CSV (Comma Separated Values)
Characteristics:
can be read with a using a FSM (Finite State Machine)
requires no intermediate storage
can be read with Regular Expressions
usually expressed using a 1D or 2D data structure
is flat, meaning no nesting or recursive properties
Ex:
this,is,,"an "" example",\r\n
"of, a",type,"III\n",grammar\r\n
As long as you can figure out all of the rules and edge cases for the above text you can parse CSV.
Type II grammar (Context Free):
Use Case - HTML (Hyper Text Markup Language) or SGML in general
Characteristics:
can be read using a DPDA (Deterministic Pushdown Automata)
will require a stack for intermediate storage
may be expressed as an AST (Abstract Syntax Tree)
may contain nesting and/or recursive properties
HTML could be expressed as a regular grammar:
<h1>Useless Example</h1>
<p>Some stuff written here</p>
<p>Isn't this fun</p>
But it's try parsing this using a FSM:
<body>
<div id=titlebar>
<h1>XHTML 1.0</h1>
<h2>W3C's failed attempt to enforce HTML as a context-free language</h2>
</div>
<p>Back when the web was still pretty boring, the W3C attempted to standardize away the quirkiness of HTML by introducing a strict specification</p
<p>Unfortunately, everybody ignored it.</p>
</body>
See the difference? Imagine you were writing a parser, you could start on an open tag and finish on a closing tag but what happens when you encounter a second opening tag before reaching the closing tag?
It's simple, you push the first opening tag onto a stack and start parsing the second tag. Repeat this process for as many levels of nesting that exist and if the syntax is well-structured, the stack can be un-rolled one layer at a time in the opposite level that it was built
Due to the strict nature of 'pure' context-free languages, they're relatively rare unless they're generated by a program. JSON, is a prime example.
The benefit of context-free languages is that, while very expressive, they're still relatively simple to parse.
But wait, didn't I just say HTML is context-free. Yep, if it is well-formed (ie XHTML).
While XHTML may be considered context-free, the looser-defined HTML would actually considered Type I (Ie Context Sensitive). The reason being, when the parser reaches poorly structured code it actually makes decisions about how to interpret the code based on the surrounding context. For example if an element is missing its closing tags, it would need to determine where that element exists in the hierarchy before it can decide where the closing tag should be placed.
Other features that could make a context-free language context-sensitive include, templates, imports, preprocessors, macros, etc.
In short, context-sensitive languages look a lot like context-free languages but the elements of a context-sensitive languages may be interpreted in different ways depending on the program state.
Disclaimer: I am not formally trained in CompSci so this answer may contain errors or assumptions. If you asked me the difference between a terminal and a non-terminal you'll earn yourself a blank stare. I learned this much by actually building a Type III (Regular) parser and by reading extensively about the rest.
The wikipedia page has a good picture and bullet points.
Roughly, the underlying machine that can describe a regular language does not need memory. It runs as a statemachine (DFA/NFA) on the input. Regular languages can also be expressed with regular expressions.
A language with the "next" level of complexity added to it is a context free language. The underlying machine describing this kind of language will need some memory to be able to represent the languages that are context free and not regular. Note that adding memory to your machine makes it a little more powerful, so it can still express languages (e.g. regular languages) that didn't need the memory to begin with. The underlying machine is typically a push-down automaton.
Type 3 grammars consist of a series of states. They cannot express embedding. For example, a Type 3 grammar cannot require matching parentheses because it has no way to show that the parentheses should be "wrapped around" their contents. This is because, as Derek points out, a Type 3 grammar does not "remember" anything about the previous states that it passed through to get to the current state.
Type 2 grammars consist of a set of "productions" (you can think of them as patterns) that can have other productions embedded within them. Thus, they are recursively defined. A production can only be defined in terms of what it contains, and cannot "see" outside of itself; this is what makes the grammar context-free.
Which do the concepts control flow, data type, statement, expression and operation belong to? Syntax or semantics?
What is the relation between control flow, data type, statement, expression, operation, function, ...? How a program is built from these primitives level by level?
I would like to understand these primitive concepts and their relations in order to figure out what aspects of a new language should one learn.
Thanks and regards!
All of those language elements have both syntax (how it is written) and semantics (how the way it is written corresponds to what it actually means). Control flow determines which statements are executed and when, expressions yield a value and can be made up of functions and other language elements (although the details depend on the programming language). An operation is usually a sequence of statements. The meaning of "function" varies from language to language; in some languages, any operation that can be invoked by name is a function. In other languages, a function is an operation that yields a result (as opposed to a procedure that does not report a result). Some languages also require that functions be non-mutating while procedures can be mutating, although this varies from language to language. Data types encapsulate both data and the operations/procedures/functions that can be operated on that data.
They belong to both worlds:
Syntax will describe which are the operators, which are primitive types (int, float), which are the keywords (return, for, while). So syntax decides which "words" you can use in the programming language. With word I mean every single possible token: = is a token, void is a token, varName12345 is a token that is considered as an identifier, 12.4 is a token considered as a float and so on..
Semantics will describe how these tokens can be combined together inside you language.
For example you will have that while semantics is something like:
WHILE ::= 'while' '(' CONDITION ')' '{' STATEMENTS '}'
CONDITION ::= CONDITION '&&' CONDITION | CONDITION '||' CONDITION | ...
STATEMENTS ::= STATEMENT ';' STATEMENTS | empty_rule
and so on. This is the grammar of the language that describes exactly how the language is structured. So it will be able to decide if a program is correct according to the language semantics.
Then there is a third aspect of the semantics, that is "what does that construct mean?". You can see it as a correspondence between, for example, a for loop and how it is translated into the lower level language needed to be executed.
This third aspect will decide if your program is correct with respect to the allowed operations. Usually you can make a compiler reject many of programs that have no meaning (because they violates the semantic) but to be able to find many different mistakes you will have to introduce a new tool: the type checker that will also check that whenever you do operations they are correct according to the types.
For example you grammar can allow doing varName = 12.4 but the typechecker will use the declaration of varName to understand if you can assign a float to it. (of course we're talking about static type checking)
Those concepts belong to both.
Statements, expressions, control flow operations, data types, etc. have their structure defined using the syntax. However, their meaning comes from the semantics.
When you have defined syntax and semantics for a programming language and its constructs, this basically provides you with a set of building blocks. The syntax is used to understand the structure in the code - usually represented using an abstract syntax tree, or AST. You can then traverse the tree and apply the semantics to each element to execute the program, or generate some instructions for some instruction set so you can execute the code later.