What is the correct way of writing a grammar in ANTLR for language that is right-to-left such as Arabic or Hebrew?
Do I write the tokens and rules in the grammar left-to-right and then create InputStream that fills in the characters in the Lexer right-to-left?
RTL reading is only a presentation, while in memory (and that is what counts for the ANTLR4 lexer) the characters are stored in increasing memory address order, just like for any other language. ANTLR4 is now fully Unicode aware and you should be able to write your rules in any language that is supported by Unicode (for both: the grammar rule names as well as the lexer content).
Related
I'm trying to write a lexer rule that would match following strings
a
aa
aaa
bbbb
the requirement here is all characters must be the same
I tried to use this rule:
REPEAT_CHARS: ([a-z])(\1)*
But \1 is not valid in antlr4. is it possible to come up with a pattern for this?
You can’t do that in an ANTLR lexer. At least, not without target specific code inside your grammar. And placing code in your grammar is something you should not do (it makes it hard to read, and the grammar is tied to that language). It is better to do those kind of checks/validations inside a listener or visitor.
Things like back-references and look-arounds are features that krept in regex-engines of programming languages. The regular expression syntax available in ANTLR (and all parser generators I know of) do not support those features, but are true regular languages.
Many features found in virtually all modern regular expression libraries provide an expressive power that far exceeds the regular languages. For example, many implementations allow grouping subexpressions with parentheses and recalling the value they match in the same expression (backreferences). This means that, among other things, a pattern can match strings of repeated words like "papa" or "WikiWiki", called squares in formal language theory.
-- https://en.wikipedia.org/wiki/Regular_expression#Patterns_for_non-regular_languages
The lexical grammar of most programming languages is fairly non-expressive in order to quickly lex it. I'm not sure what category Rust's lexical grammar belongs to. Most of it seems regular, probably with the exception of raw string literals:
let s = r##"Hi lovely "\" and "#", welcome to Rust"##;
println!("{}", s);
Which prints:
Hi lovely "\" and "#", welcome to Rust
As we can add arbitrarily many #, it seems like it can't be regular, right? But is the grammar at least context-free? Or is there something non-context free about Rust's lexical grammar?
Related: Is Rust's syntactical grammar context-free or context-sensitive?
The raw string literal syntax is not context-free.
If you think of it as a string surrounded by r#k"…"#k (using the superscript k as a count operator), then you might expect it to be context-free:
raw_string_literal
: 'r' delimited_quoted_string
delimited_quoted_string
: quoted_string
| '#' delimited_quoted_string '#'
But that is not actually the correct syntax, because the quoted_string is not allowed to contain "#k although it can contain "#j for any j<k
Excluding the terminating sequence without excluding any other similar sequence of a different length cannot be accomplished with a context-free grammar because it involves three (or more) uses of the k-repetition in a single production, and stack automata can only handle two. (The proof that the grammar is not context-free is surprisingly complicated, so I'm not going to attempt it here for lack of MathJax. The best proof I could come up with uses Ogden's lemma and the uncommonly cited (but highly useful) property that context-free grammars are closed under the application of a finite-state transducer.)
C++ raw string literals are also context-sensitive [or would be if the delimiter length were not limited, see Note 1], and pretty well all whitespace-sensitive languages (like Python and Haskell) are context-sensitive. None of these lexical analysis tasks is particularly complicated so the context-sensitivity is not a huge problem, although most standard scanner generators don't provide as much assistance as one might like. But there it is.
Rust's lexical grammar offers a couple of other complications for a scanner generator. One issue is the double meaning of ', which is used both to create character literals and to mark lifetime variables and loop labels. Apparently it is possible to determine which of these applies by considering the previously recognized token. That could be solved with a lexical scanner which is capable of generating two consecutive tokens from a single pattern, or it could be accomplished with a scannerless parser; the latter solution would be context-free but not regular. (C++'s use of ' as part of numeric literals does not cause the same problem; the C++ tokens can be recognized with regular expressions, because the ' can not be used as the first character of a numeric literal.)
Another slightly context-dependent lexical issue is that the range operator, .., takes precedence over floating point values, so that 2..3 must be lexically analysed as three tokens: 2 .. 3, rather than two floating point numbers 2. .3, which is how it would be analysed in most languages which use the maximal munch rule. Again, this might or might not be considered a deviation from regular expression tokenisation, since it depends on trailing context. But since the lookahead is at most one character, it could certainly be implemented with a DFA.
Postscript
On reflection, I am not sure that it is meaningful to ask about a "lexical grammar". Or, at least, it is ambiguous: the "lexical grammar" might refer to the combined grammar for all of the languages "tokens", or it might refer to the act of separating a sentence into tokens. The latter is really a transducer, not a parser, and suggests the question of whether the language can be tokenised with a finite-state transducer. (The answer, again, is no, because raw strings cannot be recognized by a FSA, or even a PDA.)
Recognizing individual tokens and tokenising an input stream are not necessarily equivalent. It is possible to imagine a language in which the individual tokens are all recognized by regular expressions but an input stream cannot be handled with a finite-state transducer. That will happen if there are two regular expressions T and U such that some string matching T is the longest token which is a strict prefix of an infinite set of strings in U. As a simple (and meaningless) example, take a language with tokens:
a
a*b
Both of these tokens are clearly regular, but the input stream cannot be tokenized with a finite state transducer because it must examine any sequence of as (of any length) before deciding whether to fallback to the first a or to accept the token consisting of all the as and the following b (if present).
Few languages show this pathology (and, as far as I know, Rust is not one of them), but it is technically present in some languages in which keywords are multiword phrases.
Notes
Actually, C++ raw string literals are, in a technical sense, regular (and therefore context free) because their delimiters are limited to strings of maximum length 16 drawn from an alphabet of 88 characters. That means that it is (theoretically) possible to create a regular expression consisting of 13,082,362,351,752,551,144,309,757,252,761 patterns, each matching a different possible raw string delimiter.
Say I have a grammar that has a parser rule to match one of several specific strings.
Is the proper thing to do in the grammar to make an alternate parser rule for each specific string, or to keep the parser rule general and decode the string in a visitor subclass?
If the specific strings are meaningful (e.g. a keyword in a DSL) it sounds like you want Tokens. Whatever rules you have in the grammar can reference the Tokens you created.
Generally, it's better to have your grammar do as much of the parser work as possible, rather than overly generalizing and having to write a bunch of extra code.
See the following: http://www.antlr.org/wiki/display/ANTLR4/Grammar+Structure
I wanted to write some educational code in Haskell with Unicode characters (non-Latin) in the identifiers. (So that the identifiers look nice and natural for speakers of a natural language other than English which is not using the Latin characters in its writing.) So, I set out for finding an appropriate Haskell implementation that would allow this.
But where is this feature specified in the language specification? How would I refer to this feature when looking for a conforming implementation? (And which Haskell implemenations are known to actually support Unicode identifiers?)
It turned out that one Haskell implementation did accept my code with Unicode identifiers, whereas another one failed to accept it. I would like it if there were a way to formalize this requirement of my code, in a form of a language feature switch perhaps, so that if I or someone else tries to run my code, it would be immediately clear whether his implementation is missing the required feature and hence he should look for another one. (There could be also a wiki page for this feature--"Unicode identifiers", which would list which of the existing implementations support it, so that one would know where to go if one needs it.)
(BTW, I have put a "syntax" tag on this question, but I actually perceive it to be an issue of the level of lexing, a lower level than the syntax of a language. Is there a tag here for features of the lexing level of a language, rather than for features of the syntax specification of a language?)
The Online Report documents this under Lexemes. It also notes early on that "Haskell uses the Unicode character set. However, source programs are currently biased toward the ASCII character set used in earlier versions of Haskell.".
Actual compilers may or may not support Unicode identifiers. GHC does, but you need to keep in mind that Unicode codepoints must obey the same rules as ASCII characters: types must start with a codepoint which is classed as uppercase or titlecase, variables as lowercase (although de facto this is relaxed to alphabetic and not uppercase/titlecase; this might be worth asking for a clarification from the language committee), operators must be punctuation or symbol. (This means that you can't declare types in Arabic, for example, unless you prefix them with a character in some other script that is uppercase/titlecase.)
As to collecting Unicode support information: while I don't know of a single page that provides it, searching for "unicode" on the Haskell Wiki finds information about Unicode support in a number of Haskell compilers.
In the Dragonbook's exercise 3.3.1 the student should
Consult the language reference manuals
to determine (i) the set of characters
that form the input alphabet
(excluding those that may only appear
in character strings or comments [...]
for each of the following languages:
[...].
It makes no real sense to me to describe really all the characters like a, b, / for a language, even if it is an exercise for compilers. Isn't the alphabet of a programming language the set of possible words, like {id, int, float, string, if, for, ... }?
And if you consider it really beeing "characters" in the basic idea of the word, is ??/ in C one or three charaters (or both)?
The alphabet of a language is the set of characters not the words.
Isn't the alphabet of a programming
language the set of possible words,
like {id, int, float, string, if, for,
... }?
No, the alphabet is the set of characters that are used to form words. When an language is specified, the alphabet must be given otherwise you cannot distinguish a valid token from an invalid token.
Update
You are confusing the term "word" with "token". A word is not some part of a language or program. A word is finite string of characters from the alphabet. It has nothing to do with a language construct like "int" or "while". For example, each C program is a word because it is a finite string of characters from the alphabet. The set of all of these programs (words) forms the C programming language. Tokens like "void" or "int" are entirely a different thing.
To recap, you start by defining the some set of characters you want to use. This is called the alphabet. Finite strings of these characters form words. A language is some subset of all possible words. To define a language, you define which words belong to the language. For example, with a regular expression or a context-free grammar.
Wikipedia has a good page on formal languages.
http://en.wikipedia.org/wiki/Formal_language
The confusion comes from theory defining alphabet as the set of symbols from which the strings in a language are formed. Note that the grammars for programming languages use tokens and not characters as terminal symbols.
Traditionally, from the perspective of language theory, programming languages involve two language definitions: 1) The one that has characters as the alphabet and tokens as the valid strings. 2) The one that has tokens as the alphabet and programs as the valid strings. That's why programming languages are usually specified in two parts, a lexical, and a syntactical analyzer.
It is not strictly necessary to have the two definitions to parse a programming language. A single grammar can be used to specify a programming language using characters as the input alphabet. It's just that the characters-to-token parts has been easier to specify with regular expressions, and the tokens-to-program part with grammars.
Modern compiler-compilers like ANTLR use grammar-specification languages that incorporate the expressive convenience of regular expressions, so a character-to-program definition can be done with a single grammar. Still, separating the lexical from the syntactical remains the most convenient way to parse a programming language, even with such tools.
Last minute example: imagine that the grammar productions for an if-then-else-end had to deal at the character-level with:
Whitespace.
Keywords within programming language strings: "Then, the end."
Variable names that contain keywords: 'tiff',
...
It can be done, but it would be extremely complicated.