I'm reading the book: Formal Syntax and Semantics of
Programming Languages. I don't understand this exercise:
Consider the following two grammars, each of which generates strings of
correctly balanced parentheses and brackets. Determine if either or both
is ambiguous. The Greek letter ε repreents an empty string.
<string> ::= <string> <string> | ( <string> ) |[ <string> ] | ε
<string> ::= ( <string> ) <string> | [ <string> ] <string> | ε
The first is ambiguous and the second is not. This is a question about how a context-free grammar (CFG) can be turned into a parse tree. In the first CFG, the first production is the source of the ambiguity. If I write the string "()()()" it is unclear which part of this string could match the left non-terminal and which could match the right non-terminal.
One valid parse tree for that string is that the first two characters "()" match the first non-terminal, which then matches the second production and the rest of the string "()()" matches the right non-terminal, which again matches the first production again.
Another valid parse tree is for the first four characters "()()" to match the left non-terminal and for the rest "()" to match the right non-terminal. Both are equally valid so there is an ambiguity. Parser tools like LR parsers call this a shift/reduce conflict.
This has absolutely no problem if you just want to see if a string belongs to a language. If any parse works, you're good. This has really problematic effects, however, if you're trying to create a parse tree to use as, for example, an abstract syntax tree for a programming language.
To show why this is a problem for parsing a language take a look at this example.
<expression> ::= <expression> <expression> | <expression> + <expression> | <expression> * <expression>
How do you parse "1+2*3"? Is it "(1+2)*3" or "1+(2*3)"? The grammar I gave has a shift/reduce conflict so it is not specified. Most LR parse tools will resolve this conflict for you automatically an arbitrarily. This is dangerous because if I'm writing a programming language there should be a well-defined understanding of which the programmer will get. Since this is a typical arithmetic expression we should probably follow the math convention and have the answer be "1+(2*3)".
The solution is to rewrite the grammar so that it's unambiguous or many parser tools also just allow us to explicitly specify the associativity and precedence of our lexical symbols, which is very convenient for keeping your grammar nice and readable.
Related
I am using the antlr4 c grammar as inspiration for my own grammar. I came over one thing I dont really get. Why is there Lexer rules for datatypes when they are not used? For example the rule Double : 'double'; is never used but the parser rule typeSpecifier:('double' | ... );(other datatypes has been removed to simplify) is used several places. Is there a reason why the parser rule typeSpecifier is not using the lexer rule Double?
All the grammars on that page are volunteer submissions and not part of ANTLR4. It's clearly a mistake, but the way lexer rules are matched, it won't make a difference in lexing. You can choose to implement either the explicit rule:
Double : 'double';
or the implicit one:
typeSpecifier
: ('void'
| 'char'
| 'short'
| 'int'
| 'long'
| 'float'
| 'double'
with no ill effects either way, even if you mix methods. In fact, if you take a more global look at that whole grammar, the author did the same thing with numerous other lexer rules, like Register for example. Makes no difference in actual practice.
Bottom line? Choose whichever method you like and apply it consistently. My personal preference is toward brevity, so I like the the implicit tokens so long as they are used in only one place in the grammar. As soon as a token might be used in two places, I prefer to make an explicit token out of it and update the two or more locations where it's used.
The lexical grammar of most programming languages is fairly non-expressive in order to quickly lex it. I'm not sure what category Rust's lexical grammar belongs to. Most of it seems regular, probably with the exception of raw string literals:
let s = r##"Hi lovely "\" and "#", welcome to Rust"##;
println!("{}", s);
Which prints:
Hi lovely "\" and "#", welcome to Rust
As we can add arbitrarily many #, it seems like it can't be regular, right? But is the grammar at least context-free? Or is there something non-context free about Rust's lexical grammar?
Related: Is Rust's syntactical grammar context-free or context-sensitive?
The raw string literal syntax is not context-free.
If you think of it as a string surrounded by r#k"…"#k (using the superscript k as a count operator), then you might expect it to be context-free:
raw_string_literal
: 'r' delimited_quoted_string
delimited_quoted_string
: quoted_string
| '#' delimited_quoted_string '#'
But that is not actually the correct syntax, because the quoted_string is not allowed to contain "#k although it can contain "#j for any j<k
Excluding the terminating sequence without excluding any other similar sequence of a different length cannot be accomplished with a context-free grammar because it involves three (or more) uses of the k-repetition in a single production, and stack automata can only handle two. (The proof that the grammar is not context-free is surprisingly complicated, so I'm not going to attempt it here for lack of MathJax. The best proof I could come up with uses Ogden's lemma and the uncommonly cited (but highly useful) property that context-free grammars are closed under the application of a finite-state transducer.)
C++ raw string literals are also context-sensitive [or would be if the delimiter length were not limited, see Note 1], and pretty well all whitespace-sensitive languages (like Python and Haskell) are context-sensitive. None of these lexical analysis tasks is particularly complicated so the context-sensitivity is not a huge problem, although most standard scanner generators don't provide as much assistance as one might like. But there it is.
Rust's lexical grammar offers a couple of other complications for a scanner generator. One issue is the double meaning of ', which is used both to create character literals and to mark lifetime variables and loop labels. Apparently it is possible to determine which of these applies by considering the previously recognized token. That could be solved with a lexical scanner which is capable of generating two consecutive tokens from a single pattern, or it could be accomplished with a scannerless parser; the latter solution would be context-free but not regular. (C++'s use of ' as part of numeric literals does not cause the same problem; the C++ tokens can be recognized with regular expressions, because the ' can not be used as the first character of a numeric literal.)
Another slightly context-dependent lexical issue is that the range operator, .., takes precedence over floating point values, so that 2..3 must be lexically analysed as three tokens: 2 .. 3, rather than two floating point numbers 2. .3, which is how it would be analysed in most languages which use the maximal munch rule. Again, this might or might not be considered a deviation from regular expression tokenisation, since it depends on trailing context. But since the lookahead is at most one character, it could certainly be implemented with a DFA.
Postscript
On reflection, I am not sure that it is meaningful to ask about a "lexical grammar". Or, at least, it is ambiguous: the "lexical grammar" might refer to the combined grammar for all of the languages "tokens", or it might refer to the act of separating a sentence into tokens. The latter is really a transducer, not a parser, and suggests the question of whether the language can be tokenised with a finite-state transducer. (The answer, again, is no, because raw strings cannot be recognized by a FSA, or even a PDA.)
Recognizing individual tokens and tokenising an input stream are not necessarily equivalent. It is possible to imagine a language in which the individual tokens are all recognized by regular expressions but an input stream cannot be handled with a finite-state transducer. That will happen if there are two regular expressions T and U such that some string matching T is the longest token which is a strict prefix of an infinite set of strings in U. As a simple (and meaningless) example, take a language with tokens:
a
a*b
Both of these tokens are clearly regular, but the input stream cannot be tokenized with a finite state transducer because it must examine any sequence of as (of any length) before deciding whether to fallback to the first a or to accept the token consisting of all the as and the following b (if present).
Few languages show this pathology (and, as far as I know, Rust is not one of them), but it is technically present in some languages in which keywords are multiword phrases.
Notes
Actually, C++ raw string literals are, in a technical sense, regular (and therefore context free) because their delimiters are limited to strings of maximum length 16 drawn from an alphabet of 88 characters. That means that it is (theoretically) possible to create a regular expression consisting of 13,082,362,351,752,551,144,309,757,252,761 patterns, each matching a different possible raw string delimiter.
I have written tokenizer and expression evaluator for a preprocessor language that I plan to use in my later projects. I started thinking that maybe I should describe the language with EBNF (Extended Backus–Naur Form) to keep the syntax more maintainable or even use it to generate later versions of a parser.
My first impression was that EBNF is used for tokenizing process and syntax validation. Later I discovered that it can also be used to describe operator precedence like in this post or in the Wikipedia article:
expression ::= equality-expression
equality-expression ::= additive-expression ( ( '==' | '!=' ) additive-expression ) *
additive-expression ::= multiplicative-expression ( ( '+' | '-' ) multiplicative-expression ) *
multiplicative-expression ::= primary ( ( '*' | '/' ) primary ) *
primary ::= '(' expression ')' | NUMBER | VARIABLE | '-' primary
I can see how that allows generator to produce code with operator precedence built in but is this really how precedence should be expressed? Isn't operator precedence more about semantics and EBNF about syntax? If I decide to write description of my language in EBNF, should I write it with operator precedence taken into account or document that in a separate section?
Did a similar stuff for my collegue degree.
I suggest DO NOT use the operator precedence feature, even if looks easier like "syntact sugar".
Why ? Because most languages to be described by EBNF, use a lot of operators with different features, that are better to describe & update, with EBNF expressions, instead of operator precedence.
Some operators are unary prefix, some unary posfix, some are binary (a.k.a. "infix"), some binary are evaluated from left to right, & some are evaluated from right to left. Some symbols are operators in some context, and used as other tokens, in other context, example "+", "-", that can be binary operators ("x - y"), unary prefix operators ("x - -y"), or part of a literal ("x + -5").
In my experience its more "safe" to describe them with EBNF expressions. Unless the programming language you describe, is very small, with very few and similar syntax operators (example: all binary, or all prefix unary).
Just my 2 cents.
Most UNIX regular expressions have, besides the usual **,+,?* operators a backslash operator where \1,\2,... match whatever's in the last parentheses, so for example *L=(a*)b\1* matches the (non regular) language *a^n b a^n*.
On one hand, this seems to be pretty powerful since you can create (a*)b\1b\1 to match the language *a^n b a^n b a^n* which can't even be recognized by a stack automaton. On the other hand, I'm pretty sure *a^n b^n* cannot be expressed this way.
I have two questions:
Is there any literature on this family of languages (UNIX-y regular). In particular, is there a version of the pumping lemma for these?
Can someone prove, or disprove, that *a^n b^n* cannot be expressed this way?
You're probably looking for
Benjamin Carle and Paliath Narendran "On Extended Regular Expressions" LNCS 5457
DOI:10.1007/978-3-642-00982-2_24
PDF Extended Abstract at http://hal.archives-ouvertes.fr/docs/00/17/60/43/PDF/notes_on_extended_regexp.pdf
C. Campeanu, K. Salomaa, S. Yu: A formal study of practical regular expressions, International Journal of Foundations of Computer Science, Vol. 14 (2003) 1007 - 1018.
DOI:10.1142/S012905410300214X
and of course follow their citations forward and backward to find more literature on this subject.
a^n b^n is CFL. The grammar is
A -> aAb | e
you can use pumping lemma for RL to prove A is not RL
Ruby 1.9.1 supports the following regex:
regex = %r{ (?<foo> a\g<foo>a | b\g<foo>b | c) }x
p regex.match("aaacbbb")
# the result is #<MatchData "c" foo:"c">
"Fun with Ruby 1.9 Regular Expressions" has an example where he actually arranges all the parts of a regex so that it looks like a context-free grammar as follows:
sentence = %r{
(?<subject> cat | dog | gerbil ){0}
(?<verb> eats | drinks| generates ){0}
(?<object> water | bones | PDFs ){0}
(?<adjective> big | small | smelly ){0}
(?<opt_adj> (\g<adjective>\s)? ){0}
The\s\g<opt_adj>\g<subject>\s\g<verb>\s\g<opt_adj>\g<object>
}x
I think this means that at least Ruby 1.9.1's regex engine, which is the Oniguruma regex engine, is actually equivalent to a context-free grammar, though the capturing groups aren't as useful as an actual parser-generator.
This means that "Pumping lemma for context-free languages" should describe the class of languages recognizable by Ruby 1.9.1's regex engine.
EDIT: Whoops! I messed up, and didn't do an important test which actually makes my answer above totally wrong. I won't delete the answer, because it's useful information nonetheless.
regex = %r{\A(?<foo> a\g<foo>a | b\g<foo>b | c)\Z}x
#I added anchors for the beginning and end of the string
regex.match("aaacbbb")
#returns nil, indicating that no match is possible with recursive capturing groups.
EDIT: Coming back to this many months later, I just discovered that my test in the last edit was incorrect. "aaacbbb" shouldn't be expected to match regex even if regex does operate like a context-free grammar.
The correct test should be on a string like "aabcbaa", and that does match the regex:
regex = %r{\A(?<foo> a\g<foo>a | b\g<foo>b | c)\Z}x
regex.match("aaacaaa")
# => #<MatchData "aaacaaa" foo:"aaacaaa">
regex.match("aacaa")
# => #<MatchData "aacaa" foo:"aacaa">
regex.match("aabcbaa")
# => #<MatchData "aabcbaa" foo:"aabcbaa">
Which do the concepts control flow, data type, statement, expression and operation belong to? Syntax or semantics?
What is the relation between control flow, data type, statement, expression, operation, function, ...? How a program is built from these primitives level by level?
I would like to understand these primitive concepts and their relations in order to figure out what aspects of a new language should one learn.
Thanks and regards!
All of those language elements have both syntax (how it is written) and semantics (how the way it is written corresponds to what it actually means). Control flow determines which statements are executed and when, expressions yield a value and can be made up of functions and other language elements (although the details depend on the programming language). An operation is usually a sequence of statements. The meaning of "function" varies from language to language; in some languages, any operation that can be invoked by name is a function. In other languages, a function is an operation that yields a result (as opposed to a procedure that does not report a result). Some languages also require that functions be non-mutating while procedures can be mutating, although this varies from language to language. Data types encapsulate both data and the operations/procedures/functions that can be operated on that data.
They belong to both worlds:
Syntax will describe which are the operators, which are primitive types (int, float), which are the keywords (return, for, while). So syntax decides which "words" you can use in the programming language. With word I mean every single possible token: = is a token, void is a token, varName12345 is a token that is considered as an identifier, 12.4 is a token considered as a float and so on..
Semantics will describe how these tokens can be combined together inside you language.
For example you will have that while semantics is something like:
WHILE ::= 'while' '(' CONDITION ')' '{' STATEMENTS '}'
CONDITION ::= CONDITION '&&' CONDITION | CONDITION '||' CONDITION | ...
STATEMENTS ::= STATEMENT ';' STATEMENTS | empty_rule
and so on. This is the grammar of the language that describes exactly how the language is structured. So it will be able to decide if a program is correct according to the language semantics.
Then there is a third aspect of the semantics, that is "what does that construct mean?". You can see it as a correspondence between, for example, a for loop and how it is translated into the lower level language needed to be executed.
This third aspect will decide if your program is correct with respect to the allowed operations. Usually you can make a compiler reject many of programs that have no meaning (because they violates the semantic) but to be able to find many different mistakes you will have to introduce a new tool: the type checker that will also check that whenever you do operations they are correct according to the types.
For example you grammar can allow doing varName = 12.4 but the typechecker will use the declaration of varName to understand if you can assign a float to it. (of course we're talking about static type checking)
Those concepts belong to both.
Statements, expressions, control flow operations, data types, etc. have their structure defined using the syntax. However, their meaning comes from the semantics.
When you have defined syntax and semantics for a programming language and its constructs, this basically provides you with a set of building blocks. The syntax is used to understand the structure in the code - usually represented using an abstract syntax tree, or AST. You can then traverse the tree and apply the semantics to each element to execute the program, or generate some instructions for some instruction set so you can execute the code later.