I have written tokenizer and expression evaluator for a preprocessor language that I plan to use in my later projects. I started thinking that maybe I should describe the language with EBNF (Extended Backus–Naur Form) to keep the syntax more maintainable or even use it to generate later versions of a parser.
My first impression was that EBNF is used for tokenizing process and syntax validation. Later I discovered that it can also be used to describe operator precedence like in this post or in the Wikipedia article:
expression ::= equality-expression
equality-expression ::= additive-expression ( ( '==' | '!=' ) additive-expression ) *
additive-expression ::= multiplicative-expression ( ( '+' | '-' ) multiplicative-expression ) *
multiplicative-expression ::= primary ( ( '*' | '/' ) primary ) *
primary ::= '(' expression ')' | NUMBER | VARIABLE | '-' primary
I can see how that allows generator to produce code with operator precedence built in but is this really how precedence should be expressed? Isn't operator precedence more about semantics and EBNF about syntax? If I decide to write description of my language in EBNF, should I write it with operator precedence taken into account or document that in a separate section?
Did a similar stuff for my collegue degree.
I suggest DO NOT use the operator precedence feature, even if looks easier like "syntact sugar".
Why ? Because most languages to be described by EBNF, use a lot of operators with different features, that are better to describe & update, with EBNF expressions, instead of operator precedence.
Some operators are unary prefix, some unary posfix, some are binary (a.k.a. "infix"), some binary are evaluated from left to right, & some are evaluated from right to left. Some symbols are operators in some context, and used as other tokens, in other context, example "+", "-", that can be binary operators ("x - y"), unary prefix operators ("x - -y"), or part of a literal ("x + -5").
In my experience its more "safe" to describe them with EBNF expressions. Unless the programming language you describe, is very small, with very few and similar syntax operators (example: all binary, or all prefix unary).
Just my 2 cents.
Related
Grouping operator ( ) in JavaScript
The grouping operator ( ) controls the precedence of evaluation in expressions.
Does the functionality ( ) in JavaScript itself differ from Haskell or any other programming languages?
In other words,
Is the functionality ( ) in programming languages itself affected by evaluation strategies ?
Perhaps we can share the code below:
a() * (b() + c())
to discuss the topic here, but not limited to the example.
Please feel free to use your own examples to illustrate. Thanks.
Grouping parentheses mean the same thing in Haskell as they do in high school mathematics. They group a sub-expression into a single term. This is also what they mean in Javascript and most other programming language1, so you don't have to relearn this for Haskell coming from other common languages, if you have learnt it the right way.
Unfortunately, this grouping is often explained as meaning "the expression inside the parentheses must be evaluated before the outside". This comes from the order of steps you would follow to evaluate the expression in a strict language (like high school mathematics). However the grouping really isn't really about the order in which you evaluate things, even in that setting. Instead it is used to determine what the expression actually is at all, which you need to know before you can do anythign at all with the expression, let alone evaluate it. Grouping is generally resolved as part of parsing the language, totally separate from the order in which any runtime evaluation takes place.
Let's consider the OP's example, but I'm going to declare that function call syntax is f{} rather than f() just to avoid using the same symbol for two purposes. So in my newly-made-up syntax, the OP's example is:
a{} * (b{} + c{})
This means:
a is called on zero arguments
b is called on zero arguments
c is called on zero arguments
+ is called on two arguments; the left argument is the result of b{}, and the right argument is the result of c{}
* is called on two arguments: the left argument is the result of a{}, and the right argument is the result of b{} + c{}
Note I have not numbered these. This is just an unordered list of sub-expressions that are present, not an order in which we must evaluate them.
If our example had not used grouping parentheses, it would be a{} * b{} + c{}, and our list of sub-expressions would instead be:
a is called on zero arguments
b is called on zero arguments
c is called on zero arguments
+ is called on two arguments; the left argument is the result of a{} * b{}, and the right argument is the result of c{}
* is called on two arguments: the left argument is the result of a{}, and the right argument is the result of b{}
This is simply a different set of sub-expressions from the first (because the overall expression doesn't mean the same thing). That is all that grouping parentheses do; they allow you to specify which sub-expressions are being passed as arguments to other sub-expressions2.
Now, in a strict language "what is being passed to what" does matter quite a bit to evaluation order. It is impossible in a strict language to call anything on "the result of a{} + b{} without first having evaluated a{} + b{} (and we can't call + without evaluating a{} and b{}). But even though the grouping determines what is being passed to what, and that partially determines evaluation order3, grouping isn't really "about" evaluation order. Evaluation order can change as a result of changing the grouping in our expression, but changing the grouping makes it a different expression, so almost anything can change as a result of changing grouping!
Non-strict languages like Haskell make it especially clear that grouping is not about order of evaluation, because in non-strict languages you can pass something like "the result of a{} + b{}" as an argument before you actually evaluate that result. So in my lists of subexpressions above, any order at all could potentially be possible. The grouping doesn't determine it at all.
A language needs other rules beyond just the grouping of sub-expressions to pin down evaluation order (if it wants to specify the order), whether it's strict or lazy. So since you need other rules to determine it anyway, it is best (in my opinion) to think of evaluation order as a totally separate concept than grouping. Mixing them up seems like a shortcut when you're learning high school mathematics, but it's just a handicap in more general settings.
1 In languages with roughly C-like syntax, parentheses are also used for calling functions, as in func(arg1, arg2, arg3). The OP themselves has assumed this syntax in their a() * (b() + c()) example, where this is presumably calling a, b, and c as functions (passing each of them zero arguments).
This usage is totally unrelated to grouping parentheses, and Haskell does not use parentheses for calling functions. But there can be some confusion because the necessity of using parentheses to call functions in C-like syntax sometimes avoids the need for grouping parentheses e.g. in func(2 + 3) * 6 it is unambiguous that 2 + 3 is being passed to func and the result is being multiplied by 6; in Haskell syntax you would need some grouping parentheses because func 2 + 3 * 6 without parentheses is interpreted as the same thing as (func 2) + (3 * 6), which is not func (2 + 3) * 6.
C-like syntax is not alone in using parentheses for two totally unrelated purposes; Haskell overloads parentheses too, just for different things in addition to grouping. Haskell also uses them as part of the syntax for writing tuples (e.g. (1, True, 'c')), and the unit type/value () which you may or may not want to regard as just an "empty tuple".
2 Which is also what associativity and precedence rules for operators do. Without knowing that * is higher precedence than +, a * b + c is ambiguous; there would be no way to know what it means. With the precedence rules, we know that a * b + c means "add c to the result of multiplying a and b", but we now have no way to write down what we mean when we want "multiply a by the result of adding b and c" unless we also allow grouping parentheses.
3 Even in a strict language the grouping only partially determines evaluation order. If you look at my "lists of sub-expressions" above it's clear that in a strict language we need to have evaluated a{}, b{}, and c{} early on, but nothing determines whether we evaluate a{} first and then b{} and then c{}, or c{} first, and then a{} and then b{}, or any other permutation. We could even evaluate only the two of them in the innermost +/* application (in either order), and then the operator application before evaluating the third named function call, etc etc.
Even in a strict language, the need to evaluate arguments before the call they are passed to does not fully determine evaluation order from the grouping. Grouping just provides some constraints.
4 In general in a lazy language evaluation of a given call happens a bit at a time, as it is needed, so in fact in general all of the sub-evaluations in a given expression could be interleaved in a complicated fashion (not happening precisely one after the other) anyway.
To clarify the dependency graph:
Answer by myself (the Questioner), however, I am willing to be examined, and still waiting for your answer (not opinion based):
Grouping operator () in every language share the common functionality to compose Dependency graph.
In mathematics, computer science and digital electronics, a dependency graph is a directed graph representing dependencies of several objects towards each other. It is possible to derive an evaluation order or the absence of an evaluation order that respects the given dependencies from the dependency graph.
dependency graph 1
dependency graph 2
the functionality of Grouping operator () itself is not affected by evaluation strategies of any languages.
With ANTLR 4.6, snapshot of 11/23/2016.
I have two rules which are each left-recursive. I expanded several of the alternatives to expose the left recursion. ANTLR4 handles this, because the left recursion is explicit. However, the two rules are also mutually left-recursive.
How do I resolve the mutual left recursion AND do so so that the rules aren't a total mess? Right now I have nice comments showing what was expanded, and I moved that to primary2 and constant_primary2 which aren't involved in the mutual left recursion.
constant_primary :
constant_primary2
| primary '.' method_call_body
| constant_primary '\'' '(' constant_expr ')'
;
primary :
primary2
| primary '.' method_call_body
| constant_primary '\'' '(' expr ')'
;
One option is to switch to using my fork of ANTLR 4, which is available through Maven using the Group ID com.tunnelvisionlabs. This fork handles mutual left recursion while producing parse trees that match the form you actually wrote in the grammar.
Note that this feature is somewhat experimental. If you run into problems feel free to post an issue on the issue tracker for my fork.
I'm reading the book: Formal Syntax and Semantics of
Programming Languages. I don't understand this exercise:
Consider the following two grammars, each of which generates strings of
correctly balanced parentheses and brackets. Determine if either or both
is ambiguous. The Greek letter ε repreents an empty string.
<string> ::= <string> <string> | ( <string> ) |[ <string> ] | ε
<string> ::= ( <string> ) <string> | [ <string> ] <string> | ε
The first is ambiguous and the second is not. This is a question about how a context-free grammar (CFG) can be turned into a parse tree. In the first CFG, the first production is the source of the ambiguity. If I write the string "()()()" it is unclear which part of this string could match the left non-terminal and which could match the right non-terminal.
One valid parse tree for that string is that the first two characters "()" match the first non-terminal, which then matches the second production and the rest of the string "()()" matches the right non-terminal, which again matches the first production again.
Another valid parse tree is for the first four characters "()()" to match the left non-terminal and for the rest "()" to match the right non-terminal. Both are equally valid so there is an ambiguity. Parser tools like LR parsers call this a shift/reduce conflict.
This has absolutely no problem if you just want to see if a string belongs to a language. If any parse works, you're good. This has really problematic effects, however, if you're trying to create a parse tree to use as, for example, an abstract syntax tree for a programming language.
To show why this is a problem for parsing a language take a look at this example.
<expression> ::= <expression> <expression> | <expression> + <expression> | <expression> * <expression>
How do you parse "1+2*3"? Is it "(1+2)*3" or "1+(2*3)"? The grammar I gave has a shift/reduce conflict so it is not specified. Most LR parse tools will resolve this conflict for you automatically an arbitrarily. This is dangerous because if I'm writing a programming language there should be a well-defined understanding of which the programmer will get. Since this is a typical arithmetic expression we should probably follow the math convention and have the answer be "1+(2*3)".
The solution is to rewrite the grammar so that it's unambiguous or many parser tools also just allow us to explicitly specify the associativity and precedence of our lexical symbols, which is very convenient for keeping your grammar nice and readable.
Most UNIX regular expressions have, besides the usual **,+,?* operators a backslash operator where \1,\2,... match whatever's in the last parentheses, so for example *L=(a*)b\1* matches the (non regular) language *a^n b a^n*.
On one hand, this seems to be pretty powerful since you can create (a*)b\1b\1 to match the language *a^n b a^n b a^n* which can't even be recognized by a stack automaton. On the other hand, I'm pretty sure *a^n b^n* cannot be expressed this way.
I have two questions:
Is there any literature on this family of languages (UNIX-y regular). In particular, is there a version of the pumping lemma for these?
Can someone prove, or disprove, that *a^n b^n* cannot be expressed this way?
You're probably looking for
Benjamin Carle and Paliath Narendran "On Extended Regular Expressions" LNCS 5457
DOI:10.1007/978-3-642-00982-2_24
PDF Extended Abstract at http://hal.archives-ouvertes.fr/docs/00/17/60/43/PDF/notes_on_extended_regexp.pdf
C. Campeanu, K. Salomaa, S. Yu: A formal study of practical regular expressions, International Journal of Foundations of Computer Science, Vol. 14 (2003) 1007 - 1018.
DOI:10.1142/S012905410300214X
and of course follow their citations forward and backward to find more literature on this subject.
a^n b^n is CFL. The grammar is
A -> aAb | e
you can use pumping lemma for RL to prove A is not RL
Ruby 1.9.1 supports the following regex:
regex = %r{ (?<foo> a\g<foo>a | b\g<foo>b | c) }x
p regex.match("aaacbbb")
# the result is #<MatchData "c" foo:"c">
"Fun with Ruby 1.9 Regular Expressions" has an example where he actually arranges all the parts of a regex so that it looks like a context-free grammar as follows:
sentence = %r{
(?<subject> cat | dog | gerbil ){0}
(?<verb> eats | drinks| generates ){0}
(?<object> water | bones | PDFs ){0}
(?<adjective> big | small | smelly ){0}
(?<opt_adj> (\g<adjective>\s)? ){0}
The\s\g<opt_adj>\g<subject>\s\g<verb>\s\g<opt_adj>\g<object>
}x
I think this means that at least Ruby 1.9.1's regex engine, which is the Oniguruma regex engine, is actually equivalent to a context-free grammar, though the capturing groups aren't as useful as an actual parser-generator.
This means that "Pumping lemma for context-free languages" should describe the class of languages recognizable by Ruby 1.9.1's regex engine.
EDIT: Whoops! I messed up, and didn't do an important test which actually makes my answer above totally wrong. I won't delete the answer, because it's useful information nonetheless.
regex = %r{\A(?<foo> a\g<foo>a | b\g<foo>b | c)\Z}x
#I added anchors for the beginning and end of the string
regex.match("aaacbbb")
#returns nil, indicating that no match is possible with recursive capturing groups.
EDIT: Coming back to this many months later, I just discovered that my test in the last edit was incorrect. "aaacbbb" shouldn't be expected to match regex even if regex does operate like a context-free grammar.
The correct test should be on a string like "aabcbaa", and that does match the regex:
regex = %r{\A(?<foo> a\g<foo>a | b\g<foo>b | c)\Z}x
regex.match("aaacaaa")
# => #<MatchData "aaacaaa" foo:"aaacaaa">
regex.match("aacaa")
# => #<MatchData "aacaa" foo:"aacaa">
regex.match("aabcbaa")
# => #<MatchData "aabcbaa" foo:"aabcbaa">
Which do the concepts control flow, data type, statement, expression and operation belong to? Syntax or semantics?
What is the relation between control flow, data type, statement, expression, operation, function, ...? How a program is built from these primitives level by level?
I would like to understand these primitive concepts and their relations in order to figure out what aspects of a new language should one learn.
Thanks and regards!
All of those language elements have both syntax (how it is written) and semantics (how the way it is written corresponds to what it actually means). Control flow determines which statements are executed and when, expressions yield a value and can be made up of functions and other language elements (although the details depend on the programming language). An operation is usually a sequence of statements. The meaning of "function" varies from language to language; in some languages, any operation that can be invoked by name is a function. In other languages, a function is an operation that yields a result (as opposed to a procedure that does not report a result). Some languages also require that functions be non-mutating while procedures can be mutating, although this varies from language to language. Data types encapsulate both data and the operations/procedures/functions that can be operated on that data.
They belong to both worlds:
Syntax will describe which are the operators, which are primitive types (int, float), which are the keywords (return, for, while). So syntax decides which "words" you can use in the programming language. With word I mean every single possible token: = is a token, void is a token, varName12345 is a token that is considered as an identifier, 12.4 is a token considered as a float and so on..
Semantics will describe how these tokens can be combined together inside you language.
For example you will have that while semantics is something like:
WHILE ::= 'while' '(' CONDITION ')' '{' STATEMENTS '}'
CONDITION ::= CONDITION '&&' CONDITION | CONDITION '||' CONDITION | ...
STATEMENTS ::= STATEMENT ';' STATEMENTS | empty_rule
and so on. This is the grammar of the language that describes exactly how the language is structured. So it will be able to decide if a program is correct according to the language semantics.
Then there is a third aspect of the semantics, that is "what does that construct mean?". You can see it as a correspondence between, for example, a for loop and how it is translated into the lower level language needed to be executed.
This third aspect will decide if your program is correct with respect to the allowed operations. Usually you can make a compiler reject many of programs that have no meaning (because they violates the semantic) but to be able to find many different mistakes you will have to introduce a new tool: the type checker that will also check that whenever you do operations they are correct according to the types.
For example you grammar can allow doing varName = 12.4 but the typechecker will use the declaration of varName to understand if you can assign a float to it. (of course we're talking about static type checking)
Those concepts belong to both.
Statements, expressions, control flow operations, data types, etc. have their structure defined using the syntax. However, their meaning comes from the semantics.
When you have defined syntax and semantics for a programming language and its constructs, this basically provides you with a set of building blocks. The syntax is used to understand the structure in the code - usually represented using an abstract syntax tree, or AST. You can then traverse the tree and apply the semantics to each element to execute the program, or generate some instructions for some instruction set so you can execute the code later.