Is Rust's lexical grammar regular, context-free or context-sensitive? - rust

The lexical grammar of most programming languages is fairly non-expressive in order to quickly lex it. I'm not sure what category Rust's lexical grammar belongs to. Most of it seems regular, probably with the exception of raw string literals:
let s = r##"Hi lovely "\" and "#", welcome to Rust"##;
println!("{}", s);
Which prints:
Hi lovely "\" and "#", welcome to Rust
As we can add arbitrarily many #, it seems like it can't be regular, right? But is the grammar at least context-free? Or is there something non-context free about Rust's lexical grammar?
Related: Is Rust's syntactical grammar context-free or context-sensitive?

The raw string literal syntax is not context-free.
If you think of it as a string surrounded by r#k"…"#k (using the superscript k as a count operator), then you might expect it to be context-free:
raw_string_literal
: 'r' delimited_quoted_string
delimited_quoted_string
: quoted_string
| '#' delimited_quoted_string '#'
But that is not actually the correct syntax, because the quoted_string is not allowed to contain "#k although it can contain "#j for any j<k
Excluding the terminating sequence without excluding any other similar sequence of a different length cannot be accomplished with a context-free grammar because it involves three (or more) uses of the k-repetition in a single production, and stack automata can only handle two. (The proof that the grammar is not context-free is surprisingly complicated, so I'm not going to attempt it here for lack of MathJax. The best proof I could come up with uses Ogden's lemma and the uncommonly cited (but highly useful) property that context-free grammars are closed under the application of a finite-state transducer.)
C++ raw string literals are also context-sensitive [or would be if the delimiter length were not limited, see Note 1], and pretty well all whitespace-sensitive languages (like Python and Haskell) are context-sensitive. None of these lexical analysis tasks is particularly complicated so the context-sensitivity is not a huge problem, although most standard scanner generators don't provide as much assistance as one might like. But there it is.
Rust's lexical grammar offers a couple of other complications for a scanner generator. One issue is the double meaning of ', which is used both to create character literals and to mark lifetime variables and loop labels. Apparently it is possible to determine which of these applies by considering the previously recognized token. That could be solved with a lexical scanner which is capable of generating two consecutive tokens from a single pattern, or it could be accomplished with a scannerless parser; the latter solution would be context-free but not regular. (C++'s use of ' as part of numeric literals does not cause the same problem; the C++ tokens can be recognized with regular expressions, because the ' can not be used as the first character of a numeric literal.)
Another slightly context-dependent lexical issue is that the range operator, .., takes precedence over floating point values, so that 2..3 must be lexically analysed as three tokens: 2 .. 3, rather than two floating point numbers 2. .3, which is how it would be analysed in most languages which use the maximal munch rule. Again, this might or might not be considered a deviation from regular expression tokenisation, since it depends on trailing context. But since the lookahead is at most one character, it could certainly be implemented with a DFA.
Postscript
On reflection, I am not sure that it is meaningful to ask about a "lexical grammar". Or, at least, it is ambiguous: the "lexical grammar" might refer to the combined grammar for all of the languages "tokens", or it might refer to the act of separating a sentence into tokens. The latter is really a transducer, not a parser, and suggests the question of whether the language can be tokenised with a finite-state transducer. (The answer, again, is no, because raw strings cannot be recognized by a FSA, or even a PDA.)
Recognizing individual tokens and tokenising an input stream are not necessarily equivalent. It is possible to imagine a language in which the individual tokens are all recognized by regular expressions but an input stream cannot be handled with a finite-state transducer. That will happen if there are two regular expressions T and U such that some string matching T is the longest token which is a strict prefix of an infinite set of strings in U. As a simple (and meaningless) example, take a language with tokens:
a
a*b
Both of these tokens are clearly regular, but the input stream cannot be tokenized with a finite state transducer because it must examine any sequence of as (of any length) before deciding whether to fallback to the first a or to accept the token consisting of all the as and the following b (if present).
Few languages show this pathology (and, as far as I know, Rust is not one of them), but it is technically present in some languages in which keywords are multiword phrases.
Notes
Actually, C++ raw string literals are, in a technical sense, regular (and therefore context free) because their delimiters are limited to strings of maximum length 16 drawn from an alphabet of 88 characters. That means that it is (theoretically) possible to create a regular expression consisting of 13,082,362,351,752,551,144,309,757,252,761 patterns, each matching a different possible raw string delimiter.

Related

How computer languages are made using theory of automata concept?

I tried really hard to find answer to this question on google engine.
But I wonder how these high level programming languages are created in principle of automata or is automata theory not included in defining the languages?
Language design tends to have two important levels:
Lexical analysis - the definition of what tokens look like. What is a string literal, what is a number, what are valid names for variables, functions, etc.
Syntactic analysis - the definition of how tokens work together to make meaningful statements. Can you assign a value to a literal, what does a block look like, what does an if statement look like, etc.
The lexical analysis is done using regular languages, and generally tokens are defined using regular expressions. It's not that a DFA is used (most regex implementations are not DFAs in practice), but that regular expressions tend to line up well with what most languages consider tokens. If, for example, you wanted a language where all variable names had to be palindromes, then your language's token specification would have to be context-free instead.
The input to the lexing stage is the raw characters of the source code. The alphabet would therefore be ASCII or Unicode or whatever input your compiler is expecting. The output is a stream of tokens with metadata, such as string-literal (value: hello world) which might represent "hello world" in the source code.
The syntactic analysis is typically done using a subset of context-free languages called LL or LR parsers. This is because the implementation of CFG (PDAs) are nondeterministic. LL and LR parsing are ways to make deterministic decisions with respect to how to parse a given expression.
We use CFGs for code because this is the level on the Chomsky hierarchy where nesting occurs (where you can express the idea of "depth", such as with an if within an if). Higher or lower levels on the hierarchy are possible, but a regular syntax would not be able to express nesting easily, and context-sensitive syntax would probably cause confusion (but it's not unheard of).
The input to the syntactic analysis step is the token stream, and the output is some form of executable structure, typically a parse tree that is either executed immediately (as in interpretted languages) or stored for later optimization and/or execution (as in compiled languages) or something else (as in intermediate-compiled languages like Java). The alphabet of the CFG is therefore the possible tokens specified by the lexical analysis step.
So this whole thing is a long-winded way of saying that it's not so much the automata theory that's important, but rather the formal languages. We typically want to have the simplest language class that meets our needs. That typically means regular tokens and context-free syntax, but not always.
The implementation of the regular expression need not be an automaton, and the implementation of the CFG cannot be a PDA, because PDAs are nondeterministic, so we define deterministic parsers on reasonable subsets of the CFG class instead.
More generally we talk about Theory of computation.
What has happened through the history of programming languages is that it has been formally proven that higher-level constructs are equivalent to the constructs in the abstract machines of the theory.
We prefer the higher-level constructs in modern languages because they make programs easier to write, and easier to understand by other people. That in turn leads to easier peer-review and team-play, and thus better programs with less bugs.
The Wikipedia article about Structured programming tells part of the history.
As to Automata theory, it is still present in the implementation of regular expression engines, and in most programming situations in which a good solution consists in transitioning through a set of possible states.

Repeating Pattern Matching in antlr4

I'm trying to write a lexer rule that would match following strings
a
aa
aaa
bbbb
the requirement here is all characters must be the same
I tried to use this rule:
REPEAT_CHARS: ([a-z])(\1)*
But \1 is not valid in antlr4. is it possible to come up with a pattern for this?
You can’t do that in an ANTLR lexer. At least, not without target specific code inside your grammar. And placing code in your grammar is something you should not do (it makes it hard to read, and the grammar is tied to that language). It is better to do those kind of checks/validations inside a listener or visitor.
Things like back-references and look-arounds are features that krept in regex-engines of programming languages. The regular expression syntax available in ANTLR (and all parser generators I know of) do not support those features, but are true regular languages.
Many features found in virtually all modern regular expression libraries provide an expressive power that far exceeds the regular languages. For example, many implementations allow grouping subexpressions with parentheses and recalling the value they match in the same expression (backreferences). This means that, among other things, a pattern can match strings of repeated words like "papa" or "WikiWiki", called squares in formal language theory.
-- https://en.wikipedia.org/wiki/Regular_expression#Patterns_for_non-regular_languages

If Non Terminals in Gene Expression Programming are mono-type functions, how to build complex Programs?

It just seemed to me studying GEP,and especially analyzing Karva expressions, that Non Terminals are most suitable for functions which type is a->a for some type a, in Haskell notation.
Like, with classic examples, Q+-*/ are all functions from 'some' Double to 'a' Double and they just change in arity.
Now, how can one coder use functions of heterogeneous signature in one Karva expressed gene?
Brief Introduction to GEP/Karva
Gene Expression Programming uses dense representations of a population of expressions and applies evolutionary pressure to make better ones to solve a given problem.
Karva notation represents an expression tree as a string, represented in a non-traditional traversal of level-at-a-time, left-to-right - read more here. Using Karva notation, it is simple and quick to combine (or mutate) expressions to create the next generation.
You can parse Karva notation in Haskell as per this answer with explanation of linear time or this answer that's the same code, but with more diagrams and no proof.
Terminals are the constants or variables in a Karva expression, so /+a*-3cb2 (meaning (a+(b*2))/(3-c)) has terminals [a,b,2,3,c]. A Karva expression with no terminals is thus a function of some arity.
My Question is then more related to how one would use different types of functions without breaking the gene.
What if one wants to use a Non Terminal like a > function? One can count on the fact that, for example, it can compare Doubles. But the result, in a strongly typed Language, would be a Bool. Now, assuming that the Non terminal encoding for > is interspersed in the gene, the parse of the k-expression would result in invalid code, because anything calling it would expect a Double.
One can then think of manually and silently sneak in a cast, as is done by Ms. Ferreira in her book, where she converts Bools into Ints like 0 and 1 for False and True.
Si it seems to me that k-expressed genes are for Non Terminals of any arity, that share the property of taking values of one type a, returning a type a.
In the end, has anyone any idea about how to overcome this?
I already now that one can use homeotic genes, providing some glue between different Sub Expression Trees, but that, IMHO, is somewhat rigid, because, again, you need to know in advance returned types.

When to use Unicode Normalization Forms NFC and NFD?

The Unicode Normalization FAQ includes the following paragraph:
Programs should always compare canonical-equivalent Unicode strings as equal ... The Unicode Standard provides well-defined normalization forms that can be used for this: NFC and NFD.
and continues...
The choice of which to use depends on the particular program or system. NFC is the best form for general text, since it is more compatible with strings converted from legacy encodings. ... NFD and NFKD are most useful for internal processing.
My questions are:
What makes NFC best for "general text." What defines "internal processing" and why is it best left to NFD? And finally, never minding what is "best," are the two forms interchangable as long as two strings are compared using the same normalization form?
The FAQ is somewhat misleading, starting from its use of “should” followed by the inconsistent use of “requirement” about the same thing. The Unicode Standard itself (cited in the FAQ) is more accurate. Basically, you should not expect programs to treat canonically equivalent strings as different, but neither should you expect all programs to treat them as identical.
In practice, it really depends on what your software needs to do. In most situations, you don’t need to normalize at all, and normalization may destroy essential information in the data.
For example, U+0387 GREEK ANO TELEIA (·) is defined as canonical equivalent to U+00B7 MIDDLE DOT (·). This was a mistake, as the characters are really distinct and should be rendered differently and treated differently in processing. But it’s too late to change that, since this part of Unicode has been carved into stone. Consequently, if you convert data to NFC or otherwise discard differences between canonically equivalent strings, you risk getting wrong characters.
There are risks that you take by not normalizing. For example, the letter “ä” can appear as a single Unicode character U+00E4 LATIN SMALL LETTER A WITH DIAERESIS or as two Unicode characters U+0061 LATIN SMALL LETTER A U+0308 COMBINING DIAERESIS. It will mostly be the former, i.e. the precomposed form, but if it is the latter and your code tests for data containing “ä”, using the precomposed form only, then it will not detect the latter. But in many cases, you don’t do such things but simply store the data, concatenate strings, print them, etc. Then there is a risk that the two representations result in somewhat different renderings.
It also matters whether your software passes character data to other software somehow. The recipient might expect, due to naive implicit assumptions or consciously and in a documented manner, that its input is normalized.
NFC is the general common sense form that you should use, ä is 1 code point there and that makes sense.
NFD is good for certain internal processing - if you want to make accent-insensitive searches or sorting, having your string in NFD makes it much easier and faster. Another usage is making more robust slug titles. These are just the most obvious ones, I am sure there are plenty of more uses.
If two strings x and y are canonical equivalents, then
toNFC(x) = toNFC(y)
toNFD(x) = toNFD(y)
Is that what you meant?

"Alphabet of Programminglanguage X" means really the Characters or Words?

In the Dragonbook's exercise 3.3.1 the student should
Consult the language reference manuals
to determine (i) the set of characters
that form the input alphabet
(excluding those that may only appear
in character strings or comments [...]
for each of the following languages:
[...].
It makes no real sense to me to describe really all the characters like a, b, / for a language, even if it is an exercise for compilers. Isn't the alphabet of a programming language the set of possible words, like {id, int, float, string, if, for, ... }?
And if you consider it really beeing "characters" in the basic idea of the word, is ??/ in C one or three charaters (or both)?
The alphabet of a language is the set of characters not the words.
Isn't the alphabet of a programming
language the set of possible words,
like {id, int, float, string, if, for,
... }?
No, the alphabet is the set of characters that are used to form words. When an language is specified, the alphabet must be given otherwise you cannot distinguish a valid token from an invalid token.
Update
You are confusing the term "word" with "token". A word is not some part of a language or program. A word is finite string of characters from the alphabet. It has nothing to do with a language construct like "int" or "while". For example, each C program is a word because it is a finite string of characters from the alphabet. The set of all of these programs (words) forms the C programming language. Tokens like "void" or "int" are entirely a different thing.
To recap, you start by defining the some set of characters you want to use. This is called the alphabet. Finite strings of these characters form words. A language is some subset of all possible words. To define a language, you define which words belong to the language. For example, with a regular expression or a context-free grammar.
Wikipedia has a good page on formal languages.
http://en.wikipedia.org/wiki/Formal_language
The confusion comes from theory defining alphabet as the set of symbols from which the strings in a language are formed. Note that the grammars for programming languages use tokens and not characters as terminal symbols.
Traditionally, from the perspective of language theory, programming languages involve two language definitions: 1) The one that has characters as the alphabet and tokens as the valid strings. 2) The one that has tokens as the alphabet and programs as the valid strings. That's why programming languages are usually specified in two parts, a lexical, and a syntactical analyzer.
It is not strictly necessary to have the two definitions to parse a programming language. A single grammar can be used to specify a programming language using characters as the input alphabet. It's just that the characters-to-token parts has been easier to specify with regular expressions, and the tokens-to-program part with grammars.
Modern compiler-compilers like ANTLR use grammar-specification languages that incorporate the expressive convenience of regular expressions, so a character-to-program definition can be done with a single grammar. Still, separating the lexical from the syntactical remains the most convenient way to parse a programming language, even with such tools.
Last minute example: imagine that the grammar productions for an if-then-else-end had to deal at the character-level with:
Whitespace.
Keywords within programming language strings: "Then, the end."
Variable names that contain keywords: 'tiff',
...
It can be done, but it would be extremely complicated.

Resources