In the Dragonbook's exercise 3.3.1 the student should
Consult the language reference manuals
to determine (i) the set of characters
that form the input alphabet
(excluding those that may only appear
in character strings or comments [...]
for each of the following languages:
[...].
It makes no real sense to me to describe really all the characters like a, b, / for a language, even if it is an exercise for compilers. Isn't the alphabet of a programming language the set of possible words, like {id, int, float, string, if, for, ... }?
And if you consider it really beeing "characters" in the basic idea of the word, is ??/ in C one or three charaters (or both)?
The alphabet of a language is the set of characters not the words.
Isn't the alphabet of a programming
language the set of possible words,
like {id, int, float, string, if, for,
... }?
No, the alphabet is the set of characters that are used to form words. When an language is specified, the alphabet must be given otherwise you cannot distinguish a valid token from an invalid token.
Update
You are confusing the term "word" with "token". A word is not some part of a language or program. A word is finite string of characters from the alphabet. It has nothing to do with a language construct like "int" or "while". For example, each C program is a word because it is a finite string of characters from the alphabet. The set of all of these programs (words) forms the C programming language. Tokens like "void" or "int" are entirely a different thing.
To recap, you start by defining the some set of characters you want to use. This is called the alphabet. Finite strings of these characters form words. A language is some subset of all possible words. To define a language, you define which words belong to the language. For example, with a regular expression or a context-free grammar.
Wikipedia has a good page on formal languages.
http://en.wikipedia.org/wiki/Formal_language
The confusion comes from theory defining alphabet as the set of symbols from which the strings in a language are formed. Note that the grammars for programming languages use tokens and not characters as terminal symbols.
Traditionally, from the perspective of language theory, programming languages involve two language definitions: 1) The one that has characters as the alphabet and tokens as the valid strings. 2) The one that has tokens as the alphabet and programs as the valid strings. That's why programming languages are usually specified in two parts, a lexical, and a syntactical analyzer.
It is not strictly necessary to have the two definitions to parse a programming language. A single grammar can be used to specify a programming language using characters as the input alphabet. It's just that the characters-to-token parts has been easier to specify with regular expressions, and the tokens-to-program part with grammars.
Modern compiler-compilers like ANTLR use grammar-specification languages that incorporate the expressive convenience of regular expressions, so a character-to-program definition can be done with a single grammar. Still, separating the lexical from the syntactical remains the most convenient way to parse a programming language, even with such tools.
Last minute example: imagine that the grammar productions for an if-then-else-end had to deal at the character-level with:
Whitespace.
Keywords within programming language strings: "Then, the end."
Variable names that contain keywords: 'tiff',
...
It can be done, but it would be extremely complicated.
Related
I'm trying to write a lexer rule that would match following strings
a
aa
aaa
bbbb
the requirement here is all characters must be the same
I tried to use this rule:
REPEAT_CHARS: ([a-z])(\1)*
But \1 is not valid in antlr4. is it possible to come up with a pattern for this?
You can’t do that in an ANTLR lexer. At least, not without target specific code inside your grammar. And placing code in your grammar is something you should not do (it makes it hard to read, and the grammar is tied to that language). It is better to do those kind of checks/validations inside a listener or visitor.
Things like back-references and look-arounds are features that krept in regex-engines of programming languages. The regular expression syntax available in ANTLR (and all parser generators I know of) do not support those features, but are true regular languages.
Many features found in virtually all modern regular expression libraries provide an expressive power that far exceeds the regular languages. For example, many implementations allow grouping subexpressions with parentheses and recalling the value they match in the same expression (backreferences). This means that, among other things, a pattern can match strings of repeated words like "papa" or "WikiWiki", called squares in formal language theory.
-- https://en.wikipedia.org/wiki/Regular_expression#Patterns_for_non-regular_languages
The lexical grammar of most programming languages is fairly non-expressive in order to quickly lex it. I'm not sure what category Rust's lexical grammar belongs to. Most of it seems regular, probably with the exception of raw string literals:
let s = r##"Hi lovely "\" and "#", welcome to Rust"##;
println!("{}", s);
Which prints:
Hi lovely "\" and "#", welcome to Rust
As we can add arbitrarily many #, it seems like it can't be regular, right? But is the grammar at least context-free? Or is there something non-context free about Rust's lexical grammar?
Related: Is Rust's syntactical grammar context-free or context-sensitive?
The raw string literal syntax is not context-free.
If you think of it as a string surrounded by r#k"…"#k (using the superscript k as a count operator), then you might expect it to be context-free:
raw_string_literal
: 'r' delimited_quoted_string
delimited_quoted_string
: quoted_string
| '#' delimited_quoted_string '#'
But that is not actually the correct syntax, because the quoted_string is not allowed to contain "#k although it can contain "#j for any j<k
Excluding the terminating sequence without excluding any other similar sequence of a different length cannot be accomplished with a context-free grammar because it involves three (or more) uses of the k-repetition in a single production, and stack automata can only handle two. (The proof that the grammar is not context-free is surprisingly complicated, so I'm not going to attempt it here for lack of MathJax. The best proof I could come up with uses Ogden's lemma and the uncommonly cited (but highly useful) property that context-free grammars are closed under the application of a finite-state transducer.)
C++ raw string literals are also context-sensitive [or would be if the delimiter length were not limited, see Note 1], and pretty well all whitespace-sensitive languages (like Python and Haskell) are context-sensitive. None of these lexical analysis tasks is particularly complicated so the context-sensitivity is not a huge problem, although most standard scanner generators don't provide as much assistance as one might like. But there it is.
Rust's lexical grammar offers a couple of other complications for a scanner generator. One issue is the double meaning of ', which is used both to create character literals and to mark lifetime variables and loop labels. Apparently it is possible to determine which of these applies by considering the previously recognized token. That could be solved with a lexical scanner which is capable of generating two consecutive tokens from a single pattern, or it could be accomplished with a scannerless parser; the latter solution would be context-free but not regular. (C++'s use of ' as part of numeric literals does not cause the same problem; the C++ tokens can be recognized with regular expressions, because the ' can not be used as the first character of a numeric literal.)
Another slightly context-dependent lexical issue is that the range operator, .., takes precedence over floating point values, so that 2..3 must be lexically analysed as three tokens: 2 .. 3, rather than two floating point numbers 2. .3, which is how it would be analysed in most languages which use the maximal munch rule. Again, this might or might not be considered a deviation from regular expression tokenisation, since it depends on trailing context. But since the lookahead is at most one character, it could certainly be implemented with a DFA.
Postscript
On reflection, I am not sure that it is meaningful to ask about a "lexical grammar". Or, at least, it is ambiguous: the "lexical grammar" might refer to the combined grammar for all of the languages "tokens", or it might refer to the act of separating a sentence into tokens. The latter is really a transducer, not a parser, and suggests the question of whether the language can be tokenised with a finite-state transducer. (The answer, again, is no, because raw strings cannot be recognized by a FSA, or even a PDA.)
Recognizing individual tokens and tokenising an input stream are not necessarily equivalent. It is possible to imagine a language in which the individual tokens are all recognized by regular expressions but an input stream cannot be handled with a finite-state transducer. That will happen if there are two regular expressions T and U such that some string matching T is the longest token which is a strict prefix of an infinite set of strings in U. As a simple (and meaningless) example, take a language with tokens:
a
a*b
Both of these tokens are clearly regular, but the input stream cannot be tokenized with a finite state transducer because it must examine any sequence of as (of any length) before deciding whether to fallback to the first a or to accept the token consisting of all the as and the following b (if present).
Few languages show this pathology (and, as far as I know, Rust is not one of them), but it is technically present in some languages in which keywords are multiword phrases.
Notes
Actually, C++ raw string literals are, in a technical sense, regular (and therefore context free) because their delimiters are limited to strings of maximum length 16 drawn from an alphabet of 88 characters. That means that it is (theoretically) possible to create a regular expression consisting of 13,082,362,351,752,551,144,309,757,252,761 patterns, each matching a different possible raw string delimiter.
I wonder, if this
alph = ['a'..'z']
returns me
"abcdefghijklmnopqrstuvwxyz"
How can I return French alphabet then? Can I pass somehow a locale?
Update:
Well ) I know that English and French has the same letters. But my point is if they were not the same but starts with A and ends with Z. Would be nice to have human language range support.
At least some languages come with localizations support.
(just trying Haskell, reading a book)
Haskell Char values are not real characters, they are Unicode code points. In some other languages their native character type may represent other things like ASCII characters or "code page whatsitsnumber" characters, or even something selectable at runtime, but not in Haskell.
The range 'a'..'z' coincides with the English alphabet for historical reasons, both in Unicode and in ASCII, and also in character sets derived from ASCII such as ISO8859-X. There is no commonly supported coded character set where some contiguous range of codes coincides with the French alphabet. That is, if you count letters with diacritics as separate letters. The accepted practice seems to exclude letters with diacritics, so the French alphabet coincides with English, but this is not so for other Latin-derived alphabets.
In order to get most alphabets other than English, one needs to enumerate the characters explicitly by hand and not with any range expression. For some languages one even cannot use Char to represent all letters, as some of them need more than one code point, such as Hungarian "ly" or Spanish "ll" (before 2010) or Dutch "ij" (according to some authorities — there's no one commonly accepted definition).
No language that I know supports arbitrary human alphabets as range expressions out of the box.
While programming languages usually support sorting by the current locale (just search for collate on Hackage), there is no library I know that provides a list of alphabetic characters by locale.
Modern (Unicode) systems allowing for localized characters try to also allow many non-latin alphabets, and thus very many alphabetic characters.
Enumerating all alphabetic characters within Unicode gives over 40k characters:
GHCi> length $ filter Data.Char.isAlpha $
map Data.Char.chr [0..256*256]
48408
While I am aware of libraries allowing to construct alphabetic indices, I don't know about any Haskell binding for this feature.
Why special character( except underscore) are not allowed in variable name of programming language ?
Is there are any reason related to computer architecture or organisation.
Most languages have long histories, using ASCII (or EBCDIC) character sets. Those languages tend to have simple identifier descriptions (e.g., starts with A-Z, followed by A-Z,0-9, maybe underscore; COBOL allows "-" as part of a name). When all you had was an 029 keypunch or a teletype, you didn't have many other characters, and most of them got used as operator syntax or punctuation.
On older machines, this did have the advantage that you could encode an identifier as a radix 37 (A-Z,0-9, null) [6 characters in 32 bits] or radix 64 (A-Z,a-z,0-9,underscore and null) numbers [6 characters in 36 bits, a common word size in earlier generations of machines) for small symbol tables. A consequence: many older languages had 6 character limits on identifier sizes (e.g., FORTRAN).
LISP languages have long been much more permissive; names can be anything but characters with special meaning to LISP, e.g., ( ) [ ] ' ` #, and usually there are ways to insert these characters into names using some kind of escape convention. Our PARLANSE language is like LISP; it uses "~" as an escape, so you can write ~(begin+~)end as a single identifier whose actual spelling is "(begin+end)".
More modern languages (Java, C#, Scala, ...., uh, even PARLANSE) grew up in an era of Unicode, and tend to allow most of unicode in identifiers (actually, they tend to allow named Unicode subsets as parts of identifiers). An identifier made of chinese characters is perfectly legal in such languages.
Its kind of a matter of taste in the Western hemisphere: most identifier names still tend to use just letters and digits (sometimes, Western European letters). I don't know what the Japanese and Chinese really use for identifier names when they have Unicode capable character sets; what little Asian code I have seen tends to follow western identifier conventions but the comments tend to use much more of the local native and/or Unicode character set.
Fundamentally it is because they're mostly used as operators or separators, so it would introduce ambiguity.
Is there any reason relate to computer architecture or organization.
No. The computer can't see the variable names. Only the compiler can. But it has to be able to distinguish a variable name from two variable names separated by an operator, and most language designers have adopted the principle that the meaning of a computer program should not be affected by white space.
Every programming language I have ever seen has been based on the Latin alphabet, this is not surprising considering I live in Canada...
But it only really makes sense that there would be programming languages based on other alphabets, or else bright computer scientists across the world would have to learn a new alphabet to go on in the field. I know for a fact that people in countries dominated by other alphabets develop languages based off the Latin alphabet (eg. Ruby from Japan), but just how common is it for programming languages to be based off of other alphabets like Arabic, or Cyrillic, or even writing systems which are not alphabetic but rather logographic in nature such as Japanese Kanji?
Also are any of these languages in active widespread use, or are they mainly used as teaching tools?
This is something that has bugged me since I started programming, and I have never run across someone who could think of a real answer.
Have you seen Perl?
APL is probably the most widely known. It even has a cool keyboard overlay (or was it a special keyboard you had to buy?):
In the non-alphabetic category, we also have programming languages like LabVIEW, which is mostly graphical. (You can label objects, and you can still do string manipulation, so there's some textual content.) LabVIEW has been used in data acquisition and automation for years, but gained a bit of popularity when it became the default platform for Lego Mindstorms.
There's a list on Wikipedia. I don't think any of them is really prevalent though. Many programmers can learn to write programs with english keywords even if they didn't understand the language. Ruby is a good example, you'll still see Japanese identifiers and comments in some Ruby code.
Well, Brainf* uses no latin characters, if you'll pardon the language...and the pun.
Many languages allow Unicode identifiers. It's part of standard Java, and both g++ (though you have to use \uNNNN escapes) and MSVC++ allow them (see also this question) And some allow using #define (or maybe better) to rename control structures.
But in reality, people don't do this for the most part. See past questions such as Language of variable names?, Should all code be written in English?, etc.
Agda.
Sample Snippet:
mutual
data ωChain : Set where
_∷_,_ : ∀ (x : carrier) (xω : ∞ ωChain) (p : x ≼ xω) → ωChain
head : ωChain → carrier
head (x ∷ _ , _) = x
_≼_ : carrier → ∞ ωChain → Set
x ≼ xω = x ≤ head (♭ xω)
Well, there's always APL. That has its own UNICODE characters, and I believe it used to require a special keyboard too.
There'is one langauge used in russian ERP system called after company, which developed it 1C. But it's identifiers and operators has english analogs.
Also, I know that haskell has unicode identifiers support, so you can write programs in any alphabet. But this is not useful (My native language is russian). It's quite enough that you have to type program messages and helpful comments in native alphabet.
Other people are answering with languages that use punctuation marks in addition to Latin letters. I wonder why no one mentioned digits 0 to 9 as well.
In some languages, and in some implementations of some languages, programmers can use a wide range of characters in identifiers, such as Arabic or Chinese characters. This doesn't mean that the language relies on them though.
In most languages, programmers can use a wide range of characters in string literals (in quotation marks) and in comments. Again this doesn't mean that the language relies on them.
In every programming language that I've seen, the language does rely on punctuation marks and digits. So this answers your question but not in the way you expect.
Now let's try to find something meaningful. Is there a programming language where keywords are chosen from non-Latin alphabets? I would guess not, except maybe for joke languages. What would be the point of inventing a programming language that makes it impossible for some programmers to even input a program?
EDIT: My guess is wrong. Besides APL's usage of various invented punctuation marks, it does depend on a few Greek keywords, where each keyword is one letter long, such as the letter rho.
I just found an interesting wiki for "esoteric programming languages".