In the ISO 14977 EBNF standard, section 4.7, the legal contents of an exception is described. I'm fairly certain that an exception may consist of any valid syntactic factor, as long that it doesn't contain any meta-identifiers. Which means that could use a special sequence as an exception like this
syntax =
my rule - ? Something clever ? ;
Is this the case?
You can get the standard for free at ISO here.
No that is not correct. You can use meta-identifiers in an exception as long as when fully evaluated they result in terminals. The example given in section 4.7 shows a meta-identifier defined in terms of itself, which can never fully resolve to a terminal. That is what they mean when they say:
...could equally be represented by a syntactic-factor containing no meta-identifiers.
There are lots of examples of this in the standard itself - checkout section 8. For example:
gap free symbol = terminal character - (first quote symbol | second quote symbol)
Related
I have a file named 1_add.rs, and I tried to add it into the lib.rs. Yet, I got the following error during compilation.
error: expected identifier, found `1_add`
--> src/lib.rs:1:5
|
1 | mod 1_add;
| ^^^^^ expected identifier
It seems the identifier that starts with a digit is invalid. But why would Rust has this restriction? Is there any workaround if I want to indicate the sequence of different rust files (for managing the exercise files).
In your case (you want to name the files like 1_foo.rs) you can write
#[path="1_foo.rs"]
mod mod_1_foo;
Allowing identifies to start with digits can conflict with type annotations. E.g.
let foo = 1_u32;
sets to type to u32. It would be confusing when 1_u256 means another variable.
But why would Rust has this restriction?
Not only rust, but most every language I've written a line of code in has this restriction as well.
Food for thought:
let a = 1_2;
Is 1_2 a variable name or is it a literal for value 12? What if variable 1_2 does not exist now, but you add it later, does this token stop being a number literal?
While rust compiler probably could make it work, it's not worth all the confusion, IMHO.
Allowing identifiers to start with a digit would caus conflicts with many other token types. Here are a few examples:
1e1 is a floating point number.
0x0 is a hexadecimal integer.
8u8 is an integer with explicit type annotation.
Most importantly, though, I believe allowing identifiers starting with digit would hurt readability. Currently everything starting with a digit is some kind of number, which in my opinion helps when reading code.
An incomplete list of programming languages not allowing identifiers to start with a digit: Python, Java, JavaScript, C#, Ruby, C, C++, Pascal. I can't think of a language that does allow this (which most likely does exist).
Rust identifiers are based on Unicode® Standard Annex #31
(see The Rust RFC Book), which standardizes some common rules for identifiers in programming languages. It might make it easier to parse text that could otherwise be ambiguous, like 1e10?
"Why?" cannot be reasoned here but by historical tales, the rules are as such. You cannot play against them.
If you urgently want to start your identifiers with a digit, at least for human readers, prepend an underscore like this: _1_add.
Note: To make sure that sorting works well, use also leading zeroes as many as appropriate (_001_add if you expect more than 99 files).
Section 2.1.3 of the Python Language Reference says:
Comments are ignored by the syntax.
While I'm not entirely sure about this, I believe this means the Python Intepreter will ignore comments.
In contrast, section 2.1.4 says:
If a comment in the first or second line of the Python script matches the regular expression coding[=:]\s*([-\w.]+), this comment is processed as an encoding declaration.
This also seems to be a statement of fact about the Python Interpreter: That it does not ignore a comment if it's in the first or second line of the script, as long as it matches the expression coding[=:]\s*([-\w.]+)
Source
Don't these two statements about the interpreter contradict each other? What the hell is going on?
You have valid points about the clarity of the documentation.
However, as with many other languages (HTML, XML, JSON pre-2017 standard*), the character encoding of a source file/document is determined prior to any language lexical or syntactical processing. So, it is correct to say, "Comments are ignored by the syntax." Because once the character encoding is determined, processing restarts and the syntactical processing ignores all comments.
In a sense, there are two languages: 1) for expressing the character encoding; 2) for expressing a Python script. The first one is designed so it is accepted by but has no meaning to the second.
Subsequent standards for JSON reduce the set of allowable character encoding from UTF-8, UTF-16LE, UTF-16BE, UTF-32LE, UTF-32BE to simply UTF-8.
This question already has answers here:
'*' and '/' not recognized on input by a read statement
(2 answers)
Closed 4 years ago.
I am a scientist programming in Fortran, and I came up with a strange behaviour. In one of my programs I have a string containing several "words", and I want to read all words as substrings. The first word starts with an integer and a wildcard, like "2*something".
When I perform an internal read on that string, I expect to read all wods, but instead, the READ function repeatedly reads the first substring. I do not understand why, nor how to avoid this behaviour.
Below is a minimalist sample program that reproduces this behaviour. I would expect it to read the three substrings and to print "3*a b c" on the screen. Instead, I get "a a a".
What am I doing wrong? Can you please help me and explain what is going on?
I am compiling my programs under GNU/Linux x64 with Gfortran 7.3 (7.3.0-27ubuntu1~18.04).
PROGRAM testread
IMPLICIT NONE
CHARACTER(LEN=1024):: string
CHARACTER(LEN=16):: v1, v2, v3
string="3*a b c"
READ(string,*) v1, v2, v3
PRINT*, v1, v2, v3
END PROGRAM testread
You are using list-directed input (the * format specifier). In list-directed input, a number (n) followed by an asterisk means "repeat this item n times", so it is processed as if the input was a a a b c. You would need to have as input '3*a' b c to get what you want.
I will use this as another opportunity to point out that list-directed I/O is sometimes the wrong choice as its inherent flexibility may not be what you want. That it has rules for things like repeat counts, null values, and undelimited strings is often a surprise to programmers. I also often see programmers complaining that list-directed input did not give an error when expected, because the compiler had an extension or the programmer didn't understand just how liberal the feature can be.
I suggest you pick up a Fortran language reference and carefully read the section on list-directed I/O. You may find you need to use an explicit format or change your program's expectations.
Following the answer of #SteveLionel, here is the relevant part of the reference on list-directed sequential READ statements (in this case, for Intel Fortran, but you could find it for your specific compiler and it won't be much different).
A character string does not need delimiting apostrophes or quotation marks if the corresponding I/O list item is of type default character, and the following is true:
The character string does not contain a blank, comma (,), or slash ( / ).
The character string is not continued across a record boundary.
The first nonblank character in the string is not an apostrophe or a quotation mark.
The leading character is not a string of digits followed by an asterisk.
A nondelimited character string is terminated by the first blank, comma, slash, or end-of-record encountered. Apostrophes and quotation marks within nondelimited character strings are transferred as is.
In total, there are 4 forms of sequential read statements in Fortran, and you may choose the option that best fits your need:
Formatted Sequential Read:
To use this you change the * to an actual format specifier. If you know the length of the strings at advance, this would be as easy as '(a3,a2,a2)'. Or, you could come with a format specifier that matches your data, but this generally demands you knowing the length or format of stuff.
Formatted Sequential List-Directed:
You are currently using this option (the * format descriptor). As we already showed you, this kind of I/O comes with a lot of magic and surprising behavior. What is hitting you is the n*cte thing, that is interpreted as n repetitions of cte literal.
As said by Steve Lionel, you could put quotation marks around the problematic word, so it will be parsed as one-piece. Or, as proposed by #evets, you could split or break your string using the intrinsics index or scan. Another option could be changing your wildcard from asterisk to anything else.
Formatted Namelist:
Well, that could be an option if your data was (or could be) presented in the namelist format, but I really think it's not your case.
Unformatted:
This may not apply to your case because you are reading from a character variable, and an internal READ statement can only be formatted.
Otherwise, you could split your string by means of a function instead of a I/O operation. There is no intrinsic for this, but you could come with one without much trouble (see this thread for reference). As you may have noted already, manipulating strings in fortran is... awkward, at least. There are some libraries out there (like this) that may be useful if you are doing lots of string stuff in Fortran.
I was playing around with strings and discovered that Haskell (correctly) disallows characters above Unicode code point 0x10ffff (ie one gets something like a sequence out of range error if one attempts to use something above this limit). Out of curiosity, i played around with the Unicode surrogate halves (0xd800 to 0xdfff) - invalid Unicode codepoints, and discovered that they seem to be permitted. I am curious as to why this is. Is it simply because being a bounded item means only defining a maximum and a minimum?
Disallowing the surrogate code units would indeed make Char a more correct type for Unicode code points. The Report says that Char is "an enumeration whose values represent Unicode characters", so probably this should be considered a GHC bug.
There's no specific notion of "a bounded item", but it would require extra checks in various places (right now chr just needs to make one comparison to check if its argument is valid, for instance) and possibly make some things behave more strangely (if people indirectly expect code points to be contiguous).
I don't know that there's an especially good rationale for it, though, or that the trade-off was even considered originally. In Haskell 1.4, Char was just a 16-bit type, so it would have been natural to extend it to 17*2^16 values without adding extra checks. This issue is occasionally brought up -- I've brought it up before -- but most people don't seem to worry about it very much. It's probably reasonable to file a GHC bug about it, though, to get a proper discussion going.
Note that Data.Text (which uses UTF-16 as its internal representations) does disallow the invalid code units (it has to).
The Haskell 2010 Language Report says:
Haskell uses the Unicode [2] character set. However, source programs are currently biased toward the ASCII character set used in earlier versions of Haskell.
Does this mean UTF-8?
In ghc-7.0.4/compiler/parser/Lexer.x.source:
$unispace = \x05 -- Trick Alex into handling Unicode. See alexGetChar.
$whitechar = [\ \n\r\f\v $unispace]
$white_no_nl = $whitechar # \n
$tab = \t
$ascdigit = 0-9
$unidigit = \x03 -- Trick Alex into handling Unicode. See alexGetChar.
$decdigit = $ascdigit -- for now, should really be $digit (ToDo)
$digit = [$ascdigit $unidigit]
$special = [\(\)\,\;\[\]\`\{\}]
$ascsymbol = [\!\#\$\%\&\*\+\.\/\<\=\>\?\#\\\^\|\-\~]
$unisymbol = \x04 -- Trick Alex into handling Unicode. See alexGetChar.
$symbol = [$ascsymbol $unisymbol] # [$special \_\:\"\']
$unilarge = \x01 -- Trick Alex into handling Unicode. See alexGetChar.
$asclarge = [A-Z]
$large = [$asclarge $unilarge]
$unismall = \x02 -- Trick Alex into handling Unicode. See alexGetChar.
$ascsmall = [a-z]
$small = [$ascsmall $unismall \_]
$unigraphic = \x06 -- Trick Alex into handling Unicode. See alexGetChar.
$graphic = [$small $large $symbol $digit $special $unigraphic \:\"\']
...I'm not sure what to make of this. alexGetChar wasn't really helpful.
There was a proposal to standardize on UTF-8 as the standard encoding of Haskell source files, but I'm not sure if it was accepted or not.
In practice, GHC assumes all input files are UTF-8, but it ignores malformed byte sequences in comments.
Unicode is character set. UTF-8, UTF-16 etc are the concrete physical encodings of Unicode codepoints. Try to read here. The difference explained pretty well there.
Cited report's part just states that Haskell sources use Unicode character set. It doesn't state which encoding should be used at all. In other words, it says which characters could appear in the sources, but doesn't say how they could be written in term of plain bytes.
While the Haskell standard simply says Unicode the set of possible characters (as opposed to e.g. ASCII or Latin-1) it doesn't specify which of the several different encodings (UTF8 UTF16, UTF32, byte order) to use.
Alex, the lexer that comes with the Haskell Platform requires its input to be UTF8 encoded * which is why you see the code you mention. In practice I think all the major implementations of Haskell require source to be in UTF8.
* - This is actually a real problem as GHC stores strings and more importantly Data.Text internally as UTF16. It would be nice to be able to lex these directly rather then converting back and forth.
There is an important distinction between the data type (i.e. what “abstract” data you can work with) and its representation (i.e. how it is stored in the computer memory or on disk).
The Haskell Report says two things related to Unicode:
That the Char data type in Haskell represents a Unicode character (also known as code point). You should think of it as of an abstract data type that provides a certain interface (e.g. you can call isDigit or toLower on it), but you are not allowed to know how exactly it is represented internally. The specific implementation of Haskell (e.g. GHC) is free to represent it in memory in whatever way it wants and it doesn’t matter at all, as you can’t access the underlying raw bits anyway.
That a Haskell program is text, consisting of (abstract) Unicode code points, that is, essentially, a String. And then it goes on to explain how to parse this String. Once again, it is important to stress that it defines the syntax of Haskell in terms of sequences of abstract Unicode code points.
Now, to your question about Haskell source code. The Haskell Report does not specify how this Unicode text is encoded into zeroes and ones when stored in a file.
In fact, the Haskell Report does not specify how Haskell programs are stored at all! It doesn’t mention that Haskell source code is stored in files, that files have to be named after modules, and that the directory structure should follow the structure of module names – these all are considered to be compiler implementation details, and the idea is that this allows each compiler to store Haskell programs wherever and however they want: in files, in database tables, as jpeg photos of a blackboard with a program written on it with chalk. For this reason it does not specify the encoding either (it would make no sense to specify the encoding for a program written out on a blackboard 😕).
However, GHC, the de-facto standard Haskell compiler, assumes that Haskell programs are stored in files encoded as UTF-8, organised hierarchically, and named after module names.