In python 3, why can encoding declarations be written as a comment? - python-3.x

Section 2.1.3 of the Python Language Reference says:
Comments are ignored by the syntax.
While I'm not entirely sure about this, I believe this means the Python Intepreter will ignore comments.
In contrast, section 2.1.4 says:
If a comment in the first or second line of the Python script matches the regular expression coding[=:]\s*([-\w.]+), this comment is processed as an encoding declaration.
This also seems to be a statement of fact about the Python Interpreter: That it does not ignore a comment if it's in the first or second line of the script, as long as it matches the expression coding[=:]\s*([-\w.]+)
Source
Don't these two statements about the interpreter contradict each other? What the hell is going on?

You have valid points about the clarity of the documentation.
However, as with many other languages (HTML, XML, JSON pre-2017 standard*), the character encoding of a source file/document is determined prior to any language lexical or syntactical processing. So, it is correct to say, "Comments are ignored by the syntax." Because once the character encoding is determined, processing restarts and the syntactical processing ignores all comments.
In a sense, there are two languages: 1) for expressing the character encoding; 2) for expressing a Python script. The first one is designed so it is accepted by but has no meaning to the second.
Subsequent standards for JSON reduce the set of allowable character encoding from UTF-8, UTF-16LE, UTF-16BE, UTF-32LE, UTF-32BE to simply UTF-8.

Related

Read substrings from a string containing multiplication [duplicate]

This question already has answers here:
'*' and '/' not recognized on input by a read statement
(2 answers)
Closed 4 years ago.
I am a scientist programming in Fortran, and I came up with a strange behaviour. In one of my programs I have a string containing several "words", and I want to read all words as substrings. The first word starts with an integer and a wildcard, like "2*something".
When I perform an internal read on that string, I expect to read all wods, but instead, the READ function repeatedly reads the first substring. I do not understand why, nor how to avoid this behaviour.
Below is a minimalist sample program that reproduces this behaviour. I would expect it to read the three substrings and to print "3*a b c" on the screen. Instead, I get "a a a".
What am I doing wrong? Can you please help me and explain what is going on?
I am compiling my programs under GNU/Linux x64 with Gfortran 7.3 (7.3.0-27ubuntu1~18.04).
PROGRAM testread
IMPLICIT NONE
CHARACTER(LEN=1024):: string
CHARACTER(LEN=16):: v1, v2, v3
string="3*a b c"
READ(string,*) v1, v2, v3
PRINT*, v1, v2, v3
END PROGRAM testread
You are using list-directed input (the * format specifier). In list-directed input, a number (n) followed by an asterisk means "repeat this item n times", so it is processed as if the input was a a a b c. You would need to have as input '3*a' b c to get what you want.
I will use this as another opportunity to point out that list-directed I/O is sometimes the wrong choice as its inherent flexibility may not be what you want. That it has rules for things like repeat counts, null values, and undelimited strings is often a surprise to programmers. I also often see programmers complaining that list-directed input did not give an error when expected, because the compiler had an extension or the programmer didn't understand just how liberal the feature can be.
I suggest you pick up a Fortran language reference and carefully read the section on list-directed I/O. You may find you need to use an explicit format or change your program's expectations.
Following the answer of #SteveLionel, here is the relevant part of the reference on list-directed sequential READ statements (in this case, for Intel Fortran, but you could find it for your specific compiler and it won't be much different).
A character string does not need delimiting apostrophes or quotation marks if the corresponding I/O list item is of type default character, and the following is true:
The character string does not contain a blank, comma (,), or slash ( / ).
The character string is not continued across a record boundary.
The first nonblank character in the string is not an apostrophe or a quotation mark.
The leading character is not a string of digits followed by an asterisk.
A nondelimited character string is terminated by the first blank, comma, slash, or end-of-record encountered. Apostrophes and quotation marks within nondelimited character strings are transferred as is.
In total, there are 4 forms of sequential read statements in Fortran, and you may choose the option that best fits your need:
Formatted Sequential Read:
To use this you change the * to an actual format specifier. If you know the length of the strings at advance, this would be as easy as '(a3,a2,a2)'. Or, you could come with a format specifier that matches your data, but this generally demands you knowing the length or format of stuff.
Formatted Sequential List-Directed:
You are currently using this option (the * format descriptor). As we already showed you, this kind of I/O comes with a lot of magic and surprising behavior. What is hitting you is the n*cte thing, that is interpreted as n repetitions of cte literal.
As said by Steve Lionel, you could put quotation marks around the problematic word, so it will be parsed as one-piece. Or, as proposed by #evets, you could split or break your string using the intrinsics index or scan. Another option could be changing your wildcard from asterisk to anything else.
Formatted Namelist:
Well, that could be an option if your data was (or could be) presented in the namelist format, but I really think it's not your case.
Unformatted:
This may not apply to your case because you are reading from a character variable, and an internal READ statement can only be formatted.
Otherwise, you could split your string by means of a function instead of a I/O operation. There is no intrinsic for this, but you could come with one without much trouble (see this thread for reference). As you may have noted already, manipulating strings in fortran is... awkward, at least. There are some libraries out there (like this) that may be useful if you are doing lots of string stuff in Fortran.

By means of what language design principle is a string instantiated either as a string or as a variable in GNU Octave?

Having an Octave script (in the sense of dynamic languages here) move.m defining function move(direction), it can be invoked from another script (alternatively from the command line) in different ways: move left, move('left') or move(left). While the first two will instantiate direction with the string 'left', the last one will consider left as a variable.
The question is about the formal principle in language definition behind this. I understand that in the first mode, the script is invoked as a command, considering that the rest of the command line is just data, not variables (pretty much as in a Linux prompt); while in the last two it is called as a function, interpreting what follows (between parenthesis) as either data or variables. If this is a general design criteria among scripting languages, what is the principle behind it?
To answer your question, yes, this is by design, and it's syntactic sugar offered by matlab (and hence octave) for running certain functions that expect only string arguments. Here is the relevant section in the matlab manual: https://uk.mathworks.com/help/matlab/matlab_prog/command-vs-function-syntax.html
I should clarify some misconceptions though. First, it's not "data" vs "variables". Any argument supplied in command syntax is simply interpreted as a string. So these two are equivalent:
fprintf("1")
fprintf 1
I.e., in fprintf 1, the 1 is not numeric data. It's a string.
Secondly, not all m files are "scripts". You calling your m file a script caused me some confusion. Your particular file contains a function definition and nothing else, so it's a function, 100%.
The reason this is important here, is that all functions can be called either via functional syntax or command syntax (as long as it makes sense in terms of the expected arguments being strings), whereas scripts take no arguments, so there is no functional / command syntax at play, and if you were passing 'arguments' to a script you're doing something wrong.
I understand that in the first mode, the script is invoked as a command [...]
As far as Octave goes, you are better off forgetting about that distinction. I'm not sure if a "command" ever existed but it certainly does not exist now. The command syntax is just syntactic sugar in Octave. Makes it simpler for interactive plot adjustment since it's functions arguments mainly take strings.

Contents of exceptions in ISO EBNF

In the ISO 14977 EBNF standard, section 4.7, the legal contents of an exception is described. I'm fairly certain that an exception may consist of any valid syntactic factor, as long that it doesn't contain any meta-identifiers. Which means that could use a special sequence as an exception like this
syntax =
my rule - ? Something clever ? ;
Is this the case?
You can get the standard for free at ISO here.
No that is not correct. You can use meta-identifiers in an exception as long as when fully evaluated they result in terminals. The example given in section 4.7 shows a meta-identifier defined in terms of itself, which can never fully resolve to a terminal. That is what they mean when they say:
...could equally be represented by a syntactic-factor containing no meta-identifiers.
There are lots of examples of this in the standard itself - checkout section 8. For example:
gap free symbol = terminal character - (first quote symbol | second quote symbol)

Where is it specified whether Unicode identifiers should be allowed in a Haskell implementation?

I wanted to write some educational code in Haskell with Unicode characters (non-Latin) in the identifiers. (So that the identifiers look nice and natural for speakers of a natural language other than English which is not using the Latin characters in its writing.) So, I set out for finding an appropriate Haskell implementation that would allow this.
But where is this feature specified in the language specification? How would I refer to this feature when looking for a conforming implementation? (And which Haskell implemenations are known to actually support Unicode identifiers?)
It turned out that one Haskell implementation did accept my code with Unicode identifiers, whereas another one failed to accept it. I would like it if there were a way to formalize this requirement of my code, in a form of a language feature switch perhaps, so that if I or someone else tries to run my code, it would be immediately clear whether his implementation is missing the required feature and hence he should look for another one. (There could be also a wiki page for this feature--"Unicode identifiers", which would list which of the existing implementations support it, so that one would know where to go if one needs it.)
(BTW, I have put a "syntax" tag on this question, but I actually perceive it to be an issue of the level of lexing, a lower level than the syntax of a language. Is there a tag here for features of the lexing level of a language, rather than for features of the syntax specification of a language?)
The Online Report documents this under Lexemes. It also notes early on that "Haskell uses the Unicode character set. However, source programs are currently biased toward the ASCII character set used in earlier versions of Haskell.".
Actual compilers may or may not support Unicode identifiers. GHC does, but you need to keep in mind that Unicode codepoints must obey the same rules as ASCII characters: types must start with a codepoint which is classed as uppercase or titlecase, variables as lowercase (although de facto this is relaxed to alphabetic and not uppercase/titlecase; this might be worth asking for a clarification from the language committee), operators must be punctuation or symbol. (This means that you can't declare types in Arabic, for example, unless you prefix them with a character in some other script that is uppercase/titlecase.)
As to collecting Unicode support information: while I don't know of a single page that provides it, searching for "unicode" on the Haskell Wiki finds information about Unicode support in a number of Haskell compilers.

Native newline characters?

What's the best way to determine the native newline characters such as '\n' or '\r\n' in Haskell?
I see there is a "nativeNewline" function in GHC.IO:Handle, but assume that it is both a private API and most of all non-standard Haskell.
You should think of the newline representation as part of the encoding of a text file that is stored in the filesystem, just like UTF-8. A text file is normally decoded when you read it into your program, and encoded when written -- converting to and from the native newline representation is done as part of this encoding and decoding. Inside your Haskell program, just as characters are represented by their Unicode code points, the newline character is always \n.
To tell the I/O system about the newline encoding you want to use, see the section on Newline Conversion in the documentation for System.IO.
System.IO.nativeNewline is not private - you can access it
to find out what GHC considers the native "newline" to be
on the current platform.
Note that the type of this variable, System.IO.Newline, does
not have a Show instance as of GHC 6.12.3. So you can't
easily print its value. Instead, check to see if it is equal
to System.IO.LF or System.IO.CRLF.
However, as Simon pointed out, you shouldn't need
to know about the native newline sequence with normal
usage of the text-oriented IO functions
in GHC.
This variable, together with the rest of the new Unicode-aware
capabilities of the IO system, is not yet part of the Haskell standard.
It was not included in the
Haskell 2010 report.
However, since it is already implemented in GHC,
and there is quite a widespread consensus that it is
important and useful, expect it to be included in one of the
upcoming yearly revisions of the standard.

Resources