Why can't an identifier start with a number? - rust

I have a file named 1_add.rs, and I tried to add it into the lib.rs. Yet, I got the following error during compilation.
error: expected identifier, found `1_add`
--> src/lib.rs:1:5
|
1 | mod 1_add;
| ^^^^^ expected identifier
It seems the identifier that starts with a digit is invalid. But why would Rust has this restriction? Is there any workaround if I want to indicate the sequence of different rust files (for managing the exercise files).

In your case (you want to name the files like 1_foo.rs) you can write
#[path="1_foo.rs"]
mod mod_1_foo;
Allowing identifies to start with digits can conflict with type annotations. E.g.
let foo = 1_u32;
sets to type to u32. It would be confusing when 1_u256 means another variable.

But why would Rust has this restriction?
Not only rust, but most every language I've written a line of code in has this restriction as well.
Food for thought:
let a = 1_2;
Is 1_2 a variable name or is it a literal for value 12? What if variable 1_2 does not exist now, but you add it later, does this token stop being a number literal?
While rust compiler probably could make it work, it's not worth all the confusion, IMHO.

Allowing identifiers to start with a digit would caus conflicts with many other token types. Here are a few examples:
1e1 is a floating point number.
0x0 is a hexadecimal integer.
8u8 is an integer with explicit type annotation.
Most importantly, though, I believe allowing identifiers starting with digit would hurt readability. Currently everything starting with a digit is some kind of number, which in my opinion helps when reading code.
An incomplete list of programming languages not allowing identifiers to start with a digit: Python, Java, JavaScript, C#, Ruby, C, C++, Pascal. I can't think of a language that does allow this (which most likely does exist).

Rust identifiers are based on Unicode® Standard Annex #31
(see The Rust RFC Book), which standardizes some common rules for identifiers in programming languages. It might make it easier to parse text that could otherwise be ambiguous, like 1e10?

"Why?" cannot be reasoned here but by historical tales, the rules are as such. You cannot play against them.
If you urgently want to start your identifiers with a digit, at least for human readers, prepend an underscore like this: _1_add.
Note: To make sure that sorting works well, use also leading zeroes as many as appropriate (_001_add if you expect more than 99 files).

Related

allow(uncommon_codepoints) is ignored unless specified at crate level

I wrote this the other day:
let µ = ... some expression ...
(As it happens, the µ sign is easily typeable on my keyboard, just AltGr+m. This is why I have a habit to use this letter quite often especially when it is about small values.)
Now I got this:
identifier contains uncommon Unicode codepoints
`#[warn(uncommon_codepoints)]` on by default
No problem, I'll just allow it, I thought, and put this at the front:
#![allow(uncommon_codepoints)]
But no, it's utterly hesitant against greek:
allow(uncommon_codepoints) is ignored unless specified at crate level
`#[warn(unused_attributes)]` on by default
I would think it is at least debatable what "uncommon" exactly is. But I'm not really interested in that discussion, as long as I can turn it off.
So please ... how exactly do I specify something at the crate level? I tried it in main.rs and libs.rs but it wont accept it.
Edit
This really starts to become interesting:
I put the line
#![allow(uncommon_codepoints)]
as line 1 in main.rs and it now stops complaining about the unused_attribute. However, the "uncommon codepoint" warning still appears when compiling the file that contains it (i.e. with cargo build). I am at rustc 1.58.1 (stable, AFAIK)
I also found out that what my keyboard produces is not U+03BC GREEK SMALL LETTER MU but U+00B5 MICRO SIGN. It's still a letter, lowercase. Now, the interesting thing is: the uncommon Unicode warning does not appear for a genuine greek Mu, but for the micro sign it does!
Is there any other place I can turn off annoying and (from my point of view utterly useless) warnings? In general, I highly appreciate rust's detailed and often helpful warnings (though lately I found myself making an unused HashSet just to avoid the warnings about unused imports --- hey I know I will need this later, so please stop nagging), but this unicode thingy is a bit overdone. Its a valid variable name according to rust lexical syntax, and I really do want to use it. Period.

Are operations like `2*7` considered literals?

I just had a small question.
Are operations considered literals? Would 2*7, for example, be a literal? Is "hello, " + "world!" a literal?
I know the operands are literals, but the expression is not explicitly 14 or "hello, world!".
The question Is 2+3 considered as a literal?
asks basically what I am asking but most answers weren't even helpful, all they do is break the variable declaration down or talk about what compilers do with them, but I'm not looking for that, so I would like a more in depth explanation.
Thank you
It will depend on the language and the compiler, sorry. But just using the concept that a literal is a kind of token, then no, the result is a compile-time constant, not a token.
In C/C++ 2*7 will be optimised by the compiler to make a new constant but it isn't actually legally defined as a literal, though it can be treated as a compile-time constant.
Concatenating "hello" "world" (note no plus) is actually described as a preprocessing step in c++, so does generate a new literal constant string, but then in original C this didn't work.
But note that in C, a macro will treat the parameter phrase 2+7 as separate tokens, and #define STUPIDMUL3(val) 3 * val for 2+7 will give the answer 13, not 18. If you could find a way to force macros to treat the two halves of the string differently, I think it would.
I would expect an interpreter to take longer to process 2*7 than it would 14 because it might interpret and solve it every time.

Haskell Char and unicode surrogates

I was playing around with strings and discovered that Haskell (correctly) disallows characters above Unicode code point 0x10ffff (ie one gets something like a sequence out of range error if one attempts to use something above this limit). Out of curiosity, i played around with the Unicode surrogate halves (0xd800 to 0xdfff) - invalid Unicode codepoints, and discovered that they seem to be permitted. I am curious as to why this is. Is it simply because being a bounded item means only defining a maximum and a minimum?
Disallowing the surrogate code units would indeed make Char a more correct type for Unicode code points. The Report says that Char is "an enumeration whose values represent Unicode characters", so probably this should be considered a GHC bug.
There's no specific notion of "a bounded item", but it would require extra checks in various places (right now chr just needs to make one comparison to check if its argument is valid, for instance) and possibly make some things behave more strangely (if people indirectly expect code points to be contiguous).
I don't know that there's an especially good rationale for it, though, or that the trade-off was even considered originally. In Haskell 1.4, Char was just a 16-bit type, so it would have been natural to extend it to 17*2^16 values without adding extra checks. This issue is occasionally brought up -- I've brought it up before -- but most people don't seem to worry about it very much. It's probably reasonable to file a GHC bug about it, though, to get a proper discussion going.
Note that Data.Text (which uses UTF-16 as its internal representations) does disallow the invalid code units (it has to).

Does case sensitivity have anything to do with strongly typed languages (or loosely typed languages)?

(I admit this may be a n00b question - I know very little about CS theory, mostly a hands-on/hobby sort.)
I was googling up strongly-typed language for the official definition, and one of the top links I found was from Yahoo Answers, which suggested that case sensitive was a part of whether a language is loosely/strongly typed.
I had always thought the simple answer to the difference between a strongly typed/weakly typed language is that the first requires explicit type declarations, while the later is more open, even "dynamic".
The two S/O threads (here and here) I found so far seem to suggest that (more or less), but they don't mention anything about case sensitivity. Is there a relation at all between case sensitive and strong/weak?
A couple of clarifications:
Case sensitivity has nothing to do with strong vs. weak typing, static vs. dynamic typing or any other property of the type system. I don't know why the answer on yahoo answers has gotten its one upvote, but it's completely wrong. Just ignore it.
Strong typing isn't a well-defined term, but it is often used to refer to languages with few implicit type conversions, i.e. languages where it is an error to perform operations on types that do not support that operation.
As an example multiplying the strings "foo" and "bar" gives 0 as the result in perl, while it causes a type error in ruby, python, java, haskell, ml and many other languages. Thus those languages are more strongly typed than perl.
Strong typing is also sometimes used as a synonym for static typing.
A statically typed language is a language in which the types of variables, functions and expressions are known at compile time (or before runtime anyway - a statically typed language need not be compiled per se, though in practice it usually is). This means that if a statically typed program contains a type error, it will not run. If a dynamically typed program contains a type error it will run up to the point where the error happens and then crash.
Whether a language requires type annotations is (somewhat) independent of whether its type system is strong or weak or static or dynamic. In theory a dynamically typed language could require (or at least allow) type annotations and then throw runtime errors when those annotations are broken (though I don't know of any dynamically that actually does this).
More importantly there are many statically and strongly typed languages (e.g. Haskell, ML) that don't require type annotations, but instead use type inference algorithms to infer the types.
In theory, case sensitivity is completely unrelated to type strictness. Case sensitivity is about whether the identifiers foo, FOO, and fOo refer to the same variable, function, or what-have-you. Type strictness is about whether variables have types or just values do, how easy it is to convert among types, and so on.
In practice, there might be a correlation between case sensitivity and type strictness, but I can't think of enough case-insensitive languages right now to make an assessment. My impression is that most languages commonly used today are case sensitive — possibly because C was case sensitive and very influential, possibly because it was the only way to force people to stop PROGRAMMING IN ALL CAPS after a couple decades of FORTRAN, COBOL, and BASIC.
No - they're not connected. Strongly type languages force you to specify the type of data that a variable may hold - such as a real number, an integer, a textual string, or some programmer-defined object. You they can't accidentally assign another type of data into that variable unless it is implicitly convertible: examples of this are that you can generally put a integer into a real number (i.e. double x = 3.14; x = 3; is ok but int x = 3; x = 3.14; might not be, depending on how strongly typed the langauge is). Weakly typed languages just store whatever they're asked to without doing these sanity checks. In strongly typed languages like C++, you can still create type that can store data that can be any of a specific set of types (e.g. C++'s boost::variant), but sometimes they're a bit more limited in how much you can do and how convenient it is to use.
Case sensitivity is means that the uppercase and lowercase versions of the same letter are considered equivalent for some purposes... normally in a string comparison or regular expression match. It is unusual but not unheard of for modern computer languages to ignore the case of letters in variable names (identifiers).

Why do a lot of programming languages put the type *after* the variable name?

I just came across this question in the Go FAQ, and it reminded me of something that's been bugging me for a while. Unfortunately, I don't really see what the answer is getting at.
It seems like almost every non C-like language puts the type after the variable name, like so:
var : int
Just out of sheer curiosity, why is this? Are there advantages to choosing one or the other?
There is a parsing issue, as Keith Randall says, but it isn't what he describes. The "not knowing whether it is a declaration or an expression" simply doesn't matter - you don't care whether it's an expression or a declaration until you've parsed the whole thing anyway, at which point the ambiguity is resolved.
Using a context-free parser, it doesn't matter in the slightest whether the type comes before or after the variable name. What matters is that you don't need to look up user-defined type names to understand the type specification - you don't need to have understood everything that came before in order to understand the current token.
Pascal syntax is context-free - if not completely, at least WRT this issue. The fact that the variable name comes first is less important than details such as the colon separator and the syntax of type descriptions.
C syntax is context-sensitive. In order for the parser to determine where a type description ends and which token is the variable name, it needs to have already interpreted everything that came before so that it can determine whether a given identifier token is the variable name or just another token contributing to the type description.
Because C syntax is context-sensitive, it very difficult (if not impossible) to parse using traditional parser-generator tools such as yacc/bison, whereas Pascal syntax is easy to parse using the same tools. That said, there are parser generators now that can cope with C and even C++ syntax. Although it's not properly documented or in a 1.? release etc, my personal favorite is Kelbt, which uses backtracking LR and supports semantic "undo" - basically undoing additions to the symbol table when speculative parses turn out to be wrong.
In practice, C and C++ parsers are usually hand-written, mixing recursive descent and precedence parsing. I assume the same applies to Java and C#.
Incidentally, similar issues with context sensitivity in C++ parsing have created a lot of nasties. The "Alternative Function Syntax" for C++0x is working around a similar issue by moving a type specification to the end and placing it after a separator - very much like the Pascal colon for function return types. It doesn't get rid of the context sensitivity, but adopting that Pascal-like convention does make it a bit more manageable.
the 'most other' languages you speak of are those that are more declarative. They aim to allow you to program more along the lines you think in (assuming you aren't boxed into imperative thinking).
type last reads as 'create a variable called NAME of type TYPE'
this is the opposite of course to saying 'create a TYPE called NAME', but when you think about it, what the value is for is more important than the type, the type is merely a programmatic constraint on the data
If the name of the variable starts at column 0, it's easier to find the name of the variable.
Compare
QHash<QString, QPair<int, QString> > hash;
and
hash : QHash<QString, QPair<int, QString> >;
Now imagine how much more readable your typical C++ header could be.
In formal language theory and type theory, it's almost always written as var: type. For instance, in the typed lambda calculus you'll see proofs containing statements such as:
x : A y : B
-------------
\x.y : A->B
I don't think it really matters, but I think there are two justifications: one is that "x : A" is read "x is of type A", the other is that a type is like a set (e.g. int is the set of integers), and the notation is related to "x ε A".
Some of this stuff pre-dates the modern languages you're thinking of.
An increasing trend is to not state the type at all, or to optionally state the type. This could be a dynamically typed langauge where there really is no type on the variable, or it could be a statically typed language which infers the type from the context.
If the type is sometimes given and sometimes inferred, then it's easier to read if the optional bit comes afterwards.
There are also trends related to whether a language regards itself as coming from the C school or the functional school or whatever, but these are a waste of time. The languages which improve on their predecessors and are worth learning are the ones that are willing to accept input from all different schools based on merit, not be picky about a feature's heritage.
"Those who cannot remember the past are condemned to repeat it."
Putting the type before the variable started innocuously enough with Fortran and Algol, but it got really ugly in C, where some type modifiers are applied before the variable, others after. That's why in C you have such beauties as
int (*p)[10];
or
void (*signal(int x, void (*f)(int)))(int)
together with a utility (cdecl) whose purpose is to decrypt such gibberish.
In Pascal, the type comes after the variable, so the first examples becomes
p: pointer to array[10] of int
Contrast with
q: array[10] of pointer to int
which, in C, is
int *q[10]
In C, you need parentheses to distinguish this from int (*p)[10]. Parentheses are not required in Pascal, where only the order matters.
The signal function would be
signal: function(x: int, f: function(int) to void) to (function(int) to void)
Still a mouthful, but at least within the realm of human comprehension.
In fairness, the problem isn't that C put the types before the name, but that it perversely insists on putting bits and pieces before, and others after, the name.
But if you try to put everything before the name, the order is still unintuitive:
int [10] a // an int, ahem, ten of them, called a
int [10]* a // an int, no wait, ten, actually a pointer thereto, called a
So, the answer is: A sensibly designed programming language puts the variables before the types because the result is more readable for humans.
I'm not sure, but I think it's got to do with the "name vs. noun" concept.
Essentially, if you put the type first (such as "int varname"), you're declaring an "integer named 'varname'"; that is, you're giving an instance of a type a name. However, if you put the name first, and then the type (such as "varname : int"), you're saying "this is 'varname'; it's an integer". In the first case, you're giving an instance of something a name; in the second, you're defining a noun and stating that it's an instance of something.
It's a bit like if you were defining a table as a piece of furniture; saying "this is furniture and I call it 'table'" (type first) is different from saying "a table is a kind of furniture" (type last).
It's just how the language was designed. Visual Basic has always been this way.
Most (if not all) curly brace languages put the type first. This is more intuitive to me, as the same position also specifies the return type of a method. So the inputs go into the parenthesis, and the output goes out the back of the method name.
I always thought the way C does it was slightly peculiar: instead of constructing types, the user has to declare them implicitly. It's not just before/after the variable name; in general, you may need to embed the variable name among the type attributes (or, in some usage, to embed an empty space where the name would be if you were actually declaring one).
As a weak form of pattern-matching, it is intelligable to some extent, but it doesn't seem to provide any particular advantages, either. And, trying to write (or read) a function pointer type can easily take you beyond the point of ready intelligability. So overall this aspect of C is a disadvantage, and I'm happy to see that Go has left it behind.
Putting the type first helps in parsing. For instance, in C, if you declared variables like
x int;
When you parse just the x, then you don't know whether x is a declaration or an expression. In contrast, with
int x;
When you parse the int, you know you're in a declaration (types always start a declaration of some sort).
Given progress in parsing languages, this slight help isn't terribly useful nowadays.
Fortran puts the type first:
REAL*4 I,J,K
INTEGER*4 A,B,C
And yes, there's a (very feeble) joke there for those familiar with Fortran.
There is room to argue that this is easier than C, which puts the type information around the name when the type is complex enough (pointers to functions, for example).
What about dynamically (cheers #wcoenen) typed languages? You just use the variable.

Resources