Why are Nested Comments forbidden? - programming-languages

Why are nested comments forbidden in C++, Java inspite of the fact that nested comments are useful, neat, and elegant and can be used to comment out statements that have comments?

C and C++ do it for ease of parsing. This way, when they hit a comment start of /*, the parser can trivially scan to the end. Otherwise, it would have to set up and maintain a stack, and then report errors if the comment tokens are unmatched.
As to why Java does it, the answer is simple - Java's syntax was designed to emulate C and C++. If nested comments were allowed, it might trip up some C programmers, and many angry blog posts would be written!

At least for C++ that's only partly true, there's no problem with:
/*
//
//
*/
However if you already got an /* comment in a section you want to comment out you might be able to do it by surrounding it with an #if 0 which I think many compilers optimize away. As in:
#if 0
/*
*/
#endif

It is different for each language(-family) but in general they are not 'forbidden' but simply not supported. Supporting them is a design choice.
One reason for the choice (for older languages) may have been ease of parsing.
Note: I remember a C++ compiler where it was an option to let them nest. It was marked as 'non standard'.

Conformance/compatibility with C.

Consider the following example:
/* This is a comment /* Nested Comments */ are not allowed. */
Anything which lies between /* and */ is treated as a comment. In the above example from This up to Comments everything is treated as comments including /*.
Hence, are not allowed. */ does not lie in the comment.
This being an improper C statement error occurs.
While consider this example:
// This is an /* valid */ comment
Whatever lies in the line after // is treated as comment. Since /* valid */ lies in the same line after // , it is treated as part of the comment.

In languages which allow arbitrary-length comments, it is very common for programmers to accidentally comment out or, in some rare cases, "uncomment back in" more or less code than desired. In many cases, such mistakes will result in compilation errors, but in other cases they may result in code which is simply broken. With modern syntax-highlighting editors, a rule that each /* would increment a nested-comments counter and each */ which would decrement it might not cause too much trouble, but without syntax highlighting even trying to figure out what code is commented and what code wasn't could be a major headache.
If one were designing a language from scratch, a good approach might be to specify start-of-comment and end-of-comment marks that included an optional string, and have the language enforce a requirement that the string in an end-of-comment mark must match the string in its corresponding start-of-comment mark. Assuming the syntax was <|String| and |String|>, a compiler could then accept <|1| <|2| |2|> |1|>, but reject <|1| <|2| |1|> |2|>. Even if one favored the use of #if directives rather than comments, the ability to attach markers to ensure that each end-directive was matched up with the intended start directive would be useful.

The problem is that you can't simply "skip" blocks. blocks have unknown number of blocks inside of them, so in order to know when that block is complete, you can't use a regular language, you have to use a stack and walk through each block until the stack is empty, at which point you know that the block is done.
The reason why you can't skip blocks is that the regular expression for the block is only capable of matching the /* with either "the first */ it sees afterwards" or "the last */ it sees afterwards" or "nth */ it sees afterwards".
If we go with the first "*/" token we see, we may have code like
/*
/*
*/
the compiler matched /* with */ and this line and the next one will confuse it.
*/
If we go with the last "*/" token we see, we have code where the compiler will skip things between the first comment in the file and the last comment of the file's end. In this case, the following code would be parsed as an empty string.
/*
/* hello, world! */
*/
int main(int argc, char ** argv) {
return 0;
}
/* end of file */
We can't go with the third option of skip n inner blocks without forcing all comments to have exact number of depth, otherwise it will not find the inner comment and get confused.
Actually, there is a fourth option of explicitly stating that a comment can be a single comment, a 2-scope comment, a 3-scope comment, etc; But doing this is a ugly, and each level of depth requires a significantly longer expression, and this approach limits comments to a specific depth which may fit some people, and not fit others: What if you comment out a code that has been commented out 3 times?
The general solution can be implemented with a pre-preprocessor such as:
<?php
function stripComments($code) {
$stack = array();
$codeOut = '';
$stringStream = fopen('php://memory', 'r+');
fwrite($stringStream, $code);
rewind($stringStream);
while (!feof($stringStream)) {
$ch = fgetc($stringStream);
$nextChar = fgetc($stringStream);
if ($nextChar === false) {
break;
}
if ($ch == '/' && $nextChar == '*') {
array_push($stack, '/*');
} else if ($ch == '*' && $nextChar == '/') {
if (count($stack) > 0) {
array_pop($stack);
} else {
die('cannot pop from empty stack');
}
} else {
if (count($stack) == 0) {
$codeOut .= $ch;
fseek($stringStream, -1, SEEK_CUR);
}
}
$prevChar = $ch;
}
return $codeOut;
};
?>
Which is slightly more complicated than what C currently uses:
function stripComments($code) {
return preg_replace('/\/\*[^\*\/]*\*\//', '', $code);
}
This does not take into consideration /**/ blocks inside quotes, a slightly more complicated stack that can distinguish between "/*" scope and "\"" scope is required.
The benefit of having comments inside comments is that you can comment out blocks that contain comments without stripping the comments manually, which is particularly frustrating for blocks that have comment on every line.
Summary: It can be done, but most languages do not want to treat comments as if they were their own scopes, as it takes more effort to parse.

Related

How to know if returning an l-value when using `FALLBACK`?

How can I know if I actually need to return an l-value when using FALLBACK?
I'm using return-rw but I'd like to only use return where possible. I want to track if I've actually modified %!attrs or have only just read the value when FALLBACK was called.
Or (alternate plan B) can I attach a callback or something similar to my %!attrs to monitor for changes?
class Foo {
has %.attrs;
submethod BUILD { %!attrs{'bar'} = 'bar' }
# multi method FALLBACK(Str:D $name, *#rest) {
# say 'read-only';
# return %!attrs{$name} if %!attrs«$name»:exists;
# }
multi method FALLBACK(Str:D $name, *#rest) {
say 'read-write';
return-rw %!attrs{$name} if %!attrs«$name»:exists;
}
}
my $foo = Foo.new;
say $foo.bar;
$foo.bar = 'baz';
say $foo.bar;
This feels a bit like a X-Y question, so let's simplify the example, and see if that answers helps in your decisions.
First of all: if you return the "value" of a non-existing key in a hash, you are in fact returning a container that will auto-vivify the key in the hash when assigned to:
my %hash;
sub get($key) { return-rw %hash{$key} }
get("foo") = 42;
dd %hash; # Hash %hash = {:foo(42)}
Please note that you need to use return-rw here to ensure the actual container is returned, rather than just the value in the container. Alternately, you can use the is raw trait, which allows you to just set the last value:
my %hash;
sub get($key) is raw { %hash{$key} }
get("foo") = 42;
dd %hash; # Hash %hash = {:foo(42)}
Note that you should not use return in that case, as that will still de-containerize again.
To get back to your question:
I want to track if I've actually modified %!attrs or have only just read the value when FALLBACK was called.
class Foo {
has %!attrs;
has %!unexpected;
method TWEAK() { %!attrs<bar> = 'bar' }
method FALLBACK(Str:D $name, *#rest) is raw {
if %!attrs{$name}:exists {
%!attrs{$name}
}
else {
%!unexpected{$name}++;
Any
}
}
}
This would either return the container found in the hash, or record the access to the unknown key and return an immutable Any.
Regarding plan B, recording changes: for that you could use a Proxy object for that.
Hope this helps in your quest.
Liz's answer is full of useful info and you've accepted it but I thought the following might still be of interest.
How to know if returning an l-value ... ?
Let's start by ignoring the FALLBACK clause.
You would have to test the value. To deal with Scalars, you must test the .VAR of the value. (For non-Scalar values the .VAR acts like a "no op".) I think (but don't quote me) that Scalar|Array|Hash covers all the l-value super-types:
my \value = 42; # Int is an l-value is False
my \l-value-one = $; # Scalar is an l-value is True
my \l-value-too = #; # Array is an l-value is True
say "{.VAR.^name} is an l-value is {.VAR ~~ Scalar|Array|Hash}"
for value, l-value-one, l-value-too
How to know if returning an l-value when using FALLBACK?
Adding "when using FALLBACK" makes no difference to the answer.
How can I know if I actually need to return an l-value ... ?
Again, let's start by ignoring the FALLBACK clause.
This is a completely different question than "How to know if returning an l-value ... ?". I think it's the core of your question.
Afaik, the answer is, you need to anticipate how the returned value will be used. If there's any chance it'll be used as an l-value, and you want that usage to work, then you need to return an l-value. The language/compiler can't (or at least doesn't) help you make that decision.
Consider some related scenarios:
my $baz := foo.bar;
... (100s of lines of code) ...
$baz = 42;
Unless the first line returns an l-value, the second line will fail.
But the situation is actually much more immediate than that:
routine-foo = 42;
routine-foo is evaluated first, in its entirety, before the lhs = rhs expression is evaluated.
Unless the compiler's resolution of the routine-foo call somehow incorporated the fact that the very next thing to happen would be that the lhs will be assigned to, then there would be no way for a singly or multiply dispatched routine-foo to know whether it can safely return an r-value or must return an l-value.
And the compiler's resolution does not incorporate that. Thus, for example:
multi term:<bar> is rw { ... }
multi term:<bar> { ... }
bar = 99; # Ambiguous call to 'term:<bar>(...)'
I can imagine this one day (N years from now) being solved by a combination of allowing = to be an overloadable operator, robust macros that allow overloading of = being available, and routine resolution being modified so the above ambiguous call could do something equivalent to resolving to the is rw multi. But I doubt it will actually come to pass even with N=10. Perhaps there is another way but I can't think of one at the moment.
How can I know if I actually need to return an l-value when using FALLBACK?
Again, adding "when using FALLBACK" makes no difference to the answer.
I want to track if I've actually modified %!attrs or have only just read the value when FALLBACK was called.
When FALLBACK is called it doesn't know what context it's being called in -- r-value or l-value. Any modification comes after it has already returned.
In other words, whatever solution you come up with will being nothing to do per se with FALLBACK (even if you have to use it to implement some other aspect of whatever it is you're trying to do).
(Even if it were, I suspect trying to solve it via FALLBACK itself would just make matters worse. One can imagine writing two FALLBACK multis, one with an is rw trait, but, as explained above, my imagination doesn't stretch to that making any difference any time soon, if ever, and could only happen if the above imaginary things happened (the macros etc.) and the compiler was also modified to pay attention to the two FALLBACK multi variants, and I'm not at all meaning to suggest that that even makes sense.)
Plan B
Or (alternate plan B) can I attach a callback or something similar to my %!attrs to monitor for changes?
As Lizmat notes, that's the realm of Proxys. And thus your next SO question... :)

Warning function should have a snake case identifier on by default

I'm trying to figure out what this warning actually means. The program works perfectly but during compile I get this warning:
main.rs:6:1: 8:2 warning: function 'isMultiple' should have a snake case identifier,
#[warn(non_snake_case_functions)] on by default
the code is very simple:
/*
Find the sum of all multiples of 3 or 5 below 1000
*/
fn isMultiple(num: int) -> bool {
num % 5 == 0 || num % 3 == 0
}
fn main() {
let mut sum_of_multiples = 0;
//loop from 0..999
for i in range(0,1000) {
sum_of_multiples +=
if isMultiple(i) {
i
}else{
0
};
}
println!("Sum is {}", sum_of_multiples);
}
You can turn it off by including this line in your file. Check out this thread
#![allow(non_snake_case)]
Rust style is for functions with snake_case names, i.e. the compiler is recommending you write fn is_multiple(...).
My take on it is that programmers have been debating the case of names and other formatting over and over and over again, for decades. Everyone has their preferences, they are all different.
It's time we all grew up and realized that it's better we all use the same case and formatting. That makes code much easier to read for everyone in the long run. It's time to quit the selfish bickering over preferences. Just do as everyone else.
Personally I'm not much into underscores so the Rust standard upset me a bit. But I'm prepared to get use to it. Wouldn't it be great if we all did that and never had to waste time thinking and arguing about it again.
To that end I use cargo-fmt and clippy and I accept whatever they say to do. Job done, move on to next thing.
There is a method in the Rust convention:
Structs get camel case.
Variables get snake case.
Constants get all upper case.
Makes it easy to see what is what at a glance.
-ZiCog
Link to thread : https://users.rust-lang.org/t/is-snake-case-better-than-camelcase-when-writing-rust-code/36277/2

Token empty when matching grammar although rule matched

So my rule is
/* Addition and subtraction have the lowest precedence. */
additionExp returns [double value]
: m1=multiplyExp {$value = $m1.value;}
( op=AddOp m2=multiplyExp )* {
if($op != null){ // test if matched
if($op.text == "+" ){
$value += $m2.value;
}else{
$value -= $m2.value;
}
}
}
;
AddOp : '+' | '-' ;
My test ist 3 + 4 but op.text always returns NULL and never a char.
Does anyone know how I can test for the value of AddOp?
In the example from ANTLR4 Actions and Attributes it should work:
stat: ID '=' INT ';'
{
if ( !$block::symbols.contains($ID.text) ) {
System.err.println("undefined variable: "+$ID.text);
}
}
| block
;
Are you sure $op.text is always null? Your comparison appears to check for $op.text=="+" rather than checking for null.
I always start these answers with a suggestion that you migrate all of your action code to listeners and/or visitors when using ANTLR 4. It will clean up your grammar and greatly simplify long-term maintenance of your code.
This is probably the primary problem here: Comparing String objects in Java should be performed using equals: "+".equals($op.text). Notice that I used this ordering to guarantee that you never get a NullPointerException, even if $op.text is null.
I recommend removing the op= label and referencing $AddOp instead.
When you switch to using listeners and visitors, removing the explicit label will marginally reduce the size of the parse tree.
(Only relevant to advanced users) In some edge cases involving syntax errors, labels may not be assigned while the object still exists in the parse tree. In particular, this can happen when a label is assigned to a rule reference (your op label is assigned to a token reference), and an error appears within the labeled rule. If you reference the context object via the automatically generated methods in the listener/visitor, the instances will be available even when the labels weren't assigned, improving your ability to report details of some errors.

Simplest nested block parser

I want to write a simple parser for a nested block syntax, just hierarchical plain-text. For example:
Some regular text.
This is outputted as-is, foo{but THIS
is inside a foo block}.
bar{
Blocks can be multi-line
and baz{nested}
}
What's the simplest way to do this? I've already written 2 working implementations, but they are overly complex. I tried full-text regex matching, and streaming char-by-char analysis.
I have to teach the workings of it to people, so simplicity is paramount. I don't want to introduce a dependency on Lex/Yacc Flex/Bison (or PEGjs/Jison, actually, this is javascript).
The good choices probably boil down as follows:
Given your constaints, it's going to be recursive-descent. That's a fine way to go even without constraints.
you can either parse char-by-char (traditional) or write a lexical layer that uses the local string library to scan for { and }. Either way, you might want to return three terminal symbols plus EOF: BLOCK_OF_TEXT, LEFT_BRACE, and RIGHT_BRACE.
char c;
boolean ParseNestedBlocks(InputStream i)
{ if ParseStreamContent(i)
then { if c=="}" then return false
else return true
}
else return false;
boolean ParseSteamContent(InputStream i)
{ loop:
c = GetCharacter(i);
if c =="}" then return true;
if c== EOF then return true;
if c=="{"
{ if ParseStreamContent(i)
{ if c!="}" return false; }
else return false;
}
goto loop
}
Recently, I've been using parser combinators for some projects in pure Javascript. I pulled out the code into a separate project; you can find it here. This approach is similar to the recursive descent parsers that #DigitalRoss suggested, but with a more clear split between code that's specific to your parser and general parser-bookkeeping code.
A parser for your needs (if I understood your requirements correctly) would look something like this:
var open = literal("{"), // matches only '{'
close = literal("}"), // matches only '}'
normalChar = not1(alt(open, close)); // matches any char but '{' and '}'
var form = new Parser(function() {}); // forward declaration for mutual recursion
var block = node('block',
['open', open ],
['body', many0(form)],
['close', close ]);
form.parse = alt(normalChar, block).parse; // set 'form' to its actual value
var parser = many0(form);
and you'd use it like this:
// assuming 'parser' is the parser
var parseResult = parser.parse("abc{def{ghi{}oop}javascript}is great");
The parse result is a syntax tree.
In addition to backtracking, the library also helps you produce nice error messages and threads user state between parser calls. The latter two I've found very useful for generating brace error messages, reporting both the problem and the location of the offending brace tokens when: 1) there's an open brace but no close; 2) there's mismatched brace types -- i.e. (...] or {...); 3) a close brace without a matching open.

table of functions vs switch in golang

im am writing a simple emulator in go (should i? or should i go back to c?).
anyway, i am fetching the instruction and decoding it. at this point i have a byte like 0x81, and i have to execute the right function.
should i have something like this
func (sys *cpu) eval() {
switch opcode {
case 0x80:
sys.add(sys.b)
case 0x81:
sys.add(sys.c)
etc
}
}
or something like this
var fnTable = []func(*cpu) {
0x80: func(sys *cpu) {
sys.add(sys.b)
},
0x81: func(sys *cpu) {
sys.add(sys.c)
}
}
func (sys *cpu) eval() {
return fnTable[opcode](sys)
}
1.which one is better?
2.which one is faster?
also
3.can i declare a function inline?
4.i have a cpu struct in which i have the registers etc. would it be faster if i have the registers and all as globals? (without the struct)
thank you very much.
I did some benchmarks and the table version is faster than the switch version once you have more than about 4 cases.
I was surprised to discover that the Go compiler (gc, anyway; not sure about gccgo) doesn't seem to be smart enough to turn a dense switch into a jump table.
Update:
Ken Thompson posted on the Go mailing list describing the difficulties of optimizing switch.
The first version looks better to me, YMMV.
Benchmark it. Depends how good is the compiler at optimizing. The "jump table" version might be faster if the compiler doesn't try hard enough to optimize.
Depends on your definition of what is "to declare function inline". Go can declare and define functions/methods at the top level only. But functions are first class citizens in Go, so one can have variables/parameters/return values and structured types of function type. In all this places a function literal can [also] be assigned to the variable/field/element...
Possibly. Still I would suggest to not keep the cpu state in a global variable. Once you possibly decide to go emulating multicore, it will be welcome ;-)
If you have the ast of some expression, and you want to eval it for a big amount of data rows, then you may only once compile it into the tree of lambdas, and do not calculate any switches on each iteration at all;
For example, given such ast: {* (a, {+ (b, c)})}
Compile function (in very rough pseudo language) will be something like this:
func (e *evaluator) compile(brunch ast) {
switch brunch.type {
case binaryOperator:
switch brunch.op {
case *: return func() {compile(brunch.arg0) * compile(brunch.arg1)}
case +: return func() {compile(brunch.arg0) + compile(brunch.arg1)}
}
case BasicLit: return func() {return brunch.arg0}
case Ident: return func(){return e.GetIdent(brunch.arg0)}
}
}
So eventually compile returns the func, that must be called on each row of your data and there will be no switches or other calculation stuff at all.
There remains the question about operations with data of different types, that is for your own research ;)
This is an interesting approach, in situations, when there is no jump-table mechanism available :) but sure, func call is more complex operation then jump.

Resources