Which languages allow whitespace in identifiers?
Example:
int current index = 5
string body = fetch article(current index)
FORTRAN, and it was a bad design decision.
For example, replacing a , by a . can transform a DO loop into an assignment.
MSSQL, MSAccess, and Oracle, if you quote identifiers correctly (using [] or "" respectively)
Whitespace!
http://compsoc.dur.ac.uk/whitespace/
The problem with whitespace, is that it's often used as separator between tokens. So if you allow whitespace you have to combine several tokens to one.
But it is not impossible. Two identifiers without another token are rare so you can adopt the compiler to accept this.
On the other hand, you can get hard to read code:
int current index = 5
int current /* in between comment */ index = 5
int current
index = 5
So I don't think the advantages beat the disadvantages.
Related
I've got a file format that looks a little like this:
blockA {
uniqueName42 -> uniqueName aWord1 anotherWord "Some text"
anotherUniqueName -> uniqueName23 aWord2
blockB {
thing -> anotherThing
}
}
Lots more blocks with arbitrary nesting levels.
The lines with the arrow in them define relationships between two things. Each relationship has some optional metadata (multi-word quoted or single word unquoted).
The challenge I'm having is that because the there can be an arbitrary number of metadata items in a relationship my parser is treating anotherUniqueName as a metadata item from the first relationship rather than the start of the second relationship.
You can see this in the image below. The parser is only recognising one relationshipDeclaration when a second should start with StringLiteral: anotherUniqueName
The parser looks a bit like this:
block
: BLOCK LBRACE relationshipDeclaration* RBRACE
;
relationshipDeclaration
: StringLiteral? ARROW StringLiteral StringLiteral*
;
I'm hoping to avoid lexical modes because the fact that these relationships can appear almost anywhere in the file will leave me up to my eyes in NL+ :-(
Would appreciate any ideas on what options I have. Is there a way to look ahead, spot the '->', for example?
Thanks a million.
Your example certainly looks like the NL is what signals the end of a relationshipDeclaration.
If that's the case, then you'll need NLs to be tokens available to your parse rules so the parser can know recognize the end.
As you've alluded to, you could potentially use -> to trigger a different Lexer Mode and generate different tokens for content between the -> and the NL and then use those tokens in your parse rule for relationshipDeclaration.
If it's as simple as your snippet indicates, then just capturing RD_StringLiteral tokens in that lexical mode, would probably be easier to deal with than handling all the places you might need to allow for NL. This would be pretty simple as Lexer modes go.
(BTW you can use x+ to get the same effect as x x*)
relationshipDeclaration
: StringLiteral? ARROW RD_StringLiteral+
;
I don't think there's a third option for dealing with this.
What would be the highest-performing implementation to determine if a string that represents an SQL alias needs to be wrapped in double-quotes?
Presently, in pg-promise I am wrapping every alias in double-quotes, to play it safe. I am looking to make the output SQL neater and shorter, if possible.
And I am divided which approach is the best -
to use a regular expression, somehow
to do a direct algorithm with strings
not to change it at all, if there are reasons for that
Basically, I am looking to improve function as.alias, if possible, not to wrap aliases into double quotes when it is not needed.
What have I tried so far...
I thought at first to do it only for the 99% of all cases - not to add double-quotes when your alias is the most typical one, just a simple word:
function skipQuotes(alias) {
const m = alias.match(/[A-Z]+|[a-z]+/);
return m && m[0] === alias;
}
This only checks it is a single word that uses either upper or lower case, but not the combination.
SOLUTION
Following the answer, I ended up with implementation that should cover 99% of all practical use cases, which is what I was trying to achieve:
const m = alias.match(/[a-z_][a-z0-9_$]*|[A-Z_][A-Z0-9_$]*/);
if (m && m[0] === alias) {
// double quotes will be skipped
} else {
// double quotes will be added
}
i.e. the surrounding double quotes are not added when the alias uses a simple syntax:
it is a same-case single word, without spaces
it can contain underscores, and can start with one
it can contain digits and $, but cannot start with those
Removing double quotes is admirable -- it definitely makes queries easier to read. The rules are pretty simple. A "valid" identifier consists of:
Letters (including diacritical marks), numbers, underscore, and dollar sign.
Starts with a letter (including diacriticals) or underscore.
Is not a reserved word.
(I think I have this summarized correctly. The real rules are in the documentation.)
The first two are readily implemented using regular expressions. The last probably wants a reference table for lookup (and the list varies by Postgres release -- although less than you might imagine).
Otherwise, the identifier needs to be surrounded by escape characters. Postgres uses double quotes (which is ANSI standard).
One reason you may want to do this is because Postgres converts identifiers to lower case for comparison. So, the following works fine:
select xa, Xa, xA, "xa"
from (select 1 as Xa) y
However, this does not work:
select Xa
from (select 1 as "Xa") y
Nor does:
select "Xa"
from (select 1 as Xa) y
In fact, there is no way to get refer to "Xa" without using quotes (at least none that I can readily think of).
Enforcing the discipline of exact matches can be a good thing or a bad thing. I find that one discipline too many: I admit to often ignoring case when writing "casual" code; it is just simpler to type without capitalization (or using double quotes). For more formal code, I try to be consistent.
On the other hand, the rules do allow:
select "Xa", "aX", ax
from (select 1 as "Xa", 2 as "aX", 3 as AX) y
(This returns 1, 2, 3.)
This is a naming convention that I would be happy if it were not allowed.
I'm currently teaching myself Lua for iOS game development, since I've heard lots of very good things about it. I'm really impressed by the level of documentation there is for the language, which makes learning it that much easier.
My problem is that I've found a Lua concept that nobody seems to have a "beginner's" explanation for: nested brackets for quotes. For example, I was taught that long strings with escaped single and double quotes like the following:
string_1 = "This is an \"escaped\" word and \"here\'s\" another."
could also be written without the overall surrounding quotes. Instead one would simply replace them with double brackets, like the following:
string_2 = [[This is an "escaped" word and "here's" another.]]
Those both make complete sense to me. But I can also write the string_2 line with "nested brackets," which include equal signs between both sets of the double brackets, as follows:
string_3 = [===[This is an "escaped" word and "here's" another.]===]
My question is simple. What is the point of the syntax used in string_3? It gives the same result as string_1 and string_2 when given as an an input for print(), so I don't understand why nested brackets even exist. Can somebody please help a noob (me) gain some perspective?
It would be used if your string contains a substring that is equal to the delimiter. For example, the following would be invalid:
string_2 = [[This is an "escaped" word, the characters ]].]]
Therefore, in order for it to work as expected, you would need to use a different string delimiter, like in the following:
string_3 = [===[This is an "escaped" word, the characters ]].]===]
I think it's safe to say that not a lot of string literals contain the substring ]], in which case there may never be a reason to use the above syntax.
It helps to, well, nest them:
print [==[malucart[[bbbb]]]bbbb]==]
Will print:
malucart[[bbbb]]]bbbb
But if that's not useful enough, you can use them to put whole programs in a string:
loadstring([===[print "o m g"]===])()
Will print:
o m g
I personally use them for my static/dynamic library implementation. In the case you don't know if the program has a closing bracket with the same amount of =s, you should determine it with something like this:
local c = 0
while contains(prog, "]" .. string.rep("=", c) .. "]") do
c = c + 1
end
-- do stuff
Is there any programming language that allows Names to include white spaces ? (By names, I intend variables, methods, field, etc.)
Scala does allow whitespace characters in identifier names (but for that to be possible, you need to surround the identifiers with pair of backticks).
Example (executed at Scala REPL):
Welcome to Scala version 2.8.0.final (Java HotSpot(TM) Client VM, Java 1.6.0_22).
Type in expressions to have them evaluated.
Type :help for more information.
scala> val `lol! this works! :-D` = 4
lol! this works! :-D: Int = 4
scala> val `omg!!!` = 4
omg!!!: Int = 4
scala> `omg!!!` + `lol! this works! :-D`
res0: Int = 8
In SQL you can have spaces and other non-identifier characters in field names and such. You just have to quote them like [field name] or "field name".
Common Lisp can do it with variables, if you surround the variable name with pipes (|):
CL-USER> (setf |hello world| 42)
42
CL-USER> |hello world|
42
Worth noting is that "piped" variable names also are case sensitive (which variable names normally aren't in CL).
CL-USER> |Hello World|
The variable |Hello World| is unbound.
[Condition of type UNBOUND-VARIABLE]
CL-USER> (setf hello-world 99)
99
CL-USER> hello-world
99
CL-USER> HeLlO-WoRlD
99
PHP can: http://blog.riff.org/2008_05_11_spaces_php_variable_names
Perl also:
${'some var'} = 42;
print ${'some var'}, "\n";
${'my method'} = sub {
print "method called\n";
};
&${'my method'};
A more recent innovation and experimental web script (sub)type of JavaScript: https://github.com/featurist/pogoscript/wiki
wind speed = 25
average temperature = 32
becomes
windSpeed = 25
averageTemperature = 32
Behind the screens. Also flexible rules on positioning of return variables so you can do:
y = compute some value from (z) and return it
md5 hash (read all text from file "sample.txt")
Becomes:
var y;
y = computeSomeValueFromAndReturnIt(z);
md5Hash(readAllTextFromFile("sample.txt"));
In Ruby you can have symbols that are named as :"this has a space" but it is enclosed in double-quotes so I'm not sure if you count that.
If other languages allowed whitespace as a valid character in symbol names, then you would have to use some other character to separate them.
The problem with spaces in variable names is that it's subject to interpretation since whitespace normally means "ok, end of the current token, starting another." Exceptions to this rule must have some special indicator such as quotation marks in a string ("This is a test").
Our PARLANSE parallel programming language is one such. In fact, it allows any character in identifiers, although many of them, including spaces, have to be escaped (preceded by ~) to be included in the name. Here's an example:
~'Buffer~ Marker~'
This is used to let PARLANSE easily refer to arbitrary symbols from other languages (in particular, from EBNFs taken from arbitrary reference documents, where we can't control the punctuation used).
We don't use this feature a lot, but when it is needed it means we can stay true to tokens from other documents.
You might be able to find esoteric languages that don't separate expression elements with whitespaces on this website: http://99-bottles-of-beer.net
For example... whitespace :D
Some dialects of SQL allow databases, tables, and fields to have spaces in their names.
For example, in SQL Server, you can refer to a table with a space in its name, either by putting the table name in [square brackets] or (depending on connection options) in "double quotes".
There shouldn't be much problems creating such languages supporting whitespaces in identifiers, as long as there are enough separating tokens which say the parser where the identifiers end (such as operators, braces, commas and the infamous semicolon). It just doesn't improve the readability of the source code much.
I need to accept a list of file names in a query string. ie:
http://someSite/someApp/myUtil.ashx?files=file1.txt|file2.bmp|file3.doc
Do you have any recommendations on what delimiter to use?
Having query parameters multiple times is legal, and the only way to guarantee no parsing problems in all cases:
http://someSite/someApp/myUtil.ashx?file=file1.txt&file=file2.bmp&file=file3.doc
The semicolon ; must be URI encoded if part of a filename (turned to %3B), yet not if it is separating query parameters which is its reserved use.
See section 2.2 of this rfc:
2.2. Reserved Characters
URIs include components and subcomponents that are delimited by
characters in the "reserved" set. These characters are called
"reserved" because they may (or may not) be defined as delimiters by
the generic syntax, by each scheme-specific syntax, or by the
implementation-specific syntax of a URI's dereferencing algorithm.
If data for a URI component would conflict with a reserved
character's purpose as a delimiter, then the conflicting data must be
percent-encoded before the URI is formed.
reserved = gen-delims / sub-delims
gen-delims = ":" / "/" / "?" / "#" / "[" / "]" / "#"
sub-delims = "!" / "$" / "&" / "'" / "(" / ")"
/ "*" / "+" / "," / ";" / "="
If they're filenames, a good choice would be a character which is disallowed in filenames. Suggestions so far included , | & which are generally allowed in filenames and therefore might lead to ambiguities. / on the other hand is generally not allowed, not even on Windows. It is allowed in URIs, and it has no special meaning in query strings.
Example:
http://someSite/someApp/myUtil.ashx?files=file1.txt|file2.bmp|file3.doc is bad because it may refer to the valid file file1.txt|file2.bmp.
http://someSite/someApp/myUtil.ashx?files=file1.txt/file2.bmp/file3.doc unambiguously refers to 3 files.
I would recommend making each file its own query parameter, i.e.
myUtil.ashx?file1=file1.txt&file2=file2.bmp&file3=file3.doc
This way you can just use standard query parsing and loop
Do you need to list the filenames as a string?
Most languages accepts arrays in the querystring so you could write it like
http://someSite/someApp/myUtil.ashx?files[]=file1.txt&files[]=file2.bmp&files[]=file3.doc
If it doesn't, or you can't use for some other reason, you should stick to a delimiter that is either not allowed or unusual in a filename. Pipe (|) is a good one, otherwise you could urlencode an invisible character since they are quite easy to use in coding, but harder to actually include in a filename.
I usually use arrays when possible and pipe otherwise.
I've always used double pipes "||". I don't have any good evidence to back up why this is a good choice other than 10 years of web programming and it's never been an issue.
This is one common problem. How i handled it was: I created a method which accepted a list of strings, then found a character that was not in any of the strings. (I did this by a simple concatenation of the strings, then testing for various characters.) Once a character was found, concatenated all the strings together but also prepended the string with the separation character. So in the given question, one example wud be:
http://someSite/someApp/myUtil.ashx?files=|file1.txt|file2.bmp|file3.doc
and another wud be:
http://someSite/someApp/myUtil.ashx?files=,file1.txt,file2.bmp,file3.doc
But since i actually use a method that guarantees my separator character is not in the rest of the strings, it is safe. It was a bit of work to create the first time, but i've used it MANY times in various applications.
I think I would consider using commas or semicolons.
I would build on MSalters answer by saying, to generalize, the best delimiter is one that is invalid to the items in the list. For example, if your list is prices, a comma is a bad delimiter because it can be confused with the values. For that reason, as most these answers suggest, I think a good general purpose delimiter is probably "|" as it is rarely a valid value. "/" is maybe not the best delimiter generally as it is valid for paths sometimes.