Exclude some characters in Unicode category

Exclude some characters in Unicode category - antlr4

I'm trying to implement a rule along the lines of "all characters in the Letter and Symbol Unicode categories except a few reserved characters." From the lexer rules, I know I can use \p{___} to match against Unicode categories, but I am unsure of how to handle excluding certain characters.
Looking at example grammars, I am led a few different directions. For example, the Java 9 grammar seems to use predicates in order to directly use Java's built in isJavaIdentifier() while others manually define every valid character.
How can I achieve this functionality?

Without target specific code, you will have to define the ranges yourself so that the chars you want to exclude are not part of these ranges. You cannot use \p{...} and then exclude certain characters from it.
With target specific code, you can do as in the Java 9 grammar:
#lexer::members {
boolean aCustomMethod(int character) {
// Your logic to see if 'character' is valid. You're sure
// that it's at least a char from \p{Letter} or \p{Symbol}
return true;
}
}
TOKEN
: [\p{Letter}\p{Symbol}] {aCustomMethod(_input.LA(-1))}?
;

Related

Using flex to identify variable name without repeating characters

I'm not fully sure how to word my question, so sorry for the rough title.
I am trying to create a pattern that can identify variable names with the following restraints:
Must begin with a letter
First letter may be followed by any combination of letters, numbers, and hyphens
First letter may be followed with nothing
The variable name must not be entirely X's ([xX]+ is a seperate identifier in this grammar)
So for example, these would all be valid:
Avariable123
Bee-keeper
Y
E-3
But the following would not be valid:
XXXX
X
3variable
5
I am able to meet the first three requirements with my current identifier, but I am really struggling to change it so that it doesn't pick up variables that are entirely the letter X.
Here is what I have so far: [a-z][a-z0-9\-]* {return (NAME);}
Can anyone suggest a way of editing this to avoid variables that are made up of just the letter X?

The easiest way to handle that sort of requirement is to have one pattern which matches the exceptional string and another pattern, which comes afterwards in the file, which matches all the strings:
[xX]+ { /* matches all-x tokens */ }
[[:alpha:]][[:alnum:]-]* { /* handle identifiers */ }
This works because lex (and almost all lex derivatives) select the first match if two patterns match the same longest token.
Of course, you need to know what you want to do with the exceptional symbol. If you just want to accept it as some token type, there's no problem; you just do that. If, on the other hand, the intention was to break it into subtokens, perhaps individual letters, then you'll have to use yyless(), and you might want to switch to a new lexing state in order to avoid repeatedly matching the same long sequence of Xs. But maybe that doesn't matter in your case.
See the flex manual for more details and examples.

How can I use arbitrary text as a function name in Rust?

Is there a way in Rust to use any text as a function name? Something like:
fn 'This is the name of the function' { ... }
I find it useful for test functions and it is is allowed by other languages.

There's no way. According to the official reference:
An identifier is any nonempty ASCII string of the following form:
Either
The first character is a letter.
The remaining characters are alphanumeric or _.
Or
The first character is _.
The identifier is more than one character. _ alone is not an identifier.
The remaining characters are alphanumeric or _.
A raw identifier is like a normal identifier, but prefixed by r#. (Note that
the r# prefix is not included as part of the actual identifier.)
Unlike a normal identifier, a raw identifier may be any strict or reserved
keyword except the ones listed above for RAW_IDENTIFIER.

You can't have spaces in function names (and this is true of most programming languages). Usual practice for function names in Rust is to replace spaces with underscores, so the following is allowed:
fn This_is_the_name_of_the_function { ... }
although usual practice would use a lower-case t

XML schema restriction pattern for not allowing specific string

I need to write an XSD schema with a restriction on a field, to ensure that
the value of the field does not contain the substring FILENAME at any location.
For example, all of the following must be invalid:
FILENAME
ORIGINFILENAME
FILENAMETEST
123FILENAME456
None of these values should be valid.
In a regular expression language that supports negative lookahead, I could do this by writing /^((?!FILENAME).)*$ but the XSD pattern language does not support negative lookahead.
How can I implement an XSD pattern restriction with the same effect as /^((?!FILENAME).)*$ ?
I need to use pattern, because I don't have access to XSD 1.1 assertions, which are the other obvious possibility.
The question XSD restriction that negates a matching string covers a similar case, but in that case the forbidden string is forbidden only as a prefix, which makes checking the constraint easier. How can the solution there be extended to cover the case where we have to check all locations within the input string, and not just the beginning?

OK, the OP has persuaded me that while the other question mentioned has an overlapping topic, the fact that the forbidden string is forbidden at all locations, not just as a prefix, complicates things enough to require a separate answer, at least for the XSD 1.0 case. (I started to add this answer as an addendum to my answer to the other question, and it grew too large.)
There are two approaches one can use here.
First, in XSD 1.1, a simple assertion of the form
not(matches($v, 'FILENAME'))
ought to do the job.
Second, if one is forced to work with an XSD 1.0 processor, one needs a pattern that will match all and only strings that don't contain the forbidden substring (here 'FILENAME').
One way to do this is to ensure that the character 'F' never occurs in the input. That's too drastic, but it does do the job: strings not containing the first character of the forbidden string do not contain the forbidden string.
But what of strings that do contain an occurrence of 'F'? They are fine, as long as no 'F' is followed by the string 'ILENAME'.
Putting that last point more abstractly, we can say that any acceptable string (any string that doesn't contain the string 'FILENAME') can be divided into two parts:
a prefix which contains no occurrences of the character 'F'
zero or more occurrences of 'F' followed by a string that doesn't match 'ILENAME' and doesn't contain any 'F'.
The prefix is easy to match: [^F]*.
The strings that start with F but don't match 'FILENAME' are a bit more complicated; just as we don't want to outlaw all occurrences of 'F', we also don't want to outlaw 'FI', 'FIL', etc. -- but each occurrence of such a dangerous string must be followed either by the end of the string, or by a letter that doesn't match the next letter of the forbidden string, or by another 'F' which begins another region we need to test. So for each proper prefix of the forbidden string, we create a regular expression of the form
$prefix || '([^F' || next-character-in-forbidden-string || ']'
|| '[^F]*'
Then we join all of those regular expressions with or-bars.
The end result in this case is something like the following (I have inserted newlines here and there, to make it easier to read; before use, they will need to be taken back out):
[^F]*
((F([^FI][^F]*)?)
|(FI([^FL][^F]*)?)
|(FIL([^FE][^F]*)?)
|(FILE([^FN][^F]*)?)
|(FILEN([^FA][^F]*)?)
|(FILENA([^FM][^F]*)?)
|(FILENAM([^FE][^F]*)?))*
Two points to bear in mind:
XSD regular expressions are implicitly anchored; testing this with a non-anchored regular expression evaluator will not produce the correct results.
It may not be obvious at first why the alternatives in the choice all end with [^F]* instead of .*. Thinking about the string 'FEEFIFILENAME' may help. We have to check every occurrence of 'F' to make sure it's not followed by 'ILENAME'.

Can I put one check on a Lexial element instead for on a number of parser rules?

I,m trying to use antlr4 with the IDL.g4 grammar, to implement some checks that our idl-files shall follow. One rule is about names. The rule are like:
ID contains only letters, digits and signle underscores,
ID begin with a letter,
ID end with a letter or digit.
ID is not a reserved Word in ADA, C, C++, Java, IDL
One way to do this check is to write a function that check a string for these properties and call it in the exit listeners for every rule that has an ID. E.g(refering to IDL.g4) in exitConst_decl(), exitInit_decl(), exitSimple_declarator() and a lot of more places. Maybe that is the correct way to do it. But I was thinking about putting that check directly on the lexical element ID. But don't know how to do that, or if it is possible at all.

Validating this type of constraint in the lexer would make it significantly more difficult to provide usable error messages for invalid identifiers. However, you can create a new parser rule identifier, and replace all references to ID in various parser rules to reference identifier instead.
identifier
: ID
;
You can then place your identifier validation logic inside of the single method enterIdentifier instead of all of the various rules that currently reference ID.

Allowing only usernames using "reasonable" characters

A username for a website can contain the space character, and yet it cannot be composed only of space characters. It can contain some symbols (like underscore and dash), but starting with certain symbols would look weird. Non-latin letters should be allowed, preferably for all languages, but tab and newline characters shouldn't. And definitely no Zalgo.
The rules composing what should and shouldn't be allowed in a reasonable naming system are complicated, however they are virtually the same for every website. Reimplementing them is probably a bad idea. Where can I find an implementation? I'm using PHP.

You should validate the username entered by the new user against a regular expression that run a match against the allowed character set.
Example: The following allows only english alphanumeric characters and - and _.
function isNewUsernameValid ($name, $filter = "[^a-zA-Z0-9\-\_\.]"){
return preg_match("~" . $filter . "~iU", $name) ? false : true;
}
if ( !isNewUsernameValid ($name) ){
print "Not a valid name.";
}
For your particular case, you'll have to come up with and test the regular expression.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Exclude some characters in Unicode category - antlr4

Related

Using flex to identify variable name without repeating characters

How can I use arbitrary text as a function name in Rust?

XML schema restriction pattern for not allowing specific string

Can I put one check on a Lexial element instead for on a number of parser rules?

Allowing only usernames using "reasonable" characters

Categories

Resources