XML schema restriction pattern for not allowing specific string - xsd

I need to write an XSD schema with a restriction on a field, to ensure that
the value of the field does not contain the substring FILENAME at any location.
For example, all of the following must be invalid:
FILENAME
ORIGINFILENAME
FILENAMETEST
123FILENAME456
None of these values should be valid.
In a regular expression language that supports negative lookahead, I could do this by writing /^((?!FILENAME).)*$ but the XSD pattern language does not support negative lookahead.
How can I implement an XSD pattern restriction with the same effect as /^((?!FILENAME).)*$ ?
I need to use pattern, because I don't have access to XSD 1.1 assertions, which are the other obvious possibility.
The question XSD restriction that negates a matching string covers a similar case, but in that case the forbidden string is forbidden only as a prefix, which makes checking the constraint easier. How can the solution there be extended to cover the case where we have to check all locations within the input string, and not just the beginning?

OK, the OP has persuaded me that while the other question mentioned has an overlapping topic, the fact that the forbidden string is forbidden at all locations, not just as a prefix, complicates things enough to require a separate answer, at least for the XSD 1.0 case. (I started to add this answer as an addendum to my answer to the other question, and it grew too large.)
There are two approaches one can use here.
First, in XSD 1.1, a simple assertion of the form
not(matches($v, 'FILENAME'))
ought to do the job.
Second, if one is forced to work with an XSD 1.0 processor, one needs a pattern that will match all and only strings that don't contain the forbidden substring (here 'FILENAME').
One way to do this is to ensure that the character 'F' never occurs in the input. That's too drastic, but it does do the job: strings not containing the first character of the forbidden string do not contain the forbidden string.
But what of strings that do contain an occurrence of 'F'? They are fine, as long as no 'F' is followed by the string 'ILENAME'.
Putting that last point more abstractly, we can say that any acceptable string (any string that doesn't contain the string 'FILENAME') can be divided into two parts:
a prefix which contains no occurrences of the character 'F'
zero or more occurrences of 'F' followed by a string that doesn't match 'ILENAME' and doesn't contain any 'F'.
The prefix is easy to match: [^F]*.
The strings that start with F but don't match 'FILENAME' are a bit more complicated; just as we don't want to outlaw all occurrences of 'F', we also don't want to outlaw 'FI', 'FIL', etc. -- but each occurrence of such a dangerous string must be followed either by the end of the string, or by a letter that doesn't match the next letter of the forbidden string, or by another 'F' which begins another region we need to test. So for each proper prefix of the forbidden string, we create a regular expression of the form
$prefix || '([^F' || next-character-in-forbidden-string || ']'
|| '[^F]*'
Then we join all of those regular expressions with or-bars.
The end result in this case is something like the following (I have inserted newlines here and there, to make it easier to read; before use, they will need to be taken back out):
[^F]*
((F([^FI][^F]*)?)
|(FI([^FL][^F]*)?)
|(FIL([^FE][^F]*)?)
|(FILE([^FN][^F]*)?)
|(FILEN([^FA][^F]*)?)
|(FILENA([^FM][^F]*)?)
|(FILENAM([^FE][^F]*)?))*
Two points to bear in mind:
XSD regular expressions are implicitly anchored; testing this with a non-anchored regular expression evaluator will not produce the correct results.
It may not be obvious at first why the alternatives in the choice all end with [^F]* instead of .*. Thinking about the string 'FEEFIFILENAME' may help. We have to check every occurrence of 'F' to make sure it's not followed by 'ILENAME'.

Related

Using flex to identify variable name without repeating characters

I'm not fully sure how to word my question, so sorry for the rough title.
I am trying to create a pattern that can identify variable names with the following restraints:
Must begin with a letter
First letter may be followed by any combination of letters, numbers, and hyphens
First letter may be followed with nothing
The variable name must not be entirely X's ([xX]+ is a seperate identifier in this grammar)
So for example, these would all be valid:
Avariable123
Bee-keeper
Y
E-3
But the following would not be valid:
XXXX
X
3variable
5
I am able to meet the first three requirements with my current identifier, but I am really struggling to change it so that it doesn't pick up variables that are entirely the letter X.
Here is what I have so far: [a-z][a-z0-9\-]* {return (NAME);}
Can anyone suggest a way of editing this to avoid variables that are made up of just the letter X?
The easiest way to handle that sort of requirement is to have one pattern which matches the exceptional string and another pattern, which comes afterwards in the file, which matches all the strings:
[xX]+ { /* matches all-x tokens */ }
[[:alpha:]][[:alnum:]-]* { /* handle identifiers */ }
This works because lex (and almost all lex derivatives) select the first match if two patterns match the same longest token.
Of course, you need to know what you want to do with the exceptional symbol. If you just want to accept it as some token type, there's no problem; you just do that. If, on the other hand, the intention was to break it into subtokens, perhaps individual letters, then you'll have to use yyless(), and you might want to switch to a new lexing state in order to avoid repeatedly matching the same long sequence of Xs. But maybe that doesn't matter in your case.
See the flex manual for more details and examples.

LUA -- gsub problems -- passing a variable to the match string isn't working [duplicate]

This question already has an answer here:
How to match a sentence in Lua
(1 answer)
Closed 1 year ago.
Been stuck on this for over a day.
I'm trying to use gsub to extract a portion of an input string. The exact pattern of the input varies in different cases, so I'm trying to use a variable to represent that pattern, so that the same routine - which is otherwise identical - can be used in all cases, rather than separately coding each.
So, I have something along the lines of:
newstring , n = oldstring:gsub(matchstring[i],"%1");
where matchstring[] is an indexed table of the different possible pattern matches, set up so that "%1" will match the target sequence in each matchstring[].
For instance, matchstring[1] might be
"\[User\] <code:%w*>([^<]*)<\\code>.*" -- extract user name from within the <code>...<\code>
while matchstring[2] could be
"\[World\] (%w)* .*" -- extract user name as first word after prefix '[World] '
and matchstring[3] could be
"<code:%w*>([^<]*)<\\code>.*" -- extract username from within <code>...<\code> at start
This does not work.
Yet when, debugging one of the cases, I replace matchstring[i] with the exact same string -- only now passed as a string literal rather than saved in a variable -- it works.
So.. I'm guessing there must be some 'processing' of the string - stripping out special characters or something - when it's sent as a variable rather than a string literal ... but for the life of me I can't figure out how to adjust the matchstring[] entries to compensate!
Help much appreciated...
FACEPALM
Thankyou, Piglet, you got me on the right track.
Given how this particular platform processes & passes strings, anything within <...> needed the escape character \ for downstream use, but of course - duh - for the lua gsub's processing itself it needed the standard %
much obliged

SML Pattern Matching on char lists

I'm trying to pattern match on char lists in SML. I pass in a char list generated from a string as an argument to the helper function, but I get an error saying "non-constructor applied to argument in pattern". The error goes away if instead of
#"a"::#"b"::#"c"::#"d"::_::nil
I use:
#"a"::_::nil.
Any explanations regarding why this happens would be much appreciated, and work-arounds if any. I'm guessing I could use the substring function to check this specific substring in the original string, but I find pattern matching intriguing and wanted to take a shot. Also, I need specific information in the char list located somewhere later in the string, and I was wondering if my pattern could be:
#"some useless characters"::#"list of characters I want"::#"newline character"
I checked out How to do pattern matching on string in SML? but it didn't help.
fun somefunction(#"a"::#"b"::#"c"::#"d"::_::nil) = print("true\n")
| somefunction(_) = print("false\n")
If you add parentheses around the characters the problem goes away:
fun somefunction((#"a")::(#"b")::(#"c")::(#"d")::_::nil) = print("true\n")
| somefunction(_) = print("false\n")
Then somefunction (explode "abcde") prints true and somefunction (explode "abcdef") prints false.
I'm not quite sure why the SML parser had difficulties parsing the original definition. The error message suggests that is was interpreting # as a function which is applied to strings. The problem doesn't arise simply in pattern matching. SML also has difficulty with an expression like #"a"::#"b"::[]. At first it seems like a precedence problem (of # and ::) but that isn't the issue since #"a"::explode "bc" works as expected (matching your observation of how your definition worked when only one # appeared). I suspect that the problem traces to the fact that characters where added to the language with SML 97. The earlier SML 90 viewed characters as strings of length 1. Perhaps there is some sort of behind-the-scenes kludge with the way the symbol # as a part of character literals was grafted onto the language.

Find the maximal input string matching a regular expression

Given a regular expression re and an input string str, I want to find the maximal substring of str, which starts at the minimal position, which matches re.
Special case:
re = Regex("a+|[ax](bc)*"); str = "yyabcbcb"
matching re with str should return the matching string "abcbc" (and not "a", as PCRE does). I also have in mind, that the result is as I want, if the order of the alternations is changed.
The options I found were:
POSIX extended RE - probably outdated, used by egrep ...
RE2 by Google - open source RE2 - C++ - also C-wrapper available
From my point of view, there are two problems with your question.
First is that changing the order of alternations the results are supposed to change.
For each single 'a' in the string, it can either match 'a+' or "ax*".
So it is ambiguous for matching 'a' to alternations in your regular expression.
Second, for finding the maximal substring, it requires the matching pattern of the longest match. As far as I know, only RE2 has provided such a feature, as mentioned by #Cosinus.
So my recommendation is that separating "a+|ax*" into two regexes, finding the maximal substring in each of them, and then comparing the positions of both substrings.
As to find the longest match, you can also refer to a previous regex post description here. The main idea is to search for substrings starting from string position 0 to len(str) and to keep track of the length and position when matched substrings are found.
P.S. Some languages provide regex functions similar to "findall()". Be careful of using them since the returns may be non-overlapping matches. And non-overlapping matches do not necessarily contain the longest matching substring.

string.gmatch to find a string included between two inequality signs

I'm using Lua, already used Google and nothing, can't find way to get string between inequality signs (< >). Other brackets are easy to get but these not. It's possible to do?
Target: How to grab "name" from string between inequality signs?
String: < name >: Message
If name does not contain >, then <(.-)> works.
You can use the (%b<>) pattern to capture matching <>. Then using that value, you can simply use string.sub to cut off the first and last char:
name,message=('< name<> > : Foo Bar!'):match('(%b<>)%s*:%s*(.*)')
name=name:sub(2,-2)
print(name,'sent message :',message)
As you can see this also takes care of strings containing other, embedded <> signs

Resources