Vectors of a complex type - xsd

Is there a way to define the cardinality of a type at the place where that type is referenced?
<xs:complexType name="xyType">
<xs:element name="xy" maxOccurs="1">
<xs:choice maxOccurs="1" minOccurs="0">
<xs:complexType>
<xs:choice maxOccurs="unbounded" minOccurs="0">
...
</xs:choice>
</xs:complexType>
</xs:element>
</xs:choice>
</xs:complexType>
So for instance I have two types A and B that have elements that reference this type, but in one case I only allow one xy (like above) and another I would like to allow multiple xy (like if I change the maxOccurs above for xy to "unbounded").
I don't want to have to completely separate complexType definitions for xyType (single) and xyType (unbounded), because in reality the definition for this type is very long and complex.
If possible I would also like to not define too many types (like separating the inner complexType from the body and having two types referencing that type). This would also be very complex in my specific scenario (I have a complex class hierarchy that I try to define with a schema, so everything is bloated already).
So basically I'm looking for something where the type that is referencing this type is taking care about the cardinality if that makes sense at all.

I would suggest that you modularize the parts of xyType as best as possible for sharing across two types, say xyType_A that allows only one xy and xyType_B that allows an unbounded number of xys. (Of course choose semantically appropriate names rather than these stand-ins.)
For example, xyType_A and xyType_B could differ in their definitions of xy's cardinality yet share the complex machinery defined in commonType:
<?xml version="1.0" encoding="UTF-8"?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">
<xs:complexType name="xyType_A">
<xs:sequence>
<xs:element name="xy" type="commonType" maxOccurs="1"/>
</xs:sequence>
</xs:complexType>
<xs:complexType name="xyType_B">
<xs:sequence>
<xs:element name="xy" type="commonType" maxOccurs="unbounded"/>
</xs:sequence>
</xs:complexType>
<xs:complexType name="commonType">
<xs:choice maxOccurs="1" minOccurs="0">
<xs:sequence>
<xs:choice maxOccurs="unbounded" minOccurs="0">
<!-- further complicated structures continue here -->
</xs:choice>
<!-- and here or wherever -->
</xs:sequence>
</xs:choice>
</xs:complexType>
</xs:schema>
The principle (if not the magnitude of opportunity) would be the same if the elements of varying cardinality are deeper in the definitional hierarchy: Factor as much of the common definitional components as possible, and reuse those in the distinctly defined types.

This wouldn't work in XSD 1.0. You could use Schematron (on top of the XSD 1.0); it would work with no issues.
It is possible in XSD 1.1. It would require a bit of work, at least based on my understanding. The solution is to use assertions; however, they seem to be supported for complex and simple types only, which means you may still need to introduce two new types specific to element A and B; however, they would simply be extending xyType (100% reuse), for the purpose of providing a place to define the assertion specific to A and B.
If you're interested in either alternative, tag the question appropriately.

Related

Why can extensions only be placed in simpleContent and complexContent containers?

I'm having difficulty understanding some of the nuances of the format for defining type extensions and restrictions in XSD. According to the W3Schools reference:
simpleContent defines "extensions or restrictions on a text-only complex type or on a simple type as content and contains no elements"
complexContent defines "extensions or restrictions on a complex type that contains mixed content or elements only
What isn't clear to me is why XSD requires extensions and restrictions to be contained in one of these containers, and furthermore, why only extensions and restrictions require it. It would make a little more sense to me if all 'content' had to be defined in the container, but this is not the case - with base types, the content (sequences, etc.) are defined as direct children of the complexType container.
Take this example, which to me seems overly verbose:
<xs:complexType name="fullpersoninfo">
<xs:complexContent>
<xs:extension base="personinfo">
<xs:sequence>
<xs:element name="address" type="xs:string"/>
<xs:element name="city" type="xs:string"/>
<xs:element name="country" type="xs:string"/>
</xs:sequence>
</xs:extension>
</xs:complexContent>
</xs:complexType>
Why is it not possible to write it like this instead?
<xs:complexType name="fullpersoninfo">
<xs:extension base="personinfo">
<xs:sequence>
<xs:element name="address" type="xs:string"/>
<xs:element name="city" type="xs:string"/>
<xs:element name="country" type="xs:string"/>
</xs:sequence>
</xs:extension>
</xs:complexType>
Or even like this?
<xs:complexType name="fullpersoninfo" extends="personinfo">
<xs:sequence>
<xs:element name="address" type="xs:string"/>
<xs:element name="city" type="xs:string"/>
<xs:element name="country" type="xs:string"/>
</xs:sequence>
</xs:complexType>
I'm assuming there must be some reason it was defined the way it was, but I can't find any clues as to why.
I don't think you're going to find any useful design rationales for the XML syntax of complex types. Suffice it to say that those designing the XML syntax managed, by means of the elements you mention, to solve some technical difficulties, and that no obviously better syntax commanded consensus in the working group. You may wonder what technical difficulties are solved by simpleContent and complexContent, and that's a reasonable question, but I doubt anyone is going to be willing to undertake the excursion into the design records of XSD that would be necessary to answer it.
One simple observation: the legal children of extension and restriction vary depending on whether the parent is simpleContent or complexContent. That is accomplished using declarations local to the types of simpleContent and complexContent and would not be possible without them -- at least, not without a very thorough redesign of the XML syntax.
To build on C. M. Sperberg-McQueen's answer, I would think that some (if not more) had to do with the limitations of the language (I guess the "technical difficulties" reference); since most grammars try to prove that they're good enough to define themselves, imagine how little could've actually be done in the "schema for schema", considering the limitations we still "enjoy" today in version 1.0.
Many people believe that they could truly validate an XSD by validating the XML that is XSD against the XMLSchema.xsd - it is not the case.
Many XML Schema designs raise the same question as yours; the answer is typically that the author wanted to maximize the constraining capabilities of their schema spec by working around limitations in the language.
Somehow I believe that if the features in 1.0 would have been similar to 1.1, the syntax would've been different; the spec wouldn't have been easier to understand...
To make this richer, I would also explore other schema language specifications, such as RelaxNG or Schematron; maybe some argumentative discussions... A good reading is probably Rick Jelliffe take on XSD.

Is it preferred to define a separate plural complexType for multiple singular elements

Is there any established standard for inlining trivial plural complexTypes vs. defining them separately?
In detail: When defining some XML schemas I frequently encounter cases where I want one element to contain multiple child elements of the same single type. For example a schema which describes a table in a database has a fields element which can contain one or more field elements. I can either create an inline complexType within the definition of the plural fields element:
<xs:element name="fields" minOccurs="1" maxOccurs="1">
<xs:complexType>
<xs:sequence>
<xs:element name="field" type="table-field"
minOccurs="1" maxOccurs="unbounded" />
</xs:sequence>
</xs:complexType>
</xs:element>
Or I can separately define a trivial fields type and use that:
<xs:element name="fields" type="table-field-collection" minOccurs="1" maxOccurs="1">
<!-- Elsewhere: -->
<xs:complexType name="table-field-collection">
<xs:sequence>
<xs:element name="field" type="table-field" minOccurs="1" maxOccurs="unbounded" />
</xs:sequence>
</xs:complexType>
The first approach creates a slightly more messy markup with anonymous types, while the second creates lots of extra trivial complexTypes. Is there a concensus on which approach is preferred?
There isn't really an established standard for this. There are really three choices:
"fields" must be defined as a complex type and reused (table-field-collection above)
"fields" is an element with an anonymous sub-type
There is no fields element. Instead, "field" simply repeats within the parent element.
I have specified modelling guidelines for a number of firms and used all of these patterns. More recently, I'm tending towards the third - the encapsulating fields element does not really have any semantic meaning, other than making a nice grouping when viewing documents in some graphical tools. If you were to process this using something like JAXB, you'd probably annoyed that fields is there - one more thing that can be null.
If you want to ask yourself the one relevant question from a technical point of view, then it is this: do you want to be able to inherit from table-field-collection and override it using xsi:type, or reuse it? If yes, go for the complex type. If no, go for whatever you prefer style-wise.

Cannot figure out a way to create XML schema that matches random order items with conditions

We're trying to find a way to have a schema that would validate certain rules, but we've tried various combinations of xs:all, xs:choice, xs:group and xs:sequence with no success. The rules are basically this:
only one occurance of the LICAPPIN01 element should occur
only one occurance of the LICAPPIN99 element should occur
there should be the same number of LICAPPIN30 and LICAPPIN31
there should be the same number of LICAPPIN40 and LICAPPIN41
there needs to be at least one set of LICAPPIN30/31 or LICAPPIN40/41 (both can be there as well)
For all of the above, the order does not matter -- any order is acceptable
The simplest schema we tried is this:
<?xml version="1.0" standalone="yes"?>
<xs:schema id="NewDataSet" xmlns="" xmlns:xs="http://www.w3.org/2001/XMLSchema">
<xs:element name="NewDataSet">
<xs:complexType>
<xs:choice minOccurs="1" maxOccurs="unbounded">
<xs:element name="LICAPPIN01" minOccurs="1" maxOccurs="1">
</xs:element>
<xs:element name="LICAPPIN30" minOccurs="1" maxOccurs="unbounded">
</xs:element>
<xs:element name="LICAPPIN31" minOccurs="1" maxOccurs="unbounded">
</xs:element>
<xs:element name="LICAPPIN40" minOccurs="1" maxOccurs="unbounded">
</xs:element>
<xs:element name="LICAPPIN41" minOccurs="1" maxOccurs="unbounded">
</xs:element>
<xs:element name="LICAPPIN99" minOccurs="1" maxOccurs="1">
</xs:element>
</xs:choice>
</xs:complexType>
</xs:element>
</xs:schema>
This has a number of problems:
it allows multiple LICAPPIN01 and LICAPPIN99 (replacing with xs:all might fix this?)
it does not enforce rule 3 and 4
for rule 5, it seems to force both LICAPPIN30/31 and LICAPPIN40/41 when it should be possible to only have one of the two sets
We also tried a more complex approach with xs:group for LICAPPIN30/31 and for LICAPPIN40/41 but it broke rule 6.
Any idea if this is even possible to meet all of our basic rules? In a relatively simple Schema. In the example above, I removed all of the details within each LICAPPINnn elements -- they each contain complex types, and we don't want to have to duplicate these in multiple places, ideally.
Thanks,
Denis
It's not easy to write a content model to meet all your requirements, but it's easy to meet all but the last.
If variation in the order of elements is essential to convey necessary information, then your best bet is to use assertions in XSD 1.1 or Schematron. If variation in the order of elements conveys no information, then you have the option of declaring that variation in order is not a requirement after all. The vocabulary design authorities I respect most highly say pretty consistently that if the sequence of children does not convey information, then there is no reason not to fix it.
Here is a content model that meets all the requirements you list except the last one:
<xs:complexType>
<xs:sequence>
<xs:element name="LICAPPIN01"/>
<xs:choice maxOccurs="unbounded">
<xs:sequence>
<xs:element name="LICAPPIN30"/>
<xs:element name="LICAPPIN31"/>
</xs:sequence>
<xs:sequence>
<xs:element name="LICAPPIN40"/>
<xs:element name="LICAPPIN41"/>
</xs:sequence>
</xs:choice>
<xs:element name="LICAPPIN99"/>
</xs:sequence>
</xs:complexType>

Constraint or Restriction on xsi:type

This is a generalized example of what I am up against.
I have created derived types in my schema and want to create an element which is an unbounded list (sequence) which has a restrictition where only two of the three derived types is allowed.
To say it from a top level view, "I have events where in one situation can only have two types of events".
Here is how I have defined my events and a subsequent holder of the sequence. (This all works and is valid).
The abstract item is a complex type named "Event Base" and has a common attribute called Name:
<xs:complexType name="EventBase">
<xs:annotation><xs:documentation>***Abstract Event***</xs:documentation></xs:annotation>
<xs:attribute name="Name"/>
</xs:complexType>
Then I have three events derived from the abstract as follows
<xs:complexType name="DerivedEvent1">
<xs:complexContent>
<xs:extension base="EventBase">
<xs:attribute name="Alpha" type="xs:string"/>
</xs:extension>
</xs:complexContent>
</xs:complexType>
<xs:complexType name="DerivedEvent2">
<xs:complexContent>
<xs:extension base="EventBase">
<xs:attribute name="Beta"/>
</xs:extension>
</xs:complexContent>
</xs:complexType>
<xs:complexType name="DerivedEvent3">
<xs:complexContent>
<xs:extension base="EventBase">
<xs:attribute name="Gamma"/>
</xs:extension>
</xs:complexContent>
</xs:complexType>
To facilliate a complex object to hold the derived events, I create a concrete "generic" event derived from the abstract complex
<xs:element name="Event" type="EventBase">
<xs:annotation><xs:documentation>A generic event derived from abstract.</xs:documentation></xs:annotation>
</xs:element>
Then I want to be able to hold the events, so I create a new complex object to hold the "generic" event shown above, but will actually hold derived events by the eventual consumer.
<xs:complexType name="EventsCollectionType">
<xs:annotation><xs:documentation>Holds derived events</xs:documentation></xs:annotation>
<xs:sequence>
<xs:element ref="Event" maxOccurs="unbounded"/>
</xs:sequence>
</xs:complexType>
Finally I create an element derived from the collection type which will hold actual events:
<xs:element name="Events"><xs:annotation><xs:documentation>Concrete holder of events.</xs:documentation></xs:annotation>
<xs:complexType>
<xs:sequence>
<xs:element ref="Event" maxOccurs="unbounded"/>
</xs:sequence>
</xs:complexType>
</xs:element>
The resulting xml looks like this:
<?xml version="1.0" encoding="UTF-8"?>
<Events xsi:noNamespaceSchemaLocation="file:///C:/StackOverflow.xsd" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<Event xsi:type="DerivedEvent1" Name="D1" Alpha="Content1"/>
<Event xsi:type="DerivedEvent3" Name="D1" Gamma="Content3"/>
</Events>
So the question is, how can I create a final Event*s* element which will hold only specific xsi:typed items?
So in the case where a restriction held that only the derived types of 1 and 3 would be valid (as above); but if it had a derived type 2 it would be invalid.
I have created a public GIST (Constraint or Restriction on xsi:type)
I may be wrong, but I don't think this is possible.
Within the Events collection you essentially want to have different structures but all with the same element name "Event". This goes against a fundamental constraint of schemas: http://www.w3.org/TR/xmlschema-1/#cos-element-consistent. Using xsi:type gives the schema processor a hint that will allow it to disambiguate this choice of structures thus avoiding violating this rule. It's essentially a work-around.
Could you not call each different things so you have a collection of "event1"s and "event3"s or an outer collection containing a sequence of optional "events1"s and "event3"s? It would be much easier to schema enforce the structure in this way. Also then you wouldn't require to use xsi:type at all. I'm not sure if you are using xsi:type in your instances to try to work around this limitation or for another reason but it may be easier for anybody using the schema to not have to worry about derived types.
Alternatively, you could potentially use another technology (eg schematron) to help enforce this constraint.

How to rewrite this nondeterministic XML Schema to deterministic?

Why this is non-deterministic and how to fix it?
<xs:element name="activeyears">
<xs:complexType>
<xs:sequence minOccurs="0" maxOccurs="1">
<xs:sequence minOccurs="0" maxOccurs="unbounded">
<xs:element ref="from" minOccurs="1" maxOccurs="1"/>
<xs:element ref="till" minOccurs="1" maxOccurs="1"/>
</xs:sequence>
<xs:element ref="from" minOccurs="0" maxOccurs="1"/>
</xs:sequence>
</xs:complexType>
</xs:element>
It is supposed to mean that <activeyears> is either empty or contains sequence of <from><till> which starts with <from> but can end with either.
A schema is non-deterministic when there are two branches that begin with the same element - so that you cannot tell which branch to take without looking ahead after that element. A simple example is ab|ac - when you see an a, you don't know which branch to take. For loops, the "branch" is whether to repeat the loop, or continue after it. An example of this is a*a - once you are in the loop, and you read an a, you don't know whether to repeat the loop, or continue.
Looking at your example schema, imagine that it has just parsed a <till>, and now it needs to parse a <from>. You could parse it with the <from><till> loop or with the final <from>. You can't tell which branch to use, just by looking at that <from>. You can only tell with further looking-ahead.
Bad news: I think your example schema is a very rare one, that it is impossible to express deterministically!
Here are the XML documents you want to accept (I'm using a single letter for each element, where a = <from>...</from> and b = <to>...</to>:
*empty*
a
ab
aba
abab
ababa
ababab
...
... you get the idea. The problem is that any letter can be the final letter in the sequence or it can be part of the loop. There is no way to tell which it will be, except by looking-ahead at the following letter. Since "deterministic" means that you don't do this lookahead (by definition), the language that you want cannot be expressed deterministically.
Simplifying your schema, it tries an approach similar to (ab)*a? - but both branches start with a. Another approach is a(ba)*b? - now both branches start with b. We can't win!
Technically, the set of all documents that a schema will accept is called that schema's language. If no deterministic schema exists that can express a language, the language is called "one-ambiguous".
For a theoretic discussion, see the series of papers by Bruggemann-Klein (e.g. Deterministic Regular Languages and One-Unambiguous Regular Languages).
She includes a formal test for one-unambiguous languages.
This is a simple edit of your code; I haven't tried it:
<xs:element name="activeyears">
<xs:complexType>
<xs:sequence minOccurs="0" maxOccurs="1">
<xs:element ref="from" minOccurs="1" maxOccurs="1"/>
<xs:sequence minOccurs="0" maxOccurs="unbounded">
<xs:element ref="till" minOccurs="1" maxOccurs="1"/>
<xs:element ref="from" minOccurs="0" maxOccurs="1"/>
</xs:sequence>
</xs:sequence>
</xs:complexType>
</xs:element>
Some background: XML schema is a very simple grammar, and the schema processor is a parser that attempts to apply the rules of this grammar to the input file. Unlike the parsers used by traditional compilers, however, XML schema has no lookahead. So you can't have two rules that share the same initial set of tokens (element names).
So, the specific changes that I made:
I left your outer sequence unchanged; it controls the "empty or has specific content" requirement.
If there is content, it must start with "from"; so I made that the first element in the sequence, with explicit occurrence count
Since I used "from" as an explicit element, I had to reverse the order of the subsequence.
And unless you want to specify that every "till" must be followed by a "from", you need to relax the minOccurs in the subsequence.
The subsequence also handles the case of a single from/till -- as a commenter noted, my second edit with the minOccurs='0' allowed a terminating sequence of two "till"s.

Resources