Why can extensions only be placed in simpleContent and complexContent containers? - xsd

I'm having difficulty understanding some of the nuances of the format for defining type extensions and restrictions in XSD. According to the W3Schools reference:
simpleContent defines "extensions or restrictions on a text-only complex type or on a simple type as content and contains no elements"
complexContent defines "extensions or restrictions on a complex type that contains mixed content or elements only
What isn't clear to me is why XSD requires extensions and restrictions to be contained in one of these containers, and furthermore, why only extensions and restrictions require it. It would make a little more sense to me if all 'content' had to be defined in the container, but this is not the case - with base types, the content (sequences, etc.) are defined as direct children of the complexType container.
Take this example, which to me seems overly verbose:
<xs:complexType name="fullpersoninfo">
<xs:complexContent>
<xs:extension base="personinfo">
<xs:sequence>
<xs:element name="address" type="xs:string"/>
<xs:element name="city" type="xs:string"/>
<xs:element name="country" type="xs:string"/>
</xs:sequence>
</xs:extension>
</xs:complexContent>
</xs:complexType>
Why is it not possible to write it like this instead?
<xs:complexType name="fullpersoninfo">
<xs:extension base="personinfo">
<xs:sequence>
<xs:element name="address" type="xs:string"/>
<xs:element name="city" type="xs:string"/>
<xs:element name="country" type="xs:string"/>
</xs:sequence>
</xs:extension>
</xs:complexType>
Or even like this?
<xs:complexType name="fullpersoninfo" extends="personinfo">
<xs:sequence>
<xs:element name="address" type="xs:string"/>
<xs:element name="city" type="xs:string"/>
<xs:element name="country" type="xs:string"/>
</xs:sequence>
</xs:complexType>
I'm assuming there must be some reason it was defined the way it was, but I can't find any clues as to why.

I don't think you're going to find any useful design rationales for the XML syntax of complex types. Suffice it to say that those designing the XML syntax managed, by means of the elements you mention, to solve some technical difficulties, and that no obviously better syntax commanded consensus in the working group. You may wonder what technical difficulties are solved by simpleContent and complexContent, and that's a reasonable question, but I doubt anyone is going to be willing to undertake the excursion into the design records of XSD that would be necessary to answer it.
One simple observation: the legal children of extension and restriction vary depending on whether the parent is simpleContent or complexContent. That is accomplished using declarations local to the types of simpleContent and complexContent and would not be possible without them -- at least, not without a very thorough redesign of the XML syntax.

To build on C. M. Sperberg-McQueen's answer, I would think that some (if not more) had to do with the limitations of the language (I guess the "technical difficulties" reference); since most grammars try to prove that they're good enough to define themselves, imagine how little could've actually be done in the "schema for schema", considering the limitations we still "enjoy" today in version 1.0.
Many people believe that they could truly validate an XSD by validating the XML that is XSD against the XMLSchema.xsd - it is not the case.
Many XML Schema designs raise the same question as yours; the answer is typically that the author wanted to maximize the constraining capabilities of their schema spec by working around limitations in the language.
Somehow I believe that if the features in 1.0 would have been similar to 1.1, the syntax would've been different; the spec wouldn't have been easier to understand...
To make this richer, I would also explore other schema language specifications, such as RelaxNG or Schematron; maybe some argumentative discussions... A good reading is probably Rick Jelliffe take on XSD.

Related

When should a complex type be declared by directly naming the element, as opposed to using the type attribute?

http://www.w3schools.com/schema/schema_complex.asp
In this following snippit, why should the first way ever be used over the second?
We can define a complex element in an XML Schema two different ways:
A. The "employee" element can be declared directly by naming the element, like this:
<xs:element name="employee">
<xs:complexType>
<xs:sequence>
<xs:element name="firstname" type="xs:string"/>
<xs:element name="lastname" type="xs:string"/>
</xs:sequence>
</xs:complexType>
</xs:element>
If you use the method described above, only the "employee" element can
use the specified complex type. Note that the child elements,
"firstname" and "lastname", are surrounded by the
indicator. This means that the child elements must appear in the same
order as they are declared. You will learn more about indicators in
the XSD Indicators chapter.
B. The "employee" element can have a type attribute that refers to the name of the complex type to use:
<xs:complexType name="personinfo">
<xs:sequence>
<xs:element name="firstname" type="xs:string"/>
<xs:element name="lastname" type="xs:string"/>
</xs:sequence>
</xs:complexType>
If you use the method described above, several elements can refer to
the same complex type, like this:
<xs:element name="employee" type="personinfo"/>
<xs:element name="student" type="personinfo"/>
<xs:element name="member" type="personinfo"/>
<xs:complexType name="personinfo">
<xs:sequence>
<xs:element name="firstname" type="xs:string"/>
<xs:element name="lastname" type="xs:string"/>
</xs:sequence>
</xs:complexType>
Why should the first way ever be used over the second?
Local types can be useful for elements which should not be reused outside of some specific context. It would make sense, for example, for elements representing table cells to be local to the types used for table rows, and for the declaration of table-row elements to be local to the type used for the element representing the table as a whole. (An element representing a table row does not -- on this account -- make any sense outside the context of a table. Making declarations local is a simple way of ensuring that elements which place particular demands on their contexts can only be used in those contexts.)
Local types in XSD can also (like local types in other languages) also be useful in avoiding name collisions. If my vocabulary provides for letters to have a salutation tagged salutation, and also provides for database-like information about people in which their names, addresses, and the preferred form of address (tagged salutation) are recorded, the two elements named salutation are likely to be regarded as wholly unrelated to each other; making one or both of them local allows them both to exist within a vocabulary. (Namespaces can also be used for this purpose, but I have met few vocabulary designers who would want to put these two salutation elements into different namespaces, and even fewer XML users who would greet that prospect with anything but distaste.)
If you're not interested in preventing re-use, stressing the semantic dependency of an element on its parent, or avoiding name collisions, then there isn't much reason to use local elements. (That said, many people do use them quite a lot, and perhaps they have reasons I don't understand. From where I sit, it just seems that many people overuse local declarations for no good reason at all.)
Some GUI XSD editors only support the directly declaring complex types although you can quite often manually create the types if you do want to be able to reuse complex types.
So in that situation I would only go for declaring re-usable complex types if there is reuse of types just because it is easier not to declare reusable complex types.

Vectors of a complex type

Is there a way to define the cardinality of a type at the place where that type is referenced?
<xs:complexType name="xyType">
<xs:element name="xy" maxOccurs="1">
<xs:choice maxOccurs="1" minOccurs="0">
<xs:complexType>
<xs:choice maxOccurs="unbounded" minOccurs="0">
...
</xs:choice>
</xs:complexType>
</xs:element>
</xs:choice>
</xs:complexType>
So for instance I have two types A and B that have elements that reference this type, but in one case I only allow one xy (like above) and another I would like to allow multiple xy (like if I change the maxOccurs above for xy to "unbounded").
I don't want to have to completely separate complexType definitions for xyType (single) and xyType (unbounded), because in reality the definition for this type is very long and complex.
If possible I would also like to not define too many types (like separating the inner complexType from the body and having two types referencing that type). This would also be very complex in my specific scenario (I have a complex class hierarchy that I try to define with a schema, so everything is bloated already).
So basically I'm looking for something where the type that is referencing this type is taking care about the cardinality if that makes sense at all.
I would suggest that you modularize the parts of xyType as best as possible for sharing across two types, say xyType_A that allows only one xy and xyType_B that allows an unbounded number of xys. (Of course choose semantically appropriate names rather than these stand-ins.)
For example, xyType_A and xyType_B could differ in their definitions of xy's cardinality yet share the complex machinery defined in commonType:
<?xml version="1.0" encoding="UTF-8"?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">
<xs:complexType name="xyType_A">
<xs:sequence>
<xs:element name="xy" type="commonType" maxOccurs="1"/>
</xs:sequence>
</xs:complexType>
<xs:complexType name="xyType_B">
<xs:sequence>
<xs:element name="xy" type="commonType" maxOccurs="unbounded"/>
</xs:sequence>
</xs:complexType>
<xs:complexType name="commonType">
<xs:choice maxOccurs="1" minOccurs="0">
<xs:sequence>
<xs:choice maxOccurs="unbounded" minOccurs="0">
<!-- further complicated structures continue here -->
</xs:choice>
<!-- and here or wherever -->
</xs:sequence>
</xs:choice>
</xs:complexType>
</xs:schema>
The principle (if not the magnitude of opportunity) would be the same if the elements of varying cardinality are deeper in the definitional hierarchy: Factor as much of the common definitional components as possible, and reuse those in the distinctly defined types.
This wouldn't work in XSD 1.0. You could use Schematron (on top of the XSD 1.0); it would work with no issues.
It is possible in XSD 1.1. It would require a bit of work, at least based on my understanding. The solution is to use assertions; however, they seem to be supported for complex and simple types only, which means you may still need to introduce two new types specific to element A and B; however, they would simply be extending xyType (100% reuse), for the purpose of providing a place to define the assertion specific to A and B.
If you're interested in either alternative, tag the question appropriately.

Is it preferred to define a separate plural complexType for multiple singular elements

Is there any established standard for inlining trivial plural complexTypes vs. defining them separately?
In detail: When defining some XML schemas I frequently encounter cases where I want one element to contain multiple child elements of the same single type. For example a schema which describes a table in a database has a fields element which can contain one or more field elements. I can either create an inline complexType within the definition of the plural fields element:
<xs:element name="fields" minOccurs="1" maxOccurs="1">
<xs:complexType>
<xs:sequence>
<xs:element name="field" type="table-field"
minOccurs="1" maxOccurs="unbounded" />
</xs:sequence>
</xs:complexType>
</xs:element>
Or I can separately define a trivial fields type and use that:
<xs:element name="fields" type="table-field-collection" minOccurs="1" maxOccurs="1">
<!-- Elsewhere: -->
<xs:complexType name="table-field-collection">
<xs:sequence>
<xs:element name="field" type="table-field" minOccurs="1" maxOccurs="unbounded" />
</xs:sequence>
</xs:complexType>
The first approach creates a slightly more messy markup with anonymous types, while the second creates lots of extra trivial complexTypes. Is there a concensus on which approach is preferred?
There isn't really an established standard for this. There are really three choices:
"fields" must be defined as a complex type and reused (table-field-collection above)
"fields" is an element with an anonymous sub-type
There is no fields element. Instead, "field" simply repeats within the parent element.
I have specified modelling guidelines for a number of firms and used all of these patterns. More recently, I'm tending towards the third - the encapsulating fields element does not really have any semantic meaning, other than making a nice grouping when viewing documents in some graphical tools. If you were to process this using something like JAXB, you'd probably annoyed that fields is there - one more thing that can be null.
If you want to ask yourself the one relevant question from a technical point of view, then it is this: do you want to be able to inherit from table-field-collection and override it using xsi:type, or reuse it? If yes, go for the complex type. If no, go for whatever you prefer style-wise.

Constraint or Restriction on xsi:type

This is a generalized example of what I am up against.
I have created derived types in my schema and want to create an element which is an unbounded list (sequence) which has a restrictition where only two of the three derived types is allowed.
To say it from a top level view, "I have events where in one situation can only have two types of events".
Here is how I have defined my events and a subsequent holder of the sequence. (This all works and is valid).
The abstract item is a complex type named "Event Base" and has a common attribute called Name:
<xs:complexType name="EventBase">
<xs:annotation><xs:documentation>***Abstract Event***</xs:documentation></xs:annotation>
<xs:attribute name="Name"/>
</xs:complexType>
Then I have three events derived from the abstract as follows
<xs:complexType name="DerivedEvent1">
<xs:complexContent>
<xs:extension base="EventBase">
<xs:attribute name="Alpha" type="xs:string"/>
</xs:extension>
</xs:complexContent>
</xs:complexType>
<xs:complexType name="DerivedEvent2">
<xs:complexContent>
<xs:extension base="EventBase">
<xs:attribute name="Beta"/>
</xs:extension>
</xs:complexContent>
</xs:complexType>
<xs:complexType name="DerivedEvent3">
<xs:complexContent>
<xs:extension base="EventBase">
<xs:attribute name="Gamma"/>
</xs:extension>
</xs:complexContent>
</xs:complexType>
To facilliate a complex object to hold the derived events, I create a concrete "generic" event derived from the abstract complex
<xs:element name="Event" type="EventBase">
<xs:annotation><xs:documentation>A generic event derived from abstract.</xs:documentation></xs:annotation>
</xs:element>
Then I want to be able to hold the events, so I create a new complex object to hold the "generic" event shown above, but will actually hold derived events by the eventual consumer.
<xs:complexType name="EventsCollectionType">
<xs:annotation><xs:documentation>Holds derived events</xs:documentation></xs:annotation>
<xs:sequence>
<xs:element ref="Event" maxOccurs="unbounded"/>
</xs:sequence>
</xs:complexType>
Finally I create an element derived from the collection type which will hold actual events:
<xs:element name="Events"><xs:annotation><xs:documentation>Concrete holder of events.</xs:documentation></xs:annotation>
<xs:complexType>
<xs:sequence>
<xs:element ref="Event" maxOccurs="unbounded"/>
</xs:sequence>
</xs:complexType>
</xs:element>
The resulting xml looks like this:
<?xml version="1.0" encoding="UTF-8"?>
<Events xsi:noNamespaceSchemaLocation="file:///C:/StackOverflow.xsd" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<Event xsi:type="DerivedEvent1" Name="D1" Alpha="Content1"/>
<Event xsi:type="DerivedEvent3" Name="D1" Gamma="Content3"/>
</Events>
So the question is, how can I create a final Event*s* element which will hold only specific xsi:typed items?
So in the case where a restriction held that only the derived types of 1 and 3 would be valid (as above); but if it had a derived type 2 it would be invalid.
I have created a public GIST (Constraint or Restriction on xsi:type)
I may be wrong, but I don't think this is possible.
Within the Events collection you essentially want to have different structures but all with the same element name "Event". This goes against a fundamental constraint of schemas: http://www.w3.org/TR/xmlschema-1/#cos-element-consistent. Using xsi:type gives the schema processor a hint that will allow it to disambiguate this choice of structures thus avoiding violating this rule. It's essentially a work-around.
Could you not call each different things so you have a collection of "event1"s and "event3"s or an outer collection containing a sequence of optional "events1"s and "event3"s? It would be much easier to schema enforce the structure in this way. Also then you wouldn't require to use xsi:type at all. I'm not sure if you are using xsi:type in your instances to try to work around this limitation or for another reason but it may be easier for anybody using the schema to not have to worry about derived types.
Alternatively, you could potentially use another technology (eg schematron) to help enforce this constraint.

How to rewrite this nondeterministic XML Schema to deterministic?

Why this is non-deterministic and how to fix it?
<xs:element name="activeyears">
<xs:complexType>
<xs:sequence minOccurs="0" maxOccurs="1">
<xs:sequence minOccurs="0" maxOccurs="unbounded">
<xs:element ref="from" minOccurs="1" maxOccurs="1"/>
<xs:element ref="till" minOccurs="1" maxOccurs="1"/>
</xs:sequence>
<xs:element ref="from" minOccurs="0" maxOccurs="1"/>
</xs:sequence>
</xs:complexType>
</xs:element>
It is supposed to mean that <activeyears> is either empty or contains sequence of <from><till> which starts with <from> but can end with either.
A schema is non-deterministic when there are two branches that begin with the same element - so that you cannot tell which branch to take without looking ahead after that element. A simple example is ab|ac - when you see an a, you don't know which branch to take. For loops, the "branch" is whether to repeat the loop, or continue after it. An example of this is a*a - once you are in the loop, and you read an a, you don't know whether to repeat the loop, or continue.
Looking at your example schema, imagine that it has just parsed a <till>, and now it needs to parse a <from>. You could parse it with the <from><till> loop or with the final <from>. You can't tell which branch to use, just by looking at that <from>. You can only tell with further looking-ahead.
Bad news: I think your example schema is a very rare one, that it is impossible to express deterministically!
Here are the XML documents you want to accept (I'm using a single letter for each element, where a = <from>...</from> and b = <to>...</to>:
*empty*
a
ab
aba
abab
ababa
ababab
...
... you get the idea. The problem is that any letter can be the final letter in the sequence or it can be part of the loop. There is no way to tell which it will be, except by looking-ahead at the following letter. Since "deterministic" means that you don't do this lookahead (by definition), the language that you want cannot be expressed deterministically.
Simplifying your schema, it tries an approach similar to (ab)*a? - but both branches start with a. Another approach is a(ba)*b? - now both branches start with b. We can't win!
Technically, the set of all documents that a schema will accept is called that schema's language. If no deterministic schema exists that can express a language, the language is called "one-ambiguous".
For a theoretic discussion, see the series of papers by Bruggemann-Klein (e.g. Deterministic Regular Languages and One-Unambiguous Regular Languages).
She includes a formal test for one-unambiguous languages.
This is a simple edit of your code; I haven't tried it:
<xs:element name="activeyears">
<xs:complexType>
<xs:sequence minOccurs="0" maxOccurs="1">
<xs:element ref="from" minOccurs="1" maxOccurs="1"/>
<xs:sequence minOccurs="0" maxOccurs="unbounded">
<xs:element ref="till" minOccurs="1" maxOccurs="1"/>
<xs:element ref="from" minOccurs="0" maxOccurs="1"/>
</xs:sequence>
</xs:sequence>
</xs:complexType>
</xs:element>
Some background: XML schema is a very simple grammar, and the schema processor is a parser that attempts to apply the rules of this grammar to the input file. Unlike the parsers used by traditional compilers, however, XML schema has no lookahead. So you can't have two rules that share the same initial set of tokens (element names).
So, the specific changes that I made:
I left your outer sequence unchanged; it controls the "empty or has specific content" requirement.
If there is content, it must start with "from"; so I made that the first element in the sequence, with explicit occurrence count
Since I used "from" as an explicit element, I had to reverse the order of the subsequence.
And unless you want to specify that every "till" must be followed by a "from", you need to relax the minOccurs in the subsequence.
The subsequence also handles the case of a single from/till -- as a commenter noted, my second edit with the minOccurs='0' allowed a terminating sequence of two "till"s.

Resources