Sorry for such a long question. I'll try to make it as simple as it gets.
We are working on a modular computational framework. The framework reads the configuration from xml file and we would like to validate the configuration against xsd. While adding new modules, we would like the xsd to be as simple as possible in the sense of maintainability.
The structure of framework is the following (in c#):
interface IModule { ... }
interface ISource : IModule { ... }
interface IFilter : IModule { ... }
interface IProcessor : IModule { ... }
and we have abstract implementation of various module interfaces and the implementation of modules (e.g. ObjLoader : AbstractSource). The structure of xml is following:
<Test>
<Sources>
<ObjLoader .../>
</Sources>
<Filters>
...
</Filters>
<Processors>
...
</Test>
Currently to avoid the necessity to heavily modify the xsd with every new module, we have the following test.xsd:
<xs:include schemaLocation="framework.xsd"/>
<xs:include ... modules />
<!-- adding includes for every module or group
of modules is the only modification -->
<xs:element name="Test">
<xs:complexType>
<xs:sequence>
<xs:element ref="this:Sources"
minOccurs="1"
maxOccurs="1"/>
...
</xs:sequence>
</xs:complexType>
</xs:element>
and common framework.xsd:
<xs:element name="Sources">
<xs:complexType>
<xs:choice maxOccurs="unbounded">
<xs:element ref="this:AbstractSource"/>
</xs:choice>
</xs:complexType>
</xs:element>
<xs:element name="AbstractSource"
abstract="true"/>
<xs:complexType name="ISource">
<xs:complexContent>
<xs:extension base="this:IModule">
...
</xs:extension>
</xs:complexContent>
</xs:complexType>
...
Each module (or group of modules) has its own xsd file with its definition and this file is included in the test.xsd:
<xs:include schemaLocation="framework.xsd"/>
<xs:element name="ObjLoader"
substitutionGroup="this:AbstractSource">
<xs:complexType>
<xs:complexContent>
<xs:extension base="this:ISource">
...
</xs:extension>
</xs:complexContent>
</xs:complexType>
</xs:element>
So far so good and everything works well. For each dll with modules we provide corresponding xsd and include it in test.xsd. But during the development, modules that are primarily of some kind, but can produce another operation as a side effect appeared. It is not a problem in the c# side as we can have the following class:
class GreatProcessingFilter : AbstractProcessor,
IFilter { ... }
Is there a way to modify our xsd to be able to handle such elements that can be a substitution of multiple abstract elements? Something like:
<xs:element name="GreatProcessingFilter"
substitutionGroup="AbstractProcessor"
substitutionGroup="AbstractFilter"> ...
XSD 1.1 does allow multiple QNames in the substitutionGroup attribute. It does not, however, cause the element declared as a member of multiple substitution groups to be given a type constructed dynamically from the types assigned to the various substitution-group heads. So it probably doesn't qualify as multiple inheritance in the sense you are looking for. You may find XML schemas with multiple inheritance of interest.
[Addition, responding to OP's comment] In XSD 1.1, a top-level element can be declared with multiple substitution groups; it can then be substituted for any of the elements named (unless blocked by the relevant final attribute on the substitution-group head). The constraints on substitution-group affiliations are as in 1.0: the type of the element being declared must be validly substitutable for the type of each substitution group head. If no type is given for the element, it defaults to having the same type as the first substitution-group head.
Related
I would like to validate a custom XML document against a schema.
This document would include a structure with any number of elements, each having a specific attribute. Something like this:
<Root xmlns="http://tns">
<Records>
<FirstRecord attr='whatever'>content for first record</FirstRecord>
<SecondRecord attr='whatever'>content for first record</SecondRecord>
...
<LastRecord attr='whatever'>content for first record</LastRecord>
</Records>
</Root>
The author of the XML document can include any number of records, each with an arbitrary name of his or her choosing. How is this possible to validate this against an XML Schema ?
I have tried to specify the appropriate structure type in a schema, but I do not know how to reference it in the appropriate location:
<xs:schema xmlns="http://tns" targetNamespace="http://tns" xmlns:xs="http://www.w3.org/2001/XMLSchema">
<xs:complexType name="RecordType"> <!-- This is my record type -->
<xs:simpleContent>
<xs:extension base="xs:string">
<xs:attribute name="attr" type="xs:string" use="required" />
</xs:extension>
</xs:simpleContent>
</xs:complexType>
<xs:element name="Root">
<xs:complexType>
<xs:sequence>
<xs:element minOccurs="1" maxOccurs="1" name="Records">
<!-- This is where records should go -->
<xs:complexType />
</xs:element>
</xs:sequence>
</xs:complexType>
</xs:element>
</xs:schema>
What you describe is possible in XSD 1.1, and something very similar is possible in XSD 1.0, which is not to say it's an advisable design.
In XML vocabularies, the element type normally conveys relevant information about the type of information, and it is the name of the element type that is used to drive validation in most XML schema languages; the design you describe is (some would say) a little bit like asking if I can define an object class in Java, or a struct in C, which obeys the constraints that the members can have arbitrary names, as long as one of them is an integer with the value 42. That, or something like it, may well be possible, but most experienced designers will feel strongly that this is almost certainly not the right way to go about solving any normal problem.
On the other hand, doing unusual and awkward things with a system can sometimes help in learning how to use the system effectively. (You never know a system well, said a friend of mine once, until you have thoroughly abused it.) So my answer has two parts: how to come as close as possible to the design you specify in XSD, and alternatives you might consider instead.
The simplest way to specify the language you seem to want in XSD 1.1 is to define an assertion on the Records element which says (1) that every child of Records has an 'attr' attribute and (2) that no child of Records has any children. You'll have something like this:
...
<xs:element minOccurs="1" maxOccurs="1" name="Records">
<xs:complexType>
<xs:sequence>
<xs:any/>
</xs:sequence>
<xs:assert
test="every $child in * satisfies $child/#attr"/>
<xs:assert
test="not(*/*)"/>
</xs:complexType>
</xs:element>
...
As you can see, this is very similar to what InfantPro'Aravind' has described; it avoids the problems identified by InfantPro'Aravind' by using assertion, not type assignment, to impose the constraints you impose.
In XSD 1.0, assertion is not available, and the only way I can think of to come close to the design you describe is define an abstract element, which I'll call Record, as the child of Records, and to require that the elements which actually occur as children of Records be declared as being substitutable for this abstract type (which in turn requires that their types be derived from type RecordType). Your schema might say something like this:
<xs:element name="Root">
<xs:complexType>
<xs:sequence>
<xs:element name="Records">
<xs:complexType>
<xs:sequence>
<xs:element name="Record"/>
</xs:sequence>
</xs:complexType>
</xs:element>
</xs:sequence>
</xs:complexType>
</xs:element>
<xs:complexType name="RecordType">
<xs:simpleContent>
<xs:extension base="xs:string">
<xs:attribute name="attr"
type="xs:string"
use="required" />
</xs:extension>
</xs:simpleContent>
</xs:complexType>
<xs:element name="Record"
type="RecordType"
abstract="true"/>
Elsewhere in the schema (possibly in a separate schema document) you will need to declare FirstRecord, etc., and specify that they are substitutable for Record, thus:
<xs:element name="FirstRecord" substitutionGroup="Record"/>
<xs:element name="SecondRecord" substitutionGroup="Record"/>
<xs:element name="ThirdRecord" substitutionGroup="Record"/>
...
At some level, this matches your description, though I suspect you did not want to have to declare FirstRecord, SecondRecord, etc.
Having described ways in which XSD can do what you describe, I should also say that I wouldn't recommend either of these approaches. Instead, I'd design the XML vocabulary differently to work more naturally with XSD.
In the design as you specify it, every record appears to have the same type, but in addition to the content of the element they are allowed to convey a certain additional quantity of information by having a different name (FirstRecord, SecondRecord, etc.). This additional information could just as easily be conveyed in an attribute, which would allow you to specify Record as a concrete element, rather than an abstract element, giving it an extra "alternate-name" attribute. Your sample data would then take a form like this:
<Root xmlns="http://tns">
<Records>
<Record
alternate-name="FirstRecord"
attr='whatever'>content for first record</Record>
<Record
alternate-name="SecondRecord"
attr='whatever'>content for first record</Record>
...
<Record
alternate-name="LastRecord"
attr='whatever'>content for first record</Record>
</Records>
</Root>
This will be more or less acceptable depending on whether you or your data providers or tools in your tool chain attach some mystic or other significance to having the string "FirstRecord" be an element type name instead of an attribute value.
Alternatively, one could say that the point of the design is to allow Records to contain an arbitrary sequence of elements of arbitrary structure (on this account, the restriction to xs:string is just an artifact of your example and is not really desired in reality) as long as we have, for each record, the information recorded in the 'attr' attribute. Easy enough to specify this: define 'Record' as a concrete element with an 'attr' attribute, accepting one child which can be any XML element:
<xs:element name="Record">
<xs:complexType>
<xs:sequence>
<xs:any processContents="lax"/>
</xs:sequence>
<xs:attribute name="attr"
use="required"
type="xs:string"/>
</xs:complexType>
</xs:element>
The value of the 'processContents' attribute can be changed to 'strict' or 'skip' or kept at 'lax', depending on whether you want FirstRecord, SecondRecord, etc. to be validated (and declared) or not.
I guess this is not possible with XSD alone!
When you say
any number of records, each with an arbitrary name of his or her choosing.
That forces us to use <xs:any/> element but! Having an element declared as any doesn't allow you to validate attributes under it..
So .. the answer is NO!
I have the following in a schema:
<xs:element name="td">
<xs:complexType>
<xs:complexContent>
<xs:extension base="cell.type"/>
</xs:complexContent>
</xs:complexType>
</xs:element>
<xs:complexType name="cell.type" mixed="true">
<xs:sequence minOccurs="0" maxOccurs="unbounded">
<xs:element ref="p"/>
</xs:sequence>
</xs:complexType>
Some parsers allow PCDATA directly in the element, while others don't. There's something in the XSD recommendation (3.4.2) that says when a complex type has complex content, and neither has a mixed attribute, the effective mixed is false. That means the only way mixed content could be in effect is if the extension of cell.type causes the mixed="true" to be inherited.
Could someone more familiar with schemas comment on the correct interpretation?
(BTW: if I had control of the schema I would move the mixed="true" to the element definition, but that's not my call.)
Anyone reading my question might want to read this thread also (by Damien). It seems my answer isn't entirely right: parsers/validators don't handle mixed attribute declarations on base/derived elements the same way.
Concerning extended complex types, sub-section 1.4.3.2.2.1 of section 3.4.6 in part 1 of W3C's XML Schema specification says that
Both [derived and base] {content type}s must be mixed or both must be element-only.
So yes, it is inherited (or more like you cannot overwrite it—same thing in the end).
Basically, what you've described is the desired (and as far as I'm concerned) the most logical behavior.
I've created a simple schema to run a little test with Eclipse's XML tools.
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">
<xs:element name="c">
<xs:complexType>
<xs:complexContent mixed="false">
<xs:extension base="a"/>
</xs:complexContent>
</xs:complexType>
</xs:element>
<xs:complexType name="a" mixed="true">
<xs:sequence minOccurs="0" maxOccurs="unbounded">
<xs:element name="b"/>
</xs:sequence>
</xs:complexType>
</xs:schema>
The above schema is valid, in the sense that not Eclipse's nor W3C's "official" XML Schema validator notices any issues with it.
The following XML passes validation against the aforementioned schema.
<?xml version="1.0" encoding="UTF-8"?>
<c xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="test.xsd">
x
<b/>
y
</c>
So basically you cannot overwrite the mixedness of a complex base type. To support this statement further, try and swap the base and dervied types' mixedness. In that case the XML fails to validate, because the derived type won't be mixed as it (yet again) cannot overwrite the base's mixedness.
You've also said that
Some parsers allow PCDATA directly in the element, while others don't
It couldn't hurt to clarify which parsers are you talking about. A good parser shouldn't fail when it encounters mixed content. A validating parser, given the proper schema, will fail if it encounters mixed content when the schema does not allow it.
I would like to allow for an element to either contain an attribute OR define a more complex type.
Something like
<myElement someAttr="..."/>
or
<myElement>
<...>
</myElement>
That is, if someAttr exists then I do not want to allow sub elements and if it doesn't then I want to.
The reason for this is I want to have an "include" feature where I include a file which is essentially inserted into the element. But I don't want both. You can either include additional external xml code into the element or add your own BUT not both. (or also to have it inserted from a separate part of the xml)
This is mainly for simplifying a complex xml so that the structure is easily understood.
I doubt you will be able to express something like that in XML schema at this point.
You can make an attribute optional, e.g. it can be present or not. But you cannot express something like if the attribute is not present, then include other complex content with the current means.
You'll have to either check this programmatically yourself, or maybe investigate if other XML description languages like RelaxNG or Schematron might be able to help.
Perhaps with a static choice and you change the myElement name ?
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">
<xs:element name="root">
<xs:complexType>
<xs:choice>
<xs:element name="myElementWithAttrs">
<xs:complexType>
<xs:attribute name="someAttrs" type="xs:string"/>
</xs:complexType>
</xs:element>
<xs:element name="myElementWithoutAttrs">
<xs:complexType>
<xs:sequence>
<xs:any processContents="skip"/>
</xs:sequence>
</xs:complexType>
</xs:element>
</xs:choice>
</xs:complexType>
</xs:element>
</xs:schema>
This XSD portion was obtained from: http://www.iana.org/assignments/xml-registry/schema/netconf.xsd
<xs:complexType name="rpcType">
<xs:sequence>
<xs:element ref="rpcOperation"/>
</xs:sequence>
<xs:attribute name="message-id" type="messageIdType" use="required"/>
<xs:anyAttribute processContents="lax"/>
</xs:complexType>
<xs:element name="rpc" type="rpcType"/>
And is the core to function calls in NETCONF being the node of an XML document. I am curious as to why it is not something like:
<xs:element name="rpcType">
<xs:complexType>
<xs:sequence>
<xs:element ref="rpcOperation"/>
</xs:sequence>
<xs:attribute name="message-id" type="messageIdType" use="required"/>
<xs:anyAttribute processContents="lax"/>
</xs:complexType>
</xs:element>
The reasoning is that in #1 when trying to marshall a bean (in jaxb2) I get the exception:
[com.sun.istack.SAXException2: unable to marshal type "netconf.RpcType" as an element because it is missing an #XmlRootElement annotation]
I have been reading this article over and over again, and really cant get a hold of the difference, and why it would be #1 vs #2...
It's not obvious, I'll grant you. It comes down to the type vs element decision.
When you have something like
<xs:element name="rpcType">
<xs:complexType>
This is essentially an "anonymous type", and is a type which can never occur anywhere other than inside the element rpcType. Because of this certainty, XJC knows that that type will always have the name rpcType, and so generates an #XmlRootElement annotation for it, with the rpcType name.
On the other hand, when you have
<xs:complexType name="rpcType">
then this defines a re-usable type which could potentially be referred to by several different elements. The fact that in your schema it is only referred to by one element is irrelevant. Because of this uncertainty, XJC hedges its bets and does not generate an #XmlRootElement.
The JAXB Reference Implementation has a proprietary XJC flag called "simple binding mode" which, among other things, assumes that the schema you're compiling will never be extended or combined with another. This allows it to make certain assumptions, so if it sees a named complexType only being used by one element, then it will often generate #XmlRootElement for it.
The reality is rather more subtle and complex than that, but in 90% of cases, this is a sufficient explanation.
Quite an involved question. There are many reasons to design schemas using types rather than elements (this approach is called the "venetian blind" approach versus "salami slice" for using global elements). One of the reasons is that types can be sub-typed, and another that it may be useful to only have elements global that can be root elements.
See this article for some more details on the schema side.
Now, as for the JAXB question in particular. The problem is that you created a class corresponding to a type and tried to serialise it. That means JAXB knows its content model, but not what the element name should be. You need to attach your RpcType to an element (JAXBElement), for example:
marshaller.marshal(new ObjectFactory().createRpc(myRpcType));
The ObjectFactory was placed into the package created by JAXB for you.
Advantages of Named Types
The advantage of a schema using global/named types is that child/sub types can be created that extend the parent type.
<xs:complexType name="rpcType">
<xs:sequence>
<xs:element ref="rpcOperation"/>
</xs:sequence>
<xs:attribute name="message-id" type="messageIdType" use="required"/>
<xs:anyAttribute processContents="lax"/>
</xs:complexType>
<xs:element name="rpc" type="rpcType"/>
The above fragment would allow the following child type to be created:
<xs:complexType name="myRPCType">
<xs:complexContent>
<xs:extension base="rpcType">
<xs:sequence>
<xs:element name="childProperty" type="xs:string"/>
</xs:sequence>
</xs:extension>
</xs:complexContent>
</xs:complexType>
Impact on JAXB
Another aspect of named types is that they may be used by multiple elements:
<xs:element name="FOO" type="rpcType"/>
<xs:element name="BAR" type="rpcType"/>
This means that the schema to Java compiler cannot simply just pick one of the possible elements to be the #XmlRootElement for the class corresponding to "rpcType".
I have an XML schema that includes multiple addresses:
<xs:element name="personal_address" maxOccurs="1">
<!-- address fields go here -->
</xs:element>
<xs:element name="business_address" maxOccurs="1">
<!-- address fields go here -->
</xs:element>
Within each address element, I include a "US State" enumeration:
<xs:simpleType name="state">
<xs:restriction base="xs:string">
<xs:enumeration value="AL" />
<xs:enumeration value="AK" />
<xs:enumeration value="AS" />
....
<xs:enumeration value="WY" />
</xs:restriction>
</xs:simpleType>
How do I go about writing the "US State" enumeration once and re-using it in each of my address elements? I apologize in advance if this is a n00b question -- I've never written an XSD before.
My initial stab at it is the following:
<xs:element name="business_address" maxOccurs="1">
<!-- address fields go here -->
<xs:element name="business_address_state" type="state" maxOccurs="1"></xs:element>
</xs:element>
I think you are on the right tracks. I think its more to do with XML namespaces. Try the following:
<?xml version="1.0" encoding="UTF-8"?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema"
targetNamespace="http://www.example.org/foo"
xmlns:tns="http://www.example.org/foo"
elementFormDefault="qualified">
<xs:element name="business_address">
<xs:complexType>
<xs:sequence>
<xs:element name="business_address_state"
type="tns:state" maxOccurs="1" />
</xs:sequence>
</xs:complexType>
</xs:element>
<xs:simpleType name="state">
<xs:restriction base="xs:string">
<xs:enumeration value="AL" />
<xs:enumeration value="AK" />
<xs:enumeration value="AS" />
<xs:enumeration value="WY" />
</xs:restriction>
</xs:simpleType>
</xs:schema>
Note that the type is tns:state not just state
And then this is how you would use it:
<?xml version="1.0" encoding="UTF-8"?>
<business_address xmlns="http://www.example.org/foo"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.example.org/foo foo.xsd ">
<business_address_state>AL</business_address_state>
</business_address>
Notice that this XML uses a default namespace the same as the targetNamespace of the XSD
While namespaces help keep schemas organized and prevent conflicts,
it's not the namespace above that allows for the reuse,
it's the placement of the type as an immediate child of the <xs:schema> root
that makes it a global type. (Usable within the namespace w/o the namespace qualifier and from anywhere that the tns namespace is visible w/ the tns: qualifier.)
I prefer to construct my schemas following the "Garden of Eden" approach, which maximizes reuse of both elements and types (and can also facilitate external logical referencing of the carefully made unique type/element from, say, a data dictionary stored in a database.
Note that while the "Garden of Eden" schema pattern offers the maximum reuse, it also involves the most work. At the bottom of this post, I've provided links to the other patterns covered in the blog series.
• The Garden of Eden approach http://blogs.msdn.com/skaufman/archive/2005/05/10/416269.aspx
Uses a modular approach by defining all elements globally and like the Venetian Blind approach all type definitions are declared globally. Each element is globally defined as an immediate child of the node and its type attribute can be set to one of the named complex types.
<?xml version="1.0" encoding="UTF-8"?>
<xs:schema targetNamespace="TargetNamespace" xmlns:TN="TargetNamespace" xmlns:xs="http://www.w3.org/2001/XMLSchema"
elementFormDefault="qualified" attributeFormDefault="unqualified">
<xs:element name="BookInformation" type="BookInformationType"/>
<xs:complexType name="BookInformationType">
<xs:sequence>
<xs:element ref="Title"/>
<xs:element ref="ISBN"/>
<xs:element ref="Publisher"/>
<xs:element ref="PeopleInvolved" maxOccurs="unbounded"/>
</xs:sequence>
</xs:complexType>
<xs:complexType name="PeopleInvolvedType">
<xs:sequence>
<xs:element name="Author"/>
</xs:sequence>
</xs:complexType>
<xs:element name="Title"/>
<xs:element name="ISBN"/>
<xs:element name="Publisher"/>
<xs:element name="PeopleInvolved" type="PeopleInvolvedType"/>
</xs:schema>
The advantage of this approach is that the schemas are reusable. Since both the elements and types are defined globally both are available for reuse. This approach offers the maximum amount of reusable content.
The disadvantages are the that the schema is verbose.
This would be an appropriate design when you are creating general libraries in which you can afford to make no assumptions about the scope of the schema elements and types and their use in other schemas particularly in reference to extensibility and modularity.
Since every distinct type and element has a single global definition, these canonical particles/components can be related one-to-one to identifiers in a database. And while it may at first glance seem like a tiresome ongoing manual task to maintain the associations between the textual XSD particles/components and the database, SQL Server 2005 can in fact generate canonical schema component identifiers via the statement
CREATE XML SCHEMA COLLECTION
http://technet.microsoft.com/en-us/library/ms179457.aspx
Conversely, to construct a schema from the canonical particles, SQL Server 2005 provides the
SELECT xml_schema_namespace function
http://technet.microsoft.com/en-us/library/ms191170.aspx
ca·non·i·cal
Related to Mathematics. (of an equation, coordinate, etc.)
"in simplest or standard form"
http://dictionary.reference.com/browse/canonical
Other, easier to construct, but less resuable/more "denormalized/redundant" schema patterns include
• The Russian Doll approach http://blogs.msdn.com/skaufman/archive/2005/04/21/410486.aspx
The schema has one single global element - the root element. All other elements and types are nested progressively deeper giving it the name due to each type fitting into the one above it. Since the elements in this design are declared locally they will not be reusable through the import or include statements.
• The the Salami Slice approach http://blogs.msdn.com/skaufman/archive/2005/04/25/411809.aspx
All elements are defined globally but the type definitions are defined locally. This way other schemas may reuse the elements. With this approach, a global element with its locally defined type provide a complete description of the elements content. This information 'slice' is declared individually and then aggregated back together and may also be pieced together to construct other schemas.
• The Venetian Blind approach http://blogs.msdn.com/skaufman/archive/2005/04/29/413491.aspx
Similar to the Russian Doll approach in that they both use a single global element. The Venetian Blind approach describes a modular approach by naming and defining all type definitions globally (as opposed to the Salami Slice approach which declares elements globally and types locally). Each globally defined type describes an individual "slat" and can be reused by other components. In addition, all the locally declared elements can be namespace qualified or namespace unqualified (the slats can be "opened" or "closed") depending on the elementFormDefault attribute setting at the top of the schema.