DFDL Schema for parsing delimited text message - xsd

Need small help for DFDL. I need to parse below message as something like XML/tree structure. Elements are not fixed and dynamic. Sometime some other elements will appear.
XML/Tree output expected as something below
<root>
<CLIENT_ID>DESKTOPCLIENT</CLIENT_ID>
<LOCALE>en-US</LOCALE>
<ENCODE/>
</root>

Something like this is a possible solution, tested in Daffodil:
<xs:schema
xmlns:xs="http://www.w3.org/2001/XMLSchema"
xmlns:dfdl="http://www.ogf.org/dfdl/dfdl-1.0/">
<xs:include schemaLocation="org/apache/daffodil/xsd/DFDLGeneralFormat.dfdl.xsd" />
<xs:annotation>
<xs:appinfo source="http://www.ogf.org/dfdl/">
<dfdl:format
ref="GeneralFormat"
lengthKind="delimited"
/>
</xs:appinfo>
</xs:annotation>
<xs:element name="root" dfdl:initiator="%ESC;" dfdl:terminator="%SUB;">
<xs:complexType>
<xs:sequence dfdl:separator="%CAN;" dfdl:separatorPosition="prefix" dfdl:sequenceKind="unordered">
<xs:element name="CLIENT_ID" type="xs:string" dfdl:initiator="CLIENT_ID%NAK;" />
<xs:element name="LOCALED" type="xs:string" dfdl:initiator="LOCALE%NAK;" />
<xs:element name="ENCODE" type="xs:string" dfdl:initiator="ENCODE%NAK;" />
</xs:sequence>
</xs:complexType>
</xs:element>
</xs:schema>
Note that this assumes fixed names for the individual elements, and that they all exist, though the order does not matter. If you know the fixed names, but they may or may not exist, you can add minOccurs="0" to the elements in the unorderd sequence.
However, DFDL does not allow for dynami element names, so if you don't know the names, you need a slightly different schema. Instead, you need to describe the data as an unbouned number of name/value pairs, where the name and value are separated by %NAK;, for example:
<xs:element name="root" dfdl:initiator="%ESC;" dfdl:terminator="%SUB;">
<xs:complexType>
<xs:sequence dfdl:separator="%CAN;" dfdl:separatorPosition="prefix">
<xs:element name="element" maxOccurs="unbounded">
<xs:complexType>
<xs:sequence dfdl:separator="%NAK;" dfdl:separatorPosition="infix">
<xs:element name="name" type="xs:string" />
<xs:element name="value" type="xs:string" />
</xs:sequence>
</xs:complexType>
</xs:element>
</xs:sequence>
</xs:complexType>
</xs:element>
This results in an infoset that looks something like this:
<root>
<element>
<name>CLIENT_ID</name>
<value>DESKTOPCLIENT</value>
</element>
<element>
<name>LOCALE</name>
<value>en-US</value>
</element>
<element>
<name>ENCODE</name>
<value></value>
</element>
</root>
If you need the XML tags to match the name fields like in your question, you would then need to transform the infoset. XSLT can do this kind of transformation without much trouble.
Edit: There seems to be an issue where IBM DFDL does not like the above solution. I'm not sure why, but it works with Apache Daffodil. Something about value being the empty string causes an issue. After some trial and error, I've found that IBM DFDL (and Apache Daffodil too) are okay with it if you specify that empty value elements should be treated as nil. So changing the value element to this works:
<xs:element name="value" type="xs:string" nillable="true"
dfdl:nilKind="literalValue" dfdl:nilValue="%ES;"
dfdl:useNilForDefault="no"/>
In that case, the infoset ends up with something like this:
<element>
<name>ENCODE</name>
<value xsi:nil="true"></value>
</element>
Edit2: The nillable properties are required because otherwise IBM DFDL treats an empty string value as absent rather than having an empty value. Being absent results in the error. Newer versions of the DFDL spec add a new property, emptyElementParsePolicy, which lets you control whether or not empty strings are treated as absent or are just treated as an empty string. Daffodil implements this property as an extensions, but defaults to the treat as empty behavior. IBM DFDL has the treat as absent behavior. Daffodil has a similar behavior to IBM DFDL when setting this property to treat as absent.

Related

Why does the validation of keyref depend on the ordering of the key element?

My document contains A elements with IDs and B Elements which reference the As, like this:
<root xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:noNamespaceSchemaLocation="file:\\\refissue.xsd">
<A id="x"/>
<A id="y"/>
<B><Aref idref="x" /></B>
</root>
When I validate against my simple schema (see below) I get the following error:
cvc-identity-constraint.4.3: Key 'ref' with value 'x' not found for identity constraint of element 'root'.
If I change the ordering of the A element to
<A id="y"/>
<A id="x"/>
the document validates without any errors.
Why does the validation result depend on the ordering of the elements?
Is this a bug in the validator or in my schema?
<?xml version="1.0" encoding="utf-8"?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">
<xs:element name="root">
<xs:complexType>
<xs:sequence>
<xs:element maxOccurs="unbounded" name="A">
<xs:complexType>
<xs:attribute name="id" type="xs:ID" />
</xs:complexType>
<xs:key name="A.KEY">
<xs:selector xpath="." />
<xs:field xpath="#id" />
</xs:key>
</xs:element>
<xs:element maxOccurs="unbounded" name="B">
<xs:complexType>
<xs:sequence>
<xs:element minOccurs="0" maxOccurs="1" name="Aref">
<xs:complexType>
<xs:attribute name="idref" type="xs:IDREF" />
</xs:complexType>
</xs:element>
</xs:sequence>
</xs:complexType>
</xs:element>
</xs:sequence>
</xs:complexType>
<xs:keyref name="ref" refer="A.KEY">
<xs:selector xpath="B/Aref" />
<xs:field xpath="#idref" />
</xs:keyref>
</xs:element>
</xs:schema>
I tried the validation with Eclipse (which uses xerces, I think), xerces-c 3.1.1, xmlstarlet 1.5.0 and libxml2 2.7.8 and I get the error only with eclipse and xerces.
You're right, validity against an identity constraint should not depend on the order of elements in the input.
Here I think the problem is that the schema is not quite right, and Xerces is having trouble generating a useful diagnosis of the problem. (The fact that libxml doesn't report an error is just a consequence of its incomplete coverage of XSD.)
Your key constraint should be defined on the scope of the element within which the key values need to be unique -- so on the root element, not on the A element. (As defined, your A.KEY constraint requires that the string value of each A element be unique within that A element, which will always be the case. The fact that the id attribute is declared as being of type xs:ID does require uniqueness, of course. And similarly, the fact that the Aref idref attribute is declared as being of type xs:IDREF means that your key and keyref declarations are not actually doing much work here that's not already being done by ID and IDREF.)
Once you move the declaration of A.KEY to the declaration of the root element, Xerces and Saxon agree that the schema is OK and the document is valid.
I had a similar problem in Eclipse until the xs:key and the xs:keyref were both explicitly set to the same type. In my case I set to both to xs:string(I also was using xs:unique and a keyref reference to the unique but it seems to work the same way for key and keyref pairs).
So for example if the key is based on an element that looks like this:
<xs:complexType name="elementTypeWithKey'>
<xs:attribute name="theKey" type="xs:string"/>
</xs:complexType>
and the theKey attribute is explicitly xs:string, make sure that the attribute used as a keyRef is also explicitly xs:string:
<xs:complexType name="elementTypeWithKeyRef">
<xs:attribute name="theKeyRef" type="xs:string"/>
</xs:complexType>

xml schema maxOccurs = unbounded within xs:all

Is it possible to have a combination of xs:all and xs:sequence?
I've have a xml structure with an element probenode which consist of the elements name, id, url, tags, priority, statuws_raw, active. And a combination of device and group.
device and group can occur zero or more times...
the solution below doesn't work because it is not allowed to use unbounded for an element. within an all group.
<xs:complexType name="probenodetype">
<xs:all>
<xs:element name="name" type="xs:string" />
<xs:element name="id" type="xs:unsignedInt" />
<xs:element name="url" type="xs:string" />
<xs:element name="tags" />
<xs:element name="priority" type="xs:unsignedInt" />
<xs:element name="status_raw" type="xs:unsignedInt" />
<xs:element name="active" type="xs:boolean" />
<xs:element name="device" type="devicetype" minOccurs="0" maxOccurs="unbounded">
<!-- zie devicetype -->
</xs:element>
<xs:element name="group" type="grouptype" minOccurs="0" maxOccurs="unbounded">
<!-- zie grouptype -->
</xs:element>
</xs:all>
<xs:attribute name="noaccess" type="xs:integer" use="optional" />
</xs:complexType>
In XSD 1.0, the children of xs:all must have maxOccurs set to 1.
In XSD 1.1 this constraint is lifted.
So your alternatives appear to be:
Use an XSD 1.1 processor (Saxon or Xerces-J).
Use XSD 1.0 and impose an order on the children of probenodetype. This is a problem if the order in which the children appear carries information (so id followed by url is different from url followed by id ...).
In some simple cases it's feasible to write a content model that accepts precisely what you suggest you want, using only choice and sequence, but with seven required elements the resulting content model is likely to be too long and complex to be useful.
At this point some users give up and write a complex type with a repeatable OR-group and move the responsibility for checking that name, id, url, etc. all occur at least once and at most once into the application; that allows the generator of the XML not to have to worry about a fixed order (and opens a side channel for information leakage, which matters to some people) but also renders the schema somewhat less useful as documentation of the contract between data provider and data consumer.

minOccurs/maxOccurs in XML Schema

Given this XML Schema snippet:
<xs:element name="data">
<xs:complexType>
<xs:sequence>
<xs:element name="param" type="param" minOccurs="0" maxOccurs="unbounded" />
<xs:element name="format" type="format" minOccurs="0" maxOccurs="unbounded" />
</xs:sequence>
<xs:attribute name="name" type="xs:string" />
</xs:complexType>
</xs:element>
The intended result is valid <data> elements may contain 0 or more <param> elements followed by 0 or more <format> elements. Have I added the minOccurs/maxOccurs atttributes correctly, or should they be applied to the containing <xs:sequence>?
Correct or not, what would be the result of going one way or the other?
You have done it right and you can not add min/max occurs to sequence element. Using and XML editor that supports XML Schema might help you to validate your assumptions when you are in doubt. Here is a good free ware called XMLFox

xsd: How to extend a type with an unordered list of elements

This is a part of my xml schema
<xs:complexType name="Friend">
<xs:all>
<xs:element name="name" type="xs:string" />
<xs:element name="phone" type="xs:string" />
<xs:element name="address" type="xs:string" />
</xs:all>
</xs:complexType>
<xs:complexType name="Coworker">
<xs:all>
<xs:element name="name" type="xs:string" />
<xs:element name="phone" type="xs:string" />
<xs:element name="office" type="xs:string" />
</xs:all>
</xs:complexType>
For better maintainability, I would like to have the shared attributes in an (abstract) super type or something like that. But more important, I want that all elements are unordered and also optional.
Is this possible, and what is the best way to do it?
You have to limit yourself a little bit, some of the things you are trying to do are not possible in XML Schema.
Suppose you introduce a complex type called Person to be a super-type of Friend and Coworker. Here are your options:
Replace xs:all with xs:sequence, remove name and phone from the sub-types, add to the super-type, and add inheritance. Your elements now have to be ordered, but you can make them individually optional. It is illegal to use xs:all in type hierarchies in XML Schema, because the processor cannot tell where the parent content model stops and the child content model starts.
Replace xs:all with <xs:choice maxOccurs="unbounded"> in both types, and add your inheritance. Then your elements become unordered again, but they may repeat.
So in conclusion: given your type names up there, I would guess that your requirements will not be exactly met. I would go for the first option: insisting on arbitrary element order is often not as useful as it seems.
One-and-half year after this question and the accepted answer were posted, XSD 1.1 was published. In this version it is possible to specify what the OP asked for because a number of restriction on xs:all were lifted. One of them is that it is now possible to extend an xs:all.
Using XSD 1.1 you can specify the following:
<xs:complexType name="Person" abstract="true">
<xs:all>
<xs:element name="name" type="xs:string" minOccurs="0" />
<xs:element name="phone" type="xs:string" minOccurs="0" />
</xs:all>
</xs:complexType>
<xs:complexType name="Friend">
<xs:complexContent>
<xs:extension base="Person">
<xs:all>
<xs:element name="address" type="xs:string" minOccurs="0" />
</xs:all>
</xs:extension>
</xs:complexContent>
</xs:complexType>
<xs:complexType name="Coworker">
<xs:complexContent>
<xs:extension base="Person">
<xs:all>
<xs:element name="office" type="xs:string" minOccurs="0" />
</xs:all>
</xs:extension>
</xs:complexContent>
</xs:complexType>
This defines the following types:
Person: an abstract type with optional unordered name and phone elements;
Friend: extends Person adding an optional address element to the list of unordered elements;
Coworker: extends Coworker adding an optional office element to the list of unordered elements.
Note that this solution does not work for every XML processor: even though 8 years have passed since the publication of XSD 1.1, a lot of processors still only support XSD 1.0.

Schema Issue: Can define element type OR add element attribute, but not both. I want both!

I've inherited the task of creating a schema for some XML which already exists - and IMHO is not the best that could have been done. The section giving me problems is the element at the end of the 'scan-result' element.
The best I'm hoping for with regard to the data in the 'spectrum' element is to treat it as type="xs:string". I'll programatically divide up the numeric pairs that constitute the data in the string later. (Even though this step would not be needed had the data been properly structured in the first place.)
Here's a similar piece of XML data to what I have to work with...
<scan-result>
<spectrum-index>0</spectrum-index>
<scan-index>2</scan-index>
<time-stamp>5609</time-stamp>
<tic>55510</tic>
<start-mass>22.0</start-mass>
<stop-mass>71.0</stop-mass>
<spectrum count="5">30,11352;31,360;32,16634;45,1161;46,26003</spectrum>
</scan-result>
The problem is, I can't seem to get a working definition for the 'spectrum' element that has the 'count' attribute and allows me to define the 'spectrum' element type as "xs:string".
What I would like is something like the following:
<xs:complexType name="ctypScanResult">
<xs:sequence>
<xs:element name="spectrum-index" type="xs:integer"/>
<xs:element name="scan-index" type="xs:integer"/>
<xs:element name="time-stamp" type="xs:integer"/>
<xs:element name="tic" type="xs:integer"/>
<xs:element name="start-mass" type="xs:float"/>
<xs:element name="stop-mass" type="xs:float"/>
<xs:element name="spectrum" type="xs:string">
<xs:complexType>
<xs:attribute name="count" type="xs:integer"/>
</xs:complexType>
</xs:element>
</xs:sequence>
<xs:attribute name="count" type="xs:integer"/>
</xs:complexType>
The problem is that I can define the type of the 'spectrum' element as "xs:string" XOR I can define the anonymous 'xs:complexType' in the 'spectrum' element, which allows me to insert the 'count' attribute. But I need to be able to express both.
Given that I'm kind of stuck with the XML as it was handed to me, is there a schema definition that will allow me to describe this data?
Sorry this is long, but thanks to any and all who respond,
AlarmTripper
Followup: I know why the error occurs...
Quoted from W3C:
3.3.3 Constraints on XML Representations of Element Declarations
Schema Representation Constraint: Element Declaration Representation OK
In addition to the conditions imposed on element information items by the schema for schemas: all of the following must be true:
1 default and fixed must not both be present.
2 If the item's parent is not , then all of the following must be true:
2.1 One of ref or name must be present, but not both.
2.2 If ref is present, then all of , , , , , nillable, default, fixed, form, block and type must be absent, i.e. only minOccurs, maxOccurs, id are allowed in addition to ref, along with .
3 type and either or are mutually exclusive.
4 The corresponding particle and/or element declarations must satisfy the conditions set out in Constraints on Element Declaration Schema Components (§3.3.6) and Constraints on Particle Schema Components (§3.9.6).
But I'm still in the same fix I was before... How can I actually accomplish something that resembles my goal?
Thanks,
AlarmTripper
Let a tool do it for you! Try xsd.exe.
Or, if you must define by hand, at least check your hand-written-definition with an automatically generated one.
Here's what XSD.exe gave me for your input. I trimmed out some MS-NS cruft.
<xs:element name="spectrum">
<xs:complexType>
<xs:simpleContent>
<xs:extension base="xs:string">
<xs:attribute name="count" type="xs:string" />
</xs:extension>
</xs:simpleContent>
</xs:complexType>
</xs:element>
You need to set the attribute mixed="true" on complexType:
<xs:element name="spectrum">
<xs:complexType mixed="true">
<xs:attribute name="count" type="xs:integer" />
</xs:complexType>
</xs:element>
EDIT: Okay, just read your comment, sorry. I believe the following should work instead:
<xs:element name="spectrum">
<xs:complexType>
<xs:simpleContent>
<xs:extension base="xs:string">
<xs:attribute name="count" type="xs:integer" />
</xs:extension>
</xs:simpleContent>
</xs:complexType>
</xs:element>
<xs:element name="spectrum" type="xs:string">
<xs:complexType>
<!-- ADD THIS NEXT LINE -->
<xs:complexContent mixed="true"/>
<xs:attribute name="count" type="xs:integer"/>
</xs:complexType>
</xs:element>

Resources