SGML DTD - how to define a section as CDATA or RCDATA? - xsd

I was told in this post that an SGML DTD could be the solution to my issue.
I have the XSD below. How do I convert this to an SGML DTD to have the "RawPayload" element tagged as CDATA in spawned blank/empty XML files?
<?xml version="1.0" encoding="utf-8"?>
<xs:schema attributeFormDefault="unqualified" elementFormDefault="qualified"
xmlns:xs="http://www.w3.org/2001/XMLSchema">
<xs:element name="TestCase">
<xs:complexType>
<xs:sequence>
<xs:element name="TestSuiteVersion" type="xs:int" minOccurs="1" />
<xs:element name="TestName" type="xs:string" minOccurs="1" />
<xs:element name="TestEnabled" type="xs:boolean" minOccurs="1" />
<xs:element name="TestURL" type="xs:anyURI" minOccurs="1" />
<xs:element name="RawPayload" type="xs:string" minOccurs="1" />
<xs:element name="ParsedOutput" type="xs:dateTime" minOccurs="1" />
</xs:sequence>
</xs:complexType>
</xs:element>
</xs:schema>

The following SGML markup declarations will tell an SGML parser to treat RawPayload content as unparsed character data (CDATA declared content), such that the < and & characters normally interpreted as markup delimiters and entity-reference open character, resp. can appear verbatim in content:
<!ELEMENT TestCase - -
(TestSuiteVersion,TestName,TestEnabled,
TestURL,RawPayload,ParsedOutput)>
<!ELEMENT TestSuiteVersion - - (#PCDATA)>
<!ELEMENT TestName - - (#PCDATA)>
<!ELEMENT TestEnabled - - (#PCDATA)>
<!ELEMENT TestURL - - (#PCDATA)>
<!ELEMENT RawPayload - - CDATA>
<!ELEMENT ParsedOutput - - (#PCDATA)>
However, since the context of your original question is to tunnel HTML or other markup specifically, rather than generic text content, through elements with declared content CDATA, it's worth noting that this won't work as expected: by the SGML spec (ISO 8879:1986), unparsed character data is terminated by any character sequence </X where X is a character that is valid as an (element) name start character. Thus, if you attempt to include any angle-bracket markup as content, an SGML parser will stop unparsed character data parsing mode on what looks like the first occurring end-element tag (and will immediately fail with our example DTD since end-element tag omission is not allowed for RawPayload).
Rather, in SGML, you can include regular HTML markup without any use of CDATA elements or CDATA marked sections by importing the parsing rules for HTML as an SGML DTD grammar. The following example shows a self-contained SGML document declaring your TestCase vocabulary that also imports (my) markup declarations for HTML:
<!DOCTYPE TestCase SYSTEM "http://sgmljs.net/schemas/sgml-cms/w3c/html5.dtd" [
<!ELEMENT TestCase - - (TestSuiteVersion,TestName,TestEnabled,TestURL,RawPayload,ParsedOutput)>
<!ELEMENT TestSuiteVersion - - (#PCDATA)>
<!ELEMENT TestName - - (#PCDATA)>
<!ELEMENT TestEnabled - - (#PCDATA)>
<!ELEMENT TestURL - - (#PCDATA)>
<!ELEMENT RawPayload - - ANY -(TestSuiteVersion|TestName|TestEnabled|TestURL|RawPayload|ParsedOutput)>
<!ELEMENT ParsedOutput - - (#PCDATA)>
<!ENTITY % no_entities "INCLUDE">
]>
<TestCase>
<TestSuiteVersion>1</TestSuiteVersion>
<TestName>Test1</TestName>
<TestEnabled>true</TestEnabled>
<TestURL>http://example.com</TestURL>
<RawPayload>
<h2>Description of whatever is supposed to happen</h2>
<p>Bla Blah bla</p>
</RawPayload>
<ParsedOutput>2021-12-20T19:32:52Z</ParsedOutput>
</TestCase>
By declaring RawPayload as having declared content ANY, this DTD admits any HTML 5 elements declared in html5.dtd. I've also specified the element exclusion
-(TestSuiteVersion|TestName|TestEnabled
|TestURL|RawPayload|ParsedOutput)
telling SGML that those elements must not occur in content anywhere.
Depending on your app, it would generally be advisable to avoid handling HTML as black box CDATA content, thereby becoming prone to HTML injection attacks. Rather, if you eventually intend to display user content in a browser, you should scan/filter it for malicious content. Similar to what's shown here, you'd need to at least exclude script elements but also HTML event handler attributes containing script (or set CSP accordingly for your web app).
You can run this example document as-is using (my) sgmljs software (http://sgmljs.net) eg. the sgmlproc command line utility. When run with OpenSP SGML, you'd also need to provide a SGML declaration for HTML.

Related

Using XJC to compile a XSD with mutiple schema

I have a XSD of the format:
<?xml version="1.0" encoding="utf-16"?>
<root>
<xs:schema --->
..
..
</xs:schema>
<xs:schema -->
..
..
</xs:schema -->
<xs:schema -->
..
..
</xs:schema -->
</root>
It gives an error when compiled using XJC compiler at line 1 "Content is not allowed in prolog".
If I change the encoding to , "ISO-8859-1"
it gives followwing error:
[ERROR] Unexpected <root> appears at line 2 column 10
line 2 of ****.xsd Failed to parse a schema.
If I remove the "root" tag, from the XSD, it starts giving the following error:
[ERROR] The markup in the document following the root element must be well-formed.
line 44 of file:****.xsd
Failed to parse a schema.
My question is whether we can use XJC to compile a XSD with more than 1 schema tag. I had tried this with following file format :
<?xml version="1.0" encoding="ISO-8859-1" ?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">
<xs:element name="shiporder">
<xs:complexType>
<xs:sequence>
<xs:element name="abc" type="xs:string"/>
<xs:element name="cdf">
/xs:element>
</xs:sequence>
<xs:attribute name="orderid" type="xs:string" use="required"/>
</xs:complexType>
</xs:element>
</xs:schema>
it worked perfectly well for the above , creating classes appropriately.
Does it has something to do with the namespace declaration?
In principle, the XSD spec allows multiple xs:schema elements to be included in the same XML document, so what you are trying to do is not unreasonable. In practice, a lot of XSD software (perhaps most XSD software) is not prepared for schema documents in which the xs:schema element is not the outermost element in the XML document, and even when software does support other cases, different programs don't always agree on how to behave.
See this Stack Overflow question for further discussion, including a passionate argument from a misinformed party that there is no XSD software at all that supports input of the kind you describe.
With XJC, your best option appears to be to put each xs:schema element in a separate XML document and use (a) a single driver file to import or include each of them in turn, or (b) to put them all in the same directory and hand XJC the name of the directory; it will scan the directory for schema files and compile them. You may also be able to do something with the -wsdl option.

XSD - Enforce that a sub-tag is present?

This is a pretty simple question, but my Google skills haven't gotten me the answer yet, so:
Can an XSD enforce that an element MUST BE PRESENT within a higher level element? I know that you can allow or disallow "explicit setting to nil" but that doesn't sound like the same thing.
For example:
<parentTag>
<childTag1>
... stuff
</childTag1>
<childTag2> <!-- FAIL VALIDATION IF a childTag2 isn't in parentTag!!! -->
... stuff
</childTag2>
</parentTag>
If so, what is the syntax?
Elements in an XSD are required to be present by default. If unspecified, the child element's minOccurs property is set to 1.
That is, you must explicitly make an element optional by setting minOccurs="0".
Example schema
<?xml version="1.0"?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">
<xs:element name="parentTag">
<xs:complexType>
<xs:sequence>
<xs:element name="childTag1" minOccurs="0"/> <!-- This element is optional -->
<xs:element name="childTag2"/> <!-- This element is required -->
</xs:sequence>
</xs:complexType>
</xs:element>
</xs:schema>
Testing valid XML (using xmllint)
<?xml version="1.0"?>
<parentTag>
<childTag1>
<!-- ... stuff -->
</childTag1>
<childTag2>
<!-- ... stuff -->
</childTag2>
</parentTag>
testfile.xml validates
Testing invalid XML
<?xml version="1.0"?>
<parentTag>
<childTag1>
<!-- ... stuff -->
</childTag1>
</parentTag>
testfile.xml:2: element parentTag: Schemas validity error : Element 'parentTag': Missing child element(s). Expected is ( childTag2 ).
testfile.xml fails to validate

Unmarshalling based on Concrete Instance

I am a new comer to JaxB World and I am facing one problem w.r.t. unmarshalling of the stored xml content into java class object. Problem description is as follows. Let me know if this is solvable
I have my xsd file which contains following content(this is just a example)
Student info
<xs:complexType name="specialization" abstract="true">
</xs:complexType>
<xs:complexType name="Engineering">
<xs:complexContent>
<xs:extension base="specialization">
<xs:sequence>
<xs:element name="percentage" type="xs:int" minOccurs="0"/>
</xs:sequence>
</xs:extension>
</xs:complexContent>
</xs:complexType>
<xs:complexType name="Medical">
<xs:complexContent>
<xs:extension base="specialization">
<xs:sequence>
<xs:element name="grade" type="xs:string" minOccurs="0"/>
</xs:sequence>
</xs:extension>
</xs:complexContent>
</xs:complexType>
Now all the corresponding java classes are generated by compiling the xsd. Now lets assume in my application i will set the specialization attribute of Student info by constructing Engineering class instance. So after all the operation when i save
the xml file that get saved will have the entry like below
<Student>
<Name>Name1</Name>
<Specialization>
<percentage>78<percentage>
</Specialization>
</Student>
Now when the above content goes for unmarshalling, unmarshalling fails saying unexpected element . I guess this is b'cos Specialization element is of type specialization it calls unmarshalling on itself rather than derived object which is stored.
I hope my explanation is clear. Is there any way that we can unmarshall based on derived class instanse type. The xsd and bindings.xjb file is completely in my control so i can add or modify any entries/info which conveys to unmarshalling rules to unmarshall on derived class.
Thanks for your Suggestion but the it still not working for me.
Here is what I tried
Option #1 - xsi:type
My xsd looks same as what is explained in the example but still the Xsi:type doesn't come in the resulted xml. Do i need to add any other setting while compiling? Which JaxB version should i use for this?
Option#2 - Substitution Groups
When i added the substitution entry part in my xsd, XSD compilation failed saying duplicate names "Engineering" and "Medical". I guess element name and type Name being same compilation cribs(All engineering, Medical,specialization being same both in type definition and element Name)
I can't modify the generated classes as we are using Model driven Architecture. Only thing that is in hand is xsd. Any modification to the xsd is allowed. Ideally First option should have worked. But can't figure out why it is not working. Let me know if you have some suggestion to narrow down the problem.
There are different ways of representing Java inheritance in XML when using JAXB:
Option #1 - xsi:type
In this representation an attribute is used to indicate the subtype being used to populate this element.
<Student>
<Name>Name1</Name>
<Specialization xsi:type="Engineering">
<percentage>78<percentage>
</Specialization>
</Student>
For a detailed example see:
http://blog.bdoughan.com/2010/11/jaxb-and-inheritance-using-xsitype.htmlhtml
Option #2 - Substitution Groups
Here an element name is used to indicate the subtype. This corresponds to the schema concept of substitution groups and leverages JAXB's #XmlElementRef annotation:
<Student>
<Name>Name1</Name>
<Engineering>
<percentage>78<percentage>
</Engineering>
</Student>
For a detailed example see:
http://blog.bdoughan.com/2010/11/jaxb-and-inheritance-using-substitution.html

Creating a valid XSD that is open using <all> and <any> elements

I need to specify a XSD for validating XML documents. The XSD will be used for a JAXB generation of Java bindings.
My problem is specifying optional elements which I do not know the names of and which I in general am not interested in parsing.
The structure of the XML documents is like:
<TRADE>
<TIME>12:12</TIME>
<MJELLO>12345</MJELLO>
<OPTIONAL>12:12</OPTIONAL>
<DATE>25-10-2011</DATE>
<HELLO>hello should be ignored</HELLO>
</TRADE>
The important thing is, that:
I can not assume any order, and the next XML document instance migtht have tags in a different order
I am only interested in parsing some of the tags, some are mandatory and some are optional
The XML documents can be extended with new elements which I am not interested in parsing
The structure of my XSD is like (not a valid xsd):
<?xml version="1.0" encoding="ISO-8859-1"?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">
<!-- *********************************************** -->
<!-- Trade element definitions for the XML Documents -->
<!-- *********************************************** -->
<xs:complexType name="Trade">
<!-- Using the all construction ensures that the order does not matter -->
<xs:all>
<xs:element name="DATE" type="xs:string" minOccurs="1" maxOccurs="1" />
<xs:element name="TIME" type="xs:string" minOccurs="1" maxOccurs="1" />
<xs:element name="OPTIONAL" type="xs:string" minOccurs="0" maxOccurs="1" />
<xs:any minOccurs="0"/>
</xs:all>
</xs:complexType>
<!-- TRADE is the mandatory top-level tag -->
<xs:element name="TRADE" type="Trade"/>
</xs:schema>
So, in this example: DATE and TIME are mandatory (they must be in the XML exactly once), OPTIONAL might be present once and then I would like to specify, that all other tags are allowed. The order does not matter.
How do I specify a valid XSD for this?
This is a classic parser problem.
Basically, your BNF is:
Trade = whatever whatever*
whatever = "DATE" | "TIME" | anything
anything = a-z a-z*
But this is ambigous. The string "DATE" can both be accepted under the whatever rule as "DATE" and as anything.
So if you have
<TRADE>
<TIME>12:12</TIME>
<DATE>25-10-2011</DATE>
<DATE>25-12-2011</DATE>
</TRADE>
it is unclear whether that should be accepted or not.
It could be interpreted either one of
"TIME", "DATE", anything
anything, anything, "DATE"
anything, anything, anything
"TIME", "DATE", anything
"TIME", "DATE", "DATE"
etc.
It all boils down to: If you have a wildcard combined with random sequence, you cannot meaningfully decide which token matches which rule.
It especially does not make sense to have optional elements together with a wilcard.
You have two options:
use xs:sequence instead of xs:all
do not use wildcard
As I understand it, both options are in conflict with your wishes.
Perhaps you can construct a wildcard that matches everything except DATE, TIME etc.
Is it a hard requirement to have JAXB bindings to your "known" elements?
If not, you can basically have just <any maxoccurs="unbounded" processContents="skip"/> as your xsd, and then pick out the elements you are interested in from the DOM tree.
(See here how to use JAXB without data binding.)

XSD: difference between Element and Attribute

I'm new to XSD, and I'm quite confused as to when to use attribute, and when to use element?
Why cant we specify minOccurs and maxOccurs in attribute?
Also, why is it we cannot specify use="required" in element?
An element is an XML element - a opening tag, some content, a closing tag - they are the building blocks of your XML document:
<test>someValue</test>
Here, "test" would be an element.
Attributes is an additional info on a tag - it's an "add-on" or an extra info on an element, but can never exist alone:
<test id="5">somevalue</test>
"id" is an attribute.
You cannot have multiple attributes of the same name on a single tag --> minOccurs/maxOccurs makes no sense. You can define required (or not) for an attribute - anything else doesn't make sense.
The elements are defined by their occurrence inside complex types - e.g. if you have a complex type with a <xs:sequence> inside - you are defining that all elements must be present and must the in this particular order:
<xs:complexType name="SomeType">
<xs:sequence>
<xs:element name="Element1" type="xs:string" />
<xs:element name="Element2" type="xs:string" />
</xs:sequence>
</xs:complexType>
Inside an element of that type, the sub-elements "Element1" and "Element2" are required and must appear in this order - there's no need for "required" or not (like with attributes). Whether or not an element is required is defined by the use of minOccurs and maxOccurs; both are =1 by default, e.g. the element must occur, and can only occur once. By tweaking those settings, you can define an element to be optional (minOccurs=0), or allow it to show up several times (maxOccurs > 1).
I'd strongly recommend you check out the W3Schools Tutorial on XML Schema and learn some more about XML schema.
Marc
Example: XSD Format
<xs:complexType name="contactInformation">
<xs:all>
<xs:element name="firstName" type="xs:string" minOccurs="0"/>
<xs:element name="workCountryId" type="xs:long" minOccurs="0"/>
</xs:all>
<xs:attribute name="id" type="xs:long"/>
</xs:complexType>
XML Format
<contactInformation id=100>
<firstname>VELU</firstname>
<workCountryId>120</workCountryId>
</contactInformation>
attribute is optional by default. To specify that the attribute is required, use the use attribute:
e.g. <xs:attribute name="id" type="xs:long" use="required"/>
More about attributes and elements.
A complexType element is an XML element that contains other elements and/or attributes.
The all element specifies that the child elements can appear in any order and that each child element can occur zero or one time.
maxOccurs Optional. Specifies the maximum number of times the element can occur. The value must be 1.
minOccurs Optional. Specifies the minimum number of times the element can occur. The value can be 0 or 1. Default value is 1
An element is an XML node - and it can contain other nodes, or attributes. It can be a simple type or a complex type. It is an XML entity.
An attribute is a descriptor. It can't contain anything and can only be a simple type.
Have a look at this. Of course, you can just google something like "XML element vs attribute"
<element myAttribute="value">
<subElement />
<subElement anotherAttribute="this is an attribute's value">Element value</subElement>
</element>
You can't have more than one attribute with the same name in XML, therefore you can't use minOccurs and maxOccurs for attributes.
You don't need use="required" for elements because you can have minOccurs="1" instead.
It is your choice when to use attributes and when to use elements. Here are some guidelines: http://www.ibm.com/developerworks/xml/library/x-eleatt.html

Resources