Using XSD in PySpark - azure

I am building a datawarehouse in Azure Synapse where one of the sources are about 20 different types of XML files (with a different XSD scheme) and 1 base scheme.
What I am looking for is to get all XML elements and store them in files (1 per type) in my data lake. For that I need to have unique names per element, for example the whole path as a name. I tried to define dicts per type with all element names, but this is quite some work. To automate this (XSDs are updated yearly), I tried to code this out in Excel and VBA, but the XSDs are quite complex with nested complex types etc.
Below is a snippet of the baseschema.xsd:
<?xml version="1.0" encoding="UTF-8"?>
<xs:schema targetNamespace="http://www.website.org/typ/1/baseschema/schema" elementFormDefault="qualified" xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:iwmo="http://www.website.org/typ/1/baseschema/schema">
<xs:complexType name="Complex_Address">
...
<xs:sequence>
<xs:element name="Home" type="iwmo:Complex_House" minOccurs="0">
...
</xs:element>
<xs:element name="Postalcode" type="iwmo:Simple_Postalcode" minOccurs="0">
...
</xs:element>
<xs:element name="Streetname" type="iwmo:Simple_Streetname" minOccurs="0">
...
</xs:element>
<xs:element name="Areaname" type="iwmo:Simple_Areaname" minOccurs="0">
...
</xs:element>
<xs:element name="CountryCode" type="iwmo:Simple_CountryCode" minOccurs="0">
...
</xs:element>
</xs:sequence>
</xs:complexType>
<xs:complexType name="Complex_House">
...
<xs:sequence>
<xs:element name="Housenumber" type="iwmo:Simple_Housenumber">
...
</xs:element>
<xs:element name="Houseletter" type="iwmo:Simple_Houseletter" minOccurs="0">
...
</xs:element>
<xs:element name="HousenumberAddition" type="iwmo:Simple_HousenumberAddition" minOccurs="0">
...
</xs:element>
<xs:element name="IndicationAddress" type="iwmo:Simple_IndicationAddress" minOccurs="0">
...
</xs:element>
</xs:sequence>
</xs:complexType>
<xs:complexType name="Complex_MessageIdentification">
...
<xs:sequence>
<xs:element name="Identification" type="iwmo:Simple_IdentificationMessage">
...
</xs:element>
<xs:element name="Date" type="iwmo:Simple_Date">
...
</xs:element>
</xs:sequence>
</xs:complexType>
<xs:complexType name="Complex_Product">
...
<xs:sequence>
<xs:element name="Categorie" type="iwmo:Simple_ProductCategory">
...
</xs:element>
<xs:element name="Code" type="iwmo:Simple_ProductCode" minOccurs="0">
...
</xs:element>
</xs:sequence>
</xs:complexType>
<xs:complexType name="Complex_XsdVersion">
<xs:sequence>
<xs:element name="BaseschemaXsdVersion" type="iwmo:Simple_Version">
</xs:element>
<xs:element name="MessageXsdVersion" type="iwmo:Simple_Version">
</xs:element>
</xs:sequence>
</xs:complexType>
And here a snippet of the xsd of 1 of the message types:
<?xml version="1.0" encoding="UTF-8"?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:typ="http://www.website.org/typ/1/baseschema/schema" xmlns:type1="http://www.website.org/typ/1/type1/schema" targetNamespace="http://www.website.org/typ/1/type1/schema" elementFormDefault="qualified">
<xs:import namespace="http://www.website.org/typ/1/baseschema/schema" schemaLocation="baseschema.xsd"></xs:import>
<xs:element name="Message" type="type1:Root"></xs:element>
<xs:complexType name="Root">
...
<xs:sequence>
<xs:element name="Header" type="type1:Header"></xs:element>
<xs:element name="Client" type="type1:Client"></xs:element>
</xs:sequence>
</xs:complexType>
<xs:complexType name="Header">
<xs:sequence>
<xs:element name="Person" type="typ:Simple_SpecialCode">
...
</xs:element>
<xs:element name="MessageIdentification" type="typ:Complex_MessageIdentification">
...
</xs:element>
<xs:element name="XsdVersion" type="typ:Complex_XsdVersion">
...
</xs:element>
</xs:sequence>
</xs:complexType>
<xs:complexType name="Client">
...
<xs:sequence>
<xs:element name="AssignedProducts" type="type1:AssignedProducts"></xs:element>
</xs:sequence>
</xs:complexType>
<xs:complexType name="AssignedProducts">
<xs:sequence>
<xs:element name="AssignedProduct" type="type1:AssignedProduct" maxOccurs="unbounded"></xs:element>
</xs:sequence>
</xs:complexType>
<xs:complexType name="AssignedProduct">
...
<xs:sequence>
<xs:element name="ToewijzingNummer" type="typ:Simple_Nummer">
...
</xs:element>
<xs:element name="Product" type="typ:Complex_Product" minOccurs="0">
...
</xs:element>
</xs:sequence>
</xs:complexType>
</xs:schema>
Then this would be the desired output:
Header_Person
Header_MessageIdentification_Identification
Header_MessageIdentification_Date
Header_XsdVersion_BaseschemaXsdVersion
Header_XsdVersion_MessageXsdVersion
Client_AssignedProduct_ToewijzingNummer
Client_AssignedProduct_Product_Category
Client_AssignedProduct_Product_Code
In the baseschema I also added a nested complex type, to show the complexity.
Is there some kind of package or something in Python that can help me achieve this? Also a tool that can just create this list of elements in a text file would be great, I then can easily copy that into a variable.
I'm not sure if I'm clear about my requirements, if this is posted in the correct group with the correct tags, but I hope someone can point me into a good solution.
Ronald

I found a workaround after all where I put all fields from the xsds in variables. It's not ideal, but any other way would be too complex.

Related

XSD Required Elements with specific child elements (Multiple Definitions with different types)

All, I have an XML doc which I don't control for which I need to create an xsd to validate. The XML doc has multiple transaction types, some of which are required a specific number of times, and some aren't. the parent element is simply <transaction>, the child element can be either a <ControlTransaction> or a <RetailTransaction>. The issue is that I need to require a <transaction> to exists with a <ControlTransaction> with a <ReasonCode> element having a value of "Register Open" and another with a value of "Register Close" as follows:
<?xml version="1.0" encoding="UTF-8"?>
<RegisterDay xmlns:xs="http://www.w3.org/2001/XMLSchema"
xmlns:cp="urn:register">
<Transaction>
<SequenceNumber>1</SequenceNumber>
<ControlTransaction>
<ReasonCode>Register Open</ReasonCode>
</ControlTransaction>
</Transaction>
<Transaction>
<SequenceNumber>2</SequenceNumber>
<RetailTransaction>
...stuff..
<Total>9.99</Total>
</RetailTransaction>
</Transaction>
<Transaction>
<SequenceNumber>3</SequenceNumber>
<ControlTransaction>
<ReasonCode>Register Close</ReasonCode>
</ControlTransaction>
</Transaction>
</RegisterDay>
My best attempt is to use types in my schema, but get "Elements with the same name and same scope must have the same type". I don't know how to get around this.
<?xml version="1.0"?>
<xs:schema
xmlns:cp="urn:register"
xmlns:xs="http://www.w3.org/2001/XMLSchema"
attributeFormDefault="unqualified"
elementFormDefault="qualified">
<xs:element name="RegisterDay">
<xs:complexType>
<xs:sequence>
<xs:element minOccurs="1" maxOccurs="1" name="Transaction" type="TransactionRegisterOpen_type"/>
<xs:element minOccurs="1" maxOccurs="unbounded" name="Transaction" type="RetailTransaction_type"/>
<xs:element minOccurs="1" maxOccurs="1" name="Transaction" type="TransactionRegisterClose_type"/>
</xs:sequence>
</xs:complexType>
</xs:element>
<xs:simpleType name="RegisterOpen_type">
<xs:restriction base="xs:string">
<xs:pattern value="Register Open"/>
</xs:restriction>
</xs:simpleType>
<xs:simpleType name="RegisterClose_type">
<xs:restriction base="xs:string">
<xs:pattern value="Register Close"/>
</xs:restriction>
</xs:simpleType>
<xs:complexType name="TransactionRegisterOpen_type">
<xs:sequence>
<xs:element name="SequenceNumber" type="xs:unsignedShort"/>
<xs:element name="ControlTransaction">
<xs:complexType>
<xs:sequence>
<xs:element minOccurs="1" name="ReasonCode" type="RegisterOpen_type"/>
</xs:sequence>
</xs:complexType>
</xs:element>
</xs:sequence>
</xs:complexType>
<xs:complexType name="TransactionRegisterClose_type">
<xs:sequence>
<xs:element name="SequenceNumber" type="xs:unsignedShort"/>
<xs:element name="ControlTransaction">
<xs:complexType>
<xs:sequence>
<xs:element minOccurs="1" name="ReasonCode" type="RegisterClose_type"/>
</xs:sequence>
</xs:complexType>
</xs:element>
</xs:sequence>
</xs:complexType>
<xs:complexType name="RetailTransaction_type">
<xs:sequence>
<xs:element name="SequenceNumber" type="xs:unsignedShort"/>
<xs:element name="ControlTransaction">
<xs:complexType>
<xs:sequence>
<xs:element minOccurs="1" name="Total" type="xs:decimal"/>
</xs:sequence>
</xs:complexType>
</xs:element>
</xs:sequence>
</xs:complexType>
</xs:schema>
Has anyone run into this and/or have any suggestions? I'm pretty much stumped.
Perhaps with enumeration ?
<?xml version="1.0"?>
<xs:schema
xmlns:cp="urn:register"
xmlns:xs="http://www.w3.org/2001/XMLSchema"
attributeFormDefault="unqualified"
elementFormDefault="qualified"
targetNamespace="urn:register">
<xs:element name="RegisterDay">
<xs:complexType>
<xs:sequence>
<xs:element
minOccurs="1"
maxOccurs="unbounded"
name="Transaction"
type="cp:TypeTransaction"/>
</xs:sequence>
</xs:complexType>
</xs:element>
<xs:complexType name="TypeTransaction">
<xs:sequence>
<xs:element name="SequenceNumber" type="xs:unsignedShort"/>
<xs:choice>
<xs:element name="RetailTransaction"/>
<xs:element name="ControlTransaction">
<xs:complexType>
<xs:sequence>
<xs:element name="ReasonCode">
<xs:simpleType>
<xs:restriction base="xs:string">
<xs:enumeration value="Register Open"/>
<xs:enumeration value="Register Close"/>
</xs:restriction>
</xs:simpleType>
</xs:element>
</xs:sequence>
</xs:complexType>
</xs:element>
</xs:choice>
</xs:sequence>
</xs:complexType>
</xs:schema>

Difference between group and sequence in XML Schema?

What is the difference between an xs:group and an xs:sequence in XML Schema? When would you use one or the other?
xs:sequence - together with xs:choice and xs:all - is used to define the valid sequences of XML element in the target XML. E.g. the schema for this XML:
<mainElement>
<firstSubElement/>
<subElementA/>
<subElementB/>
</mainElement>
is something like:
<xs:element name='mainElement'>
<xs:complexType>
<xs:sequence>
<xs:element name="firstSubElement"/>
<xs:element name="subElementA"/>
<xs:element name="subElementB"/>
</xs:sequence>
</xs:complexType>
</xs:element>
xs:group is used to define a named group of XML element following certain rules that can then be referenced in different parts of the schema. For example if the XML is:
<root>
<mainElementA>
<firstSubElement/>
<subElementA/>
<subElementB/>
</mainElementA>
<mainElementB>
<otherSubElement/>
<subElementA/>
<subElementB/>
</mainElementB>
</root>
you can define a group for the common sub-elements:
<xs:group name="subElements">
<xs:sequence>
<xs:element name="subElementA"/>
<xs:element name="subElementB"/>
</xs:sequence>
</xs:group>
and then use it:
<xs:element name="mainElementA">
<xs:complexType>
<xs:sequence>
<xs:element name="firstSubElement"/>
<xs:group ref="subElements"/>
</xs:sequence>
</xs:complexType>
</xs:element>
<xs:element name="mainElementB">
<xs:complexType>
<xs:sequence>
<xs:element name="otherSubElement"/>
<xs:group ref="subElements"/>
</xs:sequence>
</xs:complexType>
</xs:element>

XSD schema - Either one or both

I it possible to make a choice scenario, like (A or B or Both). If yes, how can this be done with the following elements?
<xs:element name="a" type="typeA" />
<xs:element name="b" type="typeB" />
Hope you can help.
Regards,
Nima
You can see XSD "one or both" choice construct leads to ambiguous content model
<xs:schema xmlns:xs="...">
<xs:element name="a" type="typeA" />
<xs:element name="b" type="typeB" />
<xs:element name="...">
<xs:complexType>
<xs:sequence>
<xs:choice>
<xs:sequence>
<xs:element ref="a"/>
<xs:element ref="b" minOccurs="0"/>
</xs:sequence>
<xs:element ref="b"/>
</xs:choice>
</xs:sequence>
</xs:complexType>
</xs:element>
</xs:schema>

JAXB customize bindings - skip generated classes from schema

I have a following schema:
<xs:element name="Invoice">
<xs:complexType>
<xs:sequence>
.....
<xs:element name="InvoiceLines" type="InvoiceLinesType">
</xs:element>
.....
</xs:complexType>
</xs:element>
<xs:complexType name="InvoiceLinesType">
<xs:sequence>
<xs:element maxOccurs="unbounded" name="InvoiceLine" type="InvoiceLineType">
</xs:element>
</xs:sequence>
</xs:complexType>
<xs:complexType name="InvoiceLineType">
<xs:sequence>
.....
</xs:sequence>
</xs:complexType>
The problem is, that it generate classes:
Invoice - which contain member of InvoiceLinesType
InvoiceLinesType - which contain a collection of InvoiceLineType
InvoiceLineType
So there is one unnecessary class (InvoiceLinesType) and i prefer the following
Invoice - which contain a collection of InvoiceLineType
InvoiceLineType
Does anyone know how to tell the compiler not to generate this package (InvoiceLinesType).
My current external binding file is there
<jxb:bindings schemaLocation="invoice.xsd" node="/xs:schema">
<jxb:globalBindings>
.....
<xjc:simple/>
.....
</jxb:globalBindings>
</jxb:bindings>
Thank You for response.
You would have to modify your schema - drop InvoiceLinesType and have InvoiceLineType as unbounded element in Invoice.
<xs:element name="Invoice">
<xs:complexType>
<xs:sequence>
.....
<xs:element maxOccurs="unbounded" name="InvoiceLine" type="InvoiceLineType">
</xs:element>
.....
</xs:complexType>
</xs:element>
<xs:complexType name="InvoiceLineType">
<xs:sequence>
.....
</xs:sequence>
</xs:complexType>

Problem with xml schema elements hierarchy

What's wrong with this xml schema? It doesn't parse correctly, and I can't realize a hierarchy between cluster(element)->host(element)->Load(element).
<?xml version="1.0"?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">
<xs:element name="cluster">
<xs:complexType>
<xs:sequence>
<xs:element ref="host"/>
</xs:sequence>
</xs:complexType>
</xs:element>
<xs:element name="host">
<xs:complexType>
<xs:element ref="Load"/>
<xs:attribute name="name" type="xs:string" use="required"/>
</xs:complexType>
</xs:element>
<xs:element name="Load">
<xs:complexType>
<xs:attribute name="usedPhisicalMemory" type="xs:integer"/>
</xs:complexType>
</xs:element>
</xs:schema>
Thank you, Emilio
To allow something like this (I corrected the typo in "usedPhysicalMemory"):
<cluster>
<host name="foo">
<Load usedPhysicalMemory="500" />
</host>
<host name="bar">
<Load usedPhysicalMemory="500" />
</host>
</cluster>
This schema would do it:
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">
<xs:element name="cluster">
<xs:complexType>
<xs:sequence>
<xs:element ref="host" maxOccurs="unbounded" />
</xs:sequence>
</xs:complexType>
</xs:element>
<xs:element name="host">
<xs:complexType>
<xs:sequence>
<xs:element ref="Load" />
</xs:sequence>
<xs:attribute name="name" type="xs:string" use="required" />
</xs:complexType>
</xs:element>
<xs:element name="Load">
<xs:complexType>
<xs:attribute name="usedPhysicalMemory" type="xs:integer" />
</xs:complexType>
</xs:element>
</xs:schema>
From the MSDN on <xs:complexType> (because the spec makes my brain hurt):
If group, sequence, choice, or all is specified, the elements must
appear in the following order:
group | sequence | choice | all
attribute | attributeGroup
anyAttribute
Maybe someone else can point out the relevant section in the spec.
In the host element, the load element cannot be a child of complexType, you must have a sequence, etc. in between.

Resources