JAXB messing up encoding in Mule flow - jaxb

I'm running a flow in Mule CE and have huge problems with encodings. No matter what I do my files end up with messed up non-english charcters.
Before the jaxb-object-to-xml transformer my payload looks nice in the console and in the debugger, but after that it's all messed up.
......
<http:request>
<object-to-byte-array-transformer encoding="UTF-8" doc:name="Object to Byte Array"/>
<object-to-string-transformer doc:name="String" encoding="UTF-8"/>
<json:json-to-object-transformer returnClass="java.util.List" doc:name="JSON2ObjectList" encoding="UTF-8"/>
<collection-splitter doc:name="Collection Splitter"/>
<choice doc:name="Choice">
<when expression="....">
<custom-transformer returnClass="se.system.Order.SalesHeader" class="se.system.Transformer.Map2Order" doc:name="Map2Order" mimeType="application/xml" encoding="UTF-8"/>
<mulexml:jaxb-object-to-xml-transformer name="orderMarshaller" jaxbContext-ref="JAXB_Context" doc:name="orderMarshaller" mimeType="text/xml" encoding="UTF-8"/>
<object-to-string-transformer doc:name="XML2String" encoding="UTF-8"/>
<set-variable variableName="fileName" value="order-#[function:dateStamp].xml" doc:name="fileName" encoding="UTF-8"/>
<file:outbound-endpoint path="${file.ToOrder}" responseTimeout="10000" doc:name="File" outputPattern="#[fileName]" mimeType="text/xml" encoding="UTF-8"/>
After the jaxb transformer non-english characters looks like:
Deliveryinfo2="å ä ö Å Ä Ö & % è É"/
And the 010 editor claims its ANSI DOS (with messed up characters, don't know if that one is to be trusted though)
Have I missed something in the jaxb transformer? or somewhere else?
Is it possible to replace it with a Java component, initiate my very own JAXB context, get a marshaller and handle it myself?
No clues anymore...
Regards
EDIT: this one can handle non-english characters
<mulexml:object-to-xml-transformer doc:name="Object to XML" encoding="UTF-8" />
but not GregorianCalendar types or my main Objects List of other objects so it's not an alternative

This seems to be a bug caused by the JAXB transformer not respecting the given encoding, see source (line 64).
What however is kinda weird is that according to the JAXB documentation the default encoding should be UTF-8.
Encoding
By default, the Marshaller will use UTF-8 encoding when generating XML data to a java.io.OutputStream, or a java.io.Writer. Use the setProperty API to change the output encoding used during these marshal operations. Client applications are expected to supply a valid character encoding name as defined in the W3C XML 1.0 Recommendation and supported by your Java Platform.
This should probably be something like this
final Marshaller m = jaxbContext.createMarshaller();
m.setProperty(Marshaller.JAXB_ENCODING, encoding);

Related

JAXB chokes on UTF-16 XML

My project parses XML from various sources using JAXB. This works for most sources, but I am having trouble parsing documents from one particular source. The only difference I have been able to find is that the offending document reports its encoding to be UTF-16, whereas others sem to be in UTF-8 as far as I can tell.
Here is the code:
InputStream inputStream = new FileInputStream(inputFile);
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
dbf.setNamespaceAware(true);
DocumentBuilder db = dbf.newDocumentBuilder();
Document doc = db.parse(inputStream);
This throws the following exception:
[Fatal Error] :1:40: Content is not allowed in prolog.
org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 40; Content is not allowed in prolog.
at com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(DOMParser.java:257)
at com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse(DocumentBuilderImpl.java:339)
at javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:121)
at ... (my code)
The offending document starts with
<?xml version="1.0" encoding="UTF-16"?>
followed directly by the opening tag for the root element. I examined the file with a hex editor; there are no other characters (not even BOMs or any nonprinting characters) before the opening tag.
If I change the encoding attribute to UTF-8, the code runs past that point (though it throws an unrelated exception further down the line).
Is JAXB incompatible with UTF-16? Or what else is the problem?
Running xmlstarlet fo on the document produced the following error:
/path/to/document.xml:1.38: Document labelled UTF-16 but has UTF-8 content
In conclusion, org.xml.sax.SAXParseException with the error message Content is not allowed in prolog is fairly unspecific and in some cases misleading.
While usually it indicates that illegal extra characters (including nonprinting ones) were encountered before the root element, it may also indicate something completely different – such as the XML prolog specifying an encoding which does not match the actual encoding.

base 64 Decode XML values using Groovy script

I will be receiving the following XML data in a variable.
<order>
<name>xyz</name>
<city>abc</city>
<string>aGVsbG8gd29ybGQgMQ==</string>
<string>aGVsbG8gd29ybGQgMg==</string>
<string>aGVsbG8gd29ybGQgMw==</string>
</order>
Output:
<order>
<name>xyz</name>
<city>abc</city>
<string>hello world 1</string>
<string>hello world 2</string>
<string>hello world 3</string>
</order>
I know how I can decode from base64 but the problem is some of the values are decoded already and some are encoded. What is the best approach to decode this data using groovy so that I get the output as shown?
Always: tag value will be encoded. rest all other tags and value will be decoded.
Since there's no uncertainty on which nodes could come encoded and which not, hence no need to detect base64 encoding, the way to do it is pretty simple:
Parse it. There's two preferable ways to do that in Groovy: XmlSlurper & XmlParser. They differ in computation & mem consumption modes, both provide object/structure representation in the end, though.
Work with that object structure: traverse all required elements, decode the content/attributes you need to decode.
Either proceed further with the data with them and/or serialize it back to the XML text.
Articles to look at:
Load, modify, and write an XML document in Groovy
https://www.baeldung.com/groovy-xml
https://groovy-lang.org/processing-xml.html
and many, many more.
Another cheat sheet always useful for Groovy noobs: http://groovy-lang.org/groovy-dev-kit.html
Check out how to traverse the structures there, for instance.

XML encoding of Attribute in KMIP

I'm analyzing KMIP to implement a prototype in scala. I try so to understand all concepts to implement an architecture for different encoding profiles (bytes, JSON, XML).
In specification section 5.4.1.6 XML Element Encoding, it stipulates :
[...] structure values are encoded as nested xml elements, and non-structure
values are encoded using the ‘value’ attribute
With this example :
<ActivationDate type="DateTime" value="2001-01-01T10:00:00+10:00"/>
I don't understand this syntax since Activation Date is an attribute. In section 2.1.1 Attribute an attribute is described with a structure containing Attribute Name, Attribute Index, Attribute Value.
The XML representation of an ActivationDate or other attributes should be :
<Attribute>
<AttributeName type="TextString" value="Activation Date"/
<AttributeValue type="DateTime" value="2001-01-01T10:00:00+10:00"/>
</Attribute>
Moreover, the KMIP test case uses this second representation.
If the first representation is shown as an example, it will be used. So in which case ?
The KMIP specification is very vague on this point. BOTH forms of Attribute you described are considered valid KMIP and should be handled.
I strongly recommend the KMIP Additional Message Encodings document when implementing http/json/xml encoding- https://docs.oasis-open.org/kmip/kmip-addtl-msg-enc/v1.0/os/kmip-addtl-msg-enc-v1.0-os.html
section 6.1.6 describes yet another format that isn't covered in the main spec: <TTLV tag="0x420001" name="ActivationDate" type="DateTime" value="2001-01-01T10:00:00+10:00"/>

Lua XML extract from pattern

An application is sending my script an Stream like this one:
<?xml version="1.0" encoding="UTF-8"?>
<root>
<aRootChildNode>
<anotherChildNode>
<?xml version="1.0">
<TheNodeImLookingFor>
... content ...
</TheNodeImLookingFor>
</anotherChildNode>
</aRootChildNode>
</root>
I want to extract the TheNodeImLookingFor section.
So far, got:
data = string.match(Stream, "^.+\<TheNodeImLookingFor\>.+\<\/TheNodeImLookingFor\>.+$")
Pattern is recognized in the Stream, but it doesn't extract the node and its content.
In general, it's not a good idea to use pattern matching (either Lua pattern or regex) to extract XML. Use a XML parser.
For this problem, you don't need to escape \ or <(even if you do, Lua pattern uses % to escape magic characters). And use brackets to get the node and its content:
data = string.match(Stream, "^.+(<TheNodeImLookingFor>.+</TheNodeImLookingFor>).+$")
Or to get only the content:
data = string.match(Stream, "^.+<TheNodeImLookingFor>(.+)</TheNodeImLookingFor>.+$")

Unmarshalling with JAXB leads to : javax.xml.bind.UnmarshalException (invalid byte sequence)

Here's my problem : I've written a program that unmarshals an XML file given as input and it turns out that my program works just fine on my development environment BUT this same program will yield the following exception on my client's environment :
javax.xml.bind.UnmarshalException
- with linked exception:
[java.io.UTFDataFormatException: Invalid byte 2 of 2-byte UTF-8 sequence.]
The XML file given as input to my program is using UTF-8 as encoding type. The Unmarshaller object is using the default encoding type, that is UTF-8, since I did not set any property value to it. Besides, I did not set a schema to the unmarshaller, so, I am not even requesting an XML validation.
Does anyone have any idea or has anyone already ran into the same problem?
Thanks in advance
I have already get this error. I have change my configuration to use ISO-8859-1 encoding :
marshaller.setProperty(Marshaller.JAXB_ENCODING, "ISO-8859-1");
i can put UTF-8 strings in the xml flow, it's correctly marshall/unmarshall even if the encoding is not define like ISO-8859-1

Resources