My project parses XML from various sources using JAXB. This works for most sources, but I am having trouble parsing documents from one particular source. The only difference I have been able to find is that the offending document reports its encoding to be UTF-16, whereas others sem to be in UTF-8 as far as I can tell.
Here is the code:
InputStream inputStream = new FileInputStream(inputFile);
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
dbf.setNamespaceAware(true);
DocumentBuilder db = dbf.newDocumentBuilder();
Document doc = db.parse(inputStream);
This throws the following exception:
[Fatal Error] :1:40: Content is not allowed in prolog.
org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 40; Content is not allowed in prolog.
at com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(DOMParser.java:257)
at com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse(DocumentBuilderImpl.java:339)
at javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:121)
at ... (my code)
The offending document starts with
<?xml version="1.0" encoding="UTF-16"?>
followed directly by the opening tag for the root element. I examined the file with a hex editor; there are no other characters (not even BOMs or any nonprinting characters) before the opening tag.
If I change the encoding attribute to UTF-8, the code runs past that point (though it throws an unrelated exception further down the line).
Is JAXB incompatible with UTF-16? Or what else is the problem?
Running xmlstarlet fo on the document produced the following error:
/path/to/document.xml:1.38: Document labelled UTF-16 but has UTF-8 content
In conclusion, org.xml.sax.SAXParseException with the error message Content is not allowed in prolog is fairly unspecific and in some cases misleading.
While usually it indicates that illegal extra characters (including nonprinting ones) were encountered before the root element, it may also indicate something completely different – such as the XML prolog specifying an encoding which does not match the actual encoding.
Related
i'm trying to use antlr4 to build http parser according to grammar in RFC 7230. I generated parser by antlr tool, and put it into my code. But when I'm trying to put data from browser (via tcp server) I got exception: UTF-8 string contains an illegal byte sequence.
If I understood correctly the RFC, I don't need any conversion to UTF-8, because data in header are in US ASCII, and body is simply set of bytes.
Can I disable conversion data to UTF-8 in antlr?
Thanks for advice
I have ran into an issue on Windows where encoded file is read and decoded using EncodingGroovyMethods#decodeBase64:
getClass().getResourceAsStream('/endoded_file').text.decodeBase64()
This gives me:
bad character in base64 value
File itself has CRLF endings and groovy decodeBase64 implementation snippet has a comment so:
} else if (sixBit == 66) {
// RFC 2045 says that I'm allowed to take the presence of
// these characters as evidence of data corruption
// So I will
throw new RuntimeException("bad character in base64 value"); // TODO: change this exception type
}
I looked up RFC 2045 and CLRF pair is suppose to be legal. I have tried same with org.apache.commons.codec.binary.Base64#decodeBase64 and it works. Is this a bug in groovy or was this intentional ?
I am using groovy 2.4.7.
This is not a bug, but a different way of how corrupt data is handled. Looking at the source code of Base64 in Apache commons, you can see the documentation:
* Ignores all non-base64 characters. This is how chunked (e.g. 76 character) data is handled, since CR and LF are
* silently ignored, but has implications for other bytes, too. This method subscribes to the garbage-in,
* garbage-out philosophy: it will not check the provided data for validity.
So, while the Apache Base64 decoder silently ignores the corrupt data, the Groovy one will complain about it. The RFC documentation is a bit fuzzy about it:
In base64 data, characters other than those in Table 1, line breaks, and other
white space probably indicate a transmission error, about which a
warning message or even a message rejection might be appropriate
under some circumstances.
While warning messages are hardly useful (who checks for warnings anyway?), the Groovy authors decided to go into the path of 'message rejection'.
TLDR; they are both fine, just a different way of handling corrupt data. If you can, try to fix or reject the incorrect data.
I'm running a flow in Mule CE and have huge problems with encodings. No matter what I do my files end up with messed up non-english charcters.
Before the jaxb-object-to-xml transformer my payload looks nice in the console and in the debugger, but after that it's all messed up.
......
<http:request>
<object-to-byte-array-transformer encoding="UTF-8" doc:name="Object to Byte Array"/>
<object-to-string-transformer doc:name="String" encoding="UTF-8"/>
<json:json-to-object-transformer returnClass="java.util.List" doc:name="JSON2ObjectList" encoding="UTF-8"/>
<collection-splitter doc:name="Collection Splitter"/>
<choice doc:name="Choice">
<when expression="....">
<custom-transformer returnClass="se.system.Order.SalesHeader" class="se.system.Transformer.Map2Order" doc:name="Map2Order" mimeType="application/xml" encoding="UTF-8"/>
<mulexml:jaxb-object-to-xml-transformer name="orderMarshaller" jaxbContext-ref="JAXB_Context" doc:name="orderMarshaller" mimeType="text/xml" encoding="UTF-8"/>
<object-to-string-transformer doc:name="XML2String" encoding="UTF-8"/>
<set-variable variableName="fileName" value="order-#[function:dateStamp].xml" doc:name="fileName" encoding="UTF-8"/>
<file:outbound-endpoint path="${file.ToOrder}" responseTimeout="10000" doc:name="File" outputPattern="#[fileName]" mimeType="text/xml" encoding="UTF-8"/>
After the jaxb transformer non-english characters looks like:
Deliveryinfo2="å ä ö Å Ä Ö & % è É"/
And the 010 editor claims its ANSI DOS (with messed up characters, don't know if that one is to be trusted though)
Have I missed something in the jaxb transformer? or somewhere else?
Is it possible to replace it with a Java component, initiate my very own JAXB context, get a marshaller and handle it myself?
No clues anymore...
Regards
EDIT: this one can handle non-english characters
<mulexml:object-to-xml-transformer doc:name="Object to XML" encoding="UTF-8" />
but not GregorianCalendar types or my main Objects List of other objects so it's not an alternative
This seems to be a bug caused by the JAXB transformer not respecting the given encoding, see source (line 64).
What however is kinda weird is that according to the JAXB documentation the default encoding should be UTF-8.
Encoding
By default, the Marshaller will use UTF-8 encoding when generating XML data to a java.io.OutputStream, or a java.io.Writer. Use the setProperty API to change the output encoding used during these marshal operations. Client applications are expected to supply a valid character encoding name as defined in the W3C XML 1.0 Recommendation and supported by your Java Platform.
This should probably be something like this
final Marshaller m = jaxbContext.createMarshaller();
m.setProperty(Marshaller.JAXB_ENCODING, encoding);
i am marshalling a pojo using jaxb.
The pojo class contains a variable of type string and the value being set contains the currency symbol in it depending on the java.util.LOCALE being passed.
My problem is while passing LOCALE.US , its working fine (eg. $235.36) but while passing any other LOCALE , say LOCALE.CHINA , a junk character is appended in front of the currency symbol (eg. ï¿¥235.36).
Any suggestions,answers and experiences related to such scenario are most welcomed. Thanks in advance.
By default a JAXB implementation will output to UTF-8. You can specify another encoding using the JAXB_ENCODING property (see: http://blog.bdoughan.com/2011/08/jaxb-and-java-io-files-streams-readers.html). Also note JAXB may be handling the character correctly but the viewer you are using to examine the XML may not.
Here's my problem : I've written a program that unmarshals an XML file given as input and it turns out that my program works just fine on my development environment BUT this same program will yield the following exception on my client's environment :
javax.xml.bind.UnmarshalException
- with linked exception:
[java.io.UTFDataFormatException: Invalid byte 2 of 2-byte UTF-8 sequence.]
The XML file given as input to my program is using UTF-8 as encoding type. The Unmarshaller object is using the default encoding type, that is UTF-8, since I did not set any property value to it. Besides, I did not set a schema to the unmarshaller, so, I am not even requesting an XML validation.
Does anyone have any idea or has anyone already ran into the same problem?
Thanks in advance
I have already get this error. I have change my configuration to use ISO-8859-1 encoding :
marshaller.setProperty(Marshaller.JAXB_ENCODING, "ISO-8859-1");
i can put UTF-8 strings in the xml flow, it's correctly marshall/unmarshall even if the encoding is not define like ISO-8859-1