Unmarshalling with JAXB leads to : javax.xml.bind.UnmarshalException (invalid byte sequence) - jaxb

Here's my problem : I've written a program that unmarshals an XML file given as input and it turns out that my program works just fine on my development environment BUT this same program will yield the following exception on my client's environment :
javax.xml.bind.UnmarshalException
- with linked exception:
[java.io.UTFDataFormatException: Invalid byte 2 of 2-byte UTF-8 sequence.]
The XML file given as input to my program is using UTF-8 as encoding type. The Unmarshaller object is using the default encoding type, that is UTF-8, since I did not set any property value to it. Besides, I did not set a schema to the unmarshaller, so, I am not even requesting an XML validation.
Does anyone have any idea or has anyone already ran into the same problem?
Thanks in advance

I have already get this error. I have change my configuration to use ISO-8859-1 encoding :
marshaller.setProperty(Marshaller.JAXB_ENCODING, "ISO-8859-1");
i can put UTF-8 strings in the xml flow, it's correctly marshall/unmarshall even if the encoding is not define like ISO-8859-1

Related

ANTLR4 c++ Target - UTF8 conversion error in http parser

i'm trying to use antlr4 to build http parser according to grammar in RFC 7230. I generated parser by antlr tool, and put it into my code. But when I'm trying to put data from browser (via tcp server) I got exception: UTF-8 string contains an illegal byte sequence.
If I understood correctly the RFC, I don't need any conversion to UTF-8, because data in header are in US ASCII, and body is simply set of bytes.
Can I disable conversion data to UTF-8 in antlr?
Thanks for advice

JAXB chokes on UTF-16 XML

My project parses XML from various sources using JAXB. This works for most sources, but I am having trouble parsing documents from one particular source. The only difference I have been able to find is that the offending document reports its encoding to be UTF-16, whereas others sem to be in UTF-8 as far as I can tell.
Here is the code:
InputStream inputStream = new FileInputStream(inputFile);
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
dbf.setNamespaceAware(true);
DocumentBuilder db = dbf.newDocumentBuilder();
Document doc = db.parse(inputStream);
This throws the following exception:
[Fatal Error] :1:40: Content is not allowed in prolog.
org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 40; Content is not allowed in prolog.
at com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(DOMParser.java:257)
at com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse(DocumentBuilderImpl.java:339)
at javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:121)
at ... (my code)
The offending document starts with
<?xml version="1.0" encoding="UTF-16"?>
followed directly by the opening tag for the root element. I examined the file with a hex editor; there are no other characters (not even BOMs or any nonprinting characters) before the opening tag.
If I change the encoding attribute to UTF-8, the code runs past that point (though it throws an unrelated exception further down the line).
Is JAXB incompatible with UTF-16? Or what else is the problem?
Running xmlstarlet fo on the document produced the following error:
/path/to/document.xml:1.38: Document labelled UTF-16 but has UTF-8 content
In conclusion, org.xml.sax.SAXParseException with the error message Content is not allowed in prolog is fairly unspecific and in some cases misleading.
While usually it indicates that illegal extra characters (including nonprinting ones) were encountered before the root element, it may also indicate something completely different – such as the XML prolog specifying an encoding which does not match the actual encoding.

How to detect encoding errors in a Node.js Buffer

I'm reading a file in Node.js, into a Buffer object, and I'm decoding the UTF-8 content of the Buffer using Buffer.toString('utf8'). If there are encoding errors, I want to report a failure.
The toString() method handles decoding errors by substituting an xFFFD character, which I can detect by searching the result. But xFFFD is a legal character in the input file, and I don't want to report an error if the xFFFD was present and correctly encoded in the input.
Is there any way I can distinguish a Buffer that contains a legitimately-encoded xFFFD character from one that contains an encoding error?
The solution proposed by #eol in a comment on the question appears to meet the requirements.

Decoding base64 while using GitHub API to Download a File

I am using the GitHub API to download a file from GitHub. I have been able to successfully authenticate as well as get a response from github, and see a base64 encoded string representing the file contents.
Unfortunately, I get an unusual error (string length is not a multiple of 4) when decoding the base64 string.
The HTTP request is illustrated below:
GET /repos/:owner/:repo/contents/:path
The (partial) response is illustrated below:
{
"name":....,
"download_url":...",
"type":"file",
"content":"ewogICAgInN3YWdnZXIiOiAiM...
}
The issue I am encountering is that the length of the string is 15263 bytes, and I get an error in decoding the string (string length is not a multiple of 4). I am using node.js and the 'base64-js' npm module to decode the string. Code to execute the decoding is illustrated below:
var base64 = require('base64-js');
var contents = base64.toByteArray(fileContent);
The decoding causes an exception:
Error: Invalid string. Length must be a multiple of 4
at placeHoldersCount (.../node_modules/base64-js/index.js:23:11)
at Object.toByteArray (...node_modules/base64-js/index.js:42:18)
:
:
I would think that the GitHub API is sending me the correct data, so I figure that is not the issue.
Am I performing the decoding improperly or is there another problem I am overlooking?
Any help is appreciated.
I experimented a bit and found a solution by using a different base64 decoding library as follows:
var base64 = require('js-base64').Base64;
var contents = base64.decode(res.content);
I am not sure if it is mandatory to have an encoded string length divisible by 4 (clearly my 15263 character length string is not divisible by 4) but the alternate library decoded the string properly.
A second solution which I also found to work is specific to how to use the GitHub API. By adding the following to the GitHub API call header, I was also able to get the decoded file contents:
'accept': 'application/vnd.github.VERSION.raw'
After much experimenting, I think I nailed down the difference between the working and broken base64 decoding.
It appears GitHub Base-64 encodes with:
UTF-8 charset
Base 64 MIME encoder (RFC2045)
As opposed to a "basic" (RFC4648) Base64 encoder. Several languages seem to default to the basic encoder (including Java, which I was using). When I switched to a MIME encoder, I got the full contents of the file un-garbled. This would explain why switching libraries in some cases fixed the issue.
I will note the contents field contained newline characters - decoders are supposed to ignore them, but not all do, so if you still get errors, you may need to try removing them.
The media-type header will do the job better, however in my case I am trying to use the API via a GitHub App - at time of writing, GitHub requires a specific media type be used when doing that, and it returns the JSON response.
For some reason the Github APIs base64 encoded content doesn't decode properly at all the online base64 decoders I've tried from the front page of google.
Python works however:
import base64
base64.b64decode("ewogICAgInN3YWdnZXIiOiAiM...")

Marshalling a pojo using jaxb results in junk charecters being displayed

i am marshalling a pojo using jaxb.
The pojo class contains a variable of type string and the value being set contains the currency symbol in it depending on the java.util.LOCALE being passed.
My problem is while passing LOCALE.US , its working fine (eg. $235.36) but while passing any other LOCALE , say LOCALE.CHINA , a junk character is appended in front of the currency symbol (eg. ï¿¥235.36).
Any suggestions,answers and experiences related to such scenario are most welcomed. Thanks in advance.
By default a JAXB implementation will output to UTF-8. You can specify another encoding using the JAXB_ENCODING property (see: http://blog.bdoughan.com/2011/08/jaxb-and-java-io-files-streams-readers.html). Also note JAXB may be handling the character correctly but the viewer you are using to examine the XML may not.

Resources