Camel exceptions unmarshalling JAXB ISO-8859-1 XML file - jaxb

For my route I set:
String encoding = "iso-8859-1";
JaxbDataFormat jaxb = new JaxbDataFormat( Data.class.getPackage().getName() );
if( encoding != null) {
jaxb.setEncoding( encoding );
}
from( "file://" + location + "?charset=" + encoding )
.routeId(this.getClass().getSimpleName()) // Give a nice name
. etc.
then when I provide the file in this ISO encoding I get an exception stack:
[com.ctc.wstx.exc.WstxIOException: Invalid UTF-8 start byte 0xfc (at char #3964, byte #127)]
java.io.IOException: javax.xml.bind.UnmarshalException
- with linked exception:
[com.ctc.wstx.exc.WstxIOException: Invalid UTF-8 start byte 0xfc (at char #3964, byte #127)]
at org.apache.camel.converter.jaxb.JaxbDataFormat.unmarshal(JaxbDataFormat.java:153)
at org.apache.camel.processor.UnmarshalProcessor.process(UnmarshalProcessor.java:57)
at org.apache.camel.util.AsyncProcessorConverterHelper$ProcessorToAsyncProcessorBridge.process(AsyncProcessorConverterHelper.java:61)
at org.apache.camel.util.AsyncProcessorHelper.process(AsyncProcessorHelper.java:73)
at org.apache.camel.processor.DelegateAsyncProcessor.processNext(DelegateAsyncProcessor.java:99)
at org.apache.camel.processor.DelegateAsyncProcessor.process(DelegateAsyncProcessor.java:90)
at org.apache.camel.management.InstrumentationProcessor.process(InstrumentationProcessor.java:73)
at org.apache.camel.util.AsyncProcessorHelper.process(AsyncProcessorHelper.java:73)
at org.apache.camel.processor.DelegateAsyncProcessor.processNext(DelegateAsyncProcessor.java:99)
at org.apache.camel.processor.DelegateAsyncProcessor.process(DelegateAsyncProcessor.java:90)
at org.apache.camel.processor.interceptor.TraceInterceptor.process(TraceInterceptor.java:91)
at org.apache.camel.util.AsyncProcessorHelper.process(AsyncProcessorHelper.java:73)
What could I be doing wrong?

According to the Camel doc, jaxb.setEncoding will not help, as this parameter is only used when marshalling XML documents but not when unmarshalling them.
In the ideal world, the encoding declaration in the prolog (first magic line in XML file) matches the actual encoding of the file:
<?xml version="1.0" encoding="ISO-8859-1"?>
This information is (or at least should be) automatically used by the file reading utility such as JAXB.
0xfc is a ISO-8859-1 encoded ü. In your case, check the prolog encoding declaration. If it doesn't say ISO-8859-1 it is faked. Ask the producer of the file (I hope, it wasn't you...) to set the declaration accordingly. Normally, this is correctly done by the XML marhalling framework.
If you cannot convince the producer of the file to set the correct declaration, then the things are getting trickier. In this case, you must know or guess the encoding and setting the camel header accordingly in the route:
.setHeader(Exchange.CHARSET_NAME, "ISO-8859-1")
According to the source code of JaxbDataFormat (here), this encoding is only taken into account if the filterNonXmlChars property of the JaxbDataFormat instance is set to true:
jaxb.setFilterNonXmlChars(true);
Alternatively, you may also set the Exchange.FILTER_NON_XML_CHARS property to true.

Related

Fails to parse Hebrew text from pdf using iText 7 with .net

I am trying to read a PDF file with several pages, using iText 7 on a .NET CORE 2.1
The following is my code:
Rectangle rect = new Rectangle(0, 0, 1100, 1100);
LocationTextExtractionStrategy strategy = new LocationTextExtractionStrategy();
inputStr = PdfTextExtractor.GetTextFromPage(pdfDocument.GetPage(i), strategy);
inputStr gets the following string:
"\u0011\v\u000e\u0012\u0011\v\f)(*).=*%'\f*).5?5.5*.\a \u0011\u0002\u001b\u0001!\u0016\u0012\u001a!\u0001\u0015\u001a \u0014\n\u0015\u0017\u0001(\u001b)\u0001)\u0016\u001c*\u0012\u0001\u001d\u001a \u0016* \u0015\u0001\u0017\u0016\u001b\u001a(\n,\u0002>&\u00...
and in the Text Visualizer, it looks like that:
)(*).=*%'*).5?5.5*. !!
())* * (
,>&2*06) 2.-=9 )=&,

2..*0.5<.?
.110
)<1,3
  2.3*1>?)10/6
 (& >(*,1=0>>*1?

  2.63)&*,..*0.5
  206)&13'?*9*<
  *-5=0>
?*&..,?)..*0.5
it looks like I am unable to resolve the encoding or there is a specific, custom encoding at the PDF level I cannot read/parse.
Looking at the Document Properties, under Fonts it says the following:
Any ideas how can I parse the document correctly?
Thank you
Yaniv
Analysis of the shared files
file1_copyPasteWorks.pdf
The font definitions here have an invalid ToUnicode entry:
/ToUnicode/Identity-H
The ToUnicode value is specified as
A stream containing a CMap file that maps character codes to Unicode values
(ISO 32000-2, Table 119 — Entries in a Type 0 font dictionary)
Identity-H is a name, not a stream.
Nonetheless, Adobe Reader interprets this name, and for apparently any name starting with Identity- assumes the text encoding for the font to be UCS-2 (essentially UTF-16). As this indeed is the case for the character codes used in the document, copy&paste works, even if for the wrong reasons. (Without this ToUnicode value, Adobe Reader also returns nonsense.)
iText 7, on the other hand, for mapping to Unicode first follows the Encoding value with unexpected results.
Thus, in this case Adobe Reader arrives at a better result by interpreting meaning into an invalid piece of data (and without that also returns nonsense).
file2_copyPasteFails.pdf
The font definitions here have valid but incomplete ToUnicode maps which only contain entries for the used Western European characters but not for Hebrew ones. They don't have Encoding entries.
Both Adobe Reader and iText 7 here trust the ToUnicode map and, therefore, cannot map the Hebrew glyphs.
How to parse
file1_copyPasteWorks.pdf
In case of this file the "problem" is that iText 7 applies the Encoding map. Thus, for decoding the text one can temporarily replace the Encoding map with an identity map:
for (int i = 1; i <= pdfDocument.GetNumberOfPages(); i++)
{
PdfPage page = pdfDocument.GetPage(i);
PdfDictionary fontResources = page.GetResources().GetResource(PdfName.Font);
foreach (PdfObject font in fontResources.Values(true))
{
if (font is PdfDictionary fontDict)
fontDict.Put(PdfName.Encoding, PdfName.IdentityH);
}
string output = PdfTextExtractor.GetTextFromPage(page);
// ... process output ...
}
This code shows the Hebrew characters for your file 1.
file2_copyPasteFails.pdf
Here I don't have a quick work-around. You may want to analyze multiple PDFs of that kind. If they all encode the Hebrew characters the same way, you can create your own ToUnicode map from that and inject it into the fonts like above.

Converting stream to string in node.js

I am reading a file which comes in as an attachment like follows
let content = fs.readFileSync(attachmentNames[index], {encoding: 'utf8'});
When I inspect content, it looks ok, I see file contents but when I try to assign it to some other variable
attachmentXML = builder.create('ATTACHMENT','','',{headless:true})
.ele('FILECONTENT',content).up()
I get the following error
Error: Invalid character in string: PK
There are a couple of rectangular boxes (special characters) after PK in the above message which are not getting displayed.
builder here refers to an instance of the xmlbuilder https://www.npmjs.com/package/xmlbuilder node module.
I fixed this by enclosing the string inside a JS escape() method

Convert a text file to UTF8 in D

I'm attempting to use the Phobos standard library functions to read in any valid UTF file (UTF-8, UTF-16, or UTF-32) and get it back as a UTF-8 string (aka D's string). After looking through the docs, the most concise function I could think of to do so is
using std.file, std.utf;
string readToUTF8(in string filename)
{
try {
return readText(filename);
}
catch (UTFException e) {
try {
return toUTF8(readText!wstring(filename));
}
catch (UTFException e) {
return toUTF8(readText!dstring(filename));
}
}
}
However, catching a cascading series of exceptions seems extremely hackish. Is there a "cleaner" way to go about it without relying on catching a series of exceptions?
Additionally, the above function seems to return a one-byte BOM in the resulting string if the source file was UTF-16 or UTF-32, which I would like to omit given that it's UTF-8. Is there a way to omit that besides explicitly stripping it?
One of your questions answers the other: the BOM allows you to identify the exact UTF encoding used in the file.
Ideally, readText would do this for you. Currently, it doesn't, so you'd have to implement it yourself.
I'd recommend using std.file.read, casting the returned void[] to a ubyte[], then looking at the first few bytes to see if they start with a BOM, then cast the result to the appropriate string type and convert it to a string (using toUTF8 or to!string).

c# uploading data error -> return "�" for space

i am using c# with http helper and using stream reader to read a text. But When i upload a text file containing this text
"Look  exactly what I found on # eBay! Willy Lee LifeLike  Chatting Butler Prop Motion Sen"
the space is replced by "�" and used in the code.
Code for reading the text is:-
List<string> list = new List<string>();
StreamReader reader = new StreamReader(filepath);
string text = "";
while ((text = reader.ReadLine()) != null)
{
if (!string.IsNullOrEmpty(text))
{
list.Add(text);
}
}
reader.Close();
return list;
list contains this data-
"Look��exactly�what�I�found�on�#�eBay!�Willy�Lee�LifeLike��Chatting�Butler�Prop�Motion�Sen"
Looks like encoding problem - I have had such text problems, when a text is multibyte encoded and shown in a non-unicode based webpage like a Windows-1252 or CP-125X or such.
Here looks like the same - text looks UTF-8 encoded and is displayed in ansi mode, so here the spaces are "special" spaces like these M$ Word puts sometimes, and the english characters are single byte as is the UTF-8 format (forr all chars below ASCII code 128) and this means they are compatible with ANSI codetable and visible correctly.
Or option 2 if it written in a file, and this text is saved like that, witout BOM in the beginning, the text editor may not understand that the context is unicode and opens it in ansi /regular ascii mode/.
If you give more details from where the data is read and where is saved and opened, I can give more concrete details.

How to get the 64bit data from an XML file to a string in Groovy

How can I reliably get the 64bit data from an XML file to a byte[] and then compare that with a string? The following code fails as it seems the whitespace is causing the assert to fail. The goal is for the assert to pass.
Note that it is important that we have it in the form of byte[] at somepoint, but not that the comparison be via strings
<Contents>VGVzdGluZyBURSBzZXNzaW9uIGNvbnRhaW5pbmcgQ29tcGxldGUgUGVyc29uIEEgYW5kIENvbXBs
ZXRlIEVxdWlwbWVudCBCLg0KDQpUZXN0IFRlc3QNCg0KUmVmZXJlbmNlcyBDb21wbGV0ZSBQbGFj
ZSBB
</Contents>
byte[] byteData = document.Contents.text()
assert 'VGVzdGluZyBURSBzZXNzaW9uIGNvbnRhaW5pbmcgQ29tcGxldGUgUGVyc29uIEEgYW5kIENvbXBs'+
'ZXRlIEVxdWlwbWVudCBCLg0KDQpUZXN0IFRlc3QNCg0KUmVmZXJlbmNlcyBDb21wbGV0ZSBQbGFj'+
'ZSBB' == new String(byteData)
Base 64 data is a special encoding of text to ASCII to be URL friendly (historically)
EDIT thanks to comment below, actually base64 was to encode data to send via for email
to extract text from your data, do this:
new String(
'VGVzdGluZyBURSBzZXNzaW9uIGNvbnRhaW5pbmcgQ29tcGxldGUgUGVyc29uIEEgYW5kIENvbXBsZXRlIEVxdWlwbWVudCBCLg0KDQpUZXN0IFRlc3QNCg0KUmVmZXJlbmNlcyBDb21wbGV0ZSBQbGFjZSBB')
.decodeBase64()
)
result starts with 'ession containing Complete Person A and Complete Equipment B.'
from http://mrhaki.blogspot.fr/2009/11/groovy-goodness-base64-encoding.html

Resources