Wrong text encoding when parsing json data

Wrong text encoding when parsing json data - linux

I am curling a website and writing it to .json file; this file is input to my java code which parses it using json library and the necessary data is written back in a CSV file which i later use to store it in a database.
As you know data coming from a website can be in different formats so i make sure that i read and write in UTF-8 format, still i get wrong output.
For example, Østerriksk becomes ï¿½sterriksk.
I am doing all this in Linux. I think there is some encoding problem because this same code runs fine in Windows but not in Unix/Linux.
I am quite sure my java code is proper but i am not able to find out what I'm doing wrong.

You're reading the data as ISO 8859-1 but the file is actually UTF-8. I think there's an argument (or setting) to the file reader that should solve that.
Also: curl isn't going to care about the encodings. It's really something in your Java code that's wrong.

What kind of IDE are you using, for example this can happen if you are using Eclipse IDE, and not set your default encoding to utf-8 in properties.

Related

How do I decompress the diagram data in a .drawio file with node.js and zlib?

Diagrams.net, previously and still more widely known as draw.io, is a popular tool for drawing diagrams of various kinds. It stores diagrams in an XML-based format that uses the file ending .drawio. The file content has the structure:
<mxfile {...}>
<diagram {...}>
{the-actual-diagram-content}
</diagram>
</mxfile>`
According to the documentation page Extracting the XML from mxfiles, the string {the-actual-diagram-content} contains the actual diagram data in compressed format, "compressed with the standard deflate process". I'd like to decompress this data in my node.js app to parse and modify it.
I have found an older, similar question on StackOverflow, which wants the same, but uses the libraries "atob", and later "pako". I'd like to achieve the same with the more standard "zlib" node.js module, which - if this is really "the standard deflate process" - should be possible.
However, all my attempts to "inflate" the compressed string fail. I have mostly tried variations of the following code, with different encodings ('base64', 'utf8') and methods ('inflateSync', 'unzipSync', 'gunzipSync'):
zlib.inflateSync(Buffer.from(string, 'base64')).toString();
All attempts fail with the error "Error: incorrect header check". I read this as "dude, seriously, you're using the wrong unzip algorithm for this". However, I cannot figure out what the right algorithm or settings are.
The sample string I'd like to decode is the following. Using the jgraph inflate/deflate tool, this uncompresses perfectly fine. However, the settings done there, "URL Encode", "Deflate", "Base64" sound to me exactly like what I am trying.
zVdbk6I6EP41Vp3z4BYXL/Ao3nV0VEYZfQsQITOBIEQu/voNAgrqrHtOzVbti5X+0t0kX/eXxJrYdeKhDzx7RkyIawJnxjWxVxOEZktmvymQZIAoNzLA8pGZQfwVUNEJ5iCXo0dkwqDiSAnBFHlV0CCuCw1awYDvk6jqtie4+lUPWPAOUA2A71ENmdTOUKnJXfERRJZdfJnn8hkHFM45ENjAJFEJEvs1sesTQrORE3chTrkreMniBl/MXhbmQ5f+TkCXT0gX48NHW1CSsVHXjta0nmcJAT7mG+7kq6VJQYFPjq4J0yxcTVQiG1GoesBIZyNWc4bZ1MHM4tkwoD75vFDFNqnkX4A+hfGXS+cvhLBGgsSB1E+YSxHQyjlMbuzoWhK+4NkulaPwA3kXWJfUV6LYIOfqP/Am3PGm/IW8ia3mX8abeMebSdIYesceNJkOcxNinUT9K6CcATaRsoOYWM8EAp92UsUz3CXu2c01b5AS42xygNLVn8sDMLJcNjYYtdBnAAY6xAowPq1zHbsEE/+ap1pbFuMn72Vjmxo/moXZi8uTvaSwYkTfi+WwcSmKWdeg1ChiMqJSdn7dFIxMcvQN+Fz8jDgL0mfN/mWT1bkfXFNqVxqtXjSVDzGgKKyu9VFX5ekXBLFdXHIL0o3wpZvGzPaYR5UPv5tEolhNJIg3iTISfpGocCT7fQArPmchXHj5/9po3GnjXhQYs4sPPj9OQOBlt+EexWmbPjhffEIBBfo5ddpY+Q1aQthVDsv2b51IX8v+voNKx9CjU+ibmqjOV2t/sf98SZvPS8qeBV46DCh0DYT/CTfzlR7MrF5PmC8SIz5MdfnN4UVrzlmdlql4q46i66/m3uq7mtdTvZ3jrFQ0Zp9S4EjXoqUgK+uX5cTtbS3TmDV36a4E5bhgxA0mW3u1w5PpSRuph+6SIW9HNIC0cewe9e0cxsJyA4VOe8v2qyznChq33wP8Ee3DE97iYWvIqXY74k/4oIKDGQta/LnmY9WBRjsaAaN90jSwWrNYHIZr5vGxZt8eeMTHLyCQArwGfZ2OY3Uh0tZouLKlYLya0FVfjjZyM3ZM5VVDn6d4ISvcECB5rYSOOEpCJA90I54dtp0X7Mubo77bACKoZiNgz0vFOxmKLr3tk7QLlNZorbYnQ6HhBtPe20Hy2QJUcR8vwlktvaUHvPc+VDYn1yFm0kY4lJfSXBxN9d3rsKmpO3VuyWa4kLr/PpfZQ9kAjEnUKd6e3I3U+MJGJL1v6vL5hDdROQOMPeAWl8udlP+kDMUHMhS/S4bVx0i98Q0yZOb1BZ25X/+GiP2f
What am I doing wrong?

Use zlib.inflateRawSync(). What you have there is a raw deflate stream, not a zlib stream.

GraphicsMagick unable to process Unicode filenames

I have found that GraphicsMagick is unable to process my files named in Chinese. I did the same test on ImageMagick but IM worked as expected.
I thought this might be a bug so I filed a bug report here: https://sourceforge.net/p/graphicsmagick/bugs/384/
Anyway, this is how to reproduce my situation:
Platform: Win10
Version: GraphicsMagick 1.3.20
Code: gm -identify 獅藝學會.jpg
This is the returned text from Command Prompt:
>gm -identify 獅藝學會.jpg
gm identify: Unable to open file (????.jpg) [Invalid argument].
gm identify: Request did not return an image.
Using IM worked:
identify 獅藝學會.jpg
ç?.è-?å-,æoƒ.jpg JPEG 3264x2448 3264x2448+0+0 8-bit sRGB 2.691MB 0.016u 0:00.004
Although the text returned is scrambled, but converting the file to a .png still maintained the same filename apart from the different extensions of course.
What happened
I found this problem by using the gm node.js library batch processing my images, the source of the call is made from a UTF-8 webpage, so I assume the filename is passed as Unicode encoding.
I found no documentation related to this problem, although the documentation states that there was a -encoding option, it cannot be sent as parameter on Windows as it does not recognize it and I cannot find relevant solutions on Google.
Please help, is there any easy way around this problem, while keeping the exact filename?

In case someone uses the C api.
(You can only give (char *)-type filenames. And UTF-8 encoding does not work, if using GraphicsMagick on Windows.)
You could do the following:
Open the file for input (or output) yourself (use fopen(), _wfopen() etc).
Then set the filehandle within the ImageInfo structure for reading and Image structure for writing respectively (instead of setting the filename).
To have GraphicsMagick generate the right output file format, set magick within the Image structure.
f.e.:
//Reading
imageInfo->file=_wfopen(input_filename,L"rb"); //ImageInfo *imageInfo;
ReadImage(imageInfo,exception);
//Writing
image->file=_wfopen(output_filename,L"wb"); //Image *image;
strcpy(image->magick,"PNG");
WriteImage(imageInfo,image);
GraphicsMagick automatically closes the file after writing/reading.

I have the same problem using GM in C++. UTF-8 filenames are not supported under Windows (not even in the API!).
My workaround is to get the short path name (8.3), you can do that both using command line and Win32. However this doesn't work 100% - and if you want to save a file you have to create an empty one first to be able to get the short name.

Windows mobile 6.5 - best way to read and write from and to a config file

I have a handheld device running WM6.5 and trying to put together an application that should prompt the user for some information (login, password) and save it to a file for later use.
Have tried app.config files but unfortunately it requires System::Configuration, I can add the DLL but can't get the code to run, it requires CRL or something like that which I can't configure this being a mobile app - the required option is missing from the project/solution configuration section.
I am using Visual Studio 2008 C++
What's the best way to make this happen? Precisely, 1) write a string somewhere and 2) read it back later on.
TIA
Later edit:
I have tried using a binary file, like this
// write to config file
std::string s="helloworldhelloworldhelloworld";
ofstream ofile("test.txt",ios::binary);
ofile.write((char*)s.c_str(),strlen(s.c_str()));
ofile.close();
And then I have tried reading it back like this
// read config file
char read_str[60];
ifstream inf("test.txt",ios::binary);
inf.read(read_str,60);
inf.close();
LPCTSTR application_settings = CA2W(read_str);
What happens is it adds some garbage at the end of the string, if the string is longer less garbage, otherwise more.
Is there a way to sort out this conversion issue?

Turns out, project was using Unicode and had to use wifstream and wofstream to be able to properly read the strings, rather than attempt to convert them from ANSI to unicode.
This should be a reminder for me to stay away from strong typed languages in the future. Too bad there's no other significant choice for Windows Mobile. Spent a bunch of hours on this, I could have used that time for something else.

Writing PDF binary file from stream yields malformed PDF

Dear Stack Overflow users,
I would appreciate you kind help with the following problem:
We have an Apache server functioning as a forward proxy, with ext_filter configured: whenever the response is of MIME type PDF, the filter is called (a perl script), and the PDF's content may be read from the STDIN. We read the PDF from STDIN, write it to a file and that's all. This almost always work well, but on one specific website, the PDF is malformed when written in the following way:
my $input_file = shift;
binmode STDIN;
open(OUT, ">" . $input_file);
binmode OUT;
foreach my $line (<STDIN>){
print OUT $line;
}
close OUT;
If we instead call 'tee' (set the filter to use 'tee')- the file is written correctly. Analyzing the malformed PDF shows that the xref table is malformed in the PDF we write and Adobe Reader fails to open it. We have already tried using sysopen,sysread etc. , using ":raw", and several other ways to write a binary file properly, and nothing worked (cut&paste code from documnetation for writing binary files). Only when using the 'tee' utility in linux as the filter, it was written correctly. This doesn't help us- we need to be able to write it to a file from stdin as part of the perl script. Any suggestions? If there could be a way to somehow call 'tee' with a system call, and give it STDIN of the perl program- it might could work. Many thanks in advance.

Well, although the code was basiclly correct, putting it inside "eval" somehow ruined thd PDF.
I still don't understand why, but deleting the eval solved the problem.
The perl is called from a context of ext_filter module of Apache.
I'll farther investigate this and update when I'll find an explanation for this.
Thanks for everyone.

What is the standard way to handle users opening incorrect file types?

I hope my Q was clear ... I am curious about the typical way to code for someone clicking File|Open, and selecting a file that is inappropriate for the program--like someone using a word processing program and trying to open a binary file.
In my case, my files have multiple streams streamed together. I'm unsure how to have the code validate whether an improper file was selected before the app throws a stream read exception. (Or is the way to handle the situation to just write code to catch a stream read exception?)
Thanks, as always.

I think it's quite usual that you have code that just tries to open the file, and if it fails, an error is shown to the user. Most file formats has some kind of header with a "magic number", so that the reader can tell if it's not the right file very quickly after reading the first few bytes of the file.

Magic number at the start of the file generally helps -- if you have control of the file format.
Otherwise, yeah -- catch the exception and put up a dialog.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Wrong text encoding when parsing json data - linux

You're reading the data as ISO 8859-1 but the file is actually UTF-8. I think there's an argument (or setting) to the file reader that should solve that. Also: curl isn't going to care about the encodings. It's really something in your Java code that's wrong.

What kind of IDE are you using, for example this can happen if you are using Eclipse IDE, and not set your default encoding to utf-8 in properties.

Related

How do I decompress the diagram data in a .drawio file with node.js and zlib?

GraphicsMagick unable to process Unicode filenames

Windows mobile 6.5 - best way to read and write from and to a config file

Writing PDF binary file from stream yields malformed PDF

What is the standard way to handle users opening incorrect file types?

Categories

Resources