I am maintaining a project where data has to be shared between windows and linux machines.
The program has been developed in DELPHI (Windows) in 2003 - so there is a lot of legacy data files that must be (at least probably) read by both systems in the future.
I have ported the programm to Lazarus and it runs on Linux quite well.
But the data (in a proprietary format) has stored strings as general ascii-characters from #0-#255. Reading the data on a linux machine leads to a lot of '?'-Symbols instead of 'ñ,äöüß...' etc.
What I tried to solve the problem:
1.) I read the data on a windows machine - as usual.
2.) I saved the data with a modified version, that will encode all strings with URLEncode()
on saving.
3.) I also modified the routine reading the data with URLDecode
4.) I saved the data with the modified version.
5.) I compiled the modiefied version on linux and copied the data from the windows machine.
6.) I opened the data in question ... and got questionmarks (?) instead of 'ñ,äöüß...' etc.
Well, the actual question is: How to share the data maintained by both systems and preserving those characters when editing the data (on both sides)?
Thanks in advance
8bit Ansi values between 128-255 are charset-specific. Whatever charset is used to save the data on Windows (assuming you are relying on Windows default encoding, which is dependent on the user's locale), you have to use that same charset when loading the data on Linux, and vice versa. There are dozens, if not hundreds, of charsets used in the world, which makes portability of Ansi data difficult. This is exactly the kind of problem that Unicode was designed to address. You are best off saving your data in a portable charset, such as UTF-8, and then perform conversions to/from the system charset when loading/saving the data.
Consider using UTF-8 for all your text storage.
Or, if you are sure that your data will always have the same code page, you can use conversion from the original Windows code page to UTF-8, which is the default Linux/Lazarus encoding.
You should better not rely on any proprietary binary layout for your application file format, if you want it to be cross-platform. You just discovered the character encoding problem, but you have potentially other issues, like binary endianess. SQLite3 is a very good application file format. It is fast, reliable, cross-platform, stable and atomic.
Note that Lazarus always expects utf8 strings for GUI. So even on Windows this probably wouldn't work without proper utf8 sanitation
Related
My output (csv/json) from my newly-created program (using .NET framework 4.6) need to be converted to a IBM-1027-codepage-binary-file (to be imported to Japanese client's IBM mainframe),
I've search the internet and know that Microsoft doesn't have equivalent to IBM-1027 code page.
So how could I output a IBM-1027-codepage-binary-file if I have an UTF-8 CSV/json file in my hand?
I'm asking around for other solutions, but for now, I think I'm going to have to suggest you do the conversion manually; I assume whichever language you're using allows you to do a hex conversion, at worst. For mainframes, the codepage is usually implicit in the dataset, it isn't something that is included in the file header.
So, what you can do is build a conversion table, from https://www.ibm.com/support/knowledgecenter/en/SSEQ5Y_5.9.0/com.ibm.pcomm.doc/reference/html/hcp_reference26.htm. Grab a character from your json/csv file, convert to the appropriate hex digits, and write those hex digits to a file. Repeat until EOF. (Note to actually write the hex data, not the ascii representation of the hex data.) Make sure that when the client transfers the file to their system, they perform a binary transfer.
If you wanted to get more complicated than that, you could look at enhancing/overriding part of the converter to CP500, which does exist on Microsoft Windows. One of the design points for EBCDIC was to make doing character conversions as simple as possible, so many of the CP500 characters hex representations are the same as the CP1027, with the exception of the Kanji characters.
This is a separate answer, from a colleague; I don't have the ability to validate it, I'm afraid.
transfer the file to the host in raw mode, just tag it as ccsid 1208
(edited)
for uss export _BPXK_AUTOCVT=ALL
oedit/obrowse handles it automatically.
Trying to utilize TTFs for image rendering. I didn't have any on the Linux box the application sits; I was at a loss and took a shot in the dark by SCPing the TTFs from my local machine to the server and pointing the application to them. I figured this wouldn't work since my machine is Windows, and box is Linux....but it was a shot in the dark. Alas, it didn't work. My question is: Are TTFs OS and OS Architecture specific?
No. They are plain data files, and data files are not OS specific (although their use may be).
The one single exception I can think of is that in the Bad Old Days, Apple's native file storage format on the Macintosh used two different disk objects: one for 'code' and one for 'data'. Without special software, only the 'code' parts could seen on other computers, leading to a swift exorcism of this storage format when Apple realized the rest of the world had problems reading their files. Still, it's far from unusual to read messages of confused people, finding that extracting an old Mac zip file can result in lots of zero-byte files.
As for your problem: since the problem does not lay in the font file format (there is no reason TTF "cannot work" on your system), it should be either the software you are using (does it actually support TTF fonts?) or - and I consider this more likely - you made an error transferring the files and you ended up with damaged fonts.
I have two projects, one in Windows and another one in Linux. I use the same database for both (oracle 10g),I have got an input file which consists of text that includes special characters (ÁTUL ÁD).
the program logic is like this: read input file data to database, on windows the data (including the special characters) is displayed correct, on Linux the special characters display other characters. As I already said, I use same database for both of them, could you give me some help?
The program is complex, it uses the Spring Batch Framework. Maybe the encoding causes the problem, but I have no idea how to solve it. I am using Linux for the first time.
Thanks in advance.
I find one solution which works for me is that you have to use UTF-8 encoding. All for Windows,Linux and Database.
I want to code a desktop program to print microsoft office files (doc, docx, xls and xlxs) on linux machine. But I don't know how to print them without corruption on output.
Is there a way to print or convert to an other format the file as %100 same of the view on microsoft office?
The libreoffice API might be a good place to start, particularly the examples:
http://api.libreoffice.org/
I haven't used the API myself but have used open/libre-office as an alternative to word for quite a while.
However, you say '100%' the same as in office? I wouldn't be confident of that. Depending on the document it's likely to be fine, but there are some things which don't seem to convert well. If you're working on linux, you're not likely to have the same fonts installed as whichever windows/mac machine made the document.
If the documents you're processing are all of the same/similar layout/template, and you're able to test a few first, it should be fine. But if you're processing any sort of word document, some may not convert completely without a bit of human input. Depends how much difference you can tolerate. If you want completely consistent printing across platforms, I guess that's what pdfs are for.
In our business, we require to log every request/response which coming to our server.
At this time being, we are using xml as standard implementation.
Log files are used if we need to debug/trace some error.
I am kind of curious if we switch to protocol buffers, since it is binary, what will be the best way to log request/response to file?
For example:
FileOutputStream output = new FileOutputStream("\\files\log.txt");
request.build().writeTo(outout);
For anyone who has used protocol buffers in your application, how do you log your request/response, just in case we need it for debugging purpose?
TL;DR: write debugging logs in text, write long-term logs in binary.
There are at least two ways you can do this logging (and maybe, in fact, you should do both):
Writing your logs in text format. This is good for debugging and quickly checking for problems with your eyes.
Writing your logs in binary format - this will make future analysis much quicker since you can load the data using same protocol buffers code and do all kinds of things on them.
Quite honestly, this is more or less the way this is done at the place this technology came from.
We use the ShortDebugString() method on the C++ object to write down a human-readable version of all incoming and outgoing messages to a text-file. ShortDebugString() returns a one-line version of the same string returned by the toString() method in Java. Not sure how easy it is to accomplish the same thing in Java.
If you have competing needs for logging and performance then I suppose you could dump your binary data to the file as-is, with perhaps each record preceded by a tag containing a timestamp and a length value so you'll know where this particular bit of data ends. But I hasten to admit this is very ugly. You will need to write a utility to read and analyze this file, and will be helpless without that utility.
A more reasonable solution would be to dump your binary data in text form. I'm thinking of "lines" of text, again starting with whatever tagging information you find relevant, followed by some length information in decimal or hex, followed by as many hex bytes as needed to dump your buffer - thus you could end up with some fairly long lines. But since the file is line structured, you can use text-oriented tools (an editor in the simplest case) to work with it. Hex dumping essentially means you are using two bytes in the log to represent one byte of data (plus a bit of overhead). Heh, disk space is cheap these days.
If those binary buffers have a fairly consistent structure, you could even break out and label fields (or something like that) so your data becomes a little more human readable and, more importantly, better searchable. Of course it's up to you how much effort you want to sink into making your log records look pretty; but the time spent here may well pay off a little later in analysis.
If you've non-ASCII character strings in your messages, simply logging them by using implicit or explicit call to toString would escape the characters.
"오늘은 무슨 요일입니까?" becomes "\354\230\244\353\212\230\354\235\200 \353\254\264\354\212\250 \354\232\224\354\235\274\354\236\205\353\213\210\352\271\214?"
If you want to retain the non-ASCII characters, use TextFormat.printer().escapingNonAscii(false).printToString(message).
See this answer for more details.