special character different window with linux - linux

I have two projects, one in Windows and another one in Linux. I use the same database for both (oracle 10g),I have got an input file which consists of text that includes special characters (ÁTUL ÁD).
the program logic is like this: read input file data to database, on windows the data (including the special characters) is displayed correct, on Linux the special characters display other characters. As I already said, I use same database for both of them, could you give me some help?
The program is complex, it uses the Spring Batch Framework. Maybe the encoding causes the problem, but I have no idea how to solve it. I am using Linux for the first time.
Thanks in advance.

I find one solution which works for me is that you have to use UTF-8 encoding. All for Windows,Linux and Database.

Related

How to output IBM-1027-codepage-binary-file?

My output (csv/json) from my newly-created program (using .NET framework 4.6) need to be converted to a IBM-1027-codepage-binary-file (to be imported to Japanese client's IBM mainframe),
I've search the internet and know that Microsoft doesn't have equivalent to IBM-1027 code page.
So how could I output a IBM-1027-codepage-binary-file if I have an UTF-8 CSV/json file in my hand?
I'm asking around for other solutions, but for now, I think I'm going to have to suggest you do the conversion manually; I assume whichever language you're using allows you to do a hex conversion, at worst. For mainframes, the codepage is usually implicit in the dataset, it isn't something that is included in the file header.
So, what you can do is build a conversion table, from https://www.ibm.com/support/knowledgecenter/en/SSEQ5Y_5.9.0/com.ibm.pcomm.doc/reference/html/hcp_reference26.htm. Grab a character from your json/csv file, convert to the appropriate hex digits, and write those hex digits to a file. Repeat until EOF. (Note to actually write the hex data, not the ascii representation of the hex data.) Make sure that when the client transfers the file to their system, they perform a binary transfer.
If you wanted to get more complicated than that, you could look at enhancing/overriding part of the converter to CP500, which does exist on Microsoft Windows. One of the design points for EBCDIC was to make doing character conversions as simple as possible, so many of the CP500 characters hex representations are the same as the CP1027, with the exception of the Kanji characters.
This is a separate answer, from a colleague; I don't have the ability to validate it, I'm afraid.
transfer the file to the host in raw mode, just tag it as ccsid 1208
(edited)
for uss export _BPXK_AUTOCVT=ALL
oedit/obrowse handles it automatically.

How can I open a .cat file?

I'm looking for a way to open a .cat file. I have not a single clue about how to do it (I've tried with the notepad and sublime text, without results), the only thing I know is that it's not corrupted (it's read by another program, but I need to see it with my eyes to understand the structure of the content and create a similar one for my purposes).
Every hint is well accepted.
If you can't make sense of it in a standard text editor, it's probably a binary format.
If so, you need to get yourself a program capable of doing hex dumps (such as od) and prepare for some detailed analysis.
A good start would be trying to find information about Advanced Disk Catalog somewhere on the web, assuming that's what it is.

Delphi/windows and Linux/Lazarus sharing character above #127

I am maintaining a project where data has to be shared between windows and linux machines.
The program has been developed in DELPHI (Windows) in 2003 - so there is a lot of legacy data files that must be (at least probably) read by both systems in the future.
I have ported the programm to Lazarus and it runs on Linux quite well.
But the data (in a proprietary format) has stored strings as general ascii-characters from #0-#255. Reading the data on a linux machine leads to a lot of '?'-Symbols instead of 'ñ,äöüß...' etc.
What I tried to solve the problem:
1.) I read the data on a windows machine - as usual.
2.) I saved the data with a modified version, that will encode all strings with URLEncode()
on saving.
3.) I also modified the routine reading the data with URLDecode
4.) I saved the data with the modified version.
5.) I compiled the modiefied version on linux and copied the data from the windows machine.
6.) I opened the data in question ... and got questionmarks (?) instead of 'ñ,äöüß...' etc.
Well, the actual question is: How to share the data maintained by both systems and preserving those characters when editing the data (on both sides)?
Thanks in advance
8bit Ansi values between 128-255 are charset-specific. Whatever charset is used to save the data on Windows (assuming you are relying on Windows default encoding, which is dependent on the user's locale), you have to use that same charset when loading the data on Linux, and vice versa. There are dozens, if not hundreds, of charsets used in the world, which makes portability of Ansi data difficult. This is exactly the kind of problem that Unicode was designed to address. You are best off saving your data in a portable charset, such as UTF-8, and then perform conversions to/from the system charset when loading/saving the data.
Consider using UTF-8 for all your text storage.
Or, if you are sure that your data will always have the same code page, you can use conversion from the original Windows code page to UTF-8, which is the default Linux/Lazarus encoding.
You should better not rely on any proprietary binary layout for your application file format, if you want it to be cross-platform. You just discovered the character encoding problem, but you have potentially other issues, like binary endianess. SQLite3 is a very good application file format. It is fast, reliable, cross-platform, stable and atomic.
Note that Lazarus always expects utf8 strings for GUI. So even on Windows this probably wouldn't work without proper utf8 sanitation

Different UTF-8 signature for same diacritics (umlauts) - 2 binary ways to write umlauts

I have a quite big problem, where I can't find any help around in the web:
I moved a page from a website from OSX to Linux (both systems are running in de_DE.UTF-8) and run in an quite unknown problem:
Some of the files were not found anymore, but obviously existed on the harddrive with (visibly) the same name. All those files contained german umlauts.
I took one sample image, copied the original request-uri from the webpage and called it directly - same error. After rewriting the file-name it worked. And yes, I did not mistype it!
This surprised me and I took a look into the apache-log where I found these entries:
192.168.56.10 - - [27/Aug/2012:20:03:21 +0200] "GET /images/Sch%C3%B6ne-Lau-150x150.jpg HTTP/1.1" 304 0 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_4) AppleWebKit/537.1"
192.168.56.10 - - [27/Aug/2012:20:03:57 +0200] "GET /images/Scho%CC%88ne-Lau-150x150.jpg HTTP/1.1" 404 4205 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_4) AppleWebKit/537.1"
That was something for me to investigate ... Here's what I found in the UTF8 chartable http://www.utf8-chartable.de/:
ö c3 b6 LATIN SMALL LETTER O WITH DIAERESIS
¨ cc 88 COMBINING DIAERESIS
I think you've already heard of dead-keys: http://en.wikipedia.org/wiki/Dead_key If not, read the article. It's quite interesting ;)
Does that mean, that OSX saves all diacritics separate to the letter? Does that really mean, that OSX saves the character ö as o and ¨ instead of using the real character that results of the combination?
If yes, do you know of a good script that I could use to rename these files? This won't be the first page I move from OSX to Linux ...
It's not quite the same thing as dead keys, but it's related. As you've worked out, U+00F6 and U+006F followed by U+0308 have the same visual result.
There are in fact Unicode rules in knowing to treat them the same, which is based on decompositions. There's a decomposition table in the character database, that tells us that U+00F6 canonically decomposes to U+006F followed by U+0308.
As well as canonical decomposition, there are compatibility decompositions. These lose some information, for example ² ends up being decomposed to 2. This is clearly a destructive change, but it is useful for searching when you want to be a bit fuzzy (how google knows a search for fiſh should return results about fish).
If there are more than one combining character after a non-combining character, then we can re-order them as long as we don't re-order those of the same class. This becomes clear when we consider that it doesn't matter whether we put a cedilla on something and then an acute accent, or an acute and then a cedilla, but if we put both an acute and an umlaut on a letter it clearly matters what way around they go.
From this, we have 4 normalisation forms. Put strings into an appropriate normalisation form before doing comparisons, and you don't get tripped up.
NFD: Break everything apart by canonically decomposing it as much as possible. Reorder combining characters in order of their combining class, but keep any with the same class in the same order relative to each other.
NFC: First put everything into NFD. Then continually look at the combining characters in order, if there isn't an earlier one of the same class. If there is an equivalent single character, then replace them, and re-do the scan looking to compose further.
NFKD: Like NFD, but using compatibility decomposition (damaging change, but useful for comparisons as explained above).
NFD: Do NFKD, then re-combine canonical only as per NFC.
There are also some re-combinations banned from use in NFC so that text that was valid NFC in one version of Unicode doesn't cease to be NFC if Unicode has more characters added to it.
Of NFD and NFC, NFC is clearly the more concise. It's not the most concise possible, but it is one that is very concise and can be tested for and/or created in a very efficient streaming manner.
Mac OSX uses NFD for file names. Because they're weirdos. (Okay, there are better arguments than that, they just didn't convince me!)
The Web Character Model uses NFC.* As such, you should use NFC on web stuff as much as possible. There can though be security considerations in blindly converting stuff to NFC. But if it starts from you, it should start in NFC.
Any programming language that deals with text should have a nice way of normlising text into any of these forms. If yours doesn't complain (or if yours is open source, contribute!).
See http://unicode.org/faq/normalization.html for more, or http://unicode.org/reports/tr15/ for the full gory details.
*For extra fun, if you inserted something beginning with a combining long solidus overlay (U+0338) at the start of an XML or HTML element's content, it would turn the > of the tag into ≯, turning well-formed XML into gibberish. For this reason the web character model insists that each entity must itself be NFC and not start with a combining character.
Thanks, Jon Hanna for much background-information here! This was important to get the full answer: a way to convert from the one to the other normalisation form.
As my changes are in the filesystem (because of file-upload) that is linked in the database, I now have to update my database-dump. The files got already renamed during the move (maybe by the FTP-Client ...)
Command line tools to convert charsets on Linux are:
iconv - converting the content of a stream (maybe a file)
convmv - converting the filenames in a directory
The charset utf-8-mac (as described in http://loopkid.net/articles/2011/03/19/groking-hfs-character-encoding), I could use in iconv, seems to exist just on OSX systems and so I have to move my sql-dump to my mac, convert it and move it back. Another option would be to rename the files back using convmv to NFD, but this would more hinder than help in the future, I think.
The tool convmv has a build-in (os-independent) option to enforcing NFC- or NFD-compatible filenames: http://www.j3e.de/linux/convmv/man/
PHP itself (the language my system - Wordpress is based on) supports a compatibility-layer here:
In PHP, how do I deal with the difference in encoded filenames on HFS+ vs. elsewhere? After I fixed this issue for me, I will go and write some tests and may also write a bug-report to Wordpress and other systems I work with ;)
Linux distros treat filenames as binary strings, meaning no encoding is assumed - though the graphical shell (Gnome, KDE, etc) might make some assumptions based on environment variables, locale, etc.
OS-X on the other hand requires or enforces (I forget which) their own version of UTF-8 with Unicode normalization to expand all diacritics into combining characters.
On Linux when people do use Unicode in filenames they tend to prefer UTF-8 with precomposed characters when it comes to diacritics.

How to determine codepage of a file (that had some codepage transformation applied to it)

For example if I know that ć should be ć, how can I find out the codepage transformation that occurred there?
It would be nice if there was an online site for this, but any tool will do the job. The final goal is to reverse the codepage transformation (with iconv or recode, but tools are not important, I'll take anything that works including python scripts)
EDIT:
Could you please be a little more verbose? You know for certain that some substring should be exactly. Or know just the language? Or just guessing? And the transformation that was applied, was it correct (i.e. it's valid in the other charset)? Or was it single transformation from charset X to Y but the text was actually in Z, so it's now wrong? Or was it a series of such transformations?
Actually, ideally I am looking for a tool that will tell me what happened (or what possibly happened) so I can try to transform it back to proper encoding.
What (I presume) happened in the problem I am trying to fix now is what is described in this answer - utf-8 text file got opened as ascii text file and then exported as csv.
It's extremely hard to do this generally. The main problem is that all the ascii-based encodings (iso-8859-*, dos and windows codepages) use the same range of codepoints, so no particular codepoint or set of codepoints will tell you what codepage the text is in.
There is one encoding that is easy to tell. If it's valid UTF-8, than it's almost certainly no iso-8859-* nor any windows codepage, because while all byte values are valid in them, the chance of valid utf-8 multi-byte sequence appearing in a text in them is almost zero.
Than it depends on which further encodings may can be involved. Valid sequence in Shift-JIS or Big-5 is also unlikely to be valid in any other encoding while telling apart similar encodings like cp1250 and iso-8859-2 requires spell-checking the words that contain the 3 or so characters that differ and seeing which way you get fewer errors.
If you can limit the number of transformation that may have happened, it shouldn't be too hard to put up a python script that will try them out, eliminate the obvious wrongs and uses a spell-checker to pick the most likely. I don't know about any tool that would do it.
The tools like that were quite popular decade ago. But now it's quite rare to see damaged text.
As I know it could be effectively done at least with a particular language. So, if you suggest the text language is Russian, you could collect some statistical information about characters or small groups of characters using a lot of sample texts. E.g. in English language the "th" combination appears more often than "ht".
So, then you could permute different encoding combinations and choose the one which has more probable text statistics.

Resources