difference between iso-8859 and iso-8859-1, - iso

is iso-8859 supports for latin character are i need to use iso-8859-1 in java program to read file in chinese character,and what is the difference between this

ISO-8859 is a standard for 8-bit character encodings. 8 bits give you 256 combinations which is OK for most extensions of the Latin alphabet but not for Chinese characters.
ISO-8859-1 is one of the "versions" of ISO-8859 supporting most Western-European languages (French, German, Spanish,...). For Central-European languages (Polish, Czech, Slovak,...) you need ISO-8859-2, etc.
One of the different points between ISO-8859-1 and ISO-8859-2 is the French letter è in ISO-8859-1, which is at the same position as the Czech/Slovak letter č in ISO-8859-2. That's why you could not combine these two letters in one text then.
Now with the Unicode it is possible to combine Chinese characters too.

There are several encodings available for chinese (e.g. simplified and traditional). See
http://download.oracle.com/javase/6/docs/technotes/guides/intl/encoding.doc.html for a list.
The most common ones are GB2312 aka EUC_CN for simplified chinese and Big5 for traditional chinese. I've also seen chinese documents represented in UTF-8.

Related

Reading a text file with unicode characters - Python3

I am trying to read a text file which has unicode characters (u) and other tags (\n, \u) in the text, here is an example:
(u'B9781437714227000962', u'Definition\u2014Human papillomavirus
(HPV)\u2013related proliferation of the vaginal mucosa that leads to
extensive, full-thickness loss of maturation of the vaginal
epithelium.\n')
How can remove these unicode tags using python3 in Linux operating system?
To remove unicode escape sequence (or better: to translate them), in python3:
a.encode('utf-8').decode('unicode_escape')
The decode part will translate the unicode escape sequences to the relative unicode characters. Unfortunately such (un-)escape do no work on strings, so you need to encode the string first, before to decode it.
But as pointed in the question comment, you have a serialized document. Try do unserialize it with the correct tools, and you will have automatically also the unicode "unescaping" part.

How do I get from Éphémère to Éphémère in Python3?

I've tried all kinds of combinations of encode/decode with options 'surrogatepass' and 'surrogateescape' to no avail. I'm not sure what format this is in (it might even be a bug in Autoit), but I know for a fact the information is in there because at least one online utf decoder got it right. On the online converter website, I specified the file as utf8 and the output as utf16, and the output was as expected.
This issue is called mojibake, and your specific case occurs if you have a text stream that was encoded with UTF-8, and you decode it with Windows-1252 (which is a superset of ISO 8859-1).
So, as you have already found out, you have to decode this file with UTF-8, rather than with the default encoding of Python (which appears to be Windows-1252 in your case).
Let's see why these specific garbled characters appear in your example, namely:
É at the place of É
é at the place of é
è at the place of è
The following table summarises what's going on:
All of É, é, and è are non-ASCII characters, and they are encoded with UTF-8 to 2-byte long codes.
For example, the UTF-8 code for É is:
11000011 10001001
On the other hand, Windows-1252 is an 8-bit encoding, that is, it encodes every character of its character set to 8 bits, i.e. one byte.
So, if you now decode the bit sequence 11000011 10001001 with Windows-1252, then Windows-1252 interprets this as two 1-byte codes, each representing a separate character, rather than a 2-byte code representing a single character:
The first byte 11000011 (C3 in hexadecimal) happens to be the Windows-1252 code of the character à (Unicode code point U+00C3).
The second byte 10001001 (89 in hexadecimal) happens to be the Windows-1252 code of the character ‰ (Unicode code point U+2030).
You can look up these mappings here.
So, that's why your decoding renders É instead of É. Idem for the other non-ASCII characters é and è.
My issue was during the file reading. I solved it by specifying encoding='utf-8' in the options for open().
open(filePath, 'r', encoding='utf-8')

Why is iconv in Linux not converting spanish char in UTF-8 to ISO-8859-1 correctly

In Linux, I am converting UTF-8 to ISO-8859-1 file using the following command:
iconv -f UTF-8 -t ISO-8859-1//TRANSLIT input.txt > out.txt
After conversion, when I open the out.txt
¿Quién Gómez is translated to ¿Quien Gomez.
Why are é and ó and others not translated correctly?
There are (at least) two ways to represent the accented letter é in Unicode: as a single code point U+00E9, LATIN SMALL LETTER E WITH ACUTE, and as a two-character sequence e (U+0065) followed by U+0301, COMBINING ACUTE ACCENT.
Your input file uses the latter encoding, which iconv apparently is unable to translate to Latin-1 (ISO-8859-1). With the //TRANSLIT suffix, it passes through the unaccented e unmodified and drops the combining character.
You'll probably need to convert the input so it doesn't use combining characters, replacing the sequence U+0065 U+0301 by a single code point U+00E9 (represented in 2 bytes). Either that, or arrange for whatever generates your input file to use that encoding in the first place.
So that's the problem; I don't currently know exactly how to correct it.
Keith, you are right. I found the answer from Oracle Community Sergiusz Wolicki. Here I am quoting his answer word for word. I am posting for somebody who may have this problem. "The problem is that your data is stored in the Unicode decomposed form, which is legal but seldom used for Western European languages. 'é' is stored as 'e' (0x65=U+0065) plus combining acute accent (0xcc, 0x81 = U+0301). Most simple conversion tools, including standard Oracle client/server conversion, do not account for this and do not convert a decomposed characters into a pre-composed character from ISO 8859-1. They try to convert each of the two codes independently, yielding 'e' plus some replacement for the accent character, which does not exist in ISO 8859-1. You see the result correctly in SQL Developer because there is no conversion involved and SQL Developer rendering code is capable of combining the two codes into one character, as expected.
As 'é' and 'ó' have pre-composed forms available in both Unicode and ISO 8859-1, the work around is to add COMPOSE function to your query. Thus, set NLS_LANG as I advised previously and add COMPOSE around column expressions to your query." Thank you very much, Keith

How to substitute cp1250 specific characters to utf-8 in Vim

I have some central european characters in cp1250 encoding in Vim. When I change encoding with set encoding=utf-8 they appear like <d0> and such. How can I substitute over the entire file those characters for what they should be, i.e. Đ, in this case?
As sidyll said, you should really use iconv for the purpose. Iconv knows stuff. It knows all the hairy encodings, onscure code-points, katakana, denormalized, canonical forms, compositions, nonspacing characters and the rest.
:%!iconv --from-code cp1250 --to-code utf-8
or shorter
:%!iconv -f cp1250 -t utf-8
to filter the whole buffer. If you do
:he xxd
You'll get a sample of how to automatically encode on buffer load/save if you wanted.
iconv -l will list you all (many: 1168 on my system) encodings it accepts/knows about.
Happy hacking!
The iconv() function may be useful:
iconv({expr}, {from}, {to}) *iconv()*
The result is a String, which is the text {expr} converted
from encoding {from} to encoding {to}.
When the conversion fails an empty string is returned.
The encoding names are whatever the iconv() library function
can accept, see ":!man 3 iconv".
Most conversions require Vim to be compiled with the |+iconv|
feature. Otherwise only UTF-8 to latin1 conversion and back
can be done.
This can be used to display messages with special characters,
no matter what 'encoding' is set to. Write the message in
UTF-8 and use:
echo iconv(utf8_str, "utf-8", &enc)
Note that Vim uses UTF-8 for all Unicode encodings, conversion
from/to UCS-2 is automatically changed to use UTF-8. You
cannot use UCS-2 in a string anyway, because of the NUL bytes.
{only available when compiled with the +multi_byte feature}
You can set encoding to the value of your file's encoding and termencoding to UTF-8. See The vim mbyte documentation.

Bare-minimum text sanitation

In an application that accepts, stores, processes, and displays Unicode text (for the purpose of discussion, let's say that it's a web application), which characters should always be removed from incoming text?
I can think of some, mostly listed in the C0 and C1 control codes Wikipedia article:
The range 0x00-0x19 (mostly control characters), excluding 0x09 (tab), 0x0A (LF), and 0x0D (CR)
The range 0x7F-0x9F (more control characters)
Ranges of characters that can safely be accepted would be even better to know.
There are other levels of text filtering — one might canonicalize characters that have multiple representations, replace nonbreaking characters, and remove zero-width characters — but I'm mainly interested in the basics.
See the W3 Unicode in XML and other markup languages note. It defines a class of characters as ‘discouraged for use in markup’, which I'd definitely filter out for most web sites. It notably includes such characters as:
U+2028–9 which are funky newlines that will confuse JavaScript if you try to use them in a string literal;
U+202A–E which are bidi control codes that wily users can insert to make text appear to run backwards in some browsers, even outside of a given HTML element;
language override control codes that could also have scope outside of an element;
BOM.
Additionally, you'd want to filter/replace the characters that are not valid in Unicode at all (U+FFFF et al), and, if you are using a language that works in UTF-16 natively (eg. Java, Python on Windows), any surrogate characters (U+D800–U+DFFF) that do not form valid surrogate pairs.
The range 0x00-0x19 (mostly control characters), excluding 0x09 (tab), 0x0A (LF), and 0x0D (CR)
And arguably (esp for a web application), lose CR as well, and turn tabs into spaces.
The range 0x7F-0x9F (more control characters)
Yep, away with those, except in case where people might really mean them. (SO used to allow them, which allowed people to post strings that had been mis-decoded, which was occasionally useful for diagnosing Unicode problems.) For most sites I think you'd not want them.
I suppose it depends on your purpose. In UTF-8, you could limit the user to the keyboard characters if that is your whim, which is 9,10,13,[32-126]. If you are using UTF-8, the 0x7f+ range signifies that you have a multi-byte Unicode character. In ASCII, 0x7f+ consists special display/format characters, and is localized to allow extensions depending on the language at the location.
Note that in UTF-8, the keyboard characters can differ depending on location, since users can input characters in their native language which will be outside the 0x00-0x7f range if their language doesn't use a Latin script without accents (Arabic, Chinese, Japanese, Greek, Crylic, etc.).
If you take a look here you can see what characters from UTF-8 will display.

Resources