> console.log("abc\xb2") //abc²
abc²
> console.log("abc\x80") //abc€
abc
I tested it in chrome 34 and IE 11. Any ideas ?
Character '€' was \x80 in Windows-1252, but \xE2\x82\xAC in utf-8. I made a mistake by saving the file in Windows-1252.
this string was sent to browser by websocket and browser thought it was in utf-8. So I didn't get '€' since \x80 is an unprintable character in utf-8.
Related
I am outputting a file from SSIS in UTF-8 Encoding.
This file is passed to a third party for import into their system.
They are having a problem importing this file. Although they requested UTF-8 encoding, it seems they convert the encoding to ISO-8859-1. They use this command to convert the files encoding:
iconv -f UTF-8 -t ISO-8859-1 dweyr.inp
They are receiving this error
illegal input sequence at position 11
The piece of text causing the issue is:
ark O’Dwy
I think its the apostrophe, or whatever version of an apostrophe is used in this text.
The problem i face is that every text editor i try tells me the file is UTF-8 and renders it correctly.
The vendor is saying that this char is not UTF-8.
How can i confirm whom is correct?
The error message by iconv is a bit misleading, but kind-of correct.
It doesn't tell you that the input isn't valid UTF-8, but that it cannot be converted to ISO-8859-1 in a lossless way. ISO-8859-1 does not have a way to encode the ’ character.
Verify that by executing this command:
echo "ark O’Dwy" | iconv -f UTF-8 -t UTF-7
This produces the output that looks like "ark O+IBk-Dwy".
Here I'm outputting to UTF-7 (a very rarely used encoding that is useful for demonstration here, but little else).
In other words: the encoding is only "illegal" in the sense that it cannot be converted to ISO-8859-1, but it's a perfectly valid UTF-8 sequence.
If the third party claims to support UTF-8, then they may do so only very superficially. They might support any text that can be encoded in ISO-8859-1 as long as it's encoded in UTF-8 (which is an extremely low level of "UTF-8 support").
I've tried all kinds of combinations of encode/decode with options 'surrogatepass' and 'surrogateescape' to no avail. I'm not sure what format this is in (it might even be a bug in Autoit), but I know for a fact the information is in there because at least one online utf decoder got it right. On the online converter website, I specified the file as utf8 and the output as utf16, and the output was as expected.
This issue is called mojibake, and your specific case occurs if you have a text stream that was encoded with UTF-8, and you decode it with Windows-1252 (which is a superset of ISO 8859-1).
So, as you have already found out, you have to decode this file with UTF-8, rather than with the default encoding of Python (which appears to be Windows-1252 in your case).
Let's see why these specific garbled characters appear in your example, namely:
É at the place of É
é at the place of é
è at the place of è
The following table summarises what's going on:
All of É, é, and è are non-ASCII characters, and they are encoded with UTF-8 to 2-byte long codes.
For example, the UTF-8 code for É is:
11000011 10001001
On the other hand, Windows-1252 is an 8-bit encoding, that is, it encodes every character of its character set to 8 bits, i.e. one byte.
So, if you now decode the bit sequence 11000011 10001001 with Windows-1252, then Windows-1252 interprets this as two 1-byte codes, each representing a separate character, rather than a 2-byte code representing a single character:
The first byte 11000011 (C3 in hexadecimal) happens to be the Windows-1252 code of the character à (Unicode code point U+00C3).
The second byte 10001001 (89 in hexadecimal) happens to be the Windows-1252 code of the character ‰ (Unicode code point U+2030).
You can look up these mappings here.
So, that's why your decoding renders É instead of É. Idem for the other non-ASCII characters é and è.
My issue was during the file reading. I solved it by specifying encoding='utf-8' in the options for open().
open(filePath, 'r', encoding='utf-8')
My URL (which I'm passing to twitter/share in a query string) contains %C2%BC%C3%BE, the encoding for 件, but the browser decodes it as two characters ¼þ. How can I let the browser know that it should decode it as a single character?
Your encoding is wrong. The browser is decoding it correctly as those two characters. The correct UTF-8 bytes for U+4EF6 are E4, BB, and B6.
This happens often if I open a plain text file in Vim. I see normal character text, but then � characters here and there, usually where there should just be a space. If I type :set encoding I see encoding=utf-8, and this is correct since I see smart quotes in the text where they should be. What are these � characters and how can I fix how they are displayed?
� is the unicode replacement character. Whenever you use any UTF encoding (UTF-8, UTF-16, UTF-32), all illegal byte sequences for the used UTF encoding are shown as �. Other options are discarding the byte sequences or halting the decoding process completely at first sign of trouble.
For example, the bytes for hellö in ISO-8859-1:
68 65 6c 6c f6
When decoded with UTF-8, becomes hell�. 0xf6 does not ever appear in UTF-8 alone, but the other bytes are completely valid and "by accident" even decode to same characters.
I need to replace the characters å, ä, ö to browser friendly chars. For example ä should become %E4.
I tried weirdString = Uri.EscapeUriString(weirdString);
But it doesnt convert the åäö to the right sign. Help please?
Edit: Tried this:
ASCIIEncoding ascii = new ASCIIEncoding();
byte[] asciicharacters = Encoding.UTF8.GetBytes("vägen");
byte[] asciiArray = Encoding.Convert(Encoding.UTF8, Encoding.ASCII, asciicharacters);
string finalString = ascii.GetString(asciiArray);
string fixedAddrString = HttpUtility.HtmlEncode(finalString);
If you need the characters to display on the page and not part of a URL, you should use Server.HtmlEncode.
var encodedString = Server.HtmlEncode(myString);
HTML encoding makes sure that text is displayed correctly in the browser and not interpreted by the browser as HTML. For example, if a text string contains a less than sign (<) or greater than sign (>), the browser would interpret these characters as the opening or closing bracket of an HTML tag. When the characters are HTML encoded, they are converted to the strings < and >, which causes the browser to display the less than sign and greater than sign correctly.
Update:
Since you are using UTF-8, these characters are escaped to UTF-8, not ASCII.
You need to convert the string from UTF-8 to ASCII, using the Encoding classes before you try to escape them. That is, if you do want the ASCII values to come up.
See here.
Try HttpUtility.HtmlDecode.
If this doesent working too you can do a simple string.Replace for this chars.