Forcing URL to convert 4-part unicode string to Chinese character - browser

My URL (which I'm passing to twitter/share in a query string) contains %C2%BC%C3%BE, the encoding for 件, but the browser decodes it as two characters ¼þ. How can I let the browser know that it should decode it as a single character?

Your encoding is wrong. The browser is decoding it correctly as those two characters. The correct UTF-8 bytes for U+4EF6 are E4, BB, and B6.

Related

Convert Unicode Escape to Hebrew text

I have the following text in a json file:
"\u00d7\u0090\u00d7\u0097\u00d7\u0095\u00d7\u0096\u00d7\u00aa
\u00d7\u00a4\u00d7\u0095\u00d7\u009c\u00d7\u0092"
which represents the text "אחוזת פולג" in Hebrew.
no matter which encoding/decoding i use i don't seem to get it right with
Python 3.
if for example ill try:
text = "\u00d7\u0090\u00d7\u0097\u00d7\u0095\u00d7\u0096\u00d7\u00aa
\u00d7\u00a4\u00d7\u0095\u00d7\u009c\u00d7\u0092".encode('unicode-escape')
print(text)
i get that text is:
b'\\xd7\\x90\\xd7\\x97\\xd7\\x95\\xd7\\x96\\xd7\\xaa \\xd7\\xa4\\xd7\\x95\\xd7\\x9c\\xd7\\x92'
which in bytecode is almost the correct text, if i was able to remove only one backslash and turn
b'\\xd7\\x90\\xd7\\x97\\xd7\\x95\\xd7\\x96\\xd7\\xaa \\xd7\\xa4\\xd7\\x95\\xd7\\x9c\\xd7\\x92'
into
text = b'\xd7\x90\xd7\x97\xd7\x95\xd7\x96\xd7\xaa \xd7\xa4\xd7\x95\xd7\x9c\xd7\x92'
(note how i changed double slash to single slash) then
text.decode('utf-8')
would yield the correct text in Hebrew.
but i am struggling to do so and couldn't manage to create a piece of code which will do that for me (and not manually as i just showed...)
any help much appreciated...
This string does not "represent" Hebrew text (at least not as unicode code points, UTF-16, UTF-8, or in any well-known way at all). Instead, it represents a sequence of UTF-16 code units, and this sequence consists mostly of multiplication signs, currency signs, and some weird control characters.
It looks like the original character data has been encoded and decoded several times with some strange combination of encodings.
Assuming that this is what literally is saved in your JSON file:
"\u00d7\u0090\u00d7\u0097\u00d7\u0095\u00d7\u0096\u00d7\u00aa \u00d7\u00a4\u00d7\u0095\u00d7\u009c\u00d7\u0092"
you can recover the Hebrew text as follows:
(jsonInput
.encode('latin-1')
.decode('raw_unicode_escape')
.encode('latin-1')
.decode('utf-8')
)
For the above example, it gives:
'אחוזת פולג'
If you are using a JSON deserializer to read in the data, then you should of course omit the .encode('latin-1').decode('raw_unicode_escape') steps, because the JSON deserializer would already interpret the escape sequences for you. That is, after the text element is loaded by JSON deserializer, it should be sufficient to just encode it as latin-1 and then decode it as utf-8. This works because latin-1 (ISO-8859-1) is an 8-bit character encoding that corresponds exactly to the first 256 code points of unicode, whereas your strangely broken text encodes each byte of UTF-8 encoding as an ASCII-escape of an UTF-16 code unit.
I'm not sure what you can do if your JSON contains both the broken escape sequences and valid text at the same time, it might be that the latin-1 doesn't work properly any more. Please don't apply this transformation to your JSON file unless the JSON itself contains only ASCII, it would only make everything worse.

How do I get from Éphémère to Éphémère in Python3?

I've tried all kinds of combinations of encode/decode with options 'surrogatepass' and 'surrogateescape' to no avail. I'm not sure what format this is in (it might even be a bug in Autoit), but I know for a fact the information is in there because at least one online utf decoder got it right. On the online converter website, I specified the file as utf8 and the output as utf16, and the output was as expected.
This issue is called mojibake, and your specific case occurs if you have a text stream that was encoded with UTF-8, and you decode it with Windows-1252 (which is a superset of ISO 8859-1).
So, as you have already found out, you have to decode this file with UTF-8, rather than with the default encoding of Python (which appears to be Windows-1252 in your case).
Let's see why these specific garbled characters appear in your example, namely:
É at the place of É
é at the place of é
è at the place of è
The following table summarises what's going on:
All of É, é, and è are non-ASCII characters, and they are encoded with UTF-8 to 2-byte long codes.
For example, the UTF-8 code for É is:
11000011 10001001
On the other hand, Windows-1252 is an 8-bit encoding, that is, it encodes every character of its character set to 8 bits, i.e. one byte.
So, if you now decode the bit sequence 11000011 10001001 with Windows-1252, then Windows-1252 interprets this as two 1-byte codes, each representing a separate character, rather than a 2-byte code representing a single character:
The first byte 11000011 (C3 in hexadecimal) happens to be the Windows-1252 code of the character à (Unicode code point U+00C3).
The second byte 10001001 (89 in hexadecimal) happens to be the Windows-1252 code of the character ‰ (Unicode code point U+2030).
You can look up these mappings here.
So, that's why your decoding renders É instead of É. Idem for the other non-ASCII characters é and è.
My issue was during the file reading. I solved it by specifying encoding='utf-8' in the options for open().
open(filePath, 'r', encoding='utf-8')

jsf url encoding h:outputlink

I have a page which displays several links () to files which have special characters in their filenames. e.g. "SPRÜCHE.txt".
when i want to browse (GET-request) a link, i get following error:
".../SPRÃœCHE.txt was not found on this server."
But when i replace the special char "Ü" with its ASCII equivalent "%DC", it works fine.
I cannot replace all special chars so ASCII, because there are also other charsets in the filename involved, which cannot be encoded with ASCII (e.g. chinese)
I already tried it with lots of encoding methods, like URLEncoder.encode("","UTF-8"); but this returns me a unicode representation which cannot be parsed correctly in the url ("SPR%c3%9cCHE.txt cannot be found...")
is there a function, which makes a UTF-8 hyperlink / url web-safe?
I use tomcat 7

Replace characters in a string to fit browser, C#, From UTF8 encoding

I need to replace the characters å, ä, ö to browser friendly chars. For example ä should become %E4.
I tried weirdString = Uri.EscapeUriString(weirdString);
But it doesnt convert the åäö to the right sign. Help please?
Edit: Tried this:
ASCIIEncoding ascii = new ASCIIEncoding();
byte[] asciicharacters = Encoding.UTF8.GetBytes("vägen");
byte[] asciiArray = Encoding.Convert(Encoding.UTF8, Encoding.ASCII, asciicharacters);
string finalString = ascii.GetString(asciiArray);
string fixedAddrString = HttpUtility.HtmlEncode(finalString);
If you need the characters to display on the page and not part of a URL, you should use Server.HtmlEncode.
var encodedString = Server.HtmlEncode(myString);
HTML encoding makes sure that text is displayed correctly in the browser and not interpreted by the browser as HTML. For example, if a text string contains a less than sign (<) or greater than sign (>), the browser would interpret these characters as the opening or closing bracket of an HTML tag. When the characters are HTML encoded, they are converted to the strings < and >, which causes the browser to display the less than sign and greater than sign correctly.
Update:
Since you are using UTF-8, these characters are escaped to UTF-8, not ASCII.
You need to convert the string from UTF-8 to ASCII, using the Encoding classes before you try to escape them. That is, if you do want the ASCII values to come up.
See here.
Try HttpUtility.HtmlDecode.
If this doesent working too you can do a simple string.Replace for this chars.

Cyrillic characters in browser address bar

When I put cyrillic symbols in address bar like this:
http://ru2.php.net/manual-lookup.php?pattern=привет
it switches to
http://ru2.php.net/manual-lookup.php?pattern=%EF%F0%E8%E2%E5%F2
What does that characters -- %EF%F0%E8%E2%E5%F2 -- mean? And why is it happening?
The characters are getting URL encoded. A URL may only contain a subset of ASCII characters, so anything outside plain alphanumeric and some special characters must be URL encoded.
Some browsers display non-ASCII characters as human readable characters, but that's entirely up to them. In protocols, URLs are always URL encoded.

Resources