I need to replace the characters å, ä, ö to browser friendly chars. For example ä should become %E4.
I tried weirdString = Uri.EscapeUriString(weirdString);
But it doesnt convert the åäö to the right sign. Help please?
Edit: Tried this:
ASCIIEncoding ascii = new ASCIIEncoding();
byte[] asciicharacters = Encoding.UTF8.GetBytes("vägen");
byte[] asciiArray = Encoding.Convert(Encoding.UTF8, Encoding.ASCII, asciicharacters);
string finalString = ascii.GetString(asciiArray);
string fixedAddrString = HttpUtility.HtmlEncode(finalString);
If you need the characters to display on the page and not part of a URL, you should use Server.HtmlEncode.
var encodedString = Server.HtmlEncode(myString);
HTML encoding makes sure that text is displayed correctly in the browser and not interpreted by the browser as HTML. For example, if a text string contains a less than sign (<) or greater than sign (>), the browser would interpret these characters as the opening or closing bracket of an HTML tag. When the characters are HTML encoded, they are converted to the strings < and >, which causes the browser to display the less than sign and greater than sign correctly.
Update:
Since you are using UTF-8, these characters are escaped to UTF-8, not ASCII.
You need to convert the string from UTF-8 to ASCII, using the Encoding classes before you try to escape them. That is, if you do want the ASCII values to come up.
See here.
Try HttpUtility.HtmlDecode.
If this doesent working too you can do a simple string.Replace for this chars.
Related
I have the following text in a json file:
"\u00d7\u0090\u00d7\u0097\u00d7\u0095\u00d7\u0096\u00d7\u00aa
\u00d7\u00a4\u00d7\u0095\u00d7\u009c\u00d7\u0092"
which represents the text "אחוזת פולג" in Hebrew.
no matter which encoding/decoding i use i don't seem to get it right with
Python 3.
if for example ill try:
text = "\u00d7\u0090\u00d7\u0097\u00d7\u0095\u00d7\u0096\u00d7\u00aa
\u00d7\u00a4\u00d7\u0095\u00d7\u009c\u00d7\u0092".encode('unicode-escape')
print(text)
i get that text is:
b'\\xd7\\x90\\xd7\\x97\\xd7\\x95\\xd7\\x96\\xd7\\xaa \\xd7\\xa4\\xd7\\x95\\xd7\\x9c\\xd7\\x92'
which in bytecode is almost the correct text, if i was able to remove only one backslash and turn
b'\\xd7\\x90\\xd7\\x97\\xd7\\x95\\xd7\\x96\\xd7\\xaa \\xd7\\xa4\\xd7\\x95\\xd7\\x9c\\xd7\\x92'
into
text = b'\xd7\x90\xd7\x97\xd7\x95\xd7\x96\xd7\xaa \xd7\xa4\xd7\x95\xd7\x9c\xd7\x92'
(note how i changed double slash to single slash) then
text.decode('utf-8')
would yield the correct text in Hebrew.
but i am struggling to do so and couldn't manage to create a piece of code which will do that for me (and not manually as i just showed...)
any help much appreciated...
This string does not "represent" Hebrew text (at least not as unicode code points, UTF-16, UTF-8, or in any well-known way at all). Instead, it represents a sequence of UTF-16 code units, and this sequence consists mostly of multiplication signs, currency signs, and some weird control characters.
It looks like the original character data has been encoded and decoded several times with some strange combination of encodings.
Assuming that this is what literally is saved in your JSON file:
"\u00d7\u0090\u00d7\u0097\u00d7\u0095\u00d7\u0096\u00d7\u00aa \u00d7\u00a4\u00d7\u0095\u00d7\u009c\u00d7\u0092"
you can recover the Hebrew text as follows:
(jsonInput
.encode('latin-1')
.decode('raw_unicode_escape')
.encode('latin-1')
.decode('utf-8')
)
For the above example, it gives:
'אחוזת פולג'
If you are using a JSON deserializer to read in the data, then you should of course omit the .encode('latin-1').decode('raw_unicode_escape') steps, because the JSON deserializer would already interpret the escape sequences for you. That is, after the text element is loaded by JSON deserializer, it should be sufficient to just encode it as latin-1 and then decode it as utf-8. This works because latin-1 (ISO-8859-1) is an 8-bit character encoding that corresponds exactly to the first 256 code points of unicode, whereas your strangely broken text encodes each byte of UTF-8 encoding as an ASCII-escape of an UTF-16 code unit.
I'm not sure what you can do if your JSON contains both the broken escape sequences and valid text at the same time, it might be that the latin-1 doesn't work properly any more. Please don't apply this transformation to your JSON file unless the JSON itself contains only ASCII, it would only make everything worse.
I would like to convert the special characters in "text" variable back normal in VBA!
Dim text As String
text = "Cs\u00fct\u00f6rt\u00f6k"
text = Encoding.utf8.GetString(Encoding.ASCII.GetBytes(text))
MsgBox text
'Csütörtök would be the correct result
But in the above code Excel 2013 gives me an error about "Encoding" method.. cant parse it.
It should be working just like this online converter if you put in the text value:
http://www.rapidmonkey.com/unicodeconverter/reverse.jsp
Is there any good solution for this problem? A one-line code maybe?
Thanks in advance!
The Encoding class does not decode escape sequences, you have to do that manually by parsing the string yourself. For that matter, VB strings use UTF-16, so you do not need to use the Encoding class at all. Simply replace characters 2-7 ("\u00fc") with a single &H00FC character, replace characters 9-14 ("\u00f6") with a single &H00F6 character, etc and then you are done. Each \uXXXX sequence represents a single Unicode codepoint.
I have a page which displays several links () to files which have special characters in their filenames. e.g. "SPRÜCHE.txt".
when i want to browse (GET-request) a link, i get following error:
".../SPRÃœCHE.txt was not found on this server."
But when i replace the special char "Ü" with its ASCII equivalent "%DC", it works fine.
I cannot replace all special chars so ASCII, because there are also other charsets in the filename involved, which cannot be encoded with ASCII (e.g. chinese)
I already tried it with lots of encoding methods, like URLEncoder.encode("","UTF-8"); but this returns me a unicode representation which cannot be parsed correctly in the url ("SPR%c3%9cCHE.txt cannot be found...")
is there a function, which makes a UTF-8 hyperlink / url web-safe?
I use tomcat 7
My URL (which I'm passing to twitter/share in a query string) contains %C2%BC%C3%BE, the encoding for 件, but the browser decodes it as two characters ¼þ. How can I let the browser know that it should decode it as a single character?
Your encoding is wrong. The browser is decoding it correctly as those two characters. The correct UTF-8 bytes for U+4EF6 are E4, BB, and B6.
When I put cyrillic symbols in address bar like this:
http://ru2.php.net/manual-lookup.php?pattern=привет
it switches to
http://ru2.php.net/manual-lookup.php?pattern=%EF%F0%E8%E2%E5%F2
What does that characters -- %EF%F0%E8%E2%E5%F2 -- mean? And why is it happening?
The characters are getting URL encoded. A URL may only contain a subset of ASCII characters, so anything outside plain alphanumeric and some special characters must be URL encoded.
Some browsers display non-ASCII characters as human readable characters, but that's entirely up to them. In protocols, URLs are always URL encoded.