How to decode diacritics in a string without allowing HTML using MVC5

How to decode diacritics in a string without allowing HTML using MVC5 - asp.net-mvc-5

Is there a framework supported way in MVC.net to decode diacritics in a textbox without just decoding the whole string and allowing HTML content?
If, the text box contains:
<b>Café<b>
When re-rendering the page it becomes:
<b>Café</b>
What is the correct way to allow these diacritics to display properly? using the inbuilt decode option results in:
Café
I could make a whitelist and convert these characters on the fly, but that seems clunky as I might miss some or allow potentially dangerous characters through by mistake. What is the recommended way to support these characters in MVC5?

Related

Unicode character order problem when text is displayed

I am working on an application that converts text into some other characters of the extended ASCII character set that gets displayed in custom font.
The program operation essentially parses the input string using regex, locates the standard characters and outputs them as converted before returning a string with the modified text which displays correctly when viewed with the correct font.
Every now and again, the function returns a string where the characters are displayed in the wrong order, almost like they are corrupted or some data is missing from the Unicode double width spacing. I have examined the binary output, the hex data, and inspected the data in the function before i return it and everything looks ok, but every once in a while something goes wrong and cant quite put my finger on it.
To see an example of what i mean when i say the order is weird, just take a look at the following piece of converted text output from the program and try to highlight it with your mouse. You will see that it doesn't highlight in the order you expect despite how it appears.
Has anyone seen anything like this before and have they any ideas as to what is going on?
ך┼♫יἯ╡П♪דἰ

You are mixing various Unicode characters with different LTR/RTL characteristics.
LTR means "left-to-right" and is the direction that English (and many other western language) text is written.
RTL is "right-to-left" and is used mostly by Arabic and Hebrew (as well as several other scripts).
By default when rendering Unicode text the engine will try to use the directionality of the characters to figure out what direction a given part of the code should go. And normally that works just fine because Hebrew words will have only Hebrew letters and English words will only use letters from the Latin alphabet, so for each chunk there's a easily guessable direction that makes sense.
But you are mixing letters from different scripts and with different directionality.
For example ך is U+05DA HEBREW LETTER FINAL KAF, but you also use two other Hebrew characters. You can use something like this page to list the Unicode characters you used.
You can either
not use "wrong" directionality letters or
make the direction explict using a Left-to-right mark character.
Edit: Last but not least: I just realized that you said "custom font": if you expect displaying with a specific custom font, then you should really be using one of the private use areas in Unicode: they are explicitly reserved for private use like this (i.e. where the characters don't match the publicly defined glyphs for the codepoints). That would also avoid surprises like the ones you get, where some of the used characters have different rendering properties.

Web pages served by local IIS showing black diamonds with question marks

I'm having an issue in a .NET application where pages served by local IIS display random characters (mostly black diamonds with white question marks in them). This happens in Chrome, Firefox, and Edge. IE displays the pages correctly for some reason.
The same pages in production and in lower pre-prod environments work in all my browsers. This is strictly a local issue.
Here's what I've tried:
Deleted code and re-cloned (also tried switching branches)
Disabled all browser extensions
Ran in incognito mode
Rebooted (you never know)
Deleted temporary ASP.NET files
Looked for corrupt fonts on machine but didn't find any
Other Information:
Running IIS 10.0.17134.1
.NET MVC application with Knockout
I realize there are several other posts regarding black diamonds with question marks, but none of them seem to address my issue.
Please let me know if you need more information.
Thanks for your help!

You are in luck. The explicit purpose of � is to indicate that character encodings are being misused. When users see that, they'll know that we've messed up and lost some of their text data, and we'll know that, at one or more points, our processing and/or configuration is wrong.
(Fonts are not at issue [unless there as no font available to render �]. When there is no font available for a character, it's usually rendered as a white-filled rectangle.)
Character encoding fundamentals are simple: use a sufficient character set (say Unicode), pick an appropriate encoding (say UTF-8), encode text with it to obtain bytes, tell every program and person that gets the bytes that they represent text and which encoding is used. The encoding might be understood from a standard, convention, or specification.
Your editor does the actual encoding.
If the file is part of a project or similar system, a project file might store the intended encoding for all or each text file in the project. If your editor is an IDE, it should understand how the project does that.
Your compiler needs the know the encoding of each text file you give it. A project system would communicate what it knows.
HTML provides an optional way to communicate the encoding. Example: <meta charset="utf-8">. An HTML-aware editor should not allow this indicator to be different than the encoding it uses when saving the file. An HTML-aware editor might discover this indicator when opening the file and use the specified encoding to read the file.
HTTP uses another optional way: Content-Type response header. The web server emits this either statically or in conjunction with code that it runs, such as ASP.NET.
Web browsers use the HTTP way if given.
XHR (AJAX, etc) uses HTTP along with JavaScript processing. If needed the JavaScript processing should apply the HTTP and HTML rules, as appropriate. Note: If the content is JSON, the current RFC requires the encoding to be UTF-8.
No one or thing should have to guess.
Diagnostics
Which character encoding did you intend to use? This century, UTF-8 is so much the norm that if you choose to use a different one, you should have a good reason and document it (for others and your future self).
Compare the bytes in the file with the text you expect it to represent. Does it use the entended encoding? Use an editor or tool that shows bytes in hex.
As suggested by #snakecharmerb, what does the server send? Use a web browser's F12 network tab.
What does the HTTP response header say, if anything?
What does the HTML meta tag say, if anything?
What is the HTML doctype, if any?

How to convert chars (Â or Ã or â€) to ASCII codes while creating word document using C#?

I have string " Single 63”x14” rear window" am parsing this string into HTML and creating a word document applying styles using(System.IO.File.ReadAllText(styleSheet)).
in the document am getting this string as "Single 63â€x14â€ rear window" in C#.
How can I get the correct character to show up in Word?

You would have to find out the incoming encoding of the string
" Single 63”x14” rear window"
And also which encoding the word document allows.
It appears that the encoding characters for those funky quotes are not supported by Word. You could always create a nifty little string parser to search for characters outside the Word encoding range and replace them with either String.Empty, or search for specific supported characters that look similar.
Eg. String.Replace("”","\"");
(this probably wouldn't work without directly manipulating the encoding values, but you haven't provided those so can't give an exact example)

The encoding you are looking at appears to be UTF-8. It's actually probably exactly what you want, you just need to view it using a tool which supports UTF-8, and if you process it and put it on a web page, add the required HTML meta tag so that browsers will display it using the correct encoding.

Questions on Chinese Encoding

I'm trying to create a webpage in Chinese and I realized that while the text looks fine when I run it on browsers, once I change the Character Encoding, the text becomes gibberish. Here's what's happening:
I create my html file in Emacs, encoded in UTF-8.
I upload it to the server, and view it on my browsers (FF, IE, Chrome, Opera) - no problem.
I try to view the page in other encodings via FF > View > Character Encoding > All those different Chinese encoding systems, e.g. Chinese Simplified (HZ)
Apart from UTF-8, on every other encoding the text becomes gibberish.
I'm assuming this isn't a problem - i.e. browsers are smart enough to know which encoding the page is in, and parse the content accurately. What I'm wondering is why I can't read the Chinese text anymore once I change encoding - is it because I don't have Chinese fonts installed on my OS? Should I stick to UTF-8 if my audience are Chinese or should I choose among one of their many encoding systems?
Thanks in advance for your help/opinions.

UTF isn't a 'catch-all' encoding. It's designed to contain international language character symbols for ease of use, but it is still an encoding, just like the other encodings you've selected. You would have to retype the text in each encoding to make it appear correctly when viewed with that encoding.

Viewer encoding MUST match the file being read. Viewing UTF-8 as something other makes about same sense as renaming .txt to .exe and trying to run it.
You should specify correct encoding in HTML. The option you're using in web browser exist only for those rare occasions when web developer screwed up his job and declared other encoding than actually used OR mixed up 2 different encodings on one page.

Of course changing the encoding in your browser will "break" the text! The browser is taking the stream of UTF-8 codepoints and tries to force another encoding on the raw data. Needless to say, the result ain't pretty. Changing the encoding in the browser is NOT the equivalent of converting.
As you surmised correctly, modern browsers usually guess correctly -- but not always. As Agent_L make sure to declare the encoding in the headers.

How secure are XPS documents?

How secure are XPS documents? After looking from the inside of an XPS document, found the Unicode-string property. Could someone inject e.x. a script into the Unicode string property?
How does the XPS viewer treat the Unicode string property? As a collection of glyphs or what?
UPDATE: I added the following string as UnicodeText
!##$%^&*()_+
and the XPS viewer refused to open the file. This is how this question came into my mind

XPS documents, as opposed to (coughs) some other format cannot contain scripts or active content. They are only used as a high-fidelity pre-print format. That being said, it's not entirely impossible for XPS parsers to contain security vulnerabilities. And they can be exploited. So far I haven't heard of any such exploits, though.
But back to your point. If someone wants to put a script into a string in an XPS document he can surely do so. He just shouldn't expect it to be executed. If some software actually does that, then it's probably a security problem with the software and not with the file format.
Just because you can put malware into a text file (remember iloveyou.vbs?) that doesn't mean that text files themselves have a security vulnerability :-)
ETA: The UnicodeString attribute in question aids searching inside the XPS file:
The UnicodeString attribute holds the array of Unicode scalar values that are represented by the current element. Specifying a Unicode string is RECOMMENDED, as it supports searching, selection, and accessibility.
And while the string itself is expected to be in a certain format (also detailed in the standard on page 115), the reason why the viewer didn't want to accept your input is that it's not even well-formed XML since the ampersand (&) appears unescaped. I assume that it would work if you encode the ampersand as & as required by XML. The spec also states that
The standard XML escaping mechanisms are used to specify XML-reserved characters.
But even with that in place, the relationship between the UnicodeString attribute and other parts of the document are quite intricate. They wrote over half a page on that and which combinations are valid and which are not. So I'd suggest you read up on that first, before trying to play around further :-)

p.95 of the XPS 1.0 spec: "The standard XML escaping mechanisms are used to specify XML-reserved characters."
The '&' might be causing troubles.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string