How secure are XPS documents? After looking from the inside of an XPS document, found the Unicode-string property. Could someone inject e.x. a script into the Unicode string property?
How does the XPS viewer treat the Unicode string property? As a collection of glyphs or what?
UPDATE: I added the following string as UnicodeText
!##$%^&*()_+
and the XPS viewer refused to open the file. This is how this question came into my mind
XPS documents, as opposed to (coughs) some other format cannot contain scripts or active content. They are only used as a high-fidelity pre-print format. That being said, it's not entirely impossible for XPS parsers to contain security vulnerabilities. And they can be exploited. So far I haven't heard of any such exploits, though.
But back to your point. If someone wants to put a script into a string in an XPS document he can surely do so. He just shouldn't expect it to be executed. If some software actually does that, then it's probably a security problem with the software and not with the file format.
Just because you can put malware into a text file (remember iloveyou.vbs?) that doesn't mean that text files themselves have a security vulnerability :-)
ETA: The UnicodeString attribute in question aids searching inside the XPS file:
The UnicodeString attribute holds the array of Unicode scalar values that are represented by the current element. Specifying a Unicode string is RECOMMENDED, as it supports searching, selection, and accessibility.
And while the string itself is expected to be in a certain format (also detailed in the standard on page 115), the reason why the viewer didn't want to accept your input is that it's not even well-formed XML since the ampersand (&) appears unescaped. I assume that it would work if you encode the ampersand as & as required by XML. The spec also states that
The standard XML escaping mechanisms are used to specify XML-reserved characters.
But even with that in place, the relationship between the UnicodeString attribute and other parts of the document are quite intricate. They wrote over half a page on that and which combinations are valid and which are not. So I'd suggest you read up on that first, before trying to play around further :-)
p.95 of the XPS 1.0 spec: "The standard XML escaping mechanisms are used to specify XML-reserved characters."
The '&' might be causing troubles.
Related
I have a PDF document with the following sample text (screenshot) -
But when I copy and paste it to either word or other text editors all I see is the weird characters :
I am not quite sure why does it giving me weird square boxes instead of pasting the clear human-readable letters (just like the screenshot). Can someone help me how can I get rid of this issue ? Or at least what shall I do to identify the root cause of this strange issue ?
================== Workaround found ==================
I tried converting the document's corrupted unicode to a standard ANSCI unicode formats. But most of the online services couldn't recognize these garbage/weird characters.
This issue could be resolved by some programming, but I don't want to invest time with the programming approach and preferred on the fly approach.
Finally, as suggested by the user 'mkl', converting this document by using the OCR services like "Sedja"/ "Adobe OCR" resolved by issue.
I'm having an issue in a .NET application where pages served by local IIS display random characters (mostly black diamonds with white question marks in them). This happens in Chrome, Firefox, and Edge. IE displays the pages correctly for some reason.
The same pages in production and in lower pre-prod environments work in all my browsers. This is strictly a local issue.
Here's what I've tried:
Deleted code and re-cloned (also tried switching branches)
Disabled all browser extensions
Ran in incognito mode
Rebooted (you never know)
Deleted temporary ASP.NET files
Looked for corrupt fonts on machine but didn't find any
Other Information:
Running IIS 10.0.17134.1
.NET MVC application with Knockout
I realize there are several other posts regarding black diamonds with question marks, but none of them seem to address my issue.
Please let me know if you need more information.
Thanks for your help!
You are in luck. The explicit purpose of � is to indicate that character encodings are being misused. When users see that, they'll know that we've messed up and lost some of their text data, and we'll know that, at one or more points, our processing and/or configuration is wrong.
(Fonts are not at issue [unless there as no font available to render �]. When there is no font available for a character, it's usually rendered as a white-filled rectangle.)
Character encoding fundamentals are simple: use a sufficient character set (say Unicode), pick an appropriate encoding (say UTF-8), encode text with it to obtain bytes, tell every program and person that gets the bytes that they represent text and which encoding is used. The encoding might be understood from a standard, convention, or specification.
Your editor does the actual encoding.
If the file is part of a project or similar system, a project file might store the intended encoding for all or each text file in the project. If your editor is an IDE, it should understand how the project does that.
Your compiler needs the know the encoding of each text file you give it. A project system would communicate what it knows.
HTML provides an optional way to communicate the encoding. Example: <meta charset="utf-8">. An HTML-aware editor should not allow this indicator to be different than the encoding it uses when saving the file. An HTML-aware editor might discover this indicator when opening the file and use the specified encoding to read the file.
HTTP uses another optional way: Content-Type response header. The web server emits this either statically or in conjunction with code that it runs, such as ASP.NET.
Web browsers use the HTTP way if given.
XHR (AJAX, etc) uses HTTP along with JavaScript processing. If needed the JavaScript processing should apply the HTTP and HTML rules, as appropriate. Note: If the content is JSON, the current RFC requires the encoding to be UTF-8.
No one or thing should have to guess.
Diagnostics
Which character encoding did you intend to use? This century, UTF-8 is so much the norm that if you choose to use a different one, you should have a good reason and document it (for others and your future self).
Compare the bytes in the file with the text you expect it to represent. Does it use the entended encoding? Use an editor or tool that shows bytes in hex.
As suggested by #snakecharmerb, what does the server send? Use a web browser's F12 network tab.
What does the HTTP response header say, if anything?
What does the HTML meta tag say, if anything?
What is the HTML doctype, if any?
Is there a framework supported way in MVC.net to decode diacritics in a textbox without just decoding the whole string and allowing HTML content?
If, the text box contains:
<b>Café<b>
When re-rendering the page it becomes:
<b>Café</b>
What is the correct way to allow these diacritics to display properly? using the inbuilt decode option results in:
Café
I could make a whitelist and convert these characters on the fly, but that seems clunky as I might miss some or allow potentially dangerous characters through by mistake. What is the recommended way to support these characters in MVC5?
I have built an Excel/VBA tool to validate csv files to ensure the data they contain is valid. They csv can come originate from anywhere (from a full blown unix system or a desktop user saving data out from Excel). The Excel tool is sent out to businesses so they can validate their csv files in their own environment and without taking the risk of their data leaving thier systems. Thus, the solution needs to be in native VBA and not link into external libraries.
So using VBA, I need to be able to automatically detect UTF-8 (with or without BOM) or ANSI file encodings and warn the user if these are not the file encodings used for the csv.
I think this would perhaps involve reading in a few bytes from the start of the file and determining the encoding based on the existance of the byte order mark.
Could you help me get me started on the right track?
Assuming you have the freedom to ask user to choose the correct file type, making them responsible for what they choose as a file ;)
That means, you can create a form where users can choose the filename and the encoding type like how we do on file open wizard.
Else,
I suggest you to use the FileSystemObject. It returns a TextStream which can be utilized to determine the encoding. I doubt VBA supports other types of encoding and please correct me if it does :) and happy to hear. :)
how to detect encoding type
msdn object library model
Here is a link for further considerations:-
change encode type
I have string " Single 63”x14” rear window" am parsing this string into HTML and creating a word document applying styles using(System.IO.File.ReadAllText(styleSheet)).
in the document am getting this string as "Single 63â€x14†rear window" in C#.
How can I get the correct character to show up in Word?
You would have to find out the incoming encoding of the string
" Single 63”x14” rear window"
And also which encoding the word document allows.
It appears that the encoding characters for those funky quotes are not supported by Word. You could always create a nifty little string parser to search for characters outside the Word encoding range and replace them with either String.Empty, or search for specific supported characters that look similar.
Eg. String.Replace("”","\"");
(this probably wouldn't work without directly manipulating the encoding values, but you haven't provided those so can't give an exact example)
The encoding you are looking at appears to be UTF-8. It's actually probably exactly what you want, you just need to view it using a tool which supports UTF-8, and if you process it and put it on a web page, add the required HTML meta tag so that browsers will display it using the correct encoding.