Saving property with HTML - encode on entry, or on display?

Saving property with HTML - encode on entry, or on display? - html-encode

I have a system which allows users to enter HTML-reserved characters into a text area, then post that to my application. That information is then saved to a database for later retrieval and display. Alarms are (should be) going off in your head. I need to make sure that I avoid XSS attacks, because I will display this data somewhere else in the application. Here are my options as I see it:
Encode before save to DB
I can HTML-encode the data on the way in to the database, so no HTML characters ever are entered in the database.
Pros:
Developers don't have to remember to HTML encode the data when its displayed on the web page.
Cons:
The data now doesn't make sense for desktop-based applications (or anything other than HTML). Stuff shows up like < > & etc.
Don't HTML encode before saving to DB
I can HTML encode the data whenever I need to display it on a web page.
Pros:
Feels right because it keeps the integrity of the data that was entered by the user.
Allows non-HTML based applications to just display this data without having to worry about HTML encoding.
Cons:
We might display this data in a lot of places, and we'll have to make sure that every developer knows that when you display this field, you'll need to HTML encode it.
People forget things. There WILL be at least once instance when we forget to HTML encode the data.
Scrub the data before saving to DB (don't HTML encode)
I can use a well-tested third party library to remove potentially dangerous HTML and get a safe HTML fragment to save the database, not HTML encoded.
Pros:
Preserves most of the original input so that display in a non-HTML format makes sense.
Less catastrophic if the developer forgets to HTML encode this information for display on a web page.
Cons:
Still messes with the data as the user originally entered it. If they really want to type a <script> or <object> tag, it won't make it, and we'll get support calls and emails because of that.
My question is: What is the best option, or if there is another way of going about this, what is it?

The right thing to do is not mangle/change user input.
So, do not encode before saving.
Yes, this puts the onus on the developers to remember and know that they need to encode anything coming out of the DB, but this is good practice regardless.

Related

Is there any way to hide or obfuscate schema json-ld?

On my webpage I have a standard JSON-LD schema that holds A LOT of data. Is there any way to prevent or make it harder to read for an average user in the console?

Remove spacing and new lines. It has to stay machine readable, which I think means you can't obfuscate the actual text or property names.
I guess you could have it stored in another obfuscated format and have JavaScript generate the readable version. But then, anyone checking the rendered html will see it as it is. And it will limit the systems that can read it.
Another idea is to detect if it's a normal user and not provide the structured data to them. They don't need it. But that's cloaking and may annoy Google.

Don’t mark up content that is not visible to readers of the page
One of google Google structured data Quality guidelines is to give the users the content you describe in your JSON-LD (So the idea of hiding or make this data harder to read for "normal users" does not make sense).
Don’t mark up content that is not visible to readers of the page. For
example, if the JSON-LD markup describes a performer, the HTML body
should describe that same performer. Google Quality guidelines
https://developers.google.com/search/docs/guides/sd-policies
By the way, "normal/average users" won't inspect your HTML source code (And developers have nothing to do with this specific JSON-LD information either).
Protect-javascript
If you insist read topics related to "protect-javascript" (This issue not related to schema JSON-LD):
How can I obfuscate (protect) JavaScript?
How do I protect javascript files?
Protect your JavaScripts from "view source"

Eliminate < > as accepted characters in a wordpress password?

Is it possible to eliminate these characters from a wordpress password? I have heard that it can open up scripts this way, that hackers can use to get in. Thank you.

Simple answer:
Your friend has misinformed you. Restricting these characters in a wordpress password is not something you need to worry about. But as they say "There is no smoke without fire".
More background information:
In your own web-application code, you should always be especially careful whenever you take any data from a user (Whether from a form, a cookie,or a URL) or another external computer system or application. The reason for this is that you want to avoid the values being interpreted as code and not just used as data.
The issue that has led your friend to worry about the <> characters is called Cross-Site Scripting and is a kind of attack that malicious users can perform to "inject" html or javascript content into your pages. If you accept information from the user that contains these html mark-up characters and re-display it on the same, or another page, then you can cause their html or javascript content to become part of your page. Any javascript content will run with access to the same data as the user that views the page.
Whenever outside data is read, it sould always be
validated : i.e. checked that it looks like the kind of thing you are expecting, and rejected if it doe not.
and encoded: i.e. When this data is displayed to back to the user or sent to another part of the system, it is converted to be safe. The type of conversion always depends on how and where the data is being used.
Please note that the angle-bracket characters are not the only thing to worry about. Please also note that it is well proven that disallowing certain characters (also called "blacklisting") is never the best way to secure code. It is always safer to state what is allowed (also called "whitelisting").

Do you HtmlEncode during input or output?

When do you call Microsoft.Security.Application.AntiXss.HtmlEncode? Do you do it when the user submits the information or do you do when you're displaying the information?
How about for basic stuff like First Name, Last Name, City, State, Zip?

You do it when you are displaying the information. Preserve the original as it was entered, convert it for display on a web page. Let's say you were displaying it in some other way, like exporting it into Excel. In that case, you'd want to export the preserved original.
Encode every single string.

You should only encode or escape your data at the last possible moment, whether that's directly before you put it in the database, or display it on the screen. If you encode too soon, you run the risk of accidentally double encoding (you'll often see &amp; on newbies' websites - myself included).
If you do want to encode sooner than that, then take measures to avoid the double encoding. Joel wrote an article about good uses for hungarian notation, where he advocated use of prefixes to determine what is stored in the variable. eg: "us" for unsafe string, "ss" for safe string.
usFirstName = getUserInput('firstName')
ssFirstName = cleanString(usFirstName);
Also note that it doesn't matter what the type of information is (city, zip code, etc) - leaving any of these unchecked is asking for trouble.

It depends on your situation. Where I work, for years the company did no HTML encoding, so when we started doing it, it would have been almost impossible to find every location within the system that user input could be displayed on the page.
Instead we chose to sanitize input on its way into the system since there were fewer input points than output points. We sanitize immediately before inputting data into the DB, although we don't use Microsoft's AntiXss library, we use a set of homebrew methods that whitelist ranges of HTML tags and characters depending on the type of input.
If you're designing the system from scratch, or you have a system that is small (or managed well) enough to encode output, follow Corey's suggestion. It's definitely the better way to do it.

Encoding is not a property of the data, it is a property of the transport mechanism. Therefore you should unencode data when you receive it, and encode it appropriately before transmission. The transport mechanism determines what sort of encoding is necessary.
This principle holds true whether your transport mechanism is HTML, HTTP, smoke signals, etc. The trick is knowing how to do the types of encoding manually, and when various frameworks do the steps for you automagically. For instance, ASP.NET will encode data assigned to a System.Web.UI.WebControls.Button's Text, but not text assigned to a System.Web.UI.WebControls.Literal's Text. jQuery will encode content you set with .innerText(), but not content you set with .innerHtml().

How can we restrict the user from saving a web page?

How can we restrict a user from saving the page?
Please provide some tips to disable File->Save and View Source options
EDIT: Obviously it can't be done, and probably shouldn't be attempted. But possibly a more interesting variant on this question is how can we make is sufficiently hard for a user to save a page in a usable format such that it is not worth their while doing so? The question doesn't pose a value, but say we were protecting an article subscription site where the user is paying a few hundred dollars per annum for continued access to text.

Since the page has been sent to the client, there will always be a way to get that information. Trying to stop a user from doing this will only frustrate them.
The only way to have a user not be able to save a file is to not send it to them.

While the best answer is "Don't do this," there are ways to make it more difficult for them. And since the point of this site is actually answer the question even if it's bad, here is the best way:
First you'll need to have the page open in a new window where you turn off the address bar and toolbar and everything else. That will make it so the user can't easily get to the File menu at all. To do this you'll need a "splash" page that the user loads to and then when they click a link, it opens the popup that serves the main content of your page. Details on how to create popups without things like the toolbar are here:
http://blazonry.com/javascript/windows.php
Then you'll want to add some javascript to each page that prevents the user from right clicking. Here is one method:
http://javascript.about.com/library/blnoright.htm
Finally, if it's your Javascript code that you don't want to be seen, then obfuscating your code is a pretty effective way to do that. They can still see the code if they have much know-how, but the obfuscated code would be a gigantic pain to actually interpret. There are lots of obfuscators out there; here is a free web-based one:
http://www.javascriptobfuscator.com/
This is far from foolproof. It will stop all "casual" users, but any power user will probably be able to easily figure out a way around it. Still if the idea is to at least prevent a good majority of it then this should suffice.
Update for updated question:
To address your new expanded question, I would say the best way to accomplish what you're saying is to use a format that supports DRM. Adobe Acrobat would probably be the best choice because almost everyone has the reader installed. You can prevent PDF files from being saved to the computer so that they can only be loaded from the webpage by a logged in user. The user could still do a screen capture of the document itself which I don't believe is preventable (unless Adobe Reader has some security in place for this, which they might) but it should be sufficient security for most uses.

Don't do it.
Seriously, if the user can see the page in their browser they can see the source code and/or save it to their computer.
You are fighting a losing battle here.

What about the browser's cache? It can be saved from there.
What about a print screen? That could also save the page.
The only way to prevent a user from saving something is to not show it to them in the first place.

It's really a waste of time and resources to try and do this in html as any method you use can be trivially circumvented.
Instead I would use some other technology to display the data - you can never get around a screen capture. but if you're for instance displaying text and you want to make it hard for the use to save that text for use elsewhere then possible options include
PDF - which can disable save and print. There are extensions to most popular web languages that will write a pdf on the fly. Indeed you might be as well just to go down the DRM route with Adobe and embed a document
Flash - most probably via Flex which could be used to write a general-purpose app to display text and images. The advantage of Flash is that it's easier to set up links than pdf.
Or something else, a custom java applet, or even a vrml plugin and display the text in 3D!
In all cases you could display text against a disruptive background to make OCR more difficult, and images could be watermarked. However nothing is going to stop a determined and resourceful viewer, although you can possibly make it sufficiently hard that it's not worth their time.

The least you can do is... the content is generated dynamically by Javascript. In that way, they cannot simply save it. Of course, in FX, they can still view the generated code and then copy&paste. however, normally people cannot save the page.

It shouldn't be an issue, but if you really don't want a user from seeing your code (javascript, css or html) for some reason, than you could use some obfuscation tool which makes the code less readable.

Try javascript "encoding" and obfuscation.
Something like
if(document.location == 'mydomain.com') {
content = getAjax('mycontent.xml');
// content will hold something like 72, 94, 81, 99, ... - encoded ASCII codes
document.write(String.fromCharCode(content));
}
It will always be possible to save the page, but for non-technical guys it will be harder to make it work.
There are 2 protections
domain name
converting ASCII
It's only pseudocode, but I think you get the idea.

add these to code sets in script tag
document.addEventListener('contextmenu', function (e) {
e.preventDefault();
});
document.onkeydown = function (e) {
return false;
};

I'd like to add one more method which, imho, is hard to circumvent: Ctrl+S! (for me, Apple+S)

how can we make is sufficiently hard for a user to save a page in a usable format such that it is not worth their while doing so
Nothing hard: add on every page: "Personal property of John Stealer, company Zetabeta, paid with credit card 756890987654, billing address ..., subscription expires 12/20".
This is an "extended text format" that I just invented... it has an amazing property: though it looks like a regular text, user is much less willing to print it out and give to others...

Will HTML Encoding prevent all kinds of XSS attacks?

I am not concerned about other kinds of attacks. Just want to know whether HTML Encode can prevent all kinds of XSS attacks.
Is there some way to do an XSS attack even if HTML Encode is used?

No.
Putting aside the subject of allowing some tags (not really the point of the question), HtmlEncode simply does NOT cover all XSS attacks.
For instance, consider server-generated client-side javascript - the server dynamically outputs htmlencoded values directly into the client-side javascript, htmlencode will not stop injected script from executing.
Next, consider the following pseudocode:
<input value=<%= HtmlEncode(somevar) %> id=textbox>
Now, in case its not immediately obvious, if somevar (sent by the user, of course) is set for example to
a onclick=alert(document.cookie)
the resulting output is
<input value=a onclick=alert(document.cookie) id=textbox>
which would clearly work. Obviously, this can be (almost) any other script... and HtmlEncode would not help much.
There are a few additional vectors to be considered... including the third flavor of XSS, called DOM-based XSS (wherein the malicious script is generated dynamically on the client, e.g. based on # values).
Also don't forget about UTF-7 type attacks - where the attack looks like
+ADw-script+AD4-alert(document.cookie)+ADw-/script+AD4-
Nothing much to encode there...
The solution, of course (in addition to proper and restrictive white-list input validation), is to perform context-sensitive encoding: HtmlEncoding is great IF you're output context IS HTML, or maybe you need JavaScriptEncoding, or VBScriptEncoding, or AttributeValueEncoding, or... etc.
If you're using MS ASP.NET, you can use their Anti-XSS Library, which provides all of the necessary context-encoding methods.
Note that all encoding should not be restricted to user input, but also stored values from the database, text files, etc.
Oh, and don't forget to explicitly set the charset, both in the HTTP header AND the META tag, otherwise you'll still have UTF-7 vulnerabilities...
Some more information, and a pretty definitive list (constantly updated), check out RSnake's Cheat Sheet: http://ha.ckers.org/xss.html

If you systematically encode all user input before displaying then yes, you are safe you are still not 100 % safe.
(See #Avid's post for more details)
In addition problems arise when you need to let some tags go unencoded so that you allow users to post images or bold text or any feature that requires user's input be processed as (or converted to) un-encoded markup.
You will have to set up a decision making system to decide which tags are allowed and which are not, and it is always possible that someone will figure out a way to let a non allowed tag to pass through.
It helps if you follow Joel's advice of Making Wrong Code Look Wrong or if your language helps you by warning/not compiling when you are outputting unprocessed user data (static-typing).

If you encode everything it will. (depending on your platform and the implementation of htmlencode) But any usefull web application is so complex that it's easy to forget to check every part of it. Or maybe a 3rd party component isn't safe. Or maybe some code path that you though did encoding didn't do it so you forgot it somewhere else.
So you might want to check things on the input side too. And you might want to check stuff you read from the database.

As mentioned by everyone else, you're safe as long as you encode all user input before displaying it. This includes all request parameters and data retrieved from the database that can be changed by user input.
As mentioned by Pat you'll sometimes want to display some tags, just not all tags. One common way to do this is to use a markup language like Textile, Markdown, or BBCode. However, even markup languages can be vulnerable to XSS, just be aware.
# Markup example
[foo](javascript:alert\('bar'\);)
If you do decide to let "safe" tags through I would recommend finding some existing library to parse & sanitize your code before output. There are a lot of XSS vectors out there that you would have to detect before your sanitizer is fairly safe.

I second metavida's advice to find a third-party library to handle output filtering. Neutralizing HTML characters is a good approach to stopping XSS attacks. However, the code you use to transform metacharacters can be vulnerable to evasion attacks; for instance, if it doesn't properly handle Unicode and internationalization.
A classic simple mistake homebrew output filters make is to catch only < and >, but miss things like ", which can break user-controlled output out into the attribute space of an HTML tag, where Javascript can be attached to the DOM.

No, just encoding common HTML tokens DOES NOT completely protect your site from XSS attacks. See, for example, this XSS vulnerability found in google.com:
http://www.securiteam.com/securitynews/6Z00L0AEUE.html
The important thing about this type of vulnerability is that the attacker is able to encode his XSS payload using UTF-7, and if you haven't specified a different character encoding on your page, a user's browser could interpret the UTF-7 payload and execute the attack script.

One other thing you need to check is where your input comes from. You can use the referrer string (most of the time) to check that it's from your own page, but putting in a hidden random number or something in your form and then checking it (with a session set variable maybe) also helps knowing that the input is coming from your own site and not some phishing site.

I'd like to suggest HTML Purifier (http://htmlpurifier.org/) It doesn't just filter the html, it basically tokenizes and re-compiles it. It is truly industrial-strength.
It has the additional benefit of allowing you to ensure valid html/xhtml output.
Also n'thing textile, its a great tool and I use it all the time, but I'd run it though html purifier too.
I don't think you understood what I meant re tokens. HTML Purifier doesn't just 'filter', it actually reconstructs the html. http://htmlpurifier.org/comparison.html

I don't believe so. Html Encode converts all functional characters (characters which could be interpreted by the browser as code) in to entity references which cannot be parsed by the browser and thus, cannot be executed.
<script/>
There is no way that the above can be executed by the browser.
**Unless their is a bug in the browser ofcourse.*

myString.replace(/<[^>]*>?/gm, '');
I use it, then successfully.
Strip HTML from Text JavaScript

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string