Is there any HTML sanitizer or cleanup methods available in any JSF utilities kit or libraries like PrimeFaces/OmniFaces?
I need to sanitize HTML input by user via p:editor and display safe HTML output using escape="true", following the stackexchange style. Before displaying the HTML I'm thinking to store sanitized input data to the database, so that it is ready to safe use with escape="true" and XSS is not a danger.
In order to achieve that, you basically need a standalone HTML parser. HTML parsing is rather complex and the task and responsibility of that is beyond the scope of JSF, PrimeFaces and OmniFaces. You're supposed to just grab one of the many existing HTML parsing libraries.
An example is Jsoup, it has even a separate method for the particular purpose of sanitizing HTML against a Safelist: Jsoup#clean(). For example, if you want to allow some basic HTML without images, use Safelist.basic():
String sanitizedHtml = Jsoup.clean(rawHtml, Safelist.basic());
A completely different alternative is to use a specific text formatting syntax, such as Markdown (which is also used here). Basically all of those parsers also sanitize HTML under the covers. An example is CommonMark. Perhaps this is what you actually meant when you said "stackexchange style".
As to saving in DB, you'd better save both the raw and parsed forms in 2 separate text columns. The raw form should be redisplayed during editing. The parsed form should be updated in background when the raw form has been edited. During display, obviously only show the parsed form with escape="false".
See also:
Markdown or HTML
Related
Is there any HTML sanitizer or cleanup methods available in any JSF utilities kit or libraries like PrimeFaces/OmniFaces?
I need to sanitize HTML input by user via p:editor and display safe HTML output using escape="true", following the stackexchange style. Before displaying the HTML I'm thinking to store sanitized input data to the database, so that it is ready to safe use with escape="true" and XSS is not a danger.
In order to achieve that, you basically need a standalone HTML parser. HTML parsing is rather complex and the task and responsibility of that is beyond the scope of JSF, PrimeFaces and OmniFaces. You're supposed to just grab one of the many existing HTML parsing libraries.
An example is Jsoup, it has even a separate method for the particular purpose of sanitizing HTML against a Safelist: Jsoup#clean(). For example, if you want to allow some basic HTML without images, use Safelist.basic():
String sanitizedHtml = Jsoup.clean(rawHtml, Safelist.basic());
A completely different alternative is to use a specific text formatting syntax, such as Markdown (which is also used here). Basically all of those parsers also sanitize HTML under the covers. An example is CommonMark. Perhaps this is what you actually meant when you said "stackexchange style".
As to saving in DB, you'd better save both the raw and parsed forms in 2 separate text columns. The raw form should be redisplayed during editing. The parsed form should be updated in background when the raw form has been edited. During display, obviously only show the parsed form with escape="false".
See also:
Markdown or HTML
I have some HTML entered by the user which is displayed in a Yesod template. I would like to transform this HTML, stripping out style attributes from it before it gets rendered, but i cannot find out how.
If my template contains #{ html } i can pass html as a value through a function simply writing #{ transform html }, if the transform function has a signature: transform :: Html -> Html where Html is the type defined by blaze-html here. The problem i see is that Blaze does not seem to expose functionalities useful in order to walk an HTML tree, or even just get the descendents of a given Html. So which strategies would you suggest? Should i try to get into the Blaze internals?
I am not sure whether this should be considered purely an issue with Blaze. Transforming Html elements is not one of the main use cases of Blaze, so this problem needs to be tackled in the context of rendering with Yesod
You have to render to Text or ByteString first, blaze provides no means of analyzing content. Then you can process the data with a library like html-conduit or tagsoup (which is what xss-sanitize does).
From the tutorial
But there's a problem! Our rendered comments look like this in the
browser: "<p>This is <em>another</em> comment</p>". We want those tags
to actually render as HTML.
That's React protecting you from an XSS attack. There's a way to get
around it but the framework warns you not to use it:
...
<span dangerouslySetInnerHTML={{__html: rawMarkup}} />
This is a special API that intentionally makes it difficult to insert raw HTML, but for Showdown we'll take advantage of this backdoor.
Remember: by using this feature you're relying on Showdown to be secure.
So there exists an API for inserting raw HTML, but the method name and the docs all warn against it. Is it safe to use this? For example, I have a chat app that takes Markdown comments and converts them to HTML strings. The HTML snippets are generated on the server by a Markdown converter. I trust the converter, but I'm not sure if there's any way for a user to carefully craft Markdown to exploit XSS. Is there anything else I should be doing to make sure this is safe?
Most Markdown processors (and I believe Showdown as well) allow the writer to use inline HTML. For example a user might enter:
This is _markdown_ with a <marquee>ghost from the past</marquee>. Or even **worse**:
<script>
alert("spam");
</script>
As such, you should have a whitelist of tags and strip all the other tags after converting from markdown to html. Only then use the aptly named dangerouslySetInnerHTML.
Note that this also what Stackoverflow does. The above Markdown renders as follows (without you getting an alert thrown in your face):
This is markdown with a ghost from the past. Or
even worse:
alert("spam");
There are three reasons it's best to avoid html:
security risks (xss, etc)
performance
event listeners
The security risks are largely mitigated by markdown, but you still have to decide what you consider valid, and ensure it's disallowed (maybe you don't allow images, for example).
The performance issue is only relevant when something will change in the markup. For example if you generated html with this: "Time: <b>" + new Date() + "</b>". React would normally decide to only update the textContent of the <b/> element, but instead replaces everything, and the browser must reparse the html. In larger chunks of html, this is more of a problem.
If you did want to know when someone clicks a link in the results, you've lost the ability to do so simply. You'd need to add an onClick listener to the closest react node, and figure out which element was clicked, delegating actions from there.
If you would like to use Markdown in React, I recommend a pure react renderer, e.g. vjeux/markdown-react.
I have questions on preventing XSS attacks.
1) Question:
I have an HTML template as Javascript string (trusted) and insert content coming from a server request (untrusted). I replace placeholders within that HTML template strings with that untrusted content and output it to the DOM using innerHTML/Text.
In particular I insert texts that I output in <div> and <p> tags that are already present in the template HTML string and form element values, i.e. texts in input tag's value attribute, select option and textarea tags.
Do I understand correctly that I can treat every inserted text mentioned above as HTML subcontext thus I only encode like so: encodeForJavascript( encodeForHTML( inserted_text ) ). Or do I have to encode the texts that I insert into value attributes of the input fields for the HTML Attribute subcontext?
After reading up on this issue on OWASP I am inclined to think that latter is only necessary in case I set the attribute with unstrusted content via Javascript like so: document.forms[ 0 ].elements[ 0 ].value = encodeForHTMLAttribute, is that correct?
2) Question:
What is the added value of server side encoding server responses that enter the client side via Ajax and get handled anyway (like in question 1). In addition, don't we risk problems when double encoding the content?
Thanks
You need to encode for the context in question, so to data inserted into html context needs to be encoded for html, and data inserted into html attributes, should be html attribute encoded. This is addition to the javascript encoding you mentioned.
I would javascript encode for transfer and then encode for the correct context client side, where I know which context is the right one.
At what stage does my input in the textarea change from being this raw text, and become HTML?
For example, say I indent 4 spaces
like this
Then the WMD Showdown.js will render it properly below this textarea I type in. But the text area still literally contains
like this
So is PHP server side responsible for translating all the same things the showdown.js does to permanently be HTML in the SoF Database?
There are some other posts here about this, but basically it works like this. Or at least this is how I do it on my website using WMD; see my profile if you're interested in checking out my WMD implementation.
User enters the Markdown on the client, and showdown.js runs in real time in the browser (pure client-side JavaScript; no AJAX or anything like that) to give the user the preview.
Then when the user posts to the server, WMD sends the Markdown (you have to configure WMD to do this though; by default WMD sends HTML).
Run showdown.js server-side to convert the Markdown to HTML. In theory you could use some other method but it makes sense to try to get the same transformation on the server that the user sees on the client, other than any HTML tag filtering you want to do server-side.
As just noted, you'll need to do appropriate HTML tag filtering to avoid cross-site scripting (XSS) issues. This is both important and nontrivial, so be careful.
Save both the Markdown and the HTML in the database—the Markdown because if users want to edit their posts, you want to give them the Markdown, and the HTML so you don't have to transform Markdown to HTML every time you display answers.
Here are some related posts.
Convert HTML back to Markdown for editing in wmd: Tells how to configure WMD to send Markdown to the server instead of HTML.
What HTML tags are allowed on Stack Overflow?: Useful for thinking about HTML tag filtering.
Well first of all StackOverflow is built on ASP.NET, but yes essentially the characters in the rich text box gets translated back and forth.