allow twig syntax in htmlpurifier and/or don't tidy

allow twig syntax in htmlpurifier and/or don't tidy - twig

i am running some templates my visitors make with twig through htmlpurifier but it keeps trying to fix the html code.
i have this as an example:
<ul>
{% for update in jobupdates %}
<li>
{{ update.comment|nl2br }}
</li>
{% endfor %}
</ul>
and it will turn that into:
<ul><li>
{% for update in jobupdates %}
</li><li>
{{ update.comment|nl2br }}
{% endfor %}
</li></ul>
which totaly breaks it all.
I have tried setting the option 'HTML.TidyLevel' to none but it still does it.
Is there i way to stop htmlpurifier from trying to fix the html code? or to ignore twig syntax?

Background
Is HTML Purifier the right tool for the job you want to do?
Your problem essentially boils down to that HTML Purifier is designed to sanitise HTML, whereas you are feeding it Twig, a templating mark-up language. It contains some HTML, but that's not the same thing as being HTML (much like HTML can contain plain text, but is not the same thing as plain text).
Why is this happening?
The reason it's doing what you're observing is that much of HTML Purifier's strength in the sanitising department comes from being strict about the structure of the HTML that is fed to it. That way, exploits that depend on implementation details in browsers which lay outside the standard (such as in this case how to handle text in an unordered list (<ul>) that is not a list item (<li>)) are also taken care of, reducing the attack surface.
In this particular case, the chance that anything would break by allowing this constellation is so small as to be negligible, but one can imagine other constellations where it does matter (e.g. imagine someone writing <img>some payload here</img> - that makes no sense in HTML, and I know of no exploit in the wild right now that looks anything like this, but one could imagine a browser trying to get clever with it).
Either way, it's an integral part of HTML Purifier and you can't simply turn it off, as all of the sanitation rules HTML Purifier has essentially exist on top of having well-formed HTML, for the mentioned reason.
Solutions
A: Question the use-case
Depending on what your penultimate use-case for sanitation is, the solution may be as simple as to put the purification after your Twig template has been turned into HTML, but before the result is displayed on the page. This has the added benefit of purifying e.g. the comments that are injected into your template.
That said, this may have no relation to what you're actually hoping to achieve.
B: Use a different tool
If all you want to do is tidy the HTML in your templates rather than sanitise it, you may want to look into a different tool. I have no experience with tools that just tidy HTML and they may have the same shortfalls (even just wanting to produce valid HTML is going to have that effect - but perhaps there are tools out there which only indent the tags and fix up obvious tag errors like removing a stray </img> somewhere).
If you want to sanitise your HTML, you can try a different tool as well. Take a look at http://htmlpurifier.org/comparison for some ideas?
C: Alter HTML Purifier's HTML definition
You can fork HTML Purifier and make changes to its understanding of HTML. This is really only feasible if the example in your post does not have many cousins, i.e. if there are not many completely different constellations where the insistence on well-formed HTML gets in the way. In the example you mentioned, this likely requires digging into the guts of HTMLPurifier_HTMLModule_List and HTMLPurifier_ChildDef_List, specifically into the else-block in validateChildren() from the latter class, but I have no proof-of-concept on hand right now.
Keep in mind what you'd be doing here is essentially turning the HTML definition that HTML Purifier works with into a rudimentary Twig definition. Not only is that potentially a lot of work (depending on how much you want to teach it), it's probably not actually what you want to do.
Conclusion
I'd recommend asking yourself a few questions and taking action based on the answers (the information in brackets exists as a guide to those actions, the thoughts there are not exhaustive):
Is it essential for you to have clean templates or clean output? (If templates, HTML Purifier can't help you, as it's not made for Twig; if output, HTML Purifier can help you.)
Do you want to prevent XSS attacks? (If you do, HTML Purifier can help you, but only after Twig has done its thing and constructed HTML for it to analyse.)
Do you want to catch invalid HTML declarations? (If you do, again HTML Purifier can help you, but also only after Twig has done its thing.)
Do you want to catch invalid Twig declarations? (If you do, HTML Purifier cannot help you - it might make sense to look for a Twig-specific validation tool.)
There are other questions you can ask yourself, but I hope those provide a useful starting point.

Related

What text processing tool is recommended for parsing screenplays?

I have some plain-text kinda-structured screenplays, formatted like the example at the end of this post. I would like to parse each into some format where:
It will be easy to pull up just stage directions that deal with a specific place.
It will be easy to pull up just dialogue belonging to a particular character.
The most obvious approach I can think of is using sed or perl or php to put div tags around each block, with classes representing character, location, and whether it's stage directions or dialogue. Then, open it up as a web-page and use jQuery to pull out whatever I'm interested in. But this sounds like a roundabout way to do it and maybe it only seems like a good idea because these are tools I'm accustomed to. But I'm sure this is a recurring problem that's been solved before, so can anybody recommend a more efficient workflow that can be used on a Linux box? Thanks.
Here is some sample input:
SOMEWHERE CORPORATION - OPTIONAL COMMENT
A guy named BOB is sitting at his computer.
BOB
Mmmm. Stackoverflow. I like.
Footsteps are heard approaching.
ALICE
Where's that report you said you'd have for me?
Closeup of clock ticking.
BOB (looking up)
Huh? What?
ALICE
Some more dialogue.
Some more stage directions.
Here is what sample output might look like:
<div class='scene somewhere_corporation'>
<div class='comment'>OPTIONAL COMMENT</div>
<div class='direction'>A guy named BOB is sitting at his computer.</div>
<div class='dialogue bob'>Mmmm. Stackoverflow. I like.</div>
<div class='direction'>Footsteps are heard approaching.</div>
<div class='dialogue alice'>Where's that report you said you'd have for me?</div>
<div class='direction'>Closeup of clock ticking.</div>
<div class='comment bob'>looking up</div>
<div class='dialogue bob'>Huh? What?</div>
<div class='dialogue alice'>Some more dialogue.</div>
<div class='direction'>Some more stage directions.</div>
</div>
I'm using DOM as an example, but again, only because that's something I understand. I'm open to whatever is considered a best practice for this type of text-processing task if, as I suspect, roll-your-own regexps and jQuery is not the best practice. Thanks.

You could use Celtx to import plain text scripts and export them to HTML (and RDF/XML for the metadata) (see this related thread and this blog post, which describes the file structure).
Other screenplay editors like Trelby might offer this feature, too.
There is also Fountain, a plain text markup language for screenwriting. They offer libraries which you might (I did not check if they offer something for importing and converting) use for your cause:
Fountain is free and open-source, with libraries that make it easy to add support in your apps.
Even if those projects can’t be used for your cause, you could at least reuse their format for your output.

If your input is not too noisy, i.e. if you can trust some regularities like the indentation which is larger for dialogs as opposed to comments, I would use a simple Context Free Grammar. You have good implementations in all languages and you'll find lot of information on SO.
If your input varies a lot, then take the machine learning route, but you'll need to have a big number of inputs with human-validated output for training, which might be a hassle.
In any case, I would never, ever use regular expressions for problems like that.

What will be the URLs of my expressionengine site, when using stash

I´m trying to understand how the use of the stash plugin will affect the URls of my site.
The traditional way:
I have a template group called site. Within the TG site I have the templates articles, about_us, etc.
The URl will for a single entry be
www.mysite.com/index.php/site/articles/title_of_respective_article
URL for the About-us-page:
www.mysite.com/index.php/site/about_us
Both will reflect the template_group/template structure and thus be SEO-friendly and give users a hint where they are on the site.
But when I use stash I will have 2 wrappers (one for the homepage and one for the rest of the site.
Partials will be used for header, main content and footer.
As far as I understand it, I´ll use the template_groups layout for the wrappers and partials for the main content.
The templating look like
Two wrappers build the TG "layout" Both are hidden, since they should´nt be called directly.
layout
.homepage
.site
Three partials in the TG partials
partials
header
main_content
footer
And by the way shouldn´t those not also be hidden, since they aren´t complete HTML-pages either.
This is what confuses me. How do I get my nice URLs back?
A URL like
www.mysite.com/index.php/site/about_us
will not match the TG/T concept anymore.
Any help?

To expand on both their answers above, and just to be specific to your www.mysite.com/index.php/site/about_us request:
You'd create a template group called "site" and then you may alternatively have something like this code in the /index template
{embed="layout/.site"}
{exp:channel:entries limit="1" disable="categories|member_data|pagination"}
{exp:stash:set name='title'}{title}{/exp:stash:set}
{exp:stash:set name='maincontent'}
<section>
<h1>{title}</h1>
<article>{content}</article>
</section
{/exp:stash:set}
{/exp:channel:entries}
The embed calls the .site layout and the interior simply pulls your specific channel:entries data.
As you can see, it's still using the traditional templategroup/template ways of building URLs, it's just pulling data differently.

When using Stash and the template partials approach (which I don't use personally), the files you mention are all embedded. You still use the same template groups and template files as before.
The Stash-based approach is simply a different way of doing things within your existing templates - not a replacement for them.

Exactly as Derek says. The way to think about it is this - with the template partials approach, your templates contain mostly (if not only) the entries logic (channel entries loop, its parameters, what custom fields are in play for that channel, etc). The outcome of the logic gets stores as stash variables. The stash variables then get called upon by your embedded layout templates to display the content you've stashed. So your URL structure remains the same, but you have considerably less duplication of effort since the more you constrain your templates to logic (i.e. very little if any formatting/display markup) the cleaner they are and the easier it then is to manage your templates.

Any way in Expression Engine to simulate Wordpress' shortcode functionality?

I'm relatively new to Expression Engine, and as I'm learning it I am seeing some stuff missing that WordPress has had for a while. A big one for me is shortcodes, since I will use these to allow CMS users to place more complex content in place with their other content.
I'm not seeing any real equivalent to this in EE, apart from a forthcoming plugin that's in private beta.
As an initial test I'm attempting to fake shortcodes by using delimited strings (e.g. #foo#) in the content field, then using a regex to pull those out and pass them to a function that can retrieve the content out of EE's database.
This brings me to a second question, which is that in looking at EE's API docs, there doesn't appear to be a simple means of retrieving the channel entries programmatically (thinking of something akin to WP's built-in get_posts function).
So my questions are:
a) Can this be done?
b) If so, is my method of approaching it reasonable? Or is there something stupidly obvious I'm missing in my approach?
To reiterate, my main objective here is to have some means of allowing people managing content to drop a code in place in their content that will be replaced with channel content.
Thanks for any advice or help you can give me.

Here's a simple example of the functionality you're looking for.
1) Start by installing Low Replace.
2) Create two Global Variables called gv_hello and gv_goodbye with the values "Hello" and "Goodbye" respectively.
3) Put this text into the body of an entry:
[say_hello]
Nice to see you.
[say_goodbye]
4) Put this into your template, wrapping the Low Replace tag around your body field.
{exp:low_replace
find="[say_hello]|[say_goodbye]"
replace="{gv_hello}|{gv_goodbye}"
multiple="yes"
}
{body}
{/exp:low_replace}
5) It should output this into your browser:
Hello
Nice to see you.
Goodbye
Obviously, this is a really simple example. You can put full blown HTML into your global variable. For example, we've used that to render a complex, interactive graphic that isn't editable but can be easily dropped into a page by any editor.
Unfortunately, due to parse order issues, EE tags won't work inside Global Variables. If you need EE tags in your short code output, you'll need to use Low Variables addon instead of Global Variables.

Continued from the comment:
Do you have examples of the kind of shortcodes you want to support/include? Because i have doubts if controlling the page-layout from a text-field or wysiwyg-field is the way to go.
If you want editors to be able to adjust layout or show/hide extra parts on the page, giving them access to some extra fields in the channel, is (imo) much more manageable and future-proof. For instance some selectfields, a relationship (or playa) field, or a matrix, to let them choose which parts to include/exclude on a page, or which entry from another channel to pull content from.
As said in the comment: i totally understand if you want to replace some #foo# tags with images or data from another field (see other answers: nsm-transplant, low_replace). But, giving an editor access to shortcodes and picking them out, is like writing a template-engine to generate ee-template code for the ee-template-engine.
Using some custom fields to let editors pick and choose parts to embed is, i think, much more manageable.
That being said, you could make a plugin to parse the shortcodes from a textareas content, and then program a lot, to fetch data from other modules you want to support. For channel entries you could build out of the channel data library by objectiveHTML. https://github.com/objectivehtml/Channel-Data

I hear you, I too miss shortcodes from WP -- though the reason they work so easily there is the ubiquity of the_content(). With the great flexibility of EE comes fewer blanket solutions.
I'd suggest looking at NSM Transplant. It should fit the bill for you.

There is also a plugin called Shortcode, which you can find here at
Devot-ee
A quote from the page:
Shortcode aims to allow for more dynamic use of content by authors and
editors, allowing for injection of reusable bits of content or even
whole pieces of functionality into any field in EE

BB Code versus restricted HTML

Are there any security risks in allowing(whitelist only) pure markup tags such as a, b, i, etc in post submission?
BB code seems like a heavy solution to the problem of injecting code and whitelisting "safe" html tags seems easier then going through all the parsing and conversion that bb code requires.
I have found that many bb code libraries have issues with nested elements(is this because they use a FSA or regex, instead of a proper parser?) and blockquote or fieldset are properly parsed by the web browser.
Any and all opinions are greatly appreciated.

This is something everyone seems to get wrong, while it is so simple.
Use a parser
It doesn't matter whether you use markdown, html, bbcode, whatever.
Use a parser. A real parser. Not a bunch of regexes.
The parser gives you a syntaxtree. From the syntaxtree you derive the html (still as a tree of objects). Clean the tree (using a whitelist), print the html.
Using html as syntax is perfectly fine. Just don't try to clean it with regexes.

There is nothing wrong with using HTML as long as you:
Use a proper HTML parser to process the input.
Whitelist the tags so that only things you want get through.
Whitelist the attributes on the tags. This includes parsing and whitelist things inside style attributes if you want to allow style (and, of course, use a real CSS parser for the style attributes).
Rewrite the HTML while you parse it.
The last point is mostly about getting consistent and correct HTML output. Your parser should take care of figuring out the usual confusion (such as incorrectly nested tags) that you find in hand written HTML.

Will HTML Encoding prevent all kinds of XSS attacks?

I am not concerned about other kinds of attacks. Just want to know whether HTML Encode can prevent all kinds of XSS attacks.
Is there some way to do an XSS attack even if HTML Encode is used?

No.
Putting aside the subject of allowing some tags (not really the point of the question), HtmlEncode simply does NOT cover all XSS attacks.
For instance, consider server-generated client-side javascript - the server dynamically outputs htmlencoded values directly into the client-side javascript, htmlencode will not stop injected script from executing.
Next, consider the following pseudocode:
<input value=<%= HtmlEncode(somevar) %> id=textbox>
Now, in case its not immediately obvious, if somevar (sent by the user, of course) is set for example to
a onclick=alert(document.cookie)
the resulting output is
<input value=a onclick=alert(document.cookie) id=textbox>
which would clearly work. Obviously, this can be (almost) any other script... and HtmlEncode would not help much.
There are a few additional vectors to be considered... including the third flavor of XSS, called DOM-based XSS (wherein the malicious script is generated dynamically on the client, e.g. based on # values).
Also don't forget about UTF-7 type attacks - where the attack looks like
+ADw-script+AD4-alert(document.cookie)+ADw-/script+AD4-
Nothing much to encode there...
The solution, of course (in addition to proper and restrictive white-list input validation), is to perform context-sensitive encoding: HtmlEncoding is great IF you're output context IS HTML, or maybe you need JavaScriptEncoding, or VBScriptEncoding, or AttributeValueEncoding, or... etc.
If you're using MS ASP.NET, you can use their Anti-XSS Library, which provides all of the necessary context-encoding methods.
Note that all encoding should not be restricted to user input, but also stored values from the database, text files, etc.
Oh, and don't forget to explicitly set the charset, both in the HTTP header AND the META tag, otherwise you'll still have UTF-7 vulnerabilities...
Some more information, and a pretty definitive list (constantly updated), check out RSnake's Cheat Sheet: http://ha.ckers.org/xss.html

If you systematically encode all user input before displaying then yes, you are safe you are still not 100 % safe.
(See #Avid's post for more details)
In addition problems arise when you need to let some tags go unencoded so that you allow users to post images or bold text or any feature that requires user's input be processed as (or converted to) un-encoded markup.
You will have to set up a decision making system to decide which tags are allowed and which are not, and it is always possible that someone will figure out a way to let a non allowed tag to pass through.
It helps if you follow Joel's advice of Making Wrong Code Look Wrong or if your language helps you by warning/not compiling when you are outputting unprocessed user data (static-typing).

If you encode everything it will. (depending on your platform and the implementation of htmlencode) But any usefull web application is so complex that it's easy to forget to check every part of it. Or maybe a 3rd party component isn't safe. Or maybe some code path that you though did encoding didn't do it so you forgot it somewhere else.
So you might want to check things on the input side too. And you might want to check stuff you read from the database.

As mentioned by everyone else, you're safe as long as you encode all user input before displaying it. This includes all request parameters and data retrieved from the database that can be changed by user input.
As mentioned by Pat you'll sometimes want to display some tags, just not all tags. One common way to do this is to use a markup language like Textile, Markdown, or BBCode. However, even markup languages can be vulnerable to XSS, just be aware.
# Markup example
[foo](javascript:alert\('bar'\);)
If you do decide to let "safe" tags through I would recommend finding some existing library to parse & sanitize your code before output. There are a lot of XSS vectors out there that you would have to detect before your sanitizer is fairly safe.

I second metavida's advice to find a third-party library to handle output filtering. Neutralizing HTML characters is a good approach to stopping XSS attacks. However, the code you use to transform metacharacters can be vulnerable to evasion attacks; for instance, if it doesn't properly handle Unicode and internationalization.
A classic simple mistake homebrew output filters make is to catch only < and >, but miss things like ", which can break user-controlled output out into the attribute space of an HTML tag, where Javascript can be attached to the DOM.

No, just encoding common HTML tokens DOES NOT completely protect your site from XSS attacks. See, for example, this XSS vulnerability found in google.com:
http://www.securiteam.com/securitynews/6Z00L0AEUE.html
The important thing about this type of vulnerability is that the attacker is able to encode his XSS payload using UTF-7, and if you haven't specified a different character encoding on your page, a user's browser could interpret the UTF-7 payload and execute the attack script.

One other thing you need to check is where your input comes from. You can use the referrer string (most of the time) to check that it's from your own page, but putting in a hidden random number or something in your form and then checking it (with a session set variable maybe) also helps knowing that the input is coming from your own site and not some phishing site.

I'd like to suggest HTML Purifier (http://htmlpurifier.org/) It doesn't just filter the html, it basically tokenizes and re-compiles it. It is truly industrial-strength.
It has the additional benefit of allowing you to ensure valid html/xhtml output.
Also n'thing textile, its a great tool and I use it all the time, but I'd run it though html purifier too.
I don't think you understood what I meant re tokens. HTML Purifier doesn't just 'filter', it actually reconstructs the html. http://htmlpurifier.org/comparison.html

I don't believe so. Html Encode converts all functional characters (characters which could be interpreted by the browser as code) in to entity references which cannot be parsed by the browser and thus, cannot be executed.
<script/>
There is no way that the above can be executed by the browser.
**Unless their is a bug in the browser ofcourse.*

myString.replace(/<[^>]*>?/gm, '');
I use it, then successfully.
Strip HTML from Text JavaScript

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string