preprocessing html data with powershell - string

i have some html source code of customer data that needs to be cleaned from html tags before deployed with a line joining string split.
i want to be able to target specific types of information.
if for example a customer has a list of categories on his page.
each 'category' sits, perched inside of an easily distinguishable tag:
<span _ngcontent-jal-c67="" class="category-name">Cryptocurrency</span>
would it be possible to remove everything else that is not nested inside a similar html tag?
let's say, for exampple i want evrything thats occurs inside of <span *>*</span>. so that every non <span></span> tag and its contents would be removed. the contents of all the <span ***>***</span> would stay, without the tag.
is that something i could do in powershell?
let's avoid paste.exe and cygwin type of stuff. i'm looking for standard native windows approach (cmd or powershell).
again, i want to remove all tags.
just the contents that i don't remove should be limited to those found in a specific tag. like ,<span _ngcontent-jal-c68="" class="category-name">Shopping</span>
everything that fits the <span *>*</span> profile
leave only the contents. no tag.
from: <span _ngcontent-jal-c32="" class="category-name">Home and Graden</span>
to: Home and Graden
i'm really looking for an answer for how to do this in powershell without needing to install anything or to make any interesting changes to the OS (windows10)

Instead of using delicate Regular Expressions, you might just use the [System.Net.WebUtility]::HtmlDecode method for this:
$Html = '<span _ngcontent-jal-c67="" class="category-name">Cryptocurrency</span>'
([Xml][System.Net.WebUtility]::HtmlDecode($Html)).GetElementsByTagName('span').'#text'
Result:
Cryptocurrency

Please try to investigate into the problem before asking on Stackoverflow. Did you know there is a -replace operator in PowerShell which allows you to use RegEx? Did you identify that RegEx might help you with your problem?
Anyway, here is one approach, you could take.
$html = '<span _ngcontent-jal-c32="" class="category-name">Home and Graden</span>'
if ($html -match '(<span.*>)(?<Category>.+)(</span>)') {
$Matches.Category
}
Home and Graden
The -match operator can test for a RegEx. The RegEx (<span.*>)(?<Category>.+)(</span>) will create three groups, one of which is named Category. The category sits in between the span-tags. For your input, you have to be sure that any categories will sit inside of a span-tag.
If -match returns true, the automatic variable $Matches is filled. Since we named second group Category, we can easily access it as a property with $Matches.Category.
Alternatively, and for more complex html files even preferrably, you can parse html with PowerShell, see Powershell Tip : Parsing HTML from a local File or a String

Related

NodeJS Jade (Pug) inline link in dynamic text

I have this NodeJS application, that uses Jade as template language. On one particular page, one text block is retrieved from the server, which reads the text from database.
The problem is, the returned text might contain line-breaks and links, and an operator might change this text at any time. How do I make these elements display correctly?
Most answers suggest using a new line:
p
| this is the start of the para
a(href='http://example.com') a link
| and this is the rest of the paragraph
But I cannot do this, since I cannot know when the a element appears. I've solved how to get newline correct, by this trick:
p
each l in line.description.split(/\n/)
= l
br
But I cannot seem to solve how to get links to render correctly. Does anyone know?
Edit:
I am open to any kind of format for links in the database, whatever would solve the issue. For example, say database contains the following text:
Hello!
We would like you to visit [a("http://www.google.com")Google]
Then we would like that to output text that looks like this:
Hello!
We would like you to visit Google
Looks like what you're looking for is unescaped string interpolation. The link does not work in the output because Pug automatically escapes it. Wrap the content you want to insert with !{} and it should stop breaking links. (Disclaimer: Make sure you don't leave user input unescaped - this only is a viable option if you know for sure the content of your DB does not have unwanted HTML/JS code in it.)
See this CodePen for illustration.
With this approach, you would need to use standard HTML tags (<a>) in your DB text. If you don't want that, you could have a look at Pug filters such as markdown-it (you will still need to un-escape the compilation output of that filter).

Get count of items in a velocity list

I'm creating a set of custom templates and structures for a Liferay site.
One structure provides for a repeatable section, which its matching template then iterates over.
However, for styling reasons, I need to know how many instances of the repeatable section are actually present, and I need to know before I loop.
So, the template code is something like this:
#foreach($thisChunk in $chunk.getSiblings())
[emit some HTML]
#end
I want to do some conditional logic before the foreach, and emit a different CSS classname on the containing element depending on how many $chunks there are.
Any ideas how to access the number of siblings without looping through them first?
Easy: $chunk.getSiblings().size()
How to find out? It's a plain old Java object (java.util.ArrayList in my quick test). You can find this out when you just temporarily debug your template with $chunk.getSiblings().getClass().getName() and then continue with the interface of that class.

ModX: Display multiple pages on one page -How to implement

I understand I am meant to use Ditto to do this but am unsure where to go from there.
Currently, I have a Template with all my TVs on it along with several pages using the template that are stored under a parent. The Ditto code I am using is:
[!Ditto? &parents=`173`&orderBy=`createdon ASC` &tpl=`showtemp` &display=`100` &total=`100`!]
However, when I view the page I get the error:
&tpl either does not contain any placeholders or is an invalid chunk name, code block, or filename. Please check it.
My chucnk ('showtemp') looks like:
<div class="showmedia">
[*showmedia*]
</div>
<div class="showright">
<h2>[*showname*]</h2>
<h2>[*showtime*]</h2>
</div>
As far as the set up goes I am not sure if I am going about it right.
Do I make a Chunk as if it were a normal template with TVs, then replicate it as a proper template, create the resources and go from there?
If someone could give me a step by step on how to do this correctly I would be very grateful! Thanks
You're getting that error message because your placeholder syntax is incorrect in this context.
[*templateVariable*] is correct for displaying the current resource's TVs, but in a chunk to be used within a snippet loop such as in Ditto you need to format them as placeholders like this: [+templateVariable+]
I would recommend going through each step in the following tutorial, it will help you understand all the MODX fundamentals:
http://codingpad.maryspad.com/2009/03/28/building-a-website-with-modx-for-newbies-part-1-introduction/

What bad things can a user do in a browser without the script tag?

I have an entry form where the user can type arbitrary HTML. What do I need to filter out besides script tags? Here's what I do:
userInput.replace(/<(script)/gi, "<$1");
but the sanitizer of WMD (used here on SO) manages a white list of tags, and filters out (blanks) all other tags. Why?
I don't like white lists because I don't want to prevent the user from entering arbitrary tags if she so chooses; but I can use a more extensive black list, besides 'script', if needed. What do I need as a black list?
Short answer: anything they can do with the script tag.
The script tag is not required to run javascript. Script can also be placed in almost every HTML tag. Script can appear in a number of places additional to the script tag including, but not limited to, src and href attributes that are used for URLs, event handlers and the style attribute.
The ability for a user to put unwanted script into your page is a security vulnerability known as cross-site scripting. Read around this topic and read the XSS prevention cheat sheet.
You may not want to let users add HTML to your pages. If you need this feature, consider other formats such as Markdown that allows you to disable the use of any embedded HTML; or another less secure option is to use a filtering library that tries to remove all script, such as HTMLPurifier. If you choose the filtering option, be sure to subscribe to announcements of new releases and always go back to your project to install the bug-fixed releases of the filter as new exploits are found and worked-around.

ExpressionEngine show channel content outside of loop

I know this sounds crazy, but I need to show some post information outside of the loop in the expression engine channel module. Is this possible?
You could use EE's SQL Query template tags (if you know, or have access to the database table names and know what to look for in the database):
http://expressionengine.com/user_guide/modules/query/index.html
Basically, you'd output only what you need - it doesn't have to belong to a channel, or anything specific. The one kicker is that you'd have to know the basics of SQL syntax, but if you have a small working knowledge of it, you can do tons of additional things with it.
If you're not keen on SQL, you could simply embed a template within the template that you're working on. Here's a simple example that assumes you're editing the index and meta templates inside of a template group called 'news':
index template contents:
{exp:channel:entries channel="news"}
<div class="entry">
<h1>{title}</h1>
<div class="content">{body}</div>
{embed="news/meta" this_entry_id="{entry_id}"}
</div>
{/exp:channel:entries}
meta template contents:
{exp:channel:entries channel="news" dynamic="no" limit="1" entry_id="{embed:this_entry_id}"}
<div class="meta">
<p>{entry_date}</p>
<p>{author}</p>
</div>
{/exp:channel:entries}
As you can see, the index template is embedding the meta template. Note that we're passing a parameter to the meta template so that it knows which entry ID to print information about. If you're unfamiliar with EE's template embedding feature, you can read more about it in the EE docs. Embedding templates in other templates is a great way to access the {exp:channel:entries} loop multiple times.
There's an add-on called MX Jumper that allows you to "set" a variable from inside your entries loop and then "get" it elsewhere in the template (before or after in the HTML loop doesn't matter because it parses later).
Alternatively, the approach that's all the rage now is to use the add-on Stash to store any and all elements you need to use distinctly as stash variables that you set and then get - similar to the above, except that once you set them, getting them has to happen at a later parsing stage. The beauty of this approach is stash will store the "set" variables for reuse either at a user or site level, and you can determine what the expiry period is - which then results in better performance. When you apply this broadly using the "template partials" mindset, you can store everything with stash, and then call them into a small number of wrapper templates. This makes it possible to use stash to set, for example, your entry title, then get it three separate times in the wrapper template without any additional load - no need for separate loops within your template - one loop to set the variable, and then you can call that variable as needed in your template - it's kind of like creating global variables on the fly.
I would also suggest looking at Stash.

Resources