Parsing XML in Groovy with namespace and entities

Parsing XML in Groovy with namespace and entities - groovy

Parsing XML in Groovy should be a piece of cake, but I always run into problems.
I would like to parse a string like this:
<html>
<p>
This is a <span>test</span> with <b>some</b> formattings.<br />
And this has a <ac:special>special</ac:special> formatting.
</p>
</html>
When I do it the standard way new XmlSlurper().parseText(body), the parser complains about the &nbsp entity. My secret weapon in cases like this is to use tagsoup:
def parser = new org.ccil.cowan.tagsoup.Parser()
def page = new XmlSlurper(parser).parseText(body)
But now the <ac:sepcial> tag will be closed immediatly by the parser - the special text will not be inside this tag in the resulting dom. Even when I disable the namespace-feature:
def parser = new org.ccil.cowan.tagsoup.Parser()
parser.setFeature(parser.namespacesFeature,false)
def page = new XmlSlurper(parser).parseText(body)
Another approach was to use the standard parser and to add a doctype like this one:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
This seems to work for most of my files, but it takes ages for the parser to fetch the dtd and process it.
Any good idea how to solve this?
PS: here is some sample code to play around with:
#Grab(group='org.ccil.cowan.tagsoup', module='tagsoup', version='0.9.7')
def processNode(node) {
def out = new StringBuilder("")
node.children.each {
if (it instanceof String) {
out << it
} else {
out << "<${it.name()}>${processNode(it)}</${it.name()}>"
}
}
return out.toString()
}
def body = """<html>
<p>
This is a <span>test</span> with <b>some</b> formattings.<br />
And this has a <ac:special>special</ac:special> formatting.
</p>
</html>"""
def parser = new org.ccil.cowan.tagsoup.Parser()
parser.setFeature(parser.namespacesFeature,false)
def page = new XmlSlurper(parser).parseText(body)
def out = new StringBuilder("")
page.childNodes().each {
out << processNode(it)
}
println out.toString()
""

You will have to decide whether you want parsing to conform to standards, going the DTD path, or accept just anything with a permissive parser.
Tagsoup in my experience is fine for the latter and it rarely creates any problems, so I was surprised to read your remark about its handling of "special". A quick test also showed that I could not reproduce it: when running this command
java net.sf.saxon.Query -x:org.ccil.cowan.tagsoup.Parser -s:- -qs:. !encoding=ASCII !indent=yes
on your sample, I received this result
<?xml version="1.0" encoding="ASCII"?>
<html xmlns="http://www.w3.org/1999/xhtml" xmlns:html="http://www.w3.org/1999/xhtml">
<body>
<p>
This is a <span>test</span> with <b>some</b> formattings.<br clear="none"/>
And this has a <ac:special xmlns:ac="urn:x-prefix:ac">special</ac:special> formatting.
</p>
</body>
</html>
from both TagSoup 1.2 and 1.2.1. So for me that behaved as expected, the text "special" appearing inside of the "ac:special" tag.
As for the DTD variant, you could look after going through a caching proxy for resolving the DTD, refer to a local copy, or even reduce the DTD to the bare minimum that you need. The following should be sufficient to get you across the entity:
<!DOCTYPE DOC[<!ENTITY nbsp " ">]>

Related

Not able to use asciimath using AsciidoctorJ

I am trying to convert a asciidoc file containing math expression to html using AsciidoctorJ, but have been unsuccessful so far.
This is the math.asciidoc that i am trying to convert.
= My Diabolical Mathmatical Opus
Jamie Moriarty
sample1
asciimath:[sqrt(4) = 2]
stem:[sqrt(4) = 2]
I am using the below configuration in Asciidoc
Attributes attributes = AttributesBuilder.attributes()
.math("asciimath")
.get();
Options options = OptionsBuilder.options()
.attributes(attributes)
.docType("article")
.safe(SafeMode.SERVER)
.backend("html5")
.get();
asciidoctor.convert(asciiDoc, options);
The output always shows something like this:
sample1
\$sqrt(4) = 2\$
\$sqrt(4) = 2\$
In the above generated HTML output, how do we render the mathematical equations?

Asciidoctor support asciimath and latexmath syntax and the output produced by asciimath can be rendered on browser using http://asciimath.org js library (other asciimath libraries can also be used).
Asciidoctorj uses \$ as the delimiter for asciimath markup, so we need to configure MathJax using the following configuration:
<html>
<head>
<script type="text/x-mathjax-config">
MathJax.Hub.Config({
asciimath2jax: {
delimiters: [['\\$','\\$'], ['`','`']]
}
});
</script>
<script type="text/javascript" async src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/MathJax.js?config=TeX-MML-AM_CHTML">
</script>
...
</head>
//rest of html
</html>
After we include the above code snippet in <head> section of html, asciimath rendering shall work fine.
We can refer to this section of Asciidoctor documents for activating support of asciimath inside asciidocs: https://asciidoctor.org/docs/user-manual/#activating-stem-support

Thymeleaf layout with multiple contents

I'm new on Thymeleaf template engine, and I'm making an application with Spring Boot and Spring MVC. I'm working just with application.properties for the configuration.
I want to know how I can write only ONE layout but the contents in many files: for example content1.html, content2.html, etc. and use the layout that already have the header, the footer.
If it is possible, how can I send from the controller the content file that will be replaced in the layout?

You could do something like this. Let's say you create a page where all other content will be embedded - main.html. It will look something like this:
<!doctype html>
<html xmlns:th="http://www.thymeleaf.org" xmlns:sec="http://www.thymeleaf.org/thymeleaf-extras-springsecurity3" xmlns="http://www.w3.org/1999/xhtml">
<div th:fragment="mainPage(page, fragment)">
<h4>Some header</h4>
<div th:include="${page} :: ${fragment}"></div>
<h4>Some footer</h4>
</div>
</html>
Then you want to create some page which will be embedded in your main.html page - some-page.html:
<!doctype html>
<html xmlns:th="http://www.thymeleaf.org" xmlns:sec="http://www.thymeleaf.org/thymeleaf-extras-springsecurity3" xmlns="http://www.w3.org/1999/xhtml">
<div th:fragment="somePage">
<h1>${title}</h1>
</div>
</html>
The goal is to replace <div th:include="${page} :: ${fragment}"></div> within main.html with the content from some-page.html. In controller, that will look like this:
#Controller
public class DemoController {
#RequestMapping
public String somePage(Model model) {
// Note that you can easy pass parameters to your "somePage" fragment
model.addAttribute("title", "Woa this works!");
return "main :: mainPage(page='some-page', fragment='somePage')";
}
}
And there you go! Every time when you want to swap content in main.html, you just change page and fragment parameters within string in your controller.

grrovyTest$_run_closure1_closure3_closure4_closure5#4d1abd getting displayed under div while generating html

I get this extra value when I try to generate an html using groovy, here is my code and output below
code:
import groovy.xml.MarkupBuilder
println("let us try a HTML page..\n")
def mkp= new MarkupBuilder()
mkp.html{head{ title "bijoy's groovy"
body{
div{style:"color:red"}
{p "this is cool"}
}}}
and the output has grrovyTest$_run_closure1_closure3_closure4_closure5#4d1abd as extra.. how do I remove it?
<html>
<head>
<title>bijoy's groovy</title>
<body>
<div>grrovyTest$_run_closure1_closure3_closure4_closure5#4d1abd
<p>this is cool</p>
</div>
</body>
</head>
</html>

Attributes to a DOM element are mentioned in () with a map representation as shown below for <div>.
import groovy.xml.MarkupBuilder
println("let us try a HTML page..\n")
def writer = new StringWriter()
def mkp = new MarkupBuilder(writer)
mkp.html{
head{
title "bijoy's groovy"
}
body{
div(style:"color:red"){
p "this is cool"
}
}
}
println writer
Also note, I rectified head and body and added a writer. I suppose you do not want body inside html head. :)

Extract URL from href-tag in groovy

I need to parse a malformed HTML-page and extract certain URLs from it as any kind of Collection.
I don't really care what kind of Collection, I just need to be able to iterate over it.
Let's say we have a structure like this:
<html>
<body>
<div class="outer">
<div class="inner">
Google-Link
Blah blah
</div>
<div class="inner">
Youtube-Link
Blah blah2
</div>
</div>
</body>
</html>
And here is what I do so far:
// tagsoup version 1.2 is under apache license 2.0
#Grab(group='org.ccil.cowan.tagsoup', module='tagsoup', version='1.2' )
XmlSlurper slurper = new XmlSlurper(new org.ccil.cowan.tagsoup.Parser());
GPathResult nodes = slurper.parse("test.html");
def links = nodes."**".findAll { it.#class == "inner" }
println links
I want something like
["http://google.com", "http://youtube.com"]
but all I get is:
["Google-LinkBlah blah", "Youtube-LinkBlah blah2"]
To be more precise I can't use all URLs, because the HTML-document, that I need parse
is about 15-thousand lines long and has alot of URLs that I don't need.
So I need the first URL in each "inner" block.

As The Trav says, you need to grab the href attribute from each matching a tag.
You've edited your question so the class bit in the findAll makes no sense, but with the current HTML example, this should work:
def links = nodes.'**'.findAll { it.name() == 'a' }*.#href*.text()
Edit
If (as you say after the edit) you just want the first a inside anything marked with class="inner", then try:
def links = nodes.'**'.findAll { it.#class?.text() == 'inner' }
.collect { d -> d.'**'.find { it.name() == 'a' }?.#href }
.findAll() // remove nulls if there are any

you're looking for #href on each of your nodes

Groovy: Correct Syntax for XMLSlurper to find elements with a given attribute

Given a HTML file with the structure html -> body -> a bunch of divs what is the correct groovy statement to find all of the divs with a non blank tags attribute?
The following is not working:
def nodes = html.body.div.findAll { it.#tags != null }
because it finds all the nodes.

Try the following (Groovy 1.5.6):
def doc = """
<html>
<body>
<div tags="1">test1</div>
<div>test2</div>
<div tags="">test3</div>
<div tags="4">test4</div>
</body>
</html>
"""
def html = new XmlSlurper().parseText( doc)
html.body.div.findAll { it.#tags.text()}.each { div ->
println div.text()
}
This outputs:
test1
test4

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Parsing XML in Groovy with namespace and entities - groovy

Related

Not able to use asciimath using AsciidoctorJ

Thymeleaf layout with multiple contents

grrovyTest$_run_closure1_closure3_closure4_closure5#4d1abd getting displayed under div while generating html

Extract URL from href-tag in groovy

Groovy: Correct Syntax for XMLSlurper to find elements with a given attribute

Categories

Resources