XQuery locate attribute by node value - attributes

I have a bunch of nodes like this:
<root>
<books>
<book id="1">Book 1</book>
<book id="2">Book 2</book>
<book id="3">Book 3</book>
</books>
</root>
What I want is to get the id of the book with text node "Book 2". How do I do this? I tried this without any result ($doc is my document path):
let $b := $doc/root/books/book[book = "Book 2"]
return data($b/#id)
EDIT: I meant that $doc is the document node, not only the path.

Assuming that $doc is in fact a document-node and not a document path as you described it then you can use the following:
$doc/root/books/book[. = "Book 2"]/data(#id)
Simply put . refers to the current context item, which is already book as that is the last part of the XPath before the predicate.

if $doc is your document path, you'll need to call fn:doc($doc), to get the document-node:
fn:doc($doc)/root/books/book[. = "Book 2"]/data(#id)

Related

Element:Text and sub elements combined in PowerBI & XML

Having an XML file like this:
<?xml version="1.0" encoding="UTF-8"?><outer>
<inner>Some text.</inner>
<inner>More text.</inner>
</outer>
and the following PowerBI script
let
Table0 = Xml.Tables(File.Contents("simple1.xml")){0}[Table]
in
Table0
you get this
Element:Text
Some text.
More text.
Now I'd like to add sub elements and keep inner.Element:Text
<?xml version="1.0" encoding="UTF-8"?><outer>
<inner>Some text.<secret>Don't care.</secret></inner>
<inner>More text.<secret>You know.</secret></inner>
</outer>
Using the same PowerBI script as above you get
secret
Don't care.
You know.
I already tried this script
let
Table0 = Xml.Tables(File.Contents("simple2.xml")),
Table1 = Table.ExpandTableColumn(Table0, "Table", {"secret"})
in
Table1
but got this
Name
secret
inner
Don't care.
inner
You know.
But I'd like to get this:
Element:Text
secret.Element:Text
Some text.
Don't care.
More text.
You know.
My current workaround (which I'd like to avoid) is to use sed to wrap the element text of an inner entry in its own sub element:
<inner><text>Some text.</text><secret>Don't care.</secret></inner>

BeautilulSoup Insert a tag and children of this new tag with associated value

I need to update an XML file. Its structure is
<product sku="xyz">
...
<custom-attributes>
<custom-attribute name="attrib1">test</custom-attribute>
...
</custom-attributes>
</product>
I want to add a line with a custom-attribute which is multi-valued so the required structure looks like this :
<custom-attributes>
<custom-attribute name="attrib1">test</custom-attribute>
...
<custom-attribute name="new1">
<value>word1</value>
<value>word2</value>
....
</custom-attribute>
</custom-attributes>
I wrote the following python code
precision = {"name" : "new1"}
for sku in soup.find_all('product'):
tagCustoms = sku.find('custom-attributes')
mynewtag = soup.new_tag('custom-attribute', attrs = precision)
tagCustoms.append(mynewtag)
for word in words: # words is a list
mynewtag.insert(1,soup.new_tag('value'))
It works ... except I can't find how to define the content within value's tag .. how to assign each word from words 'list within the same loop ?
I am stuck with this result
<custom-attribute name="new1">
<value></value>
<value></value>
....
</custom-attribute>
</custom-attributes>
I tried this code
for sku in soup.find_all('product'):
tagCustoms = sku.find('custom-attributes')
mynewtag = soup.new_tag('custom-attribute', attrs = precision)
tagCustoms.append(mynewtag)
for word in words: # words is a list
mynewtag.insert(1,soup.new_tag('value'))
mynewtag.value.string = word
but it only add the first word of the list the first value tag.
Many thanks in advance
There are several ways to handle this, but try this one and see if it works.
Change your for loop to:
for word in words:
ntag = soup.new_tag('value')
ntag.string = word
mynewtag.insert(1,ntag)

Reading CDATA with lxml, problem with end of line

Hello I am parsing a xml document with contains bunch of CDATA sections. I was working with no problems till now. I realised that when I am reading the an element and getting the text abribute I am getting end of line characters at the beggining and also at the end of the text read it.
A piece of the important code as follow:
for comments in self.xml.iter("Comments"):
for comment in comments.iter("Comment"):
description = comment.get('Description')
if language == "Arab":
tag = self.name + description
text = comment.text
The problem is at element Comment, he is made it as follow:
<Comment>
<![CDATA[Usually made it with not reason]]>
I try to get the text atribute and I am getting like that:
\nUsually made it with not reason\n
I Know that I could do a strip and so on. But I would like to fix the problem from the root cause, and maybe there is some option before to parse with elementree.
When I am parsing the xml file I am doing like that:
tree = ET.parse(xml)
Minimal reproducible example
import xml.etree.ElementTree as ET
filename = test.xml #Place here your path test xml file
tree = ET.parse(filename)
root = tree.getroot()
Description = root[0]
text = Description.text
print (text)
Minimal xml file
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<Description>
<![CDATA[Hello world]]>
</Description>
You're getting newline characters because there are newline characters:
<Comment>
<![CDATA[Usually made it with not reason]]>
</Comment>
Why else would <![CDATA and </Comment start on new lines?
If you don't want newline characters, remove them:
<Comment><![CDATA[Usually made it with not reason]]></Comment>
Everything inside an element counts towards its string value.
<![CDATA[...]]> is not an element, it's a parser flag. It changes how the XML parser is reading the enclosed characters. You can have multiple CDATA sections in the same element, switching between "regular mode" and "cdata mode" at will:
<Comment>normal text <![CDATA[
CDATA mode, this may contain <unescaped> Characters!
]]> now normal text again
<![CDATA[more special text]]> now normal text again
</Comment>
Any newlines before and after a CDATA section count towards the "normal text" section. When the parser reads this, it will create one long string consisting of the individual parts:
normal text
CDATA mode, this may contain <unescaped> Characters!
now normal text again
more special text now normal text again
I thought that when CDATA comes at xml they were coming with end of line at the beginning and at the end, like that.
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<Description>
<![CDATA[Hello world]]>
</Description>
But you can have it like that also.
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<Description><![CDATA[Hello world]]></Description>
It is the reason to get end of line characters when we are parsing the with the Elementtree library, is working perfect in both cases, you only have to strip or not strip depending how you want to process the data.
if you want to remove both '\n' just add the following code:
text = Description.text
text = text.strip('\n')

How to extract CDATA without the GPath/node name

I'm trying to extract CDATA content from an XML without the using GPath (or) node name. In short, i want to find & retrieve the innerText containing CDATA section from an XML.
My XML look like:
def xml = '''<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<root>
<Test1>This node contains some innerText. Ignore This.</Test1>
<Test2><![CDATA[this is the CDATA section i want to retrieve]]></Test2>
</root>'''
From the above XML, i want to get the CDATA content alone without using the reference of its node name 'Test2'. Because the node name is not always the same in my scenario.
Also note that the XML can contain innerText in few other nodes (Test1). I dont want to retrieve that. I just need the CDATA content out of the whole XML.
I want something like below (the code below is incorrect though)
def parsedXML = new xmlSlurper().parseText(xml)
def cdataContent = parsedXML.depthFirst().findAll { it.text().startsWith('<![CDATA')}
My output should be :
this is the CDATA section i want to retrieve
As #daggett says, you can't do this with the Groovy slurper or parser, but it's not too bad to drop down and use the java classes to get it.
Note you have to set the property for CDATA to become visible, as by default it's just treated as characters.
Here's the code:
import javax.xml.stream.*
def xml = '''<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<root>
<Test1>This node contains some innerText. Ignore This.</Test1>
<Test2><![CDATA[this is the CDATA section i want to retrieve]]></Test2>
</root>'''
def factory = XMLInputFactory.newInstance()
factory.setProperty('http://java.sun.com/xml/stream/properties/report-cdata-event', true)
def reader = factory.createXMLStreamReader(new StringReader(xml))
while (reader.hasNext()) {
if (reader.eventType in [XMLStreamConstants.CDATA]) {
println reader.text
}
reader.next()
}
That will print this is the CDATA section i want to retrieve
Considering you just have one CDATA in your xml split can help here
def xml = '''<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<root>
<Test1>This node contains some innerText. Ignore This.</Test1>
<Test2><![CDATA[this is the CDATA section i want to retrieve]]></Test2>
</root>'''
log.info xml.split("<!\\[CDATA\\[")[1].split("]]")[0]
So in the above logic we split the string on CDATA start and pick the portion which is left after
xml.split("<!\\[CDATA\\[")[1]
and once we got that portion we did the split again and then got the portion which is before that pattern by using
.split("]]")[0]
Here is the proof it works

obtain en-US title tag text

I'm trying to obtain the text in only the title#lang=en-US elements in an XML file.
This code obtains all the title text for all languages.
entries = root.xpath('//prefix:new-item', namespaces={'prefix': 'http://mynamespace'})
for entry in entries:
all_titles = entry.xpath('./prefix:title', namespaces={'prefix': 'http://mynamespace'})
for title in all_titles:
print (title.text)
I tried this code to get the title#lang=en-US text, but it does not work.
all_titles = entry.xpath('./prefix:title', namespaces={'prefix': 'http://mynamespace'})
for title in all_titles:
test = title.xpath("#lang='en-US'")
print (test)
How do I obtain the text for only the english language items?
The expression
//prefix:title[lang('en')]
will select all the English-language titles. Specifically:
title elements that have an xml:lang attribute identifying the title as English, for example <title xml:lang="en-US"> or <title xml:lang="en-GB">
title elements within some container that identifies all the contents as English, for example <section xml:lang="en-US"><title/></section>.
If you specifically want only US English titles, excluding other forms of English, then you can use the predicate [lang('en-US')].

Resources