Python XML ElementTree cannot iter(), find() or findall() - python-3.x

I can an xml file and loop through the root printing, but root.iter('tag'), root.find('tag') and root.findall('tag') will not work.
Here is a sample of the XML:
<?xml version='1.0' encoding='UTF-8'?>
<cpe-list xmlns:config="http://scap.nist.gov/schema/configuration/0.1" xmlns="http://cpe.mitre.org/dictionary/2.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:scap-core="http://scap.nist.gov/schema/scap-core/0.3" xmlns:cpe-23="http://scap.nist.gov/schema/cpe-extension/2.3" xmlns:ns6="http://scap.nist.gov/schema/scap-core/0.1" xmlns:meta="http://scap.nist.gov/schema/cpe-dictionary-metadata/0.2" xsi:schemaLocation="http://scap.nist.gov/schema/cpe-extension/2.3 https://scap.nist.gov/schema/cpe/2.3/cpe-dictionary-extension_2.3.xsd http://cpe.mitre.org/dictionary/2.0 https://scap.nist.gov/schema/cpe/2.3/cpe-dictionary_2.3.xsd http://scap.nist.gov/schema/cpe-dictionary-metadata/0.2 https://scap.nist.gov/schema/cpe/2.1/cpe-dictionary-metadata_0.2.xsd http://scap.nist.gov/schema/scap-core/0.3 https://scap.nist.gov/schema/nvd/scap-core_0.3.xsd http://scap.nist.gov/schema/configuration/0.1 https://scap.nist.gov/schema/nvd/configuration_0.1.xsd http://scap.nist.gov/schema/scap-core/0.1 https://scap.nist.gov/schema/nvd/scap-core_0.1.xsd">
<generator>
<product_name>National Vulnerability Database (NVD)</product_name>
<product_version>4.4</product_version>
<schema_version>2.3</schema_version>
<timestamp>2021-05-21T03:50:31.204Z</timestamp>
</generator>
<cpe-item name="cpe:/a:%240.99_kindle_books_project:%240.99_kindle_books:6::~~~android~~">
<title xml:lang="en-US">$0.99 Kindle Books project $0.99 Kindle Books (aka com.kindle.books.for99) for android 6.0</title>
<references>
<reference href="https://play.google.com/store/apps/details?id=com.kindle.books.for99">Product information</reference>
<reference href="https://docs.google.com/spreadsheets/d/1t5GXwjw82SyunALVJb2w0zi3FoLRIkfGPc7AMjRF0r4/edit?pli=1#gid=1053404143">Government Advisory</reference>
</references>
<cpe-23:cpe23-item name="cpe:2.3:a:\$0.99_kindle_books_project:\$0.99_kindle_books:6:*:*:*:*:android:*:*"/>
</cpe-item>
<cpe-item name="cpe:/a:%40thi.ng%2fegf_project:%40thi.ng%2fegf:-::~~~node.js~~">
<title xml:lang="en-US">#thi.ng/egf Project #thi.ng/egf for Node.js</title>
<references>
<reference href="https://github.com/thi-ng/umbrella/security/advisories/GHSA-rj44-gpjc-29r7">Advisory</reference>
<reference href="https://www.npmjs.com/package/#thi.ng/egf">Version</reference>
</references>
<cpe-23:cpe23-item name="cpe:2.3:a:\#thi.ng\/egf_project:\#thi.ng\/egf:-:*:*:*:*:node.js:*:*"/>
</cpe-item>
</cpe-list>
The followig Python (3.7) code works:
import xml.etree.ElementTree as ET
infile = open(filename, "r")
xml = infile.read()
infile.close()
parser = ET.XMLParser(encoding="utf-8")
root = ET.fromstring(xml, parser=parser)
print(root.tag)
for child in root:
print(child.tag)
Output:
{http://cpe.mitre.org/dictionary/2.0}cpe-list
{http://cpe.mitre.org/dictionary/2.0}cpe-item
{http://cpe.mitre.org/dictionary/2.0}cpe-item
{http://cpe.mitre.org/dictionary/2.0}cpe-item
{http://cpe.mitre.org/dictionary/2.0}cpe-item
...
But when I try:
for item in root.iter('cpe-item') or for item in root.iter('cpe-list'), nothing loops. When I try for item in root.findall('cpe-item') or for item in root.findall('cpe-list'), nothing loops. If I try item = root.find('cpe-list'), item = None.
I don't work with XML very often, but this seems so strage to me since I have some example code of other projects where this works perfectly fine. Many other examples online show this exact process is the correct process.
What is am I doing wrong?
It seems odd to me that when I print(root.tag) or print(child.tag) there is something before the tag prints. I don't know why that is happening.

You are getting entangled with namespaces. A lot has been written about it and starting here may be a good place.
As for you specific example, the tl;dr is to disregard them altogether. For example:
for item in root.findall('.//{*}cpe-item'):
print(item.tag)
Another option is to bite the bullet and declare the namespaces:
ns = {"xx":"http://cpe.mitre.org/dictionary/2.0"}
for item in root.findall('.//xx:cpe-item', ns):
print(item.tag)
output is
{http://cpe.mitre.org/dictionary/2.0}cpe-item
{http://cpe.mitre.org/dictionary/2.0}cpe-item

Related

How to access xml text of child?

I have the following xml file (taken from here:
<BioSampleSet>
<BioSample submission_date="2011-12-01T13:31:02.367" last_update="2014-11-08T01:40:24.717" publication_date="2012-02-16T10:49:52.970" access="public" id="761094" accession="SAMN00761094">
<Ids>
</Ids>
<Package display_name="Generic">Generic.1.0</Package>
<Attributes>
<Attribute attribute_name="Individual">PK314</Attribute>
<Attribute attribute_name="condition">healthy</Attribute>
<Attribute attribute_name="BioSampleModel">Generic</Attribute>
</Attributes>
<Status status="live" when="2014-11-08T00:27:24"/>
</BioSample>
</BioSampleSet>
And I need to access the text next to the attribute attribute_nameof the child Attributes.
I managed accessing the values of attribute_name.:
from Bio import Entrez,SeqIO
Entrez.email = '#'
import xml.etree.ElementTree as ET
handle = Entrez.efetch(db="biosample", id="SAMN00761094", retmode="xml", rettype="full")
tree = ET.parse(handle)
for attr in root[0].iter('Attribute'):
name = attr.get('attribute_name')
print(name)
this prints:
Individual
condition
BioSampleModel
How do I create a dict of the values of attribute_name and the text next to it?
My desired output is
attributes = {'Individual': PK314, 'condition': healthy, 'BioSampleModel': Generic}
Based strictly on the xml sample in the question, try something along these lines:
bio = """[your xml sample]"""
doc = ET.fromstring(bio)
attributes = {}
for item in doc.findall('.//Attributes//Attribute'):
attributes[item.attrib['attribute_name']]=item.text
attributes
Output:
{'Individual': 'PK314', 'condition': 'healthy', 'BioSampleModel': 'Generic'}

Find path to the node using ElementTree

Wih ElementTree, I can print every occurences of a specific tag (in my case ExpertSettingsSg
):
#!/usr/bin/env python3
import xml.etree.ElementTree as ET
root = ET.parse('mydoc.xml').getroot()
for children in root:
value=children.findall('.//ExpertSettingsSg')#tag I'm looking for
for settings in value:
if settings.text is not None:
print(settings.text)
But I didn't find a way to print the path of the occurence. Because my XML file has many levels and because ExpertSettingsSg can be almost at every level, I need to know where the ExpertSettingsSg come from. I'm looking for something like
Path to config xxxxxx = /root/xxx/aaaa/bbbb
If it's not possible with ElementTree, does any other library do the trick?
Thanks
If you already have the nodes, you can walk the tree and collect paths (borrowing the example from #valdi-bo):
from xml.etree import ElementTree as ET
txt ='''<main>
<x>
<a>
<ExpertSettingsSg id="1">x1</ExpertSettingsSg>
</a>
<b>
<dummy>xxxx</dummy>
</b>
</x>
<y>
<c>
<dummy>xxxx</dummy>
</c>
<d>
<ExpertSettingsSg id="2">x2</ExpertSettingsSg>
</d>
<e>
<ExpertSettingsSg id="3"/>
</e>
</y>
</main>'''
def node_walk(root: ET.Element):
path_to_node = []
node_stack = [root]
while node_stack:
node = node_stack[-1]
if path_to_node and node is path_to_node[-1]:
path_to_node.pop()
node_stack.pop()
yield (path_to_node, node)
else:
path_to_node.append(node)
for child in reversed(node):
node_stack.append(child)
root = ET.ElementTree(ET.fromstring(txt))
for node in root.findall('.//ExpertSettingsSg'):
for node_path, n in node_walk(root.getroot()):
if n is node:
xpath = "/".join(["."] + [n.tag for n in node_path[1:]] + [n.tag])
print(xpath, node)
# NOTE: Assert is to just show that the xpath is correct.
assert root.getroot().find(xpath) == node
You would get output like this:
./x/a/ExpertSettingsSg <Element 'ExpertSettingsSg' at 0x102cf5b80>
./y/d/ExpertSettingsSg <Element 'ExpertSettingsSg' at 0x102cf5db0>
./y/e/ExpertSettingsSg <Element 'ExpertSettingsSg' at 0x102cf5e50>
Instead of walking multiple times, we can walk once and collect all relevant nodes with path, like this:
xpaths = []
for node_path, n in node_walk(root.getroot()):
if n.tag == "ExpertSettingsSg":
xpath = "/".join(["."] + [n.tag for n in node_path[1:]] + [n.tag])
xpaths.append(xpath)
for xpath in xpaths:
node = root.getroot().find(xpath)
print(xpath, node)

XML Element Tree - appending to existing elements and attributes with ET.SubElement()?

I have the following function which builds up a re-usable XML SOAP envelope:
def get_xml_soap_envelope():
"""
Returns a generically re-usable SOAP envelope in the following format:
<soapenv:Envelope>
<soapenv:Header/>
<soapenv:Body />
</soapenv:Envelope>
"""
soapenvEnvelope = ET.Element('soapenv:Envelope')
soapenvHeader = ET.SubElement(soapenvEnvelope, 'soapenv:Header')
soapenvBody = ET.SubElement(soapenvEnvelope, 'soapenv:Body')
return soapenvEnvelope
Fairly simple stuff so far.
I was wondering now, would it be possible to append attributes (such as xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance") to the soapenv:Envelope element?
And if I also wanted to append the following XML:
<urn:{AAction} soapenv:encodingStyle="http://schemas.xmlsoap.org/soap/encoding/">
<AUserName>{AUserName}</AUserName>
<APassword>{APassword}</APassword>
</urn:{AAction}>
To the soapenv:Body such that I would have something like this:
if __name__ == "__main__":
soapenvEnvelope = get_xml_soap_envelope()
actions = {
'AAction': 'UserLogin',
}
soapAAction = ET.Element('urn:{AAction}'.format(**actions))
soapenvEnvelope.AppendElement(soapAAction, 'soapenv:Body')
So, I could specify a target node and the Element to append to?
Let's start from the bad news: Your function to create the SOAP envelope
(get_xml_soap_envelope) is wrong as it fails to specify at least
xmlns:soapenv="...".
Actually all other namespaces to be used should be also specified here.
A proper function creating the SOAP envelope should be somenting like this:
def get_xml_soap_env():
"""
Returns a generically re-usable SOAP envelope in the following format:
<soapenv:Envelope xmlns:soapenv="...", ...>
<soapenv:Header/>
<soapenv:Body />
</soapenv:Envelope>
"""
ns = {'xmlns:soapenv': 'http://schemas.xmlsoap.org/soap/envelope/',
'xmlns:xsi': 'http://www.w3.org/2001/XMLSchema-instance',
'xmlns:urn': 'http://dummy.urn'}
env = ET.Element('soapenv:Envelope', ns)
ET.SubElement(env, 'soapenv:Header')
ET.SubElement(env, 'soapenv:Body')
return env
Note that ns dictionary contains also other namespaces, which will be
needed later, a.o. xsi namespace.
A possible alternative is to define ns outside of this function and pass it as
a parameter (your choice).
When I ran:
env = get_xml_soap_env()
print(ET.tostring(env, encoding='unicode', short_empty_elements=True))
the printout (reformatted by me for readability) was:
<soapenv:Envelope xmlns:soapenv="http://schemas.xmlsoap.org/soap/envelope/"
xmlns:urn="http://dummy.urn"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<soapenv:Header />
<soapenv:Body />
</soapenv:Envelope>
Note that this time proper namespaces are included.
Then, to add the Action element and its children, define the following function:
def addAction(env, action, subelems):
body = env.find('soapenv:Body')
actn = ET.SubElement(body, f'soapenv:{action}')
for k, v in subelems.items():
child = ET.SubElement(actn, k)
child.text = v
When I ran:
subelems = {'AUserName': 'Mark', 'APassword': 'Secret!'}
addAction(env, 'UserLogin', subelems)
and printed the whole XML tree again, the result was:
<soapenv:Envelope xmlns:soapenv="http://schemas.xmlsoap.org/soap/envelope/"
xmlns:urn="http://dummy.urn" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<soapenv:Header />
<soapenv:Body>
<soapenv:UserLogin>
<AUserName>Mark</AUserName>
<APassword>Secret!</APassword>
</soapenv:UserLogin>
</soapenv:Body>
</soapenv:Envelope>

IndexError: list index out of range in Django-Python application

I have a problem with a function which has an iteration for an array. Here is my function;
def create_new_product():
tree = ET.parse('products.xml')
root = tree.getroot()
array = []
appointments = root.getchildren()
for appointment in appointments:
appt_children = appointment.getchildren()
array.clear()
for appt_child in appt_children:
temp = appt_child.text
array.append(temp)
new_product = Product(
product_name = array[0],
product_desc = array[1]
)
new_product.save()
return new_product
When I call the function, it saves 2 products into database but gives an error on third one. This is the error;
product_name = array[0],
IndexError: list index out of range
Here is also the xml file. I only copied the first 3 products from xml. There are almost 2700 products in the xml file.
<?xml version="1.0" encoding="UTF-8"?>
<Products>
<Product>
<product_name>Example 1</product_name>
<product_desc>EX101</product_desc>
</Product>
<Product>
<product_name>Example 2</product_name>
<product_desc>EX102</product_desc>
</Product>
<Product>
<product_name>Example 3</product_name>
</Product>
</Products>
I don't understand why I am getting this error because it already works for the first two products in the xml file.
I have run a minimal version of your code on python 3 (I assume it's 3 since you use array.clear()):
import xml.etree.ElementTree as ET
def create_new_product():
tree = ET.parse('./products.xml')
root = tree.getroot()
array = []
appointments = root.getchildren()
for appointment in appointments:
appt_children = appointment.getchildren()
array.clear()
# skip this element and log a warning
if len(appt_children) != 2:
print ('Warning : skipping element since it has less children than 2')
continue
for appt_child in appt_children:
temp = appt_child.text
array.append(temp)
_arg={
'product_name' : array[0],
'product_desc' : array[1]
}
print(_arg)
create_new_product()
Output :
{'product_name': 'Example 1', 'product_desc': 'EX101'}
{'product_name': 'Example 2', 'product_desc': 'EX102'}
Warning : skipping element since it has less children than 2
Edit : OP has found that the products contain sometime less children than expected. I added a check of the elements number.
List index out of range is only thrown when a place in an array is invalid, so product_name[0] doesn't actually exist. Maybe try posting your XML file and and we'll see if there's an error there.

How to get all the XML nodes and its values from a XML with groovy?

I have an XML like below.
xml1 = '''
<?xml version="1.0" encoding="UTF-8"?>
<soap>
<group1>
<g1node1>g1value1</g1node1>
<g1node2>g1value2</g1node2>
<g1node3>g1value3</g1node3>
</group1>
<group2 attr="attrvalue1">
<g2node1>g2value1</g2node1>
<g2node2>g2value2</g2node2>
<g2node3>g2value3</g2node3>
</group2>
</soap>
'''
Here, i need to get all the xml node and its values as output, either as line by line result and as a list with groovy. The output should look like
g1node1 = g1value1
g1node2 = g1value2
... and so on...
or either with a groovy map like below
out = [g1node1 : "g1value1", g1node2 : "g1value2", ...and so on...]
can anyone help me how to achieve this with groovy code?
I would still like to know (as Tim mentioned) what you have tried yet, but here is my itch:
def result = new XmlSlurper().parseText( xml )
result.'**'.collectEntries { !it.childNodes() ? [ it.name(), it.text() ] : [:] }

Resources