parse xml whose attribute include double quote with lxml - python-3.x

I can't get an xpath research on the attribute "Frais de Services" with lxml:
I have an xml file whose content is the folowing:
<column caption='Choix Découpage' name='[Aujourd&apos;Hui Parameter (copy 2)]'>
<alias key='"Frais de Services"' value='Offline Fees' />
</column>
from lxml import etree
import sys
tree = etree.parse('test.xml')
root = tree.getroot()
print([node.attrib['key'] for node in root.xpath("//alias")]) # we get ['"Billetterie Ferroviaire"']
I tried many hack, none works (i can't understand why lxml change internally the original "Predefined entities"):
root.xpath('//alias[#key="\"Frais de Services\""]')
root.xpath('//alias[#key=""Frais de Services""]')

Related

How to parse XML namespaces in Python 3 and Beautiful Soup 4?

I am trying to parse XML with BS4 in Python 3.
For some reason, I am not able to parse namespaces. I tried to look for answers in this question, but it doesn't work for me and I don't get any error message either.
Why does the first part work, but the second does not?
import requests
from bs4 import BeautifulSoup
input = """
<?xml version="1.0" encoding="utf-8"?>
<wb:countries page="1" pages="6" per_page="50" total="299" xmlns:wb="http://www.worldbank.org">
<wb:country id="ABW">
<wb:iso2Code>AW</wb:iso2Code>
<wb:name>Aruba</wb:name>
<wb:region id="LCN" iso2code="ZJ">Latin America & Caribbean </wb:region>
<wb:adminregion id="" iso2code="" />
<wb:incomeLevel id="HIC" iso2code="XD">High income</wb:incomeLevel>
<wb:lendingType id="LNX" iso2code="XX">Not classified</wb:lendingType>
<wb:capitalCity>Oranjestad</wb:capitalCity>
<wb:longitude>-70.0167</wb:longitude>
<wb:latitude>12.5167</wb:latitude>
</wb:country>
<wb:country id="AFE">
<wb:iso2Code>ZH</wb:iso2Code>
<wb:name>Africa Eastern and Southern</wb:name>
<wb:region id="NA" iso2code="NA">Aggregates</wb:region>
<wb:adminregion id="" iso2code="" />
<wb:incomeLevel id="NA" iso2code="NA">Aggregates</wb:incomeLevel>
<wb:lendingType id="" iso2code="">Aggregates</wb:lendingType>
<wb:capitalCity />
<wb:longitude />
<wb:latitude />
</wb:country>
</wb:countries>
<item>
<title>Some string</title>
<pubDate>Wed, 01 Sep 2022 12:45:00 +0000</pubDate>
<guid isPermaLink="false">4574785</guid>
<link>https://somesite.com</link>
<itunes:subtitle>A subtitle</itunes:subtitle>
<enclosure length="0" type="audio/mpeg" url="https://assets.somesite.com/123.mp3"/>
<itunes:image href="https://somesite.com/img.jpg"/>
<itunes:duration>7845</itunes:duration>
<itunes:explicit>no</itunes:explicit>
<itunes:episodeType>Full</itunes:episodeType>
</item>
"""
soup = BeautifulSoup(input, 'xml')
# Working
for x in soup.find_all('wb:country'):
print(x.find('wb:name').text)
# Not working
for x in soup.find_all('item'):
print(x.find('itunes:subtitle').text)
It looks like non conform XML were you have two documents mixed togehter - A namespace is expected in strict mode of XML parser if it is defined - Use lxml instead to get your expected result in this wild mix:
soup = BeautifulSoup(xml_string, 'lxml')
# Working
for x in soup.find_all('wb:country'):
print(x.find('wb:name').text)
# also working
for x in soup.find_all('item'):
print(x.find('itunes:subtitle').text)
Note: Avoid using python reserved terms (keywords), this could have unwanted effects on the results of your code.
If you have second document separat use:
for x in soup.find_all('item'):
print(x.find('subtitle').text)
Example
from bs4 import BeautifulSoup
xml_string = """
<?xml version="1.0" encoding="utf-8"?>
<item>
<title>Some string</title>
<pubDate>Wed, 01 Sep 2022 12:45:00 +0000</pubDate>
<guid isPermaLink="false">4574785</guid>
<link>https://somesite.com</link>
<itunes:subtitle>A subtitle</itunes:subtitle>
<enclosure length="0" type="audio/mpeg" url="https://assets.somesite.com/123.mp3"/>
<itunes:image href="https://somesite.com/img.jpg"/>
<itunes:duration>7845</itunes:duration>
<itunes:explicit>no</itunes:explicit>
<itunes:episodeType>Full</itunes:episodeType>
</item>
"""
soup = BeautifulSoup(input, 'xml')
# working
for x in soup.find_all('item'):
print(x.find('subtitle').text)
Else you have to define a namespace for your item and can still use XML parser:
<?xml version="1.0" encoding="utf-8"?>
<item xmlns:itunes="http://www.w3.org/TR/html4/">
<title>Some string</title>
<pubDate>Wed, 01 Sep 2022 12:45:00 +0000</pubDate>
<guid isPermaLink="false">4574785</guid>
<link>https://somesite.com</link>
<itunes:subtitle>A subtitle</itunes:subtitle>
...
When a namespace is defined for an element, all child elements with
the same prefix are associated with the same namespace.

python xml write with using namespace

Are those two roots the same??
Output xml got changed about the order in the root.
If they are different each other, how could I fix it??
#Python 3.7
import xml.etree.ElementTree as ET
ET.register_namespace('xsi', "http://www.w3.org/2001/test")
ET.register_namespace('', "http://www.test.com/test/test/test")
tree = ET.parse('test.xml')
tree.write("test1.xml", encoding='utf-8', xml_declaration=True)
#input XML root
<root xmlns:xsi="http://www.w3.org/2001/test" schemaVersion="2.8" xmlns="http://www.test.com/test/test/test" labelVersion="1" xsi:schemaLocation="http://www.test.com/test/test/test ..\Schema\CLIFSchema.xsd" name="test.xml">
#output XML root
<root xmlns="http://www.test.com/test/test/test" xmlns:xsi="http://www.w3.org/2001/test" labelVersion="1" name="test.xml" schemaVersion="2.8" xsi:schemaLocation="http://www.test.com/test/test/test ..\Schema\CLIFSchema.xsd">

Python 2.7 to 3.6 Code porting issue --copying xml data to list

emphasized textHi , I am currently Porting a Piece of code from python 2.7 to python 3.6 , The Code involves dumping data from xml into a list the codes looks as below
import os
import xml.etree.ElementTree as ET
regs_list = []
tree = ET.parse("test.xml")
root = tree.getroot()
for reg in root:
regs_list.append(reg)
for i in range(len(regs_list)):
print (regs_list[i].attrib["name"])
if not (regs_list[i].find("field") == None):
for regs_list[i].field in regs_list[i]:
print (regs_list[i].field.attrib["first_bit"])
The XML looks like this
<register offset="0x4" width="4" defaultValue="0x100000" name="statuscommand" desc="STATUSCOMMAND- Status and Command ">
<field first_bit="30" last_bit="31" WH="ROOO" flask="0xc0000000" name="reserved0" desc=""/>
<field first_bit="29" last_bit="29" WH="1CWRH flask="0x20000000" name="rma" desc=""/>
<field first_bit="28" last_bit="28" WH="1CWRH flask="0x10000000" name="rta" desc=""/>
</register>
<register offset="0x8" width="4" defaultValue="0x8050100" name="reve" desc="REVCLASSCODE - Revision ID and Class Code">
<field first_bit="8" last_bit="31" WH="ROOO" flask="0xffffff00" name="class_codes" desc=""/>
<field first_bit="0" last_bit="7" WH="ROOO" flask="0xff" name="rid" desc=""/>
</register>
This Code works perfectly fine in python 2.7 , we are able to dump both parent (Register) and child (field) into the list regs_list , But in 3.6 we get an error as below
3.6 output "
for regs_list[i].field in regs_list[i]:
AttributeError: 'xml.etree.ElementTree.Element' object has no attribute 'field'
2.7 output
Parent and child parsed and dumped in the list without error
Is there a difference in the way xml.etree.ElementTree.Element and lists work in 3.6 and 2.7 ??

XML delete tag based on the match

Need to delete the vrfName tag from the payload
<Test>
<Sub>
<vrfName>management</vrfName>
</Sub>
<sub1>
<vrfName>management</vrfName>
</sub1>
<host>
<vrfName>management</vrfName>
</host>
</Test>
import re
from xml.etree import ElementTree as ET
root = ET.parse("test.xml").getroot()
print(root)
regex=r'<vrfName>\w+</vrfName>'
sa=open('test.xml','r')
x=sa.readlines()
for i in x:`
match=re.findall(regex,i)
match=''.join(match)
print(match)

Parsing XML attribute with namespace python3

I have looked at the other question over Parsing XML with namespace in Python via 'ElementTree' and reviewed the xml.etree.ElementTree documentation. The issue I'm having is admittedly similar so feel free to tag this as duplicate, but I can't figure it out.
The line of code I'm having issues with is
instance_alink = root.find('{http://www.w3.org/2005/Atom}link')
My code is as follows:
import xml.etree.cElementTree as ET
tree = ET.parse('../../external_data/rss.xml')
root = tree.getroot()
instance_title = root.find('channel/title').text
instance_link = root.find('channel/link').text
instance_alink = root.find('{http://www.w3.org/2005/Atom}link')
instance_description = root.find('channel/description').text
instance_language = root.find('channel/language').text
instance_pubDate = root.find('channel/pubDate').text
instance_lastBuildDate = root.find('channel/lastBuildDate').text
The XML file:
<?xml version="1.0" encoding="windows-1252"?>
<rss version="2.0">
<channel>
<title>Filings containing financial statements tagged using the US GAAP or IFRS taxonomies.</title>
<link>http://www.example.com</link>
<atom:link href="http://www.example.com" rel="self" type="application/rss+xml" xmlns:atom="http://www.w3.org/2005/Atom"/>
<description>This is a list of up to 200 of the latest filings containing financial statements tagged using the US GAAP or IFRS taxonomies, updated every 10 minutes.</description>
<language>en-us</language>
<pubDate>Mon, 20 Nov 2017 20:20:45 EST</pubDate>
<lastBuildDate>Mon, 20 Nov 2017 20:20:45 EST</lastBuildDate>
....
The attributes I'm trying to retrieve are in line 6; so 'href', 'type', etc.
<atom:link href="http://www.example.com" rel="self" type="application/rss+xml" xmlns:atom="http://www.w3.org/2005/Atom"/>
Obviously, I've tried
instance_alink = root.find('{http://www.w3.org/2005/Atom}link').attrib
but that doesn't work cause it's type None. My thought is that it's looking for children but there are none. I can grab the attributes in the other lines in XML but not these for some reason. I've also played with ElementTree and lxml (but lxml won't load properly on Windows for whatever reason)
Any help is greatly appreciated cause the documentation seems sparse.
I was able to solve with
alink = root.find('channel/{http://www.w3.org/2005/Atom}link').attrib
the issue is that I was looking for the tag {http://www.w3.org/2005/Atom}link at the same level of <channel>, which, of course, didn't exist.

Resources