How to write dictionary into xml file with pretty print (python)? - python-3.x

d = {"a":"a1234","b":"b5678","c":"c4554545"}
Tried converting this dictionary d to xml, as below
<?xml version="1.0" encoding="UTF-8" ?>
<test>
<a>a1234</a>
<b>b5678</b>
<c>c4554545</c>
</test>
Code:
from dicttoxml import dicttoxml
xml = dicttoxml(d, custom_root='test', attr_type=False)
# Above 'xml' is of type bytes here
xml = xml.decode("utf-8") # Converting bytes to string
print(xml) # prints, <?xml version="1.0" encoding="UTF-8" ?><test><a>a1234</a><b>b5678</b><c>c4554545</c></test>
Tried printing above xml output with pretty print, but end up obtaining below (excludes <?xml version)
<test>
<a>a1234</a>
<b>b5678</b>
<c>c4554545</c>
</test>
How to pretty print as below ?
<?xml version="1.0" encoding="UTF-8" ?>
<test>
<a>a1234</a>
<b>b5678</b>
<c>c4554545</c>
</test>

Related

How to parse XML namespaces in Python 3 and Beautiful Soup 4?

I am trying to parse XML with BS4 in Python 3.
For some reason, I am not able to parse namespaces. I tried to look for answers in this question, but it doesn't work for me and I don't get any error message either.
Why does the first part work, but the second does not?
import requests
from bs4 import BeautifulSoup
input = """
<?xml version="1.0" encoding="utf-8"?>
<wb:countries page="1" pages="6" per_page="50" total="299" xmlns:wb="http://www.worldbank.org">
<wb:country id="ABW">
<wb:iso2Code>AW</wb:iso2Code>
<wb:name>Aruba</wb:name>
<wb:region id="LCN" iso2code="ZJ">Latin America & Caribbean </wb:region>
<wb:adminregion id="" iso2code="" />
<wb:incomeLevel id="HIC" iso2code="XD">High income</wb:incomeLevel>
<wb:lendingType id="LNX" iso2code="XX">Not classified</wb:lendingType>
<wb:capitalCity>Oranjestad</wb:capitalCity>
<wb:longitude>-70.0167</wb:longitude>
<wb:latitude>12.5167</wb:latitude>
</wb:country>
<wb:country id="AFE">
<wb:iso2Code>ZH</wb:iso2Code>
<wb:name>Africa Eastern and Southern</wb:name>
<wb:region id="NA" iso2code="NA">Aggregates</wb:region>
<wb:adminregion id="" iso2code="" />
<wb:incomeLevel id="NA" iso2code="NA">Aggregates</wb:incomeLevel>
<wb:lendingType id="" iso2code="">Aggregates</wb:lendingType>
<wb:capitalCity />
<wb:longitude />
<wb:latitude />
</wb:country>
</wb:countries>
<item>
<title>Some string</title>
<pubDate>Wed, 01 Sep 2022 12:45:00 +0000</pubDate>
<guid isPermaLink="false">4574785</guid>
<link>https://somesite.com</link>
<itunes:subtitle>A subtitle</itunes:subtitle>
<enclosure length="0" type="audio/mpeg" url="https://assets.somesite.com/123.mp3"/>
<itunes:image href="https://somesite.com/img.jpg"/>
<itunes:duration>7845</itunes:duration>
<itunes:explicit>no</itunes:explicit>
<itunes:episodeType>Full</itunes:episodeType>
</item>
"""
soup = BeautifulSoup(input, 'xml')
# Working
for x in soup.find_all('wb:country'):
print(x.find('wb:name').text)
# Not working
for x in soup.find_all('item'):
print(x.find('itunes:subtitle').text)
It looks like non conform XML were you have two documents mixed togehter - A namespace is expected in strict mode of XML parser if it is defined - Use lxml instead to get your expected result in this wild mix:
soup = BeautifulSoup(xml_string, 'lxml')
# Working
for x in soup.find_all('wb:country'):
print(x.find('wb:name').text)
# also working
for x in soup.find_all('item'):
print(x.find('itunes:subtitle').text)
Note: Avoid using python reserved terms (keywords), this could have unwanted effects on the results of your code.
If you have second document separat use:
for x in soup.find_all('item'):
print(x.find('subtitle').text)
Example
from bs4 import BeautifulSoup
xml_string = """
<?xml version="1.0" encoding="utf-8"?>
<item>
<title>Some string</title>
<pubDate>Wed, 01 Sep 2022 12:45:00 +0000</pubDate>
<guid isPermaLink="false">4574785</guid>
<link>https://somesite.com</link>
<itunes:subtitle>A subtitle</itunes:subtitle>
<enclosure length="0" type="audio/mpeg" url="https://assets.somesite.com/123.mp3"/>
<itunes:image href="https://somesite.com/img.jpg"/>
<itunes:duration>7845</itunes:duration>
<itunes:explicit>no</itunes:explicit>
<itunes:episodeType>Full</itunes:episodeType>
</item>
"""
soup = BeautifulSoup(input, 'xml')
# working
for x in soup.find_all('item'):
print(x.find('subtitle').text)
Else you have to define a namespace for your item and can still use XML parser:
<?xml version="1.0" encoding="utf-8"?>
<item xmlns:itunes="http://www.w3.org/TR/html4/">
<title>Some string</title>
<pubDate>Wed, 01 Sep 2022 12:45:00 +0000</pubDate>
<guid isPermaLink="false">4574785</guid>
<link>https://somesite.com</link>
<itunes:subtitle>A subtitle</itunes:subtitle>
...
When a namespace is defined for an element, all child elements with
the same prefix are associated with the same namespace.

Python3 migration xml write issue

Currently frustrated with code migration to Python3 (3.6.8)
out_fname is a .cproject file (xml format)
self.cproject_xml = ET.parse(self.CPROJ_NAME))
with open(out_fname, 'a') as cxml:
cxml.write('<?xml version="1.0" encoding="UTF-8" standalone="no"?>\n')
cxml.write('<?fileVersion 4.0.0?>')
self.cproject_xml.write(cxml,encoding='utf-8')
leads to:
File "/home/build/workspace/bismuth_build_nightly_py3#2/venv/lib/python3.6/site-packages/tinlane/cprojecttools.py", line 209, in export_cproject
self.cproject_xml.write(fxml)
snips..
File "/usr/lib64/python3.6/xml/etree/ElementTree.py", line 946, in _serialize_xml
write(_escape_cdata(elem.tail))
TypeError: write() argument must be str, not bytes
I have tried all different ways (be careful, i need the "a" when opening my file) to make it work (posting original python2 code, not the alternates). Usually i just placed a "b" in r,a,w which would solve the problem. No it doesn't work:
(cxml.write('<?xml version="1.0" encoding="UTF-8" standalone="no"?>\n')
TypeError: a bytes-like object is required, not 'str')
even when i convert to bytes (wrong in my opinion)
Minimal Example to reproduce:
create 2 identical files (file1, file2) with the following content:
<note>
<to>minimal</to>
<from>xml</from>
<heading>file</heading>
<body>content</body>
</note>
and run this codeblock:
import xml.etree.ElementTree as ET
cproject_xml = ET.parse('file1')
fname = 'file2'
with open(fname, 'a') as cxml:
cxml.write('<?xml version="1.0" encoding="UTF-8" standalone="no"?>\n')
cxml.write('<?fileVersion 4.0.0?>')
cproject_xml.write(cxml,encoding='utf-8')
When run with python2, file2 becomes:
<note>
<to>minimal</to>
<from>xml</from>
<heading>file</heading>
<body>content</body>
</note>
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<?fileVersion 4.0.0?><note>
<to>minimal</to>
<from>xml</from>
<heading>file</heading>
<body>content</body>
</note>
Any ideas?
Thanks
I'm sure I'm missing something but it doesn't make sense try to write the tree (cproject_xml) to the open file handle (cxml).
I think it would make more sense to serialize the tree and write directly to the open file.
Try changing:
cproject_xml.write(cxml, encoding='utf-8')
to:
cxml.write(ET.tostring(cproject_xml.getroot()).decode())

'list' object has not attribute 'get' Python3.8 while getting info from XML

So I'm trying to extract info from an XML file but I keep getting this error:
AttributeError: 'list' object has no attribute 'get'
My Code:
from xml.etree import ElementTree as ET
file = ET.parse('db1.xml')
drug = file.findall('drugbank/drug/products')
f = []
for x in drug:
f.append(x.text)
return f
My XML:
<?xml version="1.0" encoding="UTF-8"?>
<drugbank xmlns="http://www.drugbank.ca" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.drugbank.ca http://www.drugbank.ca/docs/drugbank.xsd" version="5.1" exported-on="2019-07-02">
<drug type="biotech" created="2005-06-13" updated="2019-06-04">
<products>
<product>
<name>Refludan</name>
<labeller>Bayer</labeller>
<ndc-id/>
<ndc-product-code/>
<dpd-id>02240996</dpd-id>
<ema-product-code/>
<ema-ma-number/>
<started-marketing-on>2000-01-31</started-marketing-on>
<ended-marketing-on>2013-07-26</ended-marketing-on>
<dosage-form>Powder, for solution</dosage-form>
<strength>50 mg</strength>
<route>Intravenous</route>
<fda-application-number/>
<generic>false</generic>
<over-the-counter>false</over-the-counter>
<approved>true</approved>
<country>Canada</country>
<source>DPD</source>
</product>
</products>
</drug>
</drugbank>
I also tried using drug = file.findall('drugbank/drug/products/name') instead of drug = file.findall('drugbank/drug/products') but it still gave the same error.
I found the issue . Use this code to get the names of your products :
import xml.etree.ElementTree as ET
xml_str = '''<?xml version="1.0" encoding="UTF-8"?>
<drugbank xmlns="http://www.drugbank.ca" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.drugbank.ca http://www.drugbank.ca/docs/drugbank.xsd" version="5.1" exported-on="2019-07-02">
<drug type="biotech" created="2005-06-13" updated="2019-06-04">
<products>
<product>
<name>Refludan</name>
<labeller>Bayer</labeller>
<ndc-id/>
<ndc-product-code/>
<dpd-id>02240996</dpd-id>
<ema-product-code/>
<ema-ma-number/>
<started-marketing-on>2000-01-31</started-marketing-on>
<ended-marketing-on>2013-07-26</ended-marketing-on>
<dosage-form>Powder, for solution</dosage-form>
<strength>50 mg</strength>
<route>Intravenous</route>
<fda-application-number/>
<generic>false</generic>
<over-the-counter>false</over-the-counter>
<approved>true</approved>
<country>Canada</country>
<source>DPD</source>
</product>
</products>
</drug>
</drugbank>
'''
root = ET.fromstring(xml_str)
# print(root.findall('{http://www.drugbank.ca}drug'))
ns = {'drug_bank': 'http://www.drugbank.ca'}
for drug in root.findall('drug_bank:drug', ns):
for products in drug.findall('drug_bank:products', ns):
for product in products.findall('drug_bank:product', ns):
for nametag in product.findall('drug_bank:name', ns):
print(nametag.text)
Output : Refludan
Explanation :
First I printed root and got this :
<Element '{http://www.drugbank.ca}drugbank' at 0x7f688ffc0770>
So I realised this was Namespace-XML-pattern to be used.
Here is the link to help you understand the topic - https://docs.python.org/3/library/xml.etree.elementtree.html#parsing-xml-with-namespaces

XML-Parsing error AttributeError: 'NoneType' object has no attribute 'text'

There is probably a simple solution to my problem, but I am very new to python3 so please go easy on me;)
I have a simple script running, which already successfully parses information from an xml-file using this code
import xml.etree.ElementTree as ET
root = ET.fromstring(my_xml_file)
u = root.find(".//name").text.rstrip()
print("Name: %s\n" % u)
The xml I am parsing looks like this
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="/3.2/style/exchange.xsl"?>
<example:world-data xmlns="http://www.example.org" xmlns:ops="http://example.oorg" xmlns:xlink="http://www.w3.oorg/1999/xlink">
<exchange-documents>
<exchange-document system="acb.org" family-id="543672" country="US" doc-number="95962" kind="B2">
<bibliographic-data>
<name>SomeName</name>
...and so on... and ends like this
</exchange-document>
</exchange-documents>
</example:world-data>
(Links are edited due to stackoverflow policy)
Output as expected
SomeName
However, if I try to parse another xml from the same api using the same python commands, I get this error-code
AttributeError: 'NoneType' object has no attribute 'text'
The second xml-file looks like this
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="/3.2/style/pub-ftxt-claims.xsl"?>
<ops:world-data xmlns="http://www.example.org/exchange" xmlns:example="http://example.org" xmlns:xlink="http://www.example.org/1999/xlink">
<ftxt:fulltext-documents xmlns="http://www.examp.org/fulltext" xmlns:ftxt="ww.example/fulltext">
<ftxt:fulltext-document system="example.org" fulltext-format="text-only">
<bibliographic-data>
<publication-reference data-format="docdb">
<document-id>
<country>EP</country>
<doc-number>10000</doc-number>
<kind>A</kind>
</document-id>
</publication-reference>
</bibliographic-data>
<claims lang="EN">
<claim>
<claim-text>1. Some text.</claim-text>
<claim-text>2. Some text.</claim-text>
<claim-text>2. Some text.</claim-text>
</claim>
</claims>
</ftxt:fulltext-document>
</ftxt:fulltext-documents>
</ops:world-data>
I tried again
root = ET.fromstring(usr_str)
u = root.find(".//claim-text").text.rstrip()
print("Abstract: %s\n" % u)
Expected output
1. Some text.
But it only prints the above mentioned error message.
Why can I parse the first xml but not the second one using these commands?
Any help is highly appreciated.
edit: code by Jack Fleeting works in python console, but unfortunately not in my PyCharm
from lxml import etree
root = etree.XML(my_xml.encode('ascii'))
root2 = etree.XML(my_xml2.encode('ascii'))
root.xpath('//*[local-name()="name"]/text()')
root2.xpath('//*[local-name()="claim-text"]/text()')
Could this be a bug in my PyCharm? My first mentioned code snippet still prints a correct result for name...
edit: Turns out I had to force the output using
a = root3.xpath('//*[local-name()="claim-text"]/text()')
print(a, flush=True)
A couple of problems here before we get to a possible solution. One, the first xml snippet you provided is invalid (for instance, the <bibliographic-data> isn't closed). I realize it's just a snippet but since this is what we have to work with, I modified the snippet below to fix that. Two, both snippets have xmlns declaration with unbound (unused) prefixes (example:world-datain the first, and ops:world-data in the second). I had to remove these prefixes, too, for the rest to work.
Given these modifications, using the lxml library should work for you.
First modified snippet:
my_xml = """<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="/3.2/style/exchange.xsl"?>
<world-data xmlns="http://www.example.org" xmlns:ops="http://example.oorg" xmlns:xlink="http://www.w3.oorg/1999/xlink">
<exchange-documents>
<exchange-document system="acb.org" family-id="543672" country="US" doc-number="95962" kind="B2">
<bibliographic-data>
<name>SomeName</name>
...and so on... and ends like this
</bibliographic-data>
</exchange-document>
</exchange-documents>
</world-data>"""
And:
my_xml2 = """<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="/3.2/style/pub-ftxt-claims.xsl"?>
<world-data xmlns="http://www.example.org/exchange" xmlns:example="http://example.org" xmlns:xlink="http://www.example.org/1999/xlink">
<ftxt:fulltext-documents xmlns="http://www.examp.org/fulltext" xmlns:ftxt="ww.example/fulltext">
<ftxt:fulltext-document system="example.org" fulltext-format="text-only">
<bibliographic-data>
<publication-reference data-format="docdb">
<document-id>
<country>EP</country>
<doc-number>10000</doc-number>
<kind>A</kind>
</document-id>
</publication-reference>
</bibliographic-data>
<claims lang="EN">
<claim>
<claim-text>1. Some text.</claim-text>
<claim-text>2. Some text.</claim-text>
<claim-text>3. Some text.</claim-text>
</claim>
</claims>
</ftxt:fulltext-document>
</ftxt:fulltext-documents>
</world-data>"""
And now to work:
from lxml import etree
root = etree.XML(my_xml.encode('ascii'))
root2 = etree.XML(my_xml2.encode('ascii'))
root.xpath('//*[local-name()="name"]/text()')
output:
['SomeName']
root2.xpath('//*[local-name()="claim-text"]/text()')
Output:
['1. Some text.', '2. Some text.', '3. Some text.']

How can I solve xml-parsing-error in python3.7 with BeautifulSoup?

Python Version: 3.7.2
Here is a xml-file.
<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:ls="https://www.littlstar.com">
<channel>
<title><![CDATA[Name (46 videos)]]></title>
<description><![CDATA[Name]]></description>
<link>http://github.com/dylang/node-rss</link>
<image>
<url>http://1.1.1.1:3001/thumb\324bfb0834915ccc0edb73b5bf0b82c2.jpeg</url>
<title>bla bla</title>
<link></link>
</image>
<generator>RSS for Node</generator>
<lastBuildDate>Sat, 23 Feb 2019 11:32:08 +0000</lastBuildDate>
<category><![CDATA[Local]]></category>
....
and here is source code. it's very simple
f = open(xmlpath, 'r')
data = f.read()
f.close()
soup = BeautifulSoup(data, 'html.parser')
print(soup.prettify())
and... result is
<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:ls="https://www.littlstar.com">
<channel>
<title><![CDATA[Name (46 videos)]]></title>
<description><![CDATA[Name]]></description>
<link/>
http://github.com/dylang/node-rss
<image/>
<url>http://192.168.1.142:3001/thumb\324bfb0834915ccc0edb73b5bf0b82c2.jpeg</url>
...
I lost "link" and "image" tags....
How can I solve this problem?
I tried upgrade bs, and using lxml parsing module...
The .read() and .close() components are not required here.
simply
with open(xmlpath) as fp:
soup = BeautifulSoup(fp, 'html.parser')
print(soup.prettify())

Resources