d = {"a":"a1234","b":"b5678","c":"c4554545"}
Tried converting this dictionary d to xml, as below
<?xml version="1.0" encoding="UTF-8" ?>
<test>
<a>a1234</a>
<b>b5678</b>
<c>c4554545</c>
</test>
Code:
from dicttoxml import dicttoxml
xml = dicttoxml(d, custom_root='test', attr_type=False)
# Above 'xml' is of type bytes here
xml = xml.decode("utf-8") # Converting bytes to string
print(xml) # prints, <?xml version="1.0" encoding="UTF-8" ?><test><a>a1234</a><b>b5678</b><c>c4554545</c></test>
Tried printing above xml output with pretty print, but end up obtaining below (excludes <?xml version)
<test>
<a>a1234</a>
<b>b5678</b>
<c>c4554545</c>
</test>
How to pretty print as below ?
<?xml version="1.0" encoding="UTF-8" ?>
<test>
<a>a1234</a>
<b>b5678</b>
<c>c4554545</c>
</test>
Related
I am trying to parse XML with BS4 in Python 3.
For some reason, I am not able to parse namespaces. I tried to look for answers in this question, but it doesn't work for me and I don't get any error message either.
Why does the first part work, but the second does not?
import requests
from bs4 import BeautifulSoup
input = """
<?xml version="1.0" encoding="utf-8"?>
<wb:countries page="1" pages="6" per_page="50" total="299" xmlns:wb="http://www.worldbank.org">
<wb:country id="ABW">
<wb:iso2Code>AW</wb:iso2Code>
<wb:name>Aruba</wb:name>
<wb:region id="LCN" iso2code="ZJ">Latin America & Caribbean </wb:region>
<wb:adminregion id="" iso2code="" />
<wb:incomeLevel id="HIC" iso2code="XD">High income</wb:incomeLevel>
<wb:lendingType id="LNX" iso2code="XX">Not classified</wb:lendingType>
<wb:capitalCity>Oranjestad</wb:capitalCity>
<wb:longitude>-70.0167</wb:longitude>
<wb:latitude>12.5167</wb:latitude>
</wb:country>
<wb:country id="AFE">
<wb:iso2Code>ZH</wb:iso2Code>
<wb:name>Africa Eastern and Southern</wb:name>
<wb:region id="NA" iso2code="NA">Aggregates</wb:region>
<wb:adminregion id="" iso2code="" />
<wb:incomeLevel id="NA" iso2code="NA">Aggregates</wb:incomeLevel>
<wb:lendingType id="" iso2code="">Aggregates</wb:lendingType>
<wb:capitalCity />
<wb:longitude />
<wb:latitude />
</wb:country>
</wb:countries>
<item>
<title>Some string</title>
<pubDate>Wed, 01 Sep 2022 12:45:00 +0000</pubDate>
<guid isPermaLink="false">4574785</guid>
<link>https://somesite.com</link>
<itunes:subtitle>A subtitle</itunes:subtitle>
<enclosure length="0" type="audio/mpeg" url="https://assets.somesite.com/123.mp3"/>
<itunes:image href="https://somesite.com/img.jpg"/>
<itunes:duration>7845</itunes:duration>
<itunes:explicit>no</itunes:explicit>
<itunes:episodeType>Full</itunes:episodeType>
</item>
"""
soup = BeautifulSoup(input, 'xml')
# Working
for x in soup.find_all('wb:country'):
print(x.find('wb:name').text)
# Not working
for x in soup.find_all('item'):
print(x.find('itunes:subtitle').text)
It looks like non conform XML were you have two documents mixed togehter - A namespace is expected in strict mode of XML parser if it is defined - Use lxml instead to get your expected result in this wild mix:
soup = BeautifulSoup(xml_string, 'lxml')
# Working
for x in soup.find_all('wb:country'):
print(x.find('wb:name').text)
# also working
for x in soup.find_all('item'):
print(x.find('itunes:subtitle').text)
Note: Avoid using python reserved terms (keywords), this could have unwanted effects on the results of your code.
If you have second document separat use:
for x in soup.find_all('item'):
print(x.find('subtitle').text)
Example
from bs4 import BeautifulSoup
xml_string = """
<?xml version="1.0" encoding="utf-8"?>
<item>
<title>Some string</title>
<pubDate>Wed, 01 Sep 2022 12:45:00 +0000</pubDate>
<guid isPermaLink="false">4574785</guid>
<link>https://somesite.com</link>
<itunes:subtitle>A subtitle</itunes:subtitle>
<enclosure length="0" type="audio/mpeg" url="https://assets.somesite.com/123.mp3"/>
<itunes:image href="https://somesite.com/img.jpg"/>
<itunes:duration>7845</itunes:duration>
<itunes:explicit>no</itunes:explicit>
<itunes:episodeType>Full</itunes:episodeType>
</item>
"""
soup = BeautifulSoup(input, 'xml')
# working
for x in soup.find_all('item'):
print(x.find('subtitle').text)
Else you have to define a namespace for your item and can still use XML parser:
<?xml version="1.0" encoding="utf-8"?>
<item xmlns:itunes="http://www.w3.org/TR/html4/">
<title>Some string</title>
<pubDate>Wed, 01 Sep 2022 12:45:00 +0000</pubDate>
<guid isPermaLink="false">4574785</guid>
<link>https://somesite.com</link>
<itunes:subtitle>A subtitle</itunes:subtitle>
...
When a namespace is defined for an element, all child elements with
the same prefix are associated with the same namespace.
Currently frustrated with code migration to Python3 (3.6.8)
out_fname is a .cproject file (xml format)
self.cproject_xml = ET.parse(self.CPROJ_NAME))
with open(out_fname, 'a') as cxml:
cxml.write('<?xml version="1.0" encoding="UTF-8" standalone="no"?>\n')
cxml.write('<?fileVersion 4.0.0?>')
self.cproject_xml.write(cxml,encoding='utf-8')
leads to:
File "/home/build/workspace/bismuth_build_nightly_py3#2/venv/lib/python3.6/site-packages/tinlane/cprojecttools.py", line 209, in export_cproject
self.cproject_xml.write(fxml)
snips..
File "/usr/lib64/python3.6/xml/etree/ElementTree.py", line 946, in _serialize_xml
write(_escape_cdata(elem.tail))
TypeError: write() argument must be str, not bytes
I have tried all different ways (be careful, i need the "a" when opening my file) to make it work (posting original python2 code, not the alternates). Usually i just placed a "b" in r,a,w which would solve the problem. No it doesn't work:
(cxml.write('<?xml version="1.0" encoding="UTF-8" standalone="no"?>\n')
TypeError: a bytes-like object is required, not 'str')
even when i convert to bytes (wrong in my opinion)
Minimal Example to reproduce:
create 2 identical files (file1, file2) with the following content:
<note>
<to>minimal</to>
<from>xml</from>
<heading>file</heading>
<body>content</body>
</note>
and run this codeblock:
import xml.etree.ElementTree as ET
cproject_xml = ET.parse('file1')
fname = 'file2'
with open(fname, 'a') as cxml:
cxml.write('<?xml version="1.0" encoding="UTF-8" standalone="no"?>\n')
cxml.write('<?fileVersion 4.0.0?>')
cproject_xml.write(cxml,encoding='utf-8')
When run with python2, file2 becomes:
<note>
<to>minimal</to>
<from>xml</from>
<heading>file</heading>
<body>content</body>
</note>
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<?fileVersion 4.0.0?><note>
<to>minimal</to>
<from>xml</from>
<heading>file</heading>
<body>content</body>
</note>
Any ideas?
Thanks
I'm sure I'm missing something but it doesn't make sense try to write the tree (cproject_xml) to the open file handle (cxml).
I think it would make more sense to serialize the tree and write directly to the open file.
Try changing:
cproject_xml.write(cxml, encoding='utf-8')
to:
cxml.write(ET.tostring(cproject_xml.getroot()).decode())
So I'm trying to extract info from an XML file but I keep getting this error:
AttributeError: 'list' object has no attribute 'get'
My Code:
from xml.etree import ElementTree as ET
file = ET.parse('db1.xml')
drug = file.findall('drugbank/drug/products')
f = []
for x in drug:
f.append(x.text)
return f
My XML:
<?xml version="1.0" encoding="UTF-8"?>
<drugbank xmlns="http://www.drugbank.ca" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.drugbank.ca http://www.drugbank.ca/docs/drugbank.xsd" version="5.1" exported-on="2019-07-02">
<drug type="biotech" created="2005-06-13" updated="2019-06-04">
<products>
<product>
<name>Refludan</name>
<labeller>Bayer</labeller>
<ndc-id/>
<ndc-product-code/>
<dpd-id>02240996</dpd-id>
<ema-product-code/>
<ema-ma-number/>
<started-marketing-on>2000-01-31</started-marketing-on>
<ended-marketing-on>2013-07-26</ended-marketing-on>
<dosage-form>Powder, for solution</dosage-form>
<strength>50 mg</strength>
<route>Intravenous</route>
<fda-application-number/>
<generic>false</generic>
<over-the-counter>false</over-the-counter>
<approved>true</approved>
<country>Canada</country>
<source>DPD</source>
</product>
</products>
</drug>
</drugbank>
I also tried using drug = file.findall('drugbank/drug/products/name') instead of drug = file.findall('drugbank/drug/products') but it still gave the same error.
I found the issue . Use this code to get the names of your products :
import xml.etree.ElementTree as ET
xml_str = '''<?xml version="1.0" encoding="UTF-8"?>
<drugbank xmlns="http://www.drugbank.ca" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.drugbank.ca http://www.drugbank.ca/docs/drugbank.xsd" version="5.1" exported-on="2019-07-02">
<drug type="biotech" created="2005-06-13" updated="2019-06-04">
<products>
<product>
<name>Refludan</name>
<labeller>Bayer</labeller>
<ndc-id/>
<ndc-product-code/>
<dpd-id>02240996</dpd-id>
<ema-product-code/>
<ema-ma-number/>
<started-marketing-on>2000-01-31</started-marketing-on>
<ended-marketing-on>2013-07-26</ended-marketing-on>
<dosage-form>Powder, for solution</dosage-form>
<strength>50 mg</strength>
<route>Intravenous</route>
<fda-application-number/>
<generic>false</generic>
<over-the-counter>false</over-the-counter>
<approved>true</approved>
<country>Canada</country>
<source>DPD</source>
</product>
</products>
</drug>
</drugbank>
'''
root = ET.fromstring(xml_str)
# print(root.findall('{http://www.drugbank.ca}drug'))
ns = {'drug_bank': 'http://www.drugbank.ca'}
for drug in root.findall('drug_bank:drug', ns):
for products in drug.findall('drug_bank:products', ns):
for product in products.findall('drug_bank:product', ns):
for nametag in product.findall('drug_bank:name', ns):
print(nametag.text)
Output : Refludan
Explanation :
First I printed root and got this :
<Element '{http://www.drugbank.ca}drugbank' at 0x7f688ffc0770>
So I realised this was Namespace-XML-pattern to be used.
Here is the link to help you understand the topic - https://docs.python.org/3/library/xml.etree.elementtree.html#parsing-xml-with-namespaces
There is probably a simple solution to my problem, but I am very new to python3 so please go easy on me;)
I have a simple script running, which already successfully parses information from an xml-file using this code
import xml.etree.ElementTree as ET
root = ET.fromstring(my_xml_file)
u = root.find(".//name").text.rstrip()
print("Name: %s\n" % u)
The xml I am parsing looks like this
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="/3.2/style/exchange.xsl"?>
<example:world-data xmlns="http://www.example.org" xmlns:ops="http://example.oorg" xmlns:xlink="http://www.w3.oorg/1999/xlink">
<exchange-documents>
<exchange-document system="acb.org" family-id="543672" country="US" doc-number="95962" kind="B2">
<bibliographic-data>
<name>SomeName</name>
...and so on... and ends like this
</exchange-document>
</exchange-documents>
</example:world-data>
(Links are edited due to stackoverflow policy)
Output as expected
SomeName
However, if I try to parse another xml from the same api using the same python commands, I get this error-code
AttributeError: 'NoneType' object has no attribute 'text'
The second xml-file looks like this
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="/3.2/style/pub-ftxt-claims.xsl"?>
<ops:world-data xmlns="http://www.example.org/exchange" xmlns:example="http://example.org" xmlns:xlink="http://www.example.org/1999/xlink">
<ftxt:fulltext-documents xmlns="http://www.examp.org/fulltext" xmlns:ftxt="ww.example/fulltext">
<ftxt:fulltext-document system="example.org" fulltext-format="text-only">
<bibliographic-data>
<publication-reference data-format="docdb">
<document-id>
<country>EP</country>
<doc-number>10000</doc-number>
<kind>A</kind>
</document-id>
</publication-reference>
</bibliographic-data>
<claims lang="EN">
<claim>
<claim-text>1. Some text.</claim-text>
<claim-text>2. Some text.</claim-text>
<claim-text>2. Some text.</claim-text>
</claim>
</claims>
</ftxt:fulltext-document>
</ftxt:fulltext-documents>
</ops:world-data>
I tried again
root = ET.fromstring(usr_str)
u = root.find(".//claim-text").text.rstrip()
print("Abstract: %s\n" % u)
Expected output
1. Some text.
But it only prints the above mentioned error message.
Why can I parse the first xml but not the second one using these commands?
Any help is highly appreciated.
edit: code by Jack Fleeting works in python console, but unfortunately not in my PyCharm
from lxml import etree
root = etree.XML(my_xml.encode('ascii'))
root2 = etree.XML(my_xml2.encode('ascii'))
root.xpath('//*[local-name()="name"]/text()')
root2.xpath('//*[local-name()="claim-text"]/text()')
Could this be a bug in my PyCharm? My first mentioned code snippet still prints a correct result for name...
edit: Turns out I had to force the output using
a = root3.xpath('//*[local-name()="claim-text"]/text()')
print(a, flush=True)
A couple of problems here before we get to a possible solution. One, the first xml snippet you provided is invalid (for instance, the <bibliographic-data> isn't closed). I realize it's just a snippet but since this is what we have to work with, I modified the snippet below to fix that. Two, both snippets have xmlns declaration with unbound (unused) prefixes (example:world-datain the first, and ops:world-data in the second). I had to remove these prefixes, too, for the rest to work.
Given these modifications, using the lxml library should work for you.
First modified snippet:
my_xml = """<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="/3.2/style/exchange.xsl"?>
<world-data xmlns="http://www.example.org" xmlns:ops="http://example.oorg" xmlns:xlink="http://www.w3.oorg/1999/xlink">
<exchange-documents>
<exchange-document system="acb.org" family-id="543672" country="US" doc-number="95962" kind="B2">
<bibliographic-data>
<name>SomeName</name>
...and so on... and ends like this
</bibliographic-data>
</exchange-document>
</exchange-documents>
</world-data>"""
And:
my_xml2 = """<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="/3.2/style/pub-ftxt-claims.xsl"?>
<world-data xmlns="http://www.example.org/exchange" xmlns:example="http://example.org" xmlns:xlink="http://www.example.org/1999/xlink">
<ftxt:fulltext-documents xmlns="http://www.examp.org/fulltext" xmlns:ftxt="ww.example/fulltext">
<ftxt:fulltext-document system="example.org" fulltext-format="text-only">
<bibliographic-data>
<publication-reference data-format="docdb">
<document-id>
<country>EP</country>
<doc-number>10000</doc-number>
<kind>A</kind>
</document-id>
</publication-reference>
</bibliographic-data>
<claims lang="EN">
<claim>
<claim-text>1. Some text.</claim-text>
<claim-text>2. Some text.</claim-text>
<claim-text>3. Some text.</claim-text>
</claim>
</claims>
</ftxt:fulltext-document>
</ftxt:fulltext-documents>
</world-data>"""
And now to work:
from lxml import etree
root = etree.XML(my_xml.encode('ascii'))
root2 = etree.XML(my_xml2.encode('ascii'))
root.xpath('//*[local-name()="name"]/text()')
output:
['SomeName']
root2.xpath('//*[local-name()="claim-text"]/text()')
Output:
['1. Some text.', '2. Some text.', '3. Some text.']
Python Version: 3.7.2
Here is a xml-file.
<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:ls="https://www.littlstar.com">
<channel>
<title><![CDATA[Name (46 videos)]]></title>
<description><![CDATA[Name]]></description>
<link>http://github.com/dylang/node-rss</link>
<image>
<url>http://1.1.1.1:3001/thumb\324bfb0834915ccc0edb73b5bf0b82c2.jpeg</url>
<title>bla bla</title>
<link></link>
</image>
<generator>RSS for Node</generator>
<lastBuildDate>Sat, 23 Feb 2019 11:32:08 +0000</lastBuildDate>
<category><![CDATA[Local]]></category>
....
and here is source code. it's very simple
f = open(xmlpath, 'r')
data = f.read()
f.close()
soup = BeautifulSoup(data, 'html.parser')
print(soup.prettify())
and... result is
<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:ls="https://www.littlstar.com">
<channel>
<title><![CDATA[Name (46 videos)]]></title>
<description><![CDATA[Name]]></description>
<link/>
http://github.com/dylang/node-rss
<image/>
<url>http://192.168.1.142:3001/thumb\324bfb0834915ccc0edb73b5bf0b82c2.jpeg</url>
...
I lost "link" and "image" tags....
How can I solve this problem?
I tried upgrade bs, and using lxml parsing module...
The .read() and .close() components are not required here.
simply
with open(xmlpath) as fp:
soup = BeautifulSoup(fp, 'html.parser')
print(soup.prettify())