I have looked at the other question over Parsing XML with namespace in Python via 'ElementTree' and reviewed the xml.etree.ElementTree documentation. The issue I'm having is admittedly similar so feel free to tag this as duplicate, but I can't figure it out.
The line of code I'm having issues with is
instance_alink = root.find('{http://www.w3.org/2005/Atom}link')
My code is as follows:
import xml.etree.cElementTree as ET
tree = ET.parse('../../external_data/rss.xml')
root = tree.getroot()
instance_title = root.find('channel/title').text
instance_link = root.find('channel/link').text
instance_alink = root.find('{http://www.w3.org/2005/Atom}link')
instance_description = root.find('channel/description').text
instance_language = root.find('channel/language').text
instance_pubDate = root.find('channel/pubDate').text
instance_lastBuildDate = root.find('channel/lastBuildDate').text
The XML file:
<?xml version="1.0" encoding="windows-1252"?>
<rss version="2.0">
<channel>
<title>Filings containing financial statements tagged using the US GAAP or IFRS taxonomies.</title>
<link>http://www.example.com</link>
<atom:link href="http://www.example.com" rel="self" type="application/rss+xml" xmlns:atom="http://www.w3.org/2005/Atom"/>
<description>This is a list of up to 200 of the latest filings containing financial statements tagged using the US GAAP or IFRS taxonomies, updated every 10 minutes.</description>
<language>en-us</language>
<pubDate>Mon, 20 Nov 2017 20:20:45 EST</pubDate>
<lastBuildDate>Mon, 20 Nov 2017 20:20:45 EST</lastBuildDate>
....
The attributes I'm trying to retrieve are in line 6; so 'href', 'type', etc.
<atom:link href="http://www.example.com" rel="self" type="application/rss+xml" xmlns:atom="http://www.w3.org/2005/Atom"/>
Obviously, I've tried
instance_alink = root.find('{http://www.w3.org/2005/Atom}link').attrib
but that doesn't work cause it's type None. My thought is that it's looking for children but there are none. I can grab the attributes in the other lines in XML but not these for some reason. I've also played with ElementTree and lxml (but lxml won't load properly on Windows for whatever reason)
Any help is greatly appreciated cause the documentation seems sparse.
I was able to solve with
alink = root.find('channel/{http://www.w3.org/2005/Atom}link').attrib
the issue is that I was looking for the tag {http://www.w3.org/2005/Atom}link at the same level of <channel>, which, of course, didn't exist.
Related
I am trying to parse XML with BS4 in Python 3.
For some reason, I am not able to parse namespaces. I tried to look for answers in this question, but it doesn't work for me and I don't get any error message either.
Why does the first part work, but the second does not?
import requests
from bs4 import BeautifulSoup
input = """
<?xml version="1.0" encoding="utf-8"?>
<wb:countries page="1" pages="6" per_page="50" total="299" xmlns:wb="http://www.worldbank.org">
<wb:country id="ABW">
<wb:iso2Code>AW</wb:iso2Code>
<wb:name>Aruba</wb:name>
<wb:region id="LCN" iso2code="ZJ">Latin America & Caribbean </wb:region>
<wb:adminregion id="" iso2code="" />
<wb:incomeLevel id="HIC" iso2code="XD">High income</wb:incomeLevel>
<wb:lendingType id="LNX" iso2code="XX">Not classified</wb:lendingType>
<wb:capitalCity>Oranjestad</wb:capitalCity>
<wb:longitude>-70.0167</wb:longitude>
<wb:latitude>12.5167</wb:latitude>
</wb:country>
<wb:country id="AFE">
<wb:iso2Code>ZH</wb:iso2Code>
<wb:name>Africa Eastern and Southern</wb:name>
<wb:region id="NA" iso2code="NA">Aggregates</wb:region>
<wb:adminregion id="" iso2code="" />
<wb:incomeLevel id="NA" iso2code="NA">Aggregates</wb:incomeLevel>
<wb:lendingType id="" iso2code="">Aggregates</wb:lendingType>
<wb:capitalCity />
<wb:longitude />
<wb:latitude />
</wb:country>
</wb:countries>
<item>
<title>Some string</title>
<pubDate>Wed, 01 Sep 2022 12:45:00 +0000</pubDate>
<guid isPermaLink="false">4574785</guid>
<link>https://somesite.com</link>
<itunes:subtitle>A subtitle</itunes:subtitle>
<enclosure length="0" type="audio/mpeg" url="https://assets.somesite.com/123.mp3"/>
<itunes:image href="https://somesite.com/img.jpg"/>
<itunes:duration>7845</itunes:duration>
<itunes:explicit>no</itunes:explicit>
<itunes:episodeType>Full</itunes:episodeType>
</item>
"""
soup = BeautifulSoup(input, 'xml')
# Working
for x in soup.find_all('wb:country'):
print(x.find('wb:name').text)
# Not working
for x in soup.find_all('item'):
print(x.find('itunes:subtitle').text)
It looks like non conform XML were you have two documents mixed togehter - A namespace is expected in strict mode of XML parser if it is defined - Use lxml instead to get your expected result in this wild mix:
soup = BeautifulSoup(xml_string, 'lxml')
# Working
for x in soup.find_all('wb:country'):
print(x.find('wb:name').text)
# also working
for x in soup.find_all('item'):
print(x.find('itunes:subtitle').text)
Note: Avoid using python reserved terms (keywords), this could have unwanted effects on the results of your code.
If you have second document separat use:
for x in soup.find_all('item'):
print(x.find('subtitle').text)
Example
from bs4 import BeautifulSoup
xml_string = """
<?xml version="1.0" encoding="utf-8"?>
<item>
<title>Some string</title>
<pubDate>Wed, 01 Sep 2022 12:45:00 +0000</pubDate>
<guid isPermaLink="false">4574785</guid>
<link>https://somesite.com</link>
<itunes:subtitle>A subtitle</itunes:subtitle>
<enclosure length="0" type="audio/mpeg" url="https://assets.somesite.com/123.mp3"/>
<itunes:image href="https://somesite.com/img.jpg"/>
<itunes:duration>7845</itunes:duration>
<itunes:explicit>no</itunes:explicit>
<itunes:episodeType>Full</itunes:episodeType>
</item>
"""
soup = BeautifulSoup(input, 'xml')
# working
for x in soup.find_all('item'):
print(x.find('subtitle').text)
Else you have to define a namespace for your item and can still use XML parser:
<?xml version="1.0" encoding="utf-8"?>
<item xmlns:itunes="http://www.w3.org/TR/html4/">
<title>Some string</title>
<pubDate>Wed, 01 Sep 2022 12:45:00 +0000</pubDate>
<guid isPermaLink="false">4574785</guid>
<link>https://somesite.com</link>
<itunes:subtitle>A subtitle</itunes:subtitle>
...
When a namespace is defined for an element, all child elements with
the same prefix are associated with the same namespace.
Are those two roots the same??
Output xml got changed about the order in the root.
If they are different each other, how could I fix it??
#Python 3.7
import xml.etree.ElementTree as ET
ET.register_namespace('xsi', "http://www.w3.org/2001/test")
ET.register_namespace('', "http://www.test.com/test/test/test")
tree = ET.parse('test.xml')
tree.write("test1.xml", encoding='utf-8', xml_declaration=True)
#input XML root
<root xmlns:xsi="http://www.w3.org/2001/test" schemaVersion="2.8" xmlns="http://www.test.com/test/test/test" labelVersion="1" xsi:schemaLocation="http://www.test.com/test/test/test ..\Schema\CLIFSchema.xsd" name="test.xml">
#output XML root
<root xmlns="http://www.test.com/test/test/test" xmlns:xsi="http://www.w3.org/2001/test" labelVersion="1" name="test.xml" schemaVersion="2.8" xsi:schemaLocation="http://www.test.com/test/test/test ..\Schema\CLIFSchema.xsd">
This question already has an answer here:
Reference - What does this regex mean?
(1 answer)
Closed 2 years ago.
I am working on the text data in a string format. I'd like to know how to extract part of the string as below:
data = '<?xml version_history="1.0" encoding="utf-8"?><feed xml:base="https://dummydomain.facebook.com/" xmlns="http://www.w3.org/2005/Atom" xmlns:d="http://schemas.microsoft.com/ado/2008/09/dataservices" xmlns:m="http://schemas.microsoft.com/ado/2008/09/dataservices/metadata" xmlns:georss="http://www.georss.org/georss" xmlns:gml="http://www.opengis.net/gml"><id>aad232-c2cc-42ca-ac1e-e1d1b4dd55de</id><title<d:VersionType>3.0</d:VersionLabel><d:Name>XYZ Company</d:Title><d:New_x005f_h84_x005f_dsad_x005f_x003f_x005f_ m:null="true" /><d:Action_x005f_x0020_x005f_Status>Active<d:Stage>3. Contract Signed<d:ComplianceAssetId m:null="true" /><d:ID m:type="Edm.Int32">408</d:ID><d:Modified m:type="Edm.DateTime">2020-07-30T12:15:04Z</d:Modified><d:Author m:type="SP.FieldUserValue"><d:LookupId m:type="Edm.Int32">13</d:LookupId><d:LookupValue> Amy, Jackson</d:LookupValue><d:Email>Amy.Jackson#doe.com</d:Email><id>af62fe09-fds2-42ca-a12c1e-e1d1b4dd55de</id><title<d:VersionType>2.0</d:VersionLabel><d:Name>XYZ Company</d:Title><d:New_x005f_x0342fs_x005f_dsad_x005f_x003f_x005f_ m:null="true" /><d:Action_x005f_x0020_x005f_Status>Active<d:Stage>2. Active Discussion<d:ComplianceAssetId m:null="true" /><d:ID m:type="Edm.Int32">408</d:ID><d:Modified m:type="Edm.DateTime">2020-02-15T18:15:60Z</d:Modified><d:Author m:type="SP.FieldUserValue"><d:LookupId m:type="Edm.Int32">13</d:LookupId><d:LookupValue> Sam, Joseph</d:LookupValue><d:Email>Sam. Joseph #doe.com</d:Email></d:LookupValue><d:Email>Amy.Jackson#doe.com</d:Email><id>af62fe09-fds2-42ca-a12c1e-e1d1b4dd55de</id><title<d:VersionType>1.0</d:VersionLabel><d:Name>XYZ Company</d:Title><d:New_x005f_x0342fs_x005f_dsad_x005f_x003f_x005f_ m:null="true" /><d:Action_x005f_x0020_x005f_Status>Active<d:Stage>1. Exploratory<d:ComplianceAssetId m:null="true" /><d:ID m:type="Edm.Int32">408</d:ID><d:Modified m:type="Edm.DateTime">2019-07-15T10:20:04Z</d:Modified><d:Author m:type="SP.FieldUserValue"><d:LookupId m:type="Edm.Int32">13</d:LookupId><d:LookupValue> Sam, Joseph</d:LookupValue><d:Email>Sam. Joseph #doe.com</d:Email>'
I want to extract all <d:VersionType>,<d:Name>,<d:Stage>,and <d:Modified m:type="Edm.DateTime">
Expected outputs:
d:VersionType d:Name d:Stage d:Modified m:type="Edm.DateTime"
3.0 XYZ Company 3. Contract 2020-07-30T12:15:04Z
2.0 XYZ Company 2. Contract 2020-02-15T18:15:60Z
1.0 XYZ Company 1. Exploratory 2019-07-15T10:20:04Z
Thanks in advance for your help!
Try using beautiful soup as it lets you parse xml, html and other documents. Such files are already in a specific structure, and you don't have to build a regex expression from scratch, making your job a lot easier.
from bs4 import BeautifulSoup
soup = BeautifulSoup(data, 'xml')
version_type = [item.text for item in soup.findAll('d:VersionType')] # gives ['3.0', '2.0', '1.0']
Replace d:VersionType with other elements you want (d:Name, d:Stage, ..) to extract their contents as well.
There is probably a simple solution to my problem, but I am very new to python3 so please go easy on me;)
I have a simple script running, which already successfully parses information from an xml-file using this code
import xml.etree.ElementTree as ET
root = ET.fromstring(my_xml_file)
u = root.find(".//name").text.rstrip()
print("Name: %s\n" % u)
The xml I am parsing looks like this
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="/3.2/style/exchange.xsl"?>
<example:world-data xmlns="http://www.example.org" xmlns:ops="http://example.oorg" xmlns:xlink="http://www.w3.oorg/1999/xlink">
<exchange-documents>
<exchange-document system="acb.org" family-id="543672" country="US" doc-number="95962" kind="B2">
<bibliographic-data>
<name>SomeName</name>
...and so on... and ends like this
</exchange-document>
</exchange-documents>
</example:world-data>
(Links are edited due to stackoverflow policy)
Output as expected
SomeName
However, if I try to parse another xml from the same api using the same python commands, I get this error-code
AttributeError: 'NoneType' object has no attribute 'text'
The second xml-file looks like this
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="/3.2/style/pub-ftxt-claims.xsl"?>
<ops:world-data xmlns="http://www.example.org/exchange" xmlns:example="http://example.org" xmlns:xlink="http://www.example.org/1999/xlink">
<ftxt:fulltext-documents xmlns="http://www.examp.org/fulltext" xmlns:ftxt="ww.example/fulltext">
<ftxt:fulltext-document system="example.org" fulltext-format="text-only">
<bibliographic-data>
<publication-reference data-format="docdb">
<document-id>
<country>EP</country>
<doc-number>10000</doc-number>
<kind>A</kind>
</document-id>
</publication-reference>
</bibliographic-data>
<claims lang="EN">
<claim>
<claim-text>1. Some text.</claim-text>
<claim-text>2. Some text.</claim-text>
<claim-text>2. Some text.</claim-text>
</claim>
</claims>
</ftxt:fulltext-document>
</ftxt:fulltext-documents>
</ops:world-data>
I tried again
root = ET.fromstring(usr_str)
u = root.find(".//claim-text").text.rstrip()
print("Abstract: %s\n" % u)
Expected output
1. Some text.
But it only prints the above mentioned error message.
Why can I parse the first xml but not the second one using these commands?
Any help is highly appreciated.
edit: code by Jack Fleeting works in python console, but unfortunately not in my PyCharm
from lxml import etree
root = etree.XML(my_xml.encode('ascii'))
root2 = etree.XML(my_xml2.encode('ascii'))
root.xpath('//*[local-name()="name"]/text()')
root2.xpath('//*[local-name()="claim-text"]/text()')
Could this be a bug in my PyCharm? My first mentioned code snippet still prints a correct result for name...
edit: Turns out I had to force the output using
a = root3.xpath('//*[local-name()="claim-text"]/text()')
print(a, flush=True)
A couple of problems here before we get to a possible solution. One, the first xml snippet you provided is invalid (for instance, the <bibliographic-data> isn't closed). I realize it's just a snippet but since this is what we have to work with, I modified the snippet below to fix that. Two, both snippets have xmlns declaration with unbound (unused) prefixes (example:world-datain the first, and ops:world-data in the second). I had to remove these prefixes, too, for the rest to work.
Given these modifications, using the lxml library should work for you.
First modified snippet:
my_xml = """<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="/3.2/style/exchange.xsl"?>
<world-data xmlns="http://www.example.org" xmlns:ops="http://example.oorg" xmlns:xlink="http://www.w3.oorg/1999/xlink">
<exchange-documents>
<exchange-document system="acb.org" family-id="543672" country="US" doc-number="95962" kind="B2">
<bibliographic-data>
<name>SomeName</name>
...and so on... and ends like this
</bibliographic-data>
</exchange-document>
</exchange-documents>
</world-data>"""
And:
my_xml2 = """<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="/3.2/style/pub-ftxt-claims.xsl"?>
<world-data xmlns="http://www.example.org/exchange" xmlns:example="http://example.org" xmlns:xlink="http://www.example.org/1999/xlink">
<ftxt:fulltext-documents xmlns="http://www.examp.org/fulltext" xmlns:ftxt="ww.example/fulltext">
<ftxt:fulltext-document system="example.org" fulltext-format="text-only">
<bibliographic-data>
<publication-reference data-format="docdb">
<document-id>
<country>EP</country>
<doc-number>10000</doc-number>
<kind>A</kind>
</document-id>
</publication-reference>
</bibliographic-data>
<claims lang="EN">
<claim>
<claim-text>1. Some text.</claim-text>
<claim-text>2. Some text.</claim-text>
<claim-text>3. Some text.</claim-text>
</claim>
</claims>
</ftxt:fulltext-document>
</ftxt:fulltext-documents>
</world-data>"""
And now to work:
from lxml import etree
root = etree.XML(my_xml.encode('ascii'))
root2 = etree.XML(my_xml2.encode('ascii'))
root.xpath('//*[local-name()="name"]/text()')
output:
['SomeName']
root2.xpath('//*[local-name()="claim-text"]/text()')
Output:
['1. Some text.', '2. Some text.', '3. Some text.']
I'm working on a coreference-resolution system based on Neural Networks for my Bachelor's Thesis, and i have a problem when i read the corpus.
The corpus is already preproccesed, and i only need to read it to do my stuff. I use Beautiful Soup 4 to read the xml files of each document that contains the data i need.
the files look like this:
<?xml version='1.0' encoding='ISO-8859-1'?>
<!DOCTYPE markables SYSTEM "markables.dtd">
<markables xmlns="www.eml.org/NameSpaces/markable">
<markable id="markable_102" span="word_390" grammatical_role="vc" coref_set="empty" visual="none" rel_type="none" np_form="indefnp" type="" entity="NO" nb="UNK" def="INDEF" sentenceid="19" lemmata="premia" pos="nn" head_pos="word_390" wikipedia="" mmax_level="markable"/>
<markable id="markable_15" span="word_48..word_49" grammatical_role="vc" coref_set="empty" visual="none" rel_type="none" np_form="defnp" type="" entity="NO" nb="SG" def="DEF" sentenceid="3" lemmata="Grozni hegoalde" pos="nnp nn" head_pos="word_48" wikipedia="Grozny" mmax_level="markable"/>
<markable id="markable_101" span="word_389" grammatical_role="sbj" coref_set="set_21" coref_type="named entities" visual="none" rel_type="coreferential" sub_type="exact repetition" np_form="ne_o" type="enamex" entity="LOC" nb="SG" def="DEF" sentenceid="19" lemmata="Mosku" pos="nnp" head_pos="word_389" wikipedia="" mmax_level="markable"/>
...
i need to extract all the spans here, so try to do it with this code (python3):
...
from bs4 import BeautifulSoup
...
file1 = markables+filename+"_markable_level.xml"
xml1 = open(file1) #markable
soup1 = BeautifulSoup(xml1, "html5lib") #markable
...
...
for markable in soup1.findAll('markable'):
try:
span = markable.contents[1]['span']
print(span)
spanA = span.split("..")[0]
spanB = span.split("..")[-1]
...
(I ignored most of the code, as they are 500 lines)
python3 aurreprozesaketaSTM.py
train
--- 28.329787254333496 seconds ---
&&&&&&&&&&&&&&&&&&&&&&&&& egun.06-1-p0002500.2000-06-01.europa
word_48..word_49
word_389
word_385..word_386
word_48..word_52
...
if you conpare the xml file with the output, you can see that word_390 is missing.
I get almost all the data that i need, then preproccess everything, build the system with neural networks, and finally i get scores and all...
But as I loose the first word of each document, my systems accuracy is a bit lower than what should be.
Can anyone help me with this? Any idea where is the problem?
You are parsing XML with html5lib. It is not supported for parsing XML.
lxml’s XML parser ... The only currently supported XML parser
https://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser