Beautiful Soup findAll() doesn't find the first one - python-3.x

I'm working on a coreference-resolution system based on Neural Networks for my Bachelor's Thesis, and i have a problem when i read the corpus.
The corpus is already preproccesed, and i only need to read it to do my stuff. I use Beautiful Soup 4 to read the xml files of each document that contains the data i need.
the files look like this:
<?xml version='1.0' encoding='ISO-8859-1'?>
<!DOCTYPE markables SYSTEM "markables.dtd">
<markables xmlns="www.eml.org/NameSpaces/markable">
<markable id="markable_102" span="word_390" grammatical_role="vc" coref_set="empty" visual="none" rel_type="none" np_form="indefnp" type="" entity="NO" nb="UNK" def="INDEF" sentenceid="19" lemmata="premia" pos="nn" head_pos="word_390" wikipedia="" mmax_level="markable"/>
<markable id="markable_15" span="word_48..word_49" grammatical_role="vc" coref_set="empty" visual="none" rel_type="none" np_form="defnp" type="" entity="NO" nb="SG" def="DEF" sentenceid="3" lemmata="Grozni hegoalde" pos="nnp nn" head_pos="word_48" wikipedia="Grozny" mmax_level="markable"/>
<markable id="markable_101" span="word_389" grammatical_role="sbj" coref_set="set_21" coref_type="named entities" visual="none" rel_type="coreferential" sub_type="exact repetition" np_form="ne_o" type="enamex" entity="LOC" nb="SG" def="DEF" sentenceid="19" lemmata="Mosku" pos="nnp" head_pos="word_389" wikipedia="" mmax_level="markable"/>
...
i need to extract all the spans here, so try to do it with this code (python3):
...
from bs4 import BeautifulSoup
...
file1 = markables+filename+"_markable_level.xml"
xml1 = open(file1) #markable
soup1 = BeautifulSoup(xml1, "html5lib") #markable
...
...
for markable in soup1.findAll('markable'):
try:
span = markable.contents[1]['span']
print(span)
spanA = span.split("..")[0]
spanB = span.split("..")[-1]
...
(I ignored most of the code, as they are 500 lines)
python3 aurreprozesaketaSTM.py
train
--- 28.329787254333496 seconds ---
&&&&&&&&&&&&&&&&&&&&&&&&& egun.06-1-p0002500.2000-06-01.europa
word_48..word_49
word_389
word_385..word_386
word_48..word_52
...
if you conpare the xml file with the output, you can see that word_390 is missing.
I get almost all the data that i need, then preproccess everything, build the system with neural networks, and finally i get scores and all...
But as I loose the first word of each document, my systems accuracy is a bit lower than what should be.
Can anyone help me with this? Any idea where is the problem?

You are parsing XML with html5lib. It is not supported for parsing XML.
lxml’s XML parser ... The only currently supported XML parser
https://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser

Related

How do I in FloPy Modflow6 output MAW head values for all timesteps?

I am creating a MAW well and want to use it as an observation well to compare it later to field data, it should be screened over multiple layers. However, I am only getting the head value in the well of the very last timestep in my output file. Any ideas on how to get all timesteps in the output?
The FloPy manual says something about it needing to be in Output Control, but I can't figure out how to do that:
print_head (boolean) – print_head (boolean) keyword to indicate that the list of multi-aquifer well heads will be printed to the listing file for every stress period in which “HEAD PRINT” is specified in Output Control. If there is no Output Control option and PRINT_HEAD is specified, then heads are printed for the last time step of each stress period.
In the MODFLOW6 manual I see that it is possible to make a continuous output:
modflow6
My MAW definition looks like this:
maw = flopy.mf6.ModflowGwfmaw(gwf,
nmawwells=1,
packagedata=[0, Rwell, minbot, wellhead,'MEAN',OBS1welllayers],
connectiondata=OBS1connectiondata,
perioddata=[(0,'STATUS','ACTIVE')],
flowing_wells=False,
save_flows=True,
mover=True,
flow_correction=True,
budget_filerecord='OBS1wellbudget',
print_flows=True,
print_head=True,
head_filerecord='OBS1wellhead',
)
My output control looks like this:
oc = flopy.mf6.ModflowGwfoc(gwf,
budget_filerecord=budget_file,
head_filerecord=head_file,
saverecord=[('HEAD', 'ALL'), ('BUDGET', 'ALL'), ],
)
Hope this is all clear and someone can help me, thanks!
You need to initialise the MAW observations file... it's not done in the OC package.
You can find the scripts for the three MAW examples in the MF6 documentation here:
https://github.com/MODFLOW-USGS/modflow6-examples/tree/master/notebooks
It looks something like this:
obs_file = "{}.maw.obs".format(name)
csv_file = obs_file + ".csv"
obs_dict = {csv_file: [
("head", "head", (0,)),
("Q1", "maw", (0,), (0,)),
("Q2", "maw", (0,), (1,)),
("Q3", "maw", (0,), (2,)),
]}
maw.obs.initialize(filename=obs_file, digits=10, print_input=True, continuous=obs_dict)

read couple of xml file and saved them as list of list

I have a 112 XML file, each contains a paragraph, like this: (this is one XML sample, we have 112 samples)
<?xml version='1.0' encoding='UTF-8'?>
<arggraph id="micro_b001" topic_id="waste_separation" stance="pro">
<edu id="e1"><![CDATA[Yes, it's annoying and cumbersome to separate your rubbish properly all the time.]]></edu>
<edu id="e2"><![CDATA[Three different bin bags stink away in the kitchen and have to be sorted into different wheelie bins.]]></edu>
<edu id="e3"><![CDATA[But still Germany produces way too much rubbish]]></edu>
<edu id="e4"><![CDATA[and too many resources are lost when what actually should be separated and recycled is burnt.]]></edu>
<edu id="e5"><![CDATA[We Berliners should take the chance and become pioneers in waste separation!]]></edu>
<adu id="a1" type="opp"/>
<adu id="a2" type="opp"/>
<adu id="a3" type="pro"/>
<adu id="a4" type="pro"/>
<adu id="a5" type="pro"/>
<edge id="c6" src="e1" trg="a1" type="seg"/>
<edge id="c7" src="e2" trg="a2" type="seg"/>
<edge id="c8" src="e3" trg="a3" type="seg"/>
<edge id="c9" src="e4" trg="a4" type="seg"/>
<edge id="c10" src="e5" trg="a5" type="seg"/>
<edge id="c1" src="a1" trg="a5" type="reb"/>
<edge id="c2" src="a2" trg="a1" type="sup"/>
<edge id="c3" src="a3" trg="c1" type="und"/>
<edge id="c4" src="a4" trg="c3" type="add"/>
</arggraph>
I want to read each of them in python and gather from each of the text that ends with "edu" ,and then saved them as
list of the list! like this
[[Yes, it's annoying and cumbersome to separate your rubbish properly all the time., Three different bin bags stink away in the kitchen and have to be sorted into different wheelie bins., But still Germany produces way too much rubbish
,and too many resources are lost when what actually should be separated and recycled is burnt , We Berliners should take the chance and become pioneers in waste separation!] , [
next XML content] ,[next, XML content],...
]]
I have tried this way
I have saved them all of them as list in myList
myList = []
myEdgesList=[]
#read the whole text from
for root, dirs, files in os.walk(path):
for file in files:
if file.endswith('.xml'):
with open(os.path.join(root, file), encoding="UTF-8") as content:
tree = ET.parse(content)
myList.append(tree)
then:
ParaList=[]
EduList=[]
for k in myList:
a=k.findall('.//edu')
for l in a:
EduList.append(l.text)
ParaList.append(EduList)
but the result only gives me a flat list of all sentences (576) and not a list of 112 paragraphs
can someone help me?
Assuming that myList is a list of parsed XML documents, moving ParaList.append(EduList) inside the main for loop should fix it for you. You also need to reset the EduList once per document, so also move EduList=[] inside the main loop:
ParaList=[]
for k in myList:
EduList=[]
a=k.findall('.//edu')
for l in a:
EduList.append(l.text)
ParaList.append(EduList)
Now the extracted content of each XML document is appended to ParaList once per document.
A better way to write this code is to use a list comprehension to process the matching lines:
ParaList=[]
for k in myList:
ParaList.append([l.text for l in k.findall('.//edu')])
Or you could even do it in one line using a nested list comprehension:
ParaList = [[l.text for l in k.findall('.//edu')] for k in myList]

Working xml data - Extract part of string in Python [duplicate]

This question already has an answer here:
Reference - What does this regex mean?
(1 answer)
Closed 2 years ago.
I am working on the text data in a string format. I'd like to know how to extract part of the string as below:
data = '<?xml version_history="1.0" encoding="utf-8"?><feed xml:base="https://dummydomain.facebook.com/" xmlns="http://www.w3.org/2005/Atom" xmlns:d="http://schemas.microsoft.com/ado/2008/09/dataservices" xmlns:m="http://schemas.microsoft.com/ado/2008/09/dataservices/metadata" xmlns:georss="http://www.georss.org/georss" xmlns:gml="http://www.opengis.net/gml"><id>aad232-c2cc-42ca-ac1e-e1d1b4dd55de</id><title<d:VersionType>3.0</d:VersionLabel><d:Name>XYZ Company</d:Title><d:New_x005f_h84_x005f_dsad_x005f_x003f_x005f_ m:null="true" /><d:Action_x005f_x0020_x005f_Status>Active<d:Stage>3. Contract Signed<d:ComplianceAssetId m:null="true" /><d:ID m:type="Edm.Int32">408</d:ID><d:Modified m:type="Edm.DateTime">2020-07-30T12:15:04Z</d:Modified><d:Author m:type="SP.FieldUserValue"><d:LookupId m:type="Edm.Int32">13</d:LookupId><d:LookupValue> Amy, Jackson</d:LookupValue><d:Email>Amy.Jackson#doe.com</d:Email><id>af62fe09-fds2-42ca-a12c1e-e1d1b4dd55de</id><title<d:VersionType>2.0</d:VersionLabel><d:Name>XYZ Company</d:Title><d:New_x005f_x0342fs_x005f_dsad_x005f_x003f_x005f_ m:null="true" /><d:Action_x005f_x0020_x005f_Status>Active<d:Stage>2. Active Discussion<d:ComplianceAssetId m:null="true" /><d:ID m:type="Edm.Int32">408</d:ID><d:Modified m:type="Edm.DateTime">2020-02-15T18:15:60Z</d:Modified><d:Author m:type="SP.FieldUserValue"><d:LookupId m:type="Edm.Int32">13</d:LookupId><d:LookupValue> Sam, Joseph</d:LookupValue><d:Email>Sam. Joseph #doe.com</d:Email></d:LookupValue><d:Email>Amy.Jackson#doe.com</d:Email><id>af62fe09-fds2-42ca-a12c1e-e1d1b4dd55de</id><title<d:VersionType>1.0</d:VersionLabel><d:Name>XYZ Company</d:Title><d:New_x005f_x0342fs_x005f_dsad_x005f_x003f_x005f_ m:null="true" /><d:Action_x005f_x0020_x005f_Status>Active<d:Stage>1. Exploratory<d:ComplianceAssetId m:null="true" /><d:ID m:type="Edm.Int32">408</d:ID><d:Modified m:type="Edm.DateTime">2019-07-15T10:20:04Z</d:Modified><d:Author m:type="SP.FieldUserValue"><d:LookupId m:type="Edm.Int32">13</d:LookupId><d:LookupValue> Sam, Joseph</d:LookupValue><d:Email>Sam. Joseph #doe.com</d:Email>'
I want to extract all <d:VersionType>,<d:Name>,<d:Stage>,and <d:Modified m:type="Edm.DateTime">
Expected outputs:
d:VersionType d:Name d:Stage d:Modified m:type="Edm.DateTime"
3.0 XYZ Company 3. Contract 2020-07-30T12:15:04Z
2.0 XYZ Company 2. Contract 2020-02-15T18:15:60Z
1.0 XYZ Company 1. Exploratory 2019-07-15T10:20:04Z
Thanks in advance for your help!
Try using beautiful soup as it lets you parse xml, html and other documents. Such files are already in a specific structure, and you don't have to build a regex expression from scratch, making your job a lot easier.
from bs4 import BeautifulSoup
soup = BeautifulSoup(data, 'xml')
version_type = [item.text for item in soup.findAll('d:VersionType')] # gives ['3.0', '2.0', '1.0']
Replace d:VersionType with other elements you want (d:Name, d:Stage, ..) to extract their contents as well.

Parse data from .xml file and save it to .tsv file in Python

I have a dataset which looks like this:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<Reviews>
<Review rid="1004293">
<sentences>
<sentence id="1004293:0">
<text>Judging from previous posts this used to be a good place, but not any longer.</text>
<Opinions>
<Opinion target="place" category="RESTAURANT#GENERAL" polarity="negative" from="51" to="56"/>
</Opinions>
</sentence>
<sentence id="1004293:1">
<text>The food here is rather good, but only if you like to wait for it.</text>
<Opinions>
<Opinion target="food" category="FOOD#QUALITY" polarity="positive" from="4" to="8"/>
<Opinion target="NULL" category="SERVICE#GENERAL" polarity="negative" from="0" to="0"/>
</Opinions>
</sentence>
...
How can I parse the data from this .xml file to .tsv file in the following format:
["negative", "Judging from previous posts this used to be a good place, but not any longer.", "RESTAURANT#GENERAL"]
["positive", "The food here is rather good, but only if you like to wait for it.","FOOD#QUALITY"]
["negative", "The food here is rather good, but only if you like to wait for it.","SERVICE#GENERAL"]
Thanks!
You can use elementtree package of python to get your desired output. Below is the code which will print your list. You can create a tsv by replacing the print and writing to a tsv file.
The sample.xml file must be present in the same directory where this code is present.
from xml.etree import ElementTree
file = 'sample.xml'
tree = ElementTree.parse(file)
root = tree.getroot()
for sentence in root.iter('sentence'):
# Loop all sentence in the xml
for opinion in sentence.iter('Opinion'):
# Loop all Opinion of a particular sentence.
print([opinion.attrib['polarity'], sentence.find('text').text, opinion.attrib['category']])
Output:
['negative', 'Judging from previous posts this used to be a good place, but not any longer.', 'RESTAURANT#GENERAL']
['positive', 'The food here is rather good, but only if you like to wait for it.', 'FOOD#QUALITY']
['negative', 'The food here is rather good, but only if you like to wait for it.', 'SERVICE#GENERAL']
sample.xml contains:
<Reviews>
<Review rid="1004293">
<sentences>
<sentence id="1004293:0">
<text>Judging from previous posts this used to be a good place, but not any longer.</text>
<Opinions>
<Opinion target="place" category="RESTAURANT#GENERAL" polarity="negative" from="51" to="56"/>
</Opinions>
</sentence>
<sentence id="1004293:1">
<text>The food here is rather good, but only if you like to wait for it.</text>
<Opinions>
<Opinion target="food" category="FOOD#QUALITY" polarity="positive" from="4" to="8"/>
<Opinion target="NULL" category="SERVICE#GENERAL" polarity="negative" from="0" to="0"/>
</Opinions>
</sentence>
</sentences>
</Review>

Parsing XML attribute with namespace python3

I have looked at the other question over Parsing XML with namespace in Python via 'ElementTree' and reviewed the xml.etree.ElementTree documentation. The issue I'm having is admittedly similar so feel free to tag this as duplicate, but I can't figure it out.
The line of code I'm having issues with is
instance_alink = root.find('{http://www.w3.org/2005/Atom}link')
My code is as follows:
import xml.etree.cElementTree as ET
tree = ET.parse('../../external_data/rss.xml')
root = tree.getroot()
instance_title = root.find('channel/title').text
instance_link = root.find('channel/link').text
instance_alink = root.find('{http://www.w3.org/2005/Atom}link')
instance_description = root.find('channel/description').text
instance_language = root.find('channel/language').text
instance_pubDate = root.find('channel/pubDate').text
instance_lastBuildDate = root.find('channel/lastBuildDate').text
The XML file:
<?xml version="1.0" encoding="windows-1252"?>
<rss version="2.0">
<channel>
<title>Filings containing financial statements tagged using the US GAAP or IFRS taxonomies.</title>
<link>http://www.example.com</link>
<atom:link href="http://www.example.com" rel="self" type="application/rss+xml" xmlns:atom="http://www.w3.org/2005/Atom"/>
<description>This is a list of up to 200 of the latest filings containing financial statements tagged using the US GAAP or IFRS taxonomies, updated every 10 minutes.</description>
<language>en-us</language>
<pubDate>Mon, 20 Nov 2017 20:20:45 EST</pubDate>
<lastBuildDate>Mon, 20 Nov 2017 20:20:45 EST</lastBuildDate>
....
The attributes I'm trying to retrieve are in line 6; so 'href', 'type', etc.
<atom:link href="http://www.example.com" rel="self" type="application/rss+xml" xmlns:atom="http://www.w3.org/2005/Atom"/>
Obviously, I've tried
instance_alink = root.find('{http://www.w3.org/2005/Atom}link').attrib
but that doesn't work cause it's type None. My thought is that it's looking for children but there are none. I can grab the attributes in the other lines in XML but not these for some reason. I've also played with ElementTree and lxml (but lxml won't load properly on Windows for whatever reason)
Any help is greatly appreciated cause the documentation seems sparse.
I was able to solve with
alink = root.find('channel/{http://www.w3.org/2005/Atom}link').attrib
the issue is that I was looking for the tag {http://www.w3.org/2005/Atom}link at the same level of <channel>, which, of course, didn't exist.

Resources