How to access xml text of child? - python-3.x

I have the following xml file (taken from here:
<BioSampleSet>
<BioSample submission_date="2011-12-01T13:31:02.367" last_update="2014-11-08T01:40:24.717" publication_date="2012-02-16T10:49:52.970" access="public" id="761094" accession="SAMN00761094">
<Ids>
</Ids>
<Package display_name="Generic">Generic.1.0</Package>
<Attributes>
<Attribute attribute_name="Individual">PK314</Attribute>
<Attribute attribute_name="condition">healthy</Attribute>
<Attribute attribute_name="BioSampleModel">Generic</Attribute>
</Attributes>
<Status status="live" when="2014-11-08T00:27:24"/>
</BioSample>
</BioSampleSet>
And I need to access the text next to the attribute attribute_nameof the child Attributes.
I managed accessing the values of attribute_name.:
from Bio import Entrez,SeqIO
Entrez.email = '#'
import xml.etree.ElementTree as ET
handle = Entrez.efetch(db="biosample", id="SAMN00761094", retmode="xml", rettype="full")
tree = ET.parse(handle)
for attr in root[0].iter('Attribute'):
name = attr.get('attribute_name')
print(name)
this prints:
Individual
condition
BioSampleModel
How do I create a dict of the values of attribute_name and the text next to it?
My desired output is
attributes = {'Individual': PK314, 'condition': healthy, 'BioSampleModel': Generic}

Based strictly on the xml sample in the question, try something along these lines:
bio = """[your xml sample]"""
doc = ET.fromstring(bio)
attributes = {}
for item in doc.findall('.//Attributes//Attribute'):
attributes[item.attrib['attribute_name']]=item.text
attributes
Output:
{'Individual': 'PK314', 'condition': 'healthy', 'BioSampleModel': 'Generic'}

Related

DataType Conversion from number to integer in pandas while writing to XML file

I'm trying to convert the csv file to the desired output XML file.
Input CSV
Id,SubID,Rank,Size
1,123,1,0.1
1,234,2,0.2
2,123,1,0.1
2,456,2,0.2
Output XML File
<AA_ITEMS>
<Id ID="1">
<SubId ID="123.0">
<Rank>1</Rank>
<Size>0.1</Size>
</SubId>
<SubId ID="234.0">
<Rank>2</Rank>
<Size>0.2</Size>
</SubId>
</Id>
<Id ID="2">
<SubId ID="456.0">
<Rank>1</Rank>
<Size>0.1</Size>
</SubId>
<SubId ID="123.0">
<Rank>2</Rank>
<Size>0.2</Size>
</SubId>
My Code Snippet to achieve the above
root = ET.Element('AA_ITEMS')
for o_id, df_group in df.groupby('ID'):
coupon_codes = ET.Element('ID', {'ID': str(o_id)})
for index,row in df_group.iterrows():
substituted_code = ET.Element('SubID', {'ID': str(row['SubID'])})
rank_code = ET.Element('RANK')
rank_code.text = str(row['RANK'])
pack_size_code = ET.Element('REL_PACK_SIZE')
pack_size_code.text = str(row['REL_PACK_SIZE'])
substituted_code.append(rank_code)
substituted_code.append(pack_size_code)
coupon_codes.append(substituted_code)
root.append(coupon_codes)
min_xml = ET.tostring(root, encoding='utf8')
with open('/tmp/{}'.format(self.blob_name.split('.')[0]+'.xml'), "wb") as file:
file.write(min_xml)
However the issue here is the value of the SubID is displayed in the number format("123.0","234.0","456.0") which should be same as that of the input file i.e integer ("123","234","456") without any decimal.
Please advise on how to achieve the above in my code

In lxml Python 3 how to recursively all the linked ids

I have an xml like this:
<library>
<content content-id="title001">
<content-links>
<content-link content-id="Number1" />
<content-link content-id="Number2" />
</content-links>
</content>
<content content-id="title002">
<content-links>
<content-link content-id="Number3" />
</content-links>
</content>
<content content-id="Number1">
<content-links>
<content-link content-id="Number1b" />
</content-links>
</content
</library>
I would need to get all the content-id that are linked to specific content-id titles. For example, for this case I would need all the ids that are linked for title001 (I might need for more titles, so it would be a list of titles that need to be found). And all these ids be added to a list that would look like:
[title001, Number1, Number2, Number1b]
So I guess that I need to recursively check every content and then get the content-id from the content-link to go to the next content and check in this one all the content-link going to the next one until the xml is completely read.
I am not able to find the recursive solution to this.
Adding the code that I got until now for this:
from lxml import etree as et
def get_ids(content):
"""
"""
content_links = content.findall('content-links/content-link')
print(content_links)
if content_links:
for content_link in content_links:
print(content_link,content_link.get('content-id'))
cl = content_link.get('content-id')
cont = x.find(f'content[#id="{cl}"]')
if cont is not None:
get_ids(cont)
if __name__ == '__main__':
"""
"""
x = et.fromstring(xml)
ids = ['title001']
for id in ids:
content = x.find(f'content[#id="{content-id}"]')
get_ids(content)
Try the following code:
from lxml import etree as et
parser = et.XMLParser(remove_blank_text=True)
tree = et.parse('Input.xml', parser)
root = tree.getroot()
cidList = ['title001'] # Your source list
cidDct = { x: 0 for x in cidList }
for elem in root.iter('content'):
cid = elem.attrib.get('content-id', '')
# print(f'X {elem.tag:15} cid:{cid}')
if cid in cidDct.keys():
# print(f'** Found: {cid}')
for elem2 in elem.iter():
if elem2 is not elem:
cid2 = elem2.attrib.get('content-id', '')
# print(f'XX {elem2.tag:15} cid:{cid2}')
if len(cid2) > 0:
# print(f'** Add: {cid2}')
cidDct[cid2] = 0
For the test you may uncomment printouts above.
Now when you print list(cidDct.keys()), you will get the
wanted ids:
['title001', 'Number1', 'Number2', 'Number1b']

Modify Specific xml tags with iterparse

I'm working with open map data and need to be able to update specific tags based on their values. I have been able to read the tags and even print the specific tags that need to be updated to the console, but I have not been able to get them to update.
I am using elementree and lxml. What I'm looking for specifically is if the first word of the addr:street tag is a cardinality direction (ie North, South, East, West) and the last word of the addr:housenumber tag is NOT a cardinality direction, take the first word from the addr:street tag and move it to be the last word of the addr:housenumber tag.
Edited based on questions below.
Initially I was just calling the code with:
clean_data(OUTPUT_FILE)
I didn't realize that iterparse can't be used to print directly from within the method (which I believe is what you're saying). I had code from a different part of the project I use earlier so I adapted what you wrote what what I had before Here's what I have:
Earlier in the file:
import xml.etree.cElementTree as ET
from collections import defaultdict
import pprint
import re
import codecs
import json
OSM_FILE = "Utah County Map.osm"
OUTPUT_FILE = "Utah County Extract.osm"
JSON_FILE = "JSON MAP DATA.json"
The code in this section of the project:
def clean_data(osm_file, tags = ('node', 'way')):
context = iter(ET.iterparse(osm_file, events=('end',)))
for event, elem in context:
if elem.tag == 'node':
streetTag, street = getVal(elem, 'addr:street')
if street is None: # No "street"
continue
first_word = getWord(street, True)
houseTag, houseNo = getVal(elem, 'addr:housenumber')
if houseNo is None: # No "housenumber"
continue
last_word = getWord(houseNo, False)
if first_word in direct_list and last_word not in direct_list:
streetTag.attrib['v'] = street[len(first_word) + 1:]
houseTag.attrib['v'] = houseNo + ' ' + first_word
for i, element in enumerate(clean_data(OUTPUT_FILE)):
print(ET.tostring(context.root, encoding='unicode', pretty_print=True, with_tail=False))
When I'm running this right now I"m getting an error:
TypeError: 'NoneType' object is not iterable
I tried adding in the output code I used earlier for another section of the project, but received the same error. Here's that code for reference as well. (Output file in this code refers to the output of the first stage of data cleaning where I removed multiple invalid nodes).
with open(CLEAN_DATA, 'w') as output:
output.write('<?xml version="1.0" encoding="UTF-8"?>\n')
output.write('<osm>\n ')
for i, element in enumerate(clean_data(OUTPUT_FILE)):
output.write(ET.tostring(element, encoding='unicode'))
output.write('</osm>')
Initial edit was in response to Valdi_bo's question below. Here is a sample from my xml file for reference. Yes I am using both Elementree and lxml since lxml seems to be a subset of elementree. Some of the functions I've called earlier in the program have only worked with one or the other so I'm using both.
<?xml version="1.0" encoding="UTF-8"?>
<osm>
<node changeset="24687880" id="356682074" lat="40.2799548" lon="-111.6457549" timestamp="2014-08-11T20:33:35Z" uid="2253787" user="1000hikes" version="2">
<tag k="addr:city" v="Provo" />
<tag k="addr:housenumber" v="3570" />
<tag k="addr:postcode" v="84604" />
<tag k="addr:street" v="Timpview Drive" />
<tag k="building" v="school" />
<tag k="ele" v="1463" />
<tag k="gnis:county_id" v="049" />
<tag k="gnis:created" v="02/25/1989" />
<tag k="gnis:feature_id" v="1449106" />
<tag k="gnis:state_id" v="49" />
<tag k="name" v="Timpview High School" />
<tag k="operator" v="Provo School District" />
</node>
<node changeset="58421729" id="356685655" lat="40.2414325" lon="-111.6678877" timestamp="2018-04-25T20:23:33Z" uid="360392" user="maxerickson" version="4">
<tag k="addr:city" v="Provo" />
<tag k="addr:housenumber" v="585" />
<tag k="addr:postcode" v="84601" />
<tag k="addr:street" v="North 500 West" />
<tag k="amenity" v="doctors" />
<tag k="gnis:feature_id" v="2432255" />
<tag k="healthcare" v="doctor" />
<tag k="healthcare:speciality" v="gynecology;obstetrics" />
<tag k="name" v="Valley Obstetrics & Gynecology" />
<tag k="old_name" v="Healthsouth Provo Surgical Center" />
<tag k="phone" v="+1 801 374 1801" />
<tag k="website" v="http://valleyobgynutah.com/location/provo-office-2/" />
</node>
</osm>
In this example the first node would remain unchanged. In the second block the addr:housenumber tag should be changed from 585 to 585 North and the addr:street tag should be changed from North 500 West to 500 West.
Try the following code:
Functions / global variables:
def getVal(nd, kVal):
'''
Get data from "tag" child node with required "k" attribute
Parameters:
nd - "starting" node,
kVal - value of "k" attribute.
Results:
- the tag found,
- its "v" attribute
'''
tg = nd.find(f'tag[#k="{kVal}"]')
if tg is None:
return (None, None)
return (tg, tg.attrib.get('v'))
def getWord(txt, first):
'''
Get first / last word from "txt"
'''
pat = r'^\S+' if first else r'\S+$'
mtch = re.search(pat, txt)
return mtch.group() if mtch else ''
direct_list = ["N", "N." "No", "North", "S", "S.",
"So", "South", "E", "E.", "East", "W", "W.", "West"]
And the main code:
for nd in tree.iter('node'):
streetTag, street = getVal(nd, 'addr:street')
if street is None: # No "street"
continue
first_word = getWord(street, True)
houseTag, houseNo = getVal(nd, 'addr:housenumber')
if houseNo is None: # No "housenumber"
continue
last_word = getWord(houseNo, False)
if first_word in direct_list and last_word not in direct_list:
streetTag.attrib['v'] = street[len(first_word) + 1:]
houseTag.attrib['v'] = houseNo + ' ' + first_word
I assume that tree variable holds the entire XML tree.
Edit following the comment as of 22:36:33Z
My code works also in a loop based on iterparse.
Prepare e.g. input.xml file with some root tag and a couple of
node elements inside. Then try the following code (with necessary imports,
functions and global variables presented above):
context = iter(etree.iterparse('input.xml', events=('end',)))
for event, elem in context:
if elem.tag == 'node':
streetTag, street = getVal(elem, 'addr:street')
if street is None: # No "street"
continue
first_word = getWord(street, True)
houseTag, houseNo = getVal(elem, 'addr:housenumber')
if houseNo is None: # No "housenumber"
continue
last_word = getWord(houseNo, False)
if first_word in direct_list and last_word not in direct_list:
streetTag.attrib['v'] = street[len(first_word) + 1:]
houseTag.attrib['v'] = houseNo + ' ' + first_word
As iterparse processes only end events, you don't even need
and event == 'end' in the first if.
You neither need initial _, root = next(context) from your code,
as context.root points to the whole XML tree.
And now, having the constructed XML tree, you can print it, to see the result:
print(etree.tostring(context.root, encoding='unicode', pretty_print=True,
with_tail=False))
Notes:
The above code has been written written without yielding anything,
but it generates a full XML tree, updated according to your needs.
As the task is to construct an XML tree, this code does not clear
anything. Calls to clear are needed only when you:
retrieve some data from processed elements and save it elsewhere,
don't need these elements any more.
Now you can reconstruct the above code into a "yielding" variant and use
it in your environment (you didn't provide any details how your code sample
is called).

XML Element Tree - appending to existing elements and attributes with ET.SubElement()?

I have the following function which builds up a re-usable XML SOAP envelope:
def get_xml_soap_envelope():
"""
Returns a generically re-usable SOAP envelope in the following format:
<soapenv:Envelope>
<soapenv:Header/>
<soapenv:Body />
</soapenv:Envelope>
"""
soapenvEnvelope = ET.Element('soapenv:Envelope')
soapenvHeader = ET.SubElement(soapenvEnvelope, 'soapenv:Header')
soapenvBody = ET.SubElement(soapenvEnvelope, 'soapenv:Body')
return soapenvEnvelope
Fairly simple stuff so far.
I was wondering now, would it be possible to append attributes (such as xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance") to the soapenv:Envelope element?
And if I also wanted to append the following XML:
<urn:{AAction} soapenv:encodingStyle="http://schemas.xmlsoap.org/soap/encoding/">
<AUserName>{AUserName}</AUserName>
<APassword>{APassword}</APassword>
</urn:{AAction}>
To the soapenv:Body such that I would have something like this:
if __name__ == "__main__":
soapenvEnvelope = get_xml_soap_envelope()
actions = {
'AAction': 'UserLogin',
}
soapAAction = ET.Element('urn:{AAction}'.format(**actions))
soapenvEnvelope.AppendElement(soapAAction, 'soapenv:Body')
So, I could specify a target node and the Element to append to?
Let's start from the bad news: Your function to create the SOAP envelope
(get_xml_soap_envelope) is wrong as it fails to specify at least
xmlns:soapenv="...".
Actually all other namespaces to be used should be also specified here.
A proper function creating the SOAP envelope should be somenting like this:
def get_xml_soap_env():
"""
Returns a generically re-usable SOAP envelope in the following format:
<soapenv:Envelope xmlns:soapenv="...", ...>
<soapenv:Header/>
<soapenv:Body />
</soapenv:Envelope>
"""
ns = {'xmlns:soapenv': 'http://schemas.xmlsoap.org/soap/envelope/',
'xmlns:xsi': 'http://www.w3.org/2001/XMLSchema-instance',
'xmlns:urn': 'http://dummy.urn'}
env = ET.Element('soapenv:Envelope', ns)
ET.SubElement(env, 'soapenv:Header')
ET.SubElement(env, 'soapenv:Body')
return env
Note that ns dictionary contains also other namespaces, which will be
needed later, a.o. xsi namespace.
A possible alternative is to define ns outside of this function and pass it as
a parameter (your choice).
When I ran:
env = get_xml_soap_env()
print(ET.tostring(env, encoding='unicode', short_empty_elements=True))
the printout (reformatted by me for readability) was:
<soapenv:Envelope xmlns:soapenv="http://schemas.xmlsoap.org/soap/envelope/"
xmlns:urn="http://dummy.urn"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<soapenv:Header />
<soapenv:Body />
</soapenv:Envelope>
Note that this time proper namespaces are included.
Then, to add the Action element and its children, define the following function:
def addAction(env, action, subelems):
body = env.find('soapenv:Body')
actn = ET.SubElement(body, f'soapenv:{action}')
for k, v in subelems.items():
child = ET.SubElement(actn, k)
child.text = v
When I ran:
subelems = {'AUserName': 'Mark', 'APassword': 'Secret!'}
addAction(env, 'UserLogin', subelems)
and printed the whole XML tree again, the result was:
<soapenv:Envelope xmlns:soapenv="http://schemas.xmlsoap.org/soap/envelope/"
xmlns:urn="http://dummy.urn" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<soapenv:Header />
<soapenv:Body>
<soapenv:UserLogin>
<AUserName>Mark</AUserName>
<APassword>Secret!</APassword>
</soapenv:UserLogin>
</soapenv:Body>
</soapenv:Envelope>

IndexError: list index out of range in Django-Python application

I have a problem with a function which has an iteration for an array. Here is my function;
def create_new_product():
tree = ET.parse('products.xml')
root = tree.getroot()
array = []
appointments = root.getchildren()
for appointment in appointments:
appt_children = appointment.getchildren()
array.clear()
for appt_child in appt_children:
temp = appt_child.text
array.append(temp)
new_product = Product(
product_name = array[0],
product_desc = array[1]
)
new_product.save()
return new_product
When I call the function, it saves 2 products into database but gives an error on third one. This is the error;
product_name = array[0],
IndexError: list index out of range
Here is also the xml file. I only copied the first 3 products from xml. There are almost 2700 products in the xml file.
<?xml version="1.0" encoding="UTF-8"?>
<Products>
<Product>
<product_name>Example 1</product_name>
<product_desc>EX101</product_desc>
</Product>
<Product>
<product_name>Example 2</product_name>
<product_desc>EX102</product_desc>
</Product>
<Product>
<product_name>Example 3</product_name>
</Product>
</Products>
I don't understand why I am getting this error because it already works for the first two products in the xml file.
I have run a minimal version of your code on python 3 (I assume it's 3 since you use array.clear()):
import xml.etree.ElementTree as ET
def create_new_product():
tree = ET.parse('./products.xml')
root = tree.getroot()
array = []
appointments = root.getchildren()
for appointment in appointments:
appt_children = appointment.getchildren()
array.clear()
# skip this element and log a warning
if len(appt_children) != 2:
print ('Warning : skipping element since it has less children than 2')
continue
for appt_child in appt_children:
temp = appt_child.text
array.append(temp)
_arg={
'product_name' : array[0],
'product_desc' : array[1]
}
print(_arg)
create_new_product()
Output :
{'product_name': 'Example 1', 'product_desc': 'EX101'}
{'product_name': 'Example 2', 'product_desc': 'EX102'}
Warning : skipping element since it has less children than 2
Edit : OP has found that the products contain sometime less children than expected. I added a check of the elements number.
List index out of range is only thrown when a place in an array is invalid, so product_name[0] doesn't actually exist. Maybe try posting your XML file and and we'll see if there's an error there.

Resources