In lxml Python 3 how to recursively all the linked ids - python-3.x

I have an xml like this:
<library>
<content content-id="title001">
<content-links>
<content-link content-id="Number1" />
<content-link content-id="Number2" />
</content-links>
</content>
<content content-id="title002">
<content-links>
<content-link content-id="Number3" />
</content-links>
</content>
<content content-id="Number1">
<content-links>
<content-link content-id="Number1b" />
</content-links>
</content
</library>
I would need to get all the content-id that are linked to specific content-id titles. For example, for this case I would need all the ids that are linked for title001 (I might need for more titles, so it would be a list of titles that need to be found). And all these ids be added to a list that would look like:
[title001, Number1, Number2, Number1b]
So I guess that I need to recursively check every content and then get the content-id from the content-link to go to the next content and check in this one all the content-link going to the next one until the xml is completely read.
I am not able to find the recursive solution to this.
Adding the code that I got until now for this:
from lxml import etree as et
def get_ids(content):
"""
"""
content_links = content.findall('content-links/content-link')
print(content_links)
if content_links:
for content_link in content_links:
print(content_link,content_link.get('content-id'))
cl = content_link.get('content-id')
cont = x.find(f'content[#id="{cl}"]')
if cont is not None:
get_ids(cont)
if __name__ == '__main__':
"""
"""
x = et.fromstring(xml)
ids = ['title001']
for id in ids:
content = x.find(f'content[#id="{content-id}"]')
get_ids(content)

Try the following code:
from lxml import etree as et
parser = et.XMLParser(remove_blank_text=True)
tree = et.parse('Input.xml', parser)
root = tree.getroot()
cidList = ['title001'] # Your source list
cidDct = { x: 0 for x in cidList }
for elem in root.iter('content'):
cid = elem.attrib.get('content-id', '')
# print(f'X {elem.tag:15} cid:{cid}')
if cid in cidDct.keys():
# print(f'** Found: {cid}')
for elem2 in elem.iter():
if elem2 is not elem:
cid2 = elem2.attrib.get('content-id', '')
# print(f'XX {elem2.tag:15} cid:{cid2}')
if len(cid2) > 0:
# print(f'** Add: {cid2}')
cidDct[cid2] = 0
For the test you may uncomment printouts above.
Now when you print list(cidDct.keys()), you will get the
wanted ids:
['title001', 'Number1', 'Number2', 'Number1b']

Related

How can I do: subscribe to the data listened from a publisher, make changes and publish it again? [ ROS - Python]

First, I want to capture pose values by subscribing to teleop_key from a turtle. Then I want to change these captured values and publish to a second turtle. The problem is that I couldn't capture the pose values as a global variables. And due to this I couldn't change the variables and published the modified ones.
I think I have an almost finished code. That's why I'm going to throw them all out directly.
#!/usr/bin/env python3
from turtlesim.msg import Pose
from geometry_msgs.msg import Twist
import rospy as rp
global pos_l_x,pos_l_y,pos_l_z,pos_a_x,pos_a_y,pos_a_z
def pose_callback(msg):
rp.loginfo("("+ str(msg.x) + "," + str(msg.y) + "," + str(msg.theta)+ ")")
pos_l_x = msg.x
pos_l_y = msg.y
pos_a_z = msg.theta
if __name__ == '__main__':
rp.init_node("turtle_inverse")
while not rp.is_shutdown():
sub = rp.Subscriber("/turtlesim1/turtle1/pose", Pose, callback= pose_callback)
rate = rp.Rate(1)
rp.loginfo("Node has been started")
cmd = Twist()
cmd.linear.x = -1*pos_l_x
cmd.linear.y = -1*pos_l_y
cmd.linear.z = 0
cmd.angular.x = 0
cmd.angular.y = 0
cmd.angular.z = -1*pos_a_z
pub = rp.Publisher("/turtlesim2/turtle1/cmd_vel", Twist, queue_size=10)
try:
pub.publish(cmd)
except rp.ServiceException as e:
rp.logwarn(e)
rate.sleep()
rp.spin()
And I did the connection between turtle1 and turtle2 in the lunch file below:
<?xml version="1.0"?>
<launch>
<group ns="turtlesim1">
<node pkg="turtlesim" type="turtlesim_node" name="turtle1">
<remap from="/turtle1/cmd_vel" to="vel_1"/>
</node>
<node pkg="turtlesim" type="turtle_teleop_key" name="Joyistic" output= "screen">
<remap from="/turtle1/cmd_vel" to="vel_1"/>
</node>
</group>
<group ns="turtlesim2">
<node pkg="turtlesim" type="turtlesim_node" name="turtle1">
</node>
</group>
<node pkg="turtlesim" type="mimic" name="mimic">
<remap from="input" to="turtlesim1/turtle1"/>
<remap from="output" to="turtlesim2/turtle1"/>
</node>
</launch>
And lastly here my package.xml code:
<?xml version="1.0"?>
<package format="2">
<name>my_robot_controller</name>
<version>0.0.0</version>
<description>The my_robot_controller package</description>
<!-- One maintainer tag required, multiple allowed, one person per tag -->
<!-- Example: -->
<!-- <maintainer email="jane.doe#example.com">Jane Doe</maintainer> -->
<maintainer email="(I delete it for sharing)">enes</maintainer>
<!-- One license tag required, multiple allowed, one license per tag -->
<!-- Commonly used license strings: -->
<!-- BSD, MIT, Boost Software License, GPLv2, GPLv3, LGPLv2.1, LGPLv3 -->
<license>TODO</license>
<buildtool_depend>catkin</buildtool_depend>
<build_depend>rospy</build_depend>
<build_depend>turtlesim</build_depend>
<build_depend>geometry_msgs</build_depend>
<build_export_depend>rospy</build_export_depend>
<build_export_depend>turtlesim</build_export_depend>
<build_export_depend>geometry_msgs</build_export_depend>
<exec_depend>rospy</exec_depend>
<exec_depend>turtlesim</exec_depend>
<exec_depend>geometry_msgs</exec_depend>
<export>
<!-- Other tools can request additional information be placed here -->
</export>
</package>
Not: I work in catkin workspace the mistake couldn't be here because I run many different code without trouble
As pointed out by one of the commenters, you need to declare you pos values are global inside your callback function. In Python variables must be declared global within the scope they are going to be used; i.e. function scope. When this doesn't happen, the interpreter doesn't know you to use global variables and simple creates a local variable. Note that this is only for assignment operations, so it does not need to be done when you get ready to publish. Take the following example:
def pose_callback(msg):
rp.loginfo("("+ str(msg.x) + "," + str(msg.y) + "," + str(msg.theta)+ ")")
global pos_l_x, pos_l_y, pos_a_z
pos_l_x = msg.x
pos_l_y = msg.y
pos_a_z = msg.theta
As another note, this will most likely break since the global variables will not always be assigned before trying to be used. So you should assign them at the very top of the file. Finally, you should not be declaring a subscriber in the main run loop. It should be done once right after node_init.
It is done !!!
#!/usr/bin/env python3
from turtlesim.msg import Pose
from geometry_msgs.msg import Twist
import rospy as rp
pos_l_x,pos_l_y,pos_l_z,pos_a_x,pos_a_y,pos_a_z = 0,0,0,0,0,0
def pose_callback(msg):
rp.loginfo("("+ str(msg.linear.x) + "," + str(msg.linear.y) + "," + str(msg.angular.z)+ ")")
global pos_l_x,pos_l_y,pos_l_z,pos_a_x,pos_a_y,pos_a_z
pos_l_x = msg.linear.x
pos_l_y = msg.linear.y
pos_l_z = msg.linear.z
pos_a_x = msg.angular.x
pos_a_y = msg.angular.y
pos_a_z = msg.angular.z
if __name__ == '__main__':
rp.init_node("turtle_inverse")
sub = rp.Subscriber("/turtlesim1/turtle1/cmd_vel", Twist, callback= pose_callback)
rate = rp.Rate(1)
rp.loginfo("Node has been started")
while not rp.is_shutdown():
cmd = Twist()
cmd.linear.x = -1*pos_l_x
cmd.linear.y = -1*pos_l_y
cmd.linear.z = -1*pos_l_z
cmd.angular.x = -1*pos_a_x
cmd.angular.y = -1*pos_a_y
cmd.angular.z = -1*pos_a_z
pub = rp.Publisher("/turtlesim2/turtle1/cmd_vel", Twist, queue_size=10)
try:
pub.publish(cmd)
except rp.ServiceException as e:
pass
pos_l_x,pos_l_y,pos_l_z,pos_a_x,pos_a_y,pos_a_z = 0,0,0,0,0,0
rate.sleep()
rp.spin()

Find path to the node using ElementTree

Wih ElementTree, I can print every occurences of a specific tag (in my case ExpertSettingsSg
):
#!/usr/bin/env python3
import xml.etree.ElementTree as ET
root = ET.parse('mydoc.xml').getroot()
for children in root:
value=children.findall('.//ExpertSettingsSg')#tag I'm looking for
for settings in value:
if settings.text is not None:
print(settings.text)
But I didn't find a way to print the path of the occurence. Because my XML file has many levels and because ExpertSettingsSg can be almost at every level, I need to know where the ExpertSettingsSg come from. I'm looking for something like
Path to config xxxxxx = /root/xxx/aaaa/bbbb
If it's not possible with ElementTree, does any other library do the trick?
Thanks
If you already have the nodes, you can walk the tree and collect paths (borrowing the example from #valdi-bo):
from xml.etree import ElementTree as ET
txt ='''<main>
<x>
<a>
<ExpertSettingsSg id="1">x1</ExpertSettingsSg>
</a>
<b>
<dummy>xxxx</dummy>
</b>
</x>
<y>
<c>
<dummy>xxxx</dummy>
</c>
<d>
<ExpertSettingsSg id="2">x2</ExpertSettingsSg>
</d>
<e>
<ExpertSettingsSg id="3"/>
</e>
</y>
</main>'''
def node_walk(root: ET.Element):
path_to_node = []
node_stack = [root]
while node_stack:
node = node_stack[-1]
if path_to_node and node is path_to_node[-1]:
path_to_node.pop()
node_stack.pop()
yield (path_to_node, node)
else:
path_to_node.append(node)
for child in reversed(node):
node_stack.append(child)
root = ET.ElementTree(ET.fromstring(txt))
for node in root.findall('.//ExpertSettingsSg'):
for node_path, n in node_walk(root.getroot()):
if n is node:
xpath = "/".join(["."] + [n.tag for n in node_path[1:]] + [n.tag])
print(xpath, node)
# NOTE: Assert is to just show that the xpath is correct.
assert root.getroot().find(xpath) == node
You would get output like this:
./x/a/ExpertSettingsSg <Element 'ExpertSettingsSg' at 0x102cf5b80>
./y/d/ExpertSettingsSg <Element 'ExpertSettingsSg' at 0x102cf5db0>
./y/e/ExpertSettingsSg <Element 'ExpertSettingsSg' at 0x102cf5e50>
Instead of walking multiple times, we can walk once and collect all relevant nodes with path, like this:
xpaths = []
for node_path, n in node_walk(root.getroot()):
if n.tag == "ExpertSettingsSg":
xpath = "/".join(["."] + [n.tag for n in node_path[1:]] + [n.tag])
xpaths.append(xpath)
for xpath in xpaths:
node = root.getroot().find(xpath)
print(xpath, node)

Getting neighbor´s element in xml archive with Python ElementTree

I got a big problem managing data in xml archives in python. I need the value in the tag ValorConta1 but I only have the value in NumeroConta which is child of PlanoConta.
<InfoFinaDFin>
<NumeroIdentificadorInfoFinaDFin>15501</NumeroIdentificadorInfoFinaDFin>
...
<PlanoConta>
<NumeroConta>2.02.01</NumeroConta>
</PlanoConta>
...
<ValorConta1>300</ValorConta1>
The code I write:
import xml.etree.ElementTree as ET
InfoDin = ET.parse('arquivos_xml/InfoFinaDFin.xml')
target_element_value = '2.01.01'
passivo = InfoDin.findall('.//PlanoConta[NumeroConta="' + target_element_value +'"]/../ValorConta1')
Try this.
from simplified_scrapy import SimplifiedDoc
html = '''
<InfoFinaDFin>
<NumeroIdentificadorInfoFinaDFin>15501</NumeroIdentificadorInfoFinaDFin>
...
<PlanoConta>
<NumeroConta>2.02.01</NumeroConta>
</PlanoConta>
...
<ValorConta1>300</ValorConta1>
</InfoFinaDFin>
'''
doc = SimplifiedDoc(html)
# print (doc.select('PlanoConta>NumeroConta>text()'))
# print (doc.select('ValorConta1>text()'))
ele = doc.NumeroConta.parent.getNext('ValorConta1')
# or
ele = doc.getElementByTag('ValorConta1',start='</NumeroConta>')
print (ele.text)
Result:
300
Here are more examples:https://github.com/yiyedata/simplified-scrapy-demo/tree/master/doc_examples

Modify Specific xml tags with iterparse

I'm working with open map data and need to be able to update specific tags based on their values. I have been able to read the tags and even print the specific tags that need to be updated to the console, but I have not been able to get them to update.
I am using elementree and lxml. What I'm looking for specifically is if the first word of the addr:street tag is a cardinality direction (ie North, South, East, West) and the last word of the addr:housenumber tag is NOT a cardinality direction, take the first word from the addr:street tag and move it to be the last word of the addr:housenumber tag.
Edited based on questions below.
Initially I was just calling the code with:
clean_data(OUTPUT_FILE)
I didn't realize that iterparse can't be used to print directly from within the method (which I believe is what you're saying). I had code from a different part of the project I use earlier so I adapted what you wrote what what I had before Here's what I have:
Earlier in the file:
import xml.etree.cElementTree as ET
from collections import defaultdict
import pprint
import re
import codecs
import json
OSM_FILE = "Utah County Map.osm"
OUTPUT_FILE = "Utah County Extract.osm"
JSON_FILE = "JSON MAP DATA.json"
The code in this section of the project:
def clean_data(osm_file, tags = ('node', 'way')):
context = iter(ET.iterparse(osm_file, events=('end',)))
for event, elem in context:
if elem.tag == 'node':
streetTag, street = getVal(elem, 'addr:street')
if street is None: # No "street"
continue
first_word = getWord(street, True)
houseTag, houseNo = getVal(elem, 'addr:housenumber')
if houseNo is None: # No "housenumber"
continue
last_word = getWord(houseNo, False)
if first_word in direct_list and last_word not in direct_list:
streetTag.attrib['v'] = street[len(first_word) + 1:]
houseTag.attrib['v'] = houseNo + ' ' + first_word
for i, element in enumerate(clean_data(OUTPUT_FILE)):
print(ET.tostring(context.root, encoding='unicode', pretty_print=True, with_tail=False))
When I'm running this right now I"m getting an error:
TypeError: 'NoneType' object is not iterable
I tried adding in the output code I used earlier for another section of the project, but received the same error. Here's that code for reference as well. (Output file in this code refers to the output of the first stage of data cleaning where I removed multiple invalid nodes).
with open(CLEAN_DATA, 'w') as output:
output.write('<?xml version="1.0" encoding="UTF-8"?>\n')
output.write('<osm>\n ')
for i, element in enumerate(clean_data(OUTPUT_FILE)):
output.write(ET.tostring(element, encoding='unicode'))
output.write('</osm>')
Initial edit was in response to Valdi_bo's question below. Here is a sample from my xml file for reference. Yes I am using both Elementree and lxml since lxml seems to be a subset of elementree. Some of the functions I've called earlier in the program have only worked with one or the other so I'm using both.
<?xml version="1.0" encoding="UTF-8"?>
<osm>
<node changeset="24687880" id="356682074" lat="40.2799548" lon="-111.6457549" timestamp="2014-08-11T20:33:35Z" uid="2253787" user="1000hikes" version="2">
<tag k="addr:city" v="Provo" />
<tag k="addr:housenumber" v="3570" />
<tag k="addr:postcode" v="84604" />
<tag k="addr:street" v="Timpview Drive" />
<tag k="building" v="school" />
<tag k="ele" v="1463" />
<tag k="gnis:county_id" v="049" />
<tag k="gnis:created" v="02/25/1989" />
<tag k="gnis:feature_id" v="1449106" />
<tag k="gnis:state_id" v="49" />
<tag k="name" v="Timpview High School" />
<tag k="operator" v="Provo School District" />
</node>
<node changeset="58421729" id="356685655" lat="40.2414325" lon="-111.6678877" timestamp="2018-04-25T20:23:33Z" uid="360392" user="maxerickson" version="4">
<tag k="addr:city" v="Provo" />
<tag k="addr:housenumber" v="585" />
<tag k="addr:postcode" v="84601" />
<tag k="addr:street" v="North 500 West" />
<tag k="amenity" v="doctors" />
<tag k="gnis:feature_id" v="2432255" />
<tag k="healthcare" v="doctor" />
<tag k="healthcare:speciality" v="gynecology;obstetrics" />
<tag k="name" v="Valley Obstetrics & Gynecology" />
<tag k="old_name" v="Healthsouth Provo Surgical Center" />
<tag k="phone" v="+1 801 374 1801" />
<tag k="website" v="http://valleyobgynutah.com/location/provo-office-2/" />
</node>
</osm>
In this example the first node would remain unchanged. In the second block the addr:housenumber tag should be changed from 585 to 585 North and the addr:street tag should be changed from North 500 West to 500 West.
Try the following code:
Functions / global variables:
def getVal(nd, kVal):
'''
Get data from "tag" child node with required "k" attribute
Parameters:
nd - "starting" node,
kVal - value of "k" attribute.
Results:
- the tag found,
- its "v" attribute
'''
tg = nd.find(f'tag[#k="{kVal}"]')
if tg is None:
return (None, None)
return (tg, tg.attrib.get('v'))
def getWord(txt, first):
'''
Get first / last word from "txt"
'''
pat = r'^\S+' if first else r'\S+$'
mtch = re.search(pat, txt)
return mtch.group() if mtch else ''
direct_list = ["N", "N." "No", "North", "S", "S.",
"So", "South", "E", "E.", "East", "W", "W.", "West"]
And the main code:
for nd in tree.iter('node'):
streetTag, street = getVal(nd, 'addr:street')
if street is None: # No "street"
continue
first_word = getWord(street, True)
houseTag, houseNo = getVal(nd, 'addr:housenumber')
if houseNo is None: # No "housenumber"
continue
last_word = getWord(houseNo, False)
if first_word in direct_list and last_word not in direct_list:
streetTag.attrib['v'] = street[len(first_word) + 1:]
houseTag.attrib['v'] = houseNo + ' ' + first_word
I assume that tree variable holds the entire XML tree.
Edit following the comment as of 22:36:33Z
My code works also in a loop based on iterparse.
Prepare e.g. input.xml file with some root tag and a couple of
node elements inside. Then try the following code (with necessary imports,
functions and global variables presented above):
context = iter(etree.iterparse('input.xml', events=('end',)))
for event, elem in context:
if elem.tag == 'node':
streetTag, street = getVal(elem, 'addr:street')
if street is None: # No "street"
continue
first_word = getWord(street, True)
houseTag, houseNo = getVal(elem, 'addr:housenumber')
if houseNo is None: # No "housenumber"
continue
last_word = getWord(houseNo, False)
if first_word in direct_list and last_word not in direct_list:
streetTag.attrib['v'] = street[len(first_word) + 1:]
houseTag.attrib['v'] = houseNo + ' ' + first_word
As iterparse processes only end events, you don't even need
and event == 'end' in the first if.
You neither need initial _, root = next(context) from your code,
as context.root points to the whole XML tree.
And now, having the constructed XML tree, you can print it, to see the result:
print(etree.tostring(context.root, encoding='unicode', pretty_print=True,
with_tail=False))
Notes:
The above code has been written written without yielding anything,
but it generates a full XML tree, updated according to your needs.
As the task is to construct an XML tree, this code does not clear
anything. Calls to clear are needed only when you:
retrieve some data from processed elements and save it elsewhere,
don't need these elements any more.
Now you can reconstruct the above code into a "yielding" variant and use
it in your environment (you didn't provide any details how your code sample
is called).

IndexError: list index out of range in Django-Python application

I have a problem with a function which has an iteration for an array. Here is my function;
def create_new_product():
tree = ET.parse('products.xml')
root = tree.getroot()
array = []
appointments = root.getchildren()
for appointment in appointments:
appt_children = appointment.getchildren()
array.clear()
for appt_child in appt_children:
temp = appt_child.text
array.append(temp)
new_product = Product(
product_name = array[0],
product_desc = array[1]
)
new_product.save()
return new_product
When I call the function, it saves 2 products into database but gives an error on third one. This is the error;
product_name = array[0],
IndexError: list index out of range
Here is also the xml file. I only copied the first 3 products from xml. There are almost 2700 products in the xml file.
<?xml version="1.0" encoding="UTF-8"?>
<Products>
<Product>
<product_name>Example 1</product_name>
<product_desc>EX101</product_desc>
</Product>
<Product>
<product_name>Example 2</product_name>
<product_desc>EX102</product_desc>
</Product>
<Product>
<product_name>Example 3</product_name>
</Product>
</Products>
I don't understand why I am getting this error because it already works for the first two products in the xml file.
I have run a minimal version of your code on python 3 (I assume it's 3 since you use array.clear()):
import xml.etree.ElementTree as ET
def create_new_product():
tree = ET.parse('./products.xml')
root = tree.getroot()
array = []
appointments = root.getchildren()
for appointment in appointments:
appt_children = appointment.getchildren()
array.clear()
# skip this element and log a warning
if len(appt_children) != 2:
print ('Warning : skipping element since it has less children than 2')
continue
for appt_child in appt_children:
temp = appt_child.text
array.append(temp)
_arg={
'product_name' : array[0],
'product_desc' : array[1]
}
print(_arg)
create_new_product()
Output :
{'product_name': 'Example 1', 'product_desc': 'EX101'}
{'product_name': 'Example 2', 'product_desc': 'EX102'}
Warning : skipping element since it has less children than 2
Edit : OP has found that the products contain sometime less children than expected. I added a check of the elements number.
List index out of range is only thrown when a place in an array is invalid, so product_name[0] doesn't actually exist. Maybe try posting your XML file and and we'll see if there's an error there.

Resources