Parsing XML file to Nested DICT - python-3.x

I'm trying to save data from a XML file into a nested dict. In my XML file, shown bellow, I have multiple tags called DOCUMENT and nested to it I have a variable number of tags called LINK. Then, inside the links I have some URLs inside ADDRESS tags
<document>
<description>blah, blah, blah</description>
<link>
<description>Document1</description>
<address>url 1</address>
</link>
<link>
<description>Document23</description>
<address>url 2</address>
</link>
<link>
<description>Document43</description>
<address>url 3</address>
</link>
<regNum>201801289307</regNum>
<order>3</order>
<seqNum>24447778</seqNum>
<codType>6</codType>
<descType>Blah</descType>
</document>
I have created a dict like this:
op = {}
op['doc_dict'] = {"descriDoc":[], "orderDoc":[], "seqNum":[], "codType":[], "descType":[]}
op['doc_dict']['link_dict'] = {"seqNum":[], "linkUrl":[]}
I would like to achieve a DICT where I can match each URL inside the LINK tags to it's parent DOCUMENT using the value inside the seqNum tag
{'doc_dict': {'descriDoc': ["blah, blah, blah"], 'orderDoc': ["4"], 'seqNum': ["24447779"],
'codType': ["6"], 'descType': ["Blah1"],
'link_dict': {'seqNum': ["24447779"], 'linkUrl': ["url 5", "url 7", "url 9"]}}}
Any idea on how to get the above DICT would be great. All my approaches failed.
Cheers,

I have used the List comprehension and solved the question.
def edicao(filename):
op = []
tree = ET.parse(filename) #read in the XML
for item in tree.iter(tag = 'document'):
doc = {}
doc["descriDoc"] = item.find('description').text
doc["orderDoc"] = item.find('order').text
doc["seqNum"] = item.find('seqNum').text
doc["links"] = [{'seqNum':item.find('seqNum').text,
'descricaoDoc':e.find('description').text,
'url':e.find('address').text} for e in item.findall('link')]
op.append(doc)
return op
Cheers,

Related

filtering out elements found with beautiful soup based on a key word in any attribute

Here is an example of an url.
url = 'https://rapaxray.com'
# logo
html_content = requests.get(url, headers=headers).text
soup = BeautifulSoup(html_content, "lxml")
images_found = soup.findAll('img', {'src' : re.compile(r'(jpe?g)|(png)|(svg)$')})
images_found
First I'm narrowing down the list of elements to the ones containing jpg, png or svg in a tag. In this case I only get 3 elements. Then I would like to filter those elements to show me only the ones that have a key word 'logo' in ANY attribute.
The element I'm looking for in this example looks like this:
'img alt="Radiology Associates, P.A." class="attachment-full size-full astra-logo-svg" loading="lazy" src="https://rapaxray.com/wp-content/uploads/2019/09/RAPA100.svg"/'
I want to filter out this element out of all elements based on condition that it has a key word 'logo' in ANY of its attributes
The challenge is that:
I have thousands of urls, and key word logo could be in a different attribute for different url
logic: if 'logo' in ANY(attribute for attribute in list_of_possible_attributes_that_this_element_has) doesn't work the same way as list comprehensions because I couldn't find the way of how to access any possible attribute without using its specific name
Checking all specific names is also problematic because particular attribute could exist in one element but not the other which throws error
Case above is also extra challenging because attribute value is a list, so we would need to flatten it to be able to check if the key word is in it.
For most of the urls the element I'm looking for is not returned as the top one like in this example so choosing top first is not an option.
Is there a way of filtering out elements based on a key word in ANY of its attributes? (without prior knowledge of what the name of the attribute is?).
If I understood you correctly, you could use a filter function similar to this answer to search for all tags such that any tag attribute's value contains val:
def my_filter(tag, val):
types = ['.jpg','.jpeg','.svg','.png']
if tag is not None and tag.name == "img" and tag.has_attr("src"):
if all(y not in tag['src'] for y in types):
return False
for key in tag.attrs.keys():
if isinstance(tag[key], list):
if any(val in entry for entry in tag[key]):
return True
else:
if val in tag[key]:
return True
return False
res = soup.find_all(lambda tag: my_filter(tag, "logo"))

Python lxml.etree: how to add 'xml:lang="en-US"' as a namespace

I am trying to create a xml whose first element is:
<speak version="1.0"
xmlns="http://www.w3.org/2001/10/synthesis" xml:lang="en-US">
</speak>
I am able to add the first attributes with...
from lxml.etree import Element, SubElement, QName, tostring
root = Element('speak', version="1.0",
xmlns="http://www.w3.org/2001/10/synthesis")
...but not the namespace xml:lang="en-US". Based on several tuto/question like this and this I tried many solutions but none worked.
For example, I tried this :
class XMLNamespaces:
xml = 'http://www.w3.org/2001/10/synthesis'
root.attrib[QName(XMLNamespaces.xml, 'lang')] = "en-US"
But the ouput is
<speak xmlns:ns0="http://www.w3.org/2001/10/synthesis" version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" ns0:lang="en-US">
How can I create the xml:lang="en-US" of my first xml element?
The special xml: prefix is associated with the http://www.w3.org/XML/1998/namespace URI.
The following code adds xml:lang="en-US" to the root element:
root.attrib[QName("http://www.w3.org/XML/1998/namespace", "lang")] = "en-US"

XML parsing with ElementTree

I was wondering if it's possible to use the existing text in a tag to get the text on the next tag in the XML tree considering the following XML file:
...
<link>
<description>document</description>
<url>https://www.../doc/file.pdf</url>
</link>
<link>
<description>document1</description>
<url>https://www.../doc1/file1.pdf</url>
</link>
<link>
<description>document2</description>
<url>https://www.../doc2/file2.pdf</url>
</link>
...
for item in tree.findall('.//subChapter//document//link//'):
if item.tag == 'description':
if item.text == 'document':
**THEN GET THE TEXT ON THE NEXT TAG <url>...</url>**
**e.g: https://www.../doc/file.pdf**
print(NEXT TAG)
elif item.text == 'document1':
**THEN GET THE TEXT ON THE NEXT TAG <url>...</url>**
**e.g: https://www..../doc/file1.pdf**
print(NEXT TAG)
elif item.text == 'document2':
**THEN GET THE TEXT ON THE NEXT TAG <url>...</url>**
**e.g: https://www.../doc/file2.pdf**
print(NEXT TAG)
Thank you!
When using lxml parser it could be doable by using getnext() function. When using ElementTree this could be achievable by changing the loop:
# iterate over link elements
for link in tree.findall('.//subChapter//document/link'):
# keep reference to link child elements
children = list(link)
for item in children:
if item.tag == 'description':
if item.text == 'document':
# acess necessary link child by index
next_tag = children[1]
print(next_tag.text)

Can not get siblings from the start of a query xpath

I have the following structure of the XML file:
<body>
<content>
<text type="header">Header 1</text>
</content>
<content>
<text type="text">Text 1</text>
</content>
</body>
I need to get the first one content tag and then using that content element get his content tag sibling in node.js with select function.
Node.js xpath npm package.
What I am trying to do is:
allDocumentString is a string representation of all the XML
file.
const headerContentTag = select('//content/text[#type = 'header']', allDocumentString);
const textContentTag = select('//following-sibling::content', headerContentTagString);
But it does not work.
I need to exactly get the first one content tag and then depending on that tag the last one.
I know I can get the second one tag just without the first one, but I need the first one.
Strictly speaking, what you need is :
var doc = new dom().parseFromString(xml)
var nodes = xpath.select("//body/content[1][./text[#type='header']]/following-sibling::content[last()]", doc)
We first look for the first content element, child of body and which contains a text element with a #header attribute. Then get we get his last following-sibling.

How to get href values from a class - Python - Selenium

<a class="link__f5415c25" href="/profiles/people/1515754-andrea-jung" title="Andrea Jung">
I have above HTML element and tried using
driver.find_elements_by_class_name('link__f5415c25')
and
driver.get_attribute('href')
but it doesn't work at all. I expected to extract values in href.
How can I do that? Thanks!
You have to first locate the element, then retrieve the attribute href, like so:
href = driver.find_element_by_class_name('link__f5415c25').get_attribute('href')
if there are multiple links associated with that class name, you can try something like:
eList = driver.find_elements_by_class_name('link__f5415c25')
hrefList = []
for e in eList:
hrefList.append(e.get_attribute('href'))
for href in hrefList:
print(href)

Resources