XML parsing with ElementTree - python-3.x

I was wondering if it's possible to use the existing text in a tag to get the text on the next tag in the XML tree considering the following XML file:
...
<link>
<description>document</description>
<url>https://www.../doc/file.pdf</url>
</link>
<link>
<description>document1</description>
<url>https://www.../doc1/file1.pdf</url>
</link>
<link>
<description>document2</description>
<url>https://www.../doc2/file2.pdf</url>
</link>
...
for item in tree.findall('.//subChapter//document//link//'):
if item.tag == 'description':
if item.text == 'document':
**THEN GET THE TEXT ON THE NEXT TAG <url>...</url>**
**e.g: https://www.../doc/file.pdf**
print(NEXT TAG)
elif item.text == 'document1':
**THEN GET THE TEXT ON THE NEXT TAG <url>...</url>**
**e.g: https://www..../doc/file1.pdf**
print(NEXT TAG)
elif item.text == 'document2':
**THEN GET THE TEXT ON THE NEXT TAG <url>...</url>**
**e.g: https://www.../doc/file2.pdf**
print(NEXT TAG)
Thank you!

When using lxml parser it could be doable by using getnext() function. When using ElementTree this could be achievable by changing the loop:
# iterate over link elements
for link in tree.findall('.//subChapter//document/link'):
# keep reference to link child elements
children = list(link)
for item in children:
if item.tag == 'description':
if item.text == 'document':
# acess necessary link child by index
next_tag = children[1]
print(next_tag.text)

Related

filtering out elements found with beautiful soup based on a key word in any attribute

Here is an example of an url.
url = 'https://rapaxray.com'
# logo
html_content = requests.get(url, headers=headers).text
soup = BeautifulSoup(html_content, "lxml")
images_found = soup.findAll('img', {'src' : re.compile(r'(jpe?g)|(png)|(svg)$')})
images_found
First I'm narrowing down the list of elements to the ones containing jpg, png or svg in a tag. In this case I only get 3 elements. Then I would like to filter those elements to show me only the ones that have a key word 'logo' in ANY attribute.
The element I'm looking for in this example looks like this:
'img alt="Radiology Associates, P.A." class="attachment-full size-full astra-logo-svg" loading="lazy" src="https://rapaxray.com/wp-content/uploads/2019/09/RAPA100.svg"/'
I want to filter out this element out of all elements based on condition that it has a key word 'logo' in ANY of its attributes
The challenge is that:
I have thousands of urls, and key word logo could be in a different attribute for different url
logic: if 'logo' in ANY(attribute for attribute in list_of_possible_attributes_that_this_element_has) doesn't work the same way as list comprehensions because I couldn't find the way of how to access any possible attribute without using its specific name
Checking all specific names is also problematic because particular attribute could exist in one element but not the other which throws error
Case above is also extra challenging because attribute value is a list, so we would need to flatten it to be able to check if the key word is in it.
For most of the urls the element I'm looking for is not returned as the top one like in this example so choosing top first is not an option.
Is there a way of filtering out elements based on a key word in ANY of its attributes? (without prior knowledge of what the name of the attribute is?).
If I understood you correctly, you could use a filter function similar to this answer to search for all tags such that any tag attribute's value contains val:
def my_filter(tag, val):
types = ['.jpg','.jpeg','.svg','.png']
if tag is not None and tag.name == "img" and tag.has_attr("src"):
if all(y not in tag['src'] for y in types):
return False
for key in tag.attrs.keys():
if isinstance(tag[key], list):
if any(val in entry for entry in tag[key]):
return True
else:
if val in tag[key]:
return True
return False
res = soup.find_all(lambda tag: my_filter(tag, "logo"))

Parsing XML file to Nested DICT

I'm trying to save data from a XML file into a nested dict. In my XML file, shown bellow, I have multiple tags called DOCUMENT and nested to it I have a variable number of tags called LINK. Then, inside the links I have some URLs inside ADDRESS tags
<document>
<description>blah, blah, blah</description>
<link>
<description>Document1</description>
<address>url 1</address>
</link>
<link>
<description>Document23</description>
<address>url 2</address>
</link>
<link>
<description>Document43</description>
<address>url 3</address>
</link>
<regNum>201801289307</regNum>
<order>3</order>
<seqNum>24447778</seqNum>
<codType>6</codType>
<descType>Blah</descType>
</document>
I have created a dict like this:
op = {}
op['doc_dict'] = {"descriDoc":[], "orderDoc":[], "seqNum":[], "codType":[], "descType":[]}
op['doc_dict']['link_dict'] = {"seqNum":[], "linkUrl":[]}
I would like to achieve a DICT where I can match each URL inside the LINK tags to it's parent DOCUMENT using the value inside the seqNum tag
{'doc_dict': {'descriDoc': ["blah, blah, blah"], 'orderDoc': ["4"], 'seqNum': ["24447779"],
'codType': ["6"], 'descType': ["Blah1"],
'link_dict': {'seqNum': ["24447779"], 'linkUrl': ["url 5", "url 7", "url 9"]}}}
Any idea on how to get the above DICT would be great. All my approaches failed.
Cheers,
I have used the List comprehension and solved the question.
def edicao(filename):
op = []
tree = ET.parse(filename) #read in the XML
for item in tree.iter(tag = 'document'):
doc = {}
doc["descriDoc"] = item.find('description').text
doc["orderDoc"] = item.find('order').text
doc["seqNum"] = item.find('seqNum').text
doc["links"] = [{'seqNum':item.find('seqNum').text,
'descricaoDoc':e.find('description').text,
'url':e.find('address').text} for e in item.findall('link')]
op.append(doc)
return op
Cheers,

How to iterate over WebElements and get a new WebElement in Robot Framework

I am trying to get href attribute from an html list using Robot Framework keywords. For example suppose the html code
<ul class="my-list">
<li class="my-listitem"><a href="...">...</li>
...
<li class="my-listitem"><a href="...">...</li>
</ul>
I have tried to use the keywords WebElement, WebElements and for loop without success. How can I do it?
This is my MWE
*** Test Cases ***
#{a tags} = Create List
#{href attr} = Create List
#{li items} = Get WebElements class:my-listitem
FOR ${li} IN #{li items}
${a tag} = Get WebElement tag:a
Append To List #{a tags} ${a tag}
END
FOR ${a tag} IN #{a tags}
${attr} = Get Element Attribute css:my-listitem href
Append To List #{href attr} ${attr}
END
Thanks in advance.
The href is an attribute of the a elements, not the li, thus you need to target them. Get a reference for all such elements, and then get their href in the loop:
${the a-s}= Get WebElements xpath=//li[#class='my-listitem']/a # by targeting the correct element, the list is a reference to all such "a" elements
${all href}= Create List
FOR ${el} IN #{the a-s} # loop over each of them
${value}= Get Element Attribute ${el} href # get the individual href
Append To List ${all href} ${value} # and store it in a result list
END
Log To Console ${all href}
Here is a possible solution (not tested):
#{my_list}= Get WebElements xpath=//li[#class='my-listitem']
FOR ${element} IN #{my_list}
${attr}= Get Element Attribute ${element} href
Log ${attr} html=True
END

I am doing some web scraping and during that I've run into an error whih states that, "'NoneType' object is not subscriptable"

I am using bs4 for web scraping. This is the html code that I am scraping.
items is a list of these multiple div tags i.e <div class="list_item odd" itemscope=""...>
from which the tag that I really want from each in items element is:
<p class="cert-runtime-genre">
<img title="R" alt="Certificate R" class="absmiddle certimage" src="https://m...>
<time datetime="PT119M">119 min</time>
-
<span>Drama</span>
<span class="ghost">|</span>
<span>War</span>
</p>
The main class of this list is saved in items. From that I want to scrape the img tag and then access the title attribute so that I can save all the certifications of the movies in a database i.e R or PG etc. But when I apply loop to the items it gives an error that the items is not subscriptable. I tried list comprehensions, simple for loop, called items elements through a predefined integer array nothing works and still gives the same error. (items is not Null and is subscriptable i.e. is a list). But when I call it with a direct integers it works fine i.e items[0] or items[1] etc, and gives the correct result for each corresponding element in the items list. The error line is below:
cert = [item.find(class_ = "absmiddle certimage")["title"] for item in items] or
cert = [item.find("img",{"class": "absmiddle certimage"})["title"] for item in items]
and this is what works fine: cert = items[0].find(class_ = "absmiddle certimage")["title"]
Any suggestion will be appreciated.

How to get href values from a class - Python - Selenium

<a class="link__f5415c25" href="/profiles/people/1515754-andrea-jung" title="Andrea Jung">
I have above HTML element and tried using
driver.find_elements_by_class_name('link__f5415c25')
and
driver.get_attribute('href')
but it doesn't work at all. I expected to extract values in href.
How can I do that? Thanks!
You have to first locate the element, then retrieve the attribute href, like so:
href = driver.find_element_by_class_name('link__f5415c25').get_attribute('href')
if there are multiple links associated with that class name, you can try something like:
eList = driver.find_elements_by_class_name('link__f5415c25')
hrefList = []
for e in eList:
hrefList.append(e.get_attribute('href'))
for href in hrefList:
print(href)

Resources