Proper equality of GPathResults - groovy

I need to traverse through a XML and distinguish elments based on their parents. I use Groovy and XmlSlurper.
I know that the GPathResults implements equals() as equality of text() nodes only. Sadly, thats not usable in my case.
Using cmp via is() seems to be pointless since every time you get new results object. I'm a newb in Groovy, so I don't feel like overloading the equals() method.
In this case I'd like to distinguish between those elements by their parent(). Let's say I got GPathResults of element 'b' stored in a variable. How can I get that particular element 'a' "which got that stored element 'b' as its NEAREST parent"?
def xml = ''' <root>
<a type="1"/>
<a type="2"/>
<b>
<a type="1"/>
</b>
</root>
'''.trim()
def slurper = new XmlSlurper(false, false).parseText(xml)
def myParticularB = slurper.b
def wantedA = slurper.depthFirst().find { seg ->
seg.name() == 'a' && seg.#type == '1' && seg.parent() == myParticularB
}
assert (wantedA.parent().name() == 'b') == true
I'm sorry if I overlooked something obvious.
//A corner case
<root>
<a type="1"/>
<a type="2"/>
<b>
<a type="1"/>
<b>
<a type="1"/>
<b>
<a type="1"/>
</b>
</b>
</b>
</root>

Related

How to get data from a tag if it's present in HTML else Empty String if the tag is not present in web scraping Python

Picture contains HTML code for the situation
case 1:
<li>
<a> some text: </a><strong> 'identifier:''random words' </strong>
</li>
case 2:
<li>
<a> some text: </a>
</li>
I want to scrape values for identifiers if it's present, else I want to put an empty string if there is no identifier in that particular case.
I am using scrapy or you can help me with BeautifulSoup as well and will really appreciate your help
It's a little bit unclear what do you want exactly, because your screenshot is little bit different than your example in your question. I suppose you want to search text "some text:" and then get next value inside <strong> (or empty string if there isn't any):
from bs4 import BeautifulSoup
txt = '''
<li>
<a> some text: </a><strong> 'identifier:''random words' </strong>
</li>
<li>
<a> some text: </a>
</li>
'''
soup = BeautifulSoup(txt, 'html.parser')
for t in soup.find_all(lambda t: t.contents[0].strip() == 'some text:'):
identifier = t.parent.find('strong')
identifier = identifier.get_text(strip=True) if identifier else ''
print('Found:', identifier)
Prints:
Found: 'identifier:''random words'
Found:

How can I get texts with certain criteria in python with selenium? (texts with certain siblings)

It's really tricky one for me so I'll describe the question as detail as possible.
First, let me show you some example of html.
....
....
<div class="lawcon">
<p>
<span class="b1">
<label> No.1 </label>
</span>
</p>
<p>
"I Want to get 'No.1' label in span if the div[#class='lawcon'] has a certain <a> tags with "bb" title, and with a string of 'Law' in the text of it."
<a title="bb" class="link" onclick="javascript:blabla('12345')" href="javascript:;">Law Power</a>
</p>
</div>
<div class="lawcon">
<p>
<span class="b1">
<label> No.2 </label>
</p>
<p>
"But I don't want to get No.2 label because, although it has <a> tag with "bb" title, but it doesn't have a text of law in it"
<a title="bb" class="link" onclick="javascript:blabla('12345')" href="javascript:;">Just Power</a>
</p>
</div>
<div class="lawcon">
<p>
<span class="b1">
<label> No.3 </label>
</p>
<p>
"If there are multiple <a> tags with the right criteria in a single div, I want to get span(No.3) for each of those" <a>
<a title="bb" class="link" onclick="javascript:blabla('12345')" href="javascript:;">Lawyer</a>
<a title="bb" class="link" onclick="javascript:blabla('12345')" href="javascript:;">By the Law</a>
<a title="bb" class="link" onclick="javascript:blabla('12345')" href="javascript:;">But not this one</a>
...
...
...
So, here is the thing. I want to extract the text of (e.g. No.1) in div[#class='lawcon'] only if the div has a tag with "bb" title, with a string of 'Law' in it.
If inside of the div, if there isn't any tag with "bb" title, or string of "Law" in it, the span should not be collected.
What I tried was
div_list = [div.text for div in driver.find_elements_by_xpath('//span[following-sibling::a[#title="bb"]]')]
But the problem is, when it has multiple tag with right criteria in a single div, it only return just one div.
What I want to have is a location(: span numbers) list(or tuple) of those text of tags
So it should be like
[[No.1 - Law Power], [No.3 - Lawyer], [No.3 - By the Law]]
I'm not sure I have explained enough. Thank you for your interests and hopefully, enlighten me with your knowledge! I really appreciate it in advance.
Here is the simple python script to get your desired output.
links = driver.find_elements_by_xpath("//a[#title='bb' and contains(.,'Law')]")
linkData = []
for link in links:
currentList = []
currentList.append(link.find_element_by_xpath("./ancestor::div[#class='lawcon']//label").text + '-' + link.text)
linkData.append(currentList)
print(linkData)
Output:
[['No.1-Law Power'], ['No.3-Lawyer'], ['No.3-By the Law']]
I am not sure why you want the output in that format. I would prefer the below approach, so that you will get to know how many divs have the matching links and then you can access the links from the output based on the divs. Just a thought.
divs = driver.find_elements_by_xpath("//a[#title='bb' and contains(.,'Law')]//ancestor::div[#class='lawcon']")
linkData = []
for div in divs:
currentList = []
for link in div.find_elements_by_xpath(".//a[#title='bb' and contains(.,'Law')]"):
currentList.append(div.find_element_by_xpath(".//label").text + '-' + link.text)
linkData.append(currentList)
print(linkData)
Output:
[['No.1-Law Power'], ['No.3-Lawyer', 'No.3-By the Law']]
As your requirement is to extract the texts No.1 and so on, which are within a <label> tag, you have to induce WebDriverWait for the visibility_of_all_elements_located() and you will have only 2 matches (against your expectation of 3) and you can use the following Locator Strategy:
Using XPATH:
print([my_elem.get_attribute("innerHTML") for my_elem in WebDriverWait(driver, 5).until(EC.visibility_of_all_elements_located((By.XPATH, "//div[#class='lawcon']//a[#title='bb' and contains(.,'Law')]//preceding::label[1]")))])

Why does attribute splitting happen in BeautifulSoup?

I try to get the attribute of the parent element:
<div class="detailMS__incidentRow incidentRow--away odd">
<div class="time-box">45'</div>
<div class="icon-box soccer-ball-own"><span class="icon soccer-ball-own"> </span></div>
<span class=" note-name">(Autogoal)</span><span class="participant-name">
Reynaldo
</span>
</div>
span_autogoal = soup.find('span', class_='note-name')
print(span_autogoal)
print(span_autogoal.find_parent('div')['class'])
# print(span_autogoal.find_parent('div').get('class')
Output:
<span class="note-name">(Autogoal)</span>
['detailMS__incidentRow', 'incidentRow--away', 'odd']
I know i can do something like this:
print(' '.join(span_autogoal.find_parent('div')['class']))
But i want to know why this is happening and is it possible to do this more correctly?
Above answer is correct however if you want get mutli attribute value return as string try use xml parser after get the parent element.
from bs4 import BeautifulSoup
data='''<div class="detailMS__incidentRow incidentRow--away odd">
<div class="time-box">45'</div>
<div class="icon-box soccer-ball-own"><span class="icon soccer-ball-own"> </span></div>
<span class=" note-name">(Autogoal)</span><span class="participant-name">
Reynaldo
</span>
</div>'''
soup=BeautifulSoup(data,'lxml')
span_autogoal = soup.find('span', class_='note-name')
print(span_autogoal)
parentdiv=span_autogoal.find_parent('div')
data=str(parentdiv)
soup=BeautifulSoup(data,'xml')
print(soup.div['class'])
Output on console:
<span class="note-name">(Autogoal)</span>
detailMS__incidentRow incidentRow--away odd
According to the BeautifulSoup documentation:
HTML 4 defines a few attributes that can have multiple values. HTML 5
removes a couple of them, but defines a few more. The most common
multi-valued attribute is class (that is, a tag can have more than one
CSS class). Others include rel, rev, accept-charset, headers, and
accesskey. Beautiful Soup presents the value(s) of a multi-valued
attribute as a list:
css_soup = BeautifulSoup('<p class="body"></p>') css_soup.p['class']
# ["body"]
css_soup = BeautifulSoup('<p class="body strikeout"></p>')
css_soup.p['class']
# ["body", "strikeout"]
So in your case in <div class="detailMS__incidentRow incidentRow--away odd"> a class attribute is multi-valued.
That's why span_autogoal.find_parent('div')['class'] gives you list as an output.

conditional xpath statement

This is a piece of HTML from which I'd like to extract information from:
<li>
<p><strong class="more-details-section-header">Provenance</strong></p>
<p>Galerie Max Hetzler, Berlin<br>Acquired from the above by the present owner</p>
</li>
I'd like to have an xpath expression which extracts the content of the 2nd <p> ... </p> depending if there's a sibling before with <p> ... Provenance ... </p>
This is to where I got so far:
if "Provenance" in response.xpath('//strong[#class="more-details-section-header"]/text()').extract():
print("provenance = yes")
But how do I get to Galerie Max Hetzler, Berlin<br>Acquired from the above by the present owner ?
I tried
if "Provenance" in response.xpath('//strong[#class="more-details-section-header"]/text()').extract():
print("provenance = yes ", response.xpath('//strong[#class="more-details-section-header"]/following-sibling::p').extract())
But am getting []
You should use
//p[preceding-sibling::p[1]/strong='Provenance']/text()

Nested GPath expressions with XmlSlurper and findAll

I'm trying to analyse an XML tree using XmlSlurper and GPath, and the behaviour of the findAll method confuses me.
Say, for example, that you have the following XML tree:
<html>
<body>
<ul>
<li class="odd"><span>Element 1</span></li>
<li class="even"><span>Element 2</span></li>
<li class="odd"><span>Element 3</span></li>
<li class="even"><span>Element 4</span></li>
<li class="odd"><span>Element 5</span></li>
</ul>
</body>
</html>
Assuming that xml has been initialised through one of XmlSlurper's parse methods, the following code executes as one would expect:
// Prints:
// odd
// odd
// odd
xml.body.ul.li.findAll {it.#class == 'odd'}.#class.each {println it.text()}
On the other hand:
// Doesn't print anything.
xml.body.ul.li.findAll {it.#class == 'odd'}.span.each {println it.text()}
I'm struggling to understand why I can use the special # property (as well as others, such as **), but not 'normal' ones.
I've looked at the API code, and what confuses me even more is that the getProperty implementation (found in GPathResult) seems to support what I'm trying to do.
What am I missing?
You need to iterate over every span, so you can use the spread-dot operator:
xml.body.ul.li.findAll {it.#class == 'odd'}*.span.each {println it.text()}

Resources