python lxml xpath 1.0 : unique values for element's attribute - python-3.x

Here is a way to get unique values. It doesn't work if i want to get unique attribute.
For example:
<a href = '11111'>sometext</a>
<a href = '11121'>sometext2</a>
<a href = '11111'>sometext3</a>
I want to get unique hrefs. Restricted by using xpath 1.0
page_src.xpath( '(//a[not(.=preceding::a)] )')
page_src.xpath( '//a/#href[not(.=preceding::a/#href)]' )
return duplicates.
Is it possible to resolve this nightmare with unique-values absence ?
UPD : it's not a solution like function i wanted, but i wrote python function, which iterates over parent elements and check if adding parent tag filters links to needed count.
Here is my example:
_x_item = (
'//a[starts-with(#href, "%s")'
'and (not(#href="%s"))'
'and (not (starts-with(#href, "%s"))) ]'
%(param1, param1, param2 ))
#rm double links
neededLinks = list(map(lambda vasa: vasa.get('href'), page_src.xpath(_x_item)))
if len(neededLinks)!=len(list(set(neededLinks))):
uniqLength = len(list(set(neededLinks)))
breakFlag = False
for linkk in neededLinks:
if neededLinks.count(linkk)>1:
dupLinks = page_src.xpath('//a[#href="%s"]'%(linkk))
dupLinkParents = list(map(lambda vasa: vasa.getparent(), dupLinks))
for dupParent in dupLinkParents:
tempLinks = page_src.xpath(_x_item.replace('//','//%s/'%(dupParent.tag)))
tempLinks = list(map(lambda vasa: vasa.get('href'), tempLinks))
if len(tempLinks)==len(set(neededLinks)):
breakFlag = True
_x_item = _x_item.replace('//','//%s/'%(dupParent.tag))
break
if breakFlag:
break
This WILL work if duplicate links has different parent, but same #href value.
As a result i will add parent.tag prefix like //div/my_prev_x_item
Plus, using python, i can update result to //div[#key1="val1" and #key2="val2"]/my_prev_x_item , iterating over dupParent.items(). But this is only working if items are not located in same parent object.
In result i need only x_path_expression, so i cant just use list(set(myItems)) .
I want easier solution ( like unique-values() ), if it exists. Plus my solution does not work if link's parent is same.

You can extract all the hrefs and then find the unique ones:
all_hrefs = page_src.xpath('//a/#href')
unique_hrefs = list(set(all_hrefs))

Related

How do I autopopulate a tkinter table using a loop

I'm trying to auto-populate a tkinter table with the names of folders in a directory and details about their properties
grpdata_name = listdir(r"PATH")
grpdata_path = r"PATH\{}".format(grpdata_name[0])
grpdata_groupcount = -1
for x in grpdata_name :
grpdata_groupcount = grpdata_groupcount +1
grpdata_groupcurrent = 'grpdata_name{}{}{}'.format('[',grpdata_groupcount,']')
GUI_Table.insert(parent='',index='end',iid=0,text='',
values=('ID',grpdata_groupcurrent,'TIME CREATED','TIME MODIFIED','DEVICES'))
My current method is to change the selected element in a string. This creates a working cycle through each part of the string ( grpdata_name[0] , grpdata_name[1] etc)
I can't figure out how to use the contents of grpdata_groupcurrent as a variable, rather than a string.
This method isn't very efficient overall, so please let me know if there is a better way to do this.

Selenium Python Iterate over li elements inside ol

I need to extract, using python and selenium, the text of multiple li elements, all of them, inside an ol element.
The list is like this:
/html/body/main/div/div/section/ol/li[3]/div/div/div[2]/div[1]/a[1]/h2
/html/body/main/div/div/section/ol/li[4]/div/div/div[2]/div[1]/a[1]/h2
/html/body/main/div/div/section/ol/li[5]/div/div/div[2]/div[1]/a[1]/h2
/html/body/main/div/div/section/ol/li[6]/div/div/div[2]/div[1]/a[1]/h2
...
So far, I'm using the following code, but this code shows only one element, and not the whole list.
What is wrong with it?
Thanks!
articles_list=r'/html/body/main/div/div/section/ol'
articles_elements = driver.find_elements_by_xpath(articles_list)
for article in articles_elements:
title = article.find_element_by_xpath('.//div/div/div[2]/div[1]/a[1]/h2').text
print('Title:'+title)
I see that li tag has increment indices like [3] then [4] and so on..
You can look for each and every web element by declaring i = 3, initially and then increment using programmatically.
a = 3
size_of_list = driver.find_elements_by_xpath(f"/html/body/main/div/div/section/ol/li")
for i in range(size_of_list):
title = driver.find_element_by_xpath(f"/html/body/main/div/div/section/ol/li[{a}]/div/div/div[2]/div[1]/a[1]/h2").text
print('Title:', title)
a = a +1

filtering out elements found with beautiful soup based on a key word in any attribute

Here is an example of an url.
url = 'https://rapaxray.com'
# logo
html_content = requests.get(url, headers=headers).text
soup = BeautifulSoup(html_content, "lxml")
images_found = soup.findAll('img', {'src' : re.compile(r'(jpe?g)|(png)|(svg)$')})
images_found
First I'm narrowing down the list of elements to the ones containing jpg, png or svg in a tag. In this case I only get 3 elements. Then I would like to filter those elements to show me only the ones that have a key word 'logo' in ANY attribute.
The element I'm looking for in this example looks like this:
'img alt="Radiology Associates, P.A." class="attachment-full size-full astra-logo-svg" loading="lazy" src="https://rapaxray.com/wp-content/uploads/2019/09/RAPA100.svg"/'
I want to filter out this element out of all elements based on condition that it has a key word 'logo' in ANY of its attributes
The challenge is that:
I have thousands of urls, and key word logo could be in a different attribute for different url
logic: if 'logo' in ANY(attribute for attribute in list_of_possible_attributes_that_this_element_has) doesn't work the same way as list comprehensions because I couldn't find the way of how to access any possible attribute without using its specific name
Checking all specific names is also problematic because particular attribute could exist in one element but not the other which throws error
Case above is also extra challenging because attribute value is a list, so we would need to flatten it to be able to check if the key word is in it.
For most of the urls the element I'm looking for is not returned as the top one like in this example so choosing top first is not an option.
Is there a way of filtering out elements based on a key word in ANY of its attributes? (without prior knowledge of what the name of the attribute is?).
If I understood you correctly, you could use a filter function similar to this answer to search for all tags such that any tag attribute's value contains val:
def my_filter(tag, val):
types = ['.jpg','.jpeg','.svg','.png']
if tag is not None and tag.name == "img" and tag.has_attr("src"):
if all(y not in tag['src'] for y in types):
return False
for key in tag.attrs.keys():
if isinstance(tag[key], list):
if any(val in entry for entry in tag[key]):
return True
else:
if val in tag[key]:
return True
return False
res = soup.find_all(lambda tag: my_filter(tag, "logo"))

How to get href values from a class - Python - Selenium

<a class="link__f5415c25" href="/profiles/people/1515754-andrea-jung" title="Andrea Jung">
I have above HTML element and tried using
driver.find_elements_by_class_name('link__f5415c25')
and
driver.get_attribute('href')
but it doesn't work at all. I expected to extract values in href.
How can I do that? Thanks!
You have to first locate the element, then retrieve the attribute href, like so:
href = driver.find_element_by_class_name('link__f5415c25').get_attribute('href')
if there are multiple links associated with that class name, you can try something like:
eList = driver.find_elements_by_class_name('link__f5415c25')
hrefList = []
for e in eList:
hrefList.append(e.get_attribute('href'))
for href in hrefList:
print(href)

How to generate FilterExpression without hard-coding the attribute names?

I know one can scan or query like this:
response = table.scan(
FilterExpression=Attr('first_name').begins_with('J') &
Attr('account_type').eq('super_user') )
How do I do this without hard coding the attribute names?
To clarify, given such dictionary, I wish to table scan as such;
attr_dict = {"foo":42, "bar":52}
response = table.scan(
FilterExpression=Attr('foo').eq(42) &
Attr('bar').eq(52) )
If it's a sql, although one wouldn't do it, they can do
"select * from table where {} = {} and {} = {}".format(a,b,x,y).
The problem I'm having is how would I & the two using a loop of some sort.
Just create the FilterExpression by looping over the entries in the dictionary. You can use boolean logic to combine individual expressions as follows (just build the filter expression string manually):
foo = 42 AND bar = 52

Resources