xpath use when exact attribute name is not known - python-3.x

I am trying to get list values in:
Though I can extract them with,
for r in g.find_elements(By.XPATH,'//ul//li[contains(#data-bib-id, "bib")]'):
print(r.text)
the attribute (data-bib-id) is not always the same and I am trying to make my scraping task as generic as possible. So, is there a way that I can extract the same info when the exact attribute is not known? That is, li showing up under a ul or ol or div with an attribute value containing a subtext "bib" or "ref"?

In case li elements can be below ul or ol or div maybe we can omit this detail and start the expression with //li?
If so you can add a logic or operator on the data-bib-id attribute value in your XPath expression so it will be as following:
for r in g.find_elements(By.XPATH,'//li[contains(#data-bib-id, "bib") or contains(#data-bib-id, "bib")]'):
print(r.text)
In case you need to limit the search so that li mast be child of ul or ol or div parent element you need to add logic or case on parent nodes so your XPath expression can be as following:
for r in g.find_elements(By.XPATH,'//ul//li[contains(#data-bib-id, "bib") or contains(#data-bib-id, "bib")] or //ol//li[contains(#data-bib-id, "bib") or contains(#data-bib-id, "bib")] //div//li[contains(#data-bib-id, "bib") or contains(#data-bib-id, "bib")]'):
print(r.text)

Related

Using xpath for an array of WebElements

I need to scrape some data off tags in a page which further has more DOM elements.
The articles are repeated and they have an xpath as:
//*[#id="post_page"]/div/div[2]/main/div/div/div/div[2]/div[2]/div/div[3]/div/article[N]
where 'N' represents the Nth article.
And within each article, the xpath for the element I'm interested in is:
/div/div/div/div/div/div/div[3]/div[1]/button[1]/span
The first thing I did was to use
Elements = driver.find_elements(By.XPATH, <first_path>)
And it fetched me all the articles in the page. PS: I did not add [N] because that would only fetch a specific article, and I'm interested in all.
Then, for each element in the list, I used find_element using the second path as follows:
for elem in Elements:
Required.append(elem.find_element(By.XPATH, <second_path>))
Where Required is a list in which I'll be storing the data. And this is where I got the element does not exist error.
I also tried adding a . before <second_path> but that didn't solve the issue either.
The complete xpath of the element is:
//*[#id="post_page"]/div/div[2]/main/div/div/div/div[2]/div[2]/div/div[3]/div/article[N]/div/div/div/div/div/div/div[3]/div[1]/button[1]/span
And the CSS Selector for the same is:
#post_page > div > div._UuSG.w77Za._21rSD._3SBW4 > main > div > div > div > div._UuSG._ayWa._3dGg1.Vlb1o._1vyTb > div._UuSG.qzupC._3cqkW > div > div:nth-child(3) > div > article:nth-child(N) > div > div > div > div > div > div > div._UuSG._3VzCT._2FoTG > div._UuSG._3dGg1._2VJFi._2h1-g > button:nth-child(1) > span
I also tried an approach using a loop where I increment a counter variable and use that as N for the whole xpath, but that didn't seem to work either. Got the same error.
Any help would be greatly appreciated.
EDIT[1]
The last span has the following class names:
<span class="_UuSG _3_54N a8-QN _2cSLK L4pn5 RiX17">Stuff I need</span>
Which are unique (collectively) in the page. This information might be relevant somehow.
I think I know your problem. When you do
Elements = driver.find_elements(By.XPATH, <first_path>)
you have already found all the elements you need here. So in your for loop, just use elem, no more "finding" is needed.
for elem in Elements:
Required.append(elem)
I would use .// to select using descendent-or-self axis starting from the current node (. means current node).
You have already tried with ./, which is pretty close.
xpath ".//span", what does the dot mean?
What is meaning of .// in XPath?

How to get text which is inside the span tag using selenium webdriver?

I want to get the text which is inside the span. However, I am not able to achieve it. The text is inside ul<li<span<a<span. I am using selenium with python.
Below is the code which I tried:
departmentCategoryContent = driver.find_elements_by_class_name('a-list-item')
departmentCategory = departmentCategoryContent.find_elements_by_tag_name('span')
after this, I am just iterating departmentCategory and printing the text using .text i.e
[ print(x.text) for x in departmentCategory ]
However, this is generating an error: AttributeError: 'list' object has no attribute 'find_elements_by_tag_name'.
Can anyone tell me what I am doing wrong and how I can get the text?
Problem:
As far as I understand, departmentCategoryContent is a list, not a single WebElement, then it doesn't have the find_elements_by_tag_name() method.
Solution:
you can choose 1 of 2 ways below:
You need for-each of list departmentCategoryContent first, then find_elements_by_tag_name().
Save time with one single statement, using find_elements_by_css_selector():
departmentCategory = driver.find_elements_by_css_selector('.a-spacing-micro.apb-browse-refinements-indent-2 .a-list-item span')
[ print(x.text) for x in departmentCategory ]
Test on devtool:
Explanation:
Your locator .a-list-item span will return all the span tag belong to the div that has class .a-list-time. There are 88 items containing the unwanted tags.
So, you need to add more specific locator to separate the other div. In this case, I use some more classes. .a-spacing-micro.apb-browse-refinements-indent-2
You're looping over the wrong thing. You want to loop through the 'a-list-item' list and find a single span element that is a child of that webElement. Try this:
departmentCategoryContent = driver.find_elements_by_class_name('a-list-item')
print(x.find_element_by_tag_name('span').text) for x in departmentCategoryContent
note that the second dom search is a find_element (not find_elements) which will return a single webElement, not a list.

Using xpath to extract only the text being part of the parent node

How can we only select and extract text which is only part of the parent node. Here is the HTML i am working on. I need to extract only the "$1950" using the xpath. When i select the parent node and extract its text content i get the the text content of its childs as well, but i need the text content of parent node only.
<span class="rentRollup">
<span class="longText">3 Bedrooms</span>
<span class="shortText">3 Beds</span>
$1,950
</span>
I have tried using the xpath but its printing the whole parent node as well as the child nodes data.
url = 'https://www.apartments.com/214-taylor-st-raleigh-nc/cr6tchd/'
#intializing request headers
ua = UserAgent()
header = {'User-Agent':str(ua.chrome)}
response = requests.get(url, headers=header)
print(response)
byte_data = response.content
source_code = html.fromstring(byte_data)
name=source_code.xpath("//*[contains(text(), '3 Bedrooms')]/..")
name=name[0].text_content()
print(name)
Try it this way: after print(response), replace everything with:
tree = html.fromstring(response.content)
name=tree.xpath("//span[#class='rentRollup']/text()")
name[2].strip()
Output:
'$1,950'
The following XPath expression
//*[contains(*/text(), '3 Bedrooms')]/text()
will select just the text nodes which are direct children of the parent node of interest. But there is still whitespace-noise which you need to get rid of.

How to get content from div class using Selenium - Python?

I want to extract the contents on the left side using the div class <table__9d458b97>
I don't want to use xpath to do the job because some contents don't sit in the same position.
driver2 = webdriver.Chrome(r'XXXX\chromedriver.exe')
driver2.get("https://www.bloomberg.com/profiles/people/15103277-mark-elliot-zuckerberg")
Here is my code using the xpath (how can I use the class?):
boardmembership_table=driver2.find_elements_by_xpath('//*[#id="root"]/div/section/div[5]')[0]
boardmembership_table.text
Thanks for the help!
You could make use of css_selector
Your can use the following code
from selenium.webdriver import Chrome
driver2 = Chrome()
driver2.get("https://www.bloomberg.com/profiles/people/15103277-mark-elliot-zuckerberg")
els = driver2.find_elements_by_css_selector('.table__9d458b97[role="table"]')
for el in els:
print(el.text)
driver2.close()
Note that you are using find_elements_by_css_selector which will return a list of elements or an empty list if None found.
You can use the below xpath, if you want to access Board Membership table.
//*[#id="root"]/div/section/div[h2[.='Board Memberships']]
Also you can use following sibling to get the div next to the title 'Board Membership'
like this
'//h2[contains(.,"Board Membership")]//following-sibling::div'

Using BeautifuSoup to separate the hrefs and the anchor text

I'm using Python3 with Beautiful Soup 4 to separate hrefs from the text itself. Like:
LINK
I wanna (1) extract and print yoursite.com, and then get LINK.
If anyone could help me that would be great!
Locate the a element by, say, class name; use dictionary-like access to attributes; .get_text() to get the link text:
a = soup.find("a", class_="sample-class") # or soup.select_one("a.sample-class")
print(a["href"])
print(a.get_text())
A tag may have any number of attributes. The tag
has an attribute “class” whose value is “boldest”. You can access a
tag’s attributes by treating the tag like a dictionary:
> tag['class']
> # u'boldest'
A string corresponds to a bit of text within a tag. Beautiful Soup
uses the NavigableString class to contain these bits of text:
tag.string
# u'Extremely bold'
you can find this in Beautiful Soup Documentation

Resources