find aria-label in html page using soup python - python-3.x

i have html pages, with this code :
<span itemprop="title" data-andiallelmwithtext="15" aria-current="page" aria-label="you in page
number 452">page 452</span>
i want to find the aria-label, so i have tried this:
is_452 = soup.find("span", {"aria-label": "you in page number 452"})
print(is_452)
i want to get the result :
is_452 =page 452
i'm getting the result:
is_452=none
how to do it ?

It has line breaks in it, so it doesn't match through text.Try the following
from simplified_scrapy.simplified_doc import SimplifiedDoc
html='''<span itemprop="title" data-andiallelmwithtext="15" aria-current="page" aria-label="you in page
number 452">page 452</span>'''
doc = SimplifiedDoc(html)
is_452 = doc.getElementByReg('aria-label="you in page[\s]*number 452"',tag="span")
print (is_452.text)

Possibly the desired element is a dynamic element and you can use Selenium to extract the value of the aria-label attribute inducing WebDriverWait for the visibility_of_element_located() and you can use either of the following Locator Strategies:
Using CSS_SELECTOR:
print(WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.CSS_SELECTOR, "section#header a.cart-heading[href='/cart']"))).get_attribute("aria-label"))
Using XPATH:
print(WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//section[#id='header']//a[#class='cart-heading' and #href='/cart']"))).get_attribute("aria-label"))
Note : You have to add the following imports:
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC

The reason soup fails in doing this is because of the line break. I have a simpler solution which doesn't use any separate library, just BeautifulSoup only. I know this question is old, but it has 1k views so it's clear many people search up this question.
You can use triple-quote strings to take into account the newline.
This:
is_452 = soup.find("span", {"aria-label": "you in page number 452"})
print(is_452)
Would become:
search_label = """you in page
number 452"""
is_452 = soup.find("span", {"aria-label": search_label})
print(is_452)

Related

Can't locate elements from a website using selenium

Trying to scrape data from a business directory but I keep getting the data was not found
name =
driver.find_elements_by_xpath('/html/body/div[3]/div/div/div[1]/div/div[1]/div/div[1]/h4')[0].text
# Results in: IndexError: list index out of range
So I tried to use WebDriverWait to make the code wait for the data to load but it doesn't find the elements, even though the data get loaded to the website.
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import pandas as pd
from bs4 import BeautifulSoup
import requests
import time
url='https://www.dmcc.ae/business-search?directory=1&submissionGuid=2c8df029-a92e-4b5d-a014-7ef9948e664b'
driver = webdriver.Firefox()
driver.get(url)
wait=WebDriverWait(driver,50)
wait.until(EC.visibility_of_element_located((By.CLASS_NAME,'searched-list ng-scope')))
name = driver.find_elements_by_xpath('/html/body/div[3]/div/div/div[1]/div/div[1]/div/div[1]/h4')[0].text
print(name)
driver.switch_to.frame(driver.find_element_by_css_selector("#pym-0 iframe"))
wait = WebDriverWait(driver, 10)
wait.until(EC.presence_of_element_located(
(By.CSS_SELECTOR, '.searched-list.ng-scope')))
name = driver.find_elements_by_xpath(
'/html/body/div[3]/div/div/div[1]/div/div[1]/div/div[1]/h4')[0].text
its inside iframe , to interact with iframe element switch to it first. Here iframe doesn't have any unique identified . So we used the parent div which had unique id as reference from that we found the child iframe
now if you want to interact outside iframe use;
driver.switch_to.default_content()
<iframe src="https://dmcc.secure.force.com/Business_directory_Page?initialWidth=987&childId=pym-0&parentTitle=List%20of%20Companies%20Registered%20in%20Dubai%2C%20DMCC%20Free%20Zone&parentUrl=https%3A%2F%2Fwww.dmcc.ae%2Fbusiness-search%3Fdirectory%3D1%26submissionGuid%3D2c8df029-a92e-4b5d-a014-7ef9948e664b" width="100%" scrolling="no" marginheight="0" frameborder="0" height="3657px"></iframe>
Switch to iframe and handle the accept button.
driver.get('https://www.dmcc.ae/business-search?directory=1&submissionGuid=2c8df029-a92e-4b5d-a014-7ef9948e664b')
wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR, "#hs-eu-confirmation-button"))).click()
wait.until(EC.frame_to_be_available_and_switch_to_it((By.CSS_SELECTOR,'#pym-0 > iframe')))
wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR,'.searched-list.ng-scope')))
name = driver.find_elements_by_xpath('//*[#id="directory_list"]/div/div/div/div[1]/h4')[0]
print(name.text))
Outputs
1 BOXOFFICE DMCC

Python Selenium get value from certain class

I am trying to get the information in value=" " from this code.
As long as the class has the word orange in it, I want to store the value.
Expected result in this case is storing value "JKK-LKK" in a variable.
<input type="text" readonly="" class="form-control nl-forms-wp-orange" value="JKK-LKK" style="cursor: pointer; border-left-style: none;>
I have tried using
text = driver.find_elements_by_xpath("//*[contains(text(), 'nl-forms-wp-orange')]").get_attribute("value"))
But I get:
AttributeError: 'list' Object has no attribute 'get_attribute'.
I've also tried getText("value") but I get is not a valid Xpath expression.
If I try only using
driver.find_elements_by_xpath("//*[contains(text(), 'nl-forms-wp-orange')]")
The list becomes empty. So I feel like I might be missing some few other key pieces.
What might I be doing wrong?
To print the value of the value attribute i.e. JKK-LKK from the elements containing class attribute as nl-forms-wp-orange you can use either of the following Locator Strategies:
Using css_selector:
print(driver.find_element_by_css_selector("input.nl-forms-wp-orange").get_attribute("value"))
Using xpath:
print(driver.find_element_by_xpath("//input[contains(#class, 'nl-forms-wp-orange')]").get_attribute("value"))
Ideally you need to induce WebDriverWait for the visibility_of_element_located() and you can use either of the following Locator Strategies:
Using CSS_SELECTOR:
print(WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.CSS_SELECTOR, "input.nl-forms-wp-orange"))).get_attribute("value"))
Using XPATH:
print(WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//input[contains(#class, 'nl-forms-wp-orange')]"))).get_attribute("value"))
Note : You have to add the following imports :
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC

Attribute raises NoSuchElementException in selenium Python

I can get the element which attribute contains X but I can't get the attribute itself. Why?
This code does not raise any errors:
links = browser.find_elements_by_xpath('//div[#aria-label="Conversations"]//a[contains(#data-href, "https://www.messenger.com/t/")]')
This code raises NoSuchElementException error:
links = browser.find_elements_by_xpath('//div[#aria-label="Conversations"]//a[contains(#data-href, "https://www.messenger.com/t/")]/#data-href')
This code is extracted from Messenger main page with the chats. I would like to get all links in the ul list...
I don't get it... Any help please?
To get all the links data-href value you need to traverse links elements and use the get_attribute()
links = browser.find_elements_by_xpath('//div[#aria-label="Conversations"]//a[contains(#data-href, "https://www.messenger.com/t/")]')
ul_list=[link.get_attribute("data-href") for link in links]
print(ul_list)
wait = WebDriverWait(driver, 20)
lists=wait.until(EC.presence_of_all_elements_located((By.XPATH, '//div[#aria-label="Conversations"]//a[contains(#data-href, "https://www.messenger.com/t/")]/#data-href')))
for element in lists:
print element.text
Note: Add below imports to your solution
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait

How to extract all href from a class in Python Selenium?

I am trying to extract people's href from the URL https://www.dx3canada.com/agenda/speakers.
I tried:
elems = driver.find_elements_by_css_selector('.display-flex card vancouver')
href_output = []
for ele in elems:
href_output.append(ele.get_attribute("href"))
print(href_output)
But the output list returns nothing...
The expected href shown as the image below and I hope the outputs as a list of hrefs:
I really appreciate the help!
To extract the people's href attribute from the URL https://www.dx3canada.com/agenda/speakers as the the desired elements are within an <iframe> so you have to:
Induce WebDriverWait for the desired frame to be available and switch to it.
Induce WebDriverWait for the visibility of all elements located.
You can use the following Locator Strategies:
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
options = webdriver.ChromeOptions()
options.add_argument("start-maximized")
options.add_experimental_option("excludeSwitches", ["enable-automation"])
options.add_experimental_option('useAutomationExtension', False)
driver = webdriver.Chrome(options=options, executable_path=r'C:\WebDrivers\chromedriver.exe')
driver.get('https://www.dx3canada.com/agenda/speakers')
WebDriverWait(driver, 30).until(EC.frame_to_be_available_and_switch_to_it((By.CSS_SELECTOR,"iframe#whovaIframeSpeaker")))
print([my_elem.get_attribute("href") for my_elem in WebDriverWait(driver, 30).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "a.display-flex.card.vancouver")))])
Console Output:
['https://whova.com/embedded/speaker_detail/dcrma_202003/9942778/', 'https://whova.com/embedded/speaker_detail/dcrma_202003/9907682/', 'https://whova.com/embedded/speaker_detail/dcrma_202003/9907688/', 'https://whova.com/embedded/speaker_detail/dcrma_202003/9907676/', 'https://whova.com/embedded/speaker_detail/dcrma_202003/9907696/', 'https://whova.com/embedded/speaker_detail/dcrma_202003/9907690/', 'https://whova.com/embedded/speaker_detail/dcrma_202003/9907670/', 'https://whova.com/embedded/speaker_detail/dcrma_202003/9907693/', 'https://whova.com/embedded/speaker_detail/dcrma_202003/9942779/', 'https://whova.com/embedded/speaker_detail/dcrma_202003/9908087/', 'https://whova.com/embedded/speaker_detail/dcrma_202003/9907671/', 'https://whova.com/embedded/speaker_detail/dcrma_202003/9907681/', 'https://whova.com/embedded/speaker_detail/dcrma_202003/9907673/', 'https://whova.com/embedded/speaker_detail/dcrma_202003/9907678/', 'https://whova.com/embedded/speaker_detail/dcrma_202003/9907689/', 'https://whova.com/embedded/speaker_detail/dcrma_202003/9907674/', 'https://whova.com/embedded/speaker_detail/dcrma_202003/9907684/', 'https://whova.com/embedded/speaker_detail/dcrma_202003/9907685/', 'https://whova.com/embedded/speaker_detail/dcrma_202003/9907686/', 'https://whova.com/embedded/speaker_detail/dcrma_202003/9942780/', 'https://whova.com/embedded/speaker_detail/dcrma_202003/9907695/', 'https://whova.com/embedded/speaker_detail/dcrma_202003/9907687/', 'https://whova.com/embedded/speaker_detail/dcrma_202003/9907683/', 'https://whova.com/embedded/speaker_detail/dcrma_202003/9907692/', 'https://whova.com/embedded/speaker_detail/dcrma_202003/9907672/', 'https://whova.com/embedded/speaker_detail/dcrma_202003/9907697/', 'https://whova.com/embedded/speaker_detail/dcrma_202003/9907680/', 'https://whova.com/embedded/speaker_detail/dcrma_202003/9907679/', 'https://whova.com/embedded/speaker_detail/dcrma_202003/9907675/', 'https://whova.com/embedded/speaker_detail/dcrma_202003/9907677/', 'https://whova.com/embedded/speaker_detail/dcrma_202003/9907694/']
Here you can find a relevant discussion on Ways to deal with #document under iframe
Your images are in an iframe, so you will need to switch to this before you can scrape the href attributes using frame_to_be_available_and_switch_to_it.
Then, to get the list of all href attributes, you may need to run some Javascript to scroll the image into view, and handle the case where the images may be lazy loading the href:
# first, switch to iframe
WebDriverWait(driver, 30).until(EC.frame_to_be_available_and_switch_to_it((By.XPATH,"//iframe[#id='whovaIframeSpeaker']")))
elements_list = driver.find_elements_by_xpath("//div[contains(#class, 'template-section-body')]/a[contains(#class, 'display-flex card vancouver')]")
for element in elements_list:
driver.execute_script("arguments[0].scrollIntoView(true);", element)
print(element.get_attribute("href"))
The results of this code:
For your css selector use .display-flex.card.vancouver instead.
elems = driver.find_elements_by_css_selector('.display-flex.card.vancouver')
Each word is a class, so you need to place a dot in the front of each one.

How to get text in span with xpath. selenium python

Before asking, I find answer through google for 2hours. But there's no answer for me.
I use selenium with python
I apply below q/a answer to my code but nothing text printed.
XPath query to get nth instance of an element
What I want to get is "Can't select"
<li data-index="5" date="20190328" class="day dimmed">
<a href="#" onclick="return false;">
<span class="dayweek">Tuesday</span>
<span class="day">28</span>
<span class="sreader">Can't select</span>
</a>
</li>
I use xpath because I need to repeat
I should do this.
The HTML code above is a simple change
day_lists = driver.find_elements_by_xpath('//li')
Nothing printed and there's no error
for day_list in day_lists:
print(day_list.find_element_by_xpath('//span[#class="sreader"]').text)
++++ 2019/3/24/16:45(+09:00)
When I test with below code
print(day_list.find_element_by_xpath('.//span[#class="sreader"]/text()'))
Error comes out. Why there's no such element?
selenium.common.exceptions.NoSuchElementException:
Message: no such element: Unable to locate element:
{"method":"xpath","selector":".//span[#class="sreader"]/text()"}
If nothing printed and there's no error then required text might be hidden or not generated yet.
For first case you might need to use get_attribute('textContent'):
day_lists = driver.find_elements_by_tag_name('li')
for day_list in day_lists:
print(day_list.find_element_by_xpath('.//span[#class="sreader"]').get_attribute('textContent'))
For second case:
from selenium.webdriver.support.ui import WebDriverWait as wait
day_lists = driver.find_elements_by_tag_name('li')
for day_list in day_lists:
print(wait(driver, 10).until(lambda driver: day_list.find_element_by_xpath('.//span[#class="sreader"]').text)
Note that in both cases you need to add leading dot in XPath:
'//span' --> './/span'
To print the desired text e.g. 선택불가 from the <span> tag you can write a function as follows:
def print_text(my_date):
print(driver.find_element_by_xpath("//li[#class='day dimmed'][#date='" +my_date+ "']//span[#class='sreader']").get_attribute("innerHTML"))
Now you can call the function with any date as follows:
print_text("20190328")
Here are the solutions. If this does not work then I predict the element might present either in separate window or frame.
CSS:
li[class='day dimmed'][date='20190328'] .sreader
xpath:
To check if there are multiple windows use below
print(len(driver.window_handles))
To check if there are multiple frames use below
print(len(driver.find_elements_by_css_selector('iframe')))
Try the following options:
day_lists=WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.XPATH, '//li[#class="day dimmed"]')))
for day_list in day_lists:
print(day_list.find_element_by_xpath('//span[#class="sreader"]').text)
OR
day_lists=WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.XPATH, '//li[#class="day dimmed"]')))
for day_list in day_lists:
print(driver.execute_script("return arguments[0].innerHTML;", day_list.find_element_by_xpath('//span[#class="sreader"]')))
Please note that you need following imports as well.
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By

Resources