I'm trying to run some tests automation on a web application (Angular2, https and websocket) with: gherkin,behave 1.2.6, selenium 3.141.0, firefox and python 3.9.4, geckodriver 0.29.1, allure-behave 2.8.40.
Configuration :
user_agent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:87.0) Gecko/20100101 Firefox/87.0"
Locator :
select = (By.XPATH, '//div[contains(#class,"connexion-content")]//select')
Error :
Assertion Failed: Item "profilselect.select" not found in current page, cannot set value on it
Captured logging:WARNING:root:[ACTION] waitForClickable => Element at "//div[contains(#class,"connexion-content")]//select" is not clickable.
Html option I want to select : Please note that « value » can’t be used because I’m only interest by the text between the « option » tag, is only if the given text correspond to a specific value that it is selected
Since your element itself is of type select why don't you use Selenium Select class where your can interact directly with it by text or index.
Since u have not shown your code, I can only help with pseudocode:
we_sel = WebDriverWait(driver, 10).until(expected_conditions.visibility_of_element_located((By.ID, 'form-1-1'))
# The above statement is equivalent to below statement.
# If you are not familiar with WebDriver Waits then uncomment the below statement and comment the above statement.
# we_sel = driver.find_element_by_id('form-1-1')
sel = Select(we_sel)
sel.select_by_visible_text('replace withoption text u want to select')
References (Webdriver API Docs):
For Select class:
https://www.selenium.dev/selenium/docs/api/py/webdriver_support/selenium.webdriver.support.select.html#module-selenium.webdriver.support.select
For WebDriverWait class:
https://www.selenium.dev/selenium/docs/api/py/webdriver_support/selenium.webdriver.support.wait.html#module-selenium.webdriver.support.wait
For expected_conditions module:
https://www.selenium.dev/selenium/docs/api/py/webdriver_support/selenium.webdriver.support.expected_conditions.html#module-selenium.webdriver.support.expected_conditions
Related
I'm trying to translate a webpage from Japanese to English, below is a code snippet, that used to work 2 weeks before and is not working anymore.
# Initializing Chrome reference
chrome_path = Service("C:\Python\Anaconda\chromedriver.exe")
custom_options = webdriver.ChromeOptions()
prefs = {
"translate_whitelists": {"ja":"en"},
"translate":{"enabled":"True"}
}
custom_options.add_experimental_option('prefs', prefs)
driver = webdriver.Chrome(service = chrome_path, options=custom_options)
I've also tried
custom_options.add_argument('--lang=en')
driver = webdriver.Chrome(service = chrome_path, options=custom_options)
None of these seem to work anymore, no other changes to this part of the code have been made.
The website I'm trying to translate is https://media.japanmetaldaily.com/market/list/
Any help is greatly appreciated.
I can't get that to work via Selenium/Chrome driver either... You could try use a hidden google translate api https://danpetrov.xyz/programming/2021/12/30/telegram-google-translate.html
I've put it into some python code that translates that table for you using that hidden api with source language "ja" and target language "en":
import urllib.parse
from bs4 import BeautifulSoup
import requests
headers = {'user-agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.99 Safari/537.36'
}
jap_page = requests.get('https://media.japanmetaldaily.com/market/list/',headers=headers)
soup = BeautifulSoup(jap_page.text,'html.parser')
table = soup.find('table',class_='rateListTable')
for line in table.find_all('a'):
endpoint = "https://translate.googleapis.com/translate_a/single?client=gtx&sl=ja&tl=en&dt=t&ie=UTF-8&oe=UTF-8&otf=1&ssel=0&tsel=0&kc=7&dt=at&dt=bd&dt=ex&dt=ld&dt=md&dt=qca&dt=rw&dt=rm&dt=ss&q="
text = urllib.parse.quote_plus(line.text.strip())
data = requests.get(endpoint+text).json()
print(data[0][0][1] +' = '+data[0][0][0])
Try chrome browser and chromedriver version 96.
In ver 97 not working
I'm trying to get the address in the 'From' column for the first-ever transaction for any token. Since there are new transactions so often thus making this table dynamic, I'd like to be able to fetch this info at any time using parsel selector. Here's my attempted approach:
First step: Fetch the total number of pages
Second step: Insert that number into the URL to get to the earliest page number.
Third step: Loop through the 'From' column and extract the first address.
It returns an empty list. I can't figure out the source of the issue. Any advice will be greatly appreciated.
from parsel import Selector
contract_address = "0x431e17fb6c8231340ce4c91d623e5f6d38282936"
pg_num_url = f"https://bscscan.com/token/generic-tokentxns2?contractAddress={contract_address}&mode=&sid=066c697ef6a537ed95ccec0084a464ec&m=normal&p=1"
headers = {"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36"}
response = requests.get(pg_num_url, headers=headers)
sel = Selector(response.text)
pg_num = sel.xpath('//nav/ul/li[3]/span/strong[2]').get() # Attempting to extract page number
url = f"https://bscscan.com/token/generic-tokentxns2?contractAddress={contract_address}&mode=&sid=066c697ef6a537ed95ccec0084a464ec&m=normal&p={pg_num}" # page number inserted
response = requests.get(url, headers=headers)
sel = Selector(response.text)
addresses = []
for row in sel.css('tr'):
addr = row.xpath('td[5]//a/#href').re('/token/([^#?]+)')[0][45:]
addresses.append(addr)
print(addresses[-1]) # Desired address
Seems like the website is using server side session tracking and a security token to make scraping a bit more difficult.
We can get around this by replicating their behaviour!
If you take a look at web inspector you can see that some cookies are being sent to us once we connect to the website for the first time:
Further when we click next page on one of the tables we see these cookies being sent back to the server:
Finally the url of the table page contains something called sid this often stands for something like security id which can be found in 1st page body. If you inspect page source you can find it hidden away in javascript:
Now we need to put all of this together:
start a requests Session which will keep track of cookies
go to token homepage and receive cookies
find sid in token homepage
use cookies and sid token to scrape the table pages
I've modified your code and it ends up looking something like this:
import re
import requests
from parsel import Selector
contract_address = "0x431e17fb6c8231340ce4c91d623e5f6d38282936"
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36"
}
# we need to start a session to keep track of cookies
session = requests.session()
# first we make a request to homepage to pickup server-side session cookies
resp_homepage = session.get(
f"https://bscscan.com/token/{contract_address}", headers=headers
)
# in the homepage we also need to find security token that is hidden in html body
# we can do this with simple regex pattern:
security_id = re.findall("sid = '(.+?)'", resp_homepage.text)[0]
# once we have cookies and security token we can build the pagination url
pg_num_url = (
f"https://bscscan.com/token/generic-tokentxns2?"
f"contractAddress={contract_address}&mode=&sid={security_id}&m=normal&p=2"
)
# finally get the page response and scrape the data:
resp_pagination = session.get(pg_num_url, headers=headers)
addresses = []
for row in Selector(resp_pagination.text).css("tr"):
addr = row.xpath("td[5]//a/#href").get()
if addr:
addresses.append(addr)
print(addresses) # Desired address
I am writing a spider to scrape a popular reviews website :-) This is my first attempt at writing a Scrapy spider.
The top level is a list of restaurants (I call this "top level"), which appear 30 at a time. My spider accesses each link and then "clicks next" to get the next 30, and so on. This part is working as my output does contain thousands of restaurants, not just the first 30.
I then want it to "click" on the link to each restaurant page ("restaurant level"), but this contains only truncated versions of the reviews, so I want it to then "click" down a further level (to "review level") and scrape the reviews from there, which appear 5 at a time with another "next" button. This is the only "level" from which I am extracting anything - the other levels just have links to access to get to the reviews and other info I want.
Most of this is working as I am getting all the information I want, but only for the first 5 reviews per restaurant. It is not "finding" the "next" button on the bottom "review level".
I have tried changing the order of commands within the parse method, but other than that I am coming up short of ideas! My xpaths are fine so it must be something to do with structure of the spider.
My spider looks thus:
import scrapy
from scrapy.http import Request
class TripSpider(scrapy.Spider):
name = 'tripadvisor'
allowed_domains = ['tripadvisor.co.uk']
start_urls = ['https://www.tripadvisor.co.uk/Restaurants-g187069-Manchester_Greater_Manchester_England.html']
custom_settings = {
'DOWNLOAD_DELAY': 1,
# 'DEPTH_LIMIT': 3,
'AUTOTHROTTLE_TARGET_CONCURRENCY': 0.5,
'USER_AGENT': "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36",
# 'DEPTH_PRIORITY': 1,
# 'SCHEDULER_DISK_QUEUE': 'scrapy.squeues.PickleFifoDiskQueue',
# 'SCHEDULER_MEMORY_QUEUE': 'scrapy.squeues.FifoMemoryQueue'
}
def scrape_review(self, response):
restaurant_name_review = response.xpath('//div[#class="wrap"]//span[#class="taLnk "]//text()').extract()
reviewer_name = response.xpath('//div[#class="username mo"]//text()').extract()
review_rating = response.xpath('//div[#class="wrap"]/div[#class="rating reviewItemInline"]/span[starts-with(#class,"ui_bubble_rating")]').extract()
review_title = response.xpath('//div[#class="wrap"]//span[#class="noQuotes"]//text()').extract()
full_reviews = response.xpath('//div[#class="wrap"]/div[#class="prw_rup prw_reviews_text_summary_hsx"]/div[#class="entry"]/p').extract()
review_date = response.xpath('//div[#class="prw_rup prw_reviews_stay_date_hsx"]/text()[not(parent::script)]').extract()
restaurant_name = response.xpath('//div[#id="listing_main_sur"]//a[#class="HEADING"]//text()').extract() * len(full_reviews)
restaurant_rating = response.xpath('//div[#class="userRating"]//#alt').extract() * len(full_reviews)
restaurant_review_count = response.xpath('//div[#class="userRating"]//a//text()').extract() * len(full_reviews)
for rvn, rvr, rvt, fr, rd, rn, rr, rvc in zip(reviewer_name, review_rating, review_title, full_reviews, review_date, restaurant_name, restaurant_rating, restaurant_review_count):
reviews_dict = dict(zip(['reviewer_name', 'review_rating', 'review_title', 'full_reviews', 'review_date', 'restaurant_name', 'restaurant_rating', 'restaurant_review_count'], (rvn, rvr, rvt, fr, rd, rn, rr, rvc)))
yield reviews_dict
# print(reviews_dict)
def parse(self, response):
### The parse method is what is actually being repeated / iterated
for review in self.scrape_review(response):
yield review
# print(review)
# access next page of resturants
next_page_restaurants = response.xpath('//a[#class="nav next rndBtn ui_button primary taLnk"]/#href').extract_first()
next_page_restaurants_url = response.urljoin(next_page_restaurants)
yield Request(next_page_restaurants_url)
print(next_page_restaurants_url)
# access next page of reviews
next_page_reviews = response.xpath('//a[#class="nav next taLnk "]/#href').extract_first()
next_page_reviews_url = response.urljoin(next_page_reviews)
yield Request(next_page_reviews_url)
print(next_page_reviews_url)
# access each restaurant page:
url = response.xpath('//div[#id="EATERY_SEARCH_RESULTS"]/div/div/div/div/a[#target="_blank"]/#href').extract()
for url_next in url:
url_full = response.urljoin(url_next)
yield Request(url_full)
# "accesses the first review to get to the full reviews (not the truncated versions)"
first_review = response.xpath('//a[#class="title "]/#href').extract_first() # extract first used as I only want to access one of the links on this page to get down to "review level"
first_review_full = response.urljoin(first_review)
yield Request(first_review_full)
# print(first_review_full)
You are missing a space at the end of the class value:
Try this:
next_page_reviews = response.xpath('//a[#class="nav next taLnk "]/#href').extract_first()
Here are some tips on matching classes partially: https://docs.scrapy.org/en/latest/topics/selectors.html#when-querying-by-class-consider-using-css
On a side note, you can define separate parse functions to make it clearer what each one is responsible for: https://docs.scrapy.org/en/latest/intro/tutorial.html?highlight=callback#more-examples-and-patterns
Goal: use Selenium and Python to search for company name on LinkedIn's search bar THEN click on the "Companies" button in the navigation to arrive to information about companies that are similar to the keyword (rather than individuals at that company). See below for an example. "CalSTRS" is the company I search for in the search bar. Then I want to click on the "Companies" navigation button.
My Helper Functions: I have defined the following helper functions (including here for reproducibility).
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import time
from random import randint
from selenium.webdriver.common.action_chains import ActionChains
browser = webdriver.Chrome()
def li_bot_login(usrnm, pwrd):
##-----Log into linkedin and get to your feed-----
browser.get('https://www.linkedin.com')
##-----Find the Search Bar-----
u = browser.find_element_by_name('session_key')
##-----Enter Username and Password, Enter-----
u.send_keys(usrnm)
p = browser.find_element_by_name('session_password')
p.send_keys(pwrd + Keys.ENTER)
def li_bot_search(search_term):
#------Search for term in linkedin search box and land you at the search results page------
search_box = browser.find_element_by_css_selector('artdeco-typeahead-input.ember-view > input')
search_box.send_keys(str(search_term) + Keys.ENTER)
def li_bot_close():
##-----Close the Webdriver-----
browser.close()
li_bot_login()
li_bot_search('calstrs')
time.sleep(5)
li_bot_close()
Here is the HTML of the "Companies" button element:
<button data-vertical="COMPANIES" data-ember-action="" data-ember-action-7255="7255">
Companies
</button>
And the XPath:
//*[#id="ember1202"]/div[5]/div[3]/div[1]/ul/li[5]/button
What I have tried: Admittedly, I am not very experienced with HTML and CSS so I am probably missing something obvious. Clearly, I am not selecting / interacting with the right element. So far, I have tried...
companies_btn = browser.find_element_by_link_text('Companies')
companies_btn.click()
which returns this traceback:
selenium.common.exceptions.NoSuchElementException: Message: no such element: Unable to locate element: {"method":"link text","selector":"Companies"}
(Session info: chrome=63.0.3239.132)
(Driver info: chromedriver=2.35.528161 (5b82f2d2aae0ca24b877009200ced9065a772e73),platform=Windows NT 10.0.16299 x86_64)
and
companies_btn_xpath = '//*[#id="ember1202"]/div[5]/div[3]/div[1]/ul/li[5]/button'
browser.find_element_by_xpath(companies_btn_xpath).click()
with this traceback...
selenium.common.exceptions.NoSuchElementException: Message: no such element: Unable to locate element: {"method":"xpath","selector":"//*[#id="ember1202"]/div[5]/div[3]/div[1]/ul/li[5]/button"}
(Session info: chrome=63.0.3239.132)
(Driver info: chromedriver=2.35.528161 (5b82f2d2aae0ca24b877009200ced9065a772e73),platform=Windows NT 10.0.16299 x86_64)
and
browser.find_element_by_css_selector('#ember1202 > div.application-outlet > div.authentication-outlet > div.neptune-grid.two-column > ul > li:nth-child(5) > button').click()
which returns this traceback...
selenium.common.exceptions.NoSuchElementException: Message: no such element: Unable to locate element: {"method":"css selector","selector":"#ember1202 > div.application-outlet > div.authentication-outlet > div.neptune-grid.two-column > ul > li:nth-child(5) > button"}
(Session info: chrome=63.0.3239.132)
(Driver info: chromedriver=2.35.528161 (5b82f2d2aae0ca24b877009200ced9065a772e73),platform=Windows NT 10.0.16299 x86_64)
It seem that you simply used incorrect selectors.
Note that
#id of div like "ember1002" is dynamic value, so it will be different each time you visit page: "ember1920", "ember1202", etc...
find_element_by_link_text() can be applied to links only, e.g. <a>Companies</a>, but not buttons
Try to find button by its text content:
browser.find_element_by_xpath('//button[normalize-space()="Companies"]').click()
With capybara-py (which can be used to drive Selenium), this is as easy as:
page.click_button("Companies")
Bonus: This will be resilient against changes in the implementation of the button, e.g., using <input type="submit" />, etc. It will also be resilient in the face of a delay before the button appears, as click_button() will wait for it to be visible and enabled.
I can't seem to login to my university website using python requests.session() function. I have tried retrieving all the headers and cookies needed to login but it does not successfully log in with my credentials. It does not show any error but the source code I review after it is supposed to have logged in shows that it is still not logged in.
All my code is below. I fill the login and password with my credentials, but the rest is the exact code.
import requests
with requests.session() as r:
url = "https://www.ouac.on.ca/apply/nonsecondary/intl/en_CA /user/login"
page = r.get(url)
aspsessionid = r.cookies["ASPSESSIONID"]
ouacapply1 = r.cookies["OUACApply1"]
LOGIN = ""
PASSWORD = ""
login_data = dict(ASPSESSIONID=aspsessionid, OUACApply1=ouacapply1, login=LOGIN, password=PASSWORD)
header = {"Referer":"https://www.ouac.on.ca/apply/nonsecondary/intl/en_CA/user/login", "User-Agent":"Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:57.0) Gecko/20100101 Firefox/57.0"}
logged_in = r.post(url, data=login_data, headers=header)
new_page = r.get(url="https://www.ouac.on.ca/apply/nonsecondary/intl/en_CA/profile/")
plain_text = new_page.text
print(plain_text)
You are missing two inputs which is needed to be posted -
name='submitButton', value='Log In'
name='csrf', value=''
The value for second keeps changing so you need to get its value dynamically.
If you want to see where this input is then goto the forms closing tag, just above the closing tag there you will find an input which hidden.
So include these two values in your login_data and you will be able to login.