My web crawler gets stuck on the Dell website. Is it fixable?

My web crawler gets stuck on the Dell website. Is it fixable? - python-3.x

First off, I'm new to coding and Python. This is the first project that I came up with to try. When I run this code there are no errors and I get to the point of opening the Dell website, inputting the service tag and pressing the enter key. At that point the Dell website gives me a pop up that says "Please wait while we validate this action. Once validated, please submit your request again." and gives a 30 second countdown.
import openpyxl as xl
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
wb = xl.load_workbook('/home/user/Code/learning/inventory.xlsx')
sheet = wb['Sheet1']
for row in range(2, sheet.max_row + 1):
service_tag_cell = sheet.cell(row, 4).value
warranty_cell: str
warranty_cell = sheet.cell(row, 5).value
if service_tag_cell != '': # and warranty_cell == '':
driver = webdriver.Firefox()
driver.get('https://www.dell.com/support/home/en-us')
# Find the search box, enter the service tag number and press the enter key
driver.find_element_by_id('inpEntrySelection').send_keys(service_tag_cell + Keys.ENTER)
# Find warranty field
warranty_date = driver.find_element_by_class_name('warrantyExpiringLabel')
warranty_cell = warranty_date.value_of_css_property
driver.close()
sheet.cell(row, 5).value = warranty_cell
wb.save('inventory2.xlsx')
I've tried searching google to understand what prompts this message from Dell. I get the sense it just doesn't want bots like mine searching their website. But is the message a result of my poor implementation that could be corrected? Or is my goal of taking a spreadsheet of service tags and returning the expiration date dead in water?

If they're using methods of detected automated actions then you'll be playing cat-and-mouse.
I can suggest that you try setting a random User-Agent with a library like Random User Agents:
from random_user_agent.user_agent import UserAgent
from random_user_agent.params import SoftwareName
user_agent_rotator = UserAgent(software_names=[SoftwareName.CHROME.value], limit=100)
user_agent = user_agent_rotator.get_random_user_agent()
options.add_argument(f'user-agent={user_agent}')
driver = webdriver.Chrome(chrome_options=options)
But if that doesn't work, there's plenty of other ways in which Selenium can be detected. This and This have good information that may helpful.

Related

Find the Twitter text box element with python selenium

I made my own Twitter complaint bot that tweets at my ISP if the network drops.
Code works perfect, until it has to find the Twitter textbox to type the tweet.
Main error is:
StaleElementReferenceException: Message: stale element reference: element is not attached to the page document
I have tried:
Adding time delays
Using Firefox Driver instead of Google
Adding page refreshes before the tweet_at_provider() looks for the textbox
Clicking the "Tweet" button to bring up the textbox to then try type in it
Using find.element_by_id but twitter changes id every pageload
When I comment out the first function call to test, it will find and type 6/10 times.
But when both functions are called the tweet_at_provider() always fails at grabbing the textbox and I get the StaleElement error.
import selenium, time, pyautogui
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.common.exceptions import ElementClickInterceptedException, NoSuchElementException, StaleElementReferenceException
PROMISED_DOWN = 200
PROMISED_UP = 10
CHROME_DRIVER_PATH = "C:\Development\chromedriver.exe"
GECKODRIVER_PATH = "C:\\Users\\meeha\\Desktop\\geckodriver\\geckodriver.exe"
TWITTER_USERNAME = "my_username"
TWITTER_PASSWORD = "my_password"
class InternetSpeedTwitterBot():
def __init__(self, driver_path):
self.driver = webdriver.Chrome(executable_path=driver_path)
self.down = 0
self.up = 0
def get_internet_speed(self):
self.driver.get("https://www.speedtest.net/")
self.driver.maximize_window()
time.sleep(2)
go = self.driver.find_element_by_xpath("//*[#id='container']/div/div[3]/div/div/div/div[2]/div[3]/div[1]/a/span[4]")
go.click()
time.sleep(40)
self.down = self.driver.find_element_by_xpath("//*[#id='container']/div/div[3]/div/div/div/div[2]/div[3]/div[3]/div/div[3]/div/div/div[2]/div[1]/div[2]/div/div[2]/span")
self.up = self.driver.find_element_by_xpath("//*[#id='container']/div/div[3]/div/div/div/div[2]/div[3]/div[3]/div/div[3]/div/div/div[2]/div[1]/div[3]/div/div[2]/span")
print(f"Download Speed: {self.down.text} Mbps")
print(f"Upload Speed: {self.up.text} Mbps")
time.sleep(3)
def tweet_at_provider(self):
self.driver.get("https://twitter.com/login")
self.driver.maximize_window()
time.sleep(3)
username = self.driver.find_element_by_name("session[username_or_email]")
password = self.driver.find_element_by_name("session[password]")
username.send_keys(TWITTER_USERNAME)
password.send_keys(TWITTER_PASSWORD)
password.submit()
time.sleep(5)
tweet_compose = self.driver.find_element_by_xpath('//*[#id="react-root"]/div/div/div[2]/header/div/div/div/div[1]/div[3]/a/div/span/div/div/span/span')
tweet_compose.click()
time.sleep(2)
textbox = self.driver.find_element_by_xpath('//*[#id="layers"]/div[2]/div/div/div/div/div/div[2]/div[2]/div/div[3]/div/div/div/div[1]/div/div/div/div/div[2]/div[1]/div/div/div/div/div/div/div/div/div/div[1]/div/div/div/div[2]/div/div/div/div')
textbox.send_keys(f"Hey #Ask_Spectrum, why is my internet speed {self.down.text} down / {self.up.text} up when I pay for {PROMISED_DOWN} down / {PROMISED_UP} up???")
bot = InternetSpeedTwitterBot(CHROME_DRIVER_PATH)
bot.get_internet_speed()
bot.tweet_at_provider()

I had the same error there and figured out that HTML tag was instantly changing as soon as I was typing something on the twitter text-box.
tackle this problem using XPATH of span tag that was showing up after typing space from my side. break tag is the initial tag when there is not any text prompted by you, only after you type anything turns into and that's when you have to copy XPATH and use it for your application

Stuck in loop <> Code doesn't want to pull anything except row 1

I am stuck in loop, I don't know what to change to make my code work normally...
problem is with CSV file, my file contains list of domains (freedommortgage.com, google.com, amd.com etc.) so when I run code, everything is fine at start, but then it keeps sending me same results all over:
the monthly total visits to freedommortgage.com is 1.10M
So here is my line:
import csv
from selenium import webdriver
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import urllib
from captcha2upload import CaptchaUpload
import time
# setting the firefox driver
def init_driver():
driver = webdriver.Firefox(executable_path=r'C:\Users\muki\Desktop\similarweb_scrapper-master\geckodriver.exe')
driver.implicitly_wait(10)
return driver
# solving the captcha (with 2captcha.com)
def captcha_solver(driver):
captcha_src = driver.find_element_by_id('recaptcha_challenge_image').get_attribute("src")
urllib.urlretrieve(captcha_src, "captcha.jpg")
captcha = CaptchaUpload("4cfd308fd703d40291a7e250d743ca84") # 2captcha API KEY
captcha_answer = captcha.solve("captcha.jpg")
wait = WebDriverWait(driver, 10)
captcha_input_box = wait.until(
EC.presence_of_element_located((By.ID, "recaptcha_response_field")))
captcha_input_box.send_keys(captcha_answer)
driver.implicitly_wait(10)
captcha_input_box.submit()
# inputting the domain in similar web search box and finding necessary values
def lookup(driver, domain, short_method):
# short method - inputting the domain in the url
if short_method:
driver.get("https://www.similarweb.com/website/" + domain)
else:
driver.get("https://www.similarweb.com")
attempt = 0
# trying 3 times before quiting (due to second refresh by the website that clears the search box)
while attempt < 1:
try:
captcha_body_page = driver.find_elements_by_class_name("block-page")
driver.implicitly_wait(10)
if captcha_body_page:
print("Captcha ahead, solving the captcha, it may take a few seconds")
captcha_solver(driver)
print("Captcha solved! the program will continue shortly")
time.sleep(20) # to prevent second refresh affecting the upcoming elements finding after captcha solved
# for normal method, inputting the domain in the searchbox instead of url
if not short_method:
input_element = driver.find_element_by_id("js-swSearch-input")
input_element.click()
input_element.send_keys(domain)
input_element.submit()
wait = WebDriverWait(driver, 10)
time.sleep(10)
total_visits = wait.until(
EC.presence_of_element_located((By.XPATH, "//span[#class='engagementInfo-valueNumber js-countValue']")))
total_visits_line = "the monthly total visits to %s is %s" % (domain, total_visits.text)
time.sleep(10)
print('\n' + total_visits_line)
except TimeoutException:
print("Box or Button or Element not found in similarweb while checking %s" % domain)
attempt += 1
print("attempt number %d... trying again" % attempt)
# main
if __name__ == "__main__":
with open('bigdomains.csv', 'rt') as f:
reader = csv.reader(f)
driver = init_driver()
for row in reader:
domain = row[0]
lookup(driver, domain, True) # user need to give as a parameter True or False, True will activate the
# short method, False will take the normal method
(Sorry for the long line of code, but I have to present everything, even tho focus is on the LAST PART of the code)
My question is simple:
Why does it keep taking row number 1 domain, and ignoring the row2 row3 row4, etc...?
Time = delay has to be 10, or more, to avoid captcha issue on this website
if anyone would try to run this, you have to edit name of csv file, and to have few domains in it in format google.com (not www.google.com) of course.

Looks like you're always accessing the same index everytime with:
domain = row[0]
Index 0 is the first item, hence why you keep getting the same value.
This post explains an alternative way to use a for loop in Python.
Accessing the index in 'for' loops?

Clicking links on the website to get contents in the bubbles with selenium

I'm trying to get the course information on http://bulletin.iit.edu/graduate/colleges/science/applied-mathematics/master-data-science/#programrequirementstext.
In my code, I tried to first click on each course, next get the description in the bubble, and then close the bubble as it may overlay on top of other course links.
My problem is that I couldn't get the description in the bubble and some course links were still skipped though I tried to avoid it by closing the bubble.
Any idea about how to do this? Thanks in advance!
info = []
driver = webdriver.Chrome()
driver.get('http://bulletin.iit.edu/graduate/colleges/science/applied-mathematics/master-data-science/#programrequirementstext')
for i in range(1,3):
for j in range(2, 46):
try:
driver.find_element_by_xpath('//*[#id="programrequirementstextcontainer"]/table['+str(i)+']/tbody/tr['+str(j)+']/td[1]/a').click()
info.append(driver.find_elements_by_xpath('/html/body/div[8]/div[3]/div/div')[0].text)
driver.find_element_by_xpath('//*[#id="lfjsbubbleclose"]').click()
time.sleep(3)
except: pass
[1]: http://bulletin.iit.edu/graduate/colleges/science/applied-mathematics/master-data-science/#programrequirementstext

Not sure why you have put static range in for loop even though all the combinations of i and j index count in your xpath doesn't find any element on your application.
I would suggest better to go with finding all element on your webpage using single locator and loop trough to get descriptions from bubble.
Use below code:
course_list = driver.find_elements_by_css_selector("table.sc_courselist a.bubblelink.code")
wait = WebDriverWait(driver, 20)
for course in course_list:
try:
print("grabbing info of course : ", course.text)
course.click()
wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, "div.courseblockdesc")))
info.append(driver.find_element_by_css_selector('div.courseblockdesc>p').text)
wait.until(EC.visibility_of_element_located((By.ID, "lfjsbubbleclose")))
driver.find_element_by_id('lfjsbubbleclose').click()
except:
print("error while grabbing info")
print(info)
As it require some time to load the content in bubble so you should introduce explicit wait in your script until bubble content get completely visible and then grab it.
import below package for using wait in above code:
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
Please note, this code grab all the courses description from bubble. Let me know if you are looking for some specific not all.

To load the bubble, the website makes an ajax call.
import requests
from bs4 import BeautifulSoup
def course(course_code):
data = {"page":"getcourse.rjs","code":course_code}
res = requests.get("http://bulletin.iit.edu/ribbit/index.cgi", data=data)
soup = BeautifulSoup(res.text,"lxml")
result = {}
result["description"] = soup.find("div", class_="courseblockdesc").text.strip()
result["title"] = soup.find("div", class_="coursetitle").text.strip()
return result
Output for course("CS 522")
{'description': 'Continued exploration of data mining algorithms. More sophisticated algorithms such as support vector machines will be studied in detail. Students will continuously study new contributions to the field. A large project will be required that encourages students to push the limits of existing data mining techniques.',
'title': 'Advanced Data Mining'}```

Button click works inconsistently (hangs) on Chrome Webdriver while executing from Python Selenium

I am using Selenium to automate a few browser actions on a particular website and I am using the below set of tools to achieve it.
Python 3.8
Selenium
Chrome Web Driver 79.0
Chrome 79.0
The tasks that I do are just fill up a form and then click on the submit button in the form. And this works most of the time except that sometimes it just won't no matter what! The filling up of form is very smooth and then, when the click on the submit button happens then Chrome just hangs there forever. There's no error on the console whatsoever and googling this issue I see this is such a common occurrence and almost all of the solution out there are just workarounds and not actual fixes. And I have tried almost all of them but to no avail. There's an issue on the selenium GitHub page as well which the maintainers weren't interested in much and closed it. How do I even go about resolving this issue. At this moment I am out of ideas actually. Any help would be appreciated. Thanks.
Below is the source code that I am trying to execute.
import time
from selenium import webdriver
import os
import csv
from datetime import datetime
url = 'https://www.nseindia.com/products/content/equities/equities/eq_security.htm'
xpath_get_data_button = '//*[#id="get"]'
xpath_download_link = '/html/body/div[2]/div[3]/div[2]/div[1]/div[3]/div/div[3]/div[1]/span[2]/a'
nse_list_file = 'nse_list.csv'
wait_time = 5
wait_time_long = 10
start_year = 2000
stop_year = 2019
curr_year = start_year
browser = webdriver.Chrome("E:/software/chromedriver_win32/chromedriver.exe")
browser.get(url)
time.sleep(wait_time)
with open(nse_list_file, 'r') as file:
reader = csv.reader(file)
for row in reader:
scrip = row[0]
year_registered = datetime.strptime(row[1], '%d-%m-%Y').year
if year_registered > start_year:
curr_year = year_registered
else:
curr_year = start_year
try:
browser.find_element_by_class_name('reporttitle1').clear()
browser.find_element_by_class_name('reporttitle1').send_keys(scrip)
browser.find_element_by_id('rdDateToDate').click()
while curr_year <= stop_year:
from_date = '01-01-' + str(curr_year)
to_date = '31-12-' + str(curr_year)
browser.find_element_by_id('fromDate').clear()
browser.find_element_by_id('fromDate').send_keys(from_date)
browser.find_element_by_id('toDate').clear()
browser.find_element_by_id('toDate').send_keys(to_date)
time.sleep(wait_time)
browser.find_element_by_xpath(xpath_get_data_button).click()
time.sleep(wait_time_long)
download_link_element = browser.find_element_by_xpath(xpath_download_link).click()
curr_year = curr_year + 1
except Exception as ex:
print('Could not find download link')
print(str(ex))
if os.path.isfile("stop_loading"):
break
browser.quit()
print('DONE')

selenium - clicking a button

I am trying to pull out the names of all courses offered by Lynda.com together with the subject so that it appears on my list as '2D Drawing -- Project Soane: Recover a Lost Monument with BIM with Paul F. Aubin'. So I am trying to write a script that will go to each subject on http://www.lynda.com/sitemap/categories and pull out the list of courses. I already managed to get Selenium to go from one subject to another and pull the courses. My only problem is that there is a button 'See X more courses' to see the rest of the courses. Sometimes you have to click it couple of times that´s why I used while loop. But selenium doesn´t seem to execute this click. Does anyone know why?
This is my code:
from selenium import webdriver
url = 'http://www.lynda.com/sitemap/categories'
mydriver = webdriver.Chrome()
mydriver.get(url)
course_list = []
for a in [1,2,3]:
for b in range(1,73):
mydriver.find_element_by_xpath('//*[#id="main-content"]/div[2]/div[3]/div[%d]/ul/li[%d]/a' % (a,b)).click()
while True:
#click the button 'See more results' as long as it´s available
try:
mydriver.find_element_by_xpath('//*[#id="main-content"]/div[1]/div[3]/button').click()
except:
break
subject = mydriver.find_element_by_tag_name('h1') # pull out the subject
courses = mydriver.find_elements_by_tag_name('h3') # pull out the courses
for course in courses:
course_list.append(str(subject.text)+" -- " + str(course.text))
# go back to the initial site
mydriver.get(url)

Scroll to element before clicking:
see_more_results = browser.find_element_by_css_selector('button[class*=see-more-results]')
browser.execute_script('return arguments[0].scrollIntoView()', see_more_results)
see_more_results.click()
One solution how to repeat these actions could be:
def get_number_of_courses():
return len(browser.find_elements_by_css_selector('.course-list > li'))
number_of_courses = get_number_of_courses()
while True:
try:
button = browser.find_element_by_css_selector(CSS_SELECTOR)
browser.execute_script('return arguments[0].scrollIntoView()', button)
button.click()
while True:
new_number_of_courses = get_number_of_courses()
if (new_number_of_courses > number_of_courses):
number_of_courses = new_number_of_courses
break
except:
break
Caveat: it's always better to use build-in explicit wait than while True:
http://www.seleniumhq.org/docs/04_webdriver_advanced.jsp#explicit-waits

The problem is that you're calling a method to find element by class name, but you're passing a xpath. if you're sure this is the correct xpath you'll simply need to change to method to 'find_element_by_xpath'.
A recommendation if you allow: Try to stay away from these long xpaths and go through some tutorials on how to write efficient xpath for example.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

My web crawler gets stuck on the Dell website. Is it fixable? - python-3.x

Related

Find the Twitter text box element with python selenium

Stuck in loop <> Code doesn't want to pull anything except row 1

Clicking links on the website to get contents in the bubbles with selenium

Button click works inconsistently (hangs) on Chrome Webdriver while executing from Python Selenium

selenium - clicking a button

Categories

Resources