Python, Selenium, Pandas DataFrame and Excel - excel

I am having trouble piecing together the last part of a puzzle. The entire code is shown below, which includes a non-essential username and password to a site where I am scraping data.
After looping through part numbers from an Excel file using
pd.read_excel()
Selenium is used to scrape various items of the website in question; the code then writes these values to the output window successfully.
As opposed to writing the data to an output window, I aim to write to the same Excel file I am pulling data from, writing it to the appropriate columns.
In the final for loop of the code, I initially tried to write the variables (which were printing to the screen) to Excel by appending
.to_excel('filePathHere')
to the variable in question. As an example, I attempted
description.to_excel('pathToFile/output.xlsx')
Which yield an error of EOL while scanning string literal (<string>, line 1)
I then thought, maybe this variable needs to be converted to a DataFrame, so I then tried
description_DataFrame = pd.DataFrame(description)
description_DataFrame.to_excel('pathToFile/output.xlsx')
which resulted in the same error message.
I am not even sure if this is the correct logic to write each item to the existing (or new) file. If it is, I found an explanation on how to deal with long strings here: StackOverFlow EOL Error but none of my data constitutes as long strings, so I can't see how that applies.
I then start to think I might need to create a dictionary, and then append to it.
So I then removed any attempts from above and tried:
description = []
description.append(mfg_part)
mfg_part.to_excel('pathToFile/output.xlsx')
Which still give me the same EOL error.
I am not to sure what is wrong, and why I can't write the variables mfg_part, mfg_OEM, description to their respective columns in the loaded Excel file.
Any hints / tips would be greatly appreciated.
complete working code, printing to the screen is as follows:
import time
#Need Selenium for interacting with web elements
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.keys import Keys
#Need numpy/pandas to interact with large datasets
import numpy as np
import pandas as pd
import itertools
# load in manufacture part number from a collection of components, via an Excel file
mfg_id_list = pd.read_excel("C:/Users/James/Documents/Python Scripts/jupyterNoteBooks/ScrapingData/MasterQuoteTemplate.xls")['Model']
# Create a dictionary to store product and price
# While the below works just fine, we want to create en empty pandas dataframe, so we can output to Excel later
productInfo = {}
chrome_path = r"C:\Users\James\Documents\Python Scripts\jupyterNoteBooks\ScrapingData\chromedriver_win32\chromedriver.exe"
driver = webdriver.Chrome(chrome_path)
driver.maximize_window()
driver.get("https://www.tessco.com/login")
userName = "FirstName.SurName321123#gmail.com"
password = "PasswordForThis123"
#Set a wait, for elements to load into the DOM
wait10 = WebDriverWait(driver, 10)
wait20 = WebDriverWait(driver, 20)
wait30 = WebDriverWait(driver, 30)
elem = wait10.until(EC.element_to_be_clickable((By.ID, "userID")))
elem.send_keys(userName)
elem = wait10.until(EC.element_to_be_clickable((By.ID, "password")))
elem.send_keys(password)
#Press the login button
driver.find_element_by_xpath("/html/body/account-login/div/div[1]/form/div[6]/div/button").click()
for i in mfg_id_list:
#Expand the search bar
searchBar = wait10.until(EC.element_to_be_clickable((By.CSS_SELECTOR, "#searchBar input")))
#Enter information into the search bar
#If cell is not blank
if len(str(i)) != 0:
searchBar.send_keys(Keys.CONTROL, 'a')
searchBar.send_keys(i)
driver.find_element_by_css_selector('a.inputButton').click()
time.sleep(5)
try:
# wait for the products information to be loaded
products = wait10.until(EC.presence_of_all_elements_located((By.XPATH,"//div[#class='CoveoResult']")))
#isProductsThere = driver.find_element_by_xpath("//div[#class='CoveoResult']")
if products:
# iterate through all products in the search result and add details to dictionary
for product in products:
# get product info such as OEM, Description and Part Number
productDescr = product.find_element_by_xpath(".//a[#class='productName CoveoResultLink hidden-xs']").text
mfgPart = product.find_element_by_xpath(".//ul[#class='unlisted info']").text.split('\n')[3]
mfgName = product.find_element_by_tag_name("img").get_attribute("alt")
# There are multiple classes, some are "class sale" or else.
#We will locate by CSS
price = product.find_element_by_css_selector("div.price").text.split('\n')[1]
# add details to dictionary
productInfo[mfgPart, mfgName, productDescr] = price
# prints the searched products information
for (mfg_part, mfg_OEM, description), price in productInfo.items():
mfg_id = mfg_part.split(': ')[1]
if mfg_id == i:
#Here is where I would write to an Excel file
#And where I made attempts as described above
print('________________________________________________')
print('Part #:', mfg_id)
print('Company:', mfg_OEM)
print('Description:', description)
print('Price:', price)
print('________________________________________________')
#time.sleep(5)
#driver.close()
else:
mfg_id = "Not on Tessco"
mfg_OEM = "Not on Tessco"
description = "Not on Tessco"
price = "Not on Tessco"
#driver.close()
print("Item was not found on Tessco.com")
except Exception as e:
print('________________________________________________')
print(e)
mfg_id = "Not on Tessco"
mfg_OEM = "Not on Tessco"
description = "Not on Tessco"
price = "Not on Tessco"
#driver.close()
print("Item was not found on Tessco.com")
print('________________________________________________')
driver.close()

Related

Stuck in loop <> Code doesn't want to pull anything except row 1

I am stuck in loop, I don't know what to change to make my code work normally...
problem is with CSV file, my file contains list of domains (freedommortgage.com, google.com, amd.com etc.) so when I run code, everything is fine at start, but then it keeps sending me same results all over:
the monthly total visits to freedommortgage.com is 1.10M
So here is my line:
import csv
from selenium import webdriver
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import urllib
from captcha2upload import CaptchaUpload
import time
# setting the firefox driver
def init_driver():
driver = webdriver.Firefox(executable_path=r'C:\Users\muki\Desktop\similarweb_scrapper-master\geckodriver.exe')
driver.implicitly_wait(10)
return driver
# solving the captcha (with 2captcha.com)
def captcha_solver(driver):
captcha_src = driver.find_element_by_id('recaptcha_challenge_image').get_attribute("src")
urllib.urlretrieve(captcha_src, "captcha.jpg")
captcha = CaptchaUpload("4cfd308fd703d40291a7e250d743ca84") # 2captcha API KEY
captcha_answer = captcha.solve("captcha.jpg")
wait = WebDriverWait(driver, 10)
captcha_input_box = wait.until(
EC.presence_of_element_located((By.ID, "recaptcha_response_field")))
captcha_input_box.send_keys(captcha_answer)
driver.implicitly_wait(10)
captcha_input_box.submit()
# inputting the domain in similar web search box and finding necessary values
def lookup(driver, domain, short_method):
# short method - inputting the domain in the url
if short_method:
driver.get("https://www.similarweb.com/website/" + domain)
else:
driver.get("https://www.similarweb.com")
attempt = 0
# trying 3 times before quiting (due to second refresh by the website that clears the search box)
while attempt < 1:
try:
captcha_body_page = driver.find_elements_by_class_name("block-page")
driver.implicitly_wait(10)
if captcha_body_page:
print("Captcha ahead, solving the captcha, it may take a few seconds")
captcha_solver(driver)
print("Captcha solved! the program will continue shortly")
time.sleep(20) # to prevent second refresh affecting the upcoming elements finding after captcha solved
# for normal method, inputting the domain in the searchbox instead of url
if not short_method:
input_element = driver.find_element_by_id("js-swSearch-input")
input_element.click()
input_element.send_keys(domain)
input_element.submit()
wait = WebDriverWait(driver, 10)
time.sleep(10)
total_visits = wait.until(
EC.presence_of_element_located((By.XPATH, "//span[#class='engagementInfo-valueNumber js-countValue']")))
total_visits_line = "the monthly total visits to %s is %s" % (domain, total_visits.text)
time.sleep(10)
print('\n' + total_visits_line)
except TimeoutException:
print("Box or Button or Element not found in similarweb while checking %s" % domain)
attempt += 1
print("attempt number %d... trying again" % attempt)
# main
if __name__ == "__main__":
with open('bigdomains.csv', 'rt') as f:
reader = csv.reader(f)
driver = init_driver()
for row in reader:
domain = row[0]
lookup(driver, domain, True) # user need to give as a parameter True or False, True will activate the
# short method, False will take the normal method
(Sorry for the long line of code, but I have to present everything, even tho focus is on the LAST PART of the code)
My question is simple:
Why does it keep taking row number 1 domain, and ignoring the row2 row3 row4, etc...?
Time = delay has to be 10, or more, to avoid captcha issue on this website
if anyone would try to run this, you have to edit name of csv file, and to have few domains in it in format google.com (not www.google.com) of course.
Looks like you're always accessing the same index everytime with:
domain = row[0]
Index 0 is the first item, hence why you keep getting the same value.
This post explains an alternative way to use a for loop in Python.
Accessing the index in 'for' loops?

Web Scraping Python fails to load the url on button.click()

The CSV file contains the names of the countries used. However, after Argentina, it fails to recover the url. And it returns a empty string.
country,country_url
Afghanistan,https://openaq.org/#/locations?parameters=pm25&countries=AF&_k=tomib2
Algeria,https://openaq.org/#/locations?parameters=pm25&countries=DZ&_k=dcc8ra
Andorra,https://openaq.org/#/locations?parameters=pm25&countries=AD&_k=crspt2
Antigua and Barbuda,https://openaq.org/#/locations?parameters=pm25&countries=AG&_k=l5x5he
Argentina,https://openaq.org/#/locations?parameters=pm25&countries=AR&_k=962zxt
Australia,
Austria,
Bahrain,
Bangladesh,
The country.csv looks like this:
Afghanistan,Algeria,Andorra,Antigua and Barbuda,Argentina,Australia,Austria,Bahrain,Bangladesh,Belgium,Bermuda,Bosnia and Herzegovina,Brazil,
The code used is:
driver = webdriver.Chrome(options = options, executable_path = driver_path)
url = 'https://openaq.org/#/locations?parameters=pm25&_k=ggmrvm'
driver.get(url)
time.sleep(2)
# This function opens .csv file that we created at the first stage
# .csv file includes names of countries
with open('1Countries.csv', newline='') as f:
reader = csv.reader(f)
list_of_countries = list(reader)
list_of_countries = list_of_countries[0]
print(list_of_countries) # printing a list of countries
# Let's create Data Frame of the country & country_url
df = pd.DataFrame(columns=['country', 'country_url'])
# With this function we are generating urls for each country page
for country in list_of_countries[:92]:
try:
path = ('//span[contains(text(),' + '\"' + country + '\"' + ')]')
# "path" is used to filter each country on the website by
# iterating country names.
next_button = driver.find_element_by_xpath(path)
next_button.click()
# Using "button.click" we are get on the page of next country
time.sleep(2)
country_url = (driver.current_url)
# "country_url" is used to get the url of the current page
next_button.click()
except:
country_url = None
d = [{'country': country, 'country_url': country_url}]
df = df.append(d)
I've tried increasing the sleep time, not sure what is leading to this?
The challenge you face is that the country list is scrollalble:
A bit convenient that your code stops working when they're not displayed.
It's a relatively easy solution - You need to scroll it into view. I've made a quick test with your code to confirm it's working. I removed the CSV part, hard coded a country that's further down the list and I've the parts to make it scroll to view:
from selenium import webdriver
from selenium.webdriver.common.action_chains import ActionChains
import time
def ScrollIntoView(element):
actions = ActionChains(driver)
actions.move_to_element(element).perform()
url = 'https://openaq.org/#/locations?parameters=pm25&_k=ggmrvm'
driver = webdriver.Chrome()
driver.get(url)
driver.implicitly_wait(10)
country = 'Bermuda'
path = ('//span[contains(text(),' + '\"' + country + '\"' + ')]')
next_button = driver.find_element_by_xpath(path)
ScrollIntoView(next_button) # added this
next_button.click()
time.sleep(2)
country_url = (driver.current_url)
print(country_url) # added this
next_button.click()
This is the output from the print:
https://openaq.org/#/locations?parameters=pm25&countries=BM&_k=7sp499
You happy to merge that into your solution? (just say if you need more support)
If it helps a reason you didn't notice for yourself is that try was masking a NotInteractableException. Have a look at how to handle errors here
try statements are great and useful - but it's also good to track when the occur so you can fix them later. Borrowing some code from that link, you can try something like this in your catch:
except:
print("Unexpected error:", sys.exc_info()[0])

Clicking links on the website to get contents in the bubbles with selenium

I'm trying to get the course information on http://bulletin.iit.edu/graduate/colleges/science/applied-mathematics/master-data-science/#programrequirementstext.
In my code, I tried to first click on each course, next get the description in the bubble, and then close the bubble as it may overlay on top of other course links.
My problem is that I couldn't get the description in the bubble and some course links were still skipped though I tried to avoid it by closing the bubble.
Any idea about how to do this? Thanks in advance!
info = []
driver = webdriver.Chrome()
driver.get('http://bulletin.iit.edu/graduate/colleges/science/applied-mathematics/master-data-science/#programrequirementstext')
for i in range(1,3):
for j in range(2, 46):
try:
driver.find_element_by_xpath('//*[#id="programrequirementstextcontainer"]/table['+str(i)+']/tbody/tr['+str(j)+']/td[1]/a').click()
info.append(driver.find_elements_by_xpath('/html/body/div[8]/div[3]/div/div')[0].text)
driver.find_element_by_xpath('//*[#id="lfjsbubbleclose"]').click()
time.sleep(3)
except: pass
[1]: http://bulletin.iit.edu/graduate/colleges/science/applied-mathematics/master-data-science/#programrequirementstext
Not sure why you have put static range in for loop even though all the combinations of i and j index count in your xpath doesn't find any element on your application.
I would suggest better to go with finding all element on your webpage using single locator and loop trough to get descriptions from bubble.
Use below code:
course_list = driver.find_elements_by_css_selector("table.sc_courselist a.bubblelink.code")
wait = WebDriverWait(driver, 20)
for course in course_list:
try:
print("grabbing info of course : ", course.text)
course.click()
wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, "div.courseblockdesc")))
info.append(driver.find_element_by_css_selector('div.courseblockdesc>p').text)
wait.until(EC.visibility_of_element_located((By.ID, "lfjsbubbleclose")))
driver.find_element_by_id('lfjsbubbleclose').click()
except:
print("error while grabbing info")
print(info)
As it require some time to load the content in bubble so you should introduce explicit wait in your script until bubble content get completely visible and then grab it.
import below package for using wait in above code:
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
Please note, this code grab all the courses description from bubble. Let me know if you are looking for some specific not all.
To load the bubble, the website makes an ajax call.
import requests
from bs4 import BeautifulSoup
def course(course_code):
data = {"page":"getcourse.rjs","code":course_code}
res = requests.get("http://bulletin.iit.edu/ribbit/index.cgi", data=data)
soup = BeautifulSoup(res.text,"lxml")
result = {}
result["description"] = soup.find("div", class_="courseblockdesc").text.strip()
result["title"] = soup.find("div", class_="coursetitle").text.strip()
return result
Output for course("CS 522")
{'description': 'Continued exploration of data mining algorithms. More sophisticated algorithms such as support vector machines will be studied in detail. Students will continuously study new contributions to the field. A large project will be required that encourages students to push the limits of existing data mining techniques.',
'title': 'Advanced Data Mining'}```

Is it possible to move through a HTML Table and grab the data within w/ BeautifulSoup4?

So for a project, I'm working on creating an API to interface with my School's course-finder and I'm struggling to grab the data from the a HTML table they store the data in without using Selenium. I was able to pull the HTML data initially using Selenium but my Instructor says he would prefer if I used BeautifulSoup4 & MechanicalSoup libraries. I got as far as submitting a search and grabbing the HTML table the data is stored in. I'm not sure how to iterate through the data stored in the HTML table as I did with my Selenium code below.
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import Select
from selenium.webdriver.chrome.options import Options
Chrome_Options = Options()
Chrome_Options.add_argument("--headless") #allows program to run without opening a chrome window
driver = webdriver.Chrome()
driver.get("https://winnet.wartburg.edu/coursefinder/") #sets the Silenium driver
select = Select(driver.find_element_by_id("ctl00_ContentPlaceHolder1_FormView1_DropDownList_Term"))
term_options = select.options
#for index in range(0, len(term_options) - 1):
# select.select_by_index(index)
lst = []
DeptSelect = Select(driver.find_element_by_id("ctl00_ContentPlaceHolder1_FormView1_DropDownList_Department"))
DeptSelect.select_by_visible_text("History") #finds the desiered department
search = driver.find_element_by_name("ctl00$ContentPlaceHolder1$FormView1$Button_FindNow")
search.click() #sends query
table_id = driver.find_element_by_id("ctl00_ContentPlaceHolder1_GridView1")
rows = table_id.find_elements_by_tag_name("tr")
for row in rows: #creates a list of lists containing our data
col_lst = []
col = row.find_elements_by_tag_name("td")
for data in col:
lst.append(data.text)
def chunk(l, n): #class that partitions our lists neatly
print("chunking...")
for i in range(0, len(l), n):
yield l[i:i + n]
n = 16 #each list contains 16 items regardless of contents or search
uberlist = list(chunk(lst, n)) #call chunk fn to partion list
with open('class_data.txt', 'w') as handler: #output of scraped data
print("writing file...")
for listitem in uberlist:
handler.write('%s\n' % listitem)
driver.close #ends and closes Silenium control over brower
This is my Soup Code and I'm wondering how I can take the data from the HTML in a similar way I did above with my Selenium.
import mechanicalsoup
import requests
from lxml import html
from lxml import etree
import pandas as pd
def text(elt):
return elt.text_content().replace(u'\xa0', u' ')
#This Will Use Mechanical Soup to grab the Form, Subit it and find the Data Table
browser = mechanicalsoup.StatefulBrowser()
winnet = "http://winnet.wartburg.edu/coursefinder/"
browser.open(winnet)
Searchform = browser.select_form()
Searchform.choose_submit('ctl00$ContentPlaceHolder1$FormView1$Button_FindNow')
response1 = browser.submit_selected() #This Progresses to Second Form
dataURL = browser.get_url() #Get URL of Second Form w/ Data
dataURL2 = 'https://winnet.wartburg.edu/coursefinder/Results.aspx'
pageContent=requests.get(dataURL2)
tree = html.fromstring(pageContent.content)
dataTable = tree.xpath('//*[#id="ctl00_ContentPlaceHolder1_GridView1"]')
rows = [] #initialize a collection of rows
for row in dataTable[0].xpath(".//tr")[1:]: #add new rows to the collection
rows.append([cell.text_content().strip() for cell in row.xpath(".//td")])
df = pd.DataFrame(rows) #load the collection to a dataframe
print(df)
#XPath to Table
#//*[#id="ctl00_ContentPlaceHolder1_GridView1"]
#//*[#id="ctl00_ContentPlaceHolder1_GridView1"]/tbody
Turns out I was able passing the wrong thing when using MechanicalSoup. I was able to pass the new page's contents to a variable called table had the page use .find('table') to retrieve the table HTML rather than the full page's HTML. From there just used table.get_text().split('\n') to make essentially a giant list of all of the rows.
I also dabble with setting form filters which worked as well.
import mechanicalsoup
from bs4 import BeautifulSoup
#Sets StatefulBrowser Object to winnet then it it grabs form
browser = mechanicalsoup.StatefulBrowser()
winnet = "http://winnet.wartburg.edu/coursefinder/"
browser.open(winnet)
Searchform = browser.select_form()
#Selects submit button and has filter options listed.
Searchform.choose_submit('ctl00$ContentPlaceHolder1$FormView1$Button_FindNow')
Searchform.set('ctl00$ContentPlaceHolder1$FormView1$TextBox_keyword', "") #Keyword Searches by Class Title. Inputting string will search by that string ignoring any stored nonsense in the page.
#ACxxx Course Codes have 3 spaces after them, THIS IS REQUIRED. Except the All value for not searching by a Department does not.
Searchform.set("ctl00$ContentPlaceHolder1$FormView1$DropDownList_Department", 'All') #For Department List, it takes the CourseCodes as inputs and displays as the Full Name
Searchform.set("ctl00$ContentPlaceHolder1$FormView1$DropDownList_Term", "2020 Winter Term") # Term Dropdown takes a value that is a string. String is Exactly the Term date.
Searchform.set('ctl00$ContentPlaceHolder1$FormView1$DropDownList_MeetingTime', 'all') #Takes the Week Class Time as a String. Need to Retrieve list of options from pages
Searchform.set('ctl00$ContentPlaceHolder1$FormView1$DropDownList_EssentialEd', 'none') #takes a small string signialling the EE req or 'all' or 'none'. None doesn't select and option and all selects all coruses w/ a EE
Searchform.set('ctl00$ContentPlaceHolder1$FormView1$DropDownList_CulturalDiversity', 'none')# Cultural Diversity, Takes none, C, D or all
Searchform.set('ctl00$ContentPlaceHolder1$FormView1$DropDownList_WritingIntensive', 'none') # options are none or WI
Searchform.set('ctl00$ContentPlaceHolder1$FormView1$DropDownList_PassFail', 'none')# Pass/Faill takes 'none' or 'PF'
Searchform.set('ctl00$ContentPlaceHolder1$FormView1$CheckBox_OpenCourses', False) #Check Box, It's True or False
Searchform.set('ctl00$ContentPlaceHolder1$FormView1$DropDownList_Instructor', '0')# 0 is for None Selected otherwise it is a string of numbers (Instructor ID?)
#Submits Page, Grabs results and then launches a browser for test purposes.
browser.submit_selected()# Submits Form. Retrieves Results.
table = browser.get_current_page().find('table') #Finds Result Table
print(type(table))
rows = table.get_text().split('\n') # List of all Class Rows split by \n.

Trouble getting around list index error

I've written some script to scrape Name and Price from craigslist. It works smoothly until it finds that either of the vale is None. As soon as It gets any None value it breaks displaying: "list index out of range". How to deal with that?
import requests
from lxml import html
page = requests.get('http://bangalore.craigslist.co.in/search/rea?s=120').text
tree = html.fromstring(page)
rows = tree.xpath('//li[#class="result-row"]')
for row in rows:
link = row.xpath('.//a[contains(#class,"hdrlnk")]/text()')[0]
price = row.xpath('.//span[#class="result-price"]/text()')[0]
print (link,price)
The most efficient technique by far I've come across to avoid errors.
import requests
from lxml import html
page = requests.get('http://bangalore.craigslist.co.in/search/rea?s=120').text
tree = html.fromstring(page)
def if_exist(row,xpath):
docs=row.xpath(xpath)
if docs:
return docs[0]
return ""
for row in tree.xpath('//li[#class="result-row"]'):
link = if_exist(row,'.//a[contains(#class,"hdrlnk")]/text()')
price = if_exist(row,'.//span[#class="result-price"]/text()')
print (link,price)

Resources