Scraping dynamic website (CSS?) with python - python-3.x

I want to know if a item is available in the local library. I can see this in the catalog with a green icon (available), or a red icon (loaned out/not available).
First I tried just beautifullsoup, this is the python code I tried:
try:
import urllib.request as urllib2
except ImportError:
import urllib2
from bs4 import BeautifulSoup
page = urllib2.urlopen("http://zoeken.mol.bibliotheek.be/?itemid=|library/marc/vlacc|9394694&undup=false")
soup = BeautifulSoup(page)
bal = soup.find(class_="avail-icon")
print(bal)
But while an element inspection in firefox gives:
<span class="avail-icon">
<i class="circle-icon avail-icon-none"></i>
<span class="hidden-text"></span>
</span>
class="circle-icon avail-icon-none" means the item is available (shows green icon on webpage),
class="circle-icon avail-icon-loanedout" means the item is loaned out (shows red icon on webpage).
I got:
<span class="avail-icon">
<i class="circle-icon avail-icon-loading"></i>
<span class="hidden-text">Toon beschikbaarheid voor</span>
</span>
class="circle-icon avail-icon-loading" means dynamic, I asume, so after some searching I found Selenium.
I tried the following code:
from bs4 import BeautifulSoup
from selenium import webdriver
driver = webdriver.Firefox()
driver.get("http://zoeken.mol.bibliotheek.be/?itemid=|library/marc/vlacc|9394694&undup=false")
html = driver.page_source
soup = BeautifulSoup(html)
bal = soup.find(class_="avail-icon")
print(bal)
Sadly, this gives me:
<span class="avail-icon">
<i class="circle-icon avail-icon-unknown"></i>
<span class="hidden-text">Toon beschikbaarheid voor</span>
</span>
Maybe I wasn't waiting enough before Selenium grabbed the webpage, so after some searching I changed the code to:
from bs4 import BeautifulSoup
from selenium import webdriver
driver = webdriver.Firefox()
driver.implicitly_wait(10) # seconds
driver.get("http://zoeken.mol.bibliotheek.be/?itemid=|library/marc/vlacc|9394694&undup=false")
html = driver.page_source
soup = BeautifulSoup(html)
bal = soup.find(class_="avail-icon")
print(bal)
Still the same result, class="circle-icon avail-icon-unknown" isn't what I'm looking for and I'm now out of ideas. Can someone throw me a hint?
PS: Maybe an idea, but I don't know how to do it:
In Firefox, in the element inspector, the right pane has a column called rules (dutch: regels). The red and green icon are loaded as one .png file (icon-sprite.png). Select the red/green icon to see what I mean.
background-position: -48px -16px; means available (green icon)
background-position: 0px -32px; means not available (red icon)
Can I somehow test for this?
PS2: I'm a novice programmer (skill level = low).

You can do this entirely by using selenium:
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
driver = webdriver.Firefox()
driver.get("http://zoeken.mol.bibliotheek.be/?itemid=|library/marc/vlacc|9394694&undup=false")
#waiting until the icon is loaded...
WebDriverWait(driver, 20).until(
EC.presence_of_element_located((By.XPATH, """//*[#id="availabilityStatic"]/div/div/ul/li/ul/li/span/i""")))
circle_icon = driver.find_element_by_xpath("""//*[#id="availabilityStatic"]/div/div/ul/li/ul/li/span/i""")
icon_class = circle_icon.get_attribute("class")
if "loanedout" in icon_class:
print "item not available"
else:
print "item available"
driver.quit()

Related

How to get the whole source code from a web page with selenium / webdriver?

I use this python program (usually) successfully for webscraping.
It gives me not only the page's source code but also the code which is hidden behind Javascript.
However, it does not work as desired on this particular website. Information is missing.
It does not seem to be a timing problem.
from selenium import webdriver
url = "https://www.youbet.dk/sport/fodbold/"
options = webdriver.ChromeOptions()
options.add_argument('--headless')
options.add_argument('--no-sandbox')
options.add_argument('--disable-dev-shm-usage')
driver = webdriver.Chrome(executable_path='D:/Programme/chromedriver_win32/chromedriver.exe',options=options)
driver.get(url)
After execution, driver.page_source contains the code.
I am interested in the text on the buttons (team name and a number).
Right-clicking and inspecting a button in Chrome gives me something like the following code which contains the information I am looking for (here "Villarreal" and "1.51"):
<button class="rj-ev-list__bet-btn rj-ev-list__selection-0ML54283820_1" data-uat="button-ev-list-bet-btn"><div class="rj-ev-list__bet-btn__inner " data-uat="div-ev-list-bet-btn-inner"><div class="rj-ev-list__bet-btn__row" data-uat="div-ev-list-bet-btn-row"><span class="rj-ev-list__bet-btn__content rj-ev-list__bet-btn__text" data-uat="ev-list-ev-list-bet-btn-text">Villarreal</span></div><div class="rj-ev-list__bet-btn__row" data-uat="div-ev-list-bet-btn-row"><span class="rj-ev-list__bet-btn__content rj-ev-list__bet-btn__odd" data-uat="ev-list-ev-list-bet-btn-odd">1.51</span></div></div><span class="rj-ev-list__bet-btn__arrow-up"></span><span class="rj-ev-list__bet-btn__arrow-down"></span></button>
But this does not show up in driver.page_source.
How can I access this information using python and selenium?
These did not help:
* Adding time.sleep(10)
* Adding driver.implicitly_wait(10)
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from bs4 import BeautifulSoup
url = "https://www.youbet.dk/sport/fodbold/"
options = webdriver.ChromeOptions()
options.add_argument('--headless')
options.add_argument('--no-sandbox')
options.add_argument('--disable-dev-shm-usage')
driver = webdriver.Chrome(options=options)
driver.get(url)
WebDriverWait(driver, 20).until(EC.presence_of_all_elements_located((By.CLASS_NAME, 'rj-ev-list__bet-btn__inner')))
html = driver.page_source
soup = BeautifulSoup(html, 'html.parser')
mydivs = soup.find_all("button", {"class": "rj-ev-list__bet-btn"})
alldata = [[div.find("span", {"class": "rj-ev-list__bet-btn__content"}).text,
div.find("span", {"class": "rj-ev-list__bet-btn__odd"}).text] for div in mydivs]
print(alldata)
driver.quit()
# [['Fulham', '2.55'], ['Uafgjort', '3.40'], ['Leeds', '2.80'], ['Real Betis', '1.83'], ['Uafgjort', '3.65'], ['Levante', '4.55'], ['Parma', '2.40'], ['Uafgjort', '3.10'], ['Genoa', '3.35']]
The problem with your approach:
You were close. The issue with the delays that you added to your code were they didn't relate directly to the visibility of the element (maybe ten seconds of wait time wasn't enough). To combat the issue, in this code, I used the more specific WebDriverWait (additional resource: https://www.geeksforgeeks.org/explicit-waits-in-selenium-python/)
WebDriverWait(driver, 20).until(EC.presence_of_all_elements_located(
(By.CLASS_NAME, 'rj-ev-list__bet-btn__inner')))
to wait for the presence of all the elements in the class. The code solution worked for me. Tell me in the comments if you have any issues.

Trying to create a web scraper for dell drivers using python3 and Beautiful Soup

I am trying to create a web scraper to grab info about Dell Drivers from their website. Apparently, it uses java on their site to load the data for the drivers to the web page. I am having difficulty getting the driver info from the webpage. this is what I have cobbled together so far.
from bs4 import BeautifulSoup
import urllib.request
import json
resp = urllib.request.urlopen("https://www.dell.com/support/home/en-us/product-support/product/precision-15-5520-laptop/drivers")
soup = BeautifulSoup(resp, 'html.parser', from_encoding=resp.info().get_param('charset'))
So far none of these have worked to try and get the data for the drivers:
data = json.loads(soup.find('script', type='text/preloaded').text)
data = json.loads(soup.find('script', type='application/x-suppress').text)
data = json.loads(soup.find('script', type='text/javascript').text)
data = json.loads(soup.find('script', type='application/ld+json').text)
I am not very skilled at python, I have been looking all over trying to cobble something together that works. Any assistance to help me get a little further in my endeavor would be greatly appreciate.
You can use selenium:
from selenium import webdriver
import time
from bs4 import BeautifulSoup
driver = webdriver.Chrome()
driver.get('https://www.dell.com/support/home/en-us/product-support/product/precision-15-5520-laptop/drivers')
time.sleep(3)
page = driver.page_source
driver.close()
soup = BeautifulSoup(page,'html5lib')
I was able to get Sushil's answer working on my machine with some minor changes
from selenium import webdriver
import time
from bs4 import BeautifulSoup
driver = webdriver.Chrome('C:/temp/chromedriver_win32/chromedriver.exe')
driver.get('https://www.dell.com/support/home/en-us/product-support/product/precision-15-5520-laptop/drivers')
time.sleep(3)
page = driver.page_source
driver.close()
soup = BeautifulSoup(page,'html.parser')
results = soup.find(id='downloads-table')
results2 = results.find_all(class_='dl-desk-view')
results3 = results.find_all(class_='details-control sorting_1')
results4 = results.find_all(class_='details-control')
results5 = results.find_all(class_='btn-download-lg btn btn-sm no-break text-decoration-none dellmetrics-driverdownloads btn-outline-primary')
The problem though is that this still only gets me 10 out of 79 drivers
I need a way to get all of the drivers that are available listed.
I got it figured out
from selenium import webdriver
import time
from bs4 import BeautifulSoup
driver = webdriver.Chrome('C:/temp/chromedriver_win32/chromedriver.exe')
driver.get('https://www.dell.com/support/home/en-us/product-support/product/precision-15-5520-laptop/drivers')
time.sleep(3)
element = driver.find_element_by_xpath("//button[contains(.,'Show all')]").click();
page = driver.page_source
driver.close()
soup = BeautifulSoup(page,'html.parser')
results = soup.find(id='downloads-table')
results2 = results.find_all(class_='dl-desk-view')
results3 = results.find_all(class_='details-control sorting_1')
results4 = results.find_all(class_='details-control')
results5 = results.find_all(class_='btn-download-lg btn btn-sm no-break text-decoration-none dellmetrics-driverdownloads btn-outline-primary')
for results2, results3, results4, results5 in zip(results2, results3, results4, results5):
print(results2, results3, results4, results5)
I was able to pull the JSON file that has the driver information. Saves a lot of hassle trying to use a web driver or other tricks.
Example for Dell Precision 7760 with Windows 10:
https://www.dell.com/support/driver/en-us/ips/api/driverlist/fetchdriversbyproduct?productcode=precision-17-7760-laptop&oscode=WT64A
(Note: "productcode" and "oscode" parameters.)
In order for this to work, you must have a request header "X-Requested-With" and set the value to "XMLHttpRequest". If you do not have this then you will get a "no content" result.
Format the resulting JSON and you should easily see the structure of the results including all of the driver data that you see on the support website.

How to get Javascript generated content

I am currently having an issue getting the HTML of javascript generated content on this ( https://aca3.accela.com/MILARA/GeneralProperty/PropertyLookUp.aspx ) webpage. It generates the javascript on the page itself. I was wondering what I was doing wrong. The code I'm using is this:
import time
from selenium import webdriver
from selenium.webdriver.support.ui import Select
from selenium.webdriver import ActionChains
from selenium.webdriver.common.keys import Keys
import selenium.webdriver.support.ui as ui
from bs4 import BeautifulSoup
driver = webdriver.Chrome(executable_path="/Users/MrPete/Downloads/chromedriver_win32/chromedriver")
driver.get('https://aca3.accela.com/MILARA/GeneralProperty/PropertyLookUp.aspx')
profession = Select(driver.find_element_by_xpath('//*[#id="ctl00_PlaceHolderMain_refLicenseeSearchForm_ddlLicenseType"]'))
profession.select_by_value("Pharmacist")
time.sleep(5) # Let the user actually see something!
lName = driver.find_element_by_xpath('//*[#id="ctl00_PlaceHolderMain_refLicenseeSearchForm_txtLastName"]')
lName.send_keys('roy')
search = driver.find_element_by_xpath('//*[#id="ctl00_PlaceHolderMain_btnNewSearch"]')
search.click()
time.sleep(5)
html = driver.execute_script("return document.getElementsByTagName('table')[38].innerHTML")
print(html)
Now, I'm not getting no output, the output I'm getting is:
<tbody><tr>
<td style="white-space:nowrap;"><span class="ACA_SmLabel ACA_SmLabel_FontSize"> Showing 1-13 of 13 </span></td>
</tr>
</tbody>
Which is (kinda) the title of the table I'm trying to get. What I want as the output is the HTML of the entire table (posted a picture of the table that is generated by javascript. What I'm currently getting is the small title up at the top of the picture, the 'Showing 1-13 of 13', and what I want is the entire table.
Try changing
html = driver.execute_script("return document.getElementsByTagName('table')[38].innerHTML")
print(html)
To:
target = driver.find_element_by_xpath('//table[#class="ACA_GridView ACA_Grid_Caption"]')
print(target.text)
Output:
Showing 1-13 of 13
License Type
License Number
First Name
Middle Initial
Last Name
Organization Name
DBA/Trade Name
License Status
License Expiration Date
Pharmacist
5302017621
Arthur
James
etc.

Scrape data from a website using selenium

I am still quite amateur at python, I am trying to scrape data from a website using selenium
<small class="fxs_price_ohl"> <span>Open 1.29814</span> <span>High 1.29828</span> <span>Low 1.29775</span> </small> </div> </div> </li> <script type="application/ld+json">
trying to obtain the data Open 1.29814, High 1.29828 and Low 1.29775 from the html code above^
count_element = browser.find_element_by_xpath("//small[#class='fxs_price_ohl']//span")
print(count_element.text)
I'm using selenium with python, this is my code ^
But count_element.text prints empty, how to get the data Open 1.29814, High 1.29828 and Low 1.29775
Use
"find_elements_by_xpath"
if you want to retrieve multiple elements.
count_elements = browser.find_elements_by_xpath("//small[#class='fxs_price_ohl']//span")
for ele in count_elements:
print(ele.text)
You can also use a css selector of class for the parents with descendant combinator and type selector for the child spans but you also need a wait condition as page is slow loading
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
browser = webdriver.Chrome()
browser.get('https://www.fxstreet.com/rates-charts/gbpusd')
before_text = ''
while True: #this could be improved with a timeout
elements = [i for i in WebDriverWait(browser,20).until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, ".fxs_chart_cag_cont .fxs_price_ohl span")))]
elem = elements[-1]
if elem.text != before_text:
break
print([elem.text for elem in elements])

Using Python, Selenium, and BeautifulSoup to scrape for content of a tag?

Relatively beginner. There are similar topics to this but I can see how my solution works, I just need help connecting these last few dots. I'd like to scrape follower counts from Instagram without the use of the API. Here's what I have so far:
Python 3.7.0
from selenium import webdriver
from bs4 import BeautifulSoup
driver = webdriver.Chrome()
> DevTools listening on ws://.......
driver.get("https://www.instagram.com/cocacola")
soup = BeautifulSoup(driver.page_source)
elements = soup.find_all(attrs={"class":"g47SY "})
# Note the full class is 'g47SY lOXF2' but I can't get this to work
for element in elements:
print(element)
>[<span class="g47SY ">667</span>,
<span class="g47SY " title="2,598,456">2.5m</span>, # Need what's in title, 2,598,456
<span class="g47SY ">582</span>]
for element in elements:
t = element.get('title')
if t:
count = t
count = count.replace(",","")
else:
pass
print(int(count))
>2598456 # Success
Is there any easier, or quicker way to get to the 2,598,456 number? My original hope was that I could just use the class of 'g47SY lOXF2' but spaces in the class name aren't functional in BS4 as far as I'm aware. Just want to make sure this code is succinct and functional.
I had to use headless option and added executable_path for testing. You can remove that.
from selenium import webdriver
from bs4 import BeautifulSoup
from selenium.webdriver.chrome.options import Options
options = Options()
options.add_argument("--headless")
driver = webdriver.Chrome(executable_path="chromedriver.exe",chrome_options=options)
driver.get('https://www.instagram.com/cocacola')
soup = BeautifulSoup(driver.page_source,'lxml')
#This will give you span that has title attribute. But it gives us multiple results
#Follower count is in the inner of a tag.
followers = soup.select_one('a > span[title]')['title'].replace(',','')
print(followers)
#Output 2598552
You could use regular expression to get the number.
Try this:
import re
fallowerRegex = re.compile(r'title="((\d){1,3}(,)?)+')
fallowerCount = fallowerRegex.search(str(elements))
result = fallowerCount.group().strip('title="').replace(',','')

Resources