Scraping $ parsing multiple html tables with the same URL (beautifulsoup & selenium) - python-3.x

i'm trying to scrape the full HTML table from this site:
https://www.iscc-system.org/certificates/all-certificates/
My code is as follows:
from selenium import webdriver
import time
import pandas as pd
url = 'https://www.iscc-system.org/certificates/all-certificates/'
browser = webdriver.Chrome('/home/giuseppe/bin/chromedriver')
browser.get(url)
csvfile = open('Scrape_certificates', 'a')
dfs = pd.read_html('https://www.iscc-system.org/certificates/all-certificates/', header=0)
for i in range(1,10):
for df in dfs:
df.to_csv(csvfile, header=False)
link_next_page = browser.find_element_by_id('table_1_next')
link_next_page.click()
time.sleep(4)
dfs = pd.read_html(browser.current_url)
csvfile.close()
The above code is only for the first 10 pages of the full table as an example.
The problem is that the output is always the same first table repeated 10 times, although by clicking the 'next table' button the actual table gets updated (at least if I see the webpage), I'm unable to get the real new data from the following table. I get always the same data from the first table.

Firstly you are reading the URL with pandas not the page source. This will fetch the page new not read the Selenium generated source. Secondly you want to limit the reading to the table with the id = table_1. Try this:
from selenium import webdriver
import time
import pandas as pd
url = 'https://www.iscc-system.org/certificates/all-certificates/'
browser = webdriver.Chrome('/home/giuseppe/bin/chromedriver')
browser.get(url)
csvfile = open('Scrape_certificates', 'a')
for i in range(1,10):
dfs = pd.read_html(browser.page_source, attrs = {'id': 'table_1'})
for df in dfs:
df.to_csv(csvfile, header=False)
link_next_page = browser.find_element_by_id('table_1_next')
link_next_page.click()
time.sleep(4)
csvfile.close()
You will need to remove or filter out line 10 from each result as it is the navigation.

Related

How to scrape multiple pages with requests in python

recently started getting into web scraping and i have managed ok but now im stuck and i cant find the answer or figure it out.
Here is my code for scraping and exporting info from a single page
import requests
page = requests.get("https://www.example.com/page.aspx?sign=1")
from bs4 import BeautifulSoup
soup = BeautifulSoup(page.content, 'html.parser')
#finds the right heading to grab
box = soup.find('h1').text
heading = box.split()[0]
#finds the right paragraph to grab
reading = soup.find_all('p')[0].text
print (heading, reading)
import csv
from datetime import datetime
# open a csv file with append, so old data will not be erased
with open('index.csv', 'a') as csv_file:
writer = csv.writer(csv_file)
writer.writerow([heading, reading, datetime.now()])
Problem occurs when i try to scrape multiple pages at the same time.
They are all the same just pagination changes eg
https://www.example.com/page.aspx?sign=1
https://www.example.com/page.aspx?sign=2
https://www.example.com/page.aspx?sign=3
https://www.example.com/page.aspx?sign=4 etc
Instead of writing the same code 20 times how do i stick all the data in a tuple or an array and export to csv.
Many thanks in advance.
Just try it out with a loop, until you got no page available (request is not OK). Should be easy to get.
import requests
from bs4 import BeautifulSoup
import csv
from datetime import datetime
results = []
page_number = 1
while True:
response = requests.get(f"https://www.example.com/page.aspx?sign={page_number}")
if response.status_code != 200:
break
soup = BeautifulSoup(page.content, 'html.parser')
#finds the right heading to grab
box = soup.find('h1').text
heading = box.split()[0]
#finds the right paragraph to grab
reading = soup.find_all('p')[0].text
# write a list
# results.append([heading, reading, datetime.now()])
# or tuple.. your call
results.append((heading, reading, datetime.now()))
page_number = page_number + 1
with open('index.csv', 'a') as csv_file:
writer = csv.writer(csv_file)
for result in results:
writer.writerow(result)

Iterate all pages and crawler table's elements save as dataframe in Python

I need to loop all the entries of all the pages from this link, then click the menu check in the red part (please see the image below) to enter the detail of each entry:
The objective is to cralwer the infos from the pages such as image below, and save left part as column names and right part as rows:
The code I used:
import requests
import json
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup
url = 'http://bjjs.zjw.beijing.gov.cn/eportal/ui?pageId=425000'
content = requests.get(url).text
soup = BeautifulSoup(content, 'lxml')
table = soup.find('table', {'class': 'gridview'})
df = pd.read_html(str(table))[0]
print(df.head(5))
Out:
序号 工程名称 ... 发证日期 详细信息
0 NaN 假日万恒社区卫生服务站装饰装修工程 ... 2020-07-07 查看
The code for entering the detailed pages:
url = 'http://bjjs.zjw.beijing.gov.cn/eportal/ui?pageId=308891&t=toDetail&GCBM=202006202001'
content = requests.get(url).text
soup = BeautifulSoup(content, 'lxml')
table = soup.find("table", attrs={"class":"detailview"}).findAll("tr")
for elements in table:
inner_elements = elements.findAll("td", attrs={"class":"label"})
for text_for_elements in inner_elements:
print(text_for_elements.text)
Out:
工程名称:
施工许可证号:
所在区县:
建设单位:
工程规模(平方米):
发证日期:
建设地址:
施工单位:
监理单位:
设计单位:
行政相对人代码:
法定代表人姓名:
许可机关:
As you can see, I only get column name, no entries have been successfully extracted.
In order to loop all pages, I think we need to use post requests, but I don't know how to get headers.
Thanks for your help at advance.
This script will go for all pages and gets the data into a DataFrame and saves them to data.csv.
(!!! Warning !!! there are 2405 pages total, so it takes a long time to get them all):
import requests
import pandas as pd
from pprint import pprint
from bs4 import BeautifulSoup
url = 'http://bjjs.zjw.beijing.gov.cn/eportal/ui?pageId=425000'
payload = {'currentPage': 1, 'pageSize':15}
def scrape_page(url):
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
return {td.get_text(strip=True).replace(':', ''): td.find_next('td').get_text(strip=True) for td in soup.select('td.label')}
all_data = []
current_page = 1
while True:
print('Page {}...'.format(current_page))
payload['currentPage'] = current_page
soup = BeautifulSoup(requests.post(url, data=payload).content, 'html.parser')
for a in soup.select('a:contains("查看")'):
u = 'http://bjjs.zjw.beijing.gov.cn' + a['href']
d = scrape_page(u)
all_data.append(d)
pprint(d)
page_next = soup.select_one('a:contains("下一页")[onclick]')
if not page_next:
break
current_page += 1
df = pd.DataFrame(all_data)
df.to_csv('data.csv')
Prints the data to screen and saves data.csv (screenshot from LibreOffice):

Python unable to refresh the execution of script

from urllib.request import urlopen
from selenium import webdriver
from bs4 import BeautifulSoup as BSoup
import requests
import pandas as pd
from requests_html import HTMLSession
import time
import xlsxwriter
import re
import os
urlpage = 'https://racing.hkjc.com/racing/information/English/Racing/LocalResults.aspx?RaceDate=2019/07/14&Racecourse=ST&RaceNo=1'
# Setup selenium
driver = webdriver.Firefox(executable_path = 'geckodriver path')
# get web page
driver.get(urlpage)
time.sleep(10)
bs_obj = BSoup(driver.page_source, 'html.parser')
# Scrape table content
table = bs_obj.find('table', {"f_tac table_bd draggable"})
rows = table.find_all('tr')
table_content = []
for row in rows[1:]:
cell_row = []
for cell in row.find_all('td'):
cell_row.append(cell.text.replace(" ", "").replace("\n\n", " ").replace("\n", ""))
table_content.append(cell_row)
header_content = []
for cell in rows[0].find_all('td'):
header_content.append(cell.text)
driver.close()
race_writer = pd.ExcelWriter('export path', engine='xlsxwriter')
df = pd.DataFrame(table_content, columns=header_content)
df.to_excel(race_writer, sheet_name='game1')
Hi All, I am trying to scrape the racing result from HKJC. When I was executing the code above, either one of the errors below happened:
No excel file is created
Df is not written to the excel file < an empty excel file is created
Say if I successfully scrape the result of game 1, I then amend the script to continue to scrape that of game 2, but it still gives me the result of game 1.
Appreciate if anyone could help.
I changed your script to the one below. The approach followed is to click through each of the relevant "Sha Tin" buttons (see range(1, len(shatin)-1)) and collect the race table data. Race tables are added to a list called "races". Finally, write each of the race tables to individual sheets in Excel (note you no longer need BeautifulSoup).
Add these to your list of imports:
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
Then:
urlpage = 'https://racing.hkjc.com/racing/information/English/Racing/LocalResults.aspx?RaceDate=2019/07/14&Racecourse=ST&RaceNo=1'
# Setup selenium
driver = webdriver.Firefox(executable_path = 'geckodriver path')
# get web page
driver.get(urlpage)
WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.XPATH,"//table[#class='f_fs12 f_fr js_racecard']")))
shatin=driver.find_elements_by_xpath("//table[#class='f_fs12 f_fr js_racecard']/tbody/tr/td")
races=[]
for i in range(1, len(shatin)-1):
shatin = driver.find_elements_by_xpath("//table[#class='f_fs12 f_fr js_racecard']/tbody/tr/td")
#time.sleep(3)
#WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.XPATH, "//div[#class='performance']")))
shatin[i].click()
table = pd.read_html(driver.find_element_by_xpath("//div[#class='performance']").get_attribute('outerHTML'))[0]
races.append(table)
with pd.ExcelWriter('races.xlsx') as writer:
for i,race in enumerate(races):
race.to_excel(writer, sheet_name=f'game{i+1}', index=False)
writer.save()
driver.quit()
Output:

How to find text of a specific cell from a html table (on a url) using python?

from selenium import webdriver
driver = webdriver.Chrome(executable_path="D:\chromedriver.exe")
#url = 'https://www.dcrustedp.in/show_chart.php'
driver.get('https://www.dcrustedp.in/show_chart.php')
rows = 2
cols = 5
for r in range(5,rows+1):
for c in range(6,cols+1):
value = driver.find_element_by_xpath("/html/body/center/table/tbody/tr["+str(r)+"]/td["+str(c)+"]").text
print(value)
`
This is my code. I want to extract result date of B.Tech - Computer Science and Engineering 5th Semester. It is in the first row of table. The date is 24-02-2020. I want to print the date from that particular cell only.
The below code works-:
from selenium import webdriver
from bs4 import BeautifulSoup
import time
webpage = 'https://www.dcrustedp.in/show_chart.php'
driver = webdriver.Chrome(executable_path='Your/path/to/chromedriver.exe')
driver.get(webpage)
time.sleep(15)
html = driver.page_source
soup = BeautifulSoup(html, "html.parser")
pagehits=driver.find_element_by_xpath("/html/body/center/table/tbody/tr[3]/td[5]")
print(pagehits.text)
driver.quit()
Without Selenium, we can use requests library to fetch the table and then respective element
import requests
import pandas as pd
url = 'https://www.dcrustedp.in/show_chart.php'
html = requests.get(url, verify=False).content
df_list = pd.read_html(html)
df = df_list[-1]
print(df.iat[0,4])
To extract the result date of 5th Semester for any of the Prg. Title, you have to induce WebDriverWait for the visibility_of_element_located() and you can use the following Locator Strategy:
xpath:
driver.get('https://www.dcrustedp.in/show_chart.php')
prg_title = "B.Tech - Computer Science and Engineering"
# prg_title = "B.Tech - Electrical Engineering"
print(WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//td[contains(., '"+prg_title+"')]//following-sibling::td[3]"))).get_attribute("innerHTML"))
Console Output:
24-02-2020

Loading scraped data into list

I was able to successfully scrape some text from a website and I'm now trying to load the text into a list so I can later convert it to a Pandas DataFrame.
The site supplied the data in a scsv format so it was quick to grab.
The following is my code:
import requests
from bs4 import BeautifulSoup
#Specify the url:url
url = "http://rotoguru1.com/cgi-bin/fyday.pl?week=1&year=2017&game=dk&scsv=1"
# Packages the request, send the request and catch the response: r
r = requests.get(url)
#Extract the response:html_doc
html_doc = r.text
soup = BeautifulSoup(html_doc,"html.parser")
#Find the tags associated with the data you need, in this case
# it's the "pre" tags
for data in soup.find_all("pre"):
print(data.text)
Sample Output
Week;Year;GID;Name;Pos;Team;h/a;Oppt;DK points;DK salary
1;2017;1254;Smith, Alex;QB;kan;a;nwe;34.02;5400 1;2017;1344;Bradford,
Sam;QB;min;h;nor;28.54;5900
use open function to write csv file
import requests
from bs4 import BeautifulSoup
url = "http://rotoguru1.com/cgi-bin/fyday.pl?week=1&year=2017&game=dk&scsv=1"
r = requests.get(url)
html_doc = r.content
soup = BeautifulSoup(html_doc,"html.parser")
file = open(“data.csv”,”w”)
for data in soup.find("pre").text.split('\n'):
file.write(data.replace(';',','))
file.close()
Here's one thing you can do, although it's possible that someone who knows pandas better than I can suggest something better.
You have r.text. Put that into a convenient text file, let me call it temp.csv. Now you can use pandas read_csv method to get these data into a dataframe.
>>> df = pandas.read_csv('temp.csv', sep=';')
Addendum:
Suppose results were like this.
>>> results = [['a', 'b', 'c'], [1,2,3], [4,5,6]]
Then you could put them in a dataframe in this way.
>>> df = pandas.DataFrame(results[1:], columns=results[0])
>>> df
a b c
0 1 2 3
1 4 5 6
If u want to convert your existing code into list, using split method might do the job and then use pandas to convert it into dataframe.
import requests
from bs4 import BeautifulSoup
#Specify the url:url
url = "http://rotoguru1.com/cgi-bin/fyday.pl?week=1&year=2017&game=dk&scsv=1"
# Packages the request, send the request and catch the response: r
r = requests.get(url)
#Extract the response:html_doc
html_doc = r.text
soup = BeautifulSoup(html_doc,"html.parser")
#Find the tags associated with the data you need, in this case
# it's the "pre" tags
for data in soup.find_all("pre"):
print(data.text.split(";"))

Resources