Loading scraped data into list - python-3.x

I was able to successfully scrape some text from a website and I'm now trying to load the text into a list so I can later convert it to a Pandas DataFrame.
The site supplied the data in a scsv format so it was quick to grab.
The following is my code:
import requests
from bs4 import BeautifulSoup
#Specify the url:url
url = "http://rotoguru1.com/cgi-bin/fyday.pl?week=1&year=2017&game=dk&scsv=1"
# Packages the request, send the request and catch the response: r
r = requests.get(url)
#Extract the response:html_doc
html_doc = r.text
soup = BeautifulSoup(html_doc,"html.parser")
#Find the tags associated with the data you need, in this case
# it's the "pre" tags
for data in soup.find_all("pre"):
print(data.text)
Sample Output
Week;Year;GID;Name;Pos;Team;h/a;Oppt;DK points;DK salary
1;2017;1254;Smith, Alex;QB;kan;a;nwe;34.02;5400 1;2017;1344;Bradford,
Sam;QB;min;h;nor;28.54;5900

use open function to write csv file
import requests
from bs4 import BeautifulSoup
url = "http://rotoguru1.com/cgi-bin/fyday.pl?week=1&year=2017&game=dk&scsv=1"
r = requests.get(url)
html_doc = r.content
soup = BeautifulSoup(html_doc,"html.parser")
file = open(“data.csv”,”w”)
for data in soup.find("pre").text.split('\n'):
file.write(data.replace(';',','))
file.close()

Here's one thing you can do, although it's possible that someone who knows pandas better than I can suggest something better.
You have r.text. Put that into a convenient text file, let me call it temp.csv. Now you can use pandas read_csv method to get these data into a dataframe.
>>> df = pandas.read_csv('temp.csv', sep=';')
Addendum:
Suppose results were like this.
>>> results = [['a', 'b', 'c'], [1,2,3], [4,5,6]]
Then you could put them in a dataframe in this way.
>>> df = pandas.DataFrame(results[1:], columns=results[0])
>>> df
a b c
0 1 2 3
1 4 5 6

If u want to convert your existing code into list, using split method might do the job and then use pandas to convert it into dataframe.
import requests
from bs4 import BeautifulSoup
#Specify the url:url
url = "http://rotoguru1.com/cgi-bin/fyday.pl?week=1&year=2017&game=dk&scsv=1"
# Packages the request, send the request and catch the response: r
r = requests.get(url)
#Extract the response:html_doc
html_doc = r.text
soup = BeautifulSoup(html_doc,"html.parser")
#Find the tags associated with the data you need, in this case
# it's the "pre" tags
for data in soup.find_all("pre"):
print(data.text.split(";"))

Related

Script is not returning proper output when trying to retrieve data from a newsletter

I am trying to write a script that can retrieve album title and band name from a music store newsletter. The band name and album title are hidden in a h3 & h4 class. When executing the script I get a blank output in the csv file.
`
from bs4 import BeautifulSoup
import requests
import pandas as pd
# Use the requests library to fetch the HTML content of the page
url = "https://www.musicmaniarecords.be/_sys/newsl_view?n=260&sub=Tmpw6Rij5D"
response = requests.get(url)
# Use the BeautifulSoup library to parse the HTML content
soup = BeautifulSoup(response.text, 'html.parser')
# Find all 'a' elements with the class 'row'
albums = soup.find_all('a', attrs={'class': 'row'})
# Iterate over the found elements and extract the album title and band name
album_title = []
band_name = []
for album in albums:
album_title_element = album.find('td', attrs={'td_class': 'h3 class'})
band_name_element = album.find('td', attrs={'td_class': 'h4 class'})
album_title.append(album_title_element.text)
band_name.append(band_name_element.text)
# Use the pandas library to save the extracted data to a CSV file
df = pd.DataFrame({'album_title': album_title, 'band_name': band_name})
df.to_csv('music_records.csv')
I think the error is in the attrs part, not sure how to fix it properly. Thanks in advance!
Looking at your code I agree that the error lies in the attrs part. The problem you are facing is that the site you are trying to scrape does not contain 'a' elements with the 'row' class. Thus find_all returns an empty list. There are plenty of 'div' elements with the 'row' class, maybe you meant to look for those?
You had the right idea by looking for 'td' elements and extracting their 'h3' and 'h4' elements, but since albums is an empty list, there are no elements to find.
I changed your code slightly to look for 'td' elements directly and extract their 'h3' and 'h4' elements. With these small changes your code found 29 albums.
from bs4 import BeautifulSoup
import requests
import pandas as pd
# Use the requests library to fetch the HTML content of the page
url = "https://www.musicmaniarecords.be/_sys/newsl_view?n=260&sub=Tmpw6Rij5D"
response = requests.get(url)
# Use the BeautifulSoup library to parse the HTML content
soup = BeautifulSoup(response.text, 'html.parser')
# Find all 'a' elements with the class 'row'
albums = soup.find_all('td', attrs={'class': 'block__cell'} )
# Iterate over the found elements and extract the album title and band name
album_title = []
band_name = []
for i, album in enumerate(albums):
album_title_element = album.find('h3')
band_name_element = album.find('h4')
album_title.append(album_title_element.text)
band_name.append(band_name_element.text)
# Use the pandas library to save the extracted data to a CSV file
df = pd.DataFrame({'album_title': album_title, 'band_name': band_name})
df.to_csv('music_records.csv', index=False)
I also took the liberty of adding index=False to the last line of your code. This makes it so each row doesn't start with a ,.
Hope this helps.
from bs4 import BeautifulSoup
import requests
import pandas as pd
# Use the requests library to fetch the HTML content of the page
url = "https://www.musicmaniarecords.be/_sys/newsl_view?n=260&sub=Tmpw6Rij5D"
response = requests.get(url)
# Use the BeautifulSoup library to parse the HTML content
soup = BeautifulSoup(response.text, 'html.parser')
# Find all 'a' elements with the class 'row'
albums = soup.find_all('td', attrs={'class': 'block__cell'})
# Iterate over the found elements and extract the album title and band name
album_title = []
band_name = []
for album in albums:
album_title_element = album.find('h3', attrs={'class': 'header'})
band_name_element = album.find('h4', attrs={'class': 'header'})
album_title.append(album_title_element.text)
band_name.append(band_name_element.text)
# Use the pandas library to save the extracted data to a CSV file
df = pd.DataFrame({'album_title': album_title, 'band_name': band_name})
df.to_csv('music_records.csv')
Thanks to the anonymous hero for helping out!

Issue webscraping a linked header with Beautiful Soup

I am running into an issue pulling in the human readable header name from this table in an html document. I can pull in the id, but my trouble comes when trying to pull in the correct header between the '>... I am not sure what I need to do in this instance... Below is my code. It all runs except for the last for loop.
# Import libraries
import requests
from bs4 import BeautifulSoup
from pprint import pprint
import pandas as pd
import numpy as np
# Pull the HTML link into a local file or buffer
# and then parse with the BeautifulSoup library
# ------------------------------------------------
url = 'https://web.dsa.missouri.edu/static/mirror_sites/factfinder.census.gov/bkmk/table/1.0/en/GEP/2014/00A4/0100000US.html'
r = requests.get(url)
#print('Status: ' + str(r.status_code))
#print(requests.status_codes._codes[200])
soup = BeautifulSoup(r.content, "html")
table = soup.find(id='data')
#print(table)
# Convert the data into a list of dictionaries
# or some other structure you can convert into
# pandas Data Frame
# ------------------------------------------------
trs = table.find_all('tr')
#print(trs)
header_row = trs[0]
#print(header_row)
names = []
for column in header_row.find_all('th'):
names.append(column.attrs['id'])
#print(names)
db_names = []
for column in header_row.find_all('a'):
db_names.append(column.attrs['data-vo-id']) # ISSUE ARISES HERE!!!
print(db_names)
Let pandas read_html do the work for you, and simply specify the table id to find:
from pandas import read_html as rh
table = rh('https://web.dsa.missouri.edu/static/mirror_sites/factfinder.census.gov/bkmk/table/1.0/en/GEP/2014/00A4/0100000US.html', attrs = {'id': 'data'})[0]
Hey you can try something like this :
soup = BeautifulSoup(r.content, "html")
table = soup.findAll('table', {'id':'data'})
trs = table[0].find_all('tr')
#print(trs)
names = []
for row in trs[:1]:
td = row.find_all('td')
data_row_txt_list = [td_tag.text.strip() for td_tag in row]
header_row = data_row_txt_list
for column in header_row:
names.append(column)

How to scrape multiple pages with requests in python

recently started getting into web scraping and i have managed ok but now im stuck and i cant find the answer or figure it out.
Here is my code for scraping and exporting info from a single page
import requests
page = requests.get("https://www.example.com/page.aspx?sign=1")
from bs4 import BeautifulSoup
soup = BeautifulSoup(page.content, 'html.parser')
#finds the right heading to grab
box = soup.find('h1').text
heading = box.split()[0]
#finds the right paragraph to grab
reading = soup.find_all('p')[0].text
print (heading, reading)
import csv
from datetime import datetime
# open a csv file with append, so old data will not be erased
with open('index.csv', 'a') as csv_file:
writer = csv.writer(csv_file)
writer.writerow([heading, reading, datetime.now()])
Problem occurs when i try to scrape multiple pages at the same time.
They are all the same just pagination changes eg
https://www.example.com/page.aspx?sign=1
https://www.example.com/page.aspx?sign=2
https://www.example.com/page.aspx?sign=3
https://www.example.com/page.aspx?sign=4 etc
Instead of writing the same code 20 times how do i stick all the data in a tuple or an array and export to csv.
Many thanks in advance.
Just try it out with a loop, until you got no page available (request is not OK). Should be easy to get.
import requests
from bs4 import BeautifulSoup
import csv
from datetime import datetime
results = []
page_number = 1
while True:
response = requests.get(f"https://www.example.com/page.aspx?sign={page_number}")
if response.status_code != 200:
break
soup = BeautifulSoup(page.content, 'html.parser')
#finds the right heading to grab
box = soup.find('h1').text
heading = box.split()[0]
#finds the right paragraph to grab
reading = soup.find_all('p')[0].text
# write a list
# results.append([heading, reading, datetime.now()])
# or tuple.. your call
results.append((heading, reading, datetime.now()))
page_number = page_number + 1
with open('index.csv', 'a') as csv_file:
writer = csv.writer(csv_file)
for result in results:
writer.writerow(result)

Iterate all pages and crawler table's elements save as dataframe in Python

I need to loop all the entries of all the pages from this link, then click the menu check in the red part (please see the image below) to enter the detail of each entry:
The objective is to cralwer the infos from the pages such as image below, and save left part as column names and right part as rows:
The code I used:
import requests
import json
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup
url = 'http://bjjs.zjw.beijing.gov.cn/eportal/ui?pageId=425000'
content = requests.get(url).text
soup = BeautifulSoup(content, 'lxml')
table = soup.find('table', {'class': 'gridview'})
df = pd.read_html(str(table))[0]
print(df.head(5))
Out:
序号 工程名称 ... 发证日期 详细信息
0 NaN 假日万恒社区卫生服务站装饰装修工程 ... 2020-07-07 查看
The code for entering the detailed pages:
url = 'http://bjjs.zjw.beijing.gov.cn/eportal/ui?pageId=308891&t=toDetail&GCBM=202006202001'
content = requests.get(url).text
soup = BeautifulSoup(content, 'lxml')
table = soup.find("table", attrs={"class":"detailview"}).findAll("tr")
for elements in table:
inner_elements = elements.findAll("td", attrs={"class":"label"})
for text_for_elements in inner_elements:
print(text_for_elements.text)
Out:
工程名称:
施工许可证号:
所在区县:
建设单位:
工程规模(平方米):
发证日期:
建设地址:
施工单位:
监理单位:
设计单位:
行政相对人代码:
法定代表人姓名:
许可机关:
As you can see, I only get column name, no entries have been successfully extracted.
In order to loop all pages, I think we need to use post requests, but I don't know how to get headers.
Thanks for your help at advance.
This script will go for all pages and gets the data into a DataFrame and saves them to data.csv.
(!!! Warning !!! there are 2405 pages total, so it takes a long time to get them all):
import requests
import pandas as pd
from pprint import pprint
from bs4 import BeautifulSoup
url = 'http://bjjs.zjw.beijing.gov.cn/eportal/ui?pageId=425000'
payload = {'currentPage': 1, 'pageSize':15}
def scrape_page(url):
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
return {td.get_text(strip=True).replace(':', ''): td.find_next('td').get_text(strip=True) for td in soup.select('td.label')}
all_data = []
current_page = 1
while True:
print('Page {}...'.format(current_page))
payload['currentPage'] = current_page
soup = BeautifulSoup(requests.post(url, data=payload).content, 'html.parser')
for a in soup.select('a:contains("查看")'):
u = 'http://bjjs.zjw.beijing.gov.cn' + a['href']
d = scrape_page(u)
all_data.append(d)
pprint(d)
page_next = soup.select_one('a:contains("下一页")[onclick]')
if not page_next:
break
current_page += 1
df = pd.DataFrame(all_data)
df.to_csv('data.csv')
Prints the data to screen and saves data.csv (screenshot from LibreOffice):

Grabbing Data from Web Page using python 3

I'm performing the same web scraping pattern that I just learned from post , however, I'm unable to scrap the using below script. I keep getting an empty return and I know the tags are there. I want to find_all "mubox" then pulls values for O/U and goalie information. This so weird, what am I missing?
from bs4 import BeautifulSoup
import requests
import pandas as pd
page_link = 'https://www.thespread.com/nhl-scores-matchups'
page_response = requests.get(page_link, timeout=10)
# here, we fetch the content from the url, using the requests library
page_content = BeautifulSoup(page_response.content, "html.parser")
# Take out the <div> of name and get its value
tables = page_content.find_all("div", class_="mubox")
print (tables)
# Iterate through rows
rows = []
This site uses an internal API before rendering the data. This api is an xml file, you can get here which contains all the match information. You can parse it using beautiful soup :
from bs4 import BeautifulSoup
import requests
page_link = 'https://www.thespread.com/matchups/NHL/matchup-list_20181030.xml'
page_response = requests.get(page_link, timeout=10)
body = BeautifulSoup(page_response.content, "lxml")
data = [
(
t.find("road").text,
t.find("roadgoalie").text,
t.find("home").text,
t.find("homegoalie").text,
float(t.find("ot").text),
float(t.find("otmoney").text),
float(t.find("ft").text),
float(t.find("ftmoney").text)
)
for t in body.find_all('event')
]
print(data)

Resources