Scraping table data with BeautifulSoup or Pandas - python-3.x

I'm somewhat new to using python and I've been given a task that requires data scraping from a table. I do not know very much html either. I've never done this before and have spent a couple days looking at various ways to scrape tables. Unfortunately all of the examples are of what appears to be a more simple webpage layout than what I'm dealing with. I've tried quite a few various methods, but none of them allow me to select the table data that I need.
How would one scrape the table at the bottom of the following webpage under the "Daily Water Level" tab?
url = https://apps.wrd.state.or.us/apps/gw/gw_info/gw_hydrograph/Hydrograph.aspx?gw_logid=HARN0052657
I've tried using the methods in the following links and others not show here:
Beautiful Soup Scraping table
Scrape table with BeautifulSoup
Web scraping with BeautifulSoup
Some of the script I've tried:
from bs4 import BeautifulSoup
import requests
html = requests.get(url).text
soup = BeautifulSoup(html, "html.parser")
data = soup.find_all("table") # {"class": "xxxx"})
I've also tried using pandas, but I can't figure out how to select the table I need instead of the first table on the webpage that has the basic well information:
import pandas as pd
df_list = pd.read_html(url)
df_list
Unfortunately the data I need doesn't even show up when I run this script and the table I'm trying to select doesn't have a class that I can use to select only that table and not the table of basic well information. I've inspected the webpage, but can't seem to find a way to get to the correct table.
As far as the final result would look, I would need to export it as a csv or as a pandas data frame so that I can then graph it with modeled groundwater data for comparison purposes. Any suggestions would be greatly appreciated!

Try below approach using python - requests simple, straightforward, reliable, fast and less code is required when it comes to requests. I have fetched the API URL from website itself after inspecting the network section of google chrome browser.
What exactly below script is doing:
First it will take the API URL and do a GET request with the dynamic parameters(in CAPS) you can change the value of Well No, Start and end date to get the desired result.
After getting the data script will parse the JSON data using json.loads library.
It will iterate all over the list of daily water level data and create a list of all the data points so that it can be used to create a CSV file for ex:- GW Login Id, GW Site ID, Land Surface Elevation, Record date etc.
Finally it will write all the headers and data in the CSV file. (! Important please make sure to input the file path in the file_path variable)
import json
import requests
from urllib3.exceptions import InsecureRequestWarning
requests.packages.urllib3.disable_warnings(InsecureRequestWarning)
import csv
def scrap_daily_water_level():
file_path = '' #Input File path here
file_name = 'daily_water_level_data.csv' #File name
#CSV headers
csv_headers = ['Line #','GW Log Id','GW Site Id', 'Land Surface Elevation', 'Record Date','Restrict to OWRD only', 'Reviewed Status', 'Reviewed Status Description', 'Water level ft above mean sea level', 'Water level ft below land surface']
list_of_water_readings = []
#Dynamic Params
WELL_NO = 'HARN0052657'
START_DATE = '1/1/1905'
END_DATE = '12/30/2050'
#API URL
URL = 'https://apps.wrd.state.or.us/apps/gw/gw_data_rws/api/' + WELL_NO + '/gw_recorder_water_level_daily_mean_public/?start_date=' + START_DATE + '&end_date=' + END_DATE + '&reviewed_status=&restrict_to_owrd_only=n'
response = requests.get(URL,verify=False) #GET API call
json_result = json.loads(response.text) #JSON loads to parse JSON data
print('Daily water level data count ',json_result['feature_count']) # Prints no. of data counts
extracted_data = json_result['feature_list'] #Extracted data in JSON form
for idx, item in enumerate(extracted_data): #Iterate over the list of extracted data
list_of_water_readings.append({ #append and create list of data with headers for further usage
'Line #': idx + 1,
'GW Log Id' : item['gw_logid'],
'GW Site Id': item['gw_site_id'],
'Land Surface Elevation': item['land_surface_elevation'],
'Record Date': item['record_date'],
'Restrict to OWRD only': item['restrict_to_owrd_only'],
'Reviewed Status':item['reviewed_status'],
'Reviewed Status Description': item['reviewed_status_description'],
'Water level ft above mean sea level': item['waterlevel_ft_above_mean_sea_level'],
'Water level ft below land surface': item['waterlevel_ft_below_land_surface']
})
#Create CSV and write data in to it.
with open(file_path + file_name ,'a+') as daily_water_level_data_CSV: #Open file in a+ mode
csvwriter = csv.DictWriter(daily_water_level_data_CSV, delimiter=',', lineterminator='\n',fieldnames=csv_headers)
print('Writing CSV header now...')
csvwriter.writeheader() #Write headers in CSV file
for item in list_of_water_readings: #iterate over the appended data and save them in to the CSV file.
print('Writing data rows now..')
print(item)
csvwriter.writerow(item)
scrap_daily_water_level()

Related

How do I put my scraped data into a data frame

Please i need help, am having trouble trying to put my scraped data into a data frame that has 3 columns i.e. date, source and keywords extracted from each scraped website for further text analysis, my code is borrowed from https://stackoverflow.com/users/12229253/foreverlearning and is given below:
from newspaper import Article
import nltk
nltk.download('punkt')
urls = ['https://dailypost.ng/2022/02/02/securing-nigeria-duty-of-all-tiers-of-government-oyo-senator-balogun/', 'https://guardian.ng/news/declare-bandits-as-terrorists-senate-tells-buhari/', 'https://www.thisdaylive.com/index.php/2021/10/24/when-will-fg-declare-bandits-as-terrorists/', 'https://punchng.com/rep-wants-buhari-to-name-lawmaker-sponsoring-terrorism/', 'https://punchng.com/national-assembly-plans-to-meet-us-congress-over-875m-weapons-deal-stoppage/']
results = {}
for url in urls:
article = Article(url)
article.download()
article.parse()
article.nlp()
results[url] = article
for url in urls:
print(url)
article = results[url]
print(article.authors)
print(article.publish_date)
print(article.keywords)
I played around with it and here is how you can make it into a data frame. Assuming that you wanted to use pandas in the first place:
import nltk
import pandas as pd
from newspaper import Article
nltk.download('punkt')
urls = ['https://dailypost.ng/2022/02/02/securing-nigeria-duty-of-all-tiers-of-government-oyo-senator-balogun/', 'https://guardian.ng/news/declare-bandits-as-terrorists-senate-tells-buhari/', 'https://www.thisdaylive.com/index.php/2021/10/24/when-will-fg-declare-bandits-as-terrorists/', 'https://punchng.com/rep-wants-buhari-to-name-lawmaker-sponsoring-terrorism/', 'https://punchng.com/national-assembly-plans-to-meet-us-congress-over-875m-weapons-deal-stoppage/']
# create a data frame with the needed columns
saved_data = pd.DataFrame(columns=['Date', 'Source', 'KeyWords'])
# put into a data frame that has 3 columns i.e. date, source and keywords
def add_data_to_df(urls, saved_data):
for url in urls: # process each url separately
article = Article(url)
article.download()
article.parse()
article.nlp()
# create a row with the data you need using attributes
record = {'Date': article.publish_date, 'Source': url, 'KeyWords': article.keywords}
# append info about each url as a new row
saved_data = saved_data.append(record, ignore_index = True)
return saved_data
Now, when you run this function
add_data_to_df(urls, saved_data), you should see a data frame with contents similar to the ones I got below during testing:
Date Source KeyWords
0 2022-02-02 00:00:00 https://dailypost.ng/2022/02/02/securing-niger... [nigeria, securing, oyo, state, senator, prote...
1 2021-09-30 04:25:24+00:00 https://guardian.ng/news/declare-bandits-as-te... [shutdown, terrorists, nigeria, guardian, decl...
2 2021-10-24 00:00:00 https://www.thisdaylive.com/index.php/2021/10/... [terrorists, nigerian, declare, state, militar...
3 2021-10-05 14:41:48+00:00 https://punchng.com/rep-wants-buhari-to-name-l... [president, buhari, house, national, lawmaker,...
4 2021-07-31 00:30:47+00:00 https://punchng.com/national-assembly-plans-to... [plans, congress, deal, nigeria, nigerian, rig...
(Sorry for the format, I am showing the output as plain text since I am not allowed to attach screenshots, but you will have a nice pandas format)
Edit: adding a function to save the data frame to a csv file. Note that this one of the shortest ways of doing this and it will save the file to the current working directory, i.e., where you are executing your code:
# this function saves given data to csv
def save_to_csv(saved_data):
saved_data.to_csv('output.csv', index=False, sep=',')
# process the articles and create a csv
save_to_csv(add_data_to_df(urls, saved_data))

Import Balance Sheet in an automatic organized manner from SEC to Dataframe

I am looking at getting the Balance Sheet data automatically and properly organized for any company using Beautiful Soup.
I am not planning on getting each variable but rather the whole Balance sheet. Originally, I was trying to do many codes to extract the URL for a particular company of my choice.
For Example, suppose I want to get the Balance Sheet data from the following URL:
URL1:'https://www.sec.gov/Archives/edgar/data/1418121/000118518520000213/aple20191231_10k.htm'
or from
URL2:'https://www.sec.gov/Archives/edgar/data/1326801/000132680120000046/form8-k03312020earnings.htm'
I am trying to write a function (suppose it is known as get_balancesheet(URL) ) such that regardless of the URL you will get the Dataframe that contains the balance sheet in an organized manner.
# Import libraries
import requests
import re
from bs4 import BeautifulSoup
I wrote the following function that needs a lot of improvement
def Get_Data_Balance_Sheet(url):
page = requests.get(url)
# Create a BeautifulSoup object
soup = BeautifulSoup(page.content)
futures1 = soup.find_all(text=re.compile('CONSOLIDATED BALANCE SHEETS'))
Table=[]
for future in futures1:
for row in future.find_next("table").find_all("tr"):
t1=[cell.get_text(strip=True) for cell in row.find_all("td")]
Table.append(t1)
# Remove list from list of lists if list is empty
Table = [x for x in Table if x != []]
return Table
Then I execute the following
url='https://www.sec.gov/Archives/edgar/data/1326801/000132680120000013/fb-12312019x10k.htm'
Tab=Get_Data_Balance_Sheet(url)
Tab
Note that this is not what I am planning for to have It is not simply putting it in a dataframe but we need to change it such that regardless of which URL we can get the Balance Sheet.
Well, this being EDGAR it's not going to be simple, but it's doable.
First things first - with the CIK you can extract specific filings of specific types made the CIK filer during a spacific period. So let say you are interested in Forms 10-K and 10-Q, original or amended (as in "FORM 10-K/A", for example), filed by this CIK filer from 2019 through 2020.
start = 2019
end = 2020
cik = 220000320193
short_cik = str(cik)[-6:] #we will need it later to form urls
First we need to get a list of filings meeting these criteria and load it into beautifulsoup:
import requests
from bs4 import BeautifulSoup as bs
url = f"https://www.sec.gov/cgi-bin/srch-edgar?text=cik%3D%{cik}%22+AND+form-type%3D(10-q*+OR+10-k*)&first={start}&last={end}"
req = requests.get(url)
soup = bs(req.text,'lxml')
There are 8 filings meeting the criteria: two Form 10-K and 6 Form 10-Q. Each of these filings has an accession number. The accession number is hiding in the url of each of these filings and we need to extract it to get to the actual target - the Excel file which contains the financial statements which are attached to each specific filing.
acc_nums = []
for link in soup.select('td>a[href]'):
target = link['href'].split(short_cik,1)
if len(target)>1:
acc_num = target[1].split('/')[1]
if not acc_num in acc_nums: #we need this filter because each filing has two forms: text and html, with the same accession number
acc_nums.append(acc_num)
At this point, acc_nums contains the accession number for each of these 8 filings. We can now download the target Excel file. Obviusly, you can loop through acc_num and download all 8, but let's say you are only looking for (randomly) the Excel file attached to the third filing:
fs_url = f"https://www.sec.gov/Archives/edgar/data/{short_cik}/{acc_nums[2]}/Financial_Report.xlsx"
fs = requests.get(fs_url)
with open('random_edgar.xlsx', 'wb') as output:
output.write(fs.content)
And there you'll have more than you'll ever want to know about Apple's financials at that point in time...

python3 - how to scrape the data from span

I try to use python3 and BeautifulSoup.
import requests
import json
from bs4 import BeautifulSoup
url = "https://www.binance.com/pl"
#get the data
data = requests.get(url);
soup = BeautifulSoup(data.text,'lxml')
print(soup)
If I open the html code (in browser) I can see:
html code in browser
But in my data (printing in console) i cant see btc price:
what data i cant see in console
Could u give me some advice how to scrape this data?
Use .findAll() to find all the rows, and then you can use it to find all the cells in a given row. You have to look at how the page is structured. It's not a standard row, but a bunch of divs made to look like a table. So you have to look at the role of each div to get to the data you want.
I'm assuming that you're going to want to look at specific rows, so my example uses the Para column to find those rows. Since the star is in it's own little cell, the Para column is the second cell, or index of 1. With that, it's just a question of which cells you want to export.
You could take out the filter if you want to get everything. You can also modify it to see if the value of a cell is above a certain price point.
# Import necessary libraries
import requests
from bs4 import BeautifulSoup
# Ignore the insecure warning
from requests.packages.urllib3.exceptions import InsecureRequestWarning
requests.packages.urllib3.disable_warnings(InsecureRequestWarning)
# Set options and which rows you want to look at
url = "https://www.binance.com/pl"
desired_rows = ['ADA/BTC', 'ADX/BTC']
# Get the page and convert it into beautiful soup
response = requests.get(url, verify=False)
soup = BeautifulSoup(response.text, 'html.parser')
# Find all table rows
rows = soup.findAll('div', {'role':'row'})
# Process all the rows in the table
for row in rows:
try:
# Get the cells for the given row
cells = row.findAll('div', {'role':'gridcell'})
# Convert them to just the values of the cell, ignoring attributes
cell_values = [c.text for c in cells]
# see if the row is one you want
if cell_values[1] in desired_rows:
# Output the data however you'd like
print(cell_values[1], cell_values[-1])
except IndexError: # there was a row without cells
pass
This resulted in the following output:
ADA/BTC 1,646.39204255
ADX/BTC 35.29384873

How to extract data from multiple dt and dd tags in tabled form (within a looped statement) using python v3 beautiful soup v4?

Source:
I’ve only chosen one year for simplicity but my intention is for all years (n=117).
https://pubs.er.usgs.gov/browse/Report/USGS%20Numbered%20Series/Open-File%20Report/
(2018 only)
https://pubs.er.usgs.gov/browse/Report/USGS%20Numbered%20Series/Open-File%20Report/2018/
Resources:
I’ve found 2 blogs and 2 stack overflow forums that have steered my attempts to replicate their work but my lack of experience and the uniqueness of the website and task has proven difficult. I’ve tried next_siblings a little but to no success. Blog #1: Extract tabled data as a table:
https://journalistsresource.org/tip-sheets/research/python-scrape-website-data-criminal-justice
https://gist.github.com/phillipsm/404780e419c49a5b62a8 Blog #2: Extract data from tags into a table
https://www.dataquest.io/blog/web-scraping-beautifulsoup/
Stack overflow forum #1:
Using BeautifulSoup to extract specific dl and dd list elements
Stack overflow forum #2:
Use BeautifulSoup to get a value after a specific tag
Problems encountered:
1. Each year’s publications have different “Additional Publication Details”. To help with this I can run the code I have and compiled (which is not in tabled form) the unique dt tag text headers to make sure all are captured for 2018 (pasted below). But again to do this for all years would take time…right? I'll add in a comment if necessary.
2. For statements…I find I keep having to nest “for” statements to get to final webpage where publication details live (minimum of 2 links). This seems restricting in what/how I can return data and without limiting replicating returns ([:1]), my code can very easily fail (whether it’s from the source server or what have you).
3. I have to first extract dt element text, then extract dd element text.
Code:
(commented out dt element grab and print statements are only for my record keeping of what’s being done. Again, I compiled unique dt element text headers for reference…see comment above. Apologize upfront if my code is ‘dizzying’…)
import requests
from bs4 import BeautifulSoup
import csv
import re
import time
url =
'https://pubs.er.usgs.gov/browse/Report/USGS%20Numbered%20Series/Open-
File%20Report'
url2 = 'https://pubs.er.usgs.gov'
response = requests.get(url)
data = response.text
pubti_links = []
soup = BeautifulSoup(data, "html.parser")
type(soup)
year_containers = soup.findAll('li',{'class':'pubs-browse-list-theme'})
for year in year_containers[:1]:
for a in soup.findAll('a'):
if '/browse/Report/USGS%20Numbered%20Series/Open-
File%20Report/2018' in a['href']:
link_containers = a.get('href')
#print (link_containers)
pubti_links = url2 + link_containers
#print (pubti_links)
for pubti_link in pubti_links[:1]:
response2 = requests.get(pubti_links)
soup2 = BeautifulSoup(response2.text, "html.parser")
time.sleep(2)
for elm in soup2.find_all('li',{'class':'pubs-browse-list-
theme'}):
for a_elm in elm.findAll('a'):
#print(a.get('href'))
pub_containers = a_elm.get('href')
pubdetails_links = url2 + pub_containers
response3 = requests.get(pubdetails_links)
soup3 = BeautifulSoup(response3.text,
"html.parser")
pubdetail_containers = soup3.findAll('dd',{'class':
["" "","dark"]})
dd_data = soup3.findAll('dd',{'class':[""
"","dark"]})
#dt_data = soup3.findAll('dt',{'class':[""
"","dark"]})
for dd_item in dd_data:
print(dd_item.string)
#for dt_item in dt_data:
#print (dt_item.string)
Desired result (the goal is to create a table of all USGS publication for each year):
Output Table example

Scrape data from webpage under 'popup' box

I'm trying to scrape data from a website. The problem is the data is only visible when the mouse pointer is hovering over it...
On the following page, I would like to extract the historical congestion levels (right bottom, when mouse on e.g. 2011)
https://www.tomtom.com/en_gb/trafficindex/city/mexico-city
I'm somewhat familiar with beautiful soup. Any ideas on how to tackle this, if possible after all...
Many thanks and sorry for the high level question, but I wanted to check feasibility before diving into it.
I think the best approach here is to request the json file (/en_gb/trafficindex/data.json) directly.
The file contains a list of 390 items, one for each city. You could create a dictionary from this list with 'cityCode' as keys and 'congestionHistory' as values, and access the data by city code.
An example with requests:
import requests
url = "https://www.tomtom.com/en_gb/trafficindex/data.json"
r = requests.get(url)
data = r.json()
congestion_data = {
i['cityTraffic']['cityCode']: i['cityTraffic']['congestionHistory']
for i in data
}
print(congestion_data['MEX'])
[{'year': 2010, 'congestion': 57}, {'year': 2011, 'congestion': 59}, ...
And saving it as a csv file:
import csv
with open('my_file.csv', 'w', newline='') as f:
w = csv.writer(f)
w.writerow(['city_code', 'congestion_history'])
for k,v in congestion_data.items():
w.writerow((k, ', '.join('{1}:{0}'.format(*i.values()) for i in v)))

Resources