python3 - how to scrape the data from span - python-3.x

I try to use python3 and BeautifulSoup.
import requests
import json
from bs4 import BeautifulSoup
url = "https://www.binance.com/pl"
#get the data
data = requests.get(url);
soup = BeautifulSoup(data.text,'lxml')
print(soup)
If I open the html code (in browser) I can see:
html code in browser
But in my data (printing in console) i cant see btc price:
what data i cant see in console
Could u give me some advice how to scrape this data?

Use .findAll() to find all the rows, and then you can use it to find all the cells in a given row. You have to look at how the page is structured. It's not a standard row, but a bunch of divs made to look like a table. So you have to look at the role of each div to get to the data you want.
I'm assuming that you're going to want to look at specific rows, so my example uses the Para column to find those rows. Since the star is in it's own little cell, the Para column is the second cell, or index of 1. With that, it's just a question of which cells you want to export.
You could take out the filter if you want to get everything. You can also modify it to see if the value of a cell is above a certain price point.
# Import necessary libraries
import requests
from bs4 import BeautifulSoup
# Ignore the insecure warning
from requests.packages.urllib3.exceptions import InsecureRequestWarning
requests.packages.urllib3.disable_warnings(InsecureRequestWarning)
# Set options and which rows you want to look at
url = "https://www.binance.com/pl"
desired_rows = ['ADA/BTC', 'ADX/BTC']
# Get the page and convert it into beautiful soup
response = requests.get(url, verify=False)
soup = BeautifulSoup(response.text, 'html.parser')
# Find all table rows
rows = soup.findAll('div', {'role':'row'})
# Process all the rows in the table
for row in rows:
try:
# Get the cells for the given row
cells = row.findAll('div', {'role':'gridcell'})
# Convert them to just the values of the cell, ignoring attributes
cell_values = [c.text for c in cells]
# see if the row is one you want
if cell_values[1] in desired_rows:
# Output the data however you'd like
print(cell_values[1], cell_values[-1])
except IndexError: # there was a row without cells
pass
This resulted in the following output:
ADA/BTC 1,646.39204255
ADX/BTC 35.29384873

Related

Import Balance Sheet in an automatic organized manner from SEC to Dataframe

I am looking at getting the Balance Sheet data automatically and properly organized for any company using Beautiful Soup.
I am not planning on getting each variable but rather the whole Balance sheet. Originally, I was trying to do many codes to extract the URL for a particular company of my choice.
For Example, suppose I want to get the Balance Sheet data from the following URL:
URL1:'https://www.sec.gov/Archives/edgar/data/1418121/000118518520000213/aple20191231_10k.htm'
or from
URL2:'https://www.sec.gov/Archives/edgar/data/1326801/000132680120000046/form8-k03312020earnings.htm'
I am trying to write a function (suppose it is known as get_balancesheet(URL) ) such that regardless of the URL you will get the Dataframe that contains the balance sheet in an organized manner.
# Import libraries
import requests
import re
from bs4 import BeautifulSoup
I wrote the following function that needs a lot of improvement
def Get_Data_Balance_Sheet(url):
page = requests.get(url)
# Create a BeautifulSoup object
soup = BeautifulSoup(page.content)
futures1 = soup.find_all(text=re.compile('CONSOLIDATED BALANCE SHEETS'))
Table=[]
for future in futures1:
for row in future.find_next("table").find_all("tr"):
t1=[cell.get_text(strip=True) for cell in row.find_all("td")]
Table.append(t1)
# Remove list from list of lists if list is empty
Table = [x for x in Table if x != []]
return Table
Then I execute the following
url='https://www.sec.gov/Archives/edgar/data/1326801/000132680120000013/fb-12312019x10k.htm'
Tab=Get_Data_Balance_Sheet(url)
Tab
Note that this is not what I am planning for to have It is not simply putting it in a dataframe but we need to change it such that regardless of which URL we can get the Balance Sheet.
Well, this being EDGAR it's not going to be simple, but it's doable.
First things first - with the CIK you can extract specific filings of specific types made the CIK filer during a spacific period. So let say you are interested in Forms 10-K and 10-Q, original or amended (as in "FORM 10-K/A", for example), filed by this CIK filer from 2019 through 2020.
start = 2019
end = 2020
cik = 220000320193
short_cik = str(cik)[-6:] #we will need it later to form urls
First we need to get a list of filings meeting these criteria and load it into beautifulsoup:
import requests
from bs4 import BeautifulSoup as bs
url = f"https://www.sec.gov/cgi-bin/srch-edgar?text=cik%3D%{cik}%22+AND+form-type%3D(10-q*+OR+10-k*)&first={start}&last={end}"
req = requests.get(url)
soup = bs(req.text,'lxml')
There are 8 filings meeting the criteria: two Form 10-K and 6 Form 10-Q. Each of these filings has an accession number. The accession number is hiding in the url of each of these filings and we need to extract it to get to the actual target - the Excel file which contains the financial statements which are attached to each specific filing.
acc_nums = []
for link in soup.select('td>a[href]'):
target = link['href'].split(short_cik,1)
if len(target)>1:
acc_num = target[1].split('/')[1]
if not acc_num in acc_nums: #we need this filter because each filing has two forms: text and html, with the same accession number
acc_nums.append(acc_num)
At this point, acc_nums contains the accession number for each of these 8 filings. We can now download the target Excel file. Obviusly, you can loop through acc_num and download all 8, but let's say you are only looking for (randomly) the Excel file attached to the third filing:
fs_url = f"https://www.sec.gov/Archives/edgar/data/{short_cik}/{acc_nums[2]}/Financial_Report.xlsx"
fs = requests.get(fs_url)
with open('random_edgar.xlsx', 'wb') as output:
output.write(fs.content)
And there you'll have more than you'll ever want to know about Apple's financials at that point in time...

Scraping table data with BeautifulSoup or Pandas

I'm somewhat new to using python and I've been given a task that requires data scraping from a table. I do not know very much html either. I've never done this before and have spent a couple days looking at various ways to scrape tables. Unfortunately all of the examples are of what appears to be a more simple webpage layout than what I'm dealing with. I've tried quite a few various methods, but none of them allow me to select the table data that I need.
How would one scrape the table at the bottom of the following webpage under the "Daily Water Level" tab?
url = https://apps.wrd.state.or.us/apps/gw/gw_info/gw_hydrograph/Hydrograph.aspx?gw_logid=HARN0052657
I've tried using the methods in the following links and others not show here:
Beautiful Soup Scraping table
Scrape table with BeautifulSoup
Web scraping with BeautifulSoup
Some of the script I've tried:
from bs4 import BeautifulSoup
import requests
html = requests.get(url).text
soup = BeautifulSoup(html, "html.parser")
data = soup.find_all("table") # {"class": "xxxx"})
I've also tried using pandas, but I can't figure out how to select the table I need instead of the first table on the webpage that has the basic well information:
import pandas as pd
df_list = pd.read_html(url)
df_list
Unfortunately the data I need doesn't even show up when I run this script and the table I'm trying to select doesn't have a class that I can use to select only that table and not the table of basic well information. I've inspected the webpage, but can't seem to find a way to get to the correct table.
As far as the final result would look, I would need to export it as a csv or as a pandas data frame so that I can then graph it with modeled groundwater data for comparison purposes. Any suggestions would be greatly appreciated!
Try below approach using python - requests simple, straightforward, reliable, fast and less code is required when it comes to requests. I have fetched the API URL from website itself after inspecting the network section of google chrome browser.
What exactly below script is doing:
First it will take the API URL and do a GET request with the dynamic parameters(in CAPS) you can change the value of Well No, Start and end date to get the desired result.
After getting the data script will parse the JSON data using json.loads library.
It will iterate all over the list of daily water level data and create a list of all the data points so that it can be used to create a CSV file for ex:- GW Login Id, GW Site ID, Land Surface Elevation, Record date etc.
Finally it will write all the headers and data in the CSV file. (! Important please make sure to input the file path in the file_path variable)
import json
import requests
from urllib3.exceptions import InsecureRequestWarning
requests.packages.urllib3.disable_warnings(InsecureRequestWarning)
import csv
def scrap_daily_water_level():
file_path = '' #Input File path here
file_name = 'daily_water_level_data.csv' #File name
#CSV headers
csv_headers = ['Line #','GW Log Id','GW Site Id', 'Land Surface Elevation', 'Record Date','Restrict to OWRD only', 'Reviewed Status', 'Reviewed Status Description', 'Water level ft above mean sea level', 'Water level ft below land surface']
list_of_water_readings = []
#Dynamic Params
WELL_NO = 'HARN0052657'
START_DATE = '1/1/1905'
END_DATE = '12/30/2050'
#API URL
URL = 'https://apps.wrd.state.or.us/apps/gw/gw_data_rws/api/' + WELL_NO + '/gw_recorder_water_level_daily_mean_public/?start_date=' + START_DATE + '&end_date=' + END_DATE + '&reviewed_status=&restrict_to_owrd_only=n'
response = requests.get(URL,verify=False) #GET API call
json_result = json.loads(response.text) #JSON loads to parse JSON data
print('Daily water level data count ',json_result['feature_count']) # Prints no. of data counts
extracted_data = json_result['feature_list'] #Extracted data in JSON form
for idx, item in enumerate(extracted_data): #Iterate over the list of extracted data
list_of_water_readings.append({ #append and create list of data with headers for further usage
'Line #': idx + 1,
'GW Log Id' : item['gw_logid'],
'GW Site Id': item['gw_site_id'],
'Land Surface Elevation': item['land_surface_elevation'],
'Record Date': item['record_date'],
'Restrict to OWRD only': item['restrict_to_owrd_only'],
'Reviewed Status':item['reviewed_status'],
'Reviewed Status Description': item['reviewed_status_description'],
'Water level ft above mean sea level': item['waterlevel_ft_above_mean_sea_level'],
'Water level ft below land surface': item['waterlevel_ft_below_land_surface']
})
#Create CSV and write data in to it.
with open(file_path + file_name ,'a+') as daily_water_level_data_CSV: #Open file in a+ mode
csvwriter = csv.DictWriter(daily_water_level_data_CSV, delimiter=',', lineterminator='\n',fieldnames=csv_headers)
print('Writing CSV header now...')
csvwriter.writeheader() #Write headers in CSV file
for item in list_of_water_readings: #iterate over the appended data and save them in to the CSV file.
print('Writing data rows now..')
print(item)
csvwriter.writerow(item)
scrap_daily_water_level()

Python, Beautifulsoup - Extracting strings from tags based on items in list

I am trying to scrape the site https://www.livechart.me/winter-2019/tv to get the number of episodes that have currently aired for certain shows this season. I do this by extracting the "episode-countdown" tag data which gives something like "EP11:" then a timestamp after it, and then I slice that string to only give the number (in this case the "11") and then subtract by 1 to get how many episodes have currently aired (as the timestamp is for when EP11 will air).
I have a list of the different shows I am watching this season in order to filter what shows I extract the episode-countdown strings for instead of extracting the countdown for every show airing. The big problem I am having is that the "episode-countdown" strings are not in the same order as my list of shows I am watching. For example, if my list is [show1, show2, show3, show4], I might get the "episodes-countdown" string tag in the order of show3, show4, show1, show2 if they are listed in that order on the website.
My current code is as follows:
from bs4 import BeautifulSoup
from urllib.request import Request, urlopen
def countdown():
html = Request('https://www.livechart.me/winter-2019/tv', headers={'User-Agent': 'Mozilla/5.0'})
page = urlopen(html)
soup = BeautifulSoup(page, 'html.parser')
shows = ['Jojo no Kimyou na Bouken: Ougon no Kaze', 'Dororo', 'Mob Psycho 100 II', 'Yakusoku no Neverland']
for tag in soup.find_all('article', attrs={'class': 'anime'}):
if any(x in tag['data-romaji'] for x in shows):
rlist = tag.find('div', attrs={'class': 'episode-countdown'}).text
r2 = rlist[:rlist.index(":")][2:]
print('{} has aired {} episodes so far'.format(tag['data-romaji'], int(r2)-1))
Each show listed on the website is inside of an "article" tag so for every show in the soup.find_all() statement, if the "data-romaji" (the name of the show listed on the website) matches a show in my "shows" list, then I extract the "episode-countdown" string and then slice the string to just the number as previously explained and then print to make sure I did it correctly.
If you go to the website, the order that the shows are listed are "Yakusoku no Neverland", "Mob Psycho", "Dororo", and "Jojo" which is the order that you get the episode-countdown strings in if you run the code. What I want to do is have it in order of my "shows" list so that I have a list of shows and a list of episodes aired that match each other. I want to add the episodes aired list as a column in a pandas dataframe I am currently building so having it not match the "shows" column would be a problem.
Is there a way for me to extract the "episode-countdown" string based on the order of my "shows" list instead of the order used on the website (if that makes sense)?
Is this what you're looking for?
from bs4 import BeautifulSoup
from urllib.request import Request, urlopen
import pandas as pd
html = Request('https://www.livechart.me/winter-2019/tv', headers={'User-Agent': 'Mozilla/5.0'})
page = urlopen(html)
soup = BeautifulSoup(page, 'html.parser')
shows = ['Jojo no Kimyou na Bouken: Ougon no Kaze', 'Dororo', 'Mob Psycho 100 II', 'Yakusoku no Neverland']
master = []
for show in shows:
for tag in soup.find_all('article', attrs={'class': 'anime'}):
show_info = []
if show in tag['data-romaji']:
show_info.append(tag['data-romaji'])
rlist = tag.find('div', attrs={'class': 'episode-countdown'}).text
r2 = rlist[:rlist.index(":")][2:]
show_info.append(r2)
master.append(show_info)
df=pd.DataFrame(master,columns=['Show','Episodes'])
df
Output:
Show Episodes
0 Jojo no Kimyou na Bouken: Ougon no Kaze 23
1 Dororo 11
2 Mob Psycho 100 II 11
3 Yakusoku no Neverland 11

Taking Average of List of Integers

I'm scraping a list of daily stock volume numbers, and I'm wanting to take an average of the first 20 results in the volume column of the page. My code looks like:
from bs4 import BeautifulSoup
import re, csv, random, time
import pandas as pd
import os
import requests
page = requests.get('https://finance.yahoo.com/quote/BDSI/history?period1=1517033117&period2=1548569117&interval=1d&filter=history&frequency=1d')
soup = BeautifulSoup(page.text, 'html.parser')
rows = soup.select('table[class="W(100%) M(0)"] tr')
for row in rows[1:20]:
col = row.find_all("td")
numbers = col[6].text.replace(',', '')
numbers2 = int(numbers)
print(numbers2)
avg20vol = sum(numbers2(1,20))/len(numbers2)
...but I'm getting stuck when trying to take the average of the returned numbers2. Receive either "TypeError: 'int' object is not callable" or "TypeError: 'int' object is not iterable" with the solutions I've tried. How do handle taking an average of a list? Does it involve turning it into a dataframe first? Thanks!
UPDATE
Here's a working example of the applicable code segment:
numberslist=[]
for row in rows[1:21]:
col = row.find_all("td")
numbers = col[6].text.replace(',', '')
numbers2 = int(numbers)
numberslist.append(numbers2)
print(numbers2)
average = sum(numberslist)/len(numberslist)
print('Average = ',average)
When scraping, actually create a list of numbers, like so:
# stuff before
number_list = [] # empty list
for row in rows[1:20]:
# get the number
number_list.append(int(number_as_string)) # add the new number at the end of the list
average = sum(number_list)/len(number_list)
You can also .append() the string forms and then transform to ints with list(map(int(list_of_strings)) or [int(x) for x in list_of_strings].
Note: rows[1:20] will leave out the first item, in your case, as you stated, the first row is header. Use rows[:20] to get the first 20 items in general.
You css selector is also wrong and gave me an error.
from bs4 import BeautifulSoup
import requests
page = requests.get('https://finance.yahoo.com/quote/BDSI/history?period1=1517033117&period2=1548569117&interval=1d&filter=history&frequency=1d')
soup = BeautifulSoup(page.text, 'html.parser')
rows = soup.find('table',class_="W(100%) M(0)").find_all('tr')
numbers=[]
for row in rows[1:20]:
col = row.find_all("td")
print(col[6].text)
number = col[6].text.replace(',', '')
number = int(number)
numbers.append(number)
avg20vol =sum(numbers)/len(numbers)
print("Average: ",avg20vol)
Output
650,100
370,500
374,700
500,700
452,500
1,401,800
2,071,200
1,005,800
441,500
757,000
901,200
563,400
1,457,000
637,100
692,700
725,000
709,000
1,155,500
496,400
Average: 808584.2105263158

Python & Beautiful Soup - Searching result strings

I am using Beautiful Soup to parse an HTML table.
Python version 3.2
Beautiful Soup version 4.1.3
I am running into an issue when trying to use the findAll method to find the columns within my rows. I get an error that says list object has no attribute findAll. I found this method through another post on stack exchange and this was not an issue there. (BeautifulSoup HTML table parsing)
I realize that findAll is a method of BeautifulSoup, not python lists. The weird part is the findAll method works when I find the rows within the table list (I only need the 2nd table on the page), but when I attempt to find the columns in the rows list.
Here's my code:
from urllib.request import URLopener
from bs4 import BeautifulSoup
opener = URLopener() #Open the URL Connection
page = opener.open("http://www.labormarketinfo.edd.ca.gov/majorer/countymajorer.asp?CountyCode=000001") #Open the page
soup = BeautifulSoup(page)
table = soup.findAll('table')[1] #Get the 2nd table (index 1)
rows = table.findAll('tr') #findAll works here
cols = rows.findAll('td') #findAll fails here
print(cols)
findAll() returns a result list, you'd need to loop over those or pick one to get to another contained element with it's own findAll() method:
table = soup.findAll('table')[1]
rows = table.findAll('tr')
for row in rows:
cols = rows.findAll('td')
print(cols)
or pick one row:
table = soup.findAll('table')[1]
rows = table.findAll('tr')
cols = rows[0].findAll('td') # columns of the *first* row.
print(cols)
Note that findAll is deprecated, you should use find_all() instead.

Resources