Scraping Yahoo finance historical stock prices - python-3.x

i am attempting to parse Yahoo finance's historical stock price tables for various stocks using BeautifulSoup with Python. Here is the code:
import requests
import pandas as pd
import urllib
from bs4 import BeautifulSoup
tickers = ['HSBA.L', 'RDSA.L', 'RIO.L', 'BP.L', 'GSK.L', 'DGE.L', 'AZN.L', 'VOD.L', 'GLEN.L', 'ULVR.L']
url = 'https://uk.finance.yahoo.com/quote/HSBA.L/history?period1=1478647619&period2=1510183619&interval=1d&filter=history&frequency=1d'
request = requests.get(url)
soup = BeautifulSoup(request.text, 'lxml')
table = soup.find_all('table')[0]
n_rows = 0
n_columns = 0
column_name = []
for row in table.find_all('tr'):
data = row.find_all('td')
if len(data) > 0:
n_rows += 1
if n_columns == 0:
n_columns = len(data)
headers = row.find_all('th')
if len(headers) > 0 and len(column_name) == 0:
for header_names in headers:
column_name.append(header_names.get_text())
new_table = pd.DataFrame(columns = column_name, index = range(0,n_rows))
row_index = 0
for row in table.find_all('tr'):
column_index = 0
columns = row.find_all('td')
for column in columns:
new_table.iat[row_index, column_index] = column.get_text()
column_index += 1
if len(columns) > 0:
row_index += 1
The first time i ran the code, i had the interval set to exactly two years from November the 7th 2015 (with weekly prices). The issue is that the resulting data frame is 101 rows long but i know for a fact it should be more (106). Then i tried to change the interval completely to the default one when the page opens (which is daily) but i still got the same 101 rows, whereas the actual data is much larger. Is there anything wrong with the code, or is it something Yahoo finance are doing?
Any help is appreciated, i'm really stuck here.

AFAIK, the API was shut down in May of 2017. Can you use Google finance? If you can accept Ex cel as a solution, here is a link to a file that you can download to download all kinds of historical time series data.
http://investexcel.net/multiple-stock-quote-downloader-for-excel/

Related

Why are player numbers not in a different column?

I have created a script that collects the information on a website and puts it on a script. I'm on my process to become acquainted with python scraping and I would like some help as I would like to player numbers to be on a different column
# import libraries
import pandas as pd
import requests
from bs4 import BeautifulSoup
import xlsxwriter
import xlwt
from xlwt import Workbook
# Workbook is created
wb = Workbook()
# add_sheet is used to create sheet.
sheet1 = wb.add_sheet('Sheet 1')
#send request
#url = 'http://fcf.cat/acta/1920/futbol-11/infantil-primera-divisio/grup-11/1i/sant-ildefons-ue-b/1i/lhospitalet-centre-esports-c'
url = 'https://www.fcf.cat/acta/2422183'
page = requests.get(url,timeout=5, verify=False)
soup = BeautifulSoup(page.text,'html.parser')
#read acta
#acta_text = []
#acta_text_element = soup.find_all(class_='acta-table')
#for item in acta_text_element:
# acta_text.append(item.text)
i = 0
acta = []
for tr in soup.find_all('tr'):
values = [td.text.strip() for td in tr.find_all('td') ]
print(values)
acta.append(values)
i = 1 + i
sheet1.write(i,0,values)
wb.save('xlwt example.xls')
print(acta)
Thanks,
Two things to consider:
You can separate the first element in the list by using values[0] then use values[1:] for the remaining items
Use isnumeric to check if a string value is a number
Try this code:
for tr in soup.find_all('tr'):
values = [td.text.strip() for td in tr.find_all('td') ]
print(values)
acta.append(values)
i = 1 + i
if len(values) and values[0].isnumeric(): # if first element is number
sheet1.write(i,0,values[0]) # number in column 1
sheet1.write(i,1,values[1:]) # rest of list in column 2
else:
sheet1.write(i,0,values) # all values in column 1
Excel output (truncated)
To take the team on the left, for example, try this:
tables = soup.select('table')
players = []
columns = ["Player","Shirt"]
titulars = [item for item in tables[1].text.strip().split('\n') if len(item)>0]
#tables[1] is where the data for the first team is; the other team is in tables[8]
for num, name in zip(titulars[2::2],titulars[1::2]):
player = []
player.extend((num,name))
players.append(player)
pd.DataFrame(players,columns=columns)
Output:
Player Shirt
0 TORNER ENCINAS, GONZALO 1
1 MACHUCA LOVERA, OSMAR SILVESTRE 3
2 JARA MARTIN, BLAI 4
3 AGUILAR LUQUE, DANIEL 5
4 FONT MURILLO, JOAQUIN 6
5 MARTÍNEZ ELVIR, RICHARD ADRIAN 7
6 MARQUEZ RODRIGUEZ, GERARD 8
7 PATUEL BATLLE, GERARD 10
8 EL MAHI ZAROUALI, BILAL 11
9 JAUME MORERA, ADRIA 14
10 DEL VALLE ESCANCIANO, MARTI 15

scrape a table in a website with python (no table tag)

I'm trying to scrape daily the stock value of a product. This is the web https://funds.ddns.net/f.php?isin=ES0110407097. And this is the code I'm trying:
import pandas as pd
from bs4 import BeautifulSoup
html_string = 'https://funds.ddns.net/f.php?isin=ES0110407097'
soup = BeautifulSoup(html_string, 'lxml')
new_table = pd.DataFrame(columns=range(0,2), index = [0])
row_marker = 0
column_marker = 0
for row in soup.find_all('tr'):
columns = soup.find_all('td')
for column in columns:
new_table.iat[row_marker,column_marker] = column.get_text()
column_marker += 1
print(new_table)
I would like to get in Python the same format I can see in the web, both the data and the number. How can I get it, please?
There's a simpler way for that particular page:
import requests
import pandas as pd
url = 'https://funds.ddns.net/f.php?isin=ES0110407097'
resp = requests.get(url)
new_table = pd.read_html(resp.text)[0]
print(new_table.head(5))
Output:
0 1
0 FECHA VL:EUR
1 2019-12-20 120170000
2 2019-12-19 119600000
3 2019-12-18 119420000
4 2019-12-17 119390000

EmptyDataError: No columns to parse from file. (Generate files with "for" in Python)

The following code obtains specific data from an internet financial portal (Morningstar). I obtain data from different companies, in this case from Dutch companies. Each one is represented by a ticker.
import pandas as pd
import numpy as np
def financials_download(ticker,report,frequency):
if frequency == "A" or frequency == "a":
frequency = "12"
elif frequency == "Q" or frequency == "q":
frequency = "3"
url = 'http://financials.morningstar.com/ajax/ReportProcess4CSV.html?&t='+ticker+'&region=usa&culture=en-US&cur=USD&reportType='+report+'&period='+frequency+'&dataType=R&order=desc&columnYear=5&rounding=3&view=raw&r=640081&denominatorView=raw&number=3'
df = pd.read_csv(url, skiprows=1, index_col=0)
return df
def ratios_download(ticker):
url = 'http://financials.morningstar.com/ajax/exportKR2CSV.html?&callback=?&t='+ticker+'&region=usa&culture=en-US&cur=USD&order=desc'
df = pd.read_csv(url, skiprows=2, index_col=0)
return df
holland=("AALBF","ABN","AEGOF", "AHODF", "AKZO","ALLVF","AMSYF","ASML","KKWFF","KDSKF","GLPG","GTOFF","HINKF","INGVF","KPN","NN","LIGHT","RANJF","RDLSF","RDS.A","SBFFF", "UNBLF", "UNLVF", "VOPKF", "WOLTF")
def finance(country):
for ticker in country:
frequency = "a"
df1 = financials_download(ticker,'bs',frequency)
df2 = financials_download(ticker,'is',frequency)
df3 = ratios_download(ticker)
d1 = df1.loc['Total assets']
if np.any("EBITDA" in df2.index) == True:
d2 = df2.loc["EBITDA"]
else:
d2 = None
if np.any("Revenue USD Mil" in df3.index) == True:
d3 = df3.loc["Revenue USD Mil"]
else:
d3 = df3.loc["Revenue EUR Mil"]
d4 = df3.loc["Operating Margin %"]
d5 = df3.loc["Return on Assets %"]
d6 = df3.loc["Return on Equity %"]
d7 = df3.loc["EBT Margin"]
d8 = df3.loc["Net Margin %"]
d9 = df3.loc["Free Cash Flow/Sales %"]
if d2 is not None:
d1=d1.to_frame().T
d2=d2.to_frame().T
d3=d3.to_frame().T
d4=d4.to_frame().T
d5=d5.to_frame().T
d6=d6.to_frame().T
d7=d7.to_frame().T
d8=d8.to_frame().T
d9=d9.to_frame().T
df_new=pd.concat([d1,d2,d3,d4,d5,d6,d7,d8,d9])
else:
d1=d1.to_frame().T
d3=d3.to_frame().T
d4=d4.to_frame().T
d5=d5.to_frame().T
d6=d6.to_frame().T
d7=d7.to_frame().T
d8=d8.to_frame().T
d9=d9.to_frame().T
df_new=pd.concat([d1,d3,d4,d5,d6,d7,d8,d9])
df_new.to_csv(ticker+'.csv')
The problem is that when I use a for loop so that it goes through all the tickers of the variable holland and generates a csv document for each of them, it returns the following error:
File "pandas/_libs/parsers.pyx", line 565, in
pandas._libs.parsers.TextReader.__cinit__ (pandas\_libs\parsers.c:6260)
EmptyDataError: No columns to parse from file
On the other hand, it runs without error, if I just select one company ticker after the other.
I'd really appreciate it if you could help me.
When you run your script several times, it fails on different tickers and different calls. This gives you an indication that the problem is not associated with a specific ticker, but rather that the call from the csv reader doesn't return a value that can be read into the data frame. You can address this problem, by using Python's error handling routines, e.g. for your financials_download function:
df = ""
i = 0
#some data in df?
while len(df) == 0:
#try to download data and load them into df
try:
df = pd.read_csv(url, skiprows=1, index_col=0)
#not successful? Count failed attempts
except:
i += 1
print("Trial", i, "failed")
#five attempts failed? Unlikely that this server will respond
if i == 5:
print("ticker", ticker, ": server is down")
break
#print("downloaded", ticker)
#print("financial download data frame:")
#print(df)
This tries five times to retrieve the data from the ticker and if this fails, it prints a message that it was not successful. But now you have to deal with this situation in your main program and adjust it, because some of the data frames are empty.
I would like to point you for this kind of basic debugging to a blog post.

BeautifulSoup4 Returning Empty List when Attempting to Scrape a Table

I'm trying to pull the data from this url: https://www.winstonslab.com/players/player.php?id=98 and I keep getting the same error when I try to access the tables.
My scraping code is below. I run this, then hp = HTMLTableParser() and table = hp.parse_url('https://www.winstonslab.com/players/player.php?id=98')[0][1] returns the error 'index 0 is out of bounds for axis 0 with size 0'
import requests
import pandas as pd
from bs4 import BeautifulSoup
class HTMLTableParser:
def parse_url(self, url):
response = requests.get(url)
soup = BeautifulSoup(response.text, 'lxml')
return [(table['id'],self.parse_html_table(table))\
for table in soup.find_all('table')]
def parse_html_table(self, table):
n_columns = 0
n_rows=0
column_names = []
# Find number of rows and columns
# we also find the column titles if we can
for row in table.find_all('tr'):
# Determine the number of rows in the table
td_tags = row.find_all('td')
if len(td_tags) > 0:
n_rows+=1
if n_columns == 0:
# Set the number of columns for our table
n_columns = len(td_tags)
# Handle column names if we find them
th_tags = row.find_all('th')
if len(th_tags) > 0 and len(column_names) == 0:
for th in th_tags:
column_names.append(th.get_text())
# Safeguard on Column Titles
if len(column_names) > 0 and len(column_names) != n_columns:
raise Exception("Column titles do not match the number of columns")
columns = column_names if len(column_names) > 0 else range(0,n_columns)
df = pd.DataFrame(columns = columns,
index= range(0,n_rows))
row_marker = 0
for row in table.find_all('tr'):
column_marker = 0
columns = row.find_all('td')
for column in columns:
df.iat[row_marker,column_marker] = column.get_text()
column_marker += 1
if len(columns) > 0:
row_marker += 1
# Convert to float if possible
for col in df:
try:
df[col] = df[col].astype(float)
except ValueError:
pass
return df
If the data that you need is just the table, you can accomplish that with pandas.read_html() function.

Socket Error Exceptions in Python when Scraping

I am trying to learn scraping,
I use exceptions lower down in the code to pass through errors because they dont affect the writing of data to csv
I keep getting a "socket.gaierror" but in the handling of that there is a "urllib.error.URLError" in the handling of that I get "NameError: name 'socket' is not defined" which seems circuitous
I kind of understand that using these exceptions may not be the best way to run the code but I cant seem to get past these errors and I dont know a way around or how to fix the errors.
If you have any suggestions outside of fixing the error exceptions that would be greatly appreciated as well.
import csv
from urllib.request import urlopen
from urllib.error import HTTPError
from bs4 import BeautifulSoup
base_url = 'http://www.fangraphs.com/' # used in line 27 for concatenation
years = ['2017','2016','2015'] # for enough data to run tests
#Getting Links for letters
player_urls = []
data = urlopen('http://www.fangraphs.com/players.aspx')
soup = BeautifulSoup(data, "html.parser")
for link in soup.find_all('a'):
if link.has_attr('href'):
player_urls.append(base_url + link['href'])
#Getting Alphabet Links
test_for_playerlinks = 'players.aspx?letter='
player_alpha_links = []
for i in player_urls:
if test_for_playerlinks in i:
player_alpha_links.append(i)
# Getting Player Links
ind_player_urls = []
for l in player_alpha_links:
data = urlopen(l)
soup = BeautifulSoup(data, "html.parser")
for link in soup.find_all('a'):
if link.has_attr('href'):
ind_player_urls.append(link['href'])
#Player Links
jan = 'statss.aspx?playerid'
players = []
for j in ind_player_urls:
if jan in j:
players.append(j)
# Building Pitcher List
pitcher = 'position=P'
pitchers = []
pos_players = []
for i in players:
if pitcher in i:
pitchers.append(i)
else:
pos_players.append(i)
# Individual Links to Different Tables Sorted by Base URL differences
splits = 'http://www.fangraphs.com/statsplits.aspx?'
game_logs = 'http://www.fangraphs.com/statsd.aspx?'
split_pp = []
gamel = []
years = ['2017','2016','2015']
for i in pos_players:
for year in years:
split_pp.append(splits + i[12:]+'&season='+ year)
gamel.append(game_logs+ i[12:] + '&type=&gds=&gde=&season=' + year)
split_pitcher = []
gl_pitcher = []
for i in pitchers:
for year in years:
split_pitcher.append(splits + i[12:]+'&season=' + year)
gl_pitcher.append(game_logs + i[12:] + '&type=&gds=&gde=&season=' + year)
# Splits for Pitcher Data
row_sp = []
rows_sp = []
try:
for i in split_pitcher:
sauce = urlopen(i)
soup = BeautifulSoup(sauce, "html.parser")
table1 = soup.find_all('strong', {"style":"font-size:15pt;"})
row_sp = []
for name in table1:
nam = name.get_text()
row_sp.append(nam)
table = soup.find_all('table', {"class":"rgMasterTable"})
for h in table:
he = h.find_all('tr')
for i in he:
td = i.find_all('td')
for j in td:
row_sp.append(j.get_text())
rows_sp.append(row_sp)
except(RuntimeError, TypeError, NameError, URLError, socket.gaierror):
pass
try:
with open('SplitsPitchingData2.csv', 'w') as fp:
writer = csv.writer(fp)
writer.writerows(rows_sp)
except(RuntimeError, TypeError, NameError):
pass
I'm guessing your main problem was that you - without any sleep what so ever - queried the site for a huge amount of invalid urls (you create 3 urls for the years 2015-2017 for 22880 pitchers in total, but most of these do not fall within that scope so you have tens of thousands of queries that return errors).
I'm surprised your IP wasn't banned by site admin. That said: It would be better to do some filtering so you avoid all those error queries...
The filter I applied is not perfect. It checks if the years in the list either appears in the start or end the years given on the site (e.g. '2004 - 2015'). This also creates error links but no way near the amount the original script did.
In code it could look like this:
from urllib.request import urlopen
from bs4 import BeautifulSoup
from time import sleep
import csv
base_url = 'http://www.fangraphs.com/'
years = ['2017','2016','2015']
# Getting Links for letters
letter_links = []
data = urlopen('http://www.fangraphs.com/players.aspx')
soup = BeautifulSoup(data, "html.parser")
for link in soup.find_all('a'):
try:
link = base_url + link['href']
if 'players.aspx?letter=' in link:
letter_links.append(link)
except:
pass
print("[*] Retrieved {} links. Now fetching content for each...".format(len(letter_links)))
# the data resides in two different base_urls:
splits_url = 'http://www.fangraphs.com/statsplits.aspx?'
game_logs_url = 'http://www.fangraphs.com/statsd.aspx?'
# we need (for some reason) players in two lists - pitchers_split and pitchers_game_log - and the rest of the players in two different, pos_players_split and pis_players_game_log
pos_players_split = []
pos_players_game_log = []
pitchers_split = []
pitchers_game_log = []
# and if we wanted to do something with the data from the letter_queries, lets put that in a list for safe keeping:
ind_player_urls = []
current_letter_count = 0
for link in letter_links:
current_letter_count +=1
data = urlopen(link)
soup = BeautifulSoup(data, "html.parser")
trs = soup.find('div', class_='search').find_all('tr')
for player in trs:
player_data = [tr.text for tr in player.find_all('td')]
# To prevent tons of queries to fangraph with invalid years - check if elements from years list exist with the player stat:
if any(year in player_data[1] for year in years if player_data[1].startswith(year) or player_data[1].endswith(year)):
href = player.a['href']
player_data.append(base_url + href)
# player_data now looks like this:
# ['David Aardsma', '2004 - 2015', 'P', 'http://www.fangraphs.com/statss.aspx?playerid=1902&position=P']
ind_player_urls.append(player_data)
# build the links for game_log and split
for year in years:
split = '{}{}&season={}'.format(splits_url,href[12:],year)
game_log = '{}{}&type=&gds=&gde=&season={}'.format(game_logs_url, href[12:], year)
# checking if the player is pitcher or not. We're append both link and name (player_data[0]), so we don't need to extract name later on
if 'P' in player_data[2]:
pitchers_split.append([player_data[0],split])
pitchers_game_log.append([player_data[0],game_log])
else:
pos_players_split.append([player_data[0],split])
pos_players_game_log.append([player_data[0],game_log])
print("[*] Done extracting data for players for letter {} out of {}".format(current_letter_count, len(letter_links)))
sleep(2)
# CONSIDER INSERTING CSV-PART HERE....
# Extracting and writing pitcher data to file
with open('SplitsPitchingData2.csv', 'a') as fp:
writer = csv.writer(fp)
for i in pitchers_split:
try:
row_sp = []
rows_sp = []
# all elements in the pitchers_split are lists. Player name is i[1]
data = urlopen(i[1])
soup = BeautifulSoup(data, "html.parser")
# append name to row_sp from pitchers_split
row_sp.append(i[0])
# the page has 3 tables with the class rgMasterTable, the first i Standard, the second Advanced, the 3rd Batted Ball
# we're only grabbing standard
table_standard = soup.find_all('table', {"class":"rgMasterTable"})[0]
trs = table_standard.find_all('tr')
for tr in trs:
td = tr.find_all('td')
for content in td:
row_sp.append(content.get_text())
rows_sp.append(row_sp)
writer.writerows(rows_sp)
sleep(2)
except Exception as e:
print(e)
pass
Since I'm not sure precisely how you wanted the data formatted on output you need some work on that.
If you want to avoid waiting for all letter_links to be extracted before you retrieve the actual pitcher stats (and fine tune your output) you can move the csv writer part up, so it runs as a part of the letter loop. If you do this don't forget to empty the pitchers_split list before grabbing another letter_link...

Resources