BeautifulSoup4 Returning Empty List when Attempting to Scrape a Table

BeautifulSoup4 Returning Empty List when Attempting to Scrape a Table - python-3.x

I'm trying to pull the data from this url: https://www.winstonslab.com/players/player.php?id=98 and I keep getting the same error when I try to access the tables.
My scraping code is below. I run this, then hp = HTMLTableParser() and table = hp.parse_url('https://www.winstonslab.com/players/player.php?id=98')[0][1] returns the error 'index 0 is out of bounds for axis 0 with size 0'
import requests
import pandas as pd
from bs4 import BeautifulSoup
class HTMLTableParser:
def parse_url(self, url):
response = requests.get(url)
soup = BeautifulSoup(response.text, 'lxml')
return [(table['id'],self.parse_html_table(table))\
for table in soup.find_all('table')]
def parse_html_table(self, table):
n_columns = 0
n_rows=0
column_names = []
# Find number of rows and columns
# we also find the column titles if we can
for row in table.find_all('tr'):
# Determine the number of rows in the table
td_tags = row.find_all('td')
if len(td_tags) > 0:
n_rows+=1
if n_columns == 0:
# Set the number of columns for our table
n_columns = len(td_tags)
# Handle column names if we find them
th_tags = row.find_all('th')
if len(th_tags) > 0 and len(column_names) == 0:
for th in th_tags:
column_names.append(th.get_text())
# Safeguard on Column Titles
if len(column_names) > 0 and len(column_names) != n_columns:
raise Exception("Column titles do not match the number of columns")
columns = column_names if len(column_names) > 0 else range(0,n_columns)
df = pd.DataFrame(columns = columns,
index= range(0,n_rows))
row_marker = 0
for row in table.find_all('tr'):
column_marker = 0
columns = row.find_all('td')
for column in columns:
df.iat[row_marker,column_marker] = column.get_text()
column_marker += 1
if len(columns) > 0:
row_marker += 1
# Convert to float if possible
for col in df:
try:
df[col] = df[col].astype(float)
except ValueError:
pass
return df

If the data that you need is just the table, you can accomplish that with pandas.read_html() function.

Related

extracting columns contents only so that all columns for each row are in the same row using Python's BeautifulSoup

I have the following python snippet in Jupyter Notebooks that works.
The challenge I have is to extract just the rows of columnar data only
Here's the snippet:
from bs4 import BeautifulSoup as bs
import pandas as pd
page = requests.get("http://lib.stat.cmu.edu/datasets/boston")
page
soup = bs(page.content)
soup
allrows = soup.find_all("p")
print(allrows)

I'm a little unclear of what you are after but I think it's each individual row of data from URL provided.
I couldn't find a way to use beautiful soup to parse the data you are after but did find a way to separate the rows using .split()
from bs4 import BeautifulSoup as bs
import pandas as pd
import requests
page = requests.get("http://lib.stat.cmu.edu/datasets/boston")
soup = bs(page.content)
allrows = soup.find_all("p")
text = soup.text # turn soup into text
text_split = text.split('\n\n') # split the page into 3 sections
data = text_split[2] # rows of data
# create df column titles using variable titles on page
col_titles = text_split[1].split('\n')
df = pd.DataFrame(columns=range(14))
df.columns = col_titles[1:]
# 'try/except' to catch end of index,
# loop throw text data building complete rows
try:
complete_row = []
n1 = 0 #used to track index
n2 = 1
rows = data.split('\n')
for el in range(len(rows)):
full_row = rows[n1] + rows[n2]
complete_row.append(full_row)
n1 = n1 + 2
n2 = n2 + 2
except IndexError:
print('end of loop')
# loop through rows of data, clean whitespace and append to df
for row in complete_row:
elem = row.split(' ')
df.loc[len(df)] = [el for el in elem if el]
#fininshed dataframe
df

How can I find out which row of data from this excel sheet is duplicated the most

I am trying to find out which row (street name) has the most crimes in an excel spreadsheet. I have found the sum for the highest amount of crimes I just can't find the actual row that generated that many occurrences.
import os
import csv
def main():
#create and save the path to the file...
fileToRead = "C:/Users/zacoli4407/Documents/Intro_To_Scipting/Crime_Data_Set.csv"
highestNumberOfCrimes = 0
data = []
rowNumber = 0
count = 0
with open(fileToRead, 'r') as dataToRead:
dataToRead = open(fileToRead, 'r') # open the access to the file
reader = csv.reader(dataToRead) # gives the ability to read from the file
for row in reader:
if row[4].isnumeric():
if int(row[4]) > highestNumberOfCrimes:
highestNumberOfCrimes = int(row[4])
rowNumber = count
data.append([row[2],row[3],row[4],row[5]]) #row 3 has the street name I am trying to acquire
count += 1
print(highestNumberOfCrimes)
with open("crime.txt", "w") as outputFile:
outputFile.write("The highest number of crimes is: \n")
outputFile.write(str(highestNumberOfCrimes))
main()

You could do the following:
import csv
from collections import defaultdict
result = defaultdict(float)
with open(fileToRead, 'r') as dataToRead:
reader = csv.reader(dataToRead)
header = next(reader)
for row in reader:
result[row[3]] += float(row[4])
#Now to get the street with maximum number of crimes
mx = max(result, key = result.get)
print(mx)
#to get the maximum number of crimes
print(result[mx])

Python Program ValueError: invalid literal for int() with base 10:

I am a python-beginner, trying to write a program to: "Corona Virus Live Updates for India – Using Python".
I am getting this error, after running the code/programe:
performance.append(int(row[2]) + int(row[3]))
ValueError: invalid literal for int() with base 10:
What can I do to fix this problem?
The Code:
extract_contents = lambda row: [x.text.replace('\n', '') for x in row]
URL = 'https://www.mohfw.gov.in/'
SHORT_HEADERS = ['SNo', 'State','Indian-Confirmed',
'Foreign-Confirmed','Cured','Death']
response = requests.get(URL).content
soup = BeautifulSoup(response, 'html.parser')
header = extract_contents(soup.tr.find_all('th'))
stats = []
all_rows = soup.find_all('tr')
for row in all_rows:
stat = extract_contents(row.find_all('td'))
if stat:
if len(stat) == 5:
# last row
stat = ['', *stat]
stats.append(stat)
elif len(stat) == 6:
stats.append(stat)
stats[-1][1] = "Total Cases"
stats.remove(stats[-1])
#Step #3:
objects = []
for row in stats :
objects.append(row[1])
y_pos = np.arange(len(objects))
performance = []
for row in stats:
performance.append(int(row[2]) + int(row[3]))
table = tabulate(stats, headers=SHORT_HEADERS)
print(table)

just changed the line performance.append(int(row[2]) + int(row[3])) to performance.append(row[2] +str(int(float(row[3]))) )
Full Code:
import requests
from bs4 import BeautifulSoup
import numpy as np
from tabulate import tabulate
extract_contents = lambda row: [x.text.replace('\n', '') for x in row]
URL = 'https://www.mohfw.gov.in/'
SHORT_HEADERS = ['SNo', 'State','Indian-Confirmed', 'Foreign-Confirmed','Cured','Death']
response = requests.get(URL).content
soup = BeautifulSoup(response, 'html.parser')
header = extract_contents(soup.tr.find_all('th'))
stats = []
all_rows = soup.find_all('tr')
for row in all_rows:
stat = extract_contents(row.find_all('td'))
if stat:
if len(stat) == 5:
# last row
stat = ['', *stat]
stats.append(stat)
elif len(stat) == 6:
stats.append(stat)
stats[-1][1] = "Total Cases"
stats.remove(stats[-1])
#Step #3:
objects = [row[1] for row in stats ]
y_pos = np.arange(len(objects))
performance = [row[2] +str(int(float(row[3]))) for row in stats]
table = tabulate(stats, headers=SHORT_HEADERS)
print(table)

How to apply a function fastly on the list of DataFrame in Python?

I have a list of DataFrames with equal length of columns and rows but different values, such as
data = [df1, df2,df3.... dfn] .
How can I apply a function function on each dataframe in the list data? I used following code but it doe not work
data = [df1, def2,df3.... dfn]
def maxloc(data):
data['loc_max'] = np.zeros(len(data))
for i in range(1,len(data)-1): #from the second value on
if data['q_value'][i] >= data['q_value'][i-1] and data['q_value'][i] >= data['q_value'][i+1]:
data['loc_max'][i] = 1
return data
df_list = [df.pipe(maxloc) for df in data]

Seems to me the problem is in your maxloc() function as this code works.
I added also the maximum value in the return of maxloc.
from random import randrange
import pandas as pd
def maxloc(data_frame):
max_index = data_frame['Value'].idxmax(0)
maximum = data_frame['Value'][max_index]
return max_index, maximum
# create test list of data-frames
data = []
for i in range(5):
temp = []
for j in range(10):
temp.append(randrange(100))
df = pd.DataFrame({'Value': temp}, index=(range(10)))
data.append(df)
df_list = [df.pipe(maxloc) for df in data]
for i, (index, value) in enumerate(df_list):
print(f"Data-frame {i:02d}: maximum = {value} at position {index}")

Scraping Yahoo finance historical stock prices

i am attempting to parse Yahoo finance's historical stock price tables for various stocks using BeautifulSoup with Python. Here is the code:
import requests
import pandas as pd
import urllib
from bs4 import BeautifulSoup
tickers = ['HSBA.L', 'RDSA.L', 'RIO.L', 'BP.L', 'GSK.L', 'DGE.L', 'AZN.L', 'VOD.L', 'GLEN.L', 'ULVR.L']
url = 'https://uk.finance.yahoo.com/quote/HSBA.L/history?period1=1478647619&period2=1510183619&interval=1d&filter=history&frequency=1d'
request = requests.get(url)
soup = BeautifulSoup(request.text, 'lxml')
table = soup.find_all('table')[0]
n_rows = 0
n_columns = 0
column_name = []
for row in table.find_all('tr'):
data = row.find_all('td')
if len(data) > 0:
n_rows += 1
if n_columns == 0:
n_columns = len(data)
headers = row.find_all('th')
if len(headers) > 0 and len(column_name) == 0:
for header_names in headers:
column_name.append(header_names.get_text())
new_table = pd.DataFrame(columns = column_name, index = range(0,n_rows))
row_index = 0
for row in table.find_all('tr'):
column_index = 0
columns = row.find_all('td')
for column in columns:
new_table.iat[row_index, column_index] = column.get_text()
column_index += 1
if len(columns) > 0:
row_index += 1
The first time i ran the code, i had the interval set to exactly two years from November the 7th 2015 (with weekly prices). The issue is that the resulting data frame is 101 rows long but i know for a fact it should be more (106). Then i tried to change the interval completely to the default one when the page opens (which is daily) but i still got the same 101 rows, whereas the actual data is much larger. Is there anything wrong with the code, or is it something Yahoo finance are doing?
Any help is appreciated, i'm really stuck here.

AFAIK, the API was shut down in May of 2017. Can you use Google finance? If you can accept Ex cel as a solution, here is a link to a file that you can download to download all kinds of historical time series data.
http://investexcel.net/multiple-stock-quote-downloader-for-excel/

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

BeautifulSoup4 Returning Empty List when Attempting to Scrape a Table - python-3.x

If the data that you need is just the table, you can accomplish that with pandas.read_html() function.

Related

extracting columns contents only so that all columns for each row are in the same row using Python's BeautifulSoup

How can I find out which row of data from this excel sheet is duplicated the most

Python Program ValueError: invalid literal for int() with base 10:

How to apply a function fastly on the list of DataFrame in Python?

Scraping Yahoo finance historical stock prices

Categories

Resources