Better Web Scraping Methodology? - python-3.x

I was just curious if someone could help me out with a bit of webscraping. I am currently starting my scrape at this link -- https://www.basketball-reference.com/leagues/NBA_1981_games-october.html. I am scraping all the table "schedule" for each month and then moving onto the next year. I am able to successfully scrape from 1989 to 2001 (every month) and put into the format I desire. But my code is so fragile.... I was curious if there is a better methodology that can be explained to me rather than just pulling in the schedule table as one massive piece of text and then splicing it to fit my needs. For example, here is my code:
from selenium import webdriver as wd
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.common.exceptions import TimeoutException
import pandas as pd
import os
chromeDriverPath = r'path of chromeDriver used by Selenium'
browser = wd.Chrome(executable_path= chromeDriverPath)
#Create the links needed
link ="https://www.basketball-reference.com/leagues/NBA_"
years = range(1989,2018,1)
months = ['october', 'november', 'december', 'january', 'february', 'march',
'april', 'may', 'june', 'july', 'august', 'september']
hour_list = ['1:00','1:30', '1:40','2:00','2:30','3:00','3:30','4:00','4:30','5:00',
'5:30','6:00','6:30','7:00','7:30','8:00','8:30','9:00','9:30',
'10:00','10:30','11:00','11:30','12:00', '12:30','12:40']
ampm = ['pm', 'am']
def scrape(url):
try:
browser.get(url)
schedule = WebDriverWait(browser,5).until(EC.presence_of_all_elements_located((By.ID, "schedule")))
except TimeoutException:
print(str(url) + ' does not exist!')
return
o_players = [schedule[i].text for i in range(0, len(schedule))]
o_players = ''.join(o_players)
o_players = o_players.splitlines()
o_players = o_players[1:]
o_players = [x.replace(',','') for x in o_players]
o_players = [x.split(' ') for x in o_players]
l0 = []
l1 = []
l2 = []
for x in o_players:
if "at" in x:
l1.append(x[:x.index("at")])
elif 'Game' in x:
l0.append(x[:x.index("Game")])
else:
l2.append(x)
l3 = l1 + l2 + l0
for x in l3:
for y in x:
if y in hour_list:
x.remove(y)
for t in x:
if t in ampm:
x.remove(t)
ot = ['OT','2OT', '3OT', '4OT','5OT']
for x in l3:
x.insert(0,'N/A')
if x[-1] != 'Score' and x[-1] not in ot:
x.insert(1,x[-1])
else:
x.insert(1,'N/A')
for y in ot:
if y in x:
x.remove('N/A')
x.remove(y)
x.insert(0,y)
l3 = [t for t in l3 if 'Playoffs' not in t]
for x in l3:
if len(x) == 17:
x.insert(0,' '.join(x[6:9]))
x.insert(1,' '.join(x[11:14]))
x.insert(1, x[11])
x.insert(3, x[16])
if len(x) == 16 and x[-1] != 'Score':
if x[8].isdigit():
x.insert(0,' '.join(x[6:8]))
x.insert(1,' '.join(x[10:13]))
x.insert(1, x[10])
x.insert(3, x[15])
else:
x.insert(0,' '.join(x[6:9]))
x.insert(1,' '.join(x[11:13]))
x.insert(1, x[11])
x.insert(3, x[15])
if len(x) == 16 and x[-1] == 'Score':
x.insert(0,' '.join(x[6:9]))
x.insert(1, ' '.join(x[11:14]))
x.insert(1, x[11])
x.insert(3, x[16])
if len(x) == 15 and x[-1] != 'Score':
x.insert(0,' '.join(x[6:8]))
x.insert(1,' '.join(x[10:12]))
x.insert(1, x[10])
x.insert(3, x[14])
if len(x) == 15 and x[-1] == 'Score':
if x[8].isdigit():
x.insert(0,' '.join(x[6:8]))
x.insert(1,' '.join(x[10:13]))
x.insert(1, x[10])
x.insert(3, x[15])
else:
x.insert(0,' '.join(x[6:9]))
x.insert(1,' '.join(x[11:13]))
x.insert(1, x[11])
x.insert(3, x[15])
if len(x) == 14:
x.insert(0,' '.join(x[6:8]))
x.insert(1,' '.join(x[10:12]))
x.insert(1, x[10])
x.insert(3, x[14])
l4 = []
for x in l3:
x = x[:10]
l4.append(x)
#Working With Pandas to Standardize Data
df = pd.DataFrame(l4)
df['Date'] = df[7] + ' '+ df[8] + ', ' + df[9]
df['Date'] = pd.to_datetime(df['Date'])
df = df.sort_values(by=['Date'])
headers = ['Visitor', 'Visitor Points', 'Home', 'Home Points', 'OT',
'Attendance','Weekday', 'Month', 'Day', 'Year', 'Date' ]
headers_order = ['Date', 'Weekday', 'Day', 'Month', 'Year', 'Visitor', 'Visitor Points',
'Home', 'Home Points', 'OT', 'Attendance']
df.columns = headers
df = df[headers_order]
file_exists = os.path.isfile("NBA_Scrape.csv")
if not file_exists:
df.to_csv('NBA_Scrape.csv', mode='a', header=True, index=False)
else:
df.to_csv('NBA_Scrape.csv', mode='a', header=False, index=False)
for x in years:
link0 = link+str(x)+'_games-'
for y in months:
final_links = link0+str(y)+'.html'
scrape(final_links)
My code starts to return errors at year 2001 I believe. I would like to scrape through the present. Please help me scrape better. I imagine there is a much more proficient way, like looping through each element in the table "schedule" and appending each one to a different list or different column in pandas? Please lend me a hand.
Thanks,
Joe

Your target is perfectly static, so no necessary to run selenium. I would suggest using Scrapy python library. It was designed to fit all web scraping needs. It incredibly fast and flexible tool. You can use xpath to pull all elements from page separately, instead of considering it as a huge piece of text.

Related

weighted average acquisition cost pandas based on buy and sell

Thanks, everyone for your time
I have a dataframe like below,
import pandas as pd
import numpy as np
raw_data = {'Date': ['04-23-2020', '05-05-2020', '05-05-2020', '05-11-2020', '05-11-2020',
'05-12-2020', '05-12-2020', '05-27-2020', '06-03-2020'],
'Type': ['Buy', 'Buy', 'Sell', 'Buy', 'Sell', 'Buy', 'Buy',
'Buy', 'Sell'],
'Ticker': ['AAA', 'AAA', 'AAA', 'AAA', 'AAA',
'BBB', 'CCC', 'CCC', 'CCC'],
'Quantity': [60000, 12000, -30000, 49000, -30000, 2000, 10000, 28500, -10000],
'Price': [60.78, 82.20, 0, 100.00, 0, 545.00, 141.00, 172.00,
0]
}
df = pd.DataFrame (raw_data, columns = ['Date','Type','Ticker','Quantity','Price']).sort_values(['Date','Ticker']).reset_index(drop = True)
My objective is to calculate the weighted average price whenever there is a sell transaction. please see my below expected outcome. I have tried a for loop for the same but I was unable to get the results.
mycode current code
df['Pur_Value'] = df['Quantity'] * df['Price']
df['TotalQty'] = df.groupby('Ticker')['Quantity'].cumsum()
grpl = df.groupby(by = ['Ticker'])
df1 = pd.DataFrame()
finaldf = pd.DataFrame()
for key in grpl.groups.keys():
df1 = grpl.get_group(key).loc[:,['Type','Ticker','Date','Quantity','Price','Pur_Value']]
df1.sort_values(by=['Ticker','Date'],inplace=True)
df1.reset_index(drop=True,inplace=True)
Cum_Value = 0
Cum_Shares = 0
for Index,Tic in df1.iterrows():
if Tic['Type'] == "Buy":
Cum_Value += Tic['Pur_Value']
Cum_Shares += Tic['Quantity']
else:
df1['sold_price'] = Cum_Value/Cum_Shares
finaldf = finaldf.append(df1)
Expected results is columns which has the weighted avg price for sold shares like below.
I was able to solve my own issue with the below code. thanks
for key in grpl.groups.keys():
df1 = grpl.get_group(key).loc[
:, ["Type", "Ticker", "Date", "Quantity", "Price", "Pur_Value"]
]
df1.sort_values(by=["Ticker", "Date"], inplace=True)
df1.reset_index(drop=True, inplace=True)
Cum_Value = 0
Cum_Shares = 0
Cum_price = 0
sold_value = 0
for Index, Tic in df1.iterrows():
if Tic["Type"] == "Buy":
Cum_Value += Tic["Pur_Value"]
Cum_Shares += Tic["Quantity"]
Cum_price = Cum_Value / Cum_Shares
else:
sold_value = Cum_price * Tic["Quantity"]
Cum_Shares += Tic["Quantity"]
Cum_Value += sold_value
Cum_price = Cum_Value / Cum_Shares
Ticker.append(Tic["Ticker"])
Dates.append(Tic["Date"])
sold_price.append(Cum_price)

i am trying to get results for specific attributes? Why am i getting this error

I am trying to get results for specific attributes? Trying to extract month and day of the week from Start Time to create new columns then filer to get new data frame
import pandas as pd
CITY_DATA = {'chicago': 'chicago.csv',
'new york city': 'new_york_city.csv','washington': 'washington.csv'}
def load_data(city, month, day):
df = pd.read_csv(CITY_DATA[city])
df['Start Time'] = pd.to_datetime(df['Start Time'])
df['month'] = df['Start Time'].dt.month
df['day_of_week'] = df['Start Time'].dt.weekday_name
if month != 'all':
months = ['january', 'february', 'march', 'april', 'may', 'june']
month = months.index(month) + 1
df = df[df['month'] == month]
if day != 'all':
df = df[df['day_of_week'] == day.title()]
return df
df = load_data('chicago', 'march', 'friday')
print(df.head())
AttributeError: 'DatetimeProperties' object has no attribute 'weekday_name'
Your problem is the following line:
df['day_of_week'] = df['Start Time'].dt.weekday_name
Change it to:
df['day_of_week'] = df['Start Time'].dt.day_name()

How to specifically name each value in a list of tuples in python?

The Tuple_list prints out something like "[ (2000, 1, 1, 1, 135) , (2000, 1, 1, 2, 136) ) , etc...]" and I can't figure out how to assign "year,month,day,hour,height" to every tuple in the list..."
def read_file(filename):
with open(filename, "r") as read:
pre_list = list()
for line in read.readlines():
remove_symbols = line.strip()
make_list = remove_symbols.replace(" ", ", ")
pre_list += make_list.split('()')
Tuple_list = [tuple(map(int, each.split(', '))) for each in pre_list]
for n in Tuple_list:
year, month, day, hour, height = Tuple_list[n][0], Tuple_list[
n][1], Tuple_list[n][2], Tuple_list[n][3], Tuple_list[n][4]
print(month)
return Tuple_list
swag = read_file("VIK_sealevel_2000.txt")
Maybe "Named tuples" is what you are looking for.
In [1]: from collections import namedtuple
In [2]: Measure = namedtuple('Measure', ['year', 'month', 'day', 'hour', 'height'])
In [3]: m1 = Measure(2005,1,1,1,4444)
In [4]: m1
Out[4]: Measure(year=2005, month=1, day=1, hour=1, height=4444)

Sorted insert on a python list

Good morning everybody,
Here is my function that is supposed to make a recursive sorted insertion of a couple of data:
def sorted_insert(w_i,sim,neighbors):
if neighbors==[]:
neighbors.append((w_i,sim))
elif neighbors[0][1]<sim:
neighbors.insert(0,(w_i,sim))
else:
sorted_insert(w_i,sim,neighbors[1:])
return neighbors
The problem is, this function doesn't insert values in the middle, here is a series of insertions :
>>> n=[]
>>> n=sorted_insert("w1",0.6,n)
>>> n=sorted_insert("w1",0.3,n)
>>> n=sorted_insert("w1",0.5,n)
>>> n=sorted_insert("w1",0.8,n)
>>> n=sorted_insert("w1",0.7,n)
>>> n
[('w1', 0.8), ('w1', 0.6)]
Is there someone that can correct my function ?
Thanks in advance.
This should work.
def sorted_insert(w_i,sim,neighbors, i=0):
if len(neighbors) == i or sim > neighbors[i][1]:
neighbors.insert(i, (w_i,sim))
else:
sorted_insert(w_i,sim,neighbors, i+1)
n=[]
sorted_insert("w1",0.6,n)
sorted_insert("w1",0.3,n)
sorted_insert("w1",0.5,n)
sorted_insert("w1",0.8,n)
sorted_insert("w1",0.7,n)
print n
# [('w1', 0.8), ('w1', 0.7), ('w1', 0.6), ('w1', 0.5), ('w1', 0.3)]

Pandas: Check whether a column exists in another column

I am new to Python and pandas. I have a dataset that has the following structures. It is a pandas DF
city time1 time2
a [1991, 1992, 1993] [1993,1994,1995]
time1 and time2 represnts the coverage of the data in two sources. I would like create a new column that indicates whether time1 and time2 have any intersection, if so return True otherwise False. The task sound very straightforward. I was thinking about using set operations on the two columns but it did not work as expected. Would anyone help me figure this out?
Thanks!
I appreciate your help.
You can iterate through all the columns and change the lists to sets and see if there is are any values in the intersection.
df1 = df.applymap(lambda x: set(x) if type(x) == list else set([x]))
df1.apply(lambda x: bool(x.time1 & x.time2), axis=1)
This is a semi-vectorized way that should make it run much faster.
df1 = df[['time1', 'time2']].applymap(lambda x: set(x) if type(x) == list else set([x]))
(df1.time1.values & df1.time2.values).astype(bool)
And even a bit faster
change_to_set = lambda x: set(x) if type(x) == list else set([x])
time1_set = df.time1.map(change_to_set).values
time2_set = df.time2.map(change_to_set).values
(time1_set & time2_set).astype(bool)
Here is kind of ugly, but vectorized approach:
In [37]: df
Out[37]:
city time1 time2
0 a [1970] [1980]
1 b [1991, 1992, 1993] [1993, 1994, 1995]
2 c [2000, 2001, 2002] [2010, 2011]
3 d [2015, 2016] [2016]
In [38]: df['x'] = df.index.isin(
...: pd.DataFrame(df.time1.tolist())
...: .stack().reset_index(name='x')
...: .merge(pd.DataFrame(df.time2.tolist())
...: .stack().reset_index(name='x'),
...: on=['level_0','x'])['level_0'])
...:
In [39]: df
Out[39]:
city time1 time2 x
0 a [1970] [1980] False
1 b [1991, 1992, 1993] [1993, 1994, 1995] True
2 c [2000, 2001, 2002] [2010, 2011] False
3 d [2015, 2016] [2016] True
Timing:
In [54]: df = pd.concat([df] * 10**4, ignore_index=True)
In [55]: df.shape
Out[55]: (40000, 3)
In [56]: %%timeit
...: df.index.isin(
...: pd.DataFrame(df.time1.tolist())
...: .stack().reset_index(name='x')
...: .merge(pd.DataFrame(df.time2.tolist())
...: .stack().reset_index(name='x'),
...: on=['level_0','x'])['level_0'])
...:
1 loop, best of 3: 253 ms per loop
In [57]: %timeit df.apply(lambda x: bool(set(x.time1) & set(x.time2)), axis=1)
1 loop, best of 3: 5.36 s per loop

Resources