Parallelizing a for loop that uses BeautifulSoup in Python - multithreading

I am trying to optimize the following loop:
all_a = []
for i in range(0, len(final_all)):
soup = BeautifulSoup(final_all[i], 'html.parser')
for t in soup.select('table[width="100%"]'):
t.extract()
for row in soup.select('tr'):
name = row.get_text(strip=True, separator=' ').split('—', maxsplit=1)
if name not in all_a:
all_a.append(name)
where final_all is a list of 30,000 html documents that look like the .html from this question.
The time to parse one html document is less than one second.
I was thinking if there is a smart way to combine both loops that use soup.select() in one loop. I also unsuccessfully tried using sets.
I also tried multiprocessing with only 30 observations, but I am clearly making a mistake:
%%time
all_a = []
def worker(data):
for i in range(0, len(data)):
start = time.time()
soup = BeautifulSoup(data[i], 'html.parser')
for t in soup.select('table[width="100%"]'):
t.extract()
for row in soup.select('tr'):
name = row.get_text(strip=True, separator=' ').split('—', maxsplit=1)
if name not in all_a:
all_a.append(name)
test = final_all[0:30]
if __name__ == '__main__':
pool = mp.Pool(8) # os.cpu_count*2
start = time.time()
final = worker(test)
CPU times: user 1min 50s, sys: 2.91 s, total: 1min 53s
Wall time: 1min 48s
Compared to the times when I do not use multiprocessing:
CPU times: user 1min 39s, sys: 1.78 s, total: 1min 41s
Wall time: 1min 39s

Try this.
all_a = []
for i in range(0, len(final_all)):
soup = BeautifulSoup(final_all[i], 'html.parser')
for t in soup.select('table[width="100%"]'):
t.extract()
for row in soup.select('tr'): # Out of the loop above
name = row.get_text(strip=True, separator=' ').split('—', maxsplit=1)
all_a.append(name)
all_a = list(set(all_a))

Related

Time efficient filtering of list in python

I have a database table called 'do_not_call', which contains information about files that hold a range of 10 digit phone numbers in the increasing order. The column 'filename' holds the name of file that contain the range of numbers from 'first_phone' to 'last_phone'. There are about 2500 records in 'do_not_call' table.
And I have a list of sqlalchemy records. I need to find which file is holding the 'phone' field of these records. So I have created a function which takes in the sqlalchemy records and returns a dictionary where the key is the name of file and value is a list of phone numbers from the sqlalchemy records that falls in the range of first and last phone numbers, contained in the file.
def get_file_mappings(dbcursor, import_records):
start_time = datetime.now()
phone_list = [int(rec.phone) for rec in import_records]
dnc_sql = "SELECT * from do_not_call;"
dbcursor.execute(dnc_sql)
dnc_result = dbcursor.fetchall()
file_mappings = {}
for file_info in dnc_result:
first_phone = int(file_info.get('first_phone'))
last_phone = int(file_info.get('last_phone'))
phone_ranges = list(filter(lambda phone: phone in range(first_phone, last_phone), phone_list))
if phone_ranges:
file_mappings.update({file_info.get('filename'): phone_ranges})
phone_list = list(set(phone_list) - set(phone_ranges))
# print(file_mappings)
print("Time = ", datetime.now() - start_time)
return file_mappings
For example if the phone_list is
[2023143300, 2024393100, 2027981539, 2022760321, 2026416368, 2027585911], the file_mappings returned will be
{'1500000_2020-9-24_Global_45A62481-17A2-4E45-82D6-DDF8B58B1BF8.txt': [2023143300, 2022760321],
'1700000_2020-9-24_Global_45A62481-17A2-4E45-82D6-DDF8B58B1BF8.txt': [2024393100],
'1900000_2020-9-24_Global_45A62481-17A2-4E45-82D6-DDF8B58B1BF8.txt': [2027981539, 2026416368, 2027585911]}
The problem here is that it takes a lot of time to execute. On average it takes about 1.5 seconds for 1000 records. Is there a better approach/algorithm to solve this problem. Any help is appreciated.
This is a very inefficient approach to binning things into a sorted list. You are not taking advantage of the fact that your bins are sorted (or could easily be sorted if they were not.) You are making a big nested loop here by testing phone numbers with the lambda statement.
You could make some marginal improvements by being consistent with set use (see below.) But in the end, you could/should just find each phone's place in the listing with an efficient search, like bisection. See example below with timing of original, set implementation, and bisection insertion.
If your phone_list is just massive, then other approaches may be advantageous, such as finding where the cutoff bins fit into a sorted copy of the phone list... but this below is 500x faster than what you have now for 1,000 or 10,000 records
# phone sorter
import random
import bisect
import time
from collections import defaultdict
# make some fake data of representative size
low_phone = 200_000_0000
data = [] # [file, low_phone, high_phone]
for idx in range(2500):
row = []
row.append(f'file_{idx}')
row.append(low_phone + idx * 20000000)
row.append(low_phone + (idx + 1) * 20000000 - 20) # some gap
data.append(row)
high_phone = data[-1][-1]
# generate some random phone numbers in range
num_phones = 10000
phone_list_orig = [random.randint(low_phone, high_phone) for t in range(num_phones)]
# orig method...
phone_list = phone_list_orig[:]
tic = time.time()
results = {}
for row in data:
low = row[1]
high = row[2]
phone_ranges = list(filter(lambda phone: phone in range(low, high), phone_list))
if phone_ranges:
results.update({row[0]:phone_ranges})
phone_list = list(set(phone_list) - set(phone_ranges))
toc = time.time()
print(f'orig time: {toc-tic:.3f}')
# with sets across the board...
phone_list = set(phone_list_orig)
tic = time.time()
results2 = {}
for row in data:
low = row[1]
high = row[2]
phone_ranges = set(filter(lambda phone: phone in range(low, high), phone_list))
if phone_ranges:
results2.update({row[0]:phone_ranges})
phone_list = phone_list - phone_ranges
toc = time.time()
print(f'using sets time: {toc-tic:.3f}')
# using bisection search
phone_list = set(phone_list_orig)
tic = time.time()
results3 = defaultdict(list)
lows = [t[1] for t in data]
for phone in phone_list:
location = bisect.bisect(lows, phone) - 1
if phone <= data[location][2]: # it is within the high limit of bin
results3[data[location][0]].append(phone)
toc = time.time()
print(f'using bisection sort time: {toc-tic:.3f}')
# for k in sorted(results3):
# print(k, ':', results.get(k))
assert(results==results2==results3)
results:
orig time: 5.236
using sets time: 4.597
using bisection sort time: 0.012
[Finished in 9.9s]

Stripping list into 3 columns Problem. BeautifulSoup, Requests, Pandas, Itertools

Python novice back again! I got a lot of great help on this but am now stumped. The code below scrapes soccer match data and scores from Lehigh University soccer website. I am trying to split the scores format ['T', '0-0(2 OT)'] into 3 columns 'T', '0-0, '2 OT but I am running into problems. The issue lies in this part of the code:
=> for result in soup.findAll("div", {'class': 'sidearm-schedule-game-result'}):
=> result = result.get_text(strip=True).split(',')
I tried .split(',') but that did not work as it created ['T', '0-0(2 OT)']. Is there a way to split than into 3 columns 1) T, 2) 0-0 and 3) 2 OT???
All help much appreciated.
Thanks
import requests
from bs4 import BeautifulSoup
import pandas as pd
from itertools import zip_longest
d = []
n = []
res = []
op = []
yr = []
with requests.Session() as req:
for year in range(2003, 2020):
print(f"Extracting Year# {year}")
r = req.get(
f"https://lehighsports.com/sports/mens-soccer/schedule/{year}")
if r.status_code == 200:
soup = BeautifulSoup(r.text, 'html.parser')
for date in soup.findAll("div", {'class': 'sidearm-schedule-game-opponent-date flex-item-1'}):
d.append(date.get_text(strip=True, separator=" "))
for name in soup.findAll("div", {'class': 'sidearm-schedule-game-opponent-name'}):
n.append(name.get_text(strip=True))
for result in soup.findAll("div", {'class': 'sidearm-schedule-game-result'}):
result = result.get_text(strip=True)
#result = result.get_text(strip=True).split(',')
res.append(result)
if len(d) != len(res):
res.append("None")
for opp in soup.findAll("div", {'class': 'sidearm-schedule-game-opponent-text'}):
op.append(opp.get_text(strip=True, separator=' '))
yr.append(year)
data = []
for items in zip_longest(yr, d, n, op, res):
data.append(items)
df = pd.DataFrame(data, columns=['Year', 'Date', 'Name', 'opponent', 'Result']).to_excel('lehigh.xlsx', index=False)
I'm going to focus here only on splitting the res list into three columns, and you can incorporate it into your code as you see fit. So let's say you have this:
res1='T, 0-0(2 OT)'
res2='W,2-1OT'
res3='T,2-2Game called '
res4='W,2-0'
scores = [res1,res2,res3,res4]
We split them like this:
print("result","score","extra")
for score in scores:
n_str = score.split(',')
target = n_str[1].strip()
print(n_str[0].strip(),' ',target[:3],' ',target[3:])
Output:
result score extra
T 0-0 (2 OT)
W 2-1 OT
T 2-2 Game called
W 2-0
Note that this assumes that no game ends with double digits scores (say, 11-2, or whatever); so this should work for your typical soccer game, but will fail with basketball :D

How can I get the arrays to all be the same length in Pandas?

I am able to scrape data from multiple web pages in a web site using BeautifulSoup, and I am using pandas to make a table of the data. The problem is I cannot get all of the arrays to be the same length and I get:
ValueError: arrays must all be same length
Here is the code I have tried:
import requests
from bs4 import BeautifulSoup
import numpy as np
import pandas as pd
# Lists to store the scraped data in
addresses = []
geographies = []
rents = []
units = []
availabilities = []
# Scraping all pages
pages_url = requests.get('https://www.rent.com/new-york/tuckahoe-apartments')
pages_soup = BeautifulSoup(pages_url.text, 'html.parser')
list_nums = pages_soup.find('div', class_='_1y05u').text
print(list_nums)
pages = [str(i) for i in range(0,6)]
for page in pages:
response = requests.get('https://www.rent.com/new-york/tuckahoe-apartments?page=' + page).text
html_soup = BeautifulSoup(response, 'html.parser')
# Extract data from individual listing containers
listing_containers = html_soup.find_all('div', class_='_3PdAH')
print(type(listing_containers))
print(len(listing_containers))
print("Page " + str(page))
for container in listing_containers:
address = container.a
if address is not None:
addresses.append(address.text)
elif address is None:
addresses.append('None')
else:
address.append(np.nan)
geography = container.find('div', class_='_1dhrl')
if geography is not None:
geographies.append(geography.text)
elif geography is None:
geographies.append('None')
else:
geographies.append(np.nan)
rent = container.find('div', class_='_3e12V')
if rent is None:
rents.append('None')
elif rent is not None:
rents.append(rent.text)
else:
rents.append(np.nan)
unit = container.find('div', class_='_2tApa')
if unit is None:
rents.append('None')
elif rent is not None:
units.append(unit.text)
else:
rents.append(np.nan)
availability = container.find('div', class_='_2P6xE')
if availability is None:
availabilities.append('None')
elif availability is not None:
availabilities.append(availability.text)
else:
availabilities.append(np.nan)
print(len(addresses))
print(len(geographies))
print(len(rents))
print(len(units))
print(len(availabilities))
minlen = min(len(addresses), len(geographies), len(rents), len(units), len(availabilities))
print('Minimum Array Length on this Page = ' + str(minlen))
test_df = pd.DataFrame({'Street' : addresses,
'City-State-Zip' : geographies,
'Rent' : rents,
'BR/BA' : units,
'Units Available' : availabilities
})
print(test_df)
Here is the output with error, and I have printed the length of each array for each web page to show that the problem first occurs on "Page 5":
236 Properties
<class 'bs4.element.ResultSet'>
30
Page 0
30
30
30
30
30
Minimum Array Length on this Page = 30
<class 'bs4.element.ResultSet'>
30
Page 1
60
60
60
60
60
Minimum Array Length on this Page = 60
<class 'bs4.element.ResultSet'>
30
Page 2
90
90
90
90
90
Minimum Array Length on this Page = 90
<class 'bs4.element.ResultSet'>
30
Page 3
120
120
120
120
120
Minimum Array Length on this Page = 120
<class 'bs4.element.ResultSet'>
30
Page 4
150
150
150
150
150
Minimum Array Length on this Page = 150
<class 'bs4.element.ResultSet'>
30
Page 5
180
180
188
172
180
Minimum Array Length on this Page = 172
Traceback (most recent call last):
File "renttucktabletest.py", line 103, in <module>
'Units Available' : availabilities
...
ValueError: arrays must all be same length
For the result, I either want to cut the array short to stop at the minimum length of the arrays so they are all equal length (in this case, the min = 172), or to fill in all of the other arrays with NaN or 'None' to get to the maximum array length so they are all equal length (in this case the maximum - 188).
I would prefer to find a solution that does not include more advanced coding than BeautifulSoup and pandas.
d = {'Street' : addresses,
'City-State-Zip' : geographies,
'Rent' : rents,
'BR/BA' : units,
'Units Available' : availabilities
}
test_df = pd.DataFrame(dict([(k,pd.Series(v)) for k,v in d.items()]))
In case of scraping, better to put generated record from each iteration in a temporary dict first, then put it into a list by appending it like I've demonstrated below:
import numpy as np
import requests
import pandas as pd
from bs4 import BeautifulSoup
# Scraping all pages
pages_url = requests.get("https://www.rent.com/new-york/tuckahoe-apartments")
pages_soup = BeautifulSoup(pages_url.text, "html.parser")
list_nums = pages_soup.find("div", class_="_1y05u").text
pages = [str(i) for i in range(0, 6)]
records = []
for page in pages:
response = requests.get(
"https://www.rent.com/new-york/tuckahoe-apartments?page=" + page
).text
html_soup = BeautifulSoup(response, "html.parser")
# Extract data from individual listing containers
listing_containers = html_soup.find_all("div", class_="_3PdAH")
print("Scraping page " + str(page))
for container in listing_containers:
# Dict to hold one record
result = {}
address = container.a
if address is None:
result["Street"] = np.nan
else:
result["Street"] = address.text
geography = container.find("div", class_="_1dhrl")
if geography is None:
result["City-State-Zip"] = np.nan
else:
result["City-State-Zip"] = geography.text
rent = container.find("div", class_="_3e12V")
if rent is None:
result["Rent"] = np.nan
else:
result["Rent"] = rent.text
unit = container.find("div", class_="_2tApa")
if unit is None:
result["BR/BA"] = np.nan
else:
result["BR/BA"] = unit.text
availability = container.find("div", class_="_2P6xE")
if availability is None:
result["Units Available"] = np.nan
else:
result["Units Available"] = availability.text
print("Record: ", result)
records.append(result)
test_df = pd.DataFrame(records)
print(test_df)

Socket Error Exceptions in Python when Scraping

I am trying to learn scraping,
I use exceptions lower down in the code to pass through errors because they dont affect the writing of data to csv
I keep getting a "socket.gaierror" but in the handling of that there is a "urllib.error.URLError" in the handling of that I get "NameError: name 'socket' is not defined" which seems circuitous
I kind of understand that using these exceptions may not be the best way to run the code but I cant seem to get past these errors and I dont know a way around or how to fix the errors.
If you have any suggestions outside of fixing the error exceptions that would be greatly appreciated as well.
import csv
from urllib.request import urlopen
from urllib.error import HTTPError
from bs4 import BeautifulSoup
base_url = 'http://www.fangraphs.com/' # used in line 27 for concatenation
years = ['2017','2016','2015'] # for enough data to run tests
#Getting Links for letters
player_urls = []
data = urlopen('http://www.fangraphs.com/players.aspx')
soup = BeautifulSoup(data, "html.parser")
for link in soup.find_all('a'):
if link.has_attr('href'):
player_urls.append(base_url + link['href'])
#Getting Alphabet Links
test_for_playerlinks = 'players.aspx?letter='
player_alpha_links = []
for i in player_urls:
if test_for_playerlinks in i:
player_alpha_links.append(i)
# Getting Player Links
ind_player_urls = []
for l in player_alpha_links:
data = urlopen(l)
soup = BeautifulSoup(data, "html.parser")
for link in soup.find_all('a'):
if link.has_attr('href'):
ind_player_urls.append(link['href'])
#Player Links
jan = 'statss.aspx?playerid'
players = []
for j in ind_player_urls:
if jan in j:
players.append(j)
# Building Pitcher List
pitcher = 'position=P'
pitchers = []
pos_players = []
for i in players:
if pitcher in i:
pitchers.append(i)
else:
pos_players.append(i)
# Individual Links to Different Tables Sorted by Base URL differences
splits = 'http://www.fangraphs.com/statsplits.aspx?'
game_logs = 'http://www.fangraphs.com/statsd.aspx?'
split_pp = []
gamel = []
years = ['2017','2016','2015']
for i in pos_players:
for year in years:
split_pp.append(splits + i[12:]+'&season='+ year)
gamel.append(game_logs+ i[12:] + '&type=&gds=&gde=&season=' + year)
split_pitcher = []
gl_pitcher = []
for i in pitchers:
for year in years:
split_pitcher.append(splits + i[12:]+'&season=' + year)
gl_pitcher.append(game_logs + i[12:] + '&type=&gds=&gde=&season=' + year)
# Splits for Pitcher Data
row_sp = []
rows_sp = []
try:
for i in split_pitcher:
sauce = urlopen(i)
soup = BeautifulSoup(sauce, "html.parser")
table1 = soup.find_all('strong', {"style":"font-size:15pt;"})
row_sp = []
for name in table1:
nam = name.get_text()
row_sp.append(nam)
table = soup.find_all('table', {"class":"rgMasterTable"})
for h in table:
he = h.find_all('tr')
for i in he:
td = i.find_all('td')
for j in td:
row_sp.append(j.get_text())
rows_sp.append(row_sp)
except(RuntimeError, TypeError, NameError, URLError, socket.gaierror):
pass
try:
with open('SplitsPitchingData2.csv', 'w') as fp:
writer = csv.writer(fp)
writer.writerows(rows_sp)
except(RuntimeError, TypeError, NameError):
pass
I'm guessing your main problem was that you - without any sleep what so ever - queried the site for a huge amount of invalid urls (you create 3 urls for the years 2015-2017 for 22880 pitchers in total, but most of these do not fall within that scope so you have tens of thousands of queries that return errors).
I'm surprised your IP wasn't banned by site admin. That said: It would be better to do some filtering so you avoid all those error queries...
The filter I applied is not perfect. It checks if the years in the list either appears in the start or end the years given on the site (e.g. '2004 - 2015'). This also creates error links but no way near the amount the original script did.
In code it could look like this:
from urllib.request import urlopen
from bs4 import BeautifulSoup
from time import sleep
import csv
base_url = 'http://www.fangraphs.com/'
years = ['2017','2016','2015']
# Getting Links for letters
letter_links = []
data = urlopen('http://www.fangraphs.com/players.aspx')
soup = BeautifulSoup(data, "html.parser")
for link in soup.find_all('a'):
try:
link = base_url + link['href']
if 'players.aspx?letter=' in link:
letter_links.append(link)
except:
pass
print("[*] Retrieved {} links. Now fetching content for each...".format(len(letter_links)))
# the data resides in two different base_urls:
splits_url = 'http://www.fangraphs.com/statsplits.aspx?'
game_logs_url = 'http://www.fangraphs.com/statsd.aspx?'
# we need (for some reason) players in two lists - pitchers_split and pitchers_game_log - and the rest of the players in two different, pos_players_split and pis_players_game_log
pos_players_split = []
pos_players_game_log = []
pitchers_split = []
pitchers_game_log = []
# and if we wanted to do something with the data from the letter_queries, lets put that in a list for safe keeping:
ind_player_urls = []
current_letter_count = 0
for link in letter_links:
current_letter_count +=1
data = urlopen(link)
soup = BeautifulSoup(data, "html.parser")
trs = soup.find('div', class_='search').find_all('tr')
for player in trs:
player_data = [tr.text for tr in player.find_all('td')]
# To prevent tons of queries to fangraph with invalid years - check if elements from years list exist with the player stat:
if any(year in player_data[1] for year in years if player_data[1].startswith(year) or player_data[1].endswith(year)):
href = player.a['href']
player_data.append(base_url + href)
# player_data now looks like this:
# ['David Aardsma', '2004 - 2015', 'P', 'http://www.fangraphs.com/statss.aspx?playerid=1902&position=P']
ind_player_urls.append(player_data)
# build the links for game_log and split
for year in years:
split = '{}{}&season={}'.format(splits_url,href[12:],year)
game_log = '{}{}&type=&gds=&gde=&season={}'.format(game_logs_url, href[12:], year)
# checking if the player is pitcher or not. We're append both link and name (player_data[0]), so we don't need to extract name later on
if 'P' in player_data[2]:
pitchers_split.append([player_data[0],split])
pitchers_game_log.append([player_data[0],game_log])
else:
pos_players_split.append([player_data[0],split])
pos_players_game_log.append([player_data[0],game_log])
print("[*] Done extracting data for players for letter {} out of {}".format(current_letter_count, len(letter_links)))
sleep(2)
# CONSIDER INSERTING CSV-PART HERE....
# Extracting and writing pitcher data to file
with open('SplitsPitchingData2.csv', 'a') as fp:
writer = csv.writer(fp)
for i in pitchers_split:
try:
row_sp = []
rows_sp = []
# all elements in the pitchers_split are lists. Player name is i[1]
data = urlopen(i[1])
soup = BeautifulSoup(data, "html.parser")
# append name to row_sp from pitchers_split
row_sp.append(i[0])
# the page has 3 tables with the class rgMasterTable, the first i Standard, the second Advanced, the 3rd Batted Ball
# we're only grabbing standard
table_standard = soup.find_all('table', {"class":"rgMasterTable"})[0]
trs = table_standard.find_all('tr')
for tr in trs:
td = tr.find_all('td')
for content in td:
row_sp.append(content.get_text())
rows_sp.append(row_sp)
writer.writerows(rows_sp)
sleep(2)
except Exception as e:
print(e)
pass
Since I'm not sure precisely how you wanted the data formatted on output you need some work on that.
If you want to avoid waiting for all letter_links to be extracted before you retrieve the actual pitcher stats (and fine tune your output) you can move the csv writer part up, so it runs as a part of the letter loop. If you do this don't forget to empty the pitchers_split list before grabbing another letter_link...

Python Multiprocessing throwing out results based on previous values

I am trying to learn how to use multiprocessing and have managed to get the code below to work. The goal is to work through every combination of the variables within the CostlyFunction by setting n equal to some number (right now it is 100 so the first 100 combinations are tested). I was hoping I could manipulate w as each process returned its list (CostlyFunction returns a list of 7 values) and only keep the results in a given range. Right now, w holds all 100 lists and then lets me manipulate those lists but, when I use n=10MM, w becomes huge and costly to hold. Is there a way to evaluate CostlyFunction's output as the workers return values and then 'throw out' values I don't need?
if __name__ == "__main__":
import csv
csvFile = open('C:\\Users\\bryan.j.weiner\\Desktop\\test.csv', 'w', newline='')
#width = -36000000/1000
#fronteir = [None]*1000
currtime = time()
n=100
po = Pool()
res = po.map_async(CostlyFunction,((i,) for i in range(n)))
w = res.get()
spamwriter = csv.writer(csvFile, delimiter=',')
spamwriter.writerows(w)
print(('2: parallel: time elapsed:', time() - currtime))
csvFile.close()
Unfortunately, Pool doesn't have a 'filter' method; otherwise, you might've been able to prune your results before they're returned. Pool.imap is probably the best solution you'll find for dealing with your memory issue: it returns an iterator over the results from CostlyFunction.
For sorting through the results, I made a simple list-based class called TopList that stores a fixed number of items. All of its items are the highest-ranked according to a key function.
from collections import Userlist
def keyfunc(a):
return a[5] # This would be the sixth item in a result from CostlyFunction
class TopList(UserList):
def __init__(self, key, *args, cap=10): # cap is the largest number of results
super().__init__(*args) # you want to store
self.cap = cap
self.key = key
def add(self, item):
self.data.append(item)
self.data.sort(key=self.key, reverse=True)
self.data.pop()
Here's how your code might look:
if __name__ == "__main__":
import csv
csvFile = open('C:\\Users\\bryan.j.weiner\\Desktop\\test.csv', 'w', newline='')
n = 100
currtime = time()
po = Pool()
best = TopList(keyfunc)
result_iter = po.imap(CostlyFunction, ((i,) for i in range(n)))
for result in result_iter:
best.add(result)
spamwriter = csv.writer(csvFile, delimiter=',')
spamwriter.writerows(w)
print(('2: parallel: time elapsed:', time() - currtime))
csvFile.close()

Resources