I've got a question regarding performance of ThreadPoolExecutor vs Thread class on its own which seems to me that I lack some fundamental understanding.
I've a web scraper in two functions. First to parse the links for each image of a website homepage and the second to load an image off the link parsed:
import threading
import urllib.request
from bs4 import BeautifulSoup as bs
import os
from concurrent.futures import ThreadPoolExecutor
path = r'C:\Users\MyDocuments\Pythom\Networking\bbc_images_scraper_test'
url = 'https://www.bbc.co.uk'
# Function to parse link anchors for images
def img_links_parser(url, links_list):
res = urllib.request.urlopen(url)
soup = bs(res,'lxml')
content = soup.findAll('div',{'class':'top-story__image'})
for i in content:
try:
link = i.attrs['style']
# Pulling the anchor from parentheses
link = link[link.find('(')+1 : link.find(')')]
# Putting the anchor in the list of links
links_list.append(link)
except:
# links might be under 'data-lazy' attribute w/o paranthesis
links_list.append(i.attrs['data-lazy'])
# Function to load images from links
def img_loader(base_url, links_list, path_location):
for link in links_list:
try:
# Pulling last element off the link which is name.jpg
file_name = link.split('/')[-1]
# Following the link and saving content in a given direcotory
urllib.request.urlretrieve(urllib.parse.urljoin(base_url, link),
os.path.join(path_location, file_name))
except:
print('Error on {}'.format(urllib.parse.urljoin(base_url, link)))
The following code is split up in to two cases:
Case 1: I'm using multiple threads:
threads = []
t1 = threading.Thread(target = img_loader, args = (url, links[:10], path))
t2 = threading.Thread(target = img_loader, args = (url, links[10:20], path))
t3 = threading.Thread(target = img_loader, args = (url, links[20:30], path))
t4 = threading.Thread(target = img_loader, args = (url, links[30:40], path))
t5 = threading.Thread(target = img_loader, args = (url, links[40:50], path))
t6 = threading.Thread(target = img_loader, args = (url, links[50:], path))
threads.extend([t1,t2,t3,t4,t5,t6])
for t in threads:
t.start()
for t in threads:
t.join()
The above code does its job on my machine for 10 seconds.
Case 2: I'm using ThreadPoolExecutor
with ThreadPoolExecutor(50) as exec:
results = exec.submit(img_loader, url, links, path)
The above code results to 18 seconds.
My understanding was that ThreadPoolExecutor creates a thread for each worker. So, given I set max_workers to 50 would result to 50 threads and therefore should have completed the job faster.
Can someone please explain what am I missing here? I admit that I'm making a silly mistake here but I just don't get it.
Many thanks!
In Case 2 you're sending all the links to one worker. Instead of
exec.submit(img_loader, url, links, path)
you'd need to:
for link in links:
exec.submit(img_loader, url, [link], path)
I didn't try it out myself, that's just from reading the documentation of ThreadPoolExecutor
Following this explanation from this link you could also use the executor.map function instead of running a for loop as suggested by hansaplast.
with ThreadPoolExecutor() as executor:
# Create a new partially applied function that stores the directory
# argument.
#
# This allows the download_link function that normally takes two
# arguments to work with the map function that expects a function of a
# single argument.
fn = partial(download_link, download_dir)
# Executes fn concurrently using threads on the links iterable. The
# timeout is for the entire process, not a single call, so downloading
# all images must complete within 30 seconds.
executor.map(fn, links, timeout=30)
I think it can be easily adapted to your needs.
And answering the question about threadpoolexecutor, I am not an expert on it, but as per the documentation I've read so far ThreadPoolExecutor is a simpler way to create a dynamic pool of workers than using Threading.Thread by yourself.
Related
I'm trying to do some web-scraping, as learning, using a predefined number of workers.
I'm using None as as sentinel to break out of the while loop and stop the worker.
The speed of each worker varies, and all workers are closed before the last
url is passed to gather_search_links to get the links.
I tried to use asyncio.Queue, but I had less control than with deque.
async def gather_search_links(html_sources, detail_urls):
while True:
if not html_sources:
await asyncio.sleep(0)
continue
data = html_sources.pop()
if data is None:
html_sources.appendleft(None)
break
data = BeautifulSoup(data, "html.parser")
result = data.find_all("div", {"data-component": "search-result"})
for record in result:
atag = record.h2.a
url = f'{domain_url}{atag.get("href")}'
detail_urls.appendleft(url)
print("apended data", len(detail_urls))
await asyncio.sleep(0)
async def get_page_source(urls, html_sources):
client = httpx.AsyncClient()
while True:
if not urls:
await asyncio.sleep(0)
continue
url = urls.pop()
print("url", url)
if url is None:
urls.appendleft(None)
break
response = await client.get(url)
html_sources.appendleft(response.text)
await asyncio.sleep(8)
html_sources.appendleft(None)
async def navigate(urls):
for i in range(2, 7):
url = f"https://www.example.com/?page={i}"
urls.appendleft(url)
await asyncio.sleep(0)
nav_urls.appendleft(None)
loop = asyncio.get_event_loop()
nav_html = deque()
nav_urls = deque()
products_url = deque()
navigate_workers = [asyncio.ensure_future(navigate(nav_urls)) for _ in range(1)]
page_source_workers = [asyncio.ensure_future(get_page_source(nav_urls, nav_html)) for _ in range(2)]
product_urls_workers = [asyncio.ensure_future(gather_search_links(nav_html, products_url)) for _ in range(1)]
workers = asyncio.wait([*navigate_workers, *page_source_workers, *product_urls_workers])
loop.run_until_complete(workers)
I'm a bit of a newbie, so this could be wrong as can be, but I believe that the issue is that all three of the functions: navigate(), gather_search_links(), and get_page_source() are asynchronous tasks that can be completed in any order. However, your checks for empty deques and your use of appendleft to ensure None is the leftmost item in your deques, look like they would appropriately prevent this. For all intents and purposes the code looks like it should run correctly.
I think the issue arises at this line:
workers = asyncio.wait([*navigate_workers, *page_source_workers, *product_urls_workers])
According to this post, the asyncio.wait function does not order these tasks according to the order they're written above, instead it fires them according to IO as coroutines. Again, your checks at the beginning of gather_search_links and get_page_source are ensuring that one function runs after the other and thus this code should work if there is only a single worker for each function. If there are multiple workers for each function, I can see issues arising where None doesn't wind up being the leftmost item in your deques. Perhaps a print statement at the end of each function to show the contents of your deques would be useful in troubleshooting this.
I guess my major question would be, why do these tasks asnychronously if you're going to write extra code because the steps must be completed synchronously? In order to get the HTML you must first have the URL. In order to scrape the HTML you must first have the HTML. What benefit does asyncio provide here? All three of these make more sense to me as synchronous tasks. Get URL, get HTML, scrape HTML, and in that order.
EDIT: It occurred to me that the main benefit of asynchronous code here is that you don't want to have to wait on each individual URL to respond back synchronously when you fetch the HTML from them. What I would do in this situation is gather my URLs synchronously first, and then combine the get and scrape functions into a single asynchronous function, which would be your only asynchronous function. Then you don't need a sentinel or a check for a "None" value or any of that extra code and you get the full value of the asynchronous fetch. You could then store your scraped data in a list (or deque or whatever) of futures. This would simplify your code and provide you with the fastest possible scrape time.
LAST EDIT:
Here's my quick and dirty rewrite. I liked your code so I decided to do my own spin. I have no idea if it works, I'm not a Python person.
import asyncio
from collections import deque
import httpx as httpx
from bs4 import BeautifulSoup
# Get or build URLs from config
def navigate():
urls = deque()
for i in range(2, 7):
url = f"https://www.example.com/?page={i}"
urls.appendleft(url)
return urls
# Asynchronously fetch and parse data for a single URL
async def fetchHTMLandParse(url):
client = httpx.AsyncClient()
response = await client.get(url)
data = BeautifulSoup(response.text, "html.parser")
result = data.find_all("div", {"data-component": "search-result"})
for record in result:
atag = record.h2.a
#Domain URL was defined elsewhere
url = f'{domain_url}{atag.get("href")}'
products_urls.appendleft(url)
loop = asyncio.get_event_loop()
products_urls = deque()
nav_urls = navigate()
fetch_and_parse_workers = [asyncio.ensure_future(fetchHTMLandParse(url)) for url in nav_urls]
workers = asyncio.wait([*fetch_and_parse_workers])
loop.run_until_complete(workers)
I have a large number of links I need to scrape from a website. I have ~70 base links and from them over 700 links that need to be scraped from those starting 70. So in order to speed up this process, takes about 2-3 hours without threading/async, I decided to try and use a thread/async.
My problem is that I need to render some javascript in order to get the links in the first place. I have been using requests-html to do this as its html.render() method is very reliable. However, when I try and run this using threading or async I run into a host of problems. I tried AsyncHTMLSession due to this Github PR but have been unable to get it to work. I was wondering if anyone had any ideas or links they could point me too that might help.
Here is some example code:
from multiprocessing.pool import ThreadPool
from requests_html import AsyncHTMLSession
links = (tuple of links)
n = 5
batch = [links[i:i+n] for i in range(0, len(links), n)]
def link_processor(batch_link):
session = AsyncHTMLSession()
results = []
for l in batch_link:
print(l)
r = session.get(l)
r.html.arender()
tmp_next = r.html.xpath('//a[contains(#href, "/matches/")]')
return tmp_next
pool = ThreadPool(processes=2)
output = pool.map(link_processor, batch)
pool.close()
pool.join()
print(output)
Output:
RuntimeError: There is no current event loop in thread 'Thread-1'.
Was able to fix this with some help from the learnpython subreddit. Turns out requests-html probably uses threads in some way and so threading the threads has an issue so simply using multiprocessing pool works.
FIXED CODE:
from multiprocessing import Pool
from requests_html import HTMLSession
.....
pool = Pool(processes=3)
output = pool.map(link_processor, batch[:2])
pool.close()
pool.join()
print(output)
I'm new to to Python-Threading, and I've gone through multiple posts but I really did not understand how to use it. However I tried to complete my task, and I want to check if I'm doing it with right approach.
Task is :
Read big CSV containing around 20K records, fetch id from each record and fire an HTTP API call for each record of the CSV.
t1 = time.time()
file_data_obj = csv.DictReader(open(file_path, 'rU'))
threads = []
for record in file_data_obj:
apiurl = https://www.api-server.com?id=record.get("acc_id", "")
thread = threading.Thread(target=requests.get, args=(apiurl,))
thread.start()
threads.append(thread)
t2 = time.time()
for thread in threads:
thread.join()
print("Total time required to process a file - {} Secs".format(t2-t1))
As there are 20K records, would it start 20K threads? OR OS/Python will handle it? If yes, can we restrict it?
How can I collect the response returned by requests.get?
Would t2 - t1 really give mw the time required to process whole file?
As there are 20K records, would it start 20K threads? OR OS/Python will handle it? If yes, can we restrict it?
Yes - it will start a thread for each iteration. The maximum amount of threads is dependent on your OS.
How can I grab the response returned by requests.get?
If you want to use the threading module only, you'll have to make use of a Queue. Threads return None by design, hence you'll have to implement a line of communication between the Thread and you main loop yourself.
from queue import Queue
from threading import Thread
import time
# A thread that produces data
q = Queue()
def return_get(q, apiurl):
q.put(requests.get(apiurl)
for record in file_data_obj:
apiurl = https://www.api-server.com?id=record.get("acc_id", "")
t = threading.Thread(target=return_get, args=(q, apiurl))
t.start()
threads.append(t)
for thread in threads:
thread.join()
while not q.empty:
r = q.get() # Fetches the first item on the queue
print(r.text)
An alternative is to use a worker pool.
from concurrent.futures import ThreadPoolExecutor
from queue import Queue
import urllib.request
threads = []
pool = ThreadPoolExecutor(10)
# Submit work to the pool
for record in file_data_obj:
apiurl = https://www.api-server.com?id=record.get("acc_id", "")
t = pool.submit(fetch_url, 'http://www.python.org')
threads.append(t)
for t in threads:
print(t.result())
You can use ThreadPoolExecutor
Retrieve a single page and report the URL and contents
def load_url(url, timeout):
with urllib.request.urlopen(url, timeout=timeout) as conn:
return conn.read()
Create pool executor with N workers
with concurrent.futures.ThreadPoolExecutor(max_workers=N_workers) as executor:
# Start the load operations and mark each future with its URL
future_to_url = {executor.submit(load_url, url, 60): url for url in URLS}
for future in concurrent.futures.as_completed(future_to_url):
url = future_to_url[future]
try:
data = future.result()
except Exception as exc:
print('%r generated an exception: %s' % (url, exc))
else:
print('%r page is %d bytes' % (url, len(data)))
I am executing a Python script with Threading, where given a "query" term that I put in the Queue, I create the url with the query parameters, set the cookies & parse the webpage to return the Products & the urls of those products. Here's the script.
Task : For a given set of queries, store the top 20 product ids in a file, or lower # if the query returns fewer results.
I remember reading that Selenium is not thread safe. Just want to make sure that this problem occurs because of that limitation, and is there a way to make it work in concurrent threads ? The main problem is that the script was I/O bound, so very slow for scraping about 3000 url fetches.
from pyvirtualdisplay import Display
from data_mining.scraping import scraping_conf as sf #custom file with rules for scraping
import Queue
import threading
import urllib2
import time
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
num_threads=5
COOKIES=sf.__MERCHANT_PARAMS[merchant_domain]['COOKIES']
query_args =sf.__MERCHANT_PARAMS[merchant_domain]['QUERY_ARGS']
class ThreadUrl(threading.Thread):
"""Threaded Url Grab"""
def __init__(self, queue, out_queue):
threading.Thread.__init__(self)
self.queue = queue
self.out_queue = out_queue
def url_from_query(self,query):
for key,val in query_args.items():
if query_args[key]=='query' :
query_args[key]=query
print "query", query
try :
url = base_url+urllib.urlencode(query_args)
print "url"
return url
except Exception as e:
log()
return None
def init_driver_and_scrape(self,base_url,query,url):
# Will use Pyvirtual display later
#display = Display(visible=0, size=(1024, 768))
#display.start()
fp = webdriver.FirefoxProfile()
fp.set_preference("browser.download.folderList",2)
fp.set_preference("javascript.enabled", True)
driver = webdriver.Firefox(firefox_profile=fp)
driver.delete_all_cookies()
driver.get(base_url)
for key,val in COOKIES[exp].items():
driver.add_cookie({'name':key,'value':val,'path':'/','domain': merchant_domain,'secure':False,'expiry':None})
print "printing cookie name & value"
for cookie in driver.get_cookies():
if cookie['name'] in COOKIES[exp].keys():
print cookie['name'],"-->", cookie['value']
driver.get(base_url+'search=junk') # To counter any refresh issues
driver.implicitly_wait(20)
driver.execute_script("window.scrollTo(0, 2000)")
print "url inside scrape", url
if url is not None :
flag = True
i=-1
row_data,row_res=(),()
while flag :
i=i+1
try :
driver.get(url)
key=sf.__MERCHANT_PARAMS[merchant_domain]['GET_ITEM_BY_ID']+str(i)
print key
item=driver.find_element_by_id(key)
href=item.get_attribute("href")
prod_id=eval(sf.__MERCHANT_PARAMS[merchant_domain]['PRODUCTID_EVAL_FUNC'])
row_res=row_res+(prod_id,)
print url,row_res
except Exception as e:
log()
flag =False
driver.delete_all_cookies()
driver.close()
return query+"|"+str(row_res)+"\n" # row_data, row_res
else :
return [query+"|"+"None"]+"\n"
def run(self):
while True:
#grabs host from queue
query = self.queue.get()
url=self.url_from_query(query)
print "query, url", query, url
data=self.init_driver_and_scrape(base_url,query,url)
self.out_queue.put(data)
#signals to queue job is done
self.queue.task_done()
class DatamineThread(threading.Thread):
"""Threaded Url Grab"""
def __init__(self, out_queue):
threading.Thread.__init__(self)
self.out_queue = out_queue
def run(self):
while True:
#grabs host from queue
data = self.out_queue.get()
fh.write(str(data)+"\n")
#signals to queue job is done
self.out_queue.task_done()
start = time.time()
def log():
logging_hndl=logging.getLogger("get_results_url")
logging_hndl.exception("Stacktrace from "+"get_results_url")
df=pd.read_csv(fh_query, sep='|',skiprows=0,header=0,usecols=None,error_bad_lines=False) # read all queries
query_list=list(df['query'].values)[0:3]
def main():
exp="Control"
#spawn a pool of threads, and pass them queue instance
for i in range(num_threads):
t = ThreadUrl(queue, out_queue)
t.setDaemon(True)
t.start()
#populate queue with data
print query_list
for query in query_list:
queue.put(query)
for i in range(num_threads):
dt = DatamineThread(out_queue)
dt.setDaemon(True)
dt.start()
#wait on the queue until everything has been processed
queue.join()
out_queue.join()
main()
print "Elapsed Time: %s" % (time.time() - start)
While I should be getting, all search results from each url page, I get only the 1st , i=0 search card and this doesn't execute for all queries/urls. What am I doing wrong ?
What I expect -
url inside scrape http://<masked>/search=nike+costume
searchResultsItem0
url inside scrape http://<masked>/search=red+tops
searchResultsItem0
url inside scrape http://<masked>/search=halloween+costumes
searchResultsItem0
and more searchResultsItem(s) , like searchResultsItem1,searchResultsItem2 and so on..
What I get
url inside scrape http://<masked>/search=nike+costume
searchResultsItem0
url inside scrape http://<masked>/search=nike+costume
searchResultsItem0
url inside scrape http://<masked>/search=nike+costume
searchResultsItem0
The skeleton code was taken from
http://www.ibm.com/developerworks/aix/library/au-threadingpython/
Additionally when I use Pyvirtual display, will that work with Threading as well ? I also used processes with the same Selenium code, and it gave the same error. Essentially it opens up 3 Firefox browsers, with exact urls, while it should be opening them from different items in the queue. Here I stored the rules in file that will import as sf, which has all custom attributes of a Base Domain.
Since setting the cookies is an integral part of my script, I can't use dryscrape.
EDIT :
I tried to localize the error, and here's what I found -
In the custom rules file, I call "sf" above, I had defined, QUERY_ARGS as
__MERCHANT_PARAMS = {
"some_domain.com" :
{
COOKIES: { <a dict of dict, masked here>
},
... more such rules
QUERY_ARGS:{'search':'query'}
}
So what is really happening is , that on calling,
query_args =sf.__MERCHANT_PARAMS[merchant_domain]['QUERY_ARGS'] - this should return the dict
{'search':'query'}, while it returns,
AttributeError: 'module' object has no attribute
'_ThreadUrl__MERCHANT_PARAMS'
This is where I don't understand how the thread is passing '_ThreadUrl__' I also tried re-initializing the query_args,inside the url_from_query method, but this doesn't work.
Any pointers, on what am I doing wrong ?
I may be replying pretty late to this. However, I tested it python2.7 and both options multithreading and mutliprocess works with selenium and it is opening two separate browsers.
I have a csv file ("SomeSiteValidURLs.csv") which listed all the links I need to scrape. The code is working and will go through the urls in the csv, scrape the information and record/save in another csv file ("Output.csv"). However, since I am planning to do it for a large portion of the site (for >10,000,000 pages), speed is important. For each link, it takes about 1s to crawl and save the info into the csv, which is too slow for the magnitude of the project. So I have incorporated the multithreading module and to my surprise it doesn't speed up at all, it still takes 1s person link. Did I do something wrong? Is there other way to speed up the processing speed?
Without multithreading:
import urllib2
import csv
from bs4 import BeautifulSoup
import threading
def crawlToCSV(FileName):
with open(FileName, "rb") as f:
for URLrecords in f:
OpenSomeSiteURL = urllib2.urlopen(URLrecords)
Soup_SomeSite = BeautifulSoup(OpenSomeSiteURL, "lxml")
OpenSomeSiteURL.close()
tbodyTags = Soup_SomeSite.find("tbody")
trTags = tbodyTags.find_all("tr", class_="result-item ")
placeHolder = []
for trTag in trTags:
tdTags = trTag.find("td", class_="result-value")
tdTags_string = tdTags.string
placeHolder.append(tdTags_string)
with open("Output.csv", "ab") as f:
writeFile = csv.writer(f)
writeFile.writerow(placeHolder)
crawltoCSV("SomeSiteValidURLs.csv")
With multithreading:
import urllib2
import csv
from bs4 import BeautifulSoup
import threading
def crawlToCSV(FileName):
with open(FileName, "rb") as f:
for URLrecords in f:
OpenSomeSiteURL = urllib2.urlopen(URLrecords)
Soup_SomeSite = BeautifulSoup(OpenSomeSiteURL, "lxml")
OpenSomeSiteURL.close()
tbodyTags = Soup_SomeSite.find("tbody")
trTags = tbodyTags.find_all("tr", class_="result-item ")
placeHolder = []
for trTag in trTags:
tdTags = trTag.find("td", class_="result-value")
tdTags_string = tdTags.string
placeHolder.append(tdTags_string)
with open("Output.csv", "ab") as f:
writeFile = csv.writer(f)
writeFile.writerow(placeHolder)
fileName = "SomeSiteValidURLs.csv"
if __name__ == "__main__":
t = threading.Thread(target=crawlToCSV, args=(fileName, ))
t.start()
t.join()
You're not parallelizing this properly. What you actually want to do is have the work being done inside your for loop happen concurrently across many workers. Right now you're moving all the work into one background thread, which does the whole thing synchronously. That's not going to improve performance at all (it will just slightly hurt it, actually).
Here's an example that uses a ThreadPool to parallelize the network operation and parsing. It's not safe to try to write to the csv file across many threads at once, so instead we return the data that would have been written back to the parent, and have the parent write all the results to the file at the end.
import urllib2
import csv
from bs4 import BeautifulSoup
from multiprocessing.dummy import Pool # This is a thread-based Pool
from multiprocessing import cpu_count
def crawlToCSV(URLrecord):
OpenSomeSiteURL = urllib2.urlopen(URLrecord)
Soup_SomeSite = BeautifulSoup(OpenSomeSiteURL, "lxml")
OpenSomeSiteURL.close()
tbodyTags = Soup_SomeSite.find("tbody")
trTags = tbodyTags.find_all("tr", class_="result-item ")
placeHolder = []
for trTag in trTags:
tdTags = trTag.find("td", class_="result-value")
tdTags_string = tdTags.string
placeHolder.append(tdTags_string)
return placeHolder
if __name__ == "__main__":
fileName = "SomeSiteValidURLs.csv"
pool = Pool(cpu_count() * 2) # Creates a Pool with cpu_count * 2 threads.
with open(fileName, "rb") as f:
results = pool.map(crawlToCSV, f) # results is a list of all the placeHolder lists returned from each call to crawlToCSV
with open("Output.csv", "ab") as f:
writeFile = csv.writer(f)
for result in results:
writeFile.writerow(result)
Note that in Python, threads only actually speed up I/O operations - because of the GIL, CPU-bound operations (like the parsing/searching BeautifulSoup is doing) can't actually be done in parallel via threads, because only one thread can do CPU-based operations at a time. So you still may not see the speed up you were hoping for with this approach. When you need to speed up CPU-bound operations in Python, you need to use multiple processes instead of threads. Luckily, you can easily see how this script performs with multiple processes instead of multiple threads; just change from multiprocessing.dummy import Pool to from multiprocessing import Pool. No other changes are required.
Edit:
If you need to scale this up to a file with 10,000,000 lines, you're going to need to adjust this code a bit - Pool.map converts the iterable you pass into it to a list prior to sending it off to your workers, which obviously isn't going to work very well with a 10,000,000 entry list; having that whole thing in memory is probably going to bog down your system. Same issue with storing all the results in a list. Instead, you should use Pool.imap:
imap(func, iterable[, chunksize])
A lazier version of map().
The chunksize argument is the same as the one used by the map()
method. For very long iterables using a large value for chunksize can
make the job complete much faster than using the default value of 1.
if __name__ == "__main__":
fileName = "SomeSiteValidURLs.csv"
FILE_LINES = 10000000
NUM_WORKERS = cpu_count() * 2
chunksize = FILE_LINES // NUM_WORKERS * 4 # Try to get a good chunksize. You're probably going to have to tweak this, though. Try smaller and lower values and see how performance changes.
pool = Pool(NUM_WORKERS)
with open(fileName, "rb") as f:
result_iter = pool.imap(crawlToCSV, f)
with open("Output.csv", "ab") as f:
writeFile = csv.writer(f)
for result in result_iter: # lazily iterate over results.
writeFile.writerow(result)
With imap, we never put the all of f into memory at once, nor do we store all the results in memory at once. The most we ever have in memory is chunksize lines of f, which should be more manageable.