Python - Speeding up Web Scraping using multiprocessing - python-3.x

I have the following function to scrape a webpage.
def parse(link: str, list_of_samples: list, index: int) -> None:
# Some code to scrape the webpage (link is given)
# The code will generate a list of strings, say sample
list_of_samples[index] = sample
I have another script that calls the above script for all URLs present in a list
def call_that_guy(URLs: list) -> list:
samples = [None for i in range(len(URLs))]
for i in range(len(URLs)):
parse(URLs[i], samples, i)
return samples
Some other function that calls the above function
def caller() -> None:
URLs = [url_1, url_2, url_3, ..., url_n]
# n will not exceed 12
samples = call_thay_guy(URLs)
print(samples)
# Prints the list of samples, but is taking too much time
One thing I noticed is that the parse function is taking around 10s to parse a single webpage (I am using Selenium). So, parsing all the URLs present in the list, it is taking around 2 minutes. I want to speed it up, probably using multithreading.
I tried doing the following instead.
import threading
def call_that_guy(URLs: list) -> list:
threads = [None for i in range(len(URLs))]
samples = [None for i in range(len(URLs))]
for i in range(len(URLs)):
threads[i] = threading.Thread(target = parse, args = (URLs[i], samples, i))
threads[i].start()
return samples
But, when I printed the returned value, all of its contents were None.
What am I trying to Achieve:
I want to asynchronously Scrape a list of URLs and populate the list of samples. Once the list is populated, I have some other statements to execute (they should execute only after samples is populated, else they'll cause Exceptions). I want to scrape the list of URLs faster (asynchronously is allowed) instead of scraping them one after another.
(I can explain something more clearly with image)

Why don't you use concurrent.futures module?
Here is a very simple but super fast code using concurrent.futures:
import concurrent.futures
def scrape_url(url):
print(f'Scraping {url}...')
scraped_content = '<html>scraped content!</html>'
return scraped_content
urls = ['https://www.google.com', 'https://www.facebook.com', 'https://www.youtube.com']
with concurrent.futures.ThreadPoolExecutor(max_workers=3) as executor:
results = executor.map(scrape_url, urls)
print(list(results))
# Expected output:
# ['<html>scraped content!</html>', '<html>scraped content!</html>', '<html>scraped content!</html>']
If you want to learn threading, I recommend watching this short tutorial: https://www.youtube.com/watch?v=IEEhzQoKtQU
Also note that this is not multiprocessing, this is multithreading and the two are not the same. If you want to know more about the difference, you can read this article: https://realpython.com/python-concurrency/
Hope this solves your problem.

Related

RuntimeWarning: coroutine 'NewsExtraction.get_article_data_elements' was never awaited

I have always resisted using asyncio within my code, but using it might help with some performance issues that I'm having.
Here is my scenario:
An end user provides a list of news sites to scrape
Each element is passed to an Article Class
A valid article is passed to an Extraction Class
The Extraction Class passes data to a NewsExtraction Class
90% this of the time this flow is flawless, but on an occasion one of the 12 functions in the NewsExtraction Class fails to extract data, which exist in the HTML provide. It seems that my code is "stepping on itself," which cause the data element not to be parsed. When I rerun the code all the elements are parsed correctly.
The NewsExtraction Class has this function get_article_data_elements, which is called from the Extraction Class.
The function get_article_data_elements call these items:
published_date = self._extract_article_published_date()
modified_date = self._extract_article_modified_date()
title = self._extract_article_title()
description = self._extract_article_description()
keywords = self._extract_article_key_words()
tags = self._extract_article_tags()
authors = self._extract_article_author()
top_image = self._extract_top_image()
language = self._extract_article_language()
categories = self._extract_article_category()
text = self._extract_textual_content()
url = self._extract_article_url()
Each of these data elements are used to populate a Python Dictionary, which is eventually passed back to the End User.
I have been trying to add asyncio code to the NewsExtraction Class, but I kept getting this error message:
RuntimeWarning: coroutine 'NewsExtraction.get_article_data_elements' was never awaited
I have spent the last 3 days trying to figure this issue out. I have looked at dozens of questions on Stack Overflow on this error RuntimeWarning: coroutine never awaited. I have also looked at numerous articles on using asyncio, but I cannot figure out how to use asyncio with my NewsExtraction Class, which is called from the Extraction Class.
Can someone provide me some pointers to solve my issue?
class NewsExtraction(object):
"""
This class is used to extract common data elements from a news article
on xyz
"""
def __init__(self, url, soup):
self._url = url
self._raw_soup = soup
truncated...
async def _extract_article_published_date(self):
"""
This function is designed to extract the publish date for the article being parsed.
:return: date article was originally published
:rtype: string
"""
json_date_published = JSONExtraction(self._url, self._raw_soup).extract_article_published_date()
if json_date_published is not None:
if len(json_date_published) != 0:
return json_date_published
else:
return None
elif json_date_published is None:
if self._raw_soup.find(name='div', attrs={'class': regex.compile("--publishDate")}):
date_published = self._raw_soup.find(name='div', attrs={'class': regex.compile("--publishDate")})
if len(date_published) != 0:
return date_published.text
else:
logger.info('The HTML tag to extract the publish date for the following article was not found.')
logger.info(f'Article URL -- {self._url}')
return None
truncated...
async def get_article_data_elements(self):
"""
This function is designed to extract all the common data elements from a
news article on xyz.
:return: dictionary of data elements related to the article
:rtype: dict
"""
article_data_elements = {}
# I have tried this:
published_date = self._extract_article_published_date().__await__()
# and this
published_date = self.task(self._extract_article_published_date())
await published_date
truncated...
I have also tried to use:
if __name__ == "__main__":
asyncio.run(NewsExtraction.get_article_data_elements())
# asyncio.run(self.get_article_data_elements())
I'm really banging my head on the wall with using asyncio in my news extraction code.
If this question is off base, I will be happy to delete it and keep reading about how to use asyncio correctly.
Can someone provide me some pointers to solve my issue?
Thanks in advance for any guidance on using asyncio
Your are defining _extract_article_published_date and get_article_data_elements as coroutines, and this coroutines must be await-ed in your code to get the result of their execution in an asynchronous way.
You can do this creating an instance of type NewsExtraction and calling this methods with the keyword await in front, this await pass the execution to other task in the loop until his awaited task completes its execution. Note that there are no threads or process involved in this task execution, the execution is passed only if it is no using cpu-time (await-ing I/O operations or sleeping).
if __name__ == '__main__':
extractor = NewsExtraction(...)
# this creates the event loop and runs the coroutine
asyncio.run(extractor.get_article_data_elements())
Inside your _extract_article_published_date you must also await your coroutines that perform requests over the network, if you are using some library for the scraping make sure that uses async/await behind the scenes to get a real performance while using asyncio.
async def get_article_data_elements(self):
article_data_elements = {}
# note here that the instance is self
published_date = await self._extract_article_published_date()
truncated...
You must dive into the asyncio documentation to get a better understanding of these features of Python 3.7+.

How to loop through api using Python to pull records from specific group of pages

I wrote a Python script that pulls json data from api. The api has 10k records.
As I set up my script, I was only pulling 10 pages, (each page contains 25 items), so it worked fine, dump everything in a .csv and also put everything into a mysql db.
When I ran, what I thought would be my last test and pulled, should I say - attempted to pull the data from all 500 pages, got an internal server error. So researched that and think it is because I am pulling all this data at once. The api documentation is kind of crapy, can find any rate limit info, anyway...
Since this is not my api, I though a quick solution would be just to run my script, let it pull the data from the first 10 pages, then the second 10 pages, 3rd 10 pages etc.
For obvious reasons I can't show all the code, but below are the basics/snippets. It is pretty simple, just grad the url, manipulate it a bit so I can add the Page#, then count the number of pages, then loop through and grab the data content.
Could someone help by explaining/showing how I can run/loop through my url, get the content from pages 1-10, then next loop through and get the content from pages 11-21 and so on?
Any insight, suggestions, examples would be greatly appreciated.
import requests
import json
import pandas as pd
import time
from datetime import datetime, timedelta
time_now = time.strftime("%Y%m%d-%H%M%S")
# Make our Request
def main_request(baseurl, x, endpoint, headers):
r = requests.get(baseurl + f'{x}' + endpoint, headers=headers)
return r.json()
# determine how many pages are needed to loop through, use for pagination
def get_pages(response):
# return response['page']['size']
return 2
def parse_json(response):
animal_list = []
for item in response['_embedded']['results']:
animal_details = {
'animal type': item['_type'],
'animal title': item['title'],
'animal phase': item['type']['value']
}
animal_list.append(animal_details)
return animal_list
animal_main_list = []
animal_data = main_request(baseurl, 1, endpoint, headers)
for x in range(1, get_pages(animal_data) + 1):
print(x)
mainList.extend(parse_json(main_request(baseurl, x, endpoint, headers)))
Got it working. Using the range function and the f{x} in my url. Set the range in a for loop and it updates the url. I used sleep.time to slow the retrieval down

Multiprocessing a function that tests a given dataset against a list of distributions. Returning function values from each iteration through list

I am working on processing a dataset that includes dense GPS data. My goal is to use parallel processing to test my dataset against all possible distributions and return the best one with the parameters generated for said distribution.
Currently, I have code that does this in serial thanks to this answer https://stackoverflow.com/a/37616966. Of course, it is going to take entirely too long to process my full dataset. I have been playing around with multiprocessing, but can't seem to get it to work right. I want it to test multiple distributions in parallel, keeping track of sum of square error. Then I want to select the distribution with the lowest SSE and return its name along with the parameters generated for it.
def fit_dist(distribution, data=data, bins=200, ax=None):
#Block of code that tests the distribution and generates params
return(distribution.name, best_params, sse)
if __name__ == '__main__':
p = Pool()
result = p.map(fit_dist, DISTRIBUTIONS)
p.close()
p.join()
I need some help with how to actually make use of the return values on each of the iterations in the multiprocessing to compare those values. I'm really new to python especially multiprocessing so please be patient with me and explain as much as possible.
The problem I'm having is it's giving me an "UnboundLocalError" on the variables that I'm trying to return from my fit_dist function. The DISTRIBUTIONS list is 89 objects. Could this be related to the parallel processing, or is it something to do with the definition of fit_dist?
With the help of Tomerikoo's comment and some further struggling, I got the code working the way I wanted it to. The UnboundLocalError was due to me not putting the return statement in the correct block of code within my fit_dist function. To answer the question I did the following.
from multiprocessing import Pool
def fit_dist:
#put this return under the right section of this method
return[distribution.name, params, sse]
if __name__ == '__main__':
p = Pool()
result = p.map(fit_dist, DISTRIBUTIONS)
p.close()
p.join()
'''filter out the None object results. Due to the nature of the distribution fitting,
some distributions are so far off that they result in None objects'''
res = list(filter(None, result))
#iterates over nested list storing the lowest sum of squared errors in best_sse
for dist in res:
if best_sse > dist[2] > 0:
best_sse = dis[2]
else:
continue
'''iterates over list pulling out sublist of distribution with best sse.
The sublists are made up of a string, tuple with parameters,
and float value for sse so that's why sse is always index 2.'''
for dist in res:
if dist[2]==best_sse:
best_dist_list = dist
else:
continue
The rest of the code simply consists of me using that list to construct charts and plots with that best distribution overtop of a histogram of my raw data.

Threading/Async in Requests-html

I have a large number of links I need to scrape from a website. I have ~70 base links and from them over 700 links that need to be scraped from those starting 70. So in order to speed up this process, takes about 2-3 hours without threading/async, I decided to try and use a thread/async.
My problem is that I need to render some javascript in order to get the links in the first place. I have been using requests-html to do this as its html.render() method is very reliable. However, when I try and run this using threading or async I run into a host of problems. I tried AsyncHTMLSession due to this Github PR but have been unable to get it to work. I was wondering if anyone had any ideas or links they could point me too that might help.
Here is some example code:
from multiprocessing.pool import ThreadPool
from requests_html import AsyncHTMLSession
links = (tuple of links)
n = 5
batch = [links[i:i+n] for i in range(0, len(links), n)]
def link_processor(batch_link):
session = AsyncHTMLSession()
results = []
for l in batch_link:
print(l)
r = session.get(l)
r.html.arender()
tmp_next = r.html.xpath('//a[contains(#href, "/matches/")]')
return tmp_next
pool = ThreadPool(processes=2)
output = pool.map(link_processor, batch)
pool.close()
pool.join()
print(output)
Output:
RuntimeError: There is no current event loop in thread 'Thread-1'.
Was able to fix this with some help from the learnpython subreddit. Turns out requests-html probably uses threads in some way and so threading the threads has an issue so simply using multiprocessing pool works.
FIXED CODE:
from multiprocessing import Pool
from requests_html import HTMLSession
.....
pool = Pool(processes=3)
output = pool.map(link_processor, batch[:2])
pool.close()
pool.join()
print(output)

Python avoiding large array allocation multiple times

I have to compute a function many many times.
To compute this function the elements of an array must be computed.
The array is quite large.
How can I avoid the allocation of the array in every function call.
The code I have tried goes something like this:
class FunctionCalculator(object):
def __init__(self, data):
"""
Get the data and do some small handling of it
Let's say that we do
self.data = data
"""
def function(self, point):
return numpy.sum(numpy.array([somecomputations(item) for item in self.data]))
Well, maybe my concern is unfounded, so I have first this question.
Question: Is it true that the array [somecomputations(item) for item in data] is being allocated and deallocated for every call to function?
Thinking that that is the case I have tried
class FunctionCalculator(object):
def __init__(self, data):
"""
Get the data and do some small handling of it
Let's say that we do
self.data = data
"""
self.number_of_data = range(0, len(data))
self.my_array = numpy.zeros(len(data))
def function(self, point):
for i in self.number_of_data:
self.my_array[i] = somecomputations(self.data[i])
return numpy.sum(self.my_array)
This is slower than the previous version. I assume that the list comprehension in the first version can be ran in C entirely, while in the second version smaller parts of the script can be translated into optimized C code.
I have very little idea of how Python works inside.
Question: Is there a good way to skip the array allocation in every function call and at the same time take advantage of a well optimized loop on the array?
I am using Python3.5
Looping over the array is unnecessary and access python to c many times, hence the slow down. The beauty of numpy arrays that functions work on them cell by cell. I think the fastest would be:
return numpy.sum(somecomputations(self.data))
Somecomputations may need a bit of a modification, but often it will work off the bat. Also, you're not using point, and other stuff.

Resources