multiprocessing pool with a dictionary as one of the arguments? - python-3.x

Is it possible to use Pool.map() on a function that contains an empty dictionary as one of its arguments? I am new to multiprocessing and want to parallise a web-scraping function. I tried following the example from this site however it doesn't include a dictionary as one of the arguments. The multiprocess function works (it prints out the search result), however it does not append to the dictionary, after completing the process the dictionary is still empty. Looks like I have to use Manager() however I don't know how to implement it. use of Manager() Thanks for help.
from functools import partial
from multiprocessing import Pool
from bs4 import BeautifulSoup as soup
count = 1
outerDict = dict()
emptyList = []
lstOfItems = ['Valsartan','Estrace','Norvasc','Combivent',
'Fluvirin','Kariva','Natrl','Foxamax','Vilanterol','Catapres']
def process_search():
'''a function that scrapes a site; the outerDict and emptyLst will
become populated as it scrapes the site for each item'''
def callSrch(item,outerDict,emptyList,count):
searchlink = 'http://www.asite.com'
uClient=ureq(searchlink+item)
pagehtml = uClient.read()
soupPage_ = soup(pagehtml,'html.parser')
process_search(item,soupPage_,outerDict,count,emptyList)
with Pool() as p:
prfx = partial(callSrch,outerDict=outerDict,emptyList=emptyList,count=count)
p.map(prfx, lstOfItems)

Related

Extract item for each spider in scrapy project

I have over a dozen spiders in a scrapy project with variety of items being extracted from different sources, including others elements mostly i have to copy same regex code over and over again in each spider for example
item['element'] = re.findall('my_regex', response.text)
I use this regex to get same element which is defined in scrapy items, is there a way to avoid copying? where do i put this in project so that i don't have to copy this in each spider and only add those that are different.
my project structure is default
any help is appreciated thanks in advance
So if I understand your question correctly, you want use the same regular expression across multiple spiders.
You can do this:
create a python module called something like regex_to_use
inside that module place your regular expression.
example:
# regex_to_use.py
regex_one = 'test'
You can access this express this one in your spiders.
# spider.py
import regex_to_use
import re as regex
find_string = regex.search(regex_to_use.regex_one, ' this is a test')
print(find_string)
# output
<re.Match object; span=(11, 15), match='test'>
You could also do something like this in your regex_to_use module
# regex_to_use.py
import re as regex
class CustomRegularExpressions(object):
def __init__(self, text):
"""
:param text: string containing the variable to search for
"""
self._text = text
def search_text(self):
find_xyx = regex.search('test', self._text)
return find_xyx
and you would call it this way in your spiders:
# spider.py
from regex_to_use import CustomRegularExpressions
find_word = CustomRegularExpressions('this is a test').search_text()
print(find_word)
# output
<re.Match object; span=(10, 14), match='test'>
If you have multiple regular expressions you could do something like this:
# regex_to_use.py
import re as regex
class CustomRegularExpressions(object):
def __init__(self, text):
"""
:param text: string containing the variable to search for
"""
self._text = text
def search_text(self, regex_to_use):
regular_expressions = {"regex_one": 'test_1', "regex_two": 'test_2'}
expression = ''.join([v for k, v in regular_expressions.items() if k == regex_to_use])
find_xyx = regex.search(expression, self._text)
return find_xyx
# spider.py
from regex_to_use import CustomRegularExpressions
find_word = CustomRegularExpressions('this is a test').search_text('regex_one')
print(find_word)
# output
<re.Match object; span=(10, 14), match='test'>
You can also use a staticmethod in the class CustomRegularExpressions
# regex_to_use.py
import re as regex
class CustomRegularExpressions:
#staticmethod
def search_text(regex_to_use, text_to_search):
regular_expressions = {"regex_one": 'test_1', "regex_two": 'test_2'}
expression = ''.join([v for k, v in regular_expressions.items() if k == regex_to_use])
find_xyx = regex.search(expression, text_to_search)
return find_xyx
# spider.py
from regex_to_use import CustomRegularExpressions
# find_word would be replaced with item['element']
# this is a test would be replaced with response.text
find_word = CustomRegularExpressions.search_text('regex_one', 'this is a test')
print(find_word)
# output
<re.Match object; span=(10, 14), match='test'>
If you use docstrings in the function search_text() you can see the regular expressions in the Python dictionary.
Showing how all this works...
This is a python project that I wrote and published. Take a look at the folder utilities. In this folder I have functions that I can use throughout my code without having to copy and paste the same code over and over.
There is a lot of common data that is usual to use across multiple spiders, like regex or even XPath.
It's a good idea to isolate them.
You can use something like this:
/project
/site_data
handle_responses.py
...
/spiders
your_spider.py
...
Isolate functionalities with a common purpose.
# handle_responses.py
# imports ...
from re import search
def get_specific_commom_data(text: str):
# probably is a good idea handle predictable errors here (`try except`)
return search('your_regex', text)
And just use where is needed that functionality.
# your_spider.py
# imports ...
import scrapy
from site_data.handle_responses import get_specific_commom_data
class YourSpider(scrapy.Spider):
# ... previous code
def your_method(self, response):
# ... previous code
item['element'] = get_specific_commom_data(response.text)
Try to keep it simple and do what you need to solve your problem.
I can copy regex in multiple spiders instead of importing object from other .py files, i understand they have the use case but here i don't want to add anything to any of the spiders but still want the element in result
There are some good answers to this but don't really solve the problem so after searching for days i have come to this solution i hope its useful for others looking for similar answer.
#middlewares.py
import yourproject.items import youritem()
#find the function and add your element
def process_spider_output(self, response, result, spider):
item = YourItem()
item['element'] = re.findall('my_regex', response.text)
now uncomment middleware from
#settings.py
SPIDER_MIDDLEWARES = {
'yourproject.middlewares.YoursprojectMiddleware': 543,
}
For each spider you will get element in result data, i am still searching for better solution and i will update the answer because it slows the spider,

how to run one function alone in parallel that depend on another big function

I have a requirement like below
from multiprocessing import Pool
import pandas as pd
import time
def test():
print("Parent")
def opt_by_region(a,b,c,d):
print("inside process")
time.sleep(1)
return b
def opt():
pool=Pool(processes=4)
df=pd.DataFrame([1,2])
res=[pool.apply_async(fun,args=(r,df,3,4))for r in range(5)]
pool.close()
pool.join()
This is sample structure of my code that am working.Here I need to run "opt_by_region" only in parallel for each region. but region and other variable are getting from function "opt"(it is not running in parallel)
so how can I solve this.how can I put wait "opt_by_region" to trigger with all the variables from function "opt".could anyone please suggest ideas it would be appreciated.
First, try removing that list comprehension from around pool.apply_async. You want to supply a list of your arguments to apply_async, even if the list elements are containers or objects.
Next, I think you have a typo, and should be supplying the function you want to iterate (fun <-> opt_by_region)
args = [(r,df,3,4) for r in range(5)]
res = pool.apply_async(opt_by_region,args=args)

Threading/Async in Requests-html

I have a large number of links I need to scrape from a website. I have ~70 base links and from them over 700 links that need to be scraped from those starting 70. So in order to speed up this process, takes about 2-3 hours without threading/async, I decided to try and use a thread/async.
My problem is that I need to render some javascript in order to get the links in the first place. I have been using requests-html to do this as its html.render() method is very reliable. However, when I try and run this using threading or async I run into a host of problems. I tried AsyncHTMLSession due to this Github PR but have been unable to get it to work. I was wondering if anyone had any ideas or links they could point me too that might help.
Here is some example code:
from multiprocessing.pool import ThreadPool
from requests_html import AsyncHTMLSession
links = (tuple of links)
n = 5
batch = [links[i:i+n] for i in range(0, len(links), n)]
def link_processor(batch_link):
session = AsyncHTMLSession()
results = []
for l in batch_link:
print(l)
r = session.get(l)
r.html.arender()
tmp_next = r.html.xpath('//a[contains(#href, "/matches/")]')
return tmp_next
pool = ThreadPool(processes=2)
output = pool.map(link_processor, batch)
pool.close()
pool.join()
print(output)
Output:
RuntimeError: There is no current event loop in thread 'Thread-1'.
Was able to fix this with some help from the learnpython subreddit. Turns out requests-html probably uses threads in some way and so threading the threads has an issue so simply using multiprocessing pool works.
FIXED CODE:
from multiprocessing import Pool
from requests_html import HTMLSession
.....
pool = Pool(processes=3)
output = pool.map(link_processor, batch[:2])
pool.close()
pool.join()
print(output)

Iterate through list while parsing

I am trying to download worksheets for this workout, all the workouts are split on different days. All that needs to be done is add a new number at the end of the link. Here is my code.
import urllib
import urllib.request
from bs4 import BeautifulSoup
import re
import os
theurl = "http://www.muscleandfitness.com/workouts/workout-routines/gain-10-pounds-muscle-4-weeks-1?day="
urls = []
count = 1
while count <29:
urls.append(theurl + str(count))
count +=1
print(urls)
for url in urls:
thepage = urllib
thepage = urllib.request.urlopen(urls)
soup = BeautifulSoup(thepage,"html.parser")
init_data = open('/Users/paribaker/Desktop/scrapping/workout/4weekdata.txt', 'a')
workout = []
for data_all in soup.findAll('div',{'class':"b-workout-program-day-exercises"}):
try:
for item in data_all.findAll('div',{'class':"b-workout-part--item"}):
for desc in item.findAll('div', {'class':"b-workout-part--description"}):
workout.append(desc.find('h4',{'class':"b-workout-part--exercise-count"}).text.strip("\n") +",\t")
workout.append(desc.find('strong',{'class':"b-workout-part--promo-title"}).text +",\t")
workout.append(desc.find('span',{'class':"b-workout-part--equipment"}).text +",\t")
for instr in item.findAll('div', {'class':"b-workout-part--instructions"}):
workout.append(instr.find('div',{'class':"b-workout-part--instructions--item workouts-sets"}).text.strip("\n") +",\t")
workout.append(instr.find('div',{'class':"b-workout-part--instructions--item workouts-reps"}).text.strip("\n") +",\t")
workout.append(instr.find('div',{'class':"b-workout-part--instructions--item workouts-rest"}).text.strip("\n"))
workout.append("\n*3")
except AttributeError:
pass
init_data.write("".join(map(lambda x:str(x), workout)))
init_data.close
The problem is that the server times out, I'm assuming its not iterating through the list properly or adding characters I do not need and crashing the server parser.
I have also tried to write another script that grabs all the link and put them in a text document, then reopen the text in this script and iterate through the text, but that also gave me the same error. What are your thoughts?
There's a typo here:
thepage = urllib.request.urlopen(urls)
You probably wanted:
thepage = urllib.request.urlopen(url)
Otherwise you are trying to open an array of urls rather than a single one.

Multithreading in Python/BeautifulSoup scraping doesn't speed up at all

I have a csv file ("SomeSiteValidURLs.csv") which listed all the links I need to scrape. The code is working and will go through the urls in the csv, scrape the information and record/save in another csv file ("Output.csv"). However, since I am planning to do it for a large portion of the site (for >10,000,000 pages), speed is important. For each link, it takes about 1s to crawl and save the info into the csv, which is too slow for the magnitude of the project. So I have incorporated the multithreading module and to my surprise it doesn't speed up at all, it still takes 1s person link. Did I do something wrong? Is there other way to speed up the processing speed?
Without multithreading:
import urllib2
import csv
from bs4 import BeautifulSoup
import threading
def crawlToCSV(FileName):
with open(FileName, "rb") as f:
for URLrecords in f:
OpenSomeSiteURL = urllib2.urlopen(URLrecords)
Soup_SomeSite = BeautifulSoup(OpenSomeSiteURL, "lxml")
OpenSomeSiteURL.close()
tbodyTags = Soup_SomeSite.find("tbody")
trTags = tbodyTags.find_all("tr", class_="result-item ")
placeHolder = []
for trTag in trTags:
tdTags = trTag.find("td", class_="result-value")
tdTags_string = tdTags.string
placeHolder.append(tdTags_string)
with open("Output.csv", "ab") as f:
writeFile = csv.writer(f)
writeFile.writerow(placeHolder)
crawltoCSV("SomeSiteValidURLs.csv")
With multithreading:
import urllib2
import csv
from bs4 import BeautifulSoup
import threading
def crawlToCSV(FileName):
with open(FileName, "rb") as f:
for URLrecords in f:
OpenSomeSiteURL = urllib2.urlopen(URLrecords)
Soup_SomeSite = BeautifulSoup(OpenSomeSiteURL, "lxml")
OpenSomeSiteURL.close()
tbodyTags = Soup_SomeSite.find("tbody")
trTags = tbodyTags.find_all("tr", class_="result-item ")
placeHolder = []
for trTag in trTags:
tdTags = trTag.find("td", class_="result-value")
tdTags_string = tdTags.string
placeHolder.append(tdTags_string)
with open("Output.csv", "ab") as f:
writeFile = csv.writer(f)
writeFile.writerow(placeHolder)
fileName = "SomeSiteValidURLs.csv"
if __name__ == "__main__":
t = threading.Thread(target=crawlToCSV, args=(fileName, ))
t.start()
t.join()
You're not parallelizing this properly. What you actually want to do is have the work being done inside your for loop happen concurrently across many workers. Right now you're moving all the work into one background thread, which does the whole thing synchronously. That's not going to improve performance at all (it will just slightly hurt it, actually).
Here's an example that uses a ThreadPool to parallelize the network operation and parsing. It's not safe to try to write to the csv file across many threads at once, so instead we return the data that would have been written back to the parent, and have the parent write all the results to the file at the end.
import urllib2
import csv
from bs4 import BeautifulSoup
from multiprocessing.dummy import Pool # This is a thread-based Pool
from multiprocessing import cpu_count
def crawlToCSV(URLrecord):
OpenSomeSiteURL = urllib2.urlopen(URLrecord)
Soup_SomeSite = BeautifulSoup(OpenSomeSiteURL, "lxml")
OpenSomeSiteURL.close()
tbodyTags = Soup_SomeSite.find("tbody")
trTags = tbodyTags.find_all("tr", class_="result-item ")
placeHolder = []
for trTag in trTags:
tdTags = trTag.find("td", class_="result-value")
tdTags_string = tdTags.string
placeHolder.append(tdTags_string)
return placeHolder
if __name__ == "__main__":
fileName = "SomeSiteValidURLs.csv"
pool = Pool(cpu_count() * 2) # Creates a Pool with cpu_count * 2 threads.
with open(fileName, "rb") as f:
results = pool.map(crawlToCSV, f) # results is a list of all the placeHolder lists returned from each call to crawlToCSV
with open("Output.csv", "ab") as f:
writeFile = csv.writer(f)
for result in results:
writeFile.writerow(result)
Note that in Python, threads only actually speed up I/O operations - because of the GIL, CPU-bound operations (like the parsing/searching BeautifulSoup is doing) can't actually be done in parallel via threads, because only one thread can do CPU-based operations at a time. So you still may not see the speed up you were hoping for with this approach. When you need to speed up CPU-bound operations in Python, you need to use multiple processes instead of threads. Luckily, you can easily see how this script performs with multiple processes instead of multiple threads; just change from multiprocessing.dummy import Pool to from multiprocessing import Pool. No other changes are required.
Edit:
If you need to scale this up to a file with 10,000,000 lines, you're going to need to adjust this code a bit - Pool.map converts the iterable you pass into it to a list prior to sending it off to your workers, which obviously isn't going to work very well with a 10,000,000 entry list; having that whole thing in memory is probably going to bog down your system. Same issue with storing all the results in a list. Instead, you should use Pool.imap:
imap(func, iterable[, chunksize])
A lazier version of map().
The chunksize argument is the same as the one used by the map()
method. For very long iterables using a large value for chunksize can
make the job complete much faster than using the default value of 1.
if __name__ == "__main__":
fileName = "SomeSiteValidURLs.csv"
FILE_LINES = 10000000
NUM_WORKERS = cpu_count() * 2
chunksize = FILE_LINES // NUM_WORKERS * 4 # Try to get a good chunksize. You're probably going to have to tweak this, though. Try smaller and lower values and see how performance changes.
pool = Pool(NUM_WORKERS)
with open(fileName, "rb") as f:
result_iter = pool.imap(crawlToCSV, f)
with open("Output.csv", "ab") as f:
writeFile = csv.writer(f)
for result in result_iter: # lazily iterate over results.
writeFile.writerow(result)
With imap, we never put the all of f into memory at once, nor do we store all the results in memory at once. The most we ever have in memory is chunksize lines of f, which should be more manageable.

Resources