Python Multiprocessing How can I make script faster? - multithreading

Python 3.6
I am writing a script to automate me checking to make sure all the links on a website for work.
I have a version of it but it runs slow because the python interpreter is only running one request at a time.
I imported selenium to pull the links down in a list. I started with a list of 41000 links. I got rid of the duplicates now I am down to 7300 links in my list. I am using the request module to just check for the response code. I know multiprocessing is the answer just see a bunch of different methods. Which is the best for my needs?
The only thing I need to keep in mind I can't run to many threads at once so I don't send our webserver threads on our server sky high with request.
Here is the function that checks the links with the python requests module that I am trying to speed up:
def check_links(y):
for ii in y:
try:
r = requests.get(ii.get_attribute('href'))
rc = r.status_code
print(ii.get_attribute('href'), ' ', rc)
except Exception as e:
logf.write(str(e))
finally:
pass

If you just need to apply the same function to all the items in a list, you just need to use a process pool, and map over you inputs. Here is a simple example:
from multiprocessing import pool
def square(x):
return {x: x**2}
p = pool.Pool()
results = p.imap_unordered(square, range(10))
for r in results:
print(r)
In the example I use imap_unordered, but also look at map and imap. You should choose the one that matches your needs the best.

Related

Python multiprocessing manager showing error when used in flask API

I am pretty confused about the best way to do what I am trying to do.
What do I want?
API call to the flask application
Flask route starts 4-5 multiprocess using Process module and combine results(on a sliced pandas dataframe) using a shared Managers().list()
Return computed results back to the client.
My implementation:
pos_iter_list = get_chunking_iter_list(len(position_records), 10000)
manager = Manager()
data_dict = manager.list()
processes = []
for i in range(len(pos_iter_list) - 1):
temp_list = data_dict[pos_iter_list[i]:pos_iter_list[i + 1]]
p = Process(
target=transpose_dataset,
args=(temp_list, name_space, align_namespace, measure_master_id, df_searchable, products,
channels, all_cols, potential_col, adoption_col, final_segment, col_map, product_segments,
data_dict)
)
p.start()
processes.append(p)
for p in processes:
p.join()
My directory structure:
- main.py(flask entry point)
- helper.py(contains function where above code is executed & calls transpose_dataset function)
Error that i am getting while running the same?
RuntimeError: No root path can be found for the provided module "mp_main". This can happen because the module came from an import hook that does not provide file name information or because it's a namespace package. In this case the root path needs to be explicitly provided.
Not sure what went wong here, manager list works fine when called from a sample.py file using if __name__ == '__main__':
Update: The same piece of code is working fine on my MacBook and not on windows os.
A sample flask API call:
#app.route(PREFIX + "ping", methods=['GET'])
def ping():
man = mp.Manager()
data = man.list()
processes = []
for i in range(0,5):
pr = mp.Process(target=test_func, args=(data, i))
pr.start()
processes.append(pr)
for pr in processes:
pr.join()
return json.dumps(list(data))
Stack has an ongoing bug preventing me from commenting, so I'll just write up an answer..
Python has 2 (main) ways to start a new process: "spawn", and "fork". Fork is a system command only available in *nix (read: linux or macos), and therefore spawn is the only option in windows. After 3.8 spawn will be the default on MacOS, but fork is still available. The big difference is that fork basically makes a copy of the existing process while spawn starts a whole new process (like just opening a new cmd window). There's a lot of nuance to why and how, but in order to be able to run the function you want the child process to run using spawn, the child has to import the main file. Importing a file is tantamount to just executing that file and then typically binding its namespace to a variable: import flask will run the flask/__ini__.py file, and bind its global namespace to the variable flask. There's often code however that is only used by the main process, and doesn't need to be imported / executed in the child process. In some cases running that code again actually breaks things, so instead you need to prevent it from running outside of the main process. This is taken into account in that the "magic" variable __name__ is only equal to "__main__" in the main file (and not in child processes or when importing modules).
In your specific case, you're creating a new app = Flask(__name__), which does some amount of validation and checks before you ever run the server. It's one of these setup/validation steps that it's tripping over when run from the child process. Fixing it by not letting it run at all is imao the cleaner solution, but you can also fix it by giving it a value that it won't trip over, then just never start that secondary server (again by protecting it with if __name__ == "__main__":)

Python 3.7, Feedparser module cannot parse BBC weather feed

When I parse the example rss link provided by BBC weather it gives only an empty feed, the example link is: "https://weather-broker-cdn.api.bbci.co.uk/en/forecast/rss/3day/2643123"
Ive tried using the feedparser module in python, I would like to do this in either python or c++ but python seemed easier. Ive also tried rewriting the URL without https:// and with .xml and it still doesn't work.
import feedparser
d = feedparser.parse('https://weather-broker-cdn.api.bbci.co.uk/en/forecast/rss/3day/2643123')
print(d)
Should give a result similar to the RSS feed which is on the link, but it just gets an empty feed
First, I know you got no result - not an error like me. Perhaps you are running a different version. As I mentioned, it yields a result on an older version in Python 2, using a program that has been running solidly every night for about 5 years, but it throws an exception on a freshly installed feedparser 5.2.1 on Python 3.7.4 64 bit.
I'm not entirely sure what is going on, but the function called _gen_georss_coords which is throwing a StopIteration on the first call. I have noted some references to this error due to the implementation of PEP479. It is written as a generator, but for your rss feed it only has to return 1 tuple. Here is the offending function.
def _gen_georss_coords(value, swap=True, dims=2):
# A generator of (lon, lat) pairs from a string of encoded GeoRSS
# coordinates. Converts to floats and swaps order.
latlons = map(float, value.strip().replace(',', ' ').split())
nxt = latlons.__next__
while True:
t = [nxt(), nxt()][::swap and -1 or 1]
if dims == 3:
t.append(nxt())
yield tuple(t)
There is something curious going on, perhaps to do with PEP479 and the fact that there are two separate generators happening in the same function, that is causing StopIteration to bubble up to the calling function. Anyway, I rewrote it is a somewhat more straightforward way.
def _gen_georss_coords(value, swap=True, dims=2):
# A generator of (lon, lat) pairs from a string of encoded GeoRSS
# coordinates. Converts to floats and swaps order.
latlons = list(map(float, value.strip().replace(',', ' ').split()))
for i in range(0, len(latlons), 3):
t = [latlons[i], latlons[i+1]][::swap and -1 or 1]
if dims == 3:
t.append(latlons[i+2])
yield tuple(t)
You can define the above new function in your code, then execute the following to patch it into feedparser
saveit, feedparser._gen_georss_coords = (feedparser._gen_georss_coords, _gen_georss_coords)
Once you're done with it, you can restore feedparser to its previous state with
feedparser._gen_georss_coords, _gen_georss_coords = (saveit, feedparser._gen_georss_coords)
Or if you're confident that this is solid, you can modify feedparser itself. Anyway I did this trick and your rss feed suddenly started working. Perhaps in your case it will also result in some improvement.

Why Can't Jupyter Notebooks Handle Multiprocessing on Windows?

On my windows 10 machine (and seemingly other people's as well), Jupyter Notebook can't seem to handle some basic multiprocessing functions like pool.map(). I can't seem to figure out why this might be, even though a solution has been suggested to call the function to be mapped as a script from another file. My question, though is why does this not work? Is there a better way to do this kind of thing beyond saving in another file?
Note that the solution was suggested in a similar question here. But I'm left wondering why this bug occurs, and whether there is another easier fix. To show what goes wrong, I'm including below a very simple version that hangs on my computer where the same function runs with no problems when the built-in function map is used.
import multiprocessing as mp
# create a grid
iterable = [3, 5, 10]
def add_3(iterable):
a = iterable + 3
return a
# Below runs no problem
results = list(map(add_3, iterable))
print(results)
# multiprocessing attempt (hangs)
def main():
pool = mp.Pool(2)
results = pool.map(add_3, iterable)
return results
if __name__ == "__main__": #Required not to spawn deviant children
results = main()
Edit: I've just tried this in Spyder and I've managed to get it to work. Unsurprisingly running the following didn't work.
if __name__ == "__main__": #Required not to spawn deviant children
results = main()
print(results)
But running it as the following does work because map uses the yield command and isn't evaluated until called which gives the typical problem.
if __name__ == "__main__": #Required not to spawn deviant children
results = main()
print(results)
edit edit:
From what I've read on the issue, it turns out that the issue is largely because of the ipython shell that jupyter uses. I think there might be an issue setting name. Either way using spyder or a different ide solved the problem, as long as you're not still running the multiprocessing function in an iPython shell.
I faced a similar problem like this. I can't use multiprocessing with function on the same script. The solution that works is to put the function on different notebook file and import it use ipynb:
from ipynb.fs.full.script_name import function_name
pool = Pool()
result = pool.map(function_name,iterable_argument)

Scrapy: Running one spider, then using information gathered to run another spider

In the Scrapy docs, the example they give for running multiple spiders is something like this:
process = CrawlerProcess()
process.crawl(MySpider1)
process.crawl(MySpider2)
process.start()
However, the problem is that I want to run Spider1, parse the data, and then use the extracted data to run Spider2. If I do something like:
process.crawl(MySpider1)
process.start()
parse_data_from_spider1()
pass_data_to_spider2_class()
process2.crawl(MySpider2)
process2.start()
It gives me the dreaded ReactorNotRestartable error. Could someone guide me on how to do what I'm trying to achieve here?
The code you're using from the docs runs multiple spiders in the same process using the internal API, so that's an issue if you need to wait for the first spider to finish before starting the second.
If this is the entire scope of the issue, my suggestions would be to store the data from the the first spider a place where the second one can consume it (database, csv, jsonlines), and bring that data into the second spider run, either in the spider definition (where name is defined, or if you've got subclasses of scrapy.Spider, maybe in the __init__) or in the start_requests() method.
Then you'll have to run the spiders sequentially, you can see the CrawlerRunner() chaining deferred method in the common practices section of the docs.
configure_logging()
runner = CrawlerRunner()
#defer.inlineCallbacks
def crawl():
yield runner.crawl(MySpider1)
yield runner.crawl(MySpider2)
reactor.stop()
crawl()
reactor.run()

Web-scraping. How make it faster ?

I have to extract some attributes(in my example there is only one: a text description of apps) from webpages. The problem is the time !
Using the following code, indeed, to go on an page, extract one part of HTML and save it, takes about 1.2-1.8 sec per page. A lot of time. Is there a way to make it faster ? I have a lot of pages, x could be also 200000.
I'm using jupiter.
Description=[]
for x in range(len(M)):
response = http.request('GET',M[x] )
soup = BeautifulSoup(response.data,"lxml")
t=str(soup.find("div",attrs={"class":"section__description"}))
Description.append(t)
Thank you
You should consider inspecting the page a bit. If page relies on the Rest API, you could scrape contents that you need by directly getting them from the API. This is much more efficient way than getting the contents from HTML.
To consume it you should check out Requests library for Python.
I would try carving this up into multiple processes per my comment. So you can put your code into a function and use multiprocessing like this
from multiprocessing import Pool
def web_scrape(url):
response = http.request('GET',url )
soup = BeautifulSoup(response.data,"lxml")
t=str(soup.find("div",attrs={"class":"section__description"}))
return t
if __name__ == '__main__':
# M is your list of urls
M=["https:// ... , ... , ... ]
p = Pool(5) # 5 or how many processes you think is appropriate (start with how many cores you have, maybe)
description=p.map(web_scrape, M))
p.close()
p.join()
description=list(description) # if you need it to be a list
What is happening is that your list of urls is getting distributed to multiple processes that run your scrape function. The results all then get consolidated in the end and wind up in description. This should be much faster than if you processed each url one at a time like you are doing currently.
For more details: https://docs.python.org/2/library/multiprocessing.html

Resources