Python multiprocessing manager showing error when used in flask API - python-3.x

I am pretty confused about the best way to do what I am trying to do.
What do I want?
API call to the flask application
Flask route starts 4-5 multiprocess using Process module and combine results(on a sliced pandas dataframe) using a shared Managers().list()
Return computed results back to the client.
My implementation:
pos_iter_list = get_chunking_iter_list(len(position_records), 10000)
manager = Manager()
data_dict = manager.list()
processes = []
for i in range(len(pos_iter_list) - 1):
temp_list = data_dict[pos_iter_list[i]:pos_iter_list[i + 1]]
p = Process(
target=transpose_dataset,
args=(temp_list, name_space, align_namespace, measure_master_id, df_searchable, products,
channels, all_cols, potential_col, adoption_col, final_segment, col_map, product_segments,
data_dict)
)
p.start()
processes.append(p)
for p in processes:
p.join()
My directory structure:
- main.py(flask entry point)
- helper.py(contains function where above code is executed & calls transpose_dataset function)
Error that i am getting while running the same?
RuntimeError: No root path can be found for the provided module "mp_main". This can happen because the module came from an import hook that does not provide file name information or because it's a namespace package. In this case the root path needs to be explicitly provided.
Not sure what went wong here, manager list works fine when called from a sample.py file using if __name__ == '__main__':
Update: The same piece of code is working fine on my MacBook and not on windows os.
A sample flask API call:
#app.route(PREFIX + "ping", methods=['GET'])
def ping():
man = mp.Manager()
data = man.list()
processes = []
for i in range(0,5):
pr = mp.Process(target=test_func, args=(data, i))
pr.start()
processes.append(pr)
for pr in processes:
pr.join()
return json.dumps(list(data))

Stack has an ongoing bug preventing me from commenting, so I'll just write up an answer..
Python has 2 (main) ways to start a new process: "spawn", and "fork". Fork is a system command only available in *nix (read: linux or macos), and therefore spawn is the only option in windows. After 3.8 spawn will be the default on MacOS, but fork is still available. The big difference is that fork basically makes a copy of the existing process while spawn starts a whole new process (like just opening a new cmd window). There's a lot of nuance to why and how, but in order to be able to run the function you want the child process to run using spawn, the child has to import the main file. Importing a file is tantamount to just executing that file and then typically binding its namespace to a variable: import flask will run the flask/__ini__.py file, and bind its global namespace to the variable flask. There's often code however that is only used by the main process, and doesn't need to be imported / executed in the child process. In some cases running that code again actually breaks things, so instead you need to prevent it from running outside of the main process. This is taken into account in that the "magic" variable __name__ is only equal to "__main__" in the main file (and not in child processes or when importing modules).
In your specific case, you're creating a new app = Flask(__name__), which does some amount of validation and checks before you ever run the server. It's one of these setup/validation steps that it's tripping over when run from the child process. Fixing it by not letting it run at all is imao the cleaner solution, but you can also fix it by giving it a value that it won't trip over, then just never start that secondary server (again by protecting it with if __name__ == "__main__":)

Related

How to force os.stat re-read file stats by same path

I have a code that is architecturally close to posted below (unfortunately i can't post full version cause it's proprietary). I have an self-updating executable and i'm trying to test this feature. We assume that full path to this file will be in A.some_path after executing input. My problem is that assertion failed, because on second call os.stat still returning the previous file stats (i suppose it thinks that nothing could changed so it's unnecessary). I have tried to launch this manually and self-updating works completely fine and the file is really removing and recreating with stats changing. Is there any guaranteed way to force os.stat re-read file stats by the same path, or alternative option to make it works (except recreating an A object)?
from pathlib import Path
import unittest
import os
class A:
some_path = Path()
def __init__(self, _some_path):
self.some_path = Path(_some_path)
def get_path(self):
return self.some_path
class TestKit(unittest.TestCase):
def setUp(self):
pass
def check_body(self, a):
some_path = a.get_path()
modification_time = os.stat(some_path).st_mtime
# Launching self-updating executable
self.assertTrue(modification_time < os.stat(some_path).st_mtime)
def check(self):
a = A(input('Enter the file path\n'))
self.check_body(a)
def Tests():
suite = unittest.TestSuite()
suite.addTest(TestKit('check'))
return suite
def main():
tests_suite = Tests()
unittest.TextTestRunner().run(tests_suite)
if __name__ == "__main__":
main()
I have found the origins of the problem: i've tried to launch self-updating via os.system which wait till the process done. But first: during the self-updating we launch several detached proccesses and actually should wait unitl all them have ended, and the second: even the signal that the proccess ends doesn't mean that OS really completely realease the file, and looks like on assertTrue we are not yet done with all our routines. For my task i simply used sleep, but normal solution should analyze the existing proccesses in the system and wait for them to finish, or at least there should be several attempts with awaiting.

I can implement Python multiprocessing with Spyder Windows PC, but why?

I'm so curious about this and need some advise about how can this happen? Yesterday I've tried to implement multiprocessing in Python script which is running on Spyder in Window PC. Here is the code I've first tried.
import multiprocessing
import time
start = time.perf_counter()
def do_something():
print('Sleeping 1 second...')
time.sleep(1)
print('Done sleeping')
p1 = multiprocessing.Process(target=do_something)
p2 = multiprocessing.Process(target=do_something)
p1.start()
p2.start()
p1.join()
p2.join()
finish = time.perf_counter()
print(f'Finished in {round(finish-start,2)} second(s)')
It's return an error.
AttributeError: Can't get attribute 'do_something' on <module '__main__' (built-in)
Then I search for survival from this problem and also my boss. And found this suggestion
Python's multiprocessing doesn't work in Spyder IDE
So I've followed it and installed Pycharm and try to run the code on PyCharm and it's seem to be work I didn't get AttributeError, however I got this new one instead of
RuntimeError:
An attempt has been made to start a new process before the
current process has finished its bootstrapping phase.
This probably means that you are not using fork to start your
child processes and you have forgotten to use the proper idiom
in the main module:
if __name__ == '__main__':
freeze_support()
...
The "freeze_support()" line can be omitted if the program
is not going to be frozen to produce an executable.
I've googled again then finally I got this
RuntimeError on windows trying python multiprocessing
what I have to do is adding this one line
if __name__ == '__main__':
before starting multiprocessing.
import multiprocessing
import time
start = time.perf_counter()
def do_something():
print('Sleeping 1 second...')
time.sleep(1)
print('Done sleeping')
if __name__ == '__main__':
p1 = multiprocessing.Process(target=do_something)
p2 = multiprocessing.Process(target=do_something)
p1.start()
p2.start()
p1.join()
p2.join()
finish = time.perf_counter()
print(f'Finished in {round(finish-start,2)} second(s)')
And it's work now moreover, it's not working only on PyCharm, now I can run this code on Spyder too. So that is why I have so curious? how come Spyder also work? This is quite persist because I'm also run this code on my other PC which is Window server 2016 with Spyder , I'm also do something.
Anyone can help explain what happen here why it's work?
Thank you.
There's a lot to unpack here, so I'll just give a brief overview. There's also some missing information like how you have spyder/pycharm configured, and what operating system you use, so I'll have to make some assumptions...
Based on the error messages you are probably using MacOS or Windows which means the default way python creates a child process is called spawn. This means it will start a completely new process from the python executable ("python.exe" on windows for example). It will then send a message to the new process telling it what function to execute (target), and optionally what arguments to call that function with. The new process will have to import the main file to have access to that function however, so if you are running the python interpreter in interactive mode, there is no "main" file to import, and you get the first error message: AttributeError.
The second error is also related to the importing of the "main" file. When you import a file, it basically just runs the file like any other python script. If you were to create a new child process during import that child would then also create a new child when it imports the same file. You would end up recursively creating infinite child processes until the computer crashed, so python disallows creating additional child processes during the import phase of a child process hence the RuntimeError.

file watcher in python 3.5 using library watchgod

Hi everyone i am trying to build a file watcher in python 3.5 using watchgod. I want to continuously watch a directory and if any file is added then i want to send a list of added files to another program which will perform a series of task. Following is my code in python :-
print("execution of main file begins !!!!")
import os
from watchgod import watch
#changes gives a set object when watch finds any kind of changes in directory
for changes in watch(r'C:\Users\Rajat.Malik\Desktop\Requests'):
fileStatus = [obj[0] for obj in list(changes) ] #converting set to list which gives file status as added, changed or modified
fileLocation = [obj[1] for obj in list(changes) ] #similarly getting list of location of files added
var2 = 0
for var1 in fileLocation:
if fileStatus[var2] == 1: #if file is added then passing all files to another code which will work on the list of files added
os.system('python split_thread_module.py '+var1) #now this code will start executing
var2 = var2 + 1
So the problem i am having is that while split_thread_module.py is executing the watcher is not watching the directory. Any file which is coming at time when split_thread_module.py is executing is not reflecting in changes. How can i watch the changes in directory and pass it to the other program on the fly even when the other program is executing. I am not a python programmer. Can anyone help me in this regard ?
Thanks in advance !!!!
Sorry for delayed, I'm the developer of watchgod. I've added a python-watchgod tag to your question which I'll watch (no pun intended) in future so I can answer such questions more quickly.
To answer your question, watchgod will not miss changes which occur in the filesystem while other code is running. They'll just be reported as changes next time watch iterates.
More generally the best approach would be to run the other code asynchronously so the main process can get back to watching the directory.
a few other hints for neater python
no need to call list(changes) in the comprehension
os.system is deprecated, better to use subprocess.run
since split_thread_module.py is also python, do you really need to run it in a separate process? Even if you do you might have more luck with python multiprocessing than starting a new process with the system's process initiation.
Overall you might try something like:
from concurrent.futures import ProcessPoolExecutor
from time import sleep
from watchgod import watch
def slow_job(status, location):
print(f'status: {status}, location: {location}, starting...')
sleep(10)
print(f'status: {status}, location: {location}, done')
with ProcessPoolExecutor() as executor:
for changes in watch('./tests'):
for status, location in changes:
executor.submit(slow_job, status, location)

Python Multiprocessing How can I make script faster?

Python 3.6
I am writing a script to automate me checking to make sure all the links on a website for work.
I have a version of it but it runs slow because the python interpreter is only running one request at a time.
I imported selenium to pull the links down in a list. I started with a list of 41000 links. I got rid of the duplicates now I am down to 7300 links in my list. I am using the request module to just check for the response code. I know multiprocessing is the answer just see a bunch of different methods. Which is the best for my needs?
The only thing I need to keep in mind I can't run to many threads at once so I don't send our webserver threads on our server sky high with request.
Here is the function that checks the links with the python requests module that I am trying to speed up:
def check_links(y):
for ii in y:
try:
r = requests.get(ii.get_attribute('href'))
rc = r.status_code
print(ii.get_attribute('href'), ' ', rc)
except Exception as e:
logf.write(str(e))
finally:
pass
If you just need to apply the same function to all the items in a list, you just need to use a process pool, and map over you inputs. Here is a simple example:
from multiprocessing import pool
def square(x):
return {x: x**2}
p = pool.Pool()
results = p.imap_unordered(square, range(10))
for r in results:
print(r)
In the example I use imap_unordered, but also look at map and imap. You should choose the one that matches your needs the best.

restart python (or reload modules) in py.test tests

I have a (python3) package that has completely different behaviour depending on how it's init()ed (perhaps not the best design, but rewriting is not an option). The module can only be init()ed once, a second time gives an error. I want to test this package (both behaviours) using py.test.
Note: the nature of the package makes the two behaviours mutually exclusive, there is no possible reason to ever want both in a singular program.
I have serveral test_xxx.py modules in my test directory. Each module will init the package in the way in needs (using fixtures). Since py.test starts the python interpreter once, running all test-modules in one py.test run fails.
Monkey-patching the package to allow a second init() is not something I want to do, since there is internal caching etc that might result in unexplained behaviour.
Is it possible to tell py.test to run each test module in a separate python process (thereby not being influenced by inits in another test-module)
Is there a way to reliably reload a package (including all sub-dependencies, etc)?
Is there another solution (I'm thinking of importing and then unimporting the package in a fixture, but this seems excessive)?
To reload a module, try using the reload() from library importlib
Example:
from importlib import reload
import some_lib
#do something
reload(some_lib)
Also, launching each test in a new process is viable, but multiprocessed code is kind of painful to debug.
Example
import some_test
from multiprocessing import Manager, Process
#create new return value holder, in this case a list
manager = Manager()
return_value = manager.list()
#create new process
process = Process(target=some_test.some_function, args=(arg, return_value))
#execute process
process.start()
#finish and return process
process.join()
#you can now use your return value as if it were a normal list,
#as long as it was assigned in your subprocess
Delete all your module imports and also your tests import that also import your modules:
import sys
for key in list(sys.modules.keys()):
if key.startswith("your_package_name") or key.startswith("test"):
del sys.modules[key]
you can use this as a fixture by configuring on your conftest.py file a fixture using the #pytest.fixture decorator.
Once I had similar problem, quite bad design though..
#pytest.fixture()
def module_type1():
mod = importlib.import_module('example')
mod._init(10)
yield mod
del sys.modules['example']
#pytest.fixture()
def module_type2():
mod = importlib.import_module('example')
mod._init(20)
yield mod
del sys.modules['example']
def test1(module_type1)
pass
def test2(module_type2)
pass
The example/init.py had something like this
def _init(val):
if 'sample' in globals():
logger.info(f'example already imported, val{sample}' )
else:
globals()['sample'] = val
logger.info(f'importing example with val : {val}')
output:
importing example with val : 10
importing example with val : 20
No clue as to how complex your package is, but if its just global variables, then this probably helps.
I have the same problem, and found three solutions:
reload(some_lib)
patch SUT, as the imported method is a key and value in SUT, you can patch the
SUT. Example, if you use f2 of m2 in m1, you can patch m1.f2 instead of m2.f2
import module, and use module.function.

Resources