I have a matlab processing script located in the middle of a long processing pipeline running on linux.
The matlab script applies the same operation to a number N of datasets D_i (i=1,2,...,N) in parallel on (8 cores) via parfor.
Usually, processing the whole dataset takes about 2hours (on 8 cores).
Unfortunately, from time to time, looks like one of the matlab subprocesses crashes randomly. This makes the job impossible to complete (and the pipeline can't finish).
I am sure this does not depend on the data as if I reprocess specifically the D_i on which the process crashes, it is executed without problems. Moreover, up to now I've processed already thousands of the mentioned dataset.
How I deal with the problem now (...manually...):
After I start the matlab job, I periodically check the process list on the machine (via a simple top); whenever I have one matlab process alive after two hours of work, then I know for sure that it has crashed. Then I simply kill it and process the part of the dataset which has not been analyzed.
Question:
I am looking for suggestion on how to timeout ALL the matlab processes running and kill them whenever they are alive for more than e.g. 2hrs CPU.
You should be able to do this by restructuring your code to use PARFEVAL instead of PARFOR. There's a simple example in this entry on Loren's blog: http://blogs.mathworks.com/loren/2013/12/09/getting-data-from-a-web-api-in-parallel/ which shows how you can stop waiting for work after a given amount of time.
Related
I have a function that creates some results for a list of tasks. I would like to save the results on the fly to 1) release memory compared to saving to appending to a results_list and 2) have the results of the first part in case of errors.
Here is a very short sample code:
for task in task_list:
result = do_awesome_stuff_to_task(task)
save_nice_results_to_db(result) # Send this job to another process and let the main process continue
Is there a way for the main process to create results for each task in task_list and each time a result is create send this to another processor/thread to save it, so the main loop can continue without waiting for the slow saving process?
I have looked at multiprocessing, but that seems mostly to speed up the loop over task_list rather than allow a secondary sub process to do other parts of the work. I have also looked into asyncio, but that seems mostly used for I/O.
All in all, I am looking for a way to have a main process looping over the task_list. For each task finished I would like to send the results to another subprocess to save the results. Notice, the do_awesome_stuff_to_task is much faster than savings process, hence, the main loop will have reached through multiple task before the first task is saved. I have thought of two ways of tackling this:
Use multiple sub process to save
Save every xx iteration - the save_results scale okay, so perhaps the save process can save xx iteration at a time while the main loop continuous?
Is this possible to do with Python? Where to look and what key considerations to take?
All help is appreciated.
It's hard to know what will be faster in your case without testing, but here's some thoughts on how to choose what to do.
If save_nice_results_to_db is slow because it's writing data to disk or network, make sure you aren't already at the maximum write speed of your hardware. Depending on the server at the other end, network traffic can sometimes benefit greatly from opening multiple ports at once to read/write so long as you stay within your total network transfer speed (of the mac interface as well as your ISP). SSD's can see some limited benefit from initiating multiple reads/writes at once, but too many will hurt performance. HDD's are almost universally slower when trying to do more than one thing at once. Everything is more efficient reading/writing larger chunks at a time.
multiprocessing must typically transfer data between the parent and child processes using pickle because they don't share memory. This has a high overhead, so if result is a large object, you may waste more time with the added overhead of sending the data to a child process than you could save by any sort of concurrency. (emphasis on the may. always test for yourself). As of 3.8 the shared_memory module was added which may be somewhat more efficient, but is much less flexible and easy to use.
threading benefits from all threads sharing memory so there is zero transfer overhead to "send" data between threads. Python threads however cannot execute bytecode concurrently due to the GIL (global interpreter lock), so multiple CPU cores cannot be leveraged to increase computation speed. This is due to python itself having many parts which are not thread-safe. Specific functions written in c may release this lock to get around this issue and leverage multiple cpu cores using threads, but once execution returns to the python interpreter, that lock is held again. Typically functions involving network access or file IO can release the GIL, as the interpreter is waiting on an operating system call which is usually thread safe. Other popular libraries like Numpy also make an effort to release the GIL while doing complex math operations on large arrays. You can only release the GIL from c/c++ code however, and not from python itself.
asyncio should get a special mention here, as it's designed specifically with concurrent network/file operations in mind. It uses coroutines instead of threads (even lower overhead than threads, which themselves are much lower overhead than processes) to queue up a bunch of operations, then uses an operating system call to wait on any of them to finish (event loop). Using this would also require your do_awesome_stuff_to_task to happen in a coroutine for it to happen at the same time as save_nice_results_to_db.
A trivial example of firing each result off to a thread to be processed:
for task in task_list:
result = do_awesome_stuff_to_task(task)
threading.Thread(target=save_nice_results_to_db, args=(result,)).start() # Send this job to another process and let the main process continue
I have an embedded system in which there are multiple users processes which run simultaneously as they are interdependent they communicate via posix queue. The issue is that one of the process is taking a bit more time to complete a task (I don't know which process or which section of code) cause of which the other process gets delayed to complete its task.
How can I figure this out that which process is taking more time and in which section of code? The system is a measuring device so it cannot have any delay or spikes in the timing of processing. I tried changing the data rate of the entire system but does not help as the spikes still appears.
Is there any possibility in linux to bind a system call when the process scheduled in the same section of code and reached a certain threshold of the scheduling duration?
Suppose I have a multi-core laptop.
I write some code in python, and run it;
then while my python code is running, I open my matlab and run some other code.
What is going on underneath? Will this two process be processed in parallel using multi-core auomatically?
Or the computer waits for one to finish and then process the other?
Thank you !
P.S. The two programs I am referring to can be considered the simplest in nature, e.g. calculate 1+2+3.....+10000000
The answer is... it depends!
Your operating system is constantly switching which processes are running. There are tons of processes always running in the background - refreshing the screen, posting sound to the speakers, checking for updates, polling the mouse, etc. - and those processes can only actually execute if they get some amount of processor time. If you have many cores, the OS will use some sort of heuristics to figure out which processes should get some time on the cores. You have the illusion that everything is running at the same time because (1) in some sense, things are running at the same time because you have multiple cores, and (2) the switching happens so fast that you can't notice it happen.
The reason I'm bringing this up is that if you run both Python and MATLAB at the same time, while in principle they could easily run at the same time, it's not guaranteed that that happens because you may have a ton of other things going on as well. It might be that both Python and MATLAB run for a bit concurrently, then both temporarily get paused to allow some program that's playing music to load the next sound clip to be played into memory, then one pauses while the OS pages in some memory from disk and another takes over, etc.
Can you assume that they'll run in parallel? Sure! Most reasonable OSes will figure that out and do it correct. Can you assume that they exclusively are running in parallel and nothing else is? Not necessarily.
On Linux, is it possible to record the running processes (just which process is running when), for some period of time? It would be like getting a log from top. The reason I want to do that is that I have performance issues with my process, and the box I am working on does not provide any facility to analyze which processes are running and when. More specifically, my process has a response time of anywhere between 3.9s and 3.2s, but this is not random: there are periods of time at 3.9s and periods at 3.2s. Having ruled out blocking I/O, we want to see if some other process is running during the periods at 3.9s.
I am coding a 5 state process model(new,ready,running,blocked,exit), for this I created a LinkedList which contains the processes ready to run. For example if I have the processes 1,2,3,4,5 it runs the 1st, then the 2nd, and when the third is running the user pushes a button and blocks the process for 5 seconds. In the meantime the following process(the 4th) runs(it doesn´t wait until the third process is unblocked). The problem that I have is that I don´t know if I should use two threads for this, one for the threads that are running and the other for the blocked process?? or is it possible to only use one thread???
You could use only a single thread if you use cooperative multitasking, where your process code periodically yields to permit other processes to run, or if you want each task to run to completion or blocking before letting another process in or back in.
If it's important that the 3rd process restart after exactly 5 seconds, and if it's okay for it to continue running in parallel with an existing process also running, you might want to use two - or more - threads.