Parallelization slows down execution in MatLab - multithreading

I am using Matlab on a Mac OS X running on a Pentium processor with 4 real cores.
I want to analyse Magnetic resonance images (MRI) and fit the signal from these images using optimisation. For every pixel I have 35 values (i.e. the same image acquired 35 times during different conditions) and I want to fit these values to some function
Below, I have stripped my code down to the very basic loop that calls the fitting function:
ticid1 = tic;
for x= a:1:b
[a, b, c, d] = FitSignal(Volume(y,x,:));
end;
toc(ticid1);
Here Volume is a 3D matrix holding all MRI images about 9 MB in size. FitSignal thus gets an array holding 35 values for a specific pixel and the optimisation finds the best fit. The loop runs in this case 120 times (b-a = 120) which is once for every pixel that are on a horizontal line in the image.
Timing the above code using tic and toc, the entire loop takes about 50 seconds
I thought executing the code in parallel may provide some speed up. So I opened 3 workers and ran the loop with parfor but found only marginal (20-30%) speedup.
Then I reduced the number of workers to 1. Now running the code with parfor took about 90 seconds. So with 1 worker the code is app. twice as slow as when running without parallelization. This is consistent with the small benefit seen with 3 workers.
I then tried timing inside the function FitSignal and found that without parallelization it takes app. 0.4 seconds while with parallelization it takes 0.7 seconds.
I understand that parallelization comes with overhead but in this case it seems excessive to me. Besides, once inside the function FitSignal, and when there is only one worker, it should not matter if the function runs on the main process or within a worker - right ? However, running inside a sole worker, the function runs quite slower!
Can anyone tell me what is wrong? and importantly, how to change the code to take advantage of any possible speedup with parallel execution ?
Thanks in advance
PS: I have checked my system. Memory pressure low, I even issued "purge" in terminal to free memory. CPU does not exceed 15% during run.

When running on a single machine, Matlab automatically parallelises vector operations (1)... except when you are running explicit parallelisation, like parfor (2).
So, what is happening here is that when you run in normal, not parfor mode you are getting a 100% speedup from parallelised vector operations, based on your numbers.
When you run in parfor mode, you loose the vector operations boost, but gain the parallelisation from parfor, so half the speed of normal processing, but split over three cores, so taking about two thirds of the time.
The above is a rough estimate based on the numbers in the question; naturally for other problems these relative speedups will vary due to a number of factors, such as differing amounts of vectorized code and overheads of parfor.

Related

Is there a way to improve priority of API calls from VBA since there are unpredictable latency/response times all the way up to 5 milliseconds

I am looking for advice on how to reduce unpredictable horrible latency/response times for API calls from VBA. I did some statistical analysis of Excel VBA API calls to QueryPerformanceCounter and GetSystemTimeAsPrecisionFileTime.
On my machine (8 core, 5.2Ghz Max frequency, W10, Office 2019) both of these have 100nanosecond single tick resolution, they both need at a minimum of 6 ticks elapsed time to get response back, a mode of 7 ticks, an average of 8+ ticks, which I can live with.
But there are serious outliers in the distribution: 0.2% of the time they need at least 100 ticks (10 microseconds), and on very rare occasion as much as 5 milliseconds to get a response back to VBA. If I unplug the power supply from this laptop, the delays increase of course. They skyrocket to >11 ticks average, and ~0.2% of the time > 20 microseconds. I surmise this is some sort of queue time issue but I have failed to find any discussion on this issue.
Is there a way to improve priority for the API calls? Maybe something crazy like assigning two or three cores exclusively to Excel and the API, everything else to the other 5-6 cores?
Excel/VBA only uses max of ~30% CPU time according to task manager, so probably no hit on speed of execution for the code.
VBA does not support multi-threading it can only use one core of your computer. So if you need multi-threading switch to a real programming language eg. Python (there exist libraries to handle Excel data with Python). Or use one of the alternatives mentioned here: Multi-threading in VBA
In VBA Excel will always wait for one command to finish until it can start the next one (single threading).
By the way the time between the VBA command to the API and the API returning the result cannot be influenced by VBA (or any other solution). This is the API's calculation time to give the result. So this is not Excel's fault that it takes long, it is the API that needs that time to calculate the result (which depends on the values you give that API how long this calculation time actually is).
I have continued to research my question. I think the bottom line is that you are at the total mercy of windows unpredictable prioritization of events. You cannot force prioritization, even if you set affinity to one CPU core and priority to high. I have tried both and indeed I see these timing outliers. See for example hints of this in this thread about getting frequency for timing calcs.
So I have to be aware that any one result has a 2+ percent probability that the timing error will be 2+ microseconds, at least on my PC running in turbo mode.

I/O or CPU bound? How to check if running concurrently?

I’m new to Python and I'm struggling to understand some things in multiprocessing/threading. I want to speed up a function and have been trying different approaches from the multiprocessing module, but I can’t get it to run any faster. It’s possible it won’t run any faster, but I wanted to be sure this is the case before giving up. This isn’t a full description, but the most time-consuming activities are:
-repeatedly generating random data (10,000 rows and 10 columns)
-using a pre-fit model to predict an outcome for each row and
-comparing each predicted value to an initial value.
It performs this multiple times depending on how many of the predicted values equal the initial value, updating the parameters of the distribution each time. The output of the function is a single numeric value.
I want to loop over several of these initial values and end up with a list of the output values. I was hoping to get multiple iterations to run concurrently (but I’m open to anything that could make it faster). I’ve been ignorantly attempting pool.apply, starmap and Process but haven’t seen a change in time.
My questions are:
Based on the description of what I’m doing, is my program I/O or CPU bound? (Is it possible to tell from that? Is this even the right question to be asking?)
Should I be using multithreading or multiprocessing?
How can I determine if the iterations are running concurrently or not?
Given you didn't mention anything about drives, I'm going to assume it's not very IO bound (although still possible). Are you using multiple threads/processes yet? If not, that's definitely your issue.
I'd probably look at Pythons Thread library and because of the loop to create data, maybe the thread pool library. You just need all of your threads running that rand function at the same time.
EDIT: I forgot to mention. If you open Task Manager/System Monitor, you should be able to see load per CPU/Thread. If only one is maxed at any given time, you aren't concurrent.
Example: I wrote a quick example to help with the thread pool. Your 10,000 item list with 10 columns was not even noticeable on my i7. I increased the columns to 10,000 and it used 4GB of RAM and probably 30 seconds of 100% CPU # 3.4GHz.
from multiprocessing import Pool, Array
import random
def thread_function(_):
"""Return a random number."""
l = []
for _ in range(10000):
l.append(random.randint(0, 10000))
return l
if __name__ == '__main__':
rand_list = Array('i', range(10000))
with Pool() as pool:
rand_list = pool.map(thread_function, rand_list)
print(len(rand_list))

Estimating WCET of a task on Linux

I want to approximate the Worst Case Execution Time (WCET) for a set of tasks on linux. Most professional tools are either expensive (1000s $), or don't support my processor architecture.
Since, I don't need a tight bound, my line of thought is that I :
disable frequency scaling
disbale unnecesary background services and tasks
set the program affinity to run on a specified core
run the program for 50,000 times with various inputs
Profiling it and storing the total number of cycles it had completed to
execute.
Given the largest clock cycle count and knowing the core frequency, I can get an estimate
Is this is a sound Practical approach?
Secondly, to account for interference from other tasks, I will run the whole task set (40) tasks in parallel with each randomly assigned a core and do the same thing for 50,000 times.
Once I get the estimate, a 10% safe margin will be added to account for unforseeble interference and untested path. This 10% margin has been suggested in the paper "Approximation of Worst Case Execution time in Preepmtive Multitasking Systems" by Corti, Brega and Gross
Some comments:
1) Even attempting to compute worst case bounds in this way means making assumptions that there aren't uncommon inputs that cause tasks to take much more or even much less time. An extreme example would be a bug that causes one of the tasks to go into an infinite loop, or that causes the whole thing to deadlock. You need something like a code review to establish that the time taken will always be pretty much the same, regardless of input.
2) It is possible that the input data does influence the time taken to some extent. Even if this isn't apparent to you, it could happen because of the details of the implementation of some library function that you call. So you need to run your tests on a representative selection of real life data.
3) When you have got your 50K test results, I would draw some sort of probability plot - see e.g. http://www.itl.nist.gov/div898/handbook/eda/section3/normprpl.htm and links off it. I would be looking for isolated points that show that in a few cases some runs were suspiciously slow or suspiciously fast, because the code review from (1) said there shouldn't be runs like this. I would also want to check that adding 10% to the maximum seen takes me a good distance away from the points I have plotted. You could also plot time taken against different parameters from the input data to check that there wasn't any pattern there.
4) If you want to try a very sophisticated approach, you could try fitting a statistical distribution to the values you have found - see e.g. https://en.wikipedia.org/wiki/Generalized_Pareto_distribution. But plotting the data and looking at it is probably the most important thing to do.

Bi-Threaded processing in Matlab

I have a Large-Scale Gradient Descent optimization problem that I am running using Matlab. The code has got two parts:
A Sequential update part that fires every iteration that updates the parameter vector.
A validation error computation part that fires every 10 iterations or so using the parameter value at the end of the corresponding iteration in which its fired.
The way that I am running this now is to do (1) and (2) sequentially. But (2) takes a lot of time and its not the core part of my routine - I made it just to check the progress and plot the error of my model. Is it possible in Matlab to run (2) in a parallel manner to (1) ? Please note that (1) cannot be run in parallel since it performs sequential update. So a simple 'parfor' usage is not a solution, unless there is a really smart way of doing that.
I don't think Matlab has any way of multi-threading outside of the (rather restricted) parallel computing toolbox. There is a work over which may help you though:
Open 2 sessions of Matlab, sessions A and B (or instances, or workspaces, however you call it)
Matlab session A:
Calculate the 10 iterations of your sequential process (1)
Saves the result in a file (adequately and uniquely named)
Goes on to calculate the next 10 iterations (back to the top of this loop basically)
In parralel:
Matlab session B:
Check periodically for the existence of the file written by process A (define a timer that will do that at the time interval which make sense for your process, a few seconds or a few minutes ...)
If the file exist => load it then do the validation computation (your process (2)) and display/report the results.
note: This only works if process (1) doesn't need the result of process (2) to run its iterations, but if it is the case I don't know how you could parallelise anyway.
If you have multiple cores on your machine that should run smoothly, if you have a single core then the 2 sessions will have to share and you will see a performance impact.

How to measure multithreaded process time on a multitasking environment?

Since I am running performance evaluation tests of my multithreaded program on a (preemptive) multitasking, multicore environment, the process can get swapped out periodically. I want to compute the latency, i.e., only the duration when the process was active. This will allow me to extrapolate how the performance would be on a non-multitasking environment, i.e., where only one program is running (most of the time), or on different workloads.
Usually two kinds of time are measured:
The wall-clock time (i.e., the time since the process started) but this includes the time when the process was swapped out.
The processor time (i.e., sum total of CPU time used by all threads) but this is not useful to compute the latency of the process.
I believe what I need is makespan of times of individual threads, which can be different from the maximum CPU time used by any thread due to the task dependency structure among the threads. For example, in a process with 2 threads, thread 1 is heavily loaded in the first two-third of the runtime (for CPU time t) while thread 2 is loaded in the later two-third of the runtime of the process (again, for CPU time t). In this case:
wall-clock time would return 3t/2 + context switch time + time used by other processes in between,
max CPU time of all threads would return a value close to t, and
total CPU time is close to 2t.
What I hope to receive as output of measure is the makespan, i.e., 3t/2.
Furthermore, multi-threading brings indeterminacy on its own. This issue can probably be taken care of running the test multiple times and summarizing the results.
Moreover, the latency also depends on how the OS schedules the threads; things get more complicated if some threads of a process wait for CPU while others run. But lets forget about this.
Is there an efficient way to compute/approximate this makespan time? For providing code examples, please use any programming language, but preferably C or C++ on linux.
PS: I understand this definition of makespan is different from what is used in scheduling problems. The definition used in scheduling problems is similar to wall-clock time.
Reformulation of the Question
I have written a multi-threaded application which takes X seconds to execute on my K-core machine.
How do I estimate how long the program will take to run on a single-core computer?
Empirically
The obvious solution is to get a computer with one core, and run your application, and use Wall-Clock time and/or CPU time as you wish.
...Oh, wait, your computer already has one core (it also has some others, but we won't need to use them).
How to do this will depend on the Operating System, but one of the first results I found from Google explains a few approaches for Windows XP and Vista.
http://masolution.blogspot.com/2008/01/how-to-use-only-one-core-of-multi-core.html
Following that you could:
Assign your Application's process to a single core's affinity. (you can also do this in your code).
Start your operating system only knowing about one of your cores. (and then switch back afterwards)
Independent Parallelism
Estimating this analytically requires knowledge about your program, the method of parallelism, etc.
As an simple example, suppose I write a multi-threaded program that calculates the ten billionth decimal digit of pi and the ten billionth decimal digit of e.
My code looks like:
public static int main()
{
Task t1 = new Task( calculatePiDigit );
Task t2 = new Task( calculateEDigit );
t1.Start();
t2.Start();
Task.waitall( t1, t2 );
}
And the happens-before graph looks like:
Clearly these are independent.
In this case
Time calculatePiDigit() by itself.
Time calculateEDigit() by itself.
Add the times together.
2-Stage Pipeline
When the tasks are not independent, you won't be able to just add the individual times together.
In this next example, I create a multi-threaded application to: take 10 images, convert them to grayscale, and then run a line detection algorithm. For some external reason, every images are not allowed to be processed out of order. Because of this, I create a pipeline pattern.
My code looks something like this:
ConcurrentQueue<Image> originalImages = new ConcurrentQueue<Image>();
ConcurrentQueue<Image> grayscaledImages = new ConcurrentQueue<Image>();
ConcurrentQueue<Image> completedImages = new ConcurrentQueue<Image>();
public static int main()
{
PipeLineStage p1 = new PipeLineStage(originalImages, grayScale, grayscaledImages);
PipeLineStage p2 = new PipeLineStage(grayscaledImages, lineDetect, completedImages);
p1.Start();
p2.Start();
originalImages.add( image1 );
originalImages.add( image2 );
//...
originalImages.add( image10 );
originalImages.add( CancellationToken );
Task.WaitAll( p1, p2 );
}
A data centric happens-before graph:
If this program had been designed as a sequential program to begin with, for cache reasons it would be more efficient to take each image one at a time and move them to completed, before moving to the next image.
Anyway, we know that GrayScale() will be called 10 times and LineDetection() will be called 10 times, so we can just time each independently and then multiply them by 10.
But what about the costs of pushing/popping/polling the ConcurrentQueues?
Assuming the images are large, that time will be negligible.
If there are millions of small images, with many consumers at each stage, then you will probably find that the overhead of waiting on locks, mutexes, etc, is very small when a program is run sequentially (assuming that the amount of work performed in the critical sections is small, such as inside the concurrent queue).
Costs of Context Switching?
Take a look at this question:
How to estimate the thread context switching overhead?
Basically, you will have context switches in multi-core environments and in single-core environments.
The overhead to perform a context switch is quite small, but they also occur very many times per second.
The danger is that the cache gets fully disrupted between context switches.
For example, ideally:
image1 gets loaded into the cache as a result of doing GrayScale
LineDetection will run much faster on image1, since it is in the cache
However, this could happen:
image1 gets loaded into the cache as a result of doing GrayScale
image2 gets loaded into the cache as a result of doing GrayScale
now pipeline stage 2 runs LineDetection on image1, but image1 isn't in the cache anymore.
Conclusion
Nothing beats timing on the same environment it will be run in.
Next best is to simulate that environment as well as you can.
Regardless, understanding your program's design should give you an idea of what to expect in a new environment.

Resources