I have a Large-Scale Gradient Descent optimization problem that I am running using Matlab. The code has got two parts:
A Sequential update part that fires every iteration that updates the parameter vector.
A validation error computation part that fires every 10 iterations or so using the parameter value at the end of the corresponding iteration in which its fired.
The way that I am running this now is to do (1) and (2) sequentially. But (2) takes a lot of time and its not the core part of my routine - I made it just to check the progress and plot the error of my model. Is it possible in Matlab to run (2) in a parallel manner to (1) ? Please note that (1) cannot be run in parallel since it performs sequential update. So a simple 'parfor' usage is not a solution, unless there is a really smart way of doing that.
I don't think Matlab has any way of multi-threading outside of the (rather restricted) parallel computing toolbox. There is a work over which may help you though:
Open 2 sessions of Matlab, sessions A and B (or instances, or workspaces, however you call it)
Matlab session A:
Calculate the 10 iterations of your sequential process (1)
Saves the result in a file (adequately and uniquely named)
Goes on to calculate the next 10 iterations (back to the top of this loop basically)
In parralel:
Matlab session B:
Check periodically for the existence of the file written by process A (define a timer that will do that at the time interval which make sense for your process, a few seconds or a few minutes ...)
If the file exist => load it then do the validation computation (your process (2)) and display/report the results.
note: This only works if process (1) doesn't need the result of process (2) to run its iterations, but if it is the case I don't know how you could parallelise anyway.
If you have multiple cores on your machine that should run smoothly, if you have a single core then the 2 sessions will have to share and you will see a performance impact.
Related
I want to approximate the Worst Case Execution Time (WCET) for a set of tasks on linux. Most professional tools are either expensive (1000s $), or don't support my processor architecture.
Since, I don't need a tight bound, my line of thought is that I :
disable frequency scaling
disbale unnecesary background services and tasks
set the program affinity to run on a specified core
run the program for 50,000 times with various inputs
Profiling it and storing the total number of cycles it had completed to
execute.
Given the largest clock cycle count and knowing the core frequency, I can get an estimate
Is this is a sound Practical approach?
Secondly, to account for interference from other tasks, I will run the whole task set (40) tasks in parallel with each randomly assigned a core and do the same thing for 50,000 times.
Once I get the estimate, a 10% safe margin will be added to account for unforseeble interference and untested path. This 10% margin has been suggested in the paper "Approximation of Worst Case Execution time in Preepmtive Multitasking Systems" by Corti, Brega and Gross
Some comments:
1) Even attempting to compute worst case bounds in this way means making assumptions that there aren't uncommon inputs that cause tasks to take much more or even much less time. An extreme example would be a bug that causes one of the tasks to go into an infinite loop, or that causes the whole thing to deadlock. You need something like a code review to establish that the time taken will always be pretty much the same, regardless of input.
2) It is possible that the input data does influence the time taken to some extent. Even if this isn't apparent to you, it could happen because of the details of the implementation of some library function that you call. So you need to run your tests on a representative selection of real life data.
3) When you have got your 50K test results, I would draw some sort of probability plot - see e.g. http://www.itl.nist.gov/div898/handbook/eda/section3/normprpl.htm and links off it. I would be looking for isolated points that show that in a few cases some runs were suspiciously slow or suspiciously fast, because the code review from (1) said there shouldn't be runs like this. I would also want to check that adding 10% to the maximum seen takes me a good distance away from the points I have plotted. You could also plot time taken against different parameters from the input data to check that there wasn't any pattern there.
4) If you want to try a very sophisticated approach, you could try fitting a statistical distribution to the values you have found - see e.g. https://en.wikipedia.org/wiki/Generalized_Pareto_distribution. But plotting the data and looking at it is probably the most important thing to do.
I am using Matlab on a Mac OS X running on a Pentium processor with 4 real cores.
I want to analyse Magnetic resonance images (MRI) and fit the signal from these images using optimisation. For every pixel I have 35 values (i.e. the same image acquired 35 times during different conditions) and I want to fit these values to some function
Below, I have stripped my code down to the very basic loop that calls the fitting function:
ticid1 = tic;
for x= a:1:b
[a, b, c, d] = FitSignal(Volume(y,x,:));
end;
toc(ticid1);
Here Volume is a 3D matrix holding all MRI images about 9 MB in size. FitSignal thus gets an array holding 35 values for a specific pixel and the optimisation finds the best fit. The loop runs in this case 120 times (b-a = 120) which is once for every pixel that are on a horizontal line in the image.
Timing the above code using tic and toc, the entire loop takes about 50 seconds
I thought executing the code in parallel may provide some speed up. So I opened 3 workers and ran the loop with parfor but found only marginal (20-30%) speedup.
Then I reduced the number of workers to 1. Now running the code with parfor took about 90 seconds. So with 1 worker the code is app. twice as slow as when running without parallelization. This is consistent with the small benefit seen with 3 workers.
I then tried timing inside the function FitSignal and found that without parallelization it takes app. 0.4 seconds while with parallelization it takes 0.7 seconds.
I understand that parallelization comes with overhead but in this case it seems excessive to me. Besides, once inside the function FitSignal, and when there is only one worker, it should not matter if the function runs on the main process or within a worker - right ? However, running inside a sole worker, the function runs quite slower!
Can anyone tell me what is wrong? and importantly, how to change the code to take advantage of any possible speedup with parallel execution ?
Thanks in advance
PS: I have checked my system. Memory pressure low, I even issued "purge" in terminal to free memory. CPU does not exceed 15% during run.
When running on a single machine, Matlab automatically parallelises vector operations (1)... except when you are running explicit parallelisation, like parfor (2).
So, what is happening here is that when you run in normal, not parfor mode you are getting a 100% speedup from parallelised vector operations, based on your numbers.
When you run in parfor mode, you loose the vector operations boost, but gain the parallelisation from parfor, so half the speed of normal processing, but split over three cores, so taking about two thirds of the time.
The above is a rough estimate based on the numbers in the question; naturally for other problems these relative speedups will vary due to a number of factors, such as differing amounts of vectorized code and overheads of parfor.
I am currently working with three matlab functions to make them run near simultaneously in single Matlab session(as I known matlab is single-threaded), these three functions are allocated with individual tasks, it might be difficult for me to explain all the detail of each function here, but try to include as much information as possible.
They are CONTROL/CAMERA/DATA_DISPLAY tasks, The approach I am using is creating Timer objects to have all the function callback continuously with different callback period time.
CONTROL will sending and receiving data through wifi with udp port, it will check the availability of package, and execute callback constantly
CAMERA receiving camera frame continuously through tcp and display it, one timer object T1 for this function to refresh the capture frame
DATA_DISPLAY display all the received data, this will refresh continuously, so another timer T2 for this function to refresh the display
However I noticed that the timer T2 is blocking the timer T1 when it is executed, and slowing down the whole process. I am working on a system using a multi-core CPU and I would expect MATLAB to be able to execute both timer objects in parallel taking advantage of the computational cores.
Through searching the parallel computing toolbox in matlab, it seems not able to deal with infinite loop or continuous callback, since the code will not finish and display nothing when execute, probably I am not so sure how to utilize this toolbox
Or can anyone provide any good idea of re-structuring the code into more efficient structure.
Many thanks
I see a problem using the parallel computing toolbox here. The design implies that the jobs are controlled via your primary matlab instance. Besides this, the primary instance is the only one with a gui, which would require to let your DISPLAY_DATA-Task control everything. I don't know if this is possible, but it would result in a very strange architecture. Besides this, inter process communication is not the best idea when processing large data amounts.
To solve the issue, I would use Java to display your data and realise the 'DISPLAY_DATA'-Part. The connection to java is very fast and simple to use. You will have to write a small java gui which has a appendframe-function that allows your CAMERA-Job to push new data. Obviously updating the gui should be done parallel without blocking.
if a function that is being memoized called in parallel from two jobs, what happens? One call's result is saved and other is retrieved or both run without using each other results? Or this case is not supported at all?
Couldn't find a reference to this in the documentation
If a result has already been computed and saved (by the same process or by a concurrent process) it is reused.
If 2 concurrent processes compute the same result for the first time, the first process to complete saves the result on the drive for later reuse and the second process use its own computation result the first time and later can reuse the cached result.
Also the cache is preserved on the hard drive after a Python program ends so that it can be reused when the same script / program is restarted later.
Since I am running performance evaluation tests of my multithreaded program on a (preemptive) multitasking, multicore environment, the process can get swapped out periodically. I want to compute the latency, i.e., only the duration when the process was active. This will allow me to extrapolate how the performance would be on a non-multitasking environment, i.e., where only one program is running (most of the time), or on different workloads.
Usually two kinds of time are measured:
The wall-clock time (i.e., the time since the process started) but this includes the time when the process was swapped out.
The processor time (i.e., sum total of CPU time used by all threads) but this is not useful to compute the latency of the process.
I believe what I need is makespan of times of individual threads, which can be different from the maximum CPU time used by any thread due to the task dependency structure among the threads. For example, in a process with 2 threads, thread 1 is heavily loaded in the first two-third of the runtime (for CPU time t) while thread 2 is loaded in the later two-third of the runtime of the process (again, for CPU time t). In this case:
wall-clock time would return 3t/2 + context switch time + time used by other processes in between,
max CPU time of all threads would return a value close to t, and
total CPU time is close to 2t.
What I hope to receive as output of measure is the makespan, i.e., 3t/2.
Furthermore, multi-threading brings indeterminacy on its own. This issue can probably be taken care of running the test multiple times and summarizing the results.
Moreover, the latency also depends on how the OS schedules the threads; things get more complicated if some threads of a process wait for CPU while others run. But lets forget about this.
Is there an efficient way to compute/approximate this makespan time? For providing code examples, please use any programming language, but preferably C or C++ on linux.
PS: I understand this definition of makespan is different from what is used in scheduling problems. The definition used in scheduling problems is similar to wall-clock time.
Reformulation of the Question
I have written a multi-threaded application which takes X seconds to execute on my K-core machine.
How do I estimate how long the program will take to run on a single-core computer?
Empirically
The obvious solution is to get a computer with one core, and run your application, and use Wall-Clock time and/or CPU time as you wish.
...Oh, wait, your computer already has one core (it also has some others, but we won't need to use them).
How to do this will depend on the Operating System, but one of the first results I found from Google explains a few approaches for Windows XP and Vista.
http://masolution.blogspot.com/2008/01/how-to-use-only-one-core-of-multi-core.html
Following that you could:
Assign your Application's process to a single core's affinity. (you can also do this in your code).
Start your operating system only knowing about one of your cores. (and then switch back afterwards)
Independent Parallelism
Estimating this analytically requires knowledge about your program, the method of parallelism, etc.
As an simple example, suppose I write a multi-threaded program that calculates the ten billionth decimal digit of pi and the ten billionth decimal digit of e.
My code looks like:
public static int main()
{
Task t1 = new Task( calculatePiDigit );
Task t2 = new Task( calculateEDigit );
t1.Start();
t2.Start();
Task.waitall( t1, t2 );
}
And the happens-before graph looks like:
Clearly these are independent.
In this case
Time calculatePiDigit() by itself.
Time calculateEDigit() by itself.
Add the times together.
2-Stage Pipeline
When the tasks are not independent, you won't be able to just add the individual times together.
In this next example, I create a multi-threaded application to: take 10 images, convert them to grayscale, and then run a line detection algorithm. For some external reason, every images are not allowed to be processed out of order. Because of this, I create a pipeline pattern.
My code looks something like this:
ConcurrentQueue<Image> originalImages = new ConcurrentQueue<Image>();
ConcurrentQueue<Image> grayscaledImages = new ConcurrentQueue<Image>();
ConcurrentQueue<Image> completedImages = new ConcurrentQueue<Image>();
public static int main()
{
PipeLineStage p1 = new PipeLineStage(originalImages, grayScale, grayscaledImages);
PipeLineStage p2 = new PipeLineStage(grayscaledImages, lineDetect, completedImages);
p1.Start();
p2.Start();
originalImages.add( image1 );
originalImages.add( image2 );
//...
originalImages.add( image10 );
originalImages.add( CancellationToken );
Task.WaitAll( p1, p2 );
}
A data centric happens-before graph:
If this program had been designed as a sequential program to begin with, for cache reasons it would be more efficient to take each image one at a time and move them to completed, before moving to the next image.
Anyway, we know that GrayScale() will be called 10 times and LineDetection() will be called 10 times, so we can just time each independently and then multiply them by 10.
But what about the costs of pushing/popping/polling the ConcurrentQueues?
Assuming the images are large, that time will be negligible.
If there are millions of small images, with many consumers at each stage, then you will probably find that the overhead of waiting on locks, mutexes, etc, is very small when a program is run sequentially (assuming that the amount of work performed in the critical sections is small, such as inside the concurrent queue).
Costs of Context Switching?
Take a look at this question:
How to estimate the thread context switching overhead?
Basically, you will have context switches in multi-core environments and in single-core environments.
The overhead to perform a context switch is quite small, but they also occur very many times per second.
The danger is that the cache gets fully disrupted between context switches.
For example, ideally:
image1 gets loaded into the cache as a result of doing GrayScale
LineDetection will run much faster on image1, since it is in the cache
However, this could happen:
image1 gets loaded into the cache as a result of doing GrayScale
image2 gets loaded into the cache as a result of doing GrayScale
now pipeline stage 2 runs LineDetection on image1, but image1 isn't in the cache anymore.
Conclusion
Nothing beats timing on the same environment it will be run in.
Next best is to simulate that environment as well as you can.
Regardless, understanding your program's design should give you an idea of what to expect in a new environment.