Priority Queue cost? - priority-queue

I wrote a program to calculate the median of a running sequence using two priority queues min and max. By comparing the input with the current median, it is put on a queue depending on its value. By comparing queue sizes the median can is determined.
I have to hypothesize and test the execution time of the program. From testing and plotting the graph for different input sizes I can see the execution time is indeed linear.
This surprises me, the reason being is I thought the bigger the queues the longer it takes to place the correct value at the root ? or is the size of the queue irrelevant to placing the correct value at the root (as it seems in my results).
If someone could elaborate on the cost of priority queue that would be very great.

Related

I/O or CPU bound? How to check if running concurrently?

I’m new to Python and I'm struggling to understand some things in multiprocessing/threading. I want to speed up a function and have been trying different approaches from the multiprocessing module, but I can’t get it to run any faster. It’s possible it won’t run any faster, but I wanted to be sure this is the case before giving up. This isn’t a full description, but the most time-consuming activities are:
-repeatedly generating random data (10,000 rows and 10 columns)
-using a pre-fit model to predict an outcome for each row and
-comparing each predicted value to an initial value.
It performs this multiple times depending on how many of the predicted values equal the initial value, updating the parameters of the distribution each time. The output of the function is a single numeric value.
I want to loop over several of these initial values and end up with a list of the output values. I was hoping to get multiple iterations to run concurrently (but I’m open to anything that could make it faster). I’ve been ignorantly attempting pool.apply, starmap and Process but haven’t seen a change in time.
My questions are:
Based on the description of what I’m doing, is my program I/O or CPU bound? (Is it possible to tell from that? Is this even the right question to be asking?)
Should I be using multithreading or multiprocessing?
How can I determine if the iterations are running concurrently or not?
Given you didn't mention anything about drives, I'm going to assume it's not very IO bound (although still possible). Are you using multiple threads/processes yet? If not, that's definitely your issue.
I'd probably look at Pythons Thread library and because of the loop to create data, maybe the thread pool library. You just need all of your threads running that rand function at the same time.
EDIT: I forgot to mention. If you open Task Manager/System Monitor, you should be able to see load per CPU/Thread. If only one is maxed at any given time, you aren't concurrent.
Example: I wrote a quick example to help with the thread pool. Your 10,000 item list with 10 columns was not even noticeable on my i7. I increased the columns to 10,000 and it used 4GB of RAM and probably 30 seconds of 100% CPU # 3.4GHz.
from multiprocessing import Pool, Array
import random
def thread_function(_):
"""Return a random number."""
l = []
for _ in range(10000):
l.append(random.randint(0, 10000))
return l
if __name__ == '__main__':
rand_list = Array('i', range(10000))
with Pool() as pool:
rand_list = pool.map(thread_function, rand_list)
print(len(rand_list))

Estimating WCET of a task on Linux

I want to approximate the Worst Case Execution Time (WCET) for a set of tasks on linux. Most professional tools are either expensive (1000s $), or don't support my processor architecture.
Since, I don't need a tight bound, my line of thought is that I :
disable frequency scaling
disbale unnecesary background services and tasks
set the program affinity to run on a specified core
run the program for 50,000 times with various inputs
Profiling it and storing the total number of cycles it had completed to
execute.
Given the largest clock cycle count and knowing the core frequency, I can get an estimate
Is this is a sound Practical approach?
Secondly, to account for interference from other tasks, I will run the whole task set (40) tasks in parallel with each randomly assigned a core and do the same thing for 50,000 times.
Once I get the estimate, a 10% safe margin will be added to account for unforseeble interference and untested path. This 10% margin has been suggested in the paper "Approximation of Worst Case Execution time in Preepmtive Multitasking Systems" by Corti, Brega and Gross
Some comments:
1) Even attempting to compute worst case bounds in this way means making assumptions that there aren't uncommon inputs that cause tasks to take much more or even much less time. An extreme example would be a bug that causes one of the tasks to go into an infinite loop, or that causes the whole thing to deadlock. You need something like a code review to establish that the time taken will always be pretty much the same, regardless of input.
2) It is possible that the input data does influence the time taken to some extent. Even if this isn't apparent to you, it could happen because of the details of the implementation of some library function that you call. So you need to run your tests on a representative selection of real life data.
3) When you have got your 50K test results, I would draw some sort of probability plot - see e.g. http://www.itl.nist.gov/div898/handbook/eda/section3/normprpl.htm and links off it. I would be looking for isolated points that show that in a few cases some runs were suspiciously slow or suspiciously fast, because the code review from (1) said there shouldn't be runs like this. I would also want to check that adding 10% to the maximum seen takes me a good distance away from the points I have plotted. You could also plot time taken against different parameters from the input data to check that there wasn't any pattern there.
4) If you want to try a very sophisticated approach, you could try fitting a statistical distribution to the values you have found - see e.g. https://en.wikipedia.org/wiki/Generalized_Pareto_distribution. But plotting the data and looking at it is probably the most important thing to do.

Parallelization slows down execution in MatLab

I am using Matlab on a Mac OS X running on a Pentium processor with 4 real cores.
I want to analyse Magnetic resonance images (MRI) and fit the signal from these images using optimisation. For every pixel I have 35 values (i.e. the same image acquired 35 times during different conditions) and I want to fit these values to some function
Below, I have stripped my code down to the very basic loop that calls the fitting function:
ticid1 = tic;
for x= a:1:b
[a, b, c, d] = FitSignal(Volume(y,x,:));
end;
toc(ticid1);
Here Volume is a 3D matrix holding all MRI images about 9 MB in size. FitSignal thus gets an array holding 35 values for a specific pixel and the optimisation finds the best fit. The loop runs in this case 120 times (b-a = 120) which is once for every pixel that are on a horizontal line in the image.
Timing the above code using tic and toc, the entire loop takes about 50 seconds
I thought executing the code in parallel may provide some speed up. So I opened 3 workers and ran the loop with parfor but found only marginal (20-30%) speedup.
Then I reduced the number of workers to 1. Now running the code with parfor took about 90 seconds. So with 1 worker the code is app. twice as slow as when running without parallelization. This is consistent with the small benefit seen with 3 workers.
I then tried timing inside the function FitSignal and found that without parallelization it takes app. 0.4 seconds while with parallelization it takes 0.7 seconds.
I understand that parallelization comes with overhead but in this case it seems excessive to me. Besides, once inside the function FitSignal, and when there is only one worker, it should not matter if the function runs on the main process or within a worker - right ? However, running inside a sole worker, the function runs quite slower!
Can anyone tell me what is wrong? and importantly, how to change the code to take advantage of any possible speedup with parallel execution ?
Thanks in advance
PS: I have checked my system. Memory pressure low, I even issued "purge" in terminal to free memory. CPU does not exceed 15% during run.
When running on a single machine, Matlab automatically parallelises vector operations (1)... except when you are running explicit parallelisation, like parfor (2).
So, what is happening here is that when you run in normal, not parfor mode you are getting a 100% speedup from parallelised vector operations, based on your numbers.
When you run in parfor mode, you loose the vector operations boost, but gain the parallelisation from parfor, so half the speed of normal processing, but split over three cores, so taking about two thirds of the time.
The above is a rough estimate based on the numbers in the question; naturally for other problems these relative speedups will vary due to a number of factors, such as differing amounts of vectorized code and overheads of parfor.

Performance issue while using Parallel.foreach() with MaximumDegreeOfParallelism set as ProcessorCount

I wanted to process records from a database concurrently and within minimum time. So I thought of using parallel.foreach() loop to process the records with the value of MaximumDegreeOfParallelism set as ProcessorCount.
ParallelOptions po = new ParallelOptions
{
};
po.MaxDegreeOfParallelism = Environment.ProcessorCount;
Parallel.ForEach(listUsers, po, (user) =>
{
//Parallel processing
ProcessEachUser(user);
});
But to my surprise, the CPU utilization was not even close to 20%. When I dig into the issue and read the MSDN article on this(http://msdn.microsoft.com/en-us/library/system.threading.tasks.paralleloptions.maxdegreeofparallelism(v=vs.110).aspx), I tried using a specific value of MaximumDegreeOfParallelism as -1. As said in the article thet this value removes the limit on the number of concurrently running processes, the performance of my program improved to a high extent.
But that also doesn't met my requirement for the maximum time taken to process all the records in the database. So I further analyzed it more and found that there are two terms as MinThreads and MaxThreads in the threadpool. By default the values of Min Thread and MaxThread are 10 and 1000 respectively. And on start only 10 threads are created and this number keeps on increasing to a max of 1000 with every new user unless a previous thread has finished its execution.
So I set the initial value of MinThread to 900 in place of 10 using
System.Threading.ThreadPool.SetMinThreads(100, 100);
so that just from the start only minimum of 900 threads are created and thought that it will improve the performance significantly. This did create 900 threads, but it also increased the number of failure on processing each user very much. So I did not achieve much using this logic. So I changed the value of MinThreads to 100 only and found that the performance was much better now.
But I wanted to improve more as my requirement of time boundation was still not met as it was still exceeding the time limit to process all the records. As you may think I was using all the best possible things to get the maximum performance in parallel processing, I was also thinking the same.
But to meet the time limit I thought of giving a shot in the dark. Now I created two different executable files(Slaves) in place of only one and assigned them each half of the users from DB. Both the executable were doing the same thing and were executing concurrently. I created another Master program to start these two Slaves at the same time.
To my surprise, it reduced the time taken to process all the records nearly to the half.
Now my question is as simple as that I do not understand the logic behind Master Slave thing giving better performance compared to a single EXE with all the logic same in both the Slaves and the previous EXE. So I would highly appreciate if someone will explain his in detail.
But to my surprise, the CPU utilization was not even close to 20%.
…
It uses the Http Requests to some Web API's hosted in other networks.
This means that CPU utilization is entirely the wrong thing to look at. When using the network, it's your network connection that's going to be the limiting factor, or possibly some network-related limit, certainly not CPU.
Now I created two different executable files … To my surprise, it reduced the time taken to process all the records nearly to the half.
This points to an artificial, per process limit, most likely ServicePointManager.DefaultConnectionLimit. Try setting it to a larger value than the default at the start of your program and see if it helps.

Why are niceness values inversely related to process priority?

The niceness of a process decreases with increasing process priority.
Extract from Beginning Linux Programming 4th Edition, Pg 169 :
The default priority is 0. Positive priorities are used for background
tasks that run when no other higher priority task is ready to run.
Negative priorities cause a program to run more frequently, taking a
larger share of the available CPU time. The range of valid priorities
is -20 to +20. This is often confusing because the higher the
numerical value, the lower the execution precedence.
Is there any special reason for negative values corresponding to higher process priority (as opposed to increasing priority for higher niceness valued processes) ?
#Ewald's answer is correct, as is confirmed by Jerry Peek et al. in Unix Power Tools (O'Reilly, 2007, p. 507):
This is why the nice number is usually called niceness: a job with a high niceness is very kind to the users of your system (i.e., it runs at low priority), while a job with little niceness hogs the CPU. The term "niceness" is awkward, like the priority system itself. Unfortunately, it's the only term that is both accurate (nice numbers are used to compute the priorities but are not the priorities themselves) and avoids horrible circumlocutions ("increasing the priority means lowering the priority...").
Nice has had this meaning since at least V6 Unix, but the V6 manual never explains this explicitly. The range of allowed values was -220 through +20, with negative numbers reserved for the superuser. The range was changed to -20 through +20 in V7.
Hysterical reasons - I mean historical... I'm pretty sure it started with numbers going up from 0 .. 20, and the lowest available was taken first. Then someone came to the conclusion that "Hmm, what if we need to make some MORE important" - well we have to go negative.
You want priority to be a sortable value, so if you start with "default is zero", you have to either make higher priority a higher number (but "priority 1" in daily speak is higher then "priority 2" - when your boss says "Make this your number 1 priority", it does mean it's important, right?). Being a computer, clearly priority 0 is higher than priority 1, and priority -1 is higher than priority 0.
In the end, it's an arbitrary choice. Maybe Ken Thomson, Dennis Ritchie or one of those guys will be able to say for sure why they choose just that sequence, and not 0..255, for example.
First of all the answer is a little bit long but it is only for clarification.
As in the linux kernel every conventional process may have the priorities which are called static priority are from 100(highest) to 139(lowest). so there are basically 40 priorities which could be assigned to the process.
so when any process is created it gets the priority of it's parent but if the user wants to change it's priority then it could be done with the help of nice(nice_value) system call.
& the reason for your question is that every process wants base time quantum which is used as how much time the process will get the CPU for its execution in milliseconds and this is calculated as
time={
if static_priority<120
(140-static_priority)*20
if static_priority>=120
(140-static_priority)*5
so The sys_nice( )service routine handles the nice( )system call. Although the nice_value may have any value, absolute values larger than 40 are trimmed down to 40. Traditionally, negative values correspond to requests for priority increments and require superuser privileges, while positive ones correspond to requests for priority decreases. In the case of a negative nice_value, the function
invokes the capable( ) function to verify whether the process has a CAP_SYS_NICE capability. Moreover, the function invokes the security_task_setnice( )security hook. so in the end the nice_value is used to calculate the static priority & then this static priority is used for calculation of base time quantum.
so it's clear that -ve values are used for increment the priority so needs super user access & +ve values are used for decrease the priority so no need of super user access.
Yes - it gets NICER as the number goes up and MEANER as the number goes down. So the process is seen as "friendlier" when it's not taking up all the resources and "nasty" as it gets greedier with resources.
Think of it as "nice" points - the nicer you are to others, the more points you have.

Resources