Change from O(n^2) complexity to O(n) complexity or lower for a piece of code used in multithreaded application - multithreading

I have a piece of code as follows. I want to improve time complexity for this.
This is a thread and I can have upto maximum of 2000 threads that execute this function at the same time
On top of that, I wait for file descriptors that are ready from a pollset. MAX_RTP_SESSIONS is also huge (value of 5000 or more). so its a big for loop and therefore i can see performance getting affected. [When value of MAX_RTP_SESSIONS is reduced to just 500, i can see a huge improvement in performance]
But i will have to use 2000 threads and also 5000 sessions. I wish i could find a way to change time complexity from o(n^2) to atleast o(n) or better. Any ideas are really appreciated!
//..
retval=epoll_wait(epfd, pollset, EPOLL_MAX_EVENTS, mSecTimeout)
//..
sem_wait(&sem_sessions);
for(i = 0; i< retVal; i++) {
for (j=0; j < MAX_RTP_SESSIONS; j++) {
if ((g_rtp_sessions[j].destroy==FALSE) &&
(g_rtp_sessions[j].used!=FALSE) &&
(g_rtp_sessions[j].p_rtp->rtp_socket->fd == pollset[i].data.fd))
{
if (0 < rtp_recv_data(....)) {
rtp_update(...)
}
}
}
}
sem_post(&sem_sessions);
//..

Sort pollset, then you can do a binary search on it which should lead to a O(n log n) algorithm.

Related

Most efficient way to spawn n pthreads with the same parameters in C

I have 32 threads that I know the input parameters to ahead of time, nothing changes inside the function (other than the memory buffer that each thread interacts with).
In pseudo C code this is my design pattern:
// declare 32 pthreads as global variables
void dispatch_32_threads() {
for(int i=0; i < 32; i++) {
pthread_create( &thread_id[i], NULL, thread_function, (void*) thread_params[i] );
}
// wait until all 32 threads are finished
for(int j=0; j < 32; j++) {
pthread_join( thread_id[j], NULL);
}
}
int main (crap) {
//init 32 pthreads here
for(int n = 0; n<4000; n++) {
for(int x = 0; x<100< x++) {
for(int y = 0; y<100< y++) {
dispatch_32_threads();
//modify buffers here
}
}
}
}
I am calling dispatch_32_threads 100*100*4000= 40000000 times. thread_function and (void*) thread_params[i] do not change. I think pthread_create keeps creating and destroying threads, I have 32 cores, none of them are at 100% utilization, it hovers around 12%. Moreover, when I reduce the number of threads to 10, all 32 cores remain at 5-7% utilization, and I see no slow down in runtime. Running less than 10 slow things down.
Running 1 thread however is extremely slow, so multi threading is helping. I profiled my code, I know it's thread_func that is slow, and thread_func is parallelizable. This leads me to believe that pthread_create keeps spawning and destroying threads on different cores, and after 10 threads I lose efficiency, and it gets slower, thread_func is in essence "less complicated" than spawning more than 10 threads.
Is this assessment true? What is the best way to utilize 100% of all cores?
Thread creation is expensive. It depends on different parameters, but is rarely below 1000 cycles. And thread synchronisation and destruction is similar. If the amount of work in your thread_function is not very high it will largely dominate the computation time.
It is rarely a good idea to create threads in the inner loops. Probably, the best is to create threads to process iterations of the outer loop. Depending on your program and on what does the thread_function there may be dependencies between iterations and this may require some rewriting, but a solution could be:
int outer=4000;
int nthreads=32;
int perthread=outer/nthreads;
// add an integer with thread_id to thread_param struct
void thread_func(whatisrequired *thread_params){
// runs perthread iteration of the loop beginning at start
int start = thread_param->thread_id;
for(int n = start; n<start+perthread; n++) {
for(int x = 0; x<100< x++) {
for(int y = 0; y<100< y++) {
//do the work
}
}
}
}
int main(){
for(int i=0; i < 32; i++) {
thread_params[i]->thread_id=i;
pthread_create( &thread_id[i], NULL, thread_func,
(void*) thread_params[i]);
}
// wait until all 32 threads are finished
for(int j=0; j < 32; j++) {
pthread_join( thread_id[j], NULL);
}
}
With this kind of parallelization, you can consider using openmp. The parallel for clause will make you easily experiment with the best parallelization scheme.
If there are dependencies and such an obvious parallelization is not possible, you can create threads at program start and give them work by managing a thread pool. Managing queues is less expensive than thread creation (but atomic accesses do have a cost).
Edit: Alternatively, you can
1. put all you loops in the thread function
2. at the start (or the end) of the inner loop add a barrier to synchronize your threads. This will ensure that all threads have finished their job.
3. In the main create all the threads and wait for completion.
Barriers are less expensive than thread creation and the result will be identical.

ways to express concurrency without thread

I am wondering about how concurrency can be expressed without an explicit thread object, not the implementation, which probably would use threads or thread pools, but the language design related issues.
Q1: I wonder what would be lost if there was no thread object, what couldn't be done in such a language?
Q2: I also wonder about how this would be expressed, what ways were proposed or implemented as alternatives or complements to threads?
one possibility is the MPI-programm-model (GPU as well)
lets say you have the following code
for(int i=0; i < 100; i++) {
work(i);
}
the "normal" thread-based way would be the separation of the iteration-range into multiple subsets. So something like this
Thread-1:
for(int i=0; i < 50; i++) {
work(i);
}
Thread-2:
for(int i=50; i < 100; i++) {
work(i);
}
however in MPI/GPU you do something different.
the idea is, that every core execute the same(GPU) or at least
a similar (MPI) programm. the difference is, that each core uses
a different ID, which changes the behavior of the code.
mpi-style: (not exactly the MPI-syntax)
int rank = get_core_id();
int size = get_num_core();
int subset = 100 / size;
for (int i = rank * subset;i < (rand+1)*subset; i+) {
//each core will use a different range for i
work(i);
}
the next big thing is communication. Normally you need to use all of the synchronization-stuff manually. MPI is message-based, meaning that its not perfectly suited for classical shared-memory modells (every core has access to the same memory), but in a cluster system (many cores combined with a network) it works excellent. This is not only limited to supercomputers (they use basically only mpi-style stuff), but in the recent years a new type of core-architecture (manycores) was developed. They have a local so called Network-On-Chip, so each core can send/receive messages without having the problem with synchronization.
MPI contains not only simple messages, but higher constructs to automatically scatter and gather data to every core.
Example: (again not MPI-syntax)
int rank = get_core_id();
int size = get_num_core();
int data[100];
int result;
int results[size];
if (rank == 0) { //master-core only
fill_with_stuff(data);
}
scatter(0, data); //core-0 will send the data-content to all other cores
result = work(rank, data); // every core works on the same data
gather(0,result,results); //get all local results and store them in
//the results-array of core-0
an other solutions is the openMP-libary
here you declare parallel-blocks. the whole thread-part is done by the libary itself
example:
//this will split the for-loop automatically in 4 threads
#pragma omp parallel for num_threads(4)
for(int i=0; i < 100; i++) {
work(i);
}
the big advantage is, that its fast to write. thats it
you may get better performance with writing the threads on your own,
but it takes a lot more time and knowledge about synchronization

OpenMP: for loop with changing number of iterations

I would like to use OpenMP to make my program run faster. Unfortunately, the opposite is the case. My code looks something like this:
const int max_iterations = 10000;
int num_interation = std::numeric_limits<int>::max();
#pragma omp parallel for
for(int i = 0; i < std::min(num_interation, max_iterations); i++)
{
// do sth.
// update the number of required iterations
// num_interation can only become smaller over time
num_interation = update_iterations(...);
}
For some reason, many more iterations are processed than required. Without OpenMP, it takes 500 iterations on avarage. However, even when setting the numbers of threads to one (set_num_threads(1)), it computes more than one thousand iterations. The same happens if I use mutliple threads, and also when using a writelock when updating num_iterations.
I would assume that it has something todo with memory bandwidth or a race condition. But those problems should not appear in case of set_num_threads(1).
Therefore, I assume that it could have something todo with the scheduling and the chunk size. However, I am really not sure about this.
Can somebody give me a hint?
A quick answer for the behaviour you experience is given by the OpenMP standard page 56:
The iteration count for each associated loop is computed before entry
to the outermost loop. If execution of any associated loop changes any
of the values used to compute any of the iteration counts, then the
behavior is unspecified.
In essence, this means that you cannot modify the boundaries of your loop once you entered it. Although according to the standard the behaviour is "unspecified", in your case, what happen is quite clear since as soon as you switch OpenMP on on your code, you compute the number of iterations you had specified initially.
So you have to take another approach to this problem.
This is a possible solution (amongst many other) which I hope scales OK. It has the drawback of potentially allowing more iterations to happen than the number you intended (up to OMP_NUM_THREADS-1 more iterations than expected, assuming that //do sth. is balanced, and many more if not). Also, it assumes that update_iterations(...) is thread safe and can be called in parallel without unwanted side effects... This is a very strong assumption which you'd better enforce!
num_interation = std::min(num_interation, max_iterations);
#pragma omp parallel
{
int i = omp_get_thread_num();
const int nbth = omp_get_num_threads();
while ( i < num_interation ) {
// do sth.
// update the number of required iterations
// num_interation can only become smaller over time
int new_num_interation = update_iterations(...);
#pragma omp critical
num_interation = std::min(num_interation, new_num_interation);
i += nbth;
}
}
A more synchronised solution, if the //do sth. isn't so balanced and not doing too many extra iterations is important, could be:
num_interation = std::min(num_interation, max_iterations);
int nb_it_done = 0;
#pragma omp parallel
{
int i = omp_get_thread_num();
const int nbth = omp_get_num_threads();
while ( nb_it_done < num_interation ) {
// do sth.
// update the number of required iterations
// num_interation can only become smaller over time
int new_num_interation = update_iterations(i);
#pragma omp critical
num_interation = std::min(num_interation, new_num_interation);
i += nbth;
#pragma omp single
nb_it_done += nbth;
}
}
Another weird thing here is that, since you didn't show what i is used for, it isn't clear if iterating somewhat randomly into the domain is a problem. If it isn't, the first solution should work well, even for unbalanced //do sth.. But if it is a problem, then you'd better stick with the second solution (and even potentially reinforce the synchronism).
But at the end of the day, there is now way (that I can think of and with decent parallelism) to avoid potential extra work to be done, since the number of iterations can change along the way.

How can i measure the overhead due to task migration/load balancing on linux with the real time patch?

I am trying to measure the overhead due to task migration. by overhead i would like to measure the latency involved in such a an activity. I know there are separate run queues available for each core and the kernel periodically checks the run queues to check whether there is a imbalance and wakes up a kernel thread ( perhaps a higher priority ) that does the migration.
Could any one provide me with pointers to kernel source code where i can insert time stamps to measure this value?
Is there any other performance metric which i probably investigate to get such an overhead?
I remember there is a post before that discussed about this topic, and someone also posted some codes about how to get the system overhead.
I see you want to add some codes to insert time stamps, do you think it's feasible because task schedule is so frequent. I think you can follow the topic that posted before.
I ever saved the source codes from the post, thanks for the author!
double getCurrentValue() {
double percent;
FILE* file;
unsigned long long totalUser, totalUserLow, totalSys, totalIdle, total;
file = fopen("/proc/stat", "r");
fscanf(file, "cpu %Ld %Ld %Ld %Ld", &totalUser, &totalUserLow,
&totalSys, &totalIdle);
fclose(file);
if (totalUser < lastTotalUser || totalUserLow < lastTotalUserLow ||
totalSys < lastTotalSys || totalIdle < lastTotalIdle) {
//Overflow detection. Just skip this value.
percent = -1.0;
}
else {
total = (totalUser - lastTotalUser) + (totalUserLow - lastTotalUserLow) +
(totalSys - lastTotalSys);
percent = total;
total += (totalIdle - lastTotalIdle);
percent /= total;
percent *= 100;
}
lastTotalUser = totalUser;
lastTotalUserLow = totalUserLow;
lastTotalSys = totalSys;
lastTotalIdle = totalIdle;
return percent;
}

How can I make this prime finder operate in parallel

I know prime finding is well studied, and there are a lot of different implementations. My question is, using the provided method (code sample), how can I go about breaking up the work? The machine it will be running on has 4 quad core hyperthreaded processors and 16GB of ram. I realize that there are some improvements that could be made, particularly in the IsPrime method. I also know that problems will occur once the list has more than int.MaxValue items in it. I don't care about any of those improvements. The only thing I care about is how to break up the work.
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
namespace Prime
{
class Program
{
static List<ulong> primes = new List<ulong>() { 2 };
static void Main(string[] args)
{
ulong reportValue = 10;
for (ulong possible = 3; possible <= ulong.MaxValue; possible += 2)
{
if (possible > reportValue)
{
Console.WriteLine(String.Format("\nThere are {0} primes less than {1}.", primes.Count, reportValue));
try
{
checked
{
reportValue *= 10;
}
}
catch (OverflowException)
{
reportValue = ulong.MaxValue;
}
}
if (IsPrime(possible))
{
primes.Add(possible);
Console.Write("\r" + possible);
}
}
Console.WriteLine(primes[primes.Count - 1]);
Console.ReadLine();
}
static bool IsPrime(ulong value)
{
foreach (ulong prime in primes)
{
if (value % prime == 0) return false;
if (prime * prime > value) break;
}
return true;
}
}
}
There are 2 basic schemes I see: 1) using all threads to test a single number, which is probably great for higher primes but I cannot really think of how to implement it, or 2) using each thread to test a single possible prime, which can cause a non-continuous string of primes to be found and run into unused resources problems when the next number to be tested is greater than the square of the highest prime found.
To me it feels like both of these situations are challenging only in the early stages of building the list of primes, but I'm not entirely sure. This is being done for a personal exercise in breaking this kind of work.
If you want, you can parallelize both operations: the checking of a prime, and the checking of multiple primes at once. Though I'm not sure this would help. To be honest I'd consider remove the threading in main().
I've tried to stay faithful to your algorithm, but to speed it up a lot I've used x*x instead of reportvalue; this is something you could easily revert if you wish.
To further improve on my core splitting you could determine an algorithm to figure out the number of computations required to perform the divisions based on the size of the numbers and split the list that way. (aka smaller numbers take less time to divide by so make the first partitions larger)
Also my concept of threadpool may not exist the way I want to use it
Here's my go at it(pseudo-ish-code):
List<int> primes = {2};
List<int> nextPrimes = {};
int cores = 4;
main()
{
for (int x = 3; x < MAX; x=x*x){
int localmax = x*x;
for(int y = x; y < localmax; y+=2){
thread{primecheck(y);}
}
"wait for all threads to be executed"
primes.add(nextPrimes);
nextPrimes = {};
}
}
void primecheck(int y)
{
bool primality;
threadpool? pool;
for(int x = 0; x < cores; x++){
pool.add(thread{
if (!smallcheck(x*primes.length/cores,(x+1)*primes.length/cores ,y)){
primality = false;
pool.kill();
}
});
}
"wait for all threads to be executed or killed"
if (primality)
nextPrimes.add(y);
}
bool smallcheck(int a, int b, int y){
foreach (int div in primes[a to b])
if (y%div == 0)
return false;
return true;
}
E: I added what I think pooling should look like, look at revision if you want to see it without.
Use the sieve of Eratosthenes instead. It's not worthwhile to parallelize unless you use a good algorithm in the first place.
Separate the space to sieve into large regions and sieve each in its own thread. Or better use some workqueue concept for large regions.
Use a bit array to represent the prime numbers, it takes less space than representing them explicitly.
See also this answer for a good implementation of a sieve (in Java, no split into regions).

Resources