I
do not want to use sleep function of high time resolution clock because busy sleep consumes CPU.
I do not want to depend on some wake up call spurious or signal based on some event using a conditional variable as it does not guarantee periodicity.
Please suggest if there is something on these lines possible. I cant find any solution online. I need to implement an eviction policy in a thread which will try to wake up every x seconds and clean up a queue if larger than a particular size.
What operating system are you running? Because on Linux/GCC5.4 at least, the std::this_thread::sleep_for function does not busy-wait. As an example:
$ cat test.txt
#include <chrono>
#include <thread>
int main(int argc, char** argv)
{
printf("Start\n");
std::this_thread::sleep_for(std::chrono::seconds(5));
printf("Stop\n");
return 0;
}
$ g++ -std=c++14 -o test test.cpp
$ time ./test
Start
Stop
real 0m5.003s
user 0m0.002s
sys 0m0.001s
See how the program took 5 seconds of real time, but almost no user(cpu) time?
As to your second point, no consumer operating system will be able to guarantee periodicity, since it is a time-sharing operating system and the underlying scheduler does not guarantee a maximum wait-time after a process becomes runnable. For that you would need to look at a real-time operating system. However, for most applications you'll probably be fine with the jitter introduced by the scheduler when using condition_variable waits.
Related
Desired behaviour: run a multi-threaded Linux program on a set of cores which have been isolated using isolcpus.
Here's a small program we can use as an example multi-threaded program:
#include <stdio.h>
#include <pthread.h>
#include <err.h>
#include <unistd.h>
#include <stdlib.h>
#define NTHR 16
#define TIME 60 * 5
void *
do_stuff(void *arg)
{
int i = 0;
(void) arg;
while (1) {
i += i;
usleep(10000); /* dont dominate CPU */
}
}
int
main(void)
{
pthread_t threads[NTHR];
int rv, i;
for (i = 0; i < NTHR; i++) {
rv = pthread_create(&threads[i], NULL, do_stuff, NULL);
if (rv) {
perror("pthread_create");
return (EXIT_FAILURE);
}
}
sleep(TIME);
exit(EXIT_SUCCESS);
}
If I compile and run this on a kernel with no isolated CPUs, then the threads are spread out over my 4 CPUs. Good!
Now if I add isolcpus=2,3 to the kernel command line and reboot:
Running the program without taskset distributes threads over cores 0 and 1. This is expected as the default affinity mask now excludes cores 2 and 3.
Running with taskset -c 0,1 has the same effect. Good.
Running with taskset -c 2,3 causes all threads to go onto the same core (either core 2 or 3). This is undesired. Threads should distribute over cores 2 and 3. Right?
This post describes a similar issue (although the example given is farther away from the pthreads API). The OP was happy to workaround this by using a different scheduler. I'm not certain this is ideal for my use-case however.
Is there a way to have the threads distributed over the isolated cores using the default scheduler?
Is this a kernel bug which I should report?
EDIT:
The right thing does indeed happen if you use a real-time scheduler like the fifo scheduler. See man sched and man chrt for details.
From the Linux Kernel Parameter Doc:
This option can be used to specify one or more CPUs to isolate from
the general SMP balancing and scheduling algorithms.
So this options would effectively prevent scheduler doing thread migration from one core to another less contended core (SMP balancing). As typical isolcpus are used together with pthread affinity control to pin threads with knowledge of CPU layout to gain predictable performance.
https://www.kernel.org/doc/Documentation/kernel-parameters.txt
--Edit--
Ok I see why you are confused. Yeah personally I would assume consistent behavior on this option. The problem lies around two functions, select_task_rq_fair and select_task_rq_rt, which is responsible for selecting new run_queue (which is essentially selecting which next_cpu to run on). I did a quick trace (Systemtap) of both functions, for CFS it would always return the same first core in the mask; for RT, it would return other cores. I haven't got a chance to look into the logic in each selection algorithm but you can send an email to the maintainer in Linux devel mailing list for fix.
I have 4 executables that do some very complex tasks, each of these programs alone might take nearly 100% of the power of a single core of a quad core CPU, thus resulting in almost 25% of total CPU power. Since all of these programs use hardware resources that can't be shared between mutiple processes, I wish to run a single executable that spawns 3 child processes which, in turn, occupy the other three cores. I'm on Linux and I'm using C++11. Most of the complex code is running in its own class and the hardest part runs in a function that I usually call Process(), so I have 4 objects, each with its own Process() that, when running, takes 100% of a single core.
I tried using OpenMP but I don't think it's the best solution as I have no control over CPU affinity. Also using std::thread is not a good idea, because threads inherit the main process' CPU affinity. In Linux I think I can do this with fork() but I have no idea how the whole structure is made.
This might be related to my other question that was partly left unanswered, maybe because I was trying the wrong approach that works in some cases but not in my case.
An example of pseudo-code could be this:
int main()
{
// ...init everything...
// This alone takes 100% of a single core
float out1 = object1->Process();
// This should be spawned as a child process running on another core
float out2 = object2->Process();
// on another core...
float out3 ...
// yet another core...
float out4 ...
// This should still run in the parent process
float total_output = out1 + out2 + out3 + out4;
}
You can use std::thread, that's a front-end to pthread_create().
Then set its affinity with sched_setaffinity() from the thread itself as well.
As you asked, here a working stub:
#include <sched.h>
#include <thread>
#include <list>
void thread_func(int cpu_index) {
cpu_set_t cpuSet;
CPU_ZERO(&cpuSet);
CPU_SET(cpu_index, &cpuSet);
sched_setaffinity(0, sizeof( cpu_set_t), &cpuSet);
/* the rest of the thread body here */
}
using namespace std;
int main(int argc, char **argv) {
if (argc != 2) exit(1);
int n_cpus = atoi(argv[1]);
list< shared_ptr< thread > > lot;
for (int i=0; i<n_cpus; ++i) {
lot.push_back( shared_ptr<thread>(new thread(thread_func, i)));
}
for(auto tptr = lot.begin(); tptr != lot.end(); ++tptr) {
(*tptr)->join();
}
}
Note that for optimal behaviour it's important that each thread initialises its memory (that is, constructs its objects) in the thread body, if you want that your code is optimized also on multi-processors, because in case you are working on a NUMA system, memory pages are allocated on memory close to the CPU using them.
For example you can have a look to this blog.
However this is not an issue in your specific case, since your are dealing with a single processor system, or more specifically a system with just one numa node (many current AMD processors do contain two numa nodes, even if within a single physical package), and all the memory banks are attached there.
The final effect of using sched_setaffinity() in this context will be just to pin down each thread to a specific core.
You don't have to program anything. The command taskset changes the CPU affinity of a currently running process or creates and sets it for a new process.
Running a single executable that spawns other programs is no different than executing the programs directly, except for the common initialization implied by the comments in your stub.
taskset 1 program_for_cpu_0 arg1 arg2 arg3...
taskset 2 program_for_cpu_1 arg1 arg2 arg3...
taskset 4 program_for_cpu_2 arg1 arg2 arg3...
taskset 8 program_for_cpu_3 arg1 arg2 arg3...
I am suspicious of setting CPU affinities. I have yet to find an actual use for doing so (other than satisfying some inner need for control).
Unless the CPUs are not identical in some way, there should be no need to restrict a process to a particular CPU.
The Linux kernel normally keeps a process on the same CPU unless it enters an extended wait for i/o, a semaphore, etc.
Switching a process from one CPU to another does not incur any particular overhead except in NUMA configurations with local memory per CPU. AFAIK, ARM implementations do not do that.
If a process should exhibit non-CPU bound behavior, allowing the kernel scheduler flexibility to reassign a process to a now-available CPU should improve system performance. Processes bound to a CPU cannot participate in resource availability.
With a friend of mine, we disagree on how synchronization is handled at userspace level (in the pthread library).
a. I think that during a pthread_mutex_lock, the thread actively waits. Meaning the linux scheduler rises this thread, let it execute his code, which should looks like:
while (mutex_resource->locked);
Then, another thread is scheduled which potentially free the locked field, etc.
So this means that the scheduler waits for the thread to complete its schedule time before switching to the next one, no matter what the thread is doing.
b. My friend thinks that the waiting thread somehow tells the kernel "Hey, I'm asleep, don't wait for me at all".
In this case, the kernel would schedule the next thread right away, without waiting for the current thread to complete its schedule time, being aware this thread is sleeping.
From what I see in the code of pthread, it seems there is loop handling the lock. But maybe I missed something.
In embedded systems, it could make sense to prevent the kernel from waiting. So he may be right (but I hope he does not :D).
Thanks!
a. I think that during a pthread_mutex_lock, the thread actively waits.
Yes, glibc's NPTL pthread_mutex_lock have active wait (spinning),
BUT the spinning is used only for very short amount of time and only for some types of mutexes. After this amount, pthread_mutex_lock will go to sleep, by calling linux syscall futex with WAIT argument.
Only mutexes with type PTHREAD_MUTEX_ADAPTIVE_NP will spin, and default is PTHREAD_MUTEX_TIMED_NP (normal mutex) without spinning. Check MAX_ADAPTIVE_COUNT in __pthread_mutex_lock sources).
If you want to do infinite spinning (active waiting), use pthread_spin_lock function with pthread_spinlock_t-types locks.
I'll consider the rest of your question as if you are using pthread_spin_lock:
Then, another thread is scheduled which potentially free the locked field, etc. So this means that the scheduler waits for the thread to complete its schedule time before switching to the next one, no matter what the thread is doing.
Yes, if there is contention for CPU cores, the your thread with active spinning may block other thread from execute, even if the other thread is the one who will unlock the mutex (spinlock) which is needed by your thread.
But if there is no contention (no thread oversubscribing), and threads are scheduled on different cores (by coincidence, or by manual setting of cpu affinity with sched_setaffinity or pthread_setaffinity_np), spinning will enable you to proceed faster, then using OS-based futex.
b. My friend thinks that the waiting thread somehow tells the kernel "Hey, I'm asleep, don't wait for me at all". In this case, the kernel would schedule the next thread right away, without waiting for the current thread to complete...
Yes, he is right.
futex is the modern way to say OS that this thread is waiting for some value in memory (for opening some mutex); and in current implementation futex also puts our thread to sleep. It is not needed to wake it to do spinning, if kernel knows when to wake up this thread. How it knows? The lock owner, when doing pthread_mutex_unlock, will check, is there any other threads, sleeping on this mutex. If there is any, lock owner will call futex with FUTEX_WAKE, telling OS to wake some thread, registered as sleeper on this mutex.
There is no need to spin, if thread registers itself as waiter in OS.
Some debuging with gdb for this test program:
#include <pthread.h>
#include <stdlib.h>
#include <string.h>
#include <time.h>
pthread_mutex_t x = PTHREAD_MUTEX_INITIALIZER;
void* thr_func(void *arg)
{
pthread_mutex_lock(&x);
}
int main(int argc, char **argv)
{
pthread_t thr;
pthread_mutex_lock(&x);
pthread_create(&thr, NULL, thr_func, NULL);
pthread_join(thr,NULL);
return 0;
}
shows that a call to pthread_mutex_lock on a mutex results in a calling a system call futex with the op parameter set to FUTEX_WAIT (http://man7.org/linux/man-pages/man2/futex.2.html)
And this is description of FUTEX_WAIT:
FUTEX_WAIT
This operation atomically verifies that the futex address
uaddr still contains the value val, and sleeps awaiting FUTEX_WAKE on
this futex address. If the timeout argument is
non-NULL, its contents describe the maximum duration of the wait,
which is infinite otherwise. The arguments uaddr2 and val3 are
ignored.
So from this description I can say that if a mutex is locked then a thread will sleep and not actively wait. And it will sleep until futex with op equal to FUTEX_WAKE is called.
Is there any mechanism through which I can wake up a thread in another process without going through the kernel? The waiting thread might spin in a loop, no problem (each thread is pegged to a separate core), but in my case the sending thread has to be quick, and can't afford to go through the kernel to wake up the waiting thread.
No, if the other thread is sleeping (not on CPU). To wake up such thread you need to change its state into "RUNNING" by calling scheduler which is part of the kernel.
Yes, you can syncronize two threads or processes if both are running on different CPUs, and if there is shared memory between them. You should bind all threads to different CPUs. Then you may use spinlock:pthread_spin_lock and pthread_spin_unlock functions from optional part of POSIX's Pthread ('(ADVANCED REALTIME THREADS)'; [THR SPI]); or any of custom spinlock. Custom spinlock most likely will use some atomic operations and/or memory barriers.
Sending thread will change the value in memory, which is checked in loop by receiver thread.
E.g.
init:
pthread_spinlock_t lock;
pthread_spin_lock(&lock); // close the "mutex"
then start threads.
waiting thread:
{
pthread_spin_lock(&lock); // wait for event;
work();
}
main thread:
{
do_smth();
pthread_spin_unlock(&lock); // open the mutex; other thread will see this change
// in ~150 CPU ticks (checked on Pentium4 and Intel Core2 single socket systems);
// time of the operation itself is of the same order; didn't measure it.
continue_work();
}
To signal to another process that it should continue, without forcing the sender to spend time in a kernel call, one mechanism comes to mind right away. Without kernel calls, all a process can do is modify memory; so the solution is inter-process shared memory. Once the sender writes to shared memory, the receiver should see the change without any explicit kernel calls, and naive polling by the receiver should work fine.
One cheap (but maybe not cheap enough) alternative is delegating the sending to a helper thread in the same process, and have the helper thread make a proper inter-process "semaphore release" or pipe write call.
I understand that you want to avoid using the kernel in order to avoid kernel-related overheads. Most of such overheads are context-switch related. Here is a demonstration of one way to accomplish what you need using signals without spinning, and without context switches:
#include <signal.h>
#include <unistd.h>
#include <sys/syscall.h>
#include <pthread.h>
#include <iostream>
#include <thread>
using namespace std;
void sigRtHandler(int sig) {
cout << "Recevied signal" << endl;
}
int main() {
constexpr static int kIter = 100000;
thread t([]() {
signal(SIGRTMIN, sigRtHandler);
for (int i = 0; i < kIter; ++i) {
usleep(1000);
}
cout << "Done" << endl;
});
usleep(1000); // Give child time to setup signal handler.
auto handle = t.native_handle();
for (int i = 0; i < kIter; ++i)
pthread_kill(handle, SIGRTMIN);
t.join();
return 0;
}
If you run this code, you'll see that the child thread keeps receiving the SIGRTMIN. While the process is running, if you look in the files /proc/(PID)/task/*/status for this process, you'll see that parent thread does not incur context switches from calling pthread_kill().
The advantage of this approach is that the waiting thread doesn't need to spin. If the waiting thread's job is not time-sensitive, this approach allows you to save CPU.
A prof once told us in class that Windows, Linux, OS X and UNIX scale on threads and not processes, so threads would likely benefit your application even on a single processor because your application would be getting more time on the CPU.
I tried with the following code on my machine (which only has one CPU).
#include <stdio.h>
#include <stdlib.h>
#include <pthread.h>
pthread_t xs[10];
void *nop(void *ptr) {
unsigned long n = 1UL << 30UL;
while(n--);
return NULL;
}
void test_one() {
size_t len = (sizeof xs) / (sizeof *xs);
while(len--)
if(pthread_create(xs+len, NULL, nop, NULL))
exit(EXIT_FAILURE);
len = (sizeof xs) / (sizeof *xs);
while(len--)
if(pthread_join(xs[len], NULL))
exit(EXIT_FAILURE);
}
void test_two() {
size_t len = (sizeof xs) / (sizeof *xs);
while(len--) nop(NULL);
}
int main(int argc, char *argv[]) {
test_one();
// test_two();
printf("done\n");
return 0;
}
Both tests were identical in terms of speed.
real 0m49.783s
user 0m48.023s
sys 0m0.224s
real 0m49.792s
user 0m49.275s
sys 0m0.192s
This me think, "Wow, threads suck". But, repeating the test on a university server with four processors close to quadrupled the speed.
real 0m7.800s
user 0m30.170s
sys 0m0.006s
real 0m30.190s
user 0m30.165s
sys 0m0.004s
Am I overlooking something when interpreting the results from my home machine?
In order to understand within the bowels of tasks/threads...lets look at this toy kernel code...
struct regs{
int eax, ebx, ecx, edx, es, ds, gs, fs, cs, ip, flags;
struct tss *task_sel;
}
struct thread{
struct regs *regs;
int parent_id;
struct thread *next;
}
struct task{
struct regs *regs;
int *phys_mem_begin;
int *phys_mem_end;
int *filehandles;
int priority;
int *num_threads;
int quantum;
int duration;
int start_time, end_time;
int parent_id;
struct thread *task_thread;
/* ... */
struct task *next;
}
Imagine the kernel allocates memory for that structure task, which is a linked-list, look closely at the quantum field, that is the timeslice of the processor-time based on the priority field. There will always be a task of id 0, which never sleeps, just idle, perhaps issuing nops (No OPerationS)...the scheduler spins around ad nauseum until infinity (that is when the power gets unplugged), if the quantum field determines the task runs for 20ms, sets the start_time and end_time + 20ms, when that end_time is up, the kernel saves the state of the cpu registers into a regs pointer. Goes onto the next task in the chain, loads the cpu registers from the pointer to regs and jumps into the instruction, sets the quantum and time duration, when the duration reaches zero, goes on to the next...effectively context-switching...this is what gives it an illusion that is running simultaneously on a single cpu.
Now look at the thread struct which is a linked-list of threads...within that task structure. The kernel allocates threads for that said task, sets up the cpu states for that thread and jumps into the threads...now the kernel has to manage the threads as well as the tasks themselves...again context switching between a task and thread...
Move on to a multi-cpu, the kernel would have been set up to be scalable, and what the scheduler would do, load one task onto one cpu, load another onto another cpu (dual core), and both jump into where the instruction pointer is pointing at...now the kernel is genuinely running both tasks simultaneously on both cpu's. Scale up to 4 way, same thing, additional tasks loaded onto each CPU, scale up again, to n-way...you get the drift.
As you can see the notion how the threads would not be perceived as scalable, as the kernel has quite frankly a mammoth job in keeping track of what cpu is running what, and on top of that, what task is running which threads, which fundamentally explains why I think threads are not exactly scalable...Threads consumes a lot of resources...
If you really want to see what is happening, take a look at the source code for Linux, specifically in the scheduler. No hang on, forget about the 2.6.x kernel releases, look at the prehistoric version 0.99, the scheduler would be more simpler to understand and easier to read, sure, its a bit old, but worth looking at, this will help you understand why and hopefully my answer also, in why threads are not scalable..and shows how the toy-os uses time division based on processes. I have strived to not to get into the technical aspects of modern-day cpu's that can do more then just what I have described...
Hope this helps.
A prof once told us in class that Windows, Linux, OS X and UNIX scale on threads and not processes, so threads would likely benefit your application even on a single processor because your application would be getting more time on the CPU.
Not necessarily. If your app is the only CPU-intensive thing running, more threads won't magically make more CPU time available - all that will result is more CPU time wasted in context switches.
This me think, "Wow, threads suck". But, repeating the test on a university server with four processors close to quadrupled the speed.
That's because with four threads, it can use all four processors.
I'm not sure exactly what you're asking, but here is an answer which may help.
Under Linux, processes and threads are essentially exactly the same. The scheduler understands things called "tasks" which it doesn't really care whether they share address space or not. Sharing or not sharing things really depends on how they were created.
Whether to use threads or processes is a key design decision and should not be taken lightly, but performance of the scheduler is probably not a factor (of course things like IPC requirements will vary the design wildly)