all,
The code below comes from "Advanced Programing in Unix Environment", it creates a new thread, and prints the process id and thread id for main and new threads.
In the book, it said that in linux, the output of this code would show that two threads have different
process ids, because pthread uses lightweight process to emulate thread. But when I ran this code in Ubuntu 12.04, it has kernel 3.2, printed the same pid.
so, does the new linux kernel change the internal implementation of pthread?
#include "apue.h"
#include <pthread.h>
pthread_t ntid;
void printids(const char *s) {
pid_t pid;
pthread_t tid;
pid = getpid();
tid = pthread_self();
printf("%s pid %u tid %u (0x%x)\n",
s, (unsigned int)pid, (unsigned int)tid, (unsigned int)tid);
}
void *thread_fn(void* arg) {
printids("new thread: ");
return (void *)0;
}
int main(void) {
int err;
err = pthread_create(&ntid, NULL, thread_fn, NULL);
if (err != 0)
err_quit("can't create thread: %s\n", strerror(err));
printids("main thread: ");
sleep(1);
return 0;
}
On Linux pthread uses the clone syscall with a special flag CLONE_THREAD.
See the documentation of clone syscall.
CLONE_THREAD (since Linux 2.4.0-test8)
If CLONE_THREAD is set, the child is placed in the same thread group as the calling process. To make the remainder of the discussion of CLONE_THREAD more readable, the term "thread" is used to refer to the processes within a thread group.
Thread groups were a feature added in Linux 2.4 to support the POSIX threads notion of a set of threads that share a single PID. Internally, this shared PID is the so-called thread group identifier (TGID) for the thread group. Since Linux 2.4, calls to getpid(2) return the TGID of the caller.
And in fact, Linux do change its thread implementation, since POSIX.1 requires threads share a same process ID.
In the obsolete LinuxThreads implementation, each of the threads in a process
has a different process ID. This is in violation of the POSIX threads
specification, and is the source of many other nonconformances to the
standard; see pthreads(7).
Linux typically uses two implementations of pthreads: LinuxThreads and Native POSIX Thread Library(NPTL), although the former is largely obsolete. Kernel from 2.6 provides NPTL, which provides much closer conformance to SUSv3, and perform better especially when there are many threads.
You can query the specific implementation of pthreads under shell using command:
getconf GNU_LIBPTHREAD_VERSION
You can also get a more detailed implementation difference in The Linux Programming Interface.
Related
Desired behaviour: run a multi-threaded Linux program on a set of cores which have been isolated using isolcpus.
Here's a small program we can use as an example multi-threaded program:
#include <stdio.h>
#include <pthread.h>
#include <err.h>
#include <unistd.h>
#include <stdlib.h>
#define NTHR 16
#define TIME 60 * 5
void *
do_stuff(void *arg)
{
int i = 0;
(void) arg;
while (1) {
i += i;
usleep(10000); /* dont dominate CPU */
}
}
int
main(void)
{
pthread_t threads[NTHR];
int rv, i;
for (i = 0; i < NTHR; i++) {
rv = pthread_create(&threads[i], NULL, do_stuff, NULL);
if (rv) {
perror("pthread_create");
return (EXIT_FAILURE);
}
}
sleep(TIME);
exit(EXIT_SUCCESS);
}
If I compile and run this on a kernel with no isolated CPUs, then the threads are spread out over my 4 CPUs. Good!
Now if I add isolcpus=2,3 to the kernel command line and reboot:
Running the program without taskset distributes threads over cores 0 and 1. This is expected as the default affinity mask now excludes cores 2 and 3.
Running with taskset -c 0,1 has the same effect. Good.
Running with taskset -c 2,3 causes all threads to go onto the same core (either core 2 or 3). This is undesired. Threads should distribute over cores 2 and 3. Right?
This post describes a similar issue (although the example given is farther away from the pthreads API). The OP was happy to workaround this by using a different scheduler. I'm not certain this is ideal for my use-case however.
Is there a way to have the threads distributed over the isolated cores using the default scheduler?
Is this a kernel bug which I should report?
EDIT:
The right thing does indeed happen if you use a real-time scheduler like the fifo scheduler. See man sched and man chrt for details.
From the Linux Kernel Parameter Doc:
This option can be used to specify one or more CPUs to isolate from
the general SMP balancing and scheduling algorithms.
So this options would effectively prevent scheduler doing thread migration from one core to another less contended core (SMP balancing). As typical isolcpus are used together with pthread affinity control to pin threads with knowledge of CPU layout to gain predictable performance.
https://www.kernel.org/doc/Documentation/kernel-parameters.txt
--Edit--
Ok I see why you are confused. Yeah personally I would assume consistent behavior on this option. The problem lies around two functions, select_task_rq_fair and select_task_rq_rt, which is responsible for selecting new run_queue (which is essentially selecting which next_cpu to run on). I did a quick trace (Systemtap) of both functions, for CFS it would always return the same first core in the mask; for RT, it would return other cores. I haven't got a chance to look into the logic in each selection algorithm but you can send an email to the maintainer in Linux devel mailing list for fix.
I have 4 executables that do some very complex tasks, each of these programs alone might take nearly 100% of the power of a single core of a quad core CPU, thus resulting in almost 25% of total CPU power. Since all of these programs use hardware resources that can't be shared between mutiple processes, I wish to run a single executable that spawns 3 child processes which, in turn, occupy the other three cores. I'm on Linux and I'm using C++11. Most of the complex code is running in its own class and the hardest part runs in a function that I usually call Process(), so I have 4 objects, each with its own Process() that, when running, takes 100% of a single core.
I tried using OpenMP but I don't think it's the best solution as I have no control over CPU affinity. Also using std::thread is not a good idea, because threads inherit the main process' CPU affinity. In Linux I think I can do this with fork() but I have no idea how the whole structure is made.
This might be related to my other question that was partly left unanswered, maybe because I was trying the wrong approach that works in some cases but not in my case.
An example of pseudo-code could be this:
int main()
{
// ...init everything...
// This alone takes 100% of a single core
float out1 = object1->Process();
// This should be spawned as a child process running on another core
float out2 = object2->Process();
// on another core...
float out3 ...
// yet another core...
float out4 ...
// This should still run in the parent process
float total_output = out1 + out2 + out3 + out4;
}
You can use std::thread, that's a front-end to pthread_create().
Then set its affinity with sched_setaffinity() from the thread itself as well.
As you asked, here a working stub:
#include <sched.h>
#include <thread>
#include <list>
void thread_func(int cpu_index) {
cpu_set_t cpuSet;
CPU_ZERO(&cpuSet);
CPU_SET(cpu_index, &cpuSet);
sched_setaffinity(0, sizeof( cpu_set_t), &cpuSet);
/* the rest of the thread body here */
}
using namespace std;
int main(int argc, char **argv) {
if (argc != 2) exit(1);
int n_cpus = atoi(argv[1]);
list< shared_ptr< thread > > lot;
for (int i=0; i<n_cpus; ++i) {
lot.push_back( shared_ptr<thread>(new thread(thread_func, i)));
}
for(auto tptr = lot.begin(); tptr != lot.end(); ++tptr) {
(*tptr)->join();
}
}
Note that for optimal behaviour it's important that each thread initialises its memory (that is, constructs its objects) in the thread body, if you want that your code is optimized also on multi-processors, because in case you are working on a NUMA system, memory pages are allocated on memory close to the CPU using them.
For example you can have a look to this blog.
However this is not an issue in your specific case, since your are dealing with a single processor system, or more specifically a system with just one numa node (many current AMD processors do contain two numa nodes, even if within a single physical package), and all the memory banks are attached there.
The final effect of using sched_setaffinity() in this context will be just to pin down each thread to a specific core.
You don't have to program anything. The command taskset changes the CPU affinity of a currently running process or creates and sets it for a new process.
Running a single executable that spawns other programs is no different than executing the programs directly, except for the common initialization implied by the comments in your stub.
taskset 1 program_for_cpu_0 arg1 arg2 arg3...
taskset 2 program_for_cpu_1 arg1 arg2 arg3...
taskset 4 program_for_cpu_2 arg1 arg2 arg3...
taskset 8 program_for_cpu_3 arg1 arg2 arg3...
I am suspicious of setting CPU affinities. I have yet to find an actual use for doing so (other than satisfying some inner need for control).
Unless the CPUs are not identical in some way, there should be no need to restrict a process to a particular CPU.
The Linux kernel normally keeps a process on the same CPU unless it enters an extended wait for i/o, a semaphore, etc.
Switching a process from one CPU to another does not incur any particular overhead except in NUMA configurations with local memory per CPU. AFAIK, ARM implementations do not do that.
If a process should exhibit non-CPU bound behavior, allowing the kernel scheduler flexibility to reassign a process to a now-available CPU should improve system performance. Processes bound to a CPU cannot participate in resource availability.
I have to implement a wrapper function that serves as pthread_self() to get a pthread ID but I've been searching and havenĀ“t found which syscall does this. Reading another post from Stack O. I know clone() is used to create threads, also that I can trace the syscalls with ptrace() but before tracing it by hand...could someone knows which syscall is?
There are 3 different IDs for a linux process thread: pid, pthread id, and tid.
The 'pid' is global and equivalent to the parent process id, and is easily obtained by 'getpid()'. This value is unique, but only for the duration of an the active process assigned the given id. This value may be 'recycled' for a new process after a process is terminated and new ones are spawned. This value is the same across all threads, within a process. This value is what you'll see in top, and htop, 'ps -ef', and pidstat.
The 'pthread id' is reported by pthread_create() and phtread_self(). This is value is unique only within the process, and only for the duration of the assign thread. This value may be 'recycled' as threads are terminated and spawned. This value is not unique across the system, nor across threads that have been terminated and started. This value is NOT visible outside of a program. This value is opaque and may be a pointer or structure depending on the platform.
The 'tid' Thread id is reported by gettid(). This was introduced to Linux 2.4, and does not appear to be available on other platforms. This value is unique within the process and across the system. This value is reported by top and htop, and 'pidstat -t'. I'm not 100% certain, but suspect this value can be 'recycled' as processes are terminated and spawned. This is the value that appears in the Linux tools 'top','htop', 'pidstat -t', and 'ps -efL', when shown threads.
Documentation for gettid: linux.die.net/gettid
You can obtain 'gettid()' through:
#include <sys/types.h>
#include <sys/syscall.h>
#include <pthread.h>
My CentOS 6.5 is not properly setup and missing the gettid prototype, though the documentation says it should be present through the above #includes. Here is a macro that mimics 'gettid':
#ifndef gettid
// equivalent to: pid_t gettid(void)
#define gettid() syscall(SYS_gettid)
#endif
Be aware that since this is a syscall(), you'll gain efficiency by caching the result and avoiding repeatedly using the syscall().
How about syscall 0xe0, gettid()?
gettid() returns the caller's thread ID (TID). In a single-threaded process, the thread ID is equal to the process ID (PID, as returned by getpid(2)). In a multithreaded process, all threads have the same PID, but each one has a unique TID. For further details, see the discussion of CLONE_THREAD in clone(2).
In glibc, pthread_self() does not do system calls, but returns a pointer to a struct pthread, located in the TSD segment.
This might be helpfull.
UINT32 tid= syscall(SYS_gettid);
Is there any mechanism through which I can wake up a thread in another process without going through the kernel? The waiting thread might spin in a loop, no problem (each thread is pegged to a separate core), but in my case the sending thread has to be quick, and can't afford to go through the kernel to wake up the waiting thread.
No, if the other thread is sleeping (not on CPU). To wake up such thread you need to change its state into "RUNNING" by calling scheduler which is part of the kernel.
Yes, you can syncronize two threads or processes if both are running on different CPUs, and if there is shared memory between them. You should bind all threads to different CPUs. Then you may use spinlock:pthread_spin_lock and pthread_spin_unlock functions from optional part of POSIX's Pthread ('(ADVANCED REALTIME THREADS)'; [THR SPI]); or any of custom spinlock. Custom spinlock most likely will use some atomic operations and/or memory barriers.
Sending thread will change the value in memory, which is checked in loop by receiver thread.
E.g.
init:
pthread_spinlock_t lock;
pthread_spin_lock(&lock); // close the "mutex"
then start threads.
waiting thread:
{
pthread_spin_lock(&lock); // wait for event;
work();
}
main thread:
{
do_smth();
pthread_spin_unlock(&lock); // open the mutex; other thread will see this change
// in ~150 CPU ticks (checked on Pentium4 and Intel Core2 single socket systems);
// time of the operation itself is of the same order; didn't measure it.
continue_work();
}
To signal to another process that it should continue, without forcing the sender to spend time in a kernel call, one mechanism comes to mind right away. Without kernel calls, all a process can do is modify memory; so the solution is inter-process shared memory. Once the sender writes to shared memory, the receiver should see the change without any explicit kernel calls, and naive polling by the receiver should work fine.
One cheap (but maybe not cheap enough) alternative is delegating the sending to a helper thread in the same process, and have the helper thread make a proper inter-process "semaphore release" or pipe write call.
I understand that you want to avoid using the kernel in order to avoid kernel-related overheads. Most of such overheads are context-switch related. Here is a demonstration of one way to accomplish what you need using signals without spinning, and without context switches:
#include <signal.h>
#include <unistd.h>
#include <sys/syscall.h>
#include <pthread.h>
#include <iostream>
#include <thread>
using namespace std;
void sigRtHandler(int sig) {
cout << "Recevied signal" << endl;
}
int main() {
constexpr static int kIter = 100000;
thread t([]() {
signal(SIGRTMIN, sigRtHandler);
for (int i = 0; i < kIter; ++i) {
usleep(1000);
}
cout << "Done" << endl;
});
usleep(1000); // Give child time to setup signal handler.
auto handle = t.native_handle();
for (int i = 0; i < kIter; ++i)
pthread_kill(handle, SIGRTMIN);
t.join();
return 0;
}
If you run this code, you'll see that the child thread keeps receiving the SIGRTMIN. While the process is running, if you look in the files /proc/(PID)/task/*/status for this process, you'll see that parent thread does not incur context switches from calling pthread_kill().
The advantage of this approach is that the waiting thread doesn't need to spin. If the waiting thread's job is not time-sensitive, this approach allows you to save CPU.
A prof once told us in class that Windows, Linux, OS X and UNIX scale on threads and not processes, so threads would likely benefit your application even on a single processor because your application would be getting more time on the CPU.
I tried with the following code on my machine (which only has one CPU).
#include <stdio.h>
#include <stdlib.h>
#include <pthread.h>
pthread_t xs[10];
void *nop(void *ptr) {
unsigned long n = 1UL << 30UL;
while(n--);
return NULL;
}
void test_one() {
size_t len = (sizeof xs) / (sizeof *xs);
while(len--)
if(pthread_create(xs+len, NULL, nop, NULL))
exit(EXIT_FAILURE);
len = (sizeof xs) / (sizeof *xs);
while(len--)
if(pthread_join(xs[len], NULL))
exit(EXIT_FAILURE);
}
void test_two() {
size_t len = (sizeof xs) / (sizeof *xs);
while(len--) nop(NULL);
}
int main(int argc, char *argv[]) {
test_one();
// test_two();
printf("done\n");
return 0;
}
Both tests were identical in terms of speed.
real 0m49.783s
user 0m48.023s
sys 0m0.224s
real 0m49.792s
user 0m49.275s
sys 0m0.192s
This me think, "Wow, threads suck". But, repeating the test on a university server with four processors close to quadrupled the speed.
real 0m7.800s
user 0m30.170s
sys 0m0.006s
real 0m30.190s
user 0m30.165s
sys 0m0.004s
Am I overlooking something when interpreting the results from my home machine?
In order to understand within the bowels of tasks/threads...lets look at this toy kernel code...
struct regs{
int eax, ebx, ecx, edx, es, ds, gs, fs, cs, ip, flags;
struct tss *task_sel;
}
struct thread{
struct regs *regs;
int parent_id;
struct thread *next;
}
struct task{
struct regs *regs;
int *phys_mem_begin;
int *phys_mem_end;
int *filehandles;
int priority;
int *num_threads;
int quantum;
int duration;
int start_time, end_time;
int parent_id;
struct thread *task_thread;
/* ... */
struct task *next;
}
Imagine the kernel allocates memory for that structure task, which is a linked-list, look closely at the quantum field, that is the timeslice of the processor-time based on the priority field. There will always be a task of id 0, which never sleeps, just idle, perhaps issuing nops (No OPerationS)...the scheduler spins around ad nauseum until infinity (that is when the power gets unplugged), if the quantum field determines the task runs for 20ms, sets the start_time and end_time + 20ms, when that end_time is up, the kernel saves the state of the cpu registers into a regs pointer. Goes onto the next task in the chain, loads the cpu registers from the pointer to regs and jumps into the instruction, sets the quantum and time duration, when the duration reaches zero, goes on to the next...effectively context-switching...this is what gives it an illusion that is running simultaneously on a single cpu.
Now look at the thread struct which is a linked-list of threads...within that task structure. The kernel allocates threads for that said task, sets up the cpu states for that thread and jumps into the threads...now the kernel has to manage the threads as well as the tasks themselves...again context switching between a task and thread...
Move on to a multi-cpu, the kernel would have been set up to be scalable, and what the scheduler would do, load one task onto one cpu, load another onto another cpu (dual core), and both jump into where the instruction pointer is pointing at...now the kernel is genuinely running both tasks simultaneously on both cpu's. Scale up to 4 way, same thing, additional tasks loaded onto each CPU, scale up again, to n-way...you get the drift.
As you can see the notion how the threads would not be perceived as scalable, as the kernel has quite frankly a mammoth job in keeping track of what cpu is running what, and on top of that, what task is running which threads, which fundamentally explains why I think threads are not exactly scalable...Threads consumes a lot of resources...
If you really want to see what is happening, take a look at the source code for Linux, specifically in the scheduler. No hang on, forget about the 2.6.x kernel releases, look at the prehistoric version 0.99, the scheduler would be more simpler to understand and easier to read, sure, its a bit old, but worth looking at, this will help you understand why and hopefully my answer also, in why threads are not scalable..and shows how the toy-os uses time division based on processes. I have strived to not to get into the technical aspects of modern-day cpu's that can do more then just what I have described...
Hope this helps.
A prof once told us in class that Windows, Linux, OS X and UNIX scale on threads and not processes, so threads would likely benefit your application even on a single processor because your application would be getting more time on the CPU.
Not necessarily. If your app is the only CPU-intensive thing running, more threads won't magically make more CPU time available - all that will result is more CPU time wasted in context switches.
This me think, "Wow, threads suck". But, repeating the test on a university server with four processors close to quadrupled the speed.
That's because with four threads, it can use all four processors.
I'm not sure exactly what you're asking, but here is an answer which may help.
Under Linux, processes and threads are essentially exactly the same. The scheduler understands things called "tasks" which it doesn't really care whether they share address space or not. Sharing or not sharing things really depends on how they were created.
Whether to use threads or processes is a key design decision and should not be taken lightly, but performance of the scheduler is probably not a factor (of course things like IPC requirements will vary the design wildly)