Linux on quad core: single executable, 4 processes - linux

I have 4 executables that do some very complex tasks, each of these programs alone might take nearly 100% of the power of a single core of a quad core CPU, thus resulting in almost 25% of total CPU power. Since all of these programs use hardware resources that can't be shared between mutiple processes, I wish to run a single executable that spawns 3 child processes which, in turn, occupy the other three cores. I'm on Linux and I'm using C++11. Most of the complex code is running in its own class and the hardest part runs in a function that I usually call Process(), so I have 4 objects, each with its own Process() that, when running, takes 100% of a single core.
I tried using OpenMP but I don't think it's the best solution as I have no control over CPU affinity. Also using std::thread is not a good idea, because threads inherit the main process' CPU affinity. In Linux I think I can do this with fork() but I have no idea how the whole structure is made.
This might be related to my other question that was partly left unanswered, maybe because I was trying the wrong approach that works in some cases but not in my case.
An example of pseudo-code could be this:
int main()
{
// ...init everything...
// This alone takes 100% of a single core
float out1 = object1->Process();
// This should be spawned as a child process running on another core
float out2 = object2->Process();
// on another core...
float out3 ...
// yet another core...
float out4 ...
// This should still run in the parent process
float total_output = out1 + out2 + out3 + out4;
}

You can use std::thread, that's a front-end to pthread_create().
Then set its affinity with sched_setaffinity() from the thread itself as well.
As you asked, here a working stub:
#include <sched.h>
#include <thread>
#include <list>
void thread_func(int cpu_index) {
cpu_set_t cpuSet;
CPU_ZERO(&cpuSet);
CPU_SET(cpu_index, &cpuSet);
sched_setaffinity(0, sizeof( cpu_set_t), &cpuSet);
/* the rest of the thread body here */
}
using namespace std;
int main(int argc, char **argv) {
if (argc != 2) exit(1);
int n_cpus = atoi(argv[1]);
list< shared_ptr< thread > > lot;
for (int i=0; i<n_cpus; ++i) {
lot.push_back( shared_ptr<thread>(new thread(thread_func, i)));
}
for(auto tptr = lot.begin(); tptr != lot.end(); ++tptr) {
(*tptr)->join();
}
}
Note that for optimal behaviour it's important that each thread initialises its memory (that is, constructs its objects) in the thread body, if you want that your code is optimized also on multi-processors, because in case you are working on a NUMA system, memory pages are allocated on memory close to the CPU using them.
For example you can have a look to this blog.
However this is not an issue in your specific case, since your are dealing with a single processor system, or more specifically a system with just one numa node (many current AMD processors do contain two numa nodes, even if within a single physical package), and all the memory banks are attached there.
The final effect of using sched_setaffinity() in this context will be just to pin down each thread to a specific core.

You don't have to program anything. The command taskset changes the CPU affinity of a currently running process or creates and sets it for a new process.
Running a single executable that spawns other programs is no different than executing the programs directly, except for the common initialization implied by the comments in your stub.
taskset 1 program_for_cpu_0 arg1 arg2 arg3...
taskset 2 program_for_cpu_1 arg1 arg2 arg3...
taskset 4 program_for_cpu_2 arg1 arg2 arg3...
taskset 8 program_for_cpu_3 arg1 arg2 arg3...
I am suspicious of setting CPU affinities. I have yet to find an actual use for doing so (other than satisfying some inner need for control).
Unless the CPUs are not identical in some way, there should be no need to restrict a process to a particular CPU.
The Linux kernel normally keeps a process on the same CPU unless it enters an extended wait for i/o, a semaphore, etc.
Switching a process from one CPU to another does not incur any particular overhead except in NUMA configurations with local memory per CPU. AFAIK, ARM implementations do not do that.
If a process should exhibit non-CPU bound behavior, allowing the kernel scheduler flexibility to reassign a process to a now-available CPU should improve system performance. Processes bound to a CPU cannot participate in resource availability.

Related

Accessing a 64-bit variable in different threads without synchronization or atomicity

I have two threads sharing an uint64_t variable. The first thread just reads from the variable while the other thread just writes into. If I don't synchronize them using mutex/spinlock/atomic operations etc.., is there any possibility of reading another value from the writing thread wrote into? It is not important to read an old-value which was written by writing thread.
As an example, the writing thread increases the variable between 0 and 100, and the reading thread prints the value. So, is there any possibility to see a value in the screen different than [0-100] range. Currently I don't see any different value but I'm not sure it can cause a race condition.
Thanks in advance.
On a 64 bit processor, the data transfers are 64 bits at a time, so you will see logically consistent values i.e. you won't see 32 bits from before the write and 32 bits after the write. This is obviously not true of 32 bit processors.
The kind of issues you will see are things like, if the two threads are running on different cores, the reading thread will not see changes made by the writing thread until the writing thread's core flushes its cache. Also, optimisation may make either thread not bother to read memory at all in the loop. For example, if you have:
uint64_t x = 0;
void increment()
{
for (int i = 0 ; i < 100 ; ++i)
{
x++;
}
}
It is possible that the compiler will generate code that reads x into a register at the start of the loop and not write it back to memory until the loop exits. You need things like volatile and memory barriers.
All bad things can happen if you have a race condition on such a variable.
The correct tool with modern C for this are atomics. Just declare your variable
uint64_t _Atomic counter;
Then, all your operations (load, store, increment...) will be atomic, that is indivisible, uninterruptible and linearizable. No mutex or other protection mechanism is necessary.
This has been introduced in C11, and recent C compilers e.g gcc and clang, now support this out of the box.

When can't computation speed be improved by parallelization?

Similar questions have been asked before but I couldn't find an answer that was more about the low level mechanics of threads themselves.
Problem
I have a physical modeling project in which I need to apply a function to 160 billion data points.
for(int i=0; i < N(160,000,000,000); i++){
physicalModal(input[i]); //Linear function, just additions and subtractions
}
function physicalModal(x){
A*x +B*x +C*x + D*x......... //An over simplification but you get the idea. A linear function
}
Given the nature of this problem am I correct in thinking a single thread on a single core, or 1 thread per core, would be the fastest way to solve this? That using extra threads beyond the number of cores would not help me here?
My Logic (Please correct where my assumptions are wrong)
Threads on a single core don't really work independently, they just share processor time which can be beneficial when one thread is waiting on perhaps a socket response and other threads are processing requests. In the example I posted above I figure the CPU could go to 100% on one thread so using multiple threads would just disturb the computation. Is this correct?
What then determines when threading is useful?
If my above assumption is correct, whats the key factor in determining when other threads would be useful? My guess would be simultaneous operations that have varying completion times, waiting, etc...But thats based on my initial premise which may be incorrect.
I need to apply a function to 160 billion data points.
I assume that your function has no side effects (no writes to global/static variables; no disk/network access; no service to many remote users) and just do some arithmetics on its input (on single point of input or several nearby points as for stencil (it is stencil kernel):
for(int i=0; i < 160_000_000_000; i++){
//Linear function, just additions and subtractions
output[i] = physicalModel(input[i] /* possibly also input[i-1], input[i+1] .. */);
}
Then you have to check how efficient this function works on single CPU. Can you (or your compiler) unroll your loop and convert it to SIMD parallelism?
for(int i=0+1; i < 160_000_000_000-1; i++){
output[i] = A*input[i-1]+ B*input[i] + C*input[i+1];
}
// unrolled 4 times; if input is float, compiler may load 4 floats
// into single SSE2 reg and do 4 operations from one asm command
for(int i=0+4; i < 160_000_000_000-4; i+=4){
output[i+0] = A*input[i-1]+ B*input[i+0] + C*input[i+1];
output[i+1] = A*input[i+0]+ B*input[i+1] + C*input[i+2];
output[i+2] = A*input[i+1]+ B*input[i+2] + C*input[i+3];
output[i+3] = A*input[i+2]+ B*input[i+3] + C*input[i+4];
}
When your function has good single-threaded performance, you can add thread or process parallelism (using OpenMP/MPI or other method). With our assumptions, there are no threads blocking on some external resource like reading from HDD or from network, so every thread you started can run at any time. Then we should start no more than 1 thread per CPU core. If we started several threads, each will run for some amount of time and displaced by other, having less performance than in case of 1 thread per cpu core.
In C/C++ adding of OpenMP thread level parallelism (https://en.wikipedia.org/wiki/OpenMP, http://www.openmp.org/) can be as easy as adding one line just before your for loop (and adding -fopenmp/-openmp option to your compilation); compiler and library will split your for loop into parts and distribute them between threads ([0..N/4], [N/4..N/2], [N/2..N*3/4], [N*3/4..N] for 4 threads or other split scheme; you can give hints with schedule option)
#pragma omp parallel for
for(int i=0+1; i < 160_000_000_000-1; i++){
output[i] = physicalModel(input[i]);;
}
Thread count will be determined in runtime by OpenMP library (gomp in gcc - https://gcc.gnu.org/onlinedocs/libgomp/index.html). By default it is "one thread per CPU is used" (per logical cpu core). You can change number of threads with OMP_NUM_THREADS environment variable (export OMP_NUM_THREADS=5; ./program).
On CPU with hardware multithreading on single cpu cores (Intel HT, other variants of SMT: you have 4 physical cores and 8 "logical") in some cases you should use 1 thread per logical core, and in other cases 1 thread per physical core (with correct thread binding), as some resources (FPU units) are shared between logical cores. Do some experiments if your code will be used several (many) times.
If your threads (model) are limited by speed of memory (Memory Bound; they loads many data from memory and does very simple operation on every float), you may want to run less threads than cpu core count, as additional threads will not get addition memory bandwidth.
If your threads do lot of computations for every element loaded from memory, use better SIMD and more threads (compute bound). When you have very good and wide SIMD (full-width AVX), you will have no speedup from using HT, as full-width AVX unit is shared between logical cores (but every physical core has one, so use it); in this case you will also have lower cpu frequency, as full-width AVX unit is very hot under full load.
Illustration of memory and compute limited applications: https://crd.lbl.gov/departments/computer-science/PAR/research/roofline/
https://crd.lbl.gov/assets/Uploads/FTG/Projects/Roofline/_resampled/ResizedImage600300-rooflineai.png

Why does using taskset to run a multi-threaded Linux program on a set of isolated cores cause all threads to run on one core?

Desired behaviour: run a multi-threaded Linux program on a set of cores which have been isolated using isolcpus.
Here's a small program we can use as an example multi-threaded program:
#include <stdio.h>
#include <pthread.h>
#include <err.h>
#include <unistd.h>
#include <stdlib.h>
#define NTHR 16
#define TIME 60 * 5
void *
do_stuff(void *arg)
{
int i = 0;
(void) arg;
while (1) {
i += i;
usleep(10000); /* dont dominate CPU */
}
}
int
main(void)
{
pthread_t threads[NTHR];
int rv, i;
for (i = 0; i < NTHR; i++) {
rv = pthread_create(&threads[i], NULL, do_stuff, NULL);
if (rv) {
perror("pthread_create");
return (EXIT_FAILURE);
}
}
sleep(TIME);
exit(EXIT_SUCCESS);
}
If I compile and run this on a kernel with no isolated CPUs, then the threads are spread out over my 4 CPUs. Good!
Now if I add isolcpus=2,3 to the kernel command line and reboot:
Running the program without taskset distributes threads over cores 0 and 1. This is expected as the default affinity mask now excludes cores 2 and 3.
Running with taskset -c 0,1 has the same effect. Good.
Running with taskset -c 2,3 causes all threads to go onto the same core (either core 2 or 3). This is undesired. Threads should distribute over cores 2 and 3. Right?
This post describes a similar issue (although the example given is farther away from the pthreads API). The OP was happy to workaround this by using a different scheduler. I'm not certain this is ideal for my use-case however.
Is there a way to have the threads distributed over the isolated cores using the default scheduler?
Is this a kernel bug which I should report?
EDIT:
The right thing does indeed happen if you use a real-time scheduler like the fifo scheduler. See man sched and man chrt for details.
From the Linux Kernel Parameter Doc:
This option can be used to specify one or more CPUs to isolate from
the general SMP balancing and scheduling algorithms.
So this options would effectively prevent scheduler doing thread migration from one core to another less contended core (SMP balancing). As typical isolcpus are used together with pthread affinity control to pin threads with knowledge of CPU layout to gain predictable performance.
https://www.kernel.org/doc/Documentation/kernel-parameters.txt
--Edit--
Ok I see why you are confused. Yeah personally I would assume consistent behavior on this option. The problem lies around two functions, select_task_rq_fair and select_task_rq_rt, which is responsible for selecting new run_queue (which is essentially selecting which next_cpu to run on). I did a quick trace (Systemtap) of both functions, for CFS it would always return the same first core in the mask; for RT, it would return other cores. I haven't got a chance to look into the logic in each selection algorithm but you can send an email to the maintainer in Linux devel mailing list for fix.

Linux 2.6.31 Scheduler and Multithreaded Jobs

I run massively parallel scientific computing jobs on a shared Linux computer with 24 cores. Most of the time my jobs are capable of scaling to 24 cores when nothing else is running on this computer. However, it seems like when even one single-threaded job that isn't mine is running, my 24-thread jobs (which I set for high nice values) only manage to get ~1800% CPU (using Linux notation). Meanwhile, about 500% of the CPU cycles (again, using Linux notation) are idle. Can anyone explain this behavior and what I can do about it to get all of the 23 cores that aren't being used by someone else?
Notes:
In case it's relevant, I have observed this on slightly different kernel versions, though I can't remember which off the top of my head.
The CPU architecture is x64. Is it at all possible that the fact that my 24-core jobs are 32-bit and the other jobs I'm competing w/ are 64-bit is relevant?
Edit: One thing I just noticed is that going up to 30 threads seems to alleviate the problem to some degree. It gets me up to ~2100% CPU.
It is possible that this is caused by the scheduler trying to keep each of your tasks running on the same CPU that it was previously running on (it does this because the task has likely brought its working set into that CPU's cache - it's "cache hot").
Here's a few ideas you can try:
Run twice as many threads as you have cores;
Run one or two less threads than you have cores;
Reduce the value of /proc/sys/kernel/sched_migration_cost (perhaps down to zero);
Reduce the value of /proc/sys/kernel/sched_domain/.../imbalance_pct down closer to 100.
Do your threads have to synchronize? If so, you might have the following problem:
Assume you have a 4-cpu system, and a 4-thread job. When run alone, threads fan out to use all 4 cores and total usage is near perfect (We'll call this 400%).
If you add one single-threaded interfering job, the scheduler might place 2 of your threads on the same cpu. This means that 2 of your threads are now running at effectively half their normal pace (dramatic simplification), and if your threads need to synchronize periodically, the progress of your job can be limited by the slowest thread, which in this case is running at half normal speed. You would see utilization of only 200% (from your job running 4x 50%) plus 100% (the interfering job) = 300%.
Similarly, if you assume that the interfering job only uses 25% of one processor's time, you might see one of your threads and the interferer on the same CPU. In that case the slowest thread is running at 3/4 normal speed, causing the total utilization to be 300% (4x 75%) + 25% = 325%. Play with these numbers and it's not hard to come up with something similar to what you're seeing.
If that's the problem, you can certainly play with priorities to give unwelcome tasks only tiny fractions of available CPU (I'm assuming I/O delays aren't a factor). Or, as you've found, try to increase threads so that each CPU has, say, 2 threads, minus a few to allow for system tasks. In this way a 24 core system might run best with, say, 46 threads (which always leaves half of 2 cores' time available for system tasks).
Do your threads communicates with each other?
Try to manually bind every thread to cpu, with sched_setaffinity or pthread_setaffinity_np. Scheduler can be rather dumb when working with a lot of relating threads.
It might be worthwhile to use mpstat (part of the sysstat package) to figure out if you have entire CPUs sitting idle while others are fully utilized. It should give you a more detailed view of the utilization than top or vmstat: run mpstat -P ALL to see 1 line per CPU.
As an experiment, you might try setting the CPU affinity on each thread such that each is bound to an individual CPU; this would let you see what performance is like if you don't let the kernel scheduler decide which CPU a task is scheduled on. It's not a good permanent solution, but if it helps a lot it gives you an idea of where the scheduler is falling short.
Do you think the bottleneck is in your application or the kernel's scheduling algorithm? Before you start tweaking scheduling parameters, I suggest you try running a simple multi-threaded application to see if it exhibits the same behavior as your application.
// COMPILE WITH: gcc threads.c -lpthread -o thread
#include <pthread.h>
#define NUM_CORES 24
void* loop_forever(void* argument) {
int a;
while(1) a++;
}
void main() {
int i;
pthread_t threads[NUM_CORES];
for (i = 0; i < NUM_CORES; i++)
pthread_create(&threads[i], 0, loop_forever, 0);
for (i = 0; i < NUM_CORES; i++)
pthread_join(threads[i], 0);
}

Does Linux Time Division Processes Or Threads

A prof once told us in class that Windows, Linux, OS X and UNIX scale on threads and not processes, so threads would likely benefit your application even on a single processor because your application would be getting more time on the CPU.
I tried with the following code on my machine (which only has one CPU).
#include <stdio.h>
#include <stdlib.h>
#include <pthread.h>
pthread_t xs[10];
void *nop(void *ptr) {
unsigned long n = 1UL << 30UL;
while(n--);
return NULL;
}
void test_one() {
size_t len = (sizeof xs) / (sizeof *xs);
while(len--)
if(pthread_create(xs+len, NULL, nop, NULL))
exit(EXIT_FAILURE);
len = (sizeof xs) / (sizeof *xs);
while(len--)
if(pthread_join(xs[len], NULL))
exit(EXIT_FAILURE);
}
void test_two() {
size_t len = (sizeof xs) / (sizeof *xs);
while(len--) nop(NULL);
}
int main(int argc, char *argv[]) {
test_one();
// test_two();
printf("done\n");
return 0;
}
Both tests were identical in terms of speed.
real 0m49.783s
user 0m48.023s
sys 0m0.224s
real 0m49.792s
user 0m49.275s
sys 0m0.192s
This me think, "Wow, threads suck". But, repeating the test on a university server with four processors close to quadrupled the speed.
real 0m7.800s
user 0m30.170s
sys 0m0.006s
real 0m30.190s
user 0m30.165s
sys 0m0.004s
Am I overlooking something when interpreting the results from my home machine?
In order to understand within the bowels of tasks/threads...lets look at this toy kernel code...
struct regs{
int eax, ebx, ecx, edx, es, ds, gs, fs, cs, ip, flags;
struct tss *task_sel;
}
struct thread{
struct regs *regs;
int parent_id;
struct thread *next;
}
struct task{
struct regs *regs;
int *phys_mem_begin;
int *phys_mem_end;
int *filehandles;
int priority;
int *num_threads;
int quantum;
int duration;
int start_time, end_time;
int parent_id;
struct thread *task_thread;
/* ... */
struct task *next;
}
Imagine the kernel allocates memory for that structure task, which is a linked-list, look closely at the quantum field, that is the timeslice of the processor-time based on the priority field. There will always be a task of id 0, which never sleeps, just idle, perhaps issuing nops (No OPerationS)...the scheduler spins around ad nauseum until infinity (that is when the power gets unplugged), if the quantum field determines the task runs for 20ms, sets the start_time and end_time + 20ms, when that end_time is up, the kernel saves the state of the cpu registers into a regs pointer. Goes onto the next task in the chain, loads the cpu registers from the pointer to regs and jumps into the instruction, sets the quantum and time duration, when the duration reaches zero, goes on to the next...effectively context-switching...this is what gives it an illusion that is running simultaneously on a single cpu.
Now look at the thread struct which is a linked-list of threads...within that task structure. The kernel allocates threads for that said task, sets up the cpu states for that thread and jumps into the threads...now the kernel has to manage the threads as well as the tasks themselves...again context switching between a task and thread...
Move on to a multi-cpu, the kernel would have been set up to be scalable, and what the scheduler would do, load one task onto one cpu, load another onto another cpu (dual core), and both jump into where the instruction pointer is pointing at...now the kernel is genuinely running both tasks simultaneously on both cpu's. Scale up to 4 way, same thing, additional tasks loaded onto each CPU, scale up again, to n-way...you get the drift.
As you can see the notion how the threads would not be perceived as scalable, as the kernel has quite frankly a mammoth job in keeping track of what cpu is running what, and on top of that, what task is running which threads, which fundamentally explains why I think threads are not exactly scalable...Threads consumes a lot of resources...
If you really want to see what is happening, take a look at the source code for Linux, specifically in the scheduler. No hang on, forget about the 2.6.x kernel releases, look at the prehistoric version 0.99, the scheduler would be more simpler to understand and easier to read, sure, its a bit old, but worth looking at, this will help you understand why and hopefully my answer also, in why threads are not scalable..and shows how the toy-os uses time division based on processes. I have strived to not to get into the technical aspects of modern-day cpu's that can do more then just what I have described...
Hope this helps.
A prof once told us in class that Windows, Linux, OS X and UNIX scale on threads and not processes, so threads would likely benefit your application even on a single processor because your application would be getting more time on the CPU.
Not necessarily. If your app is the only CPU-intensive thing running, more threads won't magically make more CPU time available - all that will result is more CPU time wasted in context switches.
This me think, "Wow, threads suck". But, repeating the test on a university server with four processors close to quadrupled the speed.
That's because with four threads, it can use all four processors.
I'm not sure exactly what you're asking, but here is an answer which may help.
Under Linux, processes and threads are essentially exactly the same. The scheduler understands things called "tasks" which it doesn't really care whether they share address space or not. Sharing or not sharing things really depends on how they were created.
Whether to use threads or processes is a key design decision and should not be taken lightly, but performance of the scheduler is probably not a factor (of course things like IPC requirements will vary the design wildly)

Resources