SetPriorityClass equivalent on Linux - linux

I have a daemon-like application that does some disk-intensive processing at initialization. To avoid slowing down other tasks I do something like this on Windows:
SetPriorityClass(GetCurrentProcess(), PROCESS_MODE_BACKGROUND_BEGIN);
// initialization tasks
SetPriorityClass(GetCurrentProcess(), PROCESS_MODE_BACKGROUND_END);
// daemon is ready and running at normal priority
AFAIK, on Unices I can call nice or setpriority and lower the process priority but I can't raise it back to what it was at process creation (i.e. there's no equivalent to the second SetPriorityClass invocation) unless I have superuser privileges. Is there by any chance another way of doing it that I'm missing? (I know I can create an initialization thread that runs at low priority and wait for it to complete on the main thread, but I'd rather prefer avoiding it)
edit: Bonus points for the equivalent SetThreadPriority(GetCurrentThread(), THREAD_MODE_BACKGROUND_BEGIN); and SetThreadPriority(GetCurrentThread(), THREAD_MODE_BACKGROUND_END);

You've said that your processing is disk intensive, so solutions using nice won't work. nice handles the priority of CPU access, not I/O access. PROCESS_MODE_BACKGROUND_BEGIN lowers I/O priority as well as CPU priority, and requires kernel features that don't exist in XP and older.
Controlling I/O priority is not portable across Unices, but there is a solution on modern Linux kernels. You'll need CAP_SYS_ADMIN to lower I/O priority to IO_PRIO_CLASS_IDLE, but it is possible to lower and raise priority within the best effort class without this.
The key function call is ioprio_set, which you'll have to call via a syscall wrapper:
static int ioprio_set(int which, int who, int ioprio)
{
return syscall(SYS_ioprio_set, which, who, ioprio);
}
For full example source, see here.
Depending on permissions, your entry to background mode is either IOPRIO_PRIO_VALUE(IO_PRIO_CLASS_IDLE,0) or IOPRIO_PRIO_VALUE(IO_PRIO_CLASS_BE,7). The sequence should then be:
#define IOPRIO_PRIO_VALUE(class, data) (((class) << IOPRIO_CLASS_SHIFT) | data)
ioprio_set(IOPRIO_WHO_PROCESS, 0, IOPRIO_PRIO_VALUE(IO_PRIO_CLASS_BE,7));
// Do work
ioprio_set(IOPRIO_WHO_PROCESS, 0, IOPRIO_PRIO_VALUE(IO_PRIO_CLASS_BE,4));
Note that you many not have permission to return to your original io priority, so you'll need to return to another best effort value.

Actually, if you have a reasonably recent Linux kernel there might be a solution. Here's what TLPI says:
In Linux kernels before 2.6.12, an
unprivileged process may use
setpriority() only to (irreversibly)
lower its own or another process’s
nice value.
Since kernel 2.6.12, Linux provides
the RLIMIT_NICE resource limit, which
permits unprivileged processes to
increase nice values. An unprivileged
process can raise its own nice value
to the maximum specified by the
formula 20 – rlim_cur, where rlim_cur
is the current RLIMIT_NICE soft
resource limit.
So basically you have to:
Use ulimit -e to set RLIMIT_NICE
Use setpriority as usual
Here is an example
Edit /etc/security/limits.conf. Add
cnicutar - nice -10
Verify using ulimit
cnicutar#aiur:~$ ulimit -e
30
We like that limit so we don't change it.
nice ls
cnicutar#aiur:~$ nice -n -10 ls tmp
cnicutar#aiur:~$
cnicutar#aiur:~$ nice -n -11 ls tmp
nice: cannot set niceness: Permission denied
setpriority example
#include <stdio.h>
#include <sys/resource.h>
#include <unistd.h>
int main()
{
int rc;
printf("We are being nice!\n");
/* set our nice to 10 */
rc = setpriority(PRIO_PROCESS, 0, 10);
if (0 != rc) {
perror("setpriority");
}
sleep(1);
printf("Stop being nice\n");
/* set our nice to -10 */
rc = setpriority(PRIO_PROCESS, 0, -10);
if (0 != rc) {
perror("setpriority");
}
return 0;
}
Test program
cnicutar#aiur:~$ ./nnice
We are being nice!
Stop being nice
cnicutar#aiur:~$
The only drawback to this is that it's not portable to other Unixes (or is it Unices ?).

To workaroud lowering priority and then bringing it back you can:
fork()
CHILD: lower its priority
PARENT: wait for the child (keeping original parent's priority)
CHILD: do the job (in lower priority)
PARENT: continue with original priority after child is finished.
This should be UNIX-portable solution.

Related

perf stat time elapsed for sys and user [duplicate]

$ time foo
real 0m0.003s
user 0m0.000s
sys 0m0.004s
$
What do real, user and sys mean in the output of time? Which one is meaningful when benchmarking my app?
Real, User and Sys process time statistics
One of these things is not like the other. Real refers to actual elapsed time; User and Sys refer to CPU time used only by the process.
Real is wall clock time - time from start to finish of the call. This is all elapsed time including time slices used by other processes and time the process spends blocked (for example if it is waiting for I/O to complete).
User is the amount of CPU time spent in user-mode code (outside the kernel) within the process. This is only actual CPU time used in executing the process. Other processes and time the process spends blocked do not count towards this figure.
Sys is the amount of CPU time spent in the kernel within the process. This means executing CPU time spent in system calls within the kernel, as opposed to library code, which is still running in user-space. Like 'user', this is only CPU time used by the process. See below for a brief description of kernel mode (also known as 'supervisor' mode) and the system call mechanism.
User+Sys will tell you how much actual CPU time your process used. Note that this is across all CPUs, so if the process has multiple threads (and this process is running on a computer with more than one processor) it could potentially exceed the wall clock time reported by Real (which usually occurs). Note that in the output these figures include the User and Sys time of all child processes (and their descendants) as well when they could have been collected, e.g. by wait(2) or waitpid(2), although the underlying system calls return the statistics for the process and its children separately.
Origins of the statistics reported by time (1)
The statistics reported by time are gathered from various system calls. 'User' and 'Sys' come from wait (2) (POSIX) or times (2) (POSIX), depending on the particular system. 'Real' is calculated from a start and end time gathered from the gettimeofday (2) call. Depending on the version of the system, various other statistics such as the number of context switches may also be gathered by time.
On a multi-processor machine, a multi-threaded process or a process forking children could have an elapsed time smaller than the total CPU time - as different threads or processes may run in parallel. Also, the time statistics reported come from different origins, so times recorded for very short running tasks may be subject to rounding errors, as the example given by the original poster shows.
A brief primer on Kernel vs. User mode
On Unix, or any protected-memory operating system, 'Kernel' or 'Supervisor' mode refers to a privileged mode that the CPU can operate in. Certain privileged actions that could affect security or stability can only be done when the CPU is operating in this mode; these actions are not available to application code. An example of such an action might be manipulation of the MMU to gain access to the address space of another process. Normally, user-mode code cannot do this (with good reason), although it can request shared memory from the kernel, which could be read or written by more than one process. In this case, the shared memory is explicitly requested from the kernel through a secure mechanism and both processes have to explicitly attach to it in order to use it.
The privileged mode is usually referred to as 'kernel' mode because the kernel is executed by the CPU running in this mode. In order to switch to kernel mode you have to issue a specific instruction (often called a trap) that switches the CPU to running in kernel mode and runs code from a specific location held in a jump table. For security reasons, you cannot switch to kernel mode and execute arbitrary code - the traps are managed through a table of addresses that cannot be written to unless the CPU is running in supervisor mode. You trap with an explicit trap number and the address is looked up in the jump table; the kernel has a finite number of controlled entry points.
The 'system' calls in the C library (particularly those described in Section 2 of the man pages) have a user-mode component, which is what you actually call from your C program. Behind the scenes, they may issue one or more system calls to the kernel to do specific services such as I/O, but they still also have code running in user-mode. It is also quite possible to directly issue a trap to kernel mode from any user space code if desired, although you may need to write a snippet of assembly language to set up the registers correctly for the call.
More about 'sys'
There are things that your code cannot do from user mode - things like allocating memory or accessing hardware (HDD, network, etc.). These are under the supervision of the kernel, and it alone can do them. Some operations like malloc orfread/fwrite will invoke these kernel functions and that then will count as 'sys' time. Unfortunately it's not as simple as "every call to malloc will be counted in 'sys' time". The call to malloc will do some processing of its own (still counted in 'user' time) and then somewhere along the way it may call the function in kernel (counted in 'sys' time). After returning from the kernel call, there will be some more time in 'user' and then malloc will return to your code. As for when the switch happens, and how much of it is spent in kernel mode... you cannot say. It depends on the implementation of the library. Also, other seemingly innocent functions might also use malloc and the like in the background, which will again have some time in 'sys' then.
To expand on the accepted answer, I just wanted to provide another reason why real ≠ user + sys.
Keep in mind that real represents actual elapsed time, while user and sys values represent CPU execution time. As a result, on a multicore system, the user and/or sys time (as well as their sum) can actually exceed the real time. For example, on a Java app I'm running for class I get this set of values:
real 1m47.363s
user 2m41.318s
sys 0m4.013s
• real: The actual time spent in running the process from start to finish, as if it was measured by a human with a stopwatch
• user: The cumulative time spent by all the CPUs during the computation
• sys: The cumulative time spent by all the CPUs during system-related tasks such as memory allocation.
Notice that sometimes user + sys might be greater than real, as
multiple processors may work in parallel.
Minimal runnable POSIX C examples
To make things more concrete, I want to exemplify a few extreme cases of time with some minimal C test programs.
All programs can be compiled and run with:
gcc -ggdb3 -o main.out -pthread -std=c99 -pedantic-errors -Wall -Wextra main.c
time ./main.out
and have been tested in Ubuntu 18.10, GCC 8.2.0, glibc 2.28, Linux kernel 4.18, ThinkPad P51 laptop, Intel Core i7-7820HQ CPU (4 cores / 8 threads), 2x Samsung M471A2K43BB1-CRC RAM (2x 16GiB).
sleep syscall
Non-busy sleep as done by the sleep syscall only counts in real, but not for user or sys.
For example, a program that sleeps for a second:
#define _XOPEN_SOURCE 700
#include <stdlib.h>
#include <unistd.h>
int main(void) {
sleep(1);
return EXIT_SUCCESS;
}
GitHub upstream.
outputs something like:
real 0m1.003s
user 0m0.001s
sys 0m0.003s
The same holds for programs blocked on IO becoming available.
For example, the following program waits for the user to enter a character and press enter:
#include <stdio.h>
#include <stdlib.h>
int main(void) {
printf("%c\n", getchar());
return EXIT_SUCCESS;
}
GitHub upstream.
And if you wait for about one second, it outputs just like the sleep example something like:
real 0m1.003s
user 0m0.001s
sys 0m0.003s
For this reason time can help you distinguish between CPU and IO bound programs: What do the terms "CPU bound" and "I/O bound" mean?
Multiple threads
The following example does niters iterations of useless purely CPU-bound work on nthreads threads:
#define _XOPEN_SOURCE 700
#include <assert.h>
#include <inttypes.h>
#include <pthread.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
uint64_t niters;
void* my_thread(void *arg) {
uint64_t *argument, i, result;
argument = (uint64_t *)arg;
result = *argument;
for (i = 0; i < niters; ++i) {
result = (result * result) - (3 * result) + 1;
}
*argument = result;
return NULL;
}
int main(int argc, char **argv) {
size_t nthreads;
pthread_t *threads;
uint64_t rc, i, *thread_args;
/* CLI args. */
if (argc > 1) {
niters = strtoll(argv[1], NULL, 0);
} else {
niters = 1000000000;
}
if (argc > 2) {
nthreads = strtoll(argv[2], NULL, 0);
} else {
nthreads = 1;
}
threads = malloc(nthreads * sizeof(*threads));
thread_args = malloc(nthreads * sizeof(*thread_args));
/* Create all threads */
for (i = 0; i < nthreads; ++i) {
thread_args[i] = i;
rc = pthread_create(
&threads[i],
NULL,
my_thread,
(void*)&thread_args[i]
);
assert(rc == 0);
}
/* Wait for all threads to complete */
for (i = 0; i < nthreads; ++i) {
rc = pthread_join(threads[i], NULL);
assert(rc == 0);
printf("%" PRIu64 " %" PRIu64 "\n", i, thread_args[i]);
}
free(threads);
free(thread_args);
return EXIT_SUCCESS;
}
GitHub upstream + plot code.
Then we plot wall, user and sys as a function of the number of threads for a fixed 10^10 iterations on my 8 hyperthread CPU:
Plot data.
From the graph, we see that:
for a CPU intensive single core application, wall and user are about the same
for 2 cores, user is about 2x wall, which means that the user time is counted across all threads.
user basically doubled, and while wall stayed the same.
this continues up to 8 threads, which matches my number of hyperthreads in my computer.
After 8, wall starts to increase as well, because we don't have any extra CPUs to put more work in a given amount of time!
The ratio plateaus at this point.
Note that this graph is only so clear and simple because the work is purely CPU-bound: if it were memory bound, then we would get a fall in performance much earlier with less cores because the memory accesses would be a bottleneck as shown at What do the terms "CPU bound" and "I/O bound" mean?
Quickly checking that wall < user is a simple way to determine that a program is multithreaded, and the closer that ratio is to the number of cores, the more effective the parallelization is, e.g.:
multithreaded linkers: Can gcc use multiple cores when linking?
C++ parallel sort: Are C++17 Parallel Algorithms implemented already?
Sys heavy work with sendfile
The heaviest sys workload I could come up with was to use the sendfile, which does a file copy operation on kernel space: Copy a file in a sane, safe and efficient way
So I imagined that this in-kernel memcpy will be a CPU intensive operation.
First I initialize a large 10GiB random file with:
dd if=/dev/urandom of=sendfile.in.tmp bs=1K count=10M
Then run the code:
#define _GNU_SOURCE
#include <assert.h>
#include <fcntl.h>
#include <stdlib.h>
#include <sys/sendfile.h>
#include <sys/stat.h>
#include <sys/types.h>
#include <unistd.h>
int main(int argc, char **argv) {
char *source_path, *dest_path;
int source, dest;
struct stat stat_source;
if (argc > 1) {
source_path = argv[1];
} else {
source_path = "sendfile.in.tmp";
}
if (argc > 2) {
dest_path = argv[2];
} else {
dest_path = "sendfile.out.tmp";
}
source = open(source_path, O_RDONLY);
assert(source != -1);
dest = open(dest_path, O_WRONLY | O_CREAT | O_TRUNC, S_IRUSR | S_IWUSR);
assert(dest != -1);
assert(fstat(source, &stat_source) != -1);
assert(sendfile(dest, source, 0, stat_source.st_size) != -1);
assert(close(source) != -1);
assert(close(dest) != -1);
return EXIT_SUCCESS;
}
GitHub upstream.
which gives basically mostly system time as expected:
real 0m2.175s
user 0m0.001s
sys 0m1.476s
I was also curious to see if time would distinguish between syscalls of different processes, so I tried:
time ./sendfile.out sendfile.in1.tmp sendfile.out1.tmp &
time ./sendfile.out sendfile.in2.tmp sendfile.out2.tmp &
And the result was:
real 0m3.651s
user 0m0.000s
sys 0m1.516s
real 0m4.948s
user 0m0.000s
sys 0m1.562s
The sys time is about the same for both as for a single process, but the wall time is larger because the processes are competing for disk read access likely.
So it seems that it does in fact account for which process started a given kernel work.
Bash source code
When you do just time <cmd> on Ubuntu, it use the Bash keyword as can be seen from:
type time
which outputs:
time is a shell keyword
So we grep source in the Bash 4.19 source code for the output string:
git grep '"user\b'
which leads us to execute_cmd.c function time_command, which uses:
gettimeofday() and getrusage() if both are available
times() otherwise
all of which are Linux system calls and POSIX functions.
GNU Coreutils source code
If we call it as:
/usr/bin/time
then it uses the GNU Coreutils implementation.
This one is a bit more complex, but the relevant source seems to be at resuse.c and it does:
a non-POSIX BSD wait3 call if that is available
times and gettimeofday otherwise
1: https://i.stack.imgur.com/qAfEe.png**Minimal runnable POSIX C examples**
To make things more concrete, I want to exemplify a few extreme cases of time with some minimal C test programs.
All programs can be compiled and run with:
gcc -ggdb3 -o main.out -pthread -std=c99 -pedantic-errors -Wall -Wextra main.c
time ./main.out
and have been tested in Ubuntu 18.10, GCC 8.2.0, glibc 2.28, Linux kernel 4.18, ThinkPad P51 laptop, Intel Core i7-7820HQ CPU (4 cores / 8 threads), 2x Samsung M471A2K43BB1-CRC RAM (2x 16GiB).
sleep
Non-busy sleep does not count in either user or sys, only real.
For example, a program that sleeps for a second:
#define _XOPEN_SOURCE 700
#include <stdlib.h>
#include <unistd.h>
int main(void) {
sleep(1);
return EXIT_SUCCESS;
}
GitHub upstream.
outputs something like:
real 0m1.003s
user 0m0.001s
sys 0m0.003s
The same holds for programs blocked on IO becoming available.
For example, the following program waits for the user to enter a character and press enter:
#include <stdio.h>
#include <stdlib.h>
int main(void) {
printf("%c\n", getchar());
return EXIT_SUCCESS;
}
GitHub upstream.
And if you wait for about one second, it outputs just like the sleep example something like:
real 0m1.003s
user 0m0.001s
sys 0m0.003s
For this reason time can help you distinguish between CPU and IO bound programs: What do the terms "CPU bound" and "I/O bound" mean?
multithreaded linkers: Can gcc use multiple cores when linking?
C++ parallel sort: Are C++17 Parallel Algorithms implemented already?
Real shows total turn-around time for a process;
while User shows the execution time for user-defined instructions
and Sys is for time for executing system calls!
Real time includes the waiting time also (the waiting time for I/O etc.)
In very simple terms, I like to think about it like this:
real is the actual amount of time it took to run the command (as if you had timed it with a stopwatch)
user and sys are how much 'work' the CPU had to do to execute the command. This 'work' is expressed in units of time.
Generally speaking:
user is how much work the CPU did to run to run the command's code
sys is how much work the CPU had to do to handle 'system overhead' type tasks (such as allocating memory, file I/O, ect.) in order to support the running command
Since these last two times are counting 'work' done, they don't include time a thread might have spent waiting (such as waiting on another process or for disk I/O to finish).
real, however, is a measure of actual runtime and not 'work', so it does include any time spent waiting (which is why sometimes real > usr+sys).
And finally, sometimes the reverse is true (usr+sys > real) for multi-threaded applications. This also arises because we are comparing 'work-time' to actual time. For example, if 3 processors each run continuously for 10 minutes to execute a command, you will get real = 10m but usr = 30m.
I want to mention some other scenario when the real-time is much much bigger than user + sys. I've created a simple server which respondes after a long time
real 4.784
user 0.01s
sys 0.01s
the issue is that in this scenario the process waits for the response which is not on the user site nor in the system.
Something similar happens when you run the find command. In that case, the time is spent mostly on requesting and getting a response from SSD.
Must mention that at least on my AMD Ryzen CPU, the user is always large than real in multi-threaded program(or single threaded program compiled with -O3).
eg.
real 0m5.815s
user 0m8.213s
sys 0m0.473s

How to change the watchdog timer in linux embedded

I have to use the linux watchdog driver (/dev/watchdog). It works great, I write an character like this:
echo 1 > /dev/watchdog
And the watchdog start and after an about 1 minute, the system reboot.
The question is, how can I change the timeout? I have to change the time interval in the driver?
Please read the Linux documentation. The standard method of changing the timeout from user space is to use an ioctl().
int timeout = 45; /* a time in seconds */
int fd;
fd = open("/dev/watchdog");
ioctl(fd, WDIOC_SETTIMEOUT, &timeout); /* Send time request to the driver. */
Each watchdog device may have an upper (and possibly lower) limit on that the hardware supports, so you can not set the timeout arbitrarily high. So after setting a timeout, it is good to read back the timeout.
ioctl(fd, WDIOC_GETTIMEOUT, &timeout); /* Update timeout with driver value. */
Now, the re-read timeout can be used as a kick frequency.
assert(timeout > 2);
while (1) {
ioctl(fd, WDIOC_KEEPALIVE, 0);
sleep(timeout-2);
}
You can write your own kicking routine in a script/shell command,
while [ 1 ] ; do sleep 1; echo V > /dev/watchdog; done
However, the userspace watchdog program is usually used. This should take care of all the esoteric features. You can nice the user space program to a minimum priority and then the system will reset if user space becomes hung up. BusyBox includes a watchdog applet.
Each watchdog driver has separate module parameters and most include a mechanism to set the timeout; use either the kernel command line or module parameter setting mechanism. However, the infra-structure ioctl timeout is more portable if you do not have specific knowledge of your watchdog hardware. The ioctl is probably more future proof, in that your hardware may change.
Sample user space code is included in the Linux samples directory.

In general, on ucLinux, is ioctl faster than writing to /sys filesystem?

I have an embedded system I'm working with, and it currently uses the sysfs to control certain features.
However, there is function that we would like to speed up, if possible.
I discovered that this subsystem also supports and ioctl interface, but before rewriting the code, I decided to search to see which is a faster interface (on ucLinux) in general: sysfs or ioctl.
Does anybody understand both implementations well enough to give me a rough idea of the difference in overhead for each? I'm looking for generic info, such as "ioctl is faster because you've removed the file layer from the function calls". Or "they are roughly the same because sysfs has a very simple interface".
Update 10/24/2013:
The specific case I'm currently doing is as follows:
int fd = open("/sys/power/state",O_WRONLY);
write( fd, "standby", 7 );
close( fd );
In kernel/power/main.c, the code that handles this write looks like:
static ssize_t state_store(struct kobject *kobj, struct kobj_attribute *attr,
const char *buf, size_t n)
{
#ifdef CONFIG_SUSPEND
suspend_state_t state = PM_SUSPEND_STANDBY;
const char * const *s;
#endif
char *p;
int len;
int error = -EINVAL;
p = memchr(buf, '\n', n);
len = p ? p - buf : n;
/* First, check if we are requested to hibernate */
if (len == 7 && !strncmp(buf, "standby", len)) {
error = enter_standby();
goto Exit;
((( snip )))
Can this be sped up by moving to a custom ioctl() where the code to handle the ioctl call looks something like:
case SNAPSHOT_STANDBY:
if (!data->frozen) {
error = -EPERM;
break;
}
error = enter_standby();
break;
(so the ioctl() calls the same low-level function that the sysfs function did).
If by sysfs you mean the sysfs() library call, notice this in man 2 sysfs:
NOTES
This System-V derived system call is obsolete; don't use it. On systems with /proc, the same information can be obtained via
/proc/filesystems; use that interface instead.
I can't recall noticing stuff that had an ioctl() and a sysfs interface, but probably they exist. I'd use the proc or sys handle anyway, since that tends to be less cryptic and more flexible.
If by sysfs you mean accessing files in /sys, that's the preferred method.
I'm looking for generic info, such as "ioctl is faster because you've removed the file layer from the function calls".
Accessing procfs or sysfs files does not entail an I/O bottleneck because they are not real files -- they are kernel interfaces. So no, accessing this stuff through "the file layer" does not affect performance. This is a not uncommon misconception in linux systems programming, I think. Programmers can be squeamish about system calls that aren't well, system calls, and paranoid that opening a file will be somehow slower. Of course, file I/O in the ABI is just system calls anyway. What makes a normal (disk) file read slow is not the calls to open, read, write, whatever, it's the hardware bottleneck.
I always use low level descriptor based functions (open(), read()) instead of high level streams when doing this because at some point some experience led me to believe they were more reliable for this specifically (reading from /proc). I can't say whether that's definitively true.
So, the question was interesting, I built a couple of modules, one for ioctl and one for sysfs, the ioctl implementing only a 4 bytes copy_from_user and nothing more, and the sysfs having nothing in its write interface.
Then a couple of userspace test up to 1 million iterations, here the results:
time ./sysfs /sys/kernel/kobject_example/bar
real 0m0.427s
user 0m0.056s
sys 0m0.368s
time ./ioctl /run/temp
real 0m0.236s
user 0m0.060s
sys 0m0.172s
edit
I agree with #goldilocks answer, HW is the real bottleneck, in a Linux environment with a well written driver choosing ioctl or sysfs doesn't make a big difference, but if you are using uClinux probably in your HW even few cpu cycles can make a difference.
The test I've done is for Linux not uClinux and it never wanted to be an absolute reference profiling the two interfaces, my point is that you can write a book about how fast is one or another but only testing will let you know, took me few minutes to setup the thing.

Can we get core id on which the process is running in MPI?

I have developed MPI program which can execute matrix multiplication on different cores in distributed environment and I can demonstrate the execution on different nodes by getting hostname of the node. But when we are running program on single node can I get core id which demonstrate the execution on multiple node sample code is given below
#include"stdio.h"
#include"stdlib.h"
#include"mpi.h"
int main(int argc , char **argv)
{
int size,rank;
int a,b,c;
MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &size);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
if(rank==0)
{
for(i=0;i<size;i++)
{
printf("insert a and b");
scanf("%d",&b);
scanf("%d",&c);
MPI_Send(&b,1,MPI_INT,i+1,6,MPI_COMM_WORLD);
MPI_Send(&c,1,MPI_INT,i+1,6,MPI_COMM_WORLD);
}
}
if(rank!=0)
{
MPI_Recv(&b,1,MPI_INT,0,6,MPI_COMM_WORLD,&s);
MPI_Recv(&c,1,MPI_INT,0,6,MPI_COMM_WORLD,&s);
a=b*c;
printf("Mul = %d\n",a);
//Print name of core on which my process is running
}
MPI_Finalize();
return 0;
}
While it is possible to obtain the ID of the logical processor that currently executes the code, this often makes no sense unless you enable MPI process binding, also known as process pinning (in Intel's parlance). Binding (or pinning) restricts the CPU affinity set of each MPI process, i.e. the set of CPUs on which the process is allowed to execute. If the affinity set includes only a single logical CPU, then the process would only execute on that logical CPU. A logical CPU usually corresponds to a hardware thread on CPUs with SMT/hyperthreading or to a CPU core on non-SMT/non-hyperthreaded CPUs. Given affinity sets that include more than one logical CPUs, the scheduler is allowed to migrate the process around in order to keep the CPUs in the set equally busy. The default affinity set usually includes all available logical CPUs, that is the process could be scheduled for execution on any core or hardware thread.
Only when MPI process binding is in place and each process is bound to a single logical CPU it makes sense to actually query the OS for the location of the process. You have to consult your MPI implementation manual on how to enable it. For example, with Open MPI you would do something like:
mpiexec --bind-to-core --bycore -n 120 ...
--bind-to-core tells Open MPI to bind each process to a single CPU core and --bycore tells it to allocate cores consecutively on multisocket machines (that is, first to allocate all cores in the first socket, then in the second socket, etc.) With Intel MPI the binding (called pinning by Intel) is enabled by setting the environment variable I_MPI_PIN to 1. The process placement strategy is controlled by the value of I_MPI_PIN_DOMAIN. To achieve the same as the above shown Open MPI command line, one would to the following with Intel MPI:
mpiexec -n 120 -env I_MPI_PIN 1 -env I_MPI_PIN_DOMAIN "core:compact" ...
To obtain the location of your processes in a platform-independent way, you could use hwloc_get_last_cpu_location() from the hwloc library. It is developed as part of the Open MPI project but can be used as a stand-alone library. It provides an abstract interface to query the system topology and to manipulate the affinity of processes and threads. hwloc supports Linux, Windows and many other OSes.
We can get core id by using the sched_getcpu() function present in utmpx.h header file
sample program from net is shown below
#include <stdio.h>
#include<mpi.h>
#include <utmpx.h>
int main( int argc , char **argv )
{
MPI_Init(&argc, &argv);
printf( "cpu = %d\n", sched_getcpu() );
MPI_Finalize();
return 0;
}
You could use the Linux specific getcpu(2) syscall (or, as Krishna answered, the Linux specific sched_getcpu(3) function wrapping it). Read carefully the man page (that getcpu syscall does not have any libc wrapper!). Notice that it may gives you some obsolete information (because the kernel can -and does- migrate processes or tasks -e.g. threads- at any time from one CPU core to another).
Otherwise, your MPI implementation is probably using threads or processes. You could query them with gettid(2) (which needs some wrapper) for threads, or getpid(2) for processes.
You may want to code:
#include <unistd.h>
#include <asm/unistd.h>
#include <sys/syscall.h>
static inline long my_gettid(void)
{ return syscall(SYS_gettid); }
and perhaps something similar for getcpu....
You could also use proc(5) e.g. query /proc/self/stat (the 39th field gives the processor number)... Perhaps just displaying all of it is the simplest way:
{
char line[128];
FILE *fs = fopen("/proc/self/stat","r");
if (!fs) { perror("/proc/self/stat"); exit(EXIT_FAILURE); };
while (!feof(fs)) {
memset(line, 0, sizeof(line));
fgets(line, sizeof(line), fs);
fputs(line, stdout);
};
fclose(fs);
}
Remember that the Linux kernel (its scheduler) is migrating tasks (i.e. processes or threads) from one CPU core to another one at any time. So querying that is next to useless (the task could have migrated between the moment you query it and the moment you display it).

Can a program assign the memory directly?

Is there any really low level programming language that can get access the memory variable directly? For example, if I have a program have a variable i. Can anyone access the memory to change my program variable i to another value?
As an example of how to change the variable in a program from “the outside”, consider the use of a debugger. Example program:
$ cat print_i.c
#include <stdio.h>
#include <unistd.h>
int main (void) {
int i = 42;
for (;;) { (void) printf("i = %d\n", i); (void) sleep(3); }
return 0;
}
$ gcc -g -o print_i print_i.c
$ ./print_i
i = 42
i = 42
i = 42
…
(The program prints the value of i every 3 seconds.)
In another terminal, find the process id of the running program and attach the gdb debugger to it:
$ ps | grep print_i
1779 p1 S+ 0:00.01 ./print_i
$ gdb print_i 1779
…
(gdb) bt
#0 0x90040df8 in mach_wait_until ()
#1 0x90040bc4 in nanosleep ()
#2 0x900409f0 in sleep ()
#3 0x00002b8c in main () at print_i.c:6
(gdb) up 3
#3 0x00002b8c in main () at print_i.c:6
6 for (;;) { (void) printf("i = %d\n", i); (void) sleep(3); }
(gdb) set variable i = 666
(gdb) continue
Now the output of the program changes:
…
i = 42
i = 42
i = 666
So, yes, it's possible to change the variable of a program from the “outside” if you have access to its memory. There are plenty of caveats here, e.g. one needs to locate where and how the variable is stored. Here it was easy because I compiled the program with debugging symbols. For an arbitrary program in an arbitrary language it's much more difficult, but still theoretically possible. Of course, if I weren't the owner of the running process, then a well-behaved operating system would not let me access its memory (without “hacking”), but that's a whole another question.
Sure, unless of course the operating system protects that memory on your behalf. Machine language (the lowest level programming language) always "accesses memory directly", and it's pretty easy to achieve in C (by casting some kind of integer to pointer, for example). Point is, unless this code's running in your process (or the kernel), whatever language it's written in, the OS would normally be protecting your process from such interference (by mapping the memory in various ways for different processes, for example).
If another process has sufficient permissions, then it can change your process's memory. On Linux, it's as simple as reading and writing the pseudo-file /proc/{pid}/mem. This is how many exploits work, though they do rely on some vulnerability that allows them to run with very high privileges (root on Unix).
Short answer: yes. Long answer: it depends on a whole lot of factors including your hardware (memory management?), your OS (protected virtual address spaces? features to circumvent these protections?) and the detailed knowledge your opponent may or may not have of both your language's architecture and your application structure.
It depends. In general, one of the functions of an operating system is called segmentation -- that means keeping programs out of each other's memory. If I write a program that tries to access memory that belongs to your program, the OS should crash me, since I'm committing something called a segmentation fault.
But there are situations where I can get around that. For example, if I have root privileges on the system, I may be able to access your memory. Or worse -- I can run your program inside a virtual machine, then sit outside that VM and do whatever I want to its memory.
So in general, you should assume that a malicious person can reach in and fiddle with your program's memory if they try hard enough.

Resources