In Perl module Proc::ProccessTable, why does pctcpu sometimes return 'inf', 'nan', or a value greater than 100? - linux

The Perl module Proc::ProcessTable occasionally observes that the pctcpu attribute as 'inf', 'nan', or a value greater then 100. Why does it do this? And are there any guidelines on how to deal with this kind of information?
We have observed this on various platforms including Linux 2.4 running on 8 logical processors.
I would guess that 'inf' or 'nan' is the result of some impossibly large value or a divide by zero.
For values greater then 100, could this possibly mean that more then one processor was used?
And for dealing with this information, is the best practice merely marking the data point as untrustworthy and normalizing to 100%?

I do not know why that happens and I cannot stress test the module right now trying to generate such cases.
However, a principle I have followed all my research is not to replace data I know to be non-sense with something that looks reasonable. You basically have missing observations and you should treat them as such. I would not attach a numerical value at all so as not to pretend I have information when I in fact do not.
Then, your statistics for the non-missing points will be meaningful and you can look at any patterns in the missing observations separately.
UPDATE: Looking at the calc_prec() function in the source code:
/* calc_prec()
*
* calculate the two cpu/memory precentage values
*/
static void calc_prec(char *format_str, struct procstat *prs, struct obstack *mem_pool)
{
float pctcpu = 100.0f * (prs->utime / 1e6) / (time(NULL) - prs->start_time);
/* calculate pctcpu - NOTE: This assumes the cpu time is in microsecond units! */
sprintf(prs->pctcpu, "%3.2f", pctcpu);
field_enable(format_str, F_PCTCPU);
/* calculate pctmem */
if (system_memory > 0) {
sprintf(prs->pctmem, "%3.2f", (float) prs->rss / system_memory * 100.f);
field_enable(format_str, F_PCTMEM);
}
}
First, IMHO, it would be better to just divide by 1e4 rather than multiplying by 100.0f after the division. Second, it is possible (if polled immediately after process start) for the time delta to be 0. Third, I would have just done the whole thing in double.
As an aside, this function looks like a good example of why you should not have comments in code.
#include <stdio.h>
#include <time.h>
volatile float calc_percent(
unsigned long utime,
time_t now,
time_t start
) {
return 100.0f * ( utime / 1e6) / (now - start);
}
int main(void) {
printf("%3.2f\n", calc_percent(1e6, time(NULL), time(NULL)));
printf("%3.2f\n", calc_percent(0, time(NULL), time(NULL)));
return 0;
}
This outputs inf in the first case and nan in the second case when compiled with Cygwin gcc-4 on Windows. I do not know if this behavior is standard or just what happens with this particular combination of OS+compiler.

Related

Was: How does BPF calculate number of CPU for PERCPU_ARRAY?

I have encountered an interesting issue where a PERCPU_ARRAY created on one system with 2 processors creates an array with 2 per-CPU elements and on another system with 2 processors, an array with 128 per-CPU elements. The latter was rather unexpected to me!
The way I discovered this behavior is that a program that allocated an array for the number of CPUs (using get_nprocs_conf(3)) and then read in the PERCPU_ARRAY into it (using bpf_map_lookup_elem()) ended up writing past the end of the array and crashing.
I would like to find out what is the proper way to determine in a program that reads BPF maps the number of elements in a PERCPU_ARRAY used on a system.
Failing that, I think the second best approach is to pick a buffer for reading in that is "large enough." Here, the problem is similar: what is that number and is there way to learn it at runtime?
The question comes from reading the source of bpftool, which figures this out:
unsigned int get_possible_cpus(void)
{
int cpus = libbpf_num_possible_cpus();
if (cpus < 0) {
p_err("Can't get # of possible cpus: %s", strerror(-cpus));
exit(-1);
}
return cpus;
}
int libbpf_num_possible_cpus(void)
{
static const char *fcpu = "/sys/devices/system/cpu/possible";
static int cpus;
int err, n, i, tmp_cpus;
bool *mask;
/* ---8<--- snip */
}
So that's how they do it!

Difference between 'Niceness' and 'Goodness' in linux kernel 2.4.27

I am trying to understand linux kernel code for task scheduler.
I am not able to figure out what is goodness value and how it differs from niceness?
Also, how each of them contribute to scheduling?
AFAIK: Each process in the run queue is assigned a "goodness" value which determines how good a process is for running. This value is calculated by the goodness() function.
The process with higher goodness will be the next process to run. If no process is available for running, then the operating system selects a special idle task.
A first-approximation of "goodness" is calculated according to the number of ticks left in the process quantum. This means that the goodness of a process decreases with time, until it becomes zero when the time quantum of the task expires.
The final goodness is calculated in function of the niceness of the process. So basically goodness of a process is a combination of the Time slice left logic and nice value.
static inline int goodness(struct task_struct * p, int this_cpu, struct mm_struct *this_mm)
{
int weight;
/*
* select the current process after every other
* runnable process, but before the idle thread.
* Also, dont trigger a counter recalculation.
*/
weight = -1;
if (p->policy & SCHED_YIELD)
goto out;
if (p->policy == SCHED_OTHER) {
/*
* Give the process a first-approximation goodness value
* according to the number of clock-ticks it has left.
*
* Don't do any other calculations if the time slice is
* over..
*/
weight = p->counter;
if (!weight)
goto out;
...
weight += 20 - p->nice;
goto out;
}
/* code for real-time processes goes here */
out:
return weight;
}
TO UNDERSTAND and REMEMBER NICE value: remember this.
The etymology of Nice value is that a "nicer" process allows for more time for other processes to run, hence lower nice value translates to higher priority.
So,
Higher the goodness value -> More likely to Run
Higher the nice value -> Less likely to run.
Also Goodness value is calculated using the nice value as
20 - p->nice
Since nice value ranges from -20 ( highest priority) to 19 ( lowest priority)
lets assume a nice value of -20 ( EXTREMELY NOT NICE). hence 20 - (-20) = 40
Which means the goodness value has increased, hence this process will be chosen.

time.h regarding doubts please explain

Could someone clear my following doubts regarding this code:
1) What is this clock() ? How does it work?
2)I have never seen ; after for statement when do i need to use these semicolons after for
and will it loop though the only next line of code int tick= clock();?
3)Is he coverting tick to float by doing this: (float)tick ? Can i do this thing with
every variable i initialize first as short, integer, long, long long and then change it to float by just doing float(variable_name)?
Thanks in advance
#include <stdlib.h>
#include <stdio.h>
#include <time.h>
int main()
{
int i;
for(i=0; i<10000000; i++);
int tick= clock();
printf("%f", (float)tick/ CLOCKS_PER_SEC);
return 0;
}
1) What is this clock() ? How does it work?
It reads the "CPU" clock of your program (how much time your program has taken so far). It is called RTC for Real Time Clock. The precision is very high, it should be close to one CPU cycle.
2) I have never seen ; after for statement when do i need to use these semicolons after for and will it loop though the only next line of code int tick= clock();?
No. The ';' means "loop by yourself." This is wrong though because such a loop can be optimized out if it does nothing. So the number of cycles that loop takes could drop to zero.
So "he" [hopes he] counts from 0 to 10 million.
3) Is he converting tick to float by doing this: (float)tick ? Can i do this thing with every variable i initialize first as short, integer, long, long long and then change it to float by just doing float(variable_name)?
Yes. Three problems though:
a) he uses int for tick which is wrong, he should use clock_t.
b) clock_t is likely larger than 32 bits and a float cannot even accommodate 32 bit integers.
c) printf() anyway takes double so converting to float is not too good, since it will right afterward convert that float to a double.
So this would be better:
clock_t tick = clock();
printf("%f", (double)tick / CLOCKS_PER_SEC);
Now, this is good if you want to know the total time your program took to run. If you'd like to know how long a certain piece of code takes, you want to read clock() before and another time after your code, then the difference gives you the amount of time your code took to run.

Parallel processing a prime finder with openMP

I am trying to construct a prime finder for a bit of C practice. I've got the algorithm down and I've done a bunch of optimisations to make it faster, I then decided to try to parallelize it because, hey why not! Turns out to be harder than I thought. I can either get all threads running the same process (with same args) or a single thread will run if I try and supply different args to each process. I really have no idea what I'm doing here but you can see some experimental values I'm using in this code:
// gcc -std=c99 -o multithread multithread.c -fopenmp -lm
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <omp.h>
int pf(unsigned int start, unsigned int limit, unsigned int q);
int main(int argc, char *argv[])
{
printf("prime finder\n");
int j, slimits[4] = {1,10000000,20000000,30000000}, elimits[4] = {10000000,20000000,30000000,40000000};
double startTime = omp_get_wtime();
#pragma omp parallel shared(slimits, elimits primes)
{
#pragma omp for
for (j = 0; j < 4; j++)
{
primes += pf(slimits[j], elimits[j], atoi(argv[2]));
}
}
printf("%d prime numbers found in %.2f seconds.\n\n", primes, omp_get_wtime() - startTime);
return 0;
}
I havn't included the pf function as it is quite large but it works on its own, it returns the number of primes found. Im sure the issue is here somewhere.
Any help would be greatly appreciated!
You have made at least one obvious (to me) and serious mistake. You've declared primes shared and allowed all the threads in the program to update it. You have, thereby, programmed a data race. Nothing in OpenMP (nor in C if I recall correctly) guarantees that += will be implemented atomically. You haven't actually specified what the problem with your program is, or what the problems are, but this must surely be one of them.
I'll tell you how to fix this later but I think there is a more serious underlying design problem you should address first. You seem to have decided that you would have 4 threads running and that you should divide the range of integers to test for primality into 4 and pass one chunk to each thread. Sure, you can make that work but it's not a smart approach to using OpenMP. Nor is it a smart approach to dividing the work of primality testing.
A smarter approach to OpenMP program design is to start off by making no assumptions about the number of threads that will be available to the executing program. Design for any number of threads, do not design a program whose behaviour depends on the number of threads it gets at run-time. Use OpenMP's facilities, specifically the schedule clause, to distribute the workload at run time.
Turning to primality testing. Draw, or at least think about, a scatter plot of points (i,t(i)), where i is an integer and t(i) is the time it takes to determine whether or not i is prime. The pattern in this plot is about as difficult to discern as the pattern in the plot of the occurrence of primes in the integers. In other words, the time to determine the primality of an integer is very unpredictable. It does tend to rise as the integers increase (well, excluding large even integers which I'm sure your test doesn't consider anyway).
One implication of this unpredictability is that if you divide a range of integers into N sub-ranges and give one sub-range to each of N threads you are not giving the threads the same amount of work to do. Indeed, in the range of integers 1..m (any m) there is one integer which takes much longer to test than any other integer in the range, and this time is the irreducible minimum that your program will take. A naive distribution of the range will produce a seriously unbalanced workload.
Here's what I think you should do to fix your program.
First, write a function which tests the primality of a single integer. This will be the basic task for your computation. Call this is_prime. Next, study the schedule clause for the parallel for construct. OpenMP provides a number of task scheduling options, I won't explain them here, you will find plenty of good documentation online. Finally, study also the reduction clause; this provides the solution to the data race you have programmed.
Applying all this I suggest you change
#pragma omp parallel shared(slimits, elimits primes)
{
#pragma omp for
for (j = 0; j < 4; j++)
{
primes += pf(slimits[j], elimits[j], atoi(argv[2]));
}
}
to
#pragma omp parallel shared(slimits, elimits, max_int_to_test)
{
#pragma omp for reduction(+:primes) schedule (dynamic, 10)
for (j = 3; j < max_int_to_test; j += 2)
{
primes += is_prime(j);
}
}
With any luck my rudimentary C hasn't screwed up the syntax too much.

Thread-safe random number generation for Monte-Carlo integration

Im trying to write something which very quickly calculates random numbers and can be applied on multiple threads. My current code is:
/* Approximating PI using a Monte-Carlo method. */
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <time.h>
#include <omp.h>
#define N 1000000000 /* As lareg as possible for increased accuracy */
double random_function(void);
int main(void)
{
int i = 0;
double X, Y;
double count_inside_temp = 0.0, count_inside = 0.0;
unsigned int th_id = omp_get_thread_num();
#pragma omp parallel private(i, X, Y) firstprivate(count_inside_temp)
{
srand(th_id);
#pragma omp for schedule(static)
for (i = 0; i <= N; i++) {
X = 2.0 * random_function() - 1.0;
Y = 2.0 * random_function() - 1.0;
if ((X * X) + (Y * Y) < 1.0) {
count_inside_temp += 1.0;
}
}
#pragma omp atomic
count_inside += count_inside_temp;
}
printf("Approximation to PI is = %.10lf\n", (count_inside * 4.0)/ N);
return 0;
}
double random_function(void)
{
return ((double) rand() / (double) RAND_MAX);
}
This works but from observing a resource manager I know its not using all the threads. Does rand() work for multithreaded code? And if not is there a good alternative? Many Thanks. Jack
Is rand() thread safe? Maybe, maybe not:
The rand() function need not be reentrant. A function that is not required to be reentrant is not required to be thread-safe."
One test and good learning exercise would be to replace the call to rand() with, say, a fixed integer and see what happens.
The way I think of pseudo-random number generators is as a black box which take an integer as input and return an integer as output. For any given input the output is always the same, but there is no pattern in the sequence of numbers and the sequence is uniformly distributed over the range of possible outputs. (This model isn't entirely accurate, but it'll do.) The way you use this black box is to choose a staring number (the seed) use the output value in your application and as the input for the next call to the random number generator. There are two common approaches to designing an API:
Two functions, one to set the initial seed (e.g. srand(seed)) and one to retrieve the next value from the sequence (e.g. rand()). The state of the PRNG is stored internally in sort of global variable. Generating a new random number either will not be thread safe (hard to tell, but the output stream won't be reproducible) or will be slow in multithreded code (you end up with some serialization around the state value).
A interface where the PRNG state is exposed to the application programmer. Here you typically have three functions: init_prng(seed), which returns some opaque representation of the PRNG state, get_prng(state), which returns a random number and changes the state variable, and destroy_peng(state), which just cleans up allocated memory and so on. PRNGs with this type of API should all be thread safe and run in parallel with no locking (because you are in charge of managing the (now thread local) state variable.
I generally write in Fortran and use Ladd's implementation of the Mersenne Twister PRNG (that link is worth reading). There are lots of suitable PRNG's in C which expose the state to your control. PRNG looks good and using this (with initialization and destroy calls inside the parallel region and private state variables) should give you a decent speedup.
Finally, it's often the case that PRNGs can be made to perform better if you ask for a whole sequence of random numbers in one go (e.g. the compiler can vectorize the PRNG internals). Because of this libraries often have something like get_prng_array(state) functions which give you back an array full of random numbers as if you put get_prng in a loop filling the array elements - they just do it more quickly. This would be a second optimization (and would need an added for loop inside the parallel for loop. Obviously, you don't want to run out of per-thread stack space doing this!

Resources