ibgtop function glibtop_get_cpu() information breaks down if a processor is disabled - linux

I am pretty much a first timer at this so please feel free to tell where I have not followed correct procedure. I will do better next time..
My claim: libgtop function glibtop_get_cpu() information breaks down if a processor is disabled.
My environment: I have disabled processor #1 (0,1,2,3) for a hardware issue I have with a motherboard. Since that time, and presumably as a result, gnome-system-monitor now reports the machine as having 3 cpus (which is correct) and calls them CPU1, CPU2 and CPU3 (not wild about the labels used here but we can discuss that another time). The more important problem is that the CPU values for CPU2 and CPU3 are always zero. When I compare the CPU of gnome-system-monitor to ‘top’ (using the ‘1’ key to get individual processors), they don’t match. When I say don’t match, ‘top’ values are non-zero, while gnome-system-monitor values are zero.
‘top’ reports %Cpu0, 2 and 3. No sign of CPU 1. More important, the numeric values for these labels are non-zero. When I use the ‘stress’ command, the values move around as expected. ‘top’ indicates the individual processors are at 100% while gnome-system-monitor says 0.
Summary so far: ‘top’ gives plausible figures for CPU while gnome-system-monitor does not. On my system, I have disabled CPU 1 (0 index) and see that CPU2 (1 index) and CPU3 (1 index) have zero CPU.
I have been reading and modifying the code in gnome-system-monitor to explore where these values are coming from and I have determined that there is nothing ‘wrong’ with gnome-system-monitor program per se; at least as far as the numeric values for CPU are concerned. This is because the data gnome-system-monitor uses is coming from libgtop library and specifically the glibtop_get_cpu() function. The resulting data returned by glibtop_get_cpu() is zero for all indexes of 1 (0 index – this is in the C++ code) or greater.
It seems to me, I need to see how glibtop_get_cpu() works, but I have had no luck finding the source to glibtop_get_cpu(). What should I do next? The library I am using is 2.38.0-2ubuntu0.18.04.1 … on Ubuntu 18.04.1. Happy to try any suggestions. I probably won’t know how to do what you suggest, but I can learn.
Should I raise a bug? I would like to go deeper than this on the first pass if possible. I was hoping to look at the problem and propose a fix but at the moment, I am stuck.
Edit! (improvements suggested to the original question)
Incorrect output:
# echo 1 > /sys/devices/system/cpu/cpu1/online // bring all cpu online for the base case
$ ./test_get_cpu
glibtop_sysinfo()->ncpu is 4
xcpu_total[0] is 485898
xcpu_total[1] is 1532
xcpu_total[2] is 484263
xcpu_total[3] is 487052
$
# echo 0 > /sys/devices/system/cpu/cpu1/online // take cpu1 offline again
$ ./test_get_cpu
glibtop_sysinfo()->ncpu is 3 // ncpu is correct
xcpu_total[0] is 501416
$
# echo 1 > /sys/devices/system/cpu/cpu1/online // bring cpu1 online
# echo 0 > /sys/devices/system/cpu/cpu2/online // … and take cpu2 offline
$ ./test_get_cpu
glibtop_sysinfo()->ncpu is 3
xcpu_total[0] is 508264
xcpu_total[1] is 5416
$
Interpretation: As anticipated, taking 'cpu2' offline means we can't see 'cpu3' in the glibtop_get_cpu() result. By induction, (risky) I think that it we take 'cpu' offline, we will not get any statistics for all 'cpu' and higher.
That is my evidence for something wrong with glibtop_get_cpu().
My Code:
#include <iostream>
using namespace std;
#include <glibtop/cpu.h>
#include <glibtop/sysinfo.h>
main() {
const glibtop_sysinfo * sysinfo = glibtop_get_sysinfo();
glibtop_cpu cpu;
glibtop_get_cpu(&cpu);
cout << "glibtop_sysinfo()->ncpu is " << sysinfo->ncpu << endl;
//for (int i=0;i<sysinfo->ncpu;++i) { // e.g. ncpu might be 3 if one processor disabled on a quad core
for (int i=0;i<GLIBTOP_NCPU;++i) { // Alternatively, look through 1024 slots
if (cpu.xcpu_total[i] != 0) {
cout << "xcpu_total[" << i << "] is " << cpu.xcpu_total[i] << endl;
}
}
}

I have found the code I was looking for # https://github.com/GNOME/libgtop
Sorry to have wasted anyone's time. I don't know precisely how the above code works. For example I don't know how/if/where glibtop_get_cpu_l() is defined but I can see enough in the code to realize that as the code stands it looks at /proc/stat and if a specific "cpu" isn't found then that is a 'warning' and is logged somewhere (don't now where) and the rest of the cpu's are skipped. I will do more work on this in my own time.

Related

How to get delta percentage from /proc/schedstat

I am trying to get node CFS scheduler throttling in percent. For that i am reading 2 values 2 times (ignoring timeslices) from /proc/schedstat it has following format:
$ cat /proc/schedstat
version 15
timestamp 4297299139
cpu0 0 0 0 0 0 0 1145287047860 105917480368 8608857
CpuTime RunqTime
so i read from file, sleep for some time, read again, calculate time passed and value delta between, and calc percent then using following code:
cputTime := float64(delta.CpuTime) / delta.TimeDelta / 10000000
runqTime := float64(delta.RunqTime) / delta.TimeDelta / 10000000
percent := runqTime
the trick is that percent could be like 2000%
i assumed that runqtime is incremental, and is expressed in nanoseconds, so i divided it by 10^7 (to get it to 0-100% range), and timedelta is difference between measurements in seconds. what is wrong with it? how to do that properly?
I, for one, do not know how to interpret the output of /proc/schedstat.
You do quote an answer to a unix.stackexchange question, with a link to a mail in LKML that mentions a possible patch to the documentation.
However, "schedstat" is a term which is suspiciously missing from my local man proc page, and from the copies of man proc I could find on the internet. Actually, when searching for schedstat on Google, the results I get either do not mention the word "schedstat" (for example : I get links to copies of the man page, which mentions "sched" and "stat"), or non authoritative comments (fun fact : some of them quote that answer on stackexchange as a reference ...)
So at the moment : if I had to really understand what's in the output, I think I would try to read the code for my version of the kernel.
As far as "how do you compute delta ?", I understand what you intend to do, I had in mind something more like "what code have you written to do it ?".
By running cat /proc/schedstat; sleep 1 in a loop on my machine, I see that the "timestamp" entry is incremented by ~250 units on each iteration (so I honestly can't say what's the underlying unit for that field ...).
To compute delta.TimeDelta : do you use that field ? or do you take two instances of time.Now() ?
The other deltas are less ambiguous, I do imagine you took the difference between the counters you see :)
Do note that, on my mainly idle machine, I sometimes see increments higher than 10^9 over a second on these counters. So again : I do not know how to interpret these numbers.

malloc/realloc/free capacity optimization

When you have a dynamically allocated buffer that varies its size at runtime in unpredictable ways (for example a vector or a string) one way to optimize its allocation is to only resize its backing store on powers of 2 (or some other set of boundaries/thresholds), and leave the extra space unused. This helps to amortize the cost of searching for new free memory and copying the data across, at the expense of a little extra memory use. For example the interface specification (reserve vs resize vs trim) of many C++ stl containers have such a scheme in mind.
My question is does the default implementation of the malloc/realloc/free memory manager on Linux 3.0 x86_64, GLIBC 2.13, GCC 4.6 (Ubuntu 11.10) have such an optimization?
void* p = malloc(N);
... // time passes, stuff happens
void* q = realloc(p,M);
Put another way, for what values of N and M (or in what other circumstances) will p == q?
From the realloc implementation in glibc trunk at http://sources.redhat.com/git/gitweb.cgi?p=glibc.git;a=blob;f=malloc/malloc.c;h=12d2211b0d6603ac27840d6f629071d1c78586fe;hb=HEAD
First, if the memory has been obtained via mmap() instead of sbrk(), which glibc malloc does for large requests, >= 128 kB by default IIRC:
if (chunk_is_mmapped(oldp))
{
void* newmem;
#if HAVE_MREMAP
newp = mremap_chunk(oldp, nb);
if(newp) return chunk2mem(newp);
#endif
/* Note the extra SIZE_SZ overhead. */
if(oldsize - SIZE_SZ >= nb) return oldmem; /* do nothing */
/* Must alloc, copy, free. */
newmem = public_mALLOc(bytes);
if (newmem == 0) return 0; /* propagate failure */
MALLOC_COPY(newmem, oldmem, oldsize - 2*SIZE_SZ);
munmap_chunk(oldp);
return newmem;
}
(Linux has mremap(), so in practice this is what is done).
For smaller requests, a few lines below we have
newp = _int_realloc(ar_ptr, oldp, oldsize, nb);
where _int_realloc is a bit big to copy-paste here, but you'll find it starting at line 4221 in the link above. AFAICS, it does NOT do the constant factor optimization increase that e.g. the C++ std::vector does, but rather allocates exactly the amount requested by the user (rounded up to the next chunk boundaries + alignment stuff and so on).
I suppose the idea is that if the user wants this factor of 2 size increase (or any other constant factor increase in order to guarantee logarithmic efficiency when resizing multiple times), then the user can implement it himself on top of the facility provided by the C library.
Perhaps you can use malloc_usable_size (google for it) to find the answer experimentally. This function, however, seems undocumented, so you will need to check out if it is still available at your platform.
See also How to find how much space is allocated by a call to malloc()?

independent searches on GPU -- how to synchronize its finish?

Assume I have some algorithm generateRandomNumbersAndTestThem() which returns true with probability p and false with probability 1-p. Typically p is very small, e.g. p=0.000001.
I'm trying to build a program in JOCL that estimates p as follows: generateRandomNumbersAndTestThem() is executed in parallel on all available shader cores (preferrably of multiple GPUs), until at least 100 trues are found. Then the estimate for p is 100/n, where n is the total number of times that generateRandomNumbersAndTestThem() was executed.
For p = 0.0000001, this means roughly 10^9 independent attempts, which should make it obvious why I'm looking to do this on GPUs. But I'm struggling a bit how to implement the stop condition properly. My idea was to have something along these lines as the kernel:
__kernel void sampleKernel(all_the_input, __global unsigned long *totAttempts) {
int gid = get_global_id(0);
//here code that localizes all_the_input for faster access
while (lessThan100truesFound) {
totAttempts[gid]++;
if (generateRandomNumbersAndTestThem())
reportTrue();
}
}
How should I implement this without severe performance loss, given that
triggering of the "if" will be a very rare event and so it is not a problem if all threads have to wait while reportTrue() is executed
lessThan100truesFound has to be modified only once (from true to false) when reportTrue() is called for the 100th time (so I don't even know if a boolean is the right way)
the plan is to buy brand-new GPU hardware for this, so you can assume a recent GPU, e.g. multiple ATI Radeon HD7970s. But it would be nice if I could test it on my current HD5450.
I assume that something can be done similar to Java's "synchronized" modifier, but I fail to find the exact way to do it. What is the "right" way to do this, i.e. any way that works without severe performance loss?
I'd suggest not using global flag to stop kernel, but rather run kernel to do certain amount of attempts, check on host if you have accumulated enough 'successes', and repeat if necessary. Using cycle of undefined length in kernel is bad since GPU driver could be killed by watch-dog timer. Besides, checking some global variable at each iteration would certainly screw kernel performance.
This way, reportTrue could be implemented as atomic_inc to some counter residing in global memory.
__kernel void sampleKernel(all_the_input, __global unsigned long *successes) {
int gid = get_global_id(0);
//here code that localizes all_the_input for faster access
for (int i = 0; i < ATT_PER_THREAD; ++i) {
if (generateRandomNumbersAndTestThem())
atomic_inc(successes);
}
}
ATT_PER_THREAD is to be adjusted depending on how long it takes to execute generateRandomNumbersAndTestThem(). Kernel launch overhead is pretty small, so there usually is no need to make your kernel run more than 0.1--1 second

Is there a way to check whether the processor cache has been flushed recently?

On i386 linux. Preferably in c/(c/posix std libs)/proc if possible. If not is there any piece of assembly or third party library that can do this?
Edit: I'm trying to develop test whether a kernel module clear a cache line or the whole proccesor(with wbinvd()). Program runs as root but I'd prefer to stay in user space if possible.
Cache coherent systems do their utmost to hide such things from you. I think you will have to observe it indirectly, either by using performance counting registers to detect cache misses or by carefully measuring the time to read a memory location with a high resolution timer.
This program works on my x86_64 box to demonstrate the effects of clflush. It times how long it takes to read a global variable using rdtsc. Being a single instruction tied directly to the CPU clock makes direct use of rdtsc ideal for this.
Here is the output:
took 81 ticks
took 81 ticks
flush: took 387 ticks
took 72 ticks
You see 3 trials: The first ensures i is in the cache (which it is, because it was just zeroed as part of BSS), the second is a read of i that should be in the cache. Then clflush kicks i out of the cache (along with its neighbors) and shows that re-reading it takes significantly longer. A final read verifies it is back in the cache. The results are very reproducible and the difference is substantial enough to easily see the cache misses. If you cared to calibrate the overhead of rdtsc() you could make the difference even more pronounced.
If you can't read the memory address you want to test (although even mmap of /dev/mem should work for these purposes) you may be able to infer what you want if you know the cacheline size and associativity of the cache. Then you can use accessible memory locations to probe the activity in the set you're interested in.
Source code:
#include <stdio.h>
#include <stdint.h>
inline void
clflush(volatile void *p)
{
asm volatile ("clflush (%0)" :: "r"(p));
}
inline uint64_t
rdtsc()
{
unsigned long a, d;
asm volatile ("rdtsc" : "=a" (a), "=d" (d));
return a | ((uint64_t)d << 32);
}
volatile int i;
inline void
test()
{
uint64_t start, end;
volatile int j;
start = rdtsc();
j = i;
end = rdtsc();
printf("took %lu ticks\n", end - start);
}
int
main(int ac, char **av)
{
test();
test();
printf("flush: ");
clflush(&i);
test();
test();
return 0;
}
I dont know of any generic command to get the the cache state, but there are ways:
I guess this is the easiest: If you got your kernel module, just disassemble it and look for cache invalidation / flushing commands (atm. just 3 came to my mind: WBINDVD, CLFLUSH, INVD).
You just said it is for i386, but I guess you dont mean a 80386. The problem is that there are many different with different extension and features. E.g. the newest Intel series has some performance/profiling registers for the cache system included, which you can use to evalute cache misses/hits/number of transfers and similar.
Similar to 2, very depending on the system you got. But when you have a multiprocessor configuration you could watch the first cache coherence protocol (MESI) with the 2nd.
You mentioned WBINVD - afaik that will always flush complete, i.e. all, cache lines
It may not be an answer to your specific question, but have you tried using a cache profiler such as Cachegrind? It can only be used to profile userspace code, but you might be able to use it nonetheless, by e.g. moving the code of your function to userspace if it does not depend on any kernel-specific interfaces.
It might actually be more effective than trying to ask the processor for information that may or may not exist and that will be probably affected by your mere asking about it - yes, Heisenberg was way before his time :-)

When 2 int's are stored in Visual Studio, the difference between their locations comes out to be 12 bytes. Is there a reason for this?

When I run the following program in VC++ 2008 Express, I get the difference in location between two consecutively stored integers as '12' instead of expected '4'. On any other compilers, the answer comes out to be '4'. Is there a particular reason for why '12'?
#include <iostream>
using namespace std;
int main()
{
int num1, num2;
cin >> num1 >> num2;
cout << &num1 << endl << &num2 << endl;
cout << int(&num1) - int(&num2)<<endl; //Here it shows difference as 12.
cout << sizeof(num1); //Here it shows the size as 4.
return 0;
}
I'm going to make a wild guess and say that you built it in debug mode. Try building it in release mode and see what you get. I know the C++ run-time will place memory guards around allocated memory in debug mode to catch buffer overflows. I don't know if it does something similar with variables on the stack.
You could be developing code for a computer in China or it may be that there is a small and rare deficiency in the specific hardware you are using. One old model has difficulty with large numbers where the top bits become set and if the variables are in contiguous memory locations it was found that a buildup of charge in the core memory could have a crosseffect on adjacent memory locations and alter the contents. Other possibilities are spare memory locations for detecting overflows and underflows and it could be that you are running 32bit software mapped onto a 48bit hardware architecture brought forward to exist as a new model with the spare bits and bytes remaining unused.

Resources