Debugging in threading building Blocks - multithreading

I would like to program in threading building blocks with tasks. But how does one do the debugging in practice?
In general the print method is a solid technique for debugging programs.
In my experience with MPI parallelization, the right way to do logging is that each thread print its debugging information in its own file (say "debug_irank" with irank the rank in the MPI_COMM_WORLD) so that the logical errors can be found.
How can something similar be achieved with TBB? It is not clear how to access the thread number in the thread pool as this is obviously something internal to tbb.
Alternatively, one could add an additional index specifying the rank when a task is generated but this makes the code rather complicated since the whole program has to take care of that.

First, get the program working with 1 thread. To do this, construct a task_scheduler_init as the first thing in main, like this:
#include "tbb/tbb.h"
int main() {
tbb::task_scheduler_init init(1);
...
}
Be sure to compile with the macro TBB_USE_DEBUG set to 1 so that TBB's checking will be enabled.
If the single-threaded version works, but the multi-threaded version does not, consider using Intel Inspector to spot race conditions. Be sure to compile with TBB_USE_THREADING_TOOLS so that Inspector gets enough information.
Otherwise, I usually first start by adding assertions, because the machine can check assertions much faster than I can read log messages. If I am really puzzled about why an assertion is failing, I use printfs and task ids (not thread ids). Easiest way to create a task id is to allocate one by post-incrementing a tbb::atomic<size_t> and storing the result in the task.
If I'm having a really bad day and the printfs are changing program behavior so that the error does not show up, I use "delayed printfs". Stuff the printf arguments in a circular buffer, and run printf on the records later after the failure is detected. Typically for the buffer, I use an array of structs containing the format string and a few word-size values, and make the array size a power of two. Then an atomic increment and mask suffices to allocate slots. E.g., something like this:
const size_t bufSize = 1024;
struct record {
const char* format;
void *arg0, *arg1;
};
tbb::atomic<size_t> head;
record buf[bufSize];
void recf(const char* fmt, void* a, void* b) {
record* r = &buf[head++ & bufSize-1];
r->format = fmt;
r->arg0 = a;
r->arg1 = b;
}
void recf(const char* fmt, int a, int b) {
record* r = &buf[head++ & bufSize-1];
r->format = fmt;
r->arg0 = (void*)a;
r->arg1 = (void*)b;
}
The two recf routines record the format and the values. The casting is somewhat abusive, but on most architectures you can print the record correctly in practice with printf(r->format, r->arg0, r->arg1) even if the the 2nd overload of recf created the record.
~
~

Related

Was: How does BPF calculate number of CPU for PERCPU_ARRAY?

I have encountered an interesting issue where a PERCPU_ARRAY created on one system with 2 processors creates an array with 2 per-CPU elements and on another system with 2 processors, an array with 128 per-CPU elements. The latter was rather unexpected to me!
The way I discovered this behavior is that a program that allocated an array for the number of CPUs (using get_nprocs_conf(3)) and then read in the PERCPU_ARRAY into it (using bpf_map_lookup_elem()) ended up writing past the end of the array and crashing.
I would like to find out what is the proper way to determine in a program that reads BPF maps the number of elements in a PERCPU_ARRAY used on a system.
Failing that, I think the second best approach is to pick a buffer for reading in that is "large enough." Here, the problem is similar: what is that number and is there way to learn it at runtime?
The question comes from reading the source of bpftool, which figures this out:
unsigned int get_possible_cpus(void)
{
int cpus = libbpf_num_possible_cpus();
if (cpus < 0) {
p_err("Can't get # of possible cpus: %s", strerror(-cpus));
exit(-1);
}
return cpus;
}
int libbpf_num_possible_cpus(void)
{
static const char *fcpu = "/sys/devices/system/cpu/possible";
static int cpus;
int err, n, i, tmp_cpus;
bool *mask;
/* ---8<--- snip */
}
So that's how they do it!

atomic_or not atomically performing operation

I have a kernel which uses a global uint array, and I want to access and change entries in that array from all threads using the atom_or function in OpenCL.
The code:
void SetMove(int c, uint m, volatile __global uint *prevmove)
{
uint idx = c >> 4;
uint mask = m << (2 * (c & 15));
atom_or(&prevmove[idx], mask);
}
I am implementing a BFS algorithm using OpenCL. This is done in 'waves', where in every wave, an array in of states is analyzed. Each GPU thread evaluates a single state, finds it's successors, and places them in a different array out. At the end of the wave the contents of out are places in in.
The prevmove array is to keep track of what action you should do in a certain state to find the goal state. The SetMove function updates prevmove when a thread finds a new state.
Now the problem is:
The results are always non-deterministic. If I would chance the atom_or operation with a normal, non atomic |= operator, the kernel behaves exactly the same. I have tested this by checking how many entries in prevmove are nonzero when the algorithm terminates. This varies everytime I run my program.
Is cause of my problem due to not implementing the atom_or correctly or is it something else?
(I have #pragma OPENCL EXTENSION cl_khr_int64_extended_atomics : enable in my kernel).

Confusing result from counting page fault in linux

I was writing programs to count the time of page faults in a linux system. More precisely, the time kernel execute the function __do_page_fault.
And somehow I wrote two global variables, named pfcount_at_beg and pfcount_at_end, which increase once when the function __do_page_fault is executed at different locations of the function.
To illustrate, the modified function goes as:
unsigned long pfcount_at_beg = 0;
unsigned long pfcount_at_end = 0;
static void __kprobes
__do_page_fault(...)
{
struct vm_area_sruct *vma;
... // VARIABLES DEFINITION
unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE;
pfcount_at_beg++; // I add THIS
...
...
// ORIGINAL CODE OF THE FUNCTION
...
pfcount_at_end++; // I add THIS
}
I expected that the value of pfcount_at_end is smaller than the value of pfcount_at_beg.
Because, I think, every time kernel executes the instructions of code pfcount_at_end++, it must have executed pfcount_at_beg++(Every function starts at the very beginning of the code).
On the other hand, as there are many conditional return between these two lines of code.
However, the result turns out oppositely. The value of pfcount_at_end is larger than the value of pfcount_at_beg.
I use printk to print these kernel variables through a self-defined syscall. And I wrote the user level program to call the system call.
Here is my simple syscall and user-level program:
// syscall
asmlinkage int sys_mysyscall(void)
{
printk( KERN_INFO "total pf_at_beg%lu\ntotal pf_at_end%lu\n", pfcount_at_beg, pfcount_at_end)
return 0;
}
// user-level program
#include<linux/unistd.h>
#include<sys/syscall.h>
#define __NR_mysyscall 223
int main()
{
syscall(__NR_mysyscall);
return 0;
}
Is there anybody who knows what exactly happened during this?
Just now I modified the code, to make pfcount_at_beg and pfcount_at_end static. However the result did not change, i.e. the value of pfcount_at_end is larger than the value of pfcount_at_beg.
So possibly it might be caused by in-atomic operation of increment. Would it be better if I use read-write lock?
The ++ operator is not garanteed to be atomic, so your counters may suffer concurrent access and have incorrect values. You should protect your increment as a critical section, or use the atomic_t type defined in <asm/atomic.h>, and its related atomic_set() and atomic_add() functions (and a lot more).
Not directly connected to your issue, but using a specific syscall is overkill (but maybe it is an exercise). A lighter solution could be to use a /proc entry (also an interesting exercise).

Strange behavior of printk in linux kernel module

I am writing a code for linux kernel module and experiencing a strange behavior in it.
Here is my code:
int data = 0;
void threadfn1()
{
int j;
for( j = 0; j < 10; j++ )
printk(KERN_INFO "I AM THREAD 1 %d\n",j);
data++;
}
void threadfn2()
{
int j;
for( j = 0; j < 10; j++ )
printk(KERN_INFO "I AM THREAD 2 %d\n",j);
data++;
}
static int __init abc_init(void)
{
struct task_struct *t1 = kthread_run(threadfn1, NULL, "thread1");
struct task_struct *t2 = kthread_run(threadfn2, NULL, "thread2");
while( 1 )
{
printk("debug\n"); // runs ok
if( data >= 2 )
{
kthread_stop(t1);
kthread_stop(t2);
break;
}
}
printk(KERN_INFO "HELLO WORLD\n");
}
Basically I was trying to wait for threads to finish and then print something after that.
The above code does achieve that target but WITH "printk("debug\n");" not commented. As soon as I comment out printk("debug\n"); to run the code without debugging and load the module through insmod command, the module hangs on and it seems like it gets lost in recursion. I dont why printk effects my code in such a big way?
Any help would be appreciated.
regards.
You're not synchronizing the access to the data-variable. What happens is, that the compiler will generate a infinite loop. Here is why:
while( 1 )
{
if( data >= 2 )
{
kthread_stop(t1);
kthread_stop(t2);
break;
}
}
The compiler can detect that the value of data never changes within the while loop. Therefore it can completely move the check out of the loop and you'll end up with a simple
while (1) {}
If you insert printk the compiler has to assume that the global variable data may change (after all - the compiler has no idea what printk does in detail) therefore your code will start to work again (in a undefined behavior kind of way..)
How to fix this:
Use proper thread synchronization primitives. If you wrap the access to data into a code section protected by a mutex the code will work. You could also replace the variable data and use a counted semaphore instead.
Edit:
This link explains how locking in the linux-kernel works:
http://www.linuxgrill.com/anonymous/fire/netfilter/kernel-hacking-HOWTO-5.html
With the call to printk() removed the compiler is optimising the loop into while (1);. When you add the call to printk() the compiler is not sure that data isn't changed and so checks the value each time through the loop.
You can insert a barrier into the loop, which forces the compiler to reevaluate data on each iteration. eg:
while (1) {
if (data >= 2) {
kthread_stop(t1);
kthread_stop(t2);
break;
}
barrier();
}
Maybe data should be declared volatile? It could be that the compiler is not going to memory to get data in the loop.
Nils Pipenbrinck's answer is spot on. I'll just add some pointers.
Rusty's Unreliable Guide to Kernel Locking (every kernel hacker should read this one).
Goodbye semaphores?, The mutex API (lwn.net articles on the new mutex API introduced in early 2006, before that the Linux kernel used semaphores as mutexes).
Also, since your shared data is a simple counter, you can just use the atomic API (basically, declare your counter as atomic_t and access it using atomic_* functions).
Volatile might not always be "bad idea". One needs to separate out
the case of when volatile is needed and when mutual exclusion
mechanism is needed. It is non optimal when one uses or misuses
one mechanism for the other. In the above case. I would suggest
for optimal solution, that both mechanisms are needed: mutex to
provide mutual exclusion, volatile to indicate to compiler that
"info" must be read fresh from hardware. Otherwise, in some
situation (optimization -O2, -O3), compilers might inadvertently
leave out the needed codes.

Two structs, one references another

typedef struct Radios_Frequencia {
char tipo_radio[3];
int qt_radio;
int frequencia;
}Radiof;
typedef struct Radio_Cidade {
char nome_cidade[30];
char nome_radio[30];
char dono_radio[3];
int numero_horas;
int audiencia;
Radiof *fre;
}R_cidade;
void Cadastrar_Radio(R_cidade**q){
printf("%d\n",i);
q[0]=(R_cidade*)malloc(sizeof(R_cidade));
printf("informa a frequencia da radio\n");
scanf("%d",&q[0]->fre->frequencia); //problem here
printf("%d\n",q[0]->fre->frequencia); // problem here
}
i want to know why this function void Cadastrar_Radio(R_cidade**q) does not print the data
You allocated storage for your primary structure but not the secondary one. Change
q[0]=(R_cidade*)malloc(sizeof(R_cidade));
to:
q[0]=(R_cidade*)malloc(sizeof(R_cidade));
q[0]->fre = malloc(sizeof(Radiof));
which will allocate both. Without that, there's a very good chance that fre will point off into never-never land (as in "you can never never tell what's going to happen since it's undefined behaviour).
You've allocated some storage, but you've not properly initialized any of it.
You won't get anything reliable to print until you put reliable values into the structures.
Additionally, as PaxDiablo also pointed out, you've allocated the space for the R_cidade structure, but not for the Radiof component of it. You're using scanf() to read a value into space that has not been allocated; that is not reliable - undefined behaviour at best, but most usually core dump time.
Note that although the two types are linked, the C compiler most certainly doesn't do any allocation of Radiof simply because R_cidade mentions it. It can't tell whether the pointer in R_cidade is meant to be to a single structure or the start of an array of structures, for example, so it cannot tell how much space to allocate. Besides, you might not want to initialize that structure every time - you might be happy to have left pointing nowhere (a null pointer) except in some special circumstances known only to you.
You should also verify that the memory allocation succeeded, or use a memory allocator that guarantees never to return a null or invalid pointer. Classically, that might be a cover function for the standard malloc() function:
#undef NDEBUG
#include <assert.h>
void *emalloc(size_t nbytes)
{
void *space = malloc(nbytes);
assert(space != 0);
return(space);
}
That's crude but effective. I use non-crashing error reporting routines of my own devising in place of the assert:
#include "stderr.h"
void *emalloc(size_t nbytes)
{
void *space = malloc(nbytes);
if (space == 0)
err_error("Out of memory\n");
return space;
}

Resources