According to proc manual:
/proc/[pid]/stack (since Linux 2.6.29)
This file provides a symbolic trace of the function calls in
this process's kernel stack. This file is provided only if
the kernel was built with the CONFIG_STACKTRACE configuration
option.
So I write a program to test:
#include <stdio.h>
#include <sys/wait.h>
#include <unistd.h>
#include <pthread.h>
void *thread_func(void *p_arg)
{
pid_t pid = fork();
if (pid > 0) {
wait(NULL);
return 0;
} else if (pid == 0) {
sleep(1000);
return 0;
}
return NULL;
}
int main(void)
{
pthread_t t1, t2;
pthread_create(&t1, NULL, thread_func, "Thread 1");
pthread_create(&t2, NULL, thread_func, "Thread 2");
sleep(1000);
return 0;
}
After running, use pstack to check the threads of progress:
linux-uibj:~ # pstack 24976
Thread 3 (Thread 0x7fd6e4ed5700 (LWP 24977)):
#0 0x00007fd6e528d3f4 in wait () from /lib64/libpthread.so.0
#1 0x0000000000400744 in thread_func ()
#2 0x00007fd6e52860a4 in start_thread () from /lib64/libpthread.so.0
#3 0x00007fd6e4fbb7fd in clone () from /lib64/libc.so.6
Thread 2 (Thread 0x7fd6e46d4700 (LWP 24978)):
#0 0x00007fd6e528d3f4 in wait () from /lib64/libpthread.so.0
#1 0x0000000000400744 in thread_func ()
#2 0x00007fd6e52860a4 in start_thread () from /lib64/libpthread.so.0
#3 0x00007fd6e4fbb7fd in clone () from /lib64/libc.so.6
Thread 1 (Thread 0x7fd6e569f700 (LWP 24976)):
#0 0x00007fd6e4f8d6cd in nanosleep () from /lib64/libc.so.6
#1 0x00007fd6e4f8d564 in sleep () from /lib64/libc.so.6
#2 0x00000000004007b1 in main ()
At the same time, check /proc/24976/stack:
linux-uibj:~ # cat /proc/24976/stack
[<ffffffff804ba1a7>] system_call_fastpath+0x16/0x1b
[<00007fd6e4f8d6cd>] 0x7fd6e4f8d6cd
[<ffffffffffffffff>] 0xffffffffffffffff
The 24976 process has 3 threads, and they all block on system call(nanosleep and wait), so all 3 threads now work in kernel space, and turn into kernel threads now, right? If this is true, there should be 3 stacks in /proc/[pid]/stack file. But it seems there is only 1 stack in /proc/[pid]/stack file.
How should I understand /proc/[pid]/stack?
How should I understand /proc/[pid]/stack ?
Taken from the man pages for proc:
There are additional helpful pseudo-paths:
[stack]
The initial process's (also known as the main thread's) stack.
Just below this, you can find:
[stack:[tid]] (since Linux 3.4)
A thread's stack (where the [tid] is a thread ID).
It corresponds to the /proc/[pid]/task/[tid]/path.
Which seems to be what you are looking for.
Nan Xiao is right.
Thread kernel mode stack is under /proc/[PID]/task/[TID]/stack.
you are checking /proc/[PID]/stack, that's the main thread stack so you have only 1. Others are under task folder.
That is for sleep locks. You might also look at perf -g to see spin locks including high system time.
Related
I have this piece of code where it works perfects in normal case. however , sometimes thread get into uncancelable sleep state.
It means from the state of the process, I see this thread getting into this https://code.woboq.org/userspace/glibc/sysdeps/unix/sysv/linux/nanosleep_nocancel.c.html#__nanosleep_nocancel
struct timespec convertticktotimespec(unsigned long numticks)
{
struct timespec tm;
/* separate the integer and decimal portions */
long nanoseconds =
((numticks / (float)sysconf(_SC_CLK_TCK)) - floor(numticks / (float)sysconf(_SC_CLK_TCK))) *
NANOSEC_MULTIPLIER;
tm.tv_sec = numticks / sysconf(_SC_CLK_TCK);
tm.tv_nsec = nanoseconds;
return tm;
}
void *thread(void *args)
{
struct_S *s = (struct_S *)args;
while(1)
{
s->var = 1;
struct timespec tm = convertticktotimespec(sysClkRateGet() * 13);
if ( 0 !=nanosleep(&tm, NULL) ) {
perror(nanosleep);
}
}
}
stack trace looks like this
Thread 19 (Thread 0x7f225a043700 (LWP 16023)):
#0 0x00007f225b8913ed in __accept_nocancel () at ../sysdeps/unix/syscall-
template.S:84
#1 0x0000000000000000 in ?? ()
Thread 18 (Thread 0x7f225a076700 (LWP 15952)):
#0 0x00007f225b89126d in __close_nocancel () at ../sysdeps/unix/syscall-template.S:84
#1 0x0000000000000000 in ?? ()
Thread 14 (Thread 0x7f225a021700 (LWP 16035)):
#0 0x00007f225b8917dd in __nanosleep_nocancel () at ../sysdeps/unix/syscall-template.S:84
#1 0x0000000000000000 in ?? ()
Thread 13 (Thread 0x7f225a032700 (LWP 16034)):
#0 0x00007f225b8917dd in __nanosleep_nocancel () at ../sysdeps/unix/syscall-template.S:84
#1 0x0000000000000000 in ?? ()
Thread 3 (Thread 0x7f225bbb3700 (LWP 15950)):
#0 0x00007f225ab1e3f3 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#1 0x0000000000000000 in ?? ()
Thread 2 (Thread 0x7f225a010700 (LWP 16036)):
#0 0x00007f225b8911ad in __write_nocancel () at ../sysdeps/unix/syscall-template.S:84
#1 0x0000000000000000 in ?? ()
some how this thread getting into uncancelable sleep state on random , although I dont find the clear definition of this anywhere on internet, so I assume thread getting sleep state forever which can not be interrupted .hence this thread goes to inactive state forever.
I have no clue why is this happening having executing or responsible for fewer lines of code or instructions.
fromcode.woboq , I found that this gets called from mutex lock.
https://code.woboq.org/userspace/glibc nptl/pthread_mutex_timedlock.c.html#416, but the thread is not using any mutex.
the only thing that i suspect here is , structure struct_s is allocated in the shared memory. this variable is also accessed and assigned by other thread from an another process. does the thread get into this state , internally depending on priority of the threads ?
cudaMalloc seemed to have spawned a thread when it was called, even though it's asynchronous. This was observed during debugging using cuda-gdb.
It also took a while to return.
The same thread exited, although as a different LWP, at the end of the program.
Can someone explain this behaviour ?
The thread is not specifically spawned by cudaMalloc. The user side CUDA driver API library seems to spawn threads at some stage during lazy context setup which have the lifetime of the CUDA context. The exact processes are not publicly documented.
You see this associated with cudaMallocbecause I would guess this is the first API to trigger whatever setup/callbacks need to be done to make the userspace driver support work. You should notice that only the first call spawns a thread. Subsequent calls do not. And the threads stay alive for the lifetime of the CUDA context, after which they are terminated. You can trigger explicit thread destruction by calling cudaDeviceReset at any point in program execution.
Here is a trivial example which demonstrates cudaMemcpyToSymbol triggering the thread spawning from the driver API library, rather than cudaMalloc:
__device__ float someconstant;
int main()
{
cudaSetDevice(0);
const float x = 3.14159f;
cudaMemcpyToSymbol(someconstant, &x, sizeof(float));
for(int i=0; i<10; i++) {
int *x;
cudaMalloc((void **)&x, size_t(1024));
cudaMemset(x, 0, 1024);
cudaFree(x);
}
return int(cudaDeviceReset());
}
In gdb I see this:
(gdb) tbreak main
Temporary breakpoint 1 at 0x40254f: file gdb_threads.cu, line 5.
(gdb) run
Starting program: /home/talonmies/SO/a.out
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Temporary breakpoint 1, main () at gdb_threads.cu:5
5 cudaSetDevice(0);
(gdb) next
6 const float x = 3.14159f;
(gdb) next
7 cudaMemcpyToSymbol(someconstant, &x, sizeof(float));
(gdb) next
[New Thread 0x7ffff5eb5700 (LWP 14282)]
[New Thread 0x7fffed3ff700 (LWP 14283)]
8 for(int i=0; i<10; i++) {
(gdb) info threads
Id Target Id Frame
3 Thread 0x7fffed3ff700 (LWP 14283) "a.out" pthread_cond_timedwait##GLIBC_2.3.2 ()
at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:238
2 Thread 0x7ffff5eb5700 (LWP 14282) "a.out" 0x00007ffff74d812d in poll () at ../sysdeps/unix/syscall-template.S:81
* 1 Thread 0x7ffff7fd1740 (LWP 14259) "a.out" main () at gdb_threads.cu:8
(gdb) thread apply all bt
Thread 3 (Thread 0x7fffed3ff700 (LWP 14283)):
#0 pthread_cond_timedwait##GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:238
#1 0x00007ffff65cad97 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#2 0x00007ffff659582d in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#3 0x00007ffff65ca4d8 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#4 0x00007ffff79bc182 in start_thread (arg=0x7fffed3ff700) at pthread_create.c:312
#5 0x00007ffff74e547d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
Thread 2 (Thread 0x7ffff5eb5700 (LWP 14282)):
#0 0x00007ffff74d812d in poll () at ../sysdeps/unix/syscall-template.S:81
#1 0x00007ffff65c9953 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#2 0x00007ffff66571ae in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#3 0x00007ffff65ca4d8 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#4 0x00007ffff79bc182 in start_thread (arg=0x7ffff5eb5700) at pthread_create.c:312
#5 0x00007ffff74e547d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
Thread 1 (Thread 0x7ffff7fd1740 (LWP 14259)):
#0 main () at gdb_threads.cu:8
I'm working with libexpect, but if the read times out (expected return code EXP_TIMEOUT) I instead get a crash as follows.
Program terminated with signal SIGABRT, Aborted.
#0 0x00007f1366275bb9 in __GI_raise (sig=sig#entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
56 ../nptl/sysdeps/unix/sysv/linux/raise.c: No such file or directory.
(gdb) bt
#0 0x00007f1366275bb9 in __GI_raise (sig=sig#entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
#1 0x00007f1366278fc8 in __GI_abort () at abort.c:89
#2 0x00007f13662b2e14 in __libc_message (do_abort=do_abort#entry=2, fmt=fmt#entry=0x7f13663bf06b "*** %s ***: %s terminated\n") at ../sysdeps/posix/libc_fatal.c:175
#3 0x00007f136634a7dc in __GI___fortify_fail (msg=<optimized out>) at fortify_fail.c:37
#4 0x00007f136634a6ed in ____longjmp_chk () at ../sysdeps/unix/sysv/linux/x86_64/____longjmp_chk.S:100
#5 0x00007f136634a649 in __longjmp_chk (env=0x1, val=1) at ../setjmp/longjmp.c:38
#6 0x00007f1366ed2a95 in ?? () from /usr/lib/libexpect.so.5.45
#7 <signal handler called>
#8 0x00007f1367334b9d in nanosleep () at ../sysdeps/unix/syscall-template.S:81
#9 0x000000000044cc13 in main (argc=3, argv=0x7fffca4013b8) at main_thread.c:6750
(gdb)
As you can see, I'm using nanosleep, which is supposed to not interact with signals like usleep and sleep (http://linux.die.net/man/2/nanosleep). As I understand it, libexpect uses SIGALRM to time out, but it's unclear to me how the two threads are interacting. If I had to guess, the expect call is raising a sigalrm, and it's interrupting the nanosleep call, but beyond that I don't know what's going on.
Thread 1:
while (stuff)
{
//dothings
struct timespec time;
time.tv_sec = 0.25;
time.tv_nsec = 250000000;
nanosleep(&time, NULL);
}
Thread 2:
switch(exp_expectl(fd, exp_glob, (char*)user_prompt, OK, exp_end))
{
case OK:
DG_LOG_DEBUG("Recieved user prompt");
break;
case EXP_TIMEOUT:
DG_LOG_DEBUG("Expect timed out");
goto error;
default:
DG_LOG_DEBUG("Expect failed for unknown reasons");
goto error;
}
I have done some reading about signals and sleep, but I've used sleep in multiple threads on many occasions and had no difficulties until now. What am I missing?
edit: misc version info
ubuntu 14.04 3.13.0-44-generic
/usr/lib/libexpect.so.5.45
code is in C
compiler is gcc (-lexpect -ltcl)
include <tcl8.6/expect.h>
my problem gdb output:
Program received signal SIGINT, Interrupt. 0x00007ffff7bcb86b in
__lll_lock_wait_private () from /lib/x86_64-linux-gnu/libpthread.so.0 (gdb) bt
#0 0x00007ffff7bcb86b in __lll_lock_wait_private () from /lib/x86_64-linux-gnu/libpthread.so.0
#1 0x00007ffff7bc8bf7 in _L_lock_21 () from /lib/x86_64-linux-gnu/libpthread.so.0
#2 0x00007ffff7bc8a6e in pthread_cond_destroy##GLIBC_2.3.2 () from /lib/x86_64-linux-gnu/libpthread.so.0
#3 0x0000000000400ab5 in control_destroy (mycontrol=0x6020c0) at control.c:20
#4 0x0000000000400f36 in cleanup_structs () at workcrew.c:160
#5 0x0000000000401027 in main () at workcrew.c:201
Note: the program run in cygwin success, but run in ubuntu linux, it's deadlock.
The all sub threads join finished before deadlock.
The source code is from web: http://www.ibm.com/developerworks/cn/linux/thread/posix_thread3/thread-3.tar.gz
The bug is in control.c:
int control_destroy(data_control *mycontrol) {
int mystatus;
if (pthread_cond_destroy(&(mycontrol->cond)))
return 1;
if (pthread_cond_destroy(&(mycontrol->cond)))
return 1;
mycontrol->active=0;
return 0;
}
This was, presumably, supposed to destroy the mutex and the condition variable. But instead it destroys the condition variable twice.
static timer_t timer;
void timer_handle(union sigval sig)
{
printf("pthread=%lu ptr=%p\n", pthread_self(), sig.sival_ptr);
}
void x_add_timer(void)
{
struct sigevent event;
struct itimerspec ts = {{0, 0}, {0, 10000}};
memset(&event, 0, sizeof(event));
event.sigev_notify = SIGEV_THREAD;
event.sigev_notify_function = timer_handle;
timer_create(CLOCK_MONOTONIC, &event, &timer);
timer_settime(timer, 0, &ts, NULL);
}
void x_del_timer(void)
{
timer_delete(timer);
}
int main()
{
int i;
struct timespec t = {0, 8000};
for (i = 0; i < 100; i++) {
x_add_timer();
nanosleep(&t, NULL);
x_del_timer();
}
return 0;
}
I am new to Linux programming. I am learning glibc timer. But I meet a strange problem.
I write the code above and using mips64-octeon-linux-gnu-gcc to compile.
but Segmentation fault sometimes occurs when running on the device
Is there anything wrong in the code?
Thanks a lot.
coredump is
Program terminated with signal 11, Segmentation fault.
[New process 16487]
[New process 16443]
[New process 16444]
#0 0x0000005558155568 in main_arena () from /lib64/libc.so.6
full backtrace is
Thread 3 (process 16444):
#0 0x00000055580c839c in clone () from /lib64/libc.so.6
No symbol table info available.
#1 0x0000005558176d38 in do_clone () from /lib64/libpthread.so.0
No symbol table info available.
#2 0x0000005558177260 in pthread_create##GLIBC_2.2 ()
from /lib64/libpthread.so.0
No symbol table info available.
#3 0x0000005557fb8cdc in timer_helper_thread () from /lib64/librt.so.1
No symbol table info available.
#4 0x0000005558177cec in start_thread () from /lib64/libpthread.so.0
No symbol table info available.
#5 0x00000055580c83ec in __thread_start () from /lib64/libc.so.6
No symbol table info available.
Thread 2 (process 16443):
#0 0x000000555808f0e4 in nanosleep () from /lib64/libc.so.6
No symbol table info available.
#1 0x000000555808edfc in sleep () from /lib64/libc.so.6
No symbol table info available.
#2 0x0000000120000f54 in main () at hello.c:53
i = 100
t = {tv_sec = 0, tv_nsec = 8000}
Thread 1 (process 16487):
#0 0x0000005558155568 in main_arena () from /lib64/libc.so.6
No symbol table info available.
#1 0x0000005557fb8d5c in timer_sigev_thread () from /lib64/librt.so.1
No symbol table info available.
#2 0x0000005558177cec in start_thread () from /lib64/libpthread.so.0
No symbol table info available.
#3 0x00000055580c83ec in __thread_start () from /lib64/libc.so.6
No symbol table info available.
You have a clear race condition there.
At higher speeds (shorter sleep time 10us), your
nanosleep(&t, NULL);
where t is set to timespec {0, 8000};, is returning before the 'timer' fires.
Hence the x_del_timer() and timer_handle() are happening in wrong order.
Increase the nanosleep() time to reduce this probability.
I wouldn't be surprised if glibc support for OCTEON/mips64 has corner-case issues with timer APIs.