Mapping thread id from top to gdb - multithreading

I am using top to see the thread wise cpu usage using
top -H -p `pgrep app.out`
It is showing some pid for each thread like
4015
4016
I had attached gdb to the application using gdb attach command.
Now I want to switch to thread 4015 which is showing inside top o/p.
How can I do that ?
If I fire thread 4015 it is showing no thread . as I need to give thread id in gdb.
So how can I map top thread id to gdb thread id ?

You should be able to match the LWP displayed in GDB with the top information:
according to my quick tests with Firefox, you can see that in your top -H -p:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
6492 kevin 20 0 1242m 386m 31m S 0.3 4.9 0:09.00 firefox
6470 kevin 20 0 1242m 386m 31m S 5.7 4.9 5:04.89 firefox
and that in GDB info threads:
22 Thread 0x7fe3d2393700 (LWP 6492) "firefox" pthread_cond_timedwait...
...
* 1 Thread 0x7fe3dd868740 (LWP 6470) "firefox" __GI___poll ()...
EDIT: just for you in exclusivity, here is a brand new commands for gdb: lwp_to_id <lwp>:
import gdb
class lwp_to_id (gdb.Command):
def __init__(self):
gdb.Command.__init__(self, "lwp_to_id", gdb.COMMAND_OBSCURE)
def invoke(self, args, from_tty):
lwp = int(args)
for thr in gdb.selected_inferior().threads():
if thr.ptid[1] == lwp:
print "LWP %s maps to thread #%d" % (lwp, thr.num)
return
else:
print "LWP %s doesn't match any threads in the current inferior." % lwp
lwp_to_id()
(working at least on the trunk version of GDB, not sure about the official releases !

Do a
ps xjf
This will give you a tree of all processes with their pid and parent pid.

Related

How to kill/stop a process that continuously refreshes its PID?

I recently installed Graylog2 onto my Ubuntu server for log monitoring. I soon after get an alert stating that my CPUs are reaching capacity. I then log into my server over SSH and run top. What I see confuses me and makes it difficult to kill the process.
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
2462 graylog2 20 0 2103292 42684 16424 S 19.3 1.1 0:00.58 java
2470 graylog+ 20 0 2295612 46368 16032 S 13.0 1.1 0:00.39 java
1971 www-data 20 0 354808 36140 19392 S 10.0 0.9 0:00.61 php5
Everytime top refreshes, I see that the PIDs of graylog have increased so I'm unable to kill it by PID.
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
16937 www-data 20 0 357988 52140 34244 S 45.3 1.3 0:07.45 php5-fpm
24588 graylog2 20 0 2079236 35464 15576 S 9.7 0.9 0:00.29 java
24547 graylog+ 20 0 2295612 37148 15640 S 8.0 0.9 0:00.24 java
What is the proper way to kill/stop a process that continuously re-instantiates itself like that?
I don't now graylog. But perhaps 'killall' can help you. It handles processes by name.
http://linux.die.net/man/1/killall
Please read the man pages before use it.
i don't use it often. so i don't know the disadvantages. (if there are any)

What could be the reason time info in ps command don't change while the process is active

I'm running a java process (doing some database manipulations) and I ran ps -lp 5631232
F S UID PID PPID C PRI NI ADDR SZ WCHAN TTY TIME CMD
202001 A 205 5631232 263213 0 60 20 3f46b46120 70156 * pts/6 1:09 java
the 'TIME' has not been changed for a long while. The status is A (active), so I think it didn't halt.
I just don't know how can I find out what's going wrong out there? Anyone can tell me how to detect the problem and/or what could be the problem?
I'm using AIX system.

Process in a polling state?

Given a process ID, how can I tell if that process is currently blocked in a polling state? i.e. it has called poll() with a negative timeout, and is waiting for input to become ready.
On UNIX-like systems the command line utility 'ps' provides this information. There are many flavors of ps depending on the OS, so read the man page.
On a BSD-like system (mac):
ps -eo pid,user,cpu,state,comm
PID USER CPU STAT COMM
1 root 0 Ss /sbin/launchd
15 root 0 Ss /usr/libexec/kextd
90710 root 0 R+ ps
83804 joe 0 Ss /bin/bash
89631 joe 0 S+ ssh
where STAT is the process state. S means interruptible sleep. s (lower case) means session leader. '+' means it's in the foreground process group. R means running, or runnable (on run queue). There are many more possible states.

How to monitor number of syscalls executed by kernel?

I need to monitor amount of system calls executed by Linux.
I'm aware that vmstat has ability to show this for BSD and AIX systems, but for Linux it can't (according to man page).
Is there any counter in /proc? Or is there any other way to monitor it?
I wrote a simple SystemTap script(based on syscalls_by_pid.stp).
It produces output like this:
ProcessName #SysCalls
munin-graph 38609
munin-cron 8160
fping 4502
check_http_demo 2584
check_nrpe 2045
sh 1836
nagios 886
sendmail 747
smokeping 649
check_http 571
check_nt 376
pcscd 216
ping 108
check_ping 100
crond 87
stapio 69
init 56
syslog-ng 27
sshd 17
ntpd 9
hp-asrd 8
hald-addon-stor 7
automount 6
httpd 4
stap 3
flow-capture 2
gam_server 2
Total 61686
The script itself:
#! /usr/bin/env stap
#
# Print the system call count by process name in descending order.
#
global syscalls
probe begin {
print ("Collecting data... Type Ctrl-C to exit and display results\n")
}
probe syscall.* {
syscalls[execname()]++
}
probe end {
printf ("%-20s %-s\n\n", "ProcessName", "#SysCalls")
summary = 0
foreach (procname in syscalls-) {
printf("%-20s %-10d\n", procname, syscalls[procname])
summary = summary + syscalls[procname]
}
printf ("\n%-20s %-d\n", "Total", summary)
}
You can use pstrace as said Jeff Foster to trace the system call.
Also, you can use strace and ltrace
strace - trace system calls and signals
ltrace - A library call tracer
You can use ptrace to monitor all syscalls (see here)
I believe OProfile can do this.
I am not aware of a centralized way to monitor syscalls throughout the entire OS. Maybe do a ptrace on the init process and follow all children? But I don't know if that will work.
Your best bet is to write a patch to the kernel itself to do this. The closest thing to this that I've seen is a cgroup implementation for enforcing permissions on what syscalls can be executed at runtime. You can find the patch here:
https://github.com/luksow/syscalls-cgroup
It shouldn't be too much more work to throw a counter in there, from a kernel programming perspective.

How is it possible that kill -9 for a process on Linux has no effect?

I'm writing a plugin to highlight text strings automatically as you visit a web site. It's like the highlight search results but automatic and for many words; it could be used for people with allergies to make words really stand out, for example, when they browse a food site.
But I have problem. When I try to close an empty, fresh FF window, it somehow blocks the whole process. When I kill the process, all the windows vanish, but the Firefox process stays alive (parent PID is 1, doesn't listen to any signals, has lots of resources open, still eats CPU, but won't budge).
So two questions:
How is it even possible for a process not to listen to kill -9 (neither as user nor as root)?
Is there anything I can do but a reboot?
[EDIT] This is the offending process:
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
digulla 16688 4.3 4.2 784476 345464 pts/14 D Mar28 75:02 /opt/firefox-3.0/firefox-bin
Same with ps -ef | grep firefox
UID PID PPID C STIME TTY TIME CMD
digulla 16688 1 4 Mar28 pts/14 01:15:02 /opt/firefox-3.0/firefox-bin
It's the only process left. As you can see, it's not a zombie, it's running! It doesn't listen to kill -9, no matter if I kill by PID or name! If I try to connect with strace, then the strace also hangs and can't be killed. There is no output, either. My guess is that FF hangs in some kernel routine but which?
[EDIT2] Based on feedback by sigjuice:
ps axopid,comm,wchan
can show you in which kernel routine a process hangs. In my case, the offending plugin was the Beagle Indexer (openSUSE 11.1). After disabling the plugin, FF was a quick and happy fox again.
As noted in comments to the OP, a process status (STAT) of D indicates that the process is in an "uninterruptible sleep" state. In real-world terms, this generally means that it's waiting on I/O and can't/won't do anything - including dying - until that I/O operation completes.
Processes in a D state will normally only be there for a fraction of a second before the operation completes and they return to R/S. In my experience, if a process gets stuck in D, it's most often trying to communicate with an unreachable NFS or other remote filesystem, trying to access a failing hard drive, or making use of some piece of hardware by way of a flaky device driver. In such cases, the only way to recover and allow the process to die is to either get the fs/drive/hardware back up and running so the I/O can complete or to give up and reboot the system. In the specific case of NFS, the mount may also eventually time out and return from the I/O operation (with a failure code), but this is dependent on the mount options and it's very common for NFS mounts to be set to wait forever.
This is distinct from a zombie process, which will have a status of Z.
Double-check that the parent-id is really 1. If not, and this is firefox, first try sudo killall -9 firefox-bin. After that, try killing the specific process IDs individually with sudo killall -9 [process-id].
How is it even possible for a process not to listen to kill -9 (neiter as user nor as root)?
If a process has gone <defunct> and then becomes a zombie with a parent of 1, you can't kill it manually; only init can. Zombie processes are already dead and gone - they've lost the ability to be killed as they are no longer processes, only a process table entry and its associated exit code, waiting to be collected. You need to kill the parent, and you can't kill init for obvious reasons.
But see here for more general information. A reboot will kill everything, naturally.
Is it possible, that this process is restarted (for example by init) just at the time you kill it?
You can check this easily. If the PID is the same after kill -9 PID then the process wasn't killed, but if it has changed the process has been restarted.
I lately get trapped into a pitfall of Double Fork and had landed to this page before finally finding my answer. The symptoms are identical even if the problem is not the same:
WYKINWYT :What You Kill Is Not What You Thought
The minimal test code is shown below based on an example for an SNMP Daemon
#include <unistd.h>
#include <stdio.h>
#include <stdlib.h>
#include <signal.h>
int main(int argc, char* argv[])
{
//We omit the -f option (do not Fork) to reproduce the problem
char * options[]={"/usr/local/sbin/snmpd",/*"-f","*/-d","--master=agentx", "-Dagentx","--agentXSocket=tcp:localhost:1706", "udp:10161", (char*) NULL};
pid_t pid = fork();
if ( 0 > pid ) return -1;
switch(pid)
{
case 0:
{ //Child launches SNMP daemon
execv(options[0],options);
exit(-2);
break;
}
default:
{
sleep(10); //Simulate "long" activity
kill(pid,SIGTERM);//kill what should be child,
//i.e the SNMP daemon I assume
printf("Signal sent to %d\n",pid);
sleep(10); //Simulate "long" operation before closing
waitpid(pid);
printf("SNMP should be now down\n");
getchar();//Blocking (for observation only)
break;
}
}
printf("Bye!\n");
}
During the first phase the main process (7699) launches the SNMP daemon (7700) but we can see that this one is now Defunct/Zombie. Beside we can see another process (7702) with the options we specified
[nils#localhost ~]$ ps -ef | tail
root 7439 2 0 23:00 ? 00:00:00 [kworker/1:0]
root 7494 2 0 23:03 ? 00:00:00 [kworker/0:1]
root 7544 2 0 23:08 ? 00:00:00 [kworker/0:2]
root 7605 2 0 23:10 ? 00:00:00 [kworker/1:2]
root 7698 729 0 23:11 ? 00:00:00 sleep 60
nils 7699 2832 0 23:11 pts/0 00:00:00 ./main
nils 7700 7699 0 23:11 pts/0 00:00:00 [snmpd] <defunct>
nils 7702 1 0 23:11 ? 00:00:00 /usr/local/sbin/snmpd -Lo -d --master=agentx -Dagentx --agentXSocket=tcp:localhost:1706 udp:10161
nils 7727 3706 0 23:11 pts/1 00:00:00 ps -ef
nils 7728 3706 0 23:11 pts/1 00:00:00 tail
After the 10 sec simulated we will try to kill the only process we know (7700). What we succeed at last with waitpid(). But Process 7702 is still here
[nils#localhost ~]$ ps -ef | tail
root 7431 2 0 23:00 ? 00:00:00 [kworker/u256:1]
root 7439 2 0 23:00 ? 00:00:00 [kworker/1:0]
root 7494 2 0 23:03 ? 00:00:00 [kworker/0:1]
root 7544 2 0 23:08 ? 00:00:00 [kworker/0:2]
root 7605 2 0 23:10 ? 00:00:00 [kworker/1:2]
root 7698 729 0 23:11 ? 00:00:00 sleep 60
nils 7699 2832 0 23:11 pts/0 00:00:00 ./main
nils 7702 1 0 23:11 ? 00:00:00 /usr/local/sbin/snmpd -Lo -d --master=agentx -Dagentx --agentXSocket=tcp:localhost:1706 udp:10161
nils 7751 3706 0 23:12 pts/1 00:00:00 ps -ef
nils 7752 3706 0 23:12 pts/1 00:00:00 tail
After giving a character to the getchar() function our main process terminates but the SNMP daemon with the pid 7002 is still here
[nils#localhost ~]$ ps -ef | tail
postfix 7399 1511 0 22:58 ? 00:00:00 pickup -l -t unix -u
root 7431 2 0 23:00 ? 00:00:00 [kworker/u256:1]
root 7439 2 0 23:00 ? 00:00:00 [kworker/1:0]
root 7494 2 0 23:03 ? 00:00:00 [kworker/0:1]
root 7544 2 0 23:08 ? 00:00:00 [kworker/0:2]
root 7605 2 0 23:10 ? 00:00:00 [kworker/1:2]
root 7698 729 0 23:11 ? 00:00:00 sleep 60
nils 7702 1 0 23:11 ? 00:00:00 /usr/local/sbin/snmpd -Lo -d --master=agentx -Dagentx --agentXSocket=tcp:localhost:1706 udp:10161
nils 7765 3706 0 23:12 pts/1 00:00:00 ps -ef
nils 7766 3706 0 23:12 pts/1 00:00:00 tail
Conclusion
The fact that we ignored the double fork mechanism made us think that the kill action did not succeed. But in fact we simply killed the wrong process !!
By adding the -f option ( Do Not (Double) Fork ) all go as expected
ps -ef | grep firefox;
and you can see 3 process, kill them all.
sudo killall -9 firefox
Should work
EDIT: [PID] changed to firefox
You can also do a pstree and kill the parent. This makes sure that you get the entire offending process tree and not just the leaf.

Resources