Do Linux kernel processes multithread? - multithreading

For Linux user space processes it seems pretty easy to determine which processes are multithreading. You can use ps -eLf and look at the NLWP value for the number of threads, which also corresponds to the 'Threads:' value in /proc/$pid/status. Apparently back in the day of LinuxThreads the implementation was not POSIX compliant. But This stackoverflow answer says "POSIX.1 requires threads share a same process ID" which is apparently rectified in NPTL. So with NPTL it allows nifty displays of threads with commands like ps -eLf because the threads all share the same PID, and you can verify that under /proc/$pid/task/ and see all the thread subfolders belonging to that "parent" process.
I can't find similar thread groupings under a "parent" process for kernel processes spawned by kthreadd, and I suspect an implementation difference since a comment under this answer says "You can not use POSIX Threads in kernel-space" and the nifty thread grouping is a POSIX feature. Thus with ps -eLf I never see multiple threads listed for kernel processes created by kthreadd which have the square brackets around it, like [ksoftirqd/0] or [nfsd], unlike user-space processes created by init.
From the man page for pthreads (which is used in the user space):
A single process can contain multiple threads, all of which are
executing the same program. These threads share the same global
memory (data and heap segments), but each thread has its own stack
(automatic variables).
This however is precisely what I do not see for kernel "threads", in terms of one process containing multiple threads.
In short, I never see any of the processes listed by 'ps' that are children of kthreadd having a NLWP (Threads) value higher than one, which makes me wonder if any kernel process forks/parallelizes and multithreads like user-space programs do (with pthreads). Where do the implementations differ?
Practical example:
Output from ps auxf for the NFS processes.
root 2 0.0 0.0 0 0 ? S Jan12 0:00 [kthreadd]
root 1546 0.0 0.0 0 0 ? S Jan12 0:00 \_ [lockd]
root 1547 0.0 0.0 0 0 ? S Jan12 0:00 \_ [nfsd4]
root 1548 0.0 0.0 0 0 ? S Jan12 0:00 \_ [nfsd4_callbacks]
root 1549 0.0 0.0 0 0 ? S Jan12 2:40 \_ [nfsd]
root 1550 0.0 0.0 0 0 ? S Jan12 2:39 \_ [nfsd]
root 1551 0.0 0.0 0 0 ? S Jan12 2:40 \_ [nfsd]
root 1552 0.0 0.0 0 0 ? S Jan12 2:47 \_ [nfsd]
root 1553 0.0 0.0 0 0 ? S Jan12 2:34 \_ [nfsd]
root 1554 0.0 0.0 0 0 ? S Jan12 2:39 \_ [nfsd]
root 1555 0.0 0.0 0 0 ? S Jan12 2:57 \_ [nfsd]
root 1556 0.0 0.0 0 0 ? S Jan12 2:41 \_ [nfsd]
By default when you start the rpc.nfsd service (at least with the init.d service script) it spawns 8 processes (or at least they have PIDs). If I wanted to write a multi-threaded version of NFS, which is implemented as a kernel module, with those nfsd "processes" as a frontend, why couldn't I group the default 8 different nfsd processes under one PID and have 8 threads running under it, versus (as is shown - and as is different than user space processes) eight different PIDs?
NSLCD is an example of a user space program that uses multithreading by contrast:
UID PID PPID LWP C NLWP STIME TTY TIME CMD
nslcd 1424 1 1424 0 6 Jan12 ? 00:00:00 /usr/sbin/nslcd
nslcd 1424 1 1425 0 6 Jan12 ? 00:00:28 /usr/sbin/nslcd
nslcd 1424 1 1426 0 6 Jan12 ? 00:00:28 /usr/sbin/nslcd
nslcd 1424 1 1427 0 6 Jan12 ? 00:00:27 /usr/sbin/nslcd
nslcd 1424 1 1428 0 6 Jan12 ? 00:00:28 /usr/sbin/nslcd
nslcd 1424 1 1429 0 6 Jan12 ? 00:00:28 /usr/sbin/nslcd
The PID is the same but the LWP is unique per thread.
Update on the function of kthreadd
From this stackoverflow answer:
kthreadd is a daemon thread that runs in kernel space. The reason is
that kernel needs to some times create threads but creating thread in
kernel is very tricky. Hence kthreadd is a thread that kernel uses to
spawn newer threads if required from there . This thread can access
userspace address space also but should not do so . Its managed by
kernel...
And this one:
kthreadd() is main function (and main loop) of daemon kthreadd which
is a kernel thread daemon, the parent of all other kernel threads.
So in the code quoted, there is a creation of request to kthreadd
daemon. To fulfill this request kthreadd will read it and start a
kernel thread.

There is no concept of a process in the kernel, so your question doesn't really make sense. The Linux kernel can and does create threads that run completely in kernel context, but all of these threads run in the same address space. There's no grouping of similar threads by PID, although related threads usually have related names.
If multiple kernel threads are working on the same task or otherwise sharing data, then they need to coordinate access to that data via locking or other concurrent algorithms. Of course the pthreads API isn't available in the kernel, but one can use kernel mutexes, wait queues, etc to get the same capabilities as pthread mutexes, condition variables, etc.
Calling these contexts of execution "kernel threads" is a reasonably good name, since they are closely analogous to multiple threads in a userspace process. They all share the (kernel's) address space, but have their own execution context (stack, program counter, etc) and are each scheduled independently and run in parallel. On the other hand, the kernel is what actually implements all the nice POSIX API abstractions (with help from the C library in userspace), so internal to that implementation we don't have the full abstraction.

Related

Does linux process VSZ equal 0 mean kernel space application?

I notice some process always have VSZ as 0
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 1 0.0 0.0 19356 1400 ? Ss Jun13 0:00 /sbin/init
root 2 0.0 0.0 0 0 ? S Jun13 0:00 [kthreadd]
root 3 0.0 0.0 0 0 ? S Jun13 0:00 [migration/0]
root 4 0.0 0.0 0 0 ? S Jun13 0:01 [ksoftirqd/0]
root 5 0.0 0.0 0 0 ? S Jun13 0:00 [stopper/0]
root 6 0.0 0.0 0 0 ? S Jun13 0:03 [watchdog/0]
root 7 0.0 0.0 0 0 ? S Jun13 0:00 [migration/1]
how to understand why they have 0 VSZ?
VSZ is the Virtual Memory Size. It includes all memory that the process can access, including memory that is swapped out, memory that is allocated, but not used, and memory that is from shared libraries.
So, the top command screenshot you shared showing VSZ values equaling 0, means that those processes are not using VSZ.
NOTE: They are kernel threads and memory statistics are irrelevant for them as they use kernel memory. Just to visualize kernel processes, press c when top command is running and it will show you all [bracketed] entries in last column named COMMAND.
You can get more details on VSZ and learn about its counterpart RSS (Resident Set Size) from here.

How to check if isolcpus is configured & working?

I am using RHEL and i have configured isolcups= in /boot/grub/grub.conf file so that I can isolate some cpu from os scheduling processes. Now I want to check if those cpu's are isolated or they are still using OS scheduling algorithm.
The machine config is twin 5690 processor in hyper threaded mode.
So a total of 24 cores.
I want to isolate 6 cores for an application.
However when i do "top", I find that there are some system processes running on those cores. I am pasting the supposed to be isolated 12th core.
100 root rt 0 0 0 0 S 0.0 0.0 0:00.01 migration/11 11
101 root 20 0 0 0 0 S 0.0 0.0 0:00.00 ksoftirqd/11 11
102 root 20 0 0 0 0 S 0.0 0.0 0:00.00 kworker/11:0 11
103 root 0 -20 0 0 0 S 0.0 0.0 0:00.00 kworker/11:0H 11
What is sure shot way of checking isolated cpu in linux?
I am able to resolve it , It using only those cpu that are not isolated.
I did stress test on it, and it is only taking non-isolated cpu.The only chang i made is config file "/boot/grub/grub.cfg" and reboot the system.
You can use stress test and check if it is using isolated core or not.

Why the CPU time is different in other threads

I ran the top -H -p for a process which gave me the few threads with LWPs.
But when I sort the results with smallest PID first, I noticed the time in first thread is constant but the other threads time is changing. Why TIME+ is different?
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
16989 root 20 0 106m 28m 2448 S 0.0 0.2 0:22.31 glusterfs
16990 root 20 0 106m 28m 2448 S 0.0 0.2 0:00.00 glusterfs
16992 root 20 0 106m 28m 2448 S 0.0 0.2 0:00.00 glusterfs
16993 root 20 0 106m 28m 2448 S 0.0 0.2 0:00.00 glusterfs
16997 root 20 0 106m 28m 2448 S 0.0 0.2 0:11.71 glusterfs
17010 root 20 0 106m 28m 2448 S 0.0 0.2 0:21.07 glusterfs
17061 root 20 0 106m 28m 2448 S 0.0 0.2 0:00.00 glusterfs
Why TIME+ is different?
Because different threads are doing different percentages of the work. There could be a number of reasons for this1, but the most likely is that the application (glusterfs) is not attempting to distribute work evenly across the worker threads.
It is not something to worry about. It doesn't matter which thread does the work if the work level (see the %CPU) is negligible.
1 - If someone had the time and inclination, they could look at the source code of glusterfs to try to understand its behavior. However, I don't think the effort is warranted.
Because the time column referes to the time consumed by a process, so when a process time does not change it probably means that this process is "sleeping" or simply waiting for an other process to finish, but there could be many more reasons.
http://linux.about.com/od/commands/l/blcmdl1_top.htm
TIME:
Total CPU time the task has used since it started. If cumulative mode
is on, this also includes the CPU time used by the process's children
which have died. You can set cumulative mode with the S command line
option or toggle it with the interactive command S. The header line
will then be changed to CTIME.

How to identify, what is stalling the system in Linux?

I have an embedded system, when I do the user i/o operations, the system just stalls. It does the action after a long time. This system is quite complex and has many process running. My question is how can I identify what is making the system stall - it does nothing literally for 5 minutes. After 5 minutes, I see the outcome. I really don't know what is stalling the system. Any inputs on how to debug this issue. I have run the top on the system. However, it doesn't lead to any issue. See here, the jup_render is just taking 30% of CPU, which is not enough to stall the system. So, I am not sure whether top is useful here or not.
~ # top
top - 12:01:05 up 21 min, 1 user, load average: 1.49, 1.26, 0.87
Tasks: 116 total, 2 running, 114 sleeping, 0 stopped, 0 zombie
Cpu(s): 44.4%us, 13.9%sy, 0.0%ni, 40.3%id, 0.0%wa, 0.0%hi, 1.4%si, 0.0%st
Mem: 822572k total, 389640k used, 432932k free, 1980k buffers
Swap: 0k total, 0k used, 0k free, 227324k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
850 root 20 0 309m 32m 16m S 30 4.0 3:10.88 jup_render
870 root 20 0 221m 13m 10m S 27 1.7 2:28.78 jup_render
688 root 20 0 1156m 4092 3688 S 11 0.5 1:25.49 rxserver
9 root 20 0 0 0 0 S 2 0.0 0:06.81 ksoftirqd/1
16 root 20 0 0 0 0 S 1 0.0 0:06.87 ksoftirqd/3
9294 root 20 0 1904 616 508 R 1 0.1 0:00.10 top
812 root 20 0 865m 85m 46m S 1 10.7 1:21.17 lippo_main
13 root 20 0 0 0 0 S 1 0.0 0:06.59 ksoftirqd/2
800 root 20 0 223m 8316 6268 S 1 1.0 0:08.30 rat-cadaemon
3 root 20 0 0 0 0 S 1 0.0 0:05.94 ksoftirqd/0
1456 root 20 0 80060 10m 8208 S 1 1.2 0:04.82 jup_render
1330 root 20 0 202m 10m 8456 S 0 1.3 0:06.08 jup_render
8905 root 20 0 1868 556 424 S 0 0.1 0:02.91 dropbear
1561 root 20 0 80084 10m 8204 S 0 1.2 0:04.92 jup_render
753 root 20 0 61500 7376 6184 S 0 0.9 0:04.06 ale_app
1329 root 20 0 79908 9m 8208 S 0 1.2 0:04.77 jup_render
631 dbus 20 0 3248 1636 676 S 0 0.2 0:13.10 dbus-daemon
1654 root 20 0 80068 10m 8204 S 0 1.2 0:04.82 jup_render
760 root 20 0 116m 15m 12m S 0 1.9 0:10.19 jup_server
8 root 20 0 0 0 0 S 0 0.0 0:00.00 kworker/1:0
2 root 20 0 0 0 0 S 0 0.0 0:00.00 kthreadd
7 root RT 0 0 0 0 S 0 0.0 0:00.00 migration/1
170 root 0 -20 0 0 0 S 0 0.0 0:00.00 kblockd
6 root RT 0 0 0 0 S 0 0.0 0:00.00 migration/0
167 root 20 0 0 0 0 S 0 0.0 0:00.00 sync_supers
281 root 0 -20 0 0 0 S 0 0.0 0:00.00 nfsiod
For an embedded system that has many process running, there can be multitude of reasons. You may need to investigate in all perspective.
Check code for race conditions and deadlock.The kernel might be busy looping in a certain condition . There can be scenario where your application is waiting on a select call or the CPU resource is used up (This choice of CPU resource usage is ruled out based on the output of top command shared by you) or blocked on a read.
If you are performing a blocking I/O operations, the process shall get into wait queue and only move back to the execution path(ready queue) after the completion of the request. That is, it is moved out of the scheduler run queue and put with a special state. It shall be put back into the run queue only if they wake from the sleep or the resource waited for is made available.
Immediate step shall be to try out 'strace'. It shall intercept/record system calls that are called by a process and also the signals that are received by a process. It will be able to show the order of events and all the return/resumption paths of calls. This can take you almost closer to the area of problem.
There are other many handy tools that can be tried based on your development environment/setup. Key tools are as below :
'iotop' - It shall provide you a table of current I/O usage by processes or threads on the system by monitoring the I/O usage information output by the kernel.
'LTTng' - Makes tracing of race conditions and interrupt cascades possible. It is the successor to LTT. It is a combination of kprobes, tracepoint and perf functionalities.
'Ftrace' - This is a Linux kernel internal tracer with which you can analyze/debug latency and performance related issues.
If your system is based on TI processor, the CCS(Trace analyzer) provides capability to perform non-intrusive debug and analysis of system activity. So, note that based on your setup, you may also need to use the relevant tool .
Came across few more ideas :
magic SysRq key is another option in linux. If the driver is stuck, the command SysRq p can take you to the exact routine that is causing the problem.
Profiling of data can tell where exactly the time is being spent by the kernel. There are couple of tools like Readprofile and Oprofile. Oprofile can be enabled by configuring with CONFIG_PROFILING and CONFIG_OPROFILE. Another option is to rebuild the kernel by enabling the profiling option and reading the profile counters using Readprofile utility by booting up with profile=2 via command line.
mpstat can give 'the percentage of time that the CPU or CPUs were idle during which the system had an outstanding disk I/O request' via 'iowait' argument.
You said you run the top app. Did you find out which programme gets the biggest CPU time and how much is a percentage for it?
If you run the top you should see another screen in there, which you neither provided nor mentioned a cpu load percentage (or other relevant info).
I advise you to include what you can find interesting/relevant or suspicious through top. If it was already done you should discover it in your question more distinctively because now it's not obvious what is the CPU maximum load.

memory utilization by a linux process

I have a lot of instances of a process running on my host, each seemingly consuming large amount of memory.
ps aux on the processes give me the following information
blah1 18634 0.0 0.4 131852 31188 pts/15 Ssl+ 00:27 0:00 myPgm
blah2 18859 0.0 0.3 131292 30656 pts/32 Sl+ 01:17 0:00 myPgm
blah3 19813 0.0 0.4 131960 31220 pts/44 Ssl+ 01:17 0:00 myPgm
blah4 20228 0.1 0.3 131728 31036 pts/54 Ssl+ 01:41 0:00 myPgm
blah5 20238 0.0 0.3 131688 30932 pts/20 Sl+ Nov15 0:00 myPgm
blah6 21181 0.0 0.3 131304 30632 pts/25 Sl+ Nov15 0:00 myPgm
blah7 21278 0.0 0.3 131824 31096 pts/61 Ssl+ Nov15 0:00 myPgm
blah8 21821 0.0 0.3 131444 30808 pts/7 Sl+ 00:54 0:00 myPgm
So VSZ is always around 130 MB and RSS around 30 MB. pmap for the processes have the following data:
For 18634:
mapped: 131852K writeable/private: 59692K shared: 28K
For 21181:
mapped: 131304K writeable/private: 59144K shared: 28K
and similar values for other processes as well. The host has 7GB of physical memory. At times I have around 700 to 800 instances of the same process running on the host. I am trying to understand how much memory each process consumes in reality. If i take "writeable/private" as the actual memory usage in each process then 58MB for each process would lead to 45 GB (for 800 process) which is crazy. Can anyone explain if im doing it wrong and how should be calculation be done?
Also free -k gives
total used free shared buffers cached
Mem: 7782580 4802104 2980476 0 380192 1931708
-/+ buffers/cache: 2490204 5292376
Swap: 1048568 32 1048536
Looks like not much swap is being used, now where does the memory for each process come from? Thanks.
You don't know what VSZ is. You think you know, but evidence suggests otherwise, therefore you need to find out what it is.
VZS is Virtual Memory Size and that's all memory required by the process, including shared memory. That's why you can't just sum(VSZ) and expect to get less than the physical amount of memory + swap.
The mapped memory in pmap probably corresponds to VSZ and writable/private, I guess, is memory that is shared by some processes, where each of them has write access to that memory (like allocated by their parent process or so).
To understand this, you need to understand how memory allocation and access works, which is difficult. http://emilics.com/blog/article/mconsumption.html this article seems to explain it in some detail (but I read that only in cursory way)

Resources