How to check if isolcpus is configured & working? - linux

I am using RHEL and i have configured isolcups= in /boot/grub/grub.conf file so that I can isolate some cpu from os scheduling processes. Now I want to check if those cpu's are isolated or they are still using OS scheduling algorithm.
The machine config is twin 5690 processor in hyper threaded mode.
So a total of 24 cores.
I want to isolate 6 cores for an application.
However when i do "top", I find that there are some system processes running on those cores. I am pasting the supposed to be isolated 12th core.
100 root rt 0 0 0 0 S 0.0 0.0 0:00.01 migration/11 11
101 root 20 0 0 0 0 S 0.0 0.0 0:00.00 ksoftirqd/11 11
102 root 20 0 0 0 0 S 0.0 0.0 0:00.00 kworker/11:0 11
103 root 0 -20 0 0 0 S 0.0 0.0 0:00.00 kworker/11:0H 11
What is sure shot way of checking isolated cpu in linux?

I am able to resolve it , It using only those cpu that are not isolated.
I did stress test on it, and it is only taking non-isolated cpu.The only chang i made is config file "/boot/grub/grub.cfg" and reboot the system.

You can use stress test and check if it is using isolated core or not.

Related

Why SCHED_FIFO threads are assigned to the same physical CPU even though idle CPUs are available?

While debugging some performance issue in app I'm working on, I found out weird behaviour of kernel scheduler.
It seems that busy SCHED_FIFO tasks tend to be scheduled on logical cores from the same physical CPU even though there are idle physical CPUs in the system.
8624 root -81 0 97.0g 49g 326m R 100 52.7 48:13.06 26 Worker0 <-- CPU 6 and 26
8629 root -81 0 97.0g 49g 326m R 100 52.7 44:56.26 6 Worker5 <-- the same physical core
8625 root -81 0 97.0g 49g 326m R 82 52.7 58:20.65 23 Worker1
8627 root -81 0 97.0g 49g 326m R 67 52.7 55:28.86 27 Worker3
8626 root -81 0 97.0g 49g 326m R 67 52.7 46:04.55 32 Worker2
8628 root -81 0 97.0g 49g 326m R 59 52.7 44:23.11 5 Worker4
Initially threads shuffle between cores, but at some point most CPU intensive threads ends up locked on the samey physical core and doesn't seem to move from there. There is no affinity set for Worker threads.
I tried to reproduce it with synthetic load by running 12 instances of:
chrt -f 10 yes > /dev/null &
And here is what I got:
25668 root -11 0 2876 752 656 R 100 0.0 0:17.86 20 yes
25663 root -11 0 2876 744 656 R 100 0.0 0:19.10 25 yes
25664 root -11 0 2876 752 656 R 100 0.0 0:18.79 6 yes
25665 root -11 0 2876 804 716 R 100 0.0 0:18.54 7 yes
25666 root -11 0 2876 748 656 R 100 0.0 0:18.31 8 yes
25667 root -11 0 2876 812 720 R 100 0.0 0:18.08 29 yes <--- core9
25669 root -11 0 2876 744 656 R 100 0.0 0:17.62 9 yes <--- core9
25670 root -11 0 2876 808 720 R 100 0.0 0:17.37 2 yes
25671 root -11 0 2876 748 656 R 100 0.0 0:17.15 23 yes <--- core3
25672 root -11 0 2876 804 712 R 100 0.0 0:16.94 4 yes
25674 root -11 0 2876 748 656 R 100 0.0 0:16.35 3 yes <--- core3
25673 root -11 0 2876 812 716 R 100 0.0 0:16.68 1 yes
This is server with 20 physical cores, so there is 8 remaining idle cores and threads are still scheduled on the same physical core. This is reproducible and persistent. It doesn't seem to happen for non-SCHED_FIFO threads. Also it started after migrating past kernel 4.19
Is this correct behaviour for SCHED_FIFO threads? Is there any flag or config option that can change this scheduler behaviour?
If I'm understanding correctly, you're trying to use SCHED_FIFO with hyperthreading ("HT") enabled, which results in multiple thread processors per physical core. My understanding is that HT-awareness within the Linux kernel is mainly through the load balancing and scheduler domains within CFS (the default scheduler these days). See https://stackoverflow.com/a/29587579/2530418 for more info.
Using SCHED_FIFO or SCHED_RR would then essentially bypass HT handling, since RT scheduling doesn't really go through CFS.
My approach to dealing with this in the past has been to disable hyperthreading. For cases where you actually need real-time behavior, this is usually the right latency/performance tradeoff to make anyway (see https://rt.wiki.kernel.org/index.php/HOWTO:_Build_an_RT-application#Hyper_threading). Whether this is appropriate really depends on what problem you're trying to solve.
Aside: I suspect if you actually need SCHED_FIFO behavior then disabling HT is what you'll want to do, but it's also common for people think that they need SCHED_FIFO when it's the wrong tool for the job. My suspicion is that there may be a better option than using SCHED_FIFO since you're describing running on a conventional server rather than an embedded system, but that's an over-generalizing guess. Hard to say without more specifics about the issue.
The problem was caused by this particular change:
https://lkml.iu.edu/hypermail/linux/kernel/1806.0/04887.html
Per CPU core watchdog threads were removed
watchdog_set_prio(SCHED_FIFO, MAX_RT_PRIO - 1);
Before, there were run periodically every 4 seconds and because they were absolutely highest priority, they were causing periodic rescheduling. When they are gone, there is nothing that can pre-empt SCHED_FIFO threads and migrate them to "better" core.
So this was all just side effect of watchdog implementation. In general there is no mechanism in kernel that will perform rebalancing of runaway RT threads.

Does linux process VSZ equal 0 mean kernel space application?

I notice some process always have VSZ as 0
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 1 0.0 0.0 19356 1400 ? Ss Jun13 0:00 /sbin/init
root 2 0.0 0.0 0 0 ? S Jun13 0:00 [kthreadd]
root 3 0.0 0.0 0 0 ? S Jun13 0:00 [migration/0]
root 4 0.0 0.0 0 0 ? S Jun13 0:01 [ksoftirqd/0]
root 5 0.0 0.0 0 0 ? S Jun13 0:00 [stopper/0]
root 6 0.0 0.0 0 0 ? S Jun13 0:03 [watchdog/0]
root 7 0.0 0.0 0 0 ? S Jun13 0:00 [migration/1]
how to understand why they have 0 VSZ?
VSZ is the Virtual Memory Size. It includes all memory that the process can access, including memory that is swapped out, memory that is allocated, but not used, and memory that is from shared libraries.
So, the top command screenshot you shared showing VSZ values equaling 0, means that those processes are not using VSZ.
NOTE: They are kernel threads and memory statistics are irrelevant for them as they use kernel memory. Just to visualize kernel processes, press c when top command is running and it will show you all [bracketed] entries in last column named COMMAND.
You can get more details on VSZ and learn about its counterpart RSS (Resident Set Size) from here.

Linux "top" command - want to aggregate resource usage to the process group or user name, especially for postgres

An important topic in software deveopment / programming is to assess the size of the product, and to match the application footprint to the system where it is running. One may need to optimize the product, and/or one may need to add more memory, use a faster processor, etc. In the case of virtual machines, it is important to make sure the application will work effectively by perhaps making the VM memory size larger, or allow a product to get more resources from the hypervisor when needed and available.
The linux top(1) command is great, with its ability to sort by different fields, add optional fields, highlight sort criteria on-screen, and switch sort field with < and >. On most systems though, there are very many processes running, making "at-a-glance" examination a little difficult. Consider:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ PPID SWAP nFLT COMMAND
2181 root 20 0 7565m 3.2g 7028 S 2.7 58.3 86:41.17 1 317m 10k java
1751 root 20 0 137m 2492 1056 S 0.0 0.0 0:02.57 1 5104 76 munin-node
11598 postgres 20 0 146m 23m 11m S 0.0 0.4 7:51.63 2143 3600 28 postmaster
1470 root 20 0 243m 1792 820 S 0.0 0.0 0:01.89 1 2396 23 rsyslogd
3107 postgres 20 0 146m 26m 11m S 0.0 0.5 7:40.61 2143 936 58 postmaster
3168 postgres 20 0 132m 14m 11m S 0.0 0.2 8:27.27 2143 904 53 postmaster
3057 postgres 20 0 138m 19m 11m S 0.0 0.3 6:55.63 2143 856 36 postmaster
3128 root 20 0 85376 900 896 S 0.0 0.0 0:00.11 1636 852 2 sshd
1728 root 20 0 80860 1080 952 S 0.0 0.0 0:00.61 1 776 0 master
3130 manager 20 0 85532 844 672 S 0.0 0.0 0:01.03 3128 712 36 sshd
436 root 16 -4 11052 264 260 S 0.0 0.0 0:00.01 1 688 0 udevd
2211 root 18 -2 11048 220 216 S 0.0 0.0 0:00.00 436 684 0 udevd
2212 root 18 -2 11048 220 216 S 0.0 0.0 0:00.00 436 684 0 udevd
1636 root 20 0 66176 524 436 S 0.0 0.0 0:00.12 1 620 25 sshd
1486 root 20 0 229m 2000 1648 S 0.0 0.0 0:00.79 1485 596 116 sssd_be
2306 postgres 20 0 131m 11m 9m S 0.0 0.2 0:01.21 2143 572 64 postmaster
3055 postgres 20 0 135m 16m 11m S 0.0 0.3 10:18.88 2143 560 36 postmaster
...etc... This is for about 20 processes, but there are well over 100 processes.
In this example I was sorting by SWAP field.
I would like to be able to aggregate related processes based on the "process group" of which they are a part, or based on the USER running the process, or based on the COMMAND being run. Essentially I want to:
Aggregate by PPID, or
Aggregate by USER, or
Aggregate by COMMAND, or
Turn off aggregation
This would allow me to see more quickly what is going on. The expectation is that all the postgres processes would show up together, as a single line, with process group leader (2143, not captured in the snippet) displaying aggegated metrics. Generally the aggregation would be a sum (VIRT, RES, SHR, %CPU, %MEM, TIME+, SWAP, nFLT), but sometimes not (as for PR and NI, which might be shown as just --).
For processes whose PPID is 1, it would be nice to have an option of toggling between aggregating them all together, or of leaving them listed individually.
Aggegation by the name of the process (java vs. munin-node, vs. postmaster, vs. chrome) would also be a nice option. The COMMAND arguments would not be used when aggregating by command name.
This would be very valuable when tuning an application. How can I do this, aggregating top data for at-a-glance viewing in larger scale systems? Has anyone written an app, perhaps that uses top in batch mode, to create a summary view like I'm discussing?
FYI, I'm specifically interest in something for CentOS, but this would be helpful on any OS variant.
Thanks!
...Alan

Why the CPU time is different in other threads

I ran the top -H -p for a process which gave me the few threads with LWPs.
But when I sort the results with smallest PID first, I noticed the time in first thread is constant but the other threads time is changing. Why TIME+ is different?
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
16989 root 20 0 106m 28m 2448 S 0.0 0.2 0:22.31 glusterfs
16990 root 20 0 106m 28m 2448 S 0.0 0.2 0:00.00 glusterfs
16992 root 20 0 106m 28m 2448 S 0.0 0.2 0:00.00 glusterfs
16993 root 20 0 106m 28m 2448 S 0.0 0.2 0:00.00 glusterfs
16997 root 20 0 106m 28m 2448 S 0.0 0.2 0:11.71 glusterfs
17010 root 20 0 106m 28m 2448 S 0.0 0.2 0:21.07 glusterfs
17061 root 20 0 106m 28m 2448 S 0.0 0.2 0:00.00 glusterfs
Why TIME+ is different?
Because different threads are doing different percentages of the work. There could be a number of reasons for this1, but the most likely is that the application (glusterfs) is not attempting to distribute work evenly across the worker threads.
It is not something to worry about. It doesn't matter which thread does the work if the work level (see the %CPU) is negligible.
1 - If someone had the time and inclination, they could look at the source code of glusterfs to try to understand its behavior. However, I don't think the effort is warranted.
Because the time column referes to the time consumed by a process, so when a process time does not change it probably means that this process is "sleeping" or simply waiting for an other process to finish, but there could be many more reasons.
http://linux.about.com/od/commands/l/blcmdl1_top.htm
TIME:
Total CPU time the task has used since it started. If cumulative mode
is on, this also includes the CPU time used by the process's children
which have died. You can set cumulative mode with the S command line
option or toggle it with the interactive command S. The header line
will then be changed to CTIME.

How to identify, what is stalling the system in Linux?

I have an embedded system, when I do the user i/o operations, the system just stalls. It does the action after a long time. This system is quite complex and has many process running. My question is how can I identify what is making the system stall - it does nothing literally for 5 minutes. After 5 minutes, I see the outcome. I really don't know what is stalling the system. Any inputs on how to debug this issue. I have run the top on the system. However, it doesn't lead to any issue. See here, the jup_render is just taking 30% of CPU, which is not enough to stall the system. So, I am not sure whether top is useful here or not.
~ # top
top - 12:01:05 up 21 min, 1 user, load average: 1.49, 1.26, 0.87
Tasks: 116 total, 2 running, 114 sleeping, 0 stopped, 0 zombie
Cpu(s): 44.4%us, 13.9%sy, 0.0%ni, 40.3%id, 0.0%wa, 0.0%hi, 1.4%si, 0.0%st
Mem: 822572k total, 389640k used, 432932k free, 1980k buffers
Swap: 0k total, 0k used, 0k free, 227324k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
850 root 20 0 309m 32m 16m S 30 4.0 3:10.88 jup_render
870 root 20 0 221m 13m 10m S 27 1.7 2:28.78 jup_render
688 root 20 0 1156m 4092 3688 S 11 0.5 1:25.49 rxserver
9 root 20 0 0 0 0 S 2 0.0 0:06.81 ksoftirqd/1
16 root 20 0 0 0 0 S 1 0.0 0:06.87 ksoftirqd/3
9294 root 20 0 1904 616 508 R 1 0.1 0:00.10 top
812 root 20 0 865m 85m 46m S 1 10.7 1:21.17 lippo_main
13 root 20 0 0 0 0 S 1 0.0 0:06.59 ksoftirqd/2
800 root 20 0 223m 8316 6268 S 1 1.0 0:08.30 rat-cadaemon
3 root 20 0 0 0 0 S 1 0.0 0:05.94 ksoftirqd/0
1456 root 20 0 80060 10m 8208 S 1 1.2 0:04.82 jup_render
1330 root 20 0 202m 10m 8456 S 0 1.3 0:06.08 jup_render
8905 root 20 0 1868 556 424 S 0 0.1 0:02.91 dropbear
1561 root 20 0 80084 10m 8204 S 0 1.2 0:04.92 jup_render
753 root 20 0 61500 7376 6184 S 0 0.9 0:04.06 ale_app
1329 root 20 0 79908 9m 8208 S 0 1.2 0:04.77 jup_render
631 dbus 20 0 3248 1636 676 S 0 0.2 0:13.10 dbus-daemon
1654 root 20 0 80068 10m 8204 S 0 1.2 0:04.82 jup_render
760 root 20 0 116m 15m 12m S 0 1.9 0:10.19 jup_server
8 root 20 0 0 0 0 S 0 0.0 0:00.00 kworker/1:0
2 root 20 0 0 0 0 S 0 0.0 0:00.00 kthreadd
7 root RT 0 0 0 0 S 0 0.0 0:00.00 migration/1
170 root 0 -20 0 0 0 S 0 0.0 0:00.00 kblockd
6 root RT 0 0 0 0 S 0 0.0 0:00.00 migration/0
167 root 20 0 0 0 0 S 0 0.0 0:00.00 sync_supers
281 root 0 -20 0 0 0 S 0 0.0 0:00.00 nfsiod
For an embedded system that has many process running, there can be multitude of reasons. You may need to investigate in all perspective.
Check code for race conditions and deadlock.The kernel might be busy looping in a certain condition . There can be scenario where your application is waiting on a select call or the CPU resource is used up (This choice of CPU resource usage is ruled out based on the output of top command shared by you) or blocked on a read.
If you are performing a blocking I/O operations, the process shall get into wait queue and only move back to the execution path(ready queue) after the completion of the request. That is, it is moved out of the scheduler run queue and put with a special state. It shall be put back into the run queue only if they wake from the sleep or the resource waited for is made available.
Immediate step shall be to try out 'strace'. It shall intercept/record system calls that are called by a process and also the signals that are received by a process. It will be able to show the order of events and all the return/resumption paths of calls. This can take you almost closer to the area of problem.
There are other many handy tools that can be tried based on your development environment/setup. Key tools are as below :
'iotop' - It shall provide you a table of current I/O usage by processes or threads on the system by monitoring the I/O usage information output by the kernel.
'LTTng' - Makes tracing of race conditions and interrupt cascades possible. It is the successor to LTT. It is a combination of kprobes, tracepoint and perf functionalities.
'Ftrace' - This is a Linux kernel internal tracer with which you can analyze/debug latency and performance related issues.
If your system is based on TI processor, the CCS(Trace analyzer) provides capability to perform non-intrusive debug and analysis of system activity. So, note that based on your setup, you may also need to use the relevant tool .
Came across few more ideas :
magic SysRq key is another option in linux. If the driver is stuck, the command SysRq p can take you to the exact routine that is causing the problem.
Profiling of data can tell where exactly the time is being spent by the kernel. There are couple of tools like Readprofile and Oprofile. Oprofile can be enabled by configuring with CONFIG_PROFILING and CONFIG_OPROFILE. Another option is to rebuild the kernel by enabling the profiling option and reading the profile counters using Readprofile utility by booting up with profile=2 via command line.
mpstat can give 'the percentage of time that the CPU or CPUs were idle during which the system had an outstanding disk I/O request' via 'iowait' argument.
You said you run the top app. Did you find out which programme gets the biggest CPU time and how much is a percentage for it?
If you run the top you should see another screen in there, which you neither provided nor mentioned a cpu load percentage (or other relevant info).
I advise you to include what you can find interesting/relevant or suspicious through top. If it was already done you should discover it in your question more distinctively because now it's not obvious what is the CPU maximum load.

Resources