/proc/[pid]/stat refresh period - linux

hi I am a Linux programmer
I have an order that monitor process cpus usage, so I use data on /proc/[pid]/stat № 14 and 15. That values are called utime and stime.
Example [/proc/[pid]/stat]
30182 (TTTTest) R 30124 30182 30124 34845 30182 4218880 142 0 0 0 5274 0 0 0 20 0 1 0 55611251 17408000 386 18446744073709551615 4194304 4260634 140733397159392 140733397158504 4203154 0 0 0 0 0 0 0 17 2 0 0 0 0 0 6360520 6361584 33239040 140733397167447 140733397167457 140733397167457 140733397168110 0
State after 5 sec
30182 (TTTTest) R 30124 30182 30124 34845 30182 4218880 142 0 0 0 5440 0 0 0 20 0 1 0 55611251 17408000 386 18446744073709551615 4194304 4260634 140733397159392 140733397158504 4203154 0 0 0 0 0 0 0 17 2 0 0 0 0 0 6360520 6361584 33239040 140733397167447 140733397167457 140733397167457 140733397168110 0
In test environment, this file refreshed 1 ~ 2 sec, so I assume this file often updated by system at least 1 sec.
So I use this calculation
process_cpu_usage = ((utime - old_utime) + (stime - old_stime))/ period
In case of above values
33.2 = ((5440 - 5274) + (0 - 0)) / 5
But, In commercial servers environment, process run with high load (cpu and file IO), /proc/[pid]/stat file update period increasing even 20~60 sec!!
So top/htop utility can't measure correct process usage value.
Why is this phenomenon occurring??
Our system is [CentOS Linux release 7.1.1503 (Core)]

Most (if not all) files in the /proc filesystem are special files, their content at any given moment reflect the actual OS/kernel data at that very moment, they're not files with contents periodically updated. See the /proc filesystem doc.
In particular the /proc/[pid]/stat content changes whenever the respective process state changes (for example after every scheduling event) - for processes mostly sleeping the file will appear to be "updated" at slower rates while for active/running processes at higher rates on lightly loaded systems. Check, for example, the corresponding files for a shell process which doesn't do anything and for a browser process playing some video stream.
On heavily loaded systems with many processes in the ready state (like the one mentioned in this Q&A, for example) there can be scheduling delays making the file content "updates" appear less often despite the processes being ready/active. Such conditions seem to be more often encountered in commercial/enterprise environments (debatable, I agree).

Related

Advise on stopping compaction to reduce slowness

I am seeing high CPU and memory usage of cassandra on the seed node. Is it advisable to stop compaction(nodetool stop) and enable in offpeak hours. Should I do manual compaction or enable autocompaction. I see lot of Native-Transport-Requests. I have three seed nodes. This is the first seed node.
Pool Name Active Pending Completed Blocked All time blocked
ReadStage 0 0 54255 0 0
MiscStage 0 0 0 0 0
CompactionExecutor 2 2566 352765 0 0
MutationStage 0 0 2659921760 0 0
MemtableReclaimMemory 0 0 180958 0 0
PendingRangeCalculator 0 0 21 0 0
GossipStage 0 0 338375 0 0
SecondaryIndexManagement 0 0 0 0 0
HintsDispatcher 0 0 63 0 0
RequestResponseStage 0 1 1684328696 0 0
Native-Transport-Requests 4 0 1538523706 0 47006391
ReadRepairStage 0 0 2197 0 0
CounterMutationStage 0 0 0 0 0
MigrationStage 0 0 0 0 0
MemtablePostFlush 1 1 216220 0 0
PerDiskMemtableFlushWriter_0 1 1 180958 0 0
ValidationExecutor 0 0 33250 0 0
Sampler 0 0 0 0 0
MemtableFlushWriter 1 1 180958 0 0
InternalResponseStage 0 0 141677 0 0
ViewMutationStage 0 0 0 0 0
AntiEntropyStage 0 0 166254 0 0
CacheCleanupExecutor 0 0 0 0 0
Repair#9 0 0 5719 0 0
I do see high compactions. Is it advisable to disable compactions using nodetool stop
$ nodetool info
ID : ebeda774-cea8-40bb-9322-69c6fcded5a9
Gossip active : true
Thrift active : true
Native Transport active: true
Load : 535.37 GiB
Generation No : 1636316595
Uptime (seconds) : 73152
Heap Memory (MB) : 19542.18 / 32168.00
Off Heap Memory (MB) : 1337.98
Data Center : us-west2
Rack : a
Exceptions : 15
Key Cache : entries 152283, size 23.07 MiB, capacity 100 MiB, 23835 hits, 280738 requests, 0.085 recent hit rate, 14400 save period in seconds
Row Cache : entries 0, size 0 bytes, capacity 0 bytes, 0 hits, 0 requests, NaN recent hit rate, 0 save period in seconds
Counter Cache : entries 0, size 0 bytes, capacity 50 MiB, 0 hits, 0 requests, NaN recent hit rate, 7200 save period in seconds
Chunk Cache : entries 6782, size 423.88 MiB, capacity 480 MiB, 23947952 misses, 24381819 requests, 0.018 recent hit rate, 250.977 microseconds miss latency
Percent Repaired : 0.49796724500672584%
Token : (invoke with -T/--tokens to see all 256 tokens)
$ free -h
total used free shared buff/cache available
Mem: 62G 53G 658M 1.0M 8.5G 8.5G
Swap: 0B 0B 0B
~$ nodetool compactionstats
pending tasks: 197
....
id compaction type keyspace table completed total unit progress
5e555610-40b2-11ec-9b5a-27bc920e6e55 Compaction mykeyspace table1 27299674 89930474 bytes 30.36%
5e55f251-40b2-11ec-9b5a-27bc920e6e55 Compaction mykeyspace table2 13922048 74426264 bytes 18.71%
Active compaction remaining time : 0h00m02s
I would definitely not run compaction manually. Much of the compaction thresholds are file-size based, which means that forcing it creates files sized outside of the normal progression. The result, is that the chances of compaction running on that table again are extremely slim. Basically, once you start down that path, you'll be running manual compactions forever.
I would also say that compaction is a good thing. You want it to happen, as compacted files are necessary to keep reads performing well. Of course, that's not much of a consolation when the compaction process is affecting operational activity.
tl;dr;
One I have done in the past, is to lower compaction throughput during the day. Not sure what throughput you're running with currently, but you can find this out by running nodetool getcompactionthroughput:
% bin/nodetool getcompactionthroughput
Current compaction throughput: 64 MB/s
So at the times when customer/operational traffic is high, you can reduce that significantly:
% bin/nodetool setcompactionthroughput 1
% bin/nodetool getcompactionthroughput
Current compaction throughput: 1 MB/s
1 MB / second is the lowest that compaction throughput can be set. If you set it to zero, it's "un-throttled," which means it'll consume all the resources that it can get at. Setting it to 1 brings its resource use (and speed) down to a trickle.
Once the busy daily traffic subsides, that setting can be turned back up:
% bin/nodetool setcompactionthroughput 256
Current compaction throughput: 256 MB/s
This can be accomplished with a scheduled job for each command.

CounterMutationStage and ViewMutationStage metrics are missing in Cassandra 4.0

When invoking nodetool tpstats on Cassandra 4.0, here is what I got nodetool result screenshot
But no CounterMutationStage and ViewMutationStage found. Where are they?
Those metrics are still there. The issue though, is that they expose their data "lazily." Which basically means, they won't show at all when the value is zero. Once you start writing to counters or views, those metrics execute their "lazy initialization," and only then are they exposed. I tested this out using Cassandra 4.0 beta4.
Running a baseline nodetool tpstats | head -n 4:
Pool Name Active Pending Completed Blocked All time blocked
MutationStage 0 0 1 0 0
ReadStage 0 0 27 0 0
CompactionExecutor 0 0 41 0 0
Next, I'll create a simple counter table.
CREATE TABLE games_popularity (game text PRIMARY KEY, popularity counter);
I'll increment the counter a few times and SELECT it.
aploetz#cqlsh> SELECT * FROM stackoverflow.games_popularity ;
game | popularity
----------------+------------
Cyberpunk 2077 | 3
(1 rows)
Now rerunning the nodetool tpstats | head -n 4 indeed show CounterMutationStage:
Pool Name Active Pending Completed Blocked All time blocked
MutationStage 0 0 12 0 0
CounterMutationStage 0 0 3 0 0
ReadStage 0 0 96 0 0
Note that in 4.0 these metrics are also exposed in the system_view.thread_pools virtual table, which you can view with SELECT * FROM system_views.thread_pools;.
Thanks to the good work that have been done by Cassandra developers, the metrics are now lazy initialised to improve the performance.
The best way to "wake up" all lazy metrics is:
nodetool getconcurrency

"vmstat" and "perf stat -a" show different numbers for context-switching

I'm trying to understand the context-switching rate on my system (running on AWS EC2), and where the switches are coming from. Just getting the number is already confusing, as two tools that I know can output such a metric give me different results. Here's the output from vmstat:
$ vmstat -w 2
procs -------------------memory------------------ ---swap-- -----io---- --system-- -----cpu-------
r b swpd free buff cache si so bi bo in cs us sy id wa st
8 0 0 443888 492304 8632452 0 0 0 1 0 0 14 2 84 0 0
37 0 0 444820 492304 8632456 0 0 0 20 131602 155911 43 5 52 0 0
8 0 0 445040 492304 8632460 0 0 0 42 131117 147812 46 4 50 0 0
13 0 0 446572 492304 8632464 0 0 0 34 129154 142260 49 4 46 0 0
The number is ~140k-160k/sec.
But perf tells something else:
$ sudo perf stat -a
Performance counter stats for 'system wide':
2980794.013800 cpu-clock (msec) # 35.997 CPUs utilized
12,335,935 context-switches # 0.004 M/sec
2,086,162 cpu-migrations # 0.700 K/sec
11,617 page-faults # 0.004 K/sec
...
0.004 M/sec is apparently 4k/sec.
Why is there a disparity between the two tools? Am I misinterpreting something in either of them, or are their CS metrics somehow different?
FWIW, I've tried doing the same on a machine running a different workload, and the difference there is even twice larger.
Environment:
AWS EC2 c5.9xlarge instance
Amazon Linux, kernel 4.14.94-73.73.amzn1.x86_64
The service runs on Docker 18.06.1-ce
Some recent versions of perf have a unit-scaling bug in the printing code. Manually do 12.3M / wall-time and see if that's sane. (spoiler alert: it is according to OP's comment.)
https://lore.kernel.org/patchwork/patch/1025968/
Commit 0aa802a79469 ("perf stat: Get rid of extra clock display
function") introduced the bug in mainline Linux 4.19-rc1 or so.
Thus, perf_stat__update_shadow_stats() now saves scaled values of clock events
in msecs, instead of original nsecs. But while calculating values of
shadow stats we still consider clock event values in nsecs. This results
in a wrong shadow stat values.
Commit 57ddf09173c1 on Mon, 17 Dec 2018 fixed it in 5.0-rc1, eventually being released with perf upstream version 5.0.
Vendor kernel trees that cherry-pick commits for their stable kernels might have the bug or have fixed the bug earlier.

Azure SQL Full Text Index initial population slow

I have a table with approximately 4.7 million records. I created a full text index on it.
I am experiencing slow initial population of the full text index.
Initial pricing tier that i had was S1, I upgraded it to S3 but i did not get better performance.
DTU and CPU are not high (usually staying around 0% ), the current velocity is about 175000 records per hour.
What can i do to speed this up?
Thanks in advance.
LE.
I tried same operation on a local instalation of SQL Server 2014, i had no problems with indexing the data.
Update 14.11.2016
Output to dm_Exec_requests
session_id request_id start_time status command sql_handle statement_start_offset statement_end_offset plan_handle database_id user_id connection_id blocking_session_id wait_type wait_time last_wait_type wait_resource open_transaction_count open_resultset_count transaction_id context_info percent_complete estimated_completion_time cpu_time total_elapsed_time scheduler_id task_address reads writes logical_reads text_size language date_format date_first quoted_identifier arithabort ansi_null_dflt_on ansi_defaults ansi_warnings ansi_padding ansi_nulls concat_null_yields_null transaction_isolation_level lock_timeout deadlock_priority row_count prev_error nest_level granted_query_memory executing_managed_code group_id query_hash query_plan_hash statement_sql_handle statement_context_id dop parallel_worker_count external_script_request_id
90 0 57:45.2 running SELECT 0x020000004D4F6005A3E8119F3DD3297095832ABE63E312F20000000000000000000000000000000000000000 0 66 0x060005004D4F6005D04F998A6E00000001000000000000000000000000000000000000000000000000000000 5 1 70A61674-396D-47EB-82C7-F3C13DAA2AD0 0 NULL 0 MEMORY_ALLOCATION_EXT 0 1 141037 0x380035003100450039003200350032002D0045003700450032002D0034003600320041002D0039004200390041002D003200310037004400300036003700430032004100360039 0 0 1 1 0 0x7A218C885C2F7437 0 0 228 2147483647 us_english mdy 7 1 1 1 0 1 1 1 1 2 -1 0 1 0 0 0 0 2000000026 0xC1681A4180C2C052 0x63AD167562BDAE5D 0x0900A3E8119F3DD3297095832ABE63E312F20000000000000000000000000000000000000000000000000000 7 1 NULL NULL
As i can see on P1 this seems much faster though. It is strange because it is not much more powerfull than S3.
I will mark it as solved because, it seems this is an issue related to service tier levels.
If you bump up the service tier of Azure database the full text indexing will run much faster than at standard level.
I could not sense a diference between S1 and S3, but P1 versus S3 is much faster.
I do not know the resoning behind this, even though the diference in DTU's is only 25 (S3: 100 DTU, P1: 125 DTU )

Is the Linux Scheduler aware of hardware interrupts (Scheduler Jitter)

If a process is interrupted by a hardware interrupt (First Level Interrupt Handler), then does the CPU scheduler becomes aware of that (e.g. Does the Scheduler count execution time for hardware interrupts separately from interrupted process)?
More details:
I am trying to troubleshoot an issue where CPU utilization in htop is way too low for the specified packet encryption task (CPU is at <10% while encrypting packets at 400Mbps; Raw encryption speed is only 1.6Gbps, so packet encryption should not go any faster than raw encryption speed).
Explanation:
My hypothesis is that packet encapsulation happens at hardware interrupts hence giving me the illusion of the low CPU usage in htop. Usually FLIHs are implemented so that they finish their task as quickly as possible and defer their work to SLIHs (Second Level Interrupt Handler which I guess is executed on behalf of ksoftirqd/X). But what happens if FLIH interrupts a process for a very long time? Does that introduce some kind of a OS jitter?
I am using Ubuntu 10.04.1 on x86-64 platform.
Additional debugging info:
while [ 1 ]; do cat /proc/stat | grep "cpu "; sleep 1; done;
cpu 288 1 1677 356408 1145 0 20863 0 0
cpu 288 1 1677 356772 1145 0 20899 0 0
cpu 288 1 1677 357108 1145 0 20968 0 0
cpu 288 1 1677 357392 1145 0 21083 0 0
cpu 288 1 1677 357620 1145 0 21259 0 0
cpu 288 1 1677 357972 1145 0 21310 0 0
cpu 288 1 1677 358289 1145 0 21398 0 0
cpu 288 1 1677 358517 1145 0 21525 0 0
cpu 288 1 1678 358838 1145 0 21652 0 0
cpu 289 1 1678 359141 1145 0 21704 0 0
cpu 289 1 1678 359563 1145 0 21729 0 0
cpu 290 1 1678 359886 1145 0 21758 0 0
cpu 290 1 1678 360296 1145 0 21801 0 0
Seventh (or sixth number column) column here I guess is the time spent inside Hardware interrupt handlers (htop uses this proc file to get statistics). I am wondering if this will end up as a bug in linux or the driver. When I took these /proc/stat snapshots the traffic was going at 500Mbps in and 500Mbps out.
The time spent in interrupt handlers is accounted.
htop shows it in "si" (soft interrupt) and "hi" (hard interrupt). ni is nice and wa is io-wait.
Edit:
From man proc:
sixth column is hardware irq time
seventh column is softirq
eight is stolen time
nienth is guest time.
the latter two are only meaningful for virtualized systems.
Do you have a kernel built with the CONFIG_IRQ_TIME_ACCOUNTING (Processor type and features/Fine granularity task level IRQ time accounting) option set?

Resources