process-exporter with alert rules in prometheus when process using too much CPU - prometheus-alertmanager

i am using process-exporter to monitor process, then alert when a process using too much CPU.
This is my monitor CPU code in prometheus dashboard
sum(rate(namedprocess_namegroup_cpu_seconds_total{groupname=~"$processes",instance="$host", mode=~"system|user"}[20s])) by (groupname, instance)
i have try to write alert with this (test for 10% CPU first)
- name: process
rules:
- alert: CPUProcess
expr: sum(rate(namedprocess_namegroup_cpu_seconds_total[20s])) by (groupname, instance) > 10
for: 1m
labels:
severity: critical
annotations:
summary: "(instance {{ $labels.instance }}) use too much CPU"
description: "Process (instance {{ $labels.groupname }}) use high CPU"
But seem like it doesn't work (another alert can work normal), can you give me a advice, thank you.

fixed by changed to namedprocess_namegroup_cpu_seconds_total{groupname=~".+", mode=~"system"} > 10

Related

kubernetes : High memory usage by daemonset-pod when using hostPath volume

I've consumer-applications which reads (no-write) the database of size ~4GiB and performs some tasks. To make sure same database is not duplicated across applications, I've stored it on all node machines of k8s-cluster.
daemonset
I've used one daemonset which is using "hostpath" volume. The daemonset pod extracts the database on each node machine (/var/lib/DATABASE).
For health-check of daemonset pod, I've written the shell script which checks the modification time of the database file (using date command).
For database extraction, approximately 300MiB memory is required and to perform health-check 50MiB is more than sufficient. Hence I've set the memory-request as 100MiB and memory-limit as 1.5GiB.
When I run the daemonset, I observed memory usage is high ~300MiB for first 10 seconds (to perform database extraction) and after that it goes down to ~30MiB. The daemonset works fine as per my expectation.
Consumer Application
Now, The consumer applications (written in golang) pods are using same "hostPath" volume (/var/lib/DATABASE) and reading the database from that location (/var/lib/DATABASE). This consumer applications does not perform any write operations on /var/lib/DATABASE directory.
However, when I deploy this consumer application on k8s then I see huge increase in memory usage of the daemonset-pod from 30MiB to 1.5GiB. The memory-usage by daemonset-pods is almost same as that of memory-limit.
I am not able to understand this behaviour, why consumer application is causing memory usage of daemonset pod ?
Any help/suggestion/truobleshooting steps would be of great help !!
Note : I'm using 'kubernetes top" command to measure the memory (working-set-bytes).
I've found this link (Kubernetes: in-memory shared cache between pods),
which says
hostPath by itself poses a security risk, and when used, should be scoped to only the required file or directory, and mounted as ReadOnly. It also comes with the caveat of not knowing who will get "charged" for the memory, so every pod has to be provisioned to be able to absorb it, depending how it is written. It also might "leak" up to the root namespace and be charged to nobody but appear as "overhead"
However, I did not find any reference from official k8s documentation. It would be helpful if someone can elaborate on it.
Following are the content of memory.stat file from daemonset pod.
cat /sys/fs/cgroup/memory/memory.stat*
cache 1562779648
rss 1916928
rss_huge 0
shmem 0
mapped_file 0
dirty 0
writeback 0
swap 0
pgpgin 96346371
pgpgout 95965640
pgfault 224070825
pgmajfault 0
inactive_anon 0
active_anon 581632
inactive_file 37675008
active_file 1522688000
unevictable 0
hierarchical_memory_limit 1610612736
hierarchical_memsw_limit 1610612736
total_cache 1562779648
total_rss 1916928
total_rss_huge 0
total_shmem 0
total_mapped_file 0
total_dirty 0
total_writeback 0
total_swap 0
total_pgpgin 96346371
total_pgpgout 95965640
total_pgfault 224070825
total_pgmajfault 0
total_inactive_anon 0
total_active_anon 581632
total_inactive_file 37675008
total_active_file 1522688000
total_unevictable 0

DB2 Create Database takes long time

I'm new to Db2 and just installed v11.5 on my Ubuntu 18.04.
I referred these two links for setup purpose:
IBM and DCON
I'm using DB2 CLI to create a database. On typing in create database <database_name> and press enter, it just stays there; there's no output.
I checked the db2diag.log as well, it stops at this:
2021-05-18-15.41.46.618309+330 E653248E505 LEVEL: Event
PID : 29136 TID : 140104312022784 PROC : db2sysc 0
INSTANCE: db2inst1 NODE : 000 DB : SOURCE
APPHDL : 0-7 APPID: *LOCAL.db2inst1.210518101139
AUTHID : DB2INST1 HOSTNAME: Host
EDUID : 23 EDUNAME: db2agent (instance) 0
FUNCTION: DB2 UDB, base sys utilities, sqeLocalDatabase::FirstConnect, probe:1000
START : DATABASE: SOURCE : ACTIVATED: NO
Tried 3-4 times; on one occasion I just let it be and it took around 30-40 mins but it created the database. I'm not sure if I'm missing out any initialization step.
Kindly guide.
System Spec:
RAM: 16GB, CPU(s): 8, Model name: Intel(R) Core(TM) i5-1035G1 CPU # 1.00GHz
The time it takes to create your skeleton Db2-LUW database is mainly determined by your I/O and logging configuration, and whether or not you get swapping/paging. The CPU speed is less important than the I/O throughput for the create database action.
As an example: my Db2-LUW v11.5 on ubuntu 18.04 and 20.04, the create database completes with the following times as reported by the time tool (with zero paging) and no containerization/virtualization:
with nvme(ssd): around 2 minutes
with spinning disk ext4 4k sector size, 256mb cache sata3, 3.5inch 7200rpm: around 4 minutes
with spinning disk ext4 512byte sector size, 64mb disk cache sata3 2.5 inch 5400rpm : around 10 minutes.
If you have more than one physical drive and/or controller, it can help to put the transaction logs on a different drive / controller to the tablespaces.
So you can see that the performance varies greatly with the I/O configuration. You get what you pay for, and how you configure it.
For creation of other objects on the skelton database like tablespaces, tables, views, indexes, mqt's, routines etc, the performance will vary further, again dependent on your I/O and logging configuration.

Self-hosted WCF service maxing cpu load

I am looking into an issue at work with a WindowsService that is taking 100% CPU on a machine with 16 CPU's.
The service is hosting a self-hosted .NET WCF service.
I have received a crash dump which I have loaded up in windbg, in order to look for clues.
So what I have tried:
!threads :
ThreadCount: 646
UnstartedThread: 0
BackgroundThread: 643
PendingThread: 0
DeadThread: 2
Hosted Runtime: no
642 of the threads were Threadpool workers as following:
8 29 2a34 000000002068b510 3029220 Preemptive 0000000000000000:0000000000000000 0000000000563f50 0 MTA (Threadpool Worker)
~29s -> !CLRStack
000000003c66eb70 00000000770512fa [GCFrame: 000000003c66eb70]
000000003c66ec40 00000000770512fa [GCFrame: 000000003c66ec40]
000000003c66ec78 00000000770512fa [HelperMethodFrame: 000000003c66ec78] System.Threading.Monitor.Enter(System.Object)
000000003c66ed70 000007fef7af1c9c System.Threading.TimerQueueTimer.Fire()
000000003c66ede0 000007fef7a6c2f3 System.Threading.QueueUserWorkItemCallback.System.Threading.IThreadPoolWorkItem.ExecuteWorkItem()
000000003c66ee30 000007fef7a6c92a System.Threading.ThreadPoolWorkQueue.Dispatch()
000000003c66f388 000007fef8d57d33 [DebuggerU2MCatchHandlerFrame: 000000003c66f388]
~29s -> K
000000003c66e858 000007fefd7010dc ntdll!NtWaitForSingleObject+0xa
000000003c66e860 000007fef8d049bf KERNELBASE!WaitForSingleObjectEx+0x79
000000003c66e900 000007fef8d04977 clr!CLREventBase::WaitEx+0x16c
000000003c66e940 000007fef8d048f8 clr!CLREventBase::WaitEx+0x103
000000003c66e9a0 000007fef8e9c5de clr!CLREventBase::WaitEx+0x70
000000003c66ea30 000007fef8dc5a34 clr!WKS::GCHeap::WaitUntilGCComplete+0x2b
000000003c66ea60 000007fef8d0c4f4 clr!Thread::RareDisablePreemptiveGC+0x176
000000003c66eaf0 000007fef8dd1f3d clr!GCCoop::GCCoop+0x3d
000000003c66eb20 000007fef8e898cf clr!AwareLock::Contention+0x137
000000003c66ebe0 000007fef7af1c9c clr!JITutil_MonContention+0xaf
000000003c66ed70 000007fef7a6c2f3 mscorlib_ni+0x521c9c
000000003c66ede0 000007fef7a6c92a mscorlib_ni+0x49c2f3
000000003c66ee30 000007fef8d57d33 mscorlib_ni+0x49c92a
000000003c66eef0 000007fef8d556e6 clr!CallDescrWorkerInternal+0x83
000000003c66ef30 000007fef8d557af clr!CallDescrWorkerWithHandler+0x4a
000000003c66ef70 000007fef8eda2c9 clr!MethodDescCallSite::CallTargetWorker+0x2e6
000000003c66f120 000007fef8ee51b0 clr!QueueUserWorkItemManagedCallback+0x2a
000000003c66f200 000007fef8ee513e clr!DebuggerU2MCatchHandlerFrame::DebuggerU2MCatchHandlerFrame+0xa0
000000003c66f240 000007fef8ee50b5 clr!ManagedPerAppDomainTPCount::DispatchWorkItem+0x38e
000000003c66f340 000007fef8ee51eb clr!ManagedPerAppDomainTPCount::DispatchWorkItem+0x2bd
000000003c66f3d0 000007fef8eda224 clr!ManagedPerAppDomainTPCount::DispatchWorkItem+0x23b
000000003c66f430 000007fef8ee6baf clr!ManagedPerAppDomainTPCount::DispatchWorkItem+0xb4
000000003c66f5c0 000007fef8ee6ab3 clr!ThreadpoolMgr::ExecuteWorkRequest+0x4c
000000003c66f5f0 000007fef8eda8a6 clr!ThreadpoolMgr::WorkerThreadStart+0xf3
000000003c66f6b0 0000000076c9652d clr!Thread::intermediateThreadProc+0x7d
000000003c66f7f0 000000007702c541 kernel32!BaseThreadInitThunk+0xd
000000003c66f820 0000000000000000 ntdll!RtlUserThreadStart+0x1d
Im having a hard time interpreting the stacktraces since they dont hit any of my applicationcode.
Are they all just idle threadworkers, waiting for work?
Threads with WaitForSingleObject are not critical, since they are waiting and not consuming CPU time. But be aware that your dump is only a snapshot and you might have had bad luck when taking the snapshot.
For a performance analysis with WinDbg you'd need several dumps during high CPU and compare them. If they all have similar stack traces, that's fine and you can conclude something. If they are all very different, it's almost useless.
The command !runaway seems more interesting here, since it lists CPU times consumed per thread, so you can identify the one(s) which are on high CPU. Again: having two snapshots that you can compare is helpful, because the main thread may still have consumed more CPU time in total than some short-living 100% threads.
If you can't use a performance profiler, SysInternals Procdump can generate a series of dumps (-n) for you on high CPU (-c). Use -s to set the time between dumps. For .NET, don't forget -ma for full memory.
Other than that, 646 threads sounds a lot to me. The OS itself could be quite busy scheduling them.
Sounds like the issue could be related to GC. Since this is a self-hosted service, it will use the Workstation GC by default, unless you enable the server GC manually:
http://msdn.microsoft.com/en-us/library/ms229357(v=vs.110).aspx
Have you tried that and see if it makes any difference?
Perfview from Microsoft may be helpful. From the link:
http://blogs.msdn.com/b/dotnet/archive/2012/10/09/improving-your-app-s-performance-with-perfview.aspx
"Late last year, Vance Morrison, who is currently an architect on the .NET Framework Performance team, released PerfView, which is a new performance tool for .NET developers. PerfView helps you discover and investigate performance hotspots in .NET Framework apps, and enables you to deliver consistently high-performance apps to your customers.
Using PerfView, you can perform complex CPU performance analyses to solve hard-to-detect performance problems. PerfView's revolutionary grouping and folding features are what makes it possible to grasp and solve these difficult problems."
use WPRUI.exe to capture a trace and analyze the CPU usage with WPA.exe.
Microsoft explained how to analyze the created trace in the following video:
Defrag Tools: #42 - WPT - CPU Analysis
http://channel9.msdn.com/Shows/Defrag-Tools/Defrag-Tools-42-WPT-CPU-Analysis
Collect ETW with Perfview and follow the big % numbers.
try run in windbg ~*e!clstack => call stacks of all threads look for repeatable code.

SharePoint Search Component has multiple process and use lots memory?

I'm using sharepoint2013 + windows2012. I noticed that the SP search component has 5 processes in taskmgr. each uses about 400-500 MB memory. Is this normal? I also tried
Set-SPEnterpriseSearchService -PerformanceLevel Reduced
But it did not change anything. Should I restart the server?
I never nooticed this on other SP server I worked before. Just curious, is it because of SP 2013, some default settings?
thanks
user3211586 ‘s link worked for me. Basically this article says:
Quick and Dirty
Kill the noderunner.exe (Microsoft Sharepoint Search Component) process via TaskManager
This will obviously break everything related to Search on the site
Production
Change the Search Service Performance Level with powerhsell
Get-SPEnterpriseSearchService | Set-SPEnterpriseSearchService –PerformanceLevel “PartlyReduced”
Performance Level Explained:
Reduced: Total number of threads = number of processors, Max Threads/host = number of processors
PartlyReduced: Total number of threads = 4 times the number of processors , Max Threads/host = 16 times the number of processors
Maximum: Total number of threads = 4 times the number of processors , Max Threads/host = 16 times the number of processors (threads are created at HIGH priority)
For the setting to take effect do an IISReset or restart the Search Service in Central Admin
I had the same issue as the OP and running Set-SPEnterpriseSearchService –PerformanceLevel “PartlyReduced” followed by IISRESET /noforce resolved the issue for me.
Please check below given article:
http://social.technet.microsoft.com/wiki/contents/articles/20413.sharepoint-2013-performance-issues-with-search-service-application.aspx
When I tried this method, and when I changed the config setting from 0 to any value between 1 and 500, it did reduce the memory usage but the search stopped working. After I reverted back the config settings to 0, the memory usage increased but search started working again.

How to monit disk I/O with monit?

Currently I'm using M/Monit to monitor a lot of instances at once. But Iwould also like to know if anyone tried to monitor Disk I/O with monit? I don't have any good knowledge about disks, so if someone could point me to the right direction or share some shell script would be great!
You should be looking for CPU Wait, since that would be your biggest indicator of I/O wait issues:
check system $HOST
if loadavg (15min) > 6 then alert
if memory usage > 90% then alert
if swap usage > 5% then alert
if cpu usage (user) > 70% then alert
if cpu usage (system) > 30% then alert
if cpu usage (wait) > 30% then alert
group system_resources
I wonder if this is what you're looking for:
check filesystem datafs with path /dev/sdb1
group server
start program = "/bin/mount /data"
stop program = "/bin/umount /data"
if failed permission 660 then unmonitor
if failed uid root then unmonitor
if failed gid disk then unmonitor
if space usage > 80 % then alert
if space usage > 94 % then stop
if inode usage > 80 % then alert
if inode usage > 94 % then stop
alert root#localhost
Taken from: http://mmonit.com/monit/documentation/monit.html#examples

Resources