100% cpu usage profile output, what could cause this based on our profile log? - node.js

We have a massively scaled nodejs project (~1m+ users) that is suddenly taking a massive beating on our CPU. (Epyc 24c 2ghz)
We've been trying to debug what's using all our CPU using a profiler, (and I can show you the output down below) and it's behaving really weirdly whatever it is.
We have a master process that spawns 48 clusters, after they're all loaded the cpu usage slowly grows to max. After killing a cluster, the LA doesn't dip at all. However after killing the master process, it all goes back to normal.
The master process obviously isn't maxing all threads, and killing a cluster should REALLY do the trick?
We even stopped the user input of the application and a cluster entirely and it didn't reduce cpu usage at all.
We've got plenty of log files we could send if you want them.

Based on the profile, it looks like the code is spending a lot of time getting the current time from the system. Do you maybe have Date.now() (or oldschool, extra-inefficient +new Date()) calls around a bunch of frequently used, relatively quick operations? Try removing those, you should see a speedup (or drop in CPU utilization, respectively).
As for stopping user input not reducing CPU load: are you maybe scheduling callbacks? Or promises, or other async requests? It's not difficult to write a program that only needs to be kicked off and then keeps the CPU busy forever on its own.
Beyond these rough guesses, there isn't enough information here to dig deeper. Is there anything other than time-related stuff on the profile, further down? In particular, any of your own code? What does the bottom-up profile say?

Related

Why would a process use over twice as many CPU resources on a machine with double the cores?

I'm hoping someone could point me in the right direction here so I can learn more about this.
We have a process running on our iMac Pro 8-core machine which utilises ~78% CPU.
When we run the same process on our new Mac Pro 16-core machine it utilises ~176% CPU.
What reasons for this could there be? We were hoping the extra cores would allow us to run more processes simultaneously however if each uses over double the CPU resources, surely that means we will be able to run fewer processes on the new machine?
There must be something obvious I'm missing about architecture. Could someone please help? I understand I haven't included any code examples, I'm asking in a more general sense about scenarios that could lead to this.
I suspect that the CPU thread manager tries to use as much CPU as it can/needs. If there are more processes needing CPU time, then the cycles will be shared out more sparingly to each. Presumably your task runs correspondingly faster on the new Mac?
The higher CPU utilization just indicates that it's able to make use of more hardware. That's fine. You should expect it to use that hardware for a shorter period, and so more things should get done in the same overall time.
As to why, it completely depends on the code. Some code decides how many threads to use based on the number of cores. If there are non-CPU bottlenecks (the hard drive or GPU for example), then a faster system may allow the process to spend more time computing and less time waiting for non-CPU resources, which will show up as higher CPU utilization, and also faster throughput.
If your actual goal is to have more processes rather than more throughput (which may be a very reasonable goal), then you will need to tune the process to use fewer resources even when they are available. How you do that completely depends on the code. Whether you even need to do that will depend on testing how the system behaves when there is contention between many processes. In many systems it will take care of itself. If there's no problem, there's no problem. A higher or lower CPU utilization number is not in itself a problem. It depends on the system, where your bottlenecks are, and what you're trying to optimize for.

How I profile multithreading problems?

This is the first time I am trying to profile a multi-threaded program.
I suspect the problem is it waiting for something, but I have no clue what, the program never reaches 100% of CPU, GPU, RAM or I/O use.
Until recently, I've only worked on projects with single-threading, or where the threads were very simple (example: usually an extra thread just to ensure the UI is not locked while the program works, or once I made a game engine with a separate thread to handle .XM and .IT files music, so that the main thread could do everything, while the other thread in another core could take care of decoding those files).
This program has several threads, and they don't do parallel work on the same tasks, each thread has its own completely separate purpose (for example one thread is dedicated to handling all sound-related API calls to the OS).
I downloaded Microsoft performance tools, there is a blog by an ex-Valve employee that explains that they work to do this, but although I even managed to make some profiles and whatnot, I don't really understood what I am seeing, it is only a bunch of pretty graphs to me (except the CPU use graph, that I already knew from doing sample-based profiling on single-threaded apps), so, how I find why the program is waiting on something? Or how I find what is it waiting for? How I find what thread is blocking the others?
I look at is as an alternation between two things:
a) measuring overall time, for which all you need is some kind of timer, and
b) finding speedups, which does not mean measuring, in spite of what a lot of people have been told.
Each time you find a speedup, you time the results and do it again.
That's the alternation.
To find speedups, the method I and many people use is random pausing.
The idea is, you get the program running under a debugger and manually interrupt it, several times.
Each time, you examine the state of every thread, including the call stack.
It is very crude, and it is very effective.
The reason this works is that the only way the program can go faster is if it is doing an activity that you can remove, and if that saves a certain fraction of time, you are at least that likely to see it on every pause.
This works whether it is doing I/O, waiting for something, or computing.
It sees things that profilers do not expose, because they make summaries from which speedups can easily hide.
Performance Wizard in Visual Studio Performance and Diagnostics Hub has "Resource contention data" profiling regime which allows to analyze concurrency contention among threads, i.e. how the overall performance of a program is impacted by threads waiting on other threads. Please refer to this blog post for more details.
PerfView is an extremely powerful profiling tool which allows one to analyze the impact of service threads and tasks to the overall performance of the program. Here is the PerfView Tutorial available.

Excessive amount of system calls when using `threadDelay`

I'm having a couple of Haskell processes running in production on a system with 12 cores. All processes are compiled with -threaded and run with 12 capabilities. One library they all use is resource-pool which keeps a pool of database connection.
What's interesting is that even though all processes are practically idle they consume around 2% CPU time. Inspecting one of these processes with strace -p $(pgrep processname) -f reveals that the process is doing an unreasonable amount of system calls even though it should not really be doing anything. To put things into perspective:
Running strace on a process with -N2 for 5 seconds produces a 66K log file.
Running it with (unreasonable) -N64 yields a 60 Megabyte log.
So the number of capabilities increases the amount of system calls being issued drastically.
Digging deeper we find that resource-pool is running a reaper thread which fires every second to inspect if it can clean up some resources. We can simulate the same behavior with this trivial program.
module Main where
import Control.Concurrent
import Control.Monad (forever)
main :: IO ()
main = forever $ do
threadDelay (10 ^ 6)
If I pass -B to the runtime system I get audio feedback whenever a GC is issued, which in this case is every 60 seconds.
So when I suppress these GC cycles by passing -I0 to the RTS running the strace command on the process only yields around 70K large log files. Since the process is also running a scotty server, GC is triggered when requests are coming in, so they seem to happen when I actually need them.
Since we are going to increase the amount of Haskell processes on this machine by a large amount during the course of the next year I was wondering how to keep their idle time at a reasonable level. Apparently passing -I0 seems to be a rather bad idea (?). Another idea would be to just decrease the number of capabilities from 12 to maybe something like 4. Is there any other way to tweak the RTS so that I can keep the processes from burning to many CPU cycles while idling?
The way GHC's memory management is structured, in order to keep memory usage under control, a 'major GC' is periodically needed, during the running of the program. This is a relatively expensive operation, and it 'stops the world' - the program makes no progress whilst this is occurring.
Obviously, it is undesirable for this to happen at any crucial point of program execution. Therefore by default, whenever a GHC-compiled program goes idle, a major GC is performed. This is usually an unobtrusive way of keeping the garbage level down and overall memory efficiency and performance up, without interrupting program interaction. This is known as 'idle GC'.
However, this can become a problem in scenarios like this: Many concurrent processes, Each of them woken frequently, running for a short amount of time, then going back to idle. This is a common scenario for server processes. In this case, when idle GC kicks in, it doesn't obstruct the process it is running in, which has completed its work, but it does steal resources from other processes running on the system. Since the program frequently idles, it is not necessary for the overhead of a major GC to be incurred on every single idle.
The 'brute force' approach would be to pass the RTS option -I0 to the program, disabling idle GC entirely. This will solve this in the short run, but misses an opportunity to collect garbage. This could allow garbage to accumulate, causing GC to kick in at an inopportune moment.
Partly in response to this question, the flag -Iw was added to the GHC runtime system. This establishes a minimum interval between which idle GCs are allowed to run. For example, -Iw5 will not run idle GC until 5 seconds have elapsed since the last GC, even if the program idles several times. This should solve the problem.
Just keep in mind the caveat in the GHC User's Guide:
This is an experimental feature, please let us know if it causes problems and/or could benefit from further tuning.
Happy Haskelling!

Consistent use of CPU by Java Process

I am running a Java program which does a heavy load work and needs lots of memory and CPU attention.
I took the snapshot of task manager while that program was running and this is how it looks like
Clearly this program is making use of all 8 cores available on my machine but if you see the CPU usage graph, you can see dips in the CPU usage and these dips are consistent across all cores.
My question is, Is there some way of avoiding these dips? Can i make sure that all my cores are being used consistently without any dip and come to rest only after my program has finished?
This looks so familiar. Obviously, your threads are blocking for some reason. Here are my suggestions:
Check to see if you have any thread blocking (synchronization). Thread synchronization is easy to do wrong and can stop computation for extended periods of time.
Make sure you aren't waiting on I/O (file, network, devices, etc). Often the default for network or other I/O is to block.
Don't block on message passing or remote procedure calls.
Use a more sophisticated profiler to get a better look. I use Intel VTune, but then I have access to it. There are other low-level profiling tools that are just as capable but more difficult to use.
Check for other processes that might be using the system. I've had situations where that other process doesn't use the processor (blocks) but doesn't give the context up (doesn't swap out and allow another process to run).
When I say "don't block", I don't mean that you should poll. That's even worse as it consumes processing without doing anything useful. Restructure your algorithm to hide latency. Use a new algorithm that permits more latency hiding. Find alternate ways of thread synchronization that minimizes or eliminates blocking.
My two cents.

Process & thread scheduling overhead

There are a few things I don't quite understand when it come to scheduling:
I assume each process/thread, as long as it is CPU bound, is given a time window. Once the window is over, it's swapped out and another process/thread is ran. Is that assumption correct? Are there any ball park numbers how long that window is on a modern PC? I'm assuming around 100 ms? What's the overhead of swapping out like? A few milliseconds or so?
Does the OS schedule by procces or by an individual kernel thread? It would make more sense to schedule each process and within that time window run whatever threads that process has available. That way the process context switching is minimized. Is my understanding correct?
How does the time each thread runs compare to other system times, such as RAM access, network access, HD I/O etc?
If I'm reading a socket (blocking) my thread will get swapped out until data is available then a hardware interrupt will be triggered and the data will be moved to the RAM (either by the CPU or by the NIC if it supports DMA) . Am I correct to assume that the thread will not necessarily be swapped back in at that point to handle he incoming data?
I'm asking primarily about Linux, but I would imagine the info would also be applicable to Windows as well.
I realize it's a bunch of different questions, I'm trying to clear up my understanding on this topic.
I assume each process/thread, as long as it is CPU bound, is given a time window. Once the window is over, it's swapped out and another process/thread is ran. Is that assumption correct? Are there any ball park numbers how long that window is on a modern PC? I'm assuming around 100 ms? What's the overhead of swapping out like? A few milliseconds or so?
No. Pretty much all modern operating systems use pre-emption, allowing interactive processes that suddenly need to do work (because the user hit a key, data was read from the disk, or a network packet was received) to interrupt CPU bound tasks.
Does the OS schedule by proces or by an individual kernel thread? It would make more sense to schedule each process and within that time window run whatever threads that process has available. That way the process context switching is minimized. Is my understanding correct?
That's a complex optimization decision. The cost of blowing out the instruction and data caches is typically large compared to the cost of changing the address space, so this isn't as significant as you might think. Typically, picking which thread to schedule of all the ready-to-run threads is done first and process stickiness may be an optimization affecting which core to schedule on.
How does the time each thread runs compare to other system times, such as RAM access, network access, HD I/O etc?
Obviously, threads have to run through a very large number of RAM accesses because switching threads requires a large number of such accesses. Hard drive and network I/O are generally slow enough that a thread that's waiting for such a thing is descheduled.
Fast SSDs change things a bit. One thing I'm seeing a lot of lately is long-treasured optimizations that use a lot of CPU to try to avoid disk accesses can be worse than just doing the disk access on some modern machines!

Resources