I'm a little confused with what I'm seeing with a node process that I have running. docker stats on the host is showing that the container is using over 100% CPU. This makes me think that the node process is maxing out the CPU. This is confirmed when I run top on the host and see that the node process is using over 100% CPU.
When I jump into the docker container I see that node is only using 54% of the CPU and that the processing is split between the two cores. I was expecting to see one core maxed out and the other at 0 since Node is single-threaded.
I found this QA and it looks like the OS could be moving the process between the cores (news to me). Is This Single Node.JS App Using Multiple Cores?
Can you help me interpret the results? Is node pretty much maxed out? Or since the process in the container is showing 54% usage can that go up to 100%? Why is the top in the node container showing 54% usage for node but 45% + 46% for both cores. Nothing is running in the container but the single node process. I'm not using clustering although maybe a package I have included is.
I'm asking all this as I'm trying to understand if I should be scaling this ECS instance out or if node can handle more.
Node.JS: 15.1.0
EC2 Instance: c5.large
NestJS: 7.3.1
Different tops
What you're seeing is (likely) due to a difference in flavors of top.
I'm going to take a wild guess and say that your Docker image is perhaps based on Alpine? The top command in Alpine is busybox. It reports the per-process CPU usage as a percentage of the TOTAL number of CPUs available (nCPUs * 100%).
This differs from most other flavors of top, which report the per-process CPU usage as a percentage of a SINGLE CPU.
Both tops show same thing: ~50% usage on each CPU
The two top screenshots are actually showing the same thing: node process is using about 50% of each of the 2 CPUs.
Testing theory
We can test this with the following:
# This will max out 1 cpu of the system
docker run --name stress --rm -d alpine sh -c 'apk add stress-ng && stress-ng --cpu 1'
# This shows the busybox top with usage as ratio of total CPUs
# press 'c' in top to see the per-CPU info at the top
docker exec -it stress top
# This will install and run procps top, with usage as a ratio of single CPU
docker exec -it stress sh -c 'apk add procps && /usr/bin/top'
In the screenshot above, we can see two different flavors of top. They are reporting the same CPU usage, but the upper one reports this as "100% CPU" (as a percentage of a single core), while lower one reports this as 6% (1/16 cores = 6.25%).
What does this tell us about node's CPU usage?
Node is single-threaded, and cannot use more than 100% of a CPU. ...sort of. Under the hood, Node uses libuv, which does run threads in silos. This is how Node receives asynchronous events for IO operations, for example. These threads do use CPU and can push your CPU usage over 100%. Some packages are also written as add-ons to Node, and these also use threads.
The environment variable UV_THREADPOOL_SIZE limits the maximum number of libuv-controlled threads which may run simultaneously. Setting this to a larger number (default is 4) before running node may remove a bottleneck.
If you are doing some CPU-intensive operations, consider using cluster, Worker Threads, writing your own add-on or spawning separate processes to do the computation.
Related
I'm trying to understand what stress command actually does in Linux, in particular -c option. My background is Physics, so I'm struggling with some concepts.
Does stress -c launch 3 processes that consume 100% of 3 bounded CPU cores (for example core 0, 1, 3)? The output of htop is confusing since I don't see 3 CPU cores at 100% all the time. Note: with bounded, I mean that these processes cannot run on other CPU cores (in this case 4 to N).
For example, after running stress -c 3, sometimes I see this (which makes sense to me):
But most of the time I'm seeing something like this (which doesn't because there aren't 3 CPU cores at 100%):
I know you can do something like
docker build -c 2 .
to give the container 2 cores, but can you do something like give the container 50% of the memory and 50% of the CPU?
Your example docker build -c 2 ., doesn't actually do what you think it does. The -c flag assigns cpu-shares, which is a relative weighting with a default of 1024. So if another container is running with the default weighting and CPU usage is maxed out, your build container will only get 2/1026 of the CPU. If you want to use this mechanism to allocate CPU, you will need to do some maths based on the number of running containers and their existing weightings (e.g if there are two containers running with the default weighting, and you give a 3rd container a weighting of 2048, it will get 2048/(2048+1024+1024) or 50% of the CPU).
You can also use the --cpuset-cpus argument to control which cores the container runs on, which I think is what you're thinking of, but that will only help you if set it for all containers.
I think what you're actually after is the --cpu-quota setting which will use the Completely Fair Scheduler in the linux kernel. The period should be set to 100000 (100m)s by default, meaning the argument --cpu-quota=50000 should give the container 50% of 1 CPU.
Regarding memory, you can only set a maximum usage for each container, you can't allocate a percentage slice.
For full details on all of this, see https://docs.docker.com/reference/run/#runtime-constraints-on-resources
I want to know if my program was run in parallel over multiple cores. I can get the perf tool to report how many cores were used in the computation, but not if they were used at the same time (in parallel).
How can this be done?
You can try using the command
top
in another terminal while the program is running. It will show the usage of all the cores on your machine.
A few possible solutions:
Use htop on another terminal as your program is being executed. htop shows the load on each CPU separately, so on an otherwise idle system you'd be able to tell if more than one core is involved in executing your program.
It is also able to show each thread separately, and the overall CPU usage of a program is aggregated, which means that parallel programs will often show CPU usage percentages over 100%.
Execute your program using the time command or shell builtin. For example, under bash on my system:
$ dd if=/dev/zero bs=1M count=100 2>/dev/null | time -p xz -T0 > dev/null
real 0.85
user 2.74
sys 0.14
It is obvious that the total CPU time (user+sys) is significantly higher than the elapsed wall-clock time (real). That indicates the parallel use of multiple cores. Keep in mind, however, that a program that is either inefficient or I/O-bound could have a low overall CPU usage despite using multiple cores at the same time.
Use top and monitor the CPU usage percentage. This method is even less specific than time and has the same weakness regarding parallel programs that do not make full use of the available processing power.
I submit jobs using headless NetLogo to a HPC server by the following code:
#!/bin/bash
#$ -N r20p
#$ -q all.q
#$ -pe mpi 24
/home/abhishekb/netlogo/netlogo-5.1.0/netlogo-headless.sh \
--model /home/abhishekb/models/corrected-rk4-20presults.nlogo \
--experiment test \
--table /home/abhishekb/csvresults/corrected-rk4-20presults.csv
Below is the snapshot of a cluster queue using:
qstat -g c
I wish to know can I increase the CQLOAD for my simulations and what does it signify too. I couldn't find an elucidate explanation online.
CPU USAGE CHECK:
qhost -u abhishekb
When I run the behaviour space on my PC through gui assigning high priority to the task makes it use nearly 99% of the CPU which makes it run faster. It uses a greater percentage of CPU processor. I wish to accomplish the same here.
EDIT:
EDIT 2;
A typical HPC environment, is designed to run only one MPI process (or OpenMP thread) per CPU core, which has therefore access to 100% of CPU time, and this cannot be increased further. In contrast, on a classical desktop/server machine, a number of processes compete for CPU time, and it is indeed possible to increase performance of one of them by setting the appropriate priorities with nice.
It appears that CQLOAD, is the mean load average for that computing queue. If you are not using all the CPU cores in it, it is not a useful indicator. Besides, even the load average per core for your runs just translates the efficiency of the code on this HPC cluster. For instance, a value of 0.7 per core, would mean that the code spends 70% of time doing calculations, while the remaining 30% are probably spent waiting to communicate with the other computing nodes (which is also necessary).
Bottom line, the only way you can increase the CPU percentage use on an HPC cluster is by optimising your code. Normally though, people are more concerned about the parallel scaling (i.e. how the time to solution decreases with the number of CPU cores) than with the CPU percentage use.
1. CPU percentage load
I agree with #rth answer regards trying to use linux job priority / renice to increase CPU percentage - it's
almost certain not to work
and, (as you've found)
you're unlikely to be able to do it as you won't have super user priveliges on the nodes (It's pretty unlikely you can even log into the worker nodes - probably only the head node)
The CPU usage of your model as it runs is mainly a function of your code structure - if it runs at 100% CPU locally it will probably run like that on the node during the time its running.
Here are some answers to the more specific parts of your question:
2. CQLOAD
You ask
CQLOAD (what does it mean too?)
The docs for this are hard to find, but you link to the spec of your cluster, which tells us that the scheduling engine for it is Sun's *Grid Engine". Man pages are here (you can access them locally too - in particular typing man qstat)
If you search through for qstat -g c, you will see the outputs described. In particular, the second column (CQLOAD) is described as:
OUTPUT FORMATS
...
an average of the normalized load average of all queue
hosts. In order to reflect each hosts different signifi-
cance the number of configured slots is used as a weight-
ing factor when determining cluster queue load. Please
note that only hosts with a np_load_value are considered
for this value. When queue selection is applied only data
about selected queues is considered in this formula. If
the load value is not available at any of the hosts '-
NA-' is printed instead of the value from the complex
attribute definition.
This means that CQLOAD gives an indication of how utilized the processors are in the queue. Your output screenshot above shows 0.84, so this indicator average load on (in-use) processors in all.q is 84%. This doesn't seem too low.
3. Number of nodes reserved
In a related question, you state colleagues are complaining that your processes are not using enough CPU. I'm not sure what that's based on, but I wonder the real problem here is that you're reserving a lot of nodes (even if just for a short time) for a job that they can see could work with fewer.
You might want to experiment with using fewer nodes (unless your results are very slow) - that is achieved by altering the line #$ -pe mpi 24 - maybe take the number 24 down. You can work out how many nodes you need (roughly) by timing how long 1 model run takes on your computer and then use
N = ((time to run 1 job) * number of runs in experiment) / (time you want the run to take)
So you want to make to make your program run faster on linux by giving it a higher priority than all other processes?
In that case you have to modify something called the program's niceness. This is normally done by invoking the command nice when you first start the program or the command renice while the program is already running. A process can have a niceness from -20 to 19 (inclusive) where lower values give the process a higher priority. Due to security reasons, you can only decrease a processes' niceness if you are the super user (root).
So if you want to make a process run with higher priority then from within bash do
[abhishekb#hpc ~]$ start_process &
[abhishekb#hpc ~]$ jobs -x sudo renice -n -20 -p %+
Or just use the last command and replace the %+ with the process id of the process you want to increase the priority for.
i hope to find some little help here.
we are using node, mongodb, supertest, mocha and spawn in our test env.
we've tried to improve our mocha test env to run tests in parallel, because our test cases run now for almost 5minutes! (600 cases)
we are spawning for example 4 processes and run tests in parallel. it is very successful, but just on linux.
on my mac the tests still run very slow. it seems like the different processes are not really running in parallel.
test times:
macosx:
- running 9 tests in parallel: 37s
- running 9 tests not in parallel: 41s
linux:
- running 9 tests in parallel: 16s
- running 9 tests not in parallel: 25s
maxosx early 2011:
10.9.2
16gb ram
core i7 2,2ghz
physical processors: 1
cores: 4
threads: 8
linux dell:
ubuntu
8gb ram
core i5-2520M 2.5ghz
physical processors: 1
cores: 2
threads: 4
my questions are:
are there any tips improving the process performance on macosx?
(except of ulimit, launchctl (maxfiles)?)
why do the tests running much faster on linux?
thanks,
kate
I copied my comment here, as it describes most of the answer.
Given the specs of your machines, I doubt the extra 8 gigs of ram, and processing power effect things much, especially given nodes single process model, and that you're only launching 4 processes. I doubht for the linux machine that 8 gigs, 2.5 ghz, and 4 threads is a bottleneck at all. As such, I would actually expect the time the processor spends running your tests to be roughly equivalent for both machines. I'd be more interested in your Disk I/O performance, given you're running mongo. Your Disk I/O has the most potential to slow stuff down. What are the specs there?
Your specs:
macosx: Toshiba 5400RPM 8MB
linux: Seagate 7200 rpm 16mb
Your Linux drive is significantly, 1.33X faster, than your mac drive, as well as having a significantly larger cache. For database based applications hard drive performance is crucial. Most of the time spent in your application will be waiting for I/O, particularly in Nodes single process method of doing work. I would suggest this as the culprit for 90% of the performance difference, and chalk the rest up to the fact that linux probably has less crap going on in the background, further exacerbating your Mac Disk Drive's performance issues.
Furthermore, launching multiple node processes isn't likely to help this. Since processor time isn't your bottleneck, launching too many processes will just slow your disk down. Another proof that this is the problem is that the performance of multiple processes on linux is proportionally better than the performance of multiple processes on mac. 1 process is nearly maxing out the performance of your 5400 drive, and so you don't see significant performance increase from running multiple processes. Whereas the multiple linux Node processes use the disk to it's full potential. You would likely see diminishing returns on the linux OS if you were to launch many more processes, unless of course you were to upgrade to a SSD.