I am implementing a parallel NodeJS application to compute spatial joins. I am running on a MacBook pro i7 processor with 4 cores (8 after hyperthreading).
To make a fair comparison, I am running exact same operation on all threads. I am attaching activity monitor screenshots for the reference.
One process
Completion time: 11.25 seconds
Two processes
Max completion time: 11.53 seconds
Four processes
Max completion time: 14.08 seconds
Eight processes
Max completion time: 24.98 seconds
My question is, given that in none of the cases memory is full, why does it take such a huge
performance hit for extra spawned processes when technically it is equipped to run 8 processes independently?
Thanks in advance.
Related
I'm trying to understand the usage of CPU cores with regard to concurrent threads and processes. Please see the below questions:
Assume I have 2 CPU cores. When there are 2 processes running, each process has only 1 thread. Are the two processes using the 2 cores?
Assume I have 2 CPU cores. When there is 1 process running, which has 2 threads. Are the two threads using the 2 cores?
Assume I have 2 CPU cores. When there are 2 processes running, each process has 2 threads. How are the two cores used by those processes and cores?
How to calculate the maximum real concurrent execution given CPU cores? What are other factor I should take into account?
1,2: Quite likely but not definitely. A portion of the system software determines what runs where. It would be unlikely to choose to keep a process or thread waiting for cpu attention when there is one that is otherwise idle, it isn't absolute.
Most processing involves some sort of transfer to and from a device, network, etc.. Typically this necessitates a period of inactivity waiting for the transfer to complete. During this inactivity, another process / thread can run on that cpu. So, if a given process is 30% cpu time and 70% I/O time, then I can run about 3 of them concurrently on a single cpu without degrading performance.
3,4: Like the paragraph above implies, depending upon the workload, their could be any distribution of the threads among the cpus. If the threads were all compute bound (100% cpu), most operating systems switch between them at a small enough granularity that all remain lively, and large enough that the switching has a minimal impact on them.
This scheduling may take other notions into consideration, such as data affinity. Recently touched bits of data are likely to remain in the cpu cache when a thread has relinquished it. The next time the thread is to be scheduled, it would be best to put it onto that cpu, to retain the effort required to warm the cache for it. It might also think that two threads of one process (address space) are more likely to share data, so should prefer the same cpu.
4: depending upon your system, there are likely to be many performance analysis tools available. Top, on UNIX-inspired systems is a simple tool which gives system wide utilization information, and the simple tool time will show how much time a process spent on a cpu vs real-world time. If you run each of your tasks sequentially, noting the cpu-time that they take, then time them running concurrently, the ratio between these cpu-times indicates the scaling factor of your concurrent app. Note that real-world time can be misleading because of io-overlap.
I have encountered a weird behavior with my algorithm/cpu, I was wondering what could be causing this.
CPU that I am using: AMD 2990WX 32c/64t, OS: Ubuntu 18.04LTS with 4.15.0-64-generic kernel.
The algorithm (Julia 1.0.3):
#sync #distributed for var in range(0.1,step=0.1,stop=10.0)
res=do_heavy_stuff(var) #solves differential equation,
#basically, multiplying 200x200 matrices many times
save(filename,"RES",res)
end
Function do_heavy_stuff(var) takes ~3 hours to solve on a single CPU core.
When I launch it in parallel with 10 processes (julia -p 10 my_code.jl)it takes ~4 hours for each parallel loop, meaning every 4 hours I get 10 files saved. The slowdown is expected, as cpu frequency goes down from 4.1Ghz to 3.4Ghz.
If I launch 3 separate instances with 10 processes each, so a total cpu utilization is 30 cores, it still takes ~4 hours for one loop cycle, meaning I get 30 runs completed and saved every 4 hours.
However, if I run 2 instances (one has nice value of 0, another nice value of +10) with 30 processes each at once julia -p 30 my_code.jl, I see (using htop) that CPU utilization is 60(+) threads, but the algorithm becomes extremely slow (after 20 hours still zero files saved). Furthermore, I see that CPU temperature is abnormally low (~45C instead of expected 65C).
From this information I can guess, that using (almost) all threads of my cpu makes it do something useless that is eating up CPU cycles, but no floating point operations are being done. I see no I/O to SSD, I utilize only half of RAM.
I launched mpstat mpstat -A: https://pastebin.com/c19nycsT and I can see that all of my cores are just chilling in idle state, that explains low temperature, however, I still don`t understand what exactly is the bottleneck? How do I troubleshoot from here? Is there any way too see (without touching hardware) whether the problem is RAM bandwidth or something else?
EDIT: It came to my attention, that I was using mpstat wrong. Apparently mpstat -A gives cpu stats since launch of the computer, while what I was needed was short time integrated results that can be obtained with mpstat -P ALL 2. Unfortunately, I only learned this after I killed my code in question, so no real data from mpstat. However, I am still interested, how would one troubleshoot such situation, where cores seems to be doing something, but result is not showing? How do I find the bottleneck?
Since you are using multiprocessing there are 2 most likely reasons for the observer behavior:
long delays on I/O. When you are processing lots of disk data or reading data from the network your processes are naturally staled. In this case CPU utilization can be low combined with long execution times.
high variance of execution time for do_heavy_stuff. This variance could arise from unstable I/O or different model parameters resulting in different execution times. Why it is a problem requires understanding how #distributed is sharing the workload among worker processes. Namely, each worker gets an equal of the for loop. For an example if you have 4 workers the first one gets var in range 0.1:0.1:2.5 the second one 2.6:0.1:5.0 and so on. Now if some of the var values result in heavy tasks the first worker might get 5h of work and other workers 1h of work. This means that #sync completes after 5 hours with only one CPU actually working all time.
Looking at your post I would strongly bet on the second reason.
I am observing strange effects with the CPU percentage as shown in e.g. top or htop on Linux (Ubuntu 16.04) for one special application. The application uses many threads (around 1000). Each thread has one computational task. About half of these tasks need to be computed once per "trigger" - the trigger is an external event received exactly every 100ms. The other threads are mostly sleeping (waiting for user interaction) and hence do not play a big role here. So to summarise: many threads are waking up basically simultaneously within a short period of time, doing there (relatively short) computation and going back to sleep again.
Since the machine running this application has 8 virtual CPUs (4 cores each 2 threads, it's an i7-3612QE), only 8 threads can really wake up at a time, so many threads will have to wait. Also some of these tasks have interdependencies, so they anyway have to wait, but I think as an approximation one can think of this application as a bunch of threads going to the runnable state at the same time every 100ms and each doing only a short computation (way below 1ms of CPU time each).
Now coming to the strange effect: If I look at the CPU percentage in "top", it shows something like 250%. As far as I know, top looks on the CPU time (user + system) the kernel accounts for this process, so 250% would mean the process uses 3 virtual CPUs on average. So far so good. Now, if I use taskset to force the entire process to use only a single virtual CPU, the CPU percentage drops to 80%. The application has internal accounting which tells me that still all data is being processed. So the application is doing the same amount of work, but it seemingly uses less CPU resources. How can that be? Can I really trust the kernel CPU time accounting, or is this an artefact of the measurement?
The CPU percentage also goes down, if I start other processes which take a lot of CPU, even if the do nothing ("while(true);") and are running at low priority (nice). If I launch 8 of these CPU-eating processes, the application reaches again 80%. With fewer CPU-eaters, I get gradually higher CPU%.
Not sure if this plays a role: I have used the profiler vtune, which tells me my application is actually quite inefficient (only about 1 IPC), mostly because it's memory bound. This does not change if I restrict the process to a single virtual CPU, so I assume the effect is not caused by a huge increase in efficiency when running everything on the same core (which would be strange anyway).
My question was essentially already answered by myself in the last paragraph: The process is memory bound. Hence not the CPU is the limited resource but the memory bandwidth. Allowing such process to run on multiple CPU cores in parallel will mainly have the effect that more CPU cores are waiting for data to arrive from RAM. This is counted as CPU load, since the CPU is executing the thread, but just quite slowly. All my other observations go along with this.
Summary
I am trying to understand the limits of my compute resources when performing multiple simulations. My task is trivial in terms of parallelisation - I need to run a large number of simple independent simulations, i.e. each simulation program does not rely on another for information. Each simulation has roughly the same running time. For this purpose I have created an experiment that is detailed below.
Details
I have two shell scripts located in the same directory.
First script called simple:
#!/bin/bash
# Simple Script
echo "Running sleep with arg= $1 "
sleep 5s
echo "Finished sleeping with arg= $1"
Second script called runall:
#!/bin/bash
export PATH="$PATH:./"
# Fork off a new process for each program by running in background
# Run N processes at a time and wait until all of them have finished
# before executing the next batch. This is sub-optimal if the running
# time of each process varies significantly.
# Note if the number of total processes is not divisible by the alloted pool something weird happens
echo "Executing runall script..."
for ARG in $(seq 600); do
simple $ARG &
NPROC=$(($NPROC+1))
if [ "$NPROC" -ge 300 ]; then
wait
echo "New batch"
NPROC=0
fi
done
Here are some specs on my computer (MAC OS X):
$ ulimit -u
709
$ sysctl hw.ncpu
hw.ncpu: 8
$ sysctl hw.physicalcpu
hw.physicalcpu: 4
From this I interpret that I have 709 processes at my disposal and 8 processor cores available.
However when I execute $ ./runall I eventually end up with:
...
Running sleep with arg= 253
Running sleep with arg= 254
Running sleep with arg= 255
Running sleep with arg= 256
Running sleep with arg= 257
Running sleep with arg= 258
./runall: fork: Resource temporarily unavailable
Running sleep with arg= 259
./simple: fork: Resource temporarily unavailable
Running sleep with arg= 260
$ Running sleep with arg= 261
Finished sleeping with arg= 5
Finished sleeping with arg= 7
Finished sleeping with arg= 4
Finished sleeping with arg= 8
Finished sleeping with arg= 3
...
SO:
Question 1
Does this mean that out of the 709 processes available, only 258 can be dedicated to my runall program, the rest remaining probably being used by other processes on my computer?
Question 2
I substituted the simple script with something else which does something more complicated than just sleep (it reads a file and processes the data in the file to create a graph) and now I start to notice some differences. With the help of using $ time ./runall I can get the total run time and whereas before when calling simple for up to the 258 processes I always got a run time of about 5s:
real 0m5.071s
user 0m0.184s
sys 0m0.263s
i.e, running many simulations in parallel gives the same runtime as a single simulation. However now that I am calling a more complex program instead of simple I get a longer total run time than the single simulation time (calling a single simulation takes 1.5s whereas 20 simulations in parallel takes about 8.5s). How do I explain this behavior?
Question 3
Im not sure how the number of processor cores is related to the parallel performance - Since I have 8 cores at my disposal I thought I would be able to run 8 programs in parallel at the same time it would take me to just run one. Im not sure about my reasoning on this...
If you have 8 cpu threads available, and your programs consume 100% of a single CPU, it does not make sense to run more than 8 programs at a time.
If your programs are multi-threaded, then you may want to have fewer than 8 processes running at a time. If your programs occasionally use less than 100% of a single CPU (perhaps if they're waiting for IO), then you may want to run more than 8 processes at a time.
Even if the process limit for your user is extremely high, other resources could be exhausted much sooner - for instance, RAM. If you launch 200 processes and they exhaust RAM, then the operating system will respond by satisfying requests for RAM by swapping out some other process's RAM to disk; and now the computer needlessly crawls to a halt because 200 processes are waiting on IO to get their memory back from disk, only to have it be written out again because some other process wants to run. This is called thrashing.
If your goal is to perform some batch computation, it does not make sense to load the computer any more than enough processes to keep all CPU cores at 100% utilization. Anything more is waste.
Edit - Clarification on terminology.
A single computer can have more than one CPU socket.
A single CPU can have more than one CPU core.
A single CPU core can support simultaneous execution of more than one stream of instructions. Hyperthreading is an example of this.
A stream of instructions is what we typically call a "thread", either in the context of the operating system, processes, or in the CPU.
So I could have a computer with 2 sockets, with each socket containing a 4-core CPU, where each of those CPUs supports hyperthreading and thus supports two threads per core.
Such a computer could execute 2 * 4 * 2 = 16 threads simultaneously.
A single process can have as many threads as it wants, until some resources is exhausted - raw RAM, internal operating system data structures, etc. Each process has at least one thread.
It's important to note that tricks like hyperthreading may not scale performance linearly. When you have unhyperthreaded CPU cores, those cores contain enough parts to be able to execute a single stream of instructions all by itself; aside from memory access, it doesn't share anything with the rest of the other cores, and so performance can scale linearly.
However, each core has a lot of parts - and during some types of computations, some of those parts are inactive while others are active. And during other types of computations could be the opposite. Doing a lot of floating-point math? Well, then the integer math unit in the core might be idle. Doing a lot of integer math? Well, then the floating-point math unit might be idle.
Hyperthreading seeks to increase perform, even if only a little bit, by exploiting these temporarily unused units within a core; while the floating point unit is busy, schedule something that can use the integer unit.
...
As far as the operating system is concerned when it comes to scheduling is how many threads across all processes are runnable. If I have one process with 3 runnable threads, a second process with one runnable thread, and a third process with 10 runnable threads, then the OS will want to run a total of 3 + 1 + 10 = 14 threads.
If there are more runnable program threads than there are CPU execution threads, then the operating system will run as many as it can, and the others will sit there doing nothing, waiting. Meanwhile, those programs and those threads may have allocated a bunch of memory.
Lets say I have a computer with 128 GB of RAM and CPU resources such that the hardware can execute a total of 16 threads at the same time. I have a program that uses 2 GB of memory to perform a simple simulation, and that program only creates one thread to perform its execution, and each program needs 100s of CPU time to finish. What would happen if I were to try to run 16 instances of that program at the same time?
Each program would allocate 2 GB * 16 = 32 GB of ram to hold its state, and then begin performing its calculations. Since each program creates a single thread, and there are 16 CPU execution threads available, every program can run on the CPU without competing for CPU time. The total time we'd need to wait for the whole batch to finish would be 100 s: 16 processes / 16 cpu execution threads * 100s.
Now what if I increase that to 32 programs running at the same time? Well, we'll allocate a total of 64GB of RAM, and at any one point in time, only 16 of them will be running. This is fine, nothing bad will happen because we've not exhausted RAM (and presumably any other resource), and the programs will all run efficiently and eventually finish. Runtime will be approximately twice as long at 200s.
Ok, now what happens if we try to run 128 programs at the same time? We'll run out of memory: 128 * 2 = 256 GB of ram, more than double what the hardware has. The operating system will respond by swapping memory to dis and reading it back in as needed, but it'll have to do this very frequently, and it'll have to wait for the disk.
If you had enough ram, this would run in 800s (128 / 16 * 100). Since you don't, it's very possible it could take an order of magnitude longer.
Your questions are a little confusing. But here's an attempt to explain some of it:
Question 1 Does this mean that out of the 709 processes available, only 258 can be dedicated to my runall program, the rest remaining probably being used by other processes on my computer?
As the ulimit manpage explains, -u tells you how many processes you can start as a user. As you know, every process on Unix has a uid (there are some nitty gritty details here like euid, setuid etc.) which refers to the user on the system that owns that process. What -u tells you is the number of processes you (since you are logged in and executing the ulimit command) can start and simultaneously run on the computer. Note that once a process with pid p exits, OS is free to recycle that number p for some other processes.
Question 2
The answer to question 2 (which seems to be your main confusion) can only be given when we understand what the time command actually reports. Understanding the output of the time command needs some experimentation. For instance, when I run your experiment (on a comparable Mac) with 100 processes (i.e. $(seq 100)), I get:
./runall.sh 0.01s user 0.02s system 39% cpu 0.087 total
This means that only 39% of the available computing power was used resulting in 0.087s of the wall clock time. Roughly speaking, the wall clock time multiplied by the CPU utilization gives the running time (user time that your code needs + system time that system calls need to execute). Your simple script is rather too simple. It does not cause the CPU's to do any work by making the sleep system call!
Compare this example with a more real-life example to find a subset of a given set with given sum. This (Java) program, on the same computer produces the following times:
java SubsetSum 38.25s user 1.09s system 510% cpu 7.702 total
This means that the total wall clock time in about 7.7 seconds, but all the available cores are stressed extremely highly to execute this program. On a 4-CPU (8 logical CPU), I get a 500% CPU utilization! (And you can see that wall clock time (7.7) multiplied by CPU utilization (5.1) i.e. 39.27 is roughly equal to total time (38.25+1.09 = 39.34))
Question 3
Well, the way to parallelize your programs is by finding out parallelizable activity in solving the problem. You have 8 cores available and the OS will decide how to allocate them to the processes that ask for it. But what if a process goes into BLOCKING state (blocked on I/O)? Then, the OS will schedule this process out and schedule something else in. A simplistic view of this like "8 cores => 8 programs at the same time" is hardly true when you take into account the way the scheduling works.
I understand that only the threads in running state actually consumes CPU but as show below by top in QNX platform, total CPU states is 99.3 which is a cumulative of four threads of which only one is in running state.
Any idea why CPU is consumed more than what running threads consume?
CPU states: 99.3% user, 0.6% kernel
CPU 0 Idle: 0.0%
CPU 1 Idle: 0.0%
Memory: 0 total, 1G avail, page size 4K
PID TID PRI STATE HH:MM:SS CPU COMMAND
704585 11 10 Run 0:01:52 24.82% App
704585 10 10 Rdy 0:01:52 24.68% App
704585 13 10 Rdy 0:01:52 24.53% App
704585 16 10 Rdy 0:01:49 24.19% App
The threads that are ready were in a running state when they consumed the CPU. Given the very similar CPU values, I'll bet that all of those threads are always either ready to run or running.
Threads in a RUNNING state are the only ones currently consuming CPU at the current instant, but those in a READY state are those that are eligible to consume CPU over an interval of time.
Your processor has two cores, so up to two threads can be RUNNING at once. Any number can be READY (i.e. unblocked and runnable, but not necessarily currently executing on a core), and those will be run according to priority and scheduling method that applies. Since you are querying the process manager for thread states, one of those two cores will at that instance obviously be running a thread in the Process manager. The other core will still be running an available READY thread from amongst the set of unblocked threads in the system, again based on priority and scheduling algorithm. This is why just one of your four threads shows as running, while the others are merely READY. That the other three threads are READY means that, assuming they are at the same priority as your other currently-running thread, the scheduler will run those threads on the available cores according to the scheduling algorithm you are using, as long as no higher priority threads are or become READY. The thread state reflects instantaneous state at the time the process manager is being asked to provide thread state information from the kernel, while the usage stat reflects activity over time and not an instantaneous state. Over a brief period of time, if you have four round robin threads all READY and at the same priority running round-robin, you would see close to 25% utilization attributable to each of the four threads. But only two can be RUNNING at any one instant if you have only two cores, and if you are busy actually getting the information about thread states then one of those two available cores is busy grabbing that info and you will only ever see up to one other thread in a RUNNING state. If you are using QNX I suggest you read and memorize the System Architecture manual (http://www.qnx.com/download/feature.html?programid=26183). Ch. 2's discussion of thread lifecycle and scheduling addresses this question.
Hope that helps.