The hotspots view (cpu view) shows incorrect time units for inherent times. I tried profiling an application which copies a physical file 200 times concurrently. The application completed in 1.2 seconds while the jprofiler snapshot shows a particular method taking 122 secs. That's strange.
Anyone who has worked with jprofiler?
This looks OK. JProfiler shows elapsed times, not CPU times. By default, the CPU views cumulate all threads, so with 200 concurrent threads, the displayed time measurements should be upwards of 200 times of the time measurements for a single thread.
You can use the thread selector at the top to switch to a single thread, then you will see times that correspond to the total run time.
Related
I have encountered a weird behavior with my algorithm/cpu, I was wondering what could be causing this.
CPU that I am using: AMD 2990WX 32c/64t, OS: Ubuntu 18.04LTS with 4.15.0-64-generic kernel.
The algorithm (Julia 1.0.3):
#sync #distributed for var in range(0.1,step=0.1,stop=10.0)
res=do_heavy_stuff(var) #solves differential equation,
#basically, multiplying 200x200 matrices many times
save(filename,"RES",res)
end
Function do_heavy_stuff(var) takes ~3 hours to solve on a single CPU core.
When I launch it in parallel with 10 processes (julia -p 10 my_code.jl)it takes ~4 hours for each parallel loop, meaning every 4 hours I get 10 files saved. The slowdown is expected, as cpu frequency goes down from 4.1Ghz to 3.4Ghz.
If I launch 3 separate instances with 10 processes each, so a total cpu utilization is 30 cores, it still takes ~4 hours for one loop cycle, meaning I get 30 runs completed and saved every 4 hours.
However, if I run 2 instances (one has nice value of 0, another nice value of +10) with 30 processes each at once julia -p 30 my_code.jl, I see (using htop) that CPU utilization is 60(+) threads, but the algorithm becomes extremely slow (after 20 hours still zero files saved). Furthermore, I see that CPU temperature is abnormally low (~45C instead of expected 65C).
From this information I can guess, that using (almost) all threads of my cpu makes it do something useless that is eating up CPU cycles, but no floating point operations are being done. I see no I/O to SSD, I utilize only half of RAM.
I launched mpstat mpstat -A: https://pastebin.com/c19nycsT and I can see that all of my cores are just chilling in idle state, that explains low temperature, however, I still don`t understand what exactly is the bottleneck? How do I troubleshoot from here? Is there any way too see (without touching hardware) whether the problem is RAM bandwidth or something else?
EDIT: It came to my attention, that I was using mpstat wrong. Apparently mpstat -A gives cpu stats since launch of the computer, while what I was needed was short time integrated results that can be obtained with mpstat -P ALL 2. Unfortunately, I only learned this after I killed my code in question, so no real data from mpstat. However, I am still interested, how would one troubleshoot such situation, where cores seems to be doing something, but result is not showing? How do I find the bottleneck?
Since you are using multiprocessing there are 2 most likely reasons for the observer behavior:
long delays on I/O. When you are processing lots of disk data or reading data from the network your processes are naturally staled. In this case CPU utilization can be low combined with long execution times.
high variance of execution time for do_heavy_stuff. This variance could arise from unstable I/O or different model parameters resulting in different execution times. Why it is a problem requires understanding how #distributed is sharing the workload among worker processes. Namely, each worker gets an equal of the for loop. For an example if you have 4 workers the first one gets var in range 0.1:0.1:2.5 the second one 2.6:0.1:5.0 and so on. Now if some of the var values result in heavy tasks the first worker might get 5h of work and other workers 1h of work. This means that #sync completes after 5 hours with only one CPU actually working all time.
Looking at your post I would strongly bet on the second reason.
I am observing strange effects with the CPU percentage as shown in e.g. top or htop on Linux (Ubuntu 16.04) for one special application. The application uses many threads (around 1000). Each thread has one computational task. About half of these tasks need to be computed once per "trigger" - the trigger is an external event received exactly every 100ms. The other threads are mostly sleeping (waiting for user interaction) and hence do not play a big role here. So to summarise: many threads are waking up basically simultaneously within a short period of time, doing there (relatively short) computation and going back to sleep again.
Since the machine running this application has 8 virtual CPUs (4 cores each 2 threads, it's an i7-3612QE), only 8 threads can really wake up at a time, so many threads will have to wait. Also some of these tasks have interdependencies, so they anyway have to wait, but I think as an approximation one can think of this application as a bunch of threads going to the runnable state at the same time every 100ms and each doing only a short computation (way below 1ms of CPU time each).
Now coming to the strange effect: If I look at the CPU percentage in "top", it shows something like 250%. As far as I know, top looks on the CPU time (user + system) the kernel accounts for this process, so 250% would mean the process uses 3 virtual CPUs on average. So far so good. Now, if I use taskset to force the entire process to use only a single virtual CPU, the CPU percentage drops to 80%. The application has internal accounting which tells me that still all data is being processed. So the application is doing the same amount of work, but it seemingly uses less CPU resources. How can that be? Can I really trust the kernel CPU time accounting, or is this an artefact of the measurement?
The CPU percentage also goes down, if I start other processes which take a lot of CPU, even if the do nothing ("while(true);") and are running at low priority (nice). If I launch 8 of these CPU-eating processes, the application reaches again 80%. With fewer CPU-eaters, I get gradually higher CPU%.
Not sure if this plays a role: I have used the profiler vtune, which tells me my application is actually quite inefficient (only about 1 IPC), mostly because it's memory bound. This does not change if I restrict the process to a single virtual CPU, so I assume the effect is not caused by a huge increase in efficiency when running everything on the same core (which would be strange anyway).
My question was essentially already answered by myself in the last paragraph: The process is memory bound. Hence not the CPU is the limited resource but the memory bandwidth. Allowing such process to run on multiple CPU cores in parallel will mainly have the effect that more CPU cores are waiting for data to arrive from RAM. This is counted as CPU load, since the CPU is executing the thread, but just quite slowly. All my other observations go along with this.
I am trying to find simple examples for what are exactly the wait time and execution time in determining the size of the thread pool. According to brian Goetz:
For tasks that may wait for I/O to complete -- for example, a task
that reads an HTTP request from a socket -- you will want to increase
the pool size beyond the number of available processors, because not
all threads will be working at all times. Using profiling, you can
estimate the ratio of waiting time (WT) to service time (ST) for a
typical request. If we call this ratio WT/ST, for an N-processor
system, you'll want to have approximately N*(1+WT/ST) threads to keep
the processors fully utilized.
I really didn't understand what he meant the Input/output. Who's doing the I/O tasks.
Imagine a task that reads some data from disk. What actually happens:
Open file.
Wait for (the spinning) disk to awake from sleep, to position the head at the right spot and for the desired blocks to appear underneath the head until all bytes arrive in a buffer.
Read from the buffer.
The whole task takes 0.1s to complete. Of this 0.1s 10 percent are spent on step 1 and 3 and the remaining 90 percent on step 2. So 0.01s are "working time" and 0.09s "wait time" that is spent waiting for the disk.
VTune hotspots analysis reports my program's execution time (elapsed time) was 60 seconds out of which only 10 seconds are reported as "CPU Time". I'm trying to where the remaining 50 seconds was spent. Using Windows Process Monitor's File System Activity, I see my program spent 5 seconds doing disk I/O. This still leaves 45 seconds unaccounted for.
There are two threads in my program, according to VTune, one of those thread consumed 99% of the CPU Time. I don't see how these two threads given their execution profiles could explain the lost time.
Any thoughts?
I am implementing a parallel NodeJS application to compute spatial joins. I am running on a MacBook pro i7 processor with 4 cores (8 after hyperthreading).
To make a fair comparison, I am running exact same operation on all threads. I am attaching activity monitor screenshots for the reference.
One process
Completion time: 11.25 seconds
Two processes
Max completion time: 11.53 seconds
Four processes
Max completion time: 14.08 seconds
Eight processes
Max completion time: 24.98 seconds
My question is, given that in none of the cases memory is full, why does it take such a huge
performance hit for extra spawned processes when technically it is equipped to run 8 processes independently?
Thanks in advance.