What is the load on the system - linux

I have a Red hat server where I can see the load average on the system is 23 24 23 (1min 5min 15min) using the top command. And i can see in /proc/cpuinfo there are 24 processors entry (0-23). But in each processor entry the cpu cores value is 6 and in each processor entry the physical id is either 1 or 0.
I want to know if my system is overloaded. Can anyone please tell me.

It seems you have a system with two processors, each with 6 cores. Each of the cores can likely run hyperthread => 2 x 6 x 2 = 24. In /proc/cpuinfo, top etc, you will see each hyperthread: that's the number of parallel processes or threads that your hardware can run.
The quick answer is that your system is probably not overloaded and that it processes a relatively stable amount of work over time (as the 1, 5 and 15 minute values are about the same). A rule of thumb is that the load average value should stay below the number of hyperthreads -- this is not, however, exactly true.
You'll find a more in-depth discussion here:
https://unix.stackexchange.com/questions/303699/how-is-the-load-average-interpreted-in-top-output-is-it-the-same-for-all-di
and here:
https://superuser.com/questions/23498/what-does-load-average-mean-on-unix-linux
and perhaps here:
https://linuxhint.com/load_average_linux/
However, please keep in mind that load average does not tell you everything about your system -- though it is usually a pretty good indicator in my experience. You'd have to check many other factors to determine overloadedness, like memory pressure, I/O wait times, I/O bandwidth utilization. It also depends on what kind of processing the system is performing.

Related

Increasing number of threads decreases cpu performance

I have written a program that tries to do some calculations. It can be run any number of cores.
Below it's a breakdown on how many calculations are performed if it's run on 1,2,3,4 cores (on a 4 logical processors laptop). The numbers in parenthesis show calculations per thread/core. My question is why the performance decreases very rapidly then number of threads/cores increases ? I dont expect the performance to double but it's significantly lower. I also observer the same issue when running 4 instance of the same program setup to run on one thread so i know that it's not an issue with the program it self.
The greatest improvement is going form 1 thread to 2 why is that ?
# Threads | calculations/sec
1 | 87000
2 | 129000 (65000,64000)
3 | 135000 (46000,45000,44000)
4 | 140000 (34000,34000,34000,32000)
One interesting thing is that exactly the same issue i can see on Google Cloud Platform in Compute Engine, when I run the program on 16 threads on 16 virtual cores platform the performance of each core drops down to like 8K states per second. However if I run 16 instances of the same program each instance does around 100K states /s. When I do the same test but on 4 cores on my home laptop i can still see the same drop of performance whether I run 4 separate .exe of one with 4 cores, I was expecting to see the same behavior on GCP but it's not the case, is it because of virtual cores ? do they behave differently.
Example code that reproduces this issue, just need to paste to your console application, but need to refresh the stats like 20 times for the performance to stabilize, not sure why it fluctuates so much. code example you can see that if you run the app with with 4 threads the you get significant performance impact compared to 4 instances with 1 thread each. I was hoping that that enabling gcServer will solver the problem but did not see any improvement

Estimate Core capacity required based on load?

I have quad core ubuntu system. say If I see the load average as 60 in last 15 mins during peak time. Load average goes to 150 as well.
This loads happens generally only during peak time. Basically I want to know if there is any standard formula to derive the number of cores ideally required to handle the given load ?
Objective :-
If consider the load as 60 then it means 60 task were in queue on an average at any point of time in last 15 mins ? Adding cpu can help me to server the
request faster or save system from hang or crashing .
Linux load average (as printed by uptime or top) includes tasks in I/O wait, so it can have very little to do with CPU time that could potentially be used in parallel.
If all the tasks were purely CPU bound, then yes 150 sustained load average would mean that potentially 150 cores could be useful. (But if it's not sustained, then it might just be a temporary long queue that wouldn't get that long if you had better CPU throughput.)
If you're getting crashes, that's a huge problem that isn't explainable by high loads. (Unless it's from the out-of-memory killer kicking in.)
It might help to use vmstat or dstat to see how much CPU time is spent in user/kernel space when your load avg. is building up, or if it's probably mostly I/O.
Or of course you probably know what tasks are running on your machine, and whether one single task is I/O bound or CPU bound on an otherwise-idle machine. I/O throughput usually scales a bit positively with queue depth, except on magnetic hard drives when that turns sequential read/write into seek-heavy workloads.

Linux acceptable load average

I have a linux dedicated server machine(8cores 8gbRAM) where i run some crawler php scripts. The load on the system ends up being arround 200, which sounds a lot. Since i am not using the machine to host content, what could be the sideeffects of such high level of load for the purposes stated above.
Machines were made to work so there are no issues with high load average, per se. But, a high load average can be an indicator of a performance issue, often. Such investigation is usually application specific, but here is a very general guideline:
Since load average is a combined metric of (CPU, IO .. etc) you want to examine all separately. I would start with making sure the machine is not thrashing, by checking swap usage (vmstat comes in handy), and disk performance (using iostat). You may also check if your operations are CPU intensive.
You should read your load average value as a 3 component value (1 minute load, 5 minutes load and 15 minutes load respectively).
Take a look at the example taken from Wiki:
For example, one can interpret a load average of "1.73 0.60 7.98" on a single-CPU system as:
during the last minute, the system was overloaded by 73% on average (1.73 runnable processes, so that 0.73 processes had to wait for a turn for a single CPU system on average).
during the last 5 minutes, the CPU was idling 40% of the time on average.
during the last 15 minutes, the system was overloaded 698% on average (7.98 runnable processes, so that 6.98 processes had to wait for a turn for a single CPU system on average).
Full article
Please note that this value depends on the resources of your machine.
Cheers!

What Would Happen if CPU Load Average is High

I read some articles about the CPU load average. They were talking about the definition, the differences between the CPU usage, and the optimal value (roughly equals to the number of cores). They also mentioned that if the number is high, you will be in trouble (waking up at mid-night etc.), but what would actually be happening if the number is high?
For example, I have been running 4, 6 and 8 sessions on a 4 core Linux server. Although the time it took to finish the task were different (4 fasted, 8 slowest), the results seem OK. The CPU load averages were roughly 4, 8 and 10. I understand that 10 might not be a good number, but then what?
It's just that: if you run absurdly high load averages, the overall efficiency will suffer: the CPU processing power will go to waste.
This is caused by several factors; the most immediate being more CPU time needed for scheduling the competing tasks. One not at all insignificant factor is that several competing processes will also overutlize the CPU cache; each task switch effectively throwing out the cache contents and replacing them with new ones. Further choke points come in forms of bottlenecks in memory and storage bandwidths.

Comparing CPU speed likely improvements for business hardware upgrade justification

I have c# Console app, Monte Carlo simulation entirely CPU bound, execution time is inversely proportional to the number of dedicated threads/cores available (I keep a 1:1 ratio between cores/threads).
It currently runs daily on:
AMD Opteron 275 # 2.21 GHz (4 core)
The app is multithread using 3 threads, the 4th thread is for another Process Controller app.
It takes 15 hours per day to run.
I need to estimate as best I can how long the same work would take to run on a system configured with the following CPU's:
http://en.wikipedia.org/wiki/Intel_Nehalem_(microarchitecture)
2 x X5570
2 x X5540
and compare the cases, I will recode it use the available threads. I want to justify that we need a Server with 2 x x5570 CPUs over the cheaper x5540 (they support 2 cpus on a single motherboard). This should make available 8 cores, 16 threads (that's how the Nehalem chips work I believe) to the operating system. So for my app that's 15 threads to the Monte Carlo Simulation.
Any ideas how to do this? Is there a website I can go and see benchmark data for all 3 CPUS involved for a single threaded benchmark? I can then extrapolate for my case and number of threads. I have access to the current system to install and run a benchmark on if necessary.
Note the business are also dictating the workload for this app over the next 3 months will increase about 20 times and needs to complete in a 24 hour clock.
Any help much appreciated.
Have also posted this here: http://www.passmark.com/forum/showthread.php?t=2308 hopefully they can better explain their benchmarking so I can effectively get a score per core which would be much more helpful.
have you considered recreating the algorithm in cuda? It uses current day GPU's to increase calculations like these 10-100 fold. This way you just need to buy a fat videocard
Finding a single-box server which can scale according to the needs you've described is going to be difficult. I would recommend looking at Sun CoolThreads or other high-thread count servers even if their individual clock speeds are lower. http://www.sun.com/servers/coolthreads/overview/performance.jsp
The T5240 supports 128 threads: http://www.sun.com/servers/coolthreads/t5240/index.xml
Memory and CPU cache bandwidth may be a limiting factor for you if the datasets are as large as they sound. How much time is spent getting data from disk? Would massively increased RAM sizes and caches help?
You might want to step back and see if there is a different algorithm which can provide the same or similar solutions with fewer calculations.
It sounds like you've spent a lot of time optimizing the the calculation thread, but is every calculation being performed actually important to the final result?
Is there a way to shortcut calculations anywhere?
Is there a way to identify items which have negligible effects on the end result, and skip those calculations?
Can a lower resolution model be used for early iterations with detail added in progressive iterations?
Monte Carlo algorithms I am familiar with are non-deterministic, and run time would be related to the number of samples; is there any way to optimize the sampling model to limit the number of items examined?
Obviously I don't know what problem domain or data set you are processing, but there may be another approach which can yield equivalent results.
tomshardware.com contains a comprehensive list of CPU benchmarks. However... you can't just divide them, you need to find as close to an apples to apples comparison as you can get and you won't quite get it because the mix of instructions on your workload may or may not depend.
I would guess please don't take this as official, you need to have real data for this that you're probably in the 1.5x - 1.75x single threaded speedup if work is cpu bound and not highly vectorized.
You also need to take into account that you are:
1) using C# and the CLR, unless you've taken steps to prevent it GC may kick in and serialize you.
2) the nehalems have hyperthreads so you won't be seeing perfect 16x speedup, more likely you'll see 8x to 12x speedup depending on how optimized your code is. Be optimistic here though (just don't expect 16x).
3) I don't know how much contention you have, getting good scaling on 3 threads != good scaling on 16 threads, there may be dragons here (and usually is).
I would envelope calc this as:
15 hours * 3 threads / 1.5 x = 30 hours of single threaded work time on a nehalem.
30 / 12 = 2.5 hours (best case)
30 / 8 = 3.75 hours (worst case)
implies a parallel run time if there is truly a 20x increase:
2.5 hours * 20 = 50 hours (best case)
3.74 hours * 20 = 75 hours (worst case)
How much have you profiled, can you squeeze 2x out of app? 1 server may be enough, but likely won't be.
And for gosh sakes try out the task parallel library in .Net 4.0 or the .Net 3.5 CTP it's supposed to help with this sort of thing.
-Rick
I'm going to go out on a limb and say that even the dual-socket X5570 will not be able to scale to the workload you envision. You need to distribute your computation across multiple systems. Simple math:
Current Workload
3 cores * 15 real-world-hours = 45 cpu-time-hours
Proposed 20X Workload
45 cpu-time-hours * 20 = 900 cpu-time-hours
900 cpu-time-hours / (20 hours-per-day-per-core) = 45 cores
Thus, you would need the equivalent of 45 2.2GHz Opteron cores to achieve your goal (despite increasing processing time from 15 hours to 20 hours per day), assuming a completely linear scaling of performance. Even if the Nehalem CPUs are 3X faster per-thread you will still be at the outside edge of your performance envelop - with no room to grow. That also assumes that hyper-threading will even work for your application.
The best-case estimates I've seen would put the X5570 at perhaps 2X the performance of your existing Opteron.
Source: http://www.dailytech.com/Server+roundup+Intel+Nehalem+Xeon+versus+AMD+Shanghai+Opteron/article15036.htm
It'd be swinging big hammer, but perhaps it makes sense to look at some heavy-iron 4-way servers. They are expensive, but at least you could get up to 24 physical cores in a single box. If you've exhausted all other means of optimization (including SIMD), then it's something to consider.
I'd also be weary of other bottlenecks such as memory bandwidth. I don't know the performance characteristics of Monte Carlo Simulations, but ramping up one resource might reveal some other bottleneck.

Resources