Process memory and CPU limitations on 64-bit Linux? [duplicate] - linux

This question already has answers here:
Closed 11 years ago.
Possible Duplicate:
Practical limitations of JVM memory and CPU usage?
Let's say money was not a limiting factor, and I wanted to write a Java program that ran on a single powerful machine.
The goal would be to make the Java program run as fast as possible without having to swap or go to disk for anything.
Let's say that this computer has:
1 TB of RAM (64 16GB DIMMs)
64 processor cores (8 8-core processors)
running 64-bit Ubuntu
Could a single instance of a java program running in a JVM take advantage of this much RAM and processors?
Are there any practical considerations that might limit the usage and efficiency?
Hardware limitations (e.g. can CPUs work together on a TB of memory)?
OS process (memory & threads) limitations?
JVM memory/heap limitations?
JVM thread limitations?
Could a regular executable (i.e. a C program) take advantage of the above specs?
Thanks,
Galen

Yes, a JVM can take advantage of that much RAM and processors. In fact I have a machine with a nearly identical configuration as you describe except it's running RedHat. Keep in mind that your application would need to be multithreaded to use all the processor cores.
As for hardware limitations, there are always caching issues that can be pretty complicated, but a rule of thumb is that if all your threads are randomly accessing multiple areas of RAM, you're likely to get low cache coherency. If they're hitting roughly the same areas of RAM, it's likely you'll get better cache hits.
For specifics, it would be nice to know what sort of applications you're planning on writing.

Related

openmp: performance decreases with multiple threads on my desktop, but the opposite over my server

I believe I read pretty much about StackOverflow threads regarding decreasing performance when increasing threads number with OpenMP. Mostly they were due to false sharing. My situation is quite different because my two machines are showing the opposite result.
I'm running STREAM benchmark, and information about my machines is written below:
Intel Xeon Gold-6148. 20 cores (40 threads), 2.4 GHz, 27.5MB LLC
Intel Core i5-9400. 6 cores (6 threads), 2.9 GHz, 9MB LLC
Memory information is similar between the two machines so I will omit it.
I ran the benchmark quite a lot of times and checked the variance between runs is small enough. The result is quite interesting.
Gold-6148(a.k.a server) gets enhanced results when increasing threads number with the OMP_NUM_THREADS option. However, the result of i5-9400(a.k.a desktop) decreases with multiple threads number.
I set the STREAM_ARRAY_SIZE with 20m, and double-checked with various sizes, so it won't affect the result.
Also, I doubted maybe because of any difference over glibc/gomp library between the two, but no difference.
Any idea why this is happening? I just absentmindedly watching the chart below again and again...
i5-9400 result
gold-6148 result
Memory information is similar between the two machines so I will omit it.
You cannot simply omit the most important information when talking about the STREAM benchmark. Xeon Gold 6148 has six DDR4-2666 memory channels split over two separate memory controllers while i5-9400 (assuming i5-9700 is a typo since i7-9700 is an octo-core i7 and not a hexa-core i5 CPU) has only two DDR4-2666 memory channels on a single memory controller. Therefore, 6148 with memory modules installed on all six channels is capable of delivering 3x the memory bandwidth of i5-9400 with the same type of memory modules on both channels. It can also handle more simultaneous memory requests and therefore provides better memory utilisation with more than one thread. Thus, the actual memory configuration is quite important.
Interpreting the STREAM results requires deep understanding of the underlying CPU architecture. There is a nice article by Georg Hager on that topic.

When to consider a Linux kernel overloaded

I am currently working on a Linux system and yesterday I've noticed that The system was slow answering my http requests. I've opened top and I've found this kind situation, in which the Memory seemed to be busy at 95~99%.
Since the cpu load seems to be low and the swap file quite free, I am wondering when I should consider a linux system overloaded and when not. I know that linux has a different memory handle system, right? Maybe this memory load is not related with the bad reaching of the https server (I mean, it could be related to the network layer or whatever...anyway not related to the memory)?
Thank you.
The term of Linux kernel overloaded is little bit not aligned with reality. You can overload something. For example HDD is overloaded, CPU is overloaded, RAM is full and you are swapping.
you should check all the cases not just CPU load and mem usage... What about io top (maybe your HDD is overloaded?), jnettop (network?).
In your case i suspect you simply use too much RAM and start Swapping 820MB in swap already. Swapping means using swap partition (usually HDD but depends on your configuration) as kind of extension of RAM (similar to windows pagefile). But since HDDs are insanely slower compared to RAM the system takes big performance hit in this case.
Another suspicious thing is CPU usage of 23%.... How many cores (incl.hyperthreading) your system has? Is it possible that your application is not using threads? Therefore your CPU usage is only ~25% but it actually means single core is running 100% (overloaded) and 3 other cores are idle(nothing to do)? Therefore you are having single process/thread application which is saturating one core.

lock contention in memory allocation - multi-threaded vs. multi-process

We have developed a big C++ application that is running satisfactorily at several sites on big Linux and Solaris boxes (up to 160 CPU cores or even more). It's a heavily multi-threaded (1000+ threads), single-process architecture, consuming huge amounts of memory (200 GB+). We are LD_PRELOADing Google Perftool's tcmalloc (or libumem/mtmalloc on Solaris) to avoid memory allocation performance bottlenecks with generally good results. However, we are starting to see adverse effects of lock contention during memory allocation/deallocation on some bigger installations, especially after the process has been running for a while (which hints to aging/fragmentation effects of the allocator).
We are considering changing to a multi-process/shared memory architecture (the heavy allocation/deallocation will not happen in shared memory, rather on the regular heap).
So, finally, here's our question: can we assume that the virtual memory manager of modern Linux kernels is capable of efficiently handing out memory to hundreds of concurent processes? Or do we have to expect running into the same kind of problems with memory allocation contention that we see in our single-process/multi-threading environment? I tend to hope for a better overall system performance, as we would no longer be limited to a single address space, and that having several independent address spaces would require less locking on the part of the virtual memory manager. Anyone have any actual experience or performance data comparing multi-threaded vs. multi-process memory allocation?
I tend to hope for a better overall system performance, as we would no longer be limited to a single address space, and that having several independent address spaces would require less locking on the part of the virtual memory manager.
There is no reason to expect this. Unless your code is so badly designed that it constantly goes back to the OS to allocate memory, it won't make any significant difference. Your application should only need to go back to the OS's virtual memory manager when it needs more virtual memory, which should not occur significantly once the process reaches its stable size.
If you are constantly allocating and freeing all the way back to the OS, you should stop doing that. If you're not, then you can keep multiple pools of already-allocated memory that can be used by multiple threads without contention. And, as a benefit, your context switches will be cheaper because TLB's don't have to be flushed.
Only if you can't reduce the frequency of address space changes (for example, if you must map and unmap files) or if you have to change other shared resources (like file descriptors) should you look at multiprocess options.

Erlang garbage collection

I need your help in investigation of issue with Erlang memory consumption. How typical, isn't it?
We have two different deployment schemes.
In first scheme we running many identical nodes on small virtual machines (in Amazon AWS),
one node per machine. Each machine has 4Gb of RAM.
In another deployment scheme we running this nodes on big baremetal machines (with 64 Gb of RAM), with many nodes per machine. In this deployment nodes are isolated in docker containers (with memory limit set to 4 Gb).
I've noticed, that heap of processes in dockerized nodes are hogging up to 3 times much more RAM, than heaps in non-dockerized nodes with identical load. I suspect that garbage collection in non-dockerized nodes is more aggressive.
Unfortunately, I don't have any garbage collection statistics, but I would like to obtain it ASAP.
To give more information, I should say that we are using HiPE R17.1 on Ubuntu 14.04 with stock kernel. In both schemes we are running 8 schedulers per node, and using default fullsweep_after flag.
My blind suggestion is that Erlang default garbage collection relies (somehow) on /proc/meminfo (which is not actual in dockerized environment).
I am not C-guy and not familiar with emulator internals, so could someone point me to places in Erlang sources that are responsible for garbage collection and some emulator options which I can use to tweak this behavior?
Unfortunately VMs often try to be smarter with memory management than necessary and that not always plays nicely with the Erlang memory management model. Erlang tends to allocate and release a large number of small chunks of memory, which is very different to normal applications, which usually allocate and release a small number of big chunks of memory.
One of those technologies is Transparent Huge Pages (THP), which some OSes enable by default and which causes Erlang nodes running in such VMs to grow (until they crash).
https://access.redhat.com/solutions/46111
https://www.digitalocean.com/company/blog/transparent-huge-pages-and-alternative-memory-allocators/
https://docs.mongodb.org/manual/tutorial/transparent-huge-pages/
So, ensuring THP is switched off is first thing you can check.
The other is trying to tweak the memory options used when starting the Erlang VM itself, for example see this post:
Erlang: discrepancy of memory usage figures
Resulting options that worked for us:
-MBas aobf -MBlmbcs 512 -MEas aobf -MElmbcs 512
Some more theory about memory allocators:
http://www.erlang-factory.com/static/upload/media/139454517145429lukaslarsson.pdf
And more detailed description of memory allocator flags:
http://erlang.org/doc/man/erts_alloc.html
First thing to know, is that garbage collection i Erlang is process based. Each process is GC in their own time, and independently from each other. So garbage collection in your system is only dependent on data in your processes, not operating system itself.
That said, there could be some differencess between memory consumption from Eralang point of view, and System point of view. That why comparing erlang:memory to what your system is saying is always a good idea (it could show you some binary leaks, or other memory problems).
If you would like to understand little more about Erlang internals I would recommend those two talks:
https://www.youtube.com/watch?v=QbzH0L_0pxI
https://www.youtube.com/watch?v=YuPaX11vZyI
And from little better debugging of your memory management I could reccomend starting with http://ferd.github.io/recon/

Linux per-process resource limits - a deep Red Hat Mystery

I have my own multithreaded C program which scales in speed smoothly with the number of CPU cores.. I can run it with 1, 2, 3, etc threads and get linear speedup.. up to about 5.5x speed on a 6-core CPU on a Ubuntu Linux box.
I had an opportunity to run the program on a very high end Sunfire x4450 with 4 quad-core Xeon processors, running Red Hat Enterprise Linux. I was eagerly anticipating seeing how fast the 16 cores could run my program with 16 threads..
But it runs at the same speed as just TWO threads!
Much hair-pulling and debugging later, I see that my program really is creating all the threads, they really are running simultaneously, but the threads themselves are slower than they should be. 2 threads runs about 1.7x faster than 1, but 3, 4, 8, 10, 16 threads all run at just net 1.9x! I can see all the threads are running (not stalled or sleeping), they're just slow.
To check that the HARDWARE wasn't at fault, I ran SIXTEEN copies of my program independently, simultaneously. They all ran at full speed. There really are 16 cores and they really do run at full speed and there really is enough RAM (in fact this machine has 64GB, and I only use 1GB per process).
So, my question is if there's some OPERATING SYSTEM explanation, perhaps some per-process resource limit which automatically scales back thread scheduling to keep one process from hogging the machine.
Clues are:
My program does not access the disk or network. It's CPU limited. Its speed scales linearly on a
single CPU box in Ubuntu Linux with
a hexacore i7 for 1-6 threads. 6
threads is effectively 6x speedup.
My program never runs faster than
2x speedup on this 16 core Sunfire
Xeon box, for any number of threads
from 2-16.
Running 16 copies of
my program single threaded runs
perfectly, all 16 running at once at
full speed.
top shows 1600% of
CPUs allocated. /proc/cpuinfo shows
all 16 cores running at full 2.9GHz
speed (not low frequency idle speed
of 1.6GHz)
There's 48GB of RAM free, it is not swapping.
What's happening? Is there some process CPU limit policy? How could I measure it if so?
What else could explain this behavior?
Thanks for your ideas to solve this, the Great Xeon Slowdown Mystery of 2010!
My initial guess would be shared memory bottlenecks. From what you say, your performance pretty much flatlines after 2 CPUs. You initially blame Redhat, but I'd be curious to see what happens if you install Ubuntu on the same hardware. I assume, of course, that you're running 64 bit SMP kernels across both tests.
It's probably not possible that the motherboard would peak at utilizing 2 CPUs. You have another machine with multiple cores that has provided better performance. Do you have hyperthreading turned on with the new machine? (and how does that answer compare to the old machine?). You're not, by chance, running in a virtualized environment?
Overall, your evidence is pointing to a ludicrously slow bottleneck somewhere. As you said, you're not I/O bound, so that leaves the CPU and memory. Either something is wrong with the hardware, or something is wrong with the hardware. Test one by changing the other, and you'll narrow down your possibilities quickly.
Do some research on rlimit - it's quite possible the shell/user acct you're running in has some RH-default or admin-set resource limits in place.
When you see this kind of odd scaling behaviour, especially if problems are seen with multiple threads, but not multiple processes, one thing to start looking at is the impacts of lock contention and other synchronisation primitives, which can cause threads running on different processors to have to wait for each other, potentially forcing multiple cores to flush their cache to main memory.
This means memory architecture starts to come into play, and that's going to be substantially faster when you have 6 cores on a single piece of silicon than when you're coordinating across 4 separate processors. Specifically, the single CPU case likely isn't needing to hit main memory for locking operations at all - everything is likely being handled at the L3 cache level, allowing the CPU to get on with things while data is flushed to main memory in the background.
While I expect the OP has lost interest in the question after all this time (or may not even have access to the hardware any more), one way to check this would be to see if the scaling up to 4 threads improves if the process affinity is set to lock it to a single physical CPU. Even better though would be to profile the application itself to see where it is spending it's time.As you change architectures and increase the number of cores, it gets harder and harder to guess where the bottlenecks are, so you really need to start measuring things directly, as in this example: http://postgresql.1045698.n5.nabble.com/Sun-Donated-a-Sun-Fire-T2000-to-the-PostgreSQL-community-td2057445.html

Resources