I have read this official post on the Hazelcast High Density Memory.
Am I right in assuming that this HD memory still consumes memory from the JVM (in which the application is running and not creating another JVM in the server and using it solely for hz instance)?
And that the only difference in this native memory configuration is that, the memory is allocated off heap rather than the default on-heap memory allocation?
HDMS or Hazelcast high Density Memory Store allocates memory into the same process space as the Java heap. That means the process still owns all the memory but the Java heap is otherwise independent and the Hazelcast allocated space (off-heap / non-Java-heap) is not target to Garbage Collection. Values are serialized and the resulting bytestream is copied to the native memory and when reading it is copied back into the Java heap area and sent to the requestor.
Imagine HDMS as a fancy malloc implementation :)
HDMS or High Density Memory Store is part of Hazelcast Enterprise HD offering. HDMS is a way for Java software to access multiple terabytes of memory per node without struggling with long and unpredictable garbage collection pauses. This memory store provides the benefits of "off-heap" memory using of many high-performance memory management techniques. HDMS solves problems related with garbage collection limitations so that applications can utilizes hardware memory more efficiently without the need of extra clusters. It is designed as a plug-gable memory manager which enables multiple memory stores for different data structures like IMap and JCache.
Related
I have a NodeJS server running on a small VM with 256MB of RAM and I notice the memory usage keeps growing as the server receives new requests. I read that an issue on small environments is that Node doesn't know about the memory constraints and therefore doesn't try to garbage collect until much later (so for instance, maybe it would only want to start garbage collecting once it reaches 512MB of used RAM), is it really the case?
I also tried using various flags such as --max-old-space-size but didn't see much change so I'm not sure if I have an actual memory leak or if Node just doesn't GC as soon as possible?
This might not be a complete answer, but it's coming from experience and might provide some pointers. Memory leak in NodeJS is one of the most challenging bugs that most developers could ever face.
But before we talk about memory leak, to answer your question - unless you explicitly configure --max-old-space-size, there are default memory limits that would take over. Since certain phases of Garbage collection in node are expensive (and sometimes blocking) steps, depending upon how much memory is available to it, it would delay (e.g. mark-sweep collection) some of the expensive GC cycles. I have seen that in a Machine with 16 GB of memory it would easily let the memory go as high as 800 MB before significant Garbage Collections would happen. But I am sure that doesn't make ~800 MB any special limit. It would really depend on how much available memory it has and what kind of application are you running. E.g. it is totally possible that if you have some complex computations, caches (e.g. big DB Connection Pools) or buggy logging libraries - they would themselves always take high memory.
If you are monitoring your NodeJs's memory footprint - sometime after the the server starts-up, everything starts to warm up (express loads all the modules and create some startup objects, caches warm up and all of your high memory consuming modules became active), it might appear as if there is a memory leak because the memory would keep climbing, sometimes as high as ~1 gb. Then you would see that it stabilizes (this limit used to be lesser in <v8 versions).
But sometimes there are actual memory leaks (which might be hard to spot if there is no specific pattern to it).
In your case, 256 MB seems to be meeting just the minimum RAM requirements for nodejs and might not really be enough. Before you start getting anxious of memory leak, you might want to pump it up to 1.5 GB and then monitor everything.
Some good resources on NodeJS's memory model and memory leak.
Node.js Under the Hood
Memory Leaks in NodeJS
Can garbage collection happen while the main thread is
busy?
Understanding and Debugging Memory Leaks in Your Node.js Applications
Some debugging tools to help spot the memory leaks
Node inspector |
Chrome
llnode
gcore
I can't find any clarity as to what is the performance of the so called Constant memory referred to in the Numba documentation:
https://numba.pydata.org/numba-doc/dev/cuda/memory.html#constant-memory
I am curious as to what are the size limits for this memory, how fast/slow it is when compared to other memory types and if there are any pitfalls using it.
Thank you!
This is more of a general question regarding the constant memory in a CUDA-capable device. You can find info in the official CUDA programming guide and here in which it says:
There is a total of 64 KB constant memory on a device. The constant
memory space is cached. As a result, a read from constant memory costs
one memory read from device memory only on a cache miss; otherwise, it
just costs one read from the constant cache. Accesses to different
addresses by threads within a warp are serialized, thus the cost
scales linearly with the number of unique addresses read by all
threads within a warp. As such, the constant cache is best when
threads in the same warp accesses only a few distinct locations. If
all threads of a warp access the same location, then constant memory
can be as fast as a register access.
Regarding how this compares to other memory types, here is my short answer. You may want to read this page for further details:
Registers: Thread private on-chip read + write memory which can be considered as the fastest memory space on a GPU.
Local memory: Thread private off-chip read + write memory which, despite its misleading name, is physically the same location as global memory. Hence, its high latency.
Global memory: The largest memory with a high latency and a global scope which is also off-chip with read + write permissions.
Constant memory: Off-chip cached read-only memory limited to 64 KB which could be accessed by threads as fast as registers, if all threads of a warp access the same location.
Shared memory: On-chip, low-latency, read + write with limited space per multiprocessor (48 KB to 164 KB depending on the compute capability of your device).
Texture memory: On-chip cached read-only memory optimized for 2D spatial locality that supports unique features like hardware filtering.
Pinned (page-locked) memory: Not an explicit device memory. Accessible directly by both CPU and GPU codes, used to maximize and overlap data transfer between CPU/GPU.
These memories have different scopes, life-times and usages. The Numba page that you have mentioned in your question explains the basics but the official CUDA programming guide has a lot more details. At the end of the day, the answer to the question of when to use each memory is to a large degree application-dependent.
I have some doubts how the JVM garbage collector would work with different values of Xmx and Xms and machine memory size:
How would garbage collector would work in following scenarios:
1. Machine memory size = 7.5GB
Xmx = 1024Mb
Number of processes = 16
Xms = 512Mb
I know 16*512Mb already exceeds the machine memory size. How would the garbage collector would work in this scenario. I think the memory usage would be entire 7.5GB in this case. Will the processes would be able to do anything in this? Or they all will be stuck?
2. Machine memory size = 7.5GB
Xmx = 320MB
Xms is not defined.
Number of Processes = 16
In this, 16*320Mb should be less than 7.5GB. But in my case, memory usage is again reaching 7.5GB. Is it possible? Or I have probably have a memory leak in my application?
So, basically I want to understand when does garbage collector runs? Does it run whenever memory used by the application reached exactly Xmx value? Or they are not related at all?
There's a couple of things to understand here and then consider in your situation.
Each JVM process has its own virtual address space, which is protected from other processes by the operating system. The OS maps physical ranges of addresses (called pages) to the virtual address space of each process. When more physical pages are required than are available, pages that have not been used for a while will be written to disk (called paging) and can then be reused. When the data of these saved pages is required again they are read back to the same or different physical page. By doing this you can easily run 16 or more JVMs all with a heap of 1Gb on a machine with 8Gb of physical memory. The problem is that the more paging to disk that is required the more you are going to degrade the performance of your applications since disk IO is orders of magnitude slower than RAM access. This is also the reason that the heap space of a single JVM should not be bigger than physical memory.
The reason for having -Xms and -Xmx options is so you can specify the initial and maximum size of the heap. As your application runs and requires more heap space the JVM is able to increase the heap size within these bounds. A lot of time these values are set to be the same to eliminate the overhead of having to resize the heap while the application is running. Most operating systems only allocate physical pages when they're required so in your situation making -Xms small won't change the amount of paging that occurs.
The key point here is it's the virtual memory system of the operating system that makes it possible to appear to be using more memory than you physically have in your machine.
As I understand it, when processes are swapped-out of main memory and then back in, they can occupy different regions of physical memory. Is this ability shared by all three of segmentation, paging, and partitioning memory management systems? If not, what are the differences and why?
Thanks.
You are mixing a lot of of different concepts. Segmentation is an obsolete system for managing memory. In ye olde days when a large system had 1–2 MB of memory and 16-bit addressing, a process could only access a fraction of the system memory (64Kb). Segment registers were used to access larger address ranges (at different times). Segmentation could be used to support multiple processes or it could be used to increase the available memory in a single process. While the process was limited to 64KB at any one time, playing with segment registers would allow a process to have more than 64KB of memory (total) available to it. This was a common practice on PDP-11s.
Partitioning and segmenting are essentially the same and are equally obsolete. I described the PDP as using segments. Others describe it as using partitions. There are multiple versions of partitions.
Intel kept (and keeps in 32-bit mode) segmentation alive long after it should have died out in its processors.
Swapping is an obsolete system for implementing multi-processing. The entire process gets moved to disk. In the days of 64KB processes this did not have the overhead that moving a 32-bit address space to disk would have.
Modern systems use paging for memory management. In virtual memory systems, individual pages are moved to secondary storage; not entire processes (although it is possible for an entire process to be paged out of memory).
By reading "understanding linux network internals" and "understanding linux kernel" the two books as well as other references, I am quite confused and need some clarifications about the "memory cache" and "memory pool" techniques.
1) Are they the same or different techniques?
2) If not the same, what makes the difference, or the distinct goals?
3) Also, how does the Slab Allocator come in?
Regarding the slab allocator:
So imagine memory is flat that is you have a block of 4 gigs contiguous memory. Then one of your programs reqeuests a 256 bytes of memory so what the memory allocator has to do is choose a suitable block of 256 bytes from this 4 gigs. So now you your memory looks something like
<============256bytes=======================>
(each = is a contiguous block of memory). Some time passes and a lot of programs operating with the memory require more 256 blocks or more or less so in the end your memory might look like:
<==256==256=256=86=68=121===>
so it gets fragmented and then there is no trace of your beautiful 4gig block of memory - this is fragmentation. Now, what the slab allocator would do is keep track of allocated objects and once they are not used anymore it will say that the memory is free when in fact it will be retained in some sort of List (You might wanna read about FreeLists).
So now imagine that the first program relinquish the 256 bytes allocated and then a new would like to have 256 bytes so instead of allocating a new chunk of the main memory it might re-use the lastly freed 256 bytes without having to go through the burden of searching the physical memory for appropriate contiguous block of space. This is how you essentially implement the memory cache. This is done so that memory fragmentation is reduced overall because you might end up in situation where memory is so fragmented that it is unusable and the memory-manager has to do some magic to get you block of appropriate size. Where as using a slab allocator pro-actively combats (but doesn't eliminate) the problem.
Linux memory allocator A.K.A slab allocator maintains the frequently used list/pool of memory objects of similar or approximate size. slab is giving extra flexibility to programmer to create their own pool of frequently used memory objects of same size and label it as programmer want,allocate, deallocate and finally destroy it.This cache is known to your driver and private to it.But there is a problem, during memory pressure there are high chances of allocation failures which could be not acceptable in some drivers, then what to do better always reserve some memory handy so that we never feel the memory crunch, since kmem cache is more generic pool mechanism we need some one who can always maintain minimum required memory and that's our buddy memory pool .
Lookaside Caches - The cache manager in the Linux kernel is sometimes called the slab allocator. You might end up allocating many objects of the same size over and over so by using this mechanism you just can allocate many objects in the same size and then use them later, without the need to allocate many objects over and over.
Memory Pool is just a form of lookaside cache that tries to always keep a list of memory around for use in emergencies, so when the memory pool is created, the allocation functions (slab allocators) create a pool of preallocated objects so you can acquire them when you need.