Used and Cached Memory In Spark

Used and Cached Memory In Spark - apache-spark

I would like to know if spark uses the linux cached memory or the linux used memory when we use the cache/persist method.
I'm asking this because I we have a custer and we see that the machines are used only at 50% used memory and 50% cached memory even when we have long jobs.
Thank you in advance,

Cached/buffered memory is memory that Linux uses for disk caching. When you read a file it is always read into memory cache. You can consider cached memory as free memory. JVM process of spark executor doesn't take directly cached memory. If you see that only 50% of memory is used on your machine, it means that spark executor definitely doesn't take more than 50% of memory. You can use top or ps utils to see how much memory spark executor actually takes. Usually it is a little bit more than current size of heap.

Related

Can h2o allow to allocate more memory to standalone cluster?

I want to increase the h2o cluster memory up to 64gb. Can I do that yes or no? If no then it should be equal or less to my system memory? or if yes then how much I can allocate?
import h2o
h2o.init(nthreads=-1,max_mem_size='16g')
Thanks

The max_mem_size parameter goes straight to the Xmx parameter for the Java heap allocated to the h2o backend process.
Because java is a garbage collected language, you never want to make the java heap size larger than about 90% of physical memory or you run the risk of uncontrollable swapping.

How garbage collector works with Xmx and Xms values

I have some doubts how the JVM garbage collector would work with different values of Xmx and Xms and machine memory size:
How would garbage collector would work in following scenarios:
1. Machine memory size = 7.5GB
Xmx = 1024Mb
Number of processes = 16
Xms = 512Mb
I know 16*512Mb already exceeds the machine memory size. How would the garbage collector would work in this scenario. I think the memory usage would be entire 7.5GB in this case. Will the processes would be able to do anything in this? Or they all will be stuck?
2. Machine memory size = 7.5GB
Xmx = 320MB
Xms is not defined.
Number of Processes = 16
In this, 16*320Mb should be less than 7.5GB. But in my case, memory usage is again reaching 7.5GB. Is it possible? Or I have probably have a memory leak in my application?
So, basically I want to understand when does garbage collector runs? Does it run whenever memory used by the application reached exactly Xmx value? Or they are not related at all?

There's a couple of things to understand here and then consider in your situation.
Each JVM process has its own virtual address space, which is protected from other processes by the operating system. The OS maps physical ranges of addresses (called pages) to the virtual address space of each process. When more physical pages are required than are available, pages that have not been used for a while will be written to disk (called paging) and can then be reused. When the data of these saved pages is required again they are read back to the same or different physical page. By doing this you can easily run 16 or more JVMs all with a heap of 1Gb on a machine with 8Gb of physical memory. The problem is that the more paging to disk that is required the more you are going to degrade the performance of your applications since disk IO is orders of magnitude slower than RAM access. This is also the reason that the heap space of a single JVM should not be bigger than physical memory.
The reason for having -Xms and -Xmx options is so you can specify the initial and maximum size of the heap. As your application runs and requires more heap space the JVM is able to increase the heap size within these bounds. A lot of time these values are set to be the same to eliminate the overhead of having to resize the heap while the application is running. Most operating systems only allocate physical pages when they're required so in your situation making -Xms small won't change the amount of paging that occurs.
The key point here is it's the virtual memory system of the operating system that makes it possible to appear to be using more memory than you physically have in your machine.

What is the difference between heap and swap memory?

What is the difference between heap and swap memory in Ubuntu/Any OS? How does this affect in choosing Cassandra?

Heap memory is what the jvm uses, swap is what OS uses to push things not used often onto disk and save memory. Its very recommended to disable swap on C* hosts, as the old gen objects in jvm may get pushed onto disk, and when a GC occurs and it gets touched it will be very slow. If it can C* will pin its memory to prevent itself from being swapped, but you should disable it anyway.

What is Hazelcast HD Memory? - on/off heap?

I have read this official post on the Hazelcast High Density Memory.
Am I right in assuming that this HD memory still consumes memory from the JVM (in which the application is running and not creating another JVM in the server and using it solely for hz instance)?
And that the only difference in this native memory configuration is that, the memory is allocated off heap rather than the default on-heap memory allocation?

HDMS or Hazelcast high Density Memory Store allocates memory into the same process space as the Java heap. That means the process still owns all the memory but the Java heap is otherwise independent and the Hazelcast allocated space (off-heap / non-Java-heap) is not target to Garbage Collection. Values are serialized and the resulting bytestream is copied to the native memory and when reading it is copied back into the Java heap area and sent to the requestor.
Imagine HDMS as a fancy malloc implementation :)

HDMS or High Density Memory Store is part of Hazelcast Enterprise HD offering. HDMS is a way for Java software to access multiple terabytes of memory per node without struggling with long and unpredictable garbage collection pauses. This memory store provides the benefits of "off-heap" memory using of many high-performance memory management techniques. HDMS solves problems related with garbage collection limitations so that applications can utilizes hardware memory more efficiently without the need of extra clusters. It is designed as a plug-gable memory manager which enables multiple memory stores for different data structures like IMap and JCache.

Shared memory marked as virtual memory?

I run a program which allocates 64MB as shared memory for IPC communication. pmap shows that chunk of 64MB is allocated. However, "top" shows the RES memory of the proc is just about 40MB! I conclude the shared memory is marked as VIRT. But why? There Linux still has more than 1GB RAM available.

Have you actually used any of that 64MB yet? Linux defers allocation.
cf. Does malloc lazily create the backing pages for an allocation on Linux (and other platforms)?

Linux doesn't load all memory the process "obtain" to RAM , it load memory form virtual place to RAM just when you program refer that block of memory. Here "memory" means private mem &　shared mem both.
I　haven't done any experiments to verify the above opinion, but I have seen this in many places like SO, and I do believe it . Just FYI.

Shared memory is, just like most if not all of the memory userland programs deal with, virtual. Only active pages need to be mapped to physical (i.e. resident memory). Doing differently would be a waste of resources.
The only exception is when the process specifically locks the pages in RAM with mlock.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string