How to analyse memory distribution in a process?

How to analyse memory distribution in a process? - linux

I have a running (private) server which use about 1.1G virtual memory (1.0G physical memory) . Though I have the source code of the server, while I want to figure out any better solution that I can use to get a big picture of the memory distribution among objects in the server? Some thing like this:
HashTable: 50%, 500M
PlayerCache: 20%, 200M
OtherA: 10%, 100M
...
where pointer may be in the object and points to dynamic allocated memorys.

On Linux, you could use proc(5) and pmap(1) (BTW, pmap(1), top(1), ps(1) are all using /proc/ and are useful to you).
So if your server process has pid 1234, you could try pmap 1234 and cat /proc/1234/maps in your terminal to understand the virtual address space of your process. Try also cat /proc/1234/status
Remember that processes run in virtual memory and each of them has its own virtual address space, and that the physical RAM is a resource managed by the kernel. You might be interested by the RSS
You won't have a report detailing dynamic memory usage per type, or per variable (because that does not make any sense in general: for example, in C, a given malloc-ed memory zone could be dereferenced thru casts of different types and be accessible -perhaps indirectly- thru several variables). You might use malloc_stats(3) (it could give statistics related to size of memory zones) if your program is using C dynamic memory allocation.
You could use valgrind, very useful to hunt memory leaks.
You could modify the source code of your server to handle and do some accounting, in your specific ways, of the heap allocation it is doing. For example, if you code in C++, you could modify the constructors and destructors of your classes to increment and decrement some static or global counters, or provide some class-specific operator new and delete. Heap memory consumption is a whole-program property (and is not tied, in general, to a specific type or variable). Some programs have their own (type-specific) allocators or use arena based or region based allocation.
Read also something about garbage collection, e.g. the GC handbook, to get useful concepts and terminology and techniques (e.g. tracing GC, reference counting, weak references, circular references, smart pointers, etc..). They matter even for manual memory management.
Memory management and allocation is specific to every program, particularly in programming languages like C or C++ (or Rust) with manual memory management where conventions matter a lot and are specific to every program. Study the source code of existing various free software servers (e.g. Apache, Lighttpd, PostGreSQL, Xorg, Unison, Git, ...) and you'll find out that each of them has different conventions about memory management.

Related

Large physically contiguous memory area

For my M.Sc. thesis, I have to reverse-engineer the hash function Intel uses inside its CPUs to spread data among Last Level Cache slices in Sandy Bridge and newer generations. To this aim, I am developing an application in Linux, which needs a physically contiguous memory area in order to make my tests. The idea is to read data from this area, so that they are cached, probe if older data have been evicted (through delay measures or LLC miss counters) in order to find colliding memory addresses and finally discover the hash function by comparing these colliding addresses.
The same procedure has already been used in Windows by a researcher, and proved to work.
To do this, I need to allocate an area that must be large (64 MB or more) and fully cachable, so without DMA-friendly options in TLB. How can I perform this allocation?
To have a full control over the allocation (i.e., for it to be really physically contiguous), my idea was to write a Linux module, export a device and mmap() it from userspace, but I do not know how to allocate so much contiguous memory inside the kernel.
I heard about Linux Contiguous Memory Allocator (CMA), but I don't know how it works

Applications don't see physical memory, a process have some address space in virtual memory. Read about the MMU (what is contiguous in virtual space might not really be physically contiguous and vice versa)
You might perhaps want to lock some memory using mlock(2)
But your application will be scheduled, and other processes (or scheduled tasks) would dirty your CPU cache. See also sched_setaffinity(2)
(and even kernel code might be perhaps preempted)

This page on Kernel Newbies, has some ideas about memory allocation. But the max for get_free_pages looks like 8MiB. (Perhaps that's a compile-time constraint?)
Since this would be all-custom, you could explore the mem= boot parameter of the linux kernel. This will limit the amount of memory used, and you can party all over the remaining memory without anyone knowing. Heck, if you boot up a busybox system, you could probably do mem=32M, but even mem=256M should work if you're not booting a GUI.
You will also want to look into the Offline Scheduler (and here). It "unplugs" the CPU from Linux so you can have full control over ALL code running on it. (Some parts of this are already in the mainline kernel, and maybe all of it is.)

Why are Sempaphores limited in Linux

We just ran out of semaphores on our Linux box, due to the use of too many Websphere Message Broker instances or somesuch.
A colleague and I got to wondering why this is even limited - it's just a bit of memory, right?
I thoroughly googled and found nothing.
Anyone know why this is?
cheers

Semaphores, when being used, require frequent access with very, very low overhead.
Having an expandable system where memory for each newly requested semaphore structure is allocated on the fly would introduce complexity that would slow down access to them because it would have to first look up where the particular semaphore in question at the moment is stored, then go fetch the memory where it is stored and check the value. It is easier and faster to keep them in one compact block of fixed memory that is readily at hand.
Having them dispersed throughout memory via dynamic allocation would also make it more difficult to efficiently use memory pages that are locked (that is, not subject to being swapped out when there are high demands on memory). The use of "locked in" memory pages for kernel data is especially important for time-sensitive and/or critical kernel functions.
Having the limit be a tunable parameter (see links in the comments of original question) allows it to be increased at runtime if needed via an "expensive" reallocation and relocation of the block. But typically this is done one time at system initialization before anything much is even using semaphores.
That said, the amount of memory used by a semaphore set is rather tiny. With modern memory available on systems being in the many gigabytes the original default limits on the number of them might seem a bit stingy. But keep in mind that on many systems semaphores are rarely used by user space processes and the linux kernel finds its way into lots of small embedded systems with rather limited memory, so setting the default limit arbitrarily high in case it might be used seems wasteful.
The few software packages, such as Oracle database for example, that do depend on having many semaphores available, typically do recommend in their installation and/or system tuning advice to increase the system limits.

RAM analysis on Linux

I want to get the map of allocated memory in RAM running on Linux.
I am looking to find the utilization of memory at a given time as well as specifics of allocation in terms of user process and kernel modules and kernel itself.

This is very very hard to get right because of shared memory, caching and on demand paging.
You can indeed use /proc/PID/maps as the previous answer stated to learn that glibc, as an example, occupies a certain range of virtual memory in a certain process but part of that is shared between all processes mapping that library (the code section), part isn't (the dynamic linker tables, for example). Part of this memory space might not be in RAM at all (paged to disk) anc opy on write semantics might mean that the memory picture at the next moment might be very different.
Add do that the sophisticated use Linux makes in caching (the page and buffer caches being the major ones) which part of which can be evicted at the kernel whim (cache IO buffers which are not dirty) but some cannot (e.g. tmpfs pages) and it gets really hairy really quickly.
In short - no one good answer to get a true view of what uses RAM and for what in a Linux system. The best answer I know is pagemap and related tool. read all about it here: http://lwn.net/Articles/230975/

You can find it out by checking ever process memory mapping
cat /proc/<PID>/maps
and for overall memory state
cat /proc/meminfo

Is fork() copy-on-write a stable exposed behavior that can be used to implement read-only shared memory?

The man page on fork() states that it does not copy data pages, it maps them into the child process and puts a copy-on-write flag. Is that behavior:
consistent between flavors of Linux?
considered an implementation detail and therefore likely to change?
I'm wondering if I can use fork() as a means to get a shared read-only memory block on the cheap. If the memory is physically copied, it would be rather expensive - there's a lot of forking going on, and the data area is big enough - but I'm hoping not...

Linux running on machines without a MMU (memory management unit) will copy all process memory on fork().
However, those systems are usually very small and embedded and you probably don't have to worry about them.
Many services such as Apache's fork model, use the initialize and fork() method to share initialized data structures.
You should be aware that if you are using languages like Perl and Python that use reference-counted variables, or C++ shared_ptr's, this model will not work. It will not work because as the reference counts are adjusted up and down, the memory becomes unshared and gets copied.
This causes huge amounts of memory usage in Perl daemons like SpamAssassin that attempt to use an initialize and fork model.

Yes you can certainly rely on it on MMU-Linux kernels; this is almost everything.
However, the page size isn't the same everywhere.
It is possible to explicitly make a shared memory area for forked process, by using mmap() to create an anonymous map - one which is not backed by a physical file. On fork, this area will always remain shared (provided the child doesn't unmap it, or map something else in at the same address). You can mprotect it to be readonly if you want.
Memory allocated with (for example) malloc can easily end up sharing a page with something that isn't readonly, which means it gets copied anyway when another structure is modified. This includes internal structures used by the malloc implementation. So you might want to mmap a specific area for this purpose and allocate from that.

Can you rely on the fact that all Linux flavors do it this way? No. But you can rely on the fact that those who don't use an even faster method.
Therefore you should use the feature and rely on it and revisit your decision if you get a performance problem.

The success of this approach depends on how well you stick to your self-imposed "read-only" limitation. Both parent and child have to obey this stricture, else the memory gets copied.
This may not be the catastrophe you're envisioning, however. The kernel can copy as little as a single page (typically 4 KB) to implement CoW semantics. A typical Linux server will use something more complex, some sort of slab allocator, so the copied region could be much larger.
The main point is that this is decoupled from your program's conception of its memory use. If you malloc() 1 GB of RAM, fork off a child, and the child changes just the first byte of that memory block, the entire 1 GB block isn't copied. Perhaps as little as one page is copied, up to the slab size containing that first byte.

Yes
All the linux distros use the same kernel, albeit with slightly different versions and releases of it.
It's unlikely that another underlying fork(2) implementation will be faster any time soon, so it's a safe bet that copy-on-write will continue to be the mechanism. Perhaps it won't be forever, but for years, definitely.
Certainly some major software systems (for example, Phusion Passenger) use fork(2) in the same way that you want to, so you would not be the only one taking advantage of CoW.

Dynamic memory managment under Linux

I know that under Windows, there are API functions like global_alloc() and such, which allocate memory, and return a handle, then this handle can be locked and a pointer returned, then unlocked again. When unlocked, the system can move this piece of memory around when it runs low on space, optimising memory usage.
My question is that is there something similar under Linux, and if not, how does Linux optimize its memory usage?

Those Windows functions come from a time when all programs were running in the same address space in real mode. Linux, and modern versions of Windows, run programs in separate address spaces, so they can move them about in RAM by remapping what physical address a particular virtual address resolves to in the page tables. No need to burden the programmer with such low level details.
Even on Windows, it's no longer necessary to use such functions except when interacting with a small number of old APIs. I believe Raymond Chen's blog and book have some discussions of the topic if you are interested in more detail. Eg here's part 4 of a series on the history of GlobalLock.

Not sure what Linux equivalent is but in ATT UNIX there are "scatter gather" memory management functions in the memory manager of the core OS. In a virtual memory operating environment there are no absolute addresses so applications don't have an equivalent function. The executable object loader (loads executable file into memory where it becomes a process) uses memory addressing from the memory manager that is all kept track of in virtual memory blocks maintained in its page table (which contains the physical memory addresses). Bottom line is your applications physical memory layout is likely in no way ever linear or accessible directly.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string