I'm trying to configure a light-weight full-featured JavaScript engine such that I can have tens of thousands of independent contexts simultaneously. Each context is doing very little (mostly event processing, light string manipulation, custom timers, etc.) and doesn't require much heap storage, but needs to be independent from the others. Using Duktape, if I allocate 20,000 contexts in x64, I get upwards of 1.6GB of memory utilized before doing much processing, or about 80KB each. As another data point, if I use SpiderMonkey 1.7.0, 20,000 runs me about 1.4GB or about 70KB... nearly the same. I've played with several of the optimizations Duktape has to offer but it doesn't seem to impact this usage.
So the question is, is there a way to get the per-context memory utilization down to the 4KB (or less) range per context?
Note: yes I know SpiderMonkey 1.7.0 isn't really full-featured, but it is for the sake of what I'm trying to do and doesn't have the JIT complexity that I don't want and don't need from later engines, V8, etc. Hence the look at Duktape as an alternative.
Thanks!
The minimum startup cost for a new global environment is almost entirely caused by built-in objects and their properties: there are roughly 70 built-in objects with 250 function properties and 90 value properties. You can reduce this by deleting unnecessary built-ins and/or built-in properties.
One thing you can do is enable DUK_OPT_LIGHTFUNC_BUILTINS, which replaces most built-in functions with a more lightweight function representation, reducing built-in object count. This has some side effects such as built-in functions having a less readable autogenerated "name" property.
If your contexts are small, another thing you can do is to use "pointer compression" which causes Duktape to represent heap pointers with 16-bit values. You need to provide macros to encode/decode pointers to/from this representation. This approach only works if the maximum size of an individual Duktape heap is ~256kB (assuming align-by-4 allocations). The feature was developed for embedded low memory 32-bit platforms though, so it may not work ideally on 64-bit environments (master branch has a few fixes for pointer compression and 64-bit platforms, so use master if you try this).
Reaching 4 kB/context won't be possible with any of these measures - there are simply too many built-in objects and properties for that. Getting to that memory amount per context would only be possible if you shared global objects for multiple scripts, which may or may not be possible depending on the isolation and threading needs of your scripts.
As a quick update on this question: Duktape 1.5.0 will have a config option to place built-in strings and objects into ROM (read only data section): https://github.com/svaarala/duktape/pull/559. The same read only strings and objects will be shared across all Duktape heaps and contexts without using any RAM. Once the feature is finalized it'll also be possible to put your own strings and builtins into the read only data section so that they won't consume any RAM per heap/context.
With this change it's possible to reach ~4 kB startup RAM usage on a 32-bit target and ~8 kB on a 64-bit target.
Related
I observe that each ffmpeg instance doing audio decoding takes about 50 mb of memory. If I record 100 stations, that's 5 GB of RAM.
Now, they all more or less use the same amount of RAM, I suspect the contain the same information over and over again because they are spawned as new processes rather than forked.
Is there way to avoid this duplication?
I am using Ubuntu 20.04, x64
Now, they all more or less use the same amount of RAM, I suspect the
contain the same information over and over again because they are
spawned as new processes rather than forked.
Have you considered that the processes may use about the same amount of RAM because they are performing roughly the same computation, with similar parameters?
Have you considered that whatever means you are using to compute memory usage may be insensitive to whether the memory used is attributed uniquely to the process vs. already being shared with other processes?
Is there way to avoid this duplication?
Programs that rely on shared libraries already share those libraries' executable code among them, saving memory.
Of course, each program does need its own copy of any writable data belonging to the library, some of which may turn out to be unused by a particular client program, and programs typically have memory requirements separate from those of any libraries they use, too. Whatever amount of that 50 MB per process is in fact additive across processes is going to be from these sources. Possibly you could reduce the memory load from these by changing program parameters (or by changing programs), but there's no special way to run the same number of instances of the program you're running now, with the same options and inputs, to reduce the amount of memory they use.
Specifically, in node-opencv, opencv Matrix objects are represented as a javascript object wrapping a c++ opencv Matrix.
However, if you don't .release() them manually, the V8 engine does not seem to know how big they are, and the NodeJS memory footprint can grow far beyond any limits you try to set on the command line; i.e. it only seems to run the GC when it approaches the set memory limits, but because it does not see the objects as large, this does not happen until it's too late.
Is there something we can add to the objects which will allow V8 to see them as large objects?
Illustrating this, you can create and 'forget' large 1M buffers all day on a nodejs set to limit it's memory to 256Mbytes.
But if you do the same with 1M opencv Matrices, NodeJS will quickly use much more than the 256M limit - unless you either run GC manually, or release the Matrices manually.
(caveat: a c++ opencv matrix is a reference to memory; i.e. more than one Matrix object can point to the same data - but it would be a start to have V8 see ALL references to the same memory as being the size of that memory for the purposes of GC, safer that seeing them as all very small.)
Circumstances: on an RPi3, we have a limited memory footprint, and processing live video (using about 4M of mat objects per frame) can soon exhaust all memory.
Also, the environment I'm working in (a Node-Red node) is designed for 'public' use, so difficult to ensure that all users completely understand the need to manually .release() images; hence this question is about how to bring this large data under the GC's control.
You can inform v8 about your external memory usage with AdjustAmountOfExternalAllocatedMemory(int64_t delta). There are wrappers for this function in n-api and NAN.
By the way, "large objects" has a special meaning in v8: objects large enough to be created in large object space and never moved. External memory is off-heap memory and I think what you're referring to.
I would like to know what techniques I should look up-to for optimizing a given algorithm for a given architecture. How do I improve performance using better caching. How do I reduce cache coherency or what access patterns should I avoid in my algorithm/program so that cache coherency doesn't impact my performance?
I understand a few standard techniques for using the recently cached data in L1 but how would I use data in a shared cache(say L2) on a multi-core effectively thereby I avoid a main-memory access which is even more costlier?
Basically, I am interested in what data access patterns I should try to exploit or avoid for a better mapping to my given architecture. What data structure I could use, in what scenarios for what architectures(with different levels of private cache and shared cache) to improve performance. Thanks.
What techniques I should look up-to for optimizing a given algorithm for a given architecture?
Micro-architectures vary, so learn the details of your specific processor. Intel provides good documentation in their optimization guide. If you are using an Intel processor you'll want to read sections 8.3 and 8.6:
8.3 OPTIMIZATION GUIDELINES
This section summarizes optimization guidelines for tuning multithreaded applications. Five areas are listed (in order of importance):
Thread synchronization
Bus utilization
Memory optimization
Front end optimization
Execution resource optimization
Practices associated with each area are listed in this section. Guidelines for each area are discussed in greater depth in sections that follow. Most of the coding recommendations improve performance scaling with processor cores; and scaling due to HT Technology. Techniques that apply to only one environment are noted.
8.6 MEMORY OPTIMIZATION
Efficient operation of caches is a critical aspect of memory optimization. Efficient operation of caches needs to address the following:
Cache blocking
Shared memory optimization
Eliminating 64-KByte aliased data accesses
Preventing excessive evictions in first-level cache
What data access patterns I should try to exploit or avoid for a better mapping to my given architecture?
Exploit
When caches are full and an access misses in the cache the cache must evict something to make room for the new data/code, what is evicted is usually based on an approximation of least-recently used (LRU). If possible then your code should have strong locality of reference:
Try to pack data that is used close in time in the algorithm such that it is close in space (address)
Pack data tightly, don't use a 64-bit integer when a 32-bit integer will do, for example
Sometimes the alignment of an "object" (related data) relative to a cache line matters. For example, if there is an array of objects each of 64-bytes and they are accessed randomly then aligning at a 64-byte boundary will improve cache efficiency by not bringing in data that is not used. If the object isn't aligned then every object touched brings in two cache lines, but only 64-bytes are needed, so 50% of data transferred isn't used (assumes cache lines are 64-bytes).
As #PaulA.Clayton pointed out in the comments, pre-fetching data is very important, as it hides part or all of the memory latency. "Also, exploiting stride-based hardware prefetching can be quite beneficial. (Software prefetching can also be useful in some cases.) Getting pointers early helps increase memory-level parallelism."
In order to facilitate the hardware pre-fetcher and to increase the utilization of the data that is brought into the cache pay careful attention to how matrices and other large structures are stored and accessed... see Wikipedia article on row-major order.
Avoid
Data that you don't use often shouldn't be close to data that you use frequently
Avoid false sharing. If two or more threads access the same cache line but are not sharing the same data within the cache line and at least one of them is a writer you have false sharing... there will be unnecessary burden and latency hit associated with cache coherency protocol.
Try not to use new data until you are done with the older data
Measure
As Andrei Alexandrescu said in this talk - when it comes to performance tuning the only intuition that is right is "I should measure this." Familiarize yourself with cache performance monitoring tools, for example:
perf
Cachegrind
The key principle is locality: when you have the choice, process nearby data first (avoid sparse accesses), and perform data reuse as soon as possible (regroup successive passes over the same data).
For multithreaded programs, the principle is separate locality: ensure that the threads work on disjoint data sets (use distinct copies is necessary/possible).
Unless you have very good reasons to do so, stay away from the peculiarities of the hardware.
It should be mentioned that code is also cached in the same way as data. Small, dense code with a lot of inlining and few jumps/calls will put less strain on the L1C cache and, ultimately, L2, L3 and RAM where collisions with data fetches will occur.
If you are using hyperthreading there appears to be evidence to indicate that a lower optimization level (O1) on two hyperthreads in a core will overall get more work done than a single, highly optimized (O2 and higher) thread.
We just ran out of semaphores on our Linux box, due to the use of too many Websphere Message Broker instances or somesuch.
A colleague and I got to wondering why this is even limited - it's just a bit of memory, right?
I thoroughly googled and found nothing.
Anyone know why this is?
cheers
Semaphores, when being used, require frequent access with very, very low overhead.
Having an expandable system where memory for each newly requested semaphore structure is allocated on the fly would introduce complexity that would slow down access to them because it would have to first look up where the particular semaphore in question at the moment is stored, then go fetch the memory where it is stored and check the value. It is easier and faster to keep them in one compact block of fixed memory that is readily at hand.
Having them dispersed throughout memory via dynamic allocation would also make it more difficult to efficiently use memory pages that are locked (that is, not subject to being swapped out when there are high demands on memory). The use of "locked in" memory pages for kernel data is especially important for time-sensitive and/or critical kernel functions.
Having the limit be a tunable parameter (see links in the comments of original question) allows it to be increased at runtime if needed via an "expensive" reallocation and relocation of the block. But typically this is done one time at system initialization before anything much is even using semaphores.
That said, the amount of memory used by a semaphore set is rather tiny. With modern memory available on systems being in the many gigabytes the original default limits on the number of them might seem a bit stingy. But keep in mind that on many systems semaphores are rarely used by user space processes and the linux kernel finds its way into lots of small embedded systems with rather limited memory, so setting the default limit arbitrarily high in case it might be used seems wasteful.
The few software packages, such as Oracle database for example, that do depend on having many semaphores available, typically do recommend in their installation and/or system tuning advice to increase the system limits.
The GLib docs recommend use of the GLib Slice Allocator over malloc:
"For newly written code it is recommended to use the new g_slice API instead of g_malloc() and friends, as long as objects are not resized during their lifetime and the object size used at allocation time is still available when freeing."
-- http://developer.gnome.org/glib/unstable/glib-Memory-Slices.html
But in practise is g_slice significantly faster than Windows/Linux malloc(faster enough to warrant the extra trouble of handling sizes and GLib's preprocessor hacks like g_slice_new)? I'm planning to use GLib in my C++ program to handle INIish configuration (GKeyFile) and to get access to data structures not available in C++ like GHashTable, so the GLib dependency doesn't matter anyway.
Faster enough to be worth it sort of depends on your app. But they should be faster.
There is another issue besides speed, which is memory fragmentation and per-block overhead. GSlice
leaves malloc to deal with large or variable-size allocations while handling small known-size objects more space-efficiently.
Slice API heavily borrows from research conducted by Sun Microsystems in 1980s and it was called slab allocation back then. I could not find original research paper but here is a wikipedia page about it or you can just google for "slab allocation".
Essentially it eliminates expensive allocation/deallocation operations by facilitating reuse of memory blocks. It also reduces or eliminates memory fragmentation. So it is not all about speed, even though it should improve it as well.
If you should used or not - it depends... Look at Havoc's answer - he summarized it pretty well.
Update 1:
Note, that modern Linux kernels include SLAB allocator as one of the option and it is often the default. So, the difference between g_slice() and malloc() may be unnoticeable in that case. However, purpose of glib is cross-platform compatibility, so using slice API may somewhat guarantee consistent performance across different platforms.
Update 2:
As it was pointed by a commenter my first update is incorrect. SLAB allocation is used by kernel to allocate memory to processes but malloc() uses an unrelated mechanism, so claim that malloc() is equivalent to g_slice() on Linux is invalid. Also see this answer for more details.