How does Nodejs/V8 know how big a native object is? - node.js

Specifically, in node-opencv, opencv Matrix objects are represented as a javascript object wrapping a c++ opencv Matrix.
However, if you don't .release() them manually, the V8 engine does not seem to know how big they are, and the NodeJS memory footprint can grow far beyond any limits you try to set on the command line; i.e. it only seems to run the GC when it approaches the set memory limits, but because it does not see the objects as large, this does not happen until it's too late.
Is there something we can add to the objects which will allow V8 to see them as large objects?
Illustrating this, you can create and 'forget' large 1M buffers all day on a nodejs set to limit it's memory to 256Mbytes.
But if you do the same with 1M opencv Matrices, NodeJS will quickly use much more than the 256M limit - unless you either run GC manually, or release the Matrices manually.
(caveat: a c++ opencv matrix is a reference to memory; i.e. more than one Matrix object can point to the same data - but it would be a start to have V8 see ALL references to the same memory as being the size of that memory for the purposes of GC, safer that seeing them as all very small.)
Circumstances: on an RPi3, we have a limited memory footprint, and processing live video (using about 4M of mat objects per frame) can soon exhaust all memory.
Also, the environment I'm working in (a Node-Red node) is designed for 'public' use, so difficult to ensure that all users completely understand the need to manually .release() images; hence this question is about how to bring this large data under the GC's control.

You can inform v8 about your external memory usage with AdjustAmountOfExternalAllocatedMemory(int64_t delta). There are wrappers for this function in n-api and NAN.
By the way, "large objects" has a special meaning in v8: objects large enough to be created in large object space and never moved. External memory is off-heap memory and I think what you're referring to.

Related

What memory leaks can occur outside the view of GHC's heap profiler

I have a program that exhibits the behavior of a memory leak. It gradually takes up all of the systems memory until it fills all swap space and then the operating system kills it. This happens once every several days.
I have extensively profiled the heap in a manner of ways (-hy, -hm, -hc) and tried limiting heap size (-M128M) tweaked the number of generations (-G1) but no matter what I do the heap size appears constant-ish and low always (measured in kB not MB or GB). Yet when I observe the program in htop, its resident memory steadily climbs.
What this indicates to me is that the memory leak is coming from somewhere besides the GHC heap. My program makes use of dependencies, specifically Haskell's yaml library which wraps the C library libyaml, it is possible that the leak is in the number of foreign pointers it has to objects allocated by libyaml.
My question is threefold:
What places besides the GHC heap can memory leak from in a Haskell program?
What tools can I use to track these down?
What changes to my source code need to be made to avoid these types of leaks, as they seem to differ from the more commonly experienced space leaks in Haskell?
This certainly sounds like foreign pointers aren't being finalized properly. There are several possible reasons for this:
The underlying C library doesn't free memory properly.
The Haskell library doesn't set up finalization properly.
The ForeignPtr objects aren't being freed.
I think there's actually a decent chance that it's option 3. If the RTS consistently finds enough memory in the first GC generation, then it just won't bother running a major collection. Fortunately, this is the easiest to diagnose. Just have your program run System.Memory.performGC every so often. If that fixes it, you've found the bug and can tweak just how often you want to do that.
Another possible issue is that you could have foreign pointers lying around in long-lived thunks or other closures. Make sure you don't.
One particularly strong possibility when working with a wrapped C library is that the wrapper functions will return ByteStrings whose underlying arrays were allocated by C code. So any ByteStrings you get back from yaml could potentially be off-heap.

Does RAM affect the time taken to sort an array?

I have an array of a 500k to million items to be sorted. Does going with a configuration of increased RAM be beneficial or not, say 8GB to 32GB or above. Im using a node.JS/mongoDB environment.
Adding RAM for an operation like that would only make a difference if you have filled up the available memory with everything that was running on your computer and the OS was swapping data out to disk to make room for your sort operation. Chances are, if that was happening, you would know because your computer would become pretty sluggish.
So, you just need enough memory for the working set of whatever applications you're running and then enough memory to hold the data you are sorting. Adding additional memory beyond that will not make any difference.
If you had an array of a million numbers to be sorted in Javascript, that array would likely take (1,000,000 * 8 bytes per number) + some overhead for a JS data structure = ~8MB. If your array values were larger than 8 bytes, then you'd have to account for that in the calculation, but hopefully you can see that this isn't a ton of memory in a modern computer.
If you have only an 8GB system and you have a lot of services and other things configured in it and are perhaps running a few other applications at the time, then it's possible that by the time you run nodejs, you don't have much free memory. You should be able to look at some system diagnostics to see how much free memory you have. As long as you have some free memory and are not causing the system to do disk swapping, adding more memory will not increase performance of the sort.
Now, if the data is stored in a database and you're doing some major database operation (such as creating a new index), then it's possible that the database may adjust how much memory it can use based on how much memory is available and it might be able to go faster by using more RAM. But, for a Javascript array which is already all in memory and is using a fixed algorithm for the sort, this would not be the case.

GC taking 32% of runtime expected?

Currently working on optimizing a library for speed. I've already reduced execution time drastically, using V8 CPU and Memory Profiling through Webstorm. This was achieved mainly by changing the core method from recursive to iterative.
Now the self time distribution breaks down as
I'm assuming the first entry "node" is timing internal functions calls, which is great. The other entries also make sense. I'm new to Nodejs profiling, but 31.6% for GC seems high, so I've decided to investigate.
I've now created a heap dump through Webstorm, but unfortunately that doesn't give me much information.
These seem to be system internal memory references mainly. Stepping through the core iteration code logic again, there also don't seem to be a lot of places where memory is explicitly allocated (using this as a reference).
Question
Can the GC overhead be reduced?
Is this amount of allocation just expected here?
Is it possible to get better memory profiling information?
Setup Instructions
In case someone want's to try debugging this, I'm including setup instructions.
Download or clone object-scan and run
yarn install --frozen-lockfile
yarn run test-simple --verbose
Now create a file test.js in the project root containing this content and run node --trace_gc test.js or run it through Webstorm for advanced profiling.
In Javascript and in v8 (node) particularly an amount of time spent for garbage collection depends on amount of data stored in heap, but that's only one of many factors.
In v8 engine there are two main "types" of GC: minor (scavenge) and major (mark-sweep/mark-compact). You may see GC types that happen during your tests in console with --trace-gc enabled. And in different cases one type could "eat" more time than other an vice versa. So before optimizations you should determine which gc takes more time.
There are not a lot of options for optimizing major GC, cause it highly affected by amount of data that stays in memory for "long" (actually in this case long means that object survives scavenge GC) period. Such data is stored in so called "old space" in heap. And major GC works with this space and it should scan all that memory and mark objects that no longer have any references for further clearance.
In your case the amount of test data you're loading goes to old space. As a result it affects major GC during the whole test. And in this case major GC will not clear too much, because you're using your test object, but it still consume time for scanning entire old space. So you may consider preventing v8 from doing that by launching node with gc-specific flags like: --nouse-idle-notification --expose-gc --gc_interval=100500 (where 100500 is number of allocation, it can be take high value that will prevent running gc before the whole test will pass) that will allow trigger garbage collections manually. Test your code using this approach and see how major GC affects it, try tests with different amount of data you provide to function. If the impact is quiet high you may try to refactor your code trying to minimize long-lived variables, closures, etc.
If you'll discover that major GC doesn't have much impact on performance, then scavenge GC takes the most of time. Unlike major GC it operates with so called "new space" in heap. It's a space where all new objects are stored. If those objects survive scavenge, then they are moved to old space. New space has much smaller size ( you may control it by setting --max_semi_space_size, note: new space size = 2 * semi space size) than old space and more new objects and variables you allocate more scavenge GC runs will happen. If this GC heats performance too much you may consider refactor your code to make less new allocations. But if you'll reuse variables it may also slowdown the performance and those objects will go to old space and may become a problem described in "major GC" section.
Also v8 GC doesn't always work in the same thread that your program runs. It does some work in background too, but I don't know what Webstorm shows in your case. If it counts just total time spend in GC, may be it just doesn't have so much impact.
You may find more details on v8 GC in this blog post.
TL;DR:
Can the GC overhead be reduced?
Yes, but first you should discover what should be optimized by following steps above.
Is this amount of allocation just expected here?
That's could be just discovered by comparing different approaches. There's no some absolute number that could limit "good" amount from "bad", because it depends on lot's of factors, including the amount on entry data.
Is it possible to get better memory profiling information?
You may find some good tools here, but in general you may use Chrome dev tools which could provide a bit more details rather than Webstorm does.

Reduced memory options for lots of Duktape contexts

I'm trying to configure a light-weight full-featured JavaScript engine such that I can have tens of thousands of independent contexts simultaneously. Each context is doing very little (mostly event processing, light string manipulation, custom timers, etc.) and doesn't require much heap storage, but needs to be independent from the others. Using Duktape, if I allocate 20,000 contexts in x64, I get upwards of 1.6GB of memory utilized before doing much processing, or about 80KB each. As another data point, if I use SpiderMonkey 1.7.0, 20,000 runs me about 1.4GB or about 70KB... nearly the same. I've played with several of the optimizations Duktape has to offer but it doesn't seem to impact this usage.
So the question is, is there a way to get the per-context memory utilization down to the 4KB (or less) range per context?
Note: yes I know SpiderMonkey 1.7.0 isn't really full-featured, but it is for the sake of what I'm trying to do and doesn't have the JIT complexity that I don't want and don't need from later engines, V8, etc. Hence the look at Duktape as an alternative.
Thanks!
The minimum startup cost for a new global environment is almost entirely caused by built-in objects and their properties: there are roughly 70 built-in objects with 250 function properties and 90 value properties. You can reduce this by deleting unnecessary built-ins and/or built-in properties.
One thing you can do is enable DUK_OPT_LIGHTFUNC_BUILTINS, which replaces most built-in functions with a more lightweight function representation, reducing built-in object count. This has some side effects such as built-in functions having a less readable autogenerated "name" property.
If your contexts are small, another thing you can do is to use "pointer compression" which causes Duktape to represent heap pointers with 16-bit values. You need to provide macros to encode/decode pointers to/from this representation. This approach only works if the maximum size of an individual Duktape heap is ~256kB (assuming align-by-4 allocations). The feature was developed for embedded low memory 32-bit platforms though, so it may not work ideally on 64-bit environments (master branch has a few fixes for pointer compression and 64-bit platforms, so use master if you try this).
Reaching 4 kB/context won't be possible with any of these measures - there are simply too many built-in objects and properties for that. Getting to that memory amount per context would only be possible if you shared global objects for multiple scripts, which may or may not be possible depending on the isolation and threading needs of your scripts.
As a quick update on this question: Duktape 1.5.0 will have a config option to place built-in strings and objects into ROM (read only data section): https://github.com/svaarala/duktape/pull/559. The same read only strings and objects will be shared across all Duktape heaps and contexts without using any RAM. Once the feature is finalized it'll also be possible to put your own strings and builtins into the read only data section so that they won't consume any RAM per heap/context.
With this change it's possible to reach ~4 kB startup RAM usage on a 32-bit target and ~8 kB on a 64-bit target.

1GB Vector, will Vector.Unboxed give trouble, will Vector.Storable give trouble?

We need to store a large 1GB of contiguous bytes in memory for long periods of time (weeks to months), and are trying to choose a Vector/Array library. I had two concerns that I can't find the answer to.
Vector.Unboxed seems to store the underlying bytes on the heap, which can be moved around at will by the GC.... Periodically moving 1GB of data would be something I would like to avoid.
Vector.Storable solves this problem by storing the underlying bytes in the c heap. But everything I've read seems to indicate that this is really only to be used for communicating with other languages (primarily c). Is there some reason that I should avoid using Vector.Storable for internal Haskell usage.
I'm open to a third option if it makes sense!
My first thought was the mmap package, which allows you to "memory-map" a file into memory, using the virtual memory system to manage paging. I don't know if this is appropriate for your use case (in particular, I don't know if you're loading or computing this 1GB of data), but it may be worth looking at.
In particular, I think this prevents the GC moving the data around (since it's not on the Haskell heap, it's managed by the OS virtual memory subsystem). On the other hand, this interface handles only raw bytes; you couldn't have, say, an array of Customer objects or something.

Resources