How to prevent GC when using Caliper

How to prevent GC when using Caliper - garbage-collection

When using caliper, I get the
ERROR: GC occurred during timing.
as some garbage gets produced in my benchmark, which I can't avoid. I guess, giving more memory to the target JVM could help, as there's not that much garbage. I'm aware about the -D and -J options, but somehow it doesn't work for me.
Firstly, I see in this question that multiple arguments passed via Jmemory=-Xmx512M,-Xmx16M get used separately, i.e., each comma separated argument leads to a new run. But I'd like to pass multiple arguments to be used together like maybe -Xmx16G -XX:NewSize=12G, so that the GC gets postponed as much as possible (and actually doesn't come at all as the run finishes in the meantime). How can I do it?
Secondly, what are the best arguments posponing the GC as much as possible? I mean, give the JVM a lot of memory (-Xmx), use it all for Eden, and don't care about how full it gets.

Related

v8 memory spike (rss) when defining more than 1000 function (does not reproduce when using --jitless)

I have a simple node app with 1 function that defines 1000+ functions inside it (without running them).
When I call this function (the wrapper) around 200 times the RSS memory of the process spikes from 100MB to 1000MB and immediately goes down. (The memory spike only happens after around 200~ calls, before that all the calls do not cause a memory spike, and all the calls after do not cause a memory spike)
This issue is happening to us in our node server in production, and I was able to reproduce it in a simple node app here:
https://github.com/gileck/node-v8-memory-issue
When I use --jitless pr --no-opt the issue does not happen (no spikes). but obviously we do not want to remove all the v8 optimizations in production.
This issue must be some kind of a specific v8 optimization, I tried a few other v8 flags but non of them fix the issue (only --jitless and --no-opt fix it)
Anyone knows which v8 optimization could cause this?
Update:
We found that --no-concurrent-recompilation fix this issue (No memory spikes at all).
but still, we can't explain it.
We are not sure why it happens and which code changes might fix it (without the flag).
As one of the answers suggests, moving all the 1000+ function definitions out of the main function will solve it, but then those functions will not be able to access the context of the main function which is why they are defined inside it.
Imagine that you have a server and you want to handle a request.
Obviously, The request handler is going to run many times as the server gets a lot of requests from the client.
Would you define functions inside the request handler (so you can access the request context in those functions) or define them outside of the request handler and pass the request context as a parameter to all of them? We chose the first option... what do you think?

anyone knows which v8 optimization could cause this?
Load Elimination.
I guess it's fair to say that any optimization could cause lots of memory consumption in pathological cases (such as: a nearly 14 MB monster of a function as input, wow!), but Load Elimination is what causes it in this particular case.
You can see for yourself when your run with --turbo-stats (and optionally --turbo-filter=foo to zoom in on just that function).
You can disable Load Elimination if you feel that you must. A preferable approach would probably be to reorganize your code somewhat: defining 2,000 functions is totally fine, but the function defining all these other functions probably doesn't need to be run in a loop long enough until it gets optimized? You'll avoid not only this particular issue, but get better efficiency in general, if you define functions only once each.
There may or may not be room for improving Load Elimination in Turbofan to be more efficient for huge inputs; that's a longer investigation and I'm not sure it's worth it (compared to working on other things that likely show up more frequently in practice).
I do want to emphasize for any future readers of this that disabling optimization(s) is not generally a good rule of thumb for improving performance (or anything else), on the contrary; nor are any other "secret" flags needed to unlock "secret" performance: the default configuration is very carefully optimized to give you what's (usually) best. It's a very rare special case that a particular optimization pass interacts badly with a particular code pattern in an input function.

Why does v8 report duplicate module strings in heap in my jest tests?

In the process of upgrading node (16.1.x => 16.5.0), I observed that I'm getting OOM issues from jest. In troubleshooting, I'm periodically taking heap snapshots. I'm regularly seeing entries in "string" for module source (same shallow/retained size). In this example screenshot, you can see that the exact same module (React) is listed 2x. Sometimes, the module string is listed even 4x for any given source module.
Upon expansion, it says "system / Map", which suggests to me I think? that theres some v8 wide reference to this module string? That makes sense--maybe. node has a require cache, jest has a module cache, v8 and node i'd assume... share module references? The strings and compiled code buckets do increase regularly, but I expect them to get GC'd. In fact, I can see that many do--expansion of the items show the refs belonging to GC Roots. But I suspect something is holding on to these module references, and I fear it's not at the user level, but at the tooling level. This is somewhat evidenced by observation that only the node.js upgrade induces the OOM failure mode.
Why would my jest test have multiple instances of the same module (i am using --runInBand, so I don't expect multiple workers)
What tips would you offer to diagnose further?
I do show multiple VM Contexts, which I think makes sense--I suppose jest is running some test suites in some sort of isolation.
I do not have a reproduction--I am looking for discussion, best-know-methods, diagnostic ideas.

I can offer some thoughts:
"system / Map" does not mean "some v8 wide reference". "Map" is the internal name for "hidden class", which you may have heard of. The details don't even matter here; TL;DR: some internal thing, totally normal, not a sign of a problem.
Having several copies of the same string on the heap is also quite normal, because strings don't get deduplicated by default. So if you run some string-producing operation twice (such as: reading an external file), you'll get two copies of the string. I have no idea what jest does under the hood, but it's totally conceivable that running tests in parallel in mostly-isolated environments has a side effect of creating duplicate strings. That may be inefficient in a sense, but as long as they get GC'ed after a while, it's not really a problem.
If the specific hypothesis implied above (there are several tests in each file, and jest creates an in-memory copy of the entire file for each executing test) holds, then a possible mitigation might be to split your test files into smaller chunks (1.8MB is quite a lot for a single file). I don't have much confidence in this, but maybe it'd be easy for you to try it and see.
More generally: in the screenshot, there are 36MB of memory used by strings. That's far from being an OOM reason.
It might be insightful to measure the memory consumption of both Node versions. If, for example, it used to consume 4GB and now crashes when it reaches 2GB, that would indicate that the limit has changed. If it used to consume 2GB and now crashes when it reaches 4GB, that would imply that something major has changed. If it used to consume 1.98GB and now crashes when it reaches 2.0GB, then chances are something tiny has changed and you just happened to get lucky with the old version.
Until contradicting evidence turns up, I would operate under the assumption that the resource consumption is normal and simply must be accommodated. You could try giving Node more memory, or reducing the number of parallel test executions.

This seems like a known issue of Jest at Node JS v16.11.0+ and has already been reported to GitHub.

Haskell + Infinite lists = Hang

I am learning Haskell, have read a few references, and am working on various challenges (mainly codewars). However at times I will attempt to generate an infinite list for some math algorithm and then pick from it (like get the first n numbers which match some pattern).
However because my syntax isn't perfect I often mix up parts and while I want to ask Haskell to define (lazy) an infinite list and pick the first 5 elements (or whatever) I end up actually asking it to do something with the full infinite list and when I build-test it, the program just hangs.
I managed (once) to call up the Windows process manager and what is happening is that in Visual Studio Code when it build and executes the executable it just grows extremely fast absorbing all memory and processor until the computer becomes non-responsive.
Is there some kind of compiler flag that can prevent this?

As noted in the comments, you can run your executable with the -M switch, which allows you to specify a maximum memory size. (Default is unlimited.) That way, if your program tries to use more than X amount of memory, it will crash with an exception rather than just consume all available RAM.
Note that this won't help if your program is doing lots of processing but not trying to actually keep the results in RAM. E.g., if you try to print out the first item that matches a condition, but no item will ever match the condition, in all likelihood your program will loop forever, but not actually consume any RAM. In that case, you'll just have to kill it.
You might also try running your code in GHCi, where you can just jab Ctrl+C to halt your code without killing GHCi itself.

Limiting work in progress of parallel operations of a streamed resource

I've found myself recently using the SemaphoreSlim class to limit the work in progress of a parallelisable operation on a (large) streamed resource:
// The below code is an example of the structure of the code, there are some
// omissions around handling of tasks that do not run to completion that should be in production code
SemaphoreSlim semaphore = new SemaphoreSlim(Environment.ProcessorCount * someMagicNumber);
foreach (var result in StreamResults())
{
semaphore.Wait();
var task = DoWorkAsync(result).ContinueWith(t => semaphore.Release());
...
}
This is to avoid bringing too many results into memory and the program being unable to cope (generally evidenced via an OutOfMemoryException). Though the code works and is reasonably performant, it still feels ungainly. Notably the someMagicNumber multiplier, which although tuned via profiling, may not be as optimal as it could be and isn't resilient to changes to the implementation of DoWorkAsync.
In the same way that thread pooling can overcome the obstacle of scheduling many things for execution, I would like something that can overcome the obstacle of scheduling many things to be loaded into memory based on the resources that are available.
Since it is deterministically impossible to decide whether an OutOfMemoryException will occur, I appreciate that what I'm looking for may only be achievable via statistical means or even not at all, but I hope that I'm missing something.

Here I'd say that you're probably overthinking this problem. The consequences for overshooting are rather high (the program crashes). The consequences for being too low are that the program might be slowed down. As long as you still have some buffer beyond a minimum value, further increases to the buffer will generally have little to no effect, unless the processing time of that task in the pipe is extraordinary volatile.
If your buffer is constantly filling up it generally means that the task before it in the pipe executes quite a bit quicker than the task that follows it, so even without a fairly small buffer it is likely to always ensure the task following it has some work. The buffer size needed to get 90% of the benefits of a buffer is usually going to be quite small (a few dozen items maybe) whereas the side needed to get an OOM error are like 6+ orders of magnate higher. As long as you're somewhere in-between those two numbers (and that's a pretty big range to land in) you'll be just fine.
Just run your static tests, pick a static number, maybe add a few percent extra for "just in case" and you should be good. At most, I'd move some of the magic numbers to a config file so that they can be altered without a recompile in the event that the input data or the machine specs change radically.

Specifying runtime behavior of a program

Are there any languages/extensions that allow the programmer to define general runtime behavior of a program during specific code segments?
Some garbage-collected languages let you modify the behavior of the GC at runtime. Like in lua, the collectgarbage function lets you do this. So, for example, you can stop the GC when you want to be sure that CPU resources aren't used in garbage collection for a critical section of code (after which you start the GC again).
I'm looking for a general way to specify intended behavior of the program without resorting to specifying specific GC tweaks. I'm interested even in an on-paper sort of specification method (ie something a programmer would code toward, but not program syntax that would actually implement that behavior). The point would be that this could be used to specify critical sections of code that shouldn't be interrupted (latency dependent activity) or other intended attributes of certain codepaths (maximum time between an output and an input or two outputs, average running time, etc).
For example, this syntax might describe that maximum time latencyDependentStuff should take is 5 milliseconds:
requireMaxTime(5) {
latencyDependentStuff();
}
Has anyone seen anything like this anywhere before?

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string