Specifying runtime behavior of a program - garbage-collection

Are there any languages/extensions that allow the programmer to define general runtime behavior of a program during specific code segments?
Some garbage-collected languages let you modify the behavior of the GC at runtime. Like in lua, the collectgarbage function lets you do this. So, for example, you can stop the GC when you want to be sure that CPU resources aren't used in garbage collection for a critical section of code (after which you start the GC again).
I'm looking for a general way to specify intended behavior of the program without resorting to specifying specific GC tweaks. I'm interested even in an on-paper sort of specification method (ie something a programmer would code toward, but not program syntax that would actually implement that behavior). The point would be that this could be used to specify critical sections of code that shouldn't be interrupted (latency dependent activity) or other intended attributes of certain codepaths (maximum time between an output and an input or two outputs, average running time, etc).
For example, this syntax might describe that maximum time latencyDependentStuff should take is 5 milliseconds:
requireMaxTime(5) {
latencyDependentStuff();
}
Has anyone seen anything like this anywhere before?

Related

v8 memory spike (rss) when defining more than 1000 function (does not reproduce when using --jitless)

I have a simple node app with 1 function that defines 1000+ functions inside it (without running them).
When I call this function (the wrapper) around 200 times the RSS memory of the process spikes from 100MB to 1000MB and immediately goes down. (The memory spike only happens after around 200~ calls, before that all the calls do not cause a memory spike, and all the calls after do not cause a memory spike)
This issue is happening to us in our node server in production, and I was able to reproduce it in a simple node app here:
https://github.com/gileck/node-v8-memory-issue
When I use --jitless pr --no-opt the issue does not happen (no spikes). but obviously we do not want to remove all the v8 optimizations in production.
This issue must be some kind of a specific v8 optimization, I tried a few other v8 flags but non of them fix the issue (only --jitless and --no-opt fix it)
Anyone knows which v8 optimization could cause this?
Update:
We found that --no-concurrent-recompilation fix this issue (No memory spikes at all).
but still, we can't explain it.
We are not sure why it happens and which code changes might fix it (without the flag).
As one of the answers suggests, moving all the 1000+ function definitions out of the main function will solve it, but then those functions will not be able to access the context of the main function which is why they are defined inside it.
Imagine that you have a server and you want to handle a request.
Obviously, The request handler is going to run many times as the server gets a lot of requests from the client.
Would you define functions inside the request handler (so you can access the request context in those functions) or define them outside of the request handler and pass the request context as a parameter to all of them? We chose the first option... what do you think?
anyone knows which v8 optimization could cause this?
Load Elimination.
I guess it's fair to say that any optimization could cause lots of memory consumption in pathological cases (such as: a nearly 14 MB monster of a function as input, wow!), but Load Elimination is what causes it in this particular case.
You can see for yourself when your run with --turbo-stats (and optionally --turbo-filter=foo to zoom in on just that function).
You can disable Load Elimination if you feel that you must. A preferable approach would probably be to reorganize your code somewhat: defining 2,000 functions is totally fine, but the function defining all these other functions probably doesn't need to be run in a loop long enough until it gets optimized? You'll avoid not only this particular issue, but get better efficiency in general, if you define functions only once each.
There may or may not be room for improving Load Elimination in Turbofan to be more efficient for huge inputs; that's a longer investigation and I'm not sure it's worth it (compared to working on other things that likely show up more frequently in practice).
I do want to emphasize for any future readers of this that disabling optimization(s) is not generally a good rule of thumb for improving performance (or anything else), on the contrary; nor are any other "secret" flags needed to unlock "secret" performance: the default configuration is very carefully optimized to give you what's (usually) best. It's a very rare special case that a particular optimization pass interacts badly with a particular code pattern in an input function.

How to extend GHC's Thread State Object

I'd like to add two extra fields of type StgWord32 to the thread state object (TSO). Based on the information I found on the GHC-Wiki and from looking at the source code, I have extended the struct in /includes/rts/storage/TSO.h and changed the program that creates different offsets (creating DerivedConstants.h). The compiler, the rts, and a simple application re-compile, but at the end of the execution (in hs_exit_) the garbage collector complains:
internal error: scavenge_stack: weird activation record found on stack: 45
I guess it has to to with cmm and/or the STG implementation details (the offsets are generated since the structs are not visible at cmm level, correct me if I'm wrong). Is the order of fields significant? Have I missed a file that should be changed?
I use a debug build of the compiler and RTS and a rather dated ghc 6.12.3 on a 64bit architecture. Any hints to relevant documentation and comments
on the difference between ghc 6 and 7 regarding TSO handling are welcome, too.
The error that you are getting comes from: ghc/rts/sm/Scav.c. Specifically at line 1917:
default:
barf("scavenge_stack: weird activation record found on stack: %d", (int)(info->i.type));
It looks like you need to also modify ClosureTypes.h, which you can find in ghc/includes/rts/storage. This file seems to contain the different kinds of headers that can appear in a heap object. I've also run into some strange bootstrapping errors, where if I try to rebuild using the stage-1 compiler, I get the error you mentioned, but if I do a clean build, then it compiles just fine.
A workaround that turned out good enough for me was to introduce a separate data structure for each Capability that would hold the additional information for each lightweight thread. I have used a HashTable (see rts/Hash.h and .c) mapping from thread id to the custom info struct. The entries were added when the threads were created from sparks (in schduleActiveteSpark).
Timing the creation, insertion, lookup and destruction of the entries and the table showed negligible overhead for small programs. The main overhead results from the actual usage of the information and should ideally be kept outside of the innermost scheduler loop. For the THREADED_RTS build one needs to ensure that other Capabilities don't access tables that are not their own (or use a mutex if such access is required, which is potential source of additional overhead).

Limiting work in progress of parallel operations of a streamed resource

I've found myself recently using the SemaphoreSlim class to limit the work in progress of a parallelisable operation on a (large) streamed resource:
// The below code is an example of the structure of the code, there are some
// omissions around handling of tasks that do not run to completion that should be in production code
SemaphoreSlim semaphore = new SemaphoreSlim(Environment.ProcessorCount * someMagicNumber);
foreach (var result in StreamResults())
{
semaphore.Wait();
var task = DoWorkAsync(result).ContinueWith(t => semaphore.Release());
...
}
This is to avoid bringing too many results into memory and the program being unable to cope (generally evidenced via an OutOfMemoryException). Though the code works and is reasonably performant, it still feels ungainly. Notably the someMagicNumber multiplier, which although tuned via profiling, may not be as optimal as it could be and isn't resilient to changes to the implementation of DoWorkAsync.
In the same way that thread pooling can overcome the obstacle of scheduling many things for execution, I would like something that can overcome the obstacle of scheduling many things to be loaded into memory based on the resources that are available.
Since it is deterministically impossible to decide whether an OutOfMemoryException will occur, I appreciate that what I'm looking for may only be achievable via statistical means or even not at all, but I hope that I'm missing something.
Here I'd say that you're probably overthinking this problem. The consequences for overshooting are rather high (the program crashes). The consequences for being too low are that the program might be slowed down. As long as you still have some buffer beyond a minimum value, further increases to the buffer will generally have little to no effect, unless the processing time of that task in the pipe is extraordinary volatile.
If your buffer is constantly filling up it generally means that the task before it in the pipe executes quite a bit quicker than the task that follows it, so even without a fairly small buffer it is likely to always ensure the task following it has some work. The buffer size needed to get 90% of the benefits of a buffer is usually going to be quite small (a few dozen items maybe) whereas the side needed to get an OOM error are like 6+ orders of magnate higher. As long as you're somewhere in-between those two numbers (and that's a pretty big range to land in) you'll be just fine.
Just run your static tests, pick a static number, maybe add a few percent extra for "just in case" and you should be good. At most, I'd move some of the magic numbers to a config file so that they can be altered without a recompile in the event that the input data or the machine specs change radically.

Threads - Message Passing

I was trying to find some resources for best performance and scaling with message passing. I heard that message passing by value instead of reference can be better scalability as it works well with NUMA style setups and reduced contention for a given memory address.
I would assume value based message passing only works with "smaller" messages. What would "smaller" be defined as? At what point would references be better? Would one do stream processing this way?
I'm looking for some helpful tips or resources for these kinds of questions.
Thanks :-)
P.S. I work in C#, but I don't think that matters so much for these kind of design questions.
Some factors to add to the excellent advice of Jeremy:
1) Passing by value only works efficiently for small messages. If the data has a [cache-line-size] unused area at the start to avoid false sharing, you are already approaching the size where passing by reference is more efficient.
2) Wider queues mean more space taken up by the queues, impacting memory use.
3) Copying data into/outof wide queue structures takes time. Apart from the actual CPU use while moving data, the queue remains locked during the copying. This increases contention on the queue and leading to an overall performance hit that is queue width dependent. If there is any deadlock-potential in your code, keeping locks for extended periods will not help matters.
4) Passing by value tends to lead to code that is specific to the data size, ie. is fixed at compile-time. Apart from a nasty infestation of templates, this makes it very difficult to tune buffer-sizes etc. at run-time.
5) If the messages are passed by reference and malloced/freed/newed/disposed/GC'd, this can lead to excessive contention on the memory-manager and frequent, wasteful GC. I usually use fixed pools of messages, allocated at startup, specifically to avoid this.
6) Handling byte-streams can be awkward when passing by reference. If a byte-stream is characterized by frequent delivery of single bytes, pass-by-reference is only sensible if the bytes are chunked-up. This can lead to the need for timeouts to ensure that partially-filled messages are dispatched to the next thread in a timely manner. This introduces complication and latency.
7) Pass-by-reference designs are inherently more likely to leak. This can lead to extended test times and overdosing on valgrind - a particularly painful addiction, (another reason I use fixed-size message object pools).
8) Complex messages, eg. those that contain references to other objects, can cause horrendous problems with ownership and lifetime-management if passed by value. Example - a server socket object has a reference to a buffer-list object that contains an array of buffer-instances of varying size, (real example from IOCP server). Try passing that by value..
9) Many OS calls cannot handle anything but a pointer. You cannot PostMessage, (that's a Windows API, for all you happy-feet), even a 256-byte structure by value with one call, (you have just the 2 wParam,lParam integers). Calls that set up asychronous callbacks often allow 'context data' to be sent to the callback - almost always just one pointer. Any app that is going to use such OS functionality is almost forced to resort to pass by reference.
Jeremy Friesner's comment seems to be the best as this is a new area, although Martin James's points are also good. I know Microsoft is looking into message passing for their future kernels as we gain more cores.
There seems to be a framework that deals with message passing and it claims to have much better performance than current .Net producer/consumer generics. I'm not sure how it will compare to .Net's Dataflow in 4.5
https://github.com/odeheurles/Disruptor-net

How to solve memory leak caused by System.Diagnostics.PerformanceCounter

Summary
I have written a process monitor command-line application that takes as parameters:
The process name or process ID
A CPU Threshold percent.
What the program does, is watches all processes with the passed name or pid, and if their CPU usage gets over the threshold%, it kills them.
I have two classes:
ProcessMonitor and ProcessMonitorList
The former, wraps around System.Diagnostics.PerformanceCounter
The latter is an IEnumarable that allows a list-like structure of the former.
The problem
The program itself works fine, however if I watch the Memory Usage on Task Manager, it grows in increments of about 20kB per second. Note: the program polls the CPU counter through PerformanceCounter every second.
This program needs to be running on a heavily used server, and there are a great number of processes it is watching. (20-30).
Investigation So far
I have used PerfMon to monitor the Private Bytes of the process versus the Total number of Bytes in all Heaps and according to the logic presented in the article referenced below, my results indicate that while fluctuating, the value remains bounded within an acceptable range, and hence there is no memory leak:
Article
I have also used FxCop to analyze my code, and it did not come up with anything relevant.
The Plot Thickens
Not being comfortable with just saying, Oh then there's no memory leak, I investigated further, and found (through debugging) that the following lines of code demonstrate where the leak is occurring, with the arrow showing the exact line.
_pc = new PerformanceCounter("Process", "% Processor Time", processName);
The above is where _pc is initiated, and is in the constructor of my ProcessMonitor class.
The below is the method that is causing the memory leak. This method is being called every second from my main.
public float NextValue()
{
if (HasExited()) return PROCESS_ENDED;
if (_pc != null)
{
_lastSample = _pc.NextValue(); //<-----------------------
return _lastSample;
}
else return -1;
}
This indicates to me that the leak exists inside the NextValue() method, which is inside the System.Diagnostics.PerformanceCounter class.
My Questions:
Is this a known problem, and how do I get around it?
Is my assumption that the task manager's memory usage increasing implies that there is indeed a memory leak correct?
Are there any better ways to monitor multiple instances of a specific process and shut them down if they go over a specific threshold CPU usage, and then send an email?
So I think I figured it out.
Using the Reflector tool, I was able to examine the code inside System.Diagnostics.
It appears that the NextValue method calls
GC.SuppressFinalization();
This means that (I think, and please correct if I am wrong) that I needed to explicitly call Dispose() on all my classes.
So, what I did is implement IDisposable on all of my classes, especially the one that wrapped around PerformanceCounter.
I wrote more explicit cleanup of my IList<PerformanceMonitor>, and the internals,
and voilĂ , the memory behavior changed.
It oscillates, but the memory usage is clearly bounded between an acceptable range over a long period of time.

Resources