Why would multiple executions of IronPython code cause OutOfMemoryException? - memory-leaks

I have some IronPython code in a C# class that has the following two lines in a function:
void test(MyType request){
compiledCode.DefaultScope.SetVarialbe("target", request);
compiledCode.Execute();
}
I was writing some code to stress-test this/performance test it. The request object that I'm passing in is a List<MyType> with one million MyType's. So I iterate over this list:
foreach(MyType thing in myThings){
test(thing);
}
And that works fine - runs in about 6 seconds. When I run
for(int x = 0; x < 1000; x++){
foreach(MyType thing in myThings){
test(thing);
}
}
On the 19th iteration of the loop I get an OutOfMemoryException.
I tried doing Python.CreateEngine(AppDomain.CurrentDomain) instead, which didn't seem to have any effect, and I tried
...
compiledCode.Execute();
compiledCode.DefaultScope.RemoveVariable("target");
Hoping that somehow IronPython was keeping the reference around. That roughly doubled the amount of time it took, but it still blew up at the same point.
Why is my code causing a memory leak, and how do I fix it?

IronPython has a memory leak somewhere that prevents objects from getting GC'd. (let this be a lesson, kids: managed languages are not immune from memory issues. They just have different ones.) I haven't been able to figure out where, though.
You could create an issue and attach a self-contained reproduction, but even better would be running your app against a memory profiler and see what is holding onto your MyType objects.

So, it turns out I wasn't thinking at all.
The target object object contains a StringBuilder that I'm adding lines to. About 10 lines per iteration, so apparently I can hold about 190 million lines of text in a StringBuilder before I run out of memory ;)

Related

Speed-up from multi-threading

I have a highly parallelizable problem. Hundreds of separate problems need to be solved by the same function. The problems each take an average of perhaps 120 ms (0.12 s) on a single core, but there is substantial variation, and some extreme and rare ones may take 10 times as long. Each problem needs memory, but this is allocated ahead of time. The problems do not need disk I/O, and they do not pass back and forth any variables once they are running. They do access different parts (array elements) of the same global struct, though.
I have C++ code, based on someone else's code, that works. (The global array of structs is not shown.) It runs 20 problems (for instance) and then returns. I think 20 is enough to even out the variability on 4 cores. I see the execution time flattening out from about 10 already.
There is a Win32 and an OpenMP version, and they behave almost identically in terms of execution time. I run the program on a 4-core Windows system. I include some OpenMP code below since it is shorter. (I changed names etc. to make it more generic and I may have made mistakes -- it won't compile stand-alone.)
The speed-up over the single-threaded version flattens out at about a factor of 2.3. So if it takes 230 seconds single-threaded, it takes 100 s multi-threaded. I am surprised that the speed-up is not a lot closer to 4, the number of cores.
Am I right to be disappointed?
Is there anything I can do to get closer to my theoretical expectation?
int split_bigtask(Inputs * inputs, Outputs * results)
{
for (int k = 0; k < MAXNO; k++)
results->solved[k].value = 0;
int res;
#pragma omp parallel shared(inputs, outputs)
{
#pragma omp for schedule(dynamic)
for (int k = 0; k < inputs->no; k++)
{
res = bigtask(inputs->values[k],
outputs->solved[k],
omp_get_thread_num()
);
}
}
return TRUE;
}
I Assume that there is no synchronization done within bigtask() (Obvious, but I'd still check it first).
It's possible that you run into a "dirty cache" problem: If you manipulate data that is close to each other (e.g. same cache line!) from multiple cores each manipulation will mark the cache line as dirty (which means that the processor needs to signal this to all other processeors which in turn involves synchronization again...).
you create too many threads (allocating a thread is quite an overhead. So creating one thread for each core is a lot more efficient than creating 5 threads each).
I personally would assume that you have case 2 ("Big Global Array").
Solution to the problem (if it's indeed case 2):
Write the results to a local array which is merged into the "Big Global Array" by the main thread after the end of the work
Split the global array into several smaller arrays (and give each thread one of these arrays)
Ensure that the records within the structure align on Cache-Line boundaries (this is a bit a hack since cache line boundaries may change for future processors)
You may want to try to create a local copy of the array for each thread (at least for the results)

Delphi [volatile] and InterlockedCompareExchange not reliable?

I wrote a simple lock-free node stack (Delp[hi XE4, Win7-64, 32-bit app) where I can have multiple 'stacks' and pop/push nodes between them from various threads simultaneously.
It works 99.999% of the time but eventually glitch under a stress test using all CPU cores.
Stripped-down, it comes down to this (not real/compiled code):
Nodes are :
type POBNode = ^TOBNode;
[volatile]TOBNode = record
[volatile]Next : POBNode;
Data : Int64;
end;
Simplified stack :
type TOBStack = class
private
[volatile]Head:POBNode;
function Pop:POBNode;
procedure Push(NewNode:POBNode);
end;
procedure TOBStack.Push(NewNode:POBNode);
var zTmp : POBNode;
begin;
repeat
zTmp:=InterlockedCompareExchangePointer(Pointer(Head),nil,nil);(memory fenced-read*)
NewNode.Next:=zTmp;
if InterlockedCompareExchangePointer(Head,NewNode,zTmp)=zTmp
then break (*success*)
else continue;
until false;
end;
function TOBStack.Pop:POBNode;
begin;
repeat
Result:=InterlockedCompareExchangePointer(Pointer(Head),nil,nil);(memory fenced-read*)
if Result=nil
the exit;
NewHead:=Result.Next;
if InterlockedCompareExchangePointer(Pointer(Head),NewHead,Result)=Result
then break (*Success*)
else continue;(*Fail, try again*)
until False;
end;
I have tried many variations on this but cannot get it to be stable.
If I create a few threads that each have a stack and they all push/pop to/from a global stack, it eventually glitch, but not quickly. Only when I stress it for minutes on end from multiple threads, in tight loops.
I cannot have hidden bugs in my code, so I need more advice than to ask a specific question to get this running 100% error-free, 24/7.
Does the code above look fine for a lock-free thread-safe stack?
What else can I look at? This is impossible to debug normally as the errors occur at various places, telling me there is a pointer or RAM corruption happening somewhere. I also get duplicate entries, meaning that a node that was popped of one stack, then later returned to that same stack, is still on top of the old stack... Impossible to happen according to my algorithm? This lead me to believe it's possible to violate Delphi/Windows InterlockedCompareExchange Methods or there is some hidden knowledge I have yet to be revealed. :) (I also tried TInterlocked)
I have made a complete test case that can be copied from ftp.true.co.za.
In there I run 8 threads doing 400 000 push/pops each and it usually crashes (safely due to checks/raised exceptions) after a few cycles of these tests, sometimes many many test cycles complete before one suddenly crash.
Any advice would be appreciated.
Regards
Anton
E
At first I was skeptical of this being an ABA problem as indicated by gabr. It seemed to me that: if one thread looks at the current Head, and another thread pushes then pops; you're happy to still operate on the same Head in the same way.
However, consider this from your Pop method:
NewHead:=Result.Next;
if InterlockedCompareExchangePointer(Pointer(Head),NewHead,Result)=Result
If the thread is swapped out after the first line.
A value for NewHead is stored in a local variable.
Then another thread successfully pops the node this thread was targetting.
It also manages to push the same node back, but with a different value for Next before the first thread resumes.
The second line will pass the comparand check allowing head to receive the NewHead value from the local variable.
However, the current value for NewHead is incorrect, thereby corrupting your stack.
There's a subtle variation on this problem not even covered by your test app. This problem isn't checked in your test app because you aren't destroying any nodes until the end of your test.
Look at current head
Another thread pops some nodes.
The nodes are destroyed, and new nodes created and pushed.
By the time your "looking thread" is active again, it could be looking at an entirely different node that is coincidentally at the same address.
If you're popping, you might assign a garbage pointer to Head.
Apart form the above...
Looking at your test app there's also some really dodgy code. E.g.
You generate a "random number": J:=GetTickCount and 7;(*Get a 'random' number 0..7*).
Do you realise how fast computers are?
Do you realise that GetTickCount will generate reams of duplicates in a tight loop?
I.e. the numbers you generate will be nothing like random.
And when comments don't agree with code, my spidey-sense kicks in.
You're allocating memory of a hard-coded size: GetMem(zTmp,12);(*Allocate new node*).
Why aren't you using SizeOf?
You're using a multi-platform compiler.... The size of that structure can vary.
There is absolutely zero reason to hard-code the size of the structure.
Right now, given these two examples, I wouldn't be entirely confident that there isn't also an error in your test code.

How to get Java stacks when JVM can't reach a safepoint

We recently had a situation where one of our production JVMs would randomly freeze. The Java process was burning CPU, but all visible activity would cease: no log output, nothing written to the GC log, no response to any network request, etc. The process would persist in this state until restarted.
It turned out that the org.mozilla.javascript.DToA class, when invoked on certain inputs, will get confused and call BigInteger.pow with enormous values (e.g. 5^2147483647), which triggers the JVM freeze. My guess is that some large loop, perhaps in java.math.BigInteger.multiplyToLen, was JIT'ed without a safepoint check inside the loop. The next time the JVM needed to pause for garbage collection, it would freeze, because the thread running the BigInteger code wouldn't be reaching a safepoint for a very long time.
My question: in the future, how can I diagnose a safepoint problem like this? kill -3 didn't produce any output; I presume it relies on safepoints to generate accurate stacks. Is there any production-safe tool which can extract stacks from a running JVM without waiting for a safepoint? (In this case, I got lucky and managed to grab a set of stack traces just after BigInteger.pow was invoked, but before it worked its way up to a sufficiently large input to completely wedge the JVM. Without that stroke of luck, I'm not sure how we would ever have diagnosed the problem.)
Edit: the following code illustrates the problem.
// Spawn a background thread to compute an enormous number.
new Thread(){ #Override public void run() {
try {
Thread.sleep(5000);
} catch (InterruptedException ex) {
}
BigInteger.valueOf(5).pow(100000000);
}}.start();
// Loop, allocating memory and periodically logging progress, so illustrate GC pause times.
byte[] b;
for (int outer = 0; ; outer++) {
long startMs = System.currentTimeMillis();
for (int inner = 0; inner < 100000; inner++) {
b = new byte[1000];
}
System.out.println("Iteration " + outer + " took " + (System.currentTimeMillis() - startMs) + " ms");
}
This launches a background thread which waits 5 seconds and then starts an enormous BigInteger computation. In the foreground, it then repeatedly allocates a series of 100,000 1K blocks, logging the elapsed time for each 100MB series. During the 5 second period, each 100MB series runs in about 20 milliseconds on my MacBook Pro. Once the BigInteger computation begins, we begin to see long pauses interleaved. In one test, the pauses were successively 175ms, 997ms, 2927ms, 4222ms, and 22617ms (at which point I aborted the test). This is consistent with BigInteger.pow() invoking a series of ever-larger multiply operations, each taking successively longer to reach a safepoint.
Your problem interested me very much. You were right about JIT. First I tried to play with GC types, but this did not have any effect. Then I tried to disable JIT and everything worked fine:
java -Djava.compiler=NONE Tests
Then printed out JIT compilations:
java -XX:+PrintCompilation Tests
And noticed that problem starts after some compilations in BigInteger class, I tried to exclude methods one by one from compilation and finally found the cause:
java -XX:CompileCommand=exclude,java/math/BigInteger,multiplyToLen -XX:+PrintCompilation Tests
For large arrays this method could work long, and problem might really be in safepoints. For some reason they are not inserted, but should be even in compiled code. Looks like a bug. The next step should be to analyze assembly code, I did not do it yet.
It's not a bug, it's a performance feature. JVM eliminates safepoint check from counted loops, making them run faster. It expects that either
you care about STW pauses and don't have extra long loops
or you have extra long loops, but fine with safepoints being eventual
If it doesn't fit you, it can be switched off with this flag: -XX:+UseCountedLoopSafepoints
And answering the title question, you can still stop and explore a program with gdb, but the stack traces wouldn't be so nice.
Perhaps that's what the "-F" option of jstack is good for:
OPTIONS
-F
Force a stack dump when 'jstack [-l] pid' does not respond.
I always wondered when an why that could help.

multithread search design

Dear Community. I like to understand a little task, which have to help me improve performance for my application.
I have array of dictionaries, in singleton area with objects NSDictionary and keys
code
country
specific
I have to receive country and specific values from this array.
My first version of application was using predicate, but later i find a lot of memory leaks and performance issues by this way. Application was too slow and don't empty very quickly a memory stack, coming to around 1G and crash.
My second version was little bit more complicated. I was filled array in singleton area with objects per one code and function, which u can see bellow.
-(void)codeIsSame:(NSArray *)codeForCheck;
{
//#synchronized(self) {
NSString *code = [codeForCheck objectAtIndex:0];
if ([_code isEqualToString:code])
{
code = nil;
NSUInteger queneNumberInt = [[codeForCheck objectAtIndex:1] intValue];
NSLog(#"We match code:%# country:%# specific:%# quene:%lu",_code, _country,_specific, queneNumberInt);
[[ProjectArrays sharedProjectArrays].arrayDictionaryesForCountryCodesResult insertObject:_result atIndex:queneNumberInt];
}
code = nil;
//}
return;
}
The way to receive necessary issues is a :
SEL selector = #selector(codeIsSame:);
[[ProjectArrays sharedProjectArrays].myCountrySpecificCodeListWithClass makeObjectsPerformSelector:selector withObject:codePlusQueueNumber];
This version working much better, no memory leaks, very quickly, but too hard to debug. Sometimes i receive empty result, i tried to synchronize thread jobs, but it still not work stable. The main problem in this way is that in strange reason sometimes i don't have result in my singleton array. I tried to debug it, using index of array for different threads, and have result that class just missed answer.
Core data don't allow me to make copy of main MOC and for multithreading design i can't using it (lock and unlock is not good idea, and that's way product too much error in lock/unlock part of code.
Maybe anybody can suggest, what i can do better in this case? I need a best way to make decision which will work stable, will be easy to coding and understand it?
My current solution is using NSDictionary, where is a keys is a code and under that code i have dictionary with country/specific. Working fine as well, but don't decide a main task - using core data if u need multiply access from too many threads to the same data.

examples of garbage collection bottlenecks

I remembered someone telling me one good one. But i cannot remember it. I spent the last 20mins with google trying to learn more.
What are examples of bad/not great code that causes a performance hit due to garbage collection ?
from an old sun tech tip -- sometimes it helps to explicitly nullify references in order to make them eligible for garbage collection earlier:
public class Stack {
private static final int MAXLEN = 10;
private Object stk[] = new Object[MAXLEN];
private int stkp = -1;
public void push(Object p) {stk[++stkp] = p;}
public Object pop() {return stk[stkp--];}
}
rewriting the pop method in this way helps ensure that garbage collection gets done in a timely fashion:
public Object pop() {
Object p = stk[stkp];
stk[stkp--] = null;
return p;
}
What are examples of bad/not great code that causes a performance hit due to garbage collection ?
The following will be inefficient when using a generational garbage collector:
Mutating references in the heap because write barriers are significantly more expensive than pointer writes. Consider replacing heap allocation and references with an array of value types and an integer index into the array, respectively.
Creating long-lived temporaries. When they survive the nursery generation they must be marked, copied and all pointers to them updated. If it is possible to coalesce updates in order to reuse of an old version of a collection, do so.
Complicated heap topologies. Again, consider replacing many references with indices.
Deep thread stacks. Try to keep stacks shallow to make it easier for the GC to collate the global roots.
However, I would not call these "bad" because there is nothing objectively wrong with them. They are only inefficient when used with this kind of garbage collector. With manual memory management, none of the issues arise (although many are replaced with equivalent issues, e.g. performance of malloc vs pool allocators). With other kinds of GC some of these issues disappear, e.g. some GCs don't have a write barrier, mark-region GCs should handle long-lived temporaries better, not all VMs need thread stacks.
When you have some loop involving the creation of new object's instances: if the number of cycles is very high you procuce a lot of trash causing the Garbage Collector to run more frequently and so decreasing performance.
One example would be object references that are kept in member variables oder static variables. Here is an example:
class Something {
static HugeInstance instance = new HugeInstance();
}
The problem is the garbage collector has no way of knowing, when this instance is not needed anymore. So its usually better to keep things in local variables and have small functions.
String foo = new String("a" + "b" + "c");
I understand Java is better about this now, but in the early days that would involve the creation and destruction of 3 or 4 string objects.
I can give you an example that will work with the .Net CLR GC:
If you override a finalize method from a class and do not call the super class Finalize method such as
protected override void Finalize(){
Console.WriteLine("Im done");
//base.Finalize(); => you should call him!!!!!
}
When you resurrect an object by accident
protected override void Finalize(){
Application.ObjJolder = this;
}
class Application{
static public object ObjHolder;
}
When you use an object that uses Finalize it takes two GC collections to get rid of the data, and in any of the above codes you won't delete it.
frequent memory allocations
lack of memory reusing (when dealing with large memory chunks)
keeping objects longer than needed (keeping references on obsolete objects)
In most modern collectors, any use of finalization will slow the collector down. And not just for the objects that have finalizers.
Your custom service does not have a load limiter on it, so:
A lot requests come in for some reason at the same time (everyone logs on in the morning say)
The service takes longer to process each requests as it now has 100s of threads (1 per request)
Yet more part processed requests builds up due to the longer processing time.
Each part processed request has created lots of objects that live until the end of processing that request.
The garbage collector spends lots of time trying to free memory it, however it can’t due to the above.
Yet more part processed requests builds up due to the longer processing time…. (including time in GC)
I have encountered a nice example while doing some parallel cell based simulation in Python. Cells are initialized and sent to worker processes after pickling for running. If you have too many cells at any one time the master node runs out of ram. The trick is to make a limited number of cells pack them and send them off to cluster nodes before making some more, remember to set the objects already sent off to "None". This allows you to perform large simulations using the total RAM of the cluster in addition to the computing power.
The application here was cell based fire simulation, only the cells actively burning were kept as objects at any one time.

Resources