Strange memory usage when using CacheEntryProcessor and modifying cache entry - memory-leaks

I wonder if anybody can explain what is going wrong with my code.
I have an IgniteCache of Long->Object[] which is a kind of batching mechanism.
The cache is onheap,partitioned and has one backup configured.
I want to modify some of the objects within the cache entry value array.
So I wrote and implementation of CacheEntryProcessor
#Override
public Object process(MutableEntry<Long, Object[]> entry, Object... arguments)
throws EntryProcessorException {
boolean updated = false;
int key = (int)arguments[0];
Set<Long> someIds = Ignition.ignite().cluster().nodeLocalMap().get(key);
Object[] values = entry.getValue();
for (int i = 0; i < values.length; i++) {
Person p = (Person) values[i];
if (someIds.contains(p.getId())) {
p.modify();
if (!updated) {
updated = true;
}
}
}
if (updated) {
entry.setValue(values);
}
return null;
}
}
When the cluster is loaded with data each node consumes around 20GB of heap.
When I run the cache processor with cache.invokeAll on multiple node cluster I have a crazy memory behavior - when the processor is being run I see memory usage going up to even 48GB or higher eventually leading to node separation from the cluster cause GC took too long.
However, if I remove the entry.setValue(values) line, which stores back the modified array into the cache everything is ok, apart from the fact that the data will not be replicated since the cache is not aware of the change - the update is only visible on the primary node :(
Can anybody tell me how to make it work? What is wrong with this approach?

First of all, I would not recommend to allocate large heap sizes. This will very likely cause a long GC pause even if everything is working properly. Basically, JVM will not clean up memory until it reaches certain threshold, and when it does reach, there will be to much garbage to collect. Try switching to off-heap or start more Ignite nodes.
The fact that more garbage is generated in case you update the entry, makes perfect sense. Basically each time you update you replace old value with a new one, and the old one becomes garbage.
If none of this helps, grab a heap dump and check what is occupying the memory.

Related

Memory leak with nodejs

I load my calendar from google. But every time I do it, node uses 2 mb more memory. Even if I delete the module. I need to load it every 5 or 10 min so I can see if there are changes. this is my code.
google-calender.js module
module.exports = {
loadCalendars: function(acces, res){
gcal = require('google-calendar');
google_calendar = new gcal.GoogleCalendar(acces);
google_calendar.calendarList.list(function(err, calendarList) {
toLoadCalenders = calendarList.items.length;
loaded = 0;
data = [];
for(var i = 0; i < toLoadCalenders; i++){
google_calendar.events.list(calendarList.items[i].id, function(err, calendarList) {
loaded++;
data.push(calendarList);
if (loaded == toLoadCalenders) {
res.send(data);
}
});
}
});
}
}
main.js
app.get('/google-calender', function (req, res) {
google = require('./google-calender');
google.loadCalendars(acces, res);
setTimeout(function(){
delete google;
},500);
});
Does anyone know how I can prevent memory leak here?
Well, memory leak topic is kind a tough area for any developer, first of all you need to know if you have a memory leak or not, I recommend using node inspector and do the following:
1- run your node app with node-inspector on.
2- take a heap snapshot on fresh start so you can know the initial memory size been used by your app.
3- do some requests, you may use some benchmarking tool, meanwhile take a second snapshot.
4- compare snapshot number one with snapshot number 2, detect where the increasing happening, then note that.
5- stop making requests and wait a little bit so we make sure garbage collector has finished its work then take third snapshot.
6- compare the size of snapshot 3 with snapshot 2, you may see that snapshot 3 has freed size more than snapshot 2.
You may do this test many times, if always the last snapshot has increasing memory allocation than its predecessors snapshots, then you may have a memory leak.
How to fix memory leak ?
well, you need to be familiar with cases in javascript where memory leaks happen, you can read this article and match similar cases in your code.
then you can read the details of snapshot tries you had and match the linearly increasing part, and figure out where in your code you have such data types or arrays, objects or even module codes that been repeatedly required and never disposed.
Actually for coincidence, we had such case today and we had to go through this troubleshooting steps to get our hands on the problem.
Good luck my friend.

Java - ThreadFactory memory leak?

This code:
while (true) {
new ThreadFactory() {
#Override
public Thread newThread(Runnable r) {
return null;
}
};
}
make the JVM go out of memory very quickly.
Why?
I have tried to run this code with JRE 1.7.0_60 x86_64 on Windows 7 with default options and here are the results:
Author's code being run as is doesn't seem to perform any allocation at all, most likely because JIT detects unused references;
Modified version of code that outputs created threadFactorys to System.out results in "saw-like" heap usage pattern:
Which means that both allocation and garbage collection takes place.
Back to your question: I think you missed some significant parts of the code, or put -Xmx to extremely low value, or some other reason. The code you posted is ok, though.

How/when to release memory in wait-free algorithms

I'm having trouble figuring out a key point in wait-free algorithm design. Suppose a data structure has a pointer to another data structure (e.g. linked list, tree, etc), how can the right time for releasing a data structure?
The problem is this, there are separate operations that can't be executed atomically without a lock. For example one thread reads the pointer to some memory, and increments the use count for that memory to prevent free while this thread is using the data, which might take long, and even if it doesn't, it's a race condition. What prevents another thread from reading the pointer, decrementing the use count and determining that it's no longer used and freeing it before the first thread incremented the use count?
The main issue is that current CPUs only have a single word CAS (compare & swap). Alternatively the problem is that I'm clueless about waitfree algorithms and data structures and after reading some papers I'm still not seeing the light.
IMHO Garbage collection can't be the answer, because it would either GC would have to be prevented from running if any single thread is inside an atomic block (which would mean it can't be guaranteed that the GC will ever run again) or the problem is simply pushed to the GC, in which case, please explain how the GC would figure out if the data is in the silly state (a pointer is read [e.g. stored in a local variable] but the the use count didn't increment yet).
PS, references to advanced tutorials on wait-free algorithms for morons are welcome.
Edit: You should assume that the problem is being solved in a non-managed language, like C or C++. After all if it were Java, we'd have no need to worry about releasing memory. Further assume that the compiler may generate code that will store temporary references to objects in registers (invisible to other threads) right before the usage counter increment, and that a thread can be interrupted between loading the object address and incrementing the counter. This of course doesn't mean that the solution must be limited to C or C++, rather that the solution should give a set of primitives that allowing the implementation of wait-free algorithms on linked data structures. I'm interested in the primitives and how they solve the problem of designing wait-free algorithms. With such primitives a wait-free algorithm can be implemented equally well in C++ and Java.
After some research I learned this.
The problem is not trivial to solve and there are several solutions each with advantages and disadvantages. The reason for the complexity comes from inter CPU synchronization issues. If not done right it might appear to work correctly 99.9% of the time, which isn't enough, or it might fail under load.
Three solutions that I found are 1) hazard pointers, 2) quiescence period based reclamation (used by the Linux kernel in the RCU implementation) 3) reference counting techniques. 4) Other 5) Combinations
Hazard pointers work by saving the currently active references in a well-known per thread location, so any thread deciding to free memory (when the counter appears to be zero) can check if the memory is still in use by anyone. An interesting improvement is to buffer request to release memory in a small array and free them up in a batch when the array is full. The advantage of using hazard pointers is that it can actually guarantee an upper bound on unreclaimed memory. The disadvantage is that it places extra burden on the reader.
Quiescence period based reclamation works by delaying the actual release of the memory until it's known that each thread has had a chance to finish working on any data that may need to be released. The way to know that this condition is satisfied is to check if each thread passed through a quiescent period (not in a critical section) after the object was removed. In the Linux kernel this means something like each task making a voluntary task switch. In a user space application it would be the end of a critical section. This can be achieved by a simple counter, each time the counter is even the thread is not in a critical section (reading shared data), each time the counter is odd the thread is inside a critical section, to move from a critical section or back all the thread needs to do is to atomically increment the number. Based on this the "garbage collector" can determine if each thread has had a chance to finish. There are several approaches, one simple one would be to queue up the requests to free memory (e.g. in a linked list or an array), each with the current generation (managed by the GC), when the GC runs it checks the state of the threads (their state counters) to see if each passed to the next generation (their counter is higher than the last time or is the same and even), any memory can be reclaimed one generation after it was freed. The advantage of this approach is that is places the least burden on the reading threads. The disadvantage is that it can't guarantee an upper bound for the memory waiting to be released (e.g. one thread spending 5 minutes in a critical section, while the data keeps changing and memory isn't released), but in practice it works out all right.
There is a number of reference counting solutions, many of them require double compare and swap, which some CPUs don't support, so can't be relied upon. The key problem remains though, taking a reference before updating the counter. I didn't find enough information to explain how this can be done simply and reliably though. So .....
There are of course a number of "Other" solutions, it's a very important topic of research with tons of papers out there. I didn't examine all of them. I only need one.
And of course the various approaches can be combined, for example hazard pointers can solve the problems of reference counting. But there's a nearly infinite number of combinations, and in some cases a spin lock might theoretically break wait-freedom, but doesn't hurt performance in practice. Somewhat like another tidbit I found in my research, it's theoretically not possible to implement wait-free algorithms using compare-and-swap, that's because in theory (purely in theory) a CAS based update might keep failing for non-deterministic excessive times (imagine a million threads on a million cores each trying to increment and decrement the same counter using CAS). In reality however it rarely fails more than a few times (I suspect it's because the CPUs spend more clocks away from CAS than there are CPUs, but I think if the algorithm returned to the same CAS on the same location every 50 clocks and there were 64 cores there could be a chance of a major problem, then again, who knows, I don't have a hundred core machine to try this). Another results of my research is that designing and implementing wait-free algorithms and data-structures is VERY challenging (even if some of the heavy lifting is outsourced, e.g. to a garbage collector [e.g. Java]), and might perform less well than a similar algorithm with carefully placed locks.
So, yeah, it's possible to free memory even without delays. It's just tricky. And if you forget to make the right operations atomic, or to place the right memory barrier, oh, well, you're toast. :-) Thanks everyone for participating.
I think atomic operations for increment/decrement and compare-and-swap would solve this problem.
Idea:
All resources have a counter which is modified with atomic operations. The counter is initially zero.
Before using a resource: "Acquire" it by atomically incrementing its counter. The resource can be used if and only if the incremented value is greater than zero.
After using a resource: "Release" it by atomically decrementing its counter. The resource should be disposed/freed if and only if the decremented value is equal to zero.
Before disposing: Atomically compare-and-swap the counter value with the minimum (negative) value. Dispose will not happen if a concurrent thread "Acquired" the resource in between.
You haven't specified a language for your question. Here goes an example in c#:
class MyResource
{
// Counter is initially zero. Resource will not be disposed until it has
// been acquired and released.
private int _counter;
public bool Acquire()
{
// Atomically increment counter.
int c = Interlocked.Increment(ref _counter);
// Resource is available if the resulting value is greater than zero.
return c > 0;
}
public bool Release()
{
// Atomically decrement counter.
int c = Interlocked.Decrement(ref _counter);
// We should never reach a negative value
Debug.Assert(c >= 0, "Resource was released without being acquired");
// Dispose when we reach zero
if (c == 0)
{
// Mark as disposed by setting counter its minimum value.
// Only do this if the counter remain at zero. Atomic compare-and-swap operation.
if (Interlocked.CompareExchange(ref _counter, int.MinValue, c) == c)
{
// TODO: Run dispose code (free stuff)
return true; // tell caller that resource is disposed
}
}
return false; // released but still in use
}
}
Usage:
// "r" is an instance of MyResource
bool acquired = false;
try
{
if (acquired = r.Acquire())
{
// TODO: Use resource
}
}
finally
{
if (acquired)
{
if (r.Release())
{
// Resource was disposed.
// TODO: Nullify variable or similar to let GC collect it.
}
}
}
I know this is not the best way but it works for me:
for shared dynamic data-structure lists I use usage counter per item
for example:
struct _data
{
DWORD usage;
bool delete;
// here add your data
_data() { usage=0; deleted=true; }
};
const int MAX = 1024;
_data data[MAX];
now when item is started to be used somwhere then
// start use of data[i]
data[i].cnt++;
after is no longer used then
// stop use of data[i]
data[i].cnt--;
if you want to add new item to list then
// add item
for (i=0;i<MAX;i++) // find first deleted item
if (data[i].deleted)
{
data[i].deleted=false;
data[i].cnt=0;
// copy/set your data
break;
}
and now in the background once in a while (on timer or whatever)
scann data[] an all undeleted items with cnt == 0 set as deleted (+ free its dynamic memory if it has any)
[Note]
to avoid multi-thread access problems implement single global lock per data list
and program it so you cannot scann data while any data[i].cnt is changing
one bool and one DWORD suffice for this if you do not want to use OS locks
// globals
bool data_cnt_locked=false;
DWORD data_cnt=0;
now any change of data[i].cnt modify like this:
// start use of data[i]
while (data_cnt_locked) Sleep(1);
data_cnt++;
data[i].cnt++;
data_cnt--;
and modify delete scan like this
while (data_cnt) Sleep(1);
data_cnt_locked=true;
Sleep(1);
if (data_cnt==0) // just to be sure
for (i=0;i<MAX;i++) // here scan for items to delete ...
if (!data[i].cnt)
if (!data[i].deleted)
{
data[i].deleted=true;
data[i].cnt=0;
// release your dynamic data ...
}
data_cnt_locked=false;
PS.
do not forget to play with the sleep times a little to suite your needs
lock free algorithm sleep times are sometimes dependent on OS task/scheduler
this is not really an lock free implementation
because while GC is at work then all is locked
but if ather than that multi access is not blocking to each other
so if you do not run GC too often you are fine

Why would multiple executions of IronPython code cause OutOfMemoryException?

I have some IronPython code in a C# class that has the following two lines in a function:
void test(MyType request){
compiledCode.DefaultScope.SetVarialbe("target", request);
compiledCode.Execute();
}
I was writing some code to stress-test this/performance test it. The request object that I'm passing in is a List<MyType> with one million MyType's. So I iterate over this list:
foreach(MyType thing in myThings){
test(thing);
}
And that works fine - runs in about 6 seconds. When I run
for(int x = 0; x < 1000; x++){
foreach(MyType thing in myThings){
test(thing);
}
}
On the 19th iteration of the loop I get an OutOfMemoryException.
I tried doing Python.CreateEngine(AppDomain.CurrentDomain) instead, which didn't seem to have any effect, and I tried
...
compiledCode.Execute();
compiledCode.DefaultScope.RemoveVariable("target");
Hoping that somehow IronPython was keeping the reference around. That roughly doubled the amount of time it took, but it still blew up at the same point.
Why is my code causing a memory leak, and how do I fix it?
IronPython has a memory leak somewhere that prevents objects from getting GC'd. (let this be a lesson, kids: managed languages are not immune from memory issues. They just have different ones.) I haven't been able to figure out where, though.
You could create an issue and attach a self-contained reproduction, but even better would be running your app against a memory profiler and see what is holding onto your MyType objects.
So, it turns out I wasn't thinking at all.
The target object object contains a StringBuilder that I'm adding lines to. About 10 lines per iteration, so apparently I can hold about 190 million lines of text in a StringBuilder before I run out of memory ;)

examples of garbage collection bottlenecks

I remembered someone telling me one good one. But i cannot remember it. I spent the last 20mins with google trying to learn more.
What are examples of bad/not great code that causes a performance hit due to garbage collection ?
from an old sun tech tip -- sometimes it helps to explicitly nullify references in order to make them eligible for garbage collection earlier:
public class Stack {
private static final int MAXLEN = 10;
private Object stk[] = new Object[MAXLEN];
private int stkp = -1;
public void push(Object p) {stk[++stkp] = p;}
public Object pop() {return stk[stkp--];}
}
rewriting the pop method in this way helps ensure that garbage collection gets done in a timely fashion:
public Object pop() {
Object p = stk[stkp];
stk[stkp--] = null;
return p;
}
What are examples of bad/not great code that causes a performance hit due to garbage collection ?
The following will be inefficient when using a generational garbage collector:
Mutating references in the heap because write barriers are significantly more expensive than pointer writes. Consider replacing heap allocation and references with an array of value types and an integer index into the array, respectively.
Creating long-lived temporaries. When they survive the nursery generation they must be marked, copied and all pointers to them updated. If it is possible to coalesce updates in order to reuse of an old version of a collection, do so.
Complicated heap topologies. Again, consider replacing many references with indices.
Deep thread stacks. Try to keep stacks shallow to make it easier for the GC to collate the global roots.
However, I would not call these "bad" because there is nothing objectively wrong with them. They are only inefficient when used with this kind of garbage collector. With manual memory management, none of the issues arise (although many are replaced with equivalent issues, e.g. performance of malloc vs pool allocators). With other kinds of GC some of these issues disappear, e.g. some GCs don't have a write barrier, mark-region GCs should handle long-lived temporaries better, not all VMs need thread stacks.
When you have some loop involving the creation of new object's instances: if the number of cycles is very high you procuce a lot of trash causing the Garbage Collector to run more frequently and so decreasing performance.
One example would be object references that are kept in member variables oder static variables. Here is an example:
class Something {
static HugeInstance instance = new HugeInstance();
}
The problem is the garbage collector has no way of knowing, when this instance is not needed anymore. So its usually better to keep things in local variables and have small functions.
String foo = new String("a" + "b" + "c");
I understand Java is better about this now, but in the early days that would involve the creation and destruction of 3 or 4 string objects.
I can give you an example that will work with the .Net CLR GC:
If you override a finalize method from a class and do not call the super class Finalize method such as
protected override void Finalize(){
Console.WriteLine("Im done");
//base.Finalize(); => you should call him!!!!!
}
When you resurrect an object by accident
protected override void Finalize(){
Application.ObjJolder = this;
}
class Application{
static public object ObjHolder;
}
When you use an object that uses Finalize it takes two GC collections to get rid of the data, and in any of the above codes you won't delete it.
frequent memory allocations
lack of memory reusing (when dealing with large memory chunks)
keeping objects longer than needed (keeping references on obsolete objects)
In most modern collectors, any use of finalization will slow the collector down. And not just for the objects that have finalizers.
Your custom service does not have a load limiter on it, so:
A lot requests come in for some reason at the same time (everyone logs on in the morning say)
The service takes longer to process each requests as it now has 100s of threads (1 per request)
Yet more part processed requests builds up due to the longer processing time.
Each part processed request has created lots of objects that live until the end of processing that request.
The garbage collector spends lots of time trying to free memory it, however it can’t due to the above.
Yet more part processed requests builds up due to the longer processing time…. (including time in GC)
I have encountered a nice example while doing some parallel cell based simulation in Python. Cells are initialized and sent to worker processes after pickling for running. If you have too many cells at any one time the master node runs out of ram. The trick is to make a limited number of cells pack them and send them off to cluster nodes before making some more, remember to set the objects already sent off to "None". This allows you to perform large simulations using the total RAM of the cluster in addition to the computing power.
The application here was cell based fire simulation, only the cells actively burning were kept as objects at any one time.

Resources