GC pauses causing performance issue

GC pauses causing performance issue - garbage-collection

I just started working on a project (I'm new not the project) that as a performance optimization loads 32GB of graph data (nodes, edges, etc) into memory and keeps it there. This is a long running service so the data is meant to remain in memory for the lifetime of the service. When a Gen 2 collection is triggered by the CLR there are large pauses (of course) which hurt performance while the GC scans Gen 2 marking everything as reachable objects.
What I'd like to know is are there strategies available for managed applications that must keep large amounts of data in memory? What are the best ways to prevent Gen 2 collections from running - ever?

There are a few general things that you can do in your implementation to make it more GC friendly: a relatively easy one is to reduce the number of object references in your object graph. For example, replace:
class Graph {
List<Node> roots;
// ...
}
class Node {
Node[] outwardEdges;
// ...
}
With indirect references through Node identifiers:
class Graph {
List<Node> roots;
Node[] allNodes;
// ...
}
class Node {
int[] outwardEdges;
// ...
}
or something similar that fits your design. This reduces the number of pointers in the object graph that the collector has to walk.
Shifting the data on to the native heap is another possibility, writing a small JNI library to give you the interface to perform the operations you need. This can pay off in other ways: the last time I had a similar problem to solve we made substantial space savings through this approach because we had largely western textual data in the data set, which occupied far less space encoded as UTF8. As long as the cost of your graph search is non-trivial then the overhead of the native call is not likely to be an issue.

Related

Extreme slowdown, OpenMP probably see unexisting race conditions?

my code on OpenMP gets very slow when I add the (*pRandomTrial)++; after generating random number. To g_iRandomTrials[32] I store number of rand() calls from each thread. Each thread writes different index of this array, there are no race conditions, results are OK, but this very easy counter makes the program almost 10 times slower then without counter. Is there some keyword I can use in this case? I tried some setups with firstprivate(g_iRandomTrials), but I was never successfull. When I create int counter in Simulate() function and use the pointer only twice on start and on the end of function, code will run probably much faster, but this seems as somewhat ugly solution, as it doesn't do anything about the problem...
int g_iRandomTrials[32];
...
#pragma omp parallel
{
do
{
...
Simulate();
...
}
}
void Simulate(void)
{
...
int id=omp_get_thread_num();
int*pRandomTrial=g_iRandomTrials+id;
...
while (used[index])
{
index=rand()%50;
(*pRandomTrial)++;
}
}

The reason for the slow down is called false sharing. The answer is padding.
In computer science, false sharing is a performance-degrading usage
pattern that can arise in systems with distributed, coherent caches at
the size of the smallest resource block managed by the caching
mechanism. When a system participant attempts to periodically access
data that will never be altered by another party, but that data shares
a cache block with data that is altered, the caching protocol may
force the first participant to reload the whole unit despite a lack of
logical necessity. The caching system is unaware of activity within
this block and forces the first participant to bear the caching system
overhead required by true shared access of a resource.
https://en.wikipedia.org/wiki/False_sharing
CPUs lock their memory in something called cachelines. These tend to be 64-bytes in length. When one core accesses a variable, it locks the entire cacheline and fetches it from memory. Other cores can no longer access it until the lock is released.
The answer is to pad and align your randomTrials in such a way that no value is within 64-bytes of another. Keep in mind that 64-byte value is most common, but there are architectures were this differs.

How/when to release memory in wait-free algorithms

I'm having trouble figuring out a key point in wait-free algorithm design. Suppose a data structure has a pointer to another data structure (e.g. linked list, tree, etc), how can the right time for releasing a data structure?
The problem is this, there are separate operations that can't be executed atomically without a lock. For example one thread reads the pointer to some memory, and increments the use count for that memory to prevent free while this thread is using the data, which might take long, and even if it doesn't, it's a race condition. What prevents another thread from reading the pointer, decrementing the use count and determining that it's no longer used and freeing it before the first thread incremented the use count?
The main issue is that current CPUs only have a single word CAS (compare & swap). Alternatively the problem is that I'm clueless about waitfree algorithms and data structures and after reading some papers I'm still not seeing the light.
IMHO Garbage collection can't be the answer, because it would either GC would have to be prevented from running if any single thread is inside an atomic block (which would mean it can't be guaranteed that the GC will ever run again) or the problem is simply pushed to the GC, in which case, please explain how the GC would figure out if the data is in the silly state (a pointer is read [e.g. stored in a local variable] but the the use count didn't increment yet).
PS, references to advanced tutorials on wait-free algorithms for morons are welcome.
Edit: You should assume that the problem is being solved in a non-managed language, like C or C++. After all if it were Java, we'd have no need to worry about releasing memory. Further assume that the compiler may generate code that will store temporary references to objects in registers (invisible to other threads) right before the usage counter increment, and that a thread can be interrupted between loading the object address and incrementing the counter. This of course doesn't mean that the solution must be limited to C or C++, rather that the solution should give a set of primitives that allowing the implementation of wait-free algorithms on linked data structures. I'm interested in the primitives and how they solve the problem of designing wait-free algorithms. With such primitives a wait-free algorithm can be implemented equally well in C++ and Java.

After some research I learned this.
The problem is not trivial to solve and there are several solutions each with advantages and disadvantages. The reason for the complexity comes from inter CPU synchronization issues. If not done right it might appear to work correctly 99.9% of the time, which isn't enough, or it might fail under load.
Three solutions that I found are 1) hazard pointers, 2) quiescence period based reclamation (used by the Linux kernel in the RCU implementation) 3) reference counting techniques. 4) Other 5) Combinations
Hazard pointers work by saving the currently active references in a well-known per thread location, so any thread deciding to free memory (when the counter appears to be zero) can check if the memory is still in use by anyone. An interesting improvement is to buffer request to release memory in a small array and free them up in a batch when the array is full. The advantage of using hazard pointers is that it can actually guarantee an upper bound on unreclaimed memory. The disadvantage is that it places extra burden on the reader.
Quiescence period based reclamation works by delaying the actual release of the memory until it's known that each thread has had a chance to finish working on any data that may need to be released. The way to know that this condition is satisfied is to check if each thread passed through a quiescent period (not in a critical section) after the object was removed. In the Linux kernel this means something like each task making a voluntary task switch. In a user space application it would be the end of a critical section. This can be achieved by a simple counter, each time the counter is even the thread is not in a critical section (reading shared data), each time the counter is odd the thread is inside a critical section, to move from a critical section or back all the thread needs to do is to atomically increment the number. Based on this the "garbage collector" can determine if each thread has had a chance to finish. There are several approaches, one simple one would be to queue up the requests to free memory (e.g. in a linked list or an array), each with the current generation (managed by the GC), when the GC runs it checks the state of the threads (their state counters) to see if each passed to the next generation (their counter is higher than the last time or is the same and even), any memory can be reclaimed one generation after it was freed. The advantage of this approach is that is places the least burden on the reading threads. The disadvantage is that it can't guarantee an upper bound for the memory waiting to be released (e.g. one thread spending 5 minutes in a critical section, while the data keeps changing and memory isn't released), but in practice it works out all right.
There is a number of reference counting solutions, many of them require double compare and swap, which some CPUs don't support, so can't be relied upon. The key problem remains though, taking a reference before updating the counter. I didn't find enough information to explain how this can be done simply and reliably though. So .....
There are of course a number of "Other" solutions, it's a very important topic of research with tons of papers out there. I didn't examine all of them. I only need one.
And of course the various approaches can be combined, for example hazard pointers can solve the problems of reference counting. But there's a nearly infinite number of combinations, and in some cases a spin lock might theoretically break wait-freedom, but doesn't hurt performance in practice. Somewhat like another tidbit I found in my research, it's theoretically not possible to implement wait-free algorithms using compare-and-swap, that's because in theory (purely in theory) a CAS based update might keep failing for non-deterministic excessive times (imagine a million threads on a million cores each trying to increment and decrement the same counter using CAS). In reality however it rarely fails more than a few times (I suspect it's because the CPUs spend more clocks away from CAS than there are CPUs, but I think if the algorithm returned to the same CAS on the same location every 50 clocks and there were 64 cores there could be a chance of a major problem, then again, who knows, I don't have a hundred core machine to try this). Another results of my research is that designing and implementing wait-free algorithms and data-structures is VERY challenging (even if some of the heavy lifting is outsourced, e.g. to a garbage collector [e.g. Java]), and might perform less well than a similar algorithm with carefully placed locks.
So, yeah, it's possible to free memory even without delays. It's just tricky. And if you forget to make the right operations atomic, or to place the right memory barrier, oh, well, you're toast. :-) Thanks everyone for participating.

I think atomic operations for increment/decrement and compare-and-swap would solve this problem.
Idea:
All resources have a counter which is modified with atomic operations. The counter is initially zero.
Before using a resource: "Acquire" it by atomically incrementing its counter. The resource can be used if and only if the incremented value is greater than zero.
After using a resource: "Release" it by atomically decrementing its counter. The resource should be disposed/freed if and only if the decremented value is equal to zero.
Before disposing: Atomically compare-and-swap the counter value with the minimum (negative) value. Dispose will not happen if a concurrent thread "Acquired" the resource in between.
You haven't specified a language for your question. Here goes an example in c#:
class MyResource
{
// Counter is initially zero. Resource will not be disposed until it has
// been acquired and released.
private int _counter;
public bool Acquire()
{
// Atomically increment counter.
int c = Interlocked.Increment(ref _counter);
// Resource is available if the resulting value is greater than zero.
return c > 0;
}
public bool Release()
{
// Atomically decrement counter.
int c = Interlocked.Decrement(ref _counter);
// We should never reach a negative value
Debug.Assert(c >= 0, "Resource was released without being acquired");
// Dispose when we reach zero
if (c == 0)
{
// Mark as disposed by setting counter its minimum value.
// Only do this if the counter remain at zero. Atomic compare-and-swap operation.
if (Interlocked.CompareExchange(ref _counter, int.MinValue, c) == c)
{
// TODO: Run dispose code (free stuff)
return true; // tell caller that resource is disposed
}
}
return false; // released but still in use
}
}
Usage:
// "r" is an instance of MyResource
bool acquired = false;
try
{
if (acquired = r.Acquire())
{
// TODO: Use resource
}
}
finally
{
if (acquired)
{
if (r.Release())
{
// Resource was disposed.
// TODO: Nullify variable or similar to let GC collect it.
}
}
}

I know this is not the best way but it works for me:
for shared dynamic data-structure lists I use usage counter per item
for example:
struct _data
{
DWORD usage;
bool delete;
// here add your data
_data() { usage=0; deleted=true; }
};
const int MAX = 1024;
_data data[MAX];
now when item is started to be used somwhere then
// start use of data[i]
data[i].cnt++;
after is no longer used then
// stop use of data[i]
data[i].cnt--;
if you want to add new item to list then
// add item
for (i=0;i<MAX;i++) // find first deleted item
if (data[i].deleted)
{
data[i].deleted=false;
data[i].cnt=0;
// copy/set your data
break;
}
and now in the background once in a while (on timer or whatever)
scann data[] an all undeleted items with cnt == 0 set as deleted (+ free its dynamic memory if it has any)
[Note]
to avoid multi-thread access problems implement single global lock per data list
and program it so you cannot scann data while any data[i].cnt is changing
one bool and one DWORD suffice for this if you do not want to use OS locks
// globals
bool data_cnt_locked=false;
DWORD data_cnt=0;
now any change of data[i].cnt modify like this:
// start use of data[i]
while (data_cnt_locked) Sleep(1);
data_cnt++;
data[i].cnt++;
data_cnt--;
and modify delete scan like this
while (data_cnt) Sleep(1);
data_cnt_locked=true;
Sleep(1);
if (data_cnt==0) // just to be sure
for (i=0;i<MAX;i++) // here scan for items to delete ...
if (!data[i].cnt)
if (!data[i].deleted)
{
data[i].deleted=true;
data[i].cnt=0;
// release your dynamic data ...
}
data_cnt_locked=false;
PS.
do not forget to play with the sleep times a little to suite your needs
lock free algorithm sleep times are sometimes dependent on OS task/scheduler
this is not really an lock free implementation
because while GC is at work then all is locked
but if ather than that multi access is not blocking to each other
so if you do not run GC too often you are fine

D programming without the garbage collector

I've been looking at D today and on the surface it looks quite amazing. I like how it includes many higher level constructs directly in the language so silly hacks or terse methods don't have to be used. One thing that really worries me if the GC. I know this is a big issues and have read many discussions about it.
My own simple tests sprouted from a question here shows that the GC is extremely slow. Over 10 times slower than straight C++ doing the same thing. (obviously the test does not directly convert into real world but the performance hit is extreme and would slow down real world happens that behave similarly(allocating many small objects quickly)
I'm looking into writing a real time low latency audio application and it is possible that the GC will ruin the performance of the application to make it nearly useless. In a sense, if it has any issues it will ruin the real time audio aspect which is much more crucial since, unlike graphics, audio runs at a much higher frame rate(44000+ vs 30-60). (due to it's low latency it is more crucial than a standard audio player which can buffer significant amounts of data)
Disabling the GC improved the results to within about 20% of the C++ code. This is significant. I'll give the code at the end for analysis.
My questions are:
How difficult is it to replace D's GC with a standard smart pointers implementation so that libraries that rely on the GC can still be used. If I remove GC completely I'll lose a lot of grunt work, as D already has limit libraries compared to C++.
Does GC.Disable only halt the garbage collection temporarily(preventing the GC thread from running) and GC.Enable pick back up where it left off. So I could potentially disable the GC from running in high cpu usage moments to prevent latency issues.
Is there any way to enforce a pattern to not use GC consistently. (this is because I've not programming in D and when I start writing my glasses that do not use the GC I would like to be sure I don't forget to implement their own clean up.
Is it possible to replace the GC in D easily? (not that I want to but it might be fun to play around with different methods of GC one day... this is similar to 1 I suppose)
What I'd like to do is trade memory for speed. I do not need the GC to run every few seconds. In fact, if I can properly implement my own memory management for my data structures then chances are it will not need to run very often at all. I might need to run it only when memory becomes scarce. From what I've read, though, the longer you wait to call it the slower it will be. Since there generally will be times in my application where I can get away with calling it without issues this will help alleviate some of the pressure(but then again, there might be hours when I won't be able to call it).
I am not worried about memory constraints as much. I'd prefer to "waste" memory over speed(up to a point, of course). First and foremost is the latency issues.
From what I've read, I can, at the very least, go the route of C/C++ as long as I don't use any libraries or language constructs that rely on the GC. The problem is, I do not know the ones that do. I've seen string, new, etc mentioned but does that mean I can't use the build in strings if I don't enable the GC?
I've read in some bug reports that the GC might be really buggy and that could explain its performance problems?
Also, D uses a bit more memory, in fact, D runs out of memory before the C++ program. I guess it is about 15% more or so in this case. I suppose that is for the GC.
I realize the following code is not representative of your average program but what it says is that when programs are instantiating a lot of objects(say, at startup) they will be much slower(10 times is a large factor). Of the GC could be "paused" at startup then it wouldn't necessarily be an issue.
What would really be nice is if I could somehow have the compiler automatically GC a local object if I do not specifically deallocate it. This almost give the best of both worlds.
e.g.,
{
Foo f = new Foo();
....
dispose f; // Causes f to be disposed of immediately and treats f outside the GC
// If left out then f is passed to the GC.
// I suppose this might actually end up creating two kinds of Foo
// behind the scenes.
Foo g = new manualGC!Foo(); // Maybe something like this will keep GC's hands off
// g and allow it to be manually disposed of.
}
In fact, it might be nice to actually be able to associate different types of GC's with different types of data with each GC being completely self contained. This way I could tailor the performance of the GC to my types.
Code:
module main;
import std.stdio, std.conv, core.memory;
import core.stdc.time;
class Foo{
int x;
this(int _x){x=_x;}
}
void main(string args[])
{
clock_t start, end;
double cpu_time_used;
//GC.disable();
start = clock();
//int n = to!int(args[1]);
int n = 10000000;
Foo[] m = new Foo[n];
foreach(i; 0..n)
//for(int i = 0; i<n; i++)
{
m[i] = new Foo(i);
}
end = clock();
cpu_time_used = (end - start);
cpu_time_used = cpu_time_used / 1000.0;
writeln(cpu_time_used);
getchar();
}
C++ code
#include <cstdlib>
#include <iostream>
#include <time.h>
#include <math.h>
#include <stdio.h>
using namespace std;
class Foo{
public:
int x;
Foo(int _x);
};
Foo::Foo(int _x){
x = _x;
}
int main(int argc, char** argv) {
int n = 120000000;
clock_t start, end;
double cpu_time_used;
start = clock();
Foo** gx = new Foo*[n];
for(int i=0;i<n;i++){
gx[i] = new Foo(i);
}
end = clock();
cpu_time_used = (end - start);
cpu_time_used = cpu_time_used / 1000.0;
cout << cpu_time_used;
std::cin.get();
return 0;
}

D can use pretty much any C library, just define the functions needed. D can also use C++ libraries, but D does not understand certain C++ constructs. So... D can use almost as many libraries as C++. They just aren't native D libs.
From D's Library reference.
Core.memory:
static nothrow void disable();
Disables automatic garbage collections performed to minimize the process footprint. Collections may continue to occur in instances where the implementation deems necessary for correct program behavior, such as during an out of memory condition. This function is reentrant, but enable must be called once for each call to disable.
static pure nothrow void free(void* p);
Deallocates the memory referenced by p. If p is null, no action occurs. If p references memory not originally allocated by this garbage collector, or if it points to the interior of a memory block, no action will be taken. The block will not be finalized regardless of whether the FINALIZE attribute is set. If finalization is desired, use delete instead.
static pure nothrow void* malloc(size_t sz, uint ba = 0);
Requests an aligned block of managed memory from the garbage collector. This memory may be deleted at will with a call to free, or it may be discarded and cleaned up automatically during a collection run. If allocation fails, this function will call onOutOfMemory which is expected to throw an OutOfMemoryError.
So yes. Read more here: http://dlang.org/garbage.html
And here: http://dlang.org/memory.html
If you really need classes, look at this: http://dlang.org/memory.html#newdelete
delete has been deprecated, but I believe you can still free() it.
Don't use classes, use structs. Structs are stack allocated, classes are heap. Unless you need polymorphism or other things classes support, they are overhead for what you are doing. You can use malloc and free if you want to.
More or less... fill out the function definitions here: https://github.com/D-Programming-Language/druntime/blob/master/src/gcstub/gc.d . There's a GC proxy system set up to allow you to customize the GC. So it's not like it is something that the designers do not want you to do.
Little GC knowledge here:
The garbage collector is not guaranteed to run the destructor for all unreferenced objects. Furthermore, the order in which the garbage collector calls destructors for unreference objects is not specified. This means that when the garbage collector calls a destructor for an object of a class that has members that are references to garbage collected objects, those references may no longer be valid. This means that destructors cannot reference sub objects. This rule does not apply to auto objects or objects deleted with the DeleteExpression, as the destructor is not being run by the garbage collector, meaning all references are valid.
import std.c.stdlib; that should have malloc and free.
import core.memory; this has GC.malloc, GC.free, GC.addroots, //add external memory to GC...
strings require the GC because they are dynamic arrays of immutable chars. ( immutable(char)[] ) Dynamic arrays require GC, static do not.
If you want manual management, go ahead.
import std.c.stdlib;
import core.memory;
char* one = cast(char*) GC.malloc(char.sizeof * 8);.
GC.free(one);//pardon me, I'm not used to manual memory management.
//I am *asking* you to edit this to fix it, if it needs it.
why create a wrapper class for an int? you are doing nothing more than slowing things down and wasting memory.
class Foo { int n; this(int _n){ n = _n; } }
writeln(Foo.sizeof); //it's 8 bytes, btw
writeln(int.sizeof); //Its *half* the size of Foo; 4 bytes.
Foo[] m;// = new Foo[n]; //8 sec
m.length=n; //7 sec minor optimization. at least on my machine.
foreach(i; 0..n)
m[i] = new Foo(i);
int[] m;
m.length=n; //nice formatting. and default initialized to 0
//Ooops! forgot this...
foreach(i; 0..n)
m[i] = i;//.145 sec
If you really need to, then write the Time-sensitive function in C, and call it from D.
Heck, if time is really that big of a deal, use D's inline assembly to optimize everything.

I suggest you read this article: http://3d.benjamin-thaut.de/?p=20
There you will find a version of the standard library that does own memory management and completely avoids garbage collection.

D's GC simply isn't as sophisticated as others like Java's. It's open-source so anyone can try to improve it.
There is an experimental concurrent GC named CDGC and there is a current GSoC project to remove the global lock: http://www.google-melange.com/gsoc/project/google/gsoc2012/avtuunainen/17001
Make sure to use LDC or GDC for compilation to get better optimized code.
The XomB project also uses a custom runtime but it's D version 1 I think.
http://wiki.xomb.org/index.php?title=Main_Page

You can also just allocate all memory blocks you need then use a memory pool to get blocks without the GC.
And by the way, it’s not as slow as you mentionned. And GC.disable() doesn’t really disable it.

We might look at the problem from a bit different view. Suboptimal performance of allocating many little objects, which you mention as a rationale for the question, has little to do with GC alone. Rather, it's a matter of balance between general-purpose (but suboptimal) and highly-performant (but task-specialised) memory management tools. The idea is: presence of GC doesn't prevent you from writing a real-time app, you just have to use more specific tools (say, object pools) for special cases.

Since this hasn't been closed yet, recent versions of D have the std.container library which contains an Array data structure that is significantly more efficient with respect to memory than the built-in arrays. I can't confirm that the other data structures in the library are also efficient, but it may be worth looking into if you need to be more memory conscious without having to resort to manually creating data structures that don't require garbage collection.

D is constantly evolving. Most of the answers here are 9+ years old, so I figured I'd answer these questions again for anyone curious what the current situation is.
(...) replace D's GC with a standard smart pointers implementation so that libraries that rely on the GC can still be used. (...)
Replacing the GC itself with smart pointers is not something I've looked into (i.e. where new creates a smart pointer). There are several D libraries that add smart pointers. You can interface with any C library. Interfacing with C++ and even Objective-C is also supported to some degree, so that should cover you pretty well.
Does GC.disable only halt the garbage collection temporarily (preventing the GC thread from running) and GC.enable pick back up where it left off. (...)
"Collections may continue to occur in instances where the implementation deems necessary for correct program behaviour, such as during an out of memory condition."
[source]
So mostly, yes. You can also manually invoke collection during down-time.
Is there any way to enforce a pattern to not use GC consistently. (...) when I start writing my classes that do not use the GC I would like to (...)
Classes are always allocated on the GC and are reference types. Structs should be used instead. However, keep in mind that structs are value types, so by default they're copied when being moved. You can #disable the copy constructor if you don't like this behaviour, but then your struct won't be POD.
What you're probably looking for is #nogc, which is a function attribute that stops a function from using the GC. You can't mark a struct type as #nogc, but you can mark each of its methods as #nogc. Just keep in mind that #nogc code can't call GC code. There's also nothrow.
If you intend to never use GC, you ought to look into Better C. It's a D language setting that removes all of D's runtime, standard library (Phobos), GC and all GC-reliant features (namely associative arrays and exceptions) in favour of using C's runtime and the C Standard Library.
Is it possible to replace the GC in D (...)
Yes it is: https://dlang.org/spec/garbage.html#gc_registry
And you can configure the pre-existing GC to better suit your needs if you don't want to make your own GC.

Is Parallel.ForEach in ConcurrentBag<T> thread safe

Description of ConcurrentBag on MSDN is not clear:
Bags are useful for storing objects when ordering doesn't matter, and unlike sets, bags support duplicates. ConcurrentBag is a thread-safe bag implementation, optimized for scenarios where the same thread will be both producing and consuming data stored in the bag.
My question is it thread safe and if this is a good practice to use ConcurrentBag in Parallel.ForEach.
For Instance:
private List<XYZ> MyMethod(List<MyData> myData)
{
var data = new ConcurrentBag<XYZ>();
Parallel.ForEach(myData, item =>
{
// Some data manipulation
data.Add(new XYZ(/* constructor parameters */);
});
return data.ToList();
}
This way I don't have to use synchronization locking in Parallel.ForEach using regular List.
Thanks a lot.

That looks fine to me. The way you're using it is thread-safe.
If you could return an IEnumerable<XYZ>, it could be made more efficient by not copying to a List<T> when you're done.

ConcurrentBag and Parallel.ForEach seems to me, no problem. If you uses this types in scenarios that has a large volume multi-user access, these classes in your implementation could rise up cpu process to levels that can crash your web-server. Furthermore, this implementation starts N tasks (threads) to execute each one of iterations, so be careful when choice this classes ans implementations. I recently spent in this situation and I had to extract memory dump to analyze whats happens into my web application core. So, be careful 'cause Concurrentbag is ThreadSafe and in web scenarios it is no better way.

examples of garbage collection bottlenecks

I remembered someone telling me one good one. But i cannot remember it. I spent the last 20mins with google trying to learn more.
What are examples of bad/not great code that causes a performance hit due to garbage collection ?

from an old sun tech tip -- sometimes it helps to explicitly nullify references in order to make them eligible for garbage collection earlier:
public class Stack {
private static final int MAXLEN = 10;
private Object stk[] = new Object[MAXLEN];
private int stkp = -1;
public void push(Object p) {stk[++stkp] = p;}
public Object pop() {return stk[stkp--];}
}
rewriting the pop method in this way helps ensure that garbage collection gets done in a timely fashion:
public Object pop() {
Object p = stk[stkp];
stk[stkp--] = null;
return p;
}

What are examples of bad/not great code that causes a performance hit due to garbage collection ?
The following will be inefficient when using a generational garbage collector:
Mutating references in the heap because write barriers are significantly more expensive than pointer writes. Consider replacing heap allocation and references with an array of value types and an integer index into the array, respectively.
Creating long-lived temporaries. When they survive the nursery generation they must be marked, copied and all pointers to them updated. If it is possible to coalesce updates in order to reuse of an old version of a collection, do so.
Complicated heap topologies. Again, consider replacing many references with indices.
Deep thread stacks. Try to keep stacks shallow to make it easier for the GC to collate the global roots.
However, I would not call these "bad" because there is nothing objectively wrong with them. They are only inefficient when used with this kind of garbage collector. With manual memory management, none of the issues arise (although many are replaced with equivalent issues, e.g. performance of malloc vs pool allocators). With other kinds of GC some of these issues disappear, e.g. some GCs don't have a write barrier, mark-region GCs should handle long-lived temporaries better, not all VMs need thread stacks.

When you have some loop involving the creation of new object's instances: if the number of cycles is very high you procuce a lot of trash causing the Garbage Collector to run more frequently and so decreasing performance.

One example would be object references that are kept in member variables oder static variables. Here is an example:
class Something {
static HugeInstance instance = new HugeInstance();
}
The problem is the garbage collector has no way of knowing, when this instance is not needed anymore. So its usually better to keep things in local variables and have small functions.

String foo = new String("a" + "b" + "c");
I understand Java is better about this now, but in the early days that would involve the creation and destruction of 3 or 4 string objects.

I can give you an example that will work with the .Net CLR GC:
If you override a finalize method from a class and do not call the super class Finalize method such as
protected override void Finalize(){
Console.WriteLine("Im done");
//base.Finalize(); => you should call him!!!!!
}
When you resurrect an object by accident
protected override void Finalize(){
Application.ObjJolder = this;
}
class Application{
static public object ObjHolder;
}
When you use an object that uses Finalize it takes two GC collections to get rid of the data, and in any of the above codes you won't delete it.

frequent memory allocations
lack of memory reusing (when dealing with large memory chunks)
keeping objects longer than needed (keeping references on obsolete objects)

In most modern collectors, any use of finalization will slow the collector down. And not just for the objects that have finalizers.

Your custom service does not have a load limiter on it, so:
A lot requests come in for some reason at the same time (everyone logs on in the morning say)
The service takes longer to process each requests as it now has 100s of threads (1 per request)
Yet more part processed requests builds up due to the longer processing time.
Each part processed request has created lots of objects that live until the end of processing that request.
The garbage collector spends lots of time trying to free memory it, however it can’t due to the above.
Yet more part processed requests builds up due to the longer processing time…. (including time in GC)

I have encountered a nice example while doing some parallel cell based simulation in Python. Cells are initialized and sent to worker processes after pickling for running. If you have too many cells at any one time the master node runs out of ram. The trick is to make a limited number of cells pack them and send them off to cluster nodes before making some more, remember to set the objects already sent off to "None". This allows you to perform large simulations using the total RAM of the cluster in addition to the computing power.
The application here was cell based fire simulation, only the cells actively burning were kept as objects at any one time.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string