Understanding Cpython garbage collection generations - python-3.x

I am trying to improve my understanding of how memory management works in python and I am confused about the concept of generations in pythons garbage collector module.
My understanding is as follows:
All objects created start in generation 0, once a threshold (700 by default for gen 0) is reached python will run a collection on that generation and any surviving objects age into the next gen.
Given the above, I cant understand the below output.
import gc
import sys
x = 1
print(gc.get_count())
gc.collect()
print(gc.get_count())
Output
(64, 1, 1)
(0, 0, 0)
Firstly, I've only run 1 line of code and Ive already got objects in gen 1 and 2 implying that garbage collection has already occurred at-least twice, how is this possible? Is there anyway to find out what objects are in each generation?
Secondly,
Why do I have 0 references in all generations after collection? I can still run the command print(x) and not get an error. Doesn't this mean there is still a reference to x and so it should exist in one of the generations?

gc.get_count() shows you the counter for each generation, towards the threshold.
It is not the amount of objects in each generation, but the counter, that when it reaches the threadshold, the collection will occur for that generation.
For example, if I start with (0,0,0) on the counter, running x = [[] for i in range(100)] will set the counter to (101,0,0).
Running y = [[] for i in range(600)] will cause the counter to flip to (0,1,0) and gen0 collection will run. At this point all of my lists will move to gen1 as they survived a gen0 collection.
When counter reaches (699,699,0) and another object is allocated, gen0 and gen1 collection will happen and the counter will go to (0,0,1). When counter reaches (699,699,699), and an object is allocated, or you use gc.collect() (which runs gen2 collection), counter will reset back to (0,0,0).
To get the objects in each generation use gc.get_objects(gen).
Regarding the garbage collection before your code runs - when Python starts, it creates lots of objects, before even loading your script. For example, you can see the modules that were loaded by running sys.modules. When those objects are created, the garbage collector runs automatically.

Related

Array assignment using multiprocessing

I have a uniform 2D coordinate grid stored in a numpy array. The values of this array are assigned by a function that looks roughly like the following:
def update_grid(grid):
n, m = grid.shape
for i in range(n):
for j in range(m):
#assignment
Calling this function takes 5-10 seconds for a 100x100 grid, and it needs to be called several hundred times during the execution of my main program. This function is the rate limiting step in my program, so I want to reduce the process time as much as possible.
I believe that the assignment expression inside can be split up in a manner which accommodates multiprocessing. The value at each gridpoint is independent of the others, so the assignments can be split something like this:
def update_grid(grid):
n, m = grid.shape
for i in range (n):
for j in range (m):
p = Process(target=#assignment)
p.start()
So my questions are:
Does the above loop structure ensure each process will only operate
on a single gridpoint? Do I need anything else to allow each
process to write to the same array, even if they're writing to
different placing in that array?
The assignment expression requires a set of parameters. These are
constant, but each process will be reading at the same time. Is this okay?
To explicitly write the code I've structured above, I would need to
define my assignment expression as another function inside of
update_grid, correct?
Is this actually worthwhile?
Thanks in advance.
Edit:
I still haven't figured out how to speed up the assignment, but I was able to avoid the problem by changing my main program. It no longer needs to update the entire grid with each iteration, and instead tracks changes and only updates what has changed. This cut the execution time of my program down from an hour to less than a minute.

Python multiprocessing taking the brakes off OSX

I have a program that randomly selects 13 cards from a full pack and analyses the hands for shape, point count and some other features important to the game of bridge. The program will select and analyse 10**7 hands in about 5 minutes. Checking the Activity Monitor shows that during execution the CPU (which s a 6 Core processor) is devoting about 9% of its time to the program and ~90% of its time it is idle. So it looks like a prime candidate for multiprocessing and I created a multiprocessing version using a Queue to pass information from each process back to the main program. Having navigated the problems of IDLE not working will multiprocessing (I now run it using PyCharm) and that doing a join on a process before it has finished freezes the program, I got it to work.
However, it doesn’t matter how many processes I use 5,10, 25 or 50 the result is always the same. The CPU devotes about 18% of its time to the program and has ~75% of its time idle and the execution time is slightly more than double at a bit over 10 minutes.
Can anyone explain how I can get the processes to take up more of the CPU time and how I can get the execution time to reflect this? Below are the relevant sections fo the program:
import random
import collections
import datetime
import time
from math import log10
from multiprocessing import Process, Queue
NUM_OF_HANDS = 10**6
NUM_OF_PROCESSES = 25
def analyse_hands(numofhands, q):
#code remove as not relevant to the problem
q.put((distribution, points, notrumps))
if __name__ == '__main__':
processlist = []
q = Queue()
handsperprocess = NUM_OF_HANDS // NUM_OF_PROCESSES
print(handsperprocess)
# Set up the processes and get them to do their stuff
start_time = time.time()
for _ in range(NUM_OF_PROCESSES):
p = Process(target=analyse_hands, args=((handsperprocess, q)))
processlist.append(p)
p.start()
# Allow q to get a few items
time.sleep(.05)
while not q.empty():
while not q.empty():
#code remove as not relevant to the problem
# Allow q to be refreshed so allowing all processes to finish before
# doing a join. It seems that doing a join before a process is
# finished will cause the program to lock
time.sleep(.05)
counter['empty'] += 1
for p in processlist:
p.join()
while not q.empty():
# This is never executed as all the processes have finished and q
# emptied before the join command above.
#code remove as not relevant to the problem
finish_time = time.time()
I have no answer to the reason why IDLE will not run a multiprocessor start instruction correctly but I believe the answer to the doubling of the execution times lies in the type of problem I am dealing with. Perhaps others can comment but it seems to me that the overhead involved with adding and removing items to and from the Queue is quite high so that performance improvements will be best achieved when the amount of data being passed via the Queue is small compared with the amount of processing required to obtain that data.
In my program I am creating and passing 10**7 items of data and I suppose it is the overhead of passing this number of items via the Queue that kills any performance improvement from getting the data via separate Processes. By using a map it seems all 10^7 items of data will need to be stored in the map before any further processing can be done. This might improve performance depending on the overhead of using the map and dealing with that amount of data but for the time being I will stick with my original vanilla, single processed code.

How garbage collection takes place in Python?

How garbage collection takes place in Python. Some people say that it happens automatically. But what is the proper process of it ?
Introduction to Python memory management
Python's memory allocation and deallocation method is automatic. The user does not have to preallocate or deallocate memory by hand as one has to when using dynamic memory allocation in languages such as C or C++. Python uses two strategies for memory allocation reference counting and garbage collection.
Prior to Python version 2.0, the Python interpreter only used reference counting for memory management. Reference counting works by counting the number of times an object is referenced by other objects in the system. When references to an object are removed, the reference count for an object is decremented. When the reference count becomes zero the object is deallocated.
Reference counting is extremely efficient but it does have some caveats. One such caveat is that it cannot handle reference cycles. A reference cycle is when there is no way to reach an object but its reference count is still greater than zero. The easiest way to create a reference cycle is to create an object which refers to itself as in the example below:
def make_cycle():
1 = [ ]
1.append(l)
make_cycle()
Because make_cycle() creates an object 1 which refers to itself, the object 1 will not automatically be freed when the function returns. This will cause the memory that 1 is using to be held onto until the Python garbage collector is invoked.
Automatic garbage collection of cycles
Because reference cycles are take computational work to discover, garbage collection must be a scheduled activity. Python schedules garbage collection based upon a threshold of object allocations and object deallocations. When the number of allocations minus the number of deallocations are greater than the threshold number, the garbage collector is run. One can inspect the threshold for new objects (objects in Python known as generation 0 objects) by loading the gc module and asking for garbage collection thresholds:
import gc
print "Garbage collection thresholds: %r" % gc.get_threshold()
Garbage collection thresholds: (700, 10, 10)
Here we can see that the default threshold on the above system is 700. This means when the number of allocations vs. the number of deallocations is greater than 700 the automatic garbage collector will run.
Automatic garbage collection will not run if your Python device is running out of memory; instead your application will throw exceptions, which must be handled or your application crashes. This is aggravated by the fact that the automatic garbage collection places high weight upon the NUMBER of free objects, not on how large they are. Thus any portion of your code which frees up large blocks of memory is a good candidate for running manual garbage collection.
Manual garbage collection
For some programs, especially long running server applications or embedded applications running on a Digi Device automatic garbage collection may not be sufficient. Although an application should be written to be as free of reference cycles as possible, it is a good idea to have a strategy for how to deal with them. Invoking the garbage collector manually during opportune times of program execution can be a good idea on how to handle memory being consumed by reference cycles.
The garbage collection can be invoked manually in the following way:
import gc
gc.collect()
gc.collect() returns the number of objects it has collected and deallocated. You can print this information in the following way:
import gc
collected = gc.collect()
print "Garbage collector: collected %d objects." % (collected)
If we create a few cycles, we can see manual collection work:
import sys, gc
def make_cycle():
1 = { }
1[0] = 1
def main():
collected = gc.collect()
print "Garbage collector: collected %d objects." % (collected)
print "Creating cycles..."
for i in range(10):
make_cycle()
collected = gc.collect()
print "Garbage collector: collected %d objects." % (collected)
if __name__ == "__main__":
ret = main()
sys.exit(ret)
In general there are two recommended strategies for performing manual garbage collection: time-based and event-based garbage collection. Time-based garbage collection is simple: the garbage collector is called on a fixed time interval. Event-based garbage collection calls the garbage collector on an event. For example, when a user disconnects from the application or when the application is known to enter an idle state.
Reference:
Python garbage collection
Understand the internals of python garbage collection and how Instagram engineers achieved a performance improvement of 10% by tweaking it.
do-you-know-how-python-cleanses-itself

Is reading/writing to different elements of a module array thread-safe?

As long as a program does not allow simultaneous writes to the same elements of a shared data structure that is stored in a module, is it thread-safe? I know this is a noob question, but couldn't find it explicitly addressed anywhere. Here's the situation:
At the beginning of a program, data is initialized and stored in a module-level allocatable array (FIELDVARS) which then becomes accessible to any subroutine where the module is referenced by a USE statement.
Suppose now that the program enters a multi-threaded and/or multi-core computational phase, and FIELDVARS is accessed for "read/write" operations during repeated multiple simultaneous calls to subroutine (COMPUTE).
Once the computational phase is complete, the program returns to a single-threaded phase and FIELDVARS must be used in a subsequent subroutine (POST). However, FIELDVARS cannot be added to the input args of COMPUTE or POST because these are called from a closed-source main program. Therefore the module-level array is used to pass the addt'l data between subroutines.
Assume that FIELDVARS and COMPUTE have been designed so that each call to COMPUTE will always give access to a set of unique elements of FIELDVARS, which are guaranteed to be different than for any other call, so that simultaneous "write" operations on the same elements will never occur. For example:
[ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, ... ] <-- FIELDVARS
^---call 1---^ ^---call 2---^ ... <-- Each call to COMPUTE is guaranteed to access a specific set of elements of FIELDVARS.
Question: Is this scenario considered "thread-safe", "conditionally safe", or "not thread-safe"? If it is not safe, what is the specific danger and what would you suggest to handle it?
Other relevant details:
The main program controls threading.
The source code of the main program is not available and cannot be changed.
The main program controls how/when COMPUTE, POST, and other subroutines are called, as well as what args can be passed in. This is why the module-level array is used to pass data between different subroutines rather than as an arg.
! DEMO MODULE W/ ALLOCATABLE INTEGER ARRAY
module DATA_MODULE
integer, dimension(:), allocatable :: FIELDVARS !<-- allocated/populated elsewhere, prior to calling COMPUTE
end module DATA_MODULE
! DEMO COMPUTE SUBROUTINE (THREADED PHASE W/ MULTIPLE SIMULTANEOUS CALLS)
subroutine COMPUTE(x, y, x_idx, y_idx, flag)
use DATA_MODULE
logical :: flag
integer :: x,y,x_idx,y_idx !<-- different for every call to COMPUTE
if (flag == .false.) then !<-- read data only
...
x = FIELDVARS(x_idx)
y = FIELDVARS(y_idx)
...
else if (flag == .true.) then !<-- write data
...
FIELDVARS(x_idx) = 0
FIELDVARS(y_idx) = 0
...
endif
end subroutine COMPUTE
It is fine and many programs depend on that fact. In OpenMP you often loop over arrays and different threads may easily work with elements which are close to each other in memory, especially on the boundaries of the blocks assigned to each thread.
At modern CPUs this is a non-issue. See also https://en.wikipedia.org/wiki/Cache_coherence
What is a real problem is False sharing. Two or more threads working with elements of memory belonging to the same chache line will be compiting for a shared resource and it may be very slow.

How to keep very big elements on memory without exhausting the garbage collector?

In Haskell, I created a Vector of 1000000 IntMaps. I then used Gloss to render a picture in a way that accesses random intmaps from that vector.
That is, I had keep every single one of them in memory. The rendering function itself is very lightweight, so the performance was supposed to be good.
Yet, the program was running at 4fps. Upon profiling, I noticed 95% of the time was spent on GC. Fair enough:
The GC is crazily scanning my vector, even though it never changes.
Is there any way to tell GHC "this big value is needed and will not change - do not try to collect anything inside it".
Edit: the program below is sufficient to replicate the issue.
import qualified Data.IntMap as Map
import qualified Data.Vector as Vec
import Graphics.Gloss
import Graphics.Gloss.Interface.IO.Animate
import System.Random
main = do
let size = 10000000
let gen i = Map.fromList $ zip [mod i 10..0] [0..mod i 10]
let vec = Vec.fromList $ map gen [0..size]
let draw t = do
rnd <- randomIO :: IO Int
let empty = Map.null $ vec Vec.! mod rnd size
let rad = if empty then 10 else 50
return $ translate (20 * cos t) (20 * sin t) (circle rad)
animateIO (InWindow "hi" (256,256) (1,1)) white draw
This accesses a random map on a huge vector and draws a rotating circle whose radius depend on whether the map is empty.
Despite that logic being very simple, the program struggles at around 1 FPS here.
gloss is the culprit here.
First, a little background on GHC's garbage collector. GHC uses (by default) a generational, copying garbage collector. This means that the heap consists of several memory areas called generations. Objects are allocated into the youngest generation. When a generation becomes full, it is scanned for live objects and the live objects are copied into the next older generation, and then the generation that was scanned is marked as empty. When the oldest generation becomes full, the live objects are instead copied into a new version of the oldest generation.
An important fact to take away from this is that the GC only ever examines live objects. Dead objects are never touched at all. This is great when collecting generations that are mostly garbage, as often happens in the youngest generation. It's not good if long-lived data undergoes many GCs, as it will be copied repeatedly. (It can also be counterintuitive to those used to malloc/free-style memory management, where allocation and deallocation are both quite expensive, but leaving objects allocated for a long time has no direct cost.)
Now, the "generational hypothesis" is that most objects are either short-lived or long-lived. The long-lived objects will quickly end up in the oldest generation since they are alive at every collection. Meanwhile, most of the short-lived objects that are allocated will never survive the youngest generation; only those that happen to be alive when it is collected will be promoted to the next generation. Similarly, most of those short-lived objects that do get promoted will not survive to the third generation. As a result, the oldest generation that holds the long-lived objects should fill up very slowly, and its expensive collections that have to copy all the long-lived objects should occur rarely.
Now, all of this is actually true in your program, except for one problem:
let displayFun backendRef = do
-- extract the current time from the state
timeS <- animateSR `getsIORef` AN.stateAnimateTime
-- call the user action to get the animation frame
picture <- frameOp (double2Float timeS)
renderS <- readIORef renderSR
portS <- viewStateViewPort <$> readIORef viewSR
windowSize <- getWindowDimensions backendRef
-- render the frame
displayPicture
windowSize
backColor
renderS
(viewPortScale portS)
(applyViewPortToPicture portS picture)
-- perform GC every frame to try and avoid long pauses
performGC
gloss tells the GC to collect the oldest generation every frame!
This might be a good idea if those collections are expected to take less time than the delay between frames anyways, but it's clearly not a good idea for your program. If you remove that performGC call from gloss, then your program runs quite quickly. Presumably if you let it run for long enough, then the oldest generation will eventually fill up and you might get a delay of a few tenths of a second as the GC copies all your long-lived data, but that's much better than paying that cost every frame.
All that said, there is a ticket #9052 about adding a stable generation, which would also suit your needs nicely. See there for more details.
I would try compiling -with-rtsopts and then playing with the heap (-H) and/or allocator (-A) options. Those greatly influence how the GC works.
More info here: https://downloads.haskell.org/~ghc/latest/docs/html/users_guide/runtime-control.html
To add to Reid's answer, I've found performMinorGC (added in https://ghc.haskell.org/trac/ghc/ticket/8257) to be the best of both worlds here.
Without any explicit GC scheduling, I still get frequent collection-related frame drops as the nursery becomes exhausted. But performGC indeed becomes performance-killing as soon as there is any significant long-lived memory usage.
performMinorGC does what we want, ignoring long-term memory and cleaning up the garbage from each frame predictably - especially if you tune -H and -A to ensure that the per-frame garbage fits in the nursery.

Resources