Python - How to remove reference to an object without copying it - object

I have fallowing problem:
In Ray software, it is needed to remove all references to an object, to release memory that is used by the object. So, I need to:
def some_func():
# results is very complex object with array, lists, dictionaries
results = ray.get(hash_for_given_job)
result_copy = copy.deepcopy(results)
del results
return results_copy
Nevertheless, it causes that the object will occupy twice of memory at some point in time. Therefore increasing RAM memory usage. How to remove a reference, and return an object, without copying it ?

I strongly suspect you don't actually need to deep copy results. Objects in the ray object store are immutable. When you call ray.get() you're already creating a copy of the object, thus if you do something like
results = ray.get(hash_for_given_job)
results_copy = ray.get(hash_for_given_job)
results.a = "hello"
assert results_copy.a != results_copy.b
This would hold true regardless of whether results and results_copy are in the same function, different tasks, different machines, etc.

Related

which swig functions require %newobject to avoid a memory leak in python

If we return value created by SWIG_NewObjectPtr without flags, it will have reference count of 1 and would never be garbage collected. In order to avoid it I have to use %newobject in .i file.
However, there are quite a few swig primitives that could be used to make the value returned into python. I suspect that some of them should be also annotated as %newobject.
Is there a list or should blindly annotate all functions returning anything into python?
Is there a way to apply %newobject globally or per-file?
For SWIG_NewPointerObj, Python will track the reference if needed. Set the own flag. If that return value isn't assigned to a variable its reference count would be decremented. %newobject isn't intended for Python objects.
%newobject is used to indicate that a function returns an C/C++ allocated object that isn't reference counted, such as return new char[50]; or return (char*)malloc(1000);. In the previous example SWIG will create a Python string by copying the data, but doesn't know to free the returned pointer without %newobject. You may also need to write a %typemap(newfree) if SWIG doesn't already have one for the type.
References:
11.3.3 Using %newobject to release memory
14.2 Object ownership and %newobject

How to explicitly release object after creating it with `ray.put`?

I'm trying to get rid of object pinned in shared memory using ray.put.
Here is code sample:
import ray
<create obj>
for ...:
obj_id = ray.put(obj)
<do stuff with obj_id on ray Actors using ray.get(obj_id)>
del obj_id
After this is finished, I look at ray dashboard and see that all obj_id are still in ray shared memory with reference type LOCAL_REFERENCE.
Official docs do not elaborate on whether there is any way of explicitly controlling object lifetime. As far as I understood, it basically suggests to wait until all memory is used, and then rely on ray to clean things up.
Question: how do I explicitly purge object from ray shared memory?
Note: I'm using Jupyter, can it be the case that this object is still alive due to this fact?
The function ray.internal.internal_api.free() performs this function. I can't find any documentation on the Ray docs for this function but it has a good docstring you can find here which I've copy-pasted below.
Free a list of IDs from the in-process and plasma object stores.
This function is a low-level API which should be used in restricted
scenarios.
If local_only is false, the request will be send to all object stores.
This method will not return any value to indicate whether the deletion is
successful or not. This function is an instruction to the object store. If
some of the objects are in use, the object stores will delete them later
when the ref count is down to 0.
Examples:
>>> x_id = f.remote()
>>> ray.get(x_id) # wait for x to be created first
>>> free([x_id]) # unpin & delete x globally
Args:
object_refs (List[ObjectRef]): List of object refs to delete.
local_only (bool): Whether only deleting the list of objects in local
object store or all object stores.

Array assignment using multiprocessing

I have a uniform 2D coordinate grid stored in a numpy array. The values of this array are assigned by a function that looks roughly like the following:
def update_grid(grid):
n, m = grid.shape
for i in range(n):
for j in range(m):
#assignment
Calling this function takes 5-10 seconds for a 100x100 grid, and it needs to be called several hundred times during the execution of my main program. This function is the rate limiting step in my program, so I want to reduce the process time as much as possible.
I believe that the assignment expression inside can be split up in a manner which accommodates multiprocessing. The value at each gridpoint is independent of the others, so the assignments can be split something like this:
def update_grid(grid):
n, m = grid.shape
for i in range (n):
for j in range (m):
p = Process(target=#assignment)
p.start()
So my questions are:
Does the above loop structure ensure each process will only operate
on a single gridpoint? Do I need anything else to allow each
process to write to the same array, even if they're writing to
different placing in that array?
The assignment expression requires a set of parameters. These are
constant, but each process will be reading at the same time. Is this okay?
To explicitly write the code I've structured above, I would need to
define my assignment expression as another function inside of
update_grid, correct?
Is this actually worthwhile?
Thanks in advance.
Edit:
I still haven't figured out how to speed up the assignment, but I was able to avoid the problem by changing my main program. It no longer needs to update the entire grid with each iteration, and instead tracks changes and only updates what has changed. This cut the execution time of my program down from an hour to less than a minute.

Should the packets generated from the system verilog generator module be destroyed to save simulation time?

Every time when I call the generate_directed_packet a new object is created. Should I worry about deleting the packet object, before creating the next one. If so, how should I go about deleting the packet object?
function void generate_directed_packet();
packet = new();
void'(packet.randomize());
endfunction : generate_directed_packet
SystemVerilog has automatic memory management. That means it only holds an object around as long as there is a class variable containing a handle to that object. The simulator "deletes" the object after there are no more class variables with a handle to that object. The "delete" is in quotes because you have no knowledge of when it deletes the object. More likely it keeps the space around until another new() of the same object and reclaims the space.
If you are using the UVM, the typical case is you are generating packets in a sequence, and sending them to a driver. What you are really doing is creating a handle to a new object in the sequence, and then copying the handle from variables in the sequence to variables in the driver.
As you copy the handle from one variable to another, you are erasing the reference to the older object. So as long as you are put putting handles to your objects in data structure that grows as you add more object handles to it, the space from the old objects get reclaimed.

Design of a high-performance sorted data structure read by many threads and written by few

I have an interesting data structure design problem that is beyond my current expertise. I'm seeking data structure or algorithm answers about tackling this problem.
The requirements:
Store a reasonable number of (pointer address, size) pairs (effectively two numbers; the first is useful as a sorting key) in one location
In a highly threaded application, many threads will look up values, to see if a specific pointer is within one of the (address, size) pairs - that is, if the pair defines a memory range, if the pointer is within any range in the list. Threads will much more rarely add or remove entries from this list.
Reading or searching for values must be as fast as possible, happening hundreds of thousands to millions of times a second
Adding or removing values, ie mutating the list, happens much more rarely; performance is not as important
It is acceptable but not ideal for the list contents to be out of date, ie for a thread's lookup code to not find an entry that should exist, so long as at some point the entry will exist.
I am keen to avoid a naive implementation such as having a critical section to serialize access to a sorted list or tree. What data structures or algorithms might be suitable for this task?
Tagged with Delphi since I am using that language for
this task. Language-agnostic answers are very welcome.
However, I probably cannot use any of the standard
libraries in any language without a lot of care. The reason is that memory access
(allocation, freeing, etc of objects and their internal memory, eg
tree nodes, etc) is strictly controlled and must go through my own
functions. My current code elsewhere in the same program uses
red/black trees and a bit trie, and I've written these myself. Object
and node allocation runs through custom memory allocation routines.
It's beyond the scope of the question, but is mentioned here to avoid
an answer like 'use STL structure foo.' I'm keen for an algorithmic or
structure answer that, so long as I have the right references or textbooks,
I can implement myself.
I would use a TDictionary<Pointer, Integer> (from Generics.Collections) combined with a TMREWSync (from SysUtils) for the multi-read exclusive-write access. TMREWSync allows multiple readers simulatenous access to the dictionary, as long as no writer is active. The dictionary itself provides O(1) lookup of pointers.
If you don't want to use the RTL classes the answer becomes: use a hash map combined with a multi-read exclusive-write synchronization object.
EDIT: Just realized that your pairs really represent memory ranges, so a hash map does not work. In this case you could use a sorted list (sorted by memory adress) and then use binary search to quickly find a matching range. That makes the lookup O(log n) instead of O(1) though.
Exploring a bit the replication idea ...
From the correctness point of view, reader/writer locks will do the work. However,
in practice, while readers may be able to proceed concurrently and in parallel
with accessing the structure, they will create a huge contention on the lock, for the
obvious reason that locking even for read access involves writing to the lock itself.
This will kill the performance in a multi-core system and even more in a multi-socket
system.
The reason for the low performance is the cache line invalidation/transfer traffic
between cores/sockets. (As a side note, here's a very recent and very interesting study
on the subject Everything You Always Wanted to Know About
Synchronization but Were Afraid to Ask ).
Naturally, we can avoid inter core cache transfers, triggered by readers, by making
a copy of the structure on each core and restricting the reader threads to accessing only
the copy local to the core they are currently executing. This requires some mechanism for a thread to obtain its current core id. It also relies to on the operating system scheduler to not move gratuitously threads across cores, i.e. to maintain core affinity to some extent.
AFACT, most current operating systems do it.
As for the writers, their job would be to update all the existing replicas, by obtaining each lock for writing. Updating one tree (apparently the structure should be some tree) at a time does mean a temporary inconsistency between replicas. From the problem
description this seams acceptable. When a writer works, it will block readers on a single
core, but not all readers. The drawback is that a writer has the perform the same work
many times - as many time as there are cores or sockets in the system.
PS.
Maybe, just maybe, another alternative is some RCU-like approach, but I don't know
this well, so I'll just stop after mentioning it :)
With replication you could have:
- one copy of your data structure (list w/ binary search, the interval tree mentioned, ..) (say, the "original" one) that is used only for the lookup (read-access).
- A second copy, the "update" one, is created when the data is to be altered (write access). So the write is made to the update copy.
Once writing completes, change some "current"-pointer from the "original" to the "update" version. Involving an access-counter to the "original" copy, this one can be destroyed when the counter decremented back to zero readers.
In pseudo-code:
// read:
data = get4Read();
... do the lookup
release4Read(data);
// write
data = get4Write();
... alter the data
release4Write(data);
// implementation:
// current is the datat structure + a 'readers' counter, initially set to '0'
get4Read() {
lock(current_lock) { // exclusive access to current
current.readers++; // one more reader
return current;
}
}
release4Read(copy) {
lock(current_lock) { // exclusive access to current
if(0 == --copy.readers) { // last reader
if(copy != current) { // it was the old, "original" one
delete(copy); // destroy it
}
}
}
}
get4Write() {
aquire_writelock(update_lock); // blocks concurrent writers!
var copy_from = get4Read();
var copy_to = deep_copy(copy_from);
copy_to.readers = 0;
return copy_to;
}
release4Write(data) {
lock(current_lock) { // exclusive access to current
var copy_from = current;
current = data;
}
release4Read(copy_from);
release_writelock(update_lock); // next write can come
}
To complete the answer regarding the actual data structure to use:
Given the fixed size of the data-entries (two integer tuple), also being quite small, i would use an array for storage and binary search for the lookup. (An alternative would be a balanced tree mentioned in the comment).
Talking about performance: As i understand, the 'address' and 'size' define ranges. Thus, the lookup for a given address being within such a range would involve an addition operation of 'address' + 'size' (for comparison of the queried address with the ranges upper bound) over and over again. It may be more performant to store start- and end-adress explicitely, instead of start-adress and size - to avoid this repeated addition.
Read the LMDB design papers at http://symas.com/mdb/ . An MVCC B+tree with lockless reads and copy-on-write writes. Reads are always zero-copy, writes may optionally be zero-copy as well. Can easily handle millions of reads per second in the C implementation. I believe you should be able to use this in your Delphi program without modification, since readers also do no memory allocation. (Writers may do a few allocations, but it's possible to avoid most of them.)
As a side note, here's a good read about memory barriers: Memory Barriers: a Hardware View for Software Hackers
This is just to answer a comment by #fast, the comment space is not big enough ...
#chill: Where do you see the need to place any 'memory barriers'?
Everywhere, where you access shared storage from two different cores.
For example, a writer comes, make a copy of the data and then calls
release4Write. Inside release4write, the writer does the assignment
current = data, to update the shared pointer with the location of the new
data, decrements the counter of the old copy to zero and proceeds with deleting it.
Now a reader intervenes and calls get4Read. And inside get4Read it does copy = current. Since there's no memory barrier, this happens to read the old value of current. For all we know, the write may be reordered after the delete call or the new value of current may still reside in the writer's store queue or the reader may not have yet
seen and processed a corresponding cache invalidation request and whatnot ...
Now the reader happily proceeds to search in that copy of the data
that the writer is deleting or has just deleted. Oops!
But, wait, there's more! :D
With propper use if the > get..() and release..() functions, where do you see the problems to access deleted data or multiple deletion?
See the following interleaving of reader and write operations.
Reader Shared data Writer
====== =========== ======
current = A:0
data = get4Read()
var copy = A:0
copy.readers++;
current = A:1
return A:1
data = A:1
... do the lookup
release4Read(copy == A:1):
--copy.readers current = A:0
0 == copy.readers -> true
data = get4Write():
aquire_writelock(update_lock)
var copy_from = get4Read():
var copy = A:0
copy.readers++;
current = A:1
return A:1
copy_from == A:1
var copy_to = deep_copy(A:1);
copy_to == B:1
return B:1
data == B:1
... alter the data
release4Write(data = B:1)
var copy_from = current;
copy_form == A:1
current = B:1
current = B:1
A:1 != B:1 -> true
delete A:1
!!! release4Read(A:1) !!!
And the writer accesses deleted data and then tries to delete it again. Double oops!

Resources