Unexpected memory usage of List<T> - c#-4.0

I always thought that the default constructor for List would initialize a list with a capacity of 4 and that the capacity would be doubled when adding the 5th element, etc...
In my application I make a lot of lists (tree like structure where each node can have many children), some of these nodes won't have any children and since my application was fast but was also using a bit much memory I decided to use the constructor where I can specify the capacity and have set this at 1.
The strange thing now is that the memory usage when I start with a capacity of 1 is about 15% higher then when I use the default constructor. It can't be because of a better fit with 4 since the doubling would be 1,2,4. So why this extra increase in memory usage? As an extra test I've tried to start with a capacity of 4. This time again the memory usage was 15% higher then when using no specified capacity.
Now this really isn't a problem, but it bothers me that a pretty simple data structure that I've used for years has some extra logic that I didn't know about yet. Does anyone have an idea of the inner workings of List in this aspect?

That's because if you use the default constructor the internal storage array is set to an empty array, but if you use the constructor with a set size an array of the correct size gets set immediately, instead of being generated on the first call to Add.
You can see this using a decompiler like JustDecompile:
public List(int capacity)
{
if (capacity < 0)
{
ThrowHelper.ThrowArgumentOutOfRangeException(ExceptionArgument.capacity, ExceptionResource.ArgumentOutOfRange_NeedNonNegNum);
}
this._items = new T[capacity];
}
public List()
{
this._items = List<T>._emptyArray;
}
If you look at the Add function it calls EnsureCapacity, which will enlarge the internal storage array if required. Obviously, if the array is set to an empty array initially the first add will create the default size array.

Related

Why GBuffers need to be created for each frame in D3D12?

I have experience with D3D11 and want to learn D3D12. I am reading the official D3D12 multithread example and don't understand why the shadow map (generated in the first pass as a DSV, consumed in the second pass as SRV) is created for each frame (actually only 2 copies, as the FrameResource is reused every 2 frames).
The code that creates the shadow map resource is here, in the FrameResource class, instances of which is created here.
There is actually another resource that is created for each frame, the constant buffer. I kind of understand the constant buffer. Because it is written by CPU (D3D11 dynamic usage) and need to remain unchanged until the GPU finish using it, so there need to be 2 copies. However, I don't understand why the shadow map needs to do the same, because it is only modified by GPU (D3D11 default usage), and there are fence commands to separate reading and writing to that texture anyway. As long as the GPU follows the fence, a single texture should be enough for the GPU to work correctly. Where am I wrong?
Thanks in advance.
EDIT
According to the comment below, the "fence" I mentioned above should more accurately be called "resource barrier".
The key issue is that you don't want to stall the GPU for best performance. Double-buffering is a minimal requirement, but typically triple-buffering is better for smoothing out frame-to-frame rendering spikes, etc.
FWIW, the default behavior of DXGI Present is to stall only after you have submitted THREE frames of work, not two.
Of course, there's a trade-off between triple-buffering and input responsiveness, but if you are maintaining 60 Hz or better than it's likely not noticeable.
With all that said, typically you don't need to double-buffered depth/stencil buffers for rendering, although if you wanted to make the initial write of the depth-buffer overlap with the read of the previous depth-buffer passes then you would want distinct buffers per frame for performance and correctness.
The 'writes' are all complete before the 'reads' in DX12 because of the injection of the 'Resource Barrier' into the command-list:
void FrameResource::SwapBarriers()
{
// Transition the shadow map from writeable to readable.
m_commandLists[CommandListMid]->ResourceBarrier(1, &CD3DX12_RESOURCE_BARRIER::Transition(m_shadowTexture.Get(), D3D12_RESOURCE_STATE_DEPTH_WRITE, D3D12_RESOURCE_STATE_PIXEL_SHADER_RESOURCE));
}
void FrameResource::Finish()
{
m_commandLists[CommandListPost]->ResourceBarrier(1, &CD3DX12_RESOURCE_BARRIER::Transition(m_shadowTexture.Get(), D3D12_RESOURCE_STATE_PIXEL_SHADER_RESOURCE, D3D12_RESOURCE_STATE_DEPTH_WRITE));
}
Note that this sample is a port/rewrite of the older legacy DirectX SDK sample MultithreadedRendering11, so it may be just an artifact of convenience to have two shadow buffers instead of just one.

Is Static Dynamic Array not allowed in RPGLE sub-procedure

SEQ is throwing a RNF3772 error if I try to declare a Static Dynamic array in a RPGLE sub-procedure. Is static dynamic array not allowed in sub-procedure?
Below is an example of what I entered in SEQ. The error that I got is "The keyword is not allowed following keyword STATIC; keyword is ignored."
P proc1 B
D pi
D myArray s 10 dim(1000) static based(myArray_p)
P E
static means the memory is kept (allocated) locally between calls
based means that no memory is allocated locally
So yeah, the two are mutually exclusive...
Unless you're %alloc() memory yourself, there's no dynamic arrays in RPG...I think even the new "dynamic arrays" in 7.4 actually just allocate the max memory. What's nice is they keep track of how many elements are used automatically.
edit2 As Barbara called out, if you are doing %Alloc()/%Realloc() yourself, then all you need is the basing pointer declared static I'd include a parm to indicate the memory should be cleaned up.
P proc1 B
D pi
d cleanUp n value
D myArray s 10 dim(1000) based(myArray_p)
d myArray_p s * static
if cleanUp;
dealloc(myArray_p);
return;
endif;
P E
Just use static. Same memory requirements as if you'd used a global variable, but hidden inside the procedure.
If you really want dynamic arrays, you could build your own routines in a *SRVPGM to use. Or you could make use of some open source.
RPG Next Gen - Vector
RPG Array List/Linked List
RPGMap
Dynamic Array using a user space
With an actual dynamic array, you'd likely end up with a pointer (or maybe an integer) variable in your procedure that you'd want defined as STATIC so that it remains between calls.
You'll also need to consider how to clean up the memory when you are done.
To define your array as based, but have it retain its values between calls, you have to define the basing pointer as static. It will probably be impossible to free the allocated storage except by reclaiming the activation group, unless your procedure has a way of knowing the array is no longer needed for future calls.

OpenCL: What if i have more tasks then available work items?

Let's make an example:
i want vector dot product made concurrently (it's not my case, this is only an example) so i have 2 large input vectors and a large output vector with the same size. the work items aviable are less then the sizes of these vectors. How can i make this dot product in opencl if the work items are less then the size of the vectors? Is this possible? Or i have just to make some tricks?
Something like:
for(i = 0; i < n; i++){
output[i] = input1[i]*input2[i];
}
with n > available work items
If by "available work items" you mean you're running into the maximum given by CL_DEVICE_MAX_WORK_ITEM_SIZES, you can always enqueue your kernel multiple times for different ranges of the array.
Depending on your actual workload, it may be more sensible to make each work item perform more work though. In the simplest case, you can use the SIMD types such as float4, float8, float16, etc. and operate on large chunks like that in one go. As always though, there is no replacement for trying different approaches and measuring the performance of each.
Divide and conquer data. If you keep workgroup size as an integer divident of global work size, then you can have N workgroup launches perhaps k of them at once per kernel launch. So you should just launch N/k kernels each with k*workgroup_size workitems and proper addressing of buffers inside kernels.
When you have per-workgroup partial sums of partial dot products(with multiple in-group reduction steps), you can simply sum them on CPU or on whichever device that data is going to.

Objective C For loops with #autorelease and ARC

As part of an app that allows auditors to create findings and associate photos to them (Saved as Base64 strings due to a limitation on the web service) I have to loop through all findings and their photos within an audit and set their sync value to true.
Whilst I perform this loop I see a memory spike jumping from around 40MB up to 500MB (for roughly 350 photos and 255 findings) and this number never goes down. On average our users are creating around 1000 findings and 500-700 photos before attempting to use this feature. I have attempted to use #autorelease pools to keep the memory down but it never seems to get released.
for (Finding * __autoreleasing f in self.audit.findings){
#autoreleasepool {
[f setToSync:#YES];
NSLog(#"%#", f.idFinding);
for (FindingPhoto * __autoreleasing p in f.photos){
#autoreleasepool {
[p setToSync:#YES];
p = nil;
}
}
f = nil;
}
}
The relationships and retain cycles look like this
Audit has a strong reference to Finding
Finding has a weak reference to Audit and a strong reference to FindingPhoto
FindingPhoto has a weak reference to Finding
What am I missing in terms of being able to effectively loop through these objects and set their properties without causing such a huge spike in memory. I'm assuming it's got something to do with so many Base64 strings being loaded into memory when looping through but never being released.
So, first, make sure you have a batch size set on the fetch request. Choose a relatively small number, but not too small because this isn't for UI processing. You want to batch a reasonable number of objects into memory to reduce loading overhead while keeping memory usage down. Try 50 or 100 and see how it goes, then consider upping the batch size a little.
If all of the objects you're loading are managed objects then the correct way to evict them during processing is to turn them into faults. That's done by calling refreshObject:mergeChanges: on the context. BUT - that discards any changes, and your loop is specifically there to make changes.
So, what you should really be doing is batch saving the objects you've modified and then turning those objects back into faults to remove the data from memory.
So, in your loop, keep a counter of how many you've modified and save the context each time you hit that count and refresh all the objects that were processed so far. The batch on the fetch and the batch size to save should be the same number.
There's probably a big difference in size between your "Finding" objects and the associated images. So your primary aim should be to redesign your database in a way so that unfaulting (loading) a Finding object does not automatically load the base64 encoded image.
That's actually one of the major strengths of Code Data: Loading part of an object hierarchy. Just try to move the base64 encoded data to an own (managed) object so that Core Data does not load it. It will still be loaded as needed when the reference is touched.

How should one use Disruptor (Disruptor Pattern) to build real-world message systems?

As the RingBuffer up-front allocates objects of a given type, how can you use a single ring buffer to process messages of various different types?
You can't create new object instances to insert into the ringBuffer and that would defeat the purpose of up-front allocation.
So you could have 3 messages in an async messaging pattern:
NewOrderRequest
NewOrderCreated
NewOrderRejected
So my question is how are you meant to use the Disruptor pattern for real-world messageing systems?
Thanks
Links:
http://code.google.com/p/disruptor-net/wiki/CodeExamples
http://code.google.com/p/disruptor-net
http://code.google.com/p/disruptor
One approach (our most common pattern) is to store the message in its marshalled form, i.e. as a byte array. For incoming requests e.g. Fix messages, binary message, are quickly pulled of the network and placed in the ring buffer. The unmarshalling and dispatch of different types of messages are handled by EventProcessors (Consumers) on that ring buffer. For outbound requests, the message is serialised into the preallocated byte array that forms the entry in the ring buffer.
If you are using some fixed size byte array as the preallocated entry, some additional logic is required to handle overflow for larger messages. I.e. pick a reasonable default size and if it is exceeded allocate a temporary array that is bigger. Then discard it when the entry is reused or consumed (depending on your use case) reverting back to the original preallocated byte array.
If you have different consumers for different message types you could quickly identify if your consumer is interested in the specific message either by knowing an offset into the byte array that carries the type information or by passing a discriminator value through on the entry.
Also there is no rule against creating object instances and passing references (we do this in a couple of places too). You do lose the benefits of object preallocation, however one of the design goals of the disruptor was to allow the user the choice of the most appropriate form of storage.
There is a library called Javolution (http://javolution.org/) that let's you defined objects as structs with fixed-length fields like string[40] etc. that rely on byte-buffers internally instead of variable size objects... that allows the token ring to be initialized with fixed size objects and thus (hopefully) contiguous blocks of memory that allow the cache to work more efficiently.
We are using that for passing events / messages and use standard strings etc. for our business-logic.
Back to object pools.
The following is an hypothesis.
If you will have 3 types of messages (A,B,C), you can make 3 arrays of those pre-allocated. That will create 3 memory zones A, B, C.
It's not like there is only one cache line, there are many and they don't have to be contiguous. Some cache lines will refer to something in zone A, other B and other C.
So the ring buffer entry can have 1 reference to a common ancestor or interface of A & B & C.
The problem is to select the instance in the pools; the simplest is to have the same array length as the ring buffer length. This implies a lot of wasted pooled objects since only one of the 3 is ever used at any entry, ex: ring buffer entry 1234 might be using message B[1234] but A[1234] and C[1234] are not used and unusable by anyone.
You could also make a super-entry with all 3 A+B+C instance inlined and indicate the type with some byte or enum. Just as wasteful on memory size, but looks a bit worse because of the fatness of the entry. For example a reader only working on C messages will have less cache locality.
I hope I'm not too wrong with this hypothesis.

Resources