d3dBuffer created with various sizes not released - memory-leaks

I noticed a problem with the DirectX api (driver from AMD).
If I create a d3d buffer using createBuffer() with incremental sizes and release it in a for loop, the memory diagnosis tools shows that the process's private bytes size is constantly increasing. I think it might be because GPU mapped system memory is never released. BTW, The cpu heap size is stable.
FOR 1000 iteration buffer size starts at 1kb to 1000 MB
CreateBuffer using
D3D11_USAGE_STAGING/D3D11_USAGE_DYNAMIC
& D3D11_CPU_ACCESS_WRITE
d3dbuffer.Release & d3dbuffer = nullptr
context. cleastate() and flush() to synchronously release d3dbuffer
for (unsigned long long totalSize = 1MB; totalSize <= 1000MB ; totalSize += 1MB)
{
// create
CComPtr<ID3D11Buffer> pd3dBuffer;
D3D11_BUFFER_DESC bufferDesc;
{
ZeroMemory(&bufferDesc, sizeof(bufferDesc));
bufferDesc.ByteWidth = static_cast<UINT>(bufferSize);
bufferDesc.Usage = D3D11_USAGE_DYNAMIC;
bufferDesc.BindFlags = D3D11_BIND_SHADER_RESOURCE;
bufferDesc.CPUAccessFlags = D3D11_CPU_ACCESS_WRITE;
bufferDesc.MiscFlags = 0;
bufferDesc.StructureByteStride = 0;
}
HRESULT hr = pd3dDevice->CreateBuffer(&bufferDesc, NULL, pd3dBuffer);
if (FAILED(hr)) break;
//release
pd3dBuffer.Release();
pd3dDeviceContext->ClearState();
pd3dDeviceContext->Flush();
}
Because the process memory usage keeps going up and eventually reaches my physical memory 16gb limit and crashes. This is weird as I synchronously release buffer right after creation. The process memory usage should be relatively stable.
Is there anyone could explain how does not directx memory management work?

After some investigation, this weird behavior was caused by the DirectX implementation. Basically Microsoft as an OS company only defines the interface/specs for DirectX APIs, 3rd party GPU vendors will have to provide their own implementations of these APIs.
Nvidia will provide one, AMD will provide one, if suddenly Qualcomm wants to support DirectX, it will have to write one as well.
The buffer allocation and release mechanism is not fully defined in the DirectX specs, therefore it is up to the vendors to provide optimal memory management algorithms. Because there is a bug in the vendor's implementation to not recycle released memory handles, the system committed memory associated with the buffer is not released which causes this bug.

Related

Zero-copy user-space TCP send of dma_mmap_coherent() mapped memory

I'm running Linux 5.1 on a Cyclone V SoC, which is an FPGA with two ARMv7 cores in one chip. My goal is to gather lots of data from an external interface and stream (part of) this data out through a TCP socket. The challenge here is that the data rate is very high and could come close to saturating the GbE interface. I have a working implementation that just uses write() calls to the socket, but it tops out at 55MB/s; roughly half the theoretical GbE limit. I'm now trying to get zero-copy TCP transmission to work to increase the throughput, but I'm hitting a wall.
To get the data out of the FPGA into Linux user-space, I've written a kernel driver. This driver uses a DMA block in the FPGA to copy a large amount of data from an external interface into DDR3 memory attached to the ARMv7 cores. The driver allocates this memory as a bunch of contiguous 1MB buffers when probed using dma_alloc_coherent() with GFP_USER, and exposes these to the userspace application by implementing mmap() on a file in /dev/ and returning an address to the application using dma_mmap_coherent() on the preallocated buffers.
So far so good; the user-space application is seeing valid data and the throughput is more than enough at >360MB/s with room to spare (the external interface is not fast enough to really see what the upper bound is).
To implement zero-copy TCP networking, my first approach was to use SO_ZEROCOPY on the socket:
sent_bytes = send(fd, buf, len, MSG_ZEROCOPY);
if (sent_bytes < 0) {
perror("send");
return -1;
}
However, this results in send: Bad address.
After googling for a bit, my second approach was to use a pipe and splice() followed by vmsplice():
ssize_t sent_bytes;
int pipes[2];
struct iovec iov = {
.iov_base = buf,
.iov_len = len
};
pipe(pipes);
sent_bytes = vmsplice(pipes[1], &iov, 1, 0);
if (sent_bytes < 0) {
perror("vmsplice");
return -1;
}
sent_bytes = splice(pipes[0], 0, fd, 0, sent_bytes, SPLICE_F_MOVE);
if (sent_bytes < 0) {
perror("splice");
return -1;
}
However, the result is the same: vmsplice: Bad address.
Note that if I replace the call to vmsplice() or send() to a function that just prints the data pointed to by buf (or a send() without MSG_ZEROCOPY), everything is working just fine; so the data is accessible to userspace, but the vmsplice()/send(..., MSG_ZEROCOPY) calls seem unable to handle it.
What am I missing here? Is there any way of using zero-copy TCP sending with a user-space address obtained from a kernel driver through dma_mmap_coherent()? Is there another approach I could use?
UPDATE
So I dove a bit deeper into the sendmsg() MSG_ZEROCOPY path in the kernel, and the call that eventually fails is get_user_pages_fast(). This call returns -EFAULT because check_vma_flags() finds the VM_PFNMAP flag set in the vma. This flag is apparently set when the pages are mapped into user space using remap_pfn_range() or dma_mmap_coherent(). My next approach is to find another way to mmap these pages.
As I posted in an update in my question, the underlying problem is that zerocopy networking does not work for memory that has been mapped using remap_pfn_range() (which dma_mmap_coherent() happens to use under the hood as well). The reason is that this type of memory (with the VM_PFNMAP flag set) does not have metadata in the form of struct page* associated with each page, which it needs.
The solution then is to allocate the memory in a way that struct page*s are associated with the memory.
The workflow that now works for me to allocate the memory is:
Use struct page* page = alloc_pages(GFP_USER, page_order); to allocate a block of contiguous physical memory, where the number of contiguous pages that will be allocated is given by 2**page_order.
Split the high-order/compound page into 0-order pages by calling split_page(page, page_order);. This now means that struct page* page has become an array with 2**page_order entries.
Now to submit such a region to the DMA (for data reception):
dma_addr = dma_map_page(dev, page, 0, length, DMA_FROM_DEVICE);
dma_desc = dmaengine_prep_slave_single(dma_chan, dma_addr, length, DMA_DEV_TO_MEM, 0);
dmaengine_submit(dma_desc);
When we get a callback from the DMA that the transfer has finished, we need to unmap the region to transfer ownership of this block of memory back to the CPU, which takes care of caches to make sure we're not reading stale data:
dma_unmap_page(dev, dma_addr, length, DMA_FROM_DEVICE);
Now, when we want to implement mmap(), all we really have to do is call vm_insert_page() repeatedly for all of the 0-order pages that we pre-allocated:
static int my_mmap(struct file *file, struct vm_area_struct *vma) {
int res;
...
for (i = 0; i < 2**page_order; ++i) {
if ((res = vm_insert_page(vma, vma->vm_start + i*PAGE_SIZE, &page[i])) < 0) {
break;
}
}
vma->vm_flags |= VM_LOCKED | VM_DONTCOPY | VM_DONTEXPAND | VM_DENYWRITE;
...
return res;
}
When the file is closed, don't forget to free the pages:
for (i = 0; i < 2**page_order; ++i) {
__free_page(&dev->shm[i].pages[i]);
}
Implementing mmap() this way now allows a socket to use this buffer for sendmsg() with the MSG_ZEROCOPY flag.
Although this works, there are two things that don't sit well with me with this approach:
You can only allocate power-of-2-sized buffers with this method, although you could implement logic to call alloc_pages as many times as needed with decreasing orders to get any size buffer made up of sub-buffers of varying sizes. This will then require some logic to tie these buffers together in the mmap() and to DMA them with scatter-gather (sg) calls rather than single.
split_page() says in its documentation:
* Note: this is probably too low level an operation for use in drivers.
* Please consult with lkml before using this in your driver.
These issues would be easily solved if there was some interface in the kernel to allocate an arbitrary amount of contiguous physical pages. I don't know why there isn't, but I don't find the above issues so important as to go digging into why this isn't available / how to implement it :-)
Maybe this will help you to understand why alloc_pages requires a power-of-2 page number.
To optimize the page allocation process(and decrease external fragmentations), which is frequently engaged, Linux kernel developed per-cpu page cache and buddy-allocator to allocate memory(there is another allocator, slab, to serve memory allocations that are smaller than a page).
Per-cpu page cache serve the one-page allocation request, while buddy-allocator keeps 11 lists, each containing 2^{0-10} physical pages respectively. These lists perform well when allocate and free pages, and of course, the premise is you are requesting a power-of-2-sized buffer.

OpenCL - how to effectively distribute work items to different devices

I'm writing an openCL application where I have N work items that I want to distribute to D devices where N > D and in turn each device can process the elements of its own work item in parallel and thus achieve a sort of "double" parallelism.
Here is the code I have written already to try and achieve this.
First I create a an event for each of my devices and set them all to complete:
cl_int err;
cl_event *events = new cl_event[deviceCount];
for(int i = 0; i < deviceCount; i++)
{
events[i] = clCreateUserEvent(context, &err);
events[i] = clSetUserEventStatus(events[i], CL_COMPLETE);
}
Each device also has its own command queue and its own "instance" of a kernel.
Then I enter into my "main loop" for distributing work items. The code finds the first available device and enqueues it with the work item.
/*---Loop over all available jobs---*/
for(int i = 0; i < numWorkItems; i++)
{
WorkItem item = workItems[i];
bool found = false; //Check for device availability
int index = -1; //Index of found device
while(!found) //Continuously loop until free device is found.
{
for(int j = 0; j < deviceCount; j++) //Total number of CPUs + GPUs
{
cl_int status;
err = clGetEventInfo(events[j], CL_EVENT_COMMAND_EXECUTION_STATUS, sizeof(cl_int), &status, NULL);
if(status == CL_COMPLETE) /*Current device has completed all of its tasks*/
{
found = true; //Exit infinite loop
index = j; //Choose current device
break; //Break out of inner loop
}
}
}
//Enqueue my kernel
clSetKernelArg(kernels[index], 0, sizeof(cl_mem), &item);
clEnqueueNDRangeKernel(queues[index], kernels[index], 1, NULL, &glob, &loc, 0, NULL, &events[index]);
clFlush(commandQueues[index]);
}
And then finally I wrap up by calling clFinish on all my devices:
/*---Wait For Completion---*/
for(int i = 0; i < deviceCount; i++)
{
clFinish(queues[i]);
}
This approach has a few problems however:
1) It doesn't distribute the work to all my devices. On my current computer I have 3 devices. My algorithm above only distributes the work to devices 1 and 2. Device 3 always gets left out because devices 1 and 2 finish so quickly that they can snatch up more work items before 3 gets a chance.
2) Even with devices 1 and 2 running together, I only see a very, very mild speed increase. For instance if i were to assign all work items to device 1 it might take 10 seconds to complete, and if I assign all work items to device 2 it might take 11 seconds to complete, but if I try to split the work between them, combined it might take 8-9 seconds when what I would hope for might be between 4-5 seconds. I get the feeling that they might not really be running in parallel with each other the way I want.
How do I fix these issues?
You have to be careful with the sizes and the memory location. Typically these factors are not considered when dealing with GPU devices. I would ask you:
What are the kernel sizes?
How fast do they finish?
If the kernel size is small and they finish quite quickly. Then the overhead of launching them will be high. So the finer granularity of distributing them across many devices does not overcome the extra overhead. In that case is better to directly increase the work size and use 1 device only.
Are the kernels independent? Do they use different buffers?
Another important thing is to have completely different memory for each device, otherwise the memory trashing between devices will delay the kernel launches, and in that case 1 single device (holding all the memory buffers locally) will perform better.
OpenCL will copy to a device all the buffers a kernel uses, and will "block" all the kernels (even in other devices) that use buffers that the kernel is writing to; will wait it to finish and then copy the buffer back to the other device.
Is the host a bottleneck?
The host is sometimes not as fast as you may think, and sometimes the kernels run so fast that the host is a big bottleneck scheduling jobs to them.
If you use the CPU as a CL device, then it cannot do both tasks (act as host and run kernels). You should prefer always GPU devices rather than CPU devices when scheduling kernels.
Never let a device empty
Waiting till a device has finish the execution, before queuing more work is typically a very bad idea. You should queue preemptively kernels in advance (1 or 2) even before the current kernel has finished. Otherwise, the device utilization will not reach not even 80%. Since there is a big amount of time since the kernel finishes till the hosts realizes of it, and even a bigger amount of time until the host queues more data to the kernel (typically >2ms, for a 10ms kernel, thats 33% wasted).
I would do:
Change this line to submitted jobs: if(status >= CL_SUBMITTED)
Ensure the devices are ordered GPU -> CPU. So, the GPUs are the device 0,1 and CPU is the device 2.
Try removing the CPU device (only using the GPUs). Maybe the speed is even better.

DMA memcpy operation in Linux

I want to dma using dma_async_memcpy_buf_to_buf function which is in dmaengine.c file (linux/drivers/dma). For this, I add a function in dmatest.c file (linux/drivers/dma) as following:
void foo ()
{
int index = 0;
dma_cookie_t cookie;
size_t len = 0x20000;
ktime_t start, end, end1, end2, end3;
s64 actual_time;
u16* dest;
u16* src;
dest = kmalloc(len, GFP_KERNEL);
src = kmalloc(len, GFP_KERNEL);
for (index = 0; index < len/2; index++)
{
dest[index] = 0xAA55;
src[index] = 0xDEAD;
}
start = ktime_get();
cookie = dma_async_memcpy_buf_to_buf(chan, dest, src, len);
while (dma_async_is_tx_complete(chan, cookie, NULL, NULL) == DMA_IN_PROGRESS)
{
dma_sync_wait(chan, cookie);
}
end = ktime_get();
actual_time = ktime_to_ns(ktime_sub(end, start));
printk("Time taken for function() execution dma: %lld\n",(long long)actual_time);
memset(dest, 0 , len);
start = ktime_get();
memcpy(dest, src, len);
end = ktime_get();
actual_time = ktime_to_ns(ktime_sub(end, start));
printk("Time taken for function() execution non-dma: %lld\n",(long long)actual_time);
}
There are some issues with DMA:
Interestingly, memcpy function execution time is less than dma_async_memcpy_buf_to_buf function. Maybe, its related with ktime_get() function problem.
My method with foo function is correct or incorrect to perform DMA operation? I'm not sure about this.
How can I measure tick counts of memcpy and dma_async_memcpy_buf_to_buf functions in terms of cpu usage
Finally, Is DMA operation possible at application level? Up to now I used in kernel level as you can see above(dmatest.c is inserted kernel module)
There are multiple issues in your question, which make it kind of hard to answer exactly what you're questioning:
Yes, your general DMA operation invocation algorithm is correct.
The fundamental difference in using plain memcpy and DMA operations for copying memory is not getting direct performance gains, but (a) performance gains due to sustaining CPU cache / prefetcher state when using DMA operation (which is likely would be garbled when using plain old memcpy, executed on CPU itself), and (b) true background operation which leaves CPU available to do other stuff.
Given (a), it's kind of pointless to use DMA operations on anything less than CPU cache size, i.e. dozens of megabytes. Typically it's done for purposes of fast off-CPU stream processing, i.e. moving data that would be anyhow produced/consumed by external devices, such as fast networking cards, video streaming / capturing / encoding hardware, etc.
Comparing async and sync operations in terms of wall clock elapsed time is wrong. There might be hundreds of threads / processes running and no one guarantees you that you'd get scheduled next tick and not several thousands ticks later.
Using ktime_get for benchmarking purposes is wrong - it's fairly imprecise, especially for given such short jobs. Profiling kernel code in fact is a pretty hard and complex task which is well beyond the scope of this question. A quick recommendation here would be to refrain at all from such micro-benchmarks and profile a much bigger and more complete job - similar to what you're ultimately trying to achieve.
Measuring "ticks" for modern CPUs is also kind of pointless, although you can use CPU vendor-specific tools, such as Intel's VTune.
Using DMA copy operations on application level is fairly pointless - at least I can't come with a single viable scenario from top of my head when it would be worth the trouble. It's not innately faster, and, what's more important, I seriously doubt that your application performance's bottleneck is memory copying. For this to be the case, you generally should be doing everything else faster than regular memory copying, and I can't really think of anything on application level that would be faster than memcpy. And if we're talking about communication with some other, off-CPU processing device, then it's automatically not application level.
Generally, memory copy performance is usually limited by memory speed, i.e. clock freq and timings. You aren't going to get any miracle boosts over regular memcpy in direct performance, just because memcpy executed on CPU is fast enough, as CPU usually works with 3x-5x-10x faster clock frequencies than memory.

Converting EFI memory Map to E820 map

I am new to Linux and learing about how Linux comes to know about the avaible Physical Mmeory .I came to know there are some BIOS system call int 0x15 which willl gives you E20 memory Map.
Now I find a piece of code where its says its defination for converting EFI memory map to E820 Memory map.What does above mean??
Is it meant underlaying motherboard firmware is EFI based but since this code runs on x86 we need to convert it to E820 Memory Map
If so ,does x86 knows only about E820 memory maps??
What is difference between E820 and EFI memory maps??
Looking forward to get detailed answer on same.
In both cases, what you have is your firmware (BIOS or EFI) which is responsible for detecting what memory (and how much) is actually physically plugged in, and the operating system which needs to know this information in some format.
Is it meant underlaying motherboard firmware is EFI based but since this code runs on x86 we need to convert it to E820 Memory Map
Your confusion here is that EFI and x86 are incompatible - they aren't. EFI firmware has its own mechanisms for reporting available memory - specifically, you can use the GetMemoryMap boot service (before you invoke ExitBootServices) to retrieve the memory map from the firmware. However, critically, this memory map is in the format the EFI firmware wishes to report (EFI_MEMORY_DESCRIPTOR) rather than E820. In this scenario, you would not also attempt int 15h, since you already have the information you need.
I suspect what the Linux kernel does is to use the E820 format as its internal representation of memory on the x86 architecture. However, when booting EFI, the kernel must use the EFI firmware boot services, but chooses the convert the answer it gets back to the E820 format.
This is not a necessary thing for a kernel you are writing to do. You simply need to know how memory is mapped.
It is also the case that some bootloaders will provide this information for you, for example GRUB. Part of the multiboot specification allows you to instruct the bootloader that it must provide this information to your kernel.
For more on this, the ever-useful osdev wiki has code samples etc. The relevant sections for getting memory maps from grub are here.
Further points:
The OS needs to understand what memory is mapped where for several reasons. One is to avoid using physical memory where firmware services reside, but the other is for communication with devices who share memory with the CPU. The video buffer is a common example of this.
Secondly, listing the memory map in EFI is not too difficult. If you haven't already discovered it, the UEFI shell that comes with some firmware has a memmap command to display the memory map. If you want to implement this yourself, a quick and dirty way to do that looks like this:
EFI_STATUS EFIAPI PrintMemoryMap(EFI_SYSTEM_TABLE* SystemTable)
{
EFI_STATUS status = EFI_SUCCESS;
UINTN MemMapSize = sizeof(EFI_MEMORY_DESCRIPTOR)*16;
UINTN MemMapSizeOut = MemMapSize;
UINTN MemMapKey = 0; UINTN MemMapDescriptorSize = 0;
UINT32 MemMapDescriptorVersion = 0;
UINTN DescriptorCount = 0;
UINTN i = 0;
uint8_t* buffer = NULL;
EFI_MEMORY_DESCRIPTOR* MemoryDescriptorPtr = NULL;
do
{
buffer = AllocatePool(MemMapSize);
if ( buffer == NULL ) break;
status = gBS->GetMemoryMap(&MemMapSizeOut, (EFI_MEMORY_DESCRIPTOR*)buffer,
&MemMapKey, &MemMapDescriptorSize, &MemMapDescriptorVersion);
Print(L"MemoryMap: Status %x\n", status);
if ( status != EFI_SUCCESS )
{
FreePool(buffer);
MemMapSize += sizeof(EFI_MEMORY_DESCRIPTOR)*16;
}
} while ( status != EFI_SUCCESS );
if ( buffer != NULL )
{
DescriptorCount = MemMapSizeOut / MemMapDescriptorSize;
MemoryDescriptorPtr = (EFI_MEMORY_DESCRIPTOR*)buffer;
Print(L"MemoryMap: DescriptorCount %d\n", DescriptorCount);
for ( i = 0; i < DescriptorCount; i++ )
{
MemoryDescriptorPtr = (EFI_MEMORY_DESCRIPTOR*)(buffer + (i*MemMapDescriptorSize));
Print(L"Type: %d PhsyicalStart: %lx VirtualStart: %lx NumberofPages: %d Attribute %lx\n",
MemoryDescriptorPtr->Type, MemoryDescriptorPtr->PhysicalStart,
MemoryDescriptorPtr->VirtualStart, MemoryDescriptorPtr->NumberOfPages,
MemoryDescriptorPtr->Attribute);
}
FreePool(buffer);
}
return status;
}
This is a reasonably straightforward function. GetMemoryMap complains bitterly if you don't pass in a large enough buffer, so we keep incrementing the buffer size until we have enough space. Then we loop and print. Be aware that sizeof(EFI_MEMORY_DESCRIPTOR) is in fact not the difference between structs in the output buffer - use the returned size calculation shown above, or you'll end up with a much larger table than you really have (and the address spaces will all look wrong).
It wouldn't be massively difficult to decide on a common format with E820 from this table.

malloc/realloc/free capacity optimization

When you have a dynamically allocated buffer that varies its size at runtime in unpredictable ways (for example a vector or a string) one way to optimize its allocation is to only resize its backing store on powers of 2 (or some other set of boundaries/thresholds), and leave the extra space unused. This helps to amortize the cost of searching for new free memory and copying the data across, at the expense of a little extra memory use. For example the interface specification (reserve vs resize vs trim) of many C++ stl containers have such a scheme in mind.
My question is does the default implementation of the malloc/realloc/free memory manager on Linux 3.0 x86_64, GLIBC 2.13, GCC 4.6 (Ubuntu 11.10) have such an optimization?
void* p = malloc(N);
... // time passes, stuff happens
void* q = realloc(p,M);
Put another way, for what values of N and M (or in what other circumstances) will p == q?
From the realloc implementation in glibc trunk at http://sources.redhat.com/git/gitweb.cgi?p=glibc.git;a=blob;f=malloc/malloc.c;h=12d2211b0d6603ac27840d6f629071d1c78586fe;hb=HEAD
First, if the memory has been obtained via mmap() instead of sbrk(), which glibc malloc does for large requests, >= 128 kB by default IIRC:
if (chunk_is_mmapped(oldp))
{
void* newmem;
#if HAVE_MREMAP
newp = mremap_chunk(oldp, nb);
if(newp) return chunk2mem(newp);
#endif
/* Note the extra SIZE_SZ overhead. */
if(oldsize - SIZE_SZ >= nb) return oldmem; /* do nothing */
/* Must alloc, copy, free. */
newmem = public_mALLOc(bytes);
if (newmem == 0) return 0; /* propagate failure */
MALLOC_COPY(newmem, oldmem, oldsize - 2*SIZE_SZ);
munmap_chunk(oldp);
return newmem;
}
(Linux has mremap(), so in practice this is what is done).
For smaller requests, a few lines below we have
newp = _int_realloc(ar_ptr, oldp, oldsize, nb);
where _int_realloc is a bit big to copy-paste here, but you'll find it starting at line 4221 in the link above. AFAICS, it does NOT do the constant factor optimization increase that e.g. the C++ std::vector does, but rather allocates exactly the amount requested by the user (rounded up to the next chunk boundaries + alignment stuff and so on).
I suppose the idea is that if the user wants this factor of 2 size increase (or any other constant factor increase in order to guarantee logarithmic efficiency when resizing multiple times), then the user can implement it himself on top of the facility provided by the C library.
Perhaps you can use malloc_usable_size (google for it) to find the answer experimentally. This function, however, seems undocumented, so you will need to check out if it is still available at your platform.
See also How to find how much space is allocated by a call to malloc()?

Resources