I am developing a tool drawing primitves with DX9 in my XP-32.
When create vertex buffer and index buffer, there could be some error of creation failed.
Return code could be D3DERR_OUTOFVIDEOMEMORY or E_OUTOFMEMORY.
I am not sure what the difference of them.
I use VideoMemory tool in DX sample to check the memory, and it reports 1024MB.
Does that mean if I create a bunch of managed resource more than 1024MB, it will report D3DERR_OUTOFVIDEOMEMORY?
If there is no more free virtual space memory in process and malloc fail, DX9 will report E_OUTOFMEMORY?
E_OUTOFMEMORY means that DirectX was unable to allocate (ie through malloc or new) the block of memory you requested.
D3DERR_OUTOFVIDEOMEMORY means that DirectX was unable to allocate (ie out of the pool of memory, either on the gfx card or reserved in main memory) the block of memory you requested.
Caveat: Drivers might lie.
D3DERR_OUTOFVIDEOMEMORY is a directx memory error...not necessarily related to video memory, it could memory occupied for holding a scene or drawing an image, as you have found out if there is not enough memory for your process you will get E_OUTOFMEMORY. The latter refers to the memory that is assigned to your process being exhausted. You did not say what operating system/hardware spec you have, best bet would be to look into getting a system memory upgrade if you're running low on resources..
Edit: Some laptops/netbooks have a graphics adapter that is 'fitted with system memory', ok these graphics cards are not serious for the likes of 'Beyond Call of Duty' and other top end games...the graphics card actually steal some memory from the main board thus inflating the amount of RAM that is available to the graphics controllers. They are fine if you are doing word processing/emailing and so on...but at the cost of the system ram which is gobbled by the controller a la 'Integrated graphics controller'...
Hope this helps,
Best regards,
Tom.
Related
Supposing I have a game that does lots of graphics in terms of openGL and I have a desktop with Linux 32-bit installed with 4GB of RAM and 1G Nvidia Graphics card. How does my game application virtual address space look like ? Is graphics card memory mapped in this virtual address space ?
Also, is there some relation between RAM and graphics card memory ? Does linux allocate equal RAM for graphics card which can not be used by any process ? That said, it results then into only 3GB of RAM available to my game process ?
How does my game application virtual address space look like?
Impossible to tell. OpenGL leaves this detail completely open to the vendor implementation. Anything that satisfies the specification is allowed.
Is graphics card memory mapped in this virtual address space?
Maybe, maybe not. That depends on the actual implementation.
Also, is there some relation between RAM and graphics card memory?
Usually yes. As far and the majority of OpenGL implementation are concerned the graphics card's RAM is essentially a cache for things that actually live in system memory (CPU RAM + swap space + stuff memory mapped from storage). However this is not pinned down to the specification and anything that satisfies the OpenGL specification is allowed.
Does Linux allocate equal RAM for graphics card which can not be used by any process?
No, because Linux (the kernel) is not concerned with these things. Your graphics card's driver is, though. And the driver may do it any way it sees fit. It can either map OpenGL context data into a separate address space through Physical Address Extension (PAE) or place it in a different process or keep it in your game's address space, or…, or…, or…. There's no written down scheme on this.
That said, it results then into only 3GB of RAM available to my game process?
If so, then more like (3GB - 1GB) - x where 0 < x because the top 1GB of your process' address space are reserved for the kernel and of course your program's text (the binary executed by the CPU) and the text of the libraries it's using takes some address space as well.
For my M.Sc. thesis, I have to reverse-engineer the hash function Intel uses inside its CPUs to spread data among Last Level Cache slices in Sandy Bridge and newer generations. To this aim, I am developing an application in Linux, which needs a physically contiguous memory area in order to make my tests. The idea is to read data from this area, so that they are cached, probe if older data have been evicted (through delay measures or LLC miss counters) in order to find colliding memory addresses and finally discover the hash function by comparing these colliding addresses.
The same procedure has already been used in Windows by a researcher, and proved to work.
To do this, I need to allocate an area that must be large (64 MB or more) and fully cachable, so without DMA-friendly options in TLB. How can I perform this allocation?
To have a full control over the allocation (i.e., for it to be really physically contiguous), my idea was to write a Linux module, export a device and mmap() it from userspace, but I do not know how to allocate so much contiguous memory inside the kernel.
I heard about Linux Contiguous Memory Allocator (CMA), but I don't know how it works
Applications don't see physical memory, a process have some address space in virtual memory. Read about the MMU (what is contiguous in virtual space might not really be physically contiguous and vice versa)
You might perhaps want to lock some memory using mlock(2)
But your application will be scheduled, and other processes (or scheduled tasks) would dirty your CPU cache. See also sched_setaffinity(2)
(and even kernel code might be perhaps preempted)
This page on Kernel Newbies, has some ideas about memory allocation. But the max for get_free_pages looks like 8MiB. (Perhaps that's a compile-time constraint?)
Since this would be all-custom, you could explore the mem= boot parameter of the linux kernel. This will limit the amount of memory used, and you can party all over the remaining memory without anyone knowing. Heck, if you boot up a busybox system, you could probably do mem=32M, but even mem=256M should work if you're not booting a GUI.
You will also want to look into the Offline Scheduler (and here). It "unplugs" the CPU from Linux so you can have full control over ALL code running on it. (Some parts of this are already in the mainline kernel, and maybe all of it is.)
Can I allocate one large and guaranteed continued range physical memory (100 MB consecutive without breaks) on Linux, and if I can, then how can I do this?
It is necessary to mapping this a continuous block of memory through the PCI-Express BAR from one CPU1 to the other CPU2 located behind the PCIe Non-Transparent Bridge.
You don't allocate physical memory in user applications (physical memory only makes sense inside the kernel).
I don't understand if you are coding a kernel module or some Linux application (e.g. a numerical finite-element code=.
Inside applications, you can allocate virtual memory with e.g. mmap(2) (and then you can allocate a big contiguous segment of address space)
I guess that some GPU cards give access to a large amount of GPU memory thru mmap so I believe it is possible to do what you want.
You might be interested by numa(7) man page. Probably the numa(3) library should give you what you want. Did you consider also open MPI? See also msync(2) and mlock(2)
From user space -- there is no guarantee depends on you luck.
if you compile your driver into the kernel -- you can use the mmap and allocate the required amount of memory.
if it is required to use it as storage or some other work not specifically for a driver then you should set the memmap parameter in the boot command line.
e.g. memmap=200M$1700M
it will block 200 MB memory starting from the end of 1700M (address).
Later it can be used to as FS as well ;)
I am trying to hunt down a possible memory leak in my Sharpdx / DirectX application.
I am getting the following information from process explorer which I do not know how to interpret.
What is Dedicated GPU Memory?
What is System GPU Memory?
What is Comitted GPU Memory?
Dedicated GPU memory is basically the VRAM on-board the GPU
System GPU memory is memory that the graphics card driver is using the GART (Graphics Address Remapping Table) to store resources in system memory... AGP and PCI Express both provide regions of memory set aside for this purpose (sometimes referred to as aperture segments).
Committed GPU memory refers to the amount of memory mapped into a display device's address space by the display driver, it is a difficult concept to explain but this number typically does not represent anything worthwhile to anyone but driver developers.
I suggest you look into the following documentation on MSDN as well as this overview of GPU address space segementation, while they are somewhat technical they give a general overview of what is going on.
When I run my CUDA program which allocates only a small amount of global memory (below 20 M), I got a "out of memory" error. (From other people's posts, I think the problem is related to memory fragmentation) I try to understand this problem, and realize I have a couple of questions related to CUDA memory management.
Is there a virtual memory concept in CUDA?
If only one kernel is allowed to run on CUDA simultaneously, after its termination, will all of the memory it used or allocated released? If not, when these memory got free released?
If more than one kernel are allowed to run on CUDA, how can they make sure the memory they use do not overlap?
Can anyone help me answer these questions? Thanks
Edit 1: operating system: x86_64 GNU/Linux
CUDA version: 4.0
Device: Geforce 200, It is one of the GPUS attached to the machine, and I don't think it is a display device.
Edit 2: The following is what I got after doing some research. Feel free to correct me.
CUDA will create one context for each host thread. This context will keep information such as what portion of memory (pre allocated memory or dynamically allocated memory) has been reserved for this application so that other application can not write to it. When this application terminates (not kernel) , this portion of memory will be released.
CUDA memory is maintained by a link list. When an application needs to allocate memory, it will go through this link list to see if there is continuous memory chunk available for allocation. If it fails to find such a chunk, a "out of memory" error will report to the users even though the total available memory size is greater than the requested memory. And that is the problem related to memory fragmentation.
cuMemGetInfo will tell you how much memory is free, but not necessarily how much memory you can allocate in a maximum allocation due to memory fragmentation.
On Vista platform (WDDM), GPU memory virtualization is possible. That is, multiple applications can allocate almost the whole GPU memory and WDDM will manage swapping data back to main memory.
New questions:
1. If the memory reserved in the context will be fully released after the application has been terminated, memory fragmentation should not exist. There must be some kind of data left in the memory.
2. Is there any way to restructure the GPU memory ?
The device memory available to your code at runtime is basically calculated as
Free memory = total memory
- display driver reservations
- CUDA driver reservations
- CUDA context static allocations (local memory, constant memory, device code)
- CUDA context runtime heap (in kernel allocations, recursive call stack, printf buffer, only on Fermi and newer GPUs)
- CUDA context user allocations (global memory, textures)
if you are getting an out of memory message, then it is likely that one or more of the first three items is consuming most of the GPU memory before your user code ever tries to get memory in the GPU. If, as you have indicated, you are not running on a display GPU, then the context static allocations are the most likely source of your problem. CUDA works by pre-allocating all the memory a context requires at the time the context is established on the device. There are a lot of things which get allocated to support a context, but the single biggest consumer in a context is local memory. The runtime must reserve the maximum amount of local memory which any kernel in a context will consume for the maximum number of threads which each multiprocessor can run simultaneously, for each multiprocess on the device. This can run into hundreds of Mb of memory if a local memory heavy kernel is loaded on a device with a lot of multiprocessors.
The best way to see what might be going on is to write a host program with no device code which establishes a context and calls cudaMemGetInfo. That will show you how much memory the device has with the minimal context overhead on it. Then run you problematic code, adding the same cudaMemGetInfo call before the first cudaMalloc call that will then give you the amount of memory your context is using. That might let you get a handle of where the memory is going. It is very unlikely that fragmentation is the problem if you are getting failure on the first cudaMalloc call.
GPU off-chip memory is separated in global, local and constant memory. This three memory types are a virtual memory concept. Global memory is free for all threads, local is just for one thread only (mostly used for register spilling) and constant memory is cached global memory (writable only from host code). Have a look at 5.3.2 from the CUDA C Programming Guide.
EDIT: removed
Memory allocated via cudaMalloc does never overlap. For the memory a kernel allocates during runtime should be enough memory available. If you are out of memory and try to start a kernel (only a guess from me) you should get the "unknown error" error message. The driver than was unable to start and/or executes the kernel.