Graphics card memory and virtual address space of a process - linux

Supposing I have a game that does lots of graphics in terms of openGL and I have a desktop with Linux 32-bit installed with 4GB of RAM and 1G Nvidia Graphics card. How does my game application virtual address space look like ? Is graphics card memory mapped in this virtual address space ?
Also, is there some relation between RAM and graphics card memory ? Does linux allocate equal RAM for graphics card which can not be used by any process ? That said, it results then into only 3GB of RAM available to my game process ?

How does my game application virtual address space look like?
Impossible to tell. OpenGL leaves this detail completely open to the vendor implementation. Anything that satisfies the specification is allowed.
Is graphics card memory mapped in this virtual address space?
Maybe, maybe not. That depends on the actual implementation.
Also, is there some relation between RAM and graphics card memory?
Usually yes. As far and the majority of OpenGL implementation are concerned the graphics card's RAM is essentially a cache for things that actually live in system memory (CPU RAM + swap space + stuff memory mapped from storage). However this is not pinned down to the specification and anything that satisfies the OpenGL specification is allowed.
Does Linux allocate equal RAM for graphics card which can not be used by any process?
No, because Linux (the kernel) is not concerned with these things. Your graphics card's driver is, though. And the driver may do it any way it sees fit. It can either map OpenGL context data into a separate address space through Physical Address Extension (PAE) or place it in a different process or keep it in your game's address space, or…, or…, or…. There's no written down scheme on this.
That said, it results then into only 3GB of RAM available to my game process?
If so, then more like (3GB - 1GB) - x where 0 < x because the top 1GB of your process' address space are reserved for the kernel and of course your program's text (the binary executed by the CPU) and the text of the libraries it's using takes some address space as well.

Related

Memory-mapped IO vs DMA?

Based on what I have learned from the comments and answers (thanks everyone!), I edited the question to be more targeted:
DMA:
Before first DMA, CPU has to setup things like RAM address range reserved to be used by the device for DMA. Once the setup work is done, can the device initiate the transfer at will, basically owning that part of RAM, or it still has to get some sort of permission from CPU again before every sinlge DMA transfer?
MMIO:
CPU accessing the device memory via mmio is more expensive compared to CPU accessing RAM, but I can see on my desktop, pci devices reserve hundreds of mega bytes for mmio, what is an example that this can be used efficiently (As opposed to copy the data back to RAM using DMA and then access them)?
Look at it from the device's perspective. The device can:
directly access memory itself (using DMA itself)
wait for the CPU to transfer data to it (by providing memory mapped IO for CPU to use)
So the question is, if the CPU can access the PCIe memory by memory-maps, why does it have to do DMAs?
The CPU doesn't use DMA at all. The entire point of DMA is to allow the CPU to do other things (or nothing) while the device does the DMA. The end result is a significant performance increase for the system as a whole - e.g. CPU/s doing lots of other work, while lots of devices (hard drive controller, video card, audio card, network card, ...) are also using DMA to transfer data around.
CPU can access this memory as if it is DRAM, by memory mapped IO.
You're misusing terminology.
Instead of "DRAM" you should be using the term "main memory", aka system memory or RAM.
On modern computers main/system memory is implemented by some type of SDRAM (synchronous dynamic RAM).
Conflating the functional term (e.g. main memory) with the hardware implementation (e.g. DDR3 SDRAM) seems harmless, but can lead to the false syllogism that "RAM is volatile" or other misunderstandings.
Memory mapping can put the memory/memories of a PCIe device in the same address space as main memory.
CPU can transfer a chunk of data from this PCIe device's memory into real physical memory, via DMA. And then CPU can access the physical memory freely.
"Real physical memory" is redundant. What other types of "physical memory" are there?
There's no "fake physical memory".
You seem to be referring to the use of a buffer in main memory as "DMA".
That is misguided.
DMA is not required in order to employ or copy data to a buffer in main memory.
So the question is, if the CPU can access the PCIe memory by memory-maps, why does it have to do DMAs?
You seem to be misusing terminology.
You might want to study this article on PCIe.
Is it because PCIe bus is slow for random access?
Accessing data from a PCIe device is very slow compared to main/system memory.
This has nothing to do with "random access".
Information (e.g. data retrieval) over the PCIe bus is accomplished with (high-speed) packets (even when the PCIe memory is mapped into processor address space).
And if so, DMA is basically a single dump to speedup frequent random access, and memory-mapped IO is for occasional access?
You're misusing terminology.
If the software is written inefficiently or only needs to use the data just once, then it might access the PCIe memory.
But if the software is going to access the data more than once or deems a "local" copy to be more efficient, then the software could allocate a buffer in main/system memory and copy the data from PCIe memory to main/system memory using either PIO (programmed I/O by the CPU) or DMA (direct memory access by a PCIe bus master or system DMA controller).
The use of buffers is widespread in computers.
A large part of "computing time" is spent on buffering and copying and moving data around.
I/O is almost always performed between a device and a buffer in main memory, even if direct device-to-device transfer is possible.
Do not mislabel the use of a buffer as "DMA".
For some info on DMA, see Why driver need to map DMA buffers when dma-engine is in device?
and
dma vs interrupt-driven i/o .
DMA is usually done by the CPU programming registers on the device mapped to MMIO regions. It wouldn't make sense to map an entire hard drive into physical address space and would quickly use up the available physcial address space on the chipset which is often limited to as low as 39 bit on modern chipsets, so instead only the host controller (xHCI, AHCI etc.) registers are mapped into the MMIO space. It would also mean that the CPU would be using mov commands to copy the data to the hard drive for the entire transfer, which occupies CPU bandwidth. Instead DMA is asynchronous and the CPU issues a command to the device, and the device, PCIe bus, DRAM controller, gets on with it, while the CPU is free.
With a IGPU without dedicated VRAM, you have VRAM in DRAM (GFX stolen memory), which is reserved for the IGPU and is of course accessible by the IGPU and CPU. You also have a GTT page table in DRAM that the IGPU uses to translate internal virtual addresses to physical pages that it then accesses via DMA over the ring bus. The CPU renders there and programs the IGPU to perform DMA to read it in to the IGPU.
On a discrete GPU with VRAM, the CPU writes to DRAM and then inserts then address of the allocation into the GTT table in VRAM via the VRAM aperture, and then programs the GPU to copy from the equivalent GART aperture address that corresponds mathematically to that GTT entry – the aperture is a contiguous GPU device local address space separate to VRAM. The GPU then reads from the aperture space which results in it indexing into the GTT and acquires the real system address of the data and then initiates a DMA transfer from the real system memory address to an arbitrary address in the 256MiB VRAM aperture. There is also the option of using PCIe BARs or resizable BARs to expose a VRAM aperture to which the CPU can directly write to without the need of a copy. Another advantage of this is a cpu core could interleave several transfers, or several cores could work on different transfers, but with DMA, the GPU can likely only perform one DMA transfer at a time sequentially/synchronously with no concurrency or parallelism.

What is the rationality of Linux kernel's mapping as much RAM as possible in direct-mapping(linear mapping) area?

The discussion below applies to 32-bit ARM Linux.
Suppose there are 512MB physical RAM in my system. For common configurations, all these 512MB physical RAM will be mapped via direct mapping by kernel(0xC000 0000 to 0xE000 0000).
Question is: kernel itself only uses part of these RAM; most of these RAM would be allocated to user space. Why bother mapping all these 512MB physical RAM in kernel's virtual space(0xC000 0000 to 0xE000 0000)? Why doesn't kernel just map part of these RAM for its only usage(say 64MB RAM)?
If physical RAM is greater than 1GB, things get a little complicated. Let's say directly-mapped area is 768MB in size. The result would be 768MB out of 1GB being directly mapped to kernel's virtual space. I guess the rest of the RAM(256MB) goes to two places: either high memory area or allocated by kernel to user space. But I still don't see any advantage of mapping so many physical RAM into kernel's virtual space.
Actually this question can be reduced to:
what are the drawbacks if kernel only directly maps a small part of physical RAM(say 64MB out of 512MB)?
Before further discussion, it is beneficial to know that
After MMU is turned on, every address issued by CPU is virtual
address.
If kernel wants to access ANY address in RAM, a mapping must be set up before the actual access happens.
If kernel only directly maps a small part of physical RAM, the cost is that every time kernel needs to access other parts of RAM, it needs to set up a temporary mapping before accessing that address and torn down that mapping after the access, which is very tedious and low efficiency.
If that mapping is set up in advance and is always there, it saves quite a lot of trouble for kernel.

Interpreting GPU information from Process Explorer

I am trying to hunt down a possible memory leak in my Sharpdx / DirectX application.
I am getting the following information from process explorer which I do not know how to interpret.
What is Dedicated GPU Memory?
What is System GPU Memory?
What is Comitted GPU Memory?
Dedicated GPU memory is basically the VRAM on-board the GPU
System GPU memory is memory that the graphics card driver is using the GART (Graphics Address Remapping Table) to store resources in system memory... AGP and PCI Express both provide regions of memory set aside for this purpose (sometimes referred to as aperture segments).
Committed GPU memory refers to the amount of memory mapped into a display device's address space by the display driver, it is a difficult concept to explain but this number typically does not represent anything worthwhile to anyone but driver developers.
I suggest you look into the following documentation on MSDN as well as this overview of GPU address space segementation, while they are somewhat technical they give a general overview of what is going on.

What's the difference between D3DERR_OUTOFVIDEOMEMORY and E_OUTOFMEMORY

I am developing a tool drawing primitves with DX9 in my XP-32.
When create vertex buffer and index buffer, there could be some error of creation failed.
Return code could be D3DERR_OUTOFVIDEOMEMORY or E_OUTOFMEMORY.
I am not sure what the difference of them.
I use VideoMemory tool in DX sample to check the memory, and it reports 1024MB.
Does that mean if I create a bunch of managed resource more than 1024MB, it will report D3DERR_OUTOFVIDEOMEMORY?
If there is no more free virtual space memory in process and malloc fail, DX9 will report E_OUTOFMEMORY?
E_OUTOFMEMORY means that DirectX was unable to allocate (ie through malloc or new) the block of memory you requested.
D3DERR_OUTOFVIDEOMEMORY means that DirectX was unable to allocate (ie out of the pool of memory, either on the gfx card or reserved in main memory) the block of memory you requested.
Caveat: Drivers might lie.
D3DERR_OUTOFVIDEOMEMORY is a directx memory error...not necessarily related to video memory, it could memory occupied for holding a scene or drawing an image, as you have found out if there is not enough memory for your process you will get E_OUTOFMEMORY. The latter refers to the memory that is assigned to your process being exhausted. You did not say what operating system/hardware spec you have, best bet would be to look into getting a system memory upgrade if you're running low on resources..
Edit: Some laptops/netbooks have a graphics adapter that is 'fitted with system memory', ok these graphics cards are not serious for the likes of 'Beyond Call of Duty' and other top end games...the graphics card actually steal some memory from the main board thus inflating the amount of RAM that is available to the graphics controllers. They are fine if you are doing word processing/emailing and so on...but at the cost of the system ram which is gobbled by the controller a la 'Integrated graphics controller'...
Hope this helps,
Best regards,
Tom.

why do we need zone_highmem on x86?

In linux kernel, mem_map is the array which holds all "struct page" descriptors. Those pages includes the 128MiB memory in lowmem for dynamically mapping highmem.
Since the lowmem size is 1GiB, so the mem_map array has only 1GiB/4KiB=256KiB entries. If each entry size is 32 byte, then the mem_map memory size = 8MiB. But if we could use mem_map to map all 4GiB physical memory(if we have so much physical memory available on x86-32), then the mem_map array would occupy 32MiB, that is not a lot of kernel memory(or am i wrong?).
So my question is: why do we need to use that 128MiB in low for indirect highmem mapping in the first place? Or put another way, why not to map all those max 4GiB physical memory(if available) in the kernel space directly?
Note: if my understanding of the kernel source above is wrong, please correct. Thanks!
Look Here: http://www.xml.com/ldd/chapter/book/ch13.html
Kernel low memory is the 'real' memory map, addressed with 32-bit pointers on x86.
Kernel high memory is the 'virtual' memory map, addressed with virtual structures on x86.
You don't want to map it all into the kernel address space, because you can't always address all of it, and you need most of your memory for virtual memory segments (virtual, page-mapped process space.)
At least, that's how I read it. Wow, that's a complicated question you asked.
To throw more confusion, chapter 13 talks about some PCI devices not being able to address the 32-bit space, which was the genesis of my previous comment:
On x86, some kernel memory usage is limited to the first Gigabyte of memory bacause of DMA addressing concerns. I'm not 100% familiar with the topic, but there's a comapatibility mode for DMA on the PCI bus. That may be what you are looking at.
3.6 GB is not the ceiling when using physical address extension, which is commonly needed on most modern x86 boards, especially with memory hotplug.
Or put another way, why not to map all those max 4GiB physical
memory(if available) in the kernel space directly?
One reason is userspace: every usespace process have its own virtual address space. Suppose you have 4Gb of RAM on x86. So if we suggest that kernel owns 1Gb of memory (~800 directly mapped + ~200 vmalloc) all other ~3Gb should be dynamically distributed between processes spinning in user space. So how can you map your 4Gbs directly when you have a several address spaces?
why do we need zone_highmem on x86?
The reason is the same. Kernel reserves only ~800Mb for low mem. All other memory will be allocated and connected with particular virtual address only on demand. For example if you will execute a binary a new virtual address space will be created and some pages will be allocated for storing your binary code and data (heap ,stack ...). So the key attribute of high mem is to serve dynamic memory allocation requests, you never know in advance what will be triggered by userspace...

Resources