GPUs and VPUs need contiguous memory.
CMA and Static memory allocation are the examples of contiguous memory.
Why is contiguous memory required here?
Contiguous memory allocation (CMA) is needed for I/O devices that can only work with contiguous ranges of physical memory. I/O devices that can only work with continuous ranges are built that way in order to simplify the design of the device.
On systems with an I/O memory management unit (IOMMU), this would not be an issue because a buffer that is contiguous in the device address space can be mapped by the IOMMU to non-contiguous regions of physical memory. Also some devices can do scatter/gather DMA (i.e., can read/write from/to multiple non-contiguous buffers). Ideally, all I/O devices should be designed to either work behind an IOMMU or should be capable of scatter/gather DMA. Unfortunately, this is not the case and there are devices that require physically contiguous buffers. There are two ways for a device driver to allocate a contiguous buffer:
The device driver can allocate a chunk of physical memory at boot-time. This is reliable because most of the physical memory would be available at boot-time. However, if the I/O device is not used, then the allocated physical memory is just wasted.
A chunk of physical memory can be allocated on demand, but it may be difficult to find a contiguous free range of the required size. The advantage, though, is that memory is only allocated when needed.
CMA solves this exact problem by providing the advantages of both of these approaches with none of their downsides. The basic idea is to make it possible to migrate allocated physical pages to create enough space for a contiguous buffer. More information on how CMA works can be found here.
Related
Based on what I have learned from the comments and answers (thanks everyone!), I edited the question to be more targeted:
DMA:
Before first DMA, CPU has to setup things like RAM address range reserved to be used by the device for DMA. Once the setup work is done, can the device initiate the transfer at will, basically owning that part of RAM, or it still has to get some sort of permission from CPU again before every sinlge DMA transfer?
MMIO:
CPU accessing the device memory via mmio is more expensive compared to CPU accessing RAM, but I can see on my desktop, pci devices reserve hundreds of mega bytes for mmio, what is an example that this can be used efficiently (As opposed to copy the data back to RAM using DMA and then access them)?
Look at it from the device's perspective. The device can:
directly access memory itself (using DMA itself)
wait for the CPU to transfer data to it (by providing memory mapped IO for CPU to use)
So the question is, if the CPU can access the PCIe memory by memory-maps, why does it have to do DMAs?
The CPU doesn't use DMA at all. The entire point of DMA is to allow the CPU to do other things (or nothing) while the device does the DMA. The end result is a significant performance increase for the system as a whole - e.g. CPU/s doing lots of other work, while lots of devices (hard drive controller, video card, audio card, network card, ...) are also using DMA to transfer data around.
CPU can access this memory as if it is DRAM, by memory mapped IO.
You're misusing terminology.
Instead of "DRAM" you should be using the term "main memory", aka system memory or RAM.
On modern computers main/system memory is implemented by some type of SDRAM (synchronous dynamic RAM).
Conflating the functional term (e.g. main memory) with the hardware implementation (e.g. DDR3 SDRAM) seems harmless, but can lead to the false syllogism that "RAM is volatile" or other misunderstandings.
Memory mapping can put the memory/memories of a PCIe device in the same address space as main memory.
CPU can transfer a chunk of data from this PCIe device's memory into real physical memory, via DMA. And then CPU can access the physical memory freely.
"Real physical memory" is redundant. What other types of "physical memory" are there?
There's no "fake physical memory".
You seem to be referring to the use of a buffer in main memory as "DMA".
That is misguided.
DMA is not required in order to employ or copy data to a buffer in main memory.
So the question is, if the CPU can access the PCIe memory by memory-maps, why does it have to do DMAs?
You seem to be misusing terminology.
You might want to study this article on PCIe.
Is it because PCIe bus is slow for random access?
Accessing data from a PCIe device is very slow compared to main/system memory.
This has nothing to do with "random access".
Information (e.g. data retrieval) over the PCIe bus is accomplished with (high-speed) packets (even when the PCIe memory is mapped into processor address space).
And if so, DMA is basically a single dump to speedup frequent random access, and memory-mapped IO is for occasional access?
You're misusing terminology.
If the software is written inefficiently or only needs to use the data just once, then it might access the PCIe memory.
But if the software is going to access the data more than once or deems a "local" copy to be more efficient, then the software could allocate a buffer in main/system memory and copy the data from PCIe memory to main/system memory using either PIO (programmed I/O by the CPU) or DMA (direct memory access by a PCIe bus master or system DMA controller).
The use of buffers is widespread in computers.
A large part of "computing time" is spent on buffering and copying and moving data around.
I/O is almost always performed between a device and a buffer in main memory, even if direct device-to-device transfer is possible.
Do not mislabel the use of a buffer as "DMA".
For some info on DMA, see Why driver need to map DMA buffers when dma-engine is in device?
and
dma vs interrupt-driven i/o .
DMA is usually done by the CPU programming registers on the device mapped to MMIO regions. It wouldn't make sense to map an entire hard drive into physical address space and would quickly use up the available physcial address space on the chipset which is often limited to as low as 39 bit on modern chipsets, so instead only the host controller (xHCI, AHCI etc.) registers are mapped into the MMIO space. It would also mean that the CPU would be using mov commands to copy the data to the hard drive for the entire transfer, which occupies CPU bandwidth. Instead DMA is asynchronous and the CPU issues a command to the device, and the device, PCIe bus, DRAM controller, gets on with it, while the CPU is free.
With a IGPU without dedicated VRAM, you have VRAM in DRAM (GFX stolen memory), which is reserved for the IGPU and is of course accessible by the IGPU and CPU. You also have a GTT page table in DRAM that the IGPU uses to translate internal virtual addresses to physical pages that it then accesses via DMA over the ring bus. The CPU renders there and programs the IGPU to perform DMA to read it in to the IGPU.
On a discrete GPU with VRAM, the CPU writes to DRAM and then inserts then address of the allocation into the GTT table in VRAM via the VRAM aperture, and then programs the GPU to copy from the equivalent GART aperture address that corresponds mathematically to that GTT entry – the aperture is a contiguous GPU device local address space separate to VRAM. The GPU then reads from the aperture space which results in it indexing into the GTT and acquires the real system address of the data and then initiates a DMA transfer from the real system memory address to an arbitrary address in the 256MiB VRAM aperture. There is also the option of using PCIe BARs or resizable BARs to expose a VRAM aperture to which the CPU can directly write to without the need of a copy. Another advantage of this is a cpu core could interleave several transfers, or several cores could work on different transfers, but with DMA, the GPU can likely only perform one DMA transfer at a time sequentially/synchronously with no concurrency or parallelism.
What's the benifit of allocating a chunk of contiguous physical memory?
Is it faster when access the contiguous physical address than virtual address? And why?
All memory accesses from the CPU go through the MMU; the speed does not depend on the actual location of the pages in physical memory.
Physically contiguous memory is needed for other devices that access memory but are not able to remap pages.
In that case, the contiguous allocation is needed to make the device work to begin with, and is not a question of speed.
Can I allocate one large and guaranteed continued range physical memory (100 MB consecutive without breaks) on Linux, and if I can, then how can I do this?
It is necessary to mapping this a continuous block of memory through the PCI-Express BAR from one CPU1 to the other CPU2 located behind the PCIe Non-Transparent Bridge.
You don't allocate physical memory in user applications (physical memory only makes sense inside the kernel).
I don't understand if you are coding a kernel module or some Linux application (e.g. a numerical finite-element code=.
Inside applications, you can allocate virtual memory with e.g. mmap(2) (and then you can allocate a big contiguous segment of address space)
I guess that some GPU cards give access to a large amount of GPU memory thru mmap so I believe it is possible to do what you want.
You might be interested by numa(7) man page. Probably the numa(3) library should give you what you want. Did you consider also open MPI? See also msync(2) and mlock(2)
From user space -- there is no guarantee depends on you luck.
if you compile your driver into the kernel -- you can use the mmap and allocate the required amount of memory.
if it is required to use it as storage or some other work not specifically for a driver then you should set the memmap parameter in the boot command line.
e.g. memmap=200M$1700M
it will block 200 MB memory starting from the end of 1700M (address).
Later it can be used to as FS as well ;)
The address space for a 32 bit system is 0x00000000 to 0xffffffff. From what I understand, this address space will be split among the system memory (RAM), ROM and memory-mapped peripherals. If the entire address space were used to address on the 4GB RAM, all RAM bytes would be accessible. But the address space being distributed with other memory mapped peripherals, does this mean that some RAM will be unaddressable/unutilized?
Here is the memory map of a typical x86 system. As you can see, the lower ranges of memory are riddled with BIOS and ROM data with small gaps in between. There's a substantial portion reserved for memory mapped devices in the upper ranges. All of these details may vary between platforms. It's nothing short of a nightmare to detect which memory areas that can be safely used.
The kernel also typically reserves a large portion of the available memory for its internals, buffers and cache.
With the advent of virtual addressing, the kernel can advertise the address space as one consistent and gapless memory range, while that is not necessarily true behind the scenes.
In linux kernel, mem_map is the array which holds all "struct page" descriptors. Those pages includes the 128MiB memory in lowmem for dynamically mapping highmem.
Since the lowmem size is 1GiB, so the mem_map array has only 1GiB/4KiB=256KiB entries. If each entry size is 32 byte, then the mem_map memory size = 8MiB. But if we could use mem_map to map all 4GiB physical memory(if we have so much physical memory available on x86-32), then the mem_map array would occupy 32MiB, that is not a lot of kernel memory(or am i wrong?).
So my question is: why do we need to use that 128MiB in low for indirect highmem mapping in the first place? Or put another way, why not to map all those max 4GiB physical memory(if available) in the kernel space directly?
Note: if my understanding of the kernel source above is wrong, please correct. Thanks!
Look Here: http://www.xml.com/ldd/chapter/book/ch13.html
Kernel low memory is the 'real' memory map, addressed with 32-bit pointers on x86.
Kernel high memory is the 'virtual' memory map, addressed with virtual structures on x86.
You don't want to map it all into the kernel address space, because you can't always address all of it, and you need most of your memory for virtual memory segments (virtual, page-mapped process space.)
At least, that's how I read it. Wow, that's a complicated question you asked.
To throw more confusion, chapter 13 talks about some PCI devices not being able to address the 32-bit space, which was the genesis of my previous comment:
On x86, some kernel memory usage is limited to the first Gigabyte of memory bacause of DMA addressing concerns. I'm not 100% familiar with the topic, but there's a comapatibility mode for DMA on the PCI bus. That may be what you are looking at.
3.6 GB is not the ceiling when using physical address extension, which is commonly needed on most modern x86 boards, especially with memory hotplug.
Or put another way, why not to map all those max 4GiB physical
memory(if available) in the kernel space directly?
One reason is userspace: every usespace process have its own virtual address space. Suppose you have 4Gb of RAM on x86. So if we suggest that kernel owns 1Gb of memory (~800 directly mapped + ~200 vmalloc) all other ~3Gb should be dynamically distributed between processes spinning in user space. So how can you map your 4Gbs directly when you have a several address spaces?
why do we need zone_highmem on x86?
The reason is the same. Kernel reserves only ~800Mb for low mem. All other memory will be allocated and connected with particular virtual address only on demand. For example if you will execute a binary a new virtual address space will be created and some pages will be allocated for storing your binary code and data (heap ,stack ...). So the key attribute of high mem is to serve dynamic memory allocation requests, you never know in advance what will be triggered by userspace...