Mmap DMA memory uncached: "map pfn ram range req uncached-minus got write-back"

Mmap DMA memory uncached: "map pfn ram range req uncached-minus got write-back" - linux

I am mapping DMA coherent memory from kernel to user space. At user level I use mmap() and in kernel driver I use dma_alloc_coherent() and afterwards remap_pfn_range() to remap the pages. This basically works as I can write data to the mapped area in my app and verify it in my kernel driver.
However, despite using dma_alloc_coherent (which should alloc uncached memory) and pgprot_noncached() the kernel informs me with this dmesg output:
map pfn ram range req uncached-minus for [mem 0xABC-0xCBA], got write-back
In my understanding, write-back is cached memory. But I need uncached memory for the DMA operation.
The Code (only showing the important parts):
User App
fd = open(dev_fn, O_RDWR | O_SYNC);
if (fd > 0)
{
mem = mmap ( NULL
, mmap_len
, PROT_READ | PROT_WRITE
, MAP_SHARED
, fd
, 0
);
}
For testing purposes I used mmap_len = getpagesize(); Which is 4096.
Kernel Driver
typedef struct
{
size_t mem_size;
dma_addr_t dma_addr;
void *cpu_addr;
} Dma_Priv;
fops_mmap()
{
dma_priv->mem_size = vma->vm_end - vma->vm_start;
dma_priv->cpu_addr = dma_alloc_coherent ( &gen_dev
, dma_priv->mem_size
, &dma_priv->dma_addr
, GFP_KERNEL
);
if (dma_priv->cpu_addr != NULL)
{
vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
remap_pfn_range ( vma
, vma->vm_start
, virt_to_phys(dma_priv->cpu_addr)>>PAGE_SHIFT
, dma_priv->mem_size
, vma->vm_page_prot
)
}
}
Useful information I've found
PATting Linux:
Page 7 --> mmap with O_SYNC (uncached):
Applications can open /dev/mem with the O_SYNC flag and then do mmap
on it. With that, applications will be accessing that address with an
uncached memory type. mmap will succeed only if there is no other
conflicting mappings to the same region.
I used the flag, doesn't help.
Page 7 --> mmap without O_SYNC (uncached-minus):
mmap without O_SYNC, no existing mapping, and not a write-back region:
For an mmap that comes under this category, we use uncached-minus type
mapping. In the absence of any MTRR for this region, the effective
type will be uncached. But in cases where there is an MTRR, making
this region write-combine, then the effective type will be
write-combine.
pgprot_noncached()
In /arch/x86/include/asm/pgtable.h I found this:
#define pgprot_noncached(prot) \
((boot_cpu_data.x86 > 3) \
? (__pgprot(pgprot_val(prot) | \
cachemode2protval(_PAGE_CACHE_MODE_UC_MINUS))) \
: (prot))
Is it possible that x86 always sets a noncached request to UC_MINUS, which results in combination with MTRR in a cached write-back?
I am using Ubuntu 16.04.1, Kernel: 4.10.0-40-generic.

https://www.kernel.org/doc/Documentation/x86/pat.txt
Drivers wanting to export some pages to userspace do it by using mmap
interface and a combination of 1) pgprot_noncached() 2)
io_remap_pfn_range() or remap_pfn_range() or vmf_insert_pfn()
With PAT support, a new API pgprot_writecombine is being added. So,
drivers can continue to use the above sequence, with either
pgprot_noncached() or pgprot_writecombine() in step 1, followed by
step 2.
In addition, step 2 internally tracks the region as UC or WC in
memtype list in order to ensure no conflicting mapping.
Note that this set of APIs only works with IO (non RAM) regions. If
driver wants to export a RAM region, it has to do set_memory_uc() or
set_memory_wc() as step 0 above and also track the usage of those
pages and use set_memory_wb() before the page is freed to free pool.
I added set_memory_uc() before pgprot_noncached() and it did the thing.
if (dma_priv->cpu_addr != NULL)
{
set_memory_uc(dma_priv->cpu_addr, (dma_priv->mem_size/PAGE_SIZE));
vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
remap_pfn_range ( vma
, vma->vm_start
, virt_to_phys(dma_priv->cpu_addr)>>PAGE_SHIFT
, dma_priv->mem_size
, vma->vm_page_prot
)
}
This answer was posted as an edit to the question Mmap DMA memory uncached: "map pfn ram range req uncached-minus got write-back" by the OP Gbo under CC BY-SA 4.0.

Related

How to read /proc/<pid>/pagemap in a kernel driver?

I am trying to read /proc//pagemap in a kernel driver like this:
uint64_t page;
uint64_t va = 0x7FFD1BF46530;`
loff_t pos = va / PAGE_SIZE * sizeof(uint64_t);
struct file * filp = filp_open("/proc/19030/pagemap", O_RDONLY, 0);
ssize_t nread = kernel_read(filp, &page, sizeof(page), &pos);
I get error -22 in nread (EINVAL, invalid argument) and
"kernel read not supported for file /19030/pagemap (pid: 19030 comm: tester)" in dmesg.
0x7FFD1BF46530 is a virtual address in a user space process pid 19030 (tester). I assume that pos is the offset into the file like in lseek64.
Doing the precise same thing as sudo with same values in a user space process, i.e. reading /proc/19030/pagemap works fine and produces a correct physical address.
The actual thing I am trying to do here is to find the physical address of a user space virtual address. I need the physical address for a device DMA transfer operation and a user space app needs to access this memory. This app allocates 1GB DMA memory with anonymous mmap from THP (Transparent Huge Pages). And I am trying to avoid the need for sudo by reading /proc//pagemap in a kernel driver via ioctl instead.
I would be happy to allocate huge page DMA memory in the driver but don't know how to do that. dma_alloc_coherent is limited to max 4MB allocations. Is there a way to get those allocated as continuous physical memory? I need hundreds of MB or many GB of DMA memory.
Problem with anonymous mmap is that it can only allocate max 1GB huge page as physically continuous memory. Allocating more works but the memory is not physically continuous and unusable for DMA.
Any good ideas or alternative ways of allocating huge pages as DMA memory?
Tried reading file /proc//pagemap in a kernel driver. Expected same results as when reading the file in a user space application which works ok.

"kernel read not supported for file …"
Indeed, as we see in __kernel_read()
if (unlikely(!file->f_op->read_iter || file->f_op->read))
return warn_unsupported(file, "read");
it fails if f_op->read_iter isn't or f_op->read is wired up (implemented), which is both the case for a pagemap file.
You could try pagemap_read() instead. – not feasible for reasons in the comments
When I had the problem of getting the physical address for a virtual address in a driver, I included and copied some kernel code (not that I recommend this, but I saw no other solution); here's an extract.
static pte_t *huge_pte_offset(struct mm_struct *mm, unsigned long addr
, unsigned long sz)
{ return NULL; }
void p4d_clear_bad(p4d_t *p4d) { p4d_ERROR(*p4d); p4d_clear(p4d); }
#include "mm/pagewalk.c"
static int pte(pte_t *pte, unsigned long addr
, unsigned long next, struct mm_walk *walk)
{
*(pte_t **)walk->private = pte;
return 1;
}
/* Scan the real Linux page tables and return a PTE pointer for
* a virtual address in a context.
* Returns true (1) if PTE was found, zero otherwise. The pointer to
* the PTE pointer is unmodified if PTE is not found.
*/
int
get_pteptr(struct mm_struct *mm, unsigned long addr, pte_t **ptep, pmd_t **pmdp)
{
struct mm_walk walk = { .pte_entry = pte, .mm = mm, .private = ptep };
return walk_page_range(addr, addr+PAGE_SIZE, &walk);
}
/* Find physical address for this virtual address. Normally used by
* I/O functions, but anyone can call it.
*/
static inline unsigned long iopa(unsigned long addr)
{
unsigned long pa;
/* I don't know why this won't work on PMacs or CHRP. It
* appears there is some bug, or there is some implicit
* mapping done not properly represented by BATs or in page
* tables.......I am actively working on resolving this, but
* can't hold up other stuff. -- Dan
*/
pte_t *pte;
struct mm_struct *mm;
#if 0
/* Check the BATs */
phys_addr_t v_mapped_by_bats(unsigned long va);
pa = v_mapped_by_bats(addr);
if (pa)
return pa;
#endif
/* Allow mapping of user addresses (within the thread)
* for DMA if necessary.
*/
if (addr < TASK_SIZE)
mm = current->mm;
else
mm = &init_mm;
ATTENTION: I needed the current address space.
You'd have to use mm = file->private_data instead.
pa = 0;
if (get_pteptr(mm, addr, &pte, NULL))
pa = (pte_val(*pte) & PAGE_MASK) | (addr & ~PAGE_MASK);
return(pa);
}

Map multiple kernel buffer into contiguous userspace buffer?

I have allocated multiple kernel accessible buffers using dma_alloc_coherent, each 4MiB in size. The goal is to map these buffers into a contiguous userspace virtual memory. The issue is that remap_pfn_range doesn't seem to be working, as the userspace memory sometimes works and sometimes doesn't, or sometimes duplicates the page mappings of the buffers.
// in probe() function
dma_alloc_coherent(&pcie->dev, BUF_SIZE, &bus_addr0, GFP_KERNEL);
dma_alloc_coherent(&pcie->dev, BUF_SIZE, &bus_addr1, GFP_KERNEL);
// ...
// in mmap() function
vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
pfn = dma_to_phys(&pcie->dev, &bus_addr0) >> PAGE_SHIFT;
remap_pfn_range(pfn, vma->vm_start + 0, pfn, BUF_SIZE, vma->vm_page_prot);
pfn = dma_to_phys(&pcie->dev, &bus_addr1) >> PAGE_SHIFT;
remap_pfn_range(pfn, vma->vm_start + BUF_SIZE, pfn, BUF_SIZE, vma->vm_page_prot);
I'm not really sure of the best way to map multiple kernel buffers to contiguous userspace memory, but I have a feeling I am doing it wrong. Thanks in advance.

I have no idea why there isn't a better interface to map multiple buffers contiguously into user space. In theory you can use multiple calls to remap_pfn_range() but getting the correct pfn for memory allocated by dma_alloc_coherent() is essentially impossible on some platforms (e.g. ARM).
I have come up with a solution to this problem that might not be considered "good" but seems to work well enough in my usage on multiple platforms (x86_64, and various ARM). The solution is to temporarily modify the start and end addresses in the struct vm_area_struct while calling dma_mmap_coherent() multiple times, once for each buffer. As long as you reset the VMA start and end addresses to their original values, everything seems to work okay (see my prior disclaimer).
Here is an example:
static int mmap(struct file *file, struct vm_area_struct *vma)
{
. . .
int rc;
unsigned long vm_start_orig = vma->vm_start;
unsigned long vm_end_orig = vma->vm_end;
for (int idx = 0; idx < buffer_list_size; idx++) {
buffer_entry = &buffer_list[idx];
/* Temporarily modify VMA start and end addresses */
if (idx > 0) {
vma->vm_start = vma->vm_end;
}
vma->vm_end = vma->vm_start + buffer_entry->size;
rc = dma_mmap_coherent(dev, vma,
buffer_entry->virt_address,
buffer_entry->phys_addr,
buffer_entry->size);
if (rc != 0) {
pr_err("dma_mmap_coherent: %d (IDX = %d)\n", rc, idx);
return -EAGAIN;
}
}
/* Restore VMA addresses */
vma->vm_start = vm_start_orig;
vma->vm_end = vm_end_orig;
return rc;
}

Unfortunately, the only currently supported method for mmap()ing DMA coherent memory is the macro dma_mmap_coherent() or the function dma_mmap_attrs() (which is called by dma_mmap_coherent()). Unfortunately, that does not support splitting a single VMA across multiple, individually allocated blocks of DMA coherent memory.
(I wish there was a supported way to split the mmap()ing of a VMA across multiple allocations of DMA coherent memory because it affects the buffer allocation in a kernel subsystem that I help maintain. I had to change it to allocate the buffer as a single block of DMA coherent memory instead of many page-sized blocks.)

Mapping of dmam_alloc_coherent allocated memory to the user space via remap_pfn_range gives pointer to wrong area of memory

I prepare an application running on ARM Intel Cyclone V SoC.
I need to map the DMA coherent memory buffer to the user space.
The buffer is allocated in the device driver with with:
buf_addr = dmam_alloc_coherent(&pdev->dev, size, &dma_addr, GFP_KERNEL);
The mapping is done correctly, and I have verified, that the buffer accessed by the hardware via dma_addr HW address is visible for the kernel via buf_addr pointer.
Then in the mmap function of the device driver I do:
unsigned long physical = virt_to_phys(buf_addr);
unsigned long vsize = vma->vm_end - vma->vm_start;
vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
remap_pfn_range(vma,vma->vm_start, physical >> PAGE_SHIFT , vsize, vma->vm_page_prot);
The application mmaps the buffer with:
buf = mmap(NULL,buf_size,PROT_READ | PROT_WRITE, dev_file, MAP_SHARED);
I do not get any error from remap_pfn_range function. Also the application is able to access the mmapped memory, but it is not the buffer allocated with dmam_alloc_coherent.

I have found the macro dma_mmap_coherent that seems to be dedicated particularly for that purpose.
I have verified that the following modification in the mmap function ensures proper operation:
unsigned long vsize = vma->vm_end - vma->vm_start;
vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
remap=dma_mmap_coherent(&my_pdev->dev,vma,fdata, dma_addr, vsize);
Because the pdev pointer is not directly delivered to the mmap function it is passed from the probe function via the global variable my_pdev. In case of driver supporting multiple devices, it should be stored in the device context.

Accessing large memory (32 GB) using /dev/zero

I want to use /dev/zero for storing lots of temporary data (32 GB or around that). I am doing this:
fd = open("/dev/zero", O_RDWR );
// <Exit on error>
vbase = (uint64_t*) mmap(NULL, MEMSIZE, PROT_READ|PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, fd, 0);
// <Exit on error>
ftruncate(fd, (off_t) MEMSIZE);
I am changing MEMSIZE from 1GB to 32 GB (performing a memtest) to see if I can really access all that range. I am running out of memory at 1 GB.
Is there something I am missing ? Am I mmap'ing correctly ?
Or am I running into some system limit ? How can I check if this is happening ?
P.S: I run many programs that generate many gigs of data within a single file, so I dont know if there is an artificial upper limit, just that I seem to be running into something.

I have to admit I'm confused about what you're actually trying to do. Anyway, a couple of reason why what you do might not work:
From the mmap(2) manpage: "MAP_ANONYMOUS
The mapping is not backed by any file; its contents are initialized to zero. The fd and offset arguments are ignored;"
From the null(4) manpage: "Data written to a null or zero special file is discarded."
So anyway, before MAP_ANONYMOUS, mmap'ing /dev/zero was sometimes used to get anonymous (i.e. not backed by any file) memory. No need to do both. In either case, actually writing to all that memory implies that you need some kind of backing store for it, either physical memory or swap space. If you cannot guarantee that, maybe it's better to mmap() a real file on a filesystem with enough space?

Look into Linux kernel mmap implementation:
vm_mmap vm_mmap_pgoff  do_mmap_pgoff  mmap_region  file->f_op->mmap(file, vma)
In the function do_mmap_pgoff, it checks the max_map_count
if (mm->map_count > sysctl_max_map_count)
return -ENOMEM;
root> sysctl -a | grep map_count
vm.max_map_count = 65530
In the function mmap_region, it checks the process virtual address limit (whether it is unlimited).
int may_expand_vm(struct mm_struct *mm, unsigned long npages)
{
unsigned long cur = mm->total_vm; /* pages */
unsigned long lim;
lim = rlimit(RLIMIT_AS) >> PAGE_SHIFT;
if (cur + npages > lim)
return 0;
return 1;
}
root> ulimit -a | grep virtual
virtual memory (kbytes, -v) unlimited
In linux kernel, init task has the rlimit setting by default.
[RLIMIT_AS] = { RLIM_INFINITY, RLIM_INFINITY }, \
#ifndef RLIM_INFINITY
# define RLIM_INFINITY (~0UL)
#endif
In order to prove it, use the test_mem program
tmp> ./test_mem
RLIMIT_AS limit got sucessfully:
soft_limit=4294967295, hard_limit=4294967295
RLIMIT_DATA limit got sucessfully:
soft_limit=4294967295, hard_limit=4294967295
struct rlimit rl;
int ret;
ret = getrlimit(RLIMIT_AS, &rl);
if (ret == 0) {
printf("RLIMIT_AS limit got sucessfully:\n");
printf("soft_limit=%lld, hard_limit=%lld\n", (long long)rl.rlim_cur, (long long)rl.rlim_max);
}
That means unlimited means 0xFFFFFFFF for 32bit app in the 64bit OS. Change the shell virtual address limit, it could reflect correctly.
root> ulimit -v 1024000
tmp> ./test_mem
RLIMIT_AS limit got sucessfully:
soft_limit=1048576000, hard_limit=1048576000
RLIMIT_DATA limit got sucessfully:
soft_limit=4294967295, hard_limit=4294967295
In mmap_region, there is an accountable check
accountable_mapping  security_vm_enough_memory_mm  cap_vm_enough_memory  __vm_enough_memory  overcommit/swap/admin and user reserve handling.
Please follow the three steps to check whether they can meet.

Poor memcpy performance in user space for mmap'ed physical memory in Linux

Of 192GB RAM installed on my computer, I have 188GB RAM above 4GB (at hardware address 0x100000000) reserved by the Linux kernel at boot time (mem=4G memmap=188G$4G). A data acquisition kernel modules accumulates data into this large area used as a ring buffer using DMA. A user space application mmap's this ring buffer into user space, then copies blocks from the ring buffer at the current location for processing once they are ready.
Copying these 16MB blocks from the mmap'ed area using memcpy does not perform as I expected. It appears that the performance depends on the size of the memory reserved at boot time (and later mmap'ed into user space). http://www.wurmsdobler.org/files/resmem.zip contains the source code for a kernel module which does implements the mmap file operation:
module_param(resmem_hwaddr, ulong, S_IRUSR);
module_param(resmem_length, ulong, S_IRUSR);
//...
static int resmem_mmap(struct file *filp, struct vm_area_struct *vma) {
remap_pfn_range(vma, vma->vm_start,
resmem_hwaddr >> PAGE_SHIFT,
resmem_length, vma->vm_page_prot);
return 0;
}
and a test application, which does in essence (with the checks removed):
#define BLOCKSIZE ((size_t)16*1024*1024)
int resMemFd = ::open(RESMEM_DEV, O_RDWR | O_SYNC);
unsigned long resMemLength = 0;
::ioctl(resMemFd, RESMEM_IOC_LENGTH, &resMemLength);
void* resMemBase = ::mmap(0, resMemLength, PROT_READ | PROT_WRITE, MAP_SHARED, resMemFd, 4096);
char* source = ((char*)resMemBase) + RESMEM_HEADER_SIZE;
char* destination = new char[BLOCKSIZE];
struct timeval start, end;
gettimeofday(&start, NULL);
memcpy(destination, source, BLOCKSIZE);
gettimeofday(&end, NULL);
float time = (end.tv_sec - start.tv_sec)*1000.0f + (end.tv_usec - start.tv_usec)/1000.0f;
std::cout << "memcpy from mmap'ed to malloc'ed: " << time << "ms (" << BLOCKSIZE/1000.0f/time << "MB/s)" << std::endl;
I have carried out memcpy tests of a 16MB data block for the different sizes of reserved RAM (resmem_length) on Ubuntu 10.04.4, Linux 2.6.32, on a SuperMicro 1026GT-TF-FM109:
| | 1GB | 4GB | 16GB | 64GB | 128GB | 188GB
|run 1 | 9.274ms (1809.06MB/s) | 11.503ms (1458.51MB/s) | 11.333ms (1480.39MB/s) | 9.326ms (1798.97MB/s) | 213.892ms ( 78.43MB/s) | 206.476ms ( 81.25MB/s)
|run 2 | 4.255ms (3942.94MB/s) | 4.249ms (3948.51MB/s) | 4.257ms (3941.09MB/s) | 4.298ms (3903.49MB/s) | 208.269ms ( 80.55MB/s) | 200.627ms ( 83.62MB/s)
My observations are:
From the first to the second run, memcpy from mmap'ed to malloc'ed seems to benefit that the contents might already be cached somewhere.
There is a significant performance degradation from >64GB, which can be noticed both when using a memcpy.
I would like to understand why that so is. Perhaps somebody in the Linux kernel developers group thought: 64GB should be enough for anybody (does this ring a bell?)
Kind regards,
peter

Based on feedback from SuperMicro, the performance degradation is due to NUMA, non-uniform memory access. The SuperMicro 1026GT-TF-FM109 uses the X8DTG-DF motherboard with one Intel 5520 Tylersburg chipset at its heart, connected to two Intel Xeon E5620 CPUs, each of which has 96GB RAM attached.
If I lock my application to CPU0, I can observe different memcpy speeds depending on what memory area was reserved and consequently mmap'ed. If the reserved memory area is off-CPU, then mmap struggles for some time to do its work, and any subsequent memcpy to and from the "remote" area consumes more time (data block size = 16MB):
resmem=64G$4G (inside CPU0 realm): 3949MB/s
resmem=64G$96G (outside CPU0 realm): 82MB/s
resmem=64G$128G (outside CPU0 realm): 3948MB/s
resmem=92G$4G (inside CPU0 realm): 3966MB/s
resmem=92G$100G (outside CPU0 realm): 57MB/s
It nearly makes sense. Only the third case, 64G$128, which means the uppermost 64GB also yield good results. This contradicts somehow the theory.
Regards,
peter

Your CPU probably doesn't have enough cache to deal with it efficiently. Either use lower memory, or get a CPU with a bigger cache.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string