Saying if we do a mmap() system call and maps some PCIE device memory (like GPU) into the user space, then application can access those memory region in the device without any OS overhead. Data can by copied from file system buffer directly to device memory without any other copy.
Above statement must be wrong... Can anyone tell me where is the flaw? Thanks!
For a normal device what you have said is correct. If the GPU memory behaves differently for reads/write, they might do this. We should look at some documentation of cudaMemcpy().
From Nvidia's basics of CUDA page 22,
direction specifies locations (host or device) of src and dst
Blocks CPU thread: returns after the copy is complete.
Doesn't start copying until previous CUDA calls complete
It seems pretty clear that the cudaMemcpy() is synchronized to prior GPU registers writes, which may have caused the mmap() memory to be updated. As the GPU pipeline is a pipeline, prior command issues may not have completed when cudaMemcpy() is issued from the CPU.
Related
I'm working on a research project that requires me to perform a memory capture from custom hardware. I am working with a Zedboard SoC (dual-core ARM Cortex-A9 with FPGA fabric attached). I have designed a device driver that allows me to perform virtual memory captures and physical memory captures (using an AXI4-Lite peripheral that controls the Xilinx AXI DMA IP).
My goal is to capture all mapped pages, so I check /proc/pid/maps for mapped regions, then obtain PFNs from /proc/pid/pagemaps, pass the physical addresses into my device driver, and then pass them to my custom hardware (which invokes the Xilinx AXI DMA to obtain the contents from physical memory).
NOTE: I am using Xilinx's PetaLinux distribution, which is built on Linux version 4.14.
My device driver implements the following procedure through a series of IOCTL calls:
Stop the target process.
Perform virtual memory capture (using the access_process_vm() function).
Flush the cache (using the flush_user_range() function).
Perform physical memory capture.
Resume the target process.
What I'm noticing, however, is that the virtual memory capture and the physical memory capture differ in the [heap] section (which is the first section that extends past one page). The first page matches, but none of the other pages are even close. The [stack] section does not match at all. I should note that for the first two memory sections, .text and .rodata, the captures match exactly. The conclusion for now is that data that does not change during runtime matches between virtual and physical captures while data that does change during runtime does not match.
So this leaves me wondering: am I using the correct function to ensure coherency between the cache and the RAM? If not, what is the proper function to use to force a cache flush to RAM? It is necessary that the data in RAM is up-to-date to the point when the target process is stopped because I cannot access the caches from the custom hardware.
Edit 1:
In regards to this question being marked as a possible duplicate of this question, I am using a function from the accepted answer to initiate a cache flush. However, from my perspective, it doesn't seem to be working as the physical memory does not match the virtual memory as I would expect if a cache flush were occurring.
For anyone coming across this question in the future, the problem was not what I thought it was. The flush_user_range() function I mentioned is the correct function to use to push pages back to main memory from cache.
However, what I didn't think about at the time was that pages that are virtually contiguous are not necessarily (and are very often not) physically contiguous. In my kernel code, I passed the length of the mapped region to my hardware, which requested that length of data from the AXI DMA. What I should have done was a virtual-to-physical translation to get the physical address of each page in each region, then requested one page length of data from main memory, repeating that process for each page in each mapped region.
I realize this is a pretty specific problem that likely won't help anyone else doing the same thing I was doing, but hopefully the learned lesson can help someone in the future: Linux allocates pages in physical memory (that are usually 4kB in size, though you shouldn't assume that's the case), and a collection of physical pages is contained in one mapped region. If you're working on any code that requires you to check physical memory, be sure to be wary of where your data may cross physical page boundaries and act accordingly.
I am trying to map DMA coherent memory, which I allocated in my kernel driver, to user space. There I use mmap() and in kernel driver I use dma_alloc_coherent() and afterwards remap_pfn_range() to remap the pages.
The purpose of mapping the DMA memory to User Space is to minimize the ioctl access to the kernel. The host must perform quite a high number of DMA coherent memory accesses and I want to access it directly in User Space instead of wasting time by using countless ioctl() operations.
mmap() returns EPERM (1) - Operation not permitted.
I found this post: mmap: Operation not permitted
Answer:
It sounds like the kernel has been compiled with CONFIG_STRICT_DEVMEM
enabled. This is a security feature to prevent user space access to
(possibly sensitive) physical memory above 1MB (IIRC). You might be
able to disable this with sysctl dev.mem.restricted.
That is the only useful info I've found. However, I see 2 issues:
1) I've allocated for test purposes only 4k. According to the above statement, only physical memory > 1MB should be a problem. I still can't mmap (anyway, for the final driver I would need a lot more dma memory, but recompiling the kernel can't be the solution to my problem) Which leads me to 2)
2) Furthermore, re-compiling the kernel is not an option as the driver should work without tweaking the kernel in a specific way.
Any ideas on this one? I appreciate the help.
I am using Ubuntu 16.04.1, Kernel: 4.10.0-40-generic
EDIT: SOLVED
I made a copy-paste mistake which resulted in a ret=-1. So the .mmap function in the kernel driver which calls remap_pfn_range, returned -1 instead of 0. This resulted in a failing mmap() in user space
I am using mmap to open /dev/mem for read/write into UART registers.
It works well but my question is :
After a write, is the msync system call with MS_SYNC flag really needed ?
From my understanding, /dev/mem is a virtual device than provide access to physical memory zones (UART registers in my case ) by translating virtual memory address and so give access to some physical memory from user space.
This is not a common file and i guess that modifications of registers are not buffered/cached. I actually would like to avoid this system call for performance reasons.
Thanks
My understanding is that msync() is needed to update the data in a normal file that is modified through a mapping created with mmap().
But when you use mmap on /dev/mem you are not mapping a normal file on disk, you are just mapping the desired hardware memory range directly into your process virtual address space, so msync() is off topic, it will do nothing.
The only thing that lies between your writing into your mmapped virtual space and the hardware device is the CPU cache. To force that you can force a cache flush (__clear_cache() maybe?), but that is usually unnecessary because the kernel identifies the memory mapped device register and disables the cache for that range. In X86 CPUs that is usually done with MTRR, but with ARM I don't know the details...
I've inherited supporting some linux kernel drivers (in which my experience is very limited). My question is as follows. It's an embedded environment and the hardware has 512MB of physical memory. However, the boot parameters that are passed to the kernel limits the memory to 256MB by using the variable linuxMem=mem=256M. In my research of this environment variable, I am of the understanding that
this limits the amount of memory that the kernel can manage to 256MB.
Yet in some application code that runs on my target, I see an open of /dev/mem and a subsequent mmap of the returned file descriptor and the offset parameter of the mmap call is in the upper 256MB of physical memory.
And things seem to be working fine. So my question is "why does it work if the kernel supposedly does not know about the upper 256MB?"
Strictly speaking, mem=256M is a kernel parameter, not an environment variable. This parameter only tells the kernel to use so much memory, but it does not make the system completely blind to the physical chip installed in the machine. It can be used to simulate a system with less physical memory than is actually available, but it is not fully equivalent to opening up your box and pulling out one of the memory chips.
Looking at the docs for this parameter, you can explicitly see that addresses outside of limited range can be used in some situations, that's why they recommend also using memmap= in some cases. So, you can't allocate memory for your app above the limit, but you can look at what is found at some physical address, and it seems some device drivers make use of this possibility.
mmap() returns virtual addresses, not physical ones.
It's perfectly possible for a device to only have 64MB of memory and for mmap() to map something around 1GB.
I'm using Slackware 12.2 on an x86 machine. I'm trying to debug/figure out things by dumping specific parts of memory. Unfortunately my knowledge on the Linux kernel is quite limited to what I need for programming/pentesting.
So here's my question: Is there a way to access any point in memory? I tried doing this with a char pointer so that It would only be a byte long. However the program crashed and spat out something in that nature of: "can't access memory location". Now I was pointing at the 0x00000000 location which where the system stores it's interrupt vectors (unless that changed), which shouldn't matter really.
Now my understanding is the kernel will allocate memory (data, stack, heap, etc) to a program and that program will not be able to go anywhere else. So I was thinking of using NASM to tell the CPU to go directly fetch what I need but I'm unsure if that would work (and I would need to figure out how to translate MASM to NASM).
Alright, well there's my long winded monologue. Essentially my question is: "Is there a way to achieve this?".
Anyway...
If your program is running in user-mode, then memory outside of your process memory won't be accessible, by hook or by crook. Using asm will not help, nor will any other method. This is simply impossible, and is a core security/stability feature of any OS that runs in protected mode (i.e. all of them, for the past 20+ years). Here's a brief overview of Linux kernel memory management.
The only way you can explore the entire memory space of your computer is by using a kernel debugger, which will allow you to access any physical address. However, even that won't let you look at the memory of every process at the same time, since some processes will have been swapped out of main memory. Furthermore, even in kernel mode, physical addresses are not necessarily the same as the addresses visible to the process.
Take a look at /dev/mem or /dev/kmem (man mem)
If you have root access you should be able to see your memory there. This is a mechanism used by kernel debuggers.
Note the warning: Examining and patching is likely to lead to unexpected results when read-only or write-only bits are present.
From the man page:
mem is a character device file that is an image of
the main memory of the computer. It may be used, for
example, to examine (and even patch) the system.
Byte addresses in mem are interpreted as physical
memory addresses. References to nonexistent locations
cause errors to be returned.
...
The file kmem is the same as mem, except that the
kernel virtual memory rather than physical memory is
accessed.