I was believing mremap would have a realloc-like behavior until debugging things like the following lines of code in C.
#define PAGESIZE 0x1000
void *p = mmap(0, PAGESIZE, PROT_READ | PROT_WRITE | PROT_EXEC, MAP_ANONYMOUS | MAP_SHARED, -1, 0);
void *p2 = mremap(p, PAGESIZE, PAGESIZE * 2, MREMAP_MAYMOVE);
// then any future access to the 2nd page in p2 would generate a nice SIGBUS
After viewing several old threads in some mailing-lists I know mmap was originally designed for 'pure' file mapping and folks who designed mremap seem not caring about codes like that above.
I know a shared memory object would do for this. But shm_open/shm_unlink require filenames and I don't want to deal with strings in this very project. And, I am not sure, maybe shared memory objects would more or less reduce performance of my application.
I'm just wondering if it's possible to make mremap work fine(fine here means no SIGBUS when expanding) with anonymously mapped memory, or if there is some similar methods simple&fast?
thx to all in advance :-)
Related
I'm running Linux 5.1 on a Cyclone V SoC, which is an FPGA with two ARMv7 cores in one chip. My goal is to gather lots of data from an external interface and stream (part of) this data out through a TCP socket. The challenge here is that the data rate is very high and could come close to saturating the GbE interface. I have a working implementation that just uses write() calls to the socket, but it tops out at 55MB/s; roughly half the theoretical GbE limit. I'm now trying to get zero-copy TCP transmission to work to increase the throughput, but I'm hitting a wall.
To get the data out of the FPGA into Linux user-space, I've written a kernel driver. This driver uses a DMA block in the FPGA to copy a large amount of data from an external interface into DDR3 memory attached to the ARMv7 cores. The driver allocates this memory as a bunch of contiguous 1MB buffers when probed using dma_alloc_coherent() with GFP_USER, and exposes these to the userspace application by implementing mmap() on a file in /dev/ and returning an address to the application using dma_mmap_coherent() on the preallocated buffers.
So far so good; the user-space application is seeing valid data and the throughput is more than enough at >360MB/s with room to spare (the external interface is not fast enough to really see what the upper bound is).
To implement zero-copy TCP networking, my first approach was to use SO_ZEROCOPY on the socket:
sent_bytes = send(fd, buf, len, MSG_ZEROCOPY);
if (sent_bytes < 0) {
perror("send");
return -1;
}
However, this results in send: Bad address.
After googling for a bit, my second approach was to use a pipe and splice() followed by vmsplice():
ssize_t sent_bytes;
int pipes[2];
struct iovec iov = {
.iov_base = buf,
.iov_len = len
};
pipe(pipes);
sent_bytes = vmsplice(pipes[1], &iov, 1, 0);
if (sent_bytes < 0) {
perror("vmsplice");
return -1;
}
sent_bytes = splice(pipes[0], 0, fd, 0, sent_bytes, SPLICE_F_MOVE);
if (sent_bytes < 0) {
perror("splice");
return -1;
}
However, the result is the same: vmsplice: Bad address.
Note that if I replace the call to vmsplice() or send() to a function that just prints the data pointed to by buf (or a send() without MSG_ZEROCOPY), everything is working just fine; so the data is accessible to userspace, but the vmsplice()/send(..., MSG_ZEROCOPY) calls seem unable to handle it.
What am I missing here? Is there any way of using zero-copy TCP sending with a user-space address obtained from a kernel driver through dma_mmap_coherent()? Is there another approach I could use?
UPDATE
So I dove a bit deeper into the sendmsg() MSG_ZEROCOPY path in the kernel, and the call that eventually fails is get_user_pages_fast(). This call returns -EFAULT because check_vma_flags() finds the VM_PFNMAP flag set in the vma. This flag is apparently set when the pages are mapped into user space using remap_pfn_range() or dma_mmap_coherent(). My next approach is to find another way to mmap these pages.
As I posted in an update in my question, the underlying problem is that zerocopy networking does not work for memory that has been mapped using remap_pfn_range() (which dma_mmap_coherent() happens to use under the hood as well). The reason is that this type of memory (with the VM_PFNMAP flag set) does not have metadata in the form of struct page* associated with each page, which it needs.
The solution then is to allocate the memory in a way that struct page*s are associated with the memory.
The workflow that now works for me to allocate the memory is:
Use struct page* page = alloc_pages(GFP_USER, page_order); to allocate a block of contiguous physical memory, where the number of contiguous pages that will be allocated is given by 2**page_order.
Split the high-order/compound page into 0-order pages by calling split_page(page, page_order);. This now means that struct page* page has become an array with 2**page_order entries.
Now to submit such a region to the DMA (for data reception):
dma_addr = dma_map_page(dev, page, 0, length, DMA_FROM_DEVICE);
dma_desc = dmaengine_prep_slave_single(dma_chan, dma_addr, length, DMA_DEV_TO_MEM, 0);
dmaengine_submit(dma_desc);
When we get a callback from the DMA that the transfer has finished, we need to unmap the region to transfer ownership of this block of memory back to the CPU, which takes care of caches to make sure we're not reading stale data:
dma_unmap_page(dev, dma_addr, length, DMA_FROM_DEVICE);
Now, when we want to implement mmap(), all we really have to do is call vm_insert_page() repeatedly for all of the 0-order pages that we pre-allocated:
static int my_mmap(struct file *file, struct vm_area_struct *vma) {
int res;
...
for (i = 0; i < 2**page_order; ++i) {
if ((res = vm_insert_page(vma, vma->vm_start + i*PAGE_SIZE, &page[i])) < 0) {
break;
}
}
vma->vm_flags |= VM_LOCKED | VM_DONTCOPY | VM_DONTEXPAND | VM_DENYWRITE;
...
return res;
}
When the file is closed, don't forget to free the pages:
for (i = 0; i < 2**page_order; ++i) {
__free_page(&dev->shm[i].pages[i]);
}
Implementing mmap() this way now allows a socket to use this buffer for sendmsg() with the MSG_ZEROCOPY flag.
Although this works, there are two things that don't sit well with me with this approach:
You can only allocate power-of-2-sized buffers with this method, although you could implement logic to call alloc_pages as many times as needed with decreasing orders to get any size buffer made up of sub-buffers of varying sizes. This will then require some logic to tie these buffers together in the mmap() and to DMA them with scatter-gather (sg) calls rather than single.
split_page() says in its documentation:
* Note: this is probably too low level an operation for use in drivers.
* Please consult with lkml before using this in your driver.
These issues would be easily solved if there was some interface in the kernel to allocate an arbitrary amount of contiguous physical pages. I don't know why there isn't, but I don't find the above issues so important as to go digging into why this isn't available / how to implement it :-)
Maybe this will help you to understand why alloc_pages requires a power-of-2 page number.
To optimize the page allocation process(and decrease external fragmentations), which is frequently engaged, Linux kernel developed per-cpu page cache and buddy-allocator to allocate memory(there is another allocator, slab, to serve memory allocations that are smaller than a page).
Per-cpu page cache serve the one-page allocation request, while buddy-allocator keeps 11 lists, each containing 2^{0-10} physical pages respectively. These lists perform well when allocate and free pages, and of course, the premise is you are requesting a power-of-2-sized buffer.
what is the main difference between File-backed mapping & Anonymous
mapping.
How can we choose between File-backed mapping or Anonymous
mapping, when we need an IPC between processes.
What is the advantage,disadvantage of using these.?
mmap() system call allows you to go for either file-backed mapping or anonymous mapping.
void *mmap(void *addr, size_t lengthint " prot ", int " flags ,int fd,
off_t offset)
File-backed mapping- In linux , there exists a file /dev/zero which is an infinite source of 0 bytes. You just open this file, and pass its descriptor to the mmap() call with appropriate flag, i.e., MAP_SHARED if you want the memory to be shared by other process or MAP_PRIVATE if you don't want sharing.
Ex-
.
.
if ((fd = open("/dev/zero", O_RDWR)) < 0)
printf("open error");
if ((area = mmap(0, SIZE, PROT_READ | PROT_WRITE, MAP_SHARED,fd, 0)) == MAP_FAILED)
{
printf("Error in memory mapping");
exit(1);
}
close(fd); //close the file because memory is mapped
//create child process
.
.
Quoting the man-page of mmap() :-
The contents of a file mapping (as opposed to an anonymous mapping;
see MAP_ANONYMOUS below), are initialized using length bytes starting
at offset offset in the file (or other object) referred to by the file
descriptor fd. offset must be a multiple of the page size as returned
by sysconf(_SC_PAGE_SIZE).
In our case, it has been initialized with zeroes(0s).
Quoting the text from the book Advanced Programming in the UNIX Environment by W. Richard Stevens, Stephen A. Rago II Edition
The advantage of using /dev/zero in the manner that we've shown is
that an actual file need not exist before we call mmap to create the
mapped region. Mapping /dev/zero automatically creates a mapped region
of the specified size. The disadvantage of this technique is that it
works only between related processes. With related processes, however,
it is probably simpler and more efficient to use threads (Chapters 11
and 12). Note that regardless of which technique is used, we still
need to synchronize access to the shared data
After the call to mmap() succeeds, we create a child process which will be able to see the writes to the mapped region(as we specified MAP_SHARED flag).
Anonymous mapping - The similar thing that we did above can be done using anonymous mapping.For anonymous mapping, we specify the MAP_ANON flag to mmap and specify the file descriptor as -1.
The resulting region is anonymous (since it's not associated with a pathname through a file descriptor) and creates a memory region that can be shared with descendant processes.
The advantage is that we don't need any file for mapping the memory, the overhead of opening and closing file is also avoided.
if ((area = mmap(0, SIZE, PROT_READ | PROT_WRITE, MAP_ANON | MAP_SHARED, -1, 0)) == MAP_FAILED)
printf("Error in anonymous memory mapping");
So, these file-backed mapping and anonymous mapping necessarily work only with related processes.
If you need this between unrelated processes, then you probably need to create named shared memory by using shm_open() and then you can pass the returned file descriptor to mmap().
I need to write in embedded Linux(2.6.37) as fast as possible incoming DMA buffers to HD partition as raw device /dev/sda1. Buffers are aligned as required and are of equal 512KB length. The process may continue for a very long time and fill as much as, for example, 256GB of data.
I need to use the memory-mapped file technique (O_DIRECT not applicable), but can't understand the exact way how to do this.
So, in pseudo code "normal" writing:
fd=open(/dev/sda1",O_WRONLY);
while(1) {
p = GetVirtualPointerToNewBuffer();
if (InputStopped())
break;
write(fd, p, BLOCK512KB);
}
Now, I will be very thankful for the similar pseudo/real code example of how to utilize memory-mapped technique for this writing.
UPDATE2:
Thanks to kestasx the latest working test code looks like following:
#define TSIZE (64*KB)
void* TBuf;
int main(int argc, char **argv) {
int fdi=open("input.dat", O_RDONLY);
//int fdo=open("/dev/sdb2", O_RDWR);
int fdo=open("output.dat", O_RDWR);
int i, offs=0;
void* addr;
i = posix_memalign(&TBuf, TSIZE, TSIZE);
if ((fdo < 1) || (fdi < 1)) {
printf("Error in files\n");
return -1; }
while(1) {
addr = mmap((void*)TBuf, TSIZE, PROT_READ | PROT_WRITE, MAP_SHARED, fdo, offs);
if ((unsigned int)addr == 0xFFFFFFFFUL) {
printf("Error MMAP=%d, %s\n", errno, strerror(errno));
return -1; }
i = read(fdi, TBuf, TSIZE);
if (i != TSIZE) {
printf("End of data\n");
return 0; }
i = munmap(addr, TSIZE);
offs += TSIZE;
sleep(1);
};
}
UPDATE3:
1. To precisely imitate the DMA work, I need to move read() call before mmp(), because when the DMA finishes it provides me with the address where it has put data. So, in pseudo code:
while(1) {
read(fdi, TBuf, TSIZE);
addr = mmap((void*)TBuf, TSIZE, PROT_READ|PROT_WRITE, MAP_FIXED|MAP_SHARED, fdo, offs);
munmap(addr, TSIZE);
offs += TSIZE; }
This variant fails after(!) the first loop - read() says BAD ADDRESS on TBuf.
Without understanding exactly what I do, I substituted munmap() with msync(). This worked perfectly.
So, the question here - why unmapping the addr influenced on TBuf?
2.With the previous example working I went to the real system with the DMA. The same loop, just instead of read() call is the call which waits for a DMA buffer to be ready and its virtual address provided.
There are no error, the code runs, BUT nothing is recorded (!).
My thought was that Linux does not see that the area was updated and therefore does not sync() a thing.
To test this, I eliminated in the working example the read() call - and yes, nothing was recorded too.
So, the question here - how can I tell Linux that the mapped region contains new data, please, flush it!
Thanks a lot!!!
If I correctly understand, it makes sense if You mmap() file (not sure if it You can mmap() raw partition/block-device) and data via DMA is written directly to this memory region.
For this to work You need to be able to control p (where new buffer is placed) or address where file is maped. If You don't - You'll have to copy memory contents (and will lose some benefits of mmap).
So psudo code would be:
truncate("data.bin", 256GB);
fd = open( "data.bin", O_RDWR );
p = GetVirtualPointerToNewBuffer();
adr = mmap( p, 1GB, PROT_READ | PROT_WRITE, MAP_SHARED, fd, offset_in_file );
startDMA();
waitDMAfinish();
munmap( adr, 1GB );
This is first step only and I'm not completely sure if it will work with DMA (have no such experience).
I assume it is 32bit system, but even then 1GB mapped file size may be too big (if Your RAM is smaller You'll be swaping).
If this setup will work, next step would be to make loop to map regions of file at different offsets and unmap already filled ones.
Most likely You'll need to align addr to 4KB boundary.
When You'll unmap region, it's data will be synced to disk. So You'll need some testing to select appropriate mapped region size (while next region is filled by DMA, there must be enough time to unmap/write previous one).
UPDATE:
What exactly happens when You fill mmap'ed region via DMA I simply don't know (not sure how exactly dirty pages are detected: what is done by hardware, and what must be done by software).
UPDATE2: To my best knowledge:
DMA works the following way:
CPU arranges DMA transfer (address where to write transfered data in RAM);
DMA controller does the actual work, while CPU can do it's own work in parallel;
once DMA transfer is complete - DMA controller signals CPU via IRQ line (interrupt), so CPU can handle the result.
This seems simple while virtual memory is not involved: DMA should work independently from runing process (actual VM table in use by CPU). Yet it should be some mehanism to invalidate CPU cache for modified by DMA physical RAM pages (don't know if CPU needs to do something, or it is done authomatically by hardware).
mmap() forks the following way:
after successfull call of mmap(), file on disk is attached to process memory range (most likely some data structure is filled in OS kernel to hold this info);
I/O (reading or writing) from mmaped range triggers pagefault, which is handled by kernel loading appropriate blocks from atached file;
writes to mmaped range are handled by hardware (don't know how exactly: maybe writes to previously unmodified pages triger some fault, which is handled by kernel marking these pages dirty; or maybe this marking is done entirely in hardware and this info is available to kernel when it needs to flush modified pages to disk).
modified (dirty) pages are written to disk by OS (as it sees appropriate) or can be forced via msync() or munmap()
In theory it should be possible to do DMA transfers to mmaped range, but You need to find out, how exactly pages ar marked dirty (if You need to do something to inform kernel which pages need to be written to disk).
UPDATE3:
Even if modified by DMA pages are not marked dirty, You should be able to triger marking by rewriting (reading ant then writing the same) at least one value in each page (most likely each 4KB) transfered. Just make sure this rewriting is not removed (optimised out) by compiler.
UPDATE4:
It seems file opened O_WRONLY can't be mmap'ed (see question comments, my experimets confirm this too). It is logical conclusion of mmap() workings described above. The same is confirmed here (with reference to POSIX standart requirement to ensure file is readable regardless of maping protection flags).
Unless there is some way around, it actually means that by using mmap() You can't avoid reading of results file (unnecessary step in Your case).
Regarding DMA transfers to mapped range, I think it will be a requirement to ensure maped pages are preloalocated before DMA starts (so there is real memory asigned to both DMA and maped region). On Linux there is MAP_POPULATE mmap flag, but from manual it seams it works with MAP_PRIVATE mapings only (changes are not writen to disk), so most likely it is usuitable. Likely You'll have to triger pagefaults manually by accessing each maped page. This should triger reading of results file.
If You still wish to use mmap and DMA together, but avoid reading of results file, You'll have to modify kernel internals to allow mmap to use O_WRONLY files (for example by zero-filling trigered pages, instead of reading them from disk).
I need reserve 256-512 Mb of continuous physical memory and have access to this memory from the user space.
I decided to use CMA for memory reserving.
Here are the steps on my idea that must be performed:
Reservation required amount of memory by CMA during system booting.
Parsing of CMA patch output which looks like for example: "CMA: reserved 256 MiB at 27400000" and saving two parameters: size of CMA area = 256*1024*1024 bytes and phys address of CMA area = 0x27400000.
Mapping of CMA area at /dev/mem file with offset = 0x27400000 using mmap(). (Of course, CONFIG_STRICT_DEVMEM is disabled)
It would let me to read data directly from phys memory from user space.
But the next code make segmentation fault(there size = 1Mb):
int file;
void* start;
file=open("/dev/mem", O_RDWR | O_SYNC);
if ( (start = mmap(0, 1024*1024, PROT_READ | PROT_WRITE, MAP_SHARED, file, 0x27400000)) == MAP_FAILED ){
perror("mmap");
}
for (int offs = 0; offs<50; offs++){
cout<<((char *)start)[offs];
}
Output of this code: "mmap: Invalid argument".
When I changed offset = 0x27400000 on 0, this code worked fine and program displayed trash. It also work for alot of offsets which I looked at /proc/iomem.
According to information from /proc/iomem, phys addr of CMA area (0x27400000 on my system) always situated in System RAM.
Does anyone have any ideas, how to mmap CMA area on /dev/mem? What am I doing wrong?
Thanks alot for any help!
Solution to this problem was suggested to me by Jeff Haran in kernelnewbies mailing list.
It was necessary to disable CONFIG_x86_PAT in .config and mmap() has started to work!
If CONFIG_X86_PAT is configured you will have problems mapping memory to user space. It basically implements the same restrictions as CONFIG_STRICT_DEVMEM.
Jeff Haran
Now I can mmap /dev/mem on any physical address I want.
But necessary to be careful:
Word of caution. CONFIG_X86_PAT was likely introduced for a reason. There may be some performance penalties for turning this off, though in my testing so far turning if off doesn’t seem to break anything.
Jeff Haran
I want to share memory between two process.
After mmap(), I get a address mapStart, then I add offset to mapStart and get mapAddr, and make sure mapAddr will not exceed maped PAGE_SIZE.
When I write to mapAddr by
memcpy((void *)mapAddr, data, size);
everything is OK.
But when I read from mapAddr by
memcpy( &data, (void *)mapAddr, size);`
that will case system crash.
Who know Why?
The similar problem is here
Add some Info: #Tony Delroy, #J-16 SDiZ
mmap function is:
mapStart = (void volatile *)mmap(0, PAGE_SIZE, PROT_READ | PROT_WRITE, MAP_SHARED | MAP_LOCKED, memfd, pa_base);
system crash: have no any OS error message, Console print some MCA info
the detail described in here
Just some idea.
Is your mmap() spanning over memory regions with different attribute? This is illegal.
Older kernel (you said 2.6.18) allowed this, but crash when you write to some of it.
See this post for some starting point. If it is possible, try a newer kernel.
There are at least two possible issues:
After mmap(), I get a address mapStart, then I add offset to mapStart and get mapAddr, and make sure mapAddr will not exceed maped PAGE_SIZE.
Not mapAddr must be made sure not to exceed the mapped size, but mapAddr+size. You are trying to touch size bytes, not just one.
memcpy((void *)mapAddr, data, size);
memcpy( &data, (void *)mapAddr, size);
Assuming data is not a array (which is a plausible assumption since you use it without address operator in the first line), the second line copies not from the location pointed to by data, but starting with data. This is quite possibly some unallocated memory, or some location on the stack, or whatever. If there is not a lot on the stack, it might as well read beyond the end of the stack into the text segment, or... something else.
(If data is indeed an array, it is of course equivalent, but then your code style would be inconsistent.)