How much locked memory does io_uring_setup need? - linux

When using io_uring_queue_init it calls io_uring_setup. There's an ENOMEM returned when there is insufficient amount of locked memory available for the process.
A strace will look something like:
[pid 37480] io_uring_setup(2048, {flags=0, sq_thread_cpu=0, sq_thread_idle=0}) = -1 ENOMEM (Cannot allocate memory)
What is the formula for how much locked memory is required per entry (first argument)? and if possible, based on the sq_entries/cq_entries in the params structure? Kernel code for the particularly keen. Please don't expand the kernel page size from the formula, as I do want this to be an architecture dependent answer (if it is).
I don't want a dodgy just set ulimit -l to unlimited as an answer. There's this outstanding feature request feature request that would help when implemented.

Thanks to Jens Axboe the following to liburing library calls where added (>=liburing-2.1) returning the size in bytes, 0 if not required, or -errno for errors.
ssize_t io_uring_mlock_size(unsigned entries, unsigned flags);
ssize_t io_uring_mlock_size_params(unsigned entries, struct io_uring_params *p);

Related

realloc function gives SIGABRT due to limited heap size

I am trying to reproduc a problem .
My c code giving SIGABRT , i traced it back to this line number :3174
https://elixir.bootlin.com/glibc/glibc-2.27/source/malloc/malloc.c
/* Little security check which won't hurt performance: the allocator
never wrapps around at the end of the address space. Therefore
we can exclude some size values which might appear here by
accident or by "design" from some intruder. We need to bypass
this check for dumped fake mmap chunks from the old main arena
because the new malloc may provide additional alignment. */
if ((__builtin_expect ((uintptr_t) oldp > (uintptr_t) -oldsize, 0)
|| __builtin_expect (misaligned_chunk (oldp), 0))
&& !DUMPED_MAIN_ARENA_CHUNK (oldp))
malloc_printerr ("realloc(): invalid pointer");
My understanding is that when i call calloc function memory get allocated when I call realloc function and try to increase memory area ,heap is not available for some reason giving SIGABRT
My another question is, How can I limit the heap area to some bytes say, 10 bytes to replicate the problem. In stackoverflow RSLIMIT and srlimit is mentioned but no sample code is mentioned. Can you provide sample code where heap size is 10 Bytes ?
How can I limit the heap area to some bytes say, 10 bytes
Can you provide sample code where heap size is 10 Bytes ?
From How to limit heap size for a c code in linux , you could do:
You could use (inside your program) setrlimit(2), probably with RLIMIT_AS (as cited by Ouah's answer).
#include <sys/resource.h>
int main() {
setrlimit(RLIMIT_AS, &(struct rlimit){10,10});
}
Better yet, make your shell do it. With bash it is the ulimit builtin.
$ ulimit -v 10
$ ./your_program.out
to replicate the problem
Most probably, limiting heap size will result in a different problem related to heap size limit. Most probably it is unrelated, and will not help you to debug the problem. Instead, I would suggest to research address sanitizer and valgrind.

Writing out DMA buffers into memory mapped file

I need to write in embedded Linux(2.6.37) as fast as possible incoming DMA buffers to HD partition as raw device /dev/sda1. Buffers are aligned as required and are of equal 512KB length. The process may continue for a very long time and fill as much as, for example, 256GB of data.
I need to use the memory-mapped file technique (O_DIRECT not applicable), but can't understand the exact way how to do this.
So, in pseudo code "normal" writing:
fd=open(/dev/sda1",O_WRONLY);
while(1) {
p = GetVirtualPointerToNewBuffer();
if (InputStopped())
break;
write(fd, p, BLOCK512KB);
}
Now, I will be very thankful for the similar pseudo/real code example of how to utilize memory-mapped technique for this writing.
UPDATE2:
Thanks to kestasx the latest working test code looks like following:
#define TSIZE (64*KB)
void* TBuf;
int main(int argc, char **argv) {
int fdi=open("input.dat", O_RDONLY);
//int fdo=open("/dev/sdb2", O_RDWR);
int fdo=open("output.dat", O_RDWR);
int i, offs=0;
void* addr;
i = posix_memalign(&TBuf, TSIZE, TSIZE);
if ((fdo < 1) || (fdi < 1)) {
printf("Error in files\n");
return -1; }
while(1) {
addr = mmap((void*)TBuf, TSIZE, PROT_READ | PROT_WRITE, MAP_SHARED, fdo, offs);
if ((unsigned int)addr == 0xFFFFFFFFUL) {
printf("Error MMAP=%d, %s\n", errno, strerror(errno));
return -1; }
i = read(fdi, TBuf, TSIZE);
if (i != TSIZE) {
printf("End of data\n");
return 0; }
i = munmap(addr, TSIZE);
offs += TSIZE;
sleep(1);
};
}
UPDATE3:
1. To precisely imitate the DMA work, I need to move read() call before mmp(), because when the DMA finishes it provides me with the address where it has put data. So, in pseudo code:
while(1) {
read(fdi, TBuf, TSIZE);
addr = mmap((void*)TBuf, TSIZE, PROT_READ|PROT_WRITE, MAP_FIXED|MAP_SHARED, fdo, offs);
munmap(addr, TSIZE);
offs += TSIZE; }
This variant fails after(!) the first loop - read() says BAD ADDRESS on TBuf.
Without understanding exactly what I do, I substituted munmap() with msync(). This worked perfectly.
So, the question here - why unmapping the addr influenced on TBuf?
2.With the previous example working I went to the real system with the DMA. The same loop, just instead of read() call is the call which waits for a DMA buffer to be ready and its virtual address provided.
There are no error, the code runs, BUT nothing is recorded (!).
My thought was that Linux does not see that the area was updated and therefore does not sync() a thing.
To test this, I eliminated in the working example the read() call - and yes, nothing was recorded too.
So, the question here - how can I tell Linux that the mapped region contains new data, please, flush it!
Thanks a lot!!!
If I correctly understand, it makes sense if You mmap() file (not sure if it You can mmap() raw partition/block-device) and data via DMA is written directly to this memory region.
For this to work You need to be able to control p (where new buffer is placed) or address where file is maped. If You don't - You'll have to copy memory contents (and will lose some benefits of mmap).
So psudo code would be:
truncate("data.bin", 256GB);
fd = open( "data.bin", O_RDWR );
p = GetVirtualPointerToNewBuffer();
adr = mmap( p, 1GB, PROT_READ | PROT_WRITE, MAP_SHARED, fd, offset_in_file );
startDMA();
waitDMAfinish();
munmap( adr, 1GB );
This is first step only and I'm not completely sure if it will work with DMA (have no such experience).
I assume it is 32bit system, but even then 1GB mapped file size may be too big (if Your RAM is smaller You'll be swaping).
If this setup will work, next step would be to make loop to map regions of file at different offsets and unmap already filled ones.
Most likely You'll need to align addr to 4KB boundary.
When You'll unmap region, it's data will be synced to disk. So You'll need some testing to select appropriate mapped region size (while next region is filled by DMA, there must be enough time to unmap/write previous one).
UPDATE:
What exactly happens when You fill mmap'ed region via DMA I simply don't know (not sure how exactly dirty pages are detected: what is done by hardware, and what must be done by software).
UPDATE2: To my best knowledge:
DMA works the following way:
CPU arranges DMA transfer (address where to write transfered data in RAM);
DMA controller does the actual work, while CPU can do it's own work in parallel;
once DMA transfer is complete - DMA controller signals CPU via IRQ line (interrupt), so CPU can handle the result.
This seems simple while virtual memory is not involved: DMA should work independently from runing process (actual VM table in use by CPU). Yet it should be some mehanism to invalidate CPU cache for modified by DMA physical RAM pages (don't know if CPU needs to do something, or it is done authomatically by hardware).
mmap() forks the following way:
after successfull call of mmap(), file on disk is attached to process memory range (most likely some data structure is filled in OS kernel to hold this info);
I/O (reading or writing) from mmaped range triggers pagefault, which is handled by kernel loading appropriate blocks from atached file;
writes to mmaped range are handled by hardware (don't know how exactly: maybe writes to previously unmodified pages triger some fault, which is handled by kernel marking these pages dirty; or maybe this marking is done entirely in hardware and this info is available to kernel when it needs to flush modified pages to disk).
modified (dirty) pages are written to disk by OS (as it sees appropriate) or can be forced via msync() or munmap()
In theory it should be possible to do DMA transfers to mmaped range, but You need to find out, how exactly pages ar marked dirty (if You need to do something to inform kernel which pages need to be written to disk).
UPDATE3:
Even if modified by DMA pages are not marked dirty, You should be able to triger marking by rewriting (reading ant then writing the same) at least one value in each page (most likely each 4KB) transfered. Just make sure this rewriting is not removed (optimised out) by compiler.
UPDATE4:
It seems file opened O_WRONLY can't be mmap'ed (see question comments, my experimets confirm this too). It is logical conclusion of mmap() workings described above. The same is confirmed here (with reference to POSIX standart requirement to ensure file is readable regardless of maping protection flags).
Unless there is some way around, it actually means that by using mmap() You can't avoid reading of results file (unnecessary step in Your case).
Regarding DMA transfers to mapped range, I think it will be a requirement to ensure maped pages are preloalocated before DMA starts (so there is real memory asigned to both DMA and maped region). On Linux there is MAP_POPULATE mmap flag, but from manual it seams it works with MAP_PRIVATE mapings only (changes are not writen to disk), so most likely it is usuitable. Likely You'll have to triger pagefaults manually by accessing each maped page. This should triger reading of results file.
If You still wish to use mmap and DMA together, but avoid reading of results file, You'll have to modify kernel internals to allow mmap to use O_WRONLY files (for example by zero-filling trigered pages, instead of reading them from disk).

linux write(): does it try to write as many bytes as possible?

If I use write in this way: write (fd, buf, 10000000 /* 10MB */) where fd is a socket and uses blocking I/O, will the kernel tries to flush as many bytes as possible so that only one call is enough? Or I have to call write several times according to its return value? If that happens, does it mean something is wrong with fd?
============================== EDITED ================================
Thanks for all the answers. Furthermore, if I put fd into poll and it returns successfully with POLLOUT, so call to write cannot be blocked and writes all the data unless something is wrong with fd?
In blocking mode, write(2) will only return if specified number of bytes are written. If it can not write it'll wait.
In non-blocking (O_NONBLOCK) mode it'll not wait. It'll return right then. If it can write all of them it'll be a success other wise it'll set errno accordingly. Then you have check the errno if its EWOULDBLOCK or EAGAIN you have to invoke same write agian.
From manual of write(2)
The number of bytes written may be less than count if, for example, there is insufficient space on the underlying physical medium, or the RLIMIT_FSIZE resource
limit is encountered (see setrlimit(2)), or the call was interrupted by a signal handler after having written less than count bytes. (See also pipe(7).)
So yes, there can be something wrong with fd.
Also note this
A successful return from write() does not make any guarantee that data has been committed to disk. In fact, on some buggy implementations, it does not even guar‐
antee that space has successfully been reserved for the data. The only way to be sure is to call fsync(2) after you are done writing all your data.
/etc/sysctl.conf is used in Linux to set parameters for the TCP protocol, which is what I assume you mean by a socket. There may be a lot of parameters there, but when you dig through it, basically there is a limit to the amount of data the TCP buffers can hold at one time.
So if you tried to write 10 MB of data at one go, write would return a ssize_t value equal to that value. Always check the return value of the write() system call. If the system allowed 10MB then write would return that value.
The value is
net.core.wmem_max = [some number]
If you change some number to a value large enough to allow 10MB you can write that much. DON'T do that! You could cause other problems. Research settings before you do anything. Changing settings can decrease performance. Be careful.
http://linux.die.net/man/7/tcp
has basic C information for TCP settings. Also check out /proc/sys/net on your box.
One other point - TCP is a two way door, so just because you can send a zillion bytes at one time does not mean the other side can read it or even handle it. You socket may just block for a while. And possibly your write() return value may be less than you hoped for.

How to find out the amount of free physical memory under linux (in c)

Assume I want to cache certain calculations, but provoking syncing it out to disk would incur an I/O penalty that would more than defy the whole purpose of caching.
This means, I need to be able find out, how much physical RAM is left (including cached memory, assuming I can push that out and allowing for some slack should buffering increase). I looked into /proc/meminfo and know how to read it out. I am not so sure how to combine the numbers to get what i want though. Code not necessary, once i know what I need I can code it myself.
I will not have root on the box it needs to run at, but it should be reasonably quiet otherwise. No large amount of disk I/O, no other processes claiming a lot of mem in a burst. The OS is a rather recent linux with overcommitting turned on. This will need to work without triggering the OOM killer obviously.
The Numbers don't need to be exact down to the megabyte, I assume that it'll be roughly in the 1 to 7 gib range depending on the box but getting close to about 100 mb would be great.
It'd definitely be preferable if the estimate were to err on the smallish side.
Unices have the standard sysconf() function (OpenGroups man page, Linux man page).
Using this function, you can get the total physical memory:
unsigned long long ps = sysconf(_SC_PAGESIZE);
unsigned long long pn = sysconf(_SC_AVPHYS_PAGES);
unsigned long long availMem = ps * pn;
As an alternative to the answer of H2CO3, you can read from /proc/meminfo.
For me, statfs worked well.
#include <sys/vfs.h>
struct statfs buf;
size_t available_mem;
if ( statfs( "/", &buf ) == -1 )
available_mem = 0;
else
available_mem = buf.f_bsize * buf.f_bfree;

After mmap(), write to returned address is OK, but read cause system crash.Why?

I want to share memory between two process.
After mmap(), I get a address mapStart, then I add offset to mapStart and get mapAddr, and make sure mapAddr will not exceed maped PAGE_SIZE.
When I write to mapAddr by
memcpy((void *)mapAddr, data, size);
everything is OK.
But when I read from mapAddr by
memcpy( &data, (void *)mapAddr, size);`
that will case system crash.
Who know Why?
The similar problem is here
Add some Info: #Tony Delroy, #J-16 SDiZ
mmap function is:
mapStart = (void volatile *)mmap(0, PAGE_SIZE, PROT_READ | PROT_WRITE, MAP_SHARED | MAP_LOCKED, memfd, pa_base);
system crash: have no any OS error message, Console print some MCA info
the detail described in here
Just some idea.
Is your mmap() spanning over memory regions with different attribute? This is illegal.
Older kernel (you said 2.6.18) allowed this, but crash when you write to some of it.
See this post for some starting point. If it is possible, try a newer kernel.
There are at least two possible issues:
After mmap(), I get a address mapStart, then I add offset to mapStart and get mapAddr, and make sure mapAddr will not exceed maped PAGE_SIZE.
Not mapAddr must be made sure not to exceed the mapped size, but mapAddr+size. You are trying to touch size bytes, not just one.
memcpy((void *)mapAddr, data, size);
memcpy( &data, (void *)mapAddr, size);
Assuming data is not a array (which is a plausible assumption since you use it without address operator in the first line), the second line copies not from the location pointed to by data, but starting with data. This is quite possibly some unallocated memory, or some location on the stack, or whatever. If there is not a lot on the stack, it might as well read beyond the end of the stack into the text segment, or... something else.
(If data is indeed an array, it is of course equivalent, but then your code style would be inconsistent.)

Resources