malloc large memory never returns NULL

malloc large memory never returns NULL - malloc

when I run this, it seems to have no problem with keep allocating memory with cnt going over thousands. I don't understand why -- aren't I supposed to get a NULL at some point? Thanks!
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <math.h>
int main(void)
{
long C = pow(10, 9);
int cnt = 0;
int conversion = 8 * 1024 * 1024;
int *p;
while (1)
{
p = (int *)malloc(C * sizeof(int));
if (p != NULL)
cnt++;
else break;
if (cnt % 10 == 0)
printf("number of successful malloc is %d with %ld Mb\n", cnt, cnt * C / conversion);
}
return 0;
}

Are you running this on Linux? Linux has a highly surprising feature known as overcommit. It doesn't actually allocate memory when you call malloc(), but rather when you actually use that memory. malloc() will happily let you allocate as much memory as your heart desires, never returning a NULL pointer.
It's only when you actually access the memory that Linux takes you seriously and goes out searching for free memory to give you. Of course there may not actually be enough memory to meet the promise it gave your program. You say, "Give me 8GB," and malloc() says, "Sure." Then you try to write to your pointer and Linux says, "Oops! I lied. How bout I just kill off processes (probably yours) until I I free up enough memory?"

You're allocating virtual memory. On a 64-bit OS, virtual memory is available in almost unlimited supply.

Related

What is zeroed page in Linux kernel?

In Linux kernel, what does 'zeroed page' actually mean?
I have tried corelate it with free pages but it does not make a lot of sense.

Zeroed pages are pages where all bits in the page are set to 0. Not all zeroed pages are free, nor are all free pages necessarily zeroed (implementation specific):
A free page doesn't necessarily mean it is zeroed. It could be a page that is set to invalid (not in use by any process), but has old data from when it was last used. For security reasons, the OS should zero out pages before giving it to another program.
A zeroed page also doesn't mean it's a free page. When a process uses malloc() and then does a read (tested in Ubuntu 20.04), the memory that is allocated is all zeros, but, of course, at this point the page is not free. I wrote this C program to verify:
#include <stdio.h>
#include <stdlib.h>
#define PAGE_SIZE 4096
int num_pages = 32;
int main(){
int i;
int bytes = num_pages * PAGE_SIZE;
char * test = (char *)malloc(bytes);
if (test == NULL){
printf("Malloc failed.\n");
return -1;
}
for(i =0; i < bytes; i++){
// A zeroed page will have all (char) zeros in it
if (test[i] != (char) 0)
printf("Not (char) 0: %c\n", test[i]);
}
return 0;
}
As pointed out in the comments by #0andriy, my original example using calloc is implemented using the "Zero page", a single page filled with zeroes that all callocs can return using the copy-on-write optimization described here.

munmap() failure with ENOMEM with private anonymous mapping

I have recently discovered that Linux does not guarantee that memory allocated with mmap can be freed with munmap if this leads to situation when number of VMA (Virtual Memory Area) structures exceed vm.max_map_count. Manpage states this (almost) clearly:
ENOMEM The process's maximum number of mappings would have been exceeded.
This error can also occur for munmap(), when unmapping a region
in the middle of an existing mapping, since this results in two
smaller mappings on either side of the region being unmapped.
The problem is that Linux kernel always tries to merge VMA structures if possible, making munmap fail even for separately created mappings. I was able to write a small program to confirm this behavior:
#include <stdio.h>
#include <stdlib.h>
#include <errno.h>
#include <sys/mman.h>
// value of vm.max_map_count
#define VM_MAX_MAP_COUNT (65530)
// number of vma for the empty process linked against libc - /proc/<id>/maps
#define VMA_PREMAPPED (15)
#define VMA_SIZE (4096)
#define VMA_COUNT ((VM_MAX_MAP_COUNT - VMA_PREMAPPED) * 2)
int main(void)
{
static void *vma[VMA_COUNT];
for (int i = 0; i < VMA_COUNT; i++) {
vma[i] = mmap(0, VMA_SIZE, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
if (vma[i] == MAP_FAILED) {
printf("mmap() failed at %d\n", i);
return 1;
}
}
for (int i = 0; i < VMA_COUNT; i += 2) {
if (munmap(vma[i], VMA_SIZE) != 0) {
printf("munmap() failed at %d (%p): %m\n", i, vma[i]);
}
}
}
It allocates a large number of pages (twice the default allowed maximum) using mmap, then munmaps every second page to create separate VMA structure for each remaining page. On my machine the last munmap call always fails with ENOMEM.
Initially I thought that munmap never fails if used with the same values for address and size that were used to create mapping. Apparently this is not the case on Linux and I was not able to find information about similar behavior on other systems.
At the same time in my opinion partial unmapping applied to the middle of a mapped region is expected to fail on any OS for every sane implementation, but I haven't found any documentation that says such failure is possible.
I would usually consider this a bug in the kernel, but knowing how Linux deals with memory overcommit and OOM I am almost sure this is a "feature" that exists to improve performance and decrease memory consumption.
Other information I was able to find:
Similar APIs on Windows do not have this "feature" due to their design (see MapViewOfFile, UnmapViewOfFile, VirtualAlloc, VirtualFree) - they simply do not support partial unmapping.
glibc malloc implementation does not create more than 65535 mappings, backing off to sbrk when this limit is reached: https://code.woboq.org/userspace/glibc/malloc/malloc.c.html. This looks like a workaround for this issue, but it is still possible to make free silently leak memory.
jemalloc had trouble with this and tried to avoid using mmap/munmap because of this issue (I don't know how it ended for them).
Do other OS's really guarantee deallocation of memory mappings? I know Windows does this, but what about other Unix-like operating systems? FreeBSD? QNX?
EDIT: I am adding example that shows how glibc's free can leak memory when internal munmap call fails with ENOMEM. Use strace to see that munmap fails:
#include <stdio.h>
#include <stdlib.h>
#include <errno.h>
#include <sys/mman.h>
// value of vm.max_map_count
#define VM_MAX_MAP_COUNT (65530)
#define VMA_MMAP_SIZE (4096)
#define VMA_MMAP_COUNT (VM_MAX_MAP_COUNT)
// glibc's malloc default mmap_threshold is 128 KiB
#define VMA_MALLOC_SIZE (128 * 1024)
#define VMA_MALLOC_COUNT (VM_MAX_MAP_COUNT)
int main(void)
{
static void *mmap_vma[VMA_MMAP_COUNT];
for (int i = 0; i < VMA_MMAP_COUNT; i++) {
mmap_vma[i] = mmap(0, VMA_MMAP_SIZE, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
if (mmap_vma[i] == MAP_FAILED) {
printf("mmap() failed at %d\n", i);
return 1;
}
}
for (int i = 0; i < VMA_MMAP_COUNT; i += 2) {
if (munmap(mmap_vma[i], VMA_MMAP_SIZE) != 0) {
printf("munmap() failed at %d (%p): %m\n", i, mmap_vma[i]);
return 1;
}
}
static void *malloc_vma[VMA_MALLOC_COUNT];
for (int i = 0; i < VMA_MALLOC_COUNT; i++) {
malloc_vma[i] = malloc(VMA_MALLOC_SIZE);
if (malloc_vma[i] == NULL) {
printf("malloc() failed at %d\n", i);
return 1;
}
}
for (int i = 0; i < VMA_MALLOC_COUNT; i += 2) {
free(malloc_vma[i]);
}
}

One way to work around this problem on Linux is to mmap more that 1 page at once (e.g. 1 MB at a time), and also map a separator page after it. So, you actually call mmap on 257 pages of memory, then remap the last page with PROT_NONE, so that it cannot be accessed. This should defeat the VMA merging optimization in the kernel. Since you are allocating many pages at once, you should not run into the max mapping limit. The downside is you have to manually manage how you want to slice the large mmap.
As to your questions:
System calls can fail on any system for a variety of reasons. Documentation is not always complete.
You are allowed to munmap a part of a mmapd region as long as the address passed in lies on a page boundary, and the length argument is rounded up to the next multiple of the page size.

max thread per process in linux

I wrote a simple program to calculate the maximum number of threads that a process can have in linux (Centos 5). here is the code:
int main()
{
pthread_t thrd[400];
for(int i=0;i<400;i++)
{
int err=pthread_create(&thrd[i],NULL,thread,(void*)i);
if(err!=0)
cout << "thread creation failed: " << i <<" error code: " << err << endl;
}
return 0;
}
void * thread(void* i)
{
sleep(100);//make the thread still alive
return 0;
}
I figured out that max number for threads is only 300!? What if i need more than that?
I have to mention that pthread_create returns 12 as error code.
Thanks before

There is a thread limit for linux and it can be modified runtime by writing desired limit to /proc/sys/kernel/threads-max. The default value is computed from the available system memory. In addition to that limit, there's also another limit: /proc/sys/vm/max_map_count which limits the maximum mmapped segments and at least recent kernels will mmap memory per thread. It should be safe to increase that limit a lot if you hit it.
However, the limit you're hitting is lack of virtual memory in 32bit operating system. Install a 64 bit linux if your hardware supports it and you'll be fine. I can easily start 30000 threads with a stack size of 8MB. The system has a single Core 2 Duo + 8 GB of system memory (I'm using 5 GB for other stuff in the same time) and it's running 64 bit Ubuntu with kernel 2.6.32. Note that memory overcommit (/proc/sys/vm/overcommit_memory) must be allowed because otherwise system would need at least 240 GB of committable memory (sum of real memory and swap space).
If you need lots of threads and cannot use 64 bit system your only choice is to minimize the memory usage per thread to conserve virtual memory. Start with requesting as little stack as you can live with.

Your system limits may not be allowing you to map the stacks of all the threads you require. Look at /proc/sys/vm/max_map_count, and see this answer. I'm not 100% sure this is your problem, because most people run into problems at much larger thread counts.

I had also encountered the same problem when my number of threads crosses some threshold.
It was because of the user level limit (number of process a user can run at a time) set to 1024 in /etc/security/limits.conf .
so check your /etc/security/limits.conf and look for entry:-
username -/soft/hard -nproc 1024
change it to some larger values to something 100k(requires sudo privileges/root) and it should work for you.
To learn more about security policy ,see http://linux.die.net/man/5/limits.conf.

check the stack size per thread with ulimit, in my case Redhat Linux 2.6:
ulimit -a
...
stack size (kbytes, -s) 10240
Each of your threads will get this amount of memory (10MB) assigned for it's stack. With a 32bit program and a maximum address space of 4GB, that is a maximum of only 4096MB / 10MB = 409 threads !!! Minus program code, minus heap-space will probably lead to your observed max. of 300 threads.
You should be able to raise this by compiling a 64bit application or setting ulimit -s 8192 or even ulimit -s 4096. But if this is advisable is another discussion...

You will run out of memory too unless u shrink the default thread stack size. Its 10MB on our version of linux.
EDIT:
Error code 12 = out of memory, so I think the 1mb stack is still too big for you. Compiled for 32 bit, I can get a 100k stack to give me 30k threads. Beyond 30k threads I get Error code 11 which means no more threads allowed. A 1MB stack gives me about 4k threads before error code 12. 10MB gives me 427 threads. 100MB gives me 42 threads. 1 GB gives me 4... We have 64 bit OS with 64 GB ram. Is your OS 32 bit? When I compile for 64bit, I can use any stack size I want and get the limit of threads.
Also I noticed if i turn the profiling stuff (Tools|Profiling) on for netbeans and run from the ide...I only can get 400 threads. Weird. Netbeans also dies if you use up all the threads.
Here is a test app you can run:
#include <stdio.h>
#include <stdlib.h>
#include <pthread.h>
#include <signal.h>
// this prevents the compiler from reordering code over this COMPILER_BARRIER
// this doesnt do anything
#define COMPILER_BARRIER() __asm__ __volatile__ ("" ::: "memory")
sigset_t _fSigSet;
volatile int _cActive = 0;
pthread_t thrd[1000000];
void * thread(void *i)
{
int nSig, cActive;
cActive = __sync_fetch_and_add(&_cActive, 1);
COMPILER_BARRIER(); // make sure the active count is incremented before sigwait
// sigwait is a handy way to sleep a thread and wake it on command
sigwait(&_fSigSet, &nSig); //make the thread still alive
COMPILER_BARRIER(); // make sure the active count is decrimented after sigwait
cActive = __sync_fetch_and_add(&_cActive, -1);
//printf("%d(%d) ", i, cActive);
return 0;
}
int main(int argc, char** argv)
{
pthread_attr_t attr;
int cThreadRequest, cThreads, i, err, cActive, cbStack;
cbStack = (argc > 1) ? atoi(argv[1]) : 0x100000;
cThreadRequest = (argc > 2) ? atoi(argv[2]) : 30000;
sigemptyset(&_fSigSet);
sigaddset(&_fSigSet, SIGUSR1);
sigaddset(&_fSigSet, SIGSEGV);
printf("Start\n");
pthread_attr_init(&attr);
if ((err = pthread_attr_setstacksize(&attr, cbStack)) != 0)
printf("pthread_attr_setstacksize failed: err: %d %s\n", err, strerror(err));
for (i = 0; i < cThreadRequest; i++)
{
if ((err = pthread_create(&thrd[i], &attr, thread, (void*)i)) != 0)
{
printf("pthread_create failed on thread %d, error code: %d %s\n",
i, err, strerror(err));
break;
}
}
cThreads = i;
printf("\n");
// wait for threads to all be created, although we might not wait for
// all threads to make it through sigwait
while (1)
{
cActive = _cActive;
if (cActive == cThreads)
break;
printf("Waiting A %d/%d,", cActive, cThreads);
sched_yield();
}
// wake em all up so they exit
for (i = 0; i < cThreads; i++)
pthread_kill(thrd[i], SIGUSR1);
// wait for them all to exit, although we might be able to exit before
// the last thread returns
while (1)
{
cActive = _cActive;
if (!cActive)
break;
printf("Waiting B %d/%d,", cActive, cThreads);
sched_yield();
}
printf("\nDone. Threads requested: %d. Threads created: %d. StackSize=%lfmb\n",
cThreadRequest, cThreads, (double)cbStack/0x100000);
return 0;
}

Zero bytes lost in Valgrind

What does it mean when Valgrind reports o bytes lost, like here:
==27752== 0 bytes in 1 blocks are definitely lost in loss record 2 of 1,532
I suspect it is just an artifact from creative use of malloc, but it is good to be sure (-;
EDIT: Of course the real question is whether it can be ignored or it is an effective leak that should be fixed by freeing those buffers.

Yes, this is a real leak, and it should be fixed.
When you malloc(0), malloc may either give you NULL, or an address that is guaranteed to be different from that of any other object.
Since you are likely on Linux, you get the second. There is no space wasted for the allocated buffer itself, but libc has to do some housekeeping, and that does waste space, so you can't go on doing malloc(0) indefinitely.
You can observe it with:
#include <stdio.h>
#include <stdlib.h>
int main() {
unsigned long i;
for (i = 0; i < (size_t)-1; ++i) {
void *p = malloc(0);
if (p == NULL) {
fprintf(stderr, "Ran out of memory on %ld iteration\n", i);
break;
}
}
return 0;
}
gcc t.c && bash -c 'ulimit -v 10240 && ./a.out'
Ran out of memory on 202751 iteration

It looks like you allocated a block with 0 size and then didn't subsequently free it.

Direct Memory Access in Linux

I'm trying to access physical memory directly for an embedded Linux project, but I'm not sure how I can best designate memory for my use.
If I boot my device regularly, and access /dev/mem, I can easily read and write to just about anywhere I want. However, in this, I'm accessing memory that can easily be allocated to any process; which I don't want to do
My code for /dev/mem is (all error checking, etc. removed):
mem_fd = open("/dev/mem", O_RDWR));
mem_p = malloc(SIZE + (PAGE_SIZE - 1));
if ((unsigned long) mem_p % PAGE_SIZE) {
mem_p += PAGE_SIZE - ((unsigned long) mem_p % PAGE_SIZE);
}
mem_p = (unsigned char *) mmap(mem_p, SIZE, PROT_READ | PROT_WRITE, MAP_SHARED | MAP_FIXED, mem_fd, BASE_ADDRESS);
And this works. However, I'd like to be using memory that no one else will touch. I've tried limiting the amount of memory that the kernel sees by booting with mem=XXXm, and then setting BASE_ADDRESS to something above that (but below the physical memory), but it doesn't seem to be accessing the same memory consistently.
Based on what I've seen online, I suspect I may need a kernel module (which is OK) which uses either ioremap() or remap_pfn_range() (or both???), but I have absolutely no idea how; can anyone help?
EDIT:
What I want is a way to always access the same physical memory (say, 1.5MB worth), and set that memory aside so that the kernel will not allocate it to any other process.
I'm trying to reproduce a system we had in other OSes (with no memory management) whereby I could allocate a space in memory via the linker, and access it using something like
*(unsigned char *)0x12345678
EDIT2:
I guess I should provide some more detail. This memory space will be used for a RAM buffer for a high performance logging solution for an embedded application. In the systems we have, there's nothing that clears or scrambles physical memory during a soft reboot. Thus, if I write a bit to a physical address X, and reboot the system, the same bit will still be set after the reboot. This has been tested on the exact same hardware running VxWorks (this logic also works nicely in Nucleus RTOS and OS20 on different platforms, FWIW). My idea was to try the same thing in Linux by addressing physical memory directly; therefore, it's essential that I get the same addresses each boot.
I should probably clarify that this is for kernel 2.6.12 and newer.
EDIT3:
Here's my code, first for the kernel module, then for the userspace application.
To use it, I boot with mem=95m, then insmod foo-module.ko, then mknod mknod /dev/foo c 32 0, then run foo-user , where it dies. Running under gdb shows that it dies at the assignment, although within gdb, I cannot dereference the address I get from mmap (although printf can)
foo-module.c
#include <linux/module.h>
#include <linux/config.h>
#include <linux/init.h>
#include <linux/fs.h>
#include <linux/mm.h>
#include <asm/io.h>
#define VERSION_STR "1.0.0"
#define FOO_BUFFER_SIZE (1u*1024u*1024u)
#define FOO_BUFFER_OFFSET (95u*1024u*1024u)
#define FOO_MAJOR 32
#define FOO_NAME "foo"
static const char *foo_version = "#(#) foo Support version " VERSION_STR " " __DATE__ " " __TIME__;
static void *pt = NULL;
static int foo_release(struct inode *inode, struct file *file);
static int foo_open(struct inode *inode, struct file *file);
static int foo_mmap(struct file *filp, struct vm_area_struct *vma);
struct file_operations foo_fops = {
.owner = THIS_MODULE,
.llseek = NULL,
.read = NULL,
.write = NULL,
.readdir = NULL,
.poll = NULL,
.ioctl = NULL,
.mmap = foo_mmap,
.open = foo_open,
.flush = NULL,
.release = foo_release,
.fsync = NULL,
.fasync = NULL,
.lock = NULL,
.readv = NULL,
.writev = NULL,
};
static int __init foo_init(void)
{
int i;
printk(KERN_NOTICE "Loading foo support module\n");
printk(KERN_INFO "Version %s\n", foo_version);
printk(KERN_INFO "Preparing device /dev/foo\n");
i = register_chrdev(FOO_MAJOR, FOO_NAME, &foo_fops);
if (i != 0) {
return -EIO;
printk(KERN_ERR "Device couldn't be registered!");
}
printk(KERN_NOTICE "Device ready.\n");
printk(KERN_NOTICE "Make sure to run mknod /dev/foo c %d 0\n", FOO_MAJOR);
printk(KERN_INFO "Allocating memory\n");
pt = ioremap(FOO_BUFFER_OFFSET, FOO_BUFFER_SIZE);
if (pt == NULL) {
printk(KERN_ERR "Unable to remap memory\n");
return 1;
}
printk(KERN_INFO "ioremap returned %p\n", pt);
return 0;
}
static void __exit foo_exit(void)
{
printk(KERN_NOTICE "Unloading foo support module\n");
unregister_chrdev(FOO_MAJOR, FOO_NAME);
if (pt != NULL) {
printk(KERN_INFO "Unmapping memory at %p\n", pt);
iounmap(pt);
} else {
printk(KERN_WARNING "No memory to unmap!\n");
}
return;
}
static int foo_open(struct inode *inode, struct file *file)
{
printk("foo_open\n");
return 0;
}
static int foo_release(struct inode *inode, struct file *file)
{
printk("foo_release\n");
return 0;
}
static int foo_mmap(struct file *filp, struct vm_area_struct *vma)
{
int ret;
if (pt == NULL) {
printk(KERN_ERR "Memory not mapped!\n");
return -EAGAIN;
}
if ((vma->vm_end - vma->vm_start) != FOO_BUFFER_SIZE) {
printk(KERN_ERR "Error: sizes don't match (buffer size = %d, requested size = %lu)\n", FOO_BUFFER_SIZE, vma->vm_end - vma->vm_start);
return -EAGAIN;
}
ret = remap_pfn_range(vma, vma->vm_start, (unsigned long) pt, vma->vm_end - vma->vm_start, PAGE_SHARED);
if (ret != 0) {
printk(KERN_ERR "Error in calling remap_pfn_range: returned %d\n", ret);
return -EAGAIN;
}
return 0;
}
module_init(foo_init);
module_exit(foo_exit);
MODULE_AUTHOR("Mike Miller");
MODULE_LICENSE("NONE");
MODULE_VERSION(VERSION_STR);
MODULE_DESCRIPTION("Provides support for foo to access direct memory");
foo-user.c
#include <sys/stat.h>
#include <fcntl.h>
#include <unistd.h>
#include <stdio.h>
#include <sys/mman.h>
int main(void)
{
int fd;
char *mptr;
fd = open("/dev/foo", O_RDWR | O_SYNC);
if (fd == -1) {
printf("open error...\n");
return 1;
}
mptr = mmap(0, 1 * 1024 * 1024, PROT_READ | PROT_WRITE, MAP_FILE | MAP_SHARED, fd, 4096);
printf("On start, mptr points to 0x%lX.\n",(unsigned long) mptr);
printf("mptr points to 0x%lX. *mptr = 0x%X\n", (unsigned long) mptr, *mptr);
mptr[0] = 'a';
mptr[1] = 'b';
printf("mptr points to 0x%lX. *mptr = 0x%X\n", (unsigned long) mptr, *mptr);
close(fd);
return 0;
}

I think you can find a lot of documentation about the kmalloc + mmap part.
However, I am not sure that you can kmalloc so much memory in a contiguous way, and have it always at the same place. Sure, if everything is always the same, then you might get a constant address. However, each time you change the kernel code, you will get a different address, so I would not go with the kmalloc solution.
I think you should reserve some memory at boot time, ie reserve some physical memory so that is is not touched by the kernel. Then you can ioremap this memory which will give you
a kernel virtual address, and then you can mmap it and write a nice device driver.
This take us back to linux device drivers in PDF format. Have a look at chapter 15, it is describing this technique on page 443
Edit : ioremap and mmap.
I think this might be easier to debug doing things in two step : first get the ioremap
right, and test it using a character device operation, ie read/write. Once you know you can safely have access to the whole ioremapped memory using read / write, then you try to mmap the whole ioremapped range.
And if you get in trouble may be post another question about mmaping
Edit : remap_pfn_range
ioremap returns a virtual_adress, which you must convert to a pfn for remap_pfn_ranges.
Now, I don't understand exactly what a pfn (Page Frame Number) is, but I think you can get one calling
virt_to_phys(pt) >> PAGE_SHIFT
This probably is not the Right Way (tm) to do it, but you should try it
You should also check that FOO_MEM_OFFSET is the physical address of your RAM block. Ie before anything happens with the mmu, your memory is available at 0 in the memory map of your processor.

Sorry to answer but not quite answer, I noticed that you have already edited the question. Please note that SO does not notify us when you edit the question. I'm giving a generic answer here, when you update the question please leave a comment, then I'll edit my answer.
Yes, you're going to need to write a module. What it comes down to is the use of kmalloc() (allocating a region in kernel space) or vmalloc() (allocating a region in userspace).
Exposing the prior is easy, exposing the latter can be a pain in the rear with the kind of interface that you are describing as needed. You noted 1.5 MB is a rough estimate of how much you actually need to reserve, is that iron clad? I.e are you comfortable taking that from kernel space? Can you adequately deal with ENOMEM or EIO from userspace (or even disk sleep)? IOW, what's going into this region?
Also, is concurrency going to be an issue with this? If so, are you going to be using a futex? If the answer to either is 'yes' (especially the latter), its likely that you'll have to bite the bullet and go with vmalloc() (or risk kernel rot from within). Also, if you are even THINKING about an ioctl() interface to the char device (especially for some ad-hoc locking idea), you really want to go with vmalloc().
Also, have you read this? Plus we aren't even touching on what grsec / selinux is going to think of this (if in use).

/dev/mem is okay for simple register peeks and pokes, but once you cross into interrupts and DMA territory, you really should write a kernel-mode driver. What you did for your previous memory-management-less OSes simply doesn't graft well to an General Purpose OS like Linux.
You've already thought about the DMA buffer allocation issue. Now, think about the "DMA done" interrupt from your device. How are you going to install an Interrupt Service Routine?
Besides, /dev/mem is typically locked out for non-root users, so it's not very practical for general use. Sure, you could chmod it, but then you've opened a big security hole in the system.
If you are trying to keep the driver code base similar between the OSes, you should consider refactoring it into separate user & kernel mode layers with an IOCTL-like interface in-between. If you write the user-mode portion as a generic library of C code, it should be easy to port between Linux and other OSes. The OS-specific part is the kernel-mode code. (We use this kind of approach for our drivers.)
It seems like you have already concluded that it's time to write a kernel-driver, so you're on the right track. The only advice I can add is to read these books cover-to-cover.
Linux Device Drivers
Understanding the Linux Kernel
(Keep in mind that these books are circa-2005, so the information is a bit dated.)

I am by far no expert on these matters, so this will be a question to you rather than an answer. Is there any reason you can't just make a small ram disk partition and use it only for your application? Would that not give you guaranteed access to the same chunk of memory? I'm not sure of there would be any I/O performance issues, or additional overhead associated with doing that. This also assumes that you can tell the kernel to partition a specific address range in memory, not sure if that is possible.
I apologize for the newb question, but I found your question interesting, and am curious if ram disk could be used in such a way.

Have you looked at the 'memmap' kernel parameter? On i386 and X64_64, you can use the memmap parameter to define how the kernel will hand very specific blocks of memory (see the Linux kernel parameter documentation). In your case, you'd want to mark memory as 'reserved' so that Linux doesn't touch it at all. Then you can write your code to use that absolute address and size (woe be unto you if you step outside that space).

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string