change page order in kernel space - linux

I have a kernel module that works on data that is:
allocated by the kernel
page aligned
the data "mapping" is arbitrary
I allocate the memory in kernel space with kvmalloc(). For userspace representation i use vm_insert_page() to create the correct ordered representation. But i could not find a method with that i can "insert" or "remap" or "reorder" page mapping within kernel space. Are there methods do the same as vm_insert_page() for kernelspace mappings?

ok this seems to work:
static int __init test_init_fs(void)
{
int rv = 0;
size_t size = 5*1024*1024; /* 5 MiB*/
void* mem = vzalloc(size);
struct page **pages = kcalloc(5, sizeof(struct page *), GFP_KERNEL);
pr_info("alloced\n");
pages[0] = vmalloc_to_page(mem + 0 * PAGE_SIZE);
pages[1] = vmalloc_to_page(mem + 6 * PAGE_SIZE);
pages[2] = vmalloc_to_page(mem + 2 * PAGE_SIZE);
pages[3] = vmalloc_to_page(mem + 1 * PAGE_SIZE);
pages[4] = vmalloc_to_page(mem + 8 * PAGE_SIZE);
pr_info("got all pages\n");
void* new_mapping = vmap(pages,5, VM_MAP, PAGE_KERNEL);
pr_info("new mapping created\n");
void* buffer = vzalloc(5*PAGE_SIZE);
memcpy(buffer,new_mapping,5*PAGE_SIZE);
vunmap(new_mapping);
pr_info("unmapped\n");
vfree(mem);
return rv;
}

Related

mmap() NVMe cmb resource file and memcpy() to it, with the unplugged SSD still could write

I mapped cmb resource file and wrote data to it. When I unplugged the SSD, I thought that a SIGBUS should be generated but the program was still running. So was this a normal behavior?
Linux kernel: v5.15.88
OS: Centos 7.4
CPU: Intel(R) Xeon(R) Silver 4114
My code like this:
size_t resource_size = 268435456;
size_t remain = resource_size;
size_t offset = 0;
size_t range = 128;
char tmp;
int fd = open("/sys/bus/pci/devices/0000:5e:00.0/resource4_wc", O_RDWR);
void * map_addr = mmap(NULL, resource_size, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
while (remain > 0) {
memcpy(map_addr + offset, buf + offset, range);
memcpy(&tmp, map_addr + offset, 0);
remain -= range;
offset += range;
if (remain == 0) {
remain = resource_size;
offset = 0;
}
}

Why did NOT my Linux act the lazy memory allocation?

I'm practising to use the Lazy Allocation and the Demand Paging policies of Linux.
I want a buffer that I allocated by mmap() occupy NO physical memory until I really write something to it.
Further more, I want it gradually enlarge (use more physical memory) with a step size of the swap page size (e.g. 4K) of Linux along with I'm writing continuously from its head to the tail.
According to some docs and searchings, it should NOT enlarge if there be only reading access on it, but the reality I observed in a experiment does NOT like this.
To test this, I coded a program as following, and watched the memory status by top shell command when it running.
constexpr size_t BUF_SIZE = 1024 * 1024 * 1024;
int main( int argc, char** argv ) {
auto shm_pt = mmap( NULL, BUF_SIZE, PROT_READ | PROT_WRITE,
MAP_SHARED | MAP_ANONYMOUS, -1, 0 );
if( shm_pt == MAP_FAILED ) {
std::cerr << "mmap error:" << shm_pt;
exit( EXIT_FAILURE );
};
bool full_zero = true;
uint8_t* pc = reinterpret_cast<uint8_t*>( shm_pt );
constexpr size_t STEP_SIZE = 1024 * 1024;
for( size_t j = 0; j < BUF_SIZE / STEP_SIZE; ++j ) {
this_thread::sleep_for( 100ms );
size_t base = j * STEP_SIZE;
std::cerr << "Reading from " << base / 1024 / 1024 << "M..." << endl;
for( size_t i = 0; i < STEP_SIZE; ++i )
full_zero = full_zero && pc[ base + i ] == 0;
}
if( !full_zero )
std::cerr << "The buffer has not been initialized with full zeros!";
for( size_t j = 0; j < BUF_SIZE / STEP_SIZE; ++j ) {
this_thread::sleep_for( 100ms );
size_t base = j * STEP_SIZE;
std::cerr << "Writing to " << base / 1024 / 1024 << "M..." << endl;
for( size_t i = 0; i < STEP_SIZE; ++i )
pc[ base + i ] = 'c';
}
munmap( shm_pt, BUF_SIZE );
return EXIT_SUCCESS;
};
What I observed is that the physical memory used by my app is growing gradually along with the Reading operation not with Writing op!
Perhaps my comprehension is wrong?
I got it!
In the searching content I pasted, that man used a MAP_PRIVATE flag to mmap() as argument, while I used MAP_SHARED.
It looks like that if a buffer is being shared between processes, a READING operation also results real memory allocation!

Adding a field to msgbuf

I got an assignment to create a message queue in Linux. I need to use msgnd() and msgrcv() functions. Everything works if my message structure has two fields mtype and mtext[] but I need to add one more field an int mpid. But when I read the value from mpid it is just garbage from the memory. I searched for answers or examples but I only found structures with two fields. Can I even add more?
struct myBuff{
long mtype;
char mtext[255];
int mpid;
};
code for the sender
void add_message(int id, struct myBuff buff){
int size = strlen(buff.mtext) + 1 + sizeof(int)
if (size > 255 + sizeof(int))
exit(EXIT_FAILURE);
msgsnd(id, (struct msgbuf*)&buff, size, 0 | MSG_NOERROR);
}
code for the receiver
void check_message(int id, struct myBuff* buff)
{
msgrcv(id, (struct msgbuf*)buff, 255 + sizeof(int), buff->mtype, 0 | MSG_NOERROR);
}

How does limits on the shared memory work on Linux

I was looking into the Linux kernel limits on the shared memory
/proc/sys/kernel/shmall
specifies the maximum amount of pages that can be allocated. Considering this number as x and the page size as p. I assume that "x * p" bytes is the limit on the system wide shared memory.
Now I wrote a small program to create a shared memory segment and i attached to that shared memory segment twice as below
shm_id = shmget(IPC_PRIVATE, 4*sizeof(int), IPC_CREAT | 0666);
if (shm_id < 0) {
printf("shmget error\n");
exit(1);
}
printf("\n The shared memory created is %d",shm_id);
ptr = shmat(shm_id,NULL,0);
ptr_info = shmat(shm_id,NULL,0);
In the above program ptr and ptr_info were different. So the shared memory is mapped to 2 virtual addresses in my process address space.
When I do an ipcs it looks like this
...
0x00000000 1638416 sun 666 16000000 2
...
Now coming to the shmall limit x * p noted above in my question. Is this limit applicable on the sum of all the virtual memory allocated for every shared memory segment? or does this limit apply on the physical memory?
Physical memory is only one here (shared memory) and from the program above when I do 2 shmat's there is twice the amount of memory allocated in my process address space. So this limit will hit soon if do continuous shmat's on a single shared memory segment?
The limit only applies to physical memory, that is the real shared memory allocated for all segments, because shmat() just maps that allocated segment into process address space.
You can trace it in the kernel, there is only one place where this limit is checked — in the newseg() function that allocates new segments (ns->shm_ctlall comparison). shmat() implementation is busy with a lot of stuff, but doesn't care at all about shmall limit, so you can map one segment as many times as you want to (well, address space is also limited, but in practice you rarely care about this limit).
You can also try some test from userspace with a simple program like this one:
#define _GNU_SOURCE
#include <errno.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/ipc.h>
#include <sys/shm.h>
#include <unistd.h>
unsigned long int get_shmall() {
FILE *f = NULL;
char buf[512];
unsigned long int value = 0;
if ((f = fopen("/proc/sys/kernel/shmall", "r")) != NULL) {
if (fgets(buf, sizeof(buf), f) != NULL)
value = strtoul(buf, NULL, 10); // no proper checks
fclose(f); // no return value check
}
return value;
}
int set_shmall(unsigned long int value) {
FILE *f = NULL;
char buf[512];
int retval = 0;
if ((f = fopen("/proc/sys/kernel/shmall", "w")) != NULL) {
if (snprintf(buf, sizeof(buf), "%lu\n", value) >= sizeof(buf) ||
fwrite(buf, 1, strlen(buf), f) != strlen(buf))
retval = -1;
fclose(f); // fingers crossed
} else
retval = -1;
return retval;
}
int main()
{
int shm_id1 = -1, shm_id2 = -1;
unsigned long int shmall = 0, shmused, newshmall;
void *ptr1, *ptr2;
struct shm_info shminf;
if ((shmall = get_shmall()) == 0) {
printf("can't get shmall\n");
goto out;
}
printf("original shmall: %lu pages\n", shmall);
if (shmctl(0, SHM_INFO, (struct shmid_ds *)&shminf) < 0) {
printf("can't get SHM_INFO\n");
goto out;
}
shmused = shminf.shm_tot * getpagesize();
printf("shmused: %lu pages (%lu bytes)\n", shminf.shm_tot, shmused);
newshmall = shminf.shm_tot + 1;
if (set_shmall(newshmall) != 0) {
printf("can't set shmall\n");
goto out;
}
if (get_shmall() != newshmall) {
printf("something went wrong with shmall setting\n");
goto out;
}
printf("new shmall: %lu pages (%lu bytes)\n", newshmall, newshmall * getpagesize());
printf("shmget() for %u bytes: ", (unsigned int) getpagesize());
shm_id1 = shmget(IPC_PRIVATE, (size_t)getpagesize(), IPC_CREAT | 0666);
if (shm_id1 < 0) {
printf("failed: %s\n", strerror(errno));
goto out;
}
printf("ok\nshmat 1: ");
ptr1 = shmat(shm_id1, NULL, 0);
if (ptr1 == 0) {
printf("failed\n");
goto out;
}
printf("ok\nshmat 2: ");
ptr2 = shmat(shm_id1, NULL, 0);
if (ptr2 == 0) {
printf("failed\n");
goto out;
}
printf("ok\n");
if (ptr1 == ptr2) {
printf("ptr1 and ptr2 are the same with shm_id1\n");
goto out;
}
printf("shmget() for %u bytes: ", (unsigned int) getpagesize());
shm_id2 = shmget(IPC_PRIVATE, (size_t)getpagesize(), IPC_CREAT | 0666);
if (shm_id2 < 0)
printf("failed: %s\n", strerror(errno));
else
printf("ok, although it's wrong\n");
out:
if (shmall != 0 && set_shmall(shmall) != 0)
printf("failed to restrore shmall\n");
if (shm_id1 >= 0 && shmctl(shm_id1, IPC_RMID, NULL) < 0)
printf("failed to remove shm_id1\n");
if (shm_id2 >= 0 && shmctl(shm_id2, IPC_RMID, NULL) < 0)
printf("failed to remove shm_id2\n");
return 0;
}
What is does is it sets the shmall limit just one page above what is currently used by the system, then tries to get page-sized new segment and map it twice (all successfully), then tries to get one more page-sized segment and fails to do that (execute the program as superuser because it writes to /proc/sys/kernel/shmall):
$ sudo ./a.out
original shmall: 18446744073708503040 pages
shmused: 21053 pages (86233088 bytes)
new shmall: 21054 pages (86237184 bytes)
shmget() for 4096 bytes: ok
shmat 1: ok
shmat 2: ok
shmget() for 4096 bytes: failed: No space left on device
I did not find any Physical memory allocation at do_shmat function (linux/ipc/shm.c)
https://github.com/torvalds/linux/blob/5469dc270cd44c451590d40c031e6a71c1f637e8/ipc/shm.c
so shmat consumes only vm (your process address space),
the main function of shmat is mmap

Linux file descriptor table and vmalloc

I see that the Linux kernel uses vmalloc to allocate memory for fdtable when it's bigger than a certain threshold. I would like to know when this happens and have some more clear information.
static void *alloc_fdmem(size_t size)
{
/*
* Very large allocations can stress page reclaim, so fall back to
* vmalloc() if the allocation size will be considered "large" by the VM.
*/
if (size <= (PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER)) {
void *data = kmalloc(size, GFP_KERNEL|__GFP_NOWARN);
if (data != NULL)
return data;
}
return vmalloc(size);
}
alloc_fdmem is called from alloc_fdtable and the last function is called from expand_fdtable
I wrote this code to print the size.
#include <stdio.h>
#define PAGE_ALLOC_COSTLY_ORDER 3
#define PAGE_SIZE 4096
int main(){
printf("\t%d\n", PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER);
}
Output
./printo
32768
So, how many files does it take for the kernel to switch to using vmalloc to allocate fdtable?
So PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER is 32768
This is called like:
data = alloc_fdmem(nr * sizeof(struct file *));
i.e. it's used to store struct file pointers.
If your pointers are 4 bytes, it happens when your have 32768/4 = 8192 open files, if your pointers are 8 bytes, it happens at 4096 open files.

Resources