mmap() NVMe cmb resource file and memcpy() to it, with the unplugged SSD still could write - linux

I mapped cmb resource file and wrote data to it. When I unplugged the SSD, I thought that a SIGBUS should be generated but the program was still running. So was this a normal behavior?
Linux kernel: v5.15.88
OS: Centos 7.4
CPU: Intel(R) Xeon(R) Silver 4114
My code like this:
size_t resource_size = 268435456;
size_t remain = resource_size;
size_t offset = 0;
size_t range = 128;
char tmp;
int fd = open("/sys/bus/pci/devices/0000:5e:00.0/resource4_wc", O_RDWR);
void * map_addr = mmap(NULL, resource_size, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
while (remain > 0) {
memcpy(map_addr + offset, buf + offset, range);
memcpy(&tmp, map_addr + offset, 0);
remain -= range;
offset += range;
if (remain == 0) {
remain = resource_size;
offset = 0;
}
}

Related

change page order in kernel space

I have a kernel module that works on data that is:
allocated by the kernel
page aligned
the data "mapping" is arbitrary
I allocate the memory in kernel space with kvmalloc(). For userspace representation i use vm_insert_page() to create the correct ordered representation. But i could not find a method with that i can "insert" or "remap" or "reorder" page mapping within kernel space. Are there methods do the same as vm_insert_page() for kernelspace mappings?
ok this seems to work:
static int __init test_init_fs(void)
{
int rv = 0;
size_t size = 5*1024*1024; /* 5 MiB*/
void* mem = vzalloc(size);
struct page **pages = kcalloc(5, sizeof(struct page *), GFP_KERNEL);
pr_info("alloced\n");
pages[0] = vmalloc_to_page(mem + 0 * PAGE_SIZE);
pages[1] = vmalloc_to_page(mem + 6 * PAGE_SIZE);
pages[2] = vmalloc_to_page(mem + 2 * PAGE_SIZE);
pages[3] = vmalloc_to_page(mem + 1 * PAGE_SIZE);
pages[4] = vmalloc_to_page(mem + 8 * PAGE_SIZE);
pr_info("got all pages\n");
void* new_mapping = vmap(pages,5, VM_MAP, PAGE_KERNEL);
pr_info("new mapping created\n");
void* buffer = vzalloc(5*PAGE_SIZE);
memcpy(buffer,new_mapping,5*PAGE_SIZE);
vunmap(new_mapping);
pr_info("unmapped\n");
vfree(mem);
return rv;
}

observe physical address content

I compile and run the pagemap_dump.c from here :
/proc/[pid]/pagemaps and /proc/[pid]/maps | linux
and another application attach a exist shared memory and every time I run pagemap_dump , the shared memory's physical address is really the same (virtual memory address is different each time I observe) :
./pagemap_dump.exe 2135 <== 2135 is pid
./pagemap_dump.exe 5864 <== 5864 is pid
./pagemap_dump.exe 5110 <== 5110 is pid
the output :
7fabb37e1000 54edde 0 1 0 1 /dev/shm/memtest.shared
7f0040fa1000 54edde 0 1 0 1 /dev/shm/memtest.shared
7ffbd2869000 54edde 0 1 0 1 /dev/shm/memtest.shared
these 3 application attach memtest.shared and their physical address are all 0x54edde in my linux ,
then I try to access the contents from /dev/mem in this
physical address :
How to access physical addresses from user space in Linux?
int main(int argc, char** argv)
{
if (argc < 3) {
printf("Usage: %s <phys_addr> <offset>\n", argv[0]);
return 0;
}
off_t offset = strtoul(argv[1], NULL, 0);
size_t len = strtoul(argv[2], NULL, 0);
// Truncate offset to a multiple of the page size, or mmap will fail.
size_t pagesize = sysconf(_SC_PAGE_SIZE);
off_t page_base = (offset / pagesize) * pagesize;
off_t page_offset = offset - page_base;
int fd = open("/dev/mem", O_RDONLY|O_SYNC);
unsigned char *mem = (unsigned char *) mmap(NULL, page_offset + len, PROT_READ , MAP_SHARED, fd, page_base);
if (mem == MAP_FAILED) {
perror("Can't map memory");
return -1;
}
size_t i;
for (i = 0; i < len; ++i)
printf("%02x ", (int)mem[page_offset + i]);
return 0;
} // main
chown root testz1.exe
chmod 4775 testz1.exe
./testz1.exe 5565918 16 <=== 0x54edde = 5565918
I got Can't map memory: Operation not permitted
./testz1.exe 54edde 16
I got some thing not like "hello world" what I strcpy to
/dev/shm/memtest.shared when I create it .
Since I seems to get the physical address of /dev/shm/memtest.shared , how can I get the right way to get the contents from /dev/mem of this physical address ?!
Edit :
After download devmem2.c and run :
./devmem2.exe 0x54edde w
get the following message :
/dev/mem opened.
Error at line 75, file devmem2.c (1) [Operation not permitted]
Look like 0x54edde is not a physical address as I expect , but Why it happened?!
pagemap_dump.c really show the physical address of /dev/shm/memtest.shared
for all processes which attach it the value 0x54edde , then Why
devmem2.c can not access it right ?!

Registering Mapped Linux Character Device Memory with cudaHostRegister Results in Invalid Argument

I'm trying to boost DMA<->CPU<->GPU data transfer by:
1. Mapping my (proprietary) device Linux Kernel allocated memory to user space
2. Registering the later (mapped memory) to Cuda with cudaHostRegister API function.
While mapping User Space allocated memory mapped to my device DMA and then registered to Cuda with cudaHostRegister works just fine, trying to register "kmalloced" memory results in "Invalid Argument" error returned by cudaHostRegister.
First I thought the problem was with alignment or my device driver complicated memory pool management, so I've written a simplest character device which implements .mmap() where kzalloced 10Kb buffer is remapped with remap_pfn_range and the problem still stands.
Unfortunately, I did not find any resembling questions over the Net, so I sincerely hope I'll find an answer here.
Some system info and Kernel driver <-> user space app code + runtime log info:
CUDA : 8.0
OS Dist : Ubuntu 14.04
Kernel : 3.16.0-31-generic
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 375.26 Driver Version: 375.26 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 770 Off | 0000:83:00.0 N/A | N/A |
| 26% 32C P8 N/A / N/A | 79MiB / 1997MiB | N/A Default |
+-------------------------------+----------------------+----------------------+
Character device mmap() code:
#define MEM_CHUNK_SIZE 4 * _K
#define MEM_POOL_SIZE 10 * _K
/**/
static int chdv_mmap(struct file *filp, struct vm_area_struct *vma)
{
unsigned int pages_per_buf = ( MEM_CHUNK_SIZE >> PAGE_SHIFT ) ;
unsigned long pfn, vsize;
/*make sure the buffer is allocated*/
if((NULL == g_membuff) &&
(NULL == (g_membuff = kzalloc(MEM_POOL_SIZE , GFP_KERNEL))))
{
kdbgprintln("Error: Not enough memory");
return -ENOMEM;
}
vsize = vma->vm_end - vma->vm_start ;
kdbgprintln("MEM_CHUNK_SIZE %u, pages_per_buf %u, vsize %lu vma->vm_pgoff %lu",
MEM_CHUNK_SIZE,
pages_per_buf,
vsize,
vma->vm_pgoff);
if(vsize > MEM_POOL_SIZE)
{
kdbgprintln("Error: vsize %lu > MEM_POOL_SIZE %u", vsize, MEM_POOL_SIZE);
return -EINVAL;
}
/* We allow only mapping of one whole buffer so offset must be multiple
* of pages_per_buf and size must be equal to dma_buf_size.
*/
if( vma->vm_pgoff % pages_per_buf )
{
kdbgprintln("Error:Mapping DMA buffers is allowed only from beginning");
return -EINVAL ;
}
vma->vm_flags = vma->vm_flags | (VM_DONTEXPAND | VM_LOCKED | VM_IO);
/*Get the PFN for remap*/
pfn = page_to_pfn(virt_to_page((unsignedcudaHostRegister char *)g_membuff));
kdbgprintln("PFN : %lu", pfn);
if(remap_pfn_range(vma, vma->vm_start, pfn, vsize, vma->vm_page_prot))
{
kdbgprintln("Error:Failed to remap memory");
return -EINVAL;
}
/*Sealing data header & footer*/
*((unsigned long *)g_membuff) = 0xCDFFFFFFFFFFFFAB;
*((unsigned long *)g_membuff + 1) = 0xAB000000000000EF;
*(unsigned long *)((unsigned char *)g_membuff + vsize - sizeof(unsigned long)) = 0xEF0000000C0000AA;
kdbgprintln("Mapped 'kalloc' buffer" \
"\n\t\tFirst 8 bytes: %lX" \
"\n\t\tSecond 8 bytes: %lX" \
"\n\t\tLast 8 bytes: %lX",
*((unsigned long *)g_membuff),
*((unsigned long *)g_membuff + 1),
*(unsigned long *)((unsigned char *)g_membuff + vsize - sizeof(unsigned long)));
return 0;
}
Test Application code:
static unsigned long map_mem_size;
int main(int argc, char** argv)
{
int fd;
const char dev_name[] = "/dev/chardev";
void * address = NULL;
long page_off = 0;
cudaError_t cudarc;
switch(argc)
{
case 2:
page_off = atoi(argv[1]) * getpagesize();
break;
default:
page_off = 0;
break;
}
map_mem_size = 2 * getpagesize();
printf("Opening %s file\n", dev_name);
errno = 0;
if(0 > (fd = open(dev_name, O_RDWR) ))
{
printf("Error %d - %s\n", errno, strerror(errno));
}
else
{
printf("About to map %lu bytes of %s device memory\n", map_mem_size, dev_name);
errno = 0;
if(MAP_FAILED == (address = mmap(NULL, map_mem_size, PROT_READ | PROT_WRITE, MAP_SHARED, fd, page_off)))
{
printf("Error %d - %s\n", errno, strerror(errno));
}
else
{
printf("mapped %s driver 'kmalloc' memory" \
"\n\t\tFirst 8 bytes : %lX" \
"\n\t\tSecond 8 bytes: %lX" \
"\n\t\tLast 8 bytes: %lX\n",
dev_name,
*((unsigned long *)address),
*((unsigned long *)address + 1),
*(unsigned long *)((unsigned char *)address + map_mem_size - sizeof(unsigned long)));
if (cudaSuccess != (cudarc = cudaHostRegister(address, map_mem_size, cudaHostRegisterDefault)))
{
printf("Error: Failed cudaHostRegister: %s, address %p\n", cudaGetErrorString(cudarc), address);
}
}
}
/*Release resources block*/
return EXIT_SUCCESS;
}
Run time debug information:
User space:
./chrdev_test
Opening /dev/chardev file
About to map 8192 bytes of /dev/chardev device memory
mapped /dev/chardev driver 'kmalloc' memory
First 8 bytes : CDFFFFFFFFFFFFAB
Second 8 bytes: AB000000000000EF
Last 8 bytes: EF0000000C0000AA
Error: Failed cudaHostRegister: invalid argument
Unmapping /dev/chardev file
Closing /dev/chardev file
Kernel space (tail -f /var/log/syslog):
[ 4814.119537] [chardev] chardev.c, chdv_mmap, line 292:MEM_CHUNK_SIZE 4096, pages_per_buf 1, vsize 8192 vma->vm_pgoff 0
[ 4814.119538] [chardev] chardev.c, chdv_mmap, line 311:PFN : 16306184
[ 4814.119543] [chardev] chardev.c, chdv_mmap, line 330:Mapped 'kzalloced' buffer
[ 4814.119543] First 8 bytes: CDFFFFFFFFFFFFAB
[ 4814.119543] Second 8 bytes: AB000000000000EF
[ 4814.119543] Last 8 bytes: EF0000000C0000AA
Thanks ahead.
Made it work!
The full answer may be found in:
https://devtalk.nvidia.com/default/topic/1014391/cuda-programming-and-performance/registering-mapped-linux-character-device-memory-with-cudahostregister-results-in-invalid-argument/?offset=3#5174771
There is a problem with memory chunks longer than 2 pages (> 8K)
working with Cuda...
Thanks,
Yoel.

How does limits on the shared memory work on Linux

I was looking into the Linux kernel limits on the shared memory
/proc/sys/kernel/shmall
specifies the maximum amount of pages that can be allocated. Considering this number as x and the page size as p. I assume that "x * p" bytes is the limit on the system wide shared memory.
Now I wrote a small program to create a shared memory segment and i attached to that shared memory segment twice as below
shm_id = shmget(IPC_PRIVATE, 4*sizeof(int), IPC_CREAT | 0666);
if (shm_id < 0) {
printf("shmget error\n");
exit(1);
}
printf("\n The shared memory created is %d",shm_id);
ptr = shmat(shm_id,NULL,0);
ptr_info = shmat(shm_id,NULL,0);
In the above program ptr and ptr_info were different. So the shared memory is mapped to 2 virtual addresses in my process address space.
When I do an ipcs it looks like this
...
0x00000000 1638416 sun 666 16000000 2
...
Now coming to the shmall limit x * p noted above in my question. Is this limit applicable on the sum of all the virtual memory allocated for every shared memory segment? or does this limit apply on the physical memory?
Physical memory is only one here (shared memory) and from the program above when I do 2 shmat's there is twice the amount of memory allocated in my process address space. So this limit will hit soon if do continuous shmat's on a single shared memory segment?
The limit only applies to physical memory, that is the real shared memory allocated for all segments, because shmat() just maps that allocated segment into process address space.
You can trace it in the kernel, there is only one place where this limit is checked — in the newseg() function that allocates new segments (ns->shm_ctlall comparison). shmat() implementation is busy with a lot of stuff, but doesn't care at all about shmall limit, so you can map one segment as many times as you want to (well, address space is also limited, but in practice you rarely care about this limit).
You can also try some test from userspace with a simple program like this one:
#define _GNU_SOURCE
#include <errno.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/ipc.h>
#include <sys/shm.h>
#include <unistd.h>
unsigned long int get_shmall() {
FILE *f = NULL;
char buf[512];
unsigned long int value = 0;
if ((f = fopen("/proc/sys/kernel/shmall", "r")) != NULL) {
if (fgets(buf, sizeof(buf), f) != NULL)
value = strtoul(buf, NULL, 10); // no proper checks
fclose(f); // no return value check
}
return value;
}
int set_shmall(unsigned long int value) {
FILE *f = NULL;
char buf[512];
int retval = 0;
if ((f = fopen("/proc/sys/kernel/shmall", "w")) != NULL) {
if (snprintf(buf, sizeof(buf), "%lu\n", value) >= sizeof(buf) ||
fwrite(buf, 1, strlen(buf), f) != strlen(buf))
retval = -1;
fclose(f); // fingers crossed
} else
retval = -1;
return retval;
}
int main()
{
int shm_id1 = -1, shm_id2 = -1;
unsigned long int shmall = 0, shmused, newshmall;
void *ptr1, *ptr2;
struct shm_info shminf;
if ((shmall = get_shmall()) == 0) {
printf("can't get shmall\n");
goto out;
}
printf("original shmall: %lu pages\n", shmall);
if (shmctl(0, SHM_INFO, (struct shmid_ds *)&shminf) < 0) {
printf("can't get SHM_INFO\n");
goto out;
}
shmused = shminf.shm_tot * getpagesize();
printf("shmused: %lu pages (%lu bytes)\n", shminf.shm_tot, shmused);
newshmall = shminf.shm_tot + 1;
if (set_shmall(newshmall) != 0) {
printf("can't set shmall\n");
goto out;
}
if (get_shmall() != newshmall) {
printf("something went wrong with shmall setting\n");
goto out;
}
printf("new shmall: %lu pages (%lu bytes)\n", newshmall, newshmall * getpagesize());
printf("shmget() for %u bytes: ", (unsigned int) getpagesize());
shm_id1 = shmget(IPC_PRIVATE, (size_t)getpagesize(), IPC_CREAT | 0666);
if (shm_id1 < 0) {
printf("failed: %s\n", strerror(errno));
goto out;
}
printf("ok\nshmat 1: ");
ptr1 = shmat(shm_id1, NULL, 0);
if (ptr1 == 0) {
printf("failed\n");
goto out;
}
printf("ok\nshmat 2: ");
ptr2 = shmat(shm_id1, NULL, 0);
if (ptr2 == 0) {
printf("failed\n");
goto out;
}
printf("ok\n");
if (ptr1 == ptr2) {
printf("ptr1 and ptr2 are the same with shm_id1\n");
goto out;
}
printf("shmget() for %u bytes: ", (unsigned int) getpagesize());
shm_id2 = shmget(IPC_PRIVATE, (size_t)getpagesize(), IPC_CREAT | 0666);
if (shm_id2 < 0)
printf("failed: %s\n", strerror(errno));
else
printf("ok, although it's wrong\n");
out:
if (shmall != 0 && set_shmall(shmall) != 0)
printf("failed to restrore shmall\n");
if (shm_id1 >= 0 && shmctl(shm_id1, IPC_RMID, NULL) < 0)
printf("failed to remove shm_id1\n");
if (shm_id2 >= 0 && shmctl(shm_id2, IPC_RMID, NULL) < 0)
printf("failed to remove shm_id2\n");
return 0;
}
What is does is it sets the shmall limit just one page above what is currently used by the system, then tries to get page-sized new segment and map it twice (all successfully), then tries to get one more page-sized segment and fails to do that (execute the program as superuser because it writes to /proc/sys/kernel/shmall):
$ sudo ./a.out
original shmall: 18446744073708503040 pages
shmused: 21053 pages (86233088 bytes)
new shmall: 21054 pages (86237184 bytes)
shmget() for 4096 bytes: ok
shmat 1: ok
shmat 2: ok
shmget() for 4096 bytes: failed: No space left on device
I did not find any Physical memory allocation at do_shmat function (linux/ipc/shm.c)
https://github.com/torvalds/linux/blob/5469dc270cd44c451590d40c031e6a71c1f637e8/ipc/shm.c
so shmat consumes only vm (your process address space),
the main function of shmat is mmap

Actual physical RAM used by process

How can I determine actual physical RAM used by some process?
I can look into /proc/PID/status on VmRSS (or into top's RES column). However, this number is incorrect for the processes which use multiple mappings backed by the same file. For example, the following piece of code maps several areas into a small physical memory window.
size_t window_size = ...; // e.g. 128 MiB
size_t total_size = ...; // e.g. 4 TiB
char path[] = "/dev/shm/window-XXXXXX";
int fd = mkstemp(path);
ftruncate(fd, (off_t)window_size)
void *data = mmap(NULL, total_size, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0);
for(ptrdiff_t offset = 0; offset < (ptrdiff_t)total_size; offset += window_size)
{
mmap( (void *)( (uintptr_t)data + offset ), window_size, PROT_READ|PROT_WRITE, MAP_FIXED|MAP_SHARED|MAP_NORESERVE, fd, 0);
}
Now, if I look into /proc/PID/status, the kernel reports VmRSS as a sum of all the windows above. Although this number is even higher than the total physical memory size.

Resources