Actual physical RAM used by process - linux

How can I determine actual physical RAM used by some process?
I can look into /proc/PID/status on VmRSS (or into top's RES column). However, this number is incorrect for the processes which use multiple mappings backed by the same file. For example, the following piece of code maps several areas into a small physical memory window.
size_t window_size = ...; // e.g. 128 MiB
size_t total_size = ...; // e.g. 4 TiB
char path[] = "/dev/shm/window-XXXXXX";
int fd = mkstemp(path);
ftruncate(fd, (off_t)window_size)
void *data = mmap(NULL, total_size, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0);
for(ptrdiff_t offset = 0; offset < (ptrdiff_t)total_size; offset += window_size)
{
mmap( (void *)( (uintptr_t)data + offset ), window_size, PROT_READ|PROT_WRITE, MAP_FIXED|MAP_SHARED|MAP_NORESERVE, fd, 0);
}
Now, if I look into /proc/PID/status, the kernel reports VmRSS as a sum of all the windows above. Although this number is even higher than the total physical memory size.

Related

mmap() NVMe cmb resource file and memcpy() to it, with the unplugged SSD still could write

I mapped cmb resource file and wrote data to it. When I unplugged the SSD, I thought that a SIGBUS should be generated but the program was still running. So was this a normal behavior?
Linux kernel: v5.15.88
OS: Centos 7.4
CPU: Intel(R) Xeon(R) Silver 4114
My code like this:
size_t resource_size = 268435456;
size_t remain = resource_size;
size_t offset = 0;
size_t range = 128;
char tmp;
int fd = open("/sys/bus/pci/devices/0000:5e:00.0/resource4_wc", O_RDWR);
void * map_addr = mmap(NULL, resource_size, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
while (remain > 0) {
memcpy(map_addr + offset, buf + offset, range);
memcpy(&tmp, map_addr + offset, 0);
remain -= range;
offset += range;
if (remain == 0) {
remain = resource_size;
offset = 0;
}
}

Linux direct IO latency difference between reading 4GB and 32MB file under pread in NVME

My test will randomly read 4K pages (random 4k page at time in a tight loop) in both 4GB and 32MB files using direct IO (O_DIRECT) and pread on NVME disk. The latency on 4GB file is about 41 microsecond per page and the latency is 79 microsecond for small 32MB file. Is there a rational explanation for such difference?
b_fd = open("large.txt", O_RDONLY | O_DIRECT);
s_fd = open("small.txt", O_RDONLY | O_DIRECT);
std::srand(std::time(nullptr));
void *buf;
int ps=getpagesize();
posix_memalign(&buf, ps, page_size);
long long nano_seconds = 0;
// number of random pages to read
int iter = 256 * 100;
for (int i = 0; i < iter; i++) {
int page_index = std::rand() % big_file_pages;
auto start = std::chrono::steady_clock::now();
pread (b_fd, buf, page_size, page_index * page_size);
auto end = std::chrono::steady_clock::now();
nano_seconds += std::chrono::duration_cast<std::chrono::nanoseconds>(end - start).count();
}
std::cout<<"large file average random 1 page direct IO read in nanoseconds is:"<<nano_seconds/iter<<"\n";```

How to pass efficiently the hugepages-backed buffer to the BM DMA device in Linux?

I need to provide a huge circular buffer (a few GB) for the bus-mastering DMA PCIe device implemented in FPGA.
The buffers should not be reserved at the boot time. Therefore, the buffer may be not contiguous.
The device supports scatter-gather (SG) operation, but for performance reasons, the addresses and lengths of consecutive contiguous segments of the buffer are stored inside the FPGA.
Therefore, usage of standard 4KB pages is not acceptable (there would be up to 262144 segments for each 1GB of the buffer).
The right solution should allocate the buffer consisting of 2MB hugepages in the user space (reducing the maximum number of segments by factor of 512).
The virtual address of the buffer should be transferred to the kernel driver via ioctl. Then the addresses and the length of the segments should be calculated and written to the FPGA.
In theory, I could use get_user_pages to create the list of the pages, and then call sg_alloc_table_from_pages to obtain the SG list suitable to program the DMA engine in FPGA.
Unfortunately, in this approach I must prepare the intermediate list of page structures with length of 262144 pages per 1GB of the buffer. This list is stored in RAM, not in the FPGA, so it is less problematic, but anyway it would be good to avoid it.
In fact I don't need to keep the pages maped for the kernel, as the hugepages are protected against swapping out, and they are mapped for the user space application that will process the received data.
So what I'm looking for is a function sg_alloc_table_from_user_hugepages, that could take such a user-space address of the hugepages-based memory buffer, and transfer it directly into the right scatterlist, without performing unnecessary and memory-consuming mapping for the kernel.
Of course such a function should verify that the buffer indeed consists of hugepages.
I have found and read these posts: (A), (B), but couldn't find a good answer.
Is there any official method to do it in the current Linux kernel?
At the moment I have a very inefficient solution based on get_user_pages_fast:
int sgt_prepare(const char __user *buf, size_t count,
struct sg_table * sgt, struct page *** a_pages,
int * a_n_pages)
{
int res = 0;
int n_pages;
struct page ** pages = NULL;
const unsigned long offset = ((unsigned long)buf) & (PAGE_SIZE-1);
//Calculate number of pages
n_pages = (offset + count + PAGE_SIZE - 1) >> PAGE_SHIFT;
printk(KERN_ALERT "n_pages: %d",n_pages);
//Allocate the table for pages
pages = vzalloc(sizeof(* pages) * n_pages);
printk(KERN_ALERT "pages: %p",pages);
if(pages == NULL) {
res = -ENOMEM;
goto sglm_err1;
}
//Now pin the pages
res = get_user_pages_fast(((unsigned long)buf & PAGE_MASK), n_pages, 0, pages);
printk(KERN_ALERT "gupf: %d",res);
if(res < n_pages) {
int i;
for(i=0; i<res; i++)
put_page(pages[i]);
res = -ENOMEM;
goto sglm_err1;
}
//Now create the sg-list
res = sg_alloc_table_from_pages(sgt, pages, n_pages, offset, count, GFP_KERNEL);
printk(KERN_ALERT "satf: %d",res);
if(res < 0)
goto sglm_err2;
*a_pages = pages;
*a_n_pages = n_pages;
return res;
sglm_err2:
//Here we jump if we know that the pages are pinned
{
int i;
for(i=0; i<n_pages; i++)
put_page(pages[i]);
}
sglm_err1:
if(sgt) sg_free_table(sgt);
if(pages) kfree(pages);
* a_pages = NULL;
* a_n_pages = 0;
return res;
}
void sgt_destroy(struct sg_table * sgt, struct page ** pages, int n_pages)
{
int i;
//Free the sg list
if(sgt->sgl)
sg_free_table(sgt);
//Unpin pages
for(i=0; i < n_pages; i++) {
set_page_dirty(pages[i]);
put_page(pages[i]);
}
}
The sgt_prepare function builds the sg_table sgt structure that i can use to create the DMA mapping. I have verified that it contains the number of entries equal to the number of hugepages used.
Unfortunately, it requires that the list of the pages is created (allocated and returned via the a_pages pointer argument), and kept as long as the buffer is used.
Therefore, I really dislike that solution. Now I have 256 2MB hugepages used as a DMA buffer. It means that I have to create and keeep unnecessary 128*1024 page structures. I also waste 512 MB of kernel address space for unnecessary kernel mapping.
The interesting question is if the a_pages may be kept only temporarily (until the sg-list is created)? In theory it should be possible, as the pages are still locked...

Write from mmapped buffer to `O_DIRECT` output file

I have a device which writes to a video buffer. This buffer is allocated in system memory using CMA and I want to implement streaming write from this buffer to a block device. My application opens video buffer with mmap and I would like to use O_DIRECT write to avoid page cache related overhead. Basically, the pseudo-code of the application looks like this:
f_in = open("/dev/videobuf", O_RDONLY);
f_mmap = mmap(0, BUFFER_SIZE, PROT_READ, MAP_SHARED, f_in, 0);
f_out = open("/dev/sda", O_WRONLY | O_DIRECT);
write(f_out, f_mmap, BLOCK_SIZE);
where BLOCK_SIZE is sector aligned value. f_out is opened without any errors, but write results in EFAULT. I tried to track down this issue and it turned out that mmap implementation in video buffer's driver uses remap_pfn_range(), which sets VM_IO and VM_PFNMAP flags for VMA. The O_DIRECT path in block device drivers checks these flags and returns EFAULT. As far as I understand, O_DIRECT writes need to pin the memory pages, but VMA flags indicate the absence of struct page for underlying memory which causes an error. Am I right here?
And the main question is how to correctly implement O_DIRECT write from mmapped buffer? I have video buffer driver and can modify it appropriately.
I found similar question, but these is no clear answer there.
remap_pfn_range will set your vma as special via pte_mkspecial and add VM_IO/VM_PFNMAP to vma, so you cannot pass the following checks when do Direct I/O.
You say your memory comes from CMA, that's good because cma memory already has struct page support, so you can just use vm_insert_pages as the following steps:
declare cma region from kernel argument or dts
get struct pages from cma:
dma_page = dma_alloc_contiguous(&pdev->dev, size, GFP_KERNEL);
if (!dma_page) {
pr_err("%s %d, dma_alloc_contiguous fail\n", __func__, __LINE__);
return -ENOMEM;
}
nr_pages = DIV_ROUND_UP(size, PAGE_SIZE);
pages = kvmalloc_array(nr_pages, sizeof(*pages), GFP_KERNEL);
for (i = 0; i < nr_pages; i++)
pages[i] = &dma_page[i];
insert pages to vma when mmap
int your_mmap(struct file *file, struct vm_area_struct *vma) {
int ret = 0;
unsigned long temp_nr_pages;
if (vma->vm_end - vma->vm_start > size)
return -EINVAL;
/* duplicitate nr_pages in that vm_insert_pages can change nr_pages */
temp_nr_pages = nr_pages;
ret = vm_insert_pages(vma, vma->vm_start, pages, &temp_nr_pages);
if (ret < 0)
pr_err("%s vm_insert_pages fail, error is %d\n", __func__, ret);
return ret;
}
export dma_alloc_contiguous(the only mm code change, but not so bad).
modified kernel/dma/contiguous.c
## -332,6 +332,7 ## struct page *dma_alloc_contiguous(struct device *dev, size_t size, gfp_t gfp)
return cma_alloc_aligned(dma_contiguous_default_area, size, gfp); }
+EXPORT_SYMBOL(dma_alloc_contiguous);
/**
* dma_free_contiguous() - release allocated pages

How does limits on the shared memory work on Linux

I was looking into the Linux kernel limits on the shared memory
/proc/sys/kernel/shmall
specifies the maximum amount of pages that can be allocated. Considering this number as x and the page size as p. I assume that "x * p" bytes is the limit on the system wide shared memory.
Now I wrote a small program to create a shared memory segment and i attached to that shared memory segment twice as below
shm_id = shmget(IPC_PRIVATE, 4*sizeof(int), IPC_CREAT | 0666);
if (shm_id < 0) {
printf("shmget error\n");
exit(1);
}
printf("\n The shared memory created is %d",shm_id);
ptr = shmat(shm_id,NULL,0);
ptr_info = shmat(shm_id,NULL,0);
In the above program ptr and ptr_info were different. So the shared memory is mapped to 2 virtual addresses in my process address space.
When I do an ipcs it looks like this
...
0x00000000 1638416 sun 666 16000000 2
...
Now coming to the shmall limit x * p noted above in my question. Is this limit applicable on the sum of all the virtual memory allocated for every shared memory segment? or does this limit apply on the physical memory?
Physical memory is only one here (shared memory) and from the program above when I do 2 shmat's there is twice the amount of memory allocated in my process address space. So this limit will hit soon if do continuous shmat's on a single shared memory segment?
The limit only applies to physical memory, that is the real shared memory allocated for all segments, because shmat() just maps that allocated segment into process address space.
You can trace it in the kernel, there is only one place where this limit is checked — in the newseg() function that allocates new segments (ns->shm_ctlall comparison). shmat() implementation is busy with a lot of stuff, but doesn't care at all about shmall limit, so you can map one segment as many times as you want to (well, address space is also limited, but in practice you rarely care about this limit).
You can also try some test from userspace with a simple program like this one:
#define _GNU_SOURCE
#include <errno.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/ipc.h>
#include <sys/shm.h>
#include <unistd.h>
unsigned long int get_shmall() {
FILE *f = NULL;
char buf[512];
unsigned long int value = 0;
if ((f = fopen("/proc/sys/kernel/shmall", "r")) != NULL) {
if (fgets(buf, sizeof(buf), f) != NULL)
value = strtoul(buf, NULL, 10); // no proper checks
fclose(f); // no return value check
}
return value;
}
int set_shmall(unsigned long int value) {
FILE *f = NULL;
char buf[512];
int retval = 0;
if ((f = fopen("/proc/sys/kernel/shmall", "w")) != NULL) {
if (snprintf(buf, sizeof(buf), "%lu\n", value) >= sizeof(buf) ||
fwrite(buf, 1, strlen(buf), f) != strlen(buf))
retval = -1;
fclose(f); // fingers crossed
} else
retval = -1;
return retval;
}
int main()
{
int shm_id1 = -1, shm_id2 = -1;
unsigned long int shmall = 0, shmused, newshmall;
void *ptr1, *ptr2;
struct shm_info shminf;
if ((shmall = get_shmall()) == 0) {
printf("can't get shmall\n");
goto out;
}
printf("original shmall: %lu pages\n", shmall);
if (shmctl(0, SHM_INFO, (struct shmid_ds *)&shminf) < 0) {
printf("can't get SHM_INFO\n");
goto out;
}
shmused = shminf.shm_tot * getpagesize();
printf("shmused: %lu pages (%lu bytes)\n", shminf.shm_tot, shmused);
newshmall = shminf.shm_tot + 1;
if (set_shmall(newshmall) != 0) {
printf("can't set shmall\n");
goto out;
}
if (get_shmall() != newshmall) {
printf("something went wrong with shmall setting\n");
goto out;
}
printf("new shmall: %lu pages (%lu bytes)\n", newshmall, newshmall * getpagesize());
printf("shmget() for %u bytes: ", (unsigned int) getpagesize());
shm_id1 = shmget(IPC_PRIVATE, (size_t)getpagesize(), IPC_CREAT | 0666);
if (shm_id1 < 0) {
printf("failed: %s\n", strerror(errno));
goto out;
}
printf("ok\nshmat 1: ");
ptr1 = shmat(shm_id1, NULL, 0);
if (ptr1 == 0) {
printf("failed\n");
goto out;
}
printf("ok\nshmat 2: ");
ptr2 = shmat(shm_id1, NULL, 0);
if (ptr2 == 0) {
printf("failed\n");
goto out;
}
printf("ok\n");
if (ptr1 == ptr2) {
printf("ptr1 and ptr2 are the same with shm_id1\n");
goto out;
}
printf("shmget() for %u bytes: ", (unsigned int) getpagesize());
shm_id2 = shmget(IPC_PRIVATE, (size_t)getpagesize(), IPC_CREAT | 0666);
if (shm_id2 < 0)
printf("failed: %s\n", strerror(errno));
else
printf("ok, although it's wrong\n");
out:
if (shmall != 0 && set_shmall(shmall) != 0)
printf("failed to restrore shmall\n");
if (shm_id1 >= 0 && shmctl(shm_id1, IPC_RMID, NULL) < 0)
printf("failed to remove shm_id1\n");
if (shm_id2 >= 0 && shmctl(shm_id2, IPC_RMID, NULL) < 0)
printf("failed to remove shm_id2\n");
return 0;
}
What is does is it sets the shmall limit just one page above what is currently used by the system, then tries to get page-sized new segment and map it twice (all successfully), then tries to get one more page-sized segment and fails to do that (execute the program as superuser because it writes to /proc/sys/kernel/shmall):
$ sudo ./a.out
original shmall: 18446744073708503040 pages
shmused: 21053 pages (86233088 bytes)
new shmall: 21054 pages (86237184 bytes)
shmget() for 4096 bytes: ok
shmat 1: ok
shmat 2: ok
shmget() for 4096 bytes: failed: No space left on device
I did not find any Physical memory allocation at do_shmat function (linux/ipc/shm.c)
https://github.com/torvalds/linux/blob/5469dc270cd44c451590d40c031e6a71c1f637e8/ipc/shm.c
so shmat consumes only vm (your process address space),
the main function of shmat is mmap

Resources