one physical page allocated when malloc() called - linux

I was trying to find the virtual set size and resident set size of a c program. I wrote a kernel module to traverse the vm_areas and calculated vss and rss. I also wrote one c program to validate the changes in vss and rss.
// sample test program
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#define N 10000000
int main() {
// setup...
int *arg1 = malloc(5*sizeof(int));
char *arg2 = malloc(sizeof(char)*1024);
pid_t _pid = getpid();
int rss , prev_rss , vss, prev_vss;
printf("pid of this process = %d\n",_pid);
//...
// first observatrion
arg1[0] = (int)(_pid);
long res = syscall(333,arg1,&arg2);
vss = prev_vss = arg1[1]; // agr1[1] stores the vss from the kernel module
rss = prev_rss = arg1[2]; // agr1[2] stores the rss from the kernel module
printf("vss = %d rss = %d\n",vss,rss);
unsigned int *ptr = malloc(1<<21); // 2 MB
printf("ptr = %p\n",ptr);
// second observatrion
arg1[0] = (int)(_pid);
res = syscall(333,arg1,&arg2);
vss = arg1[1];
rss = arg1[2];
printf("vss = %d rss = %d\n",vss,rss);
if(vss - prev_vss > 0) {
printf("chnage in vss = %d\n", vss - prev_vss);
}
if(rss - prev_rss > 0) {
printf("chnage in rss = %d\n", rss - prev_rss);
}
prev_vss = vss;
prev_rss = rss;
// ...
return 0;
}
The output of the above program:
pid of this process = 12964
vss = 4332 rss = 1308
ptr = 0x7f4077464010
vss = 6384 rss = 1312
chnage in vss = 2052
chnage in rss = 4
Here are the dmesg output :
First observation:
[11374.065527] 1 = [0000000000400000-0000000000401000] RSS=4KB sample
[11374.065529] 2 = [0000000000600000-0000000000601000] RSS=4KB sample
[11374.065530] 3 = [0000000000601000-0000000000602000] RSS=4KB sample
[11374.065532] 4 = [0000000000c94000-0000000000cb5000] RSS=4KB
[11374.065539] 5 = [00007f4077665000-00007f407781f000] RSS=1064KB libc-2.19.so
[11374.065546] 6 = [00007f407781f000-00007f4077a1f000] RSS=0KB libc-2.19.so
[11374.065547] 7 = [00007f4077a1f000-00007f4077a23000] RSS=16KB libc-2.19.so
[11374.065549] 8 = [00007f4077a23000-00007f4077a25000] RSS=8KB libc-2.19.so
[11374.065551] 9 = [00007f4077a25000-00007f4077a2a000] RSS=16KB
[11374.065553] 10 = [00007f4077a2a000-00007f4077a4d000] RSS=140KB ld-2.19.so
[11374.065554] 11 = [00007f4077c33000-00007f4077c36000] RSS=12KB
[11374.065556] 12 = [00007f4077c49000-00007f4077c4c000] RSS=12KB
[11374.065557] 13 = [00007f4077c4c000-00007f4077c4d000] RSS=4KB ld-2.19.so
[11374.065559] 14 = [00007f4077c4d000-00007f4077c4e000] RSS=4KB ld-2.19.so
[11374.065561] 15 = [00007f4077c4e000-00007f4077c4f000] RSS=4KB
[11374.065563] 16 = [00007ffcdf974000-00007ffcdf995000] RSS=8KB
[11374.065565] 17 = [00007ffcdf9c3000-00007ffcdf9c6000] RSS=0KB
[11374.065566] 18 = [00007ffcdf9c6000-00007ffcdf9c8000] RSS=4KB
Second observation:
[11374.065655] 1 = [0000000000400000-0000000000401000] RSS=4KB sample
[11374.065657] 2 = [0000000000600000-0000000000601000] RSS=4KB sample
[11374.065658] 3 = [0000000000601000-0000000000602000] RSS=4KB sample
[11374.065660] 4 = [0000000000c94000-0000000000cb5000] RSS=4KB
[11374.065667] 5 = [00007f4077464000-00007f4077665000] RSS=4KB
[11374.065673] 6 = [00007f4077665000-00007f407781f000] RSS=1064KB libc-2.19.so
[11374.065679] 7 = [00007f407781f000-00007f4077a1f000] RSS=0KB libc-2.19.so
[11374.065681] 8 = [00007f4077a1f000-00007f4077a23000] RSS=16KB libc-2.19.so
[11374.065683] 9 = [00007f4077a23000-00007f4077a25000] RSS=8KB libc-2.19.so
[11374.065685] 10 = [00007f4077a25000-00007f4077a2a000] RSS=16KB
[11374.065687] 11 = [00007f4077a2a000-00007f4077a4d000] RSS=140KB ld-2.19.so
[11374.065688] 12 = [00007f4077c33000-00007f4077c36000] RSS=12KB
[11374.065690] 13 = [00007f4077c49000-00007f4077c4c000] RSS=12KB
[11374.065691] 14 = [00007f4077c4c000-00007f4077c4d000] RSS=4KB ld-2.19.so
[11374.065693] 15 = [00007f4077c4d000-00007f4077c4e000] RSS=4KB ld-2.19.so
[11374.065695] 16 = [00007f4077c4e000-00007f4077c4f000] RSS=4KB
[11374.065697] 17 = [00007ffcdf974000-00007ffcdf995000] RSS=8KB
[11374.065699] 18 = [00007ffcdf9c3000-00007ffcdf9c6000] RSS=0KB
[11374.065701] 19 = [00007ffcdf9c6000-00007ffcdf9c8000] RSS=4KB
The virtual address of the ptr was found to be : ptr = 0x7f4077464010 which corresponds to the 5th vm_area in the second obervation.
[00007f4077464000-00007f4077665000] VSS=2052KB // shown from the VSS outputs
My quesitons are :
Why there is a difference of between desired malloc size (which was of 2048 KB) and the vss output for the 5th vm_area (2052 KB)?
We have not accessed the memory region pointed by ptr yet. So then why one physical page s allocated as shown in the rss result of the seocnd observation for the 5th vm_area? ( is it possibly because of the new vm_area_struct ?)
Thank You !

malloc(xxx) does not exactly allocate xxx size of memory. malloc is not a system call, but a library function.
In general, malloc has following steps.
extend the heap space via brk (if it needs)
do mmap to map virtual address with physical address
allocate some metadata (for managing heap space, usually linked list).
In step 3, the one page would be accessed. it means one physical page is accessed and results in increasing the RSS size by 4KB (a page size).

Related

Linux direct IO latency difference between reading 4GB and 32MB file under pread in NVME

My test will randomly read 4K pages (random 4k page at time in a tight loop) in both 4GB and 32MB files using direct IO (O_DIRECT) and pread on NVME disk. The latency on 4GB file is about 41 microsecond per page and the latency is 79 microsecond for small 32MB file. Is there a rational explanation for such difference?
b_fd = open("large.txt", O_RDONLY | O_DIRECT);
s_fd = open("small.txt", O_RDONLY | O_DIRECT);
std::srand(std::time(nullptr));
void *buf;
int ps=getpagesize();
posix_memalign(&buf, ps, page_size);
long long nano_seconds = 0;
// number of random pages to read
int iter = 256 * 100;
for (int i = 0; i < iter; i++) {
int page_index = std::rand() % big_file_pages;
auto start = std::chrono::steady_clock::now();
pread (b_fd, buf, page_size, page_index * page_size);
auto end = std::chrono::steady_clock::now();
nano_seconds += std::chrono::duration_cast<std::chrono::nanoseconds>(end - start).count();
}
std::cout<<"large file average random 1 page direct IO read in nanoseconds is:"<<nano_seconds/iter<<"\n";```

change page order in kernel space

I have a kernel module that works on data that is:
allocated by the kernel
page aligned
the data "mapping" is arbitrary
I allocate the memory in kernel space with kvmalloc(). For userspace representation i use vm_insert_page() to create the correct ordered representation. But i could not find a method with that i can "insert" or "remap" or "reorder" page mapping within kernel space. Are there methods do the same as vm_insert_page() for kernelspace mappings?
ok this seems to work:
static int __init test_init_fs(void)
{
int rv = 0;
size_t size = 5*1024*1024; /* 5 MiB*/
void* mem = vzalloc(size);
struct page **pages = kcalloc(5, sizeof(struct page *), GFP_KERNEL);
pr_info("alloced\n");
pages[0] = vmalloc_to_page(mem + 0 * PAGE_SIZE);
pages[1] = vmalloc_to_page(mem + 6 * PAGE_SIZE);
pages[2] = vmalloc_to_page(mem + 2 * PAGE_SIZE);
pages[3] = vmalloc_to_page(mem + 1 * PAGE_SIZE);
pages[4] = vmalloc_to_page(mem + 8 * PAGE_SIZE);
pr_info("got all pages\n");
void* new_mapping = vmap(pages,5, VM_MAP, PAGE_KERNEL);
pr_info("new mapping created\n");
void* buffer = vzalloc(5*PAGE_SIZE);
memcpy(buffer,new_mapping,5*PAGE_SIZE);
vunmap(new_mapping);
pr_info("unmapped\n");
vfree(mem);
return rv;
}

Need help to detect extra malloc - CS50 Pset5

Valgrind says 0 bytes lost but also says one less Frees than Mallocs
Because I have used malloc only once, I'm only posting those segments and not all the 3 files.
When loading a dictionary.txt file into a hash table:
bool load(const char *dictionary)
{
(dictionary.c:54) FILE *dict_file = fopen(dictionary, "r");
if (dict_file == NULL)
return false;
int key;
node *n = NULL;
int mallocs = 0;
while (1)
{
n = malloc(sizeof(node));
printf("malloced: %i\n", ++mallocs);
if (fscanf(dict_file, "%s", n->word) == -1)
{
printf("malloc freed\n");
free(n);
break;
}
key = hash(n->word);
n->next = table[key];
table[key] = n;
words++;
}
return true;
}
And the Unloading part:
bool unload(void)
{
int deleted = 0;
node *n;
for (int i = 0; i < N; i++)
{
n = table[i];
while(n != NULL)
{
n = n->next;
free(table[i]);
table[i] = n;
deleted++;
}
}
printf("DELETED: %i", deleted);
return true;
}
Check50 says there are memory leaks. But can't understand where.
Command: ./speller dictionaries/small texts/cat.txt
==4215==
malloced: 1
malloced: 2
malloced: 3
malloced: 4
malloc freed
DELETED: 3
WORDS MISSPELLED: 2
WORDS IN DICTIONARY: 3
WORDS IN TEXT: 6
TIME IN load: 0.03
TIME IN check: 0.00
TIME IN size: 0.00
TIME IN unload: 0.00
TIME IN TOTAL: 0.03
==4215==
==4215== HEAP SUMMARY:
==4215== in use at exit: 552 bytes in 1 blocks
==4215== total heap usage: 9 allocs, 8 frees, 10,544 bytes allocated
==4215==
==4215== 552 bytes in 1 blocks are still reachable in loss record 1 of 1
==4215== at 0x4C31B0F: malloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==4215== by 0x525AF29: __fopen_internal (iofopen.c:65)
==4215== by 0x525AF29: fopen##GLIBC_2.2.5 (iofopen.c:89)
==4215== by 0x40114E: load (dictionary.c:54)
==4215== by 0x40095E: main (speller.c:40)
==4215==
==4215== LEAK SUMMARY:
==4215== definitely lost: 0 bytes in 0 blocks
.
.
.
==4215== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)
speller.c has distribution code. I hope the rest of the question is clear and understandable.
The pointer to the opened file (dict_file) needs to be closed. See man fclose.

very interesting behaviour using CUDA 4.2 and driver 295.41

I witnessed a very interesting behaviour when using CUDA 4.2 and driver 295.41 on Linux.
The code itself is nothing more than finding the maximum value of a random matrix and labelling the position to be 1.
#include <stdio.h>
#include <stdlib.h>
const int MAX = 8;
static __global__ void position(int* d, int len) {
int idx = threadIdx.x + blockIdx.x*blockDim.x;
if (idx < len)
d[idx] = (d[idx] == MAX) ? 1 : 0;
}
int main(int argc, const char** argv) {
int colNum = 16*512, rowNum = 1024;
int len = rowNum * colNum;
int* h = (int*)malloc(len*sizeof(int));
int* d = NULL;
cudaMalloc((void**)&d, len*sizeof(int));
// get a random matrix
for (int i = 0; i < len; i++) {
h[i] = rand()%(MAX+1);
}
// launch kernel
int threads = 128;
cudaMemcpy(d, h, len*sizeof(int), cudaMemcpyHostToDevice);
position<<<(len-1)/threads+1, threads>>>(d, len);
cudaMemcpy(h, d, len*sizeof(int), cudaMemcpyDeviceToHost);
cudaFree(d);
free(h);
return 0;
}
When I set the rowNum = 1024, the code does not work at all as if the kernel has never been launched.
If rowNum = 1023, everything works fine.
And this rowNum value is somehow convoluted with the block size (in this example, 128), if I change the block size to be 512, the behaviour happens between rowNum = 4095 and 4096.
I'm not quite sure if this is a bug or did I miss anything?
You should always check for errors after calling CUDA functions. For example, in your code the invalid configuration argument error occurs during kernel launch.
This usually means that the grid or block dimensions are unvalid.
With colNum = 16*512, rowNum = 1024 you are attempting to run 65536 blocks x 128 threads, exceeding the maximum grid dimension (which is 65535 blocks for GPUs with compute capability 1.x and 2.x, not sure about 3.x).
If you need to run more threads, you can either increase block size (you have alredy tried it and it gave some effect) or use 2D/3D grid (3D is available only for devices with compute capability 2.0 or higher).

CUDA performance test

I'm writing a simple CUDA program for performance test.
This is not related to vector calculation, but just for a simple (parallel) string conversion.
#include <stdio.h>
#include <string.h>
#include <cuda_runtime.h>
#define UCHAR unsigned char
#define UINT32 unsigned long int
#define CTX_SIZE sizeof(aes_context)
#define DOCU_SIZE 4096
#define TOTAL 100000
#define BBLOCK_SIZE 500
UCHAR pH_TXT[DOCU_SIZE * TOTAL];
UCHAR pH_ENC[DOCU_SIZE * TOTAL];
UCHAR* pD_TXT;
UCHAR* pD_ENC;
__global__
void TEST_Encode( UCHAR *a_input, UCHAR *a_output )
{
UCHAR *input;
UCHAR *output;
input = &(a_input[threadIdx.x * DOCU_SIZE]);
output = &(a_output[threadIdx.x * DOCU_SIZE]);
for ( int i = 0 ; i < 30 ; i++ ) {
if ( (input[i] >= 'a') && (input[i] <= 'z') ) {
output[i] = input[i] - 'a' + 'A';
}
else {
output[i] = input[i];
}
}
}
int main(int argc, char** argv)
{
struct cudaDeviceProp xCUDEV;
cudaGetDeviceProperties(&xCUDEV, 0);
// Prepare Source
memset(pH_TXT, 0x00, DOCU_SIZE * TOTAL);
for ( int i = 0 ; i < TOTAL ; i++ ) {
strcpy((char*)pH_TXT + (i * DOCU_SIZE), "hello world, i need an apple.");
}
// Allocate vectors in device memory
cudaMalloc((void**)&pD_TXT, DOCU_SIZE * TOTAL);
cudaMalloc((void**)&pD_ENC, DOCU_SIZE * TOTAL);
// Copy vectors from host memory to device memory
cudaMemcpy(pD_TXT, pH_TXT, DOCU_SIZE * TOTAL, cudaMemcpyHostToDevice);
// Invoke kernel
int threadsPerBlock = BLOCK_SIZE;
int blocksPerGrid = (TOTAL + threadsPerBlock - 1) / threadsPerBlock;
printf("Total Task is %d\n", TOTAL);
printf("block size is %d\n", threadsPerBlock);
printf("repeat cnt is %d\n", blocksPerGrid);
TEST_Encode<<<blocksPerGrid, threadsPerBlock>>>(pD_TXT, pD_ENC);
cudaMemcpy(pH_ENC, pD_ENC, DOCU_SIZE * TOTAL, cudaMemcpyDeviceToHost);
// Free device memory
if (pD_TXT) cudaFree(pD_TXT);
if (pD_ENC) cudaFree(pD_ENC);
cudaDeviceReset();
}
And when i change BLOCK_SIZE value from 2 to 1000, I got a following duration time (from NVIDIA Visual Profiler)
TOTAL BLOCKS BLOCK_SIZE Duration(ms)
100000 50000 2 28.22
100000 10000 10 22.223
100000 2000 50 12.3
100000 1000 100 9.624
100000 500 200 10.755
100000 250 400 29.824
100000 200 500 39.67
100000 100 1000 81.268
My GPU is GeForce GT520 and max threadsPerBlock value is 1024, so I predicted that I would get best performance when BLOCK is 1000, but the above table shows different result.
I can't understand why Duration time is not linear, and how can I fix this problem. (or how can I find optimized Block value (mimimum Duration time)
It seems 2, 10, 50 threads doesn't utilize the capabilities of the gpu since its design is to start much more threads.
Your card has compute capability 2.1.
Maximum number of resident threads per multiprocessor = 1536
Maximum number of threads per block = 1024
Maximum number of resident blocks per multiprocessor = 8
Warp size = 32
There are two issues:
1.
You try to occupy so much register memory per thread that it will definetly is outsourced to slow local memory space if your block sizes increases.
2.
Perform your tests with multiple of 32 since this is the warp size of your card and many memory operations are optimized for thread sizes with multiple of the warp size.
So if you use only around 1024 (1000 in your case) threads per block 33% of your gpu is idle since only 1 block can be assigned per SM.
What happens if you use the following 100% occupancy sizes?
128 = 12 blocks -> since only 8 can be resident per sm the block execution is serialized
192 = 8 resident blocks per sm
256 = 6 resident blocks per sm
512 = 3 resident blocks per sm

Resources