RSS(resident set size) didn't decrease after object destroyed in linux

RSS(resident set size) didn't decrease after object destroyed in linux - linux

I'm trying to use RSS to estimate the mem usage of my application in linux.
for (int i = 0; i < 100; ++i) {
std::cout << "loading map " << i << std::endl;
{
process_mem_usage();
MyApplicaiton app();
// do things
process_mem_usage();
}
}
the process_mem_usage is basically monitoring the vm_size and rss using this approach How to get memory usage at runtime using C++?
By running this small bench, I only see RSS increase at the first time, and then keep the same.
I was able to claim that there is no mem leak(otherwise RSS should keep increasing). The only explanation is that the process didn't return the memory to the system (even if that memory is currently not used by my application). Is there any way to force the process return the memory to system? (I tried sleeping for a long time but didn't work).
Another example:
process_mem_usage();
{
std::shared_ptr<char> tmp((char *)operator new(500 * 1024 * 1024));
std::memset(tmp.get(), 1, 500 * 1024 * 1024);
process_mem_usage();
}
by running this example I can see RSS increase and then decrease immediately right after I destroy the shared_ptr.
So it's hard to explain what's going on under the hood.

Related

Memory not be freed on Mac when vector push_back string

Code as below, found that when vector push_back string on a Mac demo app, memory not be freed. I thought the stack variable will be freed when out of function scope, am I wrong? Thanks for any tips.
in model.h:
#pragma once
namespace NS {
const uint8_t kModel[8779041] = {4,0,188,250,....};
}
in ViewController.mm:
- (void)start {
std::vector<std::string> params = {};
std::string strModel(reinterpret_cast<const char *>(NS::kModel), sizeof(NS:kModel));
params.push_back(strModel);
}

The answer to your question depends on your understanding of the the "free" memory. The behaviour you are observing can be reproduced as simple as with a couple lines of code:
void myFunc() {
const auto *ptr = new uint8_t[8779041]{};
delete[] ptr;
}
Let's run this function and see how the memory consumption graph changes:
int main() {
myFunc(); // 1 MB
std::cout << "Check point" << std::endl; // 9.4 MB
return 0;
}
If you put one breakpoint right at the line with myFunc() invocation and another one at the line with "Check point" console output, you will witness how memory consumption for the process jumps by about 8 MB (for my system and machine configuration Xcode shows sudden jump from 1 MB to 9.4 MB). But wait, isn't it supposed to be 1 MB again after the function, as the allocated memory is freed at the end of the function? Well, not exactly.. The system doesn't regain this memory right away, because it's not that cheap operation to begin with, and if your process requests the same amount memory 1 CPU cycle later, it would be quite a redundant work. Thus, the system usually doesn't bother shrinking memory dedicated to a process either until it's needed for another process, and until it runs out of available resources (it also can be some kind of fixed timer, but overall I would say this is implementation-defined). Another common reason the memory is not freed, is because you often observe it through debug mode, where the memory remains dedicated to the process to track some tricky scenarios (like NSZombie objects, which address needs to remain accessible to the process in order to report the use-after-free occasions).
The most important here is that internally, the process can differentiate between "deleted" and "occupied" memory pages, thus it can re-occupy memory which is already deleted. As a result, no matter how many times you call the same function, the memory dedicated to the process remains the same:
int main() {
myFunc(); // 1 MB
std::cout << "Check point" << std::endl; // 9.4 MB
for (int i = 0; i < 10000; ++i) {
myFunc();
}
std::cout << "Another point" << std::endl; // 9.4 MB
return 0;
}

system: Resource temporarily unavailable, which one?

I search for answer and so far haven't found a clear one.
I am doing testing which launches many threads calling "system()", like below.
for (int i = 0; i < 3000; ++i)
pthread_create(&thread[i], NULL, thread_func, NULL);
for (int i = 0; i < 3000; ++i)
pthread_join(thread[i], NULL);
...
void* thread_func(void* arg)
{
if (system('test.sh') == -1)
{
perror("system");
exit(1);
}
pthread_exit(NULL);
}
test.sh
#!/bin/bash
sleep 100
When I run the program, at certain point it will display.
system: Resource temporarily unavailable
Is there way to know which resource? I fix the max processes issue so I think it may be due to something else.

This error means that some system call called by the system library function returned EGAIN. Most likely is the fork call, which can fail with EAGAIN for a number of reasons:
EAGAIN A system-imposed limit on the number of threads was encountered. There are a num‐
ber of limits that may trigger this error:
* the RLIMIT_NPROC soft resource limit (set via setrlimit(2)), which limits the
number of processes and threads for a real user ID, was reached;
* the kernel's system-wide limit on the number of processes and threads,
/proc/sys/kernel/threads-max, was reached (see proc(5));
* the maximum number of PIDs, /proc/sys/kernel/pid_max, was reached (see proc(5));
or
* the PID limit (pids.max) imposed by the cgroup "process number" (PIDs) con‐
troller was reached.

Parallel ray tracing in 16x16 chunks

My ray tracer is currently multi threaded, I'm basically dividing the image into as many chunks as the system has and rendering them parallel. However, not all chunks have the same rendering time, so most of the time half of the run time is only 50% cpu usage.
Code
std::shared_ptr<bitmap_image> image = std::make_shared<bitmap_image>(WIDTH, HEIGHT);
auto nThreads = std::thread::hardware_concurrency();
std::cout << "Resolution: " << WIDTH << "x" << HEIGHT << std::endl;
std::cout << "Supersampling: " << SUPERSAMPLING << std::endl;
std::cout << "Ray depth: " << DEPTH << std::endl;
std::cout << "Threads: " << nThreads << std::endl;
std::vector<RenderThread> renderThreads(nThreads);
std::vector<std::thread> tt;
auto size = WIDTH*HEIGHT;
auto chunk = size / nThreads;
auto rem = size % nThreads;
//launch threads
for (unsigned i = 0; i < nThreads - 1; i++)
{
tt.emplace_back(std::thread(&RenderThread::LaunchThread, &renderThreads[i], i * chunk, (i + 1) * chunk, image));
}
tt.emplace_back(std::thread(&RenderThread::LaunchThread, &renderThreads[nThreads-1], (nThreads - 1)*chunk, nThreads*chunk + rem, image));
for (auto& t : tt)
t.join();
I would like to divide the image into 16x16 chunks or something similar and render them paralelly, so after each chunk gets rendered, the thread switches to the next and so on... This would greatly increase cpu usage and run time.
How do I set up my ray tracer render these 16x16 chunks in a multithreaded manner?

I assume the question is "How to distribute the blocks to the various threads?"
In your current solution, you're figuring out the regions ahead of time and assigning them to the threads. The trick is to turn this idea on its head. Make the threads ask for what to do next whenever they finish a chunk of work.
Here's an outline of what the threads will do:
void WorkerThread(Manager *manager) {
while (auto task = manager->GetTask()) {
task->Execute();
}
}
So you create a Manager object that returns a chunk of work (in the form of a Task) each time a thread calls its GetTask method. Since that method will be called from multiple threads, you have to be sure it uses appropriate synchronization.
std::unique_ptr<Task> Manager::GetTask() {
std::lock_guard guard(mutex);
std::unique_ptr<Task> t;
if (next_row < HEIGHT) {
t = std::make_unique<Task>(next_row);
++next_row;
}
return t;
}
In this example, the manager creates a new task to ray trace the next row. (You could use 16x16 blocks instead of rows if you like.) When all the tasks have been issued, it just returns an empty pointer, which essentially tells the calling thread that there's nothing left to do, and the calling thread will then exit.
If you made all the Tasks in advance and had the manager dole them as they are requested, this would be a typical "work queue" solution. (General work queues also allow new Tasks to be added on the fly, but you don't need that feature for this particular problem.)

I do this a bit differently:
obtain number of CPU and or cores
You did not specify OS so you need to use your OS api for this. search for System affinity mask.
divide screen into threads
I am dividing screen by lines instead of 16x16 blocks so I do not need to have a que or something. Simply create thread for each CPU/core that will process only its horizontal lines rays. That is simple so each thread should have its ID number counting from zero and number of CPU/cores n so lines belonging to each process are:
y = ID + i*n
where i={0,1,2,3,... } once y is bigger or equal then screen resolution stop. This type of access has its advantages for example accessing screen buffer via ScanLines will not be conflicting between threads as each thread access only its lines...
I am also setting affinity mask for each thread so it uses its own CPU/core only it give me a small boost so there is not so much process switching (but that was on older OS versions hard to say what it does now).
synchronize threads
basically you should wait until all threads are finished. if they are then render the result on screen. Your threads can either stop and you will create new ones on next frame or jump to Sleep loops until rendering forced again...
I am using the latter approach so I do not need to create and configure the threads over and over again but beware Sleep(1) can sleep a lot more then just 1 ms.

How can i measure the overhead due to task migration/load balancing on linux with the real time patch?

I am trying to measure the overhead due to task migration. by overhead i would like to measure the latency involved in such a an activity. I know there are separate run queues available for each core and the kernel periodically checks the run queues to check whether there is a imbalance and wakes up a kernel thread ( perhaps a higher priority ) that does the migration.
Could any one provide me with pointers to kernel source code where i can insert time stamps to measure this value?
Is there any other performance metric which i probably investigate to get such an overhead?

I remember there is a post before that discussed about this topic, and someone also posted some codes about how to get the system overhead.
I see you want to add some codes to insert time stamps, do you think it's feasible because task schedule is so frequent. I think you can follow the topic that posted before.
I ever saved the source codes from the post, thanks for the author!
double getCurrentValue() {
double percent;
FILE* file;
unsigned long long totalUser, totalUserLow, totalSys, totalIdle, total;
file = fopen("/proc/stat", "r");
fscanf(file, "cpu %Ld %Ld %Ld %Ld", &totalUser, &totalUserLow,
&totalSys, &totalIdle);
fclose(file);
if (totalUser < lastTotalUser || totalUserLow < lastTotalUserLow ||
totalSys < lastTotalSys || totalIdle < lastTotalIdle) {
//Overflow detection. Just skip this value.
percent = -1.0;
}
else {
total = (totalUser - lastTotalUser) + (totalUserLow - lastTotalUserLow) +
(totalSys - lastTotalSys);
percent = total;
total += (totalIdle - lastTotalIdle);
percent /= total;
percent *= 100;
}
lastTotalUser = totalUser;
lastTotalUserLow = totalUserLow;
lastTotalSys = totalSys;
lastTotalIdle = totalIdle;
return percent;
}

max thread per process in linux

I wrote a simple program to calculate the maximum number of threads that a process can have in linux (Centos 5). here is the code:
int main()
{
pthread_t thrd[400];
for(int i=0;i<400;i++)
{
int err=pthread_create(&thrd[i],NULL,thread,(void*)i);
if(err!=0)
cout << "thread creation failed: " << i <<" error code: " << err << endl;
}
return 0;
}
void * thread(void* i)
{
sleep(100);//make the thread still alive
return 0;
}
I figured out that max number for threads is only 300!? What if i need more than that?
I have to mention that pthread_create returns 12 as error code.
Thanks before

There is a thread limit for linux and it can be modified runtime by writing desired limit to /proc/sys/kernel/threads-max. The default value is computed from the available system memory. In addition to that limit, there's also another limit: /proc/sys/vm/max_map_count which limits the maximum mmapped segments and at least recent kernels will mmap memory per thread. It should be safe to increase that limit a lot if you hit it.
However, the limit you're hitting is lack of virtual memory in 32bit operating system. Install a 64 bit linux if your hardware supports it and you'll be fine. I can easily start 30000 threads with a stack size of 8MB. The system has a single Core 2 Duo + 8 GB of system memory (I'm using 5 GB for other stuff in the same time) and it's running 64 bit Ubuntu with kernel 2.6.32. Note that memory overcommit (/proc/sys/vm/overcommit_memory) must be allowed because otherwise system would need at least 240 GB of committable memory (sum of real memory and swap space).
If you need lots of threads and cannot use 64 bit system your only choice is to minimize the memory usage per thread to conserve virtual memory. Start with requesting as little stack as you can live with.

Your system limits may not be allowing you to map the stacks of all the threads you require. Look at /proc/sys/vm/max_map_count, and see this answer. I'm not 100% sure this is your problem, because most people run into problems at much larger thread counts.

I had also encountered the same problem when my number of threads crosses some threshold.
It was because of the user level limit (number of process a user can run at a time) set to 1024 in /etc/security/limits.conf .
so check your /etc/security/limits.conf and look for entry:-
username -/soft/hard -nproc 1024
change it to some larger values to something 100k(requires sudo privileges/root) and it should work for you.
To learn more about security policy ,see http://linux.die.net/man/5/limits.conf.

check the stack size per thread with ulimit, in my case Redhat Linux 2.6:
ulimit -a
...
stack size (kbytes, -s) 10240
Each of your threads will get this amount of memory (10MB) assigned for it's stack. With a 32bit program and a maximum address space of 4GB, that is a maximum of only 4096MB / 10MB = 409 threads !!! Minus program code, minus heap-space will probably lead to your observed max. of 300 threads.
You should be able to raise this by compiling a 64bit application or setting ulimit -s 8192 or even ulimit -s 4096. But if this is advisable is another discussion...

You will run out of memory too unless u shrink the default thread stack size. Its 10MB on our version of linux.
EDIT:
Error code 12 = out of memory, so I think the 1mb stack is still too big for you. Compiled for 32 bit, I can get a 100k stack to give me 30k threads. Beyond 30k threads I get Error code 11 which means no more threads allowed. A 1MB stack gives me about 4k threads before error code 12. 10MB gives me 427 threads. 100MB gives me 42 threads. 1 GB gives me 4... We have 64 bit OS with 64 GB ram. Is your OS 32 bit? When I compile for 64bit, I can use any stack size I want and get the limit of threads.
Also I noticed if i turn the profiling stuff (Tools|Profiling) on for netbeans and run from the ide...I only can get 400 threads. Weird. Netbeans also dies if you use up all the threads.
Here is a test app you can run:
#include <stdio.h>
#include <stdlib.h>
#include <pthread.h>
#include <signal.h>
// this prevents the compiler from reordering code over this COMPILER_BARRIER
// this doesnt do anything
#define COMPILER_BARRIER() __asm__ __volatile__ ("" ::: "memory")
sigset_t _fSigSet;
volatile int _cActive = 0;
pthread_t thrd[1000000];
void * thread(void *i)
{
int nSig, cActive;
cActive = __sync_fetch_and_add(&_cActive, 1);
COMPILER_BARRIER(); // make sure the active count is incremented before sigwait
// sigwait is a handy way to sleep a thread and wake it on command
sigwait(&_fSigSet, &nSig); //make the thread still alive
COMPILER_BARRIER(); // make sure the active count is decrimented after sigwait
cActive = __sync_fetch_and_add(&_cActive, -1);
//printf("%d(%d) ", i, cActive);
return 0;
}
int main(int argc, char** argv)
{
pthread_attr_t attr;
int cThreadRequest, cThreads, i, err, cActive, cbStack;
cbStack = (argc > 1) ? atoi(argv[1]) : 0x100000;
cThreadRequest = (argc > 2) ? atoi(argv[2]) : 30000;
sigemptyset(&_fSigSet);
sigaddset(&_fSigSet, SIGUSR1);
sigaddset(&_fSigSet, SIGSEGV);
printf("Start\n");
pthread_attr_init(&attr);
if ((err = pthread_attr_setstacksize(&attr, cbStack)) != 0)
printf("pthread_attr_setstacksize failed: err: %d %s\n", err, strerror(err));
for (i = 0; i < cThreadRequest; i++)
{
if ((err = pthread_create(&thrd[i], &attr, thread, (void*)i)) != 0)
{
printf("pthread_create failed on thread %d, error code: %d %s\n",
i, err, strerror(err));
break;
}
}
cThreads = i;
printf("\n");
// wait for threads to all be created, although we might not wait for
// all threads to make it through sigwait
while (1)
{
cActive = _cActive;
if (cActive == cThreads)
break;
printf("Waiting A %d/%d,", cActive, cThreads);
sched_yield();
}
// wake em all up so they exit
for (i = 0; i < cThreads; i++)
pthread_kill(thrd[i], SIGUSR1);
// wait for them all to exit, although we might be able to exit before
// the last thread returns
while (1)
{
cActive = _cActive;
if (!cActive)
break;
printf("Waiting B %d/%d,", cActive, cThreads);
sched_yield();
}
printf("\nDone. Threads requested: %d. Threads created: %d. StackSize=%lfmb\n",
cThreadRequest, cThreads, (double)cbStack/0x100000);
return 0;
}

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

RSS(resident set size) didn't decrease after object destroyed in linux - linux

Related

Memory not be freed on Mac when vector push_back string

system: Resource temporarily unavailable, which one?

Parallel ray tracing in 16x16 chunks

How can i measure the overhead due to task migration/load balancing on linux with the real time patch?

max thread per process in linux

Categories

Resources