we know that in openCL by using cl::CreateBuffer() we can create buffer in device, which allocate memory there. But my question is whether the buffer would be free after terminating the program or there is a function we should use to free the memory to prevent memory leakage on device.
The destructor for the cl::Buffer object returned by cl::CreateBuffer() will release the buffer, which will also free any memory allocated on-device. This is the main mechanism you should be relying upon.
Process death for any reason (crash, clean exit) even with resources allocated will also destroy the process's context handle in the device driver, which will cause the driver to perform the cleanup.
Of course, bugs at any level of the stack could prevent this from happening correctly in all cases, but in general, once your process dies, everything should be reset.
Related
I am writing a program that retrieves images from a camera and processes them with CUDA. In order to gain the best performance, I'm passing a CUDA unified memory buffer to the image acquisition library, which writes to the buffer in another thread.
This causes all sorts of weird results where to program hangs in library code that I do not have access to. If I use a normal memory buffer and then copy to CUDA, the problem is fixed. So I became suspicious that writing from another thread might not allowed, and googled as I did, I could not find a definitive answer.
So is accessing the unified memory buffer from another CPU thread is allowed or not?
There should be no problem writing to a unified memory buffer from multiple threads.
However, keep in mind the restrictions imposed when the concurrentManagedAccess device property is not true. In that case, when you have a managed buffer, and you launch a kernel, no CPU/host thread access of any kind is allowed, to that buffer, or any other managed buffer, until you perform a cudaDeviceSynchronize() after the kernel call.
In a multithreaded environment, this might take some explicit effort to enforce.
I think this is similar to this recital if that is also your posting. Note that TX2 should have this property set to false.
Note that this general rule in the non-concurrent case can be modified through careful use of streams. However the restrictions still apply to buffers attached to streams that have a kernel launched in them (or buffers not explicitly attached to any stream): when the property mentioned above is false, access by any CPU thread is not possible.
The motivation for this behavior is roughly as follows. The CUDA runtime does not know the relationship between managed buffers, regardless of where those buffers were created. A buffer created in one thread could easily have objects in it with embedded pointers, and there is nothing to prevent or restrict those pointers from pointing to data in another managed buffer. Even a buffer that was created later. Even a buffer that was created in another thread. The safe assumption is that any linkages could be possible, and therefore, without any other negotiation, the managed memory subsystem in the CUDA runtime must move all managed buffers to the GPU, when a kernel is launched. This makes all managed buffers, without exception, inaccessible to CPU threads (any thread, anywhere). In the normal program flow, access is restored at the next occurrence of a cudaDeviceSynchronize() call. Once the CPU thread that issues that call completes the call and moves on, then managed buffers are once again visible to (all) CPU threads. Another kernel launch (anywhere) repeats the process, and interrupts the accessibility. To repeat, this is the mechanism that is in effect when the concurrentManagedAccess property on the GPU is not true, and this behavior can be somewhat modified via the aforementioned stream attach mechanism.
I detected my service process leaking memory on a Linux server, it takes 1.2G of physical memory and consumes more and more.
While I am looking at the code for the memory leak, I notice the process is restarted (This process if managed by supervisord, so it is restarted if killed). There is no error log or panic in the log of the process. So my guess is that it is killed by the kernel.
When does the kernel kill a process that is leaking memory? When it consumes too much memory? or it allocates memory too fast?
Memory leaks can cause your system memory to get low. If the memory gets very low, the OOM(Out Of Memory) killer will be invoked to try to recover from low memory state. The OOM Killer will terminate one or more processes that consume more memory and are of least importance(low priority). Normally, the OOM killer will be invoked incase there is no user address space available or if there is no page available.
OOM killer uses select_bad_process(),badness() to determine and kill the process. These functions determine the process by assigning points/score for all the processes based on various factors such as VM size of process, VM size of its children, uptime, priority, whether it does any hardware access, whether it is swapper or init or kernel thread. The process with highest points/ score(badness) gets terminated/killed.
Also, checkout whether the overcommit behaviour of kernel (/proc/sys/vm/overcommit_memory, /proc/sys/vm/overcommit_ratio) and the limit on the address space for the processes are appropriate.
Valgrind is a very handy tool in such scenarios in identifying memory leaks.
I have installed a handler (say, crashHandler()) which has a bit of file output functionality. It is a linux thread which registers for SIGSEGV with the crashHandler(). File writing is requred, as it stores the stack trace to persistent storage.
It works most of the times. But in a specific scenario, the function (crashHandler()) executes the function partly (I can see logs) and then device reboots. Can someone help me with a way to deal with such ?
The first question to ask here is why the device rebooted. Normally having an ordinary application crash won't cause a kernel-level or hardware-level reboot. Most likely, you're either hitting a watchdog timer before the crash handler completes (in which case you should extend the watchdog timeout - do NOT reset the timer from within the crash handler though, as then you're risking problems in the crash handler itself preventing a reboot), or this is pid 1 and it's crashing within the SIGSEGV handler, causing a kernel panic due to pid 1 (init) dying.
If it's the latter, you need to be more careful with what you do in that crash handler. Remember, you just crashed. You know memory is corrupt, but you don't know how it's corrupt. It may be corrupt in ways that affect the crash handler itself - e.g. if you corrupt the heap metadata, you may be unable to allocate memory without crashing for real this time. You should keep what you do in that handler to a bare minimum - in particular, avoid calling any library functions that are not documented as being async-signal-safe and avoid using any complex (pointer-containing) data structures or dynamically allocated memory. For the highest level of safety, limit yourself to just fork() and exec()ing another process that will use debugger APIs (ptrace() and /proc/$PID/mem) to perform memory dumps or whatever else you might need.
I'm writing a device driver that, among other things, allocates a block of memory with kmalloc. This memory is freed when the user program closes the file. In one of my experiments, the user program crashed without closing the file.
Would anything have freed this memory?
In another experiment, I moved the kfree() from the close() function to the module_exit() function. When I ran the user program twice consecutively, I called kmalloc again with the same pointer as before, without freeing it first. Thus, I lost a pointer to that memory, and cannot free it.
Is this memory lost to the system until I reboot, or will it be freed when I unload the driver?
Kernel memory is never freed automatically. This includes kmalloc.
All memory related to an open file descriptor should be released when the file is closed.
When a process exits, for any reason whatsoever (including kill -9), all open file descriptors are closed, and the driver's close function is called. So if you free there, nothing the process can do will make the memory stay after the process dies.
Please don't relate your user-space experience with Kernel programming.
What do I mean by this?
Normal processes get a clean-up for them once they exit, that's not the case with kernel modules because they're not really processes.
Technically, when you load a module and then call kmalloc, what you did was that you asked the kernel to allocate some memory for you in the kernel space, it's technically a new memory for the whole kernel so even if you unload your module, that allocated kernel memory is there unless explicitly freed.
In simple terms answering your question:
Every kmalloc needs a kfree, else the memory will remain there as long as the system is up.
Could any one tell me what is exactly done in both situations? What is the main cost each of them?
The main distinction between a thread switch and a process switch is that during a thread switch, the virtual memory space remains the same, while it does not during a process switch.
Both types involve handing control over to the operating system kernel to perform the context switch. The process of switching in and out of the OS kernel along with the cost of switching out the registers is the largest fixed cost of performing a context switch.
A more fuzzy cost is that a context switch messes with the processors cacheing mechanisms. Basically, when you context switch, all of the memory addresses that the processor "remembers" in its cache effectively become useless. The one big distinction here is that when you change virtual memory spaces, the processor's Translation Lookaside Buffer (TLB) or equivalent gets flushed making memory accesses much more expensive for a while. This does not happen during a thread switch.
Process context switching involves switching the memory address space. This includes memory addresses, mappings, page tables, and kernel resources—a relatively expensive operation. On some architectures, it even means flushing various processor caches that aren't sharable across address spaces. For example, x86 has to flush the TLB and some ARM processors have to flush the entirety of the L1 cache!
Thread switching is context switching from one thread to another in the same process (switching from thread to thread across processes is just process switching).Switching processor state (such as the program counter and register contents) is generally very efficient.
First of all, operating system brings outgoing thread in a kernel mode if it is not already there, because thread switch can be performed only between threads, that runs in kernel mode. Then the scheduler is invoked to make a decision about thread to which will be performed switching. After decision is made, kernel saves part of the thread context that is located in CPU (CPU registers) into the dedicated place in memory (frequently on the top of the kernel stack of outgoing thread). Then the kernel performs switch from kernel stack of outgoing thread on to kernel stack of the incoming thread. After that, kernel loads previously stored context of incoming thread from memory into CPU registers. And finally returns control back into user mode, but in user mode of the new thread.
In the case when OS has determined that incoming thread runs in another process, kernel performs one additional step: sets new active virtual address space.
The main cost in both scenarios is related to a cache pollution. In most cases, the working set used by the outgoing thread will differ significantly from working set which is used by the incoming thread. As a result, the incoming thread will start its life with avalanche of cache misses, thus flushing old and useless data from the caches and loading the new data from memory. The same is true for TLB (Translation Look Aside Buffer, which is on the CPU). In the case of reset of virtual address space (threads run in different processes) the penalty is even worse, because reset of virtual address space leads to the flushing of the entire TLB, even if new thread actually needs to load only few new entries. As a result, the new thread will start its time quantum with lots TLB misses and frequent page walking. Direct cost of threads switch is also not negligible (from ~250 and up to ~1500-2000 cycles) and depends on the CPU complexity, states of both threads and sets of registers which they actually use.
P.S.: Good post about context switch overhead: http://blog.tsunanet.net/2010/11/how-long-does-it-take-to-make-context.html
process switching: it is a transition between two memory resident of process in a multiprogramming environment;
context switching: it is a changing context from an executing program to an interrupt service routine (ISR).
In Thread Context Switching, the virtual memory space remains the same while it is not in the case of Process Context Switch. Also, Process Context Switch is costlier than Thread Context Switch.
I think main difference is when calling switch_mm() which handles memory descriptors of old and new task. In the case of threads, the virtual memory address space is unchanged (threads share virtual memory), so very little has to be done, and therefore less costly.
Though thread context switching needs to change the execution context (registers, stack pointers, program counters), they don't need to change address space as processes context switches do. There's an additional cost when you switch address space, more memory access (paging, segmentation, etc) and you have to flush TLB when entering or exiting a new process...
In short, the thread context switch does not assign a brand new set of memory and pid, it uses the same as the parent since it is running within the same process. A process one spawns a new process and thus assigns new mem and pid.
There is a loooooot more to it. They have written books on it.
As for cost, a process context switch >>>> thread as you have to reset all of the stack counters etc.
Assuming that The CPU the OS runs has got Some High Latency Devices Attached,
It makes sense to run another thread Of the Process's Address Space, while the high latency device responds back.
But, if the High Latency Device is responding faster than the time to need do set up of table + translation of Virtual To Physical memories for a NEW Process, then it is questionable if a switch is essential at all.
Also, HOT cache(data needed for running the process/thread is reachable in less time) is better choice.