pinned memory in multiple GPUs - multithreading

I am using 4 GPUs and to speed up the memory transfer I am trying to use pinned memory using cudaHostAlloc().
The main UI thread(mfc base) creates 4 threads and each thread calls cudaSetDevice(nDeviceID).
Here is my question. Can I call cudaHostAlloc() at the main thread and give the pointer as a lParam or I have to call it in each branch thread after calling the cudaSetDevice(nDeviceID)?
Here is the pseudo code.
1) Calling cudaHostAlloc at the main thread
Main thread
cudaHostAlloc((void**)h_arrfBuf, size*sizeof(float), cudaHostAllocDefault);
AcqBuf(h_arrfBuf, size);
for i =1:4
ST_Param* pstParam = new ST_Param(i, size/4, h_arrfBuf);
AfxBeginThread(Calc, pstParam );
Branch thread
UINT Calc(LPVOID lParam)
ST_Param pstParam = reinterpret_cast<ST_Param*>(lParam);
cudaSetDevice(pstParam->nDeviceID);
Cudafunc(pstParam->size/4, pstParam->h_arrfBuf+(pstParam->nDeviceID-1)*size/4);
2) Calling cudaHostAlloc at the branch threads
Main thread
AcqBuf(arrfRaw, size);
for i =1:4
ST_Param* pstParam = new ST_Param(i, size/4, arrfRaw + (i-1)*size/4);
AfxBeginThread(Calc, pstParam);
Branch thread
UINT Calc(LPVOID lParam)
ST_Param pstParam = reinterpret_cast<ST_Param*>(lParam);
cudaSetDevice(pstParam->nDeviceID);
cudaHostAlloc((void**)h_arrfBuf, size/4*sizeof(float), cudaHostAllocDefault);
memcpy(h_arrfBuf, pstParam->arrfRaw, size/4*sizeof(float));
Cudafunc(pstParam->size/4, h_arrfBuf);
What I am basically curious about is whether pinned memory is device specific or not.

Since CUDA 4.0, the runtime API is intrinsically thread safe, and a context on any given GPU is automatically shared amongst every host thread within a given application (see here).
Further, quoting from the relevant documentation:
When the application is run as a 64-bit process, a single address space is used for the host and all the devices of compute capability 2.0 and higher. All host memory allocations made via CUDA API calls and all device memory allocations on supported devices are within this virtual address range. As a consequence:
....
Allocations via cudaHostAlloc() are automatically portable (see
Portable Memory) across all the devices for which the unified address
space is used, and pointers returned by cudaHostAlloc() can be used
directly from within kernels running on these devices (i.e., there is
no need to obtain a device pointer via cudaHostGetDevicePointer() as
described in Mapped Memory.
So if your GPUs and platform support unified virtual addressing, then a pinned/mapped host memory is automatically portable to all devices within that address space, and each GPU context is automatically portable across each host thread. So you should be safe doing the complete pinned memory setup from a single host thread, given all the constraints described above.

Related

Writing to ram - LINUX [duplicate]

I know for malloc sbrk is the system call invoked ,Similarly What is the system cal invoked when i write to a malloed memory(heap memory)
int main
{
/* 10 byte of heap memory allocated */
char *ptr = malloc(5);
ptr[0] = 10; // **What is the system call invoked for
writing into this heap memory** ?????
}
There are no system call involved in this case. Ask you compiler to generate assembly so that you can see that there is only some MOV instructions there. Or you can use a debugger to see the assembly
Accessing memory does not require a system call. On the contrary, accessing memory is what most of your code does most of the time! On a modern OS, you have a flat view of a contiguous range of virtual memory, and you typically only need a system call to mark a particular region (a "page") of that memory as valid; other times, contiguously growing memory ranges such as the call stack don't even require any action on your program's part. It's solely the job of your operating system's memory manager to intercept accesses to memory that isn't mapped to physical memory (via a page fault), do some kernel magic to bring the desired memory into physical space and return control to your program.
The only reason malloc occasionally needs to perform a system call is because it asks the operating system for a random piece of virtual memory somewhere in the middle. If your program were to only function with global and local variables (but no dynamic allocation), you wouldn't need any system calls for memory management.
"operating system doesn't see every write that occurs: a write to memory corresponds simply to a STORE assembly instruction, not a system call. It is the hardware that takes care of the STORE and the necessary address translation. The only time the OS will see a memory write is when the address translation in the page tables fails, causing a trap to the OS. "
Please read the below link for details
http://pages.cs.wisc.edu/~dusseau/Classes/CS537-F04/Questions/sol12.html

Where are the stacks for the other threads located in a process virtual address space?

The following image shows where the sections of a process are laid out in the process's virtual address space (in Linux):
You can see that there is only one stack section (since this process only has one thread I assume).
But what if this process has another thread, where will the stack for this second thread be located? will it be located immediately below the first stack?
Stack space for a new thread is created by the parent thread with mmap(MAP_ANONYMOUS|MAP_STACK). So they're in the "memory map segment", as your diagram labels it. It can end up anywhere that a large malloc() could go. (glibc malloc(3) uses mmap(MAP_ANONYMOUS) for large allocations.)
(MAP_STACK is currently a no-op, and exists in case some future architecture needs special handling).
You pass a pointer to the new thread's stack space to the clone(2) system call which actually creates the thread. (Try using strace -f on a multi-threaded process sometime). See also this blog post about creating a thread using raw Linux syscalls.
See this answer on a related question for some more details about mmaping stacks. e.g. MAP_GROWSDOWN doesn't prevent another mmap() from picking the address right below the thread stack, so you can't depend on it to dynamically grow a small stack the way you can for the main thread's stack (where the kernel reserves the address space even though it's not mapped yet).
So even though mmap(MAP_GROWSDOWN) was designed for allocating stacks, it's so bad that Ulrich Drepper proposed removing it in 2.6.29.
Also, note that your memory-map diagram is for a 32-bit kernel. A 64-bit kernel doesn't have to reserve any user virtual-address space for mapping kernel memory, so a 32-bit process running on an amd64 kernel can use the full 4GB of virtual address space. (Except for the low 64k by default (sysctl vm.mmap_min_addr = 65536), so NULL-pointer dereference does actually fault. And the top page is also reserved as error codes, not valid pointers.)
Related:
See Relation between stack limit and threads for more about stack-size for pthreads. getrlimit(RLIMIT_STACK) is the main thread's stack size. Linux pthreads uses RLIMIT_STACK as the stack size for new threads, too.

How can kernel threads ask pages for only themselves?

As you know, every kernel threads share one kernel memory space. The mm field of the task_struct describing a kernel thread is null. It uses the mm field of 'priv' task.
I think it makes any kernel thread access other kernel thread's private memory region. For example, one of the device drivers was allocated 4KB page for it's own buffer but there is no way to prevent other threads from accessing it. Because every kernel threads share one memory address space.
So, I have a question. Is there any way to ask pages that should be used to private ?
Is there any way to ask pages that should be used to private ?
No, any process executing kernel code has access to everything in the operating system.
It is up to the operating system and its policies to prevent malware drivers to be loaded into kernel, if it needs some security garantees.

why kernel needs virtual addressing?

In Linux each process has its virtual address space (e.g. 4 GB in case of 32 bit system, wherein 3GB is reserved for process and 1 GB for kernel). This virtual addressing mechanism helps isolating the address space of each process. This is understandable in case of process since there are many processes. But since we have 1 kernel only so why do we need virtual addressing for kernel?
The reason the kernel is "virtual" is not to deal with paging as such, it is becuase the processor can only run in one mode at a time. So once you turn on paged memory mapping (Bit 31 in CR0 on x86), the processor is expecting ALL memory accesses to go through the page-mapping mechanism. So, since we do want to access the kernel even after we have enabled paging (virtual memory), it needs to exist somewhere in the virtual space.
The "reserving" of memory is more about "easy way to determine if an address is kernel or user-space" than anything else. It would be perfectly possible to put a little bit of kernel at address 12345-34121, another bit of kernel at 101900-102400 and some other bit of kernel at 40000000-40001000. But it would make life difficult for every aspect of the kernel and userspace - there would be gaps/holes to deal with [there already are such holes/gapes, but having more wouldn't exactly help things]. By setting a fixed limit for "userspace is from here to here, kernel is from end of userspace to X", it makes life much easier in that respect. We can just say kernel = 0; if (address > max_userspace) kernel=1; in some code.
Of course, the kerneln only takes up as much PHYSICAL memory as it will actually use - so the common thinking that "it's a waste to take up a whole gigabyte for the kernel" is wrong - the kernel itself is a few (a dozen or so for a very "big" kernel) megabytes. The modules loaded can easily add up to several more megabytes, and graphics drivers from ATI and nVidia easily another few megabytes just for the kernel moduel for that itself. The kernel also uses some bits of memory to store "kernel data", such as tasks, queues, semaphores, files and other "stuff" the kernel has to deal with. A few megabytes is used for this as well.
Virtual Memory Management is that feature of Linux which enables Multi-tasking in system without any limitation on no. of task or amount of memory used by each task. The Linux Memory Manager Subsystem (along with MMU hardware) facilitates VMM support, where memory or mem-mapped device are accessed through virtual addresses. Within Linux everything, both kernel and user components, works with virtual address except when dealing with real hardware. That's when the Memory Manager takes its place, does virtual-to-physical address translation and points to physical mem/dev location.
A process is an abstract entity, defined by kernel to which system resources are allocated in order to execute a program. In Linux Process Management the kernel is an integrated part of a process memory map. A process has two main regions, like two faces of one coin:
User Space view - contains user program sections (Code, Data, Stack, Heap, etc...) used by process
Kernel Space view - contains kernel data structures that maintain information (PID. States, FD, Resource Usage, etc...) about the process
Every process in Linux system has a unique and separate User Space Region. This feature of Linux VMM isolates each process program sections from one and other. But all processes in the system shares the common Kernel Space Region. When a process needs service from the kernel it must execute the kernel code in this region, or in other words kernel is performing on behalf of user process request.

No address space for Linux Kernel threads

Why the Linux kernel threads do not have an address space. For any task to execute, it should have a memory region right? Where do the text and data of kernel threads go?
Kernel threads do have an address space. It's just that they all share the same one. This does not prevent them from each having a different stack.
Text and data are laid out in the kernel address space (the one that is shared by all the threads), depending on how and when it was allocated, and what it's used for.
The Linux MM site has a lot of documentation about this aspect of Linux. Head over there.
I don't know the precise answer, because I'm not a Linux architect.
But in general, so-called kernel threads do have an address space: it is the address space which contains the kernel. It might not need to be explicitly represented for each kernel thread, since it is shared among many threads.
I'd expect any real thread implementation to have a machine context block containing register values (and stack pointer, etc.), and a pointer to a the address space in which the thread is supposed to run. Then a scheduler, starting a ready thread, can easily determine whether the memory management unit is set up to enable access to the address space (and if not, set it up) to enable the thread to run in its desired space.

Resources