Efficiently transfer large file (up to 2GB) to CUDA GPU?

Efficiently transfer large file (up to 2GB) to CUDA GPU? - io

I'm working on an a GPU accelerated program that requires the reading of an entire file of variable size. My question, what is the optimal number of bytes to read from a file and transfer to a coprocessor (CUDA device)?
These files could be as large as 2GiB, so creating a buffer of that size doesn't seem like the best idea.

You can cudaMalloc a buffer of the maximum size you can on your device. After this, copy over chunks of your input data of this size from host to device, process it, copy back the results and continue.
// Your input data on host
int hostBufNum = 5600000;
int* hostBuf = ...;
// Assume this is largest device buffer you can allocate
int devBufNum = 1000000;
int* devBuf;
cudaMalloc( &devBuf, sizeof( int ) * devBufNum );
int* hostChunk = hostBuf;
int hostLeft = hostBufNum;
int chunkNum = ( hostLeft < devBufNum ) ? hostLeft : devBufNum;
do
{
cudaMemcpy( devBuf, hostChunk, chunkNum * sizeof( int ) , cudaMemcpyHostToDevice);
doSomethingKernel<<< >>>( devBuf, chunkNum );
hostChunk = hostChunk + chunkNum;
hostLeft = hostBufNum - ( hostChunk - hostBuf );
} while( hostLeft > 0 );

If you can split your function up so you can work on chunks on the card, you should look into using streams (cudaStream_t).
If you schedule loads and kernel executions in several streams, you can have one stream load data while another executes a kernel on the card, thereby hiding some of the transfer time of your data in the execution of a kernel.
You need to declare a buffer of whatever your chunk size is times however many streams you declare (up to 16, for compute capability 1.x as far as I know).

Related

How to pass efficiently the hugepages-backed buffer to the BM DMA device in Linux?

I need to provide a huge circular buffer (a few GB) for the bus-mastering DMA PCIe device implemented in FPGA.
The buffers should not be reserved at the boot time. Therefore, the buffer may be not contiguous.
The device supports scatter-gather (SG) operation, but for performance reasons, the addresses and lengths of consecutive contiguous segments of the buffer are stored inside the FPGA.
Therefore, usage of standard 4KB pages is not acceptable (there would be up to 262144 segments for each 1GB of the buffer).
The right solution should allocate the buffer consisting of 2MB hugepages in the user space (reducing the maximum number of segments by factor of 512).
The virtual address of the buffer should be transferred to the kernel driver via ioctl. Then the addresses and the length of the segments should be calculated and written to the FPGA.
In theory, I could use get_user_pages to create the list of the pages, and then call sg_alloc_table_from_pages to obtain the SG list suitable to program the DMA engine in FPGA.
Unfortunately, in this approach I must prepare the intermediate list of page structures with length of 262144 pages per 1GB of the buffer. This list is stored in RAM, not in the FPGA, so it is less problematic, but anyway it would be good to avoid it.
In fact I don't need to keep the pages maped for the kernel, as the hugepages are protected against swapping out, and they are mapped for the user space application that will process the received data.
So what I'm looking for is a function sg_alloc_table_from_user_hugepages, that could take such a user-space address of the hugepages-based memory buffer, and transfer it directly into the right scatterlist, without performing unnecessary and memory-consuming mapping for the kernel.
Of course such a function should verify that the buffer indeed consists of hugepages.
I have found and read these posts: (A), (B), but couldn't find a good answer.
Is there any official method to do it in the current Linux kernel?

At the moment I have a very inefficient solution based on get_user_pages_fast:
int sgt_prepare(const char __user *buf, size_t count,
struct sg_table * sgt, struct page *** a_pages,
int * a_n_pages)
{
int res = 0;
int n_pages;
struct page ** pages = NULL;
const unsigned long offset = ((unsigned long)buf) & (PAGE_SIZE-1);
//Calculate number of pages
n_pages = (offset + count + PAGE_SIZE - 1) >> PAGE_SHIFT;
printk(KERN_ALERT "n_pages: %d",n_pages);
//Allocate the table for pages
pages = vzalloc(sizeof(* pages) * n_pages);
printk(KERN_ALERT "pages: %p",pages);
if(pages == NULL) {
res = -ENOMEM;
goto sglm_err1;
}
//Now pin the pages
res = get_user_pages_fast(((unsigned long)buf & PAGE_MASK), n_pages, 0, pages);
printk(KERN_ALERT "gupf: %d",res);
if(res < n_pages) {
int i;
for(i=0; i<res; i++)
put_page(pages[i]);
res = -ENOMEM;
goto sglm_err1;
}
//Now create the sg-list
res = sg_alloc_table_from_pages(sgt, pages, n_pages, offset, count, GFP_KERNEL);
printk(KERN_ALERT "satf: %d",res);
if(res < 0)
goto sglm_err2;
*a_pages = pages;
*a_n_pages = n_pages;
return res;
sglm_err2:
//Here we jump if we know that the pages are pinned
{
int i;
for(i=0; i<n_pages; i++)
put_page(pages[i]);
}
sglm_err1:
if(sgt) sg_free_table(sgt);
if(pages) kfree(pages);
* a_pages = NULL;
* a_n_pages = 0;
return res;
}
void sgt_destroy(struct sg_table * sgt, struct page ** pages, int n_pages)
{
int i;
//Free the sg list
if(sgt->sgl)
sg_free_table(sgt);
//Unpin pages
for(i=0; i < n_pages; i++) {
set_page_dirty(pages[i]);
put_page(pages[i]);
}
}
The sgt_prepare function builds the sg_table sgt structure that i can use to create the DMA mapping. I have verified that it contains the number of entries equal to the number of hugepages used.
Unfortunately, it requires that the list of the pages is created (allocated and returned via the a_pages pointer argument), and kept as long as the buffer is used.
Therefore, I really dislike that solution. Now I have 256 2MB hugepages used as a DMA buffer. It means that I have to create and keeep unnecessary 128*1024 page structures. I also waste 512 MB of kernel address space for unnecessary kernel mapping.
The interesting question is if the a_pages may be kept only temporarily (until the sg-list is created)? In theory it should be possible, as the pages are still locked...

Write data to an SD card through a buffer without a race condition

I am writing firmware for a data logging device. It reads data from sensors at 20 Hz and writes data to an SD card. However, the time to write data to SD card is not consistent (about 200-300 ms). Thus one solution is writing data to a buffer at a consistent rate (using a timer interrupt), and have a second thread that writes data to the SD card when the buffer is full.
Here is my naive implementation:
#define N 64
char buffer[N];
int count;
ISR() {
if (count < N) {
char a = analogRead(A0);
buffer[count] = a;
count = count + 1;
}
}
void loop() {
if (count == N) {
myFile.open("data.csv", FILE_WRITE);
int i = 0;
for (i = 0; i < N; i++) {
myFile.print(buffer[i]);
}
myFile.close();
count = 0;
}
}
The code has the following problems:
Writing data to the SD card is blocking reading when the buffer is full
It might have a race conditions.
What is the best way to solve this problem? Using a circular buffer, or double buffering? How do I ensure that a race condition does not happen?

You have rather answered your own question; you should use either double buffering or a circular buffer. Double-buffering is probably simpler to implement and appropriate for devices such as an SD card for which block-writes are generally more efficient.
Buffer length selection may need some consideration; generally you would make the buffer the same as the SD sector buffer size (typically 512 bytes), but that may not be practical, and with a sample rate as low as 20 sps, optimising SD write performance is perhaps not an issue.
Another consideration is that you need to match your sample rate to the file-system latency by selecting an appropriate buffer size. In this case the 64 sample buffer buffer will fill in a little more than three seconds, but the block write takes only up-to 300 ms - so you could use a much smaller buffer if required - 8 samples would be sufficient - although be careful, you may have observed latency of 300 ms, but it may be larger when specific boundaries are crossed in the physical flash memory - I have seen significant latency on some cards at 1 Mbyte boundaries - moreover card performance varies significantly between sizes and manufacturers.
An adaptation of your implementation with double-buffering is below. I have reduced the buffer length to 32 samples, but with double-buffering the total is unchanged at 64, but the write lag is reduced to 1.6 seconds.
// Double buffer and its management data/constants
static volatile char buffer[2][32];
static const int BUFFLEN = sizeof(buffer[0]);
static const unsigned char EMPTY = 0xff;
static volatile unsigned char inbuffer = 0;
static volatile unsigned char outbuffer = EMPTY;
ISR()
{
static int count = 0;
// Write to current buffer
char a = analogRead(A0);
buffer[inbuffer][count] = a;
count++ ;
// If buffer full...
if( count >= BUFFLEN )
{
// Signal to loop() that data available (not EMPTY)
outbuffer = inbuffer;
// Toggle input buffer
inbuffer = inbuffer == 0 ? 1 : 0;
count = 0;
}
}
void loop()
{
// If buffer available...
if( outbuffer != EMPTY )
{
// Write buffer
myFile.open("data.csv", FILE_WRITE);
for( int i = 0; i < BUFFLEN; i++)
{
myFile.print(buffer[outbuffer][i]);
}
myFile.close();
// Set the buffer to empty
outbuffer = EMPTY;
}
}
Note the use of volatile and unsigned char for the shared data. It is important that data shared between concurrent execution contexts is accessed explicitly and atomically; access to an int on 8-bit AVR based Arduino requires multiple machine instructions and the interrupt may occur part way through a read/write in loop() and cause an incorrect value to be read.

Linux: How to mmap a sequence of physically contiguous areas into user space?

In my driver I have certain number of physically contiguous DMA buffers (e.g. 4MB long each) to receive data from a device. They are handled by hardware using the SG list. As the received data will be subjected to intensive processing, I don't want to switch off cache and I will use dma_sync_single_for_cpu after each buffer is filled by DMA.
To simplify data processing, I want those buffers to appear as a single huge, contiguous, circular buffer in the user space.
In case of a single buffer I simply use remap_pfn_range or dma_mmap_coherent. However, I can't use those functions multiple times to map consecutive buffers.
Of course, I can implement the fault operation in the vm_operations so that it finds the pfn of the corresponding page in the right buffer, and inserts it into the vma with vm_insert_pfn.
The acquisition will be really fast, so I can't handle mapping when the real data arrive. But this can be solved easily. To have all mapping ready before the data acquisition starts, I can simply read the whole mmapped buffer in my application before starting the acquisition, so that all pages are already inserted when the first data arrive.
Tha fault based trick should work, but maybe there is something more elegant? Just a single function, that may be called multiple times to build the whole mapping incrementally?
Additional difficulty is that the solution should be applicable (with minimal adjustments) to kernels starting from 2.6.32 to the newest one.
PS. I have seen that annoying post. Is there a danger that if the application attempts to write something to the mmapped buffer (just doing the in place processing of data), my carefully built mapping will be destroyed by COW?

Below is my solution that works for buffers allocated with dmam_alloc_noncoherent.
Allocation of the buffers:
[...]
for(i=0;i<DMA_NOFBUFS;i++) {
ext->buf_addr[i] = dmam_alloc_noncoherent(&my_dev->dev, DMA_BUFLEN, &my_dev->buf_dma_t[i],GFP_USER);
if(my_dev->buf_addr[i] == NULL) {
res = -ENOMEM;
goto err1;
}
//Make buffer ready for filling by the device
dma_sync_single_range_for_device(&my_dev->dev, my_dev->buf_dma_t[i],0,DMA_BUFLEN,DMA_FROM_DEVICE);
}
[...]
Mapping of the buffers
void swz_mmap_open(struct vm_area_struct *vma)
{
}
void swz_mmap_close(struct vm_area_struct *vma)
{
}
static int swz_mmap_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
{
long offset;
char * buffer = NULL;
int buf_num = 0;
//Calculate the offset (according to info in https://lxr.missinglinkelectronics.com/linux+v2.6.32/drivers/gpu/drm/i915/i915_gem.c#L1195 it is better not ot use the vmf->pgoff )
offset = (unsigned long)(vmf->virtual_address - vma->vm_start);
buf_num = offset/DMA_BUFLEN;
if(buf_num > DMA_NOFBUFS) {
printk(KERN_ERR "Access outside the buffer\n");
return -EFAULT;
}
offset = offset - buf_num * DMA_BUFLEN;
buffer = my_dev->buf_addr[buf_num];
vm_insert_pfn(vma,(unsigned long)(vmf->virtual_address),virt_to_phys(&buffer[offset]) >> PAGE_SHIFT);
return VM_FAULT_NOPAGE;
}
struct vm_operations_struct swz_mmap_vm_ops =
{
.open = swz_mmap_open,
.close = swz_mmap_close,
.fault = swz_mmap_fault,
};
static int char_sgdma_wz_mmap(struct file *file, struct vm_area_struct *vma)
{
vma->vm_ops = &swz_mmap_vm_ops;
vma->vm_flags |= VM_IO | VM_RESERVED | VM_CAN_NONLINEAR | VM_PFNMAP;
swz_mmap_open(vma);
return 0;
}

Computing on variable length arrays in OpenCL

I am using OpenCL (Xcode, Intel GPU), and I am trying to implement a kernel that calculates moving averages and deviations. I want to pass several double arrays of varying lengths to the kernel. Is this possible to implement, or do I need to pad smaller arrays with zeroes so all the arrays are the same size?
I am new to OpenCL and GPGPU, so please forgive my ignorance of any nomenclature.

You can pass to the kernel any buffer, the kernel does not need to use it all.
For example, if your kernel reduces a buffer you can query at run time the amount of work items (items to reduce) using get_global_size(0). And then call the kernel with the proper parameters.
An example (unoptimized):
__kernel reduce_step(__global float* data)
{
int id = get_global_id(0);
int size = get_global_size(0);
int size2 = size/2;
int size2p = (size+1)/2;
if(id<size2) //Only reduce up to size2, the odd element will remain in place
data[id] += data[id+size2p];
}
Then you can call it like this.
void reduce_me(std::vector<cl_float>& data){
size_t size = data.size();
//Copy to a buffer already created, equal or bigger size than data size
// ... TODO, check sizes of buffer or change the buffer set to the kernel args.
queue.enqueueWriteBuffer(buffer,CL_FALSE,0,sizeof(cl_float)*size,data.data());
//Reduce until 1024
while(size > 1024){
queue.enqueueNDRangeKernel(reduce_kernel,cl::NullRange,cl::NDRange(size),cl::NullRange);
size /= 2;
}
//Read out and trim
queue.enqueueReadBuffer(buffer,CL_TRUE,0,sizeof(cl_float)*size,data.data());
data.resize(size);
}

Why would a VC++ program that is storing 5MB of data consume 64MB of system memory?

I have been working on trying to figure out why my program is consuming so much system RAM. I'm loading a file from disk into a vector of structs of several dynamically allocated arrays. A 16MB file ends up consuming 280MB of system RAM according to task manager. The types in the file are mostly chars with some shorts and a few longs. There are 331,000 records in the file containing on average about 5 fields. I converted the vector to a struct and that reduced the memory to about 255MB but that still seems very high. With the vector taking up so much memory the program is running out of memory so I need to find a way to get the memory usage more reasonable.
I wrote a simple program to just stuff a vector (or array) with 1,000,000 char pointers. I would expect it to allocate 4+1 bytes for each giving 5MB of memory required for storage, but in fact it is using 64MB (array version) or 67MB (vector version). When the program first starts up it only consumes 400K so why is there an additional 59MB for array or 62MB for vectors being allocated? This extra memory seems to be for each container, so if I create a size_check2 and copy everything and run it the program uses up 135MB for 10MB worth of pointers and data.
Thanks in advance,
size_check.h
#pragma once
#include <vector>
class size_check
{
public:
size_check(void);
~size_check(void);
typedef unsigned long size_type;
void stuff_me( unsigned int howMany );
private:
size_type** package;
// std::vector<size_type*> package;
size_type* me;
};
size_check.cpp
#include "size_check.h"
size_check::size_check(void)
{
}
size_check::~size_check(void)
{
}
void size_check::stuff_me( unsigned int howMany )
{
package = new size_type*[howMany];
for( unsigned int i = 0; i < howMany; ++i )
{
size_type *me = new size_type;
*me = 33;
package[i] = me;
// package.push_back( me );
}
}
main.cpp
#include "size_check.h"
int main( int argc, char * argv[ ] )
{
const unsigned int buckets = 20;
const unsigned int size = 50000;
size_check* me[buckets];
for( unsigned int i = 0; i < buckets; ++i )
{
me[i] = new size_check();
me[i]->stuff_me( size );
}
printf( "done.\n" );
}

In my test using VS2010, a debug build had a working set size of 52,500KB. But a release build had a working set
size of 20,944KB.
Debug builds will usually use more memory than optimized builds due to the debug heap manager doing things like creating memory fences.
In release builds, I suspect that the heap manager reserves more memory than you are actually using as a performance optimization.

Memory Leak
package = new size_type[howMany]; // instantiate 50,000 size_type's
for( unsigned int i = 0; i < howMany; ++i )
{
size_type *me = new size_type; // Leak: results in an extra 50k size_type's being instantiated
*me = 33;
package[i] = *me; // Set a non-pointer to what is at the address of pointer "me"
// Would package[i] = 33; not suffice?
}
Furthermore, make sure you've compiled in release mode

There might be a couple reasons why you're seeing such a large memory footprint from your test program. Inside your
void size_check::stuff_me( unsigned int howMany )
{
This method is always getting called with howMany = 50000.
package = new size_type[howMany];
Assuming this is on a 32-bit setup the above statement will allocate 50,000 * 4 bytes.
for( unsigned int i = 0; i < howMany; ++i )
{
size_type *me = new size_type;
The above will allocate new storage on each iteration of the loop. Since this loops 50,000 and the allocation never gets deleted that effectively takes up another 50,000 * 4 bytes upon loop completion.
*me = 33;
package[i] = *me;
}
}
Lastly, since stuff_me() gets called 20 times from main() your program would have allocated at least ~8Mbytes upon completion. If this is on a 64-bit system than the footprint will likely double since sizeof(long) == 8bytes.
The increase in memory consumption could have something to do with the way VS implements dynamic allocation. For performance reasons, it's possible that due to the multiple calls to new your program is reserving extra memory so as to avoid hitting up the OS everytime it needs more.
FYI, when I ran your test program on mingw-gcc 4.5.2, the memory consumption was ~20Mbytes -- much lower than what you were seeing but still a substantial amount. If I changed the stuff_me method to this:
void size_check::stuff_me( unsigned int howMany )
{
package = new size_type[howMany];
size_type *me = new size_type;
for( unsigned int i = 0; i < howMany; ++i )
{
*me = 33;
package[i] = *me;
}
delete me;
}
memory consumption goes down quite a bit down to ~4-5mbytes.

I think I found the answer by delving into the new statement. In debug builds there are two items that are created when you do a new. One is _CrtMemBlockHeader which is 32 bytes in length. The other is noMansLand (a memory fence) with a size of 4 bytes which gives us an overhead of 36 bytes for each new. In my case each individual new for a char was costing me 37 bytes. In release builds the memory usage is reduced to about 1/2 but I can't tell exactly how much is allocated for each new as I can't get to the new/malloc routine.
So my work around is to allocate a large block of memory to hold the file in memory. Then parse the memory image filling in a vector of pointers to the beginning of each of the records. Then on demand, I build a record from the memory image using the pointer to the beginning of the selected record. Doing this reduced the memory footprint to <25MB.
Thanks for all your help and suggestions.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string