mbind: how to uniformly interleave existing segment on all nodes? - linux

Using mbind, one can set the memory policy for a given mapped memory segment.
Q: How can I tell mbind to interleave a segment on all nodes?
If done after allocation but before usage, MPOL_INTERLEAVE on all nodes will do what we expect -- memory will be allocated uniformly on all nodes.
However, if the segment has already been written to and is allocated in e.g. node zero, there is no way to tell the kernel to uniformly interleave it on all NUMA nodes.
The operation simply becomes a no-op, as the kernel interprets it as "please place this segment on this set of nodes". Since we're passing the set of all NUMA nodes, there is no memory allocated outside that requires being moved.
Minimal, Complete, and Verifiable example
#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sched.h>
#include <sys/syscall.h>
#include <numaif.h>
#include <numa.h>
#define N ((1<<29) / sizeof(int))
#define PAGE_SIZE sysconf(_SC_PAGESIZE)
#define PAGE_MASK (~(PAGE_SIZE - 1))
void print_command(char *cmd) {
FILE *fp;
char buf[1024];
if ((fp = popen(cmd, "r")) == NULL) {
perror("popen");
exit(-1);
}
while(fgets(buf, sizeof(buf), fp) != NULL) {
printf("%s", buf);
}
if(pclose(fp)) {
perror("pclose");
exit(-1);
}
}
void print_node_allocations() {
char buf[1024];
snprintf(buf, sizeof(buf), "numastat -c %d", getpid());
printf("\x1B[32m");
print_command(buf);
printf("\x1B[0m");
}
int main(int argc, char **argv) {
int *a = numa_alloc_local(N * sizeof(int));
size_t len = (N * sizeof(int)) & PAGE_MASK;
unsigned long mymask = *numa_get_mems_allowed()->maskp;
unsigned long maxnode = numa_get_mems_allowed()->size;
// pin thread to core zero
cpu_set_t mask;
CPU_ZERO(&mask);
CPU_SET(0, &mask);
if (sched_setaffinity(syscall(SYS_gettid), sizeof(mask), &mask) < 0) {
perror("sched_setaffinity");
exit(-1);
}
// initialize array
printf("\n\n(1) array allocated on local node\n");
a[0] = 997;
for(size_t i=1; i < N; i++) {
a[i] = a[i-1] * a[i-1] % 1000000000;
}
print_node_allocations();
// attempt to get it to be uniformly interleaved on all nodes
printf("\n\n(2) array interleaved on all nodes\n");
if (mbind(a, len, MPOL_INTERLEAVE, &mymask, maxnode, MPOL_MF_MOVE_ALL | MPOL_MF_STRICT) == -1) {
perror("mbind failed");
exit(-1);
}
print_node_allocations();
// what if we interleave on all but the local node?
printf("\n\n(3) array interleaved on all nodes (except local node)\n");
mymask -= 0x01;
if (mbind(a, len, MPOL_INTERLEAVE, &mymask, maxnode, MPOL_MF_MOVE_ALL | MPOL_MF_STRICT) == -1) {
perror("mbind failed");
exit(-1);
}
print_node_allocations();
return 0;
}
Compiling and running with gcc -o interleave_all interleave_all.c -lnuma && sudo ./interleave_all yields:
(1) array allocated on local node
Per-node process memory usage (in MBs) for PID 20636 (interleave_all)
Node 0 Node 1 Node 2 Node 3 Total
------ ------ ------ ------ -----
Huge 0 0 0 0 0
Heap 0 0 0 0 0
Stack 0 0 0 0 0
Private 514 0 0 0 514
------- ------ ------ ------ ------ -----
Total 514 0 0 0 514
(2) array interleaved on all nodes
Per-node process memory usage (in MBs) for PID 20636 (interleave_all)
Node 0 Node 1 Node 2 Node 3 Total
------ ------ ------ ------ -----
Huge 0 0 0 0 0
Heap 0 0 0 0 0
Stack 0 0 0 0 0
Private 514 0 0 0 514
------- ------ ------ ------ ------ -----
Total 514 0 0 0 514
(3) array interleaved on all nodes (except local node)
Per-node process memory usage (in MBs) for PID 20636 (interleave_all)
Node 0 Node 1 Node 2 Node 3 Total
------ ------ ------ ------ -----
Huge 0 0 0 0 0
Heap 0 0 0 0 0
Stack 0 0 0 0 0
Private 2 171 171 171 514
------- ------ ------ ------ ------ -----
Total 2 171 171 171 514

Related

Need help to detect extra malloc - CS50 Pset5

Valgrind says 0 bytes lost but also says one less Frees than Mallocs
Because I have used malloc only once, I'm only posting those segments and not all the 3 files.
When loading a dictionary.txt file into a hash table:
bool load(const char *dictionary)
{
(dictionary.c:54) FILE *dict_file = fopen(dictionary, "r");
if (dict_file == NULL)
return false;
int key;
node *n = NULL;
int mallocs = 0;
while (1)
{
n = malloc(sizeof(node));
printf("malloced: %i\n", ++mallocs);
if (fscanf(dict_file, "%s", n->word) == -1)
{
printf("malloc freed\n");
free(n);
break;
}
key = hash(n->word);
n->next = table[key];
table[key] = n;
words++;
}
return true;
}
And the Unloading part:
bool unload(void)
{
int deleted = 0;
node *n;
for (int i = 0; i < N; i++)
{
n = table[i];
while(n != NULL)
{
n = n->next;
free(table[i]);
table[i] = n;
deleted++;
}
}
printf("DELETED: %i", deleted);
return true;
}
Check50 says there are memory leaks. But can't understand where.
Command: ./speller dictionaries/small texts/cat.txt
==4215==
malloced: 1
malloced: 2
malloced: 3
malloced: 4
malloc freed
DELETED: 3
WORDS MISSPELLED: 2
WORDS IN DICTIONARY: 3
WORDS IN TEXT: 6
TIME IN load: 0.03
TIME IN check: 0.00
TIME IN size: 0.00
TIME IN unload: 0.00
TIME IN TOTAL: 0.03
==4215==
==4215== HEAP SUMMARY:
==4215== in use at exit: 552 bytes in 1 blocks
==4215== total heap usage: 9 allocs, 8 frees, 10,544 bytes allocated
==4215==
==4215== 552 bytes in 1 blocks are still reachable in loss record 1 of 1
==4215== at 0x4C31B0F: malloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==4215== by 0x525AF29: __fopen_internal (iofopen.c:65)
==4215== by 0x525AF29: fopen##GLIBC_2.2.5 (iofopen.c:89)
==4215== by 0x40114E: load (dictionary.c:54)
==4215== by 0x40095E: main (speller.c:40)
==4215==
==4215== LEAK SUMMARY:
==4215== definitely lost: 0 bytes in 0 blocks
.
.
.
==4215== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)
speller.c has distribution code. I hope the rest of the question is clear and understandable.
The pointer to the opened file (dict_file) needs to be closed. See man fclose.

How to properly use if else statements and while loops with a child process in C

I'm new to C and I've been trying to create a program that takes a user input integer makes a sequence depending on whether the number is even or odd.
n / 2 if n is even
3 * n + 1 if n is odd
A new number will be computed until the sequence reaches 1. For example if a user inputs 35:
35, 106, 53, 160, 80, 40, 20, 10, 5, 16, 8, 4, 2, 1
For some reason my code doesn't work after the scan statement of the child process. I left my code and sample output below:
Code:
#include <stdio.h>
#include <unistd.h>
#include <sys/types.h>
int main()
{
pid_t pid;
int i = 0;
int j = 0;
/* fork a child process */
pid = fork();
if (pid < 0) { /* error occurred */
fprintf(stderr, "Fork Failed\n");
return 1;
}
else if (pid == 0) { /* child process */
printf("I am the child %d\n",pid);
printf("Enter a value: \n");
scanf("%d", i);
while (i < 0) {
printf("%d is not a positive integer. Please try again.\n", i);
printf("Enter a value: \n");
scanf("%d", i);
}
// can add a print i here
while (i != 1) {
if (i % 2 == 0) { // if the inputted number is even
j = i / 2;
}
else {
j = 3 * i + 1;
}
printf("%d", j);
}
}
else { /* parent process */
/* parent will wait for the child to complete */
printf("I am the parent %d\n",pid);
wait(NULL); // wait(NULL) will wait for the child process to complete and takes the status code of the child process as a parameter
printf("Child Complete\n");
}
return 0;
}
Output I'm getting on terminal in Linux (Debian):
oscreader#OSC:~/osc9e-src/ch3$ gcc newproc-posix.c
oscreader#OSC:~/osc9e-src/ch3$ ./a.out
I am the parent 16040
I am the child 0
Enter a value:
10
Child Complete
oscreader#OSC:~/osc9e-src/ch3$
Transferring comments into a semi-coherent answer.
Your calls to scanf() require a pointer argument; you give it an integer argument. Use scanf("%d", &i); — and it would be a good idea to check that scanf() returns 1 before testing the result.
My compiler told me about your bug. Why didn't your compiler do so too? Make sure you enable every warning you can! Your comment indicates that you're using gcc (or perhaps clang) — I routinely compile with:
gcc -std=c11 -O3 -g -Werror -Wall -Wextra -Wstrict-prototypes …
Indeed, for code from SO, I add -Wold-style-declarations -Wold-style-definitions to make sure functions are declared and defined properly. It's often a good idea to add -pedantic to avoid accidental use of GCC extensions.
In the loop, you don't need j — you should be changing and printing i instead.
cz17.c
#include <stdio.h>
#include <sys/wait.h>
#include <unistd.h>
int main(void)
{
int i = 0;
pid_t pid = fork();
if (pid < 0)
{
fprintf(stderr, "Fork Failed\n");
return 1;
}
else if (pid == 0)
{
printf("I am the child %d\n", pid);
printf("Enter a value: \n");
if (scanf("%d", &i) != 1)
{
fprintf(stderr, "failed to read an integer\n");
return 1;
}
while (i <= 0 || i > 1000000)
{
printf("value %d out of range 1..1000000. Try again.\n", i);
printf("Enter a value: \n");
if (scanf("%d", &i) != 1)
{
fprintf(stderr, "failed to read an integer\n");
return 1;
}
}
while (i != 1)
{
if (i % 2 == 0)
{
i = i / 2;
}
else
{
i = 3 * i + 1;
}
printf(" %d", i);
fflush(stdout);
}
putchar('\n');
}
else
{
printf("I am the parent of %d\n", pid);
int status;
int corpse = wait(&status);
printf("Child Complete (%d - 0x%.4X)\n", corpse, status);
}
return 0;
}
Compilation:
gcc -O3 -g -std=c11 -Wall -Wextra -Werror -Wmissing-prototypes -Wstrict-prototypes cz17.c -o cz17
Sample output:
$ cz17
I am the parent of 41838
I am the child 0
Enter a value:
2346
1173 3520 1760 880 440 220 110 55 166 83 250 125 376 188 94 47 142 71 214 107 322 161 484 242 121 364 182 91 274 137 412 206 103 310 155 466 233 700 350 175 526 263 790 395 1186 593 1780 890 445 1336 668 334 167 502 251 754 377 1132 566 283 850 425 1276 638 319 958 479 1438 719 2158 1079 3238 1619 4858 2429 7288 3644 1822 911 2734 1367 4102 2051 6154 3077 9232 4616 2308 1154 577 1732 866 433 1300 650 325 976 488 244 122 61 184 92 46 23 70 35 106 53 160 80 40 20 10 5 16 8 4 2 1
Child Complete (41838 - 0x0000)
$

Registering Mapped Linux Character Device Memory with cudaHostRegister Results in Invalid Argument

I'm trying to boost DMA<->CPU<->GPU data transfer by:
1. Mapping my (proprietary) device Linux Kernel allocated memory to user space
2. Registering the later (mapped memory) to Cuda with cudaHostRegister API function.
While mapping User Space allocated memory mapped to my device DMA and then registered to Cuda with cudaHostRegister works just fine, trying to register "kmalloced" memory results in "Invalid Argument" error returned by cudaHostRegister.
First I thought the problem was with alignment or my device driver complicated memory pool management, so I've written a simplest character device which implements .mmap() where kzalloced 10Kb buffer is remapped with remap_pfn_range and the problem still stands.
Unfortunately, I did not find any resembling questions over the Net, so I sincerely hope I'll find an answer here.
Some system info and Kernel driver <-> user space app code + runtime log info:
CUDA : 8.0
OS Dist : Ubuntu 14.04
Kernel : 3.16.0-31-generic
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 375.26 Driver Version: 375.26 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 770 Off | 0000:83:00.0 N/A | N/A |
| 26% 32C P8 N/A / N/A | 79MiB / 1997MiB | N/A Default |
+-------------------------------+----------------------+----------------------+
Character device mmap() code:
#define MEM_CHUNK_SIZE 4 * _K
#define MEM_POOL_SIZE 10 * _K
/**/
static int chdv_mmap(struct file *filp, struct vm_area_struct *vma)
{
unsigned int pages_per_buf = ( MEM_CHUNK_SIZE >> PAGE_SHIFT ) ;
unsigned long pfn, vsize;
/*make sure the buffer is allocated*/
if((NULL == g_membuff) &&
(NULL == (g_membuff = kzalloc(MEM_POOL_SIZE , GFP_KERNEL))))
{
kdbgprintln("Error: Not enough memory");
return -ENOMEM;
}
vsize = vma->vm_end - vma->vm_start ;
kdbgprintln("MEM_CHUNK_SIZE %u, pages_per_buf %u, vsize %lu vma->vm_pgoff %lu",
MEM_CHUNK_SIZE,
pages_per_buf,
vsize,
vma->vm_pgoff);
if(vsize > MEM_POOL_SIZE)
{
kdbgprintln("Error: vsize %lu > MEM_POOL_SIZE %u", vsize, MEM_POOL_SIZE);
return -EINVAL;
}
/* We allow only mapping of one whole buffer so offset must be multiple
* of pages_per_buf and size must be equal to dma_buf_size.
*/
if( vma->vm_pgoff % pages_per_buf )
{
kdbgprintln("Error:Mapping DMA buffers is allowed only from beginning");
return -EINVAL ;
}
vma->vm_flags = vma->vm_flags | (VM_DONTEXPAND | VM_LOCKED | VM_IO);
/*Get the PFN for remap*/
pfn = page_to_pfn(virt_to_page((unsignedcudaHostRegister char *)g_membuff));
kdbgprintln("PFN : %lu", pfn);
if(remap_pfn_range(vma, vma->vm_start, pfn, vsize, vma->vm_page_prot))
{
kdbgprintln("Error:Failed to remap memory");
return -EINVAL;
}
/*Sealing data header & footer*/
*((unsigned long *)g_membuff) = 0xCDFFFFFFFFFFFFAB;
*((unsigned long *)g_membuff + 1) = 0xAB000000000000EF;
*(unsigned long *)((unsigned char *)g_membuff + vsize - sizeof(unsigned long)) = 0xEF0000000C0000AA;
kdbgprintln("Mapped 'kalloc' buffer" \
"\n\t\tFirst 8 bytes: %lX" \
"\n\t\tSecond 8 bytes: %lX" \
"\n\t\tLast 8 bytes: %lX",
*((unsigned long *)g_membuff),
*((unsigned long *)g_membuff + 1),
*(unsigned long *)((unsigned char *)g_membuff + vsize - sizeof(unsigned long)));
return 0;
}
Test Application code:
static unsigned long map_mem_size;
int main(int argc, char** argv)
{
int fd;
const char dev_name[] = "/dev/chardev";
void * address = NULL;
long page_off = 0;
cudaError_t cudarc;
switch(argc)
{
case 2:
page_off = atoi(argv[1]) * getpagesize();
break;
default:
page_off = 0;
break;
}
map_mem_size = 2 * getpagesize();
printf("Opening %s file\n", dev_name);
errno = 0;
if(0 > (fd = open(dev_name, O_RDWR) ))
{
printf("Error %d - %s\n", errno, strerror(errno));
}
else
{
printf("About to map %lu bytes of %s device memory\n", map_mem_size, dev_name);
errno = 0;
if(MAP_FAILED == (address = mmap(NULL, map_mem_size, PROT_READ | PROT_WRITE, MAP_SHARED, fd, page_off)))
{
printf("Error %d - %s\n", errno, strerror(errno));
}
else
{
printf("mapped %s driver 'kmalloc' memory" \
"\n\t\tFirst 8 bytes : %lX" \
"\n\t\tSecond 8 bytes: %lX" \
"\n\t\tLast 8 bytes: %lX\n",
dev_name,
*((unsigned long *)address),
*((unsigned long *)address + 1),
*(unsigned long *)((unsigned char *)address + map_mem_size - sizeof(unsigned long)));
if (cudaSuccess != (cudarc = cudaHostRegister(address, map_mem_size, cudaHostRegisterDefault)))
{
printf("Error: Failed cudaHostRegister: %s, address %p\n", cudaGetErrorString(cudarc), address);
}
}
}
/*Release resources block*/
return EXIT_SUCCESS;
}
Run time debug information:
User space:
./chrdev_test
Opening /dev/chardev file
About to map 8192 bytes of /dev/chardev device memory
mapped /dev/chardev driver 'kmalloc' memory
First 8 bytes : CDFFFFFFFFFFFFAB
Second 8 bytes: AB000000000000EF
Last 8 bytes: EF0000000C0000AA
Error: Failed cudaHostRegister: invalid argument
Unmapping /dev/chardev file
Closing /dev/chardev file
Kernel space (tail -f /var/log/syslog):
[ 4814.119537] [chardev] chardev.c, chdv_mmap, line 292:MEM_CHUNK_SIZE 4096, pages_per_buf 1, vsize 8192 vma->vm_pgoff 0
[ 4814.119538] [chardev] chardev.c, chdv_mmap, line 311:PFN : 16306184
[ 4814.119543] [chardev] chardev.c, chdv_mmap, line 330:Mapped 'kzalloced' buffer
[ 4814.119543] First 8 bytes: CDFFFFFFFFFFFFAB
[ 4814.119543] Second 8 bytes: AB000000000000EF
[ 4814.119543] Last 8 bytes: EF0000000C0000AA
Thanks ahead.
Made it work!
The full answer may be found in:
https://devtalk.nvidia.com/default/topic/1014391/cuda-programming-and-performance/registering-mapped-linux-character-device-memory-with-cudahostregister-results-in-invalid-argument/?offset=3#5174771
There is a problem with memory chunks longer than 2 pages (> 8K)
working with Cuda...
Thanks,
Yoel.

MPI Reading from a text file

I am learning to program in MPI and I came across this question. Lets say I have a .txt file with 100,000 rows/lines, how do I chunk them for processing by 4 processors? i.e. I want to let processor 0 take care of the processing for lines 0-25000, processor 1 to take care of 25001-50000 and so on. I did some searching and did came across MPI_File_seek but I am not sure can it work on .txt and supports fscanf afterwards.
Text isn't a great format for parallel processing exactly because you don't know ahead of time where (say) line 25001 begins. So these sorts of problems are often dealt with ahead of time through some preprocessing step, either building an index or partitioning the file into the appropriate number of chunks for each process to read.
If you really want to do it through MPI, I'd suggest using MPI-IO to read in overlapping chunks of the text file onto the various processors, where the overlap is much longer than you expect your longest line to be, and then have each processor agree on where to start; eg, you could say that the first (or last) new line in the overlap region shared by processes N and N+1 is where process N leaves off and N+1 starts.
To follow this up with some code,
#include <stdio.h>
#include <mpi.h>
#include <stdlib.h>
#include <ctype.h>
#include <string.h>
void parprocess(MPI_File *in, MPI_File *out, const int rank, const int size, const int overlap) {
MPI_Offset globalstart;
int mysize;
char *chunk;
/* read in relevant chunk of file into "chunk",
* which starts at location in the file globalstart
* and has size mysize
*/
{
MPI_Offset globalend;
MPI_Offset filesize;
/* figure out who reads what */
MPI_File_get_size(*in, &filesize);
filesize--; /* get rid of text file eof */
mysize = filesize/size;
globalstart = rank * mysize;
globalend = globalstart + mysize - 1;
if (rank == size-1) globalend = filesize-1;
/* add overlap to the end of everyone's chunk except last proc... */
if (rank != size-1)
globalend += overlap;
mysize = globalend - globalstart + 1;
/* allocate memory */
chunk = malloc( (mysize + 1)*sizeof(char));
/* everyone reads in their part */
MPI_File_read_at_all(*in, globalstart, chunk, mysize, MPI_CHAR, MPI_STATUS_IGNORE);
chunk[mysize] = '\0';
}
/*
* everyone calculate what their start and end *really* are by going
* from the first newline after start to the first newline after the
* overlap region starts (eg, after end - overlap + 1)
*/
int locstart=0, locend=mysize-1;
if (rank != 0) {
while(chunk[locstart] != '\n') locstart++;
locstart++;
}
if (rank != size-1) {
locend-=overlap;
while(chunk[locend] != '\n') locend++;
}
mysize = locend-locstart+1;
/* "Process" our chunk by replacing non-space characters with '1' for
* rank 1, '2' for rank 2, etc...
*/
for (int i=locstart; i<=locend; i++) {
char c = chunk[i];
chunk[i] = ( isspace(c) ? c : '1' + (char)rank );
}
/* output the processed file */
MPI_File_write_at_all(*out, (MPI_Offset)(globalstart+(MPI_Offset)locstart), &(chunk[locstart]), mysize, MPI_CHAR, MPI_STATUS_IGNORE);
return;
}
int main(int argc, char **argv) {
MPI_File in, out;
int rank, size;
int ierr;
const int overlap = 100;
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &size);
if (argc != 3) {
if (rank == 0) fprintf(stderr, "Usage: %s infilename outfilename\n", argv[0]);
MPI_Finalize();
exit(1);
}
ierr = MPI_File_open(MPI_COMM_WORLD, argv[1], MPI_MODE_RDONLY, MPI_INFO_NULL, &in);
if (ierr) {
if (rank == 0) fprintf(stderr, "%s: Couldn't open file %s\n", argv[0], argv[1]);
MPI_Finalize();
exit(2);
}
ierr = MPI_File_open(MPI_COMM_WORLD, argv[2], MPI_MODE_CREATE|MPI_MODE_WRONLY, MPI_INFO_NULL, &out);
if (ierr) {
if (rank == 0) fprintf(stderr, "%s: Couldn't open output file %s\n", argv[0], argv[2]);
MPI_Finalize();
exit(3);
}
parprocess(&in, &out, rank, size, overlap);
MPI_File_close(&in);
MPI_File_close(&out);
MPI_Finalize();
return 0;
}
Running this on a narrow version of the text of the question, we get
$ mpirun -n 3 ./textio foo.in foo.out
$ paste foo.in foo.out
Hi guys I am learning to 11 1111 1 11 11111111 11
program in MPI and I came 1111111 11 111 111 1 1111
across this question. Lets 111111 1111 111111111 1111
say I have a .txt file with 111 1 1111 1 1111 1111 1111
100,000 rows/lines, how do 1111111 11111111111 111 11
I chunk them for processing 1 11111 1111 111 1111111111
by 4 processors? i.e. I want 22 2 22222222222 2222 2 2222
to let processor 0 take care 22 222 222222222 2 2222 2222
of the processing for lines 22 222 2222222222 222 22222
0-25000, processor 1 to take 22222222 222222222 2 22 2222
care of 25001-50000 and so 2222 22 22222222222 222 22
on. I did some searching and 333 3 333 3333 333333333 333
did came across MPI_File_seek 333 3333 333333 3333333333333
but I am not sure can it work 333 3 33 333 3333 333 33 3333
on .txt and supports fscanf 33 3333 333 33333333 333333
afterwards. 33333333333

CUDA performance test

I'm writing a simple CUDA program for performance test.
This is not related to vector calculation, but just for a simple (parallel) string conversion.
#include <stdio.h>
#include <string.h>
#include <cuda_runtime.h>
#define UCHAR unsigned char
#define UINT32 unsigned long int
#define CTX_SIZE sizeof(aes_context)
#define DOCU_SIZE 4096
#define TOTAL 100000
#define BBLOCK_SIZE 500
UCHAR pH_TXT[DOCU_SIZE * TOTAL];
UCHAR pH_ENC[DOCU_SIZE * TOTAL];
UCHAR* pD_TXT;
UCHAR* pD_ENC;
__global__
void TEST_Encode( UCHAR *a_input, UCHAR *a_output )
{
UCHAR *input;
UCHAR *output;
input = &(a_input[threadIdx.x * DOCU_SIZE]);
output = &(a_output[threadIdx.x * DOCU_SIZE]);
for ( int i = 0 ; i < 30 ; i++ ) {
if ( (input[i] >= 'a') && (input[i] <= 'z') ) {
output[i] = input[i] - 'a' + 'A';
}
else {
output[i] = input[i];
}
}
}
int main(int argc, char** argv)
{
struct cudaDeviceProp xCUDEV;
cudaGetDeviceProperties(&xCUDEV, 0);
// Prepare Source
memset(pH_TXT, 0x00, DOCU_SIZE * TOTAL);
for ( int i = 0 ; i < TOTAL ; i++ ) {
strcpy((char*)pH_TXT + (i * DOCU_SIZE), "hello world, i need an apple.");
}
// Allocate vectors in device memory
cudaMalloc((void**)&pD_TXT, DOCU_SIZE * TOTAL);
cudaMalloc((void**)&pD_ENC, DOCU_SIZE * TOTAL);
// Copy vectors from host memory to device memory
cudaMemcpy(pD_TXT, pH_TXT, DOCU_SIZE * TOTAL, cudaMemcpyHostToDevice);
// Invoke kernel
int threadsPerBlock = BLOCK_SIZE;
int blocksPerGrid = (TOTAL + threadsPerBlock - 1) / threadsPerBlock;
printf("Total Task is %d\n", TOTAL);
printf("block size is %d\n", threadsPerBlock);
printf("repeat cnt is %d\n", blocksPerGrid);
TEST_Encode<<<blocksPerGrid, threadsPerBlock>>>(pD_TXT, pD_ENC);
cudaMemcpy(pH_ENC, pD_ENC, DOCU_SIZE * TOTAL, cudaMemcpyDeviceToHost);
// Free device memory
if (pD_TXT) cudaFree(pD_TXT);
if (pD_ENC) cudaFree(pD_ENC);
cudaDeviceReset();
}
And when i change BLOCK_SIZE value from 2 to 1000, I got a following duration time (from NVIDIA Visual Profiler)
TOTAL BLOCKS BLOCK_SIZE Duration(ms)
100000 50000 2 28.22
100000 10000 10 22.223
100000 2000 50 12.3
100000 1000 100 9.624
100000 500 200 10.755
100000 250 400 29.824
100000 200 500 39.67
100000 100 1000 81.268
My GPU is GeForce GT520 and max threadsPerBlock value is 1024, so I predicted that I would get best performance when BLOCK is 1000, but the above table shows different result.
I can't understand why Duration time is not linear, and how can I fix this problem. (or how can I find optimized Block value (mimimum Duration time)
It seems 2, 10, 50 threads doesn't utilize the capabilities of the gpu since its design is to start much more threads.
Your card has compute capability 2.1.
Maximum number of resident threads per multiprocessor = 1536
Maximum number of threads per block = 1024
Maximum number of resident blocks per multiprocessor = 8
Warp size = 32
There are two issues:
1.
You try to occupy so much register memory per thread that it will definetly is outsourced to slow local memory space if your block sizes increases.
2.
Perform your tests with multiple of 32 since this is the warp size of your card and many memory operations are optimized for thread sizes with multiple of the warp size.
So if you use only around 1024 (1000 in your case) threads per block 33% of your gpu is idle since only 1 block can be assigned per SM.
What happens if you use the following 100% occupancy sizes?
128 = 12 blocks -> since only 8 can be resident per sm the block execution is serialized
192 = 8 resident blocks per sm
256 = 6 resident blocks per sm
512 = 3 resident blocks per sm

Resources