CUDA equivalent of pragma omp task - multithreading

I am working on a problem where work between each thread may varies drastically, where, for example, a thread may this time handle 1000000 element, but another thread may only handle 1 or 2 element. So I stumbled upon this where the answer solve the unbalanced workload by using openmp task on CPU, so my question is can I achieve the same on CUDA ?
In case you want more context:
The problem I'm trying to solve is, I have a n tuple, each has a starting point, an ending point and a value.
(0, 3, 1), (3, 6, 2), (6, 10, 3), ...
So for each tuple I want to write the value to every position between starting point and ending point of another empty array.
1, 1, 1, 2, 2, 2, 3, 3, 3, 3, ...
It is guaranteed that there is no start/ ending overlap.
My current approach is a thread for each tuple, but the starting and ending can vary a lot so the imbalanced workload between threads might cause a bottleneck for the program, though rare, but it may very well be.

The most common thread strategy I can think of in CUDA is to assign one thread per output point, and then have each thread do the work necessary to populate its output point.
For your stated objective (have each thread do roughly equal work) this is a useful strategy.
I will suggest using thrust for this. The basic idea is to:
determine the necessary size of the output based on the input
spin up a set of threads equal to the output size, where each thread determines its "insert index" in the output array by using a vectorized binary search on the input
with the insert index, insert the appropriate value in the output array.
I have used your data, the only change is that I changed the insert values from 1,2,3 to 5,2,7:
$ cat t1871.cu
#include <thrust/device_vector.h>
#include <thrust/transform.h>
#include <thrust/binary_search.h>
#include <thrust/copy.h>
#include <thrust/iterator/counting_iterator.h>
#include <thrust/iterator/permutation_iterator.h>
#include <thrust/iterator/transform_iterator.h>
#include <iostream>
using namespace thrust::placeholders;
typedef thrust::tuple<int,int,int> mt;
// returns selected item from tuple
struct my_cpy_functor1
{
__host__ __device__ int operator()(mt d){ return thrust::get<1>(d); }
};
struct my_cpy_functor2
{
__host__ __device__ int operator()(mt d){ return thrust::get<2>(d); }
};
int main(){
mt my_data[] = {{0, 3, 5}, {3, 6, 2}, {6, 10, 7}};
int ds = sizeof(my_data)/sizeof(my_data[0]); // determine data size
int os = thrust::get<1>(my_data[ds-1]) - thrust::get<0>(my_data[0]); // and output size
thrust::device_vector<mt> d_data(my_data, my_data+ds); // transfer data to device
thrust::device_vector<int> d_idx(ds+1); // create index array for searching of insertion points
thrust::transform(d_data.begin(), d_data.end(), d_idx.begin()+1, my_cpy_functor1()); // set index array
thrust::device_vector<int> d_ins(os); // create array to hold insertion points
thrust::upper_bound(d_idx.begin(), d_idx.end(), thrust::counting_iterator<int>(0), thrust::counting_iterator<int>(os), d_ins.begin()); // identify insertion points
thrust::transform(thrust::make_permutation_iterator(d_data.begin(), thrust::make_transform_iterator(d_ins.begin(), _1 -1)), thrust::make_permutation_iterator(d_data.begin(), thrust::make_transform_iterator(d_ins.end(), _1 -1)), d_ins.begin(), my_cpy_functor2()); // insert
thrust::copy(d_ins.begin(), d_ins.end(), std::ostream_iterator<int>(std::cout, ","));
std::cout << std::endl;
}
$ nvcc -o t1871 t1871.cu -std=c++14
$ ./t1871
5,5,5,2,2,2,7,7,7,7,
$

Related

Logical Error in C++ Hexadecimal Converter Code

I've been working on this Hexadecimal Converter and there seems to be a logical error somewhere in the program. I've run it on Ubuntu using the g++ tool and every time I run t program, it gives me a massive heap of garbage values. I can't figure out the source of the garbage values and neither can I find the source of the logical error. I'm a newbie at programming, so please help me figure out my mistake.
#include <iostream>
#include <math.h>
using namespace std;
int main()
{
int bin[20],finhex[10],num,bc=0,i,j,k,l=0,r=10,n=1,binset=0,m=0;
int hex[16]= {0000,0001,0010,0011,0100,0101,0110,0111,1000,1001,1010,1011,1100,1101,1110,1111};
char hexalph='A';
cout<<"\nEnter your Number: ";
cin>>num;
while(num>0)
{
bin[bc]=num%2;
num=num/2;
bc++;
}
if(bc%4!=0)
bc++;
for(j=0;j<bc/4;j++)
for(i=0;i<4;i++)
{
binset=binset+(bin[m]*pow(10,i));
m++;
}
for(k=0;k<16;k++)
{
if(hex[k]==binset)
{
if(k<=9)
finhex[l]=k;
else
while(n>0)
{
if(k==r)
{
finhex[l]=hexalph;
break;
}
else
{
hexalph++;
r++;
}
}
l++;
r=10;
binset=0;
hexalph='A';
break;
}
}
while(l>=0)
{
cout<<"\n"<<finhex[l];
l--;
}
return 0;
}
int hex[16]= {0000,0001,0010,0011,0100,0101,0110,0111,1000,1001,1010,1011,1100,1101,1110,1111};
Allow me to translate those values into decimal for you:
int hex[16] = {0, 1, 8, 9, 64, 65, 72, 73, 1000, 1001, 1010, 1011, 1100, 1101, 1110, 1111};
If you want them to be considered binary literals then you need to either specify them as such or put them in some other form that the compiler understands:
int hex[16] = {0b0000, 0b0001, 0b0010, 0b0011, 0b0100, 0b0101, 0b0110, 0b0111, 0b1000, 0b1001, 0b1010, 0b1011, 0b1100, 0b1101, 0b1110, 0b1111};
int hex[16] = {0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf};
While Ignacio Vazquez-Abrams rightly hinted at the fact that some of your initializers of hex are (due to the prefix 0) octal constants, he overlooked that you chose the unusual, but possible way of representing binary literals as decimal constants with only the digits 0 and 1. Thus, you only have to remove the prefix 0 from all constants greater than 7:
int hex[16] = {0000,0001,10,11,100,101,110,111,1000,1001,1010,1011,1100,1101,1110,1111};
Then, you stored the characters 'A' etc. in int finhex[] and did output them with cout<<"\n"<<finhex[l] - but this way not A is printed, but rather its character code value, e. g. in ASCII 65. In order to really output the character A etc., we could change the finhex array element type to char:
char bin[20],finhex[10]; int num,bc=0,i,j,k,l=0,r=10,n=1,binset=0,m=0;
- but consequently we also have to store the digits 0 to 9 as their character code values:
if (k<=9)
finhex[l]='0'+k;
Furthermore, with the lines
if(bc%4!=0)
bc++;
you rightly pondered on the need to have a multiple of 4 bits for the conversion, but you overlooked that more than one bit could be missing, and also that the additional elements of bin[] are uninitialized, so change to:
while (bc%4!=0) bin[bc++] = 0;
Besides, you omitted block braces around the (appropriately indented) two inner for loops; since C++ is not Python, the indentation has no significance and without surrounding braces only the first of the indented for loops is nested into the outer for loop.
The final while loop should be outdented and go outside the big for loop. There's also an indexing error in it, as the finhex array is indexed with an l which is by one to high; you could change this to:
while (l--) cout<<finhex[l];
cout<<"\n";

How to sort a variable-length string array with radix sort?

I know that radix sort can sort same-length string arrays, but is it possible to do so with variable-length strings. If it is, what is the C-family code or pseudo-code to implement this?
It might not a be fast algorithm for variable-length strings, but it is easy to implement radix sort, so it's useful if a sort needs to be coded quickly.
I'm not quite sure what you mean by "variable-length strings" but you can perform a binary MSB radix sort in-place so the length of the string doesn't matter since there are no intermediate buckets.
#include <stdio.h>
#include <algorithm>
static void display(char *str, int *data, int size)
{
printf("%s: ", str);
for(int v=0;v<size;v++) {
printf("%d ", data[v]);
}
printf("\n");
}
static void sort(int *data, int size, int bit)
{
if (bit == 0)
return;
int b = 0;
int e = size;
if (size > 0) {
while (b != e) {
if (data[b] & (1 << bit)) {
std::swap(data[b], data[--e]);
}
else {
b++;
}
}
sort(data, e, bit - 1);
sort(data + b, size - b, bit - 1);
}
}
int main()
{
int data[] = { 13, 12, 22, 20, 3, 4, 14, 92, 11 };
int size = sizeof(data) / sizeof(data[0]);
display("Before", data, size);
sort(data, size, sizeof(int)*8 - 1);
display("After", data, size);
}
You can do a MSB-first radix sort on variable-length strings.
There are a couple non-obvious details:
Pass #N will partition (scatter) strings from the input vector into 256 partitions, according to strvec[i][N]. It then will scan the partitions in order, and put (reinsert) strings back into the input vector.
Now the slightly complicated bit...
When you reach the end of a string, it is in its final position, and should never be touched again. That splits the strings before and after it into separate RANGES. The result of each pass is a set of ranges of yet-unsorted rows.
That means that pass #N, after the first, scans the strings in each range, and stores the source range id (index) along with the string, in the partition. In the "reinsert" step, it puts the string back into its source range; and again, it generates a new set of unsorted-row ranges.
You keep the stable-sort bonus of radix sort, if you forward-scan the input ranges and then backward-scan the partitions and reinsert starting at the back of each source range.
You can also use recursion (doing a complete sort from scratch on any subrange) but the above saves on setup and is faster.
There are more details ... quicksort falls through to doing an insertion sort for tiny ranges (e.g. up to 16); radix sort benefits from doing the same.
Using multiple bytes as a partition index is possible. One approach for that is in: Radix Sort-Mischa Sandberg-2010 There are other approaches.
Sorry I can't post code; it's now proprietary.

Can I assign a per-thread index, using pthreads?

I'm optimizing some instrumentation for my project (Linux,ICC,pthreads), and would like some feedback on this technique to assign a unique index to a thread, so I can use it to index into an array of per-thread data.
The old technique uses a std::map based on pthread id, but I'd like to avoid locks and a map lookup if possible (it is creating a significant amount of overhead).
Here is my new technique:
static PerThreadInfo info[MAX_THREADS]; // shared, each index is per thread
// Allow each thread a unique sequential index, used for indexing into per
// thread data.
1:static size_t GetThreadIndex()
2:{
3: static size_t threadCount = 0;
4: __thread static size_t myThreadIndex = threadCount++;
5: return myThreadIndex;
6:}
later in the code:
// add some info per thread, so it can be aggregated globally
info[ GetThreadIndex() ] = MyNewInfo();
So:
1) It looks like line 4 could be a race condition if two threads where created at exactly the same time. If so - how can I avoid this (preferably without locks)? I can't see how an atomic increment would help here.
2) Is there a better way to create a per-thread index somehow? Maybe by pre-generating the TLS index on thread creation somehow?
1) An atomic increment would help here actually, as the possible race is two threads reading and assigning the same ID to themselves, so making sure the increment (read number, add 1, store number) happens atomically fixes that race condition. On Intel a "lock; inc" would do the trick, or whatever your platform offers (like InterlockedIncrement() for Windows for example).
2) Well, you could actually make the whole info thread-local ("__thread static PerThreadInfo info;"), provided your only aim is to be able to access the data per-thread easily and under a common name. If you actually want it to be a globally accessible array, then saving the index as you do using TLS is a very straightforward and efficient way to do this. You could also pre-compute the indexes and pass them along as arguments at thread creation, as Kromey noted in his post.
Why so averse to using locks? Solving race conditions is exactly what they're designed for...
In any rate, you can use the 4th argument in pthread_create() to pass an argument to your threads' start routine; in this way, you could use your master process to generate an incrementing counter as it launches the threads, and pass this counter into each thread as it is created, giving you your unique index for each thread.
I know you tagged this [pthreads], but you also mentioned the "old technique" of using std::map. This leads me to believe that you're programming in C++. In C++11 you have std::thread, and you can pass out unique indexes (id's) to your threads at thread creation time through an ordinary function parameter.
Below is an example HelloWorld that creates N threads, assigning each an index of 0 through N-1. Each thread does nothing but say "hi" and give it's index:
#include <iostream>
#include <thread>
#include <mutex>
#include <vector>
inline void sub_print() {}
template <class A0, class ...Args>
void
sub_print(const A0& a0, const Args& ...args)
{
std::cout << a0;
sub_print(args...);
}
std::mutex&
cout_mut()
{
static std::mutex m;
return m;
}
template <class ...Args>
void
print(const Args& ...args)
{
std::lock_guard<std::mutex> _(cout_mut());
sub_print(args...);
}
void f(int id)
{
print("This is thread ", id, "\n");
}
int main()
{
const int N = 10;
std::vector<std::thread> threads;
for (int i = 0; i < N; ++i)
threads.push_back(std::thread(f, i));
for (auto i = threads.begin(), e = threads.end(); i != e; ++i)
i->join();
}
My output:
This is thread 0
This is thread 1
This is thread 4
This is thread 3
This is thread 5
This is thread 7
This is thread 6
This is thread 2
This is thread 9
This is thread 8

Thread-safe random number generation for Monte-Carlo integration

Im trying to write something which very quickly calculates random numbers and can be applied on multiple threads. My current code is:
/* Approximating PI using a Monte-Carlo method. */
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <time.h>
#include <omp.h>
#define N 1000000000 /* As lareg as possible for increased accuracy */
double random_function(void);
int main(void)
{
int i = 0;
double X, Y;
double count_inside_temp = 0.0, count_inside = 0.0;
unsigned int th_id = omp_get_thread_num();
#pragma omp parallel private(i, X, Y) firstprivate(count_inside_temp)
{
srand(th_id);
#pragma omp for schedule(static)
for (i = 0; i <= N; i++) {
X = 2.0 * random_function() - 1.0;
Y = 2.0 * random_function() - 1.0;
if ((X * X) + (Y * Y) < 1.0) {
count_inside_temp += 1.0;
}
}
#pragma omp atomic
count_inside += count_inside_temp;
}
printf("Approximation to PI is = %.10lf\n", (count_inside * 4.0)/ N);
return 0;
}
double random_function(void)
{
return ((double) rand() / (double) RAND_MAX);
}
This works but from observing a resource manager I know its not using all the threads. Does rand() work for multithreaded code? And if not is there a good alternative? Many Thanks. Jack
Is rand() thread safe? Maybe, maybe not:
The rand() function need not be reentrant. A function that is not required to be reentrant is not required to be thread-safe."
One test and good learning exercise would be to replace the call to rand() with, say, a fixed integer and see what happens.
The way I think of pseudo-random number generators is as a black box which take an integer as input and return an integer as output. For any given input the output is always the same, but there is no pattern in the sequence of numbers and the sequence is uniformly distributed over the range of possible outputs. (This model isn't entirely accurate, but it'll do.) The way you use this black box is to choose a staring number (the seed) use the output value in your application and as the input for the next call to the random number generator. There are two common approaches to designing an API:
Two functions, one to set the initial seed (e.g. srand(seed)) and one to retrieve the next value from the sequence (e.g. rand()). The state of the PRNG is stored internally in sort of global variable. Generating a new random number either will not be thread safe (hard to tell, but the output stream won't be reproducible) or will be slow in multithreded code (you end up with some serialization around the state value).
A interface where the PRNG state is exposed to the application programmer. Here you typically have three functions: init_prng(seed), which returns some opaque representation of the PRNG state, get_prng(state), which returns a random number and changes the state variable, and destroy_peng(state), which just cleans up allocated memory and so on. PRNGs with this type of API should all be thread safe and run in parallel with no locking (because you are in charge of managing the (now thread local) state variable.
I generally write in Fortran and use Ladd's implementation of the Mersenne Twister PRNG (that link is worth reading). There are lots of suitable PRNG's in C which expose the state to your control. PRNG looks good and using this (with initialization and destroy calls inside the parallel region and private state variables) should give you a decent speedup.
Finally, it's often the case that PRNGs can be made to perform better if you ask for a whole sequence of random numbers in one go (e.g. the compiler can vectorize the PRNG internals). Because of this libraries often have something like get_prng_array(state) functions which give you back an array full of random numbers as if you put get_prng in a loop filling the array elements - they just do it more quickly. This would be a second optimization (and would need an added for loop inside the parallel for loop. Obviously, you don't want to run out of per-thread stack space doing this!

In Perl module Proc::ProccessTable, why does pctcpu sometimes return 'inf', 'nan', or a value greater than 100?

The Perl module Proc::ProcessTable occasionally observes that the pctcpu attribute as 'inf', 'nan', or a value greater then 100. Why does it do this? And are there any guidelines on how to deal with this kind of information?
We have observed this on various platforms including Linux 2.4 running on 8 logical processors.
I would guess that 'inf' or 'nan' is the result of some impossibly large value or a divide by zero.
For values greater then 100, could this possibly mean that more then one processor was used?
And for dealing with this information, is the best practice merely marking the data point as untrustworthy and normalizing to 100%?
I do not know why that happens and I cannot stress test the module right now trying to generate such cases.
However, a principle I have followed all my research is not to replace data I know to be non-sense with something that looks reasonable. You basically have missing observations and you should treat them as such. I would not attach a numerical value at all so as not to pretend I have information when I in fact do not.
Then, your statistics for the non-missing points will be meaningful and you can look at any patterns in the missing observations separately.
UPDATE: Looking at the calc_prec() function in the source code:
/* calc_prec()
*
* calculate the two cpu/memory precentage values
*/
static void calc_prec(char *format_str, struct procstat *prs, struct obstack *mem_pool)
{
float pctcpu = 100.0f * (prs->utime / 1e6) / (time(NULL) - prs->start_time);
/* calculate pctcpu - NOTE: This assumes the cpu time is in microsecond units! */
sprintf(prs->pctcpu, "%3.2f", pctcpu);
field_enable(format_str, F_PCTCPU);
/* calculate pctmem */
if (system_memory > 0) {
sprintf(prs->pctmem, "%3.2f", (float) prs->rss / system_memory * 100.f);
field_enable(format_str, F_PCTMEM);
}
}
First, IMHO, it would be better to just divide by 1e4 rather than multiplying by 100.0f after the division. Second, it is possible (if polled immediately after process start) for the time delta to be 0. Third, I would have just done the whole thing in double.
As an aside, this function looks like a good example of why you should not have comments in code.
#include <stdio.h>
#include <time.h>
volatile float calc_percent(
unsigned long utime,
time_t now,
time_t start
) {
return 100.0f * ( utime / 1e6) / (now - start);
}
int main(void) {
printf("%3.2f\n", calc_percent(1e6, time(NULL), time(NULL)));
printf("%3.2f\n", calc_percent(0, time(NULL), time(NULL)));
return 0;
}
This outputs inf in the first case and nan in the second case when compiled with Cygwin gcc-4 on Windows. I do not know if this behavior is standard or just what happens with this particular combination of OS+compiler.

Resources