Meaning of following syntax of cuda Kernel

Meaning of following syntax of cuda Kernel - visual-c++

What is meaning of following syntax:
Kernel_fun<<<256, 128, 2056>>>(arg1, arg2, arg3);
Which value indicates workgroup and which value indicates thread.

From the CUDA Programming Guide, appendix B.22 (as of May 2019):
The execution configuration is specified by inserting an expression of
the form <<< Dg, Db, Ns, S >>> between the function name and the
parenthesized argument list, where:
Dg is of type dim3 (see Section B.3.2) and specifies the dimension and size of the grid, such that Dg.x * Dg.y * Dg.z equals the number
of blocks being launched; Dg.z must be equal to 1 for devices of
compute capability 1.x;
Db is of type dim3 (see Section B.3.2) and specifies the dimension and size of each block, such that Db.x * Db.y * Db.z equals the
number of threads per block;
Ns is of type size_t and specifies the number of bytes in shared memory that is dynamically allocated per block for this call in
addition to the statically allocated memory; this dynamically
allocated memory is used by any of the variables declared as an
external array as mentioned in Section B.2.3; Ns is an optional
argument which defaults to 0;
S is of type cudaStream_t and specifies the associated stream; S is an optional argument which defaults to 0.
In short:
<<< number of blocks, number of threads, dynamic memory per block, associated stream >>>

Related

How to calculate space for number of records

I am trying to calculate space required by a dataset using below formula, but I am getting wrong somewhere when I cross check it with the existing dataset in the system. Please help me
1st Dataset:
Record format . . . : VB
Record length . . . : 445
Block size . . . . : 32760
Number of records....: 51560
Using below formula to calculate
optimal block length (OBL) = 32760/record length = 32760/449 = 73
As there are two blocks on the track, hence (TOBL) = 2 * OBL = 73*2 = 146
Find number of physical records (PR) = Number of records/TOBL = 51560/146 = 354
Number of tracks = PR/2 = 354/2 = 177
But I can below in the dataset information
Current Allocation
Allocated tracks . : 100
Allocated extents . : 1
Current Utilization
Used tracks . . . . : 100
Used extents . . . : 1
2nd Dataset :
Record format . . . : VB
Record length . . . : 445
Block size . . . . : 27998
Number of Records....: 127,252
Using below formula to calculate
optimal block length (OBL) = 27998/record length = 27998/449 = 63
As there are two blocks on the track, hence (TOBL) = 2 * OBL = 63*2 = 126
Find number of physical records (PR) = Number of records/TOBL = 127252/126 = 1010
Number of tracks = PR/2 = 1010/2 = 505
Number of Cylinders = 505/15 = 34
But I can below in the dataset information
Current Allocation
Allocated cylinders : 69
Allocated extents . : 1
Current Utilization
Used cylinders . . : 69
Used extents . . . : 1

A few observations on your approach.
First, since your dealing with records that are variable length it would be helpful to know the "average" record length as that would help to formulate a more accurate prediction of storage. Your approach assumes a worst case scenario of all records being at maximum which is fine for planning purposes but in reality you'll likely see the actual allocation would be lower if the average of the record lengths is lower than the maximum.
The approach you are taking is reasonable but consider that you can inform z/OS of the space requirements in blocks, records, DASD geometry or let DFSMS perform the calculation on your behalf. Refer to this article to get some additional information on options.
Back to your calculations:
You Optimum Block Length (OBL) is really a records per block (RPB) number. Block size divided maximum record length yields the number of records at full length that can be stored in the block. If your average record length is less then you can store more records per block.
The assumption of two blocks per track may be true for your situation but it depends on the actual device type that will be used for the underlying allocation. Here is a link to some of the geometries for supported DASD devices and their geometries.
Your assumption of two blocks per track depends on the device is not correct for 3390's as you would need 64k for two blocks on a track but as you can see the 3390's max out at 56k so you would only get one block per track on the device.
Also, it looks like you did factor in the RDW by adding 4 bytes but someone looking at the question might be confused if they are not familiar with V records on z/OS.In the case of your calculation that would be 61 records per block at 27998 (which is the "optimal block length" so two blocks can fit comfortable on a track).
I'll use the following values:
MaximumRecordLength = RecordLength + 4 for RDW
TotalRecords = Total Records at Maximum Length (worst case)
BlockSize = modeled blocksize
RecordsPerBlock = number of records that can fit in a block (worst case)
BlocksNeeded = number of blocks needed to contain estimated records (worst case)
BlocksPerTrack = from IBM device geometry information
TracksNeeded = TotalRecords / RecordsPerBlock / BlocksPerTrack
Cylinders = Device Tracks per cylinder (15 for most devices)
Example 1:
Total Records = 51,560
BlockSize = 32,760
BlocksPerTrack = 1 (from device table)
RecordsPerBlock: 32,760 / 449 = 72.96 (72)
Total Blocks = 51,560 / 72 = 716.11 (717)
Total Tracks = 717 * 1 = 717
Cylinders = 717 / 15 = 47.8 (48)
Example 2:
Total Records = 127,252
BlockSize = 27,998
BlocksPerTrack = 2 (from device table)
RecordsPerBlock: 27,998 / 449 = 62.35 (62)
Total Blocks = 127,252 / 62 = 2052.45 (2,053)
Total Tracks = 2,053 / 2 = 1,026.5 (1,027)
Cylinders = 1027 / 15 = 68.5 (69)
Now, as to the actual allocation. It depends on how you allocated the space, the size of the records. Assuming it was in JCL you could use the RLSE subparameter of the SPACE= to release space when the is created and closed. This should release unused resources.
Given that the records are Variable the estimates are worst case and you would need to know more about the average record lengths to understand the actual allocation in terms of actual space used.
Final thought, all of the work you're doing can be overridden by your storage administrator through ACS routines. I believe that most people today would specify a BLKSIZE=0 and let DFSMS do all of the hard work because that component has more information about where a file will go, what the underlying devices are and the most efficient way of doing the allocation. The days of disk geometry and allocation are more of a campfire story unless your environment has not been administered to do these things for you.

Instead of trying to calculate tracks or cylinders, go for MBs, or KBs. z/OS (DFSMS) will calculate for you, how many tracks or cylinders are needed.
In JCL it is not straight forward but also not too complicated, once you got it.
There is a DD statement parameter called AVGREC=, which is the trigger. Let me do an example for your first case above:
//anydd DD DISP=(NEW,CATLG),
// DSN=your.new.data.set.name,
// REFCM=VB,LRECL=445,
// SPACE=(445,(51560,1000)),AVGREC=U
//* | | | |
//* V V V V
//* (1) (2) (3) (4)
Parameter AVGREC=U (4) tells the system three things:
Firstly, the first subparameter in SPACE= (1) shall be interpreted as an average record length. (Note that this value is completely independend of the value specified in LRECL=.)
Secondly, it tells the system, that the second (2), and third (3) SPACE= subparameter are the number of records of average length (1) that the data set shall be able to store.
Thirdly, it tells the system that numbers (2), and (3) are in records (AVGREC=U). Alternatives are thousands (AVGREC=M), and millions (AVGREC=M).
So, this DD statement will allocate enough space to hold the estimated number of records. You don't have to care for track capacity, block capacity, device geometry, etc.
Given the number of records you expect and the (average) record length, you can easily calculate the number of kilobytes or megabytes you need. Unfortunately, you cannot directly specify KB, or MB in JCL, but there is a way using AVGREC= as follows.
Your first data set will get 51560 records of (maximum) length 445, i.e. 22'944'200 bytes, or ~22'945 KB, or ~23 MB. The JCL for an allocation in KB looks like this:
//anydd DD DISP=(NEW,CATLG),
// DSN=your.new.data.set.name,
// REFCM=VB,LRECL=445,
// SPACE=(1,(22945,10000)),AVGREC=K
//* | | | |
//* V V V V
//* (1) (2) (3) (4)
You want the system to allocate primary space for 22945 (2) thousands (4) records of length 1 byte (1), which is 22945 KB, and secondary space for 10'000 (3) thousands (4) records of length 1 byte (1), i.e. 10'000 KB.
Now the same alloation specifying MB:
//anydd DD DISP=(NEW,CATLG),
// DSN=your.new.data.set.name,
// REFCM=VB,LRECL=445,
// SPACE=(1,(23,10)),AVGREC=M
//* | | | |
//* V V V V
//* (1) (2)(3) (4)
You want the system to allocate primary space for 23 (2) millions (4) records of length 1 byte (1), which is 23 MB, and secondary space for 10 (3) millions (4) records of length 1 byte (1), i.e. 10 MB.
I rarely use anything other than the latter.
In ISPF, it is even easier: Data Set Allocation (3.2) allows KB, and MB as space units (amongst all the old ones).

A useful and usually simpler alternative to using SPACE and AVGREC etc is to simply use a DATACLAS for space if your site has appropriate sized ones defined. If you look at ISMF Option 4 you can list available DATACLAS's and see what space values etc they provide. You'd expect to see a number of ranges in size, and some with or without Extended Format and/or Compression. Even if a DATACLAS overallocates a bit then it is likely the overallocated space will be released by the MGMTCLAS assigned to the dataset at close or during space management. And you do have an option to code DATACLAS AND SPACE in which case any coded space (or other) value will override the DATACLAS, which helps with exceptions. It still depends how your Storage Admin's have coded the ACS routines but generally Users are allowed to specify a DATACLAS and it will be honored by the ACS routines.
For basic dataset size calculation I just use LRECL times the expected Max Record Count divided by 1000 a couple of times to get a rough MB figure. Obviously variable records/blks add 4bytes each for RDW and/or BDW but unless the number of records is massive or DASD is extremely tight for space wise it shouldn't be significant enough to matter.
e.g.
=(51560*445)/1000/1000 shows as ~23MB
Also, don't expect your allocation to be exactly what you requested because the minimum allocation on Z/OS is 1 track or ~56k. The BLKSIZE also comes into effect by adding interblock gaps of ~32bytes per block. With SDB (system Determined Blocksize) invoked by omitting BLKSIZE or coding BLKSIZE=0, it will always try to provide half track blocking as close to 28k as possible so two blocks per track which is the most space efficient. That does matter, a BLKSIZE of 80bytes wastes ~80% of a track with interblock gaps. The BLKSIZE is also the unit of transfer when doing read/write to disk so generally the larger the better with some exceptions such as KSDS's being randomly access by key for example which might result in more data transfer than desired in an OLTP transaction.

how to make a non-initialised variable in Spin?

It seems that Promela initialises each variable (by default, to 0, or to the value that is given in the declaration).
How can I declare a variable that is initialised by an unknown value?
The documentation suggests if :: p = 0 :: p = 1 fi but I don't think
that it works: Spin still verifies this claim
bit p
init { if :: p = 0 :: p = 1 fi }
ltl { ! p }
(and falsifies p)
So what exactly is the semantics of init? There still is some
"pre-initial" state? How can I work around this - and not confuse my students?

This is an interesting question.
The documentation says that each and every variable is initialised to 0, unless the model specifies otherwise.
As with all variable declarations, an explicit initialization field is optional. The default initial value for all variables is zero. This applies both to scalar variables and to array variables, and it applies to both global and to local variables.
In your model, you don't initialise the variable when you declare it, therefore it is subsequently assigned to the value 0 in the initial state, which is located before your assignment:
bit p
init {
// THE INITIAL STATE IS HERE
if
:: p = 0
:: p = 1
fi
}
ltl { ! p }
Some Experiment.
A "naive" idea for dodging this limitation would be to modify the c source code of pan.c that is generated by spin when you invoke ~$ spin -a test.pml, so that the variable is initialised at random.
Instead of this initialisation function:
void
iniglobals(int calling_pid)
{
now.p = 0;
#ifdef VAR_RANGES
logval("p", now.p);
#endif
}
one could try writing this:
void
iniglobals(int calling_pid)
{
srand(time(NULL));
now.p = rand() % 2;
#ifdef VAR_RANGES
logval("p", now.p);
#endif
}
and adding an #include <time.h> in the header part.
However, once you compile that into a verifier with gcc pan.c, and you attempt to run it, you obtain non-deterministic behaviour depending on the initialization value of the variable p.
It can both determine that the property is violated:
~$ ./a.out -a
pan:1: assertion violated !( !( !(p))) (at depth 0)
pan: wrote test.pml.trail
(Spin Version 6.4.3 -- 16 December 2014)
Warning: Search not completed
+ Partial Order Reduction
Full statespace search for:
never claim + (ltl_0)
assertion violations + (if within scope of claim)
acceptance cycles + (fairness disabled)
invalid end states - (disabled by never claim)
State-vector 28 byte, depth reached 0, errors: 1
1 states, stored
0 states, matched
1 transitions (= stored+matched)
0 atomic steps
hash conflicts: 0 (resolved)
Stats on memory usage (in Megabytes):
0.000 equivalent memory usage for states (stored*(State-vector + overhead))
0.291 actual memory usage for states
128.000 memory used for hash table (-w24)
0.534 memory used for DFS stack (-m10000)
128.730 total actual memory usage
pan: elapsed time 0 seconds
or print that the property is satisfied:
~$ ./a.out -a
(Spin Version 6.4.3 -- 16 December 2014)
+ Partial Order Reduction
Full statespace search for:
never claim + (ltl_0)
assertion violations + (if within scope of claim)
acceptance cycles + (fairness disabled)
invalid end states - (disabled by never claim)
State-vector 28 byte, depth reached 0, errors: 0
1 states, stored
0 states, matched
1 transitions (= stored+matched)
0 atomic steps
hash conflicts: 0 (resolved)
Stats on memory usage (in Megabytes):
0.000 equivalent memory usage for states (stored*(State-vector + overhead))
0.291 actual memory usage for states
128.000 memory used for hash table (-w24)
0.534 memory used for DFS stack (-m10000)
128.730 total actual memory usage
unreached in init
test.pml:8, state 5, "-end-"
(1 of 5 states)
unreached in claim ltl_0
_spin_nvr.tmp:8, state 8, "-end-"
(1 of 8 states)
pan: elapsed time 0 seconds
Clearly, the initial state of a promela model verified by spin is assumed to be unique. Afterall, that's a reasonable assumption, since it would needlessly complicate things: you can always replace N different initial states S_i with an initial state S s.t. S allows to reach each S_i with an epsilon-transition. In this context, what you get is not truly an epsilon-transition, but in practice it makes little difference.
EDIT (from comments):
In principle, it is possible to make this work by modifying pan.c a little bit further:
transform the initial state initialiser into a generator of initial states
modify the verification routine to take into account that more than one initial state might exist, and that the property must hold for each initial state
Having said this, it is likely not worth the hassle, unless this is done by patching Spin's source code.
Workaround.
If you want to state that something is true in the initial state, or starting from the initial state, and take into account some non-deterministic behaviour, then you should write something as follows:
bit p
bool init_state = false
init {
if
:: p = 0
:: p = 1
fi
init_state = true // TARGET STATE
init_state = false
}
ltl { init_state & ! p }
with which you get:
~$ ./a.out -a
pan:1: assertion violated !( !((initialised& !(p)))) (at depth 0)
pan: wrote 2.pml.trail
(Spin Version 6.4.3 -- 16 December 2014)
Warning: Search not completed
+ Partial Order Reduction
Full statespace search for:
never claim + (ltl_0)
assertion violations + (if within scope of claim)
acceptance cycles + (fairness disabled)
invalid end states - (disabled by never claim)
State-vector 28 byte, depth reached 0, errors: 1
1 states, stored
0 states, matched
1 transitions (= stored+matched)
0 atomic steps
hash conflicts: 0 (resolved)
Stats on memory usage (in Megabytes):
0.000 equivalent memory usage for states (stored*(State-vector + overhead))
0.291 actual memory usage for states
128.000 memory used for hash table (-w24)
0.534 memory used for DFS stack (-m10000)
128.730 total actual memory usage
pan: elapsed time 0 seconds
Init Semantics.
Init is simply guaranteed to be the first process to spawn, and is meant to be used for spawning other processes when, for-example, the other routines take as input some parameters, e.g. some resources are shared. More info here.
I believe that this fragment of documentation is a bit misleading:
The init process is most commonly used to initialize global variables,
and to instantiate other processes, through the use of the run
operator, before system execution starts. Any process, not just the
init process, can do so, though
Since it is possible to guarantee that the init process executes all of its code before any other process using the atomic { } statement, one could say that it can be used to initialize variables before they are used by other processes from the programming point of view. But that is just a rough approximation, because the init process does not correspond to a unique state in the execution model, but rather to the tree of states at the root and the root itself is given only by the global environment as it is before any process starts.

sizeof a structure vs the offset of last element of the structure

I have a C structure comprising of a couple of elements. When i type offsetof() on the last element of the structure, it shows 144 and sizeof() the last element is 4. So, I was assuming the size of the structure is 148. However, running sizeof() on the structure itself returns a value of 152. Is there something I am misinterpreting about the offset? Shouldn't the size of the structure be 148 instead of 152? Is there some sort of padding that is being applied to fit the byte? I am running on a 64bit platform on Ubuntu 14.
Struct A{
// few elements
};
Struct B{
Struct A;
<type> C;
<type> D;
// few more elements
};
element C is at a offset of 148(may be because padding is not applied), but the sizeof struct A is 152(quite possibly due to padding) and hence, when memcpy is performed, the value assigned to element C gets zeroed out.
UPDATE : Just verified sizeof() element C is 4 bytes and hence, it makes sense it gets included immediately after 148th offset.

There is no padding, al least not before or after the structure. However if you use different datatypes within the same structute sometimes an alignment is added for performance. This is a compiler settings and if you want you can disable it.

cmm call format for foreign primop (integer-gmp example)

I have been checking out integer-gmp source code to understand how foreign primops can be implemented in terms of cmm as documented on GHC Primops page. I am aware of techniques to implement them using llvm hack or fvia-C/gcc - this is more of a learning experience for me to understand this third approach that interger-gmp library uses.
So, I looked up CMM tutorial on MSFT page (pdf link), went through GHC CMM page, and still there are some unanswered questions (hard to keep all those concepts in head without digging into CMM which is what I am doing now). There is this code fragment from integer-bmp cmm file:
integer_cmm_int2Integerzh (W_ val)
{
W_ s, p; /* to avoid aliasing */
ALLOC_PRIM_N (SIZEOF_StgArrWords + WDS(1), integer_cmm_int2Integerzh, val);
p = Hp - SIZEOF_StgArrWords;
SET_HDR(p, stg_ARR_WORDS_info, CCCS);
StgArrWords_bytes(p) = SIZEOF_W;
/* mpz_set_si is inlined here, makes things simpler */
if (%lt(val,0)) {
s = -1;
Hp(0) = -val;
} else {
if (%gt(val,0)) {
s = 1;
Hp(0) = val;
} else {
s = 0;
}
}
/* returns (# size :: Int#,
data :: ByteArray#
#)
*/
return (s,p);
}
As defined in ghc cmm header:
W_ is alias for word.
ALLOC_PRIM_N is a function for allocating memory on the heap for primitive object.
Sp(n) and Hp(n) are defined as below (comments are mine):
#define WDS(n) ((n)*SIZEOF_W) //WDS(n) calculates n*sizeof(Word)
#define Sp(n) W_[Sp + WDS(n)]//Sp(n) points to Stackpointer + n word offset?
#define Hp(n) W_[Hp + WDS(n)]//Hp(n) points to Heap pointer + n word offset?
I don't understand lines 5-9 (line 1 is the start in case you have 1/0 confusion). More specifically:
Why is the function call format of ALLOC_PRIM_N (bytes,fun,arg) that way?
Why is p manipulated that way?
The function as I understand it (from looking at function signature in Prim.hs) takes an int, and returns a (int, byte array) (stored in s,p respectively in the code).
For anyone who is wondering about inline call in if block, it is cmm implementation of gmp mpz_init_si function. My guess is if you call a function defined in object file through ccall, it can't be inlined (which makes sense since it is object-code, not intermediate code - LLVM approach seems more suitable for inlining through LLVM IR). So, the optimization was to define a cmm representation of the function to be inlined. Please correct me if this guess is wrong.
Explanation of lines 5-9 will be very much appreciated. I have more questions about other macros defined in integer-gmp file, but it might be too much to ask in one post. If you can answer the question with a Haskell wiki page or a blog (you can post the link as answer), that would be much appreciated (and if you do, I would also appreciate step-by-step walk-through of an integer-gmp cmm macro such as GMP_TAKE2_RET1).

Those lines allocate a new ByteArray# on the Haskell heap, so to understand them you first need to know a bit about how GHC's heap is managed.
Each capability (= OS thread that executes Haskell code) has its own dedicated nursery, an area of the heap into which it makes normal, small allocations like this one. Objects are simply allocated sequentially into this area from low addresses to high addresses until the capability tries to make an allocation which exceeds the remaining space in the nursery, which triggers the garbage collector.
All heap objects are aligned to a multiple of the word size, i.e., 4 bytes on 32-bit systems and 8 bytes on 64-bit systems.
The Cmm-level register Hp points to (the beginning of) the last word which has been allocated in the nursery. HpLim points to the last word which can be allocated in the nursery. (HpLim can also be set to 0 by another thread to stop the world for GC, or to send an asynchronous exception.)
https://ghc.haskell.org/trac/ghc/wiki/Commentary/Rts/Storage/HeapObjects has information on the layout of individual heap objects. Notably each heap object begins with an info pointer, which (among other things) identifies what sort of heap object it is.
The Haskell type ByteArray# is implemented with the heap object type ARR_WORDS. An ARR_WORDS object just consists of (an info pointer followed by) a size (in bytes) followed by arbitrary data (the payload). The payload is not interpreted by the GC, so it can't store pointers to Haskell heap objects, but it can store anything else. SIZEOF_StgArrWords is the size of the header common to all ARR_WORDS heap objects, and in this case the payload is just a single word, so SIZEOF_StgArrWords + WDS(1) is the amount of space we need to allocate.
ALLOC_PRIM_N (SIZEOF_StgArrWords + WDS(1), integer_cmm_int2Integerzh, val) expands to something like
Hp = Hp + (SIZEOF_StgArrWords + WDS(1));
if (Hp > HpLim) {
HpAlloc = SIZEOF_StgArrWords + WDS(1);
goto stg_gc_prim_n(integer_cmm_int2Integerzh, val);
}
First line increases Hp by the amount to be allocated. Second line checks for heap overflow. Third line records the amount that we tried to allocate, so the GC can undo it. The fourth line calls the GC.
The fourth line is the most interesting. The arguments tell the GC how to restart the thread once garbage collection is done: it should reinvoke integer_cmm_int2Integerzh with argument val. The "_n" in stg_gc_prim_n (and the "_N" in ALLOC_PRIM_N) means that val is a non-pointer argument (in this case an Int#). If val were a pointer to a Haskell heap object, the GC needs to know that it is live (so it doesn't get collected) and to reinvoke our function with the new address of the object. In that case we'd use the _p variant. There are also variants like _pp for multiple pointer arguments, _d for Double# arguments, etc.
After line 5, we've successfully allocated a block of SIZEOF_StgArrWords + WDS(1) bytes and, remember, Hp points to its last word. So, p = Hp - SIZEOF_StgArrWords sets p to the beginning of this block. Lines 8 fills in the info pointer of p, identifying the newly-created heap object as ARR_WORDS. CCCS is the current cost-center stack, used only for profiling. When profiling is enabled each heap object contains an extra field that basically identifies who is responsible for its allocation. In non-profiling builds, there is no CCCS and SET_HDR just sets the info pointer. Finally, line 9 fills in the size field of the ByteArray#. The rest of the function fills in the payload and return the sign value and the ByteArray# object pointer.
So, this ended up being more about the GHC heap than about the Cmm language, but I hope it helps.

Required knowledge
In order to do arithmetic and logical operations computers have digital circuit called ALU (Arithmetic Logic Unit) in their CPU (Central Processing Unit). An ALU loads data from input registers. Processor register is memory storage in L1 cache (data requests within 3 CPU clock ticks) implemented in SRAM(Static Random-Access Memory) located in CPU chip. A processor often contains several kinds of registers, usually differentiated by the number of bits they can hold.
Numbers are expressed in discrete bits can hold finite number of values. Typically numbers have following primitive types exposed by the programming language (in Haskell):
8 bit numbers = 256 unique representable values
16 bit numbers = 65 536 unique representable values
32 bit numbers = 4 294 967 296 unique representable values
64 bit numbers = 18 446 744 073 709 551 616 unique representable values
Fixed-precision arithmetic for those types has been implemented in hardware. Word size refers to the number of bits that can be processed by a computer's CPU in one go. For x86 architecture this is 32 bits and x64 this is 64 bits.
IEEE 754 defines floating point numbers standard for {16, 32, 64, 128} bit numbers. For example 32 bit point number (with 4 294 967 296 unique values) can hold approximate values [-3.402823e38 to 3.402823e38] with accuracy of at least 7 floating point digits.
In addition
Acronym GMP means GNU Multiple Precision Arithmetic Library and adds support for software emulated arbitrary-precision arithmetic's. Glasgow Haskell Compiler Integer implementation uses this.
GMP aims to be faster than any other bignum library for all operand
sizes. Some important factors in doing this are:
Using full words as the basic arithmetic type.
Using different algorithms for different operand sizes; algorithms that are faster for very big numbers are usually slower for small
numbers.
Highly optimized assembly language code for the most important inner loops, specialized for different processors.
Answer
For some Haskell might have slightly hard to comprehend syntax so here is javascript version
var integer_cmm_int2Integerzh = function(word) {
return WORDSIZE == 32
? goog.math.Integer.fromInt(word))
: goog.math.Integer.fromBits([word.getLowBits(), word.getHighBits()]);
};
Where goog is Google Closure library class used is located in Math.Integer. Called functions :
goog.math.Integer.fromInt = function(value) {
if (-128 <= value && value < 128) {
var cachedObj = goog.math.Integer.IntCache_[value];
if (cachedObj) {
return cachedObj;
}
}
var obj = new goog.math.Integer([value | 0], value < 0 ? -1 : 0);
if (-128 <= value && value < 128) {
goog.math.Integer.IntCache_[value] = obj;
}
return obj;
};
goog.math.Integer.fromBits = function(bits) {
var high = bits[bits.length - 1];
return new goog.math.Integer(bits, high & (1 << 31) ? -1 : 0);
};
That is not totally correct as return type should be return (s,p); where
s is value
p is sign
In order to fix this GMP wrapper should be created. This has been done in Haskell to JavaScript compiler project (source link).
Lines 5-9
ALLOC_PRIM_N (SIZEOF_StgArrWords + WDS(1), integer_cmm_int2Integerzh, val);
p = Hp - SIZEOF_StgArrWords;
SET_HDR(p, stg_ARR_WORDS_info, CCCS);
StgArrWords_bytes(p) = SIZEOF_W;
Are as follows
allocates space as new word
creates pointer to it
set pointer value
set pointer type size

How can I determine if this write access is coalesced?

How can I determine if the following memory access is coalesced or not:
// Thread-ID
int idx = blockIdx.x * blockDim.x + threadIdx.x;
// Offset:
int offset = gridDim.x * blockDim.x;
while ( idx < NUMELEMENTS )
{
// Do Something
// ....
// Write to Array which contains results of calculations
results[ idx ] = df2;
// Next Element
idx += offset;
}
NUMELEMENTS is the complete number of single dataelements to process. The array results is passed as pointer to the kernel function and allocated before in global memory.
My Question: Is the write access in the line results[ idx ] = df2; coalesced?
I believe it is as each thread processes consecutive indexed items but I'm not completely sure about it & I don't know how to tell.
Thanks!

Depends if the length of the lines of your matrix is a multiple of half the warp size for devices of compute capability 1.x or a multiple of the warp size for devices of compute capability 2.x. If it is not you can use padding to make it fully coalesced. The function cudaMallocPitch can be used for this purpose.
edit:
Sorry for the confusion. You write 'offset' elements at a time which I interpreted as lines of a matrix.
What I mean is, after each iteration of your cycle you increase the idx by offset. If offset is a multiple of half the warp size for devices of compute capability 1.x or a multiple of the warp size for devices of compute capability 2.x then you it is coalesced, if not then you need padding to make it so.
Probably it is already coalesced because you should choose the number of threads per block and thus the blockDim as a multiple of the warp size.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string