cmm call format for foreign primop (integer-gmp example) - haskell

I have been checking out integer-gmp source code to understand how foreign primops can be implemented in terms of cmm as documented on GHC Primops page. I am aware of techniques to implement them using llvm hack or fvia-C/gcc - this is more of a learning experience for me to understand this third approach that interger-gmp library uses.
So, I looked up CMM tutorial on MSFT page (pdf link), went through GHC CMM page, and still there are some unanswered questions (hard to keep all those concepts in head without digging into CMM which is what I am doing now). There is this code fragment from integer-bmp cmm file:
integer_cmm_int2Integerzh (W_ val)
{
W_ s, p; /* to avoid aliasing */
ALLOC_PRIM_N (SIZEOF_StgArrWords + WDS(1), integer_cmm_int2Integerzh, val);
p = Hp - SIZEOF_StgArrWords;
SET_HDR(p, stg_ARR_WORDS_info, CCCS);
StgArrWords_bytes(p) = SIZEOF_W;
/* mpz_set_si is inlined here, makes things simpler */
if (%lt(val,0)) {
s = -1;
Hp(0) = -val;
} else {
if (%gt(val,0)) {
s = 1;
Hp(0) = val;
} else {
s = 0;
}
}
/* returns (# size :: Int#,
data :: ByteArray#
#)
*/
return (s,p);
}
As defined in ghc cmm header:
W_ is alias for word.
ALLOC_PRIM_N is a function for allocating memory on the heap for primitive object.
Sp(n) and Hp(n) are defined as below (comments are mine):
#define WDS(n) ((n)*SIZEOF_W) //WDS(n) calculates n*sizeof(Word)
#define Sp(n) W_[Sp + WDS(n)]//Sp(n) points to Stackpointer + n word offset?
#define Hp(n) W_[Hp + WDS(n)]//Hp(n) points to Heap pointer + n word offset?
I don't understand lines 5-9 (line 1 is the start in case you have 1/0 confusion). More specifically:
Why is the function call format of ALLOC_PRIM_N (bytes,fun,arg) that way?
Why is p manipulated that way?
The function as I understand it (from looking at function signature in Prim.hs) takes an int, and returns a (int, byte array) (stored in s,p respectively in the code).
For anyone who is wondering about inline call in if block, it is cmm implementation of gmp mpz_init_si function. My guess is if you call a function defined in object file through ccall, it can't be inlined (which makes sense since it is object-code, not intermediate code - LLVM approach seems more suitable for inlining through LLVM IR). So, the optimization was to define a cmm representation of the function to be inlined. Please correct me if this guess is wrong.
Explanation of lines 5-9 will be very much appreciated. I have more questions about other macros defined in integer-gmp file, but it might be too much to ask in one post. If you can answer the question with a Haskell wiki page or a blog (you can post the link as answer), that would be much appreciated (and if you do, I would also appreciate step-by-step walk-through of an integer-gmp cmm macro such as GMP_TAKE2_RET1).

Those lines allocate a new ByteArray# on the Haskell heap, so to understand them you first need to know a bit about how GHC's heap is managed.
Each capability (= OS thread that executes Haskell code) has its own dedicated nursery, an area of the heap into which it makes normal, small allocations like this one. Objects are simply allocated sequentially into this area from low addresses to high addresses until the capability tries to make an allocation which exceeds the remaining space in the nursery, which triggers the garbage collector.
All heap objects are aligned to a multiple of the word size, i.e., 4 bytes on 32-bit systems and 8 bytes on 64-bit systems.
The Cmm-level register Hp points to (the beginning of) the last word which has been allocated in the nursery. HpLim points to the last word which can be allocated in the nursery. (HpLim can also be set to 0 by another thread to stop the world for GC, or to send an asynchronous exception.)
https://ghc.haskell.org/trac/ghc/wiki/Commentary/Rts/Storage/HeapObjects has information on the layout of individual heap objects. Notably each heap object begins with an info pointer, which (among other things) identifies what sort of heap object it is.
The Haskell type ByteArray# is implemented with the heap object type ARR_WORDS. An ARR_WORDS object just consists of (an info pointer followed by) a size (in bytes) followed by arbitrary data (the payload). The payload is not interpreted by the GC, so it can't store pointers to Haskell heap objects, but it can store anything else. SIZEOF_StgArrWords is the size of the header common to all ARR_WORDS heap objects, and in this case the payload is just a single word, so SIZEOF_StgArrWords + WDS(1) is the amount of space we need to allocate.
ALLOC_PRIM_N (SIZEOF_StgArrWords + WDS(1), integer_cmm_int2Integerzh, val) expands to something like
Hp = Hp + (SIZEOF_StgArrWords + WDS(1));
if (Hp > HpLim) {
HpAlloc = SIZEOF_StgArrWords + WDS(1);
goto stg_gc_prim_n(integer_cmm_int2Integerzh, val);
}
First line increases Hp by the amount to be allocated. Second line checks for heap overflow. Third line records the amount that we tried to allocate, so the GC can undo it. The fourth line calls the GC.
The fourth line is the most interesting. The arguments tell the GC how to restart the thread once garbage collection is done: it should reinvoke integer_cmm_int2Integerzh with argument val. The "_n" in stg_gc_prim_n (and the "_N" in ALLOC_PRIM_N) means that val is a non-pointer argument (in this case an Int#). If val were a pointer to a Haskell heap object, the GC needs to know that it is live (so it doesn't get collected) and to reinvoke our function with the new address of the object. In that case we'd use the _p variant. There are also variants like _pp for multiple pointer arguments, _d for Double# arguments, etc.
After line 5, we've successfully allocated a block of SIZEOF_StgArrWords + WDS(1) bytes and, remember, Hp points to its last word. So, p = Hp - SIZEOF_StgArrWords sets p to the beginning of this block. Lines 8 fills in the info pointer of p, identifying the newly-created heap object as ARR_WORDS. CCCS is the current cost-center stack, used only for profiling. When profiling is enabled each heap object contains an extra field that basically identifies who is responsible for its allocation. In non-profiling builds, there is no CCCS and SET_HDR just sets the info pointer. Finally, line 9 fills in the size field of the ByteArray#. The rest of the function fills in the payload and return the sign value and the ByteArray# object pointer.
So, this ended up being more about the GHC heap than about the Cmm language, but I hope it helps.

Required knowledge
In order to do arithmetic and logical operations computers have digital circuit called ALU (Arithmetic Logic Unit) in their CPU (Central Processing Unit). An ALU loads data from input registers. Processor register is memory storage in L1 cache (data requests within 3 CPU clock ticks) implemented in SRAM(Static Random-Access Memory) located in CPU chip. A processor often contains several kinds of registers, usually differentiated by the number of bits they can hold.
Numbers are expressed in discrete bits can hold finite number of values. Typically numbers have following primitive types exposed by the programming language (in Haskell):
8 bit numbers = 256 unique representable values
16 bit numbers = 65 536 unique representable values
32 bit numbers = 4 294 967 296 unique representable values
64 bit numbers = 18 446 744 073 709 551 616 unique representable values
Fixed-precision arithmetic for those types has been implemented in hardware. Word size refers to the number of bits that can be processed by a computer's CPU in one go. For x86 architecture this is 32 bits and x64 this is 64 bits.
IEEE 754 defines floating point numbers standard for {16, 32, 64, 128} bit numbers. For example 32 bit point number (with 4 294 967 296 unique values) can hold approximate values [-3.402823e38 to 3.402823e38] with accuracy of at least 7 floating point digits.
In addition
Acronym GMP means GNU Multiple Precision Arithmetic Library and adds support for software emulated arbitrary-precision arithmetic's. Glasgow Haskell Compiler Integer implementation uses this.
GMP aims to be faster than any other bignum library for all operand
sizes. Some important factors in doing this are:
Using full words as the basic arithmetic type.
Using different algorithms for different operand sizes; algorithms that are faster for very big numbers are usually slower for small
numbers.
Highly optimized assembly language code for the most important inner loops, specialized for different processors.
Answer
For some Haskell might have slightly hard to comprehend syntax so here is javascript version
var integer_cmm_int2Integerzh = function(word) {
return WORDSIZE == 32
? goog.math.Integer.fromInt(word))
: goog.math.Integer.fromBits([word.getLowBits(), word.getHighBits()]);
};
Where goog is Google Closure library class used is located in Math.Integer. Called functions :
goog.math.Integer.fromInt = function(value) {
if (-128 <= value && value < 128) {
var cachedObj = goog.math.Integer.IntCache_[value];
if (cachedObj) {
return cachedObj;
}
}
var obj = new goog.math.Integer([value | 0], value < 0 ? -1 : 0);
if (-128 <= value && value < 128) {
goog.math.Integer.IntCache_[value] = obj;
}
return obj;
};
goog.math.Integer.fromBits = function(bits) {
var high = bits[bits.length - 1];
return new goog.math.Integer(bits, high & (1 << 31) ? -1 : 0);
};
That is not totally correct as return type should be return (s,p); where
s is value
p is sign
In order to fix this GMP wrapper should be created. This has been done in Haskell to JavaScript compiler project (source link).
Lines 5-9
ALLOC_PRIM_N (SIZEOF_StgArrWords + WDS(1), integer_cmm_int2Integerzh, val);
p = Hp - SIZEOF_StgArrWords;
SET_HDR(p, stg_ARR_WORDS_info, CCCS);
StgArrWords_bytes(p) = SIZEOF_W;
Are as follows
allocates space as new word
creates pointer to it
set pointer value
set pointer type size

Related

Optimizing find_first_not_of with SSE4.2 or earlier

I am writing a textual packet analyzer for a protocol and in optimizing it I found that a great bottleneck is the find_first_not_of call.
In essence, I need to find if a packet is valid if it contains only valid characters, faster than the default C++ function.
For instance, if all allowed characters are f, h, o, t, and w, in C++ I would just call s.find_first_not_of("fhotw"), but in SSEx I have no clue after loading the string in a set of __m128i variables.
Apparently, the _mm_cmpXstrY functions documentation is not really helping me in this. (e.g. _mm_cmpistri). I could at first subtract with _mm_sub_epi8, but I don't think it would be a great idea.
Moreover, I am stuck with SSE (any version).
This article by Wojciech Muła describes a SSSE3 algorithm to accept/reject any given byte value.
(Contrary to the article, saturated arithmetic should be used to conduct range checks, but we don't have ranges.)
SSE4.2 string functions are often slower** than hand-crafted alternatives. For example, 3 uops, 3 cycle throughput on Skylake for pcmpistri, the fastest of the SSE4.2 string instructions. vs. 1 shuffle and 1 pcmpeqb per 16 bytes of input with this, with SIMD AND and movemask to combine results. Plus some load and register-copy instructions, but still very likely faster than 1 vector per 3 cycles. Doesn't quite as easily handle short 0-terminated strings, though; SSE4.2 is worth considering if you also need to worry about that, instead of known-size blocks that are a multiple of the vector width.
For "fhotw" specifically, try:
#include <tmmintrin.h> // pshufb
bool is_valid_64bytes (uint8_t* src) {
const __m128i tab = _mm_set_epi8('o','_','_','_','_','_','_','h',
'w','f','_','t','_','_','_','_');
__m128i src0 = _mm_loadu_si128((__m128i*)&src[0]);
__m128i src1 = _mm_loadu_si128((__m128i*)&src[16]);
__m128i src2 = _mm_loadu_si128((__m128i*)&src[32]);
__m128i src3 = _mm_loadu_si128((__m128i*)&src[48]);
__m128i acc;
acc = _mm_cmpeq_epi8(_mm_shuffle_epi8(tab, src0), src0);
acc = _mm_and_si128(acc, _mm_cmpeq_epi8(_mm_shuffle_epi8(tab, src1), src1));
acc = _mm_and_si128(acc, _mm_cmpeq_epi8(_mm_shuffle_epi8(tab, src2), src2));
acc = _mm_and_si128(acc, _mm_cmpeq_epi8(_mm_shuffle_epi8(tab, src3), src3));
return !!(((unsigned)_mm_movemask_epi8(acc)) == 0xFFFF);
}
Using the low 4 bits of the data, we can select a byte from our set that has that low nibble value. e.g. 'o' (0x6f) goes in the high byte of the table so input bytes of the form 0x?f try to match against it. i.e. it's the first element for _mm_set_epi8, which goes from high to low.
See the full article for variations on this technique for other special / more general cases.
**If the search is very simple (doesn't need the functionality of string instructions) or very complex (needs at least two string instructions) then it doesn't make much sense to use the string functions. Also the string instructions don't scale to the 256-bit width of AVX2.

What does flatten_parameters() do?

I saw many Pytorch examples using flatten_parameters in the forward function of the RNN
self.rnn.flatten_parameters()
I saw this RNNBase and it is written that it
Resets parameter data pointer so that they can use faster code paths
What does that mean?
It may not be a full answer to your question. But, if you give a look at the flatten_parameters's source code , you will notice that it calls _cudnn_rnn_flatten_weight in
...
NoGradGuard no_grad;
torch::_cudnn_rnn_flatten_weight(...)
...
is the function that does the job. You will find that what it actually does is copying the model's weights into a vector<Tensor> (check the params_arr declaration) in:
// Slice off views into weight_buf
std::vector<Tensor> params_arr;
size_t params_stride0;
std::tie(params_arr, params_stride0) = get_parameters(handle, rnn, rnn_desc, x_desc, w_desc, weight_buf);
MatrixRef<Tensor> weight{weight_arr, static_cast<size_t>(weight_stride0)},
params{params_arr, params_stride0};
And the weights copying in
// Copy weights
_copyParams(weight, params);
Also note that they update (or Reset as they explicitly say in docs) the original pointers of weights with the new pointers of params by doing an in-place operation .set_ (_ is their notation for the in-place operations) in orig_param.set_(new_param.view_as(orig_param));
// Update the storage
for (size_t i = 0; i < weight.size(0); i++) {
for (auto orig_param_it = weight[i].begin(), new_param_it = params[i].begin();
orig_param_it != weight[i].end() && new_param_it != params[i].end();
orig_param_it++, new_param_it++) {
auto orig_param = *orig_param_it, new_param = *new_param_it;
orig_param.set_(new_param.view_as(orig_param));
}
}
And according to n2798 (draft of C++0x)
©ISO/IECN3092
23.3.6 Class template vector
A vector is a sequence container that supports random access iterators. In addition, it supports (amortized)constant time insert and erase operations at the end; insert and erase in the middle take linear time. Storage management is handled automatically, though hints can be given to improve efficiency. The elements of a vector are stored contiguously, meaning that if v is a vector <T, Allocator> where T is some type other than bool, then it obeys the identity&v[n] == &v[0] + n for all 0 <= n < v.size().
In some situations
UserWarning: RNN module weights are not part of single contiguous chunk of memory. This means they need to be compacted at every call, possibly greately increasing memory usage. To compact weights again call flatten_parameters().
They explicitly advise people in code warnings to have a contiguous chunk of memory.

Node: Generate 6 digits random number using crypto.randomBytes

What is the correct way to generate exact value from 0 to 999999 randomly since 1000000 is not a power of 2?
This is my approach:
use crypto.randomBytes to generate 3 bytes and convert to hex
use the first 5 characters to convert to integer (max is fffff == 1048575 > 999999)
if the result > 999999, start from step 1 again
It will somehow create a recursive function. Is it logically correct and will it cause a concern of performance?
There are several way to extract random numbers in a range from random bits. Some common ones are described in NIST Special Publication 800-90A revision 1: Recommendation for Random Number Generation Using Deterministic Random Bit Generators
Although this standard is about deterministic random bit generations there is a helpful appendix called A.5 Converting Random Bits into a Random Number which describes three useful methods.
The methods described are:
A.5.1 The Simple Discard Method
A.5.2 The Complex Discard Method
A.5.3 The Simple Modular Method
The first two of them are not deterministic with regards to running time but generate a number with no bias at all. They are based on rejection sampling.
The complex discard method discusses a more optimal scheme for generating large quantities of random numbers in a range. I think it is too complex for almost any normal use; I would look at the Optimized Simple Discard method described below if you require additional efficiency instead.
The Simple Modular Method is time constant and deterministic but has non-zero (but negligible) bias. It requires a relatively large amount of additional randomness to achieve the negligible bias though; basically to have a bias of one out of 2^128 you need 128 bits on top of the bit size of the range required. This is probably not the method to choose for smaller numbers.
Your algorithm is clearly a version of the Simple Discard Method (more generally called "rejection sampling"), so it is fine.
I've myself thought of a very efficient algorithm based on the Simple Discard Method called the "Optimized Simple Discard Method" or RNG-BC where "BC" stands for "binary compare". It is based on the observation that comparison only looks at the most significant bits, which means that the least significant bits should still be considered random and can therefore be reused. Beware that this method has not been officially peer reviewed; I do present an informal proof of equivalence with the Simple Discard Method.
Of course you should rather use a generic method that is efficient given any value of N. In that case the Complex Discard Method or Simple Modular Method should be considered over the Simple Discard Method. There are other, much more complex algorithms that are even more efficient, but generally you're fine when using either of these two.
Note that it is often beneficial to first check if N is a power of two when generating a random in the range [0, N). If N is a power of two then there is no need to use any of these possibly expensive computations; just use the bits you need from the random bit or byte generator.
It's a correct algorithm (https://en.wikipedia.org/wiki/Rejection_sampling), though you could consider using bitwise operations instead of converting to hex. It can run forever if the random number generator is malfunctioning -- you could consider trying a fixed number of times and then throwing an exception instead of looping forever.
The main possible performance problem is that on some platforms, crypto.randomBytes can block if it runs out of entropy. So you don't want to waste any randomness if you're using it.
Therefore instead of your string comparison I would use the following integer operation.
if (random_bytes < 16700000) {
return random_bytes = random_bytes - 100000 * Math.floor(random_bytes/100000);
}
This has about a 99.54% chance of producing an answer from the first 3 bytes, as opposed to around 76% odds with your approach.
I would suggest the following approach:
private generateCode(): string {
let code: string = "";
do {
code += randomBytes(3).readUIntBE(0, 3);
// code += Number.parseInt(randomBytes(3).toString("hex"), 16);
} while (code.length < 6);
return code.slice(0, 6);
}
This returns the numeric code as string, but if it is necessary to get it as a number, then change to return Number.parseInt(code.slice(0, 6))
I call it the random_6d algo. Worst case just a single additional loop.
var random_6d = function(n2){
var n1 = crypto.randomBytes(3).readUIntLE(0, 3) >>> 4;
if(n1 < 1000000)
return n1;
if(typeof n2 === 'undefined')
return random_6d(n1);
return Math.abs(n1 - n2);
};
loop version:
var random_6d = function(){
var n1, n2;
while(true){
n1 = crypto.randomBytes(3).readUIntLE(0, 3) >>> 4;
if(n1 < 1000000)
return n1;
if(typeof n2 === 'undefined')
n2 = n1;
else
return Math.abs(n1 - n2);
};
};

Haskell and laziness: Store often used values as members or just calculate on the fly?

In Haskell, regarding its lazy nature, is it better to store frequently calculated values as data members, or is it safe, in standard, non-extended Haskell, to not do so.
To get a better grasp of what I mean, an example:
data Image = Image { size :: Int, inverse_size :: RealFrac }
new_image size = Image { size = size, inverse_size = 1.0 / fromIntegral size }
or shall I just declare it where it is used, assuming that function is called very often:
data Image = Image { size :: Int } -- no tainting
something (Image size) =
let inverse_size = 1.0 / fromIntegral size -- possibly make a function of it
in ...
Being a lazy language, and coming from C++ template metaprogramming, I am not sure how much Haskell will memoize such computations anyways.
Does this make any difference, runtime-speed- and storage-wise?
Denormalization - performance trade-off is not a Haskell-specific thing. If the Image type you have given as an example is close to reality, and you operate with large amount of Images, you would hardly improve performance by means of precalculating inverse_size, because you are saving 1 operation of putting value on FPU stack and 1 floating-point division at the cost of 1 extra load from memory (if size is used in something function along with inverse_size), and x2-3 memory usage increase. It is not a great achievement.
Any way, to give a chance to inverse_size you should make sure this field is strict:
data Image = Image { size :: !Int, inverse_size :: !Float }
Otherwise such precalculation would be certainly pointless.
If you are wondering if GHC compiler optimizes such things itself, the answer is: GHC is able to float repetitive computation within one function (or several (nested by calling) functions if they are inlined), but of cause it won't make a new field in your data structure itself.

OpenJDK's rehashing mechanism

Found this code on http://www.docjar.com/html/api/java/util/HashMap.java.html after searching for a HashMap implementation.
264 static int hash(int h) {
265 // This function ensures that hashCodes that differ only by
266 // constant multiples at each bit position have a bounded
267 // number of collisions (approximately 8 at default load factor).
268 h ^= (h >>> 20) ^ (h >>> 12);
269 return h ^ (h >>> 7) ^ (h >>> 4);
270 }
Can someone shed some light on this? The comment tells us why this code is here but I would like to understand how this improves a bad hash value and how it guarantees that the positions have bounded number of collisions. What do these magic numbers mean?
In order for it to make any sense it has to be combined with an understanding of how HashMap allocates things in to buckets. This is the trivial function by which a bucket index is chosen:
static int indexFor(int h, int length) {
return h & (length-1);
}
So you can see, that with a default table size of 16, only the 4 least significant bits of the hash actually matter for allocating buckets! (16 - 1 = 15, which masks the hash by 1111b)
This could clearly be bad news if your hashCode function returned:
10101100110101010101111010111111
01111100010111011001111010111111
11000000010100000001111010111111
//etc etc etc
Such a hash function would not likely be "bad" in any way that is visible to its author. But if you combine it with the way the map allocates buckets, boom, MapFail(tm).
If you keep in mind that h is a 32 bit number, those are not magic numbers at all. It is systematically xoring the most significant bits of the number rightward into the least significant bits. The purpose is so that "differences" in the number that occur anywhere "across" it when viewed in binary become visible down in the least significant bits.
Collisions become bounded because the number of different numbers that have the same relevant LSBs is now significantly bounded because any differences that occur anywhere in the binary representation are compressed into the bits that matter for bucket-ing.

Resources