I am new with rust/SIMD and I have the following code snippet as the bottleneck of my program, I am wondering if I can leverage on the autovectorization feature of it
fn is_subset(a: Vec<i64>, b: Vec<i64>) -> bool {
for i in 0..a.len() {
if (a[i] & !b[i]) != 0 {
return false;
I also have an alternative way for writing this (with iterator so trip count will be known up front), would this create autovectorization instead?
fn is_subset(a: Vec<i64>, b: Vec<i64>) -> bool {
return a.iter().zip(b.iter()).all(|(x, y)| x & y == *x)

LLVM (and GCC) don't know how to auto-vectorize loops whose trip-count can't be calculated up front. This rules out search loops like this.
ICC classic can auto-vectorize such loops, but it's a C/C++ compiler, without a front-end for Rust.
Probably your only hope would be to manually loop over 2, 4, or 8-element chunks of the arrays, branchlessly calculating your condition based on all those elements. If you're lucky, LLVM might turn that into operations on one SIMD vector. So using that inner loop inside a larger loop could result in getting the compiler to make vectorized asm, for example using AVX vptest (which sets CF according to bitwise a AND (not b) having any non-zero bits).
i.e. manually express the "unrolling" of SIMD elements in your source, for a specific vector width.
Related re: getting compilers to auto-vectorize with x86 ptest:
Getting GCC to generate a PTEST instruction when using vector extensions
how to auto vectorization array comparison function
GCC C vector extension: How to check if result of ANY element-wise comparison is true, and which?
If the whole array is small enough, a branchless reduction over the whole array (ORing boolean results together) could be something a compiler is willing to do something with. If compiling for x86, you'd be hoping for asm like pandn / por in the loop, and a horizontal reduction at the end, so see if any bits were set in the intersection of a and not b.
i64 is only 2 elements per 16-byte vector, so the compiler would have to see a good strategy for it to be profitable to auto-vectorize, especially on a 64-bit machine. Having 32-byte vectors available makes it more attractive.


Optimizing find_first_not_of with SSE4.2 or earlier

I am writing a textual packet analyzer for a protocol and in optimizing it I found that a great bottleneck is the find_first_not_of call.
In essence, I need to find if a packet is valid if it contains only valid characters, faster than the default C++ function.
For instance, if all allowed characters are f, h, o, t, and w, in C++ I would just call s.find_first_not_of("fhotw"), but in SSEx I have no clue after loading the string in a set of __m128i variables.
Apparently, the _mm_cmpXstrY functions documentation is not really helping me in this. (e.g. _mm_cmpistri). I could at first subtract with _mm_sub_epi8, but I don't think it would be a great idea.
Moreover, I am stuck with SSE (any version).
This article by Wojciech Muła describes a SSSE3 algorithm to accept/reject any given byte value.
(Contrary to the article, saturated arithmetic should be used to conduct range checks, but we don't have ranges.)
SSE4.2 string functions are often slower** than hand-crafted alternatives. For example, 3 uops, 3 cycle throughput on Skylake for pcmpistri, the fastest of the SSE4.2 string instructions. vs. 1 shuffle and 1 pcmpeqb per 16 bytes of input with this, with SIMD AND and movemask to combine results. Plus some load and register-copy instructions, but still very likely faster than 1 vector per 3 cycles. Doesn't quite as easily handle short 0-terminated strings, though; SSE4.2 is worth considering if you also need to worry about that, instead of known-size blocks that are a multiple of the vector width.
For "fhotw" specifically, try:
#include <tmmintrin.h> // pshufb
bool is_valid_64bytes (uint8_t* src) {
const __m128i tab = _mm_set_epi8('o','_','_','_','_','_','_','h',
__m128i src0 = _mm_loadu_si128((__m128i*)&src[0]);
__m128i src1 = _mm_loadu_si128((__m128i*)&src[16]);
__m128i src2 = _mm_loadu_si128((__m128i*)&src[32]);
__m128i src3 = _mm_loadu_si128((__m128i*)&src[48]);
__m128i acc;
acc = _mm_cmpeq_epi8(_mm_shuffle_epi8(tab, src0), src0);
acc = _mm_and_si128(acc, _mm_cmpeq_epi8(_mm_shuffle_epi8(tab, src1), src1));
acc = _mm_and_si128(acc, _mm_cmpeq_epi8(_mm_shuffle_epi8(tab, src2), src2));
acc = _mm_and_si128(acc, _mm_cmpeq_epi8(_mm_shuffle_epi8(tab, src3), src3));
return !!(((unsigned)_mm_movemask_epi8(acc)) == 0xFFFF);
Using the low 4 bits of the data, we can select a byte from our set that has that low nibble value. e.g. 'o' (0x6f) goes in the high byte of the table so input bytes of the form 0x?f try to match against it. i.e. it's the first element for _mm_set_epi8, which goes from high to low.
See the full article for variations on this technique for other special / more general cases.
**If the search is very simple (doesn't need the functionality of string instructions) or very complex (needs at least two string instructions) then it doesn't make much sense to use the string functions. Also the string instructions don't scale to the 256-bit width of AVX2.

Node: Generate 6 digits random number using crypto.randomBytes

What is the correct way to generate exact value from 0 to 999999 randomly since 1000000 is not a power of 2?
This is my approach:
use crypto.randomBytes to generate 3 bytes and convert to hex
use the first 5 characters to convert to integer (max is fffff == 1048575 > 999999)
if the result > 999999, start from step 1 again
It will somehow create a recursive function. Is it logically correct and will it cause a concern of performance?
There are several way to extract random numbers in a range from random bits. Some common ones are described in NIST Special Publication 800-90A revision 1: Recommendation for Random Number Generation Using Deterministic Random Bit Generators
Although this standard is about deterministic random bit generations there is a helpful appendix called A.5 Converting Random Bits into a Random Number which describes three useful methods.
The methods described are:
A.5.1 The Simple Discard Method
A.5.2 The Complex Discard Method
A.5.3 The Simple Modular Method
The first two of them are not deterministic with regards to running time but generate a number with no bias at all. They are based on rejection sampling.
The complex discard method discusses a more optimal scheme for generating large quantities of random numbers in a range. I think it is too complex for almost any normal use; I would look at the Optimized Simple Discard method described below if you require additional efficiency instead.
The Simple Modular Method is time constant and deterministic but has non-zero (but negligible) bias. It requires a relatively large amount of additional randomness to achieve the negligible bias though; basically to have a bias of one out of 2^128 you need 128 bits on top of the bit size of the range required. This is probably not the method to choose for smaller numbers.
Your algorithm is clearly a version of the Simple Discard Method (more generally called "rejection sampling"), so it is fine.
I've myself thought of a very efficient algorithm based on the Simple Discard Method called the "Optimized Simple Discard Method" or RNG-BC where "BC" stands for "binary compare". It is based on the observation that comparison only looks at the most significant bits, which means that the least significant bits should still be considered random and can therefore be reused. Beware that this method has not been officially peer reviewed; I do present an informal proof of equivalence with the Simple Discard Method.
Of course you should rather use a generic method that is efficient given any value of N. In that case the Complex Discard Method or Simple Modular Method should be considered over the Simple Discard Method. There are other, much more complex algorithms that are even more efficient, but generally you're fine when using either of these two.
Note that it is often beneficial to first check if N is a power of two when generating a random in the range [0, N). If N is a power of two then there is no need to use any of these possibly expensive computations; just use the bits you need from the random bit or byte generator.
It's a correct algorithm (, though you could consider using bitwise operations instead of converting to hex. It can run forever if the random number generator is malfunctioning -- you could consider trying a fixed number of times and then throwing an exception instead of looping forever.
The main possible performance problem is that on some platforms, crypto.randomBytes can block if it runs out of entropy. So you don't want to waste any randomness if you're using it.
Therefore instead of your string comparison I would use the following integer operation.
if (random_bytes < 16700000) {
return random_bytes = random_bytes - 100000 * Math.floor(random_bytes/100000);
This has about a 99.54% chance of producing an answer from the first 3 bytes, as opposed to around 76% odds with your approach.
I would suggest the following approach:
private generateCode(): string {
let code: string = "";
do {
code += randomBytes(3).readUIntBE(0, 3);
// code += Number.parseInt(randomBytes(3).toString("hex"), 16);
} while (code.length < 6);
return code.slice(0, 6);
This returns the numeric code as string, but if it is necessary to get it as a number, then change to return Number.parseInt(code.slice(0, 6))
I call it the random_6d algo. Worst case just a single additional loop.
var random_6d = function(n2){
var n1 = crypto.randomBytes(3).readUIntLE(0, 3) >>> 4;
if(n1 < 1000000)
return n1;
if(typeof n2 === 'undefined')
return random_6d(n1);
return Math.abs(n1 - n2);
loop version:
var random_6d = function(){
var n1, n2;
n1 = crypto.randomBytes(3).readUIntLE(0, 3) >>> 4;
if(n1 < 1000000)
return n1;
if(typeof n2 === 'undefined')
n2 = n1;
return Math.abs(n1 - n2);

Comparison of two floats in Rust to arbitrary level of precision

How can I do a comparison at an arbitrary level of precision such that I can see that two numbers are the same? In Python, I would use a function like round(), so I am looking for something equivalent in Rust.
For example I have:
let x = 1.45555454;
let y = 1.45556766;
In my case, they are similar up to 2 decimal places. So x and y would become 1.46 for the purposes of comparison. I could format these, but that surely is slow, what is the best Rust method to check equivalence, so:
if x == y { // called when we match to 2 decimal places}
To further elucidate the problem and give some context. This is really for dollars and cents accuracy. So normally in python would use the round() function with all its problems. Yes I am aware of the limitations of floating point representations. There are two functions that compute amounts, I compute in dollars and need to handle the cents part to the nearest penny.
The reason to ask the community is that I suspect that if I roll my own, it could hit performance and it's this aspect - which is I why I'm employing Rust, so here I am. Plus I saw something called round() in the Rust documentation, but it seems to take zero parameters unlike pythons version.
From the Python documentation:
Note The behavior of round() for floats can be surprising: for example, round(2.675, 2) gives 2.67 instead of the expected 2.68. This is not a bug: it’s a result of the fact that most decimal fractions can’t be represented exactly as a float.
For more information, check out What Every Programmer Should Know About Floating-Point Arithmetic.
If you don't understand how computers treat floating points, don't use this code. If you know what trouble you are getting yourself into:
fn approx_equal(a: f64, b: f64, decimal_places: u8) -> bool {
let factor = 10.0f64.powi(decimal_places as i32);
let a = (a * factor).trunc();
let b = (b * factor).trunc();
a == b
fn main() {
assert!( approx_equal(1.234, 1.235, 1));
assert!( approx_equal(1.234, 1.235, 2));
assert!(!approx_equal(1.234, 1.235, 3));
A non-exhaustive list of things that are known (or likely) to be broken with this code:
Sufficiently large floating point numbers and/or number of decimal points
Denormalized numbers
Values near zero (approx_equal(0.09, -0.09, 1))
A potential alternative is to use either a fixed-point or arbitrary-precision type, either of which are going to be slower but more logically consistent to the majority of humans.
This one seems to work pretty well for me.
fn approx_equal (a: f64, b: f64, dp: u8) -> bool {
let p = 10f64.powi(-(dp as i32));
(a-b).abs() < p

Multithreaded sparse matrix multiplication in Matlab

I am performing several matrix multiplications of an NxN sparse (~1-2%) matrix, let's call it B, with an NxM dense matrix, let's call it A (where M < N). N is large, as is M; on the order of several thousands. I am running Matlab 2013a.
Now, usually, matrix multiplications and most other matrix operations are implicitly parallelized in Matlab, i.e. they make use of multiple threads automatically.
This appears NOT to be the case if either of the matrices are sparse (see e.g. this StackOverflow discussion - with no answer for the intended question - and this largely unanswered MathWorks thread).
This is a rather unhappy surprise for me.
We can verify that multithreading has no effects for sparse matrix operations by the following code:
clc; clear all;
N = 5000; % set matrix sizes
M = 3000;
A = randn(N,M); % create dense random matrices
B = sprand(N,N,0.015); % create sparse random matrix
Bf = full(B); %create a dense form of the otherwise sparse matrix B
for i=1:3 % test for 1, 2, and 4 threads
m(i) = 2^(i-1);
maxNumCompThreads(m(i)); % set the thread count available to Matlab
tic % starts timer
y = B*A;
walltime(i) = toc; % wall clock time
speedup(i) = walltime(1)/walltime(i);
% display number of threads vs. speed up relative to just a single thread
This produces the following output, which illustrates that there is no difference between using 1, 2, and 4 threads for sparse operations:
threads speedup
1.0000 1.0000
2.0000 0.9950
4.0000 1.0155
If, on the other hand, I replace B by its dense form, refered to as Bf above, I get significant speedup:
threads speedup
1.0000 1.0000
2.0000 1.8894
4.0000 3.4841
(illustrating that matrix operations for dense matrices in Matlab are indeed implicitly parallelized)
So, my question: is there any way at all to access a parallelized/threaded version of matrix operations for sparse matrices (in Matlab) without converting them to dense form?
I found one old suggestion involving .mex files at MathWorks, but it seems the links are dead and not very well documented/no feedback? Any alternatives?
It seems to be a rather severe restriction of implicit parallelism functionality, since sparse matrices are abound in computationally heavy problems, and hyperthreaded functionality highly desirable in these cases.
MATLAB already uses SuiteSparse by Tim Davis for many of its operation on sparse matrices (for example see here), but neither of which I believe are multithreaded.
Usually computations on sparse matrices are memory-bound rather than CPU-bound. So even you use a multithreaded library, I doubt you will see huge benefits in terms of performance, at least not comparable to those specialized in dense matrices...
After all that the design of sparse matrices have different goals in mind than regular dense matrices, where efficient memory storage is often more important.
I did a quick search online, and found a few implementations out there:
sparse BLAS, spBLAS, PSBLAS. For instance, Intel MKL and AMD ACML do have some support for sparse matrices
cuSPARSE, CUSP, VexCL, ViennaCL, etc.. that run on the GPU.
I ended up writing my own mex file with OpenMP for multithreading. Code as follows. Don't forget to use -largeArrayDims and /openmp (or -fopenmp) flags when compiling.
#include <omp.h>
#include "mex.h"
#include "matrix.h"
#define ll long long
void omp_smm(double* A, double*B, double* C, ll m, ll p, ll n, ll* irs, ll* jcs)
for (ll j=0; j<p; ++j)
ll istart = jcs[j];
ll iend = jcs[j+1];
#pragma omp parallel for
for (ll ii=istart; ii<iend; ++ii)
ll i = irs[ii];
double aa = A[ii];
for (ll k=0; k<n; ++k)
C[i+k*m] += B[j+k*p]*aa;
void mexFunction(int nlhs, mxArray *plhs[], int nrhs, const mxArray *prhs[])
double *A, *B, *C; /* pointers to input & output matrices*/
size_t m,n,p; /* matrix dimensions */
A = mxGetPr(prhs[0]); /* first sparse matrix */
B = mxGetPr(prhs[1]); /* second full matrix */
mwIndex * irs = mxGetIr(prhs[0]);
mwIndex * jcs = mxGetJc(prhs[0]);
m = mxGetM(prhs[0]);
p = mxGetN(prhs[0]);
n = mxGetN(prhs[1]);
/* create output matrix C */
plhs[0] = mxCreateDoubleMatrix(m, n, mxREAL);
C = mxGetPr(plhs[0]);
omp_smm(A,B,C, m, p, n, (ll*)irs, (ll*)jcs);
On matlab central the same question was asked, and this answer was given:
I believe the sparse matrix code is implemented by a few specialized TMW engineers rather than an external library like BLAS/LAPACK/LINPACK/etc...
Which basically means, that you are out of luck.
However I can think of some tricks to achieve faster computations:
If you need to do several multiplications: do multiple multiplications at once and process them in parallel?
If you just want to do one multiplication: Cut the matrix into pieces (for example top half and bottom half), do the calculations of the parts in parallel and combine the results afterwards
Probably these solutions will not turn out to be as fast as properly implemented multithreading, but hopefully you can still get a speedup.

cmm call format for foreign primop (integer-gmp example)

I have been checking out integer-gmp source code to understand how foreign primops can be implemented in terms of cmm as documented on GHC Primops page. I am aware of techniques to implement them using llvm hack or fvia-C/gcc - this is more of a learning experience for me to understand this third approach that interger-gmp library uses.
So, I looked up CMM tutorial on MSFT page (pdf link), went through GHC CMM page, and still there are some unanswered questions (hard to keep all those concepts in head without digging into CMM which is what I am doing now). There is this code fragment from integer-bmp cmm file:
integer_cmm_int2Integerzh (W_ val)
W_ s, p; /* to avoid aliasing */
ALLOC_PRIM_N (SIZEOF_StgArrWords + WDS(1), integer_cmm_int2Integerzh, val);
p = Hp - SIZEOF_StgArrWords;
SET_HDR(p, stg_ARR_WORDS_info, CCCS);
StgArrWords_bytes(p) = SIZEOF_W;
/* mpz_set_si is inlined here, makes things simpler */
if (%lt(val,0)) {
s = -1;
Hp(0) = -val;
} else {
if (%gt(val,0)) {
s = 1;
Hp(0) = val;
} else {
s = 0;
/* returns (# size :: Int#,
data :: ByteArray#
return (s,p);
As defined in ghc cmm header:
W_ is alias for word.
ALLOC_PRIM_N is a function for allocating memory on the heap for primitive object.
Sp(n) and Hp(n) are defined as below (comments are mine):
#define WDS(n) ((n)*SIZEOF_W) //WDS(n) calculates n*sizeof(Word)
#define Sp(n) W_[Sp + WDS(n)]//Sp(n) points to Stackpointer + n word offset?
#define Hp(n) W_[Hp + WDS(n)]//Hp(n) points to Heap pointer + n word offset?
I don't understand lines 5-9 (line 1 is the start in case you have 1/0 confusion). More specifically:
Why is the function call format of ALLOC_PRIM_N (bytes,fun,arg) that way?
Why is p manipulated that way?
The function as I understand it (from looking at function signature in Prim.hs) takes an int, and returns a (int, byte array) (stored in s,p respectively in the code).
For anyone who is wondering about inline call in if block, it is cmm implementation of gmp mpz_init_si function. My guess is if you call a function defined in object file through ccall, it can't be inlined (which makes sense since it is object-code, not intermediate code - LLVM approach seems more suitable for inlining through LLVM IR). So, the optimization was to define a cmm representation of the function to be inlined. Please correct me if this guess is wrong.
Explanation of lines 5-9 will be very much appreciated. I have more questions about other macros defined in integer-gmp file, but it might be too much to ask in one post. If you can answer the question with a Haskell wiki page or a blog (you can post the link as answer), that would be much appreciated (and if you do, I would also appreciate step-by-step walk-through of an integer-gmp cmm macro such as GMP_TAKE2_RET1).
Those lines allocate a new ByteArray# on the Haskell heap, so to understand them you first need to know a bit about how GHC's heap is managed.
Each capability (= OS thread that executes Haskell code) has its own dedicated nursery, an area of the heap into which it makes normal, small allocations like this one. Objects are simply allocated sequentially into this area from low addresses to high addresses until the capability tries to make an allocation which exceeds the remaining space in the nursery, which triggers the garbage collector.
All heap objects are aligned to a multiple of the word size, i.e., 4 bytes on 32-bit systems and 8 bytes on 64-bit systems.
The Cmm-level register Hp points to (the beginning of) the last word which has been allocated in the nursery. HpLim points to the last word which can be allocated in the nursery. (HpLim can also be set to 0 by another thread to stop the world for GC, or to send an asynchronous exception.) has information on the layout of individual heap objects. Notably each heap object begins with an info pointer, which (among other things) identifies what sort of heap object it is.
The Haskell type ByteArray# is implemented with the heap object type ARR_WORDS. An ARR_WORDS object just consists of (an info pointer followed by) a size (in bytes) followed by arbitrary data (the payload). The payload is not interpreted by the GC, so it can't store pointers to Haskell heap objects, but it can store anything else. SIZEOF_StgArrWords is the size of the header common to all ARR_WORDS heap objects, and in this case the payload is just a single word, so SIZEOF_StgArrWords + WDS(1) is the amount of space we need to allocate.
ALLOC_PRIM_N (SIZEOF_StgArrWords + WDS(1), integer_cmm_int2Integerzh, val) expands to something like
Hp = Hp + (SIZEOF_StgArrWords + WDS(1));
if (Hp > HpLim) {
HpAlloc = SIZEOF_StgArrWords + WDS(1);
goto stg_gc_prim_n(integer_cmm_int2Integerzh, val);
First line increases Hp by the amount to be allocated. Second line checks for heap overflow. Third line records the amount that we tried to allocate, so the GC can undo it. The fourth line calls the GC.
The fourth line is the most interesting. The arguments tell the GC how to restart the thread once garbage collection is done: it should reinvoke integer_cmm_int2Integerzh with argument val. The "_n" in stg_gc_prim_n (and the "_N" in ALLOC_PRIM_N) means that val is a non-pointer argument (in this case an Int#). If val were a pointer to a Haskell heap object, the GC needs to know that it is live (so it doesn't get collected) and to reinvoke our function with the new address of the object. In that case we'd use the _p variant. There are also variants like _pp for multiple pointer arguments, _d for Double# arguments, etc.
After line 5, we've successfully allocated a block of SIZEOF_StgArrWords + WDS(1) bytes and, remember, Hp points to its last word. So, p = Hp - SIZEOF_StgArrWords sets p to the beginning of this block. Lines 8 fills in the info pointer of p, identifying the newly-created heap object as ARR_WORDS. CCCS is the current cost-center stack, used only for profiling. When profiling is enabled each heap object contains an extra field that basically identifies who is responsible for its allocation. In non-profiling builds, there is no CCCS and SET_HDR just sets the info pointer. Finally, line 9 fills in the size field of the ByteArray#. The rest of the function fills in the payload and return the sign value and the ByteArray# object pointer.
So, this ended up being more about the GHC heap than about the Cmm language, but I hope it helps.
Required knowledge
In order to do arithmetic and logical operations computers have digital circuit called ALU (Arithmetic Logic Unit) in their CPU (Central Processing Unit). An ALU loads data from input registers. Processor register is memory storage in L1 cache (data requests within 3 CPU clock ticks) implemented in SRAM(Static Random-Access Memory) located in CPU chip. A processor often contains several kinds of registers, usually differentiated by the number of bits they can hold.
Numbers are expressed in discrete bits can hold finite number of values. Typically numbers have following primitive types exposed by the programming language (in Haskell):
8 bit numbers = 256 unique representable values
16 bit numbers = 65 536 unique representable values
32 bit numbers = 4 294 967 296 unique representable values
64 bit numbers = 18 446 744 073 709 551 616 unique representable values
Fixed-precision arithmetic for those types has been implemented in hardware. Word size refers to the number of bits that can be processed by a computer's CPU in one go. For x86 architecture this is 32 bits and x64 this is 64 bits.
IEEE 754 defines floating point numbers standard for {16, 32, 64, 128} bit numbers. For example 32 bit point number (with 4 294 967 296 unique values) can hold approximate values [-3.402823e38 to 3.402823e38] with accuracy of at least 7 floating point digits.
In addition
Acronym GMP means GNU Multiple Precision Arithmetic Library and adds support for software emulated arbitrary-precision arithmetic's. Glasgow Haskell Compiler Integer implementation uses this.
GMP aims to be faster than any other bignum library for all operand
sizes. Some important factors in doing this are:
Using full words as the basic arithmetic type.
Using different algorithms for different operand sizes; algorithms that are faster for very big numbers are usually slower for small
Highly optimized assembly language code for the most important inner loops, specialized for different processors.
For some Haskell might have slightly hard to comprehend syntax so here is javascript version
var integer_cmm_int2Integerzh = function(word) {
return WORDSIZE == 32
? goog.math.Integer.fromInt(word))
: goog.math.Integer.fromBits([word.getLowBits(), word.getHighBits()]);
Where goog is Google Closure library class used is located in Math.Integer. Called functions :
goog.math.Integer.fromInt = function(value) {
if (-128 <= value && value < 128) {
var cachedObj = goog.math.Integer.IntCache_[value];
if (cachedObj) {
return cachedObj;
var obj = new goog.math.Integer([value | 0], value < 0 ? -1 : 0);
if (-128 <= value && value < 128) {
goog.math.Integer.IntCache_[value] = obj;
return obj;
goog.math.Integer.fromBits = function(bits) {
var high = bits[bits.length - 1];
return new goog.math.Integer(bits, high & (1 << 31) ? -1 : 0);
That is not totally correct as return type should be return (s,p); where
s is value
p is sign
In order to fix this GMP wrapper should be created. This has been done in Haskell to JavaScript compiler project (source link).
Lines 5-9
ALLOC_PRIM_N (SIZEOF_StgArrWords + WDS(1), integer_cmm_int2Integerzh, val);
p = Hp - SIZEOF_StgArrWords;
SET_HDR(p, stg_ARR_WORDS_info, CCCS);
StgArrWords_bytes(p) = SIZEOF_W;
Are as follows
allocates space as new word
creates pointer to it
set pointer value
set pointer type size
