Pattern matching benchmarking : Compiletime lookup vs Runtime lookup in D

Pattern matching benchmarking : Compiletime lookup vs Runtime lookup in D - metaprogramming

I need advice on my first D-project . I have uploaded it at :-
https://bitbucket.org/mrjohns/matcher/downloads
IDEA : Benchmarking of 3 runtime algorithms and comparing them to their compile-time variants. The only difference between these is that for the compile time-ones, the lookup tables (i.e. Arrays bmBc, bmGs, and suffixes ) must be computed at compile time( I currently rely on CTFE ) . While for the runtime-ones the lookup tables are computed on runtime.
NB : The pattern matching algorithms themselves need not be executed at compile-time, only the lookup tables.Having stated this the algorithms which run on known( compile-time computed) tables must be faster than the ones which have to compute them at runtime.
My results seem to show something different, only the first pair(
BM_Runtime and BM_Compile-time) yields admissible results, the other two pair give higher execution time for the compile-time variants. I think am missing something here. Please help.
Current Results for the pattern="GCAGAGAG" are as below :-
**BM_Runtime** = 366 hnsecs position= 513
**BM_Compile-time** = 294 hnsecs position =513
**BMH_Runtime** = 174 hnsecs position= 513
**BMH_Compile-time** = 261 hnsecs position= 513
**AG_Run-time** = 258 hnsecs position= 513
**AG_Compile-time** = 268 hnsecs position= 513
Running the code : dmd -J. matcher.d inputs.d rtime_pre.d ctime_pre.d && numactl --physcpubind=0 ./matcher
I would appreciate your suggestions.
Thanking you in advince.

Any performance test without activating compiler optimization is not useful. You should add dmd -release -inline -O -boundscheck=off. Also usually performance tests use cycles for repeating calculations. Otherwise you may get incorrect results.

Related

Python erasure coding library that can handle larger strings?

I'm looking for an python erasure coding library that works for larger inputs. So far I've checked out:
unireedsolomon: fails for 256-byte inputs, unmaintained
reedsolo/reedsolomon: fails for a 300-byte input silently.
Reed-Solomon clearly a learning project, bug tracker disabled
pyeclib: fails for 100-byte input using reed-solomon encoding, and doesn't seem to provide any documentation on valid parameters, such that I couldn't figure out how to test other algorithms (nor does liberasurecode)
I want something that can handle n=10,000 k=2,000 or so, ideally larger.

Only the field polynomial has to be prime or primitive, not the generating polynomial. If you wanted a RS(10000, 8000, 2000) (n = 10000, k = 8000, n-k = 2000) code, GF(2^16) with primitive reducing polynomial x^16 + x^12 + x^3 + x^1 + 1 could be used. The generating polynomial would be of degree 2000. Assuming first consecutive root is 2, then generating polynomial = (x-2)(x-4)(x-8) ... (x-2^2000) (all of this math done in GF(2^16), + and - are both xor). Correction would involve generating 2000 syndromes and using Berlekamp Massey or Sugiyama's extended Euclid decoder. I don't know if there are Python libraries that support GF(2^16).
https://en.wikipedia.org/wiki/Reed%E2%80%93Solomon_error_correction#Berlekamp%E2%80%93Massey_decoder
https://en.wikipedia.org/wiki/Reed%E2%80%93Solomon_error_correction#Euclidean_decoder
Large n and k can be avoided by interleaving. Tape drives like LTO treat large data blocks as matrices interleaved across rows (called C1) and down columns (called C2) using GF(2^8). LTO-8 uses RS(249,237,13), which I assume is the ECC used down columns to correct rows. With 32 read|write heads, there's an interleave of 32, probably across rows. I don't know what the RS() code is across rows, or what the interleave down columns is.

Pyomo: define objective Rule based on condition

In a transport problem, I'm trying to insert the following rule into the objective function:
If a supply of BC <19,000 tons, then we will have a penalty of $ 125 / MT
I added a constraint to check the condition but would like to apply the penalty in the objective function.
I was able to do this in Excel Solver, but the values do not match. I've already checked both, and debugged the code, but I could not figure out what's wrong.
Here is the constraint:
def bc_rule(model):
return sum(model.x[supplier, market] for supplier in model.suppliers \
for market in model.markets \
if 'BC' in supplier) >= 19000
model.bc_rules = Constraint(rule=bc_rule, doc='Minimum production')
The problem is in the objective rule:
def objective_rule(model):
PENALTY_THRESHOLD = 19000
PENALTY_COST = 125
cost = sum(model.costs[supplier, market] * model.x[supplier, market] for supplier in model.suppliers for market in model.markets)
# what is the problem here?
bc = sum(model.x[supplier, market] for supplier in model.suppliers \
for market in model.markets \
if 'BC' in supplier)
if bc < PENALTY_THRESHOLD:
cost += (PENALTY_THRESHOLD - bc) * PENALTY_COST
return cost
model.objective = Objective(rule=objective_rule, sense=minimize, doc='Define objective function')
I'm getting a much lower value than found in Excel Solver.

Your condition (if) depends on a variable in your model.
Normally, ifs should never be used in a mathematical model, and that is not only for Pyomo. Even in Excel, if statements in formulas are simply converted to scalar value before optimization, so I would be very careful when saying that it is the real optimal value.
The good news is that if statements are easily converted into mathematical constraints.
For that, you need to add a binary variable (0/1) to your model. It will take the value of 1 if bc <= PENALTY_TRESHOLD. Let's call this variable y, and is defined as model.y = Var(domain=Binary).
You will add model.y * PENALTY_COST as a term of your objective function to include the penalty cost.
Then, for the constraint, add the following piece of code:
def y_big_M(model):
bigM = 10000 # Should be a big number, big enough that it will be bigger than any number in your
# model, but small enough that it will stay around the same order of magnitude. Avoid
# utterly big number like 1e12 and + if you don't need to, since having numbers too
# large causes problems.
PENALTY_TRESHOLD = 19000
return PENALTY_TRESHOLD - sum(
model.x[supplier, market]
for supplier in model.suppliers
for market in model.markets
if 'BC' in supplier
) <= model.y * bigM
model.y_big_M = Constraint(rule=y_big_M)
The previous constraint ensures that y will take a value greater than 0 (i.e. 1) when the sum that calculates bc is smaller than the PENALTY_TRESHOLD. Any value of this difference that is greater than 0 will force the model to put 1 in the value of variable y, since if y=1, the right hand side of the constraint will be 1 * bigM, which is a very big number, big enough that bc will always be smaller than bigM.
Please, also check your Excel model to see if your if statements really works during the solver computations. Last time I checked, Excel solver do not convert if statements into bigM constraints. The modeling technique I showed you works for absolutely all programming method, even in Excel.

HashMap performance Java 9 25% less than Java 8?

Note: This is not about a performance issue. I only observe a difference in performance I cannot explain / understand.
While benchmarking some newly developed code, targeted for Java 9 I discovered something weird. A (very) simple benchmark of HashMap with 5 keys reveals that Java 9 is much slower than Java 8. Can this be explained or is my (benchmark) code just plain wrong?
Code:
#Fork(
jvmArgsAppend = {"-Xmx512M", "-disablesystemassertions"}
)
public class JsonBenchmark {
#State(Scope.Thread)
public static class Data {
final static Locale RUSSIAN = new Locale("ru");
final static Locale DUTCH = new Locale("nl");
final Map<Locale, String> hashmap = new HashMap<>();
public Data() {
hashmap.put(Locale.ENGLISH, "Flat flashing adjustable for flat angled roof with swivel");
hashmap.put(Locale.FRENCH, "Solin pour toit plat inclinée");
hashmap.put(Locale.GERMAN, "Flachdachkragen Flach Schrägdach");
hashmap.put(DUTCH, "Plakplaat vlak/hellend dak inclusief glijschaal");
hashmap.put(RUSSIAN, "Проход через плоскую кровлю регулир. для накл. кровли");
}
}
#Benchmark
public int bmHashMap(JsonBenchmark.Data data) {
final Map<Locale, String> m = data.hashmap;
int sum = 0;
sum += m.get(Data.RUSSIAN).length();
sum += m.get(Locale.FRENCH).length();
sum += m.get(Data.DUTCH).length();
sum += m.get(Locale.ENGLISH).length();
sum += m.get(Locale.GERMAN).length();
return sum;
}
}
Results:
Java 8_151: JsonBenchmark.bmHashMap thrpt 40 47948546.439 ± 560763.711 ops/s
Java 9_181: JsonBenchmark.bmHashMap thrpt 40 34962904.479 ± 276045.691 ops/s (-/- 27%!)
UPDATE
Thanks for the answers and great comments.
Suggestion by #Holger. My first reaction was: That must be the explanation. However, if I only benchmark the String#length() function, there is no significant difference in performance. And, when I only benchmark the HashMap#get() methods (as suggested by #Eugene) there is still a difference of about 10 - 12 %.
Suggestion by #Eugene. I changed the parameters (more warmup iterations, more memory), but I am not able to reproduce your outcome. I increased the heap to 4G however. But this cannot explain the difference, is not it?
Suggestion by #Alan Bateman. Yes, this improves the performance! However, still a difference of around 20%.

You are testing more than just HashMap. You are not only calling HashMap.get, you are implicitly calling Locale.hashCode and Locale.equals. Further, you are calling String.length.
Now, all four could have changed their performance characteristics, so you would need far more tests to reason about which method(s) exhibit(s) different performance.
But the hottest candidate is String.length. In Java 9, the String class does not use a char[] array anymore, but a byte[] array, to encode Latin 1 strings using only one byte per character, dramatically reducing the memory footprint of typical applications. This, however, implies that the length is not always identical to the array length anymore. So the complexity of this operation has changed.
But keep in mind that your result is about 77 ns difference in a microbenchmark. This is not enough to estimate the impact on a real application…

I had a hint this was about jmh set-up, more then it was about HashMap. As already noted you are measuring a lot more than simply HashMap::get here. But even so, I had doubts that java-9 would be that much slower, so I measured myself (latest jmh build from sources, java-8 and 9).
I haven't changed your code - just added way more heap (10GB) and way more warm-ups, thus reducing the "error" you see after ±
Using java-8:
Benchmark Mode Cnt Score Error Units
SOExample.bmHashMap avgt 25 22.059 ± 0.276 ns/op
Using java-9:
Benchmark Mode Cnt Score Error Units
SOExample.bmHashMap avgt 25 23.954 ± 0.383 ns/op
The results are on-par pretty much without a noticeable difference (these are nano-seconds after all) as you see it. Also if you really want to test just HashMap::get than your methods could simply return the invocation of that, like this:
#Benchmark
#Fork(5)
public int bmHashMap(SOExample.Data data) {
return data.hashmap.get(data.key); // where key is a random generated possible key
}

J Primes Enumeration

J will answer the n-th prime via p:n.
If I ask for the 100 millionth prime I get an almost instant answer. I cannot imagine J is sieving for that prime that quickly, but neither looking it up in a table as that table would be around 1GB in size.
There are equations giving approximations to the number of primes to a bound, but they are only approximations.
How is J finding the answer so quickly ?

J uses a table to start, then calculates
NOTE! This is speculation, based on benchmarks (shown below).
If you want to quickly try for yourself, try the following:
p:1e8 NB. near-instant
p:1e8-1 NB. noticeable pause
The low points on the graph are where J looks up the prime in a table. After that, J is calculating the value from a particular starting point so it doesn't have to calculate the entire thing. So some lucky primes will be constant time (simple table lookup) but generally there's first a table lookup, and then a calculation. But happily, it calculates starting from the previous table lookup instead of calculating the entire value.
Benchmarks
I did some benchmarking to see how p: performs on my machine (iMac i5, 16G RAM). I'm using J803. The results are interesting. I'm guessing the sawtooth pattern in the time plots (visible on the 'up to 2e5' plot) is lookup table related, while the overall log-ish shape (visible on the 'up to 1e7' plot) is CPU related.
NB. my test script
ts=:3 : 0
a=.y
while. a do.
c=.timespacex 'p:(1e4*a)' NB. 1000 times a
a=.<:a
b=.c;b
end.
}:b
)
a =: ts 200
require'plot'
plot >0{each a NB. time
plot >1{each a NB. space
(p: up to 2e5)
time
space
(p: up to 1e7)
time
space
During these runs one core was hovering around 100%:
Also, the voc page states:
Currently, arguments larger than 2^31 are tested to be prime according to a probabilistic algorithm (Miller-Rabin).
And in addition to a prime lookup table as #Mauris points out, v2.c contains this function:
static F1(jtdetmr){A z;B*zv;I d,h,i,n,wn,*wv;
RZ(w=vi(w));
wn=AN(w); wv=AV(w);
GA(z,B01,wn,AR(w),AS(w)); zv=BAV(z);
for(i=0;i<wn;++i){
n=*wv++;
if(1>=n||!(1&n)||0==n%3||0==n%5){*zv++=0; continue;}
h=0; d=n-1; while(!(1&d)){++h; d>>=1;}
if (n< 9080191)*zv++=spspd(31,n,d,h)&&spspd(73,n,d,h);
else if(n<94906266)*zv++=spspd(2 ,n,d,h)&&spspd( 7,n,d,h)&&spspd(61,n,d,h);
else *zv++=spspx(2 ,n,d,h)&&spspx( 7,n,d,h)&&spspx(61,n,d,h);
}
RE(0); R z;
} /* deterministic Miller-Rabin */

cmm call format for foreign primop (integer-gmp example)

I have been checking out integer-gmp source code to understand how foreign primops can be implemented in terms of cmm as documented on GHC Primops page. I am aware of techniques to implement them using llvm hack or fvia-C/gcc - this is more of a learning experience for me to understand this third approach that interger-gmp library uses.
So, I looked up CMM tutorial on MSFT page (pdf link), went through GHC CMM page, and still there are some unanswered questions (hard to keep all those concepts in head without digging into CMM which is what I am doing now). There is this code fragment from integer-bmp cmm file:
integer_cmm_int2Integerzh (W_ val)
{
W_ s, p; /* to avoid aliasing */
ALLOC_PRIM_N (SIZEOF_StgArrWords + WDS(1), integer_cmm_int2Integerzh, val);
p = Hp - SIZEOF_StgArrWords;
SET_HDR(p, stg_ARR_WORDS_info, CCCS);
StgArrWords_bytes(p) = SIZEOF_W;
/* mpz_set_si is inlined here, makes things simpler */
if (%lt(val,0)) {
s = -1;
Hp(0) = -val;
} else {
if (%gt(val,0)) {
s = 1;
Hp(0) = val;
} else {
s = 0;
}
}
/* returns (# size :: Int#,
data :: ByteArray#
#)
*/
return (s,p);
}
As defined in ghc cmm header:
W_ is alias for word.
ALLOC_PRIM_N is a function for allocating memory on the heap for primitive object.
Sp(n) and Hp(n) are defined as below (comments are mine):
#define WDS(n) ((n)*SIZEOF_W) //WDS(n) calculates n*sizeof(Word)
#define Sp(n) W_[Sp + WDS(n)]//Sp(n) points to Stackpointer + n word offset?
#define Hp(n) W_[Hp + WDS(n)]//Hp(n) points to Heap pointer + n word offset?
I don't understand lines 5-9 (line 1 is the start in case you have 1/0 confusion). More specifically:
Why is the function call format of ALLOC_PRIM_N (bytes,fun,arg) that way?
Why is p manipulated that way?
The function as I understand it (from looking at function signature in Prim.hs) takes an int, and returns a (int, byte array) (stored in s,p respectively in the code).
For anyone who is wondering about inline call in if block, it is cmm implementation of gmp mpz_init_si function. My guess is if you call a function defined in object file through ccall, it can't be inlined (which makes sense since it is object-code, not intermediate code - LLVM approach seems more suitable for inlining through LLVM IR). So, the optimization was to define a cmm representation of the function to be inlined. Please correct me if this guess is wrong.
Explanation of lines 5-9 will be very much appreciated. I have more questions about other macros defined in integer-gmp file, but it might be too much to ask in one post. If you can answer the question with a Haskell wiki page or a blog (you can post the link as answer), that would be much appreciated (and if you do, I would also appreciate step-by-step walk-through of an integer-gmp cmm macro such as GMP_TAKE2_RET1).

Those lines allocate a new ByteArray# on the Haskell heap, so to understand them you first need to know a bit about how GHC's heap is managed.
Each capability (= OS thread that executes Haskell code) has its own dedicated nursery, an area of the heap into which it makes normal, small allocations like this one. Objects are simply allocated sequentially into this area from low addresses to high addresses until the capability tries to make an allocation which exceeds the remaining space in the nursery, which triggers the garbage collector.
All heap objects are aligned to a multiple of the word size, i.e., 4 bytes on 32-bit systems and 8 bytes on 64-bit systems.
The Cmm-level register Hp points to (the beginning of) the last word which has been allocated in the nursery. HpLim points to the last word which can be allocated in the nursery. (HpLim can also be set to 0 by another thread to stop the world for GC, or to send an asynchronous exception.)
https://ghc.haskell.org/trac/ghc/wiki/Commentary/Rts/Storage/HeapObjects has information on the layout of individual heap objects. Notably each heap object begins with an info pointer, which (among other things) identifies what sort of heap object it is.
The Haskell type ByteArray# is implemented with the heap object type ARR_WORDS. An ARR_WORDS object just consists of (an info pointer followed by) a size (in bytes) followed by arbitrary data (the payload). The payload is not interpreted by the GC, so it can't store pointers to Haskell heap objects, but it can store anything else. SIZEOF_StgArrWords is the size of the header common to all ARR_WORDS heap objects, and in this case the payload is just a single word, so SIZEOF_StgArrWords + WDS(1) is the amount of space we need to allocate.
ALLOC_PRIM_N (SIZEOF_StgArrWords + WDS(1), integer_cmm_int2Integerzh, val) expands to something like
Hp = Hp + (SIZEOF_StgArrWords + WDS(1));
if (Hp > HpLim) {
HpAlloc = SIZEOF_StgArrWords + WDS(1);
goto stg_gc_prim_n(integer_cmm_int2Integerzh, val);
}
First line increases Hp by the amount to be allocated. Second line checks for heap overflow. Third line records the amount that we tried to allocate, so the GC can undo it. The fourth line calls the GC.
The fourth line is the most interesting. The arguments tell the GC how to restart the thread once garbage collection is done: it should reinvoke integer_cmm_int2Integerzh with argument val. The "_n" in stg_gc_prim_n (and the "_N" in ALLOC_PRIM_N) means that val is a non-pointer argument (in this case an Int#). If val were a pointer to a Haskell heap object, the GC needs to know that it is live (so it doesn't get collected) and to reinvoke our function with the new address of the object. In that case we'd use the _p variant. There are also variants like _pp for multiple pointer arguments, _d for Double# arguments, etc.
After line 5, we've successfully allocated a block of SIZEOF_StgArrWords + WDS(1) bytes and, remember, Hp points to its last word. So, p = Hp - SIZEOF_StgArrWords sets p to the beginning of this block. Lines 8 fills in the info pointer of p, identifying the newly-created heap object as ARR_WORDS. CCCS is the current cost-center stack, used only for profiling. When profiling is enabled each heap object contains an extra field that basically identifies who is responsible for its allocation. In non-profiling builds, there is no CCCS and SET_HDR just sets the info pointer. Finally, line 9 fills in the size field of the ByteArray#. The rest of the function fills in the payload and return the sign value and the ByteArray# object pointer.
So, this ended up being more about the GHC heap than about the Cmm language, but I hope it helps.

Required knowledge
In order to do arithmetic and logical operations computers have digital circuit called ALU (Arithmetic Logic Unit) in their CPU (Central Processing Unit). An ALU loads data from input registers. Processor register is memory storage in L1 cache (data requests within 3 CPU clock ticks) implemented in SRAM(Static Random-Access Memory) located in CPU chip. A processor often contains several kinds of registers, usually differentiated by the number of bits they can hold.
Numbers are expressed in discrete bits can hold finite number of values. Typically numbers have following primitive types exposed by the programming language (in Haskell):
8 bit numbers = 256 unique representable values
16 bit numbers = 65 536 unique representable values
32 bit numbers = 4 294 967 296 unique representable values
64 bit numbers = 18 446 744 073 709 551 616 unique representable values
Fixed-precision arithmetic for those types has been implemented in hardware. Word size refers to the number of bits that can be processed by a computer's CPU in one go. For x86 architecture this is 32 bits and x64 this is 64 bits.
IEEE 754 defines floating point numbers standard for {16, 32, 64, 128} bit numbers. For example 32 bit point number (with 4 294 967 296 unique values) can hold approximate values [-3.402823e38 to 3.402823e38] with accuracy of at least 7 floating point digits.
In addition
Acronym GMP means GNU Multiple Precision Arithmetic Library and adds support for software emulated arbitrary-precision arithmetic's. Glasgow Haskell Compiler Integer implementation uses this.
GMP aims to be faster than any other bignum library for all operand
sizes. Some important factors in doing this are:
Using full words as the basic arithmetic type.
Using different algorithms for different operand sizes; algorithms that are faster for very big numbers are usually slower for small
numbers.
Highly optimized assembly language code for the most important inner loops, specialized for different processors.
Answer
For some Haskell might have slightly hard to comprehend syntax so here is javascript version
var integer_cmm_int2Integerzh = function(word) {
return WORDSIZE == 32
? goog.math.Integer.fromInt(word))
: goog.math.Integer.fromBits([word.getLowBits(), word.getHighBits()]);
};
Where goog is Google Closure library class used is located in Math.Integer. Called functions :
goog.math.Integer.fromInt = function(value) {
if (-128 <= value && value < 128) {
var cachedObj = goog.math.Integer.IntCache_[value];
if (cachedObj) {
return cachedObj;
}
}
var obj = new goog.math.Integer([value | 0], value < 0 ? -1 : 0);
if (-128 <= value && value < 128) {
goog.math.Integer.IntCache_[value] = obj;
}
return obj;
};
goog.math.Integer.fromBits = function(bits) {
var high = bits[bits.length - 1];
return new goog.math.Integer(bits, high & (1 << 31) ? -1 : 0);
};
That is not totally correct as return type should be return (s,p); where
s is value
p is sign
In order to fix this GMP wrapper should be created. This has been done in Haskell to JavaScript compiler project (source link).
Lines 5-9
ALLOC_PRIM_N (SIZEOF_StgArrWords + WDS(1), integer_cmm_int2Integerzh, val);
p = Hp - SIZEOF_StgArrWords;
SET_HDR(p, stg_ARR_WORDS_info, CCCS);
StgArrWords_bytes(p) = SIZEOF_W;
Are as follows
allocates space as new word
creates pointer to it
set pointer value
set pointer type size

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string