How to divide a 128-bit dividend by a 64-bit divisor, where the dividend's bits are all 1's, and where I only need the 64 LSBs of the quotient? - 64-bit

I need to calculate (2128 - 1) / x. The divisor, x, is an unsigned 64-bit number. The dividend is composed of two unsigned 64-bit numbers (high and low), where both numbers are UINT64_MAX. I can only use 64-bit arithmetic and need it to be portable (no use of GNU's __int128, MSCV's _udiv128, assembly, or anything like that). I don't need the high part of the quotient, I only need the lower 64 bits.
How can I do this operation?
Also: x >= 3, x is not a power of 2.
Edit: I created my own solution (answer below). But I welcome any other solution that performs better :)

I am not aware of any optimizations that apply to integer division with a constant dividend. To double check, I tried a test case with an all-ones dividend with Compiler Explorer. Using gcc, icc, and clang, with highest optimization level specified, the generated code showed no optimizations being applied to the division.
It is certainly possible to build high-performance 128-bit division routines, but from personal experience I know that this is quite error prone, and very sophisticated testing is needed to achieve good test coverage including corner cases, as exhaustive test is not possible at this operand size. The effort for design and test easily exceeds what seems reasonable for an answer on Stackoverflow by two decimal orders of magnitude.
An easy way to perform integer division is to use the algorithm we all learned in grade school, only in binary. This makes the decision about the next quotient bit particularly easy: It is 1 when the current partial remainder is greater than, or equal to, the divisor, and 0 otherwise. Using longhand binary division, the only integer operations we need are additions and subtractions.
We can build portable primitives for performing these on operands of any bit length by mimicking the way a processor's machine instructions are used to effect operations on multi-word integers: ADD with carry-out, ADD with carry-in, ADD with carry-in and carry-out; analogous for SUB. In the code below I am using simple C macros for that; certainly more sophisticated approaches are possible.
Since the system I am working on right now does not have support for 128-bit integers, I prototyped and tested this approach for 64-bit integers. The 128-bit version then was an exercise in simple mechanical renaming. On a modern 64-bit processor I would expect this 128-bit division function to execute in roughly 3000 cycles.
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <limits.h>
#define SUBCcc(a,b,cy,t0,t1,t2) \
(t0=(b)+cy, t1=(a), cy=t0<cy, t2=t1<t0, cy=cy+t2, t1-t0)
#define SUBcc(a,b,cy,t0,t1) \
(t0=(b), t1=(a), cy=t1<t0, t1-t0)
#define SUBC(a,b,cy,t0,t1) \
(t0=(b)+cy, t1=(a), t1-t0)
#define ADDCcc(a,b,cy,t0,t1) \
(t0=(b)+cy, t1=(a), cy=t0<cy, t0=t0+t1, t1=t0<t1, cy=cy+t1, t0=t0)
#define ADDcc(a,b,cy,t0,t1) \
(t0=(b), t1=(a), t0=t0+t1, cy=t0<t1, t0=t0)
#define ADDC(a,b,cy,t0,t1) \
(t0=(b)+cy, t1=(a), t0+t1)
typedef struct {
uint64_t l;
uint64_t h;
} my_uint128;
my_uint128 bitwise_division_128 (my_uint128 dvnd, my_uint128 dvsr)
{
my_uint128 quot, rem, tmp;
uint64_t cy, t0, t1, t2;
int bits_left = CHAR_BIT * sizeof (my_uint128);
quot.h = dvnd.h;
quot.l = dvnd.l;
rem.h = 0;
rem.l = 0;
do {
quot.l = ADDcc (quot.l, quot.l, cy, t0, t1);
quot.h = ADDCcc (quot.h, quot.h, cy, t0, t1);
rem.l = ADDCcc (rem.l, rem.l, cy, t0, t1);
rem.h = ADDC (rem.h, rem.h, cy, t0, t1);
tmp.l = SUBcc (rem.l, dvsr.l, cy, t0, t1);
tmp.h = SUBCcc (rem.h, dvsr.h, cy, t0, t1, t2);
if (!cy) { // remainder >= divisor
rem.l = tmp.l;
rem.h = tmp.h;
quot.l = quot.l | 1;
}
bits_left--;
} while (bits_left);
return quot;
}
typedef struct {
uint32_t l;
uint32_t h;
} my_uint64;
my_uint64 bitwise_division_64 (my_uint64 dvnd, my_uint64 dvsr)
{
my_uint64 quot, rem, tmp;
uint32_t cy, t0, t1, t2;
int bits_left = CHAR_BIT * sizeof (my_uint64);
quot.h = dvnd.h;
quot.l = dvnd.l;
rem.h = 0;
rem.l = 0;
do {
quot.l = ADDcc (quot.l, quot.l, cy, t0, t1);
quot.h = ADDCcc (quot.h, quot.h, cy, t0, t1);
rem.l = ADDCcc (rem.l, rem.l, cy, t0, t1);
rem.h = ADDC (rem.h, rem.h, cy, t0, t1);
tmp.l = SUBcc (rem.l, dvsr.l, cy, t0, t1);
tmp.h = SUBCcc (rem.h, dvsr.h, cy, t0, t1, t2);
if (!cy) { // remainder >= divisor
rem.l = tmp.l;
rem.h = tmp.h;
quot.l = quot.l | 1;
}
bits_left--;
} while (bits_left);
return quot;
}
/*
https://groups.google.com/forum/#!original/comp.lang.c/qFv18ql_WlU/IK8KGZZFJx4J
From: geo <gmars...#gmail.com>
Newsgroups: sci.math,comp.lang.c,comp.lang.fortran
Subject: 64-bit KISS RNGs
Date: Sat, 28 Feb 2009 04:30:48 -0800 (PST)
This 64-bit KISS RNG has three components, each nearly
good enough to serve alone. The components are:
Multiply-With-Carry (MWC), period (2^121+2^63-1)
Xorshift (XSH), period 2^64-1
Congruential (CNG), period 2^64
*/
static uint64_t kiss64_x = 1234567890987654321ULL;
static uint64_t kiss64_c = 123456123456123456ULL;
static uint64_t kiss64_y = 362436362436362436ULL;
static uint64_t kiss64_z = 1066149217761810ULL;
static uint64_t kiss64_t;
#define MWC64 (kiss64_t = (kiss64_x << 58) + kiss64_c, \
kiss64_c = (kiss64_x >> 6), kiss64_x += kiss64_t, \
kiss64_c += (kiss64_x < kiss64_t), kiss64_x)
#define XSH64 (kiss64_y ^= (kiss64_y << 13), kiss64_y ^= (kiss64_y >> 17), \
kiss64_y ^= (kiss64_y << 43))
#define CNG64 (kiss64_z = 6906969069ULL * kiss64_z + 1234567ULL)
#define KISS64 (MWC64 + XSH64 + CNG64)
int main (void)
{
uint64_t a, b, res, ref;
my_uint64 aa, bb, rr;
do {
a = KISS64;
b = KISS64;
ref = a / b;
aa.l = (uint32_t)a;
aa.h = (uint32_t)(a >> 32);
bb.l = (uint32_t)b;
bb.h = (uint32_t)(b >> 32);
rr = bitwise_division_64 (aa, bb);
res = (((uint64_t)rr.h) << 32) + rr.l;
if (ref != res) {
printf ("a=%016llx b=%016llx res=%016llx ref=%016llx\n", a, b, res, ref);
return EXIT_FAILURE;
}
} while (a);
return EXIT_SUCCESS;
}
A faster approach than bit-wise computation is to compute the reciprocal of the divisor, multiply by the dividend resulting in a preliminary quotient, then compute the remainder to precisely adjust the quotient. The entire computation can be accomplished in fixed-point arithmetic. However, on modern processors with fast floating-point units it is more convenient to generate a starting approximation for the reciprocal with a double-precision division. A single Halley iteration with cubic convergence then results in a full-precision reciprocal.
The Halley iteration for the reciprocal is very integer multiplication intensive, with a 64x64-bit multiply with 128-bit result (umul64wide() in the code below) being the building block crucial to performance. On modern 64-bit architectures this is typically a single machine instruction executing in a few cycles, however this is not accessible to portable code. Portable code emulating the instruction requires about 15 to 20 instructions depending on architecture and compiler.
The entire 128-bit division should take roughly 300 cycles, or ten times as fast as the simple bit-wise computation. Because the code is fairly complex, it requires a significant amount of testing to ensure correct operation. In the framework below I am using pattern-based and random tests for moderately intensive testing, using the straightforward bit-wise implementation as a reference.
The implementation of udiv128() below assumes that the programming enviornment uses IEEE-754 compliant floating-point arithmetic, that the double type is mapped to IEEE-754's binary64 type, and that division of double operands is correctly rounded.
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <string.h>
#include <limits.h>
typedef struct {
uint64_t l;
uint64_t h;
} my_uint128;
my_uint128 make_my_uint128 (uint64_t h, uint64_t l);
my_uint128 add128 (my_uint128 a, my_uint128 b);
my_uint128 sub128 (my_uint128 a, my_uint128 b);
my_uint128 lsl128 (my_uint128 a, int sh);
my_uint128 lsr128 (my_uint128 a, int sh);
my_uint128 not128 (my_uint128 a);
my_uint128 umul128lo (my_uint128 a, my_uint128 b);
my_uint128 umul128hi (my_uint128 a, my_uint128 b);
double my_uint128_to_double (my_uint128 a);
int lt128 (my_uint128 a, my_uint128 b);
int eq128 (my_uint128 a, my_uint128 b);
uint64_t double_as_uint64 (double a);
double uint64_as_double (uint64_t a);
#define FP64_EXPO_BIAS (1023)
#define FP64_MANT_BITS (53)
#define FP64_MANT_IBIT (0x0010000000000000ULL)
#define FP64_MANT_MASK (0x000fffffffffffffULL)
#define FP64_INC_EXP_128 (0x0800000000000000ULL)
#define FP64_MANT_ADJ (2) // adjustment to ensure underestimate
my_uint128 udiv128 (my_uint128 dividend, my_uint128 divisor)
{
const my_uint128 zero = make_my_uint128 (0ULL, 0ULL);
const my_uint128 one = make_my_uint128 (0ULL, 1ULL);
const my_uint128 two = make_my_uint128 (0ULL, 2ULL);
my_uint128 recip, temp, quo, rem;
my_uint128 neg_divisor = sub128 (zero, divisor);
double r;
/* compute initial approximation for reciprocal; must be underestimate! */
r = 1.0 / my_uint128_to_double (divisor);
uint64_t i = double_as_uint64 (r) - FP64_MANT_ADJ + FP64_INC_EXP_128;
temp = make_my_uint128 (0ULL, (i & FP64_MANT_MASK) | FP64_MANT_IBIT);
int sh = (i >> (FP64_MANT_BITS-1)) - FP64_EXPO_BIAS - (FP64_MANT_BITS-1);
recip = (sh < 0) ? lsr128 (temp, -sh) : lsl128 (temp, sh);
/* perform Halley iteration with cubic convergence to refine reciprocal */
temp = umul128lo (neg_divisor, recip);
temp = add128 (umul128hi (temp, temp), temp);
recip = add128 (umul128hi (recip, temp), recip);
/* compute preliminary quotient and remainder */
quo = umul128hi (dividend, recip);
rem = sub128 (dividend, umul128lo (divisor, quo));
/* adjust quotient if too small; quotient off by 2 at most */
if (! lt128 (rem, divisor)) {
quo = add128 (quo, lt128 (sub128 (rem, divisor), divisor) ? one : two);
}
/* handle division by zero */
if (eq128 (divisor, zero)) quo = not128 (zero);
return quo;
}
#define SUBCcc(a,b,cy,t0,t1,t2) \
(t0=(b)+cy, t1=(a), cy=t0<cy, t2=t1<t0, cy=cy+t2, t1-t0)
#define SUBcc(a,b,cy,t0,t1) \
(t0=(b), t1=(a), cy=t1<t0, t1-t0)
#define SUBC(a,b,cy,t0,t1) \
(t0=(b)+cy, t1=(a), t1-t0)
#define ADDCcc(a,b,cy,t0,t1) \
(t0=(b)+cy, t1=(a), cy=t0<cy, t0=t0+t1, t1=t0<t1, cy=cy+t1, t0=t0)
#define ADDcc(a,b,cy,t0,t1) \
(t0=(b), t1=(a), t0=t0+t1, cy=t0<t1, t0=t0)
#define ADDC(a,b,cy,t0,t1) \
(t0=(b)+cy, t1=(a), t0+t1)
uint64_t double_as_uint64 (double a)
{
uint64_t r;
memcpy (&r, &a, sizeof r);
return r;
}
double uint64_as_double (uint64_t a)
{
double r;
memcpy (&r, &a, sizeof r);
return r;
}
my_uint128 add128 (my_uint128 a, my_uint128 b)
{
uint64_t cy, t0, t1;
a.l = ADDcc (a.l, b.l, cy, t0, t1);
a.h = ADDC (a.h, b.h, cy, t0, t1);
return a;
}
my_uint128 sub128 (my_uint128 a, my_uint128 b)
{
uint64_t cy, t0, t1;
a.l = SUBcc (a.l, b.l, cy, t0, t1);
a.h = SUBC (a.h, b.h, cy, t0, t1);
return a;
}
my_uint128 lsl128 (my_uint128 a, int sh)
{
if (sh >= 64) {
a.h = a.l << (sh - 64);
a.l = 0ULL;
} else if (sh) {
a.h = (a.h << sh) + (a.l >> (64 - sh));
a.l = a.l << sh;
}
return a;
}
my_uint128 lsr128 (my_uint128 a, int sh)
{
if (sh >= 64) {
a.l = a.h >> (sh - 64);
a.h = 0ULL;
} else if (sh) {
a.l = (a.l >> sh) + (a.h << (64 - sh));
a.h = a.h >> sh;
}
return a;
}
my_uint128 not128 (my_uint128 a)
{
a.l = ~a.l;
a.h = ~a.h;
return a;
}
int lt128 (my_uint128 a, my_uint128 b)
{
uint64_t cy, t0, t1, t2;
a.l = SUBcc (a.l, b.l, cy, t0, t1);
a.h = SUBCcc (a.h, b.h, cy, t0, t1, t2);
return cy;
}
int eq128 (my_uint128 a, my_uint128 b)
{
return (a.l == b.l) && (a.h == b.h);
}
// derived from Hacker's Delight 2nd ed. figure 8-2
my_uint128 umul64wide (uint64_t u, uint64_t v)
{
my_uint128 r;
uint64_t u0, v0, u1, v1, w0, w1, w2, t;
u0 = (uint32_t)u; u1 = u >> 32;
v0 = (uint32_t)v; v1 = v >> 32;
w0 = u0 * v0;
t = u1 * v0 + (w0 >> 32);
w1 = (uint32_t)t;
w2 = t >> 32;
w1 = u0 * v1 + w1;
r.h = u1 * v1 + w2 + (w1 >> 32);
r.l = (w1 << 32) + (uint32_t)w0;
return r;
}
my_uint128 make_my_uint128 (uint64_t h, uint64_t l)
{
my_uint128 r;
r.h = h;
r.l = l;
return r;
}
my_uint128 umul128lo (my_uint128 a, my_uint128 b)
{
my_uint128 r;
r = umul64wide (a.l, b.l);
r.h = r.h + a.l * b.h + a.h * b.l;
return r;
}
my_uint128 umul128hi (my_uint128 a, my_uint128 b)
{
my_uint128 t0, t1, t2, t3;
t0 = umul64wide (a.l, b.l);
t3 = add128 (umul64wide (a.h, b.l), make_my_uint128 (0ULL, t0.h));
t1 = make_my_uint128 (0ULL, t3.l);
t2 = make_my_uint128 (0ULL, t3.h);
t1 = add128 (umul64wide (a.l, b.h), t1);
return add128 (add128 (umul64wide (a.h, b.h), t2), make_my_uint128 (0ULL, t1.h));
}
double my_uint128_to_double (my_uint128 a)
{
const int intbits = sizeof (a) * CHAR_BIT;
const my_uint128 zero = make_my_uint128 (0ULL, 0ULL);
my_uint128 rnd, i = a;
uint64_t j;
int sh = 0;
double r;
// normalize integer so MSB is set
if (lt128 (i, make_my_uint128(0x0000000000000001ULL, 0))) {i = lsl128 (i,64); sh += 64; }
if (lt128 (i, make_my_uint128(0x0000000100000000ULL, 0))) {i = lsl128 (i,32); sh += 32; }
if (lt128 (i, make_my_uint128(0x0001000000000000ULL, 0))) {i = lsl128 (i,16); sh += 16; }
if (lt128 (i, make_my_uint128(0x0100000000000000ULL, 0))) {i = lsl128 (i, 8); sh += 8; }
if (lt128 (i, make_my_uint128(0x1000000000000000ULL, 0))) {i = lsl128 (i, 4); sh += 4; }
if (lt128 (i, make_my_uint128(0x4000000000000000ULL, 0))) {i = lsl128 (i, 2); sh += 2; }
if (lt128 (i, make_my_uint128(0x8000000000000000ULL, 0))) {i = lsl128 (i, 1); sh += 1; }
// form mantissa with explicit integer bit
rnd = lsl128 (i, FP64_MANT_BITS);
i = lsr128 (i, intbits - FP64_MANT_BITS);
j = i.l;
// add in exponent, taking into account integer bit of mantissa
if (! eq128 (a, zero)) {
j += (uint64_t)(FP64_EXPO_BIAS + (intbits-1) - 1 - sh) << (FP64_MANT_BITS-1);
}
// round to nearest or even
rnd.h = rnd.h | (rnd.l != 0);
if ((rnd.h > 0x8000000000000000ULL) ||
((rnd.h == 0x8000000000000000ULL) && (j & 1))) j++;
// reinterpret bit pattern as IEEE-754 'binary64'
r = uint64_as_double (j);
return r;
}
my_uint128 bitwise_division_128 (my_uint128 dvnd, my_uint128 dvsr)
{
my_uint128 quot, rem, tmp;
uint64_t cy, t0, t1, t2;
int bits_left = CHAR_BIT * sizeof (dvsr);
quot.h = dvnd.h;
quot.l = dvnd.l;
rem.h = 0;
rem.l = 0;
do {
quot.l = ADDcc (quot.l, quot.l, cy, t0, t1);
quot.h = ADDCcc (quot.h, quot.h, cy, t0, t1);
rem.l = ADDCcc (rem.l, rem.l, cy, t0, t1);
rem.h = ADDC (rem.h, rem.h, cy, t0, t1);
tmp.l = SUBcc (rem.l, dvsr.l, cy, t0, t1);
tmp.h = SUBCcc (rem.h, dvsr.h, cy, t0, t1, t2);
if (!cy) { // remainder >= divisor
rem.l = tmp.l;
rem.h = tmp.h;
quot.l = quot.l | 1;
}
bits_left--;
} while (bits_left);
return quot;
}
/*
https://groups.google.com/forum/#!original/comp.lang.c/qFv18ql_WlU/IK8KGZZFJx4J
From: geo <gmars...#gmail.com>
Newsgroups: sci.math,comp.lang.c,comp.lang.fortran
Subject: 64-bit KISS RNGs
Date: Sat, 28 Feb 2009 04:30:48 -0800 (PST)
This 64-bit KISS RNG has three components, each nearly
good enough to serve alone. The components are:
Multiply-With-Carry (MWC), period (2^121+2^63-1)
Xorshift (XSH), period 2^64-1
Congruential (CNG), period 2^64
*/
static uint64_t kiss64_x = 1234567890987654321ULL;
static uint64_t kiss64_c = 123456123456123456ULL;
static uint64_t kiss64_y = 362436362436362436ULL;
static uint64_t kiss64_z = 1066149217761810ULL;
static uint64_t kiss64_t;
#define MWC64 (kiss64_t = (kiss64_x << 58) + kiss64_c, \
kiss64_c = (kiss64_x >> 6), kiss64_x += kiss64_t, \
kiss64_c += (kiss64_x < kiss64_t), kiss64_x)
#define XSH64 (kiss64_y ^= (kiss64_y << 13), kiss64_y ^= (kiss64_y >> 17), \
kiss64_y ^= (kiss64_y << 43))
#define CNG64 (kiss64_z = 6906969069ULL * kiss64_z + 1234567ULL)
#define KISS64 (MWC64 + XSH64 + CNG64)
my_uint128 v[100000]; /* FIXME: size appropriately */
int main (void)
{
const my_uint128 zero = make_my_uint128 (0ULL, 0ULL);
const my_uint128 one = make_my_uint128 (0ULL, 1ULL);
my_uint128 dividend, divisor, quot, ref;
int i, j, patterns, idx = 0, nbrBits = sizeof (v[0]) * CHAR_BIT;
int patterns_done = 0;
/* pattern class 1: 2**i */
for (i = 0; i < nbrBits; i++) {
v [idx] = lsl128 (one, i);
idx++;
}
/* pattern class 2: 2**i-1 */
for (i = 0; i < nbrBits; i++) {
v [idx] = sub128 (lsl128 (one, i), one);
idx++;
}
/* pattern class 3: 2**i+1 */
for (i = 0; i < nbrBits; i++) {
v [idx] = add128 (lsl128 (one, i), one);
idx++;
}
/* pattern class 4: 2**i + 2**j */
for (i = 0; i < nbrBits; i++) {
for (j = 0; j < nbrBits; j++) {
v [idx] = add128 (lsl128 (one, i), lsl128 (one, j));
idx++;
}
}
/* pattern class 5: 2**i - 2**j */
for (i = 0; i < nbrBits; i++) {
for (j = 0; j < nbrBits; j++) {
v [idx] = sub128 (lsl128 (one, i), lsl128 (one, j));
idx++;
}
}
patterns = idx;
/* pattern class 6: one's complement of pattern classes 1 through 5 */
for (i = 0; i < patterns; i++) {
v [idx] = not128 (v [i]);
idx++;
}
/* pattern class 7: two's complement of pattern classes 1 through 5 */
for (i = 0; i < patterns; i++) {
v [idx] = sub128 (zero, v[i]);
idx++;
}
patterns = idx;
printf ("Starting pattern-based tests. Number of patterns: %d\n", patterns);
for (long long int k = 0; k < 100000000000LL; k++) {
if (k < patterns * patterns) {
dividend = v [k / patterns];
divisor = v [k % patterns];
} else {
if (!patterns_done) {
printf ("Starting random tests\n");
patterns_done = 1;
}
dividend.l = KISS64;
dividend.h = KISS64;
divisor.h = KISS64;
divisor.l = KISS64;
}
/* exclude cases with undefined results: division by zero */
if (! eq128 (divisor, zero)) {
quot = udiv128 (dividend, divisor);
ref = bitwise_division_128 (dividend, divisor);
if (! eq128 (quot, ref)) {
printf ("# (%016llx_%016llx, %016llx_%016llx): quot = %016llx_%016llx ref=%016llx_%016llx\n",
dividend.h, dividend.l, divisor.h, divisor.l,
quot.h, quot.l, ref.h, ref.l);
return EXIT_FAILURE;
}
}
}
printf ("unsigned 128-bit division: tests passed\n");
return EXIT_SUCCESS;
}

This is what I ended up coding. I'm sure there are much faster alternatives, but at least this is functional.
Based on: https://en.wikipedia.org/wiki/Division_algorithm#Integer_division_(unsigned)_with_remainder. Adapted for for this particular use-case.
// q = (2^128 - 1) / d, where q is the 64 LSBs of the quotient
uint64_t two_pow_128_minus_1_div_d(uint64_t d) {
uint64_t q = 0, r_hi = 0, r_lo = 0;
for (int i = 127; i >= 0; --i) {
r_hi = (r_hi << 1) | (r_lo >> 63);
r_lo <<= 1;
r_lo |= 1UL;
if (r_hi || r_lo >= d) {
const uint64_t borrow = d > r_lo;
r_lo -= d;
r_hi -= borrow;
if (i < 64)
q |= 1UL << i;
}
}
return q;
}

Related

Execution time of very short function

I am required to display the execution time of some searching algorithms. However, when I use start/end_t = clock(), it always displays 0.00000 due to low precision (even with double-type)
Please tell me how to display those running times.
int LinearSearch (int M[], int target, int size)
{
int k = 0;
for (k=0; k<size; k++)
{
if(M[k]==target)
{
return k;
}
//else return -1;
}
}
int LinearSentinelSearch (int M[],int target, int size)
{
int k = 0;
M[size]=target;
while (M[k] != target)
k++;
return k;
}
int binSearch(int List[], int Target, int Size)
{
int Mid;
int low = 0;
int high = Size -1;
int count=0;
int a;
while( low <= high)
{
Mid = (low + high) / 2;
if(List[Mid] == Target) return Mid;
else if( Target < List[Mid] )
high = Mid - 1;
else
low = Mid + 1;
}
return -1;
}
You can calculate the mean execution time by simply executing the algorithm multiple times N and then divide the total time by N. Using your binSearch as an example:
int i;
clock_t start, end;
start = clock();
for (i = 0 ; i < 1000 ; i++) {
binSearch(/* your actual parameters here */);
}
end = clock();
printf("Mean ticks to execute binSearch: %f\n", (end - start) / 1000.0);

"Program has triggered a breakpoint" while free memory malloced

When I free memory, the error "Program has triggered a breakpoint" occur. Check the code below, I wonder where is wrong ?
int SSavep(char *visited, int t, int n, int m)
{
int* map = (int*)malloc(m*n * sizeof(int));
int* q = (int*)malloc(m*n * sizeof(int));
int count = 0, cur = 0;
int begin = 0, end = 0;
for (int i = 0; i < m; i++) {
for (int j = 0; j < n; j++) {
//set value for map
}
}
..........
if (t >= map[end]) {
free(map);
free(q);
return 0;
}
else{
free(map);
free(q);
return -1;
}
}
the entire code is as below:
static int dir[4][2] = { {0,1},{0,-1},{1,0},{-1,0} };
int SSavep(char *visited, int t, int n, int m)
{
int* map = (int*)malloc(m*n * sizeof(int));
int* q = (int*)malloc(m*n * sizeof(int));
int count = 0, cur = 0;
int begin = 0, end = 0;
for (int i = 0; i < m; i++) {
for (int j = 0; j < n; j++) {
if (visited[i*n + j] == '.')
map[i*n + j] = 0;
else if (visited[i*n + j] == '*')
map[i*n + j] = -1;
else if (visited[i*n + j] == 'p') {
map[i*n + j] = -12;
end = i*n + j;
}
else {
map[i*n + j] = -9;
begin = i*n + j;
}
}
}
q[count++] = begin;
while (cur < count && q[cur] != end) {
int i = q[cur] / n;
int j = q[cur] % n;
for (int k = 0; k < 4; k++) {
int ni = i + dir[k][0];
int nj = j + dir[k][1];
if (ni < 0 || ni >= m || nj < 0 || nj >= n || map[ni*n + nj]>0 || map[ni*n + nj] == -1)
continue;
map[ni*n + nj] = map[i*n + j] + 1;
q[count++] = ni*n + nj;
}
cur++;
}
if (map[end] > 0 && t >= map[end]) {
free(map);
free(q);
return 0;
}
else{
free(map);
free(q);
return -1;
}
}
You are getting error on >> free(q);
for m=n=4
int* q = (int*)malloc(m*n * sizeof(int));
q == [m*n*sizeof(int)] == 4*4*4 == 64 bytes == int[16];
Because you wrote beyond the address space reserved for pointer variable 'q'.
check 'count' variable before free(q). I've got 1208, called with:
char* visited = new char[100 * 100];
memset(visited, 0, 10000);
int res = SSavep(visited, 0, 4, 4);
Btw this algo looks alot like some path finding examining neighboring cells on map and assigning weights, right? If so there are many opensource solutions, why not using them instead of reinventing the wheel? There are links to opensource solutions on wiki Path Finding page:
https://en.wikipedia.org/wiki/Pathfinding
check the bottom of the page for links.

Getting segmentaition fault with subset dp problem

Given a set of numbers, check whether it can be partitioned into two subsets such that the sum of elements in both subsets is same or not
I am getting segmentation fault in C++(g++ 5.4) with a this problem.
This is where i submitted my solution in C++
https://practice.geeksforgeeks.org/problems/subset-sum-problem/0
I am checking if the array can be divided into two parts with equal sum. So I am just checking if there exists a subset with sum equal to half the sum of the array
I have implemented the below logic with dynamic programming
Let dp[i][j] denote yes or no whether a subset with sum j is possible to form with elements in the range [0, i](both inclusive) where i is 0-based index. I have done nothing new with this traditional problem. But I am getting segmentation fault. The program is giving correct output for small test cases. What mistake have I made
I haven't used any comments because I have done nothing new. Hope it is understandable.
#include <iostream>
#include <bits/stdc++.h>
#include<cstdio>
#define ll long long int
using namespace std;
bool isVowel(char c){
return c == 'a' || c == 'e' || c == 'i' || c == 'o' || c == 'u';
}
bool isLower(char c){
return 97 <= c && c <= 122;
}
int main() {
ios_base::sync_with_stdio(false);
cin.tie(NULL);
cout.tie(NULL);
cout << setprecision(10);
ll t, n;
cin >> t;
while (t--) {
cin >> n;
ll a[n];
ll sum = 0;
for (ll i = 0; i < n; i++) {
cin >> a[i];
sum += a[i];
}
if (sum % 2) {
cout << "NO" << '\n';
continue;
}
sum /= 2;
ll dp[n][sum + 1];
for (ll i = 0; i < n; i++) {
for(ll j = 0; j < sum + 1; j++) {
dp[i][j] = 0;
}
}
for (ll i = 0; i < n; i++) {
dp[i][a[i]] = 1;
dp[i][0] = 1;
}
for (ll i = 1; i < n; i++) {
for (ll j = 1; j < sum + 1; j++){
if (j - a[i] > 0) {
dp[i][j] = dp[i - 1][j - a[i]];
}
dp[i][j] |= dp[i - 1][j];
}
}
cout << (dp[n - 1][sum] ? "YES" : "NO") << '\n';
}
}
The segmentation fault is due to
ll dp[n][sum + 1];
Even though the constraints say 1 <= N<= 100, 0 <= arr[i]<= 1000, the test cases used are probably much larger, so ll dp[n][sum + 1] will end up taking some serious stack memory, use
bool dp[n][sum + 1];
It should work fine.
On a side note, avoid using ll randomly, use them according to the constraints.

Eigen3 Matrix-Matrix Multiplication 30 times faster than own openmp parallelized code

I compiled the code below on VS C++ 2017 with /openmp /O2 /arch::AVX.
When running with 8 threads the output is:
dt_loops = 1562ms
dt_eigen = 26 ms
I expected the A * B to be faster than my own handmade loops but I did not expect such a large difference. Is there anything wrong with my code? And if not how can Eigen3 do it so much faster.
I'm not very experienced in using OpenMP or any other parallelization method. I tried diferent loop orders but the one below is the fastest.
#include <iostream>
#include <chrono>
#include <Eigen/Dense>
int main() {
std::chrono::time_point<std::chrono::system_clock> start1, end1, start2, end2;
int n = 1000;
Eigen::MatrixXd A = Eigen::MatrixXd::Random(n, n);
Eigen::MatrixXd B = Eigen::MatrixXd::Random(n, n);
Eigen::MatrixXd C = Eigen::MatrixXd::Zero(n, n);
start1 = std::chrono::system_clock::now();
int i, j, k;
#pragma omp parallel for private(i, j, k)
for (i = 0; i < n; ++i) {
for (j = 0; j < n; ++j) {
for (k = 0; k < n; ++k) {
C(i, j) += A(i, k) * B(k, j);
}
}
}
end1 = std::chrono::system_clock::now();
std::cout << "dt_loops = " << std::chrono::duration_cast<std::chrono::milliseconds>(end1-start1).count() << " ms" << std::endl;
Eigen::MatrixXd D = Eigen::MatrixXd::Zero(n, n);
start2 = std::chrono::system_clock::now();
D = A * B;
end2 = std::chrono::system_clock::now();
std::cout << "dt_eigen = " << std::chrono::duration_cast<std::chrono::milliseconds>(end2-start2).count() << " ms" << std::endl;
}

How to use calcCovarMatrix on IplImage?

There's a single-channel grayscale IplImage whose covariance matrix is to be calculated. There does happen to be a similar question on SO but no one has answered it and the code is significantly different.
Here's the code that's throwing an "unhandled exception":
int calcCovar( IplImage *src, float* dst, int w )
{
// Input matrix size
int rows = w;
int cols = w;
int i,j;
CvScalar se;
float *a;
a = (float*)malloc( w * w * sizeof(float) );
long int k=0;
//image pixels into 1D array
for(i = 0; i < w; ++i)
{
for(j = 0; j < w; ++j)
{
se = cvGet2D(src, i, j);
a[k++] = (float)se.val[0];
}
}
CvMat input = cvMat(w,w,CV_32FC1, a); //Is this the right way to format input pixels??
// Covariance matrix is N x N,
// where N is input matrix column size
const int n = w;
// Output variables passed by reference
CvMat* output = cvCreateMat(n, n, CV_32FC1);
CvMat* meanvec = cvCreateMat(1, rows, CV_32FC1);
// Calculate covariance matrix - error is here!!
cvCalcCovarMatrix((const void **) &input, rows, output, meanvec, CV_COVAR_NORMAL);
k = 0;
//Show result
cout << "Covariance matrix:" << endl;
for(i=0; i<n; i++) {
for(j=0; j<n; j++) {
cout << "(" << i << "," << j << "): ";
printf ("%f ", cvGetReal2D(output,i,j) / (rows - 1));
dst[k++] = cvGetReal2D(output,i,j) / (rows - 1);
//cout << "\t";
}
cout << endl;
}
return(0);
}
read in the manual about this function. if you store your values in the single matrix 'input', the second function parameter must not be 'rows'. The second parameter is used to say how many matrices you passes in the first parameter. You have passed just one matrix, 'input'. Segfault is no surprise.

Resources