I have several number-crunching operations that account for a good portion of the CPU time. One example of such operations is this function:
import Data.Number.Erf
import Math.Gamma
import Math.GaussianQuadratureIntegration as GQI
-- Kummer's' "1F1" a.k.a M(a,b,z) Confluent Hypergeometric function
-- Approximation by the Gaussian Quadrature method from 128 up to 1024 points of resolution
kummer :: Double -> Double -> Double -> Double -> Double
kummer a b z err = gammaFactor * integralPart
where
gammaFactor = (gamma b) / (gamma a * gamma (b-a))
integralPart = (integrator err) fun 0 1
fun = (\t -> (e ** (z * t)) * (1-t) ** (b-a-1) * t ** (a-1))
e = exp 1
integrator err
| err > 0.1 = GQI.nIntegrate128
| err > 0.01 = GQI.nIntegrate256
| err > 0.001 = GQI.nIntegrate512
| otherwise = GQI.nIntegrate1024
SO, I was wondering if there are some rules to follow about when a function should be INLINE to improve performance. REPA Authors suggest to:
Add INLINE pragmas to all leaf-functions in your code, especially ones
that compute numeric results. Non-inlined lazy function calls can cost
upwards of 50 cycles each, while each numeric operator only costs one
(or less). Inlining leaf functions also ensures they are specialized
at the appropriate numeric types.
Are these indications also applicable to the rest of the numerical calculations or only to array computations? or is there a more general guide to decide when a function should be inline?
Notice that this post: Is there any reason not to use the INLINABLE pragma for a function? does not address directly the question about if the hints provided by the programmer truly help the compiler to optimize the code.
Related
For safe prime p, prime q = (p - 1)/2, and generator g = 2, we have a distinct sequence (mod p):
g^0,g^1,...,g^q-1
Then the sequence repeats,
g^q (mod p) = g^0 (mod p)
The largest necessary bigint is g^q-1, but can't be computed, npx ts-node Test.ts:
const crypto = require('crypto');
const dh = crypto.createDiffieHellman(1024);
const p = BigInt(`0x${dh.getPrime().toString('hex')}`);
const g = BigInt(`0x${dh.getGenerator().toString('hex')}`);
const q = (p - 1n)/2n;
console.log(g ** 0n % p)
console.log(g ** 1n % p)
console.log(g ** (q-1n) % p)
As expect, 1n and 2n are output, then:
console.log(g ** (q-1n) % p)
^
RangeError: Maximum BigInt size exceeded
What's going wrong?
g ** (q-1n) % p is too big because the reduction mod p is done only at the end and g ** (q-1n) is indeed too big for the standard representation as a BigInt: it has more hex digits than there are atoms in the universe.
This "intermediate overflow" would not happen if you reduced mod p after each multiplication (this does not change the result):
function powermod(base, exp, p) {
var result = 1n;
while (exp !== 0n) {
if (exp % 2n === 1n) result = result * base % p;
base = base * base % p;
exp >>= 1n;
}
return result;
}
console.log(powermod(g, q-1n, p));
(This is not necessarily the fastest way to compute the power, see taocp, section 4.6.3.)
powermod is a ternary operation which cannot be expressed in terms of the binary operations ** and %, unless a Javascript engine is able to recognize an expression like base ** exp % p and treat it as ternary. I don't know if such Javascript engines exist. But there are npm packages for "finite fields" that implement this operation.
Building upon an excellent answer in an attempt to answer my own question:
BigInt is a best effort to "represent numeric values which are too large to be represented by the number primitive" and to define "the following operators": + * - % ** (https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/BigInt).
Unfortunately, this best effort fails: For instance, large exponents produce numbers which can't be handled.
Such limitations can be overcome, e.g., large exponents modulo some divisor can be computed by other means, without using the exponentiation operator.
Some limitation appears (seemingly) due to operator precedence (https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Operators/Operator_Precedence): Exponentiation (**) takes precedence over remainder (%), whereas these operations need to be considered in conjunction to avoid error.
Presumably—similarly to Number—there exist maximum and minimum numeric values after which RangeError will arise.
TL;DR: BigInt is a best effort; computations may produce numbers BigInt can't handle and RangeError will arise.
I want to prove below goal.
n_n: ℕ
n_ih: n_n * (n_n + 1) / 2 = arith_sum n_n
⊢ (n_n + 1) * (n_n + 1 + 1) / 2 = n_n + 1 + n_n * (n_n + 1) / 2
ring, simp, linarith is not working.
Also I tried calc, but it too long.
Is there any automatic command to erase common constant in equation?
I would say that you were asking the wrong question. Your hypothesis and goal contain / but this is not mathematical division, this is a pathological function which computer scientists use, which takes as input two natural numbers and is forced to return a natural number, so often can't return the right answer. For example 5 / 2 = 2 with the division you're using. Computer scientists call it "division with remainder" and I call it "broken and should never be used". When I'm doing this sort of exercise with my class I always coerce everything to the rationals before I do the division, so the division is mathematical division rather than this pathological function which does not satisfy things like (a / b) * b = a. The fact that this division doesn't obey the rules of normal division is why you can't get the tactics to work. If you coerce everything to the rationals before doing the division then you won't get into this mess and ring will work fine.
If you do want to persevere down the natural division road then one approach would be to start doing things proving that n(n+1) is always even, so that you can deduce n(n+1)/2)*2=n(n+1). Alternatively you could avoid this by observing that to show A/2=B/2 it suffices to prove that A=B. But either way you'll have to do a few lines of manual fiddling because you're not using mathematical functions, you're using computer science approximations, so mathematical tactics don't work with them.
Here's what my approach looks like:
import algebra.big_operators
open_locale big_operators
open finset
def arith_sum (n : ℕ) := ∑ i in range n, (i : ℚ) -- answer is rational
example (n : ℕ) : arith_sum n = n*(n-1)/2 :=
begin
unfold arith_sum,
induction n with d hd,
{ simp },
{ rw [finset.sum_range_succ, hd, nat.succ_eq_add_one],
push_cast,
ring, -- works now
}
end
I have a custom (discrete) probability distribution defined somewhat in the form: f(x)/(sum(f(x')) for x' in a given discrete set X). Also, 0<=x<=1.
So I have been trying to implement it in python 3.8.2, and the problem is that the numerator and denominator both come out to be really small and python's floating point representation just takes them as 0.0.
After calculating these probabilities, I need to sample a random element from an array, whose each index may be selected with the corresponding probability in the distribution. So if my distribution is [p1,p2,p3,p4], and my array is [a1,a2,a3,a4], then probability of selecting a2 is p2 and so on.
So how can I implement this in an elegant and efficient way?
Is there any way I could use the np.random.beta() in this case? Since the difference between the beta distribution and my actual distribution is only that the normalization constant differs and the domain is restricted to a few points.
Note: The Probability Mass function defined above is actually in the form given by the Bayes theorem and f(x)=x^s*(1-x)^f, where s and f are fixed numbers for a given iteration. So the exact problem is that, when s or f become really large, this thing goes to 0.
You could well compute things by working with logs. The point is that while both the numerator and denominator might underflow to 0, their logs won't unless your numbers are really astonishingly small.
You say
f(x) = x^s*(1-x)^t
so
logf (x) = s*log(x) + t*log(1-x)
and you want to compute, say
p = f(x) / Sum{ y in X | f(y)}
so
p = exp( logf(x) - log sum { y in X | f(y)}
= exp( logf(x) - log sum { y in X | exp( logf( y))}
The only difficulty is in computing the second term, but this is a common problem, for example here
On the other hand computing logsumexp is easy enough to to by hand.
We want
S = log( sum{ i | exp(l[i])})
if L is the maximum of the l[i] then
S = log( exp(L)*sum{ i | exp(l[i]-L)})
= L + log( sum{ i | exp( l[i]-L)})
The last sum can be computed as written, because each term is now between 0 and 1 so there is no danger of overflow, and one of the terms (the one for which l[i]==L) is 1, and so if other terms underflow, that is harmless.
This may however lose a little accuracy. A refinement would be to recognize the set A of indices where
l[i]>=L-eps (eps a user set parameter, eg 1)
And then compute
N = Sum{ i in A | exp(l[i]-L)}
B = log1p( Sum{ i not in A | exp(l[i]-L)}/N)
S = L + log( N) + B
suppose I have a vector x of normal distributed variables with mean m and standard diviation s.
Is there an efficient (explicit) function f(x, m, s) transforming x to uniform distributed vector?
With explicit I mean that the function only utilizes standard mathematical operations like +, -, *, /, pow(), exp() but no for loops. So actually I'm looking for a transformation function approximating the cumulative distribution function of the normal distribution.
in this paper I found a suolution for a uniform distribution with mean=0 and std=1. To make it applicable to any mean and std one has to substract the mean m and divide by the std s:
x = (x - m) / s
x_uni = 1. / (exp(-(358. * x)/23. + 111. * arctan(37. * x / 294.)) + 1)
Purely Functional Data Structures has the following exercise:
-- 2.5 Sharing can be useful within a single object, not just between objects.
-- For example, if the two subtress of a given node are identical, then they can
-- be represented by the same tree.
-- Part a: make a `complete a Int` function that creates a tree of
-- depth Int, putting a in every leaf of the tree.
complete :: a -> Integer -> Maybe (Tree a)
complete x depth
| depth < 0 = Nothing
| otherwise = Just $ complete' depth
where complete' d
| d == 0 = Empty
| otherwise = let copiedTree = complete' (d-1)
in Node x copiedTree copiedTree
Does this implementation run in O(d) time? Could you please say why or why not?
The interesting part of the code is the complete' function:
complete' d
| d == 0 = Empty
| otherwise = let copiedTree = complete' (d-1)
in Node x copiedTree copiedTree
As Cirdec's answer suggests, we should be careful to analyze each part of the implementation to make sure our assumptions are valid. As a general rule, we can assume that the following take 1 unit of time each*:
Using a data constructor to construct a value (e.g., using Empty to make an empty tree or using Node to turn a value and two trees into a tree).
Pattern matching on a value to see what data constructor it was built from and what values the data constructor was applied to.
Guards and if/then/else expressions (which are implemented internally using pattern matching).
Comparing an Integer to 0.
Cirdec mentions that the operation of subtracting 1 from an Integer is logarithmic in the size of the integer. As they say, this is essentially an artifact of the way Integer is implemented. It is possible to implement integers so that it takes only one step to compare them to 0 and also takes only one step to decrement them by 1. To keep things very general, it's safe to assume that there is some function c such that the cost of decrementing an Integer is c(depth).
Now that we've taken care of those preliminaries, let's get down to work! As is generally the case, we need to set up a system of equations and solve it. Let f(d) be the number of steps needed to calculate complete' d. Then the first equation is very simple:
f(0) = 2
That's because it costs 1 step to compare d to 0, and another step to check that the result is True.
The other equation is the interesting part. Think about what happens when d > 0:
We calculate d == 0.
We check if that is True (it's not).
We calculate d-1 (let's call the result dm1)
We calculate complete' dm1, saving the result as copiedTree.
We apply a Node constructor to x, copiedTree, and copiedTree.
The first part takes 1 step. The second part takes one step. The third part takes c(depth) steps, and the fifth step takes 1 step. What about the fourth part? Well, that takes f(d-1) steps, so this will be a recursive definition.
f(0) = 2
f(d) = (3+c(depth)) + f(d-1) when d > 0
OK, now we're cooking with gas! Let's calculate the first few values of f:
f(0) = 2
f(1) = (3+c(depth)) + f(0) = (3+c(depth)) + 2
f(2) = (3+c(depth)) + f(1)
= (3+c(depth)) + ((3+c(depth)) + 2)
= 2*(3+c(depth)) + 2
f(3) = (3+c(depth)) + f(2)
= (3+c(depth)) + (2*(3+c(depth)) + 2)
= 3*(3+c(depth)) + 2
You should be starting to see a pattern by now:
f(d) = d*(3+c(depth)) + 2
We generally prove things about recursive functions using mathematical induction.
Base case:
The claim holds for d=0 because 0*(3+c(depth))+2=0+2=2=f(0).
Suppose that the claim holds for d=D. Then
f(D+1) = (3+c(depth)) + f(D)
= (3+c(depth)) + (D*(3+c(depth))+2)
= (D+1)*(3+c(depth))+2
So the claim holds for D+1 as well. Thus by induction, it holds for all natural numbers d. As a reminder, this gives the conclusion that complete' d takes
f(d) = d*(3+c(depth))+2
time. Now how do we express that in big O terms? Well, big O doesn't care about the constant coefficients of any of the terms, and only cares about the highest-order terms. We can safely assume that c(depth)>=1, so we get
f(d) ∈ O(d*c(depth))
Zooming out to complete, this looks like O(depth*c(depth))
If you use the real cost of Integer decrement, this gives you O(depth*log(depth)). If you pretend that Integer decrement is O(1), this gives you O(depth).
Side note: As you continue to work through Okasaki, you will eventually reach section 10.2.1, where you will see a way to implement natural numbers supporting O(1) decrement and O(1) addition (but not efficient subtraction).
* Haskell's lazy evaluation keeps this from being precisely true, but if you pretend that everything is evaluated strictly, you will get an upper bound for the true value, which will be good enough in this case. If you want to learn how to analyze data structures that use laziness to get good asymptotic bounds, you should keep reading Okasaki.
Theoretical Answer
No, it does not run in O(d) time. Its asymptotic performance is dominated by the the Integer subtraction d-1, which takes O(log d) time. This is repeated O(d) times, giving an asymptotic upper bound on time of O(d log d).
This upper bound can improve if you use an Integer representation with an asymptotically optimal O(1) decrement. In practice we don't, since the asymptotically optimal Integer implementations are slower even for unimaginably large values.
Practically the Integer arithmetic will be a small part of the running time of the program. For practical "large" depths (smaller than a machine word) the program's running time will be dominated by allocating and populating memory. For larger depths you will exhaust the resources of the computer.
Practical Answer
Ask the run time system's profiler.
In order to profile your code, we first need to make sure it is run. Haskell is lazily evaluated, so, unless we do something to cause the tree to be completely evaluated, it might not be. Unfortunately, completely exploring the tree will take O(2^d) steps. We could avoid forcing nodes we had already visited if we kept track of their StableNames. Fortunately, traversing a structure and keeping track of visited nodes by their memory locations is already provided by the data-reify package. Since we will be using it for profiling, we need to install it with profiling enabled (-p).
cabal install -p data-reify
Using Data.Reify requires the TypeFamilies extension and Control.Applicative.
{-# LANGUAGE TypeFamilies #-}
import Data.Reify
import Control.Applicative
We reproduce your Tree code.
data Tree a = Empty | Node a (Tree a) (Tree a)
complete :: a -> Integer -> Maybe (Tree a)
complete x depth
| depth < 0 = Nothing
| otherwise = Just $ complete' depth
where complete' d
| d == 0 = Empty
| otherwise = let copiedTree = complete' (d-1)
in Node x copiedTree copiedTree
Converting data to a graph with data-reify requires that we have a base functor for the data type. The base functor is a representation of the type with explicit recursion removed. The base functor for Tree is TreeF. An additional type parameter is added for the representation of recursive occurrence of the type, and each recursive occurrence is replaced by the new parameter.
data TreeF a x = EmptyF | NodeF a x x
deriving (Show)
The MuRef instance required by reifyGraph requires that we provide a mapDeRef to traverse the structure with an Applicative and convert it to the base functor . The first argument provided to mapDeRef, which I have named deRef, is how we can convert the recursive occurrences of the structure.
instance MuRef (Tree a) where
type DeRef (Tree a) = TreeF a
mapDeRef deRef Empty = pure EmptyF
mapDeRef deRef (Node a l r) = NodeF a <$> deRef l <*> deRef r
We can make a little program to run to test the complete function. When the graph is small, we'll print it out to see what's going on. When the graph gets big, we'll only print out how many nodes it has.
main = do
d <- getLine
let (Just tree) = complete 0 (read d)
graph#(Graph nodes _) <- reifyGraph tree
if length nodes < 30
then print graph
else print (length nodes)
I put this code in a file named profileSymmetricTree.hs. To compile it, we need to enable profiling with -prof and enable the run-time system with -rtsopts.
ghc -fforce-recomp -O2 -prof -fprof-auto -rtsopts profileSymmetricTree.hs
When we run it, we'll enable the time profile with the +RTS option -p. We'll give it the depth input 3 for the first run.
profileSymmetricTree +RTS -p
3
let [(1,NodeF 0 2 2),(2,NodeF 0 3 3),(3,NodeF 0 4 4),(4,EmptyF)] in 1
We can already see from the graph that the nodes are being shared between the left and right sides of the tree.
The profiler makes a file, profileSymmetricTree.prof.
individual inherited
COST CENTRE MODULE no. entries %time %alloc %time %alloc
MAIN MAIN 43 0 0.0 0.7 100.0 100.0
main Main 87 0 100.0 21.6 100.0 32.5
...
main.(...) Main 88 1 0.0 4.8 0.0 5.1
complete Main 90 1 0.0 0.0 0.0 0.3
complete.complete' Main 92 4 0.0 0.2 0.0 0.3
complete.complete'.copiedTree Main 94 3 0.0 0.1 0.0 0.1
It shows in the entries column that complete.complete' was executed 4 times, and the complete.complete'.copiedTree was evaluated 3 times.
If you repeat this experiment with different depths, and plot the results, you should get a good idea what the practical asymptotic performance of complete is.
Here are the profiling results for a much greater depth, 300000.
individual inherited
COST CENTRE MODULE no. entries %time %alloc %time %alloc
MAIN MAIN 43 0 0.0 0.0 100.0 100.0
main Main 87 0 2.0 0.0 99.9 100.0
...
main.(...) Main 88 1 0.0 0.0 2.1 5.6
complete Main 90 1 0.0 0.0 2.1 5.6
complete.complete' Main 92 300001 1.3 4.4 2.1 5.6
complete.complete'.copiedTree Main 94 300000 0.8 1.3 0.8 1.3