How spark reduce works here - apache-spark

How does spark reduce work for this example?
val num = sc.parallelize(List(1,2,3))
val result = num.reduce((x, y) => x + y)
res: Int = 6
val result = num.reduce((x, y) => x + (y * 10))
res: Int = 321
I understand the 1st result (1 + 2 + 3 = 6). For the 2nd result, I thought the result would be 60 but it's not. Can someone explain?
Step1 : 0 + (1 * 10) = 10
Step2 : 10 + (2 * 10) = 30
Step3 : 30 + (3 * 10) = 60
Update:
As per Spark documentation:
The function should be commutative and associative so that it can be
computed correctly in parallel.
https://spark.apache.org/docs/latest/rdd-programming-guide.html

(2,3) -> 2 + 3 * 10 = 32
(1,(2,3)) -> (1,32) -> 1 + 32 * 10 = 321
A reducer (in general, not just Spark), takes a pair, applies the reduce function and takes the result and applies it again to another element. Until all elements have been applied. The order is implementation specific (or even random if in parallel), but as a rule, it should not affect the end result (commutative and associative).
Check also this https://stackoverflow.com/a/31660532/290036

Related

How to handle n not a multiple p in worker processes in matrix multiplication?

I am working on a problem regarding pseudocode for matrix multiplication using worker processes. w is the amount of workers, p is the amount of processors and n is the amount of processes.
The psuedocode calculates the matrix result by dividing the i rows into P strips of n/P rows each.
process worker[w = 1 to P]
int first = (w-1) * n/P;
int last = first + n/P - 1;
for [i = first to last] {
for [j = 0 to n-1] {
c[i,j] = 0.0;
for[k = 0 to n-1]
c[i,j] = c[i,j] + a[i,k]*b[k,j];
}
}
}
my question is how I would handle if n was not a multiple of P processors as can happen often where n is not divisible by p?
The simplest solution is to give the last worker all the remaining rows (they won't be more than P-1):
if w == P {
last += n mod P
}
n mod P is the remainder of the division of n by P.
Or change the calculation of first and last like this:
int first = ((w-1) * n)/P
int last = (w * n)/P - 1
This automatically takes care for the case when n is not divisible by P. The brackets are not really necessary in most languages where * and / have the same precedence and are left-associative. The point is that the multiplication by n should happen before the division by P.
Example: n = 11, P = 3:
w = 1: first = 0, last = 2 (3 rows)
w = 2: first = 3, last = 6 (4 rows)
w = 3: first = 7, last = 10 (4 rows)
This is a better solution as it spreads the remainder of the division evenly among the workers.

Mathematics - Distribute a list of numbers over an interval

My problem is simple.
I am searching a mathematical function to distribute number over an interval.
For example I have this list :
[2; 4; 9; 14]
And in my case I wish
2 -> 1 = f(2)
14 -> 20 = f(14)
4 -> f(4) = ?
9 -> f(9) = ?
This is just an example I am searching for f(x).
Someone would have any idea ?
Thanks for advance ! :)
If you want a linear function, then:
f(x) = lowerFunc + (x - lowerX) * (upperFunc - lowerFunc) / (upperX - lowerX),
where:
lowerFunc: function value at the lower end
upperFunc: function value at the upper end
lowerX: x parameter at the lower end
upperX: x parameter at the upper end.
For your example:
f(x) = 1 + (x - 2) * (20 - 1) / (14 - 2)
= 1 + (x - 2) * 19/12
f(2) = 1
f(4) = 4.1666
f(9) = 12.08333
f(14) = 20

Python Bit Summation Algorithm

I am trying to implement a function that will be used to judge whether a generator's output is continuous. The method I am gravitating towards is to iterate through the generator. For each value, I right justify the bits of the value (disregarding the 0b), count the number of ones, and shift the number of ones.
#!/usr/bin/python3
from typing import Tuple
def find_bit_sum(top: int, pad_length: int) -> int :
"""."""
return pad_length * (top + 1)
def find_pad_length(top: int) -> int :
"""."""
return len(bin(top)) - 2 # -"0b"
def guess_certain(top: int, pad_length: int) -> Tuple[int, int, int] :
"""."""
_both: int = find_bit_sum(top, pad_length)
_ones: int = sum(sum(int(_i_in) for _i_in in bin(_i_out)[2 :]) for _i_out in range(1, top + 1))
return _both - _ones, _ones, _both # zeros, ones, sum
def guess(top: int, pad_length: int) -> Tuple[int, int, int] : # zeros then ones then sum
"""."""
_bit_sum: int = find_bit_sum(top, pad_length) # number of bits in total
_zeros: int = _bit_sum # ones are deducted
_ones: int = 0 # _bit_sum - _zeros
# detect ones
for _indexed in range(pad_length) :
_ones_found: int = int(top // (2 ** (_indexed + 1))) # HELP!!!
_zeros -= _ones_found
_ones += _ones_found
#
return _zeros, _ones, _bit_sum
def test_the_guess(max_value: int) -> bool : # the range is int [0, max_value + 1)
pad: int = find_pad_length(max_value)
_zeros0, _ones0, _total0 = guess_certain(max_value, pad)
_zeros1, _ones1, _total1 = guess(max_value, pad)
return all((
_zeros0 == _zeros1,
_ones0 == _ones1,
_total0 == _total1
))
if __name__ == '__main__' : # should produce a lot of True
for x in range(3000) :
print(test_the_guess(x))
For the life of me, I cannot make guess() agree with guess_certain(). The time complexity of guess_certain() is my problem: it works for small ranges [0, top], but one can forget 256-bit numbers (tops). The find_bit_sum() function works perfectly. The find_pad_length() function also works.
top // (2 ** (_indexed + 1))
I've tried 40 or 50 variations of the guess() function. It has thoroughly frustrated me. The guess() function is probabilistic. In its finished state: if it returns False, then the Generator definitely isn't producing every value in range(top + 1); however, if it returns True, then the Generator could be. We already know that the generator range(top + 1) is continuous because it does produce each number between 0 and top inclusively; so, test_the_guess() should be returning True.
I sincerely do apologise for the chaotic explanation. If you have anny questions, please don't hesitate to ask.
I adjusted your ones_found assignment statement to account for the number of powers of two per int(top // (2 ** (_indexed + 1))), as well as a additional "rollover" ones that occur before the next power of two. Here is the resulting statement:
_ones_found: int = int(top // (2 ** (_indexed + 1))) * (2 ** (_indexed)) + max(0, (top % (2 ** (_indexed + 1))) - (2 ** _indexed) + 1)
I also took the liberty of converting the statement to bitwise operators for both clarity and speed, as shown below:
_ones_found: int = ((top >> _indexed + 1) << _indexed) + max(0, (top & (1 << _indexed + 1) - 1) - (1 << _indexed) + 1)

Is there a limit to the size of a BigInt or BigUint in Rust?

Is there no limit to the size of a BigInt or BigUint from the num crate in Rust? I see that in Java its length is bounded by the upper limit of an integer Integer.MAX_VALUE as it is stored as an array of int.
I did go through the documentation for it but could not really deduce my answer from
A BigUint-typed value BigUint { data: vec!(a, b, c) } represents a
number (a + b * big_digit::BASE + c * big_digit::BASE^2).
big_digit::BASE being defined as
pub const BASE: DoubleBigDigit = 1 << BITS
BITS in turn is 32
So is the BigInt being represented as (a + b * 64 + c * 64^2) internally?
TL;DR: the maximum number that can be represented is roughly:
3.079 x 10^22212093154093428519
I suppose that nothing useful needs such a big number to be represented. You can be certain that the num_bigint will do the job, whatever the usage you have with it.
In theory, there is no limit to the num big integers size since the documentation says nothing about it (version 0.1.44). However, there is a concrete limit that we can calculate:
BigUint is a Vec<BigDigit>, and BigDigit is an u32. As far as I know, Rust does not define a max size for a Vec, but since the maximum possible allocated size is isize::MAX, the maximum number of BigDigit aka u32 is:
MAX_LEN = isize::MAX / sizeof(u32)
With this information, we can deduce that the maximum of a num::BigUint (and a num::BigInt as well) in the current implementation is:
(u32::MAX + 1) ^ MAX_LEN - 1 = 2^32^MAX_LEN - 1
To have this formula, we mimic the way we calculate u8::MAX, for example:
bit::MAX is 1,
the length is 8,
so the maximum is (bit::MAX + 1) ^ 8 - 1 = 255
Here is the full demonstration from the formula given by the num documentation:
a + b * big_digit::BASE + c * big_digit::BASE^2 + ...
If we are taking the max value, a == b == c == u32::MAX. Let's name it a. Let's name big_digit::BASE b for convenience. So the max number is:
sum(a * b^n) where n is from 0 to (MAX_LEN - 1)
if we factorize, we get:
a * sum(b^n) where n is from 0 to (MAX_LEN - 1)
The general formula of the sum of x^n is (x^(n + 1) - 1) / (x - 1). So, because n is MAX_LEN - 1, the result is:
a * (b^(MAX_LEN - 1 + 1) - 1) / (b - 1)
We replace a and b with the right value, and the biggest representable number is:
u32::MAX * (2^32^MAX_LEN - 1) / (2^32 - 1)
u32::MAX is 2^32 - 1, so this can be simplified into:
2^32^MAX_LEN - 1

Compile error of erfc function

I tried to use erfc but it says argument not optional.
An example is given below
For j = 0 To 150
f = 1
For m = 1 To j
f = f * m
Next
Application.WorksheetFunction.Erf = (Application.WorksheetFunction.Erf) + (-1) ^ j * b ^ (2 * j + 1) / ((2 * j + 1) * f)
Next
Application.WorksheetFunction.ErfC = 1 - 2 / Sqr(3.14) * Application.WorksheetFunction.Erf
MsgBox (Application.WorksheetFunction.ErfC)
xf1 = (wa + 2 * sp) * q / (4 * cl ^ 2 * 3.14 * hf)
xf2 = Exp(b ^ 2) * Application.WorksheetFunction.ErfC
xf3 = 2 * b / Sqr(3.14) - 1
xf = xf1 * (xf2 + xf3)
According to the MSDN documentation at least one parameter needs to be passed to the Erf method:
Name Required/Optional Data Type Description
Arg1 Required Variant Lower_limit - the lower bound for integrating ERF.
Arg2 Optional Variant Upper_limit - the upper bound for integrating ERF.
If omitted, ERF integrates between zero and
lower_limit.
Therefore, calling it with zero parameters (e.g. Application.WorksheetFunction.Erf) will give you an "argument not optional" error.
You also won't be able to set Erf to a value, i.e.
Application.WorksheetFunction.Erf = ...
is invalid.

Resources