Apache Spark AverageByKey and CollectByKey Explanation - apache-spark

I am trying to understand the AverageByKey and CollectByKey APIs of Spark.
I read this article
http://abshinn.github.io/python/apache-spark/2014/10/11/using-combinebykey-in-apache-spark/
but I dont know if its just me.... I don't understand how these api works
Most confusing part is (x[0] + y[0], x[1] + y[1])
my understanding was that x is sum and y is count. then why are we adding the sum and count?

Instead of:
sumCount = data.combineByKey(lambda value: (value, 1),
lambda x, value: (x[0] + value, x[1] + 1),
lambda x, y: (x[0] + y[0], x[1] + y[1]))
You can write (x becomes a tuple of total and count)
sumCount = data.combineByKey(lambda value: (value, 1),
lambda (total, count), value: (total + value, count + 1),
lambda (total1, count1), (total2, count2): (total1 + total2, count1 + count2))
However if you need to compute average DoubleRDD may help.

Related

Numpy tensor implementation slower than loop

I have two functions that compute the same metric. One ends up using a list comprehension to cycle through a calculation, the other uses only numpy tensor operations. The functions take in a (N, 3) array, where N is the number of points in 3D space. When N <~ 3000 the tensor function is faster, when N >~ 3000 the list comprehension is faster. Both seem to have linear time complexity in terms of N i.e two time-N lines cross at N=~3000.
def approximate_area_loop(section, num_area_divisions):
n_a_d = num_area_divisions
interp_vectors = get_section_interp_(section)
a1 = section[:-1]
b1 = section[1:]
a2 = interp_vectors[:-1]
b2 = interp_vectors[1:]
c = lambda u: (1 - u) * a1 + u * a2
d = lambda u: (1 - u) * b1 + u * b2
x = lambda u, v: (1 - v) * c(u) + v * d(u)
area = np.sum([np.linalg.norm(np.cross((x((i + 1)/n_a_d, j/n_a_d) - x(i/n_a_d, j/n_a_d)),\
(x(i/n_a_d, (j +1)/n_a_d) - x(i/n_a_d, j/n_a_d))), axis = 1)\
for i in range(n_a_d) for j in range(n_a_d)])
Dt = section[-1, 0] - section[0, 0]
return area, Dt
def approximate_area_tensor(section, num_area_divisions):
divisors = np.linspace(0, 1, num_area_divisions + 1)
interp_vectors = get_section_interp_(section)
a1 = section[:-1]
b1 = section[1:]
a2 = interp_vectors[:-1]
b2 = interp_vectors[1:]
c = np.multiply.outer(a1, (1 - divisors)) + np.multiply.outer(a2, divisors) # c_areas_vecs_divs
d = np.multiply.outer(b1, (1 - divisors)) + np.multiply.outer(b2, divisors) # d_areas_vecs_divs
x = np.multiply.outer(c, (1 - divisors)) + np.multiply.outer(d, divisors) # x_areas_vecs_Divs_divs
u = x[:, :, 1:, :-1] - x[:, :, :-1, :-1] # u_areas_vecs_Divs_divs
v = x[:, :, :-1, 1:] - x[:, :, :-1, :-1] # v_areas_vecs_Divs_divs
sub_area_norm_vecs = np.cross(u, v, axis = 1) # areas_crosses_Divs_divs
sub_areas = np.linalg.norm(sub_area_norm_vecs, axis = 1) # areas_Divs_divs (values are now sub areas)
area = np.sum(sub_areas)
Dt = section[-1, 0] - section[0, 0]
return area, Dt
Why does the list comprehension version work faster at large N? Surely the tensor version should be faster? I'm wondering if it's something to do with the size of the calculations meaning it's too big to be done in cache? Please ask if I haven't included enough information, I'd really like to get to the bottom of this.
The bottleneck in the fully vectorized function was indeed in the np.linalg.norm as #hpauljs comment suggested.
Norm was used only to get the magnitude of all the vectors contained in axis 1. A much simpler and faster method was to just:
sub_areas = np.sqrt((sub_area_norm_vecs*sub_area_norm_vecs).sum(axis = 1))
This gives exactly the same results and sped up the code by up to 25 times faster than the loop implementation (even when the loop doesn't use linalg.norm either).

Sympy; summing and integrating an indexed variable.

The exercise I am trying to recreate is the first one in these lattice notes.
I am attempting this in sympy with python3. My attempt is this;
import sympy
from sympy.abc import a, m
from sympy import IndexedBase, Idx, oo, symbols
# from ipdb import set_trace as st
integrated_path = sympy.Symbol('I')
def V(sym_a, sym_x):
return (sym_x**2)/sym_a
N, j, j_primed = symbols('N j, j_primed', integer=True)
x = IndexedBase('x')
j_idx = Idx(j)
S = sympy.summation(((m/(2*a)) * (x[j_idx+1] - x[j_idx]**2) + a*V(a, x[j_idx])),
(j_idx, 0, N-1))
print("The action ", S)
integrand = sympy.exp(-S)
j_primed_idx = Idx(j_primed, (0, N))
integrated_path = sympy.integrate(integrand, (x[j_primed_idx], -oo, oo))
print("The integrated path is ", integrated_path)
subbed_path = integrated_path.subs({a: 0.5, N: 8, m: 1})
print("The subbed path is ", subbed_path)
However, the integration is not recognising x[j+1] as being one of x[j] therefore it is not integrating over it. The output I'm getting is;
The action Sum(x[j]**2 + m*(x[j + 1] - x[j]**2)/(2*a), (j, 0, N - 1))
The integrated path is oo*sign(exp(-Sum(x[j]**2, (j, 0, N - 1)) - m*Sum(x[j + 1], (j, 0, N - 1))/(2*a) + m*Sum(x[j]**2, (j, 0, N - 1))/(2*a)))
The subbed path is oo*sign(exp(-1.0*Sum(x[j + 1], (j, 0, 7))))
All of the x values should have integrated out, yet one of them remains. So I think I'm using indexed variables incorrectly. Short of hard coding N what is the correct way to do this?
SymPy's handling of indexed objects is not nearly sophisticated enough to handle this computation as a human would. In particular, it is not going to understand integration
over (x[j_primed_idx], -oo, oo) as "integrate over all indexed x". This looks like a single integration to SymPy, and over a variable that's distinct from any x[j] because the indices don't look the same. In short, SymPy doesn't really understand how indices work in mathematics.
You'll need to declare the value of N upfront to get anything done. And to fix the typo in (x[j_idx+1] - x[j_idx]**2) — this should be (x[j_idx+1] - x[j_idx])**2. And it's still going to take forever if N is large and a and m are symbolic. The issue is that there are cases upon cases based on the relative sizes of a and m. Here is a working version with N, a, m all specified upfront — this helps the integrator a lot. Note the use of Rational(1, 2) instead of a float 0.5 by the way — this matters for SymPy.
import sympy
from sympy import oo, symbols
N = 8
a = sympy.Rational(1, 2)
m = 1
def V(sym_a, sym_x):
return (sym_x**2)/sym_a
x = symbols('x0:{}'.format(N))
S = sympy.Add(*[((m/(2*a)) * (x[j_idx+1] - x[j_idx])**2 + a*V(a, x[j_idx])) for j_idx in range(N-1)])
print("The action ", S)
integrand = sympy.exp(-S)
integrated_path = sympy.integrate(integrand, *[(x[j_primed_idx], -oo, oo) for j_primed_idx in range(N)], conds='none')
print("The integrated and subbed path is ", integrated_path)
Output:
The action x0**2 + x1**2 + x2**2 + x3**2 + x4**2 + x5**2 + x6**2 + (-x0 + x1)**2 + (-x1 + x2)**2 + (-x2 + x3)**2 + (-x3 + x4)**2 + (-x4 + x5)**2 + (-x5 + x6)**2 + (-x6 + x7)**2
The integrated and subbed path is sqrt(377)*pi**4/377
And this is how far I can push it with symbolic a and m: N=2 here.
import sympy
from sympy import IndexedBase, Idx, oo, symbols
a, m = symbols('a m', positive=True)
N = 2
def V(sym_a, sym_x):
return (sym_x**2)/sym_a
j, j_primed = symbols('j, j_primed', integer=True)
x = symbols('x0:{}'.format(N))
S = sympy.Add(*[((m/(2*a)) * (x[j_idx+1] - x[j_idx])**2 + a*V(a, x[j_idx])) for j_idx in range(N-1)])
print("The action ", S)
integrand = sympy.exp(-S)
integrated_path = sympy.integrate(integrand, *[(x[j_primed_idx], -oo, oo) for j_primed_idx in range(N)], conds='none')
print("The integrated path is ", integrated_path)
subbed_path = integrated_path.subs({a: sympy.Rational(1, 2), m: 1})
print("The subbed path is ", subbed_path)
Output:
The action x0**2 + m*(-x0 + x1)**2/(2*a)
The integrated path is -I*pi*sqrt(a)*sqrt(4*a**2 + 2*a*m)*Piecewise((I/sqrt(-1 + (4*a**2 + 2*a*m)/(2*a*m)), (4*a**2 + 2*a*m)/(2*a*m) > 1), (1/sqrt(1 - (4*a**2 + 2*a*m)/(2*a*m)), True))/(m*sqrt(a + m/2))
The subbed path is pi

Pyspark - Max / Min Parameter

I have a query. In Pyspark when we need to get total(SUM) based on (Key,Value), our query reads like:
RDD1 = RDD.reduceByKey(lambda x , y: x + y)
where as when we need to find MAX / MIN value for (Key,Value) our query reads like
RDD1 = RDD.reduceByKey(lambda x , y: x if x[1] >= y[1] else y)
Why when we Sum data not using x[1], Y[1], where as same is use for MAX / MIN?. Please clarify the doubt.
Rgd's
You're wrong and you've taken this code out of context. In both cases x and y refer to values.
lambda x , y: x if x[1] >= y[1] else y
is equivalent to:
lambda x, y: max(x, y, key=lambda x: x[1])
It compares values by their second element and means that each value:
Is indexable (implements __getitem__).
Has at least two elements.
Example
sc.parallelize([(1, ("a", -3)), (1, ("b", 3))]) \
.reduceByKey(lambda x , y: x if x[1] >= y[1] else y).first()
will be (1, ('b', 3)) because 3 is larger than -3.

Add and Multiplication of Polynomials in Python

I want to add and multiply two polynomials. A function takes two arguments like add([(4,3),(3,0)],[(-4,3),(2,1)]).So, the polynomial looks like
4x^3 + 3 and -4x^3 + 2x
I want to add and multiply both these two polynomials without using any library.
I have created a simplified version for both addition and multiplication by creating a blank list that can store the coefficients from constant terms to the co-eff of highest exponents. The logic is simply to update the coefficients and creating a list containing tuple pairs of the format (co-eff, exponent)
def add(p1,p2):
x = [0]*(max(p1[0][1],p2[0][1])+1)
for i in p1+p2:
x[i[1]]+=i[0]
res = [(x[i],i) for i in range(len(x)) if x[i]!=0]
res.sort(key = lambda r: r[1], reverse= True)
return res
def mul(p1,p2):
x = [0]*(p1[0][1]*p2[0][1]+1)
for i in p1:
for j in p2:
x[i[1]+j[1]]+=i[0]*j[0]
res = [(x[i],i) for i in range(len(x)) if x[i]!=0]
res.sort(key = lambda r: r[1], reverse= True)
return res
pls note that this code works only for non negative exponents
addition and multiplication of the polynomials you referred in the question yields the following results
add([(4,3),(3,0)],[(-4,3),(2,1)]) = [(2, 1), (3, 0)]
mul([(4,3),(3,0)],[(-4,3),(2,1)]) = [(-16, 6), (8, 4), (-12, 3), (6, 1)]
For addition I have written a method
def poly_add( x, y):
r = []
min_len = min( len(x), len(y))
for i in range(min_len):
if x[i][1] == y[i][1]:
m = x[i][0] + y[i][0]
if m != 0:
r.append((m, x[i][1]))
if x[i][1] != y[i][1]:
r.append((y[i]))
r.append((x[i]))
return r

Lambdas and sums Python

def summation(calc_termo, linf, prox, lsup):
soma = 0
while linf <= lsup:
soma = soma + calc_termo(linf)
linf = prox(linf)
return soma
summation(lambda x: summation(lambda x: x, 1, lambda x: x + 1, x),1, lambda x: x + 1, 5)
I'm having trouble to understand how this code works. I got this as an exercise from my university and I'm having some trouble understanding the code.
It seems to be the sum of the numbers between 1 to 5, but can't understand what summation(lambda x: x, 1, lambda x: x + 1, x) does.
I'd start by taking those arguments apart:
lambda x: summation(lambda x: x, 1, lambda x: x + 1, x)
Substitute those variables back into the the original functions and simplify it:
def inner_function(x):
soma = 0
linf = 1
while linf <= x:
soma += linf + 1
linf += 1
return soma
Simplify that a little more:
def inner_function(x):
soma = 0
for linf in range(1, x + 1):
soma += linf
return soma
And a little more:
inner_function = lambda x: sum(range(1, x + 1))
And some more:
inner_function = lambda x: x * (x + 1) / 2
Now your original function becomes:
def summation(calc_termo, linf, prox, lsup):
soma = 0
while linf <= lsup:
soma = soma + calc_termo(linf)
linf = prox(linf)
return soma
summation(inner_function, 1, lambda x: x + 1, 5)
Or:
def summation(linf, prox, lsup):
soma = 0
while linf <= lsup:
soma = soma + linf * (linf + 1) / 2
linf = prox(linf)
return soma
summation(1, lambda x: x + 1, 5)
You can take it from there. I got:
summation = lambda: sum(n * (n + 1) / 2 for n in range(6))
Which is equal to:
sum(sum(range(n + 1)) for n in range(6))
The last line that you had trouble with could better be stated as:
summation(lambda x: summation(lambda y: y, 1, lambda z: z + 1, x),1, lambda w: w + 1, 5)
The lambdas don't all interfere with each other, if that's what you were confused about.

Resources