Pyspark - Max / Min Parameter - apache-spark

I have a query. In Pyspark when we need to get total(SUM) based on (Key,Value), our query reads like:
RDD1 = RDD.reduceByKey(lambda x , y: x + y)
where as when we need to find MAX / MIN value for (Key,Value) our query reads like
RDD1 = RDD.reduceByKey(lambda x , y: x if x[1] >= y[1] else y)
Why when we Sum data not using x[1], Y[1], where as same is use for MAX / MIN?. Please clarify the doubt.
Rgd's

You're wrong and you've taken this code out of context. In both cases x and y refer to values.
lambda x , y: x if x[1] >= y[1] else y
is equivalent to:
lambda x, y: max(x, y, key=lambda x: x[1])
It compares values by their second element and means that each value:
Is indexable (implements __getitem__).
Has at least two elements.
Example
sc.parallelize([(1, ("a", -3)), (1, ("b", 3))]) \
.reduceByKey(lambda x , y: x if x[1] >= y[1] else y).first()
will be (1, ('b', 3)) because 3 is larger than -3.

Related

Numpy Vectorization for Nested 'for' loop

I was trying to write a program which plots level set for any given function.
rmin = -5.0
rmax = 5.0
c = 4.0
x = np.arange(rmin,rmax,0.1)
y = np.arange(rmin,rmax,0.1)
x,y = np.meshgrid(x,y)
f = lambda x,y: y**2.0 - 4*x
realplots = []
for i in range(x.shape[0]):
for j in range(x.shape[1]):
if abs(f(x[i,j],y[i,j])-c)< 1e-4:
realplots.append([x[i,j],y[i,j]])`
But it being a nested for loop, is taking lot of time. Any help in vectorizing the above code/new method of plotting level set is highly appreciated.(Note: The function 'f' will be changed at the time of running.So, the vectorization must be done without considering the function's properties)
I tried vectorizing through
ans = np.where(abs(f(x,y)-c)<1e-4,np.array([x,y]),[0,0])
but it was giving me operands could not be broadcast together with shapes (100,100) (2,100,100) (2,)
I was adding [0,0] as an escape from else condition in np.where which is indeed wrong.
Since you get the values rather than the indexes, you don't really need np.where.
You can directly use the mask to index x and y, look at the "Boolean array indexing" section of the documentation.
It is straightforward:
def vectorized(x, y, c, f, threshold):
mask = np.abs(f(x, y) - c) < threshold
x, y = x[mask], y[mask]
return np.stack([x, y], axis=-1)
Your function for reference:
def op(x, y, c, f, threshold):
res = []
for i in range(x.shape[0]):
for j in range(x.shape[1]):
if abs(f(x[i, j], y[i, j]) - c) < threshold:
res.append([x[i, j], y[i, j]])
return res
Tests:
rmin, rmax = -5.0, +5.0
c = 4.0
threshold = 1e-4
x = np.arange(rmin, rmax, 0.1)
y = np.arange(rmin, rmax, 0.1)
x, y = np.meshgrid(x, y)
f = lambda x, y: y**2 - 4 * x
res_op = op(x, y, c, f, threshold)
res_vec = vectorized(x, y, c, f, threshold)
assert np.allclose(res_op, res_vec)

If element in list is in in range of another element from another list

Below is a question that is an extension of a question I asked a month ago.
Find if item in list a in range of items in sublist of list b
Let's suppose I have two lists:
x = ['2_12_20','2_40_60','4_45_70']
y = ['2_16','2_18','4_60','3_400']
In a biological context, these numbers refer to chromosome positions. For example, in list x, '2_12_20' refers to chromosome 2 between positions 12 and 20.
Similarly, in list y, '2_16' refers to chromosome 2 at position 16.
What I would like to do is determine which chromosome position pairs in y fall within the range in each element in list x.
This is the code I have written so far:
x_new = list(map(lambda z: tuple(map(int,z.split('_'))),x))
y_new = list(map(lambda z: tuple(map(int,z.split('_'))),y))
def check_in_range(number):
for i in y_new:
if number[0] == i[0]: #if chromosomes match
if number[1] <= i[1] and i[1] <= number[2]: #if position falls in range
return i
else:
continue #if chromosomes do not match, move on to next
answer = dict(zip(x_new, map(check_in_range, x_new)))
I would like my output to return a dictionary, where the elements in x are the keys and the values are any element in y.
My answer should be
{(2, 12, 20): [(2, 16),(2,18)], (2, 40, 60): None, (4, 45, 70): (4, 60)}
But I am getting
{(2, 12, 20): (2, 16), (2, 40, 60): None, (4, 45, 70): (4, 60)}
How do I alter my code so that it updates the dictionary if a key-value pair is already present?
I believe I figured it out.
x_new = list(map(lambda z: tuple(map(int,z.split('_'))),x))
y_new = list(map(lambda z: tuple(map(int,z.split('_'))),y))
def check_in_range(number):
list_a = []
for i in y_new:
if number[0] == i[0]: #if chromosomes match
if number[1] <= i[1] and i[1] <= number[2]: #if position falls in range
list_a.append(i)
else:
continue #if chromosomes do not match, move on to next
return(list_a)
answer = dict(zip(x_new, map(check_in_range, x_new)))

Python reduce function not working the way it is expected

I have a very simple use case in which i have a list of names and i have to calculate the total length of all the words in the names list. Below is my code base but it does not work the way i expect :
In [13]: names = ['John', 'Arya', 'Maya', 'Mary']
In [14]: from functools import reduce
In [15]: check = reduce(lambda x, y: len(x) + len(y), names)
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-15-39802d43150a> in <module>
----> 1 check = reduce(lambda x, y: len(x) + len(y), names)
<ipython-input-15-39802d43150a> in <lambda>(x, y)
----> 1 check = reduce(lambda x, y: len(x) + len(y), names)
TypeError: object of type 'int' has no len()
Can someone please point out where i am going wrong .
For completeness thought I'd also show map, a more functional approach:
total_length = sum(map(len, names))
Just use a generator expression with sum. reduce will only sometimes be better or clearer for specific use cases.
names = ['John', 'Arya', 'Maya', 'Mary']
total_length = sum(len(name) for name in names)
If you do want to use reduce, the first parameter is the accumulated value and the second is the next element of the list. You'll need to provide a starting value and only call len on the your y value:
total_length = reduce(lambda x, y: x + len(y), names, 0)
Here's a pure-python implementation of reduce():
>>> def reduce(fun, seq, initial=0):
... acc = initial
... for item in seq:
... acc = fun(acc, item)
... return acc
We can see that fun() receives the accumulator and the current value from seq. This is apparent when you trace the execution:
>>> def foo(x, y):
... print("foo(%s, %s) -> %s" % (x, y, x+y))
... return x+y
...
>>> reduce(foo, range(6))
foo(0, 1) -> 1
foo(1, 2) -> 3
foo(3, 3) -> 6
foo(6, 4) -> 10
foo(10, 5) -> 15
15

Add and Multiplication of Polynomials in Python

I want to add and multiply two polynomials. A function takes two arguments like add([(4,3),(3,0)],[(-4,3),(2,1)]).So, the polynomial looks like
4x^3 + 3 and -4x^3 + 2x
I want to add and multiply both these two polynomials without using any library.
I have created a simplified version for both addition and multiplication by creating a blank list that can store the coefficients from constant terms to the co-eff of highest exponents. The logic is simply to update the coefficients and creating a list containing tuple pairs of the format (co-eff, exponent)
def add(p1,p2):
x = [0]*(max(p1[0][1],p2[0][1])+1)
for i in p1+p2:
x[i[1]]+=i[0]
res = [(x[i],i) for i in range(len(x)) if x[i]!=0]
res.sort(key = lambda r: r[1], reverse= True)
return res
def mul(p1,p2):
x = [0]*(p1[0][1]*p2[0][1]+1)
for i in p1:
for j in p2:
x[i[1]+j[1]]+=i[0]*j[0]
res = [(x[i],i) for i in range(len(x)) if x[i]!=0]
res.sort(key = lambda r: r[1], reverse= True)
return res
pls note that this code works only for non negative exponents
addition and multiplication of the polynomials you referred in the question yields the following results
add([(4,3),(3,0)],[(-4,3),(2,1)]) = [(2, 1), (3, 0)]
mul([(4,3),(3,0)],[(-4,3),(2,1)]) = [(-16, 6), (8, 4), (-12, 3), (6, 1)]
For addition I have written a method
def poly_add( x, y):
r = []
min_len = min( len(x), len(y))
for i in range(min_len):
if x[i][1] == y[i][1]:
m = x[i][0] + y[i][0]
if m != 0:
r.append((m, x[i][1]))
if x[i][1] != y[i][1]:
r.append((y[i]))
r.append((x[i]))
return r

Apache Spark AverageByKey and CollectByKey Explanation

I am trying to understand the AverageByKey and CollectByKey APIs of Spark.
I read this article
http://abshinn.github.io/python/apache-spark/2014/10/11/using-combinebykey-in-apache-spark/
but I dont know if its just me.... I don't understand how these api works
Most confusing part is (x[0] + y[0], x[1] + y[1])
my understanding was that x is sum and y is count. then why are we adding the sum and count?
Instead of:
sumCount = data.combineByKey(lambda value: (value, 1),
lambda x, value: (x[0] + value, x[1] + 1),
lambda x, y: (x[0] + y[0], x[1] + y[1]))
You can write (x becomes a tuple of total and count)
sumCount = data.combineByKey(lambda value: (value, 1),
lambda (total, count), value: (total + value, count + 1),
lambda (total1, count1), (total2, count2): (total1 + total2, count1 + count2))
However if you need to compute average DoubleRDD may help.

Resources