Find maximum element in an RDD in Pyspark by using map/filter - apache-spark

a = sc.parallelize((1,9,3,10))
I want to find the maximum element in a without using any max function.
I tried
a.filter( lambda x,y: x if x>y else y)
I am not able to compare elements in the RDD. How do I use for loop or if else condition properly in the map/filter function. Is it possible?
Thank you.
I was trying to post a different question. But not able to.
a = sc.parallelize((11,7,20,10,1,7))
I want to sort the elements in increasing order without using sort() function.
I tried:
def srt(a,b):
if a>b:
i=a
a=b
b=i
final=a.map(lambda x,y: srt(x,y))
I am not getting the required result.
I want to get
(1,7,7,10,11,20)
thank you.

You cannot find the max/min using filters. You may achieve that using comparison in a reduce operation:
a = sc.parallelize([1,9,3,10])
max_val = a.reduce(lambda a, b: a if a > b else b)
The lambda just compares and returns the bigger of 2 values.

Related

What is the Efficient way to right rotate list circularly in python without inbuilt function

def circularArrayRotation(a, k, queries):
temp=a+a
indexToCountFrom=len(a)-k
for val in queries:
print(temp[indexToCountFrom+val])
I am having this code to perform the rotation .
This function takes list as a, the number of time it needs to be rotated as k, and last is query which is a list containing indices whose value is needed after the all rotation.
My code works for all the cases except some bigger ones.
Where i am doing it wrong ?
link: https://www.hackerrank.com/challenges/circular-array-rotation/problem
You'll probably run into a timeout when you concatenate large lists with temp = a + a.
Instead, don't create a new list, but use the modulo operator in your loop:
print(a[(indexToCountFrom+val) % len(a)])

On a dataset made up of dictionaries, how do I multiply the elements of each dictionary with Python'

I started coding in Python 4 days ago, so I'm a complete newbie. I have a dataset that comprises an undefined number of dictionaries. Each dictionary is the x and y of a point in the coordinates.
I'm trying to compute the summatory of xy by nesting the loop that multiplies xy within the loop that sums the products.
However I haven't been able to figure out how to multiply the values for the two keys in each dictionary (so far I only got to multiply all the x*y)
So far I've got this:
If my data set were to be d= [{'x':0, 'y':0}, {'x':1, 'y':1}, {'x':2, 'y':3}]
I've got the code for the function that calculates the product of each pair of x and y:
def product_xy (product_x_per_y):
prod_xy =[]
n = 0
for i in range (len(d)):
result = d[n]['x']*d[n]['y']
prod_xy.append(result)
n+1
return prod_xy
I also have the function to add up the elements of a list (like prod_xy):
def total_xy_prod (sum_prod):
all = 0
for s in sum_prod:
all+= s
return all
I've been trying to find a way to nest this two functions so that I can iterate through the multiplication of each x*y and then add up all the products.
Make sure your code works as expected
First, your functions have a few mistakes. For example, in product_xy, you assign n=0, and later do n + 1; you probably meant to do n += 1 instead of n + 1. But n is also completely unnecessary; you can simply use the i from the range iteration to replace n like so: result = d[i]['x']*d[i]['y']
Nesting these two functions: part 1
To answer your question, it's fairly straightforward to get the sum of the products of the elements from your current code:
coord_sum = total_xy_prod(product_xy(d))
Nesting these two functions: part 2
However, there is a much shorter and more efficient way to tackle this problem. For one, Python provides the built-in function sum() to sum the elements of a list (and other iterables), so there's no need create total_xy_prod. Our code could at this point read as follows:
coord_sum = sum(product_xy(d))
But product_xy is also unnecessarily long and inefficient, and we could also replace it entirely with a shorter expression. In this case, the shortening comes from generator expressions, which are basically compact for-loops. The Python docs give some of the basic details of how the syntax works at list comprehensions, which are distinct, but closely related to generator expressions. For the purposes of answering this question, I will simply present the final, most simplified form of your desired result:
coord_sum = sum(e['x'] * e['y'] for e in d)
Here, the generator expression iterates through every element in d (using for e in d), multiplies the numbers stored in the dictionary keys 'x' and 'y' of each element (using e['x'] * e['y']), and then sums each of those products from the entire sequence.
There is also some documentation on generator expressions, but it's a bit technical, so it's probably not approachable for the Python beginner.

Connect string value to a corresponding variable name

This question has somehow to do with an earlier post from me. See here overlap-of-nested-lists-creates-unwanted-gap
I think that I have found a solution but i can't figure out how to implement it.
First the relevant code since I think it is easier to explain my problem that way. I have prepared a fiddle to show the code:
PYFiddle here
Each iteration fills a nested list in ag depending on the axis. The next iteration is supposed to fill the next nested list in ag but depending on the length of the list filled before.
The generell idea to realise this is as follows:
First I would assign each nested list within the top for-loop to a variable like that:
x = ag[0]
y = ag[1]
z = ag[2]
In order to identify that first list I need to access data_j like that. I think the access would work that way.
data_j[i-1]['axis']
data_j[i-1]['axis'] returns either x,y or z as string
Now I need to get the length of the list which corresponds to the axis returned from data_j[i-1]['axis'].
The problem is how do I connect the "value" of data_j[i-1]['axis'] with its corresponding x = ag[0], y = ag[1] or z = ag[2]
Since eval() and globals() are bad practice I would need a push into the right direction. I couldn't find a solution
EDIT:
I think I figured out a way. Instead of taking the detour of using the actual axis name I will try to use the iterator i of the parent loop (See the fiddle) since it increases for each element from data_j it kinda creates an id which I think I can use to create a method to use it for the index of the nest to address the correct list.
I managed to solve it using the iterator i. See the fiddle from my original post in order to comprehend what I did with the following piece of code:
if i < 0:
cond = 0
else:
cond = i
pred_axis = data_j[cond]['axis']
if pred_axis == 'x':
g = 0
elif pred_axis == 'y':
g = 1
elif pred_axis == 'z':
g = 2
calc_size = len(ag[g])
n_offset = calc_size+offset
I haven't figured yet why cond must be i and not i-1 but it works. As soon as I figure out the logic behind it I will post it.
EDIT: It doesn't work for i it works for i-1. My indices for the relevant list start at 1. ag[0] is reserved for a constant which can be added if necessary for further calculations. So since the relevant indices are moved up by the value of 1 from the beginning already i don't need to decrease the iterator in each run.

How to correctly use enumerate with two inputs and three expected outputs in python spark

I've been tryng to replicate the code in http://www.data-intuitive.com/2015/01/transposing-a-spark-rdd/ to traspose an RDD in pyspark. I am able to load my RDD correctly and apply the zipWithIndex method to it as follows:
m1.rdd.zipWithIndex().collect()
[(Row(c1_1=1, c1_2=2, c1_3=3), 0),
(Row(c1_1=4, c1_2=5, c1_3=6), 1),
(Row(c1_1=7, c1_2=8, c1_3=9), 2)]
But, when I want to apply it a flatMap with a lambda enumerating that array either the syntax is non-valid:
m1.rdd.zipWithIndex().flatMap(lambda (x,i): [(i,j,e) for (j,e) in enumerate(x)]).take(1)
Or, the positional argument i appears as missing:
m1.rdd.zipWithIndex().flatMap(lambda x,i: [(i,j,e) for (j,e) in enumerate(x)]).take(1)
When I run the lambda in python, it needs an extra index parameter to catch the function.
aa = m1.rdd.zipWithIndex().collect()
g = lambda x,i: [(i,j,e) for (j,e) in enumerate(x)]
g(aa,3) #extra parameter
Which seems to me unnecessary as the index has been calculated previously.
I'm quite an amateur in python and spark and I would like to know what is the issue with the indexes and why neither spark nor python are catching them. Thank you.
First let's take a look a the signature of RDD.flatMap (preservesPartitioning parameter removed for clarity):
flatMap(self: RDD[T], f: Callable[[T], Iterable[U]]) -> RDD[U]: ...
As you can see flatMap expects an unary function.
Going back to your code:
lambda x, i: ... is a binary function, so clearly it won't work.
lambda (x, i): ... use to be a syntax for an unary function with tuple argument unpacking. It used structural matching to destructure (unpack in Python nomenclature) a single input argument (here Tuple[Any, Any]). This syntax was brittle and has been removed in Python 3. A correct way to achieve the same result in Python 3 is indexing:
lambda xi: ((x[1], j, e) for e, j in enumerate(x[0]))
If you prefer structural matching just use standard function:
def flatten(xsi):
xs, i = xsi
for j, x in enumerate(xs):
yield i, j, x
rdd.flatMap(flatten)

Two parameters in a predicate function

Is there a way that I can write a predicate function that will compare two strings and see which one is greater? Right now I have
def helper1(x, y):
return x > y
However, I'm trying to use the function in this way,
new_tuple = divide((helper1(some_value, l[0]),l[1:])
Please note that the above function call is probably wrong because my helper1 is incomplete. But the gist is I'm trying to compare two items to see if one's greater than the other, and the items are l[1:] to l[0]
Divide is a function that, given a predicate and a list, divides that list into a tuple that has two lists, based on what the predicate comes out as. Divide is very long, so I don't think I should post it on here.
So given that a predicate should only take one parameter, how should I write it so that it will take one parameter?
You should write a closure.
def helper(x):
def cmp(y):
return x > y
return cmp
...
new_tuple = divide(helper1(l[0]), l[1:])
...

Resources