how to get max value in spark rdd and remove it? - apache-spark

there is a RDD object:
//have some data in RDD[(Int, Int)] object
(1, 2)
(3, 2)
(2, 3)
(5, 4)
(2, 7)
(5, 2)
(5, 7)
I want to get max key and remove it, the max key is 5, so the result I want is:
//a new RDD object,RDD[(Int, Int)]
(1, 2)
(3, 2)
(2, 3)
(2, 7)
Could you help me? Thank you!

You need to first get the results sorted and then use RDD.max() to get the highest value and finally perform filter to filter the keys which are other than the highest key.
or
You can also register this as DataFrame and execute simple SQL query to get the results.

Related

Spark - difference between aggregateByKey() and reduceByKey()

I'm new to spark. Could somebody explain more detail about "aggregatByKey() lets you return result in different type than input value type while reduceByKey() return the same type as input".
If i use reduceByKey() i also can get a different type of value in output:
>>> rdd = sc.parallelize([(1,3),(2,3),(1,2),(2,5)])
>>> rdd.collect()
[(1, 3), (2, 3), (1, 2), (2, 5)]
>>> rdd.reduceByKey(lambda x,y: str(x)+str(y)).collect()
[(2, '35'), (1, '32')]
As we can see - input is int, output - str.
Eather i don't understand this diff correctly? whats the point?

Pyspark- Convert an RDD into a key value pair RDD, with the keys in ascending order

I have an RDD1 in this form: ['once','upon','a','time',...,'the','end']. I want to convert in into a key/value pair such that the strings are values and keys are in ascending order. The expected RDD2 should be as follows: [(1,'once'),(2,'upon'),(3,'a'),(4,'time'),...,(RDD1.count()-1,'the'),(RDD1.count(),'end']
Any hints?
Thanks
Use pyspark's own zip function. This might help:
rdd1 = sc.parallelize(['once','upon','a','time','the','end'])
nums = sc.parallelize(range(rdd1.count())).map(lambda x: x+1)
zippedRdds = nums.zip(rdd1)
rdd2 = zippedRdds.sortByKey()
rdd2.collect()
will give:
[(1, 'once'), (2, 'upon'), (3, 'a'), (4, 'time'), (5, 'the'), (6,
'end')]

Selecting sublists of a list of lists to define a relation

If I happen to have the following list of lists:
L=[[(1,3)],[(1,3),(2,4)],[(1,3),(1,4)],[(1,2)],[(1,2),(1,3)],[(1,3),(2,4),(1,2)]]
and what I wish to do, is to create a relation between lists in the following way:
I wish to say that
[(1,3)] and [(1,3),(1,4)]
are related, because the first is a sublist of the second, but then I would like to add this relation into a list as:
Relations=[([(1,3)],[(1,3),(1,4)])]
but, we can also see that:
[(1,3)] and [(1,3),(2,4)]
are related, because the first is a sublist of the second, so I would want this to also be a relation added into my Relations list:
Relations=[([(1,3)],[(1,3),(1,4)]),([(1,3)],[(1,3),(2,4)])]
The only thing I wish to be careful with, is that I am considering for a list to be a sublist of another if they only differ by ONE element. So in other words, we cannot have:
([(1,3)],[(1,3),(2,4),(1,2)])
as an element of my Relations list, but we SHOULD have:
([(1,3),(2,4)],[(1,3),(2,4),(1,2)])
as an element in my Relations list.
I hope there is an optimal way to do this, since in the original context I have to deal with a much bigger list of lists.
Any help given is much appreciated.
You really haven't provided enough information, so can't tell if you need itertools.combinations() or itertools.permutations(). Your examples work with itertools.combinations so will use that.
If x and y are two elements of the list then you just want all occurrences where the set(x).issubset(y) and the size of the set difference is <= 1 - len(set(y) - set(x)) <= 1, e.g.:
In []:
[[x, y] for x, y in it.combinations(L, r=2) if set(x).issubset(y) and len(set(y)-set(x)) <= 1]
Out[]:
[[[(1, 3)], [(1, 3), (2, 4)]],
[[(1, 3)], [(1, 3), (1, 4)]],
[[(1, 3)], [(1, 2), (1, 3)]],
[[(1, 3), (2, 4)], [(1, 3), (2, 4), (1, 2)]],
[[(1, 2)], [(1, 2), (1, 3)]],
[[(1, 2), (1, 3)], [(1, 3), (2, 4), (1, 2)]]]

Spark RDD: Vary the size of each partition

I need executors to finish processing data at different times.
I think the easiest way is to make RDD partitions have not uniform sizes. How can I do this?
Not sure what you are trying to achieve, but you can partition the RDD anyway you like using partitionBy eg:
sc.parallelize(xrange(10)).zipWithIndex()
.partitionBy(2, lambda x: 0 if x<2 else 1)
.glom().collect()
[[(0, 0), (1, 1)], [(2, 2), (3, 3), (4, 4), (5, 5), (6, 6), (7, 7), (8, 8), (9, 9)]]
Note that it works on a (k,v) RDD and the partitioning function takes only k as a param

Pruned Dynamic Programming

Currently, I'm working a string alignment comparison. I'm confused on how to optimize DP by pruning.
DP can be represented as a matrix/table. The start point is (0, 0). For example, element at (3, 4) is pruned and its value marked as -1 or null. But when I compute location (4, 4), (3, 5) and (4, 5), I still need a if-statement to check whether the value of (3, 4) is invalid(pruned) or valid(not pruned). Can this implementation save time because pruning function brings extra running time???

Resources