Max values for each key RDD - apache-spark

I have this data and want to get each key's max value. The key will be the first element (9,14,26).
(('14', '51600', 'Fashion Week'), 1)
(('9', '61577', 'Guiding Light'), 7)
(('9', '6856', 'Adlina Marie'), 22)
(('14', '120850', 'People Say (feat. Redman)'), 5)
(('26', '155571', "Thinking 'Bout You"), 30)
(('26', '156532', "Hello"), 8)
The final format will be:
(9, '6856', 'Adlina Marie', 22)
(14, '120850', 'People Say (feat. Redman)', 5)
(26, '155571', "Thinking 'Bout You", 30)
How to select the first column as the key and the last as the value to find the maximum of the value? I tried
groupbykey(lambda x:int(x[0][0])).mapValues(lambda x: max(x))
but it takes the second column as the value to find the max.

You could use map before the aggregations and after:
rdd = rdd.map(lambda x: (x[0][0],(x[1], x[0][1], x[0][2])))
rdd = rdd.groupByKey().mapValues(max)
rdd = rdd.map(lambda x: (x[0], x[1][1], x[1][2], x[1][0]))
Full example:
sc = spark.sparkContext
data = [(('14', '51600', 'Fashion Week'), 1),
(('9', '61577', 'Guiding Light'), 7),
(('9', '6856', 'Adlina Marie'), 22),
(('14', '120850', 'People Say (feat. Redman)'), 5),
(('26', '155571', "Thinking 'Bout You"), 30),
(('26', '156532', "Hello"), 8)]
rdd = sc.parallelize(data)
rdd = rdd.map(lambda x: (x[0][0],(x[1], x[0][1], x[0][2])))
print(rdd.collect())
# [('14', (1, '51600', 'Fashion Week')), ('9', (7, '61577', 'Guiding Light')), ('9', (22, '6856', 'Adlina Marie')), ('14', (5, '120850', 'People Say (feat. Redman)')), ('26', (30, '155571', "Thinking 'Bout You")), ('26', (8, '156532', 'Hello'))]
rdd = rdd.groupByKey().mapValues(max)
print(rdd.collect())
# [('14', (5, '120850', 'People Say (feat. Redman)')), ('9', (22, '6856', 'Adlina Marie')), ('26', (30, '155571', "Thinking 'Bout You"))]
rdd = rdd.map(lambda x: (x[0], x[1][1], x[1][2], x[1][0]))
print(rdd.collect())
# [('14', '120850', 'People Say (feat. Redman)', 5), ('9', '6856', 'Adlina Marie', 22), ('26', '155571', "Thinking 'Bout You", 30)]

If working with rdds is not a restriction, here is another approach using a spark df with a window function:
df = spark.createDataFrame(
[
(('14', '51600', 'Fashion Week'), 1)
,(('9', '61577', 'Guiding Light'), 7)
,(('9', '6856', 'Adlina Marie'), 22)
,(('14', '120850', 'People Say (feat. Redman)'), 5)
,(('26', '155571', "Thinking 'Bout You"), 30)
,(('26', '156532', "Hello"), 8)
],['key','value']
)
from pyspark.sql import functions as F
from pyspark.sql import Window
df\
.select(F.col('key._1').alias('key_1'),
F.col('key._2').alias('key_2'),
F.col('key._3').alias('key_3'),
F.col('value'))\
.withColumn('max', F.max(F.col('value')).over(Window.partitionBy('key_1')))\
.filter(F.col('value')==F.col('max'))\
.select('key_1', 'key_2', 'key_3', 'value')\
.show()
+-----+------+--------------------+-----+
|key_1| key_2| key_3|value|
+-----+------+--------------------+-----+
| 14|120850|People Say (feat....| 5|
| 26|155571| Thinking 'Bout You| 30|
| 9| 6856| Adlina Marie| 22|
+-----+------+--------------------+-----+

Related

spark: complex join optimization

Given two rdd:
rdd1 = sc.parallelize([("cat", [1,2,3,4])])
rdd2 = sc.parallelize([(1, 100), (2, 201), (3, 350), (4, 400)])
What would be the most optimized way of getting:
rdd_expected = sc.parallelize([("cat", [1,2,3,4], [100, 201, 350, 400])])

How to divide the content of an RDD

I have an rdd that I will like to divide the content and return a list of tuple.
rdd_to_divide = [('Nate', (1.2, 1.2)), ('Mike', (5, 10)), ('Ben', (3, 7)), ('Chad', (12, 20))]
result_rdd = [('Nate', 1.2/1.2), ('Mike', 5/10), ('Ben', 3/7), ('Chad', 12/20)]
Thanks in advance
Use a lambda function to map the dataframe as below:
>>> rdd_to_divide = sc.parallelize([('Nate', (1.2, 1.2)), ('Mike', (5, 10)), ('Ben', (3, 7)), ('Chad', (12, 20))])
>>> result_rdd = rdd_to_divide.map(lambda x: (x[0], x[1][0]/x[1][1]))
>>> result_rdd.take(5)
[('Nate', 1.0), ('Mike', 0.5), ('Ben', 0.42857142857142855), ('Chad', 0.6)]

Sorting tuples in python and keeping the relative order

Input = [("M", 19), ("H", 19), ("A", 25)]
Output =[("A", 25), ("M" ,19), ("H", 19)]
It should sort alphabetically but when the second value is equal then it should remain in place without changing their respective places.
Here M and H both have value as 19 so it is already sorted.
IIUC, you can group tuples by the second element. First gather them together using sorted based on the tuples' second element, then throw them in a list with groupby based on the second element. This sort will preserve the order you have them in already (this sort may also be unnecessary depending on your data).
import itertools
Input = [('M', 19), ('H', 19), ('A', 25)]
sort1 = sorted(Input, key=lambda x: x[1])
grouped = []
for _, g in itertools.groupby(sort1, lambda x: x[1]):
grouped.append(list(g))
Then sort these grouped lists based on the first letter and finally "unlist" them.
sort2 = sorted(grouped, key=lambda x: x[0][0])
Output = [tup for sublist in sort2 for tup in sublist]
You could group the items by the second value of each tuple using itertools.groupby, sort groupings by the first item in each group, then flatten the result with itertools.chain.from_iterable:
from operator import itemgetter
from itertools import groupby, chain
def relative_sort(Input):
return list(
chain.from_iterable(
sorted(
(
tuple(g)
for _, g in groupby(
sorted(Input, key=itemgetter(1)), key=itemgetter(1)
)
),
key=itemgetter(0),
)
)
)
Output:
>>> relative_sort([("M", 19), ("H", 19), ("A", 25)])
[('A', 25), ('M', 19), ('H', 19)]
>>> relative_sort([("B", 19), ("B", 25), ("M", 19), ("H", 19), ("A", 25)])
[('B', 19), ('M', 19), ('H', 19), ('B', 25), ('A', 25)]
>>> relative_sort([("A", 19), ("B", 25), ("M", 19), ("J", 30), ("H", 19)])
[('A', 19), ('M', 19), ('H', 19), ('B', 25), ('J', 30)]

Get top values based on compound key for each partition in Spark RDD

I want to use the following rdd
rdd = sc.parallelize([("K1", "e", 9), ("K1", "aaa", 9), ("K1", "ccc", 3), ("K1", "ddd", 9),
("B1", "qwe", 4), ("B1", "rty", 7), ("B1", "iop", 8), ("B1", "zxc", 1)])
to get the output
[('K1', 'aaa', 9),
('K1', 'ddd', 9),
('K1', 'e', 9),
('B1', 'iop', 8),
('B1', 'rty', 7),
('B1', 'qwe', 4)]
I referred to Get Top 3 values for every key in a RDD in Spark and used the following code
from heapq import nlargest
rdd.groupBy(
lambda x: x[0]
).flatMap(
lambda g: nlargest(3, g[1], key=lambda x: (x[2],x[1]))
).collect()
However, I can only derive
[('K1', 'e', 9),
('K1', 'ddd', 9),
('K1', 'aaa', 9),
('B1', 'iop', 8),
('B1', 'qwe', 7),
('B1', 'rty', 4)]
How shall I do?
It is a sorting problem actually, but sorting is a computationally very expensive process due to shuffling. But you can try:
rdd2 = rdd.groupBy(
lambda x: x[0]
).flatMap(
lambda g: nlargest(3, g[1], key=lambda x: (x[2],x[1]))
)
rdd2.sortBy(lambda x: x[1], x[2]).collect()
# [('K1', 'aaa', 9), ('K1', 'ddd', 9), ('K1', 'e', 9), ('B1', 'iop', 8), ('B1', 'qwe', 4), ('B1', 'rty', 7)]
I have sorted it using the first and second value of the tuples.
Also note, q comes before r alphabetically. So your mentioned expected output is off and misleading.
If you are open for dataframe , you can use windows function with rank
Inspired from here
import pyspark.sql.functions as f
from pyspark.sql import functions as F
from pyspark.sql import SparkSession
from pyspark.sql import Window
spark = SparkSession.builder.appName('test').master("local[*]").getOrCreate()
df = spark.createDataFrame([
("K1", "e", 9),
("K1", "aaa", 9),
("K1", "ccc", 3),
("K1", "ddd", 9),
("B1", "qwe", 4),
("B1", "rty", 7),
("B1", "iop", 8),
("B1", "zxc", 1)], ['A', 'B', 'C']
)
w = Window.partitionBy('A').orderBy(df.C.desc())
df.select('*', F.rank().over(w).alias('rank')).filter("rank<4").drop('rank').show()
+---+---+---+
| A | B | C|
+---+---+---+
| B1 | iop | 8|
| B1 | rty | 7|
| B1 | qwe | 4|
| K1 | e | 9|
| K1 | aaa | 9|
| K1 | ddd | 9|
+---+---+---+

ReKeying an RDD

I have a key value RDD with 1 key and multiple values. How can I create a new RDD to have one of the values become the key and the key become a value?
Ex the existing RDD is (16, (1002, 'US')), (9, (1001, 'MX')), (1, (1004, 'MX')), (17, (1004, 'MX'))]. I want tomake a new RDD such that (1002, (16, 'US')), (1001, (9, 'MX')), (1004, (1, 'MX')), (1004, (17, 'MX'))
and the new RDD desired is
(1002, (16, 'US')), (1001, (9, 'MX')), (1004, (1, 'MX')), (1004, (17, 'MX'))
rdd.map(lambda x: (x[1][0],(x[0],x[1][1])))

Resources