spark: complex join optimization - apache-spark

Given two rdd:
rdd1 = sc.parallelize([("cat", [1,2,3,4])])
rdd2 = sc.parallelize([(1, 100), (2, 201), (3, 350), (4, 400)])
What would be the most optimized way of getting:
rdd_expected = sc.parallelize([("cat", [1,2,3,4], [100, 201, 350, 400])])

Related

Max values for each key RDD

I have this data and want to get each key's max value. The key will be the first element (9,14,26).
(('14', '51600', 'Fashion Week'), 1)
(('9', '61577', 'Guiding Light'), 7)
(('9', '6856', 'Adlina Marie'), 22)
(('14', '120850', 'People Say (feat. Redman)'), 5)
(('26', '155571', "Thinking 'Bout You"), 30)
(('26', '156532', "Hello"), 8)
The final format will be:
(9, '6856', 'Adlina Marie', 22)
(14, '120850', 'People Say (feat. Redman)', 5)
(26, '155571', "Thinking 'Bout You", 30)
How to select the first column as the key and the last as the value to find the maximum of the value? I tried
groupbykey(lambda x:int(x[0][0])).mapValues(lambda x: max(x))
but it takes the second column as the value to find the max.
You could use map before the aggregations and after:
rdd = rdd.map(lambda x: (x[0][0],(x[1], x[0][1], x[0][2])))
rdd = rdd.groupByKey().mapValues(max)
rdd = rdd.map(lambda x: (x[0], x[1][1], x[1][2], x[1][0]))
Full example:
sc = spark.sparkContext
data = [(('14', '51600', 'Fashion Week'), 1),
(('9', '61577', 'Guiding Light'), 7),
(('9', '6856', 'Adlina Marie'), 22),
(('14', '120850', 'People Say (feat. Redman)'), 5),
(('26', '155571', "Thinking 'Bout You"), 30),
(('26', '156532', "Hello"), 8)]
rdd = sc.parallelize(data)
rdd = rdd.map(lambda x: (x[0][0],(x[1], x[0][1], x[0][2])))
print(rdd.collect())
# [('14', (1, '51600', 'Fashion Week')), ('9', (7, '61577', 'Guiding Light')), ('9', (22, '6856', 'Adlina Marie')), ('14', (5, '120850', 'People Say (feat. Redman)')), ('26', (30, '155571', "Thinking 'Bout You")), ('26', (8, '156532', 'Hello'))]
rdd = rdd.groupByKey().mapValues(max)
print(rdd.collect())
# [('14', (5, '120850', 'People Say (feat. Redman)')), ('9', (22, '6856', 'Adlina Marie')), ('26', (30, '155571', "Thinking 'Bout You"))]
rdd = rdd.map(lambda x: (x[0], x[1][1], x[1][2], x[1][0]))
print(rdd.collect())
# [('14', '120850', 'People Say (feat. Redman)', 5), ('9', '6856', 'Adlina Marie', 22), ('26', '155571', "Thinking 'Bout You", 30)]
If working with rdds is not a restriction, here is another approach using a spark df with a window function:
df = spark.createDataFrame(
[
(('14', '51600', 'Fashion Week'), 1)
,(('9', '61577', 'Guiding Light'), 7)
,(('9', '6856', 'Adlina Marie'), 22)
,(('14', '120850', 'People Say (feat. Redman)'), 5)
,(('26', '155571', "Thinking 'Bout You"), 30)
,(('26', '156532', "Hello"), 8)
],['key','value']
)
from pyspark.sql import functions as F
from pyspark.sql import Window
df\
.select(F.col('key._1').alias('key_1'),
F.col('key._2').alias('key_2'),
F.col('key._3').alias('key_3'),
F.col('value'))\
.withColumn('max', F.max(F.col('value')).over(Window.partitionBy('key_1')))\
.filter(F.col('value')==F.col('max'))\
.select('key_1', 'key_2', 'key_3', 'value')\
.show()
+-----+------+--------------------+-----+
|key_1| key_2| key_3|value|
+-----+------+--------------------+-----+
| 14|120850|People Say (feat....| 5|
| 26|155571| Thinking 'Bout You| 30|
| 9| 6856| Adlina Marie| 22|
+-----+------+--------------------+-----+

How to make calculations between RDD rows?

I have a Spark RDD like this:
[(1, '02-01-1950', 2.8), (2, '03-01-1950', 3.1), (3, '04-01-1950', 3.2)]
And I want to calculate the increase (by percentage) between sequential rows. For example, from row 1 to row 2 the increase of value is 110.7% ((3.1/2.8)*100), and so on.
Any suggestions on how to make calculations between rows?
You can join the RDD with the same RDD that have the keys shifted by 1:
rdd = sc.parallelize([(1, '02-01-1950', 2.8), (2, '03-01-1950', 3.1), (3, '04-01-1950', 3.2)])
rdd2 = rdd.map(lambda x: (x[0], x[2]))
rdd3 = rdd.map(lambda x: (x[0]+1, x[2]))
rdd4 = rdd2.join(rdd3).mapValues(lambda r: r[0]/r[1]*100)
rdd4.collect()
# [(2, 110.71428571428572), (3, 103.2258064516129)]

How to divide the content of an RDD

I have an rdd that I will like to divide the content and return a list of tuple.
rdd_to_divide = [('Nate', (1.2, 1.2)), ('Mike', (5, 10)), ('Ben', (3, 7)), ('Chad', (12, 20))]
result_rdd = [('Nate', 1.2/1.2), ('Mike', 5/10), ('Ben', 3/7), ('Chad', 12/20)]
Thanks in advance
Use a lambda function to map the dataframe as below:
>>> rdd_to_divide = sc.parallelize([('Nate', (1.2, 1.2)), ('Mike', (5, 10)), ('Ben', (3, 7)), ('Chad', (12, 20))])
>>> result_rdd = rdd_to_divide.map(lambda x: (x[0], x[1][0]/x[1][1]))
>>> result_rdd.take(5)
[('Nate', 1.0), ('Mike', 0.5), ('Ben', 0.42857142857142855), ('Chad', 0.6)]

Take top N elements from each group in PySpark RDD (without using groupByKey)

I have an RDD like the following
dataSource = sc.parallelize( [("user1", (3, "blue")), ("user1", (4, "black")), ("user2", (5, "white"), ("user2", (3, "black")), ("user2", (6, "red")), ("user1", (1, "red"))] )
I want to use reduceByKey to find Top 2 colors for each user so the output would be an RDD like:
sc.parallelize([("user1", ["black", "blue"]), ("user2", ["red", "white"])])
so I need to reduce by key and then sort each key's values, i.e. (number, color) on number and return top n colors.
I don't want to use groupBy. If there is anything better than reduceByKey other than groupBy, it would be great :)
You can for example use a heap queue. Required imports:
import heapq
from functools import partial
Helper functions:
def zero_value(n):
"""Initialize a queue. If n is large
it could be more efficient to track a number of the elements
on heap (cnt, heap) and switch between heappush and heappushpop
if we exceed n. I leave this as an exercise for the reader."""
return [(float("-inf"), None) for _ in range(n)]
def seq_func(acc, x):
heapq.heappushpop(acc, x)
return acc
def merge_func(acc1, acc2, n):
return heapq.nlargest(n, heapq.merge(acc1, acc2))
def finalize(kvs):
return [v for (k, v) in kvs if k != float("-inf")]
Data:
rdd = sc.parallelize([
("user1", (3, "blue")), ("user1", (4, "black")),
("user2", (5, "white")), ("user2", (3, "black")),
("user2", (6, "red")), ("user1", (1, "red"))])
Solution:
(rdd
.aggregateByKey(zero_value(2), seq_func, partial(merge_func, n=2))
.mapValues(finalize)
.collect())
Result:
[('user2', ['red', 'white']), ('user1', ['black', 'blue'])]

Repartitioning partitioned data

I'm working on a skewed data problem, such that my smallest partitions are below 64MB and my largest partitions can be greater than 1GB. I've been contemplating a strategy to map a few small partitions to the same partition key, thus creating a partition comprised of partitions. This is all in the hope of reducing variance in task size as well as number of files stored on disk.
At one point in my Spark application, I need to operate on the (non-grouped) original partitions and to do so, will need to repartition by the original key. This brings me to my question:
Suppose I have two data sets as seen below. Each row is a tuple of the form (partition_key, (original_key, data)). In data0, you can see that original_key = 0 is on its own node, whereas original_key = 4 and original_key = 5 are together on the node containing partition_key = 3. In data1, things are not as organized.
If data0 is partitioned by partition_key, and then partitioned by original_key, will a shuffle occur? In other words, does it matter during the second partitionBy call that data0 is more organized than data1?
data0 = [
(0, (0, 'a')),
(0, (0, 'b')),
(0, (0, 'c')),
(1, (1, 'd')),
(1, (1, 'e')),
(1, (2, 'f')),
(1, (2, 'g')),
(2, (3, 'h')),
(2, (3, 'i')),
(2, (3, 'j')),
(3, (4, 'k')),
(3, (4, 'l')),
(3, (5, 'm')),
(3, (5, 'n')),
(3, (5, 'o')),
]
data1 = [
(0, (0, 'a')),
(1, (0, 'b')),
(0, (0, 'c')),
(1, (1, 'd')),
(2, (1, 'e')),
(1, (2, 'f')),
(3, (2, 'g')),
(2, (3, 'h')),
(0, (3, 'i')),
(3, (3, 'j')),
(3, (4, 'k')),
(3, (4, 'l')),
(1, (5, 'm')),
(2, (5, 'n')),
(3, (5, 'o')),
]
rdd0 = sc.parallelize(data0, 3).cache()
partitioned0 = rdd0.partitionBy(4)
partitioned0.map(lambda row: (row[1][0], row[1])).partitionBy(6).collect()
rdd1 = sc.parallelize(data1, 3).cache()
partitioned1 = rdd1.partitionBy(4)
partitioned1.map(lambda row: (row[1][0], row[1])).partitionBy(6).collect()
When you call re-partition shuffle kicks in.
How much of data get's shuffled is based on the original RDD.
As a side note: when you do sc.parallelize(data0,3) the 3 is mere guideline. If the default partition is <=3 then your rdd0 will have 3 partitions. If your data0 is on more HDFS blocks providing the partition number has no effect.

Resources