Spark reduceByKey() to return a compound value - apache-spark

I am new to Spark and stumble upon the following (probably simple) problem.
I have a RDD with key-value elements, each value being a (string, number) pair.
For instance the key-value pair is ('A', ('02', 43)).
I want to reduce this RDD by keeping elements (key and the whole value) with maximum numbers when they share the same key.
reduceByKey() seems relevant and i went with this MWE.
sc= spark.sparkContext
rdd = sc.parallelize([
('A', ('02', 43)),
('A', ('02', 36)),
('B', ('02', 306)),
('C', ('10', 185))])
rdd.reduceByKey(lambda a,b : max(a[1],b[1])).collect()
which produces
[('C', ('10', 185)), ('A', 43), ('B', ('02', 306))]
My problem here is that i would like to get:
[('C', ('10', 185)), ('A', ('02', 43)), ('B', ('02', 306))]
i.e, i don't see how to return ('A',('02',43)) and not simply ('A',43).

I found myself a solution to this simple problem.
Define a function instead of using an inline function for reduceByKey().
This is:
def max_compound(a,b):
if (max(a[1],b[1])==a[1]):
return a
else:
return b
and call:
rdd.reduceByKey(max_compound).collect()

The following code is in Scala, hope you can convert the same logic into pyspark
val rdd = sparkSession.sparkContext.parallelize(Array(('A', (2, 43)), ('A', (2, 36)), ('B', (2, 306)), ('C', (10, 185))))
val rdd2 = rdd.reduceByKey((a, b) => (Math.max(a._1, b._1), Math.max(a._2, b._2)))
rdd2.collect().foreach(println)
output:
(B,(2,306))
(A,(2,43))
(C,(10,185))

Related

Pyspark- Convert an RDD into a key value pair RDD, with the keys in ascending order

I have an RDD1 in this form: ['once','upon','a','time',...,'the','end']. I want to convert in into a key/value pair such that the strings are values and keys are in ascending order. The expected RDD2 should be as follows: [(1,'once'),(2,'upon'),(3,'a'),(4,'time'),...,(RDD1.count()-1,'the'),(RDD1.count(),'end']
Any hints?
Thanks
Use pyspark's own zip function. This might help:
rdd1 = sc.parallelize(['once','upon','a','time','the','end'])
nums = sc.parallelize(range(rdd1.count())).map(lambda x: x+1)
zippedRdds = nums.zip(rdd1)
rdd2 = zippedRdds.sortByKey()
rdd2.collect()
will give:
[(1, 'once'), (2, 'upon'), (3, 'a'), (4, 'time'), (5, 'the'), (6,
'end')]

How to use combineByKey in pyspark [duplicate]

This question already has answers here:
Who can give a clear explanation for `combineByKey` in Spark?
(3 answers)
Closed 3 years ago.
I got a question from HW:
we have a sample data like this---
data = [ ("B", 2), ("A", 1), ("A", 4), ("B", 2), ("B", 3) ]
the combineByKey code is like this---
>>> rdd = sc.parallelize( data )
>>> rdd2 = rdd.combineByKey
>>> rdd2 = rdd.combineByKey(lambda value: (value, value+2, 1),
... lambda x, value: (x[0] + value, x[1] + value*value, x[2] + 1),
... lambda x, y: (x[0] + y[0], x[1] + y[1], x[2] + y[2]))
I got a result like this:
>>> myoutput = rdd2.collect()
>>> myoutput
[('B', (7, 17, 3)), ('A', (5, 9, 2))]
since we suppose to manually write out the answer instead of just run the code to get the result.
after the first lambda, is it correct I got this result: (b, (2,4,1)), (a,(1,3,1)), (a,(4,6,1)),(b,(2,4,1)),(b,(3,5,1)? But I don't quite understand "x[1] + value*value" part for the second lambda? How to get the middle value of 17 and 9 for b and a?
Can anyone help to explain to me? Thank you!
As explained in the link by cricket_007,
When using combineByKey values are merged into one value at each partition then each partition value is merged into a single value.
Lets first look at the number of partitions and what each partition contains after we parallelize the data.
>>> data = [ ("B", 2), ("A", 1), ("A", 4), ("B", 2), ("B", 3) ]
>>> rdd = sc.parallelize( data )
>>> rdd.collect()
[('B', 2), ('A', 1), ('A', 4), ('B', 2), ('B', 3)]
Number of partitions (by default):
>>> num_partitions = rdd.getNumPartitions()
>>> print(num_partitions)
4
Contents of each partition:
>>> partitions = rdd.glom().collect()
>>> for num,partition in enumerate(partitions):
... print(f'Partitions {num} -> {partition}')
Partitions 0 -> [('B', 2)]
Partitions 1 -> [('A', 1)]
Partitions 2 -> [('A', 4)]
Partitions 3 -> [('B', 2), ('B', 3)]
combineByKey is defined as
combineByKey(createCombiner, mergeValue, mergeCombiners, partitioner)
The three functions that combineByKey takes as arguments,
createCombiner :(lambda value: (value, value+2, 1)
This will be called on every unseen key in a partition.
mergeValue : lambda x, value: (x[0] + value, x[1] + value*value, x[2] + 1) This will be called when the key is already seen before in a particular partition.
mergeCombiners : lambda x, y: (x[0] + y[0], x[1] + y[1], x[2] + y[2])
This will be called to merge the keys of different partitions
partitioner : Beyond the scope of this answer.
Now let's work out what happens:
Parition 0: [('B', 2)]
createCombiner
('B', 2) -> Unseen Key -> ('B', (2, 2+2, 1))
-> ('B', (2,4,1)
# Same createCombiner for partition 1,2,3
Partition 1: [('A',1)]
createCombiner
('A',1) -> Unseen Key -> ('A', (1,3,1))
Partition 2: [('A',4)]
createCombiner
('A',4) -> Unseen Key -> ('A', (4,6,1))
Partition 3: [('B',2), ('B',3)]
createCombiner
('B',2) -> Unseen Key -> ('B',(2,4,1))
('B',3) -> Seen Key -> mergeValue ('B',(2,4,1)) with ('B',3)
-> ('B', (2 + 3, 4+(3*3), 1+1)
-> ('B', (5,13,2))
Partition 0 and Partition 3:
mergeCombiners ('B', (2,4,1)) and ('B', (5,13,2))
-> ('B', (2+5,4+13,1+2))
-> ('B', (7,19,3)
Partition 1 and 2:
mergeCombiners ('A', (1,3,1)) and ('A', (4,6,1))
-> ('A', (1+4, 3+6, 1+1))
-> ('A', (5,9,2))
So the final answer that we get is:
>>> rdd2 = rdd.combineByKey(lambda value: (value, value+2, 1),
... lambda x, value: (x[0] + value, x[1] + value*value, x[2] + 1),
... lambda x, y: (x[0] + y[0], x[1] + y[1], x[2] + y[2]))
>>> rdd2.collect()
[('B', (7, 17, 3)), ('A', (5, 9, 2))]
I hope this explains whats going on.
Additional Clarification as asked in comments:
How does spark set the number of partitions?
From the docs: Spark tries to set the number of partitions automatically based on your cluster. However, you can also set it manually by passing it as a second parameter to parallelize (e.g. sc.parallelize(data, 10)
How does spark partition the data?
A partition (aka split) is a logical chunk of a large distributed data set.
Spark has three different partitioning schemes, namely
hashPartitioner : The Default. Send keys with the same hash module end up on the same node.
customPartitioner :Example below.
rangePartitioner : Elements with keys in the same range appear on the same node.
I quote from Learning Spark by Karau et al. Pg.61, that spark does not give you explicit control on which key goes to which partition, but it ensures a set of keys will appear together on some node. If you want keys with the same value to appear together in the same partition you can use a custom partitioner like so.
>>> def customPartitioner(key):
... if key == 'A':
... return 0
... if key == 'B':
... return 1
>>> num_partitions = 2
>>> rdd = sc.parallelize( data ).partitionBy(num_partitions,customPartitioner)
>>> partitions = rdd.glom().collect()
>>> for num,partition in enumerate(partitions):
... print(f'Partition {num} -> {partition}')
Partition 0 -> [('A', 1), ('A', 4)]
Partition 1 -> [('B', 2), ('B', 2), ('B', 3)]
I encourage you to read the book to learn more.

Reducing values in lists of (key, val) RDD's, given these lists are values in another list of (key, val) RDD's

I've being rolling my head for a while over this - would really appreciate any suggestions!
Sorry for long title, I hope a short example I'll construct below will explain this much better.
Let's say we have an RDD of the below form:
data = sc.parallelize([(1,[('k1',4),('k2',3),('k1',2)]),\
(2,[('k3',1),('k3',8),('k1',6)])])
data.collect()
Output:
[(1, [('k1', 4), ('k2', 3), ('k1', 2)]),
(2, [('k3', 1), ('k3', 8), ('k1', 6)])]
I am looking to do the following with the deepest list of (key,val) RDD's
.reduceByKey(lambda a, b: a + b)
(i.e. reduce the values of these RDD's by key to get the sum by key while retaining the result mapped with keys of the initial higher level RDD, which would produce the following output):
[(1, [('k1', 6), ('k2', 3)]),
(2, [('k3', 9), ('k1', 6)])]
I'm relatively new with PySpark and probably missing something basic here, but I've tried a lot of different approaches on this, but essentially cannot find a way to access and reduceByKey the (key,val) RDD's in a list, which is itself a value of another RDD.
Many thanks in advance!
Denys
What you are trying to do is : your value (in input K,V) is an iterable on which you want to sum on inner key and return result as =>
(outer_key(e.g 1,2) -> List(Inner_Key(E.g."K1","K2"),Summed_value))
As you see the sum is calculated on inner Key-V,
we can achieve this by
First peeling out elements from each list item
=> making a new key as (outer key ,inner key)
=> making a sum on (outer_key,inner_key) -> value
=> Changing data format back to (outer_key ->(inner_key, summed_value))
=> finally Grouping again on Outer Key
I am not sure about Python one but believe just replacing Scala collection syntax with python's would suffice and here is the solution
SCALA VERSION
scala> val keySeq = Seq((1,List(("K1",4),("K2",3),("K1",2))),
| (2,List(("K3",1),("K3",8),("K1",6))))
keySeq: Seq[(Int, List[(String, Int)])] = List((1,List((K1,4), (K2,3), (K1,2))), (2,List((K3,1), (K3,8), (K1,6))))
scala> val inRdd = sc.parallelize(keySeq)
inRdd: org.apache.spark.rdd.RDD[(Int, List[(String, Int)])] = ParallelCollectionRDD[111] at parallelize at <console>:26
scala> inRdd.take(10)
res64: Array[(Int, List[(String, Int)])] = Array((1,List((K1,4), (K2,3), (K1,2))), (2,List((K3,1), (K3,8), (K1,6))))
// And solution :
scala> inRdd.flatMap { case (i,l) => l.map(l => ((i,l._1),l._2)) }.reduceByKey(_+_).map(x => (x._1._1 ->(x._1._2,x._2))).groupByKey.map(x => (x._1,x._2.toList.sortBy(x =>x))).collect()
// RESULT ::
res65: Array[(Int, List[(String, Int)])] = Array((1,List((K1,6), (K2,3))), (2,List((K1,6), (K3,9))))
UPDATE => Python Solution
>>> data = sc.parallelize([(1,[('k1',4),('k2',3),('k1',2)]),\
... (2,[('k3',1),('k3',8),('k1',6)])])
>>> data.collect()
[(1, [('k1', 4), ('k2', 3), ('k1', 2)]), (2, [('k3', 1), ('k3', 8), ('k1', 6)])]
# Similar operation
>>> data.flatMap(lambda x : [ ((x[0],y[0]),y[1]) for y in x[1]]).reduceByKey(lambda a,b : (a+b)).map(lambda x : [x[0][0],(x[0][1],x[1])]).groupByKey().mapValues(list).collect()
# RESULT
[(1, [('k1', 6), ('k2', 3)]), (2, [('k3', 9), ('k1', 6)])]
you should .map your dataset instead of reducing because the count of rows in your example are same as in source dataset, inside map you could reduce values as python list
use mapValues() + itertools.groupby():
from itertools import groupby
data.mapValues(lambda x: [ (k, sum(f[1] for f in g)) for (k,g) in groupby(sorted(x), key=lambda d: d[0]) ]) \
.collect()
#[(1, [('k1', 6), ('k2', 3)]), (2, [('k1', 6), ('k3', 9)])]
with itertools.groupby, we use the first item of the tuple as grouped-key k and sum the 2nd item from the tuple in each g.
Edit: for a large data set, sorting with itertools.groupby is expensive, just write up a function w/o sorting to handle the same:
def merge_tuples(x):
d = {}
for (k,v) in x:
d[k] = d.get(k,0) + v
return d.items()
data.mapValues(merge_tuples).collect()
#[(1, [('k2', 3), ('k1', 6)]), (2, [('k3', 9), ('k1', 6)])]

Pyspark: Applying reduce by key to the values of an rdd

After some transformations I have ended up with an rdd with the following format:
[(0, [('a', 1), ('b', 1), ('b', 1), ('b', 1)])
(1, [('c', 1), ('d', 1), ('h', 1), ('h', 1)])]
I can't figure out how to essentially "reduceByKey()" on the values portion of this rdd.
This is what I'd like to achieve:
[(0, [('a', 1), ('b', 3)])
(1, [('c', 1), ('d', 1), ('h', 2)])]
I was originally using .values() then applying reduceByKey to the result of that but then I end up losing my original key (in this case 0 or 1).
You lose the original key because .values() will only get value of the key-value in a row. You should sum the tuple in the row.
from collections import defaultdict
def sum_row(row):
result = defaultdict(int)
for key, val in row[1]:
result[key] += val
return (row[0],list(result.items()))
data_rdd = data_rdd.map(sum_row)
print(data_rdd.collect())
# [(0, [('a', 1), ('b', 3)]), (1, [('h', 2), ('c', 1), ('d', 1)])]
Though values gives RDD, reduceByKey works on all the values on RDD not row-wise.
You can also use groupby(ordering is required) to achieve the same:
from itertools import groupby
distdata.map(lambda x: (x[0], [(a, sum(c[1] for c in b)) for a,b in groupby(sorted(x[1]), key=lambda p: p[0]) ])).collect()

Spark: Sort an RDD by multiple values in a tuple / columns

So I have an RDD as follows
RDD[(String, Int, String)]
And as an example
('b', 1, 'a')
('a', 1, 'b')
('a', 0, 'b')
('a', 0, 'a')
The final result should look something like
('a', 0, 'a')
('a', 0, 'b')
('a', 1, 'b')
('b', 1, 'a')
How would I do something like this?
Try this:
rdd.sortBy(r => r)
If you wanted to switch the sort order around, you could do this:
rdd.sortBy(r => (r._3, r._1, r._2))
For reverse order:
rdd.sortBy(r => r, false)

Resources