How to use combineByKey in pyspark [duplicate] - apache-spark

This question already has answers here:
Who can give a clear explanation for `combineByKey` in Spark?
(3 answers)
Closed 3 years ago.
I got a question from HW:
we have a sample data like this---
data = [ ("B", 2), ("A", 1), ("A", 4), ("B", 2), ("B", 3) ]
the combineByKey code is like this---
>>> rdd = sc.parallelize( data )
>>> rdd2 = rdd.combineByKey
>>> rdd2 = rdd.combineByKey(lambda value: (value, value+2, 1),
... lambda x, value: (x[0] + value, x[1] + value*value, x[2] + 1),
... lambda x, y: (x[0] + y[0], x[1] + y[1], x[2] + y[2]))
I got a result like this:
>>> myoutput = rdd2.collect()
>>> myoutput
[('B', (7, 17, 3)), ('A', (5, 9, 2))]
since we suppose to manually write out the answer instead of just run the code to get the result.
after the first lambda, is it correct I got this result: (b, (2,4,1)), (a,(1,3,1)), (a,(4,6,1)),(b,(2,4,1)),(b,(3,5,1)? But I don't quite understand "x[1] + value*value" part for the second lambda? How to get the middle value of 17 and 9 for b and a?
Can anyone help to explain to me? Thank you!

As explained in the link by cricket_007,
When using combineByKey values are merged into one value at each partition then each partition value is merged into a single value.
Lets first look at the number of partitions and what each partition contains after we parallelize the data.
>>> data = [ ("B", 2), ("A", 1), ("A", 4), ("B", 2), ("B", 3) ]
>>> rdd = sc.parallelize( data )
>>> rdd.collect()
[('B', 2), ('A', 1), ('A', 4), ('B', 2), ('B', 3)]
Number of partitions (by default):
>>> num_partitions = rdd.getNumPartitions()
>>> print(num_partitions)
4
Contents of each partition:
>>> partitions = rdd.glom().collect()
>>> for num,partition in enumerate(partitions):
... print(f'Partitions {num} -> {partition}')
Partitions 0 -> [('B', 2)]
Partitions 1 -> [('A', 1)]
Partitions 2 -> [('A', 4)]
Partitions 3 -> [('B', 2), ('B', 3)]
combineByKey is defined as
combineByKey(createCombiner, mergeValue, mergeCombiners, partitioner)
The three functions that combineByKey takes as arguments,
createCombiner :(lambda value: (value, value+2, 1)
This will be called on every unseen key in a partition.
mergeValue : lambda x, value: (x[0] + value, x[1] + value*value, x[2] + 1) This will be called when the key is already seen before in a particular partition.
mergeCombiners : lambda x, y: (x[0] + y[0], x[1] + y[1], x[2] + y[2])
This will be called to merge the keys of different partitions
partitioner : Beyond the scope of this answer.
Now let's work out what happens:
Parition 0: [('B', 2)]
createCombiner
('B', 2) -> Unseen Key -> ('B', (2, 2+2, 1))
-> ('B', (2,4,1)
# Same createCombiner for partition 1,2,3
Partition 1: [('A',1)]
createCombiner
('A',1) -> Unseen Key -> ('A', (1,3,1))
Partition 2: [('A',4)]
createCombiner
('A',4) -> Unseen Key -> ('A', (4,6,1))
Partition 3: [('B',2), ('B',3)]
createCombiner
('B',2) -> Unseen Key -> ('B',(2,4,1))
('B',3) -> Seen Key -> mergeValue ('B',(2,4,1)) with ('B',3)
-> ('B', (2 + 3, 4+(3*3), 1+1)
-> ('B', (5,13,2))
Partition 0 and Partition 3:
mergeCombiners ('B', (2,4,1)) and ('B', (5,13,2))
-> ('B', (2+5,4+13,1+2))
-> ('B', (7,19,3)
Partition 1 and 2:
mergeCombiners ('A', (1,3,1)) and ('A', (4,6,1))
-> ('A', (1+4, 3+6, 1+1))
-> ('A', (5,9,2))
So the final answer that we get is:
>>> rdd2 = rdd.combineByKey(lambda value: (value, value+2, 1),
... lambda x, value: (x[0] + value, x[1] + value*value, x[2] + 1),
... lambda x, y: (x[0] + y[0], x[1] + y[1], x[2] + y[2]))
>>> rdd2.collect()
[('B', (7, 17, 3)), ('A', (5, 9, 2))]
I hope this explains whats going on.
Additional Clarification as asked in comments:
How does spark set the number of partitions?
From the docs: Spark tries to set the number of partitions automatically based on your cluster. However, you can also set it manually by passing it as a second parameter to parallelize (e.g. sc.parallelize(data, 10)
How does spark partition the data?
A partition (aka split) is a logical chunk of a large distributed data set.
Spark has three different partitioning schemes, namely
hashPartitioner : The Default. Send keys with the same hash module end up on the same node.
customPartitioner :Example below.
rangePartitioner : Elements with keys in the same range appear on the same node.
I quote from Learning Spark by Karau et al. Pg.61, that spark does not give you explicit control on which key goes to which partition, but it ensures a set of keys will appear together on some node. If you want keys with the same value to appear together in the same partition you can use a custom partitioner like so.
>>> def customPartitioner(key):
... if key == 'A':
... return 0
... if key == 'B':
... return 1
>>> num_partitions = 2
>>> rdd = sc.parallelize( data ).partitionBy(num_partitions,customPartitioner)
>>> partitions = rdd.glom().collect()
>>> for num,partition in enumerate(partitions):
... print(f'Partition {num} -> {partition}')
Partition 0 -> [('A', 1), ('A', 4)]
Partition 1 -> [('B', 2), ('B', 2), ('B', 3)]
I encourage you to read the book to learn more.

Related

PySpark Combining Strings into a tuple-pair based on second value as key

I am new to pyspark and am still trying to understand how map and reduce work.
I have a dataset read as an RDD, attached below after learn.txt.
Based on the second value (numeric one), I want to see which 2 letters have same value and how many times.
My current codes output:
[(('b', 'c'), 1),
('c', 1),
('d', 1),
('a', 1),
(('a', 'b'), 2),
((('a', 'b'), 'c'), 1)]
What I want as the output:
[(('b','a'),3),
(('a','b'),3),
(('b','c'),2),
(('c','b'),2),
(('a','c'),1),
(('c','a'),1)]
That is pairs only, of all permutations if they have a single match.
I don't believe my code will be too helpful but this is what I have got:
from pyspark import RDD, SparkContext
from pyspark.sql import DataFrame, SparkSession
sc = SparkContext('local[*]')
spark = SparkSession.builder.getOrCreate()
df = sc.textFile("learn.txt")
mapped = df.map(lambda x: [a for a in x.split(',')])
remapped = mapped.map(lambda x: (x[1], x[0]))
reduced = remapped.reduceByKey(lambda x,y: (x,y))
threemapped = reduced.map(lambda x: (x[1], 1))
output = threemapped.reduceByKey(lambda x, y: x+y)
output.collect()
Where learn.txt:
a,1
a,2
a,3
a,4
b,2
b,3
b,4
b,6
c,2
c,5
c,6
d,7
With .reduceByKey(lambda x,y: (x,y)), you create intertwined tuples of tuples of tuples... You are not going to be able to do anything with that.
Since you are looking for couples of values that share a key, we might use a join like this:
# same code as you
vals = df\
.map(lambda x: [a for a in x.split(',')])\
.map(lambda x: (x[1], x[0]))
# but then you can join vals with itself and use reduceByKey to count occurrences
result = vals.join(vals)\
.filter(lambda x: x[1][0] != x[1][1])\
.map(lambda x: ((x[1][1], x[1][0]), 1))\
.reduceByKey(lambda a, b: a+b)\
.collect()
which yields:
[(('b', 'a'), 3), (('c', 'a'), 1), (('c', 'b'), 2),
(('b', 'c'), 2), (('a', 'b'), 3), (('a', 'c'), 1)]

Spark reduceByKey() to return a compound value

I am new to Spark and stumble upon the following (probably simple) problem.
I have a RDD with key-value elements, each value being a (string, number) pair.
For instance the key-value pair is ('A', ('02', 43)).
I want to reduce this RDD by keeping elements (key and the whole value) with maximum numbers when they share the same key.
reduceByKey() seems relevant and i went with this MWE.
sc= spark.sparkContext
rdd = sc.parallelize([
('A', ('02', 43)),
('A', ('02', 36)),
('B', ('02', 306)),
('C', ('10', 185))])
rdd.reduceByKey(lambda a,b : max(a[1],b[1])).collect()
which produces
[('C', ('10', 185)), ('A', 43), ('B', ('02', 306))]
My problem here is that i would like to get:
[('C', ('10', 185)), ('A', ('02', 43)), ('B', ('02', 306))]
i.e, i don't see how to return ('A',('02',43)) and not simply ('A',43).
I found myself a solution to this simple problem.
Define a function instead of using an inline function for reduceByKey().
This is:
def max_compound(a,b):
if (max(a[1],b[1])==a[1]):
return a
else:
return b
and call:
rdd.reduceByKey(max_compound).collect()
The following code is in Scala, hope you can convert the same logic into pyspark
val rdd = sparkSession.sparkContext.parallelize(Array(('A', (2, 43)), ('A', (2, 36)), ('B', (2, 306)), ('C', (10, 185))))
val rdd2 = rdd.reduceByKey((a, b) => (Math.max(a._1, b._1), Math.max(a._2, b._2)))
rdd2.collect().foreach(println)
output:
(B,(2,306))
(A,(2,43))
(C,(10,185))

Reducing values in lists of (key, val) RDD's, given these lists are values in another list of (key, val) RDD's

I've being rolling my head for a while over this - would really appreciate any suggestions!
Sorry for long title, I hope a short example I'll construct below will explain this much better.
Let's say we have an RDD of the below form:
data = sc.parallelize([(1,[('k1',4),('k2',3),('k1',2)]),\
(2,[('k3',1),('k3',8),('k1',6)])])
data.collect()
Output:
[(1, [('k1', 4), ('k2', 3), ('k1', 2)]),
(2, [('k3', 1), ('k3', 8), ('k1', 6)])]
I am looking to do the following with the deepest list of (key,val) RDD's
.reduceByKey(lambda a, b: a + b)
(i.e. reduce the values of these RDD's by key to get the sum by key while retaining the result mapped with keys of the initial higher level RDD, which would produce the following output):
[(1, [('k1', 6), ('k2', 3)]),
(2, [('k3', 9), ('k1', 6)])]
I'm relatively new with PySpark and probably missing something basic here, but I've tried a lot of different approaches on this, but essentially cannot find a way to access and reduceByKey the (key,val) RDD's in a list, which is itself a value of another RDD.
Many thanks in advance!
Denys
What you are trying to do is : your value (in input K,V) is an iterable on which you want to sum on inner key and return result as =>
(outer_key(e.g 1,2) -> List(Inner_Key(E.g."K1","K2"),Summed_value))
As you see the sum is calculated on inner Key-V,
we can achieve this by
First peeling out elements from each list item
=> making a new key as (outer key ,inner key)
=> making a sum on (outer_key,inner_key) -> value
=> Changing data format back to (outer_key ->(inner_key, summed_value))
=> finally Grouping again on Outer Key
I am not sure about Python one but believe just replacing Scala collection syntax with python's would suffice and here is the solution
SCALA VERSION
scala> val keySeq = Seq((1,List(("K1",4),("K2",3),("K1",2))),
| (2,List(("K3",1),("K3",8),("K1",6))))
keySeq: Seq[(Int, List[(String, Int)])] = List((1,List((K1,4), (K2,3), (K1,2))), (2,List((K3,1), (K3,8), (K1,6))))
scala> val inRdd = sc.parallelize(keySeq)
inRdd: org.apache.spark.rdd.RDD[(Int, List[(String, Int)])] = ParallelCollectionRDD[111] at parallelize at <console>:26
scala> inRdd.take(10)
res64: Array[(Int, List[(String, Int)])] = Array((1,List((K1,4), (K2,3), (K1,2))), (2,List((K3,1), (K3,8), (K1,6))))
// And solution :
scala> inRdd.flatMap { case (i,l) => l.map(l => ((i,l._1),l._2)) }.reduceByKey(_+_).map(x => (x._1._1 ->(x._1._2,x._2))).groupByKey.map(x => (x._1,x._2.toList.sortBy(x =>x))).collect()
// RESULT ::
res65: Array[(Int, List[(String, Int)])] = Array((1,List((K1,6), (K2,3))), (2,List((K1,6), (K3,9))))
UPDATE => Python Solution
>>> data = sc.parallelize([(1,[('k1',4),('k2',3),('k1',2)]),\
... (2,[('k3',1),('k3',8),('k1',6)])])
>>> data.collect()
[(1, [('k1', 4), ('k2', 3), ('k1', 2)]), (2, [('k3', 1), ('k3', 8), ('k1', 6)])]
# Similar operation
>>> data.flatMap(lambda x : [ ((x[0],y[0]),y[1]) for y in x[1]]).reduceByKey(lambda a,b : (a+b)).map(lambda x : [x[0][0],(x[0][1],x[1])]).groupByKey().mapValues(list).collect()
# RESULT
[(1, [('k1', 6), ('k2', 3)]), (2, [('k3', 9), ('k1', 6)])]
you should .map your dataset instead of reducing because the count of rows in your example are same as in source dataset, inside map you could reduce values as python list
use mapValues() + itertools.groupby():
from itertools import groupby
data.mapValues(lambda x: [ (k, sum(f[1] for f in g)) for (k,g) in groupby(sorted(x), key=lambda d: d[0]) ]) \
.collect()
#[(1, [('k1', 6), ('k2', 3)]), (2, [('k1', 6), ('k3', 9)])]
with itertools.groupby, we use the first item of the tuple as grouped-key k and sum the 2nd item from the tuple in each g.
Edit: for a large data set, sorting with itertools.groupby is expensive, just write up a function w/o sorting to handle the same:
def merge_tuples(x):
d = {}
for (k,v) in x:
d[k] = d.get(k,0) + v
return d.items()
data.mapValues(merge_tuples).collect()
#[(1, [('k2', 3), ('k1', 6)]), (2, [('k3', 9), ('k1', 6)])]

Pyspark: Applying reduce by key to the values of an rdd

After some transformations I have ended up with an rdd with the following format:
[(0, [('a', 1), ('b', 1), ('b', 1), ('b', 1)])
(1, [('c', 1), ('d', 1), ('h', 1), ('h', 1)])]
I can't figure out how to essentially "reduceByKey()" on the values portion of this rdd.
This is what I'd like to achieve:
[(0, [('a', 1), ('b', 3)])
(1, [('c', 1), ('d', 1), ('h', 2)])]
I was originally using .values() then applying reduceByKey to the result of that but then I end up losing my original key (in this case 0 or 1).
You lose the original key because .values() will only get value of the key-value in a row. You should sum the tuple in the row.
from collections import defaultdict
def sum_row(row):
result = defaultdict(int)
for key, val in row[1]:
result[key] += val
return (row[0],list(result.items()))
data_rdd = data_rdd.map(sum_row)
print(data_rdd.collect())
# [(0, [('a', 1), ('b', 3)]), (1, [('h', 2), ('c', 1), ('d', 1)])]
Though values gives RDD, reduceByKey works on all the values on RDD not row-wise.
You can also use groupby(ordering is required) to achieve the same:
from itertools import groupby
distdata.map(lambda x: (x[0], [(a, sum(c[1] for c in b)) for a,b in groupby(sorted(x[1]), key=lambda p: p[0]) ])).collect()

Get degree of each nodes in a graph by Networkx in python

Suppose I have a data set like below that shows an undirected graph:
1 2
1 3
1 4
3 5
3 6
7 8
8 9
10 11
I have a python script like it:
for s in ActorGraph.degree():
print(s)
that is a dictionary consist of key and value that keys are node names and values are degree of nodes:
('9', 1)
('5', 1)
('11', 1)
('8', 2)
('6', 1)
('4', 1)
('10', 1)
('7', 1)
('2', 1)
('3', 3)
('1', 3)
In networkx documentation suggest to use values() for having nodes degree.
now I like to have just keys that are degree of nodes and I use this part of script but it does't work and say object has no attribute 'values':
for s in ActorGraph.degree():
print(s.values())
how can I do it?
You are using version 2.0 of networkx. Which changed from using a dict for G.degree() to using a dict-like (but not dict) DegreeView. See this guide.
To have the degrees in a list you can use a list-comprehension:
degrees = [val for (node, val) in G.degree()]
I'd like to add the following: if you're initializing the undirected graph with nx.Graph() and adding the edges afterwards, just beware that networkx doesn't guarrantee the order of nodes will be preserved -- this also applies to degree(). This means that if you use the list comprehension approach then try to access the degree by list index the indexes may not correspond to the right nodes. If you'd like them to correspond, you can instead do:
degrees = [val for (node, val) in sorted(G.degree(), key=lambda pair: pair[0])]
Here's a simple example to illustrate this:
>>> edges = [(0, 1), (0, 3), (0, 5), (1, 2), (1, 3), (1, 4), (2, 3), (2, 4), (2, 5)]
>>> g = nx.Graph()
>>> g.add_edges_from(edges)
>>> print(g.degree())
[(0, 3), (1, 4), (3, 3), (5, 2), (2, 4), (4, 2)]
>>> print([val for (node, val) in g.degree()])
[3, 4, 3, 2, 4, 2]
>>> print([val for (node, val) in sorted(g.degree(), key=lambda pair: pair[0])])
[3, 4, 4, 3, 2, 2]
You can also use a dict comprehension to get an actual dictionary:
degrees = {node:val for (node, val) in G.degree()}

Resources