Pyspark: Applying reduce by key to the values of an rdd - apache-spark

After some transformations I have ended up with an rdd with the following format:
[(0, [('a', 1), ('b', 1), ('b', 1), ('b', 1)])
(1, [('c', 1), ('d', 1), ('h', 1), ('h', 1)])]
I can't figure out how to essentially "reduceByKey()" on the values portion of this rdd.
This is what I'd like to achieve:
[(0, [('a', 1), ('b', 3)])
(1, [('c', 1), ('d', 1), ('h', 2)])]
I was originally using .values() then applying reduceByKey to the result of that but then I end up losing my original key (in this case 0 or 1).

You lose the original key because .values() will only get value of the key-value in a row. You should sum the tuple in the row.
from collections import defaultdict
def sum_row(row):
result = defaultdict(int)
for key, val in row[1]:
result[key] += val
return (row[0],list(result.items()))
data_rdd = data_rdd.map(sum_row)
print(data_rdd.collect())
# [(0, [('a', 1), ('b', 3)]), (1, [('h', 2), ('c', 1), ('d', 1)])]

Though values gives RDD, reduceByKey works on all the values on RDD not row-wise.
You can also use groupby(ordering is required) to achieve the same:
from itertools import groupby
distdata.map(lambda x: (x[0], [(a, sum(c[1] for c in b)) for a,b in groupby(sorted(x[1]), key=lambda p: p[0]) ])).collect()

Related

Reduce key, value pair based on similarity of their value in PySpark

I am a beginner in PySpark.
I want to find the pairs of letters with the same numbers in values and then to find out which pair of letters appear more often.
Here is my data
data = sc.parallelize([('a', 1), ('b', 4), ('c', 10), ('d', 4), ('e', 4), ('f', 1), ('b', 5), ('d', 5)])
data.collect()
[('a', 1), ('b', 4), ('c', 10), ('d', 4), ('e', 4), ('f', 1), ('b', 5), ('d', 5)]
The result I want would look like this:
1: a,f
4: b, d
4: b, e
4: d, e
10: c
5: b, d
I have tried the following:
data1= data.map(lambda y: (y[1], y[0]))
data1.collect()
[(1, 'a'), (4, 'b'), (10, 'c'), (4, 'd'), (4, 'e'), (1, 'f'), ('b', 5), ('d', 5)]
data1.groupByKey().mapValues(list).collect()
[(10, ['c']), (4, ['b', 'd', 'e']), (1, ['a', 'f']), (5, ['b', 'd'])]
As I said I am very new to PySpark and tried to search the command for that but was not successful. Could anyone please help me with this?
You can use flatMap with python itertools.combinations to get combinations of 2 from the grouped values. Also, prefer using reduceByKey rather than groupByKey:
from itertools import combinations
result = data.map(lambda x: (x[1], [x[0]])) \
.reduceByKey(lambda a, b: a + b) \
.flatMap(lambda x: [(x[0], p) for p in combinations(x[1], 2 if (len(x[1]) > 1) else 1)])
result.collect()
#[(1, ('a', 'f')), (10, ('c',)), (4, ('b', 'd')), (4, ('b', 'e')), (4, ('d', 'e')), (5, ('b', 'd'))]
If you want to get None when tuple has only one element, you can use this:
.flatMap(lambda x: [(x[0], p) for p in combinations(x[1] if len(x[1]) > 1 else x[1] + [None], 2)])

How to sort and remove tuples with same first element and only keeping the first occurrence

Support we have a list of tuples listeT:
ListeT=[('a', 1), ('x',1) , ('b', 1), ('b', 1), ('a', 2), ('a',3), ('c', 6), ('c', 5),('e', 6), ('d', 7),('b', 2)]` and i want to get the following result:
Result = [('a', 1), ('x',1) , ('b', 1), ('c', 5), ('c', 6), ('e', 6), ('d', 7)]`
1-I want to order this list according to the second element of tuples.
2- I want to remove duplicates and only keep the first occurrence of tuples having the same value of first elements: For instance if have ('a', 1) and (a,2), I want to keep only ('a', 1).
For the sorting I used : res=sorted(listT, key=lambda x: x[1], reverse= True) and it worked.
But for the duplicates I could't find a good solution:
I can remove duplicate elements by converting the the list to a set (set(listeT)) or by using
numpy.unique(ListeT, axis=0). However, this only removes the duplicates for all the tuples but I want also to remove the duplicates of tuples having the same first element and only keeping the first occurrence.
Thank you.
The dict can take care of the uniqueness while feeding it in reversed order.
I did not understood if you need the outer sort so you can just replace it with list() if not needed.
sorted(dict(sorted(ListeT, key=lambda x: x[1], reverse= True)).items(), key=lambda x: x[1])
[('a', 1), ('b', 1), ('x', 1), ('c', 5), ('e', 6), ('d', 7)]

Sorting tuples in python and keeping the relative order

Input = [("M", 19), ("H", 19), ("A", 25)]
Output =[("A", 25), ("M" ,19), ("H", 19)]
It should sort alphabetically but when the second value is equal then it should remain in place without changing their respective places.
Here M and H both have value as 19 so it is already sorted.
IIUC, you can group tuples by the second element. First gather them together using sorted based on the tuples' second element, then throw them in a list with groupby based on the second element. This sort will preserve the order you have them in already (this sort may also be unnecessary depending on your data).
import itertools
Input = [('M', 19), ('H', 19), ('A', 25)]
sort1 = sorted(Input, key=lambda x: x[1])
grouped = []
for _, g in itertools.groupby(sort1, lambda x: x[1]):
grouped.append(list(g))
Then sort these grouped lists based on the first letter and finally "unlist" them.
sort2 = sorted(grouped, key=lambda x: x[0][0])
Output = [tup for sublist in sort2 for tup in sublist]
You could group the items by the second value of each tuple using itertools.groupby, sort groupings by the first item in each group, then flatten the result with itertools.chain.from_iterable:
from operator import itemgetter
from itertools import groupby, chain
def relative_sort(Input):
return list(
chain.from_iterable(
sorted(
(
tuple(g)
for _, g in groupby(
sorted(Input, key=itemgetter(1)), key=itemgetter(1)
)
),
key=itemgetter(0),
)
)
)
Output:
>>> relative_sort([("M", 19), ("H", 19), ("A", 25)])
[('A', 25), ('M', 19), ('H', 19)]
>>> relative_sort([("B", 19), ("B", 25), ("M", 19), ("H", 19), ("A", 25)])
[('B', 19), ('M', 19), ('H', 19), ('B', 25), ('A', 25)]
>>> relative_sort([("A", 19), ("B", 25), ("M", 19), ("J", 30), ("H", 19)])
[('A', 19), ('M', 19), ('H', 19), ('B', 25), ('J', 30)]

Spark reduceByKey() to return a compound value

I am new to Spark and stumble upon the following (probably simple) problem.
I have a RDD with key-value elements, each value being a (string, number) pair.
For instance the key-value pair is ('A', ('02', 43)).
I want to reduce this RDD by keeping elements (key and the whole value) with maximum numbers when they share the same key.
reduceByKey() seems relevant and i went with this MWE.
sc= spark.sparkContext
rdd = sc.parallelize([
('A', ('02', 43)),
('A', ('02', 36)),
('B', ('02', 306)),
('C', ('10', 185))])
rdd.reduceByKey(lambda a,b : max(a[1],b[1])).collect()
which produces
[('C', ('10', 185)), ('A', 43), ('B', ('02', 306))]
My problem here is that i would like to get:
[('C', ('10', 185)), ('A', ('02', 43)), ('B', ('02', 306))]
i.e, i don't see how to return ('A',('02',43)) and not simply ('A',43).
I found myself a solution to this simple problem.
Define a function instead of using an inline function for reduceByKey().
This is:
def max_compound(a,b):
if (max(a[1],b[1])==a[1]):
return a
else:
return b
and call:
rdd.reduceByKey(max_compound).collect()
The following code is in Scala, hope you can convert the same logic into pyspark
val rdd = sparkSession.sparkContext.parallelize(Array(('A', (2, 43)), ('A', (2, 36)), ('B', (2, 306)), ('C', (10, 185))))
val rdd2 = rdd.reduceByKey((a, b) => (Math.max(a._1, b._1), Math.max(a._2, b._2)))
rdd2.collect().foreach(println)
output:
(B,(2,306))
(A,(2,43))
(C,(10,185))

How to re-order a list of tuples according to the order of the elements in another list

There are two lists:
list1 = [('A', 1), ('B', 2), ('C', 3)]
list2 = ['C', 'A', 'B']
How can I reorganize the tuples in list1 so that the first elements of all the tuples are in the same order as those in list2?
i.e., the expected result is
list1 = [('C', 3), ('A', 1), ('B', 2)]
You could use a list comprehension as follows:
d = dict(list1)
[(i,d.get(i)) for i in list2]
[('C', 3), ('A', 1), ('B', 2)]

Resources