PySpark cogroup on multiple keys - apache-spark

Given two lists, I want to group them based on the co-occurence of the first two keys:
x=[(1,(2,'cat')),(4,(5,'dog'))]
y=[(1,(2,'hairBall')),(4,(5,'woof'))]
desired output:
z=[(1,2,('cat','hairBall')),(4,5,('dog','woof'))]
what I have tried so far:
sc=SparkContext()
xs=sc.parallelize(x)
ys=sc.parallelize(y)
zs_temp=xs.cogroup(ys)
this results in:
zs_temp.collect()=[(1, [[(2, 'cat')], [(2, 'hairBall')]]), (4, [[(5, 'dog')], [(5, 'woof')]])]
attempted solution:
zs_temp.map(lambda f: f[1].cogroup(f[1]) ).collect()
but get the error:
AttributeError: 'tuple' object has no attribute 'cogroup'

Test Data:
x=[(1,(2,'cat')),(4,(5,'dog'))]
y=[(1,(2,'hairBall')),(4,(5,'woof'))]
xs=sc.parallelize(x)
ys=sc.parallelize(y)
Function to change the keys
def reKey(r):
return ((r[0], r[1][0]), r[1][1])
Change keys
xs2 = xs.map(reKey)
ys2 = ys.map(reKey)
Join Data, collect results
results = ys2.join(xs2)
results.collect()
[((1, 2), ('hairBall', 'cat')), ((4, 5), ('woof', 'dog'))]

Related

PySpark sort values

I have a data:
[(u'ab', u'cd'),
(u'ef', u'gh'),
(u'cd', u'ab'),
(u'ab', u'gh'),
(u'ab', u'cd')]
I would like to do a mapreduce on this data and to find out how often same pairs appear.
As a result I get:
[((u'ab', u'cd'), 2),
((u'cd', u'ab'), 1),
((u'ab', u'gh'), 1),
((u'ef', u'gh'), 1)]
As you can see it is not quire right as (u'ab', u'cd') has to be 3 instead of 2 because (u'cd', u'ab') is the same pair.
My question is how can I make the program to count (u'cd', u'ab') and (u'ab', u'cd') as the same pair? I was thinking about sorting values for each row but could not find any solution for this.
You can sort the values then use reduceByKey to count the pairs:
rdd1 = rdd.map(lambda x: (tuple(sorted(x)), 1))\
.reduceByKey(lambda a, b: a + b)
rdd1.collect()
# [(('ab', 'gh'), 1), (('ef', 'gh'), 1), (('ab', 'cd'), 3)]
You can key by the sorted element, and count by key:
result = rdd.keyBy(lambda x: tuple(sorted(x))).countByKey()
print(result)
# defaultdict(<class 'int'>, {('ab', 'cd'): 3, ('ef', 'gh'): 1, ('ab', 'gh'): 1})
To convert the result into a list, you can do:
result2 = sorted(result.items())
print(result2)
# [(('ab', 'cd'), 3), (('ab', 'gh'), 1), (('ef', 'gh'), 1)]

Not getting the right combination using itertools

I am trying to use itertools to get combination but i am not getting the combinations in the format that i want.
import itertools
dict = {(1,):1, (2,):3, (3,):1}
combo = list(itertools.combinations(dict.keys(),2))
print(combo)
Output:
[(('1',), ('2',)), (('1',), ('3',)), (('2',), ('3',))]
Output I want:
[('1','2',), ('1','3',),('2','3',)]
The output is expected as your dict keys are tuples. You may try the following:
>>> combo = list(itertools.combinations((key[0] for key in dict.keys()), 2))
>>> print(combo)
[(1, 2), (1, 3), (2, 3)]
This will extract the first ([0]) element from each dict key.

Join RDD and get min value

I have multiple rdd's and want to get the common words by joining it and get the minimum count .So I Join and get it by below code :
from pyspark import SparkContext
sc = SparkContext("local", "Join app")
x = sc.parallelize([("spark", 1), ("hadoop", 4)])
y = sc.parallelize([("spark", 2), ("hadoop", 5)])
joined = x.join(y).map(lambda x: (x[0], int(x[1]))).reduceByKey(lambda (x,y,z) : (x,y) if y<=z else (x,z))
final = joined.collect()
print "Join RDD -> %s" % (final)
But this throws below error:
TypeError: int() argument must be a string or a number, not 'tuple'
So I am inputiing a tuple instead of a number .Not sure which is causing it. Any help is appreciated
x.join(other, numPartitions=None): Return an RDD containing all pairs of elements with matching keys in C{self} and C{other}. Each pair of elements will be returned as a (k, (v1, v2)) tuple, where (k, v1) is in C{self} and (k, v2) is in C{other}.
Therefore you have a tuple as second element:
In [2]: x.join(y).collect()
Out[2]: [('spark', (1, 2)), ('hadoop', (4, 5))]
Solution :
x = sc.parallelize([("spark", 1), ("hadoop", 4)])
y = sc.parallelize([("spark", 2), ("hadoop", 5)])
joined = x.join(y)
final = joined.map(lambda x: (x[0], min(x[1])))
final.collect()
>>> [('spark', 1), ('hadoop', 4)]

Print the content of ResultIterable object

How can I print the content of a pyspark.resultiterable.ResultIterable object that has a list of rows and columns
Is there a built-in function for that?
I would like something like dataframe.show()
I was facing the same issue and solved it eventually, so let me share my way of doing it...
Let us assume we have two RDDs.
rdd1 = sc.parallelize([(1,'A'),(2,'B'),(3,'C')])
rdd2 = sc.parallelize([(1,'a'),(2,'b'),(3,'c')])
Let us cogroup those RDDs in order to get ResultIterable.
cogrouped = rdd1.cogroup(rdd2)
for t in cogrouped.collect():
print t
>>
(1, (<pyspark.resultiterable.ResultIterable object at 0x107c49450>, <pyspark.resultiterable.ResultIterable object at 0x107c95690>))
(2, (<pyspark.resultiterable.ResultIterable object at 0x107c95710>, <pyspark.resultiterable.ResultIterable object at 0x107c95790>))
(3, (<pyspark.resultiterable.ResultIterable object at 0x107c957d0>, <pyspark.resultiterable.ResultIterable object at 0x107c95810>))
Now we want to see what is inside of those ResultIterables.
We can do it like this:
def iterate(iterable):
r = []
for v1_iterable in iterable:
for v2 in v1_iterable:
r.append(v2)
return tuple(r)
x = cogrouped.mapValues(iterate)
for e in x.collect():
print e
or like this
def iterate2(iterable):
r = []
for x in iterable.__iter__():
for y in x.__iter__():
r.append(y)
return tuple(r)
y = cogrouped.mapValues(iterate2)
for e in y.collect():
print e
In both cases we will get the same result:
(1, ('A', 'a'))
(2, ('B', 'b'))
(3, ('C', 'c'))
Hopefully, this will help somebody in future.

How to add a pair to a dictionary in python?

I created the following dictionary: d = {}
I add to it a new key and a couple as the corresponding d['Key1'] = (1, 2)
If I print the dictionary
{ 'Key1': (1, 2) }
How can I add another pair of integers to the following dictionary to the following key in order to have
{ 'Key1': (1, 2), (1, 3), (1, 4)}
Is it possible to do it in Python? if yes How can I do it?
Can I also count the number of values that correspond to a specific key?
Yes its possible but you need to use a container for your values.
For example you can use a list,also you can use dict.setdefault method for assignment if your keys will have a multiple values, this is useful of you want to add another key with multiple values :
>>> d = {}
>>> d.setdefault('Key1',[]).append((1,2))
>>> d
{'Key1': [(1, 2)]}
>>> d['Key1'].append((1, 3))
>>> d
{'Key1': [(1, 2), (1, 3)]}
>>> d.setdefault('Key2',[]).append((4,2))
>>> d
{'Key2': [(4, 2)], 'Key1': [(1, 2), (1, 3)]}
setdefault(key[, default])
If key is in the dictionary, return its value. If not, insert key with a value of default and return default. default defaults to None.
That's only possible if you make the initial entry a list.
D['key'] = []
D['key'].append((1,2))
D['key'].append((3,4))
{'Key': [(1,2), (3,4)] }
You can't do it directly; in other words, a dict can't hold multiple values associated with the same key. You would need to have a dict of lists.
If the set of keys isn't known in advance, you might want to use a defaultdict, which will automatically create a list (or whatever) each time you access a key which doesn't already exist in the dictionary.
d = collections.defaultdict(list)
To add elements, you would use
d['Key1'].append((1, 2))
instead of
d['Key1'] = (1, 2)
This lets you avoid writing special code to handle the first insertion with a given key.

Resources