Join RDD and get min value - apache-spark

I have multiple rdd's and want to get the common words by joining it and get the minimum count .So I Join and get it by below code :
from pyspark import SparkContext
sc = SparkContext("local", "Join app")
x = sc.parallelize([("spark", 1), ("hadoop", 4)])
y = sc.parallelize([("spark", 2), ("hadoop", 5)])
joined = x.join(y).map(lambda x: (x[0], int(x[1]))).reduceByKey(lambda (x,y,z) : (x,y) if y<=z else (x,z))
final = joined.collect()
print "Join RDD -> %s" % (final)
But this throws below error:
TypeError: int() argument must be a string or a number, not 'tuple'
So I am inputiing a tuple instead of a number .Not sure which is causing it. Any help is appreciated

x.join(other, numPartitions=None): Return an RDD containing all pairs of elements with matching keys in C{self} and C{other}. Each pair of elements will be returned as a (k, (v1, v2)) tuple, where (k, v1) is in C{self} and (k, v2) is in C{other}.
Therefore you have a tuple as second element:
In [2]: x.join(y).collect()
Out[2]: [('spark', (1, 2)), ('hadoop', (4, 5))]
Solution :
x = sc.parallelize([("spark", 1), ("hadoop", 4)])
y = sc.parallelize([("spark", 2), ("hadoop", 5)])
joined = x.join(y)
final = joined.map(lambda x: (x[0], min(x[1])))
final.collect()
>>> [('spark', 1), ('hadoop', 4)]

Related

PySpark sort values

I have a data:
[(u'ab', u'cd'),
(u'ef', u'gh'),
(u'cd', u'ab'),
(u'ab', u'gh'),
(u'ab', u'cd')]
I would like to do a mapreduce on this data and to find out how often same pairs appear.
As a result I get:
[((u'ab', u'cd'), 2),
((u'cd', u'ab'), 1),
((u'ab', u'gh'), 1),
((u'ef', u'gh'), 1)]
As you can see it is not quire right as (u'ab', u'cd') has to be 3 instead of 2 because (u'cd', u'ab') is the same pair.
My question is how can I make the program to count (u'cd', u'ab') and (u'ab', u'cd') as the same pair? I was thinking about sorting values for each row but could not find any solution for this.
You can sort the values then use reduceByKey to count the pairs:
rdd1 = rdd.map(lambda x: (tuple(sorted(x)), 1))\
.reduceByKey(lambda a, b: a + b)
rdd1.collect()
# [(('ab', 'gh'), 1), (('ef', 'gh'), 1), (('ab', 'cd'), 3)]
You can key by the sorted element, and count by key:
result = rdd.keyBy(lambda x: tuple(sorted(x))).countByKey()
print(result)
# defaultdict(<class 'int'>, {('ab', 'cd'): 3, ('ef', 'gh'): 1, ('ab', 'gh'): 1})
To convert the result into a list, you can do:
result2 = sorted(result.items())
print(result2)
# [(('ab', 'cd'), 3), (('ab', 'gh'), 1), (('ef', 'gh'), 1)]

How to associate unique id with text in word counting with spark

I have an RDD that is populated as
id txt
1 A B C
2 A B C
1 A B C
The result of my word count (pyspark) should be for a combination of string and id associated with it. Example:
[(u'1_A',2), (u'1_B',2), (u'1_C',2),(u'2_A',1),(u'2_B',1),(u'2_C',1)]
I tried using a user defined function to combine id with string splits from text. It, however, complains that the append function is unavailable in this context.
Appreciate any code samples that will set me in the right direction.
Here is an alternative solution using PySpark Dataframe. Mainly, the code uses explode and split to split the txt column. Then, use groupby and count to count number of pairs.
import pyspark.sql.functions as func
rdd = spark.sparkContext.parallelize([(1,'A B C'), (2, 'A B C'), (1,'A B C')])
df = rdd.toDF(['id', 'txt'])
df_agg = df.select('id', func.explode(func.split('txt', ' '))).\
groupby(['id', 'col']).\
count().\
sort(['id', 'col'], ascending=True)
df_agg.rdd.map(lambda x:(str(x['id']) + '_' + x['col'], x['count'] )).collect()
Output
[('1_A', 2), ('1_B', 2), ('1_C', 2), ('2_A', 1), ('2_B', 1), ('2_C', 1)]
the below snippet should work
rdd = sc.parallelize([(1,'A B C'), (2, 'A B C'), (1,'A B C')])
result = rdd \
.map(lambda x: (x[0],x[1].split(' '))) \
.flatMap(lambda x: [ '%s_%s'% (x[0],y) for y in x[1] ]) \
.map(lambda x: (x,1)) \
.reduceByKey(lambda x,y: x + y)
result.collect()
Output
[('1_C', 2), ('1_B', 2), ('1_A', 2), ('2_A', 1), ('2_B', 1), ('2_C', 1)]

Print the content of ResultIterable object

How can I print the content of a pyspark.resultiterable.ResultIterable object that has a list of rows and columns
Is there a built-in function for that?
I would like something like dataframe.show()
I was facing the same issue and solved it eventually, so let me share my way of doing it...
Let us assume we have two RDDs.
rdd1 = sc.parallelize([(1,'A'),(2,'B'),(3,'C')])
rdd2 = sc.parallelize([(1,'a'),(2,'b'),(3,'c')])
Let us cogroup those RDDs in order to get ResultIterable.
cogrouped = rdd1.cogroup(rdd2)
for t in cogrouped.collect():
print t
>>
(1, (<pyspark.resultiterable.ResultIterable object at 0x107c49450>, <pyspark.resultiterable.ResultIterable object at 0x107c95690>))
(2, (<pyspark.resultiterable.ResultIterable object at 0x107c95710>, <pyspark.resultiterable.ResultIterable object at 0x107c95790>))
(3, (<pyspark.resultiterable.ResultIterable object at 0x107c957d0>, <pyspark.resultiterable.ResultIterable object at 0x107c95810>))
Now we want to see what is inside of those ResultIterables.
We can do it like this:
def iterate(iterable):
r = []
for v1_iterable in iterable:
for v2 in v1_iterable:
r.append(v2)
return tuple(r)
x = cogrouped.mapValues(iterate)
for e in x.collect():
print e
or like this
def iterate2(iterable):
r = []
for x in iterable.__iter__():
for y in x.__iter__():
r.append(y)
return tuple(r)
y = cogrouped.mapValues(iterate2)
for e in y.collect():
print e
In both cases we will get the same result:
(1, ('A', 'a'))
(2, ('B', 'b'))
(3, ('C', 'c'))
Hopefully, this will help somebody in future.

PySpark cogroup on multiple keys

Given two lists, I want to group them based on the co-occurence of the first two keys:
x=[(1,(2,'cat')),(4,(5,'dog'))]
y=[(1,(2,'hairBall')),(4,(5,'woof'))]
desired output:
z=[(1,2,('cat','hairBall')),(4,5,('dog','woof'))]
what I have tried so far:
sc=SparkContext()
xs=sc.parallelize(x)
ys=sc.parallelize(y)
zs_temp=xs.cogroup(ys)
this results in:
zs_temp.collect()=[(1, [[(2, 'cat')], [(2, 'hairBall')]]), (4, [[(5, 'dog')], [(5, 'woof')]])]
attempted solution:
zs_temp.map(lambda f: f[1].cogroup(f[1]) ).collect()
but get the error:
AttributeError: 'tuple' object has no attribute 'cogroup'
Test Data:
x=[(1,(2,'cat')),(4,(5,'dog'))]
y=[(1,(2,'hairBall')),(4,(5,'woof'))]
xs=sc.parallelize(x)
ys=sc.parallelize(y)
Function to change the keys
def reKey(r):
return ((r[0], r[1][0]), r[1][1])
Change keys
xs2 = xs.map(reKey)
ys2 = ys.map(reKey)
Join Data, collect results
results = ys2.join(xs2)
results.collect()
[((1, 2), ('hairBall', 'cat')), ((4, 5), ('woof', 'dog'))]

Spark Multiple Joins

Using spark context, I would like to perform multiple joins between
rdd's, where the number of rdd's to be joined should be dynamic.
I would like the result to be unfolded, for example:
val rdd1 = sc.parallelize(List((1,1.0),(11,11.0), (111,111.0)))
val rdd2 = sc.parallelize(List((1,2.0),(11,12.0), (111,112.0)))
val rdd3 = sc.parallelize(List((1,3.0),(11,13.0), (111,113.0)))
val rdd4 = sc.parallelize(List((1,4.0),(11,14.0), (111,114.0)))
val rdd11 = rdd1.join(rdd2).join(rdd3).join(rdd4)
.foreach(println)
generates the following output:
(11,(((11.0,12.0),13.0),14.0))
(111,(((111.0,112.0),113.0),114.0))
(1,(((1.0,2.0),3.0),4.0))
I would like to:
Unfold the values, e.g first line should read:
(11, 11.0, 12.0, 13.0, 14.0).
Do it dynamically so that it can work on any dynamic number
of rdd's to be joined.
Any ideas would be appreciated,
Eli.
Instead of using join, I would use union followed by groupByKey to achieve what you desire.
Here is what I would do -
val emptyRdd = sc.emptyRDD[(Int, Double)]
val listRdds = List(rdd1, rdd2, rdd3, rdd4) // satisfy your dynamic number of rdds
val unioned = listRdds.fold(emptyRdd)(_.union(_))
val grouped = unioned.groupByKey
grouped.collect().foreach(println(_))
This will yields the result:
(1,CompactBuffer(1.0, 2.0, 3.0, 4.0))
(11,CompactBuffer(11.0, 12.0, 13.0, 14.0))
(111,CompactBuffer(111.0, 112.0, 113.0, 114.0))
Updated:
If you would still like to use join, this is how to do it with somewhat complicated foldLeft functions -
val joined = rddList match {
case head::tail => tail.foldLeft(head.mapValues(Array(_)))(_.join(_).mapValues {
case (arr: Array[Double], d: Double) => arr :+ d
})
case Nil => sc.emptyRDD[(Int, Array[Double])]
}
And joined.collect will yield
res14: Array[(Int, Array[Double])] = Array((1,Array(1.0, 2.0, 3.0, 4.0)), (11,Array(11.0, 12.0, 13.0, 14.0)), (111,Array(111.0, 112.0, 113.0, 114.0)))
Others with this problem may find groupWith helpful. From the docs:
>>> w = sc.parallelize([("a", 5), ("b", 6)])
>>> x = sc.parallelize([("a", 1), ("b", 4)])
>>> y = sc.parallelize([("a", 2)])
>>> z = sc.parallelize([("b", 42)])
>>> [(x, tuple(map(list, y))) for x, y in sorted(list(w.groupWith(x, y, z).collect()))]
[('a', ([5], [1], [2], [])), ('b', ([6], [4], [], [42]))]

Resources