Spark Multiple Joins - apache-spark

Using spark context, I would like to perform multiple joins between
rdd's, where the number of rdd's to be joined should be dynamic.
I would like the result to be unfolded, for example:
val rdd1 = sc.parallelize(List((1,1.0),(11,11.0), (111,111.0)))
val rdd2 = sc.parallelize(List((1,2.0),(11,12.0), (111,112.0)))
val rdd3 = sc.parallelize(List((1,3.0),(11,13.0), (111,113.0)))
val rdd4 = sc.parallelize(List((1,4.0),(11,14.0), (111,114.0)))
val rdd11 = rdd1.join(rdd2).join(rdd3).join(rdd4)
.foreach(println)
generates the following output:
(11,(((11.0,12.0),13.0),14.0))
(111,(((111.0,112.0),113.0),114.0))
(1,(((1.0,2.0),3.0),4.0))
I would like to:
Unfold the values, e.g first line should read:
(11, 11.0, 12.0, 13.0, 14.0).
Do it dynamically so that it can work on any dynamic number
of rdd's to be joined.
Any ideas would be appreciated,
Eli.

Instead of using join, I would use union followed by groupByKey to achieve what you desire.
Here is what I would do -
val emptyRdd = sc.emptyRDD[(Int, Double)]
val listRdds = List(rdd1, rdd2, rdd3, rdd4) // satisfy your dynamic number of rdds
val unioned = listRdds.fold(emptyRdd)(_.union(_))
val grouped = unioned.groupByKey
grouped.collect().foreach(println(_))
This will yields the result:
(1,CompactBuffer(1.0, 2.0, 3.0, 4.0))
(11,CompactBuffer(11.0, 12.0, 13.0, 14.0))
(111,CompactBuffer(111.0, 112.0, 113.0, 114.0))
Updated:
If you would still like to use join, this is how to do it with somewhat complicated foldLeft functions -
val joined = rddList match {
case head::tail => tail.foldLeft(head.mapValues(Array(_)))(_.join(_).mapValues {
case (arr: Array[Double], d: Double) => arr :+ d
})
case Nil => sc.emptyRDD[(Int, Array[Double])]
}
And joined.collect will yield
res14: Array[(Int, Array[Double])] = Array((1,Array(1.0, 2.0, 3.0, 4.0)), (11,Array(11.0, 12.0, 13.0, 14.0)), (111,Array(111.0, 112.0, 113.0, 114.0)))

Others with this problem may find groupWith helpful. From the docs:
>>> w = sc.parallelize([("a", 5), ("b", 6)])
>>> x = sc.parallelize([("a", 1), ("b", 4)])
>>> y = sc.parallelize([("a", 2)])
>>> z = sc.parallelize([("b", 42)])
>>> [(x, tuple(map(list, y))) for x, y in sorted(list(w.groupWith(x, y, z).collect()))]
[('a', ([5], [1], [2], [])), ('b', ([6], [4], [], [42]))]

Related

Spark RDD find ratio of for key-value pairs

My rdd contains key-value pairs such as this:
(key1, 5),
(key2, 10),
(key3, 20),
I want to perform a map operation that associates each key with its respect ratio in the entire rdd, such as this:
(key1, 5/35),
(key2, 10/35),
(key3, 20/35),
I am struggling to find a method to do this using standard functions, any help will be appreciated.
You can calculate the sum and divide each value by the sum:
from operator import add
rdd = sc.parallelize([('key1', 5), ('key2', 10), ('key3', 20)])
total = rdd.values().reduce(add)
rdd2 = rdd.mapValues(lambda x: x/total)
rdd2.collect()
# [('key1', 0.14285714285714285), ('key2', 0.2857142857142857), ('key3', 0.5714285714285714)]
In Scala it would be
val rdd = sc.parallelize(List(("key1", 5), ("key2", 10), ("key3", 20)))
val total = rdd.values.reduce(_+_)
val rdd2 = rdd.mapValues(1.0*_/total)
rdd2.collect
// Array[(String, Double)] = Array((key1,0.14285714285714285), (key2,0.2857142857142857), (key3,0.5714285714285714))

Join RDD and get min value

I have multiple rdd's and want to get the common words by joining it and get the minimum count .So I Join and get it by below code :
from pyspark import SparkContext
sc = SparkContext("local", "Join app")
x = sc.parallelize([("spark", 1), ("hadoop", 4)])
y = sc.parallelize([("spark", 2), ("hadoop", 5)])
joined = x.join(y).map(lambda x: (x[0], int(x[1]))).reduceByKey(lambda (x,y,z) : (x,y) if y<=z else (x,z))
final = joined.collect()
print "Join RDD -> %s" % (final)
But this throws below error:
TypeError: int() argument must be a string or a number, not 'tuple'
So I am inputiing a tuple instead of a number .Not sure which is causing it. Any help is appreciated
x.join(other, numPartitions=None): Return an RDD containing all pairs of elements with matching keys in C{self} and C{other}. Each pair of elements will be returned as a (k, (v1, v2)) tuple, where (k, v1) is in C{self} and (k, v2) is in C{other}.
Therefore you have a tuple as second element:
In [2]: x.join(y).collect()
Out[2]: [('spark', (1, 2)), ('hadoop', (4, 5))]
Solution :
x = sc.parallelize([("spark", 1), ("hadoop", 4)])
y = sc.parallelize([("spark", 2), ("hadoop", 5)])
joined = x.join(y)
final = joined.map(lambda x: (x[0], min(x[1])))
final.collect()
>>> [('spark', 1), ('hadoop', 4)]

Efficient way to reduceByKey ignoring keys not in another RDD?

I have a large collection of data in pyspark. The format is key-value pairs, and I need to do a reducebykey operation, but ignoring all data whose key isn't in an RDD of 'interesting' keys that I also have.
I found a solution on SO that utilizes the subtractbykey operation to achieve this. It works, but crashes due to low memory on my cluster. I have not been able to change this with tweaking the settings, so I'm hoping there's a more efficient solution.
Here's my solution that works on smaller datasets:
# The keys I'm interested in
edges = sc.parallelize([("a", "b"), ("b", "c"), ("a", "d")])
# Data containing both interesting and uninteresting stuff
data1 = sc.parallelize([(("a", "b"), [42]), (("a", "c"), [60]), (("a", "d"), [13, 37])])
data2 = sc.parallelize([(("a", "b"), [43]), (("b", "c"), [23, 24]), (("a", "c"), [13, 37])])
all_data = [data1, data2]
mask = edges.map(lambda t: (tuple(t), None))
rdds = []
for datum in all_data:
combined = datum.reduceByKey(lambda a, b: a+b)
unwanted = combined.subtractByKey(mask)
wanted = combined.subtractByKey(unwanted)
rdds.append(wanted)
edge_alltimes = sc.union(rdds).reduceByKey(lambda a,b: a+b)
edge_alltimes.collect()
As desired, this outputs [(('a', 'd'), [13, 37]), (('a', 'b'), [42, 43]), (('b', 'c'), [23, 24])]
(i.e. data for the 'interesting' key tuples have been combined and the rest has been dropped).
The reason I have the data in several RDDs is to mimic behavior on my cluster where I can't load all the data simultaneously due to its size.
Any help would be great.
Example with join. A small drawback is that you need to have RDD of pairs before join and you need to strip extra data after join.
import org.apache.spark.{SparkConf, SparkContext}
object Main {
val conf = new SparkConf().setAppName("myapp").setMaster("local[*]")
val sc = new SparkContext(conf)
def main(args: Array[String]): Unit = {
val goodKeys = sc.parallelize(Seq(1, 2))
val allData = sc.parallelize(Seq((1, "a"), (2, "b"), (3, "c")))
val goodPairs = goodKeys.map(v => (v, 0))
val goodData = allData.join(goodPairs).mapValues(p => p._1)
goodData.collect().foreach(println)
}
}
Output:
(1,a)
(2,b)

Checking if an RDD(K,V) V is contains in another Rdd(K,V) V

I have two RDD(K,V),in spark it is not allow two map nesting。
val x = sc.parallelize(List((1,"abc"),(2,"efg")))
val y = sc.parallelize(List((1,"ab"),(2,"ef"), (3,"tag"))
I want check "abc" contains "ab", if the RDD is large or not.
Assuming you want to select a value from RDD x when it's substring is present in the RDD y then this code should work.
def main(args: Array[String]): Unit = {
val x = spark.sparkContext.parallelize(List((1, "abc"), (2, "efg")))
val y = spark.sparkContext.parallelize(List((1, "ab"), (2, "ef"), (3, "tag")))
// This RDD is filtered. That is we are selecting elements from x only if the substring of the value is present in
// the RDD y.
val filteredRDD = filterRDD(x, y)
// Now we map the filteredRDD to our result list
val resultArray = filteredRDD.map(x => x._2).collect()
}
def filterRDD(x: RDD[(Int, String)], y: RDD[(Int, String)]): RDD[(Int, String)] = {
// Broadcasting the y RDD to all spark nodes, since we are collecting this before hand.
// The reason we are collecting the y RDD is to avoid call collect in the filter logic
val y_bc = spark.sparkContext.broadcast(y.collect.toSet)
x.filter(m => {
y_bc.value.exists(n => m._2.contains(n._2))
})
}

Spark zipPartitions on the same RDD

I'm quite a newbie with Spark and I have some problem in doing something like a cartesian but only within the same partition. Maybe an example can swoh clearly what I want to do: let's suppose we have a RDD made with sc.parallelize(1,2,3,4,5,6) and this RDD is partitioned in three partitions which contains respectively: (1,2) ; (3,4) ; (5,6). Than I would like to obtain the following result: ((1,1),(1,2),(2,1),(2,2)) ; ((3,3),(3,4),(4,3),(4,4)) ; ((5,5),(5,6),(6,5),(6,6)).
What I have tried so far is doing:
partitionedData.zipPartitions(partitionedData)((aiter, biter) => {
var res = new ListBuffer[(Double,Double)]()
while(aiter.hasNext){
val a = aiter.next()
while(biter.hasNext){
val b = biter.next()
res+=(a,b)
}
}
res.iterator
})
but it doesn't work as aiter and biter are the same iterator...so I get only the first line of the result.
Can someone help me?
Thanks.
Use RDD.mapPartitions:
val rdd = sc.parallelize(1 to 6, 3)
val res = rdd.mapPartitions { iter =>
val seq = iter.toSeq
val res = for (a <- seq; b <- seq) yield (a, b)
res.iterator
}
res.collect
Prints:
res0: Array[(Int, Int)] = Array((1,1), (1,2), (2,1), (2,2), (3,3), (3,4), (4,3), (4,4), (5,5), (5,6), (6,5), (6,6))

Resources