How to join two RDDs in spark with python? - apache-spark

Suppose
rdd1 = ( (a, 1), (a, 2), (b, 1) ),
rdd2 = ( (a, ?), (a, *), (c, .) ).
Want to generate
( (a, (1, ?)), (a, (1, *)), (a, (2, ?)), (a, (2, *)) ).
Any easy methods?
I think it is different from the cross join but can't find a good solution.
My solution is
(rdd1
.cartesian( rdd2 )
.filter( lambda (k, v): k[0]==v[0] )
.map( lambda (k, v): (k[0], (k[1], v[1])) ))

You are just looking for a simple join, e.g.
rdd = sc.parallelize([("red",20),("red",30),("blue", 100)])
rdd2 = sc.parallelize([("red",40),("red",50),("yellow", 10000)])
rdd.join(rdd2).collect()
# Gives [('red', (20, 40)), ('red', (20, 50)), ('red', (30, 40)), ('red', (30, 50))]

Related

how to add a new element in RDD

I have a spark pair RDD (key, count) as below
Array[(String, Int)] = Array((a,1), (b,2), (c,3), (d,4))
i want to add a new max element in RDDs
Array[(String, Int)] = Array((a,1,4), (b,2,4), (c,3,4), (d,4,4))
In the definition, you are saying:
(a, 1) -> (a, 1, 4)
(b, 2) -> (b, 2, 4)
(c, 1) -> (c, 3, 4) where is the 3 coming from now?
(d, 3) -> (d, 4, 4) where is the 4 coming from now?
In case your new max is the maximum value of your value RDD plus one, then you can sort descending by value and get the value of the first element:
val max = df1.sortBy(_._2, ascending = false).collect(0)._2 + 1
val df2 = df1.map(r => (r._1, r._2, max)
This gives:
(a,1,4)
(b,2,4)
(c,1,4)
(d,3,4)
which should be what you want.

Apache Spark: GraphFrame and random walk implementation

Suppose a directed graph where each node has a flag (boolean)
Main goal is: "jump" to a node for the next step (node) until we will find the node with flag=true with limited count of steps (configurable) for each vertex.
P.S. its okay to go back to initial node
For example:
Vertex A (flag=false) -> [B, C, D, G]
Vertex B (flag=false) -> [A, E, F, G]
Vertex C (flag=false) -> [A, E, G]
Vertex D (flag=false) -> [A, F, G]
Vertex E (flag=true) -> [B, C, G]
Vertex F (flag=false) -> [B, D]
Vertex G (flag=false) -> [A, B, C, D, E]
Vertex H (flag=true) -> [G]
stepsCount = 10
Algorithm:
vertex A has flag=false, there are randomly selected node B where flag=false also
so here again we randomly selecting the next node, lets say its a node F where flag=false
so "jump" to the next node while stepsCount < 10
if there are no flag=true mark the Vertex A with 0 else 1
// Vertex DataFrame
val v = spark.createDataFrame(List(
("A", false),
("B", false),
("C", false),
("D", false),
("E", true),
("F", false),
("G", false),
("H", true)
)).toDF("id", "flag")
// Edge DataFrame
val e = spark.createDataFrame(List(
("A", "B", 0),
("B", "A", 0),
("A", "C", 0),
("C", "A", 0),
("A", "D", 0),
("D", "A", 0),
("F", "B", 0),
("E", "B", 0),
("F", "D", 0),
("D", "F", 0),
("F", "D", 0),
("A", "G", 0),
("B", "G", 0),
("C", "G", 0),
("D", "G", 0),
("E", "G", 0),
("G", "H", 0)
)).toDF("src", "dst", "mark")
val g = GraphFrame(v, e)

Haskell filtering data

I want to be able to filter the below data so I can find specific data for example if I wanted to find an item with only apples it would look similar to this output: [("apple","crate",6),("apple","box",3)]
fruit :: [(String, String, Int)]
fruit = [("apple", "crate", 6), ("pear", "crate", 5), ("mango", "box", 4),
("apple", "box", 3), ("banana", "box", 5), ("pear", "box", 10), ("apricot",
"box", 4), ("peach", "box", 5), ("walnut", "box", 4), ("blueberry", "tray", 10),
("blackberry", "tray", 4), ("watermelon", "piece", 8), ("marrow", "piece", 7),
("hazelnut", "sack", 2), ("walnut", "sack", 4)]
first :: (a, b, c) -> a
first (x, _, _) = x
second :: (a, b, c) -> b
second (_, y, _) = y
third :: (a, b, c) -> c
third (_, _, z) = z
A couple of alternatives:
filter ((=="apple") . first) fruit
[ f | f#("apple",_,_) <- fruit ]
The first one exploits your first projection, checking whether its result is equal to "apple".
The second one instead exploits list comprehensions, where elements that fail to pattern match are discarded.
Perhaps an even more basic approach is using a lambda abstraction and equality.
filter (\(s,_,_) -> s == "apple") fruit

Group an RDD by key in java

I am trying to group an RDD by using groupby. Most of the docs suggest not to use groupBy because of how it works internally to group keys. Is there another way to achieve that. I cannot use reducebyKey because I am not doing a reduction operation here.
Ex-
Entry - long id, string name;
JavaRDD<Entry> entries = rdd.groupBy(Entry::getId)
.flatmap(x -> someOp(x))
.values()
.filter()
aggregateByKey [Pair] see
Works like the aggregate function except the aggregation is applied to the values with the same key. Also unlike the aggregate function the initial value is not applied to the second reduce.
Listing Variants
def aggregateByKey[U](zeroValue: U)(seqOp: (U, V) ⇒ U, combOp: (U, U)
⇒ U)(implicit arg0: ClassTag[U]): RDD[(K, U)]
def aggregateByKey[U](zeroValue: U, numPartitions: Int)(seqOp: (U, V) ⇒ U,
combOp: (U, U) ⇒ U)(implicit arg0: ClassTag[U]): RDD[(K, U)]
def aggregateByKey[U](zeroValue: U, partitioner: Partitioner)(seqOp: (U,
V) ⇒ U, combOp: (U, U) ⇒ U)(implicit arg0: ClassTag[U]): RDD[(K, U)]
Example :
val pairRDD = sc.parallelize(List( ("cat",2), ("cat", 5), ("mouse", 4),("cat", 12), ("dog", 12), ("mouse", 2)), 2)
// lets have a look at what is in the partitions
def myfunc(index: Int, iter: Iterator[(String, Int)]) : Iterator[String] = {
iter.map(x => "[partID:" + index + ", val: " + x + "]")
}
pairRDD.mapPartitionsWithIndex(myfunc).collect
res2: Array[String] = Array([partID:0, val: (cat,2)], [partID:0, val: (cat,5)], [partID:0, val: (mouse,4)], [partID:1, val: (cat,12)], [partID:1, val: (dog,12)], [partID:1, val: (mouse,2)])
pairRDD.aggregateByKey(0)(math.max(_, _), _ + _).collect
res3: Array[(String, Int)] = Array((dog,12), (cat,17), (mouse,6))
pairRDD.aggregateByKey(100)(math.max(_, _), _ + _).collect
res4: Array[(String, Int)] = Array((dog,100), (cat,200), (mouse,200))

Recursive function

Let S be the subset of the set of ordered pairs of integers defined recursively by
Basis step: (0, 0) ∈ S.
Recursive step: If (a,b) ∈ S,
then (a,b + 1) ∈ S, (a + 1, b + 1) ∈ S, and (a + 2, b + 1) ∈ S.
List the elements of S produced by the first four application
def subset(a,b):
base=[]
if base == []:
base.append((a,b))
return base
elif (a,b) in base:
base.append(subset(a,b+1))
base.append(subset(a+1,b+1))
base.append(subset(a+2,b+1))
return base
for number in range(0,5):
for number2 in range(0,5):
print(*subset(number,number2))
The output is
(0, 0)
(0, 1)
(0, 2)
(0, 3)
(0, 4)
(1, 0)
(1, 1)
(1, 2)
(1, 3)
(1, 4)
(2, 0)
(2, 1)
(2, 2)
(2, 3)
(2, 4)
(3, 0)
(3, 1)
(3, 2)
(3, 3)
(3, 4)
(4, 0)
(4, 1)
(4, 2)
(4, 3)
(4, 4)
But the correct answer is more than what I got.
(0, 1), (1, 1), and (2, 1) are all in S. If we apply the recursive step to these we add (0, 2), (1, 2), (2, 2), (3, 2), and (4, 2). The next round gives us (0, 3), (1, 3), (2, 3), (3, 3), (4, 3), (5, 3), and (6, 3). And a fourth set of applications adds (0,4), (1,4), (2,4), (3,4), (4,4), (5,4), (6,4), (7,4), and (8,4).
What did I do wrong with my code?
Is this what you want ? Based on the result you wanted :
def subset(a):
#Returns a list that contains all (i, a + 1) from i = 0 to i = (a + 1) * 2 + 1
return [(i, a + 1) for i in range((a + 1) * 2 + 1)]
setList = []
for i in range(4):
setList += subset(i) #Fill a list with subset result from i = 0 -> 3
print(setList)
If you want to use a recursive function, you can do that too :
def subset(a, b):
if a < b * 2:
#Returns a list that contains (a, b) and unzipped result of subset(a + 1, b)
#Otherwise it would add a list in the list
return [(a, b), *subset(a + 1, b)]
else:
return [(a, b)] #If deepest element of recursion, just return [(a, b)]
setList = []
for i in range(5):
setList += subset(0, i)
print(setList)

Resources