Spark convert PairRDD to RDD - apache-spark

What is the best way to convert a PairRDD into an RDD with both K and V are merged (in java)?
For example, the PairRDD contains K as some string and V as a JSON. I want to add this K to the value JSON and produce an RDD.
Input PairRDD
("abc", {"x:"100", "y":"200"})
("def", {"x":"400", "y":"500")
Output should be and RDD as follows
({"x:"100", "y":"200","z":"abc"})
({"x":"400", "y":"500","z":"def"})

You can use map to translate between the two
consider:
scala> pairrdd.foreach(println)
(def,Map(x -> 400, y -> 500))
(abc,Map(x -> 100, y -> 200))
(I think that's what your sample is meant to represent)
scala> val newrdd = prdd.map(X=> X._2 ++ Map("z"-> X._1))
scala> newrdd.foreach(println)
Map(x -> 100, y -> 200, z -> abc)
Map(x -> 400, y -> 500, z -> def)
You'll have to change the val newrdd to java syntax, but the right side of the equation (I believe) will stay the same

Related

how to add a new element in RDD

I have a spark pair RDD (key, count) as below
Array[(String, Int)] = Array((a,1), (b,2), (c,3), (d,4))
i want to add a new max element in RDDs
Array[(String, Int)] = Array((a,1,4), (b,2,4), (c,3,4), (d,4,4))
In the definition, you are saying:
(a, 1) -> (a, 1, 4)
(b, 2) -> (b, 2, 4)
(c, 1) -> (c, 3, 4) where is the 3 coming from now?
(d, 3) -> (d, 4, 4) where is the 4 coming from now?
In case your new max is the maximum value of your value RDD plus one, then you can sort descending by value and get the value of the first element:
val max = df1.sortBy(_._2, ascending = false).collect(0)._2 + 1
val df2 = df1.map(r => (r._1, r._2, max)
This gives:
(a,1,4)
(b,2,4)
(c,1,4)
(d,3,4)
which should be what you want.

Convert Spark DataFrame Map into Array of Maps of `{"Key": key, "Value": value}`

How can I take a Spark DataFrame structured like this:
val sourcedf = spark.createDataFrame(
List(
Row(Map("AL" -> "Alabama", "AK" -> "Alaska").asJava),
Row(Map("TX" -> "Texas", "FL" -> "Florida", "NJ" -> "New Jersey").asJava)
).asJava, StructType(
StructField("my_map", MapType(StringType, StringType, false)) ::
Nil))
or in a text form, sourcedf.show(false) shows:
+----------------------------------------------+
|my_map |
+----------------------------------------------+
|[AL -> Alabama, AK -> Alaska] |
|[TX -> Texas, FL -> Florida, NJ -> New Jersey]|
+----------------------------------------------+
and programmatically transform to this structure:
val targetdf = spark.createDataFrame(
List(
Row(List(Map("Key" -> "AL", "Value" -> "Alabama"), Map("Key" -> "AK", "Value" -> "Alaska")).asJava),
Row(List(Map("Key" -> "TX", "Value" -> "Texas"), Map("Key" -> "FL", "Value" -> "Florida"), Map("Key" -> "NJ", "Value" -> "New Jersey")).asJava)
).asJava, StructType(
StructField("my_list", ArrayType(MapType(StringType, StringType, false), false)) ::
Nil))
or in a text form, targetdf.show(false) shows:
+----------------------------------------------------------------------------------------------+
|my_list |
+----------------------------------------------------------------------------------------------+
|[[Key -> AL, Value -> Alabama], [Key -> AK, Value -> Alaska]] |
|[[Key -> TX, Value -> Texas], [Key -> FL, Value -> Florida], [Key -> NJ, Value -> New Jersey]]|
+----------------------------------------------------------------------------------------------+```
So whilst using Scala, I couldn't figure out how to handle a java.util.Map with provided Encoders, I probably would have had to write one myself and I figured it was too much work.
However, I can see two ways to do this without converting to java.util.Map and using scala.collection.immutable.Map.
You could convert into a Dataset[Obj] and flatMap.
case class Foo(my_map: Map[String, String])
case class Bar(my_list: List[Map[String, String]])
implicit val encoder = ExpressionEncoder[List[Map[String, String]]]
val ds: Dataset[Foo] = sourcedf.as[Foo]
val output: Dataset[Bar] = ds.map(x => Bar(x.my_map.flatMap({case (k, v) => List(Map("key" -> k, "value" -> v))}).toList))
output.show(false)
Or you can use a UDF
val mapToList: Map[String, String] => List[Map[String, String]] = {
x => x.flatMap({case (k, v) => List(Map("key" -> k, "value" -> v))}).toList
}
val mapToListUdf: UserDefinedFunction = udf(mapToList)
val output: Dataset[Row] = sourcedf.select(mapToListUdf($"my_map").as("my_list"))
output.show(false)
Both output
+----------------------------------------------------------------------------------------------+
|my_list |
+----------------------------------------------------------------------------------------------+
|[[key -> AL, value -> Alabama], [key -> AK, value -> Alaska]] |
|[[key -> TX, value -> Texas], [key -> FL, value -> Florida], [key -> NJ, value -> New Jersey]]|
+----------------------------------------------------------------------------------------------+

Calculate multiple step connections in GraphX on Spark

I have been looking for the GraphX on Spark documentation and I am trying to work out how to calculate all the 2 and potentially further step connections in the graph.
If I have the following structure
A -> b
b -> C
b -> D
Then A is connected to C and D via B (A -> b -> C) and (A -> b -> D)
I was having a look at the connected components functions but not sure how you would extend it to this. In reality b will be a different vertex type but not sure if this has an effect or not.
Any suggestions would be greatly appreciated I am pretty new to GraphX
It seems you just need to use collectNeighborIds action, and then join with reversed copy of itself. I wrote some code:
val graph : Graph[Int, Int] = ...
val bros = graph.collectNeighborIds(EdgeDirection.Out)
val flat = bros.flatMap(x => x._2.map(y => (y, x._1)))
val brosofbros : RDD[(VertexId, Array[VertexId])]= flat.join(bros)
.map(x => (x._2._1, x._2._2))
.reduceByKey(_ ++ _)
Finally 'brosofbros' contains vertex id and all its second neighbors, in you example it would be [A, Array[C, D]]. (but there is not B vertex)

Scala parallel frequency calculation using aggregate doesn't work

I'm learning Scala by working the exercises from the book "Scala for the Impatient". Please see the following question and my answer and code. I'd like to know if my answer is correct. Also the code doesn't work (all frequencies are 1). Where's the bug?
Q10: Harry Hacker reads a file into a string and wants to use a
parallel collection to update the letter frequencies concurrently on
portions of the string. He uses the following code:
val frequencies = new scala.collection.mutable.HashMap[Char, Int]
for (c <- str.par) frequencies(c) = frequencies.getOrElse(c, 0) + 1
Why is this a terrible idea? How can he really parallelize the
computation?
My answer:
It is not a good idea because if 2 threads are concurrently updating the same frequency, the result is undefined.
My code:
def parFrequency(str: String) = {
str.par.aggregate(Map[Char, Int]())((m, c) => { m + (c -> (m.getOrElse(c, 0) + 1)) }, _ ++ _)
}
Unit test:
"Method parFrequency" should "return the frequency of each character in a string" in {
val freq = parFrequency("harry hacker")
freq should have size 8
freq('h') should be(2) // fails
freq('a') should be(2)
freq('r') should be(3)
freq('y') should be(1)
freq(' ') should be(1)
freq('c') should be(1)
freq('k') should be(1)
freq('e') should be(1)
}
Edit:
After reading this thread, I updated the code. Now the test works if ran alone, but fails if ran as a suite.
def parFrequency(str: String) = {
val freq = ImmutableHashMap[Char, Int]()
str.par.aggregate(freq)((_, c) => ImmutableHashMap(c -> 1), (m1, m2) => m1.merged(m2)({
case ((k, v1), (_, v2)) => (k, v1 + v2)
}))
}
Edit 2:
See my solution below.
++ does not combine the values of identical keys. So when you merge the maps, you get (for shared keys) one of the values (which in this case is always 1), not the sum of the values.
This works:
def parFrequency(str: String) = {
str.par.aggregate(Map[Char, Int]())((m, c) => { m + (c -> (m.getOrElse(c, 0) + 1)) },
(a,b) => b.foldLeft(a){case (acc, (k,v))=> acc updated (k, acc.getOrElse(k,0) + v) })
}
val freq = parFrequency("harry hacker")
//> Map(e -> 1, y -> 1, a -> 2, -> 1, c -> 1, h -> 2, r -> 3, k -> 1)
The foldLeft iterates over one of the maps, updating the other map with the key/values found.
You trouble in first case as you detected by yourself was in ++ operator which just concatenating, dropping second occurence of same key.
Now in the second case you have the (_, c) => ImmutableHashMap(c -> 1) which just drops all of chars found my the map in seqop stage.
My suggestion is to extend the Map type with special compination operation, working like merged in HashMap and preserve collecting from first example at seqop stage:
implicit class MapUnionOps[K, V](m1: Map[K, V]) {
def unionWith[V1 >: V](m2: Map[K, V1])(f: (V1, V1) => V1): Map[K, V1] = {
val kv1 = m1.filterKeys(!m2.contains(_))
val kv2 = m2.filterKeys(!m1.contains(_))
val common = (m1.keySet & m2.keySet).toSeq map (k => (k, f(m1(k), m2(k))))
(common ++ kv1 ++ kv2).toMap
}
}
def parFrequency(str: String) = {
str.par.aggregate(Map[Char, Int]())((m, c) => {m + (c -> (m.getOrElse(c, 0) + 1))}, (m1, m2) => (m1 unionWith m2)(_ + _))
}
Or you can use fold solution from Paul's answer, but for better performance for each merge choose lesser map to traverse:
implicit class MapUnionOps[K, V](m1: Map[K, V]) {
def unionWith(m2: Map[K, V])(f: (V, V) => V): Map[K, V] =
if (m2.size > m1.size) m2.unionWith(m1)(f)
else m2.foldLeft(m1) {
case (acc, (k, v)) => acc + (k -> acc.get(k).fold(v)(f(v, _)))
}
}
This seems to work. I like it better than the other solutions proposed here because:
It's lot less code than an implicit class and slightly less code than using getOrElse with foldLeft.
It uses the merged function from the API which's intended to do what I want.
It's my own solution :)
def parFrequency(str: String) = {
val freq = ImmutableHashMap[Char, Int]()
str.par.aggregate(freq)((_, c) => ImmutableHashMap(c -> 1), _.merged(_) {
case ((k, v1), (_, v2)) => (k, v1 + v2)
})
}
Thanks for taking the time to help me out.

Returning Minimum Value in Binary Tree

I'm trying to write code that if given a tree, will go through the tree and return the minimum value in that tree, if the tree is empty, it will return val. What I have right now compiles but will not run. Any help?
minValue :: Ord a => a -> BTree a -> a
minValue val Empty = val
minValue val (BNode v left Empty) = minimum [minValue v left]
minValue val (BNode v Empty right) = minimum [minValue v right]
minValue val (BNode v left right) = minimum ([minValue v left]++[minValue v right])
I'm assuming that BTree is defined as
data BTree a = Empty | BNode a (BTree a) (BTree a) deriving (Eq, Show)
Although for future reference please include data type definitions in your question.
The key to the solution here is that the minimum of a node is the minimum of its value and the mins of each branch:
minValue :: Ord a => a -> BTree a -> a
minValue val Empty = val
minValue val (BNode v left right) =
let leftMin = minValue val left
rightMin = minValue val right
in ???
Instead of worrying if the left or right is Empty, just trust the recursion to handle it. If left is Empty, then minValue val left will just be val, and similarly for right. Then you have 4 values in scope that you want to determine the minimum of, val, v, leftMin, and rightMin. How might you do that?

Resources