Group an RDD by key in java - apache-spark

I am trying to group an RDD by using groupby. Most of the docs suggest not to use groupBy because of how it works internally to group keys. Is there another way to achieve that. I cannot use reducebyKey because I am not doing a reduction operation here.
Ex-
Entry - long id, string name;
JavaRDD<Entry> entries = rdd.groupBy(Entry::getId)
.flatmap(x -> someOp(x))
.values()
.filter()

aggregateByKey [Pair] see
Works like the aggregate function except the aggregation is applied to the values with the same key. Also unlike the aggregate function the initial value is not applied to the second reduce.
Listing Variants
def aggregateByKey[U](zeroValue: U)(seqOp: (U, V) ⇒ U, combOp: (U, U)
⇒ U)(implicit arg0: ClassTag[U]): RDD[(K, U)]
def aggregateByKey[U](zeroValue: U, numPartitions: Int)(seqOp: (U, V) ⇒ U,
combOp: (U, U) ⇒ U)(implicit arg0: ClassTag[U]): RDD[(K, U)]
def aggregateByKey[U](zeroValue: U, partitioner: Partitioner)(seqOp: (U,
V) ⇒ U, combOp: (U, U) ⇒ U)(implicit arg0: ClassTag[U]): RDD[(K, U)]
Example :
val pairRDD = sc.parallelize(List( ("cat",2), ("cat", 5), ("mouse", 4),("cat", 12), ("dog", 12), ("mouse", 2)), 2)
// lets have a look at what is in the partitions
def myfunc(index: Int, iter: Iterator[(String, Int)]) : Iterator[String] = {
iter.map(x => "[partID:" + index + ", val: " + x + "]")
}
pairRDD.mapPartitionsWithIndex(myfunc).collect
res2: Array[String] = Array([partID:0, val: (cat,2)], [partID:0, val: (cat,5)], [partID:0, val: (mouse,4)], [partID:1, val: (cat,12)], [partID:1, val: (dog,12)], [partID:1, val: (mouse,2)])
pairRDD.aggregateByKey(0)(math.max(_, _), _ + _).collect
res3: Array[(String, Int)] = Array((dog,12), (cat,17), (mouse,6))
pairRDD.aggregateByKey(100)(math.max(_, _), _ + _).collect
res4: Array[(String, Int)] = Array((dog,100), (cat,200), (mouse,200))

Related

how to add a new element in RDD

I have a spark pair RDD (key, count) as below
Array[(String, Int)] = Array((a,1), (b,2), (c,3), (d,4))
i want to add a new max element in RDDs
Array[(String, Int)] = Array((a,1,4), (b,2,4), (c,3,4), (d,4,4))
In the definition, you are saying:
(a, 1) -> (a, 1, 4)
(b, 2) -> (b, 2, 4)
(c, 1) -> (c, 3, 4) where is the 3 coming from now?
(d, 3) -> (d, 4, 4) where is the 4 coming from now?
In case your new max is the maximum value of your value RDD plus one, then you can sort descending by value and get the value of the first element:
val max = df1.sortBy(_._2, ascending = false).collect(0)._2 + 1
val df2 = df1.map(r => (r._1, r._2, max)
This gives:
(a,1,4)
(b,2,4)
(c,1,4)
(d,3,4)
which should be what you want.

Combine 2 map columns in spark sql

Hello I am using the combine udf from brickhouse to combine to maps for the below example
select combine(map('a',1,'b',0),map('a',0,'c',1))
If does combine the maps but I want to keep the highest value while combining the maps. is the possible?
You can use udf to concat the map to get the key having max value as below-
val mapConcat = udf((map1: Map[String, Int], map2: Map[String, Int]) => {
val finalMap = mutable.Map.empty[String, mutable.ArrayBuffer[Int]]
map1.foreach { case (key: String, value: Int) =>
if (finalMap.contains(key))
finalMap(key) :+ key
else finalMap.put(key, mutable.ArrayBuffer(value))
}
map2.foreach { case (key: String, value: Int) =>
if (finalMap.contains(key))
finalMap(key) :+ key
else finalMap.put(key, mutable.ArrayBuffer(value))
}
finalMap.mapValues(_.max)
})
spark.udf.register("my_map_concat", mapConcat)
spark.range(2).selectExpr("map('a',1,'b',0)","map('a',0,'c',1)",
"my_map_concat(map('a',1,'b',0),map('a',0,'c',1))")
.show(false)
Output-
+----------------+----------------+-------------------------------------+
|map(a, 1, b, 0) |map(a, 0, c, 1) |UDF(map(a, 1, b, 0), map(a, 0, c, 1))|
+----------------+----------------+-------------------------------------+
|[a -> 1, b -> 0]|[a -> 0, c -> 1]|[b -> 0, a -> 1, c -> 1] |
|[a -> 1, b -> 0]|[a -> 0, c -> 1]|[b -> 0, a -> 1, c -> 1] |
+----------------+----------------+-------------------------------------+

Convert Spark DataFrame Map into Array of Maps of `{"Key": key, "Value": value}`

How can I take a Spark DataFrame structured like this:
val sourcedf = spark.createDataFrame(
List(
Row(Map("AL" -> "Alabama", "AK" -> "Alaska").asJava),
Row(Map("TX" -> "Texas", "FL" -> "Florida", "NJ" -> "New Jersey").asJava)
).asJava, StructType(
StructField("my_map", MapType(StringType, StringType, false)) ::
Nil))
or in a text form, sourcedf.show(false) shows:
+----------------------------------------------+
|my_map |
+----------------------------------------------+
|[AL -> Alabama, AK -> Alaska] |
|[TX -> Texas, FL -> Florida, NJ -> New Jersey]|
+----------------------------------------------+
and programmatically transform to this structure:
val targetdf = spark.createDataFrame(
List(
Row(List(Map("Key" -> "AL", "Value" -> "Alabama"), Map("Key" -> "AK", "Value" -> "Alaska")).asJava),
Row(List(Map("Key" -> "TX", "Value" -> "Texas"), Map("Key" -> "FL", "Value" -> "Florida"), Map("Key" -> "NJ", "Value" -> "New Jersey")).asJava)
).asJava, StructType(
StructField("my_list", ArrayType(MapType(StringType, StringType, false), false)) ::
Nil))
or in a text form, targetdf.show(false) shows:
+----------------------------------------------------------------------------------------------+
|my_list |
+----------------------------------------------------------------------------------------------+
|[[Key -> AL, Value -> Alabama], [Key -> AK, Value -> Alaska]] |
|[[Key -> TX, Value -> Texas], [Key -> FL, Value -> Florida], [Key -> NJ, Value -> New Jersey]]|
+----------------------------------------------------------------------------------------------+```
So whilst using Scala, I couldn't figure out how to handle a java.util.Map with provided Encoders, I probably would have had to write one myself and I figured it was too much work.
However, I can see two ways to do this without converting to java.util.Map and using scala.collection.immutable.Map.
You could convert into a Dataset[Obj] and flatMap.
case class Foo(my_map: Map[String, String])
case class Bar(my_list: List[Map[String, String]])
implicit val encoder = ExpressionEncoder[List[Map[String, String]]]
val ds: Dataset[Foo] = sourcedf.as[Foo]
val output: Dataset[Bar] = ds.map(x => Bar(x.my_map.flatMap({case (k, v) => List(Map("key" -> k, "value" -> v))}).toList))
output.show(false)
Or you can use a UDF
val mapToList: Map[String, String] => List[Map[String, String]] = {
x => x.flatMap({case (k, v) => List(Map("key" -> k, "value" -> v))}).toList
}
val mapToListUdf: UserDefinedFunction = udf(mapToList)
val output: Dataset[Row] = sourcedf.select(mapToListUdf($"my_map").as("my_list"))
output.show(false)
Both output
+----------------------------------------------------------------------------------------------+
|my_list |
+----------------------------------------------------------------------------------------------+
|[[key -> AL, value -> Alabama], [key -> AK, value -> Alaska]] |
|[[key -> TX, value -> Texas], [key -> FL, value -> Florida], [key -> NJ, value -> New Jersey]]|
+----------------------------------------------------------------------------------------------+

Spark convert PairRDD to RDD

What is the best way to convert a PairRDD into an RDD with both K and V are merged (in java)?
For example, the PairRDD contains K as some string and V as a JSON. I want to add this K to the value JSON and produce an RDD.
Input PairRDD
("abc", {"x:"100", "y":"200"})
("def", {"x":"400", "y":"500")
Output should be and RDD as follows
({"x:"100", "y":"200","z":"abc"})
({"x":"400", "y":"500","z":"def"})
You can use map to translate between the two
consider:
scala> pairrdd.foreach(println)
(def,Map(x -> 400, y -> 500))
(abc,Map(x -> 100, y -> 200))
(I think that's what your sample is meant to represent)
scala> val newrdd = prdd.map(X=> X._2 ++ Map("z"-> X._1))
scala> newrdd.foreach(println)
Map(x -> 100, y -> 200, z -> abc)
Map(x -> 400, y -> 500, z -> def)
You'll have to change the val newrdd to java syntax, but the right side of the equation (I believe) will stay the same

Scala parallel frequency calculation using aggregate doesn't work

I'm learning Scala by working the exercises from the book "Scala for the Impatient". Please see the following question and my answer and code. I'd like to know if my answer is correct. Also the code doesn't work (all frequencies are 1). Where's the bug?
Q10: Harry Hacker reads a file into a string and wants to use a
parallel collection to update the letter frequencies concurrently on
portions of the string. He uses the following code:
val frequencies = new scala.collection.mutable.HashMap[Char, Int]
for (c <- str.par) frequencies(c) = frequencies.getOrElse(c, 0) + 1
Why is this a terrible idea? How can he really parallelize the
computation?
My answer:
It is not a good idea because if 2 threads are concurrently updating the same frequency, the result is undefined.
My code:
def parFrequency(str: String) = {
str.par.aggregate(Map[Char, Int]())((m, c) => { m + (c -> (m.getOrElse(c, 0) + 1)) }, _ ++ _)
}
Unit test:
"Method parFrequency" should "return the frequency of each character in a string" in {
val freq = parFrequency("harry hacker")
freq should have size 8
freq('h') should be(2) // fails
freq('a') should be(2)
freq('r') should be(3)
freq('y') should be(1)
freq(' ') should be(1)
freq('c') should be(1)
freq('k') should be(1)
freq('e') should be(1)
}
Edit:
After reading this thread, I updated the code. Now the test works if ran alone, but fails if ran as a suite.
def parFrequency(str: String) = {
val freq = ImmutableHashMap[Char, Int]()
str.par.aggregate(freq)((_, c) => ImmutableHashMap(c -> 1), (m1, m2) => m1.merged(m2)({
case ((k, v1), (_, v2)) => (k, v1 + v2)
}))
}
Edit 2:
See my solution below.
++ does not combine the values of identical keys. So when you merge the maps, you get (for shared keys) one of the values (which in this case is always 1), not the sum of the values.
This works:
def parFrequency(str: String) = {
str.par.aggregate(Map[Char, Int]())((m, c) => { m + (c -> (m.getOrElse(c, 0) + 1)) },
(a,b) => b.foldLeft(a){case (acc, (k,v))=> acc updated (k, acc.getOrElse(k,0) + v) })
}
val freq = parFrequency("harry hacker")
//> Map(e -> 1, y -> 1, a -> 2, -> 1, c -> 1, h -> 2, r -> 3, k -> 1)
The foldLeft iterates over one of the maps, updating the other map with the key/values found.
You trouble in first case as you detected by yourself was in ++ operator which just concatenating, dropping second occurence of same key.
Now in the second case you have the (_, c) => ImmutableHashMap(c -> 1) which just drops all of chars found my the map in seqop stage.
My suggestion is to extend the Map type with special compination operation, working like merged in HashMap and preserve collecting from first example at seqop stage:
implicit class MapUnionOps[K, V](m1: Map[K, V]) {
def unionWith[V1 >: V](m2: Map[K, V1])(f: (V1, V1) => V1): Map[K, V1] = {
val kv1 = m1.filterKeys(!m2.contains(_))
val kv2 = m2.filterKeys(!m1.contains(_))
val common = (m1.keySet & m2.keySet).toSeq map (k => (k, f(m1(k), m2(k))))
(common ++ kv1 ++ kv2).toMap
}
}
def parFrequency(str: String) = {
str.par.aggregate(Map[Char, Int]())((m, c) => {m + (c -> (m.getOrElse(c, 0) + 1))}, (m1, m2) => (m1 unionWith m2)(_ + _))
}
Or you can use fold solution from Paul's answer, but for better performance for each merge choose lesser map to traverse:
implicit class MapUnionOps[K, V](m1: Map[K, V]) {
def unionWith(m2: Map[K, V])(f: (V, V) => V): Map[K, V] =
if (m2.size > m1.size) m2.unionWith(m1)(f)
else m2.foldLeft(m1) {
case (acc, (k, v)) => acc + (k -> acc.get(k).fold(v)(f(v, _)))
}
}
This seems to work. I like it better than the other solutions proposed here because:
It's lot less code than an implicit class and slightly less code than using getOrElse with foldLeft.
It uses the merged function from the API which's intended to do what I want.
It's my own solution :)
def parFrequency(str: String) = {
val freq = ImmutableHashMap[Char, Int]()
str.par.aggregate(freq)((_, c) => ImmutableHashMap(c -> 1), _.merged(_) {
case ((k, v1), (_, v2)) => (k, v1 + v2)
})
}
Thanks for taking the time to help me out.

Resources