Spark group-by aggregation in a dataset containing a map - apache-spark

I have a java POJO
class MyObj{
String id;
Map<KeyObj, ValueObj> mapValues;
//getters and //setters(omitted)
}
I have a spark dataset
Dataset<MyObj> myDs = .....
My dataset has a list of values but there are duplicate Ids. How do I combine the duplicate account Ids and aggregate the Key values pairs into one Map for that Id using Spark groupBy.
Thanks for your help.
So I have:
ID. Map
----------------------------------
1000 [(w -> wer), (D -> dfr)]
1000 [(g -> gde)]
1001 [(k -> khg), (v -> vsa)]
And I need this:
ID. Map
----------------------------------
1000 [(w -> wer), (D -> dfr), (g -> gde)]
1001 [(k -> khg), (v -> vsa)]

You can explode the original maps so that each entry of each map is a row of its own. Then you can group over the id column and restore the maps with map_from_arrays:
myDs.select(col("id"),explode(col("mapValues"))) //1
.groupBy("id")
.agg(collect_list("key").as("keys"), collect_list("value").as("values")) //2
.withColumn("map", map_from_arrays(col("keys"), col("values"))) //3
.drop("keys", "values") //4
.show(false);
explode the maps into single rows. The new column names will be key and value
When grouping by id collect all keys and values into arrays, resulting in one array with keys and one array with values per id
Use map_from_arrays to transform the keys and values arrays back into a single map
drop the intermediate columns
The result is
+----+------------------------------+
|id |map |
+----+------------------------------+
|1000|[D -> dfr, w -> wer, g -> gde]|
|1001|[v -> vsa, k -> khg] |
+----+------------------------------+

Related

Sort by key in map type column for each row in spark dataframe

I have a spark dataframe in the below format:
Name LD_Value
A37 Map(10 -> 0.20,5 -> 0.30,17 -> 0.25)
A39 Map(11 -> 0.40,6 -> 0.67,24 -> 0.45)
I need to sort based on keys in LD_Value column for each record in descending order.
Expected output:
Name LD_Value
A37 Map(17 -> 0.25,10 -> 0.20,5 -> 0.30)
A39 Map(24 -> 0.45,11 -> 0.40,6 -> 0.67)
Is it possible to do sorting on map type column in spark dataframe?
I looked into spark higher-order functions but no luck.
You can first get the keys of the map using map_keys function, sort the array of keys then use transform to get the corresponding value for each key element from the original map, and finally update the map column by creating a new map from the two arrays using map_from_arrays function.
For Spark 3+, you can sort the array of keys in descending order by using a comparator function as the second argument to array_sort function :
from pyspark.sql import functions as F
df1 = df.withColumn(
"LD_Value_keys",
F.expr("array_sort(map_keys(LD_Value), (x, y) -> case when x > y then -1 when x < y then 1 else 0 end)")
).withColumn("LD_Value_values", F.expr("transform(LD_Value_keys, x -> LD_Value[x])")) \
.withColumn("LD_Value", F.map_from_arrays(F.col("LD_Value_keys"), F.col("LD_Value_values"))) \
.drop("LD_Value_keys", "LD_Value_values")
df1.show()
#+----+----------------------------------+
#|Name|LD_Value |
#+----+----------------------------------+
#|A37 |[17 -> 0.25, 10 -> 0.2, 5 -> 0.3] |
#|A39 |[24 -> 0.45, 11 -> 0.4, 6 -> 0.67]|
#+----+----------------------------------+
For Spark < 3, you can sort an array in descending order using this UDF:
# array_sort_udf (array, reverse): if reverse = True then desc
array_sort_udf = F.udf(lambda arr, r: sorted(arr, reverse=r), ArrayType(StringType()))
And use it like this:
df.withColumn("LD_Value_keys", array_sort_udf(F.map_keys(F.col("LD_Value")), F.lit(True)))

Scala count chars in a string logical error

here is the code:
val a = "abcabca"
a.groupBy((c: Char) => a.count( (d:Char) => d == c))
here is the result I want:
scala.collection.immutable.Map[Int,String] = Map(2 -> b, 2 -> c, 3 -> a)
but the result I get is
scala.collection.immutable.Map[Int,String] = Map(2 -> bcbc, 3 -> aaa)
why?
thank you.
Write an expression like
"abcabca".groupBy(identity).collect{
case (k,v) => (k,v.length)
}
which will give output as
res0: scala.collection.immutable.Map[Char,Int] = Map(b -> 2, a -> 3, c -> 2)
Let's dissect your initial attempt :
a.groupBy((c: Char) => a.count( (d:Char) => d == c))
So, you're grouping by something which is what ? the result of a.count(...), so the key of your Map will be an Int. For the char a, we will get 3, for the chars b and c, we'll get 2.
Now, the original String will be traversed and for the results accumulated, char by char.
So after traversing the first "ab", the current state is "2-> b, 3->c". (Note that for each char in the string, the .count() is called, which is a n² wasteful algorithm, but anyway).
The string is progressively traversed, and at the end the accumulated results is shown. As it turns out, the 3 "a" have been sent under the "3" key, and the b and c have been sent to the key "2", in the order the string was traversed, which is the left to right order.
Now, a usual groupBy on a list returns something like Map[T, List[T]], so you may have expected a List[Char] somewhere. It doesn't happen (because the Repr for String is String), and your list of chars is effectively recombobulated into a String, and is given to you as such.
Hence your final result !
Your question header reads as "Scala count chars in a string logical error". But you are using Map and you wanted counts as keys. Equal keys are not allowed in Map objects. Hence equal keys get eliminated in the resulting Map, keeping just one, because no duplicate keys are allowed. What you want may be a Seq of tuples like (count, char) like List[Int,Char]. Try this.
val x = "abcabca"
x.groupBy(identity).mapValues(_.size).toList.map{case (x,y)=>(y,x)}
In Scal REPL:
scala> x.groupBy(identity).mapValues(_.size).toList.map{case (x,y)=>(y,x)}
res13: List[(Int, Char)] = List((2,b), (3,a), (2,c))
The above gives a list of counts and respective chars as a list of tuples.So this is what you may really wanted.
If you try converting this to a Map:
scala> x.groupBy(identity).mapValues(_.size).toList.map{case (x,y)=>(y,x)}.toMap
res14: scala.collection.immutable.Map[Int,Char] = Map(2 -> c, 3 -> a)
So this is not what you want obviously.
Even more concisely use:
x.distinct.map(v=>(x.filter(_==v).size,v))
scala> x.distinct.map(v=>(x.filter(_==v).size,v))
res19: scala.collection.immutable.IndexedSeq[(Int, Char)] = Vector((3,a), (2,b), (2,c))
The problem with your approach is you are mapping count to characters. Which is:
In case of
val str = abcabca
While traversing the string str a has count 3, b has count 2 and c has count 2 while creating the map (with the use of groupBy) it will put all the characters in the value which has the same key that is.
Map(3->aaa, 2->bc)
That’s the reason you are getting such output for your program.
As you can see in the definition of the groupBy function:
def
groupBy[K](f: (A) ⇒ K): immutable.Map[K, Repr]
Partitions this traversable collection into a map of traversable collections according to some discriminator function.
Note: this method is not re-implemented by views. This means when applied to a view it will always force the view and return a new traversable collection.
K
the type of keys returned by the discriminator function.
f
the discriminator function.
returns
A map from keys to traversable collections such that the following invariant holds:
(xs groupBy f)(k) = xs filter (x => f(x) == k)
That is, every key k is bound to a traversable collection of those elements x for which f(x) equals k.
GroupBy returns a Map which holds the following invariant.
(xs groupBy f)(k) = xs filter (x => f(x) == k)
Which means it return collection of elements for which the key is same.

How to find all two pairs of sets and elements in a collection using MapReduce in Spark?

I have a collection of sets, each set contains many items. I want to retrieve all pairs of sets and elements using Spark where each pair after reduce processing will contains two items and two sets
for example:
If I have this list of sets
Set A={1,2,3,4 }
Set B={1,2,4,5}
Set C= {2,3,5,6}
The map process will be:
(A,1)
(A,2)
(A,3)
(B,1)
(B,2)
(B,4)
(B,5)
(C,2)
(C,3)
(C,5)
(C,6)
The target result after reduce is:
(A B, 1 2) // since 1 2 exist in both A and B
(A B, 1 4)
(A B, 2 4)
(A C,2 3)
(B C,2 5)
here (A B,1 3) not in the result because 1 3 not exists in B
Could you help me to solve this problem in Spark in one map and one reduce functions in any language ( Python, Scala, or Java)?
Lets break this problem into multiple parts, I consider the transformation from input lists to map output trivial. So let us start from there,
That you have a list of (String, int) looking like
("A", 1)
("A", 2)
....
Lets forget you need 2 integer elements in result set first, and lets solve for getting intersection set between any 2 keys from the mapped output.
Result from your input would look like
(AB, Set(1,2,4))
(BC, Set(2,5))
(AC, Set(2,3))
To do this, first, extract all keys from your mapped output (mappedOutput) that is an RDD of (String, Int), convert to set, and get all combinations of 2 elements (I am using a stupid method here, a good way to do this that scales would be to use a combination generator)
val combinations = mappedOutput.map(x => x._1).collect.toSet
.subsets.filter(x => x.size == 2).toList
.map(x => x.mkString(""))
output would be List(ab,ac,bc), these combination codes will serve as keys to be joined.
convert mapped output to list of set key (a,b,c) => set of elements
val step1 = mappedOutput.groupByKey().map(x => (x._1, x._2.toSet))
Attach combination codes as key to step1
val step2 = step1.map(x => combinations.filter(y => y.contains(x._1)).map(y => (y, x))).flatMap(x => x)
output would be (ab, (a, set of elements in a)), (ac, (a, set of elements in a)) etc. Because of the filter, we will not attach combination code bc to set a.
Now obtain the result I want using a reduce
val result = step2.reduceByKey((a, b) => ("", a.intersect(b))).map(x => (x._1, x._2._2))
So we have the output I mentioned we want at the start now. What is left is to transform this result to what you need, which is very simple to do.
val transformed = result.map(x => x._2.subsets.filter(x => x.size == 2).map(y => (x._1, y.mkString(" ")))).flatMap(x => x)
end :)

Scala: How to count occurrences of unique items in a certain index?

I have a list that is formatted like the lists below:
List(List(21, Georgetown, Male),List(29, Medford, Male),List(18, Manchester, Male),List(27, Georgetown, Female))
And I need to count the occurrences of each unique town name, then return the town name and the amount of times it was counted. But I only want to return the one town that had the most occurences. So if I applied the function to the list above, I would get
(Georgetown, 2)
I'm coming from Java, so I know how to do this process in a longer way, but I want to utilize some of Scala's built in methods.
scala> val towns = List(
| List(21, "Georgetown", "Male"),
| List(29, "Medford", "Male"),
| List(18, "Manchester", "Male"),
| List(27, "Georgetown", "Female"))
towns: List[List[Any]] = ...
scala> towns.map({ case List(a, b, c) => (b, c) }).groupBy(_._1).mapValues(_.length).maxBy(_._2)
res0: (Any, Int) = (Georgetown,2)
This is a pretty weird structure, but a way to do it would be with:
val items : List[List[Any]] = List(
List(List(21, "Georgetown", "Male")),
List(List(29, "Medford", "Male")),
List(List(18, "Manchester", "Male")),
List(List(27, "Georgetown", "Female"))).map(_.flatten)
val results = items.foldLeft(Map[String,Int]()) {
(acc,item) =>
val key = item(1).asInstanceOf[String]
val count = acc.getOrElse(key, 0 )
acc + (key -> (count + 1))
}
println(results)
Which produces:
Map(Georgetown -> 2, Medford -> 1, Manchester -> 1)

How to generate a new RDD from another RDD according to specific logic

I am a freshman to Spark. I have a problem, but I don't know how to solve it. My data in RDD is as follows:
(1,{A,B,C,D})
(2,{E,F,G})
......
I know RDDs are immutable, but, I want to transform my RDD into a new RDD that looks like this:
11 A,B
12 B,C
13 C,D
21 E,F
22 F,G
......
How can I generate a new key and extract adjacent elements?
Assuming your collection is something similar to a List, you could do something like:
val rdd2 = rdd1.flatMap { case (key, values) =>
for (value <- values.sliding(2).zipWithIndex)
yield (key.toString + value._2, value._1)
}
What we are doing here is iterating through the values in your list, applying a sliding window of size 2 on the elements, zipping the elements with an integer index, and finally outputting a list of tuples keyed by the original index appended with the list indices (whose values are the slid elements). We also use a flatMap here in order to flatten the results into their own records.
When run in spark-shell, I'm seeing the following output on your example:
scala> val rdd1 = sc.parallelize(Array((1,List("A","B","C","D")), (2,List("E","F","G"))))
rdd1: org.apache.spark.rdd.RDD[(Int, List[String])] = ParallelCollectionRDD[0] at parallelize at <console>:21
scala> val rdd2 = rdd1.flatMap { case (key, values) => for (value <- values.sliding(2).zipWithIndex) yield (key.toString + value._2, value._1) }
rdd2: org.apache.spark.rdd.RDD[(String, Seq[String])] = MapPartitionsRDD[1] at flatMap at <console>:23
scala> rdd2.foreach(println)
...
(10,List(A, B))
(11,List(B, C))
(12,List(C, D))
(20,List(E, F))
(21,List(F, G))
The one note with this is that the output key (e.g. 10, 11) will have 3 digits if you have 11 or more elements. For example, for the input key 1, you will have an output key 110 on the 11th element. Not sure if that fits your use case, but it seemed like a reasonable extension of your request. Based off your output key scheme, I would actually suggest something different (like maybe adding a hyphen between the key and element?). This will prevent collisions later as you'll see 2-10 and 21-0 instead of 210 for both keys.

Resources