Spark: FlatMapValues query - apache-spark

I'm reading the Learning Spark book and couldn't understand the following pair rdd transformation.
rdd.flatMapValues(x => (x to 5))
It is applied on an rdd {(1,2),(3,4),(3,6)} and the output of the transformation is {(1,2),(1,3),(1,4),(1,5),(3,4),(3,5)}
Can someone please explain this.

flatMapValues method is a combination of flatMap and mapValues.
Let's start with the given rdd.
val sampleRDD = sc.parallelize(Array((1,2),(3,4),(3,6)))
mapValues maps the values while keeping the keys.
For example, sampleRDD.mapValues(x => x to 5) returns
Array((1,Range(2, 3, 4, 5)), (3,Range(4, 5)), (3,Range()))
notice that for key-value pair (3, 6), it produces (3,Range()) since 6 to 5 produces an empty collection of values.
flatMap "breaks down" collections into the elements of the collection. You can search for more accurate description of flatMap online like here and here.
For example,
given val rdd2 = sampleRDD.mapValues(x => x to 5),
if we do rdd2.flatMap(x => x), you will get
Array((1,2),(1,3),(1,4),(1,5),(3,4),(3,5)).
That is, for every element in the collection in each key, we create a (key, element) pair.
Also notice that (3, Range()) does not produce any additional key element pair since the sequence is empty.
now combining flatMap and mapValues, you get flatMapValues.

flatMapValues works on each value associated with key. In above case x to 5 means each value will be incremented till 5.
Taking first pair where you have (1,2) , here key is 1 and value is 2 so there after applying transformation it will become {(1,2),(1,3),(1,4),(1,5)}.
Hope this helps.

Related

Something strange about groupBy on spark

I'm learning about groupBy function on spark,I create a list with 2 partitions,then use groupBy to get every odd and even numbers.I found if I define
val rdd = sc.makeRDD(List(1, 2, 3, 4),2)
val result = rdd.groupBy(_ % 2 )
the result with goes to their own partition. But if I define
val result = rdd.groupBy(_ % 2 ==0)
the result turns to in one partition.could anybody explain why?
It's just the hashing applied to groupBy Shuffle.
val rdd = sc.makeRDD(List(1, 2, 3, 4), 5)
With 5 partitions you see 2 partitions used and 3 empty. Just an algorithm.

Spark SQL – how to group by or aggregate with dynamically generated keys?

I have a DataFrame of overlapping ordered arrays.
[1,2,3]
[2,3,4]
[7,8,9]
Using Spark SQL I would like to group those that overlap, like below:
Key Values
1 [1,2,3], [2,3,4]
2 [7,8,9]
I was looking into UDAF functions, but can't understand how can I generate a new key for those rows that match my merging criteria.
Currently, I implemented it on the driver side, like this:
Order the collection of arrays by their first and last elements.
In a loop, if the first element in the array is less or equal than the last element of the previous array, then put them in the same bucket.
This works, but to do that I need to collect all the data on the driver side and I'm looking for a more efficient way to do that.
This is what I could implement to help with the situation.
Explanation:
Find permutations first for array which have length greater than 1
Explode permutation array
group by on permutation array and collect list of original array
Get distinct of list of original array
import org.apache.spark.sql.functions._
val y = sc.parallelize(Seq(Seq(1,2,3),Seq(2,3,4),Seq(7,8,9))).toDF("arr")
val x = (s:Seq[Int]) => s.toSet[Int].subsets.filter(_.size>1).map(_.toList).toList
val permutations = udf(x)
val a = y.select($"arr", permutations($"arr").as("permutations"))
a.select($"arr", explode($"permutations").as("permutations")).groupBy("permutations").agg(collect_set($"arr").as("groups")).select($"groups").distinct().select(monotonicallyIncreasingId, $"groups").show(false)
//+-----------------------------+----------------------+
//|monotonically_increasing_id()|groups |
//+-----------------------------+----------------------+
//|214748364800 |[[1, 2, 3], [2, 3, 4]]|
//|412316860416 |[[7, 8, 9]] |
//|884763262976 |[[1, 2, 3]] |
//|1056561954816 |[[2, 3, 4]] |
//+-----------------------------+----------------------+
I hope this will get you started. There are a lot of nitty-gritties, I will leave those to you.

Map each element of a list in Spark

I'm working with an RDD which pairs are structured this way: [Int, List[Int]] my goal is to map the items of the list of each pair with the key. So for example I'd need to do this:
RDD1:[Int, List[Int]]
<1><[2, 3]>
<2><[3, 5, 8]>
RDD2:[Int, Int]
<1><2>
<1><3>
<2><3>
<2><5>
<2><8>
well I can't understand what kind of transformation would be needed in order to get to RDD2. Transformations list can be found here. Any Idea? Is it a wrong approach?
You can use flatMap:
val rdd1 = sc.parallelize(Seq((1, List(2, 3)), (2, List(3, 5, 8))))
val rdd2 = rdd1.flatMap(x => x._2.map(y => (x._1, y)))
// or:
val rdd2 = rdd1.flatMap{case (key, list) => list.map(nr => (key, nr))}
// print result:
rdd2.collect().foreach(println)
Gives result:
(1,2)
(1,3)
(2,3)
(2,5)
(2,8)
flatMap created few output objects from one input object.
In your case, inner map in flatMap maps tuple (Int, List[Int]) to List[(Int, Int)] - key is the same as input tuple, but for each element in input list it creates one output tuple. flatMap causes that each element of this List becomes one row in RDD

argmax in Spark DataFrames: how to retrieve the row with the maximum value

Given a Spark DataFrame df, I want to find the maximum value in a certain numeric column 'values', and obtain the row(s) where that value was reached. I can of course do this:
# it doesn't matter if I use scala or python,
# since I hope I get this done with DataFrame API
import pyspark.sql.functions as F
max_value = df.select(F.max('values')).collect()[0][0]
df.filter(df.values == max_value).show()
but this is inefficient since it requires two passes through df.
pandas.Series/DataFrame and numpy.array have argmax/idxmax methods that do this efficiently (in one pass). So does standard python (built-in function max accepts a key parameter, so it can be used to find the index of the highest value).
What is the right approach in Spark? Note that I don't mind whether I get all the rows that where the maximum value is achieved, or just some arbitrary (non-empty!) subset of those rows.
If schema is Orderable (schema contains only atomics / arrays of atomics / recursively orderable structs) you can use simple aggregations:
Python:
df.select(F.max(
F.struct("values", *(x for x in df.columns if x != "values"))
)).first()
Scala:
df.select(max(struct(
$"values" +: df.columns.collect {case x if x!= "values" => col(x)}: _*
))).first
Otherwise you can reduce over Dataset (Scala only) but it requires additional deserialization:
type T = ???
df.reduce((a, b) => if (a.getAs[T]("values") > b.getAs[T]("values")) a else b)
You can also oredrBy and limit(1) / take(1):
Scala:
df.orderBy(desc("values")).limit(1)
// or
df.orderBy(desc("values")).take(1)
Python:
df.orderBy(F.desc('values')).limit(1)
# or
df.orderBy(F.desc("values")).take(1)
Maybe it's an incomplete answer but you can use DataFrame's internal RDD, apply the max method and get the maximum record using a determined key.
a = sc.parallelize([
("a", 1, 100),
("b", 2, 120),
("c", 10, 1000),
("d", 14, 1000)
]).toDF(["name", "id", "salary"])
a.rdd.max(key=lambda x: x["salary"]) # Row(name=u'c', id=10, salary=1000)

How to generate a new RDD from another RDD according to specific logic

I am a freshman to Spark. I have a problem, but I don't know how to solve it. My data in RDD is as follows:
(1,{A,B,C,D})
(2,{E,F,G})
......
I know RDDs are immutable, but, I want to transform my RDD into a new RDD that looks like this:
11 A,B
12 B,C
13 C,D
21 E,F
22 F,G
......
How can I generate a new key and extract adjacent elements?
Assuming your collection is something similar to a List, you could do something like:
val rdd2 = rdd1.flatMap { case (key, values) =>
for (value <- values.sliding(2).zipWithIndex)
yield (key.toString + value._2, value._1)
}
What we are doing here is iterating through the values in your list, applying a sliding window of size 2 on the elements, zipping the elements with an integer index, and finally outputting a list of tuples keyed by the original index appended with the list indices (whose values are the slid elements). We also use a flatMap here in order to flatten the results into their own records.
When run in spark-shell, I'm seeing the following output on your example:
scala> val rdd1 = sc.parallelize(Array((1,List("A","B","C","D")), (2,List("E","F","G"))))
rdd1: org.apache.spark.rdd.RDD[(Int, List[String])] = ParallelCollectionRDD[0] at parallelize at <console>:21
scala> val rdd2 = rdd1.flatMap { case (key, values) => for (value <- values.sliding(2).zipWithIndex) yield (key.toString + value._2, value._1) }
rdd2: org.apache.spark.rdd.RDD[(String, Seq[String])] = MapPartitionsRDD[1] at flatMap at <console>:23
scala> rdd2.foreach(println)
...
(10,List(A, B))
(11,List(B, C))
(12,List(C, D))
(20,List(E, F))
(21,List(F, G))
The one note with this is that the output key (e.g. 10, 11) will have 3 digits if you have 11 or more elements. For example, for the input key 1, you will have an output key 110 on the 11th element. Not sure if that fits your use case, but it seemed like a reasonable extension of your request. Based off your output key scheme, I would actually suggest something different (like maybe adding a hyphen between the key and element?). This will prevent collisions later as you'll see 2-10 and 21-0 instead of 210 for both keys.

Resources