Iterating over a grouped dataset in Spark 1.6 - apache-spark

In an ordered dataset, I want to aggregate data until a condition is met, but grouped by a certain key.
To set some context to my question I simplify my problem to the below problem statement:
In spark I need to aggregate strings, grouped by key when a user stops
"shouting" (the 2nd char in a string is not uppercase).
Dataset example:
ID, text, timestamps
1, "OMG I like bananas", 123
1, "Bananas are the best", 234
1, "MAN I love banana", 1235
2, "ORLY? I'm more into grapes", 123565
2, "BUT I like apples too", 999
2, "unless you count veggies", 9999
2, "THEN don't forget tomatoes", 999999
The expected result would be:
1, "OMG I like bananas Bananas are the best"
2, "ORLY? I'm more into grapes BUT I like apples too unless you count veggies"
via groupby and agg I can't seem to set a condition to "stop when an uppercase char" is found.

This only works in Spark 2.1 or above
What you want to do is possible, but it may be very expensive.
First, let's create some test data. As general advice, when you ask something on Stackoverflow please provide something similar to this so people have somewhere to start.
import spark.sqlContext.implicits._
import org.apache.spark.sql.functions._
val df = List(
(1, "OMG I like bananas", 1),
(1, "Bananas are the best", 2),
(1, "MAN I love banana", 3),
(2, "ORLY? I'm more into grapes", 1),
(2, "BUT I like apples too", 2),
(2, "unless you count veggies", 3),
(2, "THEN don't forget tomatoes", 4)
).toDF("ID", "text", "timestamps")
In order to get a column with the collected texts in order, we need to add a new column using a window function.
Using the spark shell:
scala> val df2 = df.withColumn("coll", collect_list("text").over(Window.partitionBy("id").orderBy("timestamps")))
df2: org.apache.spark.sql.DataFrame = [ID: int, text: string ... 2 more fields]
scala> val x = df2.groupBy("ID").agg(max($"coll").as("texts"))
x: org.apache.spark.sql.DataFrame = [ID: int, texts: array<string>]
scala> x.collect.foreach(println)
[1,WrappedArray(OMG I like bananas, Bananas are the best, MAN I love banana)]
[2,WrappedArray(ORLY? I'm more into grapes, BUT I like apples too, unless you count veggies, THEN don't forget tomatoes)]
To get the actual text we may need a UDF. Here's mine (I'm far from an expert in Scala, so bear with me)
import scala.collection.mutable
val aggText: Seq[String] => String = (list: Seq[String]) => {
def tex(arr: Seq[String], accum: Seq[String]): Seq[String] = arr match {
case Seq() => accum
case Seq(single) => accum :+ single
case Seq(str, xs #_*) => if (str.length >= 2 && !(str.charAt(0).isUpper && str.charAt(1).isUpper))
tex(Nil, accum :+ str )
else
tex(xs, accum :+ str)
}
val res = tex(list, Seq())
res.mkString(" ")
}
val textUDF = udf(aggText(_: mutable.WrappedArray[String]))
So, we have a dataframe with the collected texts in the proper order, and a Scala function (wrapped as a UDF). Let's piece it together:
scala> val x = df2.groupBy("ID").agg(max($"coll").as("texts"))
x: org.apache.spark.sql.DataFrame = [ID: int, texts: array<string>]
scala> val y = x.select($"ID", textUDF($"texts"))
y: org.apache.spark.sql.DataFrame = [ID: int, UDF(texts): string]
scala> y.collect.foreach(println)
[1,OMG I like bananas Bananas are the best]
[2,ORLY? I'm more into grapes BUT I like apples too unless you count veggies]
scala>
I think this is the result you want.

Related

How can I use reduceByKey from spark to sum integers in a list?

I have a (key, value) whose value is equal to a list of integers inside a list. I mean:
(Key, Value) = ("aaa", [ [1,2,3],[1,1,1] ])
I want reducebykey summing each value of the same position as below:
("aaa", [1+1,2+1,3+1])
What is the best way to do this using reduceBykey function?
Thank u!
Though I am not sure why you need to use reduceByKey here but providing my solution based on my understanding.
import sparkSession.implicits._
def col2sum(x:Array[Int],y:Array[Int]):Array[Int] = {
x.zipAll(y,0,0).map(pair=>pair._1+pair._2)
}
val kvData = sparkSession.sparkContext.parallelize(Seq(("aaa", Array(Array(1, 2, 3), Array(1, 1, 1)))))
val output = kvData.map(data => (data._1, data._2.reduce(col2sum)))
Converting as DataFrame to check results:
output.toDF("field_1", "field_2").show()
+----+---------+
|ddff| dffhj|
+----+---------+
| aaa|[2, 3, 4]|
+----+---------+

Building derived column using Spark transformations

I got a table record as stated below.
Id Indicator Date
1 R 2018-01-20
1 R 2018-10-21
1 P 2019-01-22
2 R 2018-02-28
2 P 2018-05-22
2 P 2019-03-05
I need to pick the Ids that had more than two R indicator in the last one year and derive a new column called Marked_Flag as Y otherwise N. So the expected output should look like below,
Id Marked_Flag
1 Y
2 N
So what I did so far, I took the records in a dataset and then again build another dataset from that. The code looks like below.
Dataset<row> getIndicators = spark.sql("select id, count(indicator) as indi_count from source group by id having indicator = 'R'");
Dataset<row>getFlag = spark.sql("select id, case when indi_count > 1 then 'Y' else 'N' end as Marked_Flag" from getIndicators");
But my lead what this to be done using a single dataset and using Spark transformations. I am pretty new to Spark, any guidance or code snippet on this regard would be highly helpful.
Created two Datasets one to get the aggregation and another used the aggregated value to derive the new column.
Dataset<row> getIndicators = spark.sql("select id, count(indicator) as indi_count from source group by id having indicator = 'R'");
Dataset<row>getFlag = spark.sql("select id, case when indi_count > 1 then 'Y' else 'N' end as Marked_Flag" from getIndicators");
Input
Expected output
Try out the following. Note that I am using pyspark DataFrame here
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
spark = SparkSession.builder.getOrCreate()
df = spark.createDataFrame([
[1, "R", "2018-01-20"],
[1, "R", "2018-10-21"],
[1, "P", "2019-01-22"],
[2, "R", "2018-02-28"],
[2, "P", "2018-05-22"],
[2, "P", "2019-03-05"]], ["Id", "Indicator","Date"])
gr = df.filter(F.col("Indicator")=="R").groupBy("Id").agg(F.count("Indicator"))
gr = gr.withColumn("Marked_Flag", F.when(F.col("count(Indicator)") > 1, "Y").otherwise('N')).drop("count(Indicator)")
gr.show()
# +---+-----------+
# | Id|Marked_Flag|
# +---+-----------+
# | 1| Y|
# | 2| N|
# +---+-----------+
#

Efficient way to reduceByKey ignoring keys not in another RDD?

I have a large collection of data in pyspark. The format is key-value pairs, and I need to do a reducebykey operation, but ignoring all data whose key isn't in an RDD of 'interesting' keys that I also have.
I found a solution on SO that utilizes the subtractbykey operation to achieve this. It works, but crashes due to low memory on my cluster. I have not been able to change this with tweaking the settings, so I'm hoping there's a more efficient solution.
Here's my solution that works on smaller datasets:
# The keys I'm interested in
edges = sc.parallelize([("a", "b"), ("b", "c"), ("a", "d")])
# Data containing both interesting and uninteresting stuff
data1 = sc.parallelize([(("a", "b"), [42]), (("a", "c"), [60]), (("a", "d"), [13, 37])])
data2 = sc.parallelize([(("a", "b"), [43]), (("b", "c"), [23, 24]), (("a", "c"), [13, 37])])
all_data = [data1, data2]
mask = edges.map(lambda t: (tuple(t), None))
rdds = []
for datum in all_data:
combined = datum.reduceByKey(lambda a, b: a+b)
unwanted = combined.subtractByKey(mask)
wanted = combined.subtractByKey(unwanted)
rdds.append(wanted)
edge_alltimes = sc.union(rdds).reduceByKey(lambda a,b: a+b)
edge_alltimes.collect()
As desired, this outputs [(('a', 'd'), [13, 37]), (('a', 'b'), [42, 43]), (('b', 'c'), [23, 24])]
(i.e. data for the 'interesting' key tuples have been combined and the rest has been dropped).
The reason I have the data in several RDDs is to mimic behavior on my cluster where I can't load all the data simultaneously due to its size.
Any help would be great.
Example with join. A small drawback is that you need to have RDD of pairs before join and you need to strip extra data after join.
import org.apache.spark.{SparkConf, SparkContext}
object Main {
val conf = new SparkConf().setAppName("myapp").setMaster("local[*]")
val sc = new SparkContext(conf)
def main(args: Array[String]): Unit = {
val goodKeys = sc.parallelize(Seq(1, 2))
val allData = sc.parallelize(Seq((1, "a"), (2, "b"), (3, "c")))
val goodPairs = goodKeys.map(v => (v, 0))
val goodData = allData.join(goodPairs).mapValues(p => p._1)
goodData.collect().foreach(println)
}
}
Output:
(1,a)
(2,b)

how to select by max(date) with spark dataframe API

Given the following dataset
id v date
1 a1 1
1 a2 2
2 b1 3
2 b2 4
I want to select only the last value (regarding the date) for each id.
I've came up with this code :
scala> val df = sc.parallelize(List((41,"a1",1), (1, "a2", 2), (2, "b1", 3), (2, "b2", 4))).toDF("id", "v", "date")
df: org.apache.spark.sql.DataFrame = [id: int, v: string, date: int]
scala> val agg = df.groupBy("id").max("date")
agg: org.apache.spark.sql.DataFrame = [id: int, max(date): int]
scala> val res = df.join(agg, df("id") === agg("id") && df("date") === agg("max(date)"))
16/11/14 22:25:01 WARN sql.Column: Constructing trivially true equals predicate, 'id#3 = id#3'. Perhaps you need to use aliases.
res: org.apache.spark.sql.DataFrame = [id: int, v: string, date: int, id: int, max(date): int]
Is there a better way (more idiomatic, …) ?
Bonus : how to perform a max on a date column and avoid this error Aggregation function can only be applied on a numeric column. ?
You can try agg() with the max function :
import static org.apache.spark.sql.functions.*
df.groupBy("id").agg(max("date"))
For me it only worked taht way:
df = df.groupBy('CPF').agg({'DATA': 'max'})

What is the difference between map and flatMap and a good use case for each?

Can someone explain to me the difference between map and flatMap and what is a good use case for each?
What does "flatten the results" mean?
What is it good for?
Here is an example of the difference, as a spark-shell session:
First, some data - two lines of text:
val rdd = sc.parallelize(Seq("Roses are red", "Violets are blue")) // lines
rdd.collect
res0: Array[String] = Array("Roses are red", "Violets are blue")
Now, map transforms an RDD of length N into another RDD of length N.
For example, it maps from two lines into two line-lengths:
rdd.map(_.length).collect
res1: Array[Int] = Array(13, 16)
But flatMap (loosely speaking) transforms an RDD of length N into a collection of N collections, then flattens these into a single RDD of results.
rdd.flatMap(_.split(" ")).collect
res2: Array[String] = Array("Roses", "are", "red", "Violets", "are", "blue")
We have multiple words per line, and multiple lines, but we end up with a single output array of words
Just to illustrate that, flatMapping from a collection of lines to a collection of words looks like:
["aa bb cc", "", "dd"] => [["aa","bb","cc"],[],["dd"]] => ["aa","bb","cc","dd"]
The input and output RDDs will therefore typically be of different sizes for flatMap.
If we had tried to use map with our split function, we'd have ended up with nested structures (an RDD of arrays of words, with type RDD[Array[String]]) because we have to have exactly one result per input:
rdd.map(_.split(" ")).collect
res3: Array[Array[String]] = Array(
Array(Roses, are, red),
Array(Violets, are, blue)
)
Finally, one useful special case is mapping with a function which might not return an answer, and so returns an Option. We can use flatMap to filter out the elements that return None and extract the values from those that return a Some:
val rdd = sc.parallelize(Seq(1,2,3,4))
def myfn(x: Int): Option[Int] = if (x <= 2) Some(x * 10) else None
rdd.flatMap(myfn).collect
res3: Array[Int] = Array(10,20)
(noting here that an Option behaves rather like a list that has either one element, or zero elements)
Generally we use word count example in hadoop. I will take the same use case and will use map and flatMap and we will see the difference how it is processing the data.
Below is the sample data file.
hadoop is fast
hive is sql on hdfs
spark is superfast
spark is awesome
The above file will be parsed using map and flatMap.
Using map
>>> wc = data.map(lambda line:line.split(" "));
>>> wc.collect()
[u'hadoop is fast', u'hive is sql on hdfs', u'spark is superfast', u'spark is awesome']
Input has 4 lines and output size is 4 as well, i.e., N elements ==> N elements.
Using flatMap
>>> fm = data.flatMap(lambda line:line.split(" "));
>>> fm.collect()
[u'hadoop', u'is', u'fast', u'hive', u'is', u'sql', u'on', u'hdfs', u'spark', u'is', u'superfast', u'spark', u'is', u'awesome']
The output is different from map.
Let's assign 1 as value for each key to get the word count.
fm: RDD created by using flatMap
wc: RDD created using map
>>> fm.map(lambda word : (word,1)).collect()
[(u'hadoop', 1), (u'is', 1), (u'fast', 1), (u'hive', 1), (u'is', 1), (u'sql', 1), (u'on', 1), (u'hdfs', 1), (u'spark', 1), (u'is', 1), (u'superfast', 1), (u'spark', 1), (u'is', 1), (u'awesome', 1)]
Whereas flatMap on RDD wc will give the below undesired output:
>>> wc.flatMap(lambda word : (word,1)).collect()
[[u'hadoop', u'is', u'fast'], 1, [u'hive', u'is', u'sql', u'on', u'hdfs'], 1, [u'spark', u'is', u'superfast'], 1, [u'spark', u'is', u'awesome'], 1]
You can't get the word count if map is used instead of flatMap.
As per the definition, difference between map and flatMap is:
map: It returns a new RDD by applying given function to each element
of the RDD. Function in map returns only one item.
flatMap: Similar to map, it returns a new RDD by applying a function
to each element of the RDD, but output is flattened.
It boils down to your initial question: what you mean by flattening ?
When you use flatMap, a "multi-dimensional" collection becomes "one-dimensional" collection.
val array1d = Array ("1,2,3", "4,5,6", "7,8,9")
//array1d is an array of strings
val array2d = array1d.map(x => x.split(","))
//array2d will be : Array( Array(1,2,3), Array(4,5,6), Array(7,8,9) )
val flatArray = array1d.flatMap(x => x.split(","))
//flatArray will be : Array (1,2,3,4,5,6,7,8,9)
You want to use a flatMap when,
your map function results in creating multi layered structures
but all you want is a simple - flat - one dimensional structure, by removing ALL the internal groupings
all examples are good....Here is nice visual illustration... source courtesy : DataFlair training of spark
Map : A map is a transformation operation in Apache Spark. It applies to each element of RDD and it returns the result as new RDD. In the Map, operation developer can define his own custom business logic. The same logic will be applied to all the elements of RDD.
Spark RDD map function takes one element as input process it according to custom code (specified by the developer) and returns one element at a time. Map transforms an RDD of length N into another RDD of length N. The input and output RDDs will typically have the same number of records.
Example of map using scala :
val x = spark.sparkContext.parallelize(List("spark", "map", "example", "sample", "example"), 3)
val y = x.map(x => (x, 1))
y.collect
// res0: Array[(String, Int)] =
// Array((spark,1), (map,1), (example,1), (sample,1), (example,1))
// rdd y can be re writen with shorter syntax in scala as
val y = x.map((_, 1))
y.collect
// res1: Array[(String, Int)] =
// Array((spark,1), (map,1), (example,1), (sample,1), (example,1))
// Another example of making tuple with string and it's length
val y = x.map(x => (x, x.length))
y.collect
// res3: Array[(String, Int)] =
// Array((spark,5), (map,3), (example,7), (sample,6), (example,7))
FlatMap :
A flatMap is a transformation operation. It applies to each element of RDD and it returns the result as new RDD. It is similar to Map, but FlatMap allows returning 0, 1 or more elements from map function. In the FlatMap operation, a developer can define his own custom business logic. The same logic will be applied to all the elements of the RDD.
What does "flatten the results" mean?
A FlatMap function takes one element as input process it according to custom code (specified by the developer) and returns 0 or more element at a time. flatMap() transforms an RDD of length N into another RDD of length M.
Example of flatMap using scala :
val x = spark.sparkContext.parallelize(List("spark flatmap example", "sample example"), 2)
// map operation will return Array of Arrays in following case : check type of res0
val y = x.map(x => x.split(" ")) // split(" ") returns an array of words
y.collect
// res0: Array[Array[String]] =
// Array(Array(spark, flatmap, example), Array(sample, example))
// flatMap operation will return Array of words in following case : Check type of res1
val y = x.flatMap(x => x.split(" "))
y.collect
//res1: Array[String] =
// Array(spark, flatmap, example, sample, example)
// RDD y can be re written with shorter syntax in scala as
val y = x.flatMap(_.split(" "))
y.collect
//res2: Array[String] =
// Array(spark, flatmap, example, sample, example)
If you are asking the difference between RDD.map and RDD.flatMap in Spark, map transforms an RDD of size N to another one of size N . eg.
myRDD.map(x => x*2)
for example, if myRDD is composed of Doubles .
While flatMap can transform the RDD into anther one of a different size:
eg.:
myRDD.flatMap(x =>new Seq(2*x,3*x))
which will return an RDD of size 2*N
or
myRDD.flatMap(x =>if x<10 new Seq(2*x,3*x) else new Seq(x) )
Use test.md as a example:
➜ spark-1.6.1 cat test.md
This is the first line;
This is the second line;
This is the last line.
scala> val textFile = sc.textFile("test.md")
scala> textFile.map(line => line.split(" ")).count()
res2: Long = 3
scala> textFile.flatMap(line => line.split(" ")).count()
res3: Long = 15
scala> textFile.map(line => line.split(" ")).collect()
res0: Array[Array[String]] = Array(Array(This, is, the, first, line;), Array(This, is, the, second, line;), Array(This, is, the, last, line.))
scala> textFile.flatMap(line => line.split(" ")).collect()
res1: Array[String] = Array(This, is, the, first, line;, This, is, the, second, line;, This, is, the, last, line.)
If you use map method, you will get the lines of test.md, for flatMap method, you will get the number of words.
The map method is similar to flatMap, they are all return a new RDD. map method often to use return a new RDD, flatMap method often to use split words.
map and flatMap are similar, in the sense they take a line from the input RDD and apply a function on it. The way they differ is that the function in map returns only one element, while function in flatMap can return a list of elements (0 or more) as an iterator.
Also, the output of the flatMap is flattened. Although the function in flatMap returns a list of elements, the flatMap returns an RDD which has all the elements from the list in a flat way (not a list).
map returns RDD of equal number of elements while flatMap may not.
An example use case for flatMap Filter out missing or incorrect data.
An example use case for map Use in wide variety of cases where is the number of elements of input and output are the same.
number.csv
1
2
3
-
4
-
5
map.py adds all numbers in add.csv.
from operator import *
def f(row):
try:
return float(row)
except Exception:
return 0
rdd = sc.textFile('a.csv').map(f)
print(rdd.count()) # 7
print(rdd.reduce(add)) # 15.0
flatMap.py uses flatMap to filtered out missing data before addition. Less numbers are added compared to the previous version.
from operator import *
def f(row):
try:
return [float(row)]
except Exception:
return []
rdd = sc.textFile('a.csv').flatMap(f)
print(rdd.count()) # 5
print(rdd.reduce(add)) # 15.0
The difference can be seen from below sample pyspark code:
rdd = sc.parallelize([2, 3, 4])
rdd.flatMap(lambda x: range(1, x)).collect()
Output:
[1, 1, 2, 1, 2, 3]
rdd.map(lambda x: range(1, x)).collect()
Output:
[[1], [1, 2], [1, 2, 3]]
map: It returns a new RDD by applying a function to each element of the RDD. Function in .map can return only one item.
flatMap: Similar to map, it returns a new RDD by applying a function to each element of the RDD, but the output is flattened.
Also, function in flatMap can return a list of elements (0 or more)
For Example:
sc.parallelize([3,4,5]).map(lambda x: range(1,x)).collect()
Output: [[1, 2], [1, 2, 3], [1, 2, 3, 4]]
sc.parallelize([3,4,5]).flatMap(lambda x: range(1,x)).collect()
Output: notice o/p is flattened out in a single list [1, 2, 1, 2, 3,
1, 2, 3, 4]
Source:https://www.linkedin.com/pulse/difference-between-map-flatmap-transformations-spark-pyspark-pandey/
RDD.map returns all elements in single array
RDD.flatMap returns elements in Arrays of array
let's assume we have text in text.txt file as
Spark is an expressive framework
This text is to understand map and faltMap functions of Spark RDD
Using map
val text=sc.textFile("text.txt").map(_.split(" ")).collect
output:
text: **Array[Array[String]]** = Array(Array(Spark, is, an, expressive, framework), Array(This, text, is, to, understand, map, and, faltMap, functions, of, Spark, RDD))
Using flatMap
val text=sc.textFile("text.txt").flatMap(_.split(" ")).collect
output:
text: **Array[String]** = Array(Spark, is, an, expressive, framework, This, text, is, to, understand, map, and, faltMap, functions, of, Spark, RDD)
Flatmap and Map both transforms the collection.
Difference:
map(func)
Return a new distributed dataset formed by passing each element of the source through a function func.
flatMap(func)
Similar to map, but each input item can be mapped to 0 or more output items (so func should return a Seq rather than a single item).
The transformation function:
map: One element in -> one element out.
flatMap: One element in -> 0 or more elements out (a collection).
For all those who've wanted PySpark related:
Example transformation: flatMap
>>> a="hello what are you doing"
>>> a.split()
['hello', 'what', 'are', 'you', 'doing']
>>> b=["hello what are you doing","this is rak"]
>>> b.split()
Traceback (most recent call last):
File "", line 1, in
AttributeError: 'list' object has no attribute 'split'
>>> rline=sc.parallelize(b)
>>> type(rline)
>>> def fwords(x):
... return x.split()
>>> rword=rline.map(fwords)
>>> rword.collect()
[['hello', 'what', 'are', 'you', 'doing'], ['this', 'is', 'rak']]
>>> rwordflat=rline.flatMap(fwords)
>>> rwordflat.collect()
['hello', 'what', 'are', 'you', 'doing', 'this', 'is', 'rak']
Hope it helps :)
map :
is a higher-order method that takes a function as input and applies it to each element in the source RDD.
http://commandstech.com/difference-between-map-and-flatmap-in-spark-what-is-map-and-flatmap-with-examples/
flatMap:
a higher-order method and transformation operation that takes an input function.
map
Return a new RDD by applying a function to each element of this RDD.
>>> rdd = sc.parallelize([2, 3, 4])
>>> sorted(rdd.map(lambda x: [(x, x), (x, x)]).collect())
[[(2, 2), (2, 2)], [(3, 3), (3, 3)], [(4, 4), (4, 4)]]
flatMap
Return a new RDD by first applying a function to all elements of this RDD, and then flattening the results.
Here transformation of one element to many element is possible
>>> rdd = sc.parallelize([2, 3, 4])
>>> sorted(rdd.flatMap(lambda x: [(x, x), (x, x)]).collect())
[(2, 2), (2, 2), (3, 3), (3, 3), (4, 4), (4, 4)]
map(func) Return a new distributed dataset formed by passing each element of the source through a function func declared.so map()is single term
whiles
flatMap(func) Similar to map, but each input item can be mapped to 0 or more output items so func should return a Sequence rather than a single item.
Difference in output of map and flatMap:
1.flatMap
val a = sc.parallelize(1 to 10, 5)
a.flatMap(1 to _).collect()
Output:
1, 1, 2, 1, 2, 3, 1, 2, 3, 4, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 6, 1, 2, 3, 4, 5, 6, 7, 1, 2, 3, 4, 5, 6, 7, 8, 1, 2, 3, 4, 5, 6, 7, 8, 9, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10
2.map:
val a = sc.parallelize(List("dog", "salmon", "salmon", "rat", "elephant"), 3)
val b = a.map(_.length).collect()
Output:
3 6 6 3 8

Resources