Spark GroupBy Aggregate functions - apache-spark

case class Step (Id : Long,
stepNum : Long,
stepId : Int,
stepTime: java.sql.Timestamp
)
I have a Dataset[Step] and I want to perform a groupBy operation on the "Id" col.
My output should look like Dataset[(Long, List[Step])]. How do I do this?
lets say variable "inquiryStepMap" is of type Dataset[Step] then we can do this with RDDs as follows
val inquiryStepGrouped: RDD[(Long, Iterable[Step])] = inquiryStepMap.rdd.groupBy(x => x.Id)

It seems you need groupByKey:
Sample:
import java.sql.Timestamp
val t = new Timestamp(2017, 5, 1, 0, 0, 0, 0)
val ds = Seq(Step(1L, 21L, 1, t), Step(1L, 20L, 2, t), Step(2L, 10L, 3, t)).toDS()
groupByKey and then mapGroups:
ds.groupByKey(_.Id).mapGroups((Id, Vals) => (Id, Vals.toList))
// res18: org.apache.spark.sql.Dataset[(Long, List[Step])] = [_1: bigint, _2: array<struct<Id:bigint,stepNum:bigint,stepId:int,stepTime:timestamp>>]
And the result looks like:
ds.groupByKey(_.Id).mapGroups((Id, Vals) => (Id, Vals.toList)).show()
+---+--------------------+
| _1| _2|
+---+--------------------+
| 1|[[1,21,1,3917-06-...|
| 2|[[2,10,3,3917-06-...|
+---+--------------------+

Related

Need to get fields from lookup file with calculated values using spark dataframe

I have a products file as below and the formula for cost price and discount in the lookup file. we need get the discount and cost price fields from lookup file.
product_id,product_name,marked_price,selling_price,profit
101,AAAA,5500,5400,500
102,ABCS,7000,6500,1000
103,GHMA,6500,5600,700
104,PSNLA,8450,8000,800
105,GNBC,1250,1200,600
lookup file:
key,value
cost_price,(selling_price+profit)
discount,(marked_price-selling_price)
Final output:
product_id,product_name,marked_price,selling_price,profit,cost_price,discount
101,AAAA,5500,5400,500,5900,100
102,ABCS,7000,6500,1000,7500,500
103,GHMA,6500,5600,700,6300,900
104,PSNLA,8450,8000,800,8800,450
105,GNBC,1250,1200,600,1800,50
First, you should make a Map out of your lookup-file, then you can add the columns using expr:
val lookup = Map(
"cost_price" -> "selling_price+profit",
"discount" -> "(marked_price-selling_price)"
)
val df = Seq(
(101, "AAAA", 5500, 5400, 500),
(102, "ABCS", 7000, 6500, 1000)
)
.toDF("product_id", "product_name", "marked_price", "selling_price", "profit")
df
.withColumn("cost_price",expr(lookup("cost_price")))
.withColumn("discount",expr(lookup("discount")))
.show()
gives :
+----------+------------+------------+-------------+------+----------+--------+
|product_id|product_name|marked_price|selling_price|profit|cost_price|discount|
+----------+------------+------------+-------------+------+----------+--------+
| 101| AAAA| 5500| 5400| 500| 5900| 100|
| 102| ABCS| 7000| 6500| 1000| 7500| 500|
+----------+------------+------------+-------------+------+----------+--------+
you can also iterate your lookup :
val finalDF = lookup.foldLeft(df){case (df,(k,v)) => df.withColumn(k,expr(v))}
import spark.implicits._
val dataheader = Seq("product_id", "product_name", "marked_price", "selling_price", "profit")
val data = Seq(
(101, "AAAA", 5500, 5400, 500),
(102, "ABCS", 7000, 6500, 1000),
(103, "GHMA", 6500, 5600, 700),
(104, "PSNLA", 8450, 8000, 800),
(105, "GNBC", 1250, 1200, 600))
val dataDF = data.toDF(dataheader: _*)
val lookupMap = Seq(
("cost_price", "(selling_price+profit)"),
("discount", "(marked_price-selling_price)")) //You can read your data from look up file and construct a dataframe
.toDF("key", "value").select("key", "value").as[(String, String)].collect.toMap
lookupMap.foldLeft(dataDF)((df, look) => {
df.withColumn(look._1, expr(look._2))
}).show()

Loading Data into Spark Dataframe without delimiters in source

I have a dataset with no delimiters:
111222333444
555666777888
Desired output:
|_c1_|_c2_|_c3_|_c4_|
|111 |222 |333 |444 |
|555 |666 |777 |888 |
i have tried this to attain the output
val myDF = spark.sparkContext.textFile("myFile").toDF()
val myNewDF = myDF.withColumn("c1", substring(col("value"), 0, 3))
.withColumn("c2", substring(col("value"), 3, 6))
.withColumn("c3", substring(col("value"), 6, 9)
.withColumn("c4", substring(col("value"), 9, 12))
.drop("value")
.show()
but i need to manipulate c4 (multiply 100) but the datatype is string not double.
Update: I encountered a scenarios
when i execute this,
val myNewDF = myDF.withColumn("c1", expr("substring(value, 0, 3)"))
.withColumn("c2", expr("substring(value, 3, 6"))
.withColumn("c3", expr("substring(value, 6, 9)"))
.withColumn("c4", (expr("substring(value, 9, 12)").cast("double") * 100))
.drop("value")
myNewDF.show(5,false) // it only shows "value" column (which i dropped) and "c1" column
myNewDF.printSchema // only showing 2 rows. why is it not showing all the newly created 4 columns?
Create test dataframe:
scala> var df = Seq(("111222333444"),("555666777888")).toDF("s")
df: org.apache.spark.sql.DataFrame = [s: string]
Split column s into an array of 3-character chunks:
scala> var res = df.withColumn("temp",split(col("s"),"(?<=\\G...)"))
res: org.apache.spark.sql.DataFrame = [s: string, temp: array<string>]
Map array elements to new columns:
scala> res = res.select((1 until 5).map(i => col("temp").getItem(i-1).as("c"+i)):_*)
res: org.apache.spark.sql.DataFrame = [c1: string, c2: string ... 2 more fields]
scala> res.show(false)
+---+---+---+---+
|c1 |c2 |c3 |c4 |
+---+---+---+---+
|111|222|333|444|
|555|666|777|888|
+---+---+---+---+
Leaving a little to puzzle for yourself, like 1) reading the file and naming your dataset / dataframe columns explicitly, this simulated approach with RDD should help you on your way:
val rdd = sc.parallelize(Seq(("111222333444"),
("555666777888")
)
)
val df = rdd.map(x => (x.slice(0,3), x.slice(3,6), x.slice(6,9), x.slice(9,12))).toDF()
df.show(false)
returns:
+---+---+---+---+
|_1 |_2 |_3 |_4 |
+---+---+---+---+
|111|222|333|444|
|555|666|777|888|
+---+---+---+---+
OR
using DF's:
import org.apache.spark.sql.functions._
val df = sc.parallelize(Seq(("111222333444"),
("555666777888"))
).toDF()
val df2 = df.withColumn("c1", expr("substring(value, 1, 3)")).withColumn("c2", expr("substring(value, 4, 3)")).withColumn("c3", expr("substring(value, 7, 3)")).withColumn("c4", expr("substring(value, 10, 3)"))
df2.show(false)
returns:
+------------+---+---+---+---+
|value |c1 |c2 |c3 |c4 |
+------------+---+---+---+---+
|111222333444|111|222|333|444|
|555666777888|555|666|777|888|
+------------+---+---+---+---+
you can drop the value, leave that up to you.
Like the answer above but gets complicated if not all 3 size chunks.
Your updated question for double times 100:
val df2 = df.withColumn("c1", expr("substring(value, 1, 3)")).withColumn("c2", expr("substring(value, 4, 3)")).withColumn("c3", expr("substring(value, 7, 3)"))
.withColumn("c4", (expr("substring(value, 10, 3)").cast("double") * 100))

In spark dataframe for map column how to update values with a constant for all keys

I have spark dataframe with two columns of type Integer and Map, I wanted to know best way to update the values for all the keys for map column.
With help of UDF, I am able to update the values
def modifyValues = (map_data: Map[String, Int]) => {
val divideWith = 10
map_data.mapValues( _ / divideWith)
}
val modifyMapValues = udf(modifyValues)
df.withColumn("updatedValues", modifyMapValues($"data_map"))
scala> dF.printSchema()
root
|-- id: integer (nullable = true)
|-- data_map: map (nullable = true)
| |-- key: string
| |-- value: integer (valueContainsNull = true)
Sample data:
>val ds = Seq(
(1, Map("foo" -> 100, "bar" -> 200)),
(2, Map("foo" -> 200)),
(3, Map("bar" -> 200))
).toDF("id", "data_map")
Expected output:
+---+-----------------------+
|id |data_map |
+---+-----------------------+
|1 |[foo -> 10, bar -> 20] |
|2 |[foo -> 20] |
|3 |[bar -> 1] |
+---+-----------------------+
Wanted to know, is there anyway to do this without UDF?
One possible way how to do it (without UDF) is this one:
extract keys using map_keys to an array
extract values using map_values to an array
transform extracted values using TRANSFORM (available since Spark 2.4)
create back the map using map_from_arrays
import org.apache.spark.sql.functions.{expr, map_from_arrays, map_values, map_keys}
ds
.withColumn("values", map_values($"data_map"))
.withColumn("keys", map_keys($"data_map"))
.withColumn("values_transformed", expr("TRANSFORM(values, v -> v/10)"))
.withColumn("data_map_transformed", map_from_arrays($"keys", $"values_transformed"))
import pyspark.sql.functions as sp
from pyspark.sql.types import StringType, FloatType, MapType
Add a new key with any value:
my_update_udf = sp.udf(lambda x: {**x, **{'new_key':77}}, MapType(StringType(), FloatType()))
sdf = sdf.withColumn('updated', my_update_udf(sp.col('to_be_updated')))
Update value for all/one key(s):
my_update_udf = sp.udf(lambda x: {k:v/77) for k,v in x.items() if v!=None and k=='my_key'}, MapType(StringType(), FloatType()))
sdf = sdf.withColumn('updated', my_update_udf(sp.col('to_be_updated')))
There is another way available in Spark 3:
Seq(
Map("keyToUpdate" -> 11, "someOtherKey" -> 12),
Map("keyToUpdate" -> 21, "someOtherKey" -> 22)
).toDF("mapColumn")
.withColumn(
"mapColumn",
map_concat(
map(lit("keyToUpdate"), col("mapColumn.keyToUpdate") * 10), // <- transformation
map_filter(col("mapColumn"), (k, _) => k =!= "keyToUpdate")
)
)
.show(false)
output:
+----------------------------------------+
|mapColumn |
+----------------------------------------+
|{someOtherKey -> 12, keyToUpdate -> 110}|
|{someOtherKey -> 22, keyToUpdate -> 210}|
+----------------------------------------+
map_filter(expr, func) - Filters entries in a map using the function
map_concat(map, ...) - Returns the union of all the given maps

DataFrame and DataSet - converting values to <k,v> pair

Sample Input (black coloured text) and Output (red coloured text)
I have a DataFrame (one in black), how can I transform it to one like in red?
(column number, value)
[Image is attached]
val df = spark.read.format("csv").option("inferSchema", "true").option("header", "true").load("file:/home/hduser/Desktop/Demo.csv")
case class Employee(EmpId: String, Experience: Double, Salary: Double)
val ds = df.as[Employee]
I need the solution in both DataFrame and DataSet way.
Thank you in advance! :-)
I believe it's a structure you want when you say pair. check if below code gives your expected output.
With DataFrame:
import spark.sqlContext.implicits._
import org.apache.spark.sql.functions._
val data = Seq(("111",5,50000),("222",6,60000),("333",7,60000))
val df = data.toDF("EmpId","Experience","Salary")
val newdf = df.withColumn("EmpId", struct(lit("1").as("key"),col("EmpId").as("value")))
.withColumn("Experience", struct(lit("2").as("key"),col("Experience").as("value")))
.withColumn("Salary", struct(lit("3").as("key"),col("Salary").as("value")))
.show(false)
output:
+--------+----------+----------+
|EmpId |Experience|Salary |
+--------+----------+----------+
|[1, 111]|[2, 5] |[3, 50000]|
|[1, 222]|[2, 6] |[3, 60000]|
|[1, 333]|[2, 7] |[3, 60000]|
+--------+----------+----------+
With Dataset:
First you need to define case class for new structure otherwise you can't create a dataset
case class Employee2(EmpId: EmpData, Experience: EmpData, Salary: EmpData)
case class EmpData(key: String,value:String)
val ds = df.as[Employee]
val newDS = ds.map(rec=>{
(EmpData("1",rec.EmpId), EmpData("2",rec.Experience.toString),EmpData("3",rec.Salary.toString))
})
val finalDS = newDS.toDF("EmpId","Experience","Salary").as[Employee2]
finalDS.show(false)
Output:
+--------+--------+------------+
|EmpId |Experience|Salary |
+--------+--------+------------+
|[1, 111]|[2, 5] |[3, 50000] |
|[1, 222]|[2, 6] |[3, 60000] |
|[1, 333]|[2, 7] |[3, 60000] |
+--------+--------+------------+
Thanks

Flatten rows in sliding window using Spark

I'm processing a large number of rows from either a database or a file using Apache Spark. Part of the processing creates a sliding window of 3 rows where the rows need to flattened and additional calculations performed on the flattened rows. Below is a simplified example of what is trying to be done.
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions.desc
import org.apache.spark.sql.Dataset
import org.apache.spark.sql.expressions.Window
object Main extends App {
val ss = SparkSession.builder().appName("DataSet Test")
.master("local[*]").getOrCreate()
import ss.implicits._
case class Foo(a:Int, b:String )
// rows from database or file
val foos = Seq(Foo(-18, "Z"),
Foo(-11, "G"),
Foo(-8, "A"),
Foo(-4, "C"),
Foo(-1, "F")).toDS()
// work on 3 rows
val sliding_window_spec = Window.orderBy(desc("a")).rowsBetween( -2, 0)
// flattened object with example computations
case class FooResult(a1:Int, b1:String, a2:Int, b2:String, a3:Int, b3:String, computation1:Int, computation2:String )
// how to convert foo to fooResult???
// flatten 3 rows into 1 and do additional computations on flattened rows
// expected results
val fooResults = Seq(FooResult( -1, "F", -4, "C", -8, "A", -5, "FCA" ),
FooResult( -4, "C", -8, "A", -11, "G", -12, "CAG" ),
FooResult( -8, "A", -11, "G", -18, "Z", -19, "AGZ" )).toDS()
ss.stop()
}
How can I convert the foos into the fooResults? I'm using Apache Spark 2.3.0
// how to convert foo to fooResult???
// flatten 3 rows into 1 and do additional computations on flattened rows
You can simply use collect_list inbuilt function using the window function you've already defined and then by defining a udf function, you can do the computation part and flattening part. finally you can filter and expand the struct column to get your final desired result as
def slidingUdf = udf((list1: Seq[Int], list2:Seq[String])=> {
if(list1.size < 3) null
else {
val zipped = list1.zip(list2)
FooResult(zipped(0)._1, zipped(0)._2, zipped(1)._1, zipped(1)._2, zipped(2)._1, zipped(2)._2, zipped(0)._1+zipped(1)._1, zipped(0)._2+zipped(1)._2+zipped(2)._2)
}
})
foos.select(slidingUdf(collect_list("a").over(sliding_window_spec), collect_list("b").over(sliding_window_spec)).as("test"))
.filter(col("test").isNotNull)
.select(col("test.*"))
.show(false)
which should give you
+---+---+---+---+---+---+------------+------------+
|a1 |b1 |a2 |b2 |a3 |b3 |computation1|computation2|
+---+---+---+---+---+---+------------+------------+
|-1 |F |-4 |C |-8 |A |-5 |FCA |
|-4 |C |-8 |A |-11|G |-12 |CAG |
|-8 |A |-11|G |-18|Z |-19 |AGZ |
+---+---+---+---+---+---+------------+------------+
Note: Remember that the case classes should be defined outside the scope of the current session

Resources