Spark GroupBy Aggregate functions

Spark GroupBy Aggregate functions - apache-spark

case class Step (Id : Long,
stepNum : Long,
stepId : Int,
stepTime: java.sql.Timestamp
)
I have a Dataset[Step] and I want to perform a groupBy operation on the "Id" col.
My output should look like Dataset[(Long, List[Step])]. How do I do this?
lets say variable "inquiryStepMap" is of type Dataset[Step] then we can do this with RDDs as follows
val inquiryStepGrouped: RDD[(Long, Iterable[Step])] = inquiryStepMap.rdd.groupBy(x => x.Id)

It seems you need groupByKey:
Sample:
import java.sql.Timestamp
val t = new Timestamp(2017, 5, 1, 0, 0, 0, 0)
val ds = Seq(Step(1L, 21L, 1, t), Step(1L, 20L, 2, t), Step(2L, 10L, 3, t)).toDS()
groupByKey and then mapGroups:
ds.groupByKey(_.Id).mapGroups((Id, Vals) => (Id, Vals.toList))
// res18: org.apache.spark.sql.Dataset[(Long, List[Step])] = [_1: bigint, _2: array<struct<Id:bigint,stepNum:bigint,stepId:int,stepTime:timestamp>>]
And the result looks like:
ds.groupByKey(_.Id).mapGroups((Id, Vals) => (Id, Vals.toList)).show()
+---+--------------------+
| _1| _2|
+---+--------------------+
| 1|[[1,21,1,3917-06-...|
| 2|[[2,10,3,3917-06-...|
+---+--------------------+

Related

Need to get fields from lookup file with calculated values using spark dataframe

I have a products file as below and the formula for cost price and discount in the lookup file. we need get the discount and cost price fields from lookup file.
product_id,product_name,marked_price,selling_price,profit
101,AAAA,5500,5400,500
102,ABCS,7000,6500,1000
103,GHMA,6500,5600,700
104,PSNLA,8450,8000,800
105,GNBC,1250,1200,600
lookup file:
key,value
cost_price,(selling_price+profit)
discount,(marked_price-selling_price)
Final output:
product_id,product_name,marked_price,selling_price,profit,cost_price,discount
101,AAAA,5500,5400,500,5900,100
102,ABCS,7000,6500,1000,7500,500
103,GHMA,6500,5600,700,6300,900
104,PSNLA,8450,8000,800,8800,450
105,GNBC,1250,1200,600,1800,50

First, you should make a Map out of your lookup-file, then you can add the columns using expr:
val lookup = Map(
"cost_price" -> "selling_price+profit",
"discount" -> "(marked_price-selling_price)"
)
val df = Seq(
(101, "AAAA", 5500, 5400, 500),
(102, "ABCS", 7000, 6500, 1000)
)
.toDF("product_id", "product_name", "marked_price", "selling_price", "profit")
df
.withColumn("cost_price",expr(lookup("cost_price")))
.withColumn("discount",expr(lookup("discount")))
.show()
gives :
+----------+------------+------------+-------------+------+----------+--------+
|product_id|product_name|marked_price|selling_price|profit|cost_price|discount|
+----------+------------+------------+-------------+------+----------+--------+
| 101| AAAA| 5500| 5400| 500| 5900| 100|
| 102| ABCS| 7000| 6500| 1000| 7500| 500|
+----------+------------+------------+-------------+------+----------+--------+
you can also iterate your lookup :
val finalDF = lookup.foldLeft(df){case (df,(k,v)) => df.withColumn(k,expr(v))}

import spark.implicits._
val dataheader = Seq("product_id", "product_name", "marked_price", "selling_price", "profit")
val data = Seq(
(101, "AAAA", 5500, 5400, 500),
(102, "ABCS", 7000, 6500, 1000),
(103, "GHMA", 6500, 5600, 700),
(104, "PSNLA", 8450, 8000, 800),
(105, "GNBC", 1250, 1200, 600))
val dataDF = data.toDF(dataheader: _*)
val lookupMap = Seq(
("cost_price", "(selling_price+profit)"),
("discount", "(marked_price-selling_price)")) //You can read your data from look up file and construct a dataframe
.toDF("key", "value").select("key", "value").as[(String, String)].collect.toMap
lookupMap.foldLeft(dataDF)((df, look) => {
df.withColumn(look._1, expr(look._2))
}).show()

Loading Data into Spark Dataframe without delimiters in source

I have a dataset with no delimiters:
111222333444
555666777888
Desired output:
|_c1_|_c2_|_c3_|_c4_|
|111 |222 |333 |444 |
|555 |666 |777 |888 |
i have tried this to attain the output
val myDF = spark.sparkContext.textFile("myFile").toDF()
val myNewDF = myDF.withColumn("c1", substring(col("value"), 0, 3))
.withColumn("c2", substring(col("value"), 3, 6))
.withColumn("c3", substring(col("value"), 6, 9)
.withColumn("c4", substring(col("value"), 9, 12))
.drop("value")
.show()
but i need to manipulate c4 (multiply 100) but the datatype is string not double.
Update: I encountered a scenarios
when i execute this,
val myNewDF = myDF.withColumn("c1", expr("substring(value, 0, 3)"))
.withColumn("c2", expr("substring(value, 3, 6"))
.withColumn("c3", expr("substring(value, 6, 9)"))
.withColumn("c4", (expr("substring(value, 9, 12)").cast("double") * 100))
.drop("value")
myNewDF.show(5,false) // it only shows "value" column (which i dropped) and "c1" column
myNewDF.printSchema // only showing 2 rows. why is it not showing all the newly created 4 columns?

Create test dataframe:
scala> var df = Seq(("111222333444"),("555666777888")).toDF("s")
df: org.apache.spark.sql.DataFrame = [s: string]
Split column s into an array of 3-character chunks:
scala> var res = df.withColumn("temp",split(col("s"),"(?<=\\G...)"))
res: org.apache.spark.sql.DataFrame = [s: string, temp: array<string>]
Map array elements to new columns:
scala> res = res.select((1 until 5).map(i => col("temp").getItem(i-1).as("c"+i)):_*)
res: org.apache.spark.sql.DataFrame = [c1: string, c2: string ... 2 more fields]
scala> res.show(false)
+---+---+---+---+
|c1 |c2 |c3 |c4 |
+---+---+---+---+
|111|222|333|444|
|555|666|777|888|
+---+---+---+---+

Leaving a little to puzzle for yourself, like 1) reading the file and naming your dataset / dataframe columns explicitly, this simulated approach with RDD should help you on your way:
val rdd = sc.parallelize(Seq(("111222333444"),
("555666777888")
)
)
val df = rdd.map(x => (x.slice(0,3), x.slice(3,6), x.slice(6,9), x.slice(9,12))).toDF()
df.show(false)
returns:
+---+---+---+---+
|_1 |_2 |_3 |_4 |
+---+---+---+---+
|111|222|333|444|
|555|666|777|888|
+---+---+---+---+
OR
using DF's:
import org.apache.spark.sql.functions._
val df = sc.parallelize(Seq(("111222333444"),
("555666777888"))
).toDF()
val df2 = df.withColumn("c1", expr("substring(value, 1, 3)")).withColumn("c2", expr("substring(value, 4, 3)")).withColumn("c3", expr("substring(value, 7, 3)")).withColumn("c4", expr("substring(value, 10, 3)"))
df2.show(false)
returns:
+------------+---+---+---+---+
|value |c1 |c2 |c3 |c4 |
+------------+---+---+---+---+
|111222333444|111|222|333|444|
|555666777888|555|666|777|888|
+------------+---+---+---+---+
you can drop the value, leave that up to you.
Like the answer above but gets complicated if not all 3 size chunks.
Your updated question for double times 100:
val df2 = df.withColumn("c1", expr("substring(value, 1, 3)")).withColumn("c2", expr("substring(value, 4, 3)")).withColumn("c3", expr("substring(value, 7, 3)"))
.withColumn("c4", (expr("substring(value, 10, 3)").cast("double") * 100))

In spark dataframe for map column how to update values with a constant for all keys

I have spark dataframe with two columns of type Integer and Map, I wanted to know best way to update the values for all the keys for map column.
With help of UDF, I am able to update the values
def modifyValues = (map_data: Map[String, Int]) => {
val divideWith = 10
map_data.mapValues( _ / divideWith)
}
val modifyMapValues = udf(modifyValues)
df.withColumn("updatedValues", modifyMapValues($"data_map"))
scala> dF.printSchema()
root
|-- id: integer (nullable = true)
|-- data_map: map (nullable = true)
| |-- key: string
| |-- value: integer (valueContainsNull = true)
Sample data:
>val ds = Seq(
(1, Map("foo" -> 100, "bar" -> 200)),
(2, Map("foo" -> 200)),
(3, Map("bar" -> 200))
).toDF("id", "data_map")
Expected output:
+---+-----------------------+
|id |data_map |
+---+-----------------------+
|1 |[foo -> 10, bar -> 20] |
|2 |[foo -> 20] |
|3 |[bar -> 1] |
+---+-----------------------+
Wanted to know, is there anyway to do this without UDF?

One possible way how to do it (without UDF) is this one:
extract keys using map_keys to an array
extract values using map_values to an array
transform extracted values using TRANSFORM (available since Spark 2.4)
create back the map using map_from_arrays
import org.apache.spark.sql.functions.{expr, map_from_arrays, map_values, map_keys}
ds
.withColumn("values", map_values($"data_map"))
.withColumn("keys", map_keys($"data_map"))
.withColumn("values_transformed", expr("TRANSFORM(values, v -> v/10)"))
.withColumn("data_map_transformed", map_from_arrays($"keys", $"values_transformed"))

import pyspark.sql.functions as sp
from pyspark.sql.types import StringType, FloatType, MapType
Add a new key with any value:
my_update_udf = sp.udf(lambda x: {**x, **{'new_key':77}}, MapType(StringType(), FloatType()))
sdf = sdf.withColumn('updated', my_update_udf(sp.col('to_be_updated')))
Update value for all/one key(s):
my_update_udf = sp.udf(lambda x: {k:v/77) for k,v in x.items() if v!=None and k=='my_key'}, MapType(StringType(), FloatType()))
sdf = sdf.withColumn('updated', my_update_udf(sp.col('to_be_updated')))

There is another way available in Spark 3:
Seq(
Map("keyToUpdate" -> 11, "someOtherKey" -> 12),
Map("keyToUpdate" -> 21, "someOtherKey" -> 22)
).toDF("mapColumn")
.withColumn(
"mapColumn",
map_concat(
map(lit("keyToUpdate"), col("mapColumn.keyToUpdate") * 10), // <- transformation
map_filter(col("mapColumn"), (k, _) => k =!= "keyToUpdate")
)
)
.show(false)
output:
+----------------------------------------+
|mapColumn |
+----------------------------------------+
|{someOtherKey -> 12, keyToUpdate -> 110}|
|{someOtherKey -> 22, keyToUpdate -> 210}|
+----------------------------------------+
map_filter(expr, func) - Filters entries in a map using the function
map_concat(map, ...) - Returns the union of all the given maps

DataFrame and DataSet - converting values to <k,v> pair

Sample Input (black coloured text) and Output (red coloured text)
I have a DataFrame (one in black), how can I transform it to one like in red?
(column number, value)
[Image is attached]
val df = spark.read.format("csv").option("inferSchema", "true").option("header", "true").load("file:/home/hduser/Desktop/Demo.csv")
case class Employee(EmpId: String, Experience: Double, Salary: Double)
val ds = df.as[Employee]
I need the solution in both DataFrame and DataSet way.
Thank you in advance! :-)

I believe it's a structure you want when you say pair. check if below code gives your expected output.
With DataFrame:
import spark.sqlContext.implicits._
import org.apache.spark.sql.functions._
val data = Seq(("111",5,50000),("222",6,60000),("333",7,60000))
val df = data.toDF("EmpId","Experience","Salary")
val newdf = df.withColumn("EmpId", struct(lit("1").as("key"),col("EmpId").as("value")))
.withColumn("Experience", struct(lit("2").as("key"),col("Experience").as("value")))
.withColumn("Salary", struct(lit("3").as("key"),col("Salary").as("value")))
.show(false)
output:
+--------+----------+----------+
|EmpId |Experience|Salary |
+--------+----------+----------+
|[1, 111]|[2, 5] |[3, 50000]|
|[1, 222]|[2, 6] |[3, 60000]|
|[1, 333]|[2, 7] |[3, 60000]|
+--------+----------+----------+
With Dataset:
First you need to define case class for new structure otherwise you can't create a dataset
case class Employee2(EmpId: EmpData, Experience: EmpData, Salary: EmpData)
case class EmpData(key: String,value:String)
val ds = df.as[Employee]
val newDS = ds.map(rec=>{
(EmpData("1",rec.EmpId), EmpData("2",rec.Experience.toString),EmpData("3",rec.Salary.toString))
})
val finalDS = newDS.toDF("EmpId","Experience","Salary").as[Employee2]
finalDS.show(false)
Output:
+--------+--------+------------+
|EmpId |Experience|Salary |
+--------+--------+------------+
|[1, 111]|[2, 5] |[3, 50000] |
|[1, 222]|[2, 6] |[3, 60000] |
|[1, 333]|[2, 7] |[3, 60000] |
+--------+--------+------------+
Thanks

Flatten rows in sliding window using Spark

I'm processing a large number of rows from either a database or a file using Apache Spark. Part of the processing creates a sliding window of 3 rows where the rows need to flattened and additional calculations performed on the flattened rows. Below is a simplified example of what is trying to be done.
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions.desc
import org.apache.spark.sql.Dataset
import org.apache.spark.sql.expressions.Window
object Main extends App {
val ss = SparkSession.builder().appName("DataSet Test")
.master("local[*]").getOrCreate()
import ss.implicits._
case class Foo(a:Int, b:String )
// rows from database or file
val foos = Seq(Foo(-18, "Z"),
Foo(-11, "G"),
Foo(-8, "A"),
Foo(-4, "C"),
Foo(-1, "F")).toDS()
// work on 3 rows
val sliding_window_spec = Window.orderBy(desc("a")).rowsBetween( -2, 0)
// flattened object with example computations
case class FooResult(a1:Int, b1:String, a2:Int, b2:String, a3:Int, b3:String, computation1:Int, computation2:String )
// how to convert foo to fooResult???
// flatten 3 rows into 1 and do additional computations on flattened rows
// expected results
val fooResults = Seq(FooResult( -1, "F", -4, "C", -8, "A", -5, "FCA" ),
FooResult( -4, "C", -8, "A", -11, "G", -12, "CAG" ),
FooResult( -8, "A", -11, "G", -18, "Z", -19, "AGZ" )).toDS()
ss.stop()
}
How can I convert the foos into the fooResults? I'm using Apache Spark 2.3.0

// how to convert foo to fooResult???
// flatten 3 rows into 1 and do additional computations on flattened rows
You can simply use collect_list inbuilt function using the window function you've already defined and then by defining a udf function, you can do the computation part and flattening part. finally you can filter and expand the struct column to get your final desired result as
def slidingUdf = udf((list1: Seq[Int], list2:Seq[String])=> {
if(list1.size < 3) null
else {
val zipped = list1.zip(list2)
FooResult(zipped(0)._1, zipped(0)._2, zipped(1)._1, zipped(1)._2, zipped(2)._1, zipped(2)._2, zipped(0)._1+zipped(1)._1, zipped(0)._2+zipped(1)._2+zipped(2)._2)
}
})
foos.select(slidingUdf(collect_list("a").over(sliding_window_spec), collect_list("b").over(sliding_window_spec)).as("test"))
.filter(col("test").isNotNull)
.select(col("test.*"))
.show(false)
which should give you
+---+---+---+---+---+---+------------+------------+
|a1 |b1 |a2 |b2 |a3 |b3 |computation1|computation2|
+---+---+---+---+---+---+------------+------------+
|-1 |F |-4 |C |-8 |A |-5 |FCA |
|-4 |C |-8 |A |-11|G |-12 |CAG |
|-8 |A |-11|G |-18|Z |-19 |AGZ |
+---+---+---+---+---+---+------------+------------+
Note: Remember that the case classes should be defined outside the scope of the current session

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Spark GroupBy Aggregate functions - apache-spark

Related

Need to get fields from lookup file with calculated values using spark dataframe

Loading Data into Spark Dataframe without delimiters in source

In spark dataframe for map column how to update values with a constant for all keys

DataFrame and DataSet - converting values to <k,v> pair

Flatten rows in sliding window using Spark

Categories

Resources