Spark show cost based optimizer statistics - apache-spark

I have tried to enable the Spark cbo by setting the property in spark-shell
spark.conf.set("spark.sql.cbo.enabled", true)
I am now running spark.sql("ANALYZE TABLE events COMPUTE STATISTICS").show
Running this query doesn't show me any statistics spark.sql("select * from events where eventID=1").explain(true)
Running this on Spark 2.2.1
scala> spark.sql("select * from events where eventID=1").explain()
== Physical Plan ==
*Project [buyDetails.capacity#923, buyDetails.clearingNumber#924, buyDetails.leavesQty#925L, buyDetails.liquidityCode#926, buyDetails.orderID#927, buyDetails.side#928, cancelQty#929L, capacity#930, clearingNumber#931, contraClearingNumber#932, desiredLeavesQty#933L, displayPrice#934, displayQty#935L, eventID#936, eventTimestamp#937L, exchange#938, executionCodes#939, fillID#940, handlingInstructions#941, initiator#942, leavesQty#943L, nbbPrice#944, nbbQty#945L, nboPrice#946, ... 29 more fields]
+- *Filter (isnotnull(eventID#936) && (cast(eventID#936 as int) = 1))
+- *FileScan parquet default.events[buyDetails.capacity#923,buyDetails.clearingNumber#924,buyDetails.leavesQty#925L,buyDetails.liquidityCode#926,buyDetails.orderID#927,buyDetails.side#928,cancelQty#929L,capacity#930,clearingNumber#931,contraClearingNumber#932,desiredLeavesQty#933L,displayPrice#934,displayQty#935L,eventID#936,eventTimestamp#937L,exchange#938,executionCodes#939,fillID#940,handlingInstructions#941,initiator#942,leavesQty#943L,nbbPrice#944,nbbQty#945L,nboPrice#946,... 29 more fields] Batched: true, Format: Parquet, Location: InMemoryFileIndex[file:/home/asehgal/data/events], PartitionFilters: [], PushedFilters: [IsNotNull(eventID)], ReadSchema: struct<buyDetails.capacity:string,buyDetails.clearingNumber:string,buyDetails.leavesQty:bigint,bu...

For me, the stats are not visible in df.explain(true). I played around a bit, and could print statistics using println(df.queryExecution.stringWithStats), full example :
val ss = SparkSession
.builder()
.master("local[*]")
.appName("TestCBO")
.config("spark.sql.cbo.enabled",true)
.getOrCreate()
import ss.implicits._
val df1 = ss.range(10000L).toDF("i")
df1.write.mode("overwrite").saveAsTable("table1")
val df2 = ss.range(100000L).toDF("i")
df2.write.mode("overwrite").saveAsTable("table2")
ss.sql("ANALYZE TABLE table1 COMPUTE STATISTICS FOR COLUMNS i")
ss.sql("ANALYZE TABLE table2 COMPUTE STATISTICS FOR COLUMNS i")
val df = ss.table("table1").join(ss.table("table2"), "i")
.where($"i" > 1000)
println(df.queryExecution.stringWithStats)
gives
== Optimized Logical Plan ==
Project [i#2554L], Statistics(sizeInBytes=147.2 KB, rowCount=9.42E+3, hints=none)
+- Join Inner, (i#2554L = i#2557L), Statistics(sizeInBytes=220.8 KB, rowCount=9.42E+3, hints=none)
:- Filter (isnotnull(i#2554L) && (i#2554L > 1000)), Statistics(sizeInBytes=140.6 KB, rowCount=9.00E+3, hints=none)
: +- Relation[i#2554L] parquet, Statistics(sizeInBytes=156.3 KB, rowCount=1.00E+4, hints=none)
+- Filter ((i#2557L > 1000) && isnotnull(i#2557L)), Statistics(sizeInBytes=1546.9 KB, rowCount=9.90E+4, hints=none)
+- Relation[i#2557L] parquet, Statistics(sizeInBytes=1562.5 KB, rowCount=1.00E+5, hints=none)
This is not shown in standard df.explain, because this fires (Dataset.scala):
ExplainCommand(queryExecution.logical, extended = true) // cost = false in this constructor
To enable the output of costs, we can invoke this ExplainCommand ourself:
import org.apache.spark.sql.execution.command.ExplainCommand
val explain = ExplainCommand(df.queryExecution.logical, extended = true, cost = true)
ss.sessionState.executePlan(explain).executedPlan.executeCollect().foreach {
r => println(r.getString(0))
}
Here you could also enable the output of the generated code (set codegen = true)
alternatively, this gives a similar output
df // join of two dataframes and filter
.registerTempTable("tmp")
ss.sql("EXPLAIN COST select * from tmp").show(false)
To see statistics in SparkUI, you must go to SQL-tab, then select the corresponding query (in this case df.show()):

Related

Does spark bring entire hive table to memory

I am in the process of learning the working of Apache Spark and have some basic queries. Let's say I have a Spark application running which connects to a Hive table.
My hive table is as follows:
Name
Age
Marks
A
50
100
B
50
100
C
75
200
When I run the following code snippets, which rows and columns will be loaded into memory during the execution? Will the filtering of rows/columns be done after the entire table is loaded into the memory?
1. spark_session.sql("SELECT name, age from table").collect()
2. spark_session.sql("SELECT * from table WHERE age=50").collect()
3. spark_session.sql("SELECT * from table").select("name", "age").collect()
4. spark_session.sql("SELECT * from table").filter(df.age = 50).collect()
If the datasource supports predicate pushdown then spark will not load entire data to memory while filtering the data.
Let's check the spark plan for hive table with parquet as file format:
>>> df = spark.createDataFrame([('A', 25, 100),('B', 30, 100)], ['name', 'age', 'marks'])
>>> df.write.saveAsTable('table')
>>> spark.sql('select * from table where age=25').explain(True)
== Physical Plan ==
*(1) Filter (isnotnull(age#1389L) AND (age#1389L = 25))
+- *(1) ColumnarToRow
+- FileScan parquet default.table[name#1388,age#1389L,marks#1390L] Batched: true, DataFilters: [isnotnull(age#1389L), (age#1389L = 25)],
Format: Parquet, Location: InMemoryFileIndex[file:/Users/mohan/spark-warehouse/table],
PartitionFilters: [], PushedFilters: [IsNotNull(age), EqualTo(age,25)], ReadSchema: struct<name:string,age:bigint,marks:bigint>
You can verify if filter pushed to underlying storage by looking at PushedFilters: [IsNotNull(age), EqualTo(age,25)]

Does Spark respect kudu's hash partitioning similar to bucketed joins on parquet tables?

I'm trying out Kudu with Spark. I want to join 2 tables with the following schema-
# This table has around 1 million records
TABLE dimensions (
id INT32 NOT NULL,
PRIMARY KEY (id)
)
HASH (id) PARTITIONS 32,
RANGE (id) (
PARTITION UNBOUNDED
)
OWNER root
REPLICAS 1
# This table has 500 million records
TABLE facts (
id INT32 NOT NULL,
date DATE NOT NULL,
PRIMARY KEY (id, date)
)
HASH (id) PARTITIONS 32,
RANGE (id, date) (
PARTITION UNBOUNDED
)
OWNER root
REPLICAS 1
I inserted data to these tables using the following script-
// Load data to spark dataframe
val dimensions_raw = spark.sqlContext.read.format("csv")
.option("header", "true")
.option("inferSchema", "true")
.load("/root/dimensions.csv")
dimensions_raw.printSchema
dimensions_raw.createOrReplaceTempView("dimensions_raw")
// Set the primary key columns
import org.apache.spark.sql.types._
import org.apache.spark.sql.DataFrame
def setNotNull(df: DataFrame, columns: Seq[String]) : DataFrame = {
val schema = df.schema
// Modify [[StructField] for the specified columns.
val newSchema = StructType(schema.map {
case StructField(c, t, _, m) if columns.contains(c) => StructField(c, t, nullable = false, m)
case y: StructField => y
})
// Apply new schema to the DataFrame
df.sqlContext.createDataFrame(df.rdd, newSchema)
}
val primaryKeyCols = Seq("id") // `primaryKeyCols` for `facts` table is `(id, date)`
val dimensions_prep = setNotNull(dimensions_raw, primaryKeyCols)
dimensions_prep.printSchema
dimensions_prep.createOrReplaceTempView("dimensions_prep")
// Create a kudu table
import collection.JavaConverters._
import org.apache.kudu.client._
import org.apache.kudu.spark.kudu._
val kuduContext = new KuduContext("localhost:7051", spark.sparkContext)
// Delete the table if it already exists.
if(kuduContext.tableExists("dimensions")) {
kuduContext.deleteTable("dimensions")
}
kuduContext.createTable("dimensions", dimensions_prep.schema,
/* primary key */ primaryKeyCols,
new CreateTableOptions()
.setNumReplicas(1)
.addHashPartitions(List("id").asJava, 32))
// Load the kudu table from spark dataframe
kuduContext.insertRows(dimensions_prep, "dimensions")
// Create a DataFrame that points to the Kudu table we want to query.
val dimensions = spark.read
.option("kudu.master", "localhost:7051")
.option("kudu.table", "dimensions")
.format("kudu").load
dimensions.createOrReplaceTempView("dimensions")
Ran the above script for facts table as well.
I want to join facts with dimensions table on id. I tried the following command in Spark-
val query = facts.join(dimensions, facts.col("id") === dimensions.col("id"))
query.show()
// And I get the following Physical plan-
== Physical Plan ==
*(5) SortMergeJoin [id#0], [id#14], Inner
:- *(2) Sort [id#0 ASC NULLS FIRST], false, 0
: +- Exchange hashpartitioning(id#0, 200), true, [id=#43]
: +- *(1) Scan Kudu facts [id#0,date#1] PushedFilters: [], ReadSchema: struct<id:int,date:date...
+- *(4) Sort [id#14 ASC NULLS FIRST], false, 0
+- Exchange hashpartitioning(id#14, 200), true, [id=#49]
+- *(3) Scan Kudu dimensions [id#14] PushedFilters: [], ReadSchema: struct<id:int>
My question is that how do I tell spark that the tables are already sorted on id (join key) so no need to sort again.
Moreover the Exchange hashpartitioning need not be done as the table is already bucketed over id.
The join query is taking sub 100seconds on a single machine with single master & tablet server running.
Am I doing something wrong here or is it the expected speed with Kudu for this kind of query?

How to view pushed and partition filters in Spark 3

How can you view the partition filters and pushed filters in Spark 3 (3.0.0-preview2)?
The explain method outputted detail like this in Spark 2:
== Physical Plan ==
Project [first_name#12, last_name#13, country#14]
+- Filter (((isnotnull(country#14) && isnotnull(first_name#12)) && (country#14 = Russia)) && StartsWith(first_name#12, M))
+- FileScan csv [first_name#12,last_name#13,country#14]
Batched: false,
Format: CSV,
Location: InMemoryFileIndex[file:/Users/powers/Documents/tmp/blog_data/people.csv],
PartitionFilters: [],
PushedFilters: [IsNotNull(country), IsNotNull(first_name), EqualTo(country,Russia), StringStartsWith(first_name,M)],
ReadSchema: struct
This would easily let you identify the PartitionFilters and PushedFilters.
In Spark 3, the explain is a lot less, even when the extended argument is set:
val path = new java.io.File("./src/test/resources/person_data.csv").getCanonicalPath
val df = spark.read.option("header", "true").csv(path)
df
.filter(col("person_country") === "Cuba")
.explain("extended")
Here's the output:
== Parsed Logical Plan ==
'Filter ('person_country = Cuba)
+- RelationV2[person_name#115, person_country#116] csv file:/Users/matthewpowers/Documents/code/my_apps/mungingdata/spark3/src/test/resources/person_data.csv
== Analyzed Logical Plan ==Only 18s
person_name: string, person_country: string
Filter (person_country#116 = Cuba)
+- RelationV2[person_name#115, person_country#116] csv file:/Users/matthewpowers/Documents/code/my_apps/mungingdata/spark3/src/test/resources/person_data.csv
== Optimized Logical Plan ==
Filter (isnotnull(person_country#116) AND (person_country#116 = Cuba))
+- RelationV2[person_name#115, person_country#116] csv file:/Users/matthewpowers/Documents/code/my_apps/mungingdata/spark3/src/test/resources/person_data.csv
== Physical Plan ==
*(1) Project [person_name#115, person_country#116]
+- *(1) Filter (isnotnull(person_country#116) AND (person_country#116 = Cuba))
+- BatchScan[person_name#115, person_country#116] CSVScan Location: InMemoryFileIndex[file:/Users/matthewpowers/Documents/code/my_apps/mungingdata/spark3/src/test/re..., ReadSchema: struct<person_name:string,person_country:string>
Is there any way to see the partition filters and pushed filters in Spark 3?
This looks like it was a bug that was fixed towards the end of April. The JIRA for the predicate pushdown is SPARK-30475 and for the partition pushdown is SPARK-30428.
Can you check if your version of Spark has this fix included in it?

pyspark - getting Latest partition from Hive partitioned column logic

I am new to pySpark.
I am trying get the latest partition (date partition) of a hive table using PySpark-dataframes and done like below.
But I am sure there is a better way to do it using dataframe functions (not by writing SQL). Could you please share inputs on better ways.
This solution is scanning through entire data on Hive table to get it.
df_1 = sqlContext.table("dbname.tablename");
df_1_dates = df_1.select('partitioned_date_column').distinct().orderBy(df_1['partitioned_date_column'].desc())
lat_date_dict=df_1_dates.first().asDict()
lat_dt=lat_date_dict['partitioned_date_column']
I agree with #philantrovert what has mentioned in the comment. You can use below approach for partition pruning to filter to limit the number of partitions scanned for your hive table.
>>> spark.sql("""show partitions test_dev_db.newpartitiontable""").show();
+--------------------+
| partition|
+--------------------+
|tran_date=2009-01-01|
|tran_date=2009-02-01|
|tran_date=2009-03-01|
|tran_date=2009-04-01|
|tran_date=2009-05-01|
|tran_date=2009-06-01|
|tran_date=2009-07-01|
|tran_date=2009-08-01|
|tran_date=2009-09-01|
|tran_date=2009-10-01|
|tran_date=2009-11-01|
|tran_date=2009-12-01|
+--------------------+
>>> max_date=spark.sql("""show partitions test_dev_db.newpartitiontable""").rdd.flatMap(lambda x:x).map(lambda x : x.replace("tran_date=","")).max()
>>> print max_date
2009-12-01
>>> query = "select city,state,country from test_dev_db.newpartitiontable where tran_date ='{}'".format(max_date)
>>> spark.sql(query).show();
+--------------------+----------------+--------------+
| city| state| country|
+--------------------+----------------+--------------+
| Southampton| England|United Kingdom|
|W Lebanon ...| NH| United States|
| Comox|British Columbia| Canada|
| Gasperich| Luxembourg| Luxembourg|
+--------------------+----------------+--------------+
>>> spark.sql(query).explain(True)
== Parsed Logical Plan ==
'Project ['city, 'state, 'country]
+- 'Filter ('tran_date = 2009-12-01)
+- 'UnresolvedRelation `test_dev_db`.`newpartitiontable`
== Analyzed Logical Plan ==
city: string, state: string, country: string
Project [city#9, state#10, country#11]
+- Filter (tran_date#12 = 2009-12-01)
+- SubqueryAlias newpartitiontable
+- Relation[city#9,state#10,country#11,tran_date#12] orc
== Optimized Logical Plan ==
Project [city#9, state#10, country#11]
+- Filter (isnotnull(tran_date#12) && (tran_date#12 = 2009-12-01))
+- Relation[city#9,state#10,country#11,tran_date#12] orc
== Physical Plan ==
*(1) Project [city#9, state#10, country#11]
+- *(1) FileScan orc test_dev_db.newpartitiontable[city#9,state#10,country#11,tran_date#12] Batched: true, Format: ORC, Location: PrunedInMemoryFileIndex[hdfs://xxx.host.com:8020/user/xxx/dev/hadoop/database/test_dev..., PartitionCount: 1, PartitionFilters: [isnotnull(tran_date#12), (tran_date#12 = 2009-12-01)], PushedFilters: [], ReadSchema: struct<city:string,state:string,country:string>
you can see in above plan that PartitionCount: 1 it has scanned only one partition from 12 available partitions.
Building on Vikrant's answer, here is a more general way of extracting partition column values directly from the table metadata, which avoids Spark scanning through all the files in the table.
First, if your data isn't already registered in a catalog, you'll want to do that so Spark can see the partition details. Here, I'm registering a new table named data.
spark.catalog.createTable(
'data',
path='/path/to/the/data',
source='parquet',
)
spark.catalog.recoverPartitions('data')
partitions = spark.sql('show partitions data')
To show a self-contained answer, however, I'll manually create the partitions DataFrame so you can see what it would look like, along with the solution for extracting a specific column value from it.
from pyspark.sql.functions import (
col,
regexp_extract,
)
partitions = (
spark.createDataFrame(
[
('/country=usa/region=ri/',),
('/country=usa/region=ma/',),
('/country=russia/region=siberia/',),
],
schema=['partition'],
)
)
partition_name = 'country'
(
partitions
.select(
'partition',
regexp_extract(
col('partition'),
pattern=r'(\/|^){}=(\S+?)(\/|$)'.format(partition_name),
idx=2,
).alias(partition_name),
)
.show(truncate=False)
)
The output of this query is:
+-------------------------------+-------+
|partition |country|
+-------------------------------+-------+
|/country=usa/region=ri/ |usa |
|/country=usa/region=ma/ |usa |
|/country=russia/region=siberia/|russia |
+-------------------------------+-------+
The solution in Scala will look very similar to this, except the call to regexp_extract() will look slightly different:
.select(
regexp_extract(
col("partition"),
exp=s"(\\/|^)${partitionName}=(\\S+?)(\\/|$$)",
groupIdx=2
).alias(partitionName).as[String]
)
Again, the benefit of querying partition values in this way is that Spark will not scan all the files in the table to get you the answer. If you have a table with tens or hundreds of thousands of files in it, your time savings will be significant.

How to load only the data of the last partition

I have some data partitioned this way:
/data/year=2016/month=9/version=0
/data/year=2016/month=10/version=0
/data/year=2016/month=10/version=1
/data/year=2016/month=10/version=2
/data/year=2016/month=10/version=3
/data/year=2016/month=11/version=0
/data/year=2016/month=11/version=1
When using this data, I'd like to load the last version only of each month.
A simple way to do this is to do load("/data/year=2016/month=11/version=3") instead of doing load("/data").
The drawback of this solution is the loss of partitioning information such as year and month, which means it would not be possible to apply operations based on the year or the month anymore.
Is it possible to ask Spark to load the last version only of each month? How would you go about this?
Well, Spark supports predicate push-down, so if you provide a filter following the load, it will only read in the data fulfilling the criteria in the filter. Like this:
spark.read.option("basePath", "/data").load("/data").filter('version === 3)
And you get to keep the partitioning information :)
Just an addition to previous answers for reference
I have a below ORC format table in hive which is partitioned on year,month & date column.
hive (default)> show partitions test_dev_db.partition_date_table;
OK
year=2019/month=08/day=07
year=2019/month=08/day=08
year=2019/month=08/day=09
If I set below properties, I can read the latest partition data in spark sql as shown below:
spark.sql("SET spark.sql.orc.enabled=true");
spark.sql("SET spark.sql.hive.convertMetastoreOrc=true")
spark.sql("SET spark.sql.orc.filterPushdown=true")
spark.sql("""select * from test_dev_db.partition_date_table where year ='2019' and month='08' and day='07' """).explain(True)
we can see PartitionCount: 1 in plan and it's obvious that it has filtered the latest partition.
== Physical Plan ==
*(1) FileScan orc test_dev_db.partition_date_table[emp_id#212,emp_name#213,emp_salary#214,emp_date#215,year#216,month#217,day#218] Batched: true, Format: ORC, Location: PrunedInMemoryFileIndex[hdfs://xxx.host.com:8020/user/xxxx/dev/hadoop/database/test_dev..., **PartitionCount: 1**, PartitionFilters: [isnotnull(year#216), isnotnull(month#217), isnotnull(day#218), (year#216 = 2019), (month#217 = 0..., PushedFilters: [], ReadSchema: struct<emp_id:int,emp_name:string,emp_salary:int,emp_date:date>
whereas same will not work if I use below query:
even if we create dataframe using spark.read.format("orc").load(hdfs absolute path of table) and create a temporary view and run spark sql on that. It will still scan all the partitions available for that table until and unless we use specific filter condition on top of that.
spark.sql("""select * from test_dev_db.partition_date_table where year ='2019' and month='08' and day in (select max(day) from test_dev_db.partition_date_table)""").explain(True)
It still has scanned all the three partitions, here PartitionCount: 3
== Physical Plan ==
*(2) BroadcastHashJoin [day#282], [max(day)#291], LeftSemi, BuildRight
:- *(2) FileScan orc test_dev_db.partition_date_table[emp_id#276,emp_name#277,emp_salary#278,emp_date#279,year#280,month#281,day#282] Batched: true, Format: ORC, Location: PrunedInMemoryFileIndex[hdfs://xxx.host.com:8020/user/xxx/dev/hadoop/database/test_dev..., PartitionCount: 3, PartitionFilters: [isnotnull(year#280), isnotnull(month#281), (year#280 = 2019), (month#281 = 08)], PushedFilters: [], ReadSchema: struct<emp_id:int,emp_name:string,emp_salary:int,emp_date:date>
To filter out the data based on the max partition using spark sql, we can use the below approach. we can use below technique for partition pruning to limits the number of files and partitions that Spark reads when querying the Hive ORC table data.
rdd=spark.sql("""show partitions test_dev_db.partition_date_table""").rdd.flatMap(lambda x:x)
newrdd=rdd.map(lambda x : x.replace("/","")).map(lambda x : x.replace("year=","")).map(lambda x : x.replace("month=","-")).map(lambda x : x.replace("day=","-")).map(lambda x : x.split('-'))
max_year=newrdd.map(lambda x : (x[0])).max()
max_month=newrdd.map(lambda x : x[1]).max()
max_day=newrdd.map(lambda x : x[2]).max()
prepare your query to filter Hive partition table using these max values.
query = "select * from test_dev_db.partition_date_table where year ='{0}' and month='{1}' and day ='{2}'".format(max_year,max_month,max_day)
>>> spark.sql(query).show();
+------+--------+----------+----------+----+-----+---+
|emp_id|emp_name|emp_salary| emp_date|year|month|day|
+------+--------+----------+----------+----+-----+---+
| 3| Govind| 810000|2019-08-09|2019| 08| 09|
| 4| Vikash| 5500|2019-08-09|2019| 08| 09|
+------+--------+----------+----------+----+-----+---+
spark.sql(query).explain(True)
If you see the plan of this query, you can see that it has scanned only one partition of given Hive table.
here PartitionCount is 1
== Optimized Logical Plan ==
Filter (((((isnotnull(day#397) && isnotnull(month#396)) && isnotnull(year#395)) && (year#395 = 2019)) && (month#396 = 08)) && (day#397 = 09))
+- Relation[emp_id#391,emp_name#392,emp_salary#393,emp_date#394,year#395,month#396,day#397] orc
== Physical Plan ==
*(1) FileScan orc test_dev_db.partition_date_table[emp_id#391,emp_name#392,emp_salary#393,emp_date#394,year#395,month#396,day#397] Batched: true, Format: ORC, Location: PrunedInMemoryFileIndex[hdfs://xxx.host.com:8020/user/xxx/dev/hadoop/database/test_dev..., PartitionCount: 1, PartitionFilters: [isnotnull(day#397), isnotnull(month#396), isnotnull(year#395), (year#395 = 2019), (month#396 = 0..., PushedFilters: [], ReadSchema: struct<emp_id:int,emp_name:string,emp_salary:int,emp_date:date>
I think you have to use Spark's Window Function and then find and filter out the latest version.
import org.apache.spark.sql.functions.{col, first}
import org.apache.spark.sql.expressions.Window
val windowSpec = Window.partitionBy("year","month").orderBy(col("version").desc)
spark.read.load("/data")
.withColumn("maxVersion", first("version").over(windowSpec))
.select("*")
.filter(col("maxVersion") === col("version"))
.drop("maxVersion")
Let me know if this works for you.
Here's a Scala general function:
/**
* Given a DataFrame, use keys (e.g. last modified time), to show the most up to date record
*
* #param dF DataFrame to be parsed
* #param groupByKeys These are the columns you would like to groupBy and expect to be duplicated,
* hence why you're trying to obtain records according to a latest value of keys.
* #param keys The sequence of keys used to rank the records in the table
* #return DataFrame with records that have rank 1, this means the most up to date version of those records
*/
def getLastUpdatedRecords(dF: DataFrame, groupByKeys: Seq[String], keys: Seq[String]): DataFrame = {
val part = Window.partitionBy(groupByKeys.head, groupByKeys.tail: _*).orderBy(array(keys.head, keys.tail: _*).desc)
val rowDF = dF.withColumn("rn", row_number().over(part))
val res = rowDF.filter(col("rn")===1).drop("rn")
res
}

Resources