Issues creating Spark table from EXTERNALLY partitioned data

Issues creating Spark table from EXTERNALLY partitioned data - apache-spark

CSV Data is stored daily on AWS S3, as follows:
/data/year=2020/month=5/day=5/<data-part-1.csv, data-part-2.csv,...data-part-K.csv>
The query I would like to work:
CREATE EXTERNAL TABLE {table_name} (data1 INT, data2 INT)
PARTITIONED BY (year INT, month INT, day INT)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
STORED AS TEXTFILE LOCATION '{file_location}'
TBLPROPERTIES ('skip.header.line.count' = '1')
Outcome: table is empty:
attempted better specifying the location ".../data/year=/month=/day=*", instead of ".../data/".
also attempted suggestions to run this command, which did not work:
spark.sql("msck repair table database_name.table_name").
This version below is able to load data, but I need the year/month/day columns, idea here is filter by those to make queries faster:
CREATE EXTERNAL TABLE {table_name} (data1 INT, data2 INT)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
STORED AS TEXTFILE LOCATION '{file_location}'
TBLPROPERTIES ('skip.header.line.count' = '1')
Outcome: loads table as expected, but queries are very slow.
This version also loads a table, however, YEAR,MONTH,DAY columns are null:
CREATE EXTERNAL TABLE {table_name} (data1 INT, data2 INT, year INT, month INT, day INT)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
STORED AS TEXTFILE LOCATION '{file_location}'
TBLPROPERTIES ('skip.header.line.count' = '1')
I am assuming the first query is the correct way to load this data, based on documentation. Looking at the resultant schema, that also seems to be correct - however I cannot get it to actually load any data.
Does anyone know what I am doing wrong?

Check if this helpful-
Please note sparkSession is created without hive support
1. Create dummy test dataframe and store it as csv with year, month & day partition
val df = spark.range(1).withColumn("date",
explode(sequence(to_date(lit("2020-06-09")), to_date(lit("2020-06-20")), expr("interval 1 day")))
).withColumn("year", year($"date"))
.withColumn("month", month($"date"))
.withColumn("day", dayofmonth($"date"))
df.show(false)
df.printSchema()
/**
* +---+----------+----+-----+---+
* |id |date |year|month|day|
* +---+----------+----+-----+---+
* |0 |2020-06-09|2020|6 |9 |
* |0 |2020-06-10|2020|6 |10 |
* |0 |2020-06-11|2020|6 |11 |
* |0 |2020-06-12|2020|6 |12 |
* |0 |2020-06-13|2020|6 |13 |
* |0 |2020-06-14|2020|6 |14 |
* |0 |2020-06-15|2020|6 |15 |
* |0 |2020-06-16|2020|6 |16 |
* |0 |2020-06-17|2020|6 |17 |
* |0 |2020-06-18|2020|6 |18 |
* |0 |2020-06-19|2020|6 |19 |
* |0 |2020-06-20|2020|6 |20 |
* +---+----------+----+-----+---+
*
* root
* |-- id: long (nullable = false)
* |-- date: date (nullable = false)
* |-- year: integer (nullable = false)
* |-- month: integer (nullable = false)
* |-- day: integer (nullable = false)
*/
df.repartition(2).write.partitionBy("year", "month", "day")
.option("header", true)
.mode(SaveMode.Overwrite)
.csv("/Users/sokale/models/hive_table")
File structure
/**
* File structure - /Users/sokale/models/hive_table
* ---------------
* year=2020
* year=2020/month=6
* year=2020/month=6/day=10
* |- part...csv files (same part files for all the below directories)
* year=2020/month=6/day=11
* year=2020/month=6/day=12
* year=2020/month=6/day=13
* year=2020/month=6/day=14
* year=2020/month=6/day=15
* year=2020/month=6/day=16
* year=2020/month=6/day=17
* year=2020/month=6/day=18
* year=2020/month=6/day=19
* year=2020/month=6/day=20
* year=2020/month=6/day=9
*/
Read the partitioned table
val csvDF = spark.read.option("header", true)
.csv("/Users/sokale/models/hive_table")
csvDF.show(false)
csvDF.printSchema()
/**
* +---+----------+----+-----+---+
* |id |date |year|month|day|
* +---+----------+----+-----+---+
* |0 |2020-06-20|2020|6 |20 |
* |0 |2020-06-19|2020|6 |19 |
* |0 |2020-06-09|2020|6 |9 |
* |0 |2020-06-12|2020|6 |12 |
* |0 |2020-06-10|2020|6 |10 |
* |0 |2020-06-15|2020|6 |15 |
* |0 |2020-06-16|2020|6 |16 |
* |0 |2020-06-17|2020|6 |17 |
* |0 |2020-06-13|2020|6 |13 |
* |0 |2020-06-18|2020|6 |18 |
* |0 |2020-06-14|2020|6 |14 |
* |0 |2020-06-11|2020|6 |11 |
* +---+----------+----+-----+---+
*
* root
* |-- id: string (nullable = true)
* |-- date: string (nullable = true)
* |-- year: integer (nullable = true)
* |-- month: integer (nullable = true)
* |-- day: integer (nullable = true)
*/

Related

How to rename columns in Spark SQL inside WITH and VALUES?

Given a table built in this way with Spark SQL (2.4.*):
scala> spark.sql("with some_data (values ('A',1),('B',2)) select * from some_data").show()
+----+----+
|col1|col2|
+----+----+
| A| 1|
| B| 2|
+----+----+
I wasn't able to set the column names (indeed the default col1 and col2). Is there a way to rename that columns for example to label and value?

Either modify your query as-
spark.sql("with some_data (values ('A',1),('B',2) T(label, value)) select * from some_data").show()
/**
* +-----+-----+
* |label|value|
* +-----+-----+
* | A| 1|
* | B| 2|
* +-----+-----+
*/
or Use this example for reference -
val df = spark.sql(
"""
|select Class_Name, Customer, Date_Time, Median_Percentage
|from values
| ('ClassA', 'A', '6/13/20', 64550),
| ('ClassA', 'B', '6/6/20', 40200),
| ('ClassB', 'F', '6/20/20', 26800),
| ('ClassB', 'G', '6/20/20', 18100)
| T(Class_Name, Customer, Date_Time, Median_Percentage)
""".stripMargin)
df.show(false)
df.printSchema()
/**
* +----------+--------+---------+-----------------+
* |Class_Name|Customer|Date_Time|Median_Percentage|
* +----------+--------+---------+-----------------+
* |ClassA |A |6/13/20 |64550 |
* |ClassA |B |6/6/20 |40200 |
* |ClassB |F |6/20/20 |26800 |
* |ClassB |G |6/20/20 |18100 |
* +----------+--------+---------+-----------------+
*
* root
* |-- Class_Name: string (nullable = false)
* |-- Customer: string (nullable = false)
* |-- Date_Time: string (nullable = false)
* |-- Median_Percentage: integer (nullable = false)
*/
Please observe T(Class_Name, Customer, Date_Time, Median_Percentage) to provide names to column as required

Transform columns in Spark DataFrame based on map without using UDFs

I would like to transform some columns in my dataframe based on configuration represented by Scala maps.
I have 2 case:
Receiving a map Map[String, Seq[String]] and columns col1, col2, to transform col3 if there is an entity in a map with key = col1, and col2 is in this entity value list.
Receiving a map Map[String, (Long, Long) and col1, col2, to transform col3 if
there is an entity in a map with key = col1 and col2 is in a range describe by the tuple of Longs as (start, end).
examples:
case 1
having this table, and a map Map(u1-> Seq(w1,w11), u2 -> Seq(w2,w22))
+------+------+------+
| col1 | col2 | col3 |
+------+------+------+
| u1 | w1 | v1 |
+------+------+------+
| u2 | w2 | v2 |
+------+------+------+
| u3 | w3 | v3 |
+------+------+------+
I would like to add "x-" prefix to col3, only if it matchs the term
+------+------+------+
| col1 | col2 | col3 |
+------+------+------+
| u1 | w1 | x-v1 |
+------+------+------+
| u2 | w2 | x-v2 |
+------+------+------+
| u3 | w3 | v3 |
+------+------+------+
case 2:
This table and map Map("u1" -> (1,5), u2 -> (2, 4))
+------+------+------+
| col1 | col2 | col3 |
+------+------+------+
| u1 | 2 | v1 |
+------+------+------+
| u1 | 6 | v11 |
+------+------+------+
| u2 | 3 | v3 |
+------+------+------+
| u3 | 4 | v3 |
+------+------+------+
expected output should be:
+------+------+------+
| col1 | col2 | col3 |
+------+------+------+
| u1 | 2 | x-v1 |
+------+------+------+
| u1 | 6 | v11 |
+------+------+------+
| u2 | 3 | x-v3 |
+------+------+------+
| u3 | 4 | v3 |
+------+------+------+
This can easily be done by UDFs, but for performance concerned, I would like not to use them.
Is there a way to achieve it without it in Spark 2.4.2?
Thanks

Check below code.
Note -
I have changed your second case map value to Map("u1" -> Seq(1,5), u2 -> Seq(2, 4))
Converting map values to json map, adding json map as column values to DataFrame, then applying logic on DataFrame.
If possible you can directly add values inside json map so that you can avoid conversion map to json map.
Import required libraries.
import org.apache.spark.sql.types._
import org.json4s.JsonDSL._
import org.json4s.jackson.JsonMethods._
Case-1 Logic
scala> val caseOneDF = Seq(("u1","w1","v1"),("u2","w2","v2"),("u3","w3","v3")).toDF("col1","col2","col3")
caseOneDF: org.apache.spark.sql.DataFrame = [col1: string, col2: string ... 1 more field]
scala> val caseOneMap = Map("u1" -> Seq("w1","w11"),"u2" -> Seq("w2","w22"))
caseOneMap: scala.collection.immutable.Map[String,Seq[String]] = Map(u1 -> List(w1, w11), u2 -> List(w2, w22))
scala> val caseOneJsonMap = lit(compact(render(caseOneMap)))
caseOneJsonMap: org.apache.spark.sql.Column = {"u1":["w1","w11"],"u2":["w2","w22"]}
scala> val caseOneSchema = MapType(StringType,ArrayType(StringType))
caseOneSchema: org.apache.spark.sql.types.MapType = MapType(StringType,ArrayType(StringType,true),true)
scala> val caseOneExpr = from_json(caseOneJsonMap,caseOneSchema)
caseOneExpr: org.apache.spark.sql.Column = entries
Case-1 Final Output
scala> dfa
.withColumn("data",caseOneExpr)
.withColumn("col3",when(expr("array_contains(data[col1],col2)"),concat(lit("x-"),$"col3")).otherwise($"col3"))
.drop("data")
.show(false)
+----+----+----+
|col1|col2|col3|
+----+----+----+
|u1 |w1 |x-v1|
|u2 |w2 |x-v2|
|u3 |w3 |v3 |
+----+----+----+
Case-2 Logic
scala> val caseTwoDF = Seq(("u1",2,"v1"),("u1",6,"v11"),("u2",3,"v3"),("u3",4,"v3")).toDF("col1","col2","col3")
caseTwoDF: org.apache.spark.sql.DataFrame = [col1: string, col2: int ... 1 more field]
scala> val caseTwoMap = Map("u1" -> Seq(1,5),"u2" -> Seq(2,4))
caseTwoMap: scala.collection.immutable.Map[String,Seq[Int]] = Map(u1 -> List(1, 5), u2 -> List(2, 4))
scala> val caseTwoJsonMap = lit(compact(render(caseTwoMap)))
caseTwoJsonMap: org.apache.spark.sql.Column = {"u1":[1,5],"u2":[2,4]}
scala> val caseTwoSchema = MapType(StringType,ArrayType(IntegerType))
caseTwoSchema: org.apache.spark.sql.types.MapType = MapType(StringType,ArrayType(IntegerType,true),true)
scala> val caseTwoExpr = from_json(caseTwoJsonMap,caseTwoSchema)
caseTwoExpr: org.apache.spark.sql.Column = entries
Case-2 Final Output
scala> caseTwoDF
.withColumn("data",caseTwoExpr)
.withColumn("col3",when(expr("array_contains(sequence(data[col1][0],data[col1][1]),col2)"), concat(lit("x-"),$"col3")).otherwise($"col3"))
.drop("data")
.show(false)
+----+----+----+
|col1|col2|col3|
+----+----+----+
|u1 |2 |x-v1|
|u1 |6 |v11 |
|u2 |3 |x-v3|
|u3 |4 |v3 |
+----+----+----+

Another alternative -
import org.apache.spark.sql.functions.typedLit
Case-1
df1.show(false)
df1.printSchema()
/**
* +----+----+----+
* |col1|col2|col3|
* +----+----+----+
* |u1 |w1 |v1 |
* |u2 |w2 |v2 |
* |u3 |w3 |v3 |
* +----+----+----+
*
* root
* |-- col1: string (nullable = true)
* |-- col2: string (nullable = true)
* |-- col3: string (nullable = true)
*/
val case1 = Map("u1" -> Seq("w1","w11"), "u2" -> Seq("w2","w22"))
val p1 = df1.withColumn("case1", typedLit(case1))
.withColumn("col3",
when(array_contains(expr("case1[col1]"), $"col2"), concat(lit("x-"), $"col3"))
.otherwise($"col3")
)
p1.show(false)
p1.printSchema()
/**
* +----+----+----+----------------------------------+
* |col1|col2|col3|case1 |
* +----+----+----+----------------------------------+
* |u1 |w1 |x-v1|[u1 -> [w1, w11], u2 -> [w2, w22]]|
* |u2 |w2 |x-v2|[u1 -> [w1, w11], u2 -> [w2, w22]]|
* |u3 |w3 |v3 |[u1 -> [w1, w11], u2 -> [w2, w22]]|
* +----+----+----+----------------------------------+
*
* root
* |-- col1: string (nullable = true)
* |-- col2: string (nullable = true)
* |-- col3: string (nullable = true)
* |-- case1: map (nullable = false)
* | |-- key: string
* | |-- value: array (valueContainsNull = true)
* | | |-- element: string (containsNull = true)
*/
Case-2
df2.show(false)
df2.printSchema()
/**
* +----+----+----+
* |col1|col2|col3|
* +----+----+----+
* |u1 |2 |v1 |
* |u1 |6 |v11 |
* |u2 |3 |v3 |
* |u3 |4 |v3 |
* +----+----+----+
*
* root
* |-- col1: string (nullable = true)
* |-- col2: integer (nullable = true)
* |-- col3: string (nullable = true)
*/
val case2 = Map("u1" -> (1,5), "u2" -> (2, 4))
val p = df2.withColumn("case2", typedLit(case2))
.withColumn("col3",
when(expr("col2 between case2[col1]._1 and case2[col1]._2"), concat(lit("x-"), $"col3"))
.otherwise($"col3")
)
p.show(false)
p.printSchema()
/**
* +----+----+----+----------------------------+
* |col1|col2|col3|case2 |
* +----+----+----+----------------------------+
* |u1 |2 |x-v1|[u1 -> [1, 5], u2 -> [2, 4]]|
* |u1 |6 |v11 |[u1 -> [1, 5], u2 -> [2, 4]]|
* |u2 |3 |x-v3|[u1 -> [1, 5], u2 -> [2, 4]]|
* |u3 |4 |v3 |[u1 -> [1, 5], u2 -> [2, 4]]|
* +----+----+----+----------------------------+
*
* root
* |-- col1: string (nullable = true)
* |-- col2: integer (nullable = true)
* |-- col3: string (nullable = true)
* |-- case2: map (nullable = false)
* | |-- key: string
* | |-- value: struct (valueContainsNull = true)
* | | |-- _1: integer (nullable = false)
* | | |-- _2: integer (nullable = false)
*/

scala> caseTwoDF
.withColumn("data",caseTwoExpr)
.withColumn("col3",when(expr("array_contains(sequence(data[col1][0],data[col1][1]),col2)"), concat(lit("x-"),$"col3")).otherwise($"col3"))
.drop("data")
.show(false)
+----+----+----+
|col1|col2|col3|
+----+----+----+
|u1 |2 |x-v1|
|u1 |6 |v11 |
|u2 |3 |x-v3|
|u3 |4 |v3 |
+----+----+----+

Get all Not null columns of spark dataframe in one Column

I need to select all not nulls column from Hive table and insert them into Hbase. For example, consider the below table:
Name Place Department Experience
==============================================
Ram | Ramgarh | Sales | 14
Lakshman | Lakshmanpur |Operations |
Sita | Sitapur | | 14
Ravan | | | 25
I have to write all the not null columns from above table to Hbase. So I wrote a logic to get not null columns in one column of dataframe as below. Name column is mandatory there.
Name Place Department Experience Not_null_columns
================================================================================
Ram Ramgarh Sales 14 Name, Place, Department, Experience
Lakshman Lakshmanpur Operations Name, Place, Department
Sita Sitapur 14 Name, Place, Experience
Ravan 25 Name, Experience
Now my requirement is to create a column in dataframe with all values of not null columns in a single column as provided below.
Name Place Department Experience Not_null_columns_values
Ram Ramgarh Sales 14 Name: Ram, Place: Ramgarh, Department: Sales, Experince: 14
Lakshman Lakshmanpur Operations Name: Lakshman, Place: Lakshmanpur, Department: Operations
Sita Sitapur 14 Name: Sita, Place: Sitapur, Experience: 14
Ravan 25 Name: Ravan, Experience: 25
Once I get above df I will write it to Hbase with Name as key and last column as value.
Please let me know if there could have been a better approach to do this.

Try this-
Load the test data provided
val data =
"""
|Name | Place | Department | Experience
|
|Ram | Ramgarh | Sales | 14
|
|Lakshman | Lakshmanpur |Operations |
|
|Sita | Sitapur | | 14
|
|Ravan | | | 25
""".stripMargin
val stringDS = data.split(System.lineSeparator())
.map(_.split("\\|").map(_.replaceAll("""^[ \t]+|[ \t]+$""", "")).mkString(","))
.toSeq.toDS()
val df = spark.read
.option("sep", ",")
.option("inferSchema", "true")
.option("header", "true")
// .option("nullValue", "null")
.csv(stringDS)
df.show(false)
df.printSchema()
/**
* +--------+-----------+----------+----------+
* |Name |Place |Department|Experience|
* +--------+-----------+----------+----------+
* |Ram |Ramgarh |Sales |14 |
* |Lakshman|Lakshmanpur|Operations|null |
* |Sita |Sitapur |null |14 |
* |Ravan |null |null |25 |
* +--------+-----------+----------+----------+
*
* root
* |-- Name: string (nullable = true)
* |-- Place: string (nullable = true)
* |-- Department: string (nullable = true)
* |-- Experience: integer (nullable = true)
*/
convert struct and then json
val x = df.withColumn("Not_null_columns_values",
to_json(struct(df.columns.map(col): _*)))
x.show(false)
x.printSchema()
/**
* +--------+-----------+----------+----------+---------------------------------------------------------------------+
* |Name |Place |Department|Experience|Not_null_columns_values |
* +--------+-----------+----------+----------+---------------------------------------------------------------------+
* |Ram |Ramgarh |Sales |14 |{"Name":"Ram","Place":"Ramgarh","Department":"Sales","Experience":14}|
* |Lakshman|Lakshmanpur|Operations|null |{"Name":"Lakshman","Place":"Lakshmanpur","Department":"Operations"} |
* |Sita |Sitapur |null |14 |{"Name":"Sita","Place":"Sitapur","Experience":14} |
* |Ravan |null |null |25 |{"Name":"Ravan","Experience":25} |
* +--------+-----------+----------+----------+---------------------------------------------------------------------+
*/

Finding variable length strings in column

I have a column that looks like this:
Class
A
AA
BB
AAAA
ABA
AAAAA
What I'd like to do, is filter this column out where it has only A's and nothing else. So the result would be something like this:
Class
A
AA
AAAA
AAAAA
Is there a way to do this in Spark?

Check below code.
scala> val df = Seq("A","AA","BB","AAAA","ABA","AAAAA","BAB").toDF("Class")
df: org.apache.spark.sql.DataFrame = [Class: string]
scala> df.filter(!col("Class").rlike("[^A]+")).show
+-----+
|Class|
+-----+
| A|
| AA|
| AAAA|
|AAAAA|
+-----+

Try this using rlike function
val data1 =
"""
|Class
|A
|AA
|BB
|AAAA
|ABA
|AAAAA
""".stripMargin
val stringDS1 = data1.split(System.lineSeparator())
.map(_.split("\\|").map(_.replaceAll("""^[ \t]+|[ \t]+$""", "")).mkString(","))
.toSeq.toDS()
val df1 = spark.read
.option("sep", ",")
.option("inferSchema", "true")
.option("header", "true")
.option("nullValue", "null")
.csv(stringDS1)
df1.show(false)
df1.printSchema()
/**
* +-----+
* |Class|
* +-----+
* |A |
* |AA |
* |BB |
* |AAAA |
* |ABA |
* |AAAAA|
* +-----+
*
* root
* |-- Class: string (nullable = true)
*/
df1.filter(col("Class").rlike("""^A+$"""))
.show(false)
/**
* +-----+
* |Class|
* +-----+
* |A |
* |AA |
* |AAAA |
* |AAAAA|
* +-----+
*/

Getting unexpected results from Spark sql Windows Functions

It seems like Spark sql Window function does not working properly .
I am running a spark job in Hadoop Cluster where a HDFS block size is 128 MB and
Spark Version 1.5 CDH 5.5
My requirement:
If there are multiple records with same data_rfe_id then take the single record as per maximum seq_id and maxiumum service_id
I see that in raw data there are some records with same data_rfe_id and same seq_id so hence, I applied row_number using Window function so that I can filter the records with row_num === 1
But it seems its not working when have huge datasets. I see that same rowNumber is applied .
Why is it happening like this?
Do I need to reshuffle before I apply window function on dataframe?
I am expecting a unique rank number to each data_rfe_id
I want to use Window Function only to achieve this .
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions.rowNumber
.....
scala> df.printSchema
root
|-- transitional_key: string (nullable = true)
|-- seq_id: string (nullable = true)
|-- data_rfe_id: string (nullable = true)
|-- service_id: string (nullable = true)
|-- event_start_date_time: string (nullable = true)
|-- event_id: string (nullable = true)
val windowFunction = Window.partitionBy(df("data_rfe_id")).orderBy(df("seq_id").desc,df("service_id").desc)
val rankDF =df.withColumn("row_num",rowNumber.over(windowFunction))
rankDF.select("data_rfe_id","seq_id","service_id","row_num").show(200,false)
Expected result :
+------------------------------------+-----------------+-----------+-------+
|data_rfe_id |seq_id |service_id|row_num|
+------------------------------------+-----------------+-----------+-------+
|9ih67fshs-de11-4f80-a66d-b52a12c14b0e|1695826 |4039 |1 |
|9ih67fshs-de11-4f80-a66d-b52a12c14b0e|1695821 |3356 |2 |
|9ih67fshs-de11-4f80-a66d-b52a12c14b0e|1695802 |1857 |3 |
|23sds222-9669-429e-a95b-bc984ccf0fb0 |1695541 |2156 |1 |
|23sds222-9669-429e-a95b-bc984ccf0fb0 |1695541 |2103 |2 |
|23sds222-9669-429e-a95b-bc984ccf0fb0 |1695541 |2083 |3 |
|23sds222-9669-429e-a95b-bc984ccf0fb0 |1695541 |2082 |4 |
|23sds222-9669-429e-a95b-bc984ccf0fb0 |1695541 |2076 |5 |
Actual Result I got as per above code :
+------------------------------------+-----------------+-----------+-------+
|data_rfe_id |seq_id |service_id|row_num|
+------------------------------------+-----------------+-----------+-------+
|9ih67fshs-de11-4f80-a66d-b52a12c14b0e|1695826 |4039 |1 |
|9ih67fshs-de11-4f80-a66d-b52a12c14b0e|1695821 |3356 |1 |
|9ih67fshs-de11-4f80-a66d-b52a12c14b0e|1695802 |1857 |1 |
|23sds222-9669-429e-a95b-bc984ccf0fb0 |1695541 |2156 |1 |
|23sds222-9669-429e-a95b-bc984ccf0fb0 |1695541 |2103 |1 |
|23sds222-9669-429e-a95b-bc984ccf0fb0 |1695541 |2083 |1 |
|23sds222-9669-429e-a95b-bc984ccf0fb0 |1695541 |2082 |1 |
|23sds222-9669-429e-a95b-bc984ccf0fb0 |1695541 |2076 |1 |
Could someone explain me why I am getting these unexpected results? and How do I resolve that ?

Basically you want to rank , having seq_id and service_id in desc order. Add rangeBetween with range you need. Rank may work for you. following is snippet of code :
val windowFunction = Window.partitionBy(df("data_rfe_id")).orderBy(df("seq_id"),df("service_id")).desc().rangeBetween(-MAXNUMBER,MAXNUMBER))
val rankDF =df.withColumn( "rank", rank().over(windowFunction) )
As you are using older version of spark don't know it will work or not. There is issue with windowSpec here is reference

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Issues creating Spark table from EXTERNALLY partitioned data - apache-spark

Related

How to rename columns in Spark SQL inside WITH and VALUES?

Transform columns in Spark DataFrame based on map without using UDFs

Get all Not null columns of spark dataframe in one Column

Finding variable length strings in column

Getting unexpected results from Spark sql Windows Functions

Categories

Resources