Get all Not null columns of spark dataframe in one Column - apache-spark

I need to select all not nulls column from Hive table and insert them into Hbase. For example, consider the below table:
Name Place Department Experience
==============================================
Ram | Ramgarh | Sales | 14
Lakshman | Lakshmanpur |Operations |
Sita | Sitapur | | 14
Ravan | | | 25
I have to write all the not null columns from above table to Hbase. So I wrote a logic to get not null columns in one column of dataframe as below. Name column is mandatory there.
Name Place Department Experience Not_null_columns
================================================================================
Ram Ramgarh Sales 14 Name, Place, Department, Experience
Lakshman Lakshmanpur Operations Name, Place, Department
Sita Sitapur 14 Name, Place, Experience
Ravan 25 Name, Experience
Now my requirement is to create a column in dataframe with all values of not null columns in a single column as provided below.
Name Place Department Experience Not_null_columns_values
Ram Ramgarh Sales 14 Name: Ram, Place: Ramgarh, Department: Sales, Experince: 14
Lakshman Lakshmanpur Operations Name: Lakshman, Place: Lakshmanpur, Department: Operations
Sita Sitapur 14 Name: Sita, Place: Sitapur, Experience: 14
Ravan 25 Name: Ravan, Experience: 25
Once I get above df I will write it to Hbase with Name as key and last column as value.
Please let me know if there could have been a better approach to do this.

Try this-
Load the test data provided
val data =
"""
|Name | Place | Department | Experience
|
|Ram | Ramgarh | Sales | 14
|
|Lakshman | Lakshmanpur |Operations |
|
|Sita | Sitapur | | 14
|
|Ravan | | | 25
""".stripMargin
val stringDS = data.split(System.lineSeparator())
.map(_.split("\\|").map(_.replaceAll("""^[ \t]+|[ \t]+$""", "")).mkString(","))
.toSeq.toDS()
val df = spark.read
.option("sep", ",")
.option("inferSchema", "true")
.option("header", "true")
// .option("nullValue", "null")
.csv(stringDS)
df.show(false)
df.printSchema()
/**
* +--------+-----------+----------+----------+
* |Name |Place |Department|Experience|
* +--------+-----------+----------+----------+
* |Ram |Ramgarh |Sales |14 |
* |Lakshman|Lakshmanpur|Operations|null |
* |Sita |Sitapur |null |14 |
* |Ravan |null |null |25 |
* +--------+-----------+----------+----------+
*
* root
* |-- Name: string (nullable = true)
* |-- Place: string (nullable = true)
* |-- Department: string (nullable = true)
* |-- Experience: integer (nullable = true)
*/
convert struct and then json
val x = df.withColumn("Not_null_columns_values",
to_json(struct(df.columns.map(col): _*)))
x.show(false)
x.printSchema()
/**
* +--------+-----------+----------+----------+---------------------------------------------------------------------+
* |Name |Place |Department|Experience|Not_null_columns_values |
* +--------+-----------+----------+----------+---------------------------------------------------------------------+
* |Ram |Ramgarh |Sales |14 |{"Name":"Ram","Place":"Ramgarh","Department":"Sales","Experience":14}|
* |Lakshman|Lakshmanpur|Operations|null |{"Name":"Lakshman","Place":"Lakshmanpur","Department":"Operations"} |
* |Sita |Sitapur |null |14 |{"Name":"Sita","Place":"Sitapur","Experience":14} |
* |Ravan |null |null |25 |{"Name":"Ravan","Experience":25} |
* +--------+-----------+----------+----------+---------------------------------------------------------------------+
*/

Related

How to rename columns in Spark SQL inside WITH and VALUES?

Given a table built in this way with Spark SQL (2.4.*):
scala> spark.sql("with some_data (values ('A',1),('B',2)) select * from some_data").show()
+----+----+
|col1|col2|
+----+----+
| A| 1|
| B| 2|
+----+----+
I wasn't able to set the column names (indeed the default col1 and col2). Is there a way to rename that columns for example to label and value?
Either modify your query as-
spark.sql("with some_data (values ('A',1),('B',2) T(label, value)) select * from some_data").show()
/**
* +-----+-----+
* |label|value|
* +-----+-----+
* | A| 1|
* | B| 2|
* +-----+-----+
*/
or Use this example for reference -
val df = spark.sql(
"""
|select Class_Name, Customer, Date_Time, Median_Percentage
|from values
| ('ClassA', 'A', '6/13/20', 64550),
| ('ClassA', 'B', '6/6/20', 40200),
| ('ClassB', 'F', '6/20/20', 26800),
| ('ClassB', 'G', '6/20/20', 18100)
| T(Class_Name, Customer, Date_Time, Median_Percentage)
""".stripMargin)
df.show(false)
df.printSchema()
/**
* +----------+--------+---------+-----------------+
* |Class_Name|Customer|Date_Time|Median_Percentage|
* +----------+--------+---------+-----------------+
* |ClassA |A |6/13/20 |64550 |
* |ClassA |B |6/6/20 |40200 |
* |ClassB |F |6/20/20 |26800 |
* |ClassB |G |6/20/20 |18100 |
* +----------+--------+---------+-----------------+
*
* root
* |-- Class_Name: string (nullable = false)
* |-- Customer: string (nullable = false)
* |-- Date_Time: string (nullable = false)
* |-- Median_Percentage: integer (nullable = false)
*/
Please observe T(Class_Name, Customer, Date_Time, Median_Percentage) to provide names to column as required

Transform columns in Spark DataFrame based on map without using UDFs

I would like to transform some columns in my dataframe based on configuration represented by Scala maps.
I have 2 case:
Receiving a map Map[String, Seq[String]] and columns col1, col2, to transform col3 if there is an entity in a map with key = col1, and col2 is in this entity value list.
Receiving a map Map[String, (Long, Long) and col1, col2, to transform col3 if
there is an entity in a map with key = col1 and col2 is in a range describe by the tuple of Longs as (start, end).
examples:
case 1
having this table, and a map Map(u1-> Seq(w1,w11), u2 -> Seq(w2,w22))
+------+------+------+
| col1 | col2 | col3 |
+------+------+------+
| u1 | w1 | v1 |
+------+------+------+
| u2 | w2 | v2 |
+------+------+------+
| u3 | w3 | v3 |
+------+------+------+
I would like to add "x-" prefix to col3, only if it matchs the term
+------+------+------+
| col1 | col2 | col3 |
+------+------+------+
| u1 | w1 | x-v1 |
+------+------+------+
| u2 | w2 | x-v2 |
+------+------+------+
| u3 | w3 | v3 |
+------+------+------+
case 2:
This table and map Map("u1" -> (1,5), u2 -> (2, 4))
+------+------+------+
| col1 | col2 | col3 |
+------+------+------+
| u1 | 2 | v1 |
+------+------+------+
| u1 | 6 | v11 |
+------+------+------+
| u2 | 3 | v3 |
+------+------+------+
| u3 | 4 | v3 |
+------+------+------+
expected output should be:
+------+------+------+
| col1 | col2 | col3 |
+------+------+------+
| u1 | 2 | x-v1 |
+------+------+------+
| u1 | 6 | v11 |
+------+------+------+
| u2 | 3 | x-v3 |
+------+------+------+
| u3 | 4 | v3 |
+------+------+------+
This can easily be done by UDFs, but for performance concerned, I would like not to use them.
Is there a way to achieve it without it in Spark 2.4.2?
Thanks
Check below code.
Note -
I have changed your second case map value to Map("u1" -> Seq(1,5), u2 -> Seq(2, 4))
Converting map values to json map, adding json map as column values to DataFrame, then applying logic on DataFrame.
If possible you can directly add values inside json map so that you can avoid conversion map to json map.
Import required libraries.
import org.apache.spark.sql.types._
import org.json4s.JsonDSL._
import org.json4s.jackson.JsonMethods._
Case-1 Logic
scala> val caseOneDF = Seq(("u1","w1","v1"),("u2","w2","v2"),("u3","w3","v3")).toDF("col1","col2","col3")
caseOneDF: org.apache.spark.sql.DataFrame = [col1: string, col2: string ... 1 more field]
scala> val caseOneMap = Map("u1" -> Seq("w1","w11"),"u2" -> Seq("w2","w22"))
caseOneMap: scala.collection.immutable.Map[String,Seq[String]] = Map(u1 -> List(w1, w11), u2 -> List(w2, w22))
scala> val caseOneJsonMap = lit(compact(render(caseOneMap)))
caseOneJsonMap: org.apache.spark.sql.Column = {"u1":["w1","w11"],"u2":["w2","w22"]}
scala> val caseOneSchema = MapType(StringType,ArrayType(StringType))
caseOneSchema: org.apache.spark.sql.types.MapType = MapType(StringType,ArrayType(StringType,true),true)
scala> val caseOneExpr = from_json(caseOneJsonMap,caseOneSchema)
caseOneExpr: org.apache.spark.sql.Column = entries
Case-1 Final Output
scala> dfa
.withColumn("data",caseOneExpr)
.withColumn("col3",when(expr("array_contains(data[col1],col2)"),concat(lit("x-"),$"col3")).otherwise($"col3"))
.drop("data")
.show(false)
+----+----+----+
|col1|col2|col3|
+----+----+----+
|u1 |w1 |x-v1|
|u2 |w2 |x-v2|
|u3 |w3 |v3 |
+----+----+----+
Case-2 Logic
scala> val caseTwoDF = Seq(("u1",2,"v1"),("u1",6,"v11"),("u2",3,"v3"),("u3",4,"v3")).toDF("col1","col2","col3")
caseTwoDF: org.apache.spark.sql.DataFrame = [col1: string, col2: int ... 1 more field]
scala> val caseTwoMap = Map("u1" -> Seq(1,5),"u2" -> Seq(2,4))
caseTwoMap: scala.collection.immutable.Map[String,Seq[Int]] = Map(u1 -> List(1, 5), u2 -> List(2, 4))
scala> val caseTwoJsonMap = lit(compact(render(caseTwoMap)))
caseTwoJsonMap: org.apache.spark.sql.Column = {"u1":[1,5],"u2":[2,4]}
scala> val caseTwoSchema = MapType(StringType,ArrayType(IntegerType))
caseTwoSchema: org.apache.spark.sql.types.MapType = MapType(StringType,ArrayType(IntegerType,true),true)
scala> val caseTwoExpr = from_json(caseTwoJsonMap,caseTwoSchema)
caseTwoExpr: org.apache.spark.sql.Column = entries
Case-2 Final Output
scala> caseTwoDF
.withColumn("data",caseTwoExpr)
.withColumn("col3",when(expr("array_contains(sequence(data[col1][0],data[col1][1]),col2)"), concat(lit("x-"),$"col3")).otherwise($"col3"))
.drop("data")
.show(false)
+----+----+----+
|col1|col2|col3|
+----+----+----+
|u1 |2 |x-v1|
|u1 |6 |v11 |
|u2 |3 |x-v3|
|u3 |4 |v3 |
+----+----+----+
Another alternative -
import org.apache.spark.sql.functions.typedLit
Case-1
df1.show(false)
df1.printSchema()
/**
* +----+----+----+
* |col1|col2|col3|
* +----+----+----+
* |u1 |w1 |v1 |
* |u2 |w2 |v2 |
* |u3 |w3 |v3 |
* +----+----+----+
*
* root
* |-- col1: string (nullable = true)
* |-- col2: string (nullable = true)
* |-- col3: string (nullable = true)
*/
val case1 = Map("u1" -> Seq("w1","w11"), "u2" -> Seq("w2","w22"))
val p1 = df1.withColumn("case1", typedLit(case1))
.withColumn("col3",
when(array_contains(expr("case1[col1]"), $"col2"), concat(lit("x-"), $"col3"))
.otherwise($"col3")
)
p1.show(false)
p1.printSchema()
/**
* +----+----+----+----------------------------------+
* |col1|col2|col3|case1 |
* +----+----+----+----------------------------------+
* |u1 |w1 |x-v1|[u1 -> [w1, w11], u2 -> [w2, w22]]|
* |u2 |w2 |x-v2|[u1 -> [w1, w11], u2 -> [w2, w22]]|
* |u3 |w3 |v3 |[u1 -> [w1, w11], u2 -> [w2, w22]]|
* +----+----+----+----------------------------------+
*
* root
* |-- col1: string (nullable = true)
* |-- col2: string (nullable = true)
* |-- col3: string (nullable = true)
* |-- case1: map (nullable = false)
* | |-- key: string
* | |-- value: array (valueContainsNull = true)
* | | |-- element: string (containsNull = true)
*/
Case-2
df2.show(false)
df2.printSchema()
/**
* +----+----+----+
* |col1|col2|col3|
* +----+----+----+
* |u1 |2 |v1 |
* |u1 |6 |v11 |
* |u2 |3 |v3 |
* |u3 |4 |v3 |
* +----+----+----+
*
* root
* |-- col1: string (nullable = true)
* |-- col2: integer (nullable = true)
* |-- col3: string (nullable = true)
*/
val case2 = Map("u1" -> (1,5), "u2" -> (2, 4))
val p = df2.withColumn("case2", typedLit(case2))
.withColumn("col3",
when(expr("col2 between case2[col1]._1 and case2[col1]._2"), concat(lit("x-"), $"col3"))
.otherwise($"col3")
)
p.show(false)
p.printSchema()
/**
* +----+----+----+----------------------------+
* |col1|col2|col3|case2 |
* +----+----+----+----------------------------+
* |u1 |2 |x-v1|[u1 -> [1, 5], u2 -> [2, 4]]|
* |u1 |6 |v11 |[u1 -> [1, 5], u2 -> [2, 4]]|
* |u2 |3 |x-v3|[u1 -> [1, 5], u2 -> [2, 4]]|
* |u3 |4 |v3 |[u1 -> [1, 5], u2 -> [2, 4]]|
* +----+----+----+----------------------------+
*
* root
* |-- col1: string (nullable = true)
* |-- col2: integer (nullable = true)
* |-- col3: string (nullable = true)
* |-- case2: map (nullable = false)
* | |-- key: string
* | |-- value: struct (valueContainsNull = true)
* | | |-- _1: integer (nullable = false)
* | | |-- _2: integer (nullable = false)
*/
scala> caseTwoDF
.withColumn("data",caseTwoExpr)
.withColumn("col3",when(expr("array_contains(sequence(data[col1][0],data[col1][1]),col2)"), concat(lit("x-"),$"col3")).otherwise($"col3"))
.drop("data")
.show(false)
+----+----+----+
|col1|col2|col3|
+----+----+----+
|u1 |2 |x-v1|
|u1 |6 |v11 |
|u2 |3 |x-v3|
|u3 |4 |v3 |
+----+----+----+

SparkSQL - Extract multiple regex matches (using SQL only)

I have a dataset of SQL queries in raw text and another with a regular expression of all the possible table names:
# queries
+-----+----------------------------------------------+
| id | query |
+-----+----------------------------------------------+
| 1 | select * from table_a, table_b |
| 2 | select * from table_c join table_d... |
+-----+----------------------------------------------+
# regexp
'table_a|table_b|table_c|table_d'
And I wanted the following result:
# expected result
+-----+----------------------------------------------+
| id | tables |
+-----+----------------------------------------------+
| 1 | [table_a, table_b] |
| 2 | [table_c, table_d] |
+-----+----------------------------------------------+
But using the following SQL in Spark, all I get is the first match...
select
id,
regexp_extract(query, 'table_a|table_b|table_c|table_d') as tables
from queries
# actual result
+-----+----------------------------------------------+
| id | tables |
+-----+----------------------------------------------+
| 1 | table_a |
| 2 | table_c |
+-----+----------------------------------------------+
Is there any way to do this using only Spark SQL? This is the function I am using https://people.apache.org/~pwendell/spark-nightly/spark-master-docs/latest/api/sql/#regexp_extract
EDIT
I would also accept a solution that returned the following:
# alternative solution
+-----+----------------------------------------------+
| id | tables |
+-----+----------------------------------------------+
| 1 | table_a |
| 1 | table_b |
| 2 | table_c |
| 2 | table_d |
+-----+----------------------------------------------+
SOLUTION
#chlebek solved this below. I reformatted his SQL using CTEs for better readability:
with
split_queries as (
select
id,
explode(split(query, ' ')) as col
from queries
),
extracted_tables as (
select
id,
regexp_extract(col, 'table_a|table_b|table_c|table_d', 0) as rx
from split_queries
)
select
id,
collect_set(rx) as tables
from extracted_tables
where rx != ''
group by id
Bear in mind that the split(query, ' ') part of the query will split your SQL only by spaces. If you have other things such as tabs, line breaks, comments, etc., you should deal with these before or when splitting.
If you have only a few values to check you can achieve it using contains function instead of regexp:
val names = Seq("table_a","table_b","table_c","table_d")
def c(col: Column) = names.map(n => when(col.contains(n),n).otherwise(""))
df.select('id,array_remove(array(c('query):_*),"").as("result")).show(false)
but using regexp it will looks like below (Spark SQL API):
df.select('id,explode(split('query," ")))
.select('id,regexp_extract('col,"table_a|table_b|table_c|table_d",0).as("rx"))
.filter('rx=!="")
.groupBy('id)
.agg(collect_list('rx))
and it could be translated to below SQL query:
select id, collect_list(rx) from
(select id, regexp_extract(col,'table_a|table_b|table_c|table_d',0) as rx from
(select id, explode(split(query,' ')) as col from df) q1
) q2
where rx != '' group by id
so output will be:
+---+------------------+
| id| collect_list(rx)|
+---+------------------+
| 1|[table_a, table_b]|
| 2|[table_c, table_d]|
+---+------------------+
As you are using spark-sql, you can use sql parser & it will do job for you.
def getTables(query: String): Seq[String] = {
val logicalPlan = spark.sessionState.sqlParser.parsePlan(query)
import org.apache.spark.sql.catalyst.analysis.UnresolvedRelation
logicalPlan.collect { case r: UnresolvedRelation => r.tableName }
}
val query = "select * from table_1 as a left join table_2 as b on
a.id=b.id"
scala> getTables(query).foreach(println)
table_1
table_2
You can register 'getTables' as udf & use in query
You can use another SQL function available in Spark called collect_list https://docs.databricks.com/spark/latest/spark-sql/language-manual/functions.html#collect_list. You can find another sample https://mungingdata.com/apache-spark/arraytype-columns/
Basically, applying to your code it should be
val df = spark.sql("select 1 id, 'select * from table_a, table_b' query" )
val df1 = spark.sql("select 2 id, 'select * from table_c join table_d' query" )
val df3 = df.union(df1)
df3.createOrReplaceTempView("tabla")
spark.sql("""
select id, collect_list(tables) from (
select id, explode(split(query, ' ')) as tables
from tabla)
where tables like 'table%' group by id""").show
The output will be
+---+--------------------+
| id|collect_list(tables)|
+---+--------------------+
| 1| [table_a,, table_b]|
| 2| [table_c, table_d]|
+---+--------------------+
Hope this helps
If you are on spark>=2.4 then you can remove exploding and collecting the same operations by using higher order functions on array and without any subqueries-
Load the test data
val data =
"""
|id | query
|1 | select * from table_a, table_b
|2 | select * from table_c join table_d on table_c.id=table_d.id
""".stripMargin
val stringDS = data.split(System.lineSeparator())
.map(_.split("\\|").map(_.replaceAll("""^[ \t]+|[ \t]+$""", "")).mkString(";"))
.toSeq.toDS()
val df = spark.read
.option("sep", ";")
.option("inferSchema", "true")
.option("header", "true")
.option("nullValue", "null")
.csv(stringDS)
df.printSchema()
df.show(false)
/**
* root
* |-- id: integer (nullable = true)
* |-- query: string (nullable = true)
*
* +---+-----------------------------------------------------------+
* |id |query |
* +---+-----------------------------------------------------------+
* |1 |select * from table_a, table_b |
* |2 |select * from table_c join table_d on table_c.id=table_d.id|
* +---+-----------------------------------------------------------+
*/
Extract the tables from query
// spark >= 2.4.0
df.createOrReplaceTempView("queries")
spark.sql(
"""
|select id,
| array_distinct(
| FILTER(
| split(query, '\\.|=|\\s+|,'), x -> x rlike 'table_a|table_b|table_c|table_d'
| )
| )as tables
|FROM
| queries
""".stripMargin)
.show(false)
/**
* +---+------------------+
* |id |tables |
* +---+------------------+
* |1 |[table_a, table_b]|
* |2 |[table_c, table_d]|
* +---+------------------+
*/

Issues creating Spark table from EXTERNALLY partitioned data

CSV Data is stored daily on AWS S3, as follows:
/data/year=2020/month=5/day=5/<data-part-1.csv, data-part-2.csv,...data-part-K.csv>
The query I would like to work:
CREATE EXTERNAL TABLE {table_name} (data1 INT, data2 INT)
PARTITIONED BY (year INT, month INT, day INT)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
STORED AS TEXTFILE LOCATION '{file_location}'
TBLPROPERTIES ('skip.header.line.count' = '1')
Outcome: table is empty:
attempted better specifying the location ".../data/year=/month=/day=*", instead of ".../data/".
also attempted suggestions to run this command, which did not work:
spark.sql("msck repair table database_name.table_name").
This version below is able to load data, but I need the year/month/day columns, idea here is filter by those to make queries faster:
CREATE EXTERNAL TABLE {table_name} (data1 INT, data2 INT)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
STORED AS TEXTFILE LOCATION '{file_location}'
TBLPROPERTIES ('skip.header.line.count' = '1')
Outcome: loads table as expected, but queries are very slow.
This version also loads a table, however, YEAR,MONTH,DAY columns are null:
CREATE EXTERNAL TABLE {table_name} (data1 INT, data2 INT, year INT, month INT, day INT)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
STORED AS TEXTFILE LOCATION '{file_location}'
TBLPROPERTIES ('skip.header.line.count' = '1')
I am assuming the first query is the correct way to load this data, based on documentation. Looking at the resultant schema, that also seems to be correct - however I cannot get it to actually load any data.
Does anyone know what I am doing wrong?
Check if this helpful-
Please note sparkSession is created without hive support
1. Create dummy test dataframe and store it as csv with year, month & day partition
val df = spark.range(1).withColumn("date",
explode(sequence(to_date(lit("2020-06-09")), to_date(lit("2020-06-20")), expr("interval 1 day")))
).withColumn("year", year($"date"))
.withColumn("month", month($"date"))
.withColumn("day", dayofmonth($"date"))
df.show(false)
df.printSchema()
/**
* +---+----------+----+-----+---+
* |id |date |year|month|day|
* +---+----------+----+-----+---+
* |0 |2020-06-09|2020|6 |9 |
* |0 |2020-06-10|2020|6 |10 |
* |0 |2020-06-11|2020|6 |11 |
* |0 |2020-06-12|2020|6 |12 |
* |0 |2020-06-13|2020|6 |13 |
* |0 |2020-06-14|2020|6 |14 |
* |0 |2020-06-15|2020|6 |15 |
* |0 |2020-06-16|2020|6 |16 |
* |0 |2020-06-17|2020|6 |17 |
* |0 |2020-06-18|2020|6 |18 |
* |0 |2020-06-19|2020|6 |19 |
* |0 |2020-06-20|2020|6 |20 |
* +---+----------+----+-----+---+
*
* root
* |-- id: long (nullable = false)
* |-- date: date (nullable = false)
* |-- year: integer (nullable = false)
* |-- month: integer (nullable = false)
* |-- day: integer (nullable = false)
*/
df.repartition(2).write.partitionBy("year", "month", "day")
.option("header", true)
.mode(SaveMode.Overwrite)
.csv("/Users/sokale/models/hive_table")
File structure
/**
* File structure - /Users/sokale/models/hive_table
* ---------------
* year=2020
* year=2020/month=6
* year=2020/month=6/day=10
* |- part...csv files (same part files for all the below directories)
* year=2020/month=6/day=11
* year=2020/month=6/day=12
* year=2020/month=6/day=13
* year=2020/month=6/day=14
* year=2020/month=6/day=15
* year=2020/month=6/day=16
* year=2020/month=6/day=17
* year=2020/month=6/day=18
* year=2020/month=6/day=19
* year=2020/month=6/day=20
* year=2020/month=6/day=9
*/
Read the partitioned table
val csvDF = spark.read.option("header", true)
.csv("/Users/sokale/models/hive_table")
csvDF.show(false)
csvDF.printSchema()
/**
* +---+----------+----+-----+---+
* |id |date |year|month|day|
* +---+----------+----+-----+---+
* |0 |2020-06-20|2020|6 |20 |
* |0 |2020-06-19|2020|6 |19 |
* |0 |2020-06-09|2020|6 |9 |
* |0 |2020-06-12|2020|6 |12 |
* |0 |2020-06-10|2020|6 |10 |
* |0 |2020-06-15|2020|6 |15 |
* |0 |2020-06-16|2020|6 |16 |
* |0 |2020-06-17|2020|6 |17 |
* |0 |2020-06-13|2020|6 |13 |
* |0 |2020-06-18|2020|6 |18 |
* |0 |2020-06-14|2020|6 |14 |
* |0 |2020-06-11|2020|6 |11 |
* +---+----------+----+-----+---+
*
* root
* |-- id: string (nullable = true)
* |-- date: string (nullable = true)
* |-- year: integer (nullable = true)
* |-- month: integer (nullable = true)
* |-- day: integer (nullable = true)
*/

Getting unexpected results from Spark sql Windows Functions

It seems like Spark sql Window function does not working properly .
I am running a spark job in Hadoop Cluster where a HDFS block size is 128 MB and
Spark Version 1.5 CDH 5.5
My requirement:
If there are multiple records with same data_rfe_id then take the single record as per maximum seq_id and maxiumum service_id
I see that in raw data there are some records with same data_rfe_id and same seq_id so hence, I applied row_number using Window function so that I can filter the records with row_num === 1
But it seems its not working when have huge datasets. I see that same rowNumber is applied .
Why is it happening like this?
Do I need to reshuffle before I apply window function on dataframe?
I am expecting a unique rank number to each data_rfe_id
I want to use Window Function only to achieve this .
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions.rowNumber
.....
scala> df.printSchema
root
|-- transitional_key: string (nullable = true)
|-- seq_id: string (nullable = true)
|-- data_rfe_id: string (nullable = true)
|-- service_id: string (nullable = true)
|-- event_start_date_time: string (nullable = true)
|-- event_id: string (nullable = true)
val windowFunction = Window.partitionBy(df("data_rfe_id")).orderBy(df("seq_id").desc,df("service_id").desc)
val rankDF =df.withColumn("row_num",rowNumber.over(windowFunction))
rankDF.select("data_rfe_id","seq_id","service_id","row_num").show(200,false)
Expected result :
+------------------------------------+-----------------+-----------+-------+
|data_rfe_id |seq_id |service_id|row_num|
+------------------------------------+-----------------+-----------+-------+
|9ih67fshs-de11-4f80-a66d-b52a12c14b0e|1695826 |4039 |1 |
|9ih67fshs-de11-4f80-a66d-b52a12c14b0e|1695821 |3356 |2 |
|9ih67fshs-de11-4f80-a66d-b52a12c14b0e|1695802 |1857 |3 |
|23sds222-9669-429e-a95b-bc984ccf0fb0 |1695541 |2156 |1 |
|23sds222-9669-429e-a95b-bc984ccf0fb0 |1695541 |2103 |2 |
|23sds222-9669-429e-a95b-bc984ccf0fb0 |1695541 |2083 |3 |
|23sds222-9669-429e-a95b-bc984ccf0fb0 |1695541 |2082 |4 |
|23sds222-9669-429e-a95b-bc984ccf0fb0 |1695541 |2076 |5 |
Actual Result I got as per above code :
+------------------------------------+-----------------+-----------+-------+
|data_rfe_id |seq_id |service_id|row_num|
+------------------------------------+-----------------+-----------+-------+
|9ih67fshs-de11-4f80-a66d-b52a12c14b0e|1695826 |4039 |1 |
|9ih67fshs-de11-4f80-a66d-b52a12c14b0e|1695821 |3356 |1 |
|9ih67fshs-de11-4f80-a66d-b52a12c14b0e|1695802 |1857 |1 |
|23sds222-9669-429e-a95b-bc984ccf0fb0 |1695541 |2156 |1 |
|23sds222-9669-429e-a95b-bc984ccf0fb0 |1695541 |2103 |1 |
|23sds222-9669-429e-a95b-bc984ccf0fb0 |1695541 |2083 |1 |
|23sds222-9669-429e-a95b-bc984ccf0fb0 |1695541 |2082 |1 |
|23sds222-9669-429e-a95b-bc984ccf0fb0 |1695541 |2076 |1 |
Could someone explain me why I am getting these unexpected results? and How do I resolve that ?
Basically you want to rank , having seq_id and service_id in desc order. Add rangeBetween with range you need. Rank may work for you. following is snippet of code :
val windowFunction = Window.partitionBy(df("data_rfe_id")).orderBy(df("seq_id"),df("service_id")).desc().rangeBetween(-MAXNUMBER,MAXNUMBER))
val rankDF =df.withColumn( "rank", rank().over(windowFunction) )
As you are using older version of spark don't know it will work or not. There is issue with windowSpec here is reference

Resources