I would like to transform some columns in my dataframe based on configuration represented by Scala maps.
I have 2 case:
Receiving a map Map[String, Seq[String]] and columns col1, col2, to transform col3 if there is an entity in a map with key = col1, and col2 is in this entity value list.
Receiving a map Map[String, (Long, Long) and col1, col2, to transform col3 if
there is an entity in a map with key = col1 and col2 is in a range describe by the tuple of Longs as (start, end).
examples:
case 1
having this table, and a map Map(u1-> Seq(w1,w11), u2 -> Seq(w2,w22))
+------+------+------+
| col1 | col2 | col3 |
+------+------+------+
| u1 | w1 | v1 |
+------+------+------+
| u2 | w2 | v2 |
+------+------+------+
| u3 | w3 | v3 |
+------+------+------+
I would like to add "x-" prefix to col3, only if it matchs the term
+------+------+------+
| col1 | col2 | col3 |
+------+------+------+
| u1 | w1 | x-v1 |
+------+------+------+
| u2 | w2 | x-v2 |
+------+------+------+
| u3 | w3 | v3 |
+------+------+------+
case 2:
This table and map Map("u1" -> (1,5), u2 -> (2, 4))
+------+------+------+
| col1 | col2 | col3 |
+------+------+------+
| u1 | 2 | v1 |
+------+------+------+
| u1 | 6 | v11 |
+------+------+------+
| u2 | 3 | v3 |
+------+------+------+
| u3 | 4 | v3 |
+------+------+------+
expected output should be:
+------+------+------+
| col1 | col2 | col3 |
+------+------+------+
| u1 | 2 | x-v1 |
+------+------+------+
| u1 | 6 | v11 |
+------+------+------+
| u2 | 3 | x-v3 |
+------+------+------+
| u3 | 4 | v3 |
+------+------+------+
This can easily be done by UDFs, but for performance concerned, I would like not to use them.
Is there a way to achieve it without it in Spark 2.4.2?
Thanks
Check below code.
Note -
I have changed your second case map value to Map("u1" -> Seq(1,5), u2 -> Seq(2, 4))
Converting map values to json map, adding json map as column values to DataFrame, then applying logic on DataFrame.
If possible you can directly add values inside json map so that you can avoid conversion map to json map.
Import required libraries.
import org.apache.spark.sql.types._
import org.json4s.JsonDSL._
import org.json4s.jackson.JsonMethods._
Case-1 Logic
scala> val caseOneDF = Seq(("u1","w1","v1"),("u2","w2","v2"),("u3","w3","v3")).toDF("col1","col2","col3")
caseOneDF: org.apache.spark.sql.DataFrame = [col1: string, col2: string ... 1 more field]
scala> val caseOneMap = Map("u1" -> Seq("w1","w11"),"u2" -> Seq("w2","w22"))
caseOneMap: scala.collection.immutable.Map[String,Seq[String]] = Map(u1 -> List(w1, w11), u2 -> List(w2, w22))
scala> val caseOneJsonMap = lit(compact(render(caseOneMap)))
caseOneJsonMap: org.apache.spark.sql.Column = {"u1":["w1","w11"],"u2":["w2","w22"]}
scala> val caseOneSchema = MapType(StringType,ArrayType(StringType))
caseOneSchema: org.apache.spark.sql.types.MapType = MapType(StringType,ArrayType(StringType,true),true)
scala> val caseOneExpr = from_json(caseOneJsonMap,caseOneSchema)
caseOneExpr: org.apache.spark.sql.Column = entries
Case-1 Final Output
scala> dfa
.withColumn("data",caseOneExpr)
.withColumn("col3",when(expr("array_contains(data[col1],col2)"),concat(lit("x-"),$"col3")).otherwise($"col3"))
.drop("data")
.show(false)
+----+----+----+
|col1|col2|col3|
+----+----+----+
|u1 |w1 |x-v1|
|u2 |w2 |x-v2|
|u3 |w3 |v3 |
+----+----+----+
Case-2 Logic
scala> val caseTwoDF = Seq(("u1",2,"v1"),("u1",6,"v11"),("u2",3,"v3"),("u3",4,"v3")).toDF("col1","col2","col3")
caseTwoDF: org.apache.spark.sql.DataFrame = [col1: string, col2: int ... 1 more field]
scala> val caseTwoMap = Map("u1" -> Seq(1,5),"u2" -> Seq(2,4))
caseTwoMap: scala.collection.immutable.Map[String,Seq[Int]] = Map(u1 -> List(1, 5), u2 -> List(2, 4))
scala> val caseTwoJsonMap = lit(compact(render(caseTwoMap)))
caseTwoJsonMap: org.apache.spark.sql.Column = {"u1":[1,5],"u2":[2,4]}
scala> val caseTwoSchema = MapType(StringType,ArrayType(IntegerType))
caseTwoSchema: org.apache.spark.sql.types.MapType = MapType(StringType,ArrayType(IntegerType,true),true)
scala> val caseTwoExpr = from_json(caseTwoJsonMap,caseTwoSchema)
caseTwoExpr: org.apache.spark.sql.Column = entries
Case-2 Final Output
scala> caseTwoDF
.withColumn("data",caseTwoExpr)
.withColumn("col3",when(expr("array_contains(sequence(data[col1][0],data[col1][1]),col2)"), concat(lit("x-"),$"col3")).otherwise($"col3"))
.drop("data")
.show(false)
+----+----+----+
|col1|col2|col3|
+----+----+----+
|u1 |2 |x-v1|
|u1 |6 |v11 |
|u2 |3 |x-v3|
|u3 |4 |v3 |
+----+----+----+
Another alternative -
import org.apache.spark.sql.functions.typedLit
Case-1
df1.show(false)
df1.printSchema()
/**
* +----+----+----+
* |col1|col2|col3|
* +----+----+----+
* |u1 |w1 |v1 |
* |u2 |w2 |v2 |
* |u3 |w3 |v3 |
* +----+----+----+
*
* root
* |-- col1: string (nullable = true)
* |-- col2: string (nullable = true)
* |-- col3: string (nullable = true)
*/
val case1 = Map("u1" -> Seq("w1","w11"), "u2" -> Seq("w2","w22"))
val p1 = df1.withColumn("case1", typedLit(case1))
.withColumn("col3",
when(array_contains(expr("case1[col1]"), $"col2"), concat(lit("x-"), $"col3"))
.otherwise($"col3")
)
p1.show(false)
p1.printSchema()
/**
* +----+----+----+----------------------------------+
* |col1|col2|col3|case1 |
* +----+----+----+----------------------------------+
* |u1 |w1 |x-v1|[u1 -> [w1, w11], u2 -> [w2, w22]]|
* |u2 |w2 |x-v2|[u1 -> [w1, w11], u2 -> [w2, w22]]|
* |u3 |w3 |v3 |[u1 -> [w1, w11], u2 -> [w2, w22]]|
* +----+----+----+----------------------------------+
*
* root
* |-- col1: string (nullable = true)
* |-- col2: string (nullable = true)
* |-- col3: string (nullable = true)
* |-- case1: map (nullable = false)
* | |-- key: string
* | |-- value: array (valueContainsNull = true)
* | | |-- element: string (containsNull = true)
*/
Case-2
df2.show(false)
df2.printSchema()
/**
* +----+----+----+
* |col1|col2|col3|
* +----+----+----+
* |u1 |2 |v1 |
* |u1 |6 |v11 |
* |u2 |3 |v3 |
* |u3 |4 |v3 |
* +----+----+----+
*
* root
* |-- col1: string (nullable = true)
* |-- col2: integer (nullable = true)
* |-- col3: string (nullable = true)
*/
val case2 = Map("u1" -> (1,5), "u2" -> (2, 4))
val p = df2.withColumn("case2", typedLit(case2))
.withColumn("col3",
when(expr("col2 between case2[col1]._1 and case2[col1]._2"), concat(lit("x-"), $"col3"))
.otherwise($"col3")
)
p.show(false)
p.printSchema()
/**
* +----+----+----+----------------------------+
* |col1|col2|col3|case2 |
* +----+----+----+----------------------------+
* |u1 |2 |x-v1|[u1 -> [1, 5], u2 -> [2, 4]]|
* |u1 |6 |v11 |[u1 -> [1, 5], u2 -> [2, 4]]|
* |u2 |3 |x-v3|[u1 -> [1, 5], u2 -> [2, 4]]|
* |u3 |4 |v3 |[u1 -> [1, 5], u2 -> [2, 4]]|
* +----+----+----+----------------------------+
*
* root
* |-- col1: string (nullable = true)
* |-- col2: integer (nullable = true)
* |-- col3: string (nullable = true)
* |-- case2: map (nullable = false)
* | |-- key: string
* | |-- value: struct (valueContainsNull = true)
* | | |-- _1: integer (nullable = false)
* | | |-- _2: integer (nullable = false)
*/
scala> caseTwoDF
.withColumn("data",caseTwoExpr)
.withColumn("col3",when(expr("array_contains(sequence(data[col1][0],data[col1][1]),col2)"), concat(lit("x-"),$"col3")).otherwise($"col3"))
.drop("data")
.show(false)
+----+----+----+
|col1|col2|col3|
+----+----+----+
|u1 |2 |x-v1|
|u1 |6 |v11 |
|u2 |3 |x-v3|
|u3 |4 |v3 |
+----+----+----+
Related
Given a table built in this way with Spark SQL (2.4.*):
scala> spark.sql("with some_data (values ('A',1),('B',2)) select * from some_data").show()
+----+----+
|col1|col2|
+----+----+
| A| 1|
| B| 2|
+----+----+
I wasn't able to set the column names (indeed the default col1 and col2). Is there a way to rename that columns for example to label and value?
Either modify your query as-
spark.sql("with some_data (values ('A',1),('B',2) T(label, value)) select * from some_data").show()
/**
* +-----+-----+
* |label|value|
* +-----+-----+
* | A| 1|
* | B| 2|
* +-----+-----+
*/
or Use this example for reference -
val df = spark.sql(
"""
|select Class_Name, Customer, Date_Time, Median_Percentage
|from values
| ('ClassA', 'A', '6/13/20', 64550),
| ('ClassA', 'B', '6/6/20', 40200),
| ('ClassB', 'F', '6/20/20', 26800),
| ('ClassB', 'G', '6/20/20', 18100)
| T(Class_Name, Customer, Date_Time, Median_Percentage)
""".stripMargin)
df.show(false)
df.printSchema()
/**
* +----------+--------+---------+-----------------+
* |Class_Name|Customer|Date_Time|Median_Percentage|
* +----------+--------+---------+-----------------+
* |ClassA |A |6/13/20 |64550 |
* |ClassA |B |6/6/20 |40200 |
* |ClassB |F |6/20/20 |26800 |
* |ClassB |G |6/20/20 |18100 |
* +----------+--------+---------+-----------------+
*
* root
* |-- Class_Name: string (nullable = false)
* |-- Customer: string (nullable = false)
* |-- Date_Time: string (nullable = false)
* |-- Median_Percentage: integer (nullable = false)
*/
Please observe T(Class_Name, Customer, Date_Time, Median_Percentage) to provide names to column as required
CSV Data is stored daily on AWS S3, as follows:
/data/year=2020/month=5/day=5/<data-part-1.csv, data-part-2.csv,...data-part-K.csv>
The query I would like to work:
CREATE EXTERNAL TABLE {table_name} (data1 INT, data2 INT)
PARTITIONED BY (year INT, month INT, day INT)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
STORED AS TEXTFILE LOCATION '{file_location}'
TBLPROPERTIES ('skip.header.line.count' = '1')
Outcome: table is empty:
attempted better specifying the location ".../data/year=/month=/day=*", instead of ".../data/".
also attempted suggestions to run this command, which did not work:
spark.sql("msck repair table database_name.table_name").
This version below is able to load data, but I need the year/month/day columns, idea here is filter by those to make queries faster:
CREATE EXTERNAL TABLE {table_name} (data1 INT, data2 INT)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
STORED AS TEXTFILE LOCATION '{file_location}'
TBLPROPERTIES ('skip.header.line.count' = '1')
Outcome: loads table as expected, but queries are very slow.
This version also loads a table, however, YEAR,MONTH,DAY columns are null:
CREATE EXTERNAL TABLE {table_name} (data1 INT, data2 INT, year INT, month INT, day INT)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
STORED AS TEXTFILE LOCATION '{file_location}'
TBLPROPERTIES ('skip.header.line.count' = '1')
I am assuming the first query is the correct way to load this data, based on documentation. Looking at the resultant schema, that also seems to be correct - however I cannot get it to actually load any data.
Does anyone know what I am doing wrong?
Check if this helpful-
Please note sparkSession is created without hive support
1. Create dummy test dataframe and store it as csv with year, month & day partition
val df = spark.range(1).withColumn("date",
explode(sequence(to_date(lit("2020-06-09")), to_date(lit("2020-06-20")), expr("interval 1 day")))
).withColumn("year", year($"date"))
.withColumn("month", month($"date"))
.withColumn("day", dayofmonth($"date"))
df.show(false)
df.printSchema()
/**
* +---+----------+----+-----+---+
* |id |date |year|month|day|
* +---+----------+----+-----+---+
* |0 |2020-06-09|2020|6 |9 |
* |0 |2020-06-10|2020|6 |10 |
* |0 |2020-06-11|2020|6 |11 |
* |0 |2020-06-12|2020|6 |12 |
* |0 |2020-06-13|2020|6 |13 |
* |0 |2020-06-14|2020|6 |14 |
* |0 |2020-06-15|2020|6 |15 |
* |0 |2020-06-16|2020|6 |16 |
* |0 |2020-06-17|2020|6 |17 |
* |0 |2020-06-18|2020|6 |18 |
* |0 |2020-06-19|2020|6 |19 |
* |0 |2020-06-20|2020|6 |20 |
* +---+----------+----+-----+---+
*
* root
* |-- id: long (nullable = false)
* |-- date: date (nullable = false)
* |-- year: integer (nullable = false)
* |-- month: integer (nullable = false)
* |-- day: integer (nullable = false)
*/
df.repartition(2).write.partitionBy("year", "month", "day")
.option("header", true)
.mode(SaveMode.Overwrite)
.csv("/Users/sokale/models/hive_table")
File structure
/**
* File structure - /Users/sokale/models/hive_table
* ---------------
* year=2020
* year=2020/month=6
* year=2020/month=6/day=10
* |- part...csv files (same part files for all the below directories)
* year=2020/month=6/day=11
* year=2020/month=6/day=12
* year=2020/month=6/day=13
* year=2020/month=6/day=14
* year=2020/month=6/day=15
* year=2020/month=6/day=16
* year=2020/month=6/day=17
* year=2020/month=6/day=18
* year=2020/month=6/day=19
* year=2020/month=6/day=20
* year=2020/month=6/day=9
*/
Read the partitioned table
val csvDF = spark.read.option("header", true)
.csv("/Users/sokale/models/hive_table")
csvDF.show(false)
csvDF.printSchema()
/**
* +---+----------+----+-----+---+
* |id |date |year|month|day|
* +---+----------+----+-----+---+
* |0 |2020-06-20|2020|6 |20 |
* |0 |2020-06-19|2020|6 |19 |
* |0 |2020-06-09|2020|6 |9 |
* |0 |2020-06-12|2020|6 |12 |
* |0 |2020-06-10|2020|6 |10 |
* |0 |2020-06-15|2020|6 |15 |
* |0 |2020-06-16|2020|6 |16 |
* |0 |2020-06-17|2020|6 |17 |
* |0 |2020-06-13|2020|6 |13 |
* |0 |2020-06-18|2020|6 |18 |
* |0 |2020-06-14|2020|6 |14 |
* |0 |2020-06-11|2020|6 |11 |
* +---+----------+----+-----+---+
*
* root
* |-- id: string (nullable = true)
* |-- date: string (nullable = true)
* |-- year: integer (nullable = true)
* |-- month: integer (nullable = true)
* |-- day: integer (nullable = true)
*/
I have a schema:
root (original)
|-- entries: array (nullable = true)
| |-- element: struct (containsNull = false)
| | |-- col1: string (nullable = false)
| | |-- col2: string (nullable = true)
How can I flatten it?
root (derived)
|-- col1: string (nullable = false)
|-- col2: string (nullable = true)
|-- col3: string (nullable = false)
|-- col4: string (nullable = true)
|-- ...
where col1...n is [col1 from original] and value for col1...n is value from [col2 from original]
Example:
+--------------------------------------------+
|entries |
+--------------------------------------------+
|[[a1, 1], [a2, P], [a4, N] |
|[[a1, 1], [a2, O], [a3, F], [a4, 1], [a5, 1]|
+--------------------------------------------+
I want to create the next dataset:
+-------------------------+
| a1 | a2 | a3 | a4 | a5 |
+-------------------------+
| 1 | P | null| N | null|
| 1 | O | F | 1 | 1 |
+-------------------------+
You can do it with a combination of explode and pivot, to do so, one needs to create a row_id first:
val df = Seq(
Seq(("a1", "1"), ("a2", "P"), ("a4", "N")),
Seq(("a1", "1"), ("a2", "O"), ("a3", "F"), ("a4", "1"), ("a5", "1"))
).toDF("arr")
.select($"arr".cast("array<struct<col1:string,col2:string>>"))
df
.withColumn("row_id", monotonically_increasing_id())
.select($"row_id", explode($"arr"))
.select($"row_id", $"col.*")
.groupBy($"row_id").pivot($"col1").agg(first($"col2"))
.drop($"row_id")
.show()
gives:
+---+---+----+---+----+
| a1| a2| a3| a4| a5|
+---+---+----+---+----+
| 1| P|null| N|null|
| 1| O| F| 1| 1|
+---+---+----+---+----+
Data schema,
root
|-- id: string (nullable = true)
|-- col1: string (nullable = true)
|-- col2: string (nullable = true)
|id|col1 |col2 |
|1 |["x","y","z"]|[123,"null","null"]|
From above data i want to filter where x exits in col1 and respective value for x from col2.
(values of col1 and col2 ordered.If x index 2 in col1 and value index at col2 also 2)
Result:(Need col1 and col2 type array type)
|id |col1 |col2 |
|1 |["x"]|[123]|
If x not present in col1 then need result like
|id| col1 |col2 |
|1 |["null"] |["null"]|
i tried,
val df1 = df.withColumn("result",when($"col1".contains("x"),"X").otherwise("null"))
The trick is to transform your data from dumb string columns into a more useable data structure. Once col1 and col2 are rebuilt as arrays (or as a map, as your desired output suggests they should be), you can use Spark's built-in functions rather than a messy UDF as suggested by #baitmbarek.
To start, use trim and split to convert col1 and col2 to arrays:
scala> val df = Seq(
| ("1", """["x","y","z"]""","""[123,"null","null"]"""),
| ("2", """["a","y","z"]""","""[123,"null","null"]""")
| ).toDF("id","col1","col2")
df: org.apache.spark.sql.DataFrame = [id: string, col1: string ... 1 more field]
scala> val df_array = df.withColumn("col1", split(trim($"col1", "[\"]"), "\"?,\"?"))
.withColumn("col2", split(trim($"col2", "[\"]"), "\"?,\"?"))
df_array: org.apache.spark.sql.DataFrame = [id: string, col1: array<string> ... 1 more field]
scala> df_array.show(false)
+---+---------+-----------------+
|id |col1 |col2 |
+---+---------+-----------------+
|1 |[x, y, z]|[123, null, null]|
|2 |[a, y, z]|[123, null, null]|
+---+---------+-----------------+
scala> df_array.printSchema
root
|-- id: string (nullable = true)
|-- col1: array (nullable = true)
| |-- element: string (containsNull = true)
|-- col2: array (nullable = true)
| |-- element: string (containsNull = true)
From here, you should be able to achieve what you want using array_position to find the index of 'x' (if any) in col1 and retrieve the matching data from col2. However, converting the two arrays into a map first should make it clearer to understand what your code is doing:
scala> val df_map = df_array.select(
$"id",
map_from_entries(arrays_zip($"col1", $"col2")).as("col_map")
)
df_map: org.apache.spark.sql.DataFrame = [id: string, col_map: map<string,string>]
scala> df_map.show(false)
+---+--------------------------------+
|id |col_map |
+---+--------------------------------+
|1 |[x -> 123, y -> null, z -> null]|
|2 |[a -> 123, y -> null, z -> null]|
+---+--------------------------------+
scala> val df_final = df_map.select(
$"id",
when(isnull(element_at($"col_map", "x")),
array(lit("null")))
.otherwise(
array(lit("x")))
.as("col1"),
when(isnull(element_at($"col_map", "x")),
array(lit("null")))
.otherwise(
array(element_at($"col_map", "x")))
.as("col2")
)
df_final: org.apache.spark.sql.DataFrame = [id: string, col1: array<string> ... 1 more field]
scala> df_final.show
+---+------+------+
| id| col1| col2|
+---+------+------+
| 1| [x]| [123]|
| 2|[null]|[null]|
+---+------+------+
scala> df_final.printSchema
root
|-- id: string (nullable = true)
|-- col1: array (nullable = false)
| |-- element: string (containsNull = false)
|-- col2: array (nullable = false)
| |-- element: string (containsNull = true)
Dataframe schema:
root
|-- ID: decimal(15,0) (nullable = true)
|-- COL1: array (nullable = true)
| |-- element: string (containsNull = true)
|-- COL2: array (nullable = true)
| |-- element: string (containsNull = true)
|-- COL3: array (nullable = true)
| |-- element: string (containsNull = true)
Sample data
+--------------------+--------------------+--------------------+
| COL1 | COL2 | COL3 |
+--------------------+--------------------+--------------------+
|[A, B, C, A] |[101, 102, 103, 104]|[P, Q, R, S] |
+--------------------+--------------------+--------------------+
I want to apply nested conditions on array elements.
For example,
Find COL3 elements where COL1 elements are A and COL2 elements are even.
Expected Output : [S]
I looked at various functions. For e.g. - array_position but it returns only the first occurrence.
Is there any straightforward way or I have to explode arrays?
Assuming your condition applies to array elements with the same index, it is possible to filter arrays with lambda functions in SQL since Spark 2.4.0, but this is still not exposed via the other language APIs and you need to use expr(). You simply zip the three arrays and then filter the resulting array of structs:
scala> df.show()
+---+------------+--------------------+------------+
| ID| COL1| COL2| COL3|
+---+------------+--------------------+------------+
| 1|[A, B, C, A]|[101, 102, 103, 104]|[P, Q, R, S]|
+---+------------+--------------------+------------+
scala> df.select($"ID", expr(s"""
| filter(
| arrays_zip(COL1, COL2, COL3),
| e -> e.COL1 == "A" AND CAST(e.COL2 AS integer) % 2 == 0
| ).COL3 AS result
| """)).show()
+---+------+
| ID|result|
+---+------+
| 1| [S]|
+---+------+
Since this uses expr() to supply an SQL expression as a column, it also works with PySpark:
>>> from pyspark.sql.functions import expr
>>> df.select(df.ID, expr("""
... filter(
... arrays_zip(COL1, COL2, COL3),
... e -> e.COL1 == "A" AND CAST(e.COL2 AS integer) % 2 == 0
... ).COL3 AS result
... """)).show()
+---+------+
| ID|result|
+---+------+
| 1| [S]|
+---+------+