How to rename columns in Spark SQL inside WITH and VALUES? - apache-spark

Given a table built in this way with Spark SQL (2.4.*):
scala> spark.sql("with some_data (values ('A',1),('B',2)) select * from some_data").show()
+----+----+
|col1|col2|
+----+----+
| A| 1|
| B| 2|
+----+----+
I wasn't able to set the column names (indeed the default col1 and col2). Is there a way to rename that columns for example to label and value?

Either modify your query as-
spark.sql("with some_data (values ('A',1),('B',2) T(label, value)) select * from some_data").show()
/**
* +-----+-----+
* |label|value|
* +-----+-----+
* | A| 1|
* | B| 2|
* +-----+-----+
*/
or Use this example for reference -
val df = spark.sql(
"""
|select Class_Name, Customer, Date_Time, Median_Percentage
|from values
| ('ClassA', 'A', '6/13/20', 64550),
| ('ClassA', 'B', '6/6/20', 40200),
| ('ClassB', 'F', '6/20/20', 26800),
| ('ClassB', 'G', '6/20/20', 18100)
| T(Class_Name, Customer, Date_Time, Median_Percentage)
""".stripMargin)
df.show(false)
df.printSchema()
/**
* +----------+--------+---------+-----------------+
* |Class_Name|Customer|Date_Time|Median_Percentage|
* +----------+--------+---------+-----------------+
* |ClassA |A |6/13/20 |64550 |
* |ClassA |B |6/6/20 |40200 |
* |ClassB |F |6/20/20 |26800 |
* |ClassB |G |6/20/20 |18100 |
* +----------+--------+---------+-----------------+
*
* root
* |-- Class_Name: string (nullable = false)
* |-- Customer: string (nullable = false)
* |-- Date_Time: string (nullable = false)
* |-- Median_Percentage: integer (nullable = false)
*/
Please observe T(Class_Name, Customer, Date_Time, Median_Percentage) to provide names to column as required

Related

Transform columns in Spark DataFrame based on map without using UDFs

I would like to transform some columns in my dataframe based on configuration represented by Scala maps.
I have 2 case:
Receiving a map Map[String, Seq[String]] and columns col1, col2, to transform col3 if there is an entity in a map with key = col1, and col2 is in this entity value list.
Receiving a map Map[String, (Long, Long) and col1, col2, to transform col3 if
there is an entity in a map with key = col1 and col2 is in a range describe by the tuple of Longs as (start, end).
examples:
case 1
having this table, and a map Map(u1-> Seq(w1,w11), u2 -> Seq(w2,w22))
+------+------+------+
| col1 | col2 | col3 |
+------+------+------+
| u1 | w1 | v1 |
+------+------+------+
| u2 | w2 | v2 |
+------+------+------+
| u3 | w3 | v3 |
+------+------+------+
I would like to add "x-" prefix to col3, only if it matchs the term
+------+------+------+
| col1 | col2 | col3 |
+------+------+------+
| u1 | w1 | x-v1 |
+------+------+------+
| u2 | w2 | x-v2 |
+------+------+------+
| u3 | w3 | v3 |
+------+------+------+
case 2:
This table and map Map("u1" -> (1,5), u2 -> (2, 4))
+------+------+------+
| col1 | col2 | col3 |
+------+------+------+
| u1 | 2 | v1 |
+------+------+------+
| u1 | 6 | v11 |
+------+------+------+
| u2 | 3 | v3 |
+------+------+------+
| u3 | 4 | v3 |
+------+------+------+
expected output should be:
+------+------+------+
| col1 | col2 | col3 |
+------+------+------+
| u1 | 2 | x-v1 |
+------+------+------+
| u1 | 6 | v11 |
+------+------+------+
| u2 | 3 | x-v3 |
+------+------+------+
| u3 | 4 | v3 |
+------+------+------+
This can easily be done by UDFs, but for performance concerned, I would like not to use them.
Is there a way to achieve it without it in Spark 2.4.2?
Thanks
Check below code.
Note -
I have changed your second case map value to Map("u1" -> Seq(1,5), u2 -> Seq(2, 4))
Converting map values to json map, adding json map as column values to DataFrame, then applying logic on DataFrame.
If possible you can directly add values inside json map so that you can avoid conversion map to json map.
Import required libraries.
import org.apache.spark.sql.types._
import org.json4s.JsonDSL._
import org.json4s.jackson.JsonMethods._
Case-1 Logic
scala> val caseOneDF = Seq(("u1","w1","v1"),("u2","w2","v2"),("u3","w3","v3")).toDF("col1","col2","col3")
caseOneDF: org.apache.spark.sql.DataFrame = [col1: string, col2: string ... 1 more field]
scala> val caseOneMap = Map("u1" -> Seq("w1","w11"),"u2" -> Seq("w2","w22"))
caseOneMap: scala.collection.immutable.Map[String,Seq[String]] = Map(u1 -> List(w1, w11), u2 -> List(w2, w22))
scala> val caseOneJsonMap = lit(compact(render(caseOneMap)))
caseOneJsonMap: org.apache.spark.sql.Column = {"u1":["w1","w11"],"u2":["w2","w22"]}
scala> val caseOneSchema = MapType(StringType,ArrayType(StringType))
caseOneSchema: org.apache.spark.sql.types.MapType = MapType(StringType,ArrayType(StringType,true),true)
scala> val caseOneExpr = from_json(caseOneJsonMap,caseOneSchema)
caseOneExpr: org.apache.spark.sql.Column = entries
Case-1 Final Output
scala> dfa
.withColumn("data",caseOneExpr)
.withColumn("col3",when(expr("array_contains(data[col1],col2)"),concat(lit("x-"),$"col3")).otherwise($"col3"))
.drop("data")
.show(false)
+----+----+----+
|col1|col2|col3|
+----+----+----+
|u1 |w1 |x-v1|
|u2 |w2 |x-v2|
|u3 |w3 |v3 |
+----+----+----+
Case-2 Logic
scala> val caseTwoDF = Seq(("u1",2,"v1"),("u1",6,"v11"),("u2",3,"v3"),("u3",4,"v3")).toDF("col1","col2","col3")
caseTwoDF: org.apache.spark.sql.DataFrame = [col1: string, col2: int ... 1 more field]
scala> val caseTwoMap = Map("u1" -> Seq(1,5),"u2" -> Seq(2,4))
caseTwoMap: scala.collection.immutable.Map[String,Seq[Int]] = Map(u1 -> List(1, 5), u2 -> List(2, 4))
scala> val caseTwoJsonMap = lit(compact(render(caseTwoMap)))
caseTwoJsonMap: org.apache.spark.sql.Column = {"u1":[1,5],"u2":[2,4]}
scala> val caseTwoSchema = MapType(StringType,ArrayType(IntegerType))
caseTwoSchema: org.apache.spark.sql.types.MapType = MapType(StringType,ArrayType(IntegerType,true),true)
scala> val caseTwoExpr = from_json(caseTwoJsonMap,caseTwoSchema)
caseTwoExpr: org.apache.spark.sql.Column = entries
Case-2 Final Output
scala> caseTwoDF
.withColumn("data",caseTwoExpr)
.withColumn("col3",when(expr("array_contains(sequence(data[col1][0],data[col1][1]),col2)"), concat(lit("x-"),$"col3")).otherwise($"col3"))
.drop("data")
.show(false)
+----+----+----+
|col1|col2|col3|
+----+----+----+
|u1 |2 |x-v1|
|u1 |6 |v11 |
|u2 |3 |x-v3|
|u3 |4 |v3 |
+----+----+----+
Another alternative -
import org.apache.spark.sql.functions.typedLit
Case-1
df1.show(false)
df1.printSchema()
/**
* +----+----+----+
* |col1|col2|col3|
* +----+----+----+
* |u1 |w1 |v1 |
* |u2 |w2 |v2 |
* |u3 |w3 |v3 |
* +----+----+----+
*
* root
* |-- col1: string (nullable = true)
* |-- col2: string (nullable = true)
* |-- col3: string (nullable = true)
*/
val case1 = Map("u1" -> Seq("w1","w11"), "u2" -> Seq("w2","w22"))
val p1 = df1.withColumn("case1", typedLit(case1))
.withColumn("col3",
when(array_contains(expr("case1[col1]"), $"col2"), concat(lit("x-"), $"col3"))
.otherwise($"col3")
)
p1.show(false)
p1.printSchema()
/**
* +----+----+----+----------------------------------+
* |col1|col2|col3|case1 |
* +----+----+----+----------------------------------+
* |u1 |w1 |x-v1|[u1 -> [w1, w11], u2 -> [w2, w22]]|
* |u2 |w2 |x-v2|[u1 -> [w1, w11], u2 -> [w2, w22]]|
* |u3 |w3 |v3 |[u1 -> [w1, w11], u2 -> [w2, w22]]|
* +----+----+----+----------------------------------+
*
* root
* |-- col1: string (nullable = true)
* |-- col2: string (nullable = true)
* |-- col3: string (nullable = true)
* |-- case1: map (nullable = false)
* | |-- key: string
* | |-- value: array (valueContainsNull = true)
* | | |-- element: string (containsNull = true)
*/
Case-2
df2.show(false)
df2.printSchema()
/**
* +----+----+----+
* |col1|col2|col3|
* +----+----+----+
* |u1 |2 |v1 |
* |u1 |6 |v11 |
* |u2 |3 |v3 |
* |u3 |4 |v3 |
* +----+----+----+
*
* root
* |-- col1: string (nullable = true)
* |-- col2: integer (nullable = true)
* |-- col3: string (nullable = true)
*/
val case2 = Map("u1" -> (1,5), "u2" -> (2, 4))
val p = df2.withColumn("case2", typedLit(case2))
.withColumn("col3",
when(expr("col2 between case2[col1]._1 and case2[col1]._2"), concat(lit("x-"), $"col3"))
.otherwise($"col3")
)
p.show(false)
p.printSchema()
/**
* +----+----+----+----------------------------+
* |col1|col2|col3|case2 |
* +----+----+----+----------------------------+
* |u1 |2 |x-v1|[u1 -> [1, 5], u2 -> [2, 4]]|
* |u1 |6 |v11 |[u1 -> [1, 5], u2 -> [2, 4]]|
* |u2 |3 |x-v3|[u1 -> [1, 5], u2 -> [2, 4]]|
* |u3 |4 |v3 |[u1 -> [1, 5], u2 -> [2, 4]]|
* +----+----+----+----------------------------+
*
* root
* |-- col1: string (nullable = true)
* |-- col2: integer (nullable = true)
* |-- col3: string (nullable = true)
* |-- case2: map (nullable = false)
* | |-- key: string
* | |-- value: struct (valueContainsNull = true)
* | | |-- _1: integer (nullable = false)
* | | |-- _2: integer (nullable = false)
*/
scala> caseTwoDF
.withColumn("data",caseTwoExpr)
.withColumn("col3",when(expr("array_contains(sequence(data[col1][0],data[col1][1]),col2)"), concat(lit("x-"),$"col3")).otherwise($"col3"))
.drop("data")
.show(false)
+----+----+----+
|col1|col2|col3|
+----+----+----+
|u1 |2 |x-v1|
|u1 |6 |v11 |
|u2 |3 |x-v3|
|u3 |4 |v3 |
+----+----+----+

Get all Not null columns of spark dataframe in one Column

I need to select all not nulls column from Hive table and insert them into Hbase. For example, consider the below table:
Name Place Department Experience
==============================================
Ram | Ramgarh | Sales | 14
Lakshman | Lakshmanpur |Operations |
Sita | Sitapur | | 14
Ravan | | | 25
I have to write all the not null columns from above table to Hbase. So I wrote a logic to get not null columns in one column of dataframe as below. Name column is mandatory there.
Name Place Department Experience Not_null_columns
================================================================================
Ram Ramgarh Sales 14 Name, Place, Department, Experience
Lakshman Lakshmanpur Operations Name, Place, Department
Sita Sitapur 14 Name, Place, Experience
Ravan 25 Name, Experience
Now my requirement is to create a column in dataframe with all values of not null columns in a single column as provided below.
Name Place Department Experience Not_null_columns_values
Ram Ramgarh Sales 14 Name: Ram, Place: Ramgarh, Department: Sales, Experince: 14
Lakshman Lakshmanpur Operations Name: Lakshman, Place: Lakshmanpur, Department: Operations
Sita Sitapur 14 Name: Sita, Place: Sitapur, Experience: 14
Ravan 25 Name: Ravan, Experience: 25
Once I get above df I will write it to Hbase with Name as key and last column as value.
Please let me know if there could have been a better approach to do this.
Try this-
Load the test data provided
val data =
"""
|Name | Place | Department | Experience
|
|Ram | Ramgarh | Sales | 14
|
|Lakshman | Lakshmanpur |Operations |
|
|Sita | Sitapur | | 14
|
|Ravan | | | 25
""".stripMargin
val stringDS = data.split(System.lineSeparator())
.map(_.split("\\|").map(_.replaceAll("""^[ \t]+|[ \t]+$""", "")).mkString(","))
.toSeq.toDS()
val df = spark.read
.option("sep", ",")
.option("inferSchema", "true")
.option("header", "true")
// .option("nullValue", "null")
.csv(stringDS)
df.show(false)
df.printSchema()
/**
* +--------+-----------+----------+----------+
* |Name |Place |Department|Experience|
* +--------+-----------+----------+----------+
* |Ram |Ramgarh |Sales |14 |
* |Lakshman|Lakshmanpur|Operations|null |
* |Sita |Sitapur |null |14 |
* |Ravan |null |null |25 |
* +--------+-----------+----------+----------+
*
* root
* |-- Name: string (nullable = true)
* |-- Place: string (nullable = true)
* |-- Department: string (nullable = true)
* |-- Experience: integer (nullable = true)
*/
convert struct and then json
val x = df.withColumn("Not_null_columns_values",
to_json(struct(df.columns.map(col): _*)))
x.show(false)
x.printSchema()
/**
* +--------+-----------+----------+----------+---------------------------------------------------------------------+
* |Name |Place |Department|Experience|Not_null_columns_values |
* +--------+-----------+----------+----------+---------------------------------------------------------------------+
* |Ram |Ramgarh |Sales |14 |{"Name":"Ram","Place":"Ramgarh","Department":"Sales","Experience":14}|
* |Lakshman|Lakshmanpur|Operations|null |{"Name":"Lakshman","Place":"Lakshmanpur","Department":"Operations"} |
* |Sita |Sitapur |null |14 |{"Name":"Sita","Place":"Sitapur","Experience":14} |
* |Ravan |null |null |25 |{"Name":"Ravan","Experience":25} |
* +--------+-----------+----------+----------+---------------------------------------------------------------------+
*/

Issues creating Spark table from EXTERNALLY partitioned data

CSV Data is stored daily on AWS S3, as follows:
/data/year=2020/month=5/day=5/<data-part-1.csv, data-part-2.csv,...data-part-K.csv>
The query I would like to work:
CREATE EXTERNAL TABLE {table_name} (data1 INT, data2 INT)
PARTITIONED BY (year INT, month INT, day INT)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
STORED AS TEXTFILE LOCATION '{file_location}'
TBLPROPERTIES ('skip.header.line.count' = '1')
Outcome: table is empty:
attempted better specifying the location ".../data/year=/month=/day=*", instead of ".../data/".
also attempted suggestions to run this command, which did not work:
spark.sql("msck repair table database_name.table_name").
This version below is able to load data, but I need the year/month/day columns, idea here is filter by those to make queries faster:
CREATE EXTERNAL TABLE {table_name} (data1 INT, data2 INT)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
STORED AS TEXTFILE LOCATION '{file_location}'
TBLPROPERTIES ('skip.header.line.count' = '1')
Outcome: loads table as expected, but queries are very slow.
This version also loads a table, however, YEAR,MONTH,DAY columns are null:
CREATE EXTERNAL TABLE {table_name} (data1 INT, data2 INT, year INT, month INT, day INT)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
STORED AS TEXTFILE LOCATION '{file_location}'
TBLPROPERTIES ('skip.header.line.count' = '1')
I am assuming the first query is the correct way to load this data, based on documentation. Looking at the resultant schema, that also seems to be correct - however I cannot get it to actually load any data.
Does anyone know what I am doing wrong?
Check if this helpful-
Please note sparkSession is created without hive support
1. Create dummy test dataframe and store it as csv with year, month & day partition
val df = spark.range(1).withColumn("date",
explode(sequence(to_date(lit("2020-06-09")), to_date(lit("2020-06-20")), expr("interval 1 day")))
).withColumn("year", year($"date"))
.withColumn("month", month($"date"))
.withColumn("day", dayofmonth($"date"))
df.show(false)
df.printSchema()
/**
* +---+----------+----+-----+---+
* |id |date |year|month|day|
* +---+----------+----+-----+---+
* |0 |2020-06-09|2020|6 |9 |
* |0 |2020-06-10|2020|6 |10 |
* |0 |2020-06-11|2020|6 |11 |
* |0 |2020-06-12|2020|6 |12 |
* |0 |2020-06-13|2020|6 |13 |
* |0 |2020-06-14|2020|6 |14 |
* |0 |2020-06-15|2020|6 |15 |
* |0 |2020-06-16|2020|6 |16 |
* |0 |2020-06-17|2020|6 |17 |
* |0 |2020-06-18|2020|6 |18 |
* |0 |2020-06-19|2020|6 |19 |
* |0 |2020-06-20|2020|6 |20 |
* +---+----------+----+-----+---+
*
* root
* |-- id: long (nullable = false)
* |-- date: date (nullable = false)
* |-- year: integer (nullable = false)
* |-- month: integer (nullable = false)
* |-- day: integer (nullable = false)
*/
df.repartition(2).write.partitionBy("year", "month", "day")
.option("header", true)
.mode(SaveMode.Overwrite)
.csv("/Users/sokale/models/hive_table")
File structure
/**
* File structure - /Users/sokale/models/hive_table
* ---------------
* year=2020
* year=2020/month=6
* year=2020/month=6/day=10
* |- part...csv files (same part files for all the below directories)
* year=2020/month=6/day=11
* year=2020/month=6/day=12
* year=2020/month=6/day=13
* year=2020/month=6/day=14
* year=2020/month=6/day=15
* year=2020/month=6/day=16
* year=2020/month=6/day=17
* year=2020/month=6/day=18
* year=2020/month=6/day=19
* year=2020/month=6/day=20
* year=2020/month=6/day=9
*/
Read the partitioned table
val csvDF = spark.read.option("header", true)
.csv("/Users/sokale/models/hive_table")
csvDF.show(false)
csvDF.printSchema()
/**
* +---+----------+----+-----+---+
* |id |date |year|month|day|
* +---+----------+----+-----+---+
* |0 |2020-06-20|2020|6 |20 |
* |0 |2020-06-19|2020|6 |19 |
* |0 |2020-06-09|2020|6 |9 |
* |0 |2020-06-12|2020|6 |12 |
* |0 |2020-06-10|2020|6 |10 |
* |0 |2020-06-15|2020|6 |15 |
* |0 |2020-06-16|2020|6 |16 |
* |0 |2020-06-17|2020|6 |17 |
* |0 |2020-06-13|2020|6 |13 |
* |0 |2020-06-18|2020|6 |18 |
* |0 |2020-06-14|2020|6 |14 |
* |0 |2020-06-11|2020|6 |11 |
* +---+----------+----+-----+---+
*
* root
* |-- id: string (nullable = true)
* |-- date: string (nullable = true)
* |-- year: integer (nullable = true)
* |-- month: integer (nullable = true)
* |-- day: integer (nullable = true)
*/

Finding variable length strings in column

I have a column that looks like this:
Class
A
AA
BB
AAAA
ABA
AAAAA
What I'd like to do, is filter this column out where it has only A's and nothing else. So the result would be something like this:
Class
A
AA
AAAA
AAAAA
Is there a way to do this in Spark?
Check below code.
scala> val df = Seq("A","AA","BB","AAAA","ABA","AAAAA","BAB").toDF("Class")
df: org.apache.spark.sql.DataFrame = [Class: string]
scala> df.filter(!col("Class").rlike("[^A]+")).show
+-----+
|Class|
+-----+
| A|
| AA|
| AAAA|
|AAAAA|
+-----+
Try this using rlike function
val data1 =
"""
|Class
|A
|AA
|BB
|AAAA
|ABA
|AAAAA
""".stripMargin
val stringDS1 = data1.split(System.lineSeparator())
.map(_.split("\\|").map(_.replaceAll("""^[ \t]+|[ \t]+$""", "")).mkString(","))
.toSeq.toDS()
val df1 = spark.read
.option("sep", ",")
.option("inferSchema", "true")
.option("header", "true")
.option("nullValue", "null")
.csv(stringDS1)
df1.show(false)
df1.printSchema()
/**
* +-----+
* |Class|
* +-----+
* |A |
* |AA |
* |BB |
* |AAAA |
* |ABA |
* |AAAAA|
* +-----+
*
* root
* |-- Class: string (nullable = true)
*/
df1.filter(col("Class").rlike("""^A+$"""))
.show(false)
/**
* +-----+
* |Class|
* +-----+
* |A |
* |AA |
* |AAAA |
* |AAAAA|
* +-----+
*/

pyspark nested columns in a string

I am working with PySpark. I have a DataFrame loaded from csv that contains the following schema:
root
|-- id: string (nullable = true)
|-- date: date (nullable = true)
|-- users: string (nullable = true)
If I show the first two rows it looks like:
+---+----------+---------------------------------------------------+
| id| date|users |
+---+----------+---------------------------------------------------+
| 1|2017-12-03|{"1":["xxx","yyy","zzz"],"2":["aaa","bbb"],"3":[]} |
| 2|2017-12-04|{"1":["uuu","yyy","zzz"],"2":["aaa"],"3":[]} |
+---+----------+---------------------------------------------------+
I would like to create a new DataFrame that contains the 'user' string broken out by each element. I would like something similar to
id user_id user_product
1 1 xxx
1 1 yyy
1 1 zzz
1 2 aaa
1 2 bbb
1 3 <null>
2 1 uuu
etc...
I have tried many approaches but can't seem to get it working.
The closest I can get is defining the schema such as the following and creating a new df applying schema using from_json:
userSchema = StructType([
StructField("user_id", StringType()),
StructField("product_list", StructType([
StructField("product", StringType())
]))
])
user_df = in_csv.select('id',from_json(in_csv.users, userSchema).alias("test"))
This returns the correct schema:
root
|-- id: string (nullable = true)
|-- test: struct (nullable = true)
| |-- user_id: string (nullable = true)
| |-- product_list: struct (nullable = true)
| | |-- product: string (nullable = true)
but when I show any part of the 'test' struct it returns nulls instead of values e.g.
user_df.select('test.user_id').show()
returns test.user_id :
+-------+
|user_id|
+-------+
| null|
| null|
+-------+
Maybe I shouldn't be using the from_json as the users string is not pure JSON. Any help as to approach I could take?
The schema should conform to the shape of the data. Unfortunately from_json supports only StructType(...) or ArrayType(StructType(...)) which won't be useful here, unless you can guarantee that all records have the same set of key.
Instead, you can use an UserDefinedFunction:
import json
from pyspark.sql.functions import explode, udf
df = spark.createDataFrame([
(1, "2017-12-03", """{"1":["xxx","yyy","zzz"],"2":["aaa","bbb"],"3":[]}"""),
(2, "2017-12-04", """{"1":["uuu","yyy","zzz"],"2":["aaa"],"3":[]}""")],
("id", "date", "users")
)
#udf("map<string, array<string>>")
def parse(s):
try:
return json.loads(s)
except:
pass
(df
.select("id", "date",
explode(parse("users")).alias("user_id", "user_product"))
.withColumn("user_product", explode("user_product"))
.show())
# +---+----------+-------+------------+
# | id| date|user_id|user_product|
# +---+----------+-------+------------+
# | 1|2017-12-03| 1| xxx|
# | 1|2017-12-03| 1| yyy|
# | 1|2017-12-03| 1| zzz|
# | 1|2017-12-03| 2| aaa|
# | 1|2017-12-03| 2| bbb|
# | 2|2017-12-04| 1| uuu|
# | 2|2017-12-04| 1| yyy|
# | 2|2017-12-04| 1| zzz|
# | 2|2017-12-04| 2| aaa|
# +---+----------+-------+------------+
You dont need to use from_json. You have to explode two times, one for user_id and one for users.
import pyspark.sql.functions as F
df = sql.createDataFrame([
(1,'2017-12-03',{"1":["xxx","yyy","zzz"],"2":["aaa","bbb"],"3":[]} ),
(2,'2017-12-04',{"1":["uuu","yyy","zzz"],"2":["aaa"], "3":[]} )],
['id','date','users']
)
df = df.select('id','date',F.explode('users').alias('user_id','users'))\
.select('id','date','user_id',F.explode('users').alias('users'))
df.show()
+---+----------+-------+-----+
| id| date|user_id|users|
+---+----------+-------+-----+
| 1|2017-12-03| 1| xxx|
| 1|2017-12-03| 1| yyy|
| 1|2017-12-03| 1| zzz|
| 1|2017-12-03| 2| aaa|
| 1|2017-12-03| 2| bbb|
| 2|2017-12-04| 1| uuu|
| 2|2017-12-04| 1| yyy|
| 2|2017-12-04| 1| zzz|
| 2|2017-12-04| 2| aaa|
+---+----------+-------+-----+

Resources