I have the following DF:
+----+------+--------+----+----+----+----+----------+
| ID | Name | Vl1 | Vl2| Vl3| Vl4|Vl5 | Vl6 |
+----+------+--------+----+----+----+----+----------+
|1 |John | 1.5 |null|null|null| A|2022-01-01|
|1 |John | 1 |null|null|null| A|2022-01-01|
|1 |John | 3 |null|1 |null| A|2022-01-01|
|1 |John | 4 |null|1 |null| A|2022-01-01|
|2 |Ana | 2.5 |null|null|null| A|2022-01-01|
|2 |Ana | 0 |null|null|null| A|2022-01-01|
|2 |Ana | null|null|null|null| A|2022-01-01|
|2 |Ana | 2 |null|null|null| A|2022-01-01|
|2 |Ana | 2 |2 |null|null| A|2022-01-01|
|2 |Ana | 1 |null|null|null| A|2022-01-01|
|3 |Paul | 5 |null|null|null| A|2022-01-01|
|3 |Paul | null|2 |null|null| A|2022-01-01|
|3 |Paul | 2.5 |null|2 |null| A|2022-01-01|
|3 |Paul | null|null|3 |null| A|2022-01-01|
+----+------+--------+----+----+----+----+----------+
How can I generate the following nested structure:
|-- Title: string (nullable = true)
|-- Company: string (nullable = true)
|-- Client: array (nullable = false)
| |-- element: struct (containsNull = false)
| | |-- Vl1: double (nullable = true)
| | |-- Vl2: double (nullable = true)
| | |-- Prch: array (nullable = false)
| | | |-- element: struct (containsNull = false)
| | | | |-- Vl3: string (nullable = true)
| | | | |-- Detail: array (nullable = true)
| | | | | |-- element: struct (containsNull = false)
| | | | | | |-- Vl4: date (nullable = true)
| | | | |-- Bs: array (nullable = true)
| | | | | |-- element: struct (containsNull = false)
| | | | | | |-- Vl5: string (nullable = true)
| | | | | | |-- Vl6: date (nullable = true)
Until now, I used withcolumn to generate the columnns Tiltle and Company. Then I used groupby to group by clients and then I used collect_list to generate the first level of the nested structure, but how can I generate the other levels of the strucuture (Prch, Detail, Bs).
Just to know, my DF has 1 billion lines. I would like to know the best way to generate this structure. My spark environment has 4 workers with 5 cores by worker.
MVCE:
data = [
("1","John",1.5,None,None,None,"A", "2022-01-01"),
("1","John",1.0,None,None,None,"A", "2022-01-01"),
("1","John",3.0,None,1.0,None,"A", "2022-01-01"),
("1","John",4.0,None,1.0,None,"A", "2022-01-01"),
("2","Ana",2.5,None,None,None,"A", "2022-01-01"),
("2","Ana",0.0,None,None,None,"A", "2022-01-01"),
("2","Ana",None,None,None,None,"A", "2022-01-01"),
("2","Ana",2.0,None,None,None,"A", "2022-01-01"),
("2","Ana",2.0,2.0,None,None,"A", "2022-01-01"),
("2","Ana",1.0,None,None,None,"A", "2022-01-01"),
("3","Paul",5.0,None,None,None,"A", "2022-01-01"),
("3","Paul",None,2.0,None,None,"A", "2022-01-01"),
("3","Paul",2.5,None,2.0,None,"A", "2022-01-01"),
("3","Paul",None,None,3.0,None,"A", "2022-01-01")
]
schema = StructType([
StructField("Id", StringType(),True),
StructField("Name", StringType(),True),
StructField("Vl1", DoubleType(),True),
StructField("Vl2", DoubleType(), True),
StructField("Vl3", DoubleType(), True),
StructField("Vl4", DateType(), True),
StructField("Vl5", StringType(), True),
StructField("Vl6", StringType(), True)
])
df = spark.createDataFrame(data=data,schema=schema)
df.printSchema()
df.show()
You have to group them sequentially, for each level of nesting from the innermost and move outwards.
The first line below is to create the Vl3, Details, Bs, then the second line is to create the Prch column, and 3rd is to create the Client column.
(You ob don't have to create multiple data frames for each step, this is just for the sake of explanation.)
df1 = df.groupBy("Id", "Name", "Vl1", "Vl2", "Vl3").agg(fn.collect_list(fn.struct(fn.col("Vl4"))).alias("Detail"), fn.collect_list(fn.struct(fn.col("Vl5"), fn.col("Vl6"))).alias("Bs"))
df2 = df1.groupBy("Id", "Name", "Vl1", "Vl2").agg(fn.collect_list(fn.struct(fn.col("Vl3"), fn.col("Detail"), fn.col("Bs"))).alias("Prch"))
df3 = df2.groupBy("Id", "Name").agg(fn.collect_list(fn.struct(fn.col("Vl1"), fn.col("Vl2"), fn.col("Prch"))).alias("Client"))
Which gives you this schema:
root
|-- Id: string (nullable = true)
|-- Name: string (nullable = true)
|-- Client: array (nullable = false)
| |-- element: struct (containsNull = false)
| | |-- Vl1: double (nullable = true)
| | |-- Vl2: double (nullable = true)
| | |-- Prch: array (nullable = false)
| | | |-- element: struct (containsNull = false)
| | | | |-- Vl3: double (nullable = true)
| | | | |-- Detail: array (nullable = false)
| | | | | |-- element: struct (containsNull = false)
| | | | | | |-- Vl4: date (nullable = true)
| | | | |-- Bs: array (nullable = false)
| | | | | |-- element: struct (containsNull = false)
| | | | | | |-- Vl5: string (nullable = true)
| | | | | | |-- Vl6: string (nullable = true)
I have an input PySpark df:
+---------+-------+--------+----------+----------+
|timestamp|user_id|results |event_name|product_id|
+---------+-------+--------+----------+----------+
|1000 |user_1 |result 1|Click |1 |
|1001 |user_1 |result 1|View |1 |
|1002 |user_1 |result 2|Click |3 |
|1003 |user_1 |result 2|View |4 |
|1004 |user_1 |result 2|View |5 |
+---------+-------+--------+----------+----------+
root
|-- timestamp: timestamp (nullable = true)
|-- user_id: string (nullable = true)
|-- results: string (nullable = true)
|-- event_name: string (nullable = true)
|-- product_id: string (nullable = true)
I'd like to convert this to following making sure that I keep unique combinations of user_id and results, and aggregate product_ids based on given event_name like this:
+-------+--------+---------------+---------------+
|user_id|results |product_clicked|products_viewed|
+-------+--------+---------------+---------------+
|user_1 |result 1|[1] |[1] |
|user_1 |result 2|[4,5] |[3] |
+-------+--------+---------------+---------------+
root
|-- user_id: string (nullable = true)
|-- results: string (nullable = true)
|-- product_clicked: array (nullable = true)
| |-- element: string (containsNull = true)
|-- products_viewed: array (nullable = true)
| |-- element: string (containsNull = true)
I have looked into pivot, it's close but I do not need the aggregation part of it, instead I need array creation on columns which are created based on event_name column. Cannot figure our how to do it.
NOTE: The order in product_clicked and product_viewed columns above is important and is based on timestamp column of input dataframe.
You can use collect_list during the pivot aggregation:
import pyspark.sql.functions as F
df2 = (df.groupBy('user_id', 'results')
.pivot('event_name')
.agg(F.collect_list('product_id'))
.selectExpr("user_id", "results", "Click as product_clicked", "View as product_viewed")
)
df2.show()
+-------+-------+---------------+--------------+
|user_id|results|product_clicked|product_viewed|
+-------+-------+---------------+--------------+
| user_1|result2| [3]| [4, 5]|
| user_1|result1| [1]| [1]|
+-------+-------+---------------+--------------+
To ensure ordering, you can collect a list of structs containing the timestamp, sort the list, and transform the list to only keep the product_id:
df2 = (df.groupBy('user_id', 'results')
.pivot('event_name')
.agg(F.sort_array(F.collect_list(F.struct('timestamp', 'product_id'))))
.selectExpr("user_id", "results", "transform(Click, x -> x.product_id) as product_clicked", "transform(View, x -> x.product_id) as product_viewed")
)
df2.show()
+-------+-------+---------------+--------------+
|user_id|results|product_clicked|product_viewed|
+-------+-------+---------------+--------------+
| user_1|result2| [3]| [4, 5]|
| user_1|result1| [1]| [1]|
+-------+-------+---------------+--------------+
I am new to pyspark and I need to explode my array of values in such a way that each value gets assigned to a new column. I tried using explode but I couldn't get the desired output.Below is my output
+---------------+----------+------------------+----------+---------+------------+--------------------+
|account_balance|account_id|credit_Card_Number|first_name|last_name|phone_number| transactions|
+---------------+----------+------------------+----------+---------+------------+--------------------+
| 100000| 12345| 12345| abc| xyz| 1234567890|[1000, 01/06/2020...|
| 100000| 12345| 12345| abc| xyz| 1234567890|[1100, 02/06/2020...|
| 100000| 12345| 12345| abc| xyz| 1234567890|[6146, 02/06/2020...|
| 100000| 12345| 12345| abc| xyz| 1234567890|[253, 03/06/2020,...|
| 100000| 12345| 12345| abc| xyz| 1234567890|[4521, 04/06/2020...|
| 100000| 12345| 12345| abc| xyz| 1234567890|[955, 05/06/2020,...|
+---------------+----------+------------------+----------+---------+------------+--------------------+
Below is the schema of the program
root
|-- account_balance: long (nullable = true)
|-- account_id: long (nullable = true)
|-- credit_Card_Number: long (nullable = true)
|-- first_name: string (nullable = true)
|-- last_name: string (nullable = true)
|-- phone_number: long (nullable = true)
|-- transactions: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- amount: long (nullable = true)
| | |-- date: string (nullable = true)
| | |-- shop: string (nullable = true)
| | |-- transaction_code: string (nullable = true)
I want an output in which I have additional columns of amount,date,shop,transaction_code with their respective values
amount date shop transaction_code
1000 01/06/2020 amazon buy
1100 02/06/2020 amazon sell
6146 02/06/2020 ebay buy
253 03/06/2020 ebay buy
4521 04/06/2020 amazon buy
955 05/06/2020 amazon buy
Use explode and then split the struct fileds, finally drop the newly exploded and transactions array columns.
Example:
from pyspark.sql.functions import *
#got only some columns from json
df.printSchema()
#root
# |-- account_balance: long (nullable = true)
# |-- transactions: array (nullable = true)
# | |-- element: struct (containsNull = true)
# | | |-- amount: long (nullable = true)
# | | |-- date: string (nullable = true)
df.selectExpr("*","explode(transactions)").select("*","col.*").drop(*['col','transactions']).show()
#+---------------+------+--------+
#|account_balance|amount| date|
#+---------------+------+--------+
#| 10| 1000|20200202|
#+---------------+------+--------+
I have a schema:
root (original)
|-- entries: array (nullable = true)
| |-- element: struct (containsNull = false)
| | |-- col1: string (nullable = false)
| | |-- col2: string (nullable = true)
How can I flatten it?
root (derived)
|-- col1: string (nullable = false)
|-- col2: string (nullable = true)
|-- col3: string (nullable = false)
|-- col4: string (nullable = true)
|-- ...
where col1...n is [col1 from original] and value for col1...n is value from [col2 from original]
Example:
+--------------------------------------------+
|entries |
+--------------------------------------------+
|[[a1, 1], [a2, P], [a4, N] |
|[[a1, 1], [a2, O], [a3, F], [a4, 1], [a5, 1]|
+--------------------------------------------+
I want to create the next dataset:
+-------------------------+
| a1 | a2 | a3 | a4 | a5 |
+-------------------------+
| 1 | P | null| N | null|
| 1 | O | F | 1 | 1 |
+-------------------------+
You can do it with a combination of explode and pivot, to do so, one needs to create a row_id first:
val df = Seq(
Seq(("a1", "1"), ("a2", "P"), ("a4", "N")),
Seq(("a1", "1"), ("a2", "O"), ("a3", "F"), ("a4", "1"), ("a5", "1"))
).toDF("arr")
.select($"arr".cast("array<struct<col1:string,col2:string>>"))
df
.withColumn("row_id", monotonically_increasing_id())
.select($"row_id", explode($"arr"))
.select($"row_id", $"col.*")
.groupBy($"row_id").pivot($"col1").agg(first($"col2"))
.drop($"row_id")
.show()
gives:
+---+---+----+---+----+
| a1| a2| a3| a4| a5|
+---+---+----+---+----+
| 1| P|null| N|null|
| 1| O| F| 1| 1|
+---+---+----+---+----+