Pyspark DF Pivot and Create Arrays columns - apache-spark

I have an input PySpark df:
+---------+-------+--------+----------+----------+
|timestamp|user_id|results |event_name|product_id|
+---------+-------+--------+----------+----------+
|1000 |user_1 |result 1|Click |1 |
|1001 |user_1 |result 1|View |1 |
|1002 |user_1 |result 2|Click |3 |
|1003 |user_1 |result 2|View |4 |
|1004 |user_1 |result 2|View |5 |
+---------+-------+--------+----------+----------+
root
|-- timestamp: timestamp (nullable = true)
|-- user_id: string (nullable = true)
|-- results: string (nullable = true)
|-- event_name: string (nullable = true)
|-- product_id: string (nullable = true)
I'd like to convert this to following making sure that I keep unique combinations of user_id and results, and aggregate product_ids based on given event_name like this:
+-------+--------+---------------+---------------+
|user_id|results |product_clicked|products_viewed|
+-------+--------+---------------+---------------+
|user_1 |result 1|[1] |[1] |
|user_1 |result 2|[4,5] |[3] |
+-------+--------+---------------+---------------+
root
|-- user_id: string (nullable = true)
|-- results: string (nullable = true)
|-- product_clicked: array (nullable = true)
| |-- element: string (containsNull = true)
|-- products_viewed: array (nullable = true)
| |-- element: string (containsNull = true)
I have looked into pivot, it's close but I do not need the aggregation part of it, instead I need array creation on columns which are created based on event_name column. Cannot figure our how to do it.
NOTE: The order in product_clicked and product_viewed columns above is important and is based on timestamp column of input dataframe.

You can use collect_list during the pivot aggregation:
import pyspark.sql.functions as F
df2 = (df.groupBy('user_id', 'results')
.pivot('event_name')
.agg(F.collect_list('product_id'))
.selectExpr("user_id", "results", "Click as product_clicked", "View as product_viewed")
)
df2.show()
+-------+-------+---------------+--------------+
|user_id|results|product_clicked|product_viewed|
+-------+-------+---------------+--------------+
| user_1|result2| [3]| [4, 5]|
| user_1|result1| [1]| [1]|
+-------+-------+---------------+--------------+
To ensure ordering, you can collect a list of structs containing the timestamp, sort the list, and transform the list to only keep the product_id:
df2 = (df.groupBy('user_id', 'results')
.pivot('event_name')
.agg(F.sort_array(F.collect_list(F.struct('timestamp', 'product_id'))))
.selectExpr("user_id", "results", "transform(Click, x -> x.product_id) as product_clicked", "transform(View, x -> x.product_id) as product_viewed")
)
df2.show()
+-------+-------+---------------+--------------+
|user_id|results|product_clicked|product_viewed|
+-------+-------+---------------+--------------+
| user_1|result2| [3]| [4, 5]|
| user_1|result1| [1]| [1]|
+-------+-------+---------------+--------------+

Related

Generate a nested nested structure in pyspark

I have the following DF:
+----+------+--------+----+----+----+----+----------+
| ID | Name | Vl1 | Vl2| Vl3| Vl4|Vl5 | Vl6 |
+----+------+--------+----+----+----+----+----------+
|1 |John | 1.5 |null|null|null| A|2022-01-01|
|1 |John | 1 |null|null|null| A|2022-01-01|
|1 |John | 3 |null|1 |null| A|2022-01-01|
|1 |John | 4 |null|1 |null| A|2022-01-01|
|2 |Ana | 2.5 |null|null|null| A|2022-01-01|
|2 |Ana | 0 |null|null|null| A|2022-01-01|
|2 |Ana | null|null|null|null| A|2022-01-01|
|2 |Ana | 2 |null|null|null| A|2022-01-01|
|2 |Ana | 2 |2 |null|null| A|2022-01-01|
|2 |Ana | 1 |null|null|null| A|2022-01-01|
|3 |Paul | 5 |null|null|null| A|2022-01-01|
|3 |Paul | null|2 |null|null| A|2022-01-01|
|3 |Paul | 2.5 |null|2 |null| A|2022-01-01|
|3 |Paul | null|null|3 |null| A|2022-01-01|
+----+------+--------+----+----+----+----+----------+
How can I generate the following nested structure:
|-- Title: string (nullable = true)
|-- Company: string (nullable = true)
|-- Client: array (nullable = false)
| |-- element: struct (containsNull = false)
| | |-- Vl1: double (nullable = true)
| | |-- Vl2: double (nullable = true)
| | |-- Prch: array (nullable = false)
| | | |-- element: struct (containsNull = false)
| | | | |-- Vl3: string (nullable = true)
| | | | |-- Detail: array (nullable = true)
| | | | | |-- element: struct (containsNull = false)
| | | | | | |-- Vl4: date (nullable = true)
| | | | |-- Bs: array (nullable = true)
| | | | | |-- element: struct (containsNull = false)
| | | | | | |-- Vl5: string (nullable = true)
| | | | | | |-- Vl6: date (nullable = true)
Until now, I used withcolumn to generate the columnns Tiltle and Company. Then I used groupby to group by clients and then I used collect_list to generate the first level of the nested structure, but how can I generate the other levels of the strucuture (Prch, Detail, Bs).
Just to know, my DF has 1 billion lines. I would like to know the best way to generate this structure. My spark environment has 4 workers with 5 cores by worker.
MVCE:
data = [
("1","John",1.5,None,None,None,"A", "2022-01-01"),
("1","John",1.0,None,None,None,"A", "2022-01-01"),
("1","John",3.0,None,1.0,None,"A", "2022-01-01"),
("1","John",4.0,None,1.0,None,"A", "2022-01-01"),
("2","Ana",2.5,None,None,None,"A", "2022-01-01"),
("2","Ana",0.0,None,None,None,"A", "2022-01-01"),
("2","Ana",None,None,None,None,"A", "2022-01-01"),
("2","Ana",2.0,None,None,None,"A", "2022-01-01"),
("2","Ana",2.0,2.0,None,None,"A", "2022-01-01"),
("2","Ana",1.0,None,None,None,"A", "2022-01-01"),
("3","Paul",5.0,None,None,None,"A", "2022-01-01"),
("3","Paul",None,2.0,None,None,"A", "2022-01-01"),
("3","Paul",2.5,None,2.0,None,"A", "2022-01-01"),
("3","Paul",None,None,3.0,None,"A", "2022-01-01")
]
schema = StructType([
StructField("Id", StringType(),True),
StructField("Name", StringType(),True),
StructField("Vl1", DoubleType(),True),
StructField("Vl2", DoubleType(), True),
StructField("Vl3", DoubleType(), True),
StructField("Vl4", DateType(), True),
StructField("Vl5", StringType(), True),
StructField("Vl6", StringType(), True)
])
df = spark.createDataFrame(data=data,schema=schema)
df.printSchema()
df.show()
You have to group them sequentially, for each level of nesting from the innermost and move outwards.
The first line below is to create the Vl3, Details, Bs, then the second line is to create the Prch column, and 3rd is to create the Client column.
(You ob don't have to create multiple data frames for each step, this is just for the sake of explanation.)
df1 = df.groupBy("Id", "Name", "Vl1", "Vl2", "Vl3").agg(fn.collect_list(fn.struct(fn.col("Vl4"))).alias("Detail"), fn.collect_list(fn.struct(fn.col("Vl5"), fn.col("Vl6"))).alias("Bs"))
df2 = df1.groupBy("Id", "Name", "Vl1", "Vl2").agg(fn.collect_list(fn.struct(fn.col("Vl3"), fn.col("Detail"), fn.col("Bs"))).alias("Prch"))
df3 = df2.groupBy("Id", "Name").agg(fn.collect_list(fn.struct(fn.col("Vl1"), fn.col("Vl2"), fn.col("Prch"))).alias("Client"))
Which gives you this schema:
root
|-- Id: string (nullable = true)
|-- Name: string (nullable = true)
|-- Client: array (nullable = false)
| |-- element: struct (containsNull = false)
| | |-- Vl1: double (nullable = true)
| | |-- Vl2: double (nullable = true)
| | |-- Prch: array (nullable = false)
| | | |-- element: struct (containsNull = false)
| | | | |-- Vl3: double (nullable = true)
| | | | |-- Detail: array (nullable = false)
| | | | | |-- element: struct (containsNull = false)
| | | | | | |-- Vl4: date (nullable = true)
| | | | |-- Bs: array (nullable = false)
| | | | | |-- element: struct (containsNull = false)
| | | | | | |-- Vl5: string (nullable = true)
| | | | | | |-- Vl6: string (nullable = true)

Extract Nested JSON columns using Explode function

I have a data structure like below
data: struct (nullable = true)
| |-- event_Id: long (nullable = true)
| |-- data_nested: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- item1: string (nullable = true)
| | | |-- item2: string (nullable = true)
| | | |-- item3: string (nullable = true)
| | | | |-- item3_item1: long (containsNull = true)
| |-- other_elemets: array (nullable = true)
I want to take out data like
-+-------------------+-------+----------+
|item1 |item2 |item3 |
+--------------------+-------+----------+
|A |Android|null |
|B |Android|null |
|C |iOS |null |
|D |iOS |null |
-+-------------------+-------+----------+
I have got the data using functions like
df = spark.read.json(['s3://<data_location>]).select("data.data_nested").persist()
df.select(explode(col('element')).alias('item1'))
which gives the item1 column. Now want to know how to use explode to get item2 and item3 most efficiently

pyspark split array type column to multiple columns

After running ALS algorithm in pyspark over a dataset, I have come across a final dataframe which looks like the following
Recommendation column is array type, now I want to split this column, my final dataframe should look like this
Can anyone suggest me, which pyspark function can be used to form this dataframe?
Schema of the dataframe
root
|-- person: string (nullable = false)
|-- recommendation: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- ID: string (nullable = true)
| | |-- rating: float (nullable = true)
Assuming ID doesn't duplicate in each array, you can try the following:
import pyspark.sql.functions as f
df.withColumn('recommendation', f.explode('recommendation'))\
.withColumn('ID', f.col('recommendation').getItem('ID'))\
.withColumn('rating', f.col('recommendation').getItem('rating'))\
.groupby('person')\
.pivot('ID')\
.agg(f.first('rating')).show()
+------+---+---+---+
|person| a| b| c|
+------+---+---+---+
| xyz|0.4|0.3|0.3|
| abc|0.5|0.3|0.2|
| def|0.3|0.2|0.5|
+------+---+---+---+
Or transform with RDD:
df.rdd.map(lambda r: Row(
person=r.person, **{s.ID: s.rating for s in r.recommendation})
).toDF().show()
+------+-------------------+-------------------+-------------------+
|person| a| b| c|
+------+-------------------+-------------------+-------------------+
| abc| 0.5|0.30000001192092896|0.20000000298023224|
| def|0.30000001192092896|0.20000000298023224| 0.5|
| xyz| 0.4000000059604645|0.30000001192092896|0.30000001192092896|
+------+-------------------+-------------------+-------------------+

How to convert an array of structs into multiple columns?

I have a schema:
root (original)
|-- entries: array (nullable = true)
| |-- element: struct (containsNull = false)
| | |-- col1: string (nullable = false)
| | |-- col2: string (nullable = true)
How can I flatten it?
root (derived)
|-- col1: string (nullable = false)
|-- col2: string (nullable = true)
|-- col3: string (nullable = false)
|-- col4: string (nullable = true)
|-- ...
where col1...n is [col1 from original] and value for col1...n is value from [col2 from original]
Example:
+--------------------------------------------+
|entries |
+--------------------------------------------+
|[[a1, 1], [a2, P], [a4, N] |
|[[a1, 1], [a2, O], [a3, F], [a4, 1], [a5, 1]|
+--------------------------------------------+
I want to create the next dataset:
+-------------------------+
| a1 | a2 | a3 | a4 | a5 |
+-------------------------+
| 1 | P | null| N | null|
| 1 | O | F | 1 | 1 |
+-------------------------+
You can do it with a combination of explode and pivot, to do so, one needs to create a row_id first:
val df = Seq(
Seq(("a1", "1"), ("a2", "P"), ("a4", "N")),
Seq(("a1", "1"), ("a2", "O"), ("a3", "F"), ("a4", "1"), ("a5", "1"))
).toDF("arr")
.select($"arr".cast("array<struct<col1:string,col2:string>>"))
df
.withColumn("row_id", monotonically_increasing_id())
.select($"row_id", explode($"arr"))
.select($"row_id", $"col.*")
.groupBy($"row_id").pivot($"col1").agg(first($"col2"))
.drop($"row_id")
.show()
gives:
+---+---+----+---+----+
| a1| a2| a3| a4| a5|
+---+---+----+---+----+
| 1| P|null| N|null|
| 1| O| F| 1| 1|
+---+---+----+---+----+

spark dataframe select values from multiple columns based on condition

Data schema,
root
|-- id: string (nullable = true)
|-- col1: string (nullable = true)
|-- col2: string (nullable = true)
|id|col1 |col2 |
|1 |["x","y","z"]|[123,"null","null"]|
From above data i want to filter where x exits in col1 and respective value for x from col2.
(values of col1 and col2 ordered.If x index 2 in col1 and value index at col2 also 2)
Result:(Need col1 and col2 type array type)
|id |col1 |col2 |
|1 |["x"]|[123]|
If x not present in col1 then need result like
|id| col1 |col2 |
|1 |["null"] |["null"]|
i tried,
val df1 = df.withColumn("result",when($"col1".contains("x"),"X").otherwise("null"))
The trick is to transform your data from dumb string columns into a more useable data structure. Once col1 and col2 are rebuilt as arrays (or as a map, as your desired output suggests they should be), you can use Spark's built-in functions rather than a messy UDF as suggested by #baitmbarek.
To start, use trim and split to convert col1 and col2 to arrays:
scala> val df = Seq(
| ("1", """["x","y","z"]""","""[123,"null","null"]"""),
| ("2", """["a","y","z"]""","""[123,"null","null"]""")
| ).toDF("id","col1","col2")
df: org.apache.spark.sql.DataFrame = [id: string, col1: string ... 1 more field]
scala> val df_array = df.withColumn("col1", split(trim($"col1", "[\"]"), "\"?,\"?"))
.withColumn("col2", split(trim($"col2", "[\"]"), "\"?,\"?"))
df_array: org.apache.spark.sql.DataFrame = [id: string, col1: array<string> ... 1 more field]
scala> df_array.show(false)
+---+---------+-----------------+
|id |col1 |col2 |
+---+---------+-----------------+
|1 |[x, y, z]|[123, null, null]|
|2 |[a, y, z]|[123, null, null]|
+---+---------+-----------------+
scala> df_array.printSchema
root
|-- id: string (nullable = true)
|-- col1: array (nullable = true)
| |-- element: string (containsNull = true)
|-- col2: array (nullable = true)
| |-- element: string (containsNull = true)
From here, you should be able to achieve what you want using array_position to find the index of 'x' (if any) in col1 and retrieve the matching data from col2. However, converting the two arrays into a map first should make it clearer to understand what your code is doing:
scala> val df_map = df_array.select(
$"id",
map_from_entries(arrays_zip($"col1", $"col2")).as("col_map")
)
df_map: org.apache.spark.sql.DataFrame = [id: string, col_map: map<string,string>]
scala> df_map.show(false)
+---+--------------------------------+
|id |col_map |
+---+--------------------------------+
|1 |[x -> 123, y -> null, z -> null]|
|2 |[a -> 123, y -> null, z -> null]|
+---+--------------------------------+
scala> val df_final = df_map.select(
$"id",
when(isnull(element_at($"col_map", "x")),
array(lit("null")))
.otherwise(
array(lit("x")))
.as("col1"),
when(isnull(element_at($"col_map", "x")),
array(lit("null")))
.otherwise(
array(element_at($"col_map", "x")))
.as("col2")
)
df_final: org.apache.spark.sql.DataFrame = [id: string, col1: array<string> ... 1 more field]
scala> df_final.show
+---+------+------+
| id| col1| col2|
+---+------+------+
| 1| [x]| [123]|
| 2|[null]|[null]|
+---+------+------+
scala> df_final.printSchema
root
|-- id: string (nullable = true)
|-- col1: array (nullable = false)
| |-- element: string (containsNull = false)
|-- col2: array (nullable = false)
| |-- element: string (containsNull = true)

Resources