Generate a nested nested structure in pyspark - apache-spark

I have the following DF:
+----+------+--------+----+----+----+----+----------+
| ID | Name | Vl1 | Vl2| Vl3| Vl4|Vl5 | Vl6 |
+----+------+--------+----+----+----+----+----------+
|1 |John | 1.5 |null|null|null| A|2022-01-01|
|1 |John | 1 |null|null|null| A|2022-01-01|
|1 |John | 3 |null|1 |null| A|2022-01-01|
|1 |John | 4 |null|1 |null| A|2022-01-01|
|2 |Ana | 2.5 |null|null|null| A|2022-01-01|
|2 |Ana | 0 |null|null|null| A|2022-01-01|
|2 |Ana | null|null|null|null| A|2022-01-01|
|2 |Ana | 2 |null|null|null| A|2022-01-01|
|2 |Ana | 2 |2 |null|null| A|2022-01-01|
|2 |Ana | 1 |null|null|null| A|2022-01-01|
|3 |Paul | 5 |null|null|null| A|2022-01-01|
|3 |Paul | null|2 |null|null| A|2022-01-01|
|3 |Paul | 2.5 |null|2 |null| A|2022-01-01|
|3 |Paul | null|null|3 |null| A|2022-01-01|
+----+------+--------+----+----+----+----+----------+
How can I generate the following nested structure:
|-- Title: string (nullable = true)
|-- Company: string (nullable = true)
|-- Client: array (nullable = false)
| |-- element: struct (containsNull = false)
| | |-- Vl1: double (nullable = true)
| | |-- Vl2: double (nullable = true)
| | |-- Prch: array (nullable = false)
| | | |-- element: struct (containsNull = false)
| | | | |-- Vl3: string (nullable = true)
| | | | |-- Detail: array (nullable = true)
| | | | | |-- element: struct (containsNull = false)
| | | | | | |-- Vl4: date (nullable = true)
| | | | |-- Bs: array (nullable = true)
| | | | | |-- element: struct (containsNull = false)
| | | | | | |-- Vl5: string (nullable = true)
| | | | | | |-- Vl6: date (nullable = true)
Until now, I used withcolumn to generate the columnns Tiltle and Company. Then I used groupby to group by clients and then I used collect_list to generate the first level of the nested structure, but how can I generate the other levels of the strucuture (Prch, Detail, Bs).
Just to know, my DF has 1 billion lines. I would like to know the best way to generate this structure. My spark environment has 4 workers with 5 cores by worker.
MVCE:
data = [
("1","John",1.5,None,None,None,"A", "2022-01-01"),
("1","John",1.0,None,None,None,"A", "2022-01-01"),
("1","John",3.0,None,1.0,None,"A", "2022-01-01"),
("1","John",4.0,None,1.0,None,"A", "2022-01-01"),
("2","Ana",2.5,None,None,None,"A", "2022-01-01"),
("2","Ana",0.0,None,None,None,"A", "2022-01-01"),
("2","Ana",None,None,None,None,"A", "2022-01-01"),
("2","Ana",2.0,None,None,None,"A", "2022-01-01"),
("2","Ana",2.0,2.0,None,None,"A", "2022-01-01"),
("2","Ana",1.0,None,None,None,"A", "2022-01-01"),
("3","Paul",5.0,None,None,None,"A", "2022-01-01"),
("3","Paul",None,2.0,None,None,"A", "2022-01-01"),
("3","Paul",2.5,None,2.0,None,"A", "2022-01-01"),
("3","Paul",None,None,3.0,None,"A", "2022-01-01")
]
schema = StructType([
StructField("Id", StringType(),True),
StructField("Name", StringType(),True),
StructField("Vl1", DoubleType(),True),
StructField("Vl2", DoubleType(), True),
StructField("Vl3", DoubleType(), True),
StructField("Vl4", DateType(), True),
StructField("Vl5", StringType(), True),
StructField("Vl6", StringType(), True)
])
df = spark.createDataFrame(data=data,schema=schema)
df.printSchema()
df.show()

You have to group them sequentially, for each level of nesting from the innermost and move outwards.
The first line below is to create the Vl3, Details, Bs, then the second line is to create the Prch column, and 3rd is to create the Client column.
(You ob don't have to create multiple data frames for each step, this is just for the sake of explanation.)
df1 = df.groupBy("Id", "Name", "Vl1", "Vl2", "Vl3").agg(fn.collect_list(fn.struct(fn.col("Vl4"))).alias("Detail"), fn.collect_list(fn.struct(fn.col("Vl5"), fn.col("Vl6"))).alias("Bs"))
df2 = df1.groupBy("Id", "Name", "Vl1", "Vl2").agg(fn.collect_list(fn.struct(fn.col("Vl3"), fn.col("Detail"), fn.col("Bs"))).alias("Prch"))
df3 = df2.groupBy("Id", "Name").agg(fn.collect_list(fn.struct(fn.col("Vl1"), fn.col("Vl2"), fn.col("Prch"))).alias("Client"))
Which gives you this schema:
root
|-- Id: string (nullable = true)
|-- Name: string (nullable = true)
|-- Client: array (nullable = false)
| |-- element: struct (containsNull = false)
| | |-- Vl1: double (nullable = true)
| | |-- Vl2: double (nullable = true)
| | |-- Prch: array (nullable = false)
| | | |-- element: struct (containsNull = false)
| | | | |-- Vl3: double (nullable = true)
| | | | |-- Detail: array (nullable = false)
| | | | | |-- element: struct (containsNull = false)
| | | | | | |-- Vl4: date (nullable = true)
| | | | |-- Bs: array (nullable = false)
| | | | | |-- element: struct (containsNull = false)
| | | | | | |-- Vl5: string (nullable = true)
| | | | | | |-- Vl6: string (nullable = true)

Related

Flatten nested array pyspark columns

I have said schema in my dataframe structure. I was just wondering what would be the best way of flattening all of these so I can have a full viewable dataframe without arrays.
root
|-- id: string (nullable = true)
|-- InsuranceProvider: string (nullable = true)
|-- Type: struct (nullable = true)
| |-- Client: struct (nullable = true)
| | |-- PaidIn: struct (nullable = true)
| | | |-- Insuranceid: string (nullable = true)
| | | |-- Insurancedesc: string (nullable = true)
| | | |-- purchaseditems: array (nullable = true)
| | | | |-- element: struct (containsNull = true)
| | | | | |-- InsuranceNumber: string (nullable = true)
| | | | | |-- InsuranceLabel: string (nullable = true)
| | | | | |-- Insurancequantity: integer (nullable = true)
| | | | | |-- Insuranceprice: integer (nullable = true)
| | | | | |-- discountsreceived: array (nullable = true)
| | | | | | |-- element: struct (containsNull = true)
| | | | | | | |-- amount: integer (nullable = true)
| | | | | | | |-- description: string (nullable = true)
| | | | | |-- childItems: array (nullable = true)
| | | | | | |-- element: struct (containsNull = true)
| | | | | | | |-- InsuranceNumber: string (nullable = true)
| | | | | | | |-- InsuranceLabel: string (nullable = true)
| | | | | | | |-- Insurancequantity: double (nullable = true)
| | | | | | | |-- Insuranceprice: integer (nullable = true)
| | | | | | | |-- discountsreceived: array (nullable = true)
| | | | | | | | |-- element: struct (containsNull = true)
| | | | | | | | | |-- amount: integer (nullable = true)
| | | | | | | | | |-- description: string (nullable = true)
|-- eventTime: string (nullable = true)

Extract Nested JSON columns using Explode function

I have a data structure like below
data: struct (nullable = true)
| |-- event_Id: long (nullable = true)
| |-- data_nested: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- item1: string (nullable = true)
| | | |-- item2: string (nullable = true)
| | | |-- item3: string (nullable = true)
| | | | |-- item3_item1: long (containsNull = true)
| |-- other_elemets: array (nullable = true)
I want to take out data like
-+-------------------+-------+----------+
|item1 |item2 |item3 |
+--------------------+-------+----------+
|A |Android|null |
|B |Android|null |
|C |iOS |null |
|D |iOS |null |
-+-------------------+-------+----------+
I have got the data using functions like
df = spark.read.json(['s3://<data_location>]).select("data.data_nested").persist()
df.select(explode(col('element')).alias('item1'))
which gives the item1 column. Now want to know how to use explode to get item2 and item3 most efficiently

Pyspark DF Pivot and Create Arrays columns

I have an input PySpark df:
+---------+-------+--------+----------+----------+
|timestamp|user_id|results |event_name|product_id|
+---------+-------+--------+----------+----------+
|1000 |user_1 |result 1|Click |1 |
|1001 |user_1 |result 1|View |1 |
|1002 |user_1 |result 2|Click |3 |
|1003 |user_1 |result 2|View |4 |
|1004 |user_1 |result 2|View |5 |
+---------+-------+--------+----------+----------+
root
|-- timestamp: timestamp (nullable = true)
|-- user_id: string (nullable = true)
|-- results: string (nullable = true)
|-- event_name: string (nullable = true)
|-- product_id: string (nullable = true)
I'd like to convert this to following making sure that I keep unique combinations of user_id and results, and aggregate product_ids based on given event_name like this:
+-------+--------+---------------+---------------+
|user_id|results |product_clicked|products_viewed|
+-------+--------+---------------+---------------+
|user_1 |result 1|[1] |[1] |
|user_1 |result 2|[4,5] |[3] |
+-------+--------+---------------+---------------+
root
|-- user_id: string (nullable = true)
|-- results: string (nullable = true)
|-- product_clicked: array (nullable = true)
| |-- element: string (containsNull = true)
|-- products_viewed: array (nullable = true)
| |-- element: string (containsNull = true)
I have looked into pivot, it's close but I do not need the aggregation part of it, instead I need array creation on columns which are created based on event_name column. Cannot figure our how to do it.
NOTE: The order in product_clicked and product_viewed columns above is important and is based on timestamp column of input dataframe.
You can use collect_list during the pivot aggregation:
import pyspark.sql.functions as F
df2 = (df.groupBy('user_id', 'results')
.pivot('event_name')
.agg(F.collect_list('product_id'))
.selectExpr("user_id", "results", "Click as product_clicked", "View as product_viewed")
)
df2.show()
+-------+-------+---------------+--------------+
|user_id|results|product_clicked|product_viewed|
+-------+-------+---------------+--------------+
| user_1|result2| [3]| [4, 5]|
| user_1|result1| [1]| [1]|
+-------+-------+---------------+--------------+
To ensure ordering, you can collect a list of structs containing the timestamp, sort the list, and transform the list to only keep the product_id:
df2 = (df.groupBy('user_id', 'results')
.pivot('event_name')
.agg(F.sort_array(F.collect_list(F.struct('timestamp', 'product_id'))))
.selectExpr("user_id", "results", "transform(Click, x -> x.product_id) as product_clicked", "transform(View, x -> x.product_id) as product_viewed")
)
df2.show()
+-------+-------+---------------+--------------+
|user_id|results|product_clicked|product_viewed|
+-------+-------+---------------+--------------+
| user_1|result2| [3]| [4, 5]|
| user_1|result1| [1]| [1]|
+-------+-------+---------------+--------------+

Reading nested Json structure in PySpark

I am new to PySpark.I am trying to read the values for one of the nested column of my JSON data.Here is my json structure:
-- _index: string (nullable = true)
|-- _score: string (nullable = true)
|-- _source: struct (nullable = true)
| |-- layers: struct (nullable = true)
| | |-- R1.TEST6: struct (nullable = true)
| | | |-- R1.TEST1: struct (nullable = true)
| | | | |-- R1.TEST1.idx: string (nullable = true)
| | | | |-- R1.TEST1.ide: string (nullable = true)
| | | |-- R1.TEST3: struct (nullable = true)
| | | | |-- R1.TEST3.PDU: string (nullable = true)
| | | | |-- R1.TEST3.pdu: string (nullable = true)
| | | | |-- R1.TEST4: struct (nullable = true)
| | | | | |-- R1.TEST2: struct (nullable = true)
| | | | | | |-- R1.TEST2.agg: string (nullable = true)
| | | | | | |-- R1.TEST2.size: string (nullable = true)
| | | | | | |-- R1.TEST2.start: string (nullable = true)
| | | | | | |-- R1.TEST2.beam: string (nullable = true)
| | | | | | |-- R1.TEST2.startIndex: string (nullable = true)
| | | | | | |-- R1.TEST2.regType: string (nullable = true)
| | | | | | |-- R1.TEST2.coreSetType: string (nullable = true)
| | | | | | |-- R1.TEST2.cpType: string (nullable = true)
| | | | | | |-- R1.TEST2.column1: string (nullable = true)
| | | | | | |-- R1.TEST2.column1: string (nullable = true)
| | | | | | |-- R1.TEST2.column1: string (nullable = true)
| | | | | | |-- R1.TEST2.column1: string (nullable = true)
| | | | | | |-- R1.TEST2.column1: string (nullable = true)
| | | | | | |-- R1.TEST2.column1: string (nullable = true)
| | | | | | |-- R1.TEST2.column3: string (nullable = true)
As mentioned over the article,https://stackoverflow.com/questions/57811415/reading-a-nested-json-file-in-pyspark,I tried doing below:
df2 = df.select(F.array(F.expr("_source.*")).alias("Source"))
Now my requirement is to access the value that is underR1.TEST6: tag
But below code is not working:
df2.withColumn("source_data", F.explode(F.arrays_zip("Source"))).select("source_data.Source.R1.TEST6.R1.TEST1.idx").show()
Can someone please help me on how can I access all the fields of this nested JSON and create a table as there are multiple levels of nesting present in this JSON _source.R1.TEST6
So how to use explode at this many multiple levels under

Spark for Json Data

I am processing a nested complex Json and below is the schema for it.
root
|-- businessEntity: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- payGroup: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- reportingPeriod: struct (nullable = true)
| | | | | |-- worker: array (nullable = true)
| | | | | | |-- element: struct (containsNull = true)
| | | | | | | |-- category: string (nullable = true)
| | | | | | | |-- person: struct (nullable = true)
| | | | | | | |-- tax: array (nullable = true)
| | | | | | | | |-- element: struct (containsNull = true)
| | | | | | | | | |-- code: string (nullable = true)
| | | | | | | | | |-- qtdAmount: double (nullable = true)
| | | | | | | | | |-- ytdAmount: double (nullable =
My requirement is to create a hashmap with code concatenated with qtdAmount as key and value of qtdAmount as value.
Map.put(code + "qtdAmount" , qtdAmount). How can i do this with spark.
I tried with below shell commands.
import org.apache.spark.sql._
val sqlcontext = new SQLContext(sc)
val cdm = sqlcontext.read.json("/user/edureka/CDM/cdm.json")
val spark = SparkSession.builder().appName("SQL").config("spark.some.config.option","some-vale").getOrCreate()
cdm.createOrReplaceTempView("CDM")
val sqlDF = spark.sql("SELECT businessEntity[0].payGroup[0] from CDM").show()
val address = spark.sql("SELECT businessEntity[0].payGroup[0].reportingPeriod.worker[0].person.address from CDM as address")
val worker = spark.sql("SELECT businessEntity[0].payGroup[0].reportingPeriod.worker from CDM")
val tax = spark.sql("SELECT businessEntity[0].payGroup[0].reportingPeriod.worker[0].tax from CDM")
val tax = sqlcontext.sql("SELECT businessEntity[0].payGroup[0].reportingPeriod.worker[0].tax from CDM")
tax.select("tax.code")
val codes = tax.select(expode(tax("code"))
scala> val codes = tax.withColumn("code",explode(tax("tax.code"))).withColumn("qtdAmount",explode(tax("tax.qtdAmount"))).withColumn("ytdAmount",explode(tax("tax.ytdAmount")))
i am trying to get all the codes and qtdAmount into a map. But i am not getting it. Using multiple explode statements for a single DF, is producing cartesian product of the elements.
Could someone please help on how to parse the json of this much complex in spark.
You can get code and qtyAmount in this way.
import sqlcontext.implicits._
cdm.select(
$"businessEntity.element.payGroup.element.reportingPeriod.worker.element.tax.element.code".as("code"),
$"businessEntity.element.payGroup.element.reportingPeriod.worker.element.tax.element.qtdAmount".as("qtdAmount")
).show
For detailed information, check this

Resources