Nested json and dataframe - retrieving data from array - apache-spark

I am trying to retrieve data from json file.
df=spark.read.json('/home/data/activities.json',multiLine=True)
The content looks like this below (I include only 1 "data" row, there are 94):
{"meta.count":"94",
"data":[
{"id":"f67de4f6-d23e-49a7-b9dd-63d68df533a3",
"name.fi":"Purjehdus 1700-luvun tyyliin tykkisluuppi Dianalla","name.en":"Cannon sloop Diana\u00B4s sailings","name.sv":"Seglingar med kanonslupen Diana",
"name.zh":null,
"source_type.id":3,
"source_type.name":"MyHelsinki",
"info_url":"https:\/\/www.suomenlinnatours.com\/en\/cannon-sloop-diana?johku_product=7",
"modified_at":"2019-12-28T15:10:33.145Z",
"location.lat":60.145111083984375,
"location.lon":24.987560272216797,
"location.address.street_address":"Suomenlinna Tykist\u00F6lahti Pier",
"location.address.postal_code":null,
"location.address.locality":"Helsinki",
"description.intro":null,
"description.body":"<p>Luvassa on amiraali Chapamin tahdittama mielenkiintoinen laivamatka valistusajan Viaporiin... /p>\r\n",
"description.images":[
{"url":"https:\/\/edit.myhelsinki.fi\/sites\/default\/files\/2017-06\/Tykkisluuppi_Diana.jpg",
"copyright_holder":"",
"license_type.id":1,
"license_type.name":"All rights reserved."},
{"url":"https:\/\/edit.myhelsinki.fi\/sites\/default\/files\/2017-06\/Tykkisluuppi_Diana_2.jpg",
"copyright_holder":"",
"license_type.id":1,
"license_type.name":"All rights reserved."},
{"url":"https:\/\/edit.myhelsinki.fi\/sites\/default\/files\/2017-06\/Tykkisluuppi_Diana_3.jpg",
"copyright_holder":"",
"license_type.id":1,"license_type.name":"All rights reserved."}],
"tags":[
{"id":"myhelsinki:45","name":"sea"},
{"id":"myhelsinki:836","name":"suomenlinna"},
{"id":"myhelsinki:793","name":"history"}],
"where_when_duration.where_and_when":"Suomenlinna, kes\u00E4kuusta elokuuhun",
"where_when_duration.duration":"N. 1h 45min"}],
"tags":[
{"id":"myhelsinki:453","name":"nature"},
{"id":"myhelsinki:747","name":"canoeing"},
{"id":"myhelsinki:342","name":"guidance"},
{"id":"myhelsinki:399","name":"outdoor recreation"}],
"where_when_duration.where_and_when":"Toukokuussa joka sunnuntai, kes\u00E4-elokuussa joka keskiviikko ja sunnuntai, syyskuussa joka sunnuntai",
"where_when_duration.duration":"4,5 tuntia sis\u00E4lt\u00E4en kuljetuksen"}],
"tags.myhelsinki:10":"sauna",
"tags.myhelsinki:1016":"heavy rock",
"tags.myhelsinki:1749":"national parks",
"tags.myhelsinki:1822":"schools (educational institutions)",
"tags.myhelsinki:2":"food"
}
The schema looks as follows:
root
|-- data: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- description.body: string (nullable = true)
| | |-- description.images: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- copyright_holder: string (nullable = true)
| | | | |-- license_type.id: long (nullable = true)
| | | | |-- license_type.name: string (nullable = true)
| | | | |-- url: string (nullable = true)
| | |-- description.intro: string (nullable = true)
| | |-- id: string (nullable = true)
| | |-- info_url: string (nullable = true)
| | |-- location.address.locality: string (nullable = true)
| | |-- location.address.postal_code: string (nullable = true)
| | |-- location.address.street_address: string (nullable = true)
| | |-- location.lat: double (nullable = true)
| | |-- location.lon: double (nullable = true)
| | |-- modified_at: string (nullable = true)
| | |-- name.en: string (nullable = true)
| | |-- name.fi: string (nullable = true)
| | |-- name.sv: string (nullable = true)
| | |-- name.zh: string (nullable = true)
| | |-- source_type.id: long (nullable = true)
| | |-- source_type.name: string (nullable = true)
| | |-- tags: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- id: string (nullable = true)
| | | | |-- name: string (nullable = true)
| | |-- where_when_duration.duration: string (nullable = true)
| | |-- where_when_duration.where_and_when: string (nullable = true)
|-- meta.count: string (nullable = true)
|-- tags.myhelsinki:10: string (nullable = true)
|-- tags.myhelsinki:1016: string (nullable = true)
|-- tags.myhelsinki:1749: string (nullable = true)
|-- tags.myhelsinki:1822: string (nullable = true)
|-- tags.myhelsinki:2: string (nullable = true)
I am interested in "data" array, including nested array "tags". I would like to skip "meta.count" and "tags.myhelsinki:..."
I tried this:
df.withColumn("expl_data", explode_outer(col("tags"))).select("expl_data.data.name.en").show(10)
and I get error message:
AnalysisException: "cannot resolve '`tags`' given input columns: [tags.myhelsinki:10, tags.myhelsinki:453, tags.myhelsinki:226, tags.myhelsinki:1016, tags.myhelsinki:342, tags.myhelsinki:531, tags.myhelsinki:364, tags.myhelsinki:836, tags.myhelsinki:346,...
I have the same error when I try to explode "tags.name" or "description.images" arrays.
Could anyone please help? My goal is to retrieve a set of selected fields from this structure (tags are very important).
Many thanks in advance!
Alicia

You have to explode the data array first so that you can access the nested fields:
df = df.select(explode_outer(col("data")).alias("data")).select(col("data.*"))
Now, you can explode the inner arrays tags and description.imageslike this:
df = df.withColumn("tags", explode_outer(col("tags")))\
.withColumn("description.images", explode_outer(col("`description.images`")))
And finally select the columns you want:
df.select("id", "tags.name", "`name.en`").show()
+--------------------+-----------+--------------------+
| id| name| name.en|
+--------------------+-----------+--------------------+
|f67de4f6-d23e-49a...| sea|Cannon sloop Dian...|
|f67de4f6-d23e-49a...| sea|Cannon sloop Dian...|
|f67de4f6-d23e-49a...| sea|Cannon sloop Dian...|
|f67de4f6-d23e-49a...|suomenlinna|Cannon sloop Dian...|
|f67de4f6-d23e-49a...|suomenlinna|Cannon sloop Dian...|
|f67de4f6-d23e-49a...|suomenlinna|Cannon sloop Dian...|
|f67de4f6-d23e-49a...| history|Cannon sloop Dian...|
|f67de4f6-d23e-49a...| history|Cannon sloop Dian...|
|f67de4f6-d23e-49a...| history|Cannon sloop Dian...|
+--------------------+-----------+--------------------+
Notice that some columns contain a dot in their name, use `` to select them.

Related

How to drop values in nested struct if data type is null

I have data structure like below shown. What I want is to drop certain values within field3(struct) if the data type is null, and also drop field3.ZZZ completely if ZZZ data type is null.
root
|-- field1: string (nullable = true)
|-- field2: struct (nullable = true)
|-- field3: struct (nullable = true)
| |-- XXX: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- s: array (nullable = true)
| | | | |-- element: struct (containsNull = true)
| | | | | |-- v: string (nullable = true)
| | | |-- v: long (nullable = true)
| |-- YYY: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- s: array (nullable = true)
| | | | |-- element: struct (containsNull = true)
| | | | | |-- v: string (nullable = true)
| | | |-- v: long (nullable = true)
| |-- ZZZ: array (nullable = true)
| | |-- element: null (containsNull = true)
I did lots of research, but haven't found a clear way to solve this requirement. Can anyone help me on this?

json file into pyspark dataFrame

I've downloaded a json file, that I'm trying to get it into a DataFrame, for making some analysis.
raw_constructors = spark.read.json("/constructors.json")
When I make raw_constructors.show() , I only get one column and one row.
+--------------------+
| MRData|
+--------------------+
|{{[{adams, Adams,...|
+--------------------+
So when I ask for the schema of the json file with raw_constructors.printSchema()
I got this:
root
|-- MRData: struct (nullable = true)
| |-- ConstructorTable: struct (nullable = true)
| | |-- Constructors: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- constructorId: string (nullable = true)
| | | | |-- name: string (nullable = true)
| | | | |-- nationality: string (nullable = true)
| | | | |-- url: string (nullable = true)
| |-- limit: string (nullable = true)
| |-- offset: string (nullable = true)
| |-- series: string (nullable = true)
| |-- total: string (nullable = true)
| |-- url: string (nullable = true)
| |-- xmlns: string (nullable = true)
I'm using pyspark.
How can I get the dataFrame with the 4 columns: constructorId, name, nationality, url and get one row per item?
Thank you!
You can simply using explode to break the array down to multiple rows
from pyspark.sql import functions as F
(df
.select(F.explode('MRData.ConstructorTable.Constructors').alias('tmp'))
.select('tmp.*')
.show()
)
+-------------+----+-----------+---+
|constructorId|name|nationality|url|
+-------------+----+-----------+---+
| i1| n1| y1| u1|
| i2| n2| y2| u2|
+-------------+----+-----------+---+

SparkSQL: Use SQL inside Transform Lambda

I am not able to access the lambda variable in the inline SQL command in a Spark SQL Transform function.
I do not want to explode and perform a join. I was hoping to perform an inline join to fetch the value for the email.
Error.
AnalysisException: cannot resolve '`y.email`' given input columns: [lkp.id, lkp.display_name, lkp.email_address, lkp.uuid]
Code.
transform (
x.license_admin,
y -> struct (
get_uuid(lower(trim(y.email))) as email,
( select coalesce(lkp.uuid, get_uuid(lower(trim(y.email))))
from lkp_uuid_email_map lkp
where lkp.email_address = lower(trim(y.email))
) as email_02,
generic_hash(y.first_name) as first_name
)
) as license_admin,
Column Schema
|-- order_line_items: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- additional_notes: string (nullable = true)
| | |-- amount: double (nullable = true)
| | |-- discounts: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- discount_amount: double (nullable = true)
| | | | |-- discount_percentage: string (nullable = true)
| | | | |-- discount_reason: string (nullable = true)
| | |-- id: string (nullable = true)
| | |-- license_admin: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- email: string (nullable = true)
| | | | |-- first_name: string (nullable = true)
| | | | |-- last_name: string (nullable = true)
| | | | |-- license_admin_id: string (nullable = true)
Thanks in advance.

how to explode a structs array

I have a schema like this
root
|-- CaseNumber: string (nullable = true)
|-- Interactions: struct (nullable = true)
| |-- EmailInteractions: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- PhoneInteractions: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- Contact: struct (nullable = true)
| | | | |-- id: string (nullable = true)
| | | |-- CreatedOn: string (nullable = true)
| | | |-- Direction: string (nullable = true)
| |-- WebInteractions: array (nullable = true)
| | |-- element: string (containsNull = true)
and I would like to explode the three arrays(EmailInteractions,PhoneInteractions,WebInteractions) and group with CaseNumber and create three tables and execute this sql query
Select casenumber,CreatedOn
from EmailInteration
where Direction = 'Outgoing'
union all
Select casenumber,CreatedOn
from PhoneInteraction
where Direction = 'Outgoing'
union all
Select casenumber,CreatedOn
from WebInteraction
where Direction = 'Outgoing'
the code to retrieve the schema
val dl = spark.read.format("com.databricks.spark.avro").load("adl://power.azuredatalakestore.net/Se/eventhubspace/eventhub/0_2020_01_20_*_*_*.avro")
val dl1=dl.select($"body".cast("string")).map(_.toString())
val dl2=spark.read.json(dl1)
val dl3=dl2.select($"Content.*",$"RawProperties.UserProperties.*")
I am new to databricks, any help would be appreciated. thanks in advance.

How to convert string column to ArrayType in pyspark

I have requirement where, I need to mask the data stored in Cassandra tables using pyspark. I have a frozen data set in Cassandra which I get it as Array in pyspark. I converted it to String for masking it. Now, I want to convert it back to array type.
I am using spark 2.3.2 to mask data from Cassandra table. I copied data to a data frame and converted it to string to perform masking. I tried to convert it back to array However, I am not able to maintain the original structure.
table_df.createOrReplaceTempView("tmp")
networkinfos_df= sqlContext.sql('Select networkinfos , pid, eid, s sid From tmp ')
dfn1 = networkinfos_df.withColumn('networkinfos_ntdf',regexp_replace(networkinfos_df.networkinfos.cast(StringType()),r'\b(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\b', faker.ipv4_private(network=False, address_class=None))).drop('networkinfos') \
.withColumn('networkinfos_ntdf',regexp_replace('networkinfos_ntdf',r'([a-fA-F0-9]{2}[:|\-]?){6}', faker.mac_address())) \
.withColumn('networkinfos_ntdf',regexp_replace('networkinfos_ntdf',r'(([0-9a-fA-F]{1,4}:){7,7}[0-9a-fA-F]{1,4}|([0-9a-fA-F]{1,4}:){1,7}:|([0-9a-fA-F]{1,4}:){1,6}:[0-9a-fA-F]{1,4}|([0-9a-fA-F]{1,4}:){1,5}(:[0-9a-fA-F]{1,4}){1,2}|([0-9a-fA-F]{1,4}:){1,4}(:[0-9a-fA-F]{1,4}){1,3}|([0-9a-fA-F]{1,4}:){1,3}(:[0-9a-fA-F]{1,4}){1,4}|([0-9a-fA-F]{1,4}:){1,2}(:[0-9a-fA-F]{1,4}){1,5}|[0-9a-fA-F]{1,4}:((:[0-9a-fA-F]{1,4}){1,6})|:((:[0-9a-fA-F]{1,4}){1,7}|:)|fe80:(:[0-9a-fA-F]{0,4}){0,4}%[0-9a-zA-Z]{1,}|::(ffff(:0{1,4}){0,1}:){0,1}((25[0-5]|(2[0-4]|1{0,1}[0-9]){0,1}[0-9])\.){3,3}(25[0-5]|(2[0-4]|1{0,1}[0-9]){0,1}[0-9])|([0-9a-fA-F]{1,4}:){1,4}:((25[0-5]|(2[0-4]|1{0,1}[0-9]){0,1}[0-9])\.){3,3}(25[0-5]|(2[0-4]|1{0,1}[0-9]){0,1}[0-9]))', faker.ipv6(network=False))) \
.drop('networkinfos')
dfn2 = dfn1.withColumn("networkinfos_ntdf", array(dfn1["networkinfos_ntdf"]))
dfn2.show(30,False)
The original structure it as follows:
enter code here
|-- networkinfos: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- vendor: string (nullable = true)
| | |-- product: string (nullable = true)
| | |-- dhcp_enabled: boolean (nullable = true)
| | |-- dhcp_server: string (nullable = true)
| | |-- dns_servers: array (nullable = true)
| | | |-- element: string (containsNull = true)
| | |-- ipv4: string (nullable = true)
| | |-- ipv6: string (nullable = true)
| | |-- subnet_mask_obsolete: string (nullable = true)
| | |-- default_ip_gateway: string (nullable = true)
| | |-- mac_address: string (nullable = true)
| | |-- logical_name: string (nullable = true)
| | |-- dhcp_lease_obtained: timestamp (nullable = true)
| | |-- dhcp_lease_expires: timestamp (nullable = true)
| | |-- ip_enabled: boolean (nullable = true)
| | |-- ipv4_list: array (nullable = true)
| | | |-- element: string (containsNull = true)
| | |-- ipv6_list: array (nullable = true)
| | | |-- element: string (containsNull = true)
| | |-- subnet_masks_obsolete: array (nullable = true)
| | | |-- element: string (containsNull = true)
| | |-- default_ip_gateways: array (nullable = true)
| | | |-- element: string (containsNull = true)
| | |-- wins_primary_server: string (nullable = true)
| | |-- wins_secondary_server: string (nullable = true)
| | |-- subnet_mask: string (nullable = true)
| | |-- subnet_masks: array (nullable = true)
| | | |-- element: string (containsNull = true)
| | |-- interface_index: integer (nullable = true)
| | |-- speed: long (nullable = true)
| | |-- dhcp_servers: array (nullable = true)
| | | |-- element: string (containsNull = true)
What I am getting is:
root
|-- pid: string (nullable = true)
|-- eid: string (nullable = true)
|-- sid: string (nullable = true)
|-- networkinfos_ntdf: array (nullable = false)
| |-- element: string (containsNull = true)
How can I get it converted to original structure?
You can try using pyspark.sql.functions.to_json() and pyspark.sql.functions.from_json() to handle your task if your regexp_replace operations do not break the JSON data:
First find the schema for the field networkinfos:
from pyspark.sql.types import ArrayType
from pyspark.sql.functions import regexp_replace, from_json, to_json
# get the schema of the array field `networkinfos` in JSON
schema_data = df.select('networkinfos').schema.jsonValue()['fields'][0]['type']
# convert it into pyspark.sql.types.ArrayType:
field_schema = ArrayType.fromJson(schema_data)
After you have the field_schema, you can use from_json to set it back to its original schema from the modified JSON strings:
dfn1 = networkinfos_df \
.withColumn('networkinfos', to_json('networkinfos')) \
.withColumn('networkinfos', regexp_replace('networkinfos',...)) \
.....\
.withColumn('networkinfos', from_json('networkinfos', field_schema))

Resources