SparkSQL: Use SQL inside Transform Lambda - apache-spark

I am not able to access the lambda variable in the inline SQL command in a Spark SQL Transform function.
I do not want to explode and perform a join. I was hoping to perform an inline join to fetch the value for the email.
Error.
AnalysisException: cannot resolve '`y.email`' given input columns: [lkp.id, lkp.display_name, lkp.email_address, lkp.uuid]
Code.
transform (
x.license_admin,
y -> struct (
get_uuid(lower(trim(y.email))) as email,
( select coalesce(lkp.uuid, get_uuid(lower(trim(y.email))))
from lkp_uuid_email_map lkp
where lkp.email_address = lower(trim(y.email))
) as email_02,
generic_hash(y.first_name) as first_name
)
) as license_admin,
Column Schema
|-- order_line_items: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- additional_notes: string (nullable = true)
| | |-- amount: double (nullable = true)
| | |-- discounts: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- discount_amount: double (nullable = true)
| | | | |-- discount_percentage: string (nullable = true)
| | | | |-- discount_reason: string (nullable = true)
| | |-- id: string (nullable = true)
| | |-- license_admin: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- email: string (nullable = true)
| | | | |-- first_name: string (nullable = true)
| | | | |-- last_name: string (nullable = true)
| | | | |-- license_admin_id: string (nullable = true)
Thanks in advance.

Related

How to drop values in nested struct if data type is null

I have data structure like below shown. What I want is to drop certain values within field3(struct) if the data type is null, and also drop field3.ZZZ completely if ZZZ data type is null.
root
|-- field1: string (nullable = true)
|-- field2: struct (nullable = true)
|-- field3: struct (nullable = true)
| |-- XXX: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- s: array (nullable = true)
| | | | |-- element: struct (containsNull = true)
| | | | | |-- v: string (nullable = true)
| | | |-- v: long (nullable = true)
| |-- YYY: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- s: array (nullable = true)
| | | | |-- element: struct (containsNull = true)
| | | | | |-- v: string (nullable = true)
| | | |-- v: long (nullable = true)
| |-- ZZZ: array (nullable = true)
| | |-- element: null (containsNull = true)
I did lots of research, but haven't found a clear way to solve this requirement. Can anyone help me on this?

json file into pyspark dataFrame

I've downloaded a json file, that I'm trying to get it into a DataFrame, for making some analysis.
raw_constructors = spark.read.json("/constructors.json")
When I make raw_constructors.show() , I only get one column and one row.
+--------------------+
| MRData|
+--------------------+
|{{[{adams, Adams,...|
+--------------------+
So when I ask for the schema of the json file with raw_constructors.printSchema()
I got this:
root
|-- MRData: struct (nullable = true)
| |-- ConstructorTable: struct (nullable = true)
| | |-- Constructors: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- constructorId: string (nullable = true)
| | | | |-- name: string (nullable = true)
| | | | |-- nationality: string (nullable = true)
| | | | |-- url: string (nullable = true)
| |-- limit: string (nullable = true)
| |-- offset: string (nullable = true)
| |-- series: string (nullable = true)
| |-- total: string (nullable = true)
| |-- url: string (nullable = true)
| |-- xmlns: string (nullable = true)
I'm using pyspark.
How can I get the dataFrame with the 4 columns: constructorId, name, nationality, url and get one row per item?
Thank you!
You can simply using explode to break the array down to multiple rows
from pyspark.sql import functions as F
(df
.select(F.explode('MRData.ConstructorTable.Constructors').alias('tmp'))
.select('tmp.*')
.show()
)
+-------------+----+-----------+---+
|constructorId|name|nationality|url|
+-------------+----+-----------+---+
| i1| n1| y1| u1|
| i2| n2| y2| u2|
+-------------+----+-----------+---+

Working with a StructType column in PySpark UDF

I have the following schema for one of columns that I'm processing,
|-- time_to_resolution_remainingTime: struct (nullable = true)
| |-- _links: struct (nullable = true)
| | |-- self: string (nullable = true)
| |-- completedCycles: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- breached: boolean (nullable = true)
| | | |-- elapsedTime: struct (nullable = true)
| | | | |-- friendly: string (nullable = true)
| | | | |-- millis: long (nullable = true)
| | | |-- goalDuration: struct (nullable = true)
| | | | |-- friendly: string (nullable = true)
| | | | |-- millis: long (nullable = true)
| | | |-- remainingTime: struct (nullable = true)
| | | | |-- friendly: string (nullable = true)
| | | | |-- millis: long (nullable = true)
| | | |-- startTime: struct (nullable = true)
| | | | |-- epochMillis: long (nullable = true)
| | | | |-- friendly: string (nullable = true)
| | | | |-- iso8601: string (nullable = true)
| | | | |-- jira: string (nullable = true)
| | | |-- stopTime: struct (nullable = true)
| | | | |-- epochMillis: long (nullable = true)
| | | | |-- friendly: string (nullable = true)
| | | | |-- iso8601: string (nullable = true)
| | | | |-- jira: string (nullable = true)
| |-- id: string (nullable = true)
| |-- name: string (nullable = true)
| |-- ongoingCycle: struct (nullable = true)
| | |-- breachTime: struct (nullable = true)
| | | |-- epochMillis: long (nullable = true)
| | | |-- friendly: string (nullable = true)
| | | |-- iso8601: string (nullable = true)
| | | |-- jira: string (nullable = true)
| | |-- breached: boolean (nullable = true)
| | |-- elapsedTime: struct (nullable = true)
| | | |-- friendly: string (nullable = true)
| | | |-- millis: long (nullable = true)
| | |-- goalDuration: struct (nullable = true)
| | | |-- friendly: string (nullable = true)
| | | |-- millis: long (nullable = true)
| | |-- paused: boolean (nullable = true)
| | |-- remainingTime: struct (nullable = true)
| | | |-- friendly: string (nullable = true)
| | | |-- millis: long (nullable = true)
| | |-- startTime: struct (nullable = true)
| | | |-- epochMillis: long (nullable = true)
| | | |-- friendly: string (nullable = true)
| | | |-- iso8601: string (nullable = true)
| | | |-- jira: string (nullable = true)
| | |-- withinCalendarHours: boolean (nullable = true)
I'm interested in getting the time fields (e.g completedCycles[x].elapsedTime, ongoingCycle.remainingTime) etc, based on certain conditions. The UDF I'm using is:
#udf("string")
def extract_time(s, field):
# Return ongoing cycle field
if has_column(s, 'ongoingCycle'):
field = 'ongoingCycle.{}'.format(field)
return s[field]
# return last element of completed cycles
s = s.get(size(s) - 1)
return s[field]
cl = 'time_to_resolution_remainingTime'
df = df.withColumn(cl, extract_time(cl, lit("elapsedTime.friendly"))).select(cl)
display(df)
This results in an error:
SparkException: Job aborted due to stage failure: Task 0 in stage 549.0 failed 4 times, most recent failure: Lost task 0.3 in stage 549.0 (TID 1597, 10.155.239.76, executor 0): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/databricks/spark/python/pyspark/sql/types.py", line 1514, in __getitem__
idx = self.__fields__.index(item)
ValueError: 'ongoingCycle.elapsedTime.friendly' is not in list
I'm obviously doing something terribly wrong here, but I'm unable to resolve this. Is it possible to convert the s data frame in the UDF to a python dictionary and perform calculations on that? or is there a much better way to do this?
Edit:
Sample Data
{
"_links":{
"self":"https:///...."
},
"completedCycles":[
],
"id":"630",
"name":"Time to resolution",
"ongoingCycle":{
"breachTime":{
"epochMillis":1605583651354,
"friendly":"17/Nov/20 3:27 PM +12:00",
"iso8601":"2020-11-17T15:27:31+1200",
"jira":"2020-11-17T15:27:31.354+1200"
},
"breached":true,
"elapsedTime":{
"friendly":"57h 32m",
"millis":207148646
},
"goalDuration":{
"friendly":"4h",
"millis":14400000
},
"paused":false,
"remainingTime":{
"friendly":"-53h 32m",
"millis":-192748646
},
"startTime":{
"epochMillis":1605511651354,
"friendly":"16/Nov/20 7:27 PM +12:00",
"iso8601":"2020-11-16T19:27:31+1200",
"jira":"2020-11-16T19:27:31.354+1200"
},
"withinCalendarHours":false
}
}
Expected output: -53h 23m
With completed cycles but no ongoing cycle
{
"_links":{
"self":"https://...."
},
"completedCycles":[
{
"breached":true,
"elapsedTime":{
"friendly":"72h 43m",
"millis":261818073
},
"goalDuration":{
"friendly":"4h",
"millis":14400000
},
"remainingTime":{
"friendly":"-68h 43m",
"millis":-247418073
},
"startTime":{
"epochMillis":1605156449463,
"friendly":"12/Nov/20 4:47 PM +12:00",
"iso8601":"2020-11-12T16:47:29+1200",
"jira":"2020-11-12T16:47:29.463+1200"
},
"stopTime":{
"epochMillis":1606282267536,
"friendly":"Today 5:31 PM +12:00",
"iso8601":"2020-11-25T17:31:07+1200",
"jira":"2020-11-25T17:31:07.536+1200"
}
}
],
"id":"630",
"name":"Time to resolution",
"ongoingCycle": null
}
Expected output: -68h 43m
I got this code to work but not sure if it's the best way to go about solving this,
#udf("string")
def extract_time(s, field):
if s is None:
return None
# Return ongoing cycle field
if has_column(s, 'ongoingCycle'):
if s['ongoingCycle'] is not None:
return s['ongoingCycle']['remainingTime']['friendly']
# Get the last completed cycles' remaining time
s_completed = s['completedCycles']
if len(s_completed) > 0:
return s_completed[-1]['remainingTime']['friendly']
return None
Use when function to check same logic as you have implemented in UDF.
Check below code.
df.show()
+---------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---+------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|_links |completedCycles |id |name |ongoingCycle |
+---------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---+------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[https:///....]|[] |630|Time to resolution|[[1605583651354, 17/Nov/20 3:27 PM +12:00, 2020-11-17T15:27:31+1200, 2020-11-17T15:27:31.354+1200], true, [57h 32m, 207148646], [4h, 14400000], false, [-53h 32m, -192748646], [1605511651354, 16/Nov/20 7:27 PM +12:00, 2020-11-16T19:27:31+1200, 2020-11-16T19:27:31.354+1200], false]|
|[https://....] |[[true, [72h 43m, 261818073], [4h, 14400000], [-68h 43m, -247418073], [1605156449463, 12/Nov/20 4:47 PM +12:00, 2020-11-12T16:47:29+1200, 2020-11-12T16:47:29.463+1200], [1606282267536, Today 5:31 PM +12:00, 2020-11-25T17:31:07+1200, 2020-11-25T17:31:07.536+1200]]]|630|Time to resolution|null |
+---------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---+------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
df.withColumn("time_to_resolution_remainingTime",F.expr("CASE WHEN ongoingCycle IS NOT NULL THEN ongoingCycle.elapsedTime.friendly WHEN size(completedCycles) > 0 THEN completedCycles[size(completedCycles)-1].remainingTime.friendly ELSE null END"))\
.select("time_to_resolution_remainingTime")\
.show(false)
+--------------------------------+
|time_to_resolution_remainingTime|
+--------------------------------+
|57h 32m |
|-68h 43m |
+--------------------------------+

Nested json and dataframe - retrieving data from array

I am trying to retrieve data from json file.
df=spark.read.json('/home/data/activities.json',multiLine=True)
The content looks like this below (I include only 1 "data" row, there are 94):
{"meta.count":"94",
"data":[
{"id":"f67de4f6-d23e-49a7-b9dd-63d68df533a3",
"name.fi":"Purjehdus 1700-luvun tyyliin tykkisluuppi Dianalla","name.en":"Cannon sloop Diana\u00B4s sailings","name.sv":"Seglingar med kanonslupen Diana",
"name.zh":null,
"source_type.id":3,
"source_type.name":"MyHelsinki",
"info_url":"https:\/\/www.suomenlinnatours.com\/en\/cannon-sloop-diana?johku_product=7",
"modified_at":"2019-12-28T15:10:33.145Z",
"location.lat":60.145111083984375,
"location.lon":24.987560272216797,
"location.address.street_address":"Suomenlinna Tykist\u00F6lahti Pier",
"location.address.postal_code":null,
"location.address.locality":"Helsinki",
"description.intro":null,
"description.body":"<p>Luvassa on amiraali Chapamin tahdittama mielenkiintoinen laivamatka valistusajan Viaporiin... /p>\r\n",
"description.images":[
{"url":"https:\/\/edit.myhelsinki.fi\/sites\/default\/files\/2017-06\/Tykkisluuppi_Diana.jpg",
"copyright_holder":"",
"license_type.id":1,
"license_type.name":"All rights reserved."},
{"url":"https:\/\/edit.myhelsinki.fi\/sites\/default\/files\/2017-06\/Tykkisluuppi_Diana_2.jpg",
"copyright_holder":"",
"license_type.id":1,
"license_type.name":"All rights reserved."},
{"url":"https:\/\/edit.myhelsinki.fi\/sites\/default\/files\/2017-06\/Tykkisluuppi_Diana_3.jpg",
"copyright_holder":"",
"license_type.id":1,"license_type.name":"All rights reserved."}],
"tags":[
{"id":"myhelsinki:45","name":"sea"},
{"id":"myhelsinki:836","name":"suomenlinna"},
{"id":"myhelsinki:793","name":"history"}],
"where_when_duration.where_and_when":"Suomenlinna, kes\u00E4kuusta elokuuhun",
"where_when_duration.duration":"N. 1h 45min"}],
"tags":[
{"id":"myhelsinki:453","name":"nature"},
{"id":"myhelsinki:747","name":"canoeing"},
{"id":"myhelsinki:342","name":"guidance"},
{"id":"myhelsinki:399","name":"outdoor recreation"}],
"where_when_duration.where_and_when":"Toukokuussa joka sunnuntai, kes\u00E4-elokuussa joka keskiviikko ja sunnuntai, syyskuussa joka sunnuntai",
"where_when_duration.duration":"4,5 tuntia sis\u00E4lt\u00E4en kuljetuksen"}],
"tags.myhelsinki:10":"sauna",
"tags.myhelsinki:1016":"heavy rock",
"tags.myhelsinki:1749":"national parks",
"tags.myhelsinki:1822":"schools (educational institutions)",
"tags.myhelsinki:2":"food"
}
The schema looks as follows:
root
|-- data: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- description.body: string (nullable = true)
| | |-- description.images: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- copyright_holder: string (nullable = true)
| | | | |-- license_type.id: long (nullable = true)
| | | | |-- license_type.name: string (nullable = true)
| | | | |-- url: string (nullable = true)
| | |-- description.intro: string (nullable = true)
| | |-- id: string (nullable = true)
| | |-- info_url: string (nullable = true)
| | |-- location.address.locality: string (nullable = true)
| | |-- location.address.postal_code: string (nullable = true)
| | |-- location.address.street_address: string (nullable = true)
| | |-- location.lat: double (nullable = true)
| | |-- location.lon: double (nullable = true)
| | |-- modified_at: string (nullable = true)
| | |-- name.en: string (nullable = true)
| | |-- name.fi: string (nullable = true)
| | |-- name.sv: string (nullable = true)
| | |-- name.zh: string (nullable = true)
| | |-- source_type.id: long (nullable = true)
| | |-- source_type.name: string (nullable = true)
| | |-- tags: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- id: string (nullable = true)
| | | | |-- name: string (nullable = true)
| | |-- where_when_duration.duration: string (nullable = true)
| | |-- where_when_duration.where_and_when: string (nullable = true)
|-- meta.count: string (nullable = true)
|-- tags.myhelsinki:10: string (nullable = true)
|-- tags.myhelsinki:1016: string (nullable = true)
|-- tags.myhelsinki:1749: string (nullable = true)
|-- tags.myhelsinki:1822: string (nullable = true)
|-- tags.myhelsinki:2: string (nullable = true)
I am interested in "data" array, including nested array "tags". I would like to skip "meta.count" and "tags.myhelsinki:..."
I tried this:
df.withColumn("expl_data", explode_outer(col("tags"))).select("expl_data.data.name.en").show(10)
and I get error message:
AnalysisException: "cannot resolve '`tags`' given input columns: [tags.myhelsinki:10, tags.myhelsinki:453, tags.myhelsinki:226, tags.myhelsinki:1016, tags.myhelsinki:342, tags.myhelsinki:531, tags.myhelsinki:364, tags.myhelsinki:836, tags.myhelsinki:346,...
I have the same error when I try to explode "tags.name" or "description.images" arrays.
Could anyone please help? My goal is to retrieve a set of selected fields from this structure (tags are very important).
Many thanks in advance!
Alicia
You have to explode the data array first so that you can access the nested fields:
df = df.select(explode_outer(col("data")).alias("data")).select(col("data.*"))
Now, you can explode the inner arrays tags and description.imageslike this:
df = df.withColumn("tags", explode_outer(col("tags")))\
.withColumn("description.images", explode_outer(col("`description.images`")))
And finally select the columns you want:
df.select("id", "tags.name", "`name.en`").show()
+--------------------+-----------+--------------------+
| id| name| name.en|
+--------------------+-----------+--------------------+
|f67de4f6-d23e-49a...| sea|Cannon sloop Dian...|
|f67de4f6-d23e-49a...| sea|Cannon sloop Dian...|
|f67de4f6-d23e-49a...| sea|Cannon sloop Dian...|
|f67de4f6-d23e-49a...|suomenlinna|Cannon sloop Dian...|
|f67de4f6-d23e-49a...|suomenlinna|Cannon sloop Dian...|
|f67de4f6-d23e-49a...|suomenlinna|Cannon sloop Dian...|
|f67de4f6-d23e-49a...| history|Cannon sloop Dian...|
|f67de4f6-d23e-49a...| history|Cannon sloop Dian...|
|f67de4f6-d23e-49a...| history|Cannon sloop Dian...|
+--------------------+-----------+--------------------+
Notice that some columns contain a dot in their name, use `` to select them.

How to convert string column to ArrayType in pyspark

I have requirement where, I need to mask the data stored in Cassandra tables using pyspark. I have a frozen data set in Cassandra which I get it as Array in pyspark. I converted it to String for masking it. Now, I want to convert it back to array type.
I am using spark 2.3.2 to mask data from Cassandra table. I copied data to a data frame and converted it to string to perform masking. I tried to convert it back to array However, I am not able to maintain the original structure.
table_df.createOrReplaceTempView("tmp")
networkinfos_df= sqlContext.sql('Select networkinfos , pid, eid, s sid From tmp ')
dfn1 = networkinfos_df.withColumn('networkinfos_ntdf',regexp_replace(networkinfos_df.networkinfos.cast(StringType()),r'\b(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\b', faker.ipv4_private(network=False, address_class=None))).drop('networkinfos') \
.withColumn('networkinfos_ntdf',regexp_replace('networkinfos_ntdf',r'([a-fA-F0-9]{2}[:|\-]?){6}', faker.mac_address())) \
.withColumn('networkinfos_ntdf',regexp_replace('networkinfos_ntdf',r'(([0-9a-fA-F]{1,4}:){7,7}[0-9a-fA-F]{1,4}|([0-9a-fA-F]{1,4}:){1,7}:|([0-9a-fA-F]{1,4}:){1,6}:[0-9a-fA-F]{1,4}|([0-9a-fA-F]{1,4}:){1,5}(:[0-9a-fA-F]{1,4}){1,2}|([0-9a-fA-F]{1,4}:){1,4}(:[0-9a-fA-F]{1,4}){1,3}|([0-9a-fA-F]{1,4}:){1,3}(:[0-9a-fA-F]{1,4}){1,4}|([0-9a-fA-F]{1,4}:){1,2}(:[0-9a-fA-F]{1,4}){1,5}|[0-9a-fA-F]{1,4}:((:[0-9a-fA-F]{1,4}){1,6})|:((:[0-9a-fA-F]{1,4}){1,7}|:)|fe80:(:[0-9a-fA-F]{0,4}){0,4}%[0-9a-zA-Z]{1,}|::(ffff(:0{1,4}){0,1}:){0,1}((25[0-5]|(2[0-4]|1{0,1}[0-9]){0,1}[0-9])\.){3,3}(25[0-5]|(2[0-4]|1{0,1}[0-9]){0,1}[0-9])|([0-9a-fA-F]{1,4}:){1,4}:((25[0-5]|(2[0-4]|1{0,1}[0-9]){0,1}[0-9])\.){3,3}(25[0-5]|(2[0-4]|1{0,1}[0-9]){0,1}[0-9]))', faker.ipv6(network=False))) \
.drop('networkinfos')
dfn2 = dfn1.withColumn("networkinfos_ntdf", array(dfn1["networkinfos_ntdf"]))
dfn2.show(30,False)
The original structure it as follows:
enter code here
|-- networkinfos: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- vendor: string (nullable = true)
| | |-- product: string (nullable = true)
| | |-- dhcp_enabled: boolean (nullable = true)
| | |-- dhcp_server: string (nullable = true)
| | |-- dns_servers: array (nullable = true)
| | | |-- element: string (containsNull = true)
| | |-- ipv4: string (nullable = true)
| | |-- ipv6: string (nullable = true)
| | |-- subnet_mask_obsolete: string (nullable = true)
| | |-- default_ip_gateway: string (nullable = true)
| | |-- mac_address: string (nullable = true)
| | |-- logical_name: string (nullable = true)
| | |-- dhcp_lease_obtained: timestamp (nullable = true)
| | |-- dhcp_lease_expires: timestamp (nullable = true)
| | |-- ip_enabled: boolean (nullable = true)
| | |-- ipv4_list: array (nullable = true)
| | | |-- element: string (containsNull = true)
| | |-- ipv6_list: array (nullable = true)
| | | |-- element: string (containsNull = true)
| | |-- subnet_masks_obsolete: array (nullable = true)
| | | |-- element: string (containsNull = true)
| | |-- default_ip_gateways: array (nullable = true)
| | | |-- element: string (containsNull = true)
| | |-- wins_primary_server: string (nullable = true)
| | |-- wins_secondary_server: string (nullable = true)
| | |-- subnet_mask: string (nullable = true)
| | |-- subnet_masks: array (nullable = true)
| | | |-- element: string (containsNull = true)
| | |-- interface_index: integer (nullable = true)
| | |-- speed: long (nullable = true)
| | |-- dhcp_servers: array (nullable = true)
| | | |-- element: string (containsNull = true)
What I am getting is:
root
|-- pid: string (nullable = true)
|-- eid: string (nullable = true)
|-- sid: string (nullable = true)
|-- networkinfos_ntdf: array (nullable = false)
| |-- element: string (containsNull = true)
How can I get it converted to original structure?
You can try using pyspark.sql.functions.to_json() and pyspark.sql.functions.from_json() to handle your task if your regexp_replace operations do not break the JSON data:
First find the schema for the field networkinfos:
from pyspark.sql.types import ArrayType
from pyspark.sql.functions import regexp_replace, from_json, to_json
# get the schema of the array field `networkinfos` in JSON
schema_data = df.select('networkinfos').schema.jsonValue()['fields'][0]['type']
# convert it into pyspark.sql.types.ArrayType:
field_schema = ArrayType.fromJson(schema_data)
After you have the field_schema, you can use from_json to set it back to its original schema from the modified JSON strings:
dfn1 = networkinfos_df \
.withColumn('networkinfos', to_json('networkinfos')) \
.withColumn('networkinfos', regexp_replace('networkinfos',...)) \
.....\
.withColumn('networkinfos', from_json('networkinfos', field_schema))

Resources