json file into pyspark dataFrame - apache-spark

I've downloaded a json file, that I'm trying to get it into a DataFrame, for making some analysis.
raw_constructors = spark.read.json("/constructors.json")
When I make raw_constructors.show() , I only get one column and one row.
+--------------------+
| MRData|
+--------------------+
|{{[{adams, Adams,...|
+--------------------+
So when I ask for the schema of the json file with raw_constructors.printSchema()
I got this:
root
|-- MRData: struct (nullable = true)
| |-- ConstructorTable: struct (nullable = true)
| | |-- Constructors: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- constructorId: string (nullable = true)
| | | | |-- name: string (nullable = true)
| | | | |-- nationality: string (nullable = true)
| | | | |-- url: string (nullable = true)
| |-- limit: string (nullable = true)
| |-- offset: string (nullable = true)
| |-- series: string (nullable = true)
| |-- total: string (nullable = true)
| |-- url: string (nullable = true)
| |-- xmlns: string (nullable = true)
I'm using pyspark.
How can I get the dataFrame with the 4 columns: constructorId, name, nationality, url and get one row per item?
Thank you!

You can simply using explode to break the array down to multiple rows
from pyspark.sql import functions as F
(df
.select(F.explode('MRData.ConstructorTable.Constructors').alias('tmp'))
.select('tmp.*')
.show()
)
+-------------+----+-----------+---+
|constructorId|name|nationality|url|
+-------------+----+-----------+---+
| i1| n1| y1| u1|
| i2| n2| y2| u2|
+-------------+----+-----------+---+

Related

Change value in nested struct, array, struct in a Spark dataframe using PySpark

So I have this spark dataframe with following schema:
root
|-- id: string (nullable = true)
|-- elements: struct (nullable = true)
| |-- created: string (nullable = true)
| |-- id: string (nullable = true)
| |-- items: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- field: string (nullable = true)
| | | |-- fieldId: string (nullable = true)
| | | |-- fieldtype: string (nullable = true)
| | | |-- from: string (nullable = true)
| | | |-- fromString: string (nullable = true)
| | | |-- tmpFromAccountId: string (nullable = true)
| | | |-- tmpToAccountId: string (nullable = true)
| | | |-- to: string (nullable = true)
| | | |-- toString: string (nullable = true)
For this case, I want to change value inside "items" elements (field, fieldId, etc.) using defined value ("Issue") - without caring if it is empty or already filled. So it should be from:
+--------+--------------------------------------------------------------------------------+
| id | elements |
+--------+--------------------------------------------------------------------------------+
|ABCD-123|[2023-01-16T20:25:30.875+0700, 5388402, [[field, , status,,,,, 23456, Yes]]] |
+--------+--------------------------------------------------------------------------------+
To:
+--------+----------------------------------------------------------------------------------------------------------+
| id | elements |
+--------+----------------------------------------------------------------------------------------------------------+
|ABCD-123|[2023-01-16T20:25:30.875+0700, 5388402, [[Issue, Issue, Issue, Issue, Issue, Issue, Issue, Issue, Issue]]]|
+-------------------------------------------------------------------------------------------------------------------+
I already try using this script in python file, but it didn't work:
replace_list = ['field', 'fieldtype', 'fieldId', 'from', 'fromString', 'to', 'toString', 'tmpFromAccountId', 'tmpToAccountId']
# Didn't work 1
for col_name in replace_list: df = df.withColumn(f"items.element.{col_name}", lit("Issue"))
# Didn't work 2
for col_name in replace_list: df = df.withColumn("elements.items.element", struct(col(f"elements.items.element.*"), lit("Issue").alias(f"{col_name}")))
In this case, I'm using Spark version 2.4.8. I don't want to use explode method since I want to avoid join dataframes. Is it possible to perform this kind of operation directly in spark? Thank you.
I simplified your example a bit, let's assume your dataset is called df, containing this data:
+---------------------------------------------+
|elements |
+---------------------------------------------+
|{1, [{field, fieldId, from, fromString}]} |
|{2, [{field2, fieldId2, from2, fromString2}]}|
+---------------------------------------------+
and this schema:
root
|-- elements: struct (nullable = true)
| |-- id: string (nullable = true)
| |-- items: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- field: string (nullable = true)
| | | |-- fieldId: string (nullable = true)
| | | |-- from: string (nullable = true)
| | | |-- fromString: string (nullable = true)
An approach to get to your desired result, is to extract the keys of all elements inside the array, do the same for values, then concatenate them, as below:
df
.withColumn("keys", expr("transform(elements.items, x -> json_object_keys(to_json(x)))"))
.withColumn("values", expr("transform(keys, x -> transform(x, y -> 'Issue'))"))
.withColumn("final", expr("transform(keys, (x, i) -> map_from_arrays(keys[i], values[i]))"))
The final output looks like:
+---------------------------------------------+------------------------------------------------------------------------+
|elements |final |
+---------------------------------------------+------------------------------------------------------------------------+
|{1, [{field, fieldId, from, fromString}]} |[{field -> Issue, fieldId -> Issue, from -> Issue, fromString -> Issue}]|
|{2, [{field2, fieldId2, from2, fromString2}]}|[{field -> Issue, fieldId -> Issue, from -> Issue, fromString -> Issue}]|
+---------------------------------------------+------------------------------------------------------------------------+
Hopefully this is what you want, since it's a bit hard to help without providing examples. Good luck!

SparkSQL: Use SQL inside Transform Lambda

I am not able to access the lambda variable in the inline SQL command in a Spark SQL Transform function.
I do not want to explode and perform a join. I was hoping to perform an inline join to fetch the value for the email.
Error.
AnalysisException: cannot resolve '`y.email`' given input columns: [lkp.id, lkp.display_name, lkp.email_address, lkp.uuid]
Code.
transform (
x.license_admin,
y -> struct (
get_uuid(lower(trim(y.email))) as email,
( select coalesce(lkp.uuid, get_uuid(lower(trim(y.email))))
from lkp_uuid_email_map lkp
where lkp.email_address = lower(trim(y.email))
) as email_02,
generic_hash(y.first_name) as first_name
)
) as license_admin,
Column Schema
|-- order_line_items: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- additional_notes: string (nullable = true)
| | |-- amount: double (nullable = true)
| | |-- discounts: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- discount_amount: double (nullable = true)
| | | | |-- discount_percentage: string (nullable = true)
| | | | |-- discount_reason: string (nullable = true)
| | |-- id: string (nullable = true)
| | |-- license_admin: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- email: string (nullable = true)
| | | | |-- first_name: string (nullable = true)
| | | | |-- last_name: string (nullable = true)
| | | | |-- license_admin_id: string (nullable = true)
Thanks in advance.

Nested json and dataframe - retrieving data from array

I am trying to retrieve data from json file.
df=spark.read.json('/home/data/activities.json',multiLine=True)
The content looks like this below (I include only 1 "data" row, there are 94):
{"meta.count":"94",
"data":[
{"id":"f67de4f6-d23e-49a7-b9dd-63d68df533a3",
"name.fi":"Purjehdus 1700-luvun tyyliin tykkisluuppi Dianalla","name.en":"Cannon sloop Diana\u00B4s sailings","name.sv":"Seglingar med kanonslupen Diana",
"name.zh":null,
"source_type.id":3,
"source_type.name":"MyHelsinki",
"info_url":"https:\/\/www.suomenlinnatours.com\/en\/cannon-sloop-diana?johku_product=7",
"modified_at":"2019-12-28T15:10:33.145Z",
"location.lat":60.145111083984375,
"location.lon":24.987560272216797,
"location.address.street_address":"Suomenlinna Tykist\u00F6lahti Pier",
"location.address.postal_code":null,
"location.address.locality":"Helsinki",
"description.intro":null,
"description.body":"<p>Luvassa on amiraali Chapamin tahdittama mielenkiintoinen laivamatka valistusajan Viaporiin... /p>\r\n",
"description.images":[
{"url":"https:\/\/edit.myhelsinki.fi\/sites\/default\/files\/2017-06\/Tykkisluuppi_Diana.jpg",
"copyright_holder":"",
"license_type.id":1,
"license_type.name":"All rights reserved."},
{"url":"https:\/\/edit.myhelsinki.fi\/sites\/default\/files\/2017-06\/Tykkisluuppi_Diana_2.jpg",
"copyright_holder":"",
"license_type.id":1,
"license_type.name":"All rights reserved."},
{"url":"https:\/\/edit.myhelsinki.fi\/sites\/default\/files\/2017-06\/Tykkisluuppi_Diana_3.jpg",
"copyright_holder":"",
"license_type.id":1,"license_type.name":"All rights reserved."}],
"tags":[
{"id":"myhelsinki:45","name":"sea"},
{"id":"myhelsinki:836","name":"suomenlinna"},
{"id":"myhelsinki:793","name":"history"}],
"where_when_duration.where_and_when":"Suomenlinna, kes\u00E4kuusta elokuuhun",
"where_when_duration.duration":"N. 1h 45min"}],
"tags":[
{"id":"myhelsinki:453","name":"nature"},
{"id":"myhelsinki:747","name":"canoeing"},
{"id":"myhelsinki:342","name":"guidance"},
{"id":"myhelsinki:399","name":"outdoor recreation"}],
"where_when_duration.where_and_when":"Toukokuussa joka sunnuntai, kes\u00E4-elokuussa joka keskiviikko ja sunnuntai, syyskuussa joka sunnuntai",
"where_when_duration.duration":"4,5 tuntia sis\u00E4lt\u00E4en kuljetuksen"}],
"tags.myhelsinki:10":"sauna",
"tags.myhelsinki:1016":"heavy rock",
"tags.myhelsinki:1749":"national parks",
"tags.myhelsinki:1822":"schools (educational institutions)",
"tags.myhelsinki:2":"food"
}
The schema looks as follows:
root
|-- data: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- description.body: string (nullable = true)
| | |-- description.images: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- copyright_holder: string (nullable = true)
| | | | |-- license_type.id: long (nullable = true)
| | | | |-- license_type.name: string (nullable = true)
| | | | |-- url: string (nullable = true)
| | |-- description.intro: string (nullable = true)
| | |-- id: string (nullable = true)
| | |-- info_url: string (nullable = true)
| | |-- location.address.locality: string (nullable = true)
| | |-- location.address.postal_code: string (nullable = true)
| | |-- location.address.street_address: string (nullable = true)
| | |-- location.lat: double (nullable = true)
| | |-- location.lon: double (nullable = true)
| | |-- modified_at: string (nullable = true)
| | |-- name.en: string (nullable = true)
| | |-- name.fi: string (nullable = true)
| | |-- name.sv: string (nullable = true)
| | |-- name.zh: string (nullable = true)
| | |-- source_type.id: long (nullable = true)
| | |-- source_type.name: string (nullable = true)
| | |-- tags: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- id: string (nullable = true)
| | | | |-- name: string (nullable = true)
| | |-- where_when_duration.duration: string (nullable = true)
| | |-- where_when_duration.where_and_when: string (nullable = true)
|-- meta.count: string (nullable = true)
|-- tags.myhelsinki:10: string (nullable = true)
|-- tags.myhelsinki:1016: string (nullable = true)
|-- tags.myhelsinki:1749: string (nullable = true)
|-- tags.myhelsinki:1822: string (nullable = true)
|-- tags.myhelsinki:2: string (nullable = true)
I am interested in "data" array, including nested array "tags". I would like to skip "meta.count" and "tags.myhelsinki:..."
I tried this:
df.withColumn("expl_data", explode_outer(col("tags"))).select("expl_data.data.name.en").show(10)
and I get error message:
AnalysisException: "cannot resolve '`tags`' given input columns: [tags.myhelsinki:10, tags.myhelsinki:453, tags.myhelsinki:226, tags.myhelsinki:1016, tags.myhelsinki:342, tags.myhelsinki:531, tags.myhelsinki:364, tags.myhelsinki:836, tags.myhelsinki:346,...
I have the same error when I try to explode "tags.name" or "description.images" arrays.
Could anyone please help? My goal is to retrieve a set of selected fields from this structure (tags are very important).
Many thanks in advance!
Alicia
You have to explode the data array first so that you can access the nested fields:
df = df.select(explode_outer(col("data")).alias("data")).select(col("data.*"))
Now, you can explode the inner arrays tags and description.imageslike this:
df = df.withColumn("tags", explode_outer(col("tags")))\
.withColumn("description.images", explode_outer(col("`description.images`")))
And finally select the columns you want:
df.select("id", "tags.name", "`name.en`").show()
+--------------------+-----------+--------------------+
| id| name| name.en|
+--------------------+-----------+--------------------+
|f67de4f6-d23e-49a...| sea|Cannon sloop Dian...|
|f67de4f6-d23e-49a...| sea|Cannon sloop Dian...|
|f67de4f6-d23e-49a...| sea|Cannon sloop Dian...|
|f67de4f6-d23e-49a...|suomenlinna|Cannon sloop Dian...|
|f67de4f6-d23e-49a...|suomenlinna|Cannon sloop Dian...|
|f67de4f6-d23e-49a...|suomenlinna|Cannon sloop Dian...|
|f67de4f6-d23e-49a...| history|Cannon sloop Dian...|
|f67de4f6-d23e-49a...| history|Cannon sloop Dian...|
|f67de4f6-d23e-49a...| history|Cannon sloop Dian...|
+--------------------+-----------+--------------------+
Notice that some columns contain a dot in their name, use `` to select them.

How to convert string column to ArrayType in pyspark

I have requirement where, I need to mask the data stored in Cassandra tables using pyspark. I have a frozen data set in Cassandra which I get it as Array in pyspark. I converted it to String for masking it. Now, I want to convert it back to array type.
I am using spark 2.3.2 to mask data from Cassandra table. I copied data to a data frame and converted it to string to perform masking. I tried to convert it back to array However, I am not able to maintain the original structure.
table_df.createOrReplaceTempView("tmp")
networkinfos_df= sqlContext.sql('Select networkinfos , pid, eid, s sid From tmp ')
dfn1 = networkinfos_df.withColumn('networkinfos_ntdf',regexp_replace(networkinfos_df.networkinfos.cast(StringType()),r'\b(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\b', faker.ipv4_private(network=False, address_class=None))).drop('networkinfos') \
.withColumn('networkinfos_ntdf',regexp_replace('networkinfos_ntdf',r'([a-fA-F0-9]{2}[:|\-]?){6}', faker.mac_address())) \
.withColumn('networkinfos_ntdf',regexp_replace('networkinfos_ntdf',r'(([0-9a-fA-F]{1,4}:){7,7}[0-9a-fA-F]{1,4}|([0-9a-fA-F]{1,4}:){1,7}:|([0-9a-fA-F]{1,4}:){1,6}:[0-9a-fA-F]{1,4}|([0-9a-fA-F]{1,4}:){1,5}(:[0-9a-fA-F]{1,4}){1,2}|([0-9a-fA-F]{1,4}:){1,4}(:[0-9a-fA-F]{1,4}){1,3}|([0-9a-fA-F]{1,4}:){1,3}(:[0-9a-fA-F]{1,4}){1,4}|([0-9a-fA-F]{1,4}:){1,2}(:[0-9a-fA-F]{1,4}){1,5}|[0-9a-fA-F]{1,4}:((:[0-9a-fA-F]{1,4}){1,6})|:((:[0-9a-fA-F]{1,4}){1,7}|:)|fe80:(:[0-9a-fA-F]{0,4}){0,4}%[0-9a-zA-Z]{1,}|::(ffff(:0{1,4}){0,1}:){0,1}((25[0-5]|(2[0-4]|1{0,1}[0-9]){0,1}[0-9])\.){3,3}(25[0-5]|(2[0-4]|1{0,1}[0-9]){0,1}[0-9])|([0-9a-fA-F]{1,4}:){1,4}:((25[0-5]|(2[0-4]|1{0,1}[0-9]){0,1}[0-9])\.){3,3}(25[0-5]|(2[0-4]|1{0,1}[0-9]){0,1}[0-9]))', faker.ipv6(network=False))) \
.drop('networkinfos')
dfn2 = dfn1.withColumn("networkinfos_ntdf", array(dfn1["networkinfos_ntdf"]))
dfn2.show(30,False)
The original structure it as follows:
enter code here
|-- networkinfos: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- vendor: string (nullable = true)
| | |-- product: string (nullable = true)
| | |-- dhcp_enabled: boolean (nullable = true)
| | |-- dhcp_server: string (nullable = true)
| | |-- dns_servers: array (nullable = true)
| | | |-- element: string (containsNull = true)
| | |-- ipv4: string (nullable = true)
| | |-- ipv6: string (nullable = true)
| | |-- subnet_mask_obsolete: string (nullable = true)
| | |-- default_ip_gateway: string (nullable = true)
| | |-- mac_address: string (nullable = true)
| | |-- logical_name: string (nullable = true)
| | |-- dhcp_lease_obtained: timestamp (nullable = true)
| | |-- dhcp_lease_expires: timestamp (nullable = true)
| | |-- ip_enabled: boolean (nullable = true)
| | |-- ipv4_list: array (nullable = true)
| | | |-- element: string (containsNull = true)
| | |-- ipv6_list: array (nullable = true)
| | | |-- element: string (containsNull = true)
| | |-- subnet_masks_obsolete: array (nullable = true)
| | | |-- element: string (containsNull = true)
| | |-- default_ip_gateways: array (nullable = true)
| | | |-- element: string (containsNull = true)
| | |-- wins_primary_server: string (nullable = true)
| | |-- wins_secondary_server: string (nullable = true)
| | |-- subnet_mask: string (nullable = true)
| | |-- subnet_masks: array (nullable = true)
| | | |-- element: string (containsNull = true)
| | |-- interface_index: integer (nullable = true)
| | |-- speed: long (nullable = true)
| | |-- dhcp_servers: array (nullable = true)
| | | |-- element: string (containsNull = true)
What I am getting is:
root
|-- pid: string (nullable = true)
|-- eid: string (nullable = true)
|-- sid: string (nullable = true)
|-- networkinfos_ntdf: array (nullable = false)
| |-- element: string (containsNull = true)
How can I get it converted to original structure?
You can try using pyspark.sql.functions.to_json() and pyspark.sql.functions.from_json() to handle your task if your regexp_replace operations do not break the JSON data:
First find the schema for the field networkinfos:
from pyspark.sql.types import ArrayType
from pyspark.sql.functions import regexp_replace, from_json, to_json
# get the schema of the array field `networkinfos` in JSON
schema_data = df.select('networkinfos').schema.jsonValue()['fields'][0]['type']
# convert it into pyspark.sql.types.ArrayType:
field_schema = ArrayType.fromJson(schema_data)
After you have the field_schema, you can use from_json to set it back to its original schema from the modified JSON strings:
dfn1 = networkinfos_df \
.withColumn('networkinfos', to_json('networkinfos')) \
.withColumn('networkinfos', regexp_replace('networkinfos',...)) \
.....\
.withColumn('networkinfos', from_json('networkinfos', field_schema))

Parsing hierarchical json to dataFrame in spark

I have a json file structured in hdfs .I am trying to read the json file in my spark context.The json file format is as follows
{"Request": {"TrancheList": {"Tranche": [{"Id": "123","OwnedAmt": "26500000", "Currency": "USD" }, { "Id": "456", "OwnedAmt": "41000000","Currency": "USD"}]},"FxRatesList": {"FxRatesContract": [{"Currency": "CHF","FxRate": "0.97919983706115"},{"Currency": "AUD", "FxRate": "1.2966804979253"},{ "Currency": "USD","FxRate": "1"},{"Currency": "SEK","FxRate": "8.1561012531034"},{"Currency": "NOK", "FxRate": "8.2454981641398"}]},"isExcludeDeals": "true","baseCurrency": "USD"}}
val inputdf = spark.read.json("hdfs://localhost/user/xyz/request.json")
inputdf.printSchema
The printSchema shows me the following output:
root
|-- Request: struct (nullable = true)
| |-- FxRatesList: struct (nullable = true)
| | |-- FxRatesContract: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- Currency: string (nullable = true)
| | | | |-- FxRate: string (nullable = true)
| |-- TrancheList: struct (nullable = true)
| | |-- Tranche: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- Currency: string (nullable = true)
| | | | |-- OwnedAmt: string (nullable = true)
| | | | |-- Id: string (nullable = true)
| |-- baseCurrency: string (nullable = true)
| |-- isExcludeDeals: string (nullable = true)
What should be the best way of creating a dataframe/RDD of trancheList section in the json so that it gives me a distinct list of ID with there OwnedAmt and Currency which looks like a following table
Id OwnedAmt Currency
123 26500000 USD
456 41000000 USD
Any help would be great.
Thanks
Here is another way for getting this data.
val inputdf = spark.read.json("hdfs://localhost/user/xyz/request.json").select("Request.TrancheList.Tranche");
val dataDF = inputdf.select(explode(inputdf("Tranche"))).toDF("Tranche").select("Tranche.Id", "Tranche.OwnedAmt","Tranche.Currency")
dataDF.show
You should be able to access the columns within the hierarchy of your DataFrame using the dot notation.
In this example, then the query would be something like
// Spark 2.0 example; use registerTempTable for Spark 1.6
inputdf.createOrReplaceTempView("inputdf")
spark.sql("select Request.TrancheList.Tranche.Id, Request.TrancheList.Tranche.OwnedAmt, Request.TrancheList.Tranche.Currency from inputdf")

Resources