Parsing hierarchical json to dataFrame in spark

Parsing hierarchical json to dataFrame in spark - apache-spark

I have a json file structured in hdfs .I am trying to read the json file in my spark context.The json file format is as follows
{"Request": {"TrancheList": {"Tranche": [{"Id": "123","OwnedAmt": "26500000", "Currency": "USD" }, { "Id": "456", "OwnedAmt": "41000000","Currency": "USD"}]},"FxRatesList": {"FxRatesContract": [{"Currency": "CHF","FxRate": "0.97919983706115"},{"Currency": "AUD", "FxRate": "1.2966804979253"},{ "Currency": "USD","FxRate": "1"},{"Currency": "SEK","FxRate": "8.1561012531034"},{"Currency": "NOK", "FxRate": "8.2454981641398"}]},"isExcludeDeals": "true","baseCurrency": "USD"}}
val inputdf = spark.read.json("hdfs://localhost/user/xyz/request.json")
inputdf.printSchema
The printSchema shows me the following output:
root
|-- Request: struct (nullable = true)
| |-- FxRatesList: struct (nullable = true)
| | |-- FxRatesContract: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- Currency: string (nullable = true)
| | | | |-- FxRate: string (nullable = true)
| |-- TrancheList: struct (nullable = true)
| | |-- Tranche: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- Currency: string (nullable = true)
| | | | |-- OwnedAmt: string (nullable = true)
| | | | |-- Id: string (nullable = true)
| |-- baseCurrency: string (nullable = true)
| |-- isExcludeDeals: string (nullable = true)
What should be the best way of creating a dataframe/RDD of trancheList section in the json so that it gives me a distinct list of ID with there OwnedAmt and Currency which looks like a following table
Id OwnedAmt Currency
123 26500000 USD
456 41000000 USD
Any help would be great.
Thanks

Here is another way for getting this data.
val inputdf = spark.read.json("hdfs://localhost/user/xyz/request.json").select("Request.TrancheList.Tranche");
val dataDF = inputdf.select(explode(inputdf("Tranche"))).toDF("Tranche").select("Tranche.Id", "Tranche.OwnedAmt","Tranche.Currency")
dataDF.show

You should be able to access the columns within the hierarchy of your DataFrame using the dot notation.
In this example, then the query would be something like
// Spark 2.0 example; use registerTempTable for Spark 1.6
inputdf.createOrReplaceTempView("inputdf")
spark.sql("select Request.TrancheList.Tranche.Id, Request.TrancheList.Tranche.OwnedAmt, Request.TrancheList.Tranche.Currency from inputdf")

Related

Change value in nested struct, array, struct in a Spark dataframe using PySpark

So I have this spark dataframe with following schema:
root
|-- id: string (nullable = true)
|-- elements: struct (nullable = true)
| |-- created: string (nullable = true)
| |-- id: string (nullable = true)
| |-- items: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- field: string (nullable = true)
| | | |-- fieldId: string (nullable = true)
| | | |-- fieldtype: string (nullable = true)
| | | |-- from: string (nullable = true)
| | | |-- fromString: string (nullable = true)
| | | |-- tmpFromAccountId: string (nullable = true)
| | | |-- tmpToAccountId: string (nullable = true)
| | | |-- to: string (nullable = true)
| | | |-- toString: string (nullable = true)
For this case, I want to change value inside "items" elements (field, fieldId, etc.) using defined value ("Issue") - without caring if it is empty or already filled. So it should be from:
+--------+--------------------------------------------------------------------------------+
| id | elements |
+--------+--------------------------------------------------------------------------------+
|ABCD-123|[2023-01-16T20:25:30.875+0700, 5388402, [[field, , status,,,,, 23456, Yes]]] |
+--------+--------------------------------------------------------------------------------+
To:
+--------+----------------------------------------------------------------------------------------------------------+
| id | elements |
+--------+----------------------------------------------------------------------------------------------------------+
|ABCD-123|[2023-01-16T20:25:30.875+0700, 5388402, [[Issue, Issue, Issue, Issue, Issue, Issue, Issue, Issue, Issue]]]|
+-------------------------------------------------------------------------------------------------------------------+
I already try using this script in python file, but it didn't work:
replace_list = ['field', 'fieldtype', 'fieldId', 'from', 'fromString', 'to', 'toString', 'tmpFromAccountId', 'tmpToAccountId']
# Didn't work 1
for col_name in replace_list: df = df.withColumn(f"items.element.{col_name}", lit("Issue"))
# Didn't work 2
for col_name in replace_list: df = df.withColumn("elements.items.element", struct(col(f"elements.items.element.*"), lit("Issue").alias(f"{col_name}")))
In this case, I'm using Spark version 2.4.8. I don't want to use explode method since I want to avoid join dataframes. Is it possible to perform this kind of operation directly in spark? Thank you.

I simplified your example a bit, let's assume your dataset is called df, containing this data:
+---------------------------------------------+
|elements |
+---------------------------------------------+
|{1, [{field, fieldId, from, fromString}]} |
|{2, [{field2, fieldId2, from2, fromString2}]}|
+---------------------------------------------+
and this schema:
root
|-- elements: struct (nullable = true)
| |-- id: string (nullable = true)
| |-- items: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- field: string (nullable = true)
| | | |-- fieldId: string (nullable = true)
| | | |-- from: string (nullable = true)
| | | |-- fromString: string (nullable = true)
An approach to get to your desired result, is to extract the keys of all elements inside the array, do the same for values, then concatenate them, as below:
df
.withColumn("keys", expr("transform(elements.items, x -> json_object_keys(to_json(x)))"))
.withColumn("values", expr("transform(keys, x -> transform(x, y -> 'Issue'))"))
.withColumn("final", expr("transform(keys, (x, i) -> map_from_arrays(keys[i], values[i]))"))
The final output looks like:
+---------------------------------------------+------------------------------------------------------------------------+
|elements |final |
+---------------------------------------------+------------------------------------------------------------------------+
|{1, [{field, fieldId, from, fromString}]} |[{field -> Issue, fieldId -> Issue, from -> Issue, fromString -> Issue}]|
|{2, [{field2, fieldId2, from2, fromString2}]}|[{field -> Issue, fieldId -> Issue, from -> Issue, fromString -> Issue}]|
+---------------------------------------------+------------------------------------------------------------------------+
Hopefully this is what you want, since it's a bit hard to help without providing examples. Good luck!

json file into pyspark dataFrame

I've downloaded a json file, that I'm trying to get it into a DataFrame, for making some analysis.
raw_constructors = spark.read.json("/constructors.json")
When I make raw_constructors.show() , I only get one column and one row.
+--------------------+
| MRData|
+--------------------+
|{{[{adams, Adams,...|
+--------------------+
So when I ask for the schema of the json file with raw_constructors.printSchema()
I got this:
root
|-- MRData: struct (nullable = true)
| |-- ConstructorTable: struct (nullable = true)
| | |-- Constructors: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- constructorId: string (nullable = true)
| | | | |-- name: string (nullable = true)
| | | | |-- nationality: string (nullable = true)
| | | | |-- url: string (nullable = true)
| |-- limit: string (nullable = true)
| |-- offset: string (nullable = true)
| |-- series: string (nullable = true)
| |-- total: string (nullable = true)
| |-- url: string (nullable = true)
| |-- xmlns: string (nullable = true)
I'm using pyspark.
How can I get the dataFrame with the 4 columns: constructorId, name, nationality, url and get one row per item?
Thank you!

You can simply using explode to break the array down to multiple rows
from pyspark.sql import functions as F
(df
.select(F.explode('MRData.ConstructorTable.Constructors').alias('tmp'))
.select('tmp.*')
.show()
)
+-------------+----+-----------+---+
|constructorId|name|nationality|url|
+-------------+----+-----------+---+
| i1| n1| y1| u1|
| i2| n2| y2| u2|
+-------------+----+-----------+---+

SparkSQL: Use SQL inside Transform Lambda

I am not able to access the lambda variable in the inline SQL command in a Spark SQL Transform function.
I do not want to explode and perform a join. I was hoping to perform an inline join to fetch the value for the email.
Error.
AnalysisException: cannot resolve '`y.email`' given input columns: [lkp.id, lkp.display_name, lkp.email_address, lkp.uuid]
Code.
transform (
x.license_admin,
y -> struct (
get_uuid(lower(trim(y.email))) as email,
( select coalesce(lkp.uuid, get_uuid(lower(trim(y.email))))
from lkp_uuid_email_map lkp
where lkp.email_address = lower(trim(y.email))
) as email_02,
generic_hash(y.first_name) as first_name
)
) as license_admin,
Column Schema
|-- order_line_items: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- additional_notes: string (nullable = true)
| | |-- amount: double (nullable = true)
| | |-- discounts: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- discount_amount: double (nullable = true)
| | | | |-- discount_percentage: string (nullable = true)
| | | | |-- discount_reason: string (nullable = true)
| | |-- id: string (nullable = true)
| | |-- license_admin: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- email: string (nullable = true)
| | | | |-- first_name: string (nullable = true)
| | | | |-- last_name: string (nullable = true)
| | | | |-- license_admin_id: string (nullable = true)
Thanks in advance.

how to explode a structs array

I have a schema like this
root
|-- CaseNumber: string (nullable = true)
|-- Interactions: struct (nullable = true)
| |-- EmailInteractions: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- PhoneInteractions: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- Contact: struct (nullable = true)
| | | | |-- id: string (nullable = true)
| | | |-- CreatedOn: string (nullable = true)
| | | |-- Direction: string (nullable = true)
| |-- WebInteractions: array (nullable = true)
| | |-- element: string (containsNull = true)
and I would like to explode the three arrays(EmailInteractions,PhoneInteractions,WebInteractions) and group with CaseNumber and create three tables and execute this sql query
Select casenumber,CreatedOn
from EmailInteration
where Direction = 'Outgoing'
union all
Select casenumber,CreatedOn
from PhoneInteraction
where Direction = 'Outgoing'
union all
Select casenumber,CreatedOn
from WebInteraction
where Direction = 'Outgoing'
the code to retrieve the schema
val dl = spark.read.format("com.databricks.spark.avro").load("adl://power.azuredatalakestore.net/Se/eventhubspace/eventhub/0_2020_01_20_*_*_*.avro")
val dl1=dl.select($"body".cast("string")).map(_.toString())
val dl2=spark.read.json(dl1)
val dl3=dl2.select($"Content.*",$"RawProperties.UserProperties.*")
I am new to databricks, any help would be appreciated. thanks in advance.

How to convert string column to ArrayType in pyspark

I have requirement where, I need to mask the data stored in Cassandra tables using pyspark. I have a frozen data set in Cassandra which I get it as Array in pyspark. I converted it to String for masking it. Now, I want to convert it back to array type.
I am using spark 2.3.2 to mask data from Cassandra table. I copied data to a data frame and converted it to string to perform masking. I tried to convert it back to array However, I am not able to maintain the original structure.
table_df.createOrReplaceTempView("tmp")
networkinfos_df= sqlContext.sql('Select networkinfos , pid, eid, s sid From tmp ')
dfn1 = networkinfos_df.withColumn('networkinfos_ntdf',regexp_replace(networkinfos_df.networkinfos.cast(StringType()),r'\b(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\b', faker.ipv4_private(network=False, address_class=None))).drop('networkinfos') \
.withColumn('networkinfos_ntdf',regexp_replace('networkinfos_ntdf',r'([a-fA-F0-9]{2}[:|\-]?){6}', faker.mac_address())) \
.withColumn('networkinfos_ntdf',regexp_replace('networkinfos_ntdf',r'(([0-9a-fA-F]{1,4}:){7,7}[0-9a-fA-F]{1,4}|([0-9a-fA-F]{1,4}:){1,7}:|([0-9a-fA-F]{1,4}:){1,6}:[0-9a-fA-F]{1,4}|([0-9a-fA-F]{1,4}:){1,5}(:[0-9a-fA-F]{1,4}){1,2}|([0-9a-fA-F]{1,4}:){1,4}(:[0-9a-fA-F]{1,4}){1,3}|([0-9a-fA-F]{1,4}:){1,3}(:[0-9a-fA-F]{1,4}){1,4}|([0-9a-fA-F]{1,4}:){1,2}(:[0-9a-fA-F]{1,4}){1,5}|[0-9a-fA-F]{1,4}:((:[0-9a-fA-F]{1,4}){1,6})|:((:[0-9a-fA-F]{1,4}){1,7}|:)|fe80:(:[0-9a-fA-F]{0,4}){0,4}%[0-9a-zA-Z]{1,}|::(ffff(:0{1,4}){0,1}:){0,1}((25[0-5]|(2[0-4]|1{0,1}[0-9]){0,1}[0-9])\.){3,3}(25[0-5]|(2[0-4]|1{0,1}[0-9]){0,1}[0-9])|([0-9a-fA-F]{1,4}:){1,4}:((25[0-5]|(2[0-4]|1{0,1}[0-9]){0,1}[0-9])\.){3,3}(25[0-5]|(2[0-4]|1{0,1}[0-9]){0,1}[0-9]))', faker.ipv6(network=False))) \
.drop('networkinfos')
dfn2 = dfn1.withColumn("networkinfos_ntdf", array(dfn1["networkinfos_ntdf"]))
dfn2.show(30,False)
The original structure it as follows:
enter code here
|-- networkinfos: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- vendor: string (nullable = true)
| | |-- product: string (nullable = true)
| | |-- dhcp_enabled: boolean (nullable = true)
| | |-- dhcp_server: string (nullable = true)
| | |-- dns_servers: array (nullable = true)
| | | |-- element: string (containsNull = true)
| | |-- ipv4: string (nullable = true)
| | |-- ipv6: string (nullable = true)
| | |-- subnet_mask_obsolete: string (nullable = true)
| | |-- default_ip_gateway: string (nullable = true)
| | |-- mac_address: string (nullable = true)
| | |-- logical_name: string (nullable = true)
| | |-- dhcp_lease_obtained: timestamp (nullable = true)
| | |-- dhcp_lease_expires: timestamp (nullable = true)
| | |-- ip_enabled: boolean (nullable = true)
| | |-- ipv4_list: array (nullable = true)
| | | |-- element: string (containsNull = true)
| | |-- ipv6_list: array (nullable = true)
| | | |-- element: string (containsNull = true)
| | |-- subnet_masks_obsolete: array (nullable = true)
| | | |-- element: string (containsNull = true)
| | |-- default_ip_gateways: array (nullable = true)
| | | |-- element: string (containsNull = true)
| | |-- wins_primary_server: string (nullable = true)
| | |-- wins_secondary_server: string (nullable = true)
| | |-- subnet_mask: string (nullable = true)
| | |-- subnet_masks: array (nullable = true)
| | | |-- element: string (containsNull = true)
| | |-- interface_index: integer (nullable = true)
| | |-- speed: long (nullable = true)
| | |-- dhcp_servers: array (nullable = true)
| | | |-- element: string (containsNull = true)
What I am getting is:
root
|-- pid: string (nullable = true)
|-- eid: string (nullable = true)
|-- sid: string (nullable = true)
|-- networkinfos_ntdf: array (nullable = false)
| |-- element: string (containsNull = true)
How can I get it converted to original structure?

You can try using pyspark.sql.functions.to_json() and pyspark.sql.functions.from_json() to handle your task if your regexp_replace operations do not break the JSON data:
First find the schema for the field networkinfos:
from pyspark.sql.types import ArrayType
from pyspark.sql.functions import regexp_replace, from_json, to_json
# get the schema of the array field `networkinfos` in JSON
schema_data = df.select('networkinfos').schema.jsonValue()['fields'][0]['type']
# convert it into pyspark.sql.types.ArrayType:
field_schema = ArrayType.fromJson(schema_data)
After you have the field_schema, you can use from_json to set it back to its original schema from the modified JSON strings:
dfn1 = networkinfos_df \
.withColumn('networkinfos', to_json('networkinfos')) \
.withColumn('networkinfos', regexp_replace('networkinfos',...)) \
.....\
.withColumn('networkinfos', from_json('networkinfos', field_schema))

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Parsing hierarchical json to dataFrame in spark - apache-spark

Here is another way for getting this data. val inputdf = spark.read.json("hdfs://localhost/user/xyz/request.json").select("Request.TrancheList.Tranche"); val dataDF = inputdf.select(explode(inputdf("Tranche"))).toDF("Tranche").select("Tranche.Id", "Tranche.OwnedAmt","Tranche.Currency") dataDF.show

Related

Change value in nested struct, array, struct in a Spark dataframe using PySpark

json file into pyspark dataFrame

SparkSQL: Use SQL inside Transform Lambda

how to explode a structs array

How to convert string column to ArrayType in pyspark

Categories

Resources