pyspark dataframes: Why can I select some nested fields but not others? - python-3.x

I'm trying to write some code to un-nest JSON into Dataframes using pyspark (3.0.1) in Python 3.9.1.
I have some dummy data with a schema as follows:
data.printSchema()
root
|-- recordID: string (nullable = true)
|-- customerDetails: struct (nullable = true)
| |-- name: string (nullable = true)
| |-- dob: string (nullable = true)
|-- familyMembers: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- relationship: string (nullable = true)
| | |-- name: string (nullable = true)
| | |-- contactNumbers: struct (nullable = true)
| | | |-- work: string (nullable = true)
| | | |-- home: string (nullable = true)
| | |-- addressDetails: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- addressType: string (nullable = true)
| | | | |-- address: string (nullable = true)
When I select fields from familyMembers I get the following results as expected:
data.select('familyMembers.contactNumbers.work').show(truncate=False)
+------------------------------------------------+
|work |
+------------------------------------------------+
|[(07) 4612 3880, (03) 5855 2377, (07) 4979 1871]|
|[(07) 4612 3880, (03) 5855 2377] |
+------------------------------------------------+
data.select('familyMembers.name').show(truncate=False)
+------------------------------------+
|name |
+------------------------------------+
|[Jane Smith, Bob Smith, Simon Smith]|
|[Jackie Sacamano, Simon Sacamano] |
+------------------------------------+
Yet when I try to select fields from the addressDetails ArrayType (beneath familyMembers) I get an error:
>>> data.select('familyMembers.addressDetails.address').show(truncate=False)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python3.9/site-packages/pyspark/sql/dataframe.py", line 1421, in select
jdf = self._jdf.select(self._jcols(*cols))
File "/usr/local/lib/python3.9/site-packages/py4j/java_gateway.py", line 1304, in __call__
return_value = get_return_value(
File "/usr/local/lib/python3.9/site-packages/pyspark/sql/utils.py", line 134, in deco
raise_from(converted)
File "<string>", line 3, in raise_from
pyspark.sql.utils.AnalysisException: cannot resolve '`familyMembers`.`addressDetails`['address']' due to data type mismatch: argument 2 requires integral type, however, ''address'' is of string type.;;
'Project [familyMembers#71.addressDetails[address] AS address#277]
+- LogicalRDD [recordID#69, customerDetails#70, familyMembers#71], false
I'm confused. Both familyMembers and addressDetails are ArrayTypes, yet selecting from one works but not the other. Is there an explanation for this, or something I've missed? Is it because one is nested within the other?
Code to reproduce (with just 1 record):
from pyspark.sql.types import StructType
from pyspark.sql import SparkSession, DataFrame
import json
rawdata = [{"recordID":"abc-123","customerDetails":{"name":"John Smith","dob":"1980-04-23"},"familyMembers":[{"relationship":"mother","name":"Jane Smith","contactNumbers":{"work":"(07) 4612 3880","home":"(08) 8271 1577"},"addressDetails":[{"addressType":"residential","address":"29 Commonwealth St, Clifton, QLD 4361 "},{"addressType":"work","address":"20 A Yeo Ave, Highgate, SA 5063 "}]},{"relationship":"father","name":"Bob Smith","contactNumbers":{"work":"(03) 5855 2377","home":"(03) 9773 2483"},"addressDetails":[{"addressType":"residential","address":"1735 Fenaughty Rd, Kyabram South, VIC 3620"},{"addressType":"work","address":"12 Haldane St, Bonbeach, VIC 3196 "}]},{"relationship":"brother","name":"Simon Smith","contactNumbers":{"work":"(07) 4979 1871","home":"(08) 9862 6017"},"addressDetails":[{"addressType":"residential","address":"6 Darren St, Sun Valley, QLD 4680"},{"addressType":"work","address":"Arthur River, WA 6315"}]}]},]
strschema = '{"fields":[{"metadata":{},"name":"recordID","nullable":true,"type":"string"},{"metadata":{},"name":"customerDetails","nullable":true,"type":{"fields":[{"metadata":{},"name":"name","nullable":true,"type":"string"},{"metadata":{},"name":"dob","nullable":true,"type":"string"}],"type":"struct"}},{"metadata":{},"name":"familyMembers","nullable":true,"type":{"containsNull":true,"elementType":{"fields":[{"metadata":{},"name":"relationship","nullable":true,"type":"string"},{"metadata":{},"name":"name","nullable":true,"type":"string"},{"metadata":{},"name":"contactNumbers","nullable":true,"type":{"fields":[{"metadata":{},"name":"work","nullable":true,"type":"string"},{"metadata":{},"name":"home","nullable":true,"type":"string"}],"type":"struct"}},{"metadata":{},"name":"addressDetails","nullable":true,"type":{"containsNull":true,"elementType":{"fields":[{"metadata":{},"name":"addressType","nullable":true,"type":"string"},{"metadata":{},"name":"address","nullable":true,"type":"string"}],"type":"struct"},"type":"array"}}],"type":"struct"},"type":"array"}}],"type":"struct"}'
spark = SparkSession.builder.appName("json-un-nester").enableHiveSupport().getOrCreate()
sc = spark.sparkContext
schema = StructType.fromJson(json.loads(strschema))
datardd = sc.parallelize(rawdata)
data = spark.createDataFrame(datardd, schema=schema)
data.show()
data.select('familyMembers.name').show(truncate=False)
data.select('familyMembers.addressDetails.address').show(truncate=False)

To understand this you can print the schema of :
data.select('familyMembers.addressDetails').printSchema()
#root
# |-- familyMembers.addressDetails: array (nullable = true)
# | |-- element: array (containsNull = true)
# | | |-- element: struct (containsNull = true)
# | | | |-- addressType: string (nullable = true)
# | | | |-- address: string (nullable = true)
See here you have an array of arrays of structs which is different from the initial schema you have. So you can't directly access address from the root, but you can select the first element of the nested array then access the struct field address :
data.selectExpr("familyMembers.addressDetails[0].address").show(truncate=False)
#+--------------------------------------------------------------------------+
#|familyMembers.addressDetails AS addressDetails#29[0].address |
#+--------------------------------------------------------------------------+
#|[29 Commonwealth St, Clifton, QLD 4361 , 20 A Yeo Ave, Highgate, SA 5063 ]|
#+--------------------------------------------------------------------------+
Or:
data.select(F.col('familyMembers.addressDetails').getItem(0).getItem("address"))

Along with the answer that #blackbishop provided You can also use the combination of the select and expr to get the output as below :
data.select(expr('familyMembers.addressDetails[0].address')))
Output :
You can also use explode to get all the addresses if you want as below :
data.select(explode('familyMembers.addressDetails')).select("col.address")
Output :

Related

How to Convert Map of Struct type to Json in Spark2

I have a map field in dataset with below schema
|-- party: map (nullable = true)
| |-- key: string
| |-- value: struct (valueContainsNull = true)
| | |-- partyName: string (nullable = true)
| | |-- cdrId: string (nullable = true)
| | |-- legalEntityId: string (nullable = true)
| | |-- customPartyId: string (nullable = true)
| | |-- partyIdScheme: string (nullable = true)
| | |-- customPartyIdScheme: string (nullable = true)
| | |-- bdrId: string (nullable = true)
Need to convert it to JSON type. Please suggest how to do it. Thanks in advance
Spark provides to_json function available for DataFrame operations:
import org.apache.spark.sql.functions._
import spark.implicits._
val df =
List(
("key1", "party01", "cdrId01"),
("key2", "party02", "cdrId02"),
)
.toDF("key", "partyName", "cdrId")
.select(struct($"key", struct($"partyName", $"cdrId")).as("col1"))
.agg(map_from_entries(collect_set($"col1")).as("map_col"))
.select($"map_col", to_json($"map_col").as("json_col"))

Pyspark: Selecting a value after exploding an array

I am new to pyspark and trying to parse telecom.value if telecom.system = "fax|phone" but getting below error. I understand that filter() would return me a struct and I am selecting a column from that. How do I select the column value after calling filter()?
File "", line 3, in raise_from
pyspark.sql.utils.AnalysisException: Resolved attribute(s) telecom#27,telecom#33 missing from name#3,telecom#5,address#7 in operator !Project [name#3.family AS Practitioner_LastName#23, name#3.suffix AS Practitioner_NameSuffix#24, name#3.given[0] AS Practitioner_FirstName#25, telecom#27.value AS telecom.value#42, telecom#33.value AS telecom.value#43, address#7.city AS PractitionerCity#38, address#7.line[0] AS PractitionerAddress_1#39, address#7.postalCode AS PractitionerZip#40, address#7.state AS PractitionerState#41]. Attribute(s) with the same name appear in the operation: telecom,telecom. Please check if the right attribute(s) are used.
root
|-- resource: struct (nullable = true)
| |-- address: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- city: string (nullable = true)
| | | |-- country: string (nullable = true)
| | | |-- line: array (nullable = true)
| | | | |-- element: string (containsNull = true)
| | | |-- postalCode: string (nullable = true)
| | | |-- state: string (nullable = true)
| | | |-- use: string (nullable = true)
| |-- id: string (nullable = true)
| |-- identifier: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- type: struct (nullable = true)
| | | | |-- coding: array (nullable = true)
| | | | | |-- element: struct (containsNull = true)
| | | | | | |-- code: string (nullable = true)
| | | | | | |-- system: string (nullable = true)
| | | |-- use: string (nullable = true)
| | | |-- value: string (nullable = true)
| |-- name: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- family: string (nullable = true)
| | | |-- given: array (nullable = true)
| | | | |-- element: string (containsNull = true)
| | | |-- suffix: array (nullable = true)
| | | | |-- element: string (containsNull = true)
| | | |-- use: string (nullable = true)
| |-- resourceType: string (nullable = true)
| |-- telecom: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- system: string (nullable = true)
| | | |-- use: string (nullable = true)
| | | |-- value: string (nullable = true)
| |-- text: struct (nullable = true)
| | |-- div: string (nullable = true)
| | |-- status: string (nullable = true)
import sys
import pyspark
from pyspark.context import SparkContext
from pyspark.sql import SparkSession
import pyspark.sql.functions as f
appName = "PySpark Example - JSON file to Spark Data Frame"
master = "local"
spark = SparkSession.builder.appName(appName).master(master).getOrCreate()
json_file_path = 'C:\\Users\\M\\Documents\\Practitioner.json'
source_df = spark.read.json(json_file_path, multiLine=True)
source_df.printSchema()
output = source_df.select(source_df["resource.name"][0].alias("name"),
source_df["resource.telecom"].alias("telecom"),
source_df["resource.address"][0].alias("address"))
output.printSchema()
practitioner = output.select(
output.name.family.alias("Practitioner_LastName"),
output.name.suffix.alias("Practitioner_NameSuffix"),
output.name.given[0].alias("Practitioner_FirstName"),
output.withColumn("telecom", f.explode(f.col("telecom"))).filter(f.col("telecom.system") == "phone").telecom.value,
output.withColumn("telecom", f.explode(f.col("telecom"))).filter(f.col("telecom.system") == "fax").telecom.value,
output.address.city.alias("PractitionerCity"),
output.address.line[0].alias("PractitionerAddress_1"),
output.address.postalCode.alias("PractitionerZip"),
output.address.state.alias("PractitionerState")
)
practitioner.printSchema()
practitioner.show()
My json is:
{"resource":{"resourceType":"Practitioner","id":"scm-ambqa1821624401190","text":{"status":"generated","div":""},"identifier":[{"use":"official","type":{"coding":[{"system":"http:\/\/hl7.org\/fhir\/v2\/0203","code":"NPI"}]},"value":"1548206097"},{"use":"official","type":{"coding":[{"system":"http:\/\/hl7.org\/fhir\/v2\/0203","code":"DEA"}]},"value":"HB1548206"}],"name":[{"use":"official","family":"BERNSTEIN","given":["HELENE","B"],"suffix":["MD"]}],"telecom":[{"system":"phone","value":"6106547854","use":"work"},{"system":"email","value":"sachin.belhekar#allscripts.com","use":"work"},{"system":"fax","value":"7106547895","use":"work"}],"address":[{"use":"work","line":["West Street 1","West Street 2"],"city":"Michigan","state":"MI","postalCode":"49036","country":"USA"}]}}
The data structure is a bit complex, so I will use a UDF to parse it:
import pyspark.sql.functions as f
import pyspark.sql.types as t
#f.udf(t.StringType())
def phone_parser(row):
for item in row:
if item['system'] == 'phone':
return item['value']
#f.udf(t.StringType())
def fax_parser(row):
for item in row:
if item['system'] == 'fax':
return item['value']
output.select(phone_parser('telecom'), fax_parser('telecom'))

Exploding struct type large number of columns to two columns of each column having keys and values in Pyspark

I have a pyspark df who's schema looks like this
root
|-- company: struct (nullable = true)
| |-- A: long(nullable = true)
| |-- A_timestamp: long(nullable = true)
| |-- B: long(nullable = true)
| |-- B_timestamp: long(nullable = true)
| |-- C: long(nullable = true)
| |-- C_timestamp: long(nullable = true)
| |-- D: long(nullable = true)
| |-- D_timestamp: long(nullable = true)
| |-- E: long(nullable = true)
| |-- E_timestamp: long(nullable = true)
I want the final format of this dataframe with 4 columns to look like this
Please help me to solve this using Pyspark.

How to create a list json using Pyspark?

I am trying to create a json file with below structure using Pyspark.
Target Output:
[{
"Loaded_data": [{
"Loaded_numeric_columns": ["id", "val"],
"Loaded_category_columns": ["name", "branch"]
}],
"enriched_data": [{
"enriched_category_columns": ["country__4"],
"enriched_index_columns": ["id__1", "val__3"]
}]
}]
I could able to create list for each section . Please refer below code. I kind of stuck here, could you please help.
Sample data:
input_data=spark.read.csv("/tmp/test234.csv",header=True, inferSchema=True)
def is_numeric(data_type):
return data_type not in ('date', 'string', 'boolean')
def is_nonnumeric(data_type):
return data_type in ('string')
sub="__"
Loaded_numeric_columns = [name for name, data_type in input_data.dtypes if is_numeric(data_type) and (sub not in name)]
print Loaded_numeric_columns
Loaded_category_columns = [name for name, data_type in input_data.dtypes if is_nonnumeric(data_type) and (sub not in name)]
print Loaded_category_columns
enriched_category_columns = [name for name, data_type in input_data.dtypes if is_nonnumeric(data_type) and (sub in name)]
print enriched_category_columns
enriched_index_columns = [name for name, data_type in input_data.dtypes if is_numeric(data_type) and (sub in name)]
print enriched_index_columns
You can just create the new column type with struct and array :
from pyspark.sql import functions as F
df.show()
+---+-----+-------+------+----------+-----+-------+
| id| val| name|branch|country__4|id__1| val__3|
+---+-----+-------+------+----------+-----+-------+
| 1|67.87|Shankar| a| 1|67.87|Shankar|
+---+-----+-------+------+----------+-----+-------+
df.select(
F.struct(
F.array(F.col("id"), F.col("val")).alias("Loaded_numeric_columns"),
F.array(F.col("name"), F.col("branch")).alias("Loaded_category_columns"),
).alias("Loaded_data"),
F.struct(
F.array(F.col("country__4")).alias("enriched_category_columns"),
F.array(F.col("id__1"), F.col("val__3")).alias("enriched_index_columns"),
).alias("enriched_data"),
).printSchema()
root
|-- Loaded_data: struct (nullable = false)
| |-- Loaded_numeric_columns: array (nullable = false)
| | |-- element: double (containsNull = true)
| |-- Loaded_category_columns: array (nullable = false)
| | |-- element: string (containsNull = true)
|-- enriched_data: struct (nullable = false)
| |-- enriched_category_columns: array (nullable = false)
| | |-- element: long (containsNull = true)
| |-- enriched_index_columns: array (nullable = false)
| | |-- element: string (containsNull = true)

Parse JSON in Spark containing reserve character

I have a JSON input.txt file with data as follows:
2018-05-30.txt:{"Message":{"eUuid":"6e7d4890-9279-491a-ae4d-70416ef9d42d","schemaVersion":"1.0-AB1","timestamp":1527539376,"id":"XYZ","location":{"dim":{"x":2,"y":-7},"towards":121.0},"source":"a","UniqueId":"test123","code":"del","signature":"xyz","":{},"vel":{"ground":15},"height":{},"next":{"dim":{}},"sub":"del1"}}
2018-05-30.txt:{"Message":{"eUuid":"5e7d4890-9279-491a-ae4d-70416ef9d42d","schemaVersion":"1.0-AB1","timestamp":1627539376,"id":"ABC","location":{"dim":{"x":1,"y":-8},"towards":132.0},"source":"b","UniqueId":"hello123","code":"fra","signature":"abc","":{},"vel":{"ground":16},"height":{},"next":{"dim":{}},"sub":"fra1"}}
.
.
I tried to load the JSON into a DataFrame as follows:
>>val df = spark.read.json("<full path of input.txt file>")
I am receiving
_corrupt_record
dataframe
I am aware that json character contains "." (2018-05-30.txt) as reserve character which is causing the issue. How may I resolve this?
val rdd = sc.textFile("/Users/kishore/abc.json")
val jsonRdd= rdd.map(x=>x.split("txt:")(1))
scala> df.show
+--------------------+
| Message|
+--------------------+
|[test123,del,6e7d...|
|[hello123,fra,5e7...|
+--------------------+
import org.apache.spark.sql.functions._
import sqlContext.implicits._
// val df = sqlContext.read.json(jsonRdd)
// df.show(false)
val df = sqlContext.read.json(jsonRdd).withColumn("eUuid", $"Message"("eUuid"))
.withColumn("schemaVersion", $"Message"("schemaVersion"))
.withColumn("timestamp", $"Message"("timestamp"))
.withColumn("id", $"Message"("id"))
.withColumn("source", $"Message"("source"))
.withColumn("UniqueId", $"Message"("UniqueId"))
.withColumn("location", $"Message"("location"))
.withColumn("dim", $"location"("dim"))
.withColumn("x", $"dim"("x"))
.withColumn("y", $"dim"("y"))
.drop("dim")
.withColumn("vel", $"Message"("vel"))
.withColumn("ground", $"vel"("ground"))
.withColumn("sub", $"Message"("sub"))
.drop("Message")
df.show()
+--------------------+-------------+----------+---+------+--------+------------+---+---+----+------+----+
| eUuid|schemaVersion| timestamp| id|source|UniqueId| location| x| y| vel|ground| sub|
+--------------------+-------------+----------+---+------+--------+------------+---+---+----+------+----+
|6e7d4890-9279-491...| 1.0-AB1|1527539376|XYZ| a| test123|[[2,-7],121]| 2| -7|[15]| 15|del1|
+--------------------+-------------+----------+---+------+--------+------------+---+---+----+------+----+
The problem is not a reserved character it is that the file does not contain valid JSON
so you can
val df=spark.read.textFile(...)
val json=spark.read.json(df.map(v=>v.drop(15)))
json.printSchema()
root
|-- Message: struct (nullable = true)
| |-- UniqueId: string (nullable = true)
| |-- code: string (nullable = true)
| |-- eUuid: string (nullable = true)
| |-- id: string (nullable = true)
| |-- location: struct (nullable = true)
| | |-- dim: struct (nullable = true)
| | | |-- x: long (nullable = true)
| | | |-- y: long (nullable = true)
| | |-- towards: double (nullable = true)
| |-- schemaVersion: string (nullable = true)
| |-- signature: string (nullable = true)
| |-- source: string (nullable = true)
| |-- sub: string (nullable = true)
| |-- timestamp: long (nullable = true)
| |-- vel: struct (nullable = true)
| | |-- ground: long (nullable = true)

Resources