How to write a schema for below nested Json pyspark - apache-spark

How to write schema for below json :
"place_results": {
"title": "W2A Architects",
"place_id": "ChIJ4SUGuHw5xIkRAl0856nZrBM",
"data_id": "0x89c4397cb80625e1:0x13acd9a9e73c5d02",
"data_cid": "1417747306467056898",
"reviews_link": "httpshl=en",
"photos_link": "https=en",
"gps_coordinates": {
"latitude": 40.6027801,
"longitude": -75.4701499
},
"place_id_search": "http",
"rating": 3.7,
I am getting nulls while writing below schema. How to know the correct datatype to use?
StructField('place_results', StructType([
StructField('address', StringType(), True),
StructField('data_cid', StringType(), True),
StructField('data_id', StringType(), True),
StructField('gps_coordinates', StringType(), True),
StructField('open_state', StringType(), True),
StructField('phone', StringType(), True),
StructField('website', StringType(), True)
])),

This should work:
StructType([
StructField('place_results',
StructType([
StructField('data_cid', StringType(), True),
StructField('data_id', StringType(), True),
StructField('gps_coordinates', StructType([
StructField('latitude', DoubleType(), True),
StructField('longitude', DoubleType(), True)]), True),
StructField('photos_link', StringType(), True),
StructField('place_id', StringType(), True),
StructField('place_id_search', StringType(), True),
StructField('rating', DoubleType(), True),
StructField('reviews_link', StringType(), True),
StructField('title', StringType(), True)]), True)
])
I got this using this command:
spark.read.option("multiLine", True).json("dbfs:/test/sample.json").schema

Related

write delta lake in Databricks error: HttpRequest 409 err PathAlreadyExist

Sometimes I get this error when a job in Databricks is writing in Azure data lake:
HttpRequest: 409,err=PathAlreadyExists,appendpos=,cid=f448-0832-41ac-a2ab-8821453ef3c8,rid=7d4-101f-005a-578c-f82000000,connMs=0,sendMs=0,recvMs=38,sent=0,recv=168,method=PUT,url=https://awutmp.dfs.core.windows.net/bronze/app/_delta_log/_last_checkpoint?resource=file&timeout=90
My code read from a blob storage using autoloader and write in Azure Data Lake:
Schemas:
val binarySchema = StructType(List(
StructField("path", StringType, true),
StructField("modificationTime", TimestampType, true),
StructField("length", LongType, true),
StructField("content", BinaryType, true)
))
val jsonSchema = StructType(List(
StructField("EquipmentId", StringType, true),
StructField("EquipmentName", StringType, true),
StructField("EquipmentType", StringType, true),
StructField("Name", StringType, true),
StructField("Value", StringType, true),
StructField("ValueType", StringType, true),
StructField("LastSourceTimeStamp", StringType, true),
StructField("LastReprocessDate", StringType, true),
StructField("LastStateDuration", StringType, true),
StructField("MessageId", StringType, true)
))
Create delta table if not exists:
val sinkPath = "abfss://bronze#awutmp.dfs.core.windows.net/app"
val tableSQL =
s"""
CREATE TABLE IF NOT EXISTS bronze.awutmpapp(
path STRING,
file_modification_time TIMESTAMP,
file_length LONG,
value STRING,
json struct<EquipmentId STRING, EquipmentName STRING, EquipmentType STRING, Name STRING, Value STRING,ValueType STRING, LastSourceTimeStamp STRING, LastReprocessDate STRING, LastStateDuration STRING, MessageId STRING>,
job_name STRING,
job_version STRING,
schema STRING,
schema_version STRING,
timestamp_etl_process TIMESTAMP,
year INT GENERATED ALWAYS AS (YEAR(file_modification_time)) COMMENT 'generated from file_modification_time',
month INT GENERATED ALWAYS AS (MONTH(file_modification_time)) COMMENT 'generated from file_modification_time',
day INT GENERATED ALWAYS AS (DAY(file_modification_time)) COMMENT 'generated from file_modification_time'
)
USING DELTA
PARTITIONED BY (year, month, day)
LOCATION '${sinkPath}'
"""
spark.sql(tableSQL)
Options:
val options = Map[String, String](
"cloudFiles.format" -> "BinaryFile",
"cloudFiles.useNotifications" -> "true",
"cloudFiles.queueName" -> queue,
"cloudFiles.connectionString" -> queueConnString,
"cloudFiles.validateOptions" -> "true",
"cloudFiles.allowOverwrites" -> "true",
"cloudFiles.includeExistingFiles" -> "true",
"recursiveFileLookup" -> "true",
"modifiedAfter" -> "2022-01-01T00:00:00.000+0000",
"pathGlobFilter" -> "*.json.gz",
"ignoreCorruptFiles" -> "true",
"ignoreMissingFiles" -> "true"
)
Method process each microbatch:
def decompress(compressed: Array[Byte]): Option[String] =
Try {
val inputStream = new GZIPInputStream(new ByteArrayInputStream(compressed))
scala.io.Source.fromInputStream(inputStream).mkString
}.toOption
def binaryToStringUDF: UserDefinedFunction = {
udf { (data: Array[Byte]) => decompress(data).orNull }
}
def processMicroBatch: (DataFrame, Long) => Unit = (df: DataFrame, id: Long) => {
val resultDF = df
.withColumn("content_string", binaryToStringUDF(col("content")))
.withColumn("array_value", split(col("content_string"), "\n"))
.withColumn("array_noempty_values", expr("filter(array_value, value -> value <> '')"))
.withColumn("value", explode(col("array_noempty_values")))
.withColumn("json", from_json(col("value"), jsonSchema))
.withColumnRenamed("length", "file_length")
.withColumnRenamed("modificationTime", "file_modification_time")
.withColumn("job_name", lit("jobName"))
.withColumn("job_version", lit("1.0"))
.withColumn("schema", lit(schema.toString))
.withColumn("schema_version", lit("1.0"))
.withColumn("timestamp_etl_process", current_timestamp())
.withColumn("timestamp_tz", expr("current_timezone()"))
.withColumn("timestamp_etl_process",
to_utc_timestamp(col("timestamp_etl_process"), col("timestamp_tz")))
.drop("timestamp_tz", "array_value", "array_noempty_values", "content", "content_string")
resultDF
.write
.format("delta")
.mode("append")
.option("path", sinkPath)
.save()
}
val storagePath = "wasbs://signal#externalaccount.blob.core.windows.net/"
val checkpointPath = "/checkpoint/signal/autoloader"
spark
.readStream
.format("cloudFiles")
.options(options)
.schema(binarySchema)
.load(storagePath)
.writeStream
.format("delta")
.outputMode("append")
.foreachBatch(processMicroBatch)
.option("checkpointLocation", checkpointPath)
.trigger(Trigger.AvailableNow)
.start()
.awaitTermination()
It is aditional information I have seen in Azure log analytics:
How can I solve this error?

Kafka JSON Data with Schema is Null in PySpark Structured Streaming. Mismatched input on new schema

I am trying to read Kafka messages in JSON in Spark Structured Streaming. Example of the messages in Kafka is as follows:
{
"_id": {
"$oid": "5eb292531c7d910b8c98dbce"
},
"Id": 37,
"Timestamp": {
"$date": 1582889068616
},
"TTNR": "R902170286",
"SNR": 91177446,
"State": 0,
"I_A1": "FALSE",
"I_B1": "FALSE",
"I1": 0.0037385,
"Mabs": -20.9814753,
"p_HD1": 31.0069236,
"pG": 27.640614,
"pT": 1.7169713,
"pXZ": 3.4712914,
"T3": 25.2174444,
"nan": 179.3099976,
"Q1": 0,
"a_01X": [
62.7839925,
62.7839925,
62.7839925,
62.7839925,
62.7839925,
62.7839925,
62.7839925,
62.7839925,
62.7839925,
62.7839925,
62.7839925,
62.7839925,
62.7839925,
62.7839925,
62.7839925,
62.7839925,
62.7839925,
62.7839925,
62.7839925,
62.7839925,
62.7839925,
62.7839925,
62.7839925,
62.7839925,
62.7839925,
62.7839925,
62.7839925,
62.7839925,
62.7839925,
62.7839925,
62.7839925,
62.7839925,
62.7839925,
62.7839925,
62.7839925,
62.7839925,
62.7839925,
62.7839925,
62.7839925,
62.7839925,
62.7839925,
62.7839925,
62.7839925,
62.7839925,
62.7839925,
62.7839925,
62.7839925,
62.7839925,
62.7839925,
62.7839925
]
}
After reading the stream in Kafka, the value field as string looks like this:
|value |

|
{"_id":{"$oid":"5eb292531c7d910b8c98dbce"},"Id":37,"Timestamp":{"$date":1582889068616},"TTNR":"R902170286","SNR":91177446,"State":0,"I_A1":"FALSE","I_B1":"FALSE","I1":0.0037385,"Mabs":-20.9814753,"p_HD1":31.0069236,"pG":27.640614,"pT":1.7169713,"pXZ":3.4712914,"T3":25.2174444,"nan":179.3099976,"Q1":0,"a_01X":[62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925]}
|

A schema has been defined to select some fields as follows:
json_schema=StructType([ \
StructField("_id",StructField("$oid",StringType())), \
StructField("Id", DoubleType()), \
StructField('Timestamp', StructField("$date", LongType())), \
StructField("TTNR", StringType()), \
StructField("SNR", DoubleType()), \
StructField("State", LongType()), \
StructField("I_A1", StringType()), \
StructField("I_B1", StringType()), \
StructField("I1", DoubleType()), \
StructField("Mabs", DoubleType()), \
StructField("p_HD1", DoubleType()), \
StructField("pG", DoubleType()), \
StructField("pT", DoubleType()), \
StructField("pXZ", DoubleType()), \
StructField("T3", DoubleType()), \
StructField("nan", DoubleType()), \
StructField("Q1", LongType()), \
StructField("a_01X", ArrayType(DoubleType()))
])
(Solved with parsing error)But after trying to print to the console, I get null values:
data_stream_json = data_stream_value.select(from_json(col("value"), json_schema).alias("json_detail"))
data_stream_output = data_stream_json \
.writeStream \
.outputMode("append") \
.format("console") \
.start()
+----+----+----+----+
| Id|TTNR| SNR| Q1|
+----+----+----+----+
|null|null|null|null|
+----+----+----+----+
(New Error) After changing the schema, there is a new problem parsing the string:
pyspark.sql.utils.ParseException: u'\nmismatched input \'{\' expecting {\'SELECT\', \'FROM\', \'ADD\', \'AS\', \'ALL\', \'ANY\', \'DISTINCT\', \'WHERE\', \'GROUP\', \'BY\', \'GROUPING\', \'SETS\', \'CUBE\', \'ROLLUP\', \'ORDER\', \'HAVING\', \'LIMIT\', \'AT\', \'OR\', \'AND\', \'IN\', NOT, \'NO\', \'EXISTS\', \'BETWEEN\', \'LIKE\', RLIKE, \'IS\', \'NULL\', \'TRUE\', \'FALSE\', \'NULLS\', \'ASC\', \'DESC\', \'FOR\', \'INTERVAL\', \'CASE\', \'WHEN\', \'THEN\', \'ELSE\', \'END\', \'JOIN\', \'CROSS\', \'OUTER\', \'INNER\', \'LEFT\', \'SEMI\', \'RIGHT\', \'FULL\', \'NATURAL\', \'ON\', \'PIVOT\', \'LATERAL\', \'WINDOW\', \'OVER\', \'PARTITION\', \'RANGE\', \'ROWS\', \'UNBOUNDED\', \'PRECEDING\', \'FOLLOWING\', \'CURRENT\', \'FIRST\', \'AFTER\', \'LAST\', \'ROW\', \'WITH\', \'VALUES\', \'CREATE\', \'TABLE\', \'DIRECTORY\', \'VIEW\', \'REPLACE\', \'INSERT\', \'DELETE\', \'INTO\', \'DESCRIBE\', \'EXPLAIN\', \'FORMAT\', \'LOGICAL\', \'CODEGEN\', \'COST\', \'CAST\', \'SHOW\', \'TABLES\', \'COLUMNS\', \'COLUMN\', \'USE\', \'PARTITIONS\', \'FUNCTIONS\', \'DROP\', \'UNION\', \'EXCEPT\', \'MINUS\', \'INTERSECT\', \'TO\', \'TABLESAMPLE\', \'STRATIFY\', \'ALTER\', \'RENAME\', \'ARRAY\', \'MAP\', \'STRUCT\', \'COMMENT\', \'SET\', \'RESET\', \'DATA\', \'START\', \'TRANSACTION\', \'COMMIT\', \'ROLLBACK\', \'MACRO\', \'IGNORE\', \'BOTH\', \'LEADING\', \'TRAILING\', \'IF\', \'POSITION\', \'EXTRACT\', \'DIV\', \'PERCENT\', \'BUCKET\', \'OUT\', \'OF\', \'SORT\', \'CLUSTER\', \'DISTRIBUTE\', \'OVERWRITE\', \'TRANSFORM\', \'REDUCE\', \'SERDE\', \'SERDEPROPERTIES\', \'RECORDREADER\', \'RECORDWRITER\', \'DELIMITED\', \'FIELDS\', \'TERMINATED\', \'COLLECTION\', \'ITEMS\', \'KEYS\', \'ESCAPED\', \'LINES\', \'SEPARATED\', \'FUNCTION\', \'EXTENDED\', \'REFRESH\', \'CLEAR\', \'CACHE\', \'UNCACHE\', \'LAZY\', \'FORMATTED\', \'GLOBAL\', TEMPORARY, \'OPTIONS\', \'UNSET\', \'TBLPROPERTIES\', \'DBPROPERTIES\', \'BUCKETS\', \'SKEWED\', \'STORED\', \'DIRECTORIES\', \'LOCATION\', \'EXCHANGE\', \'ARCHIVE\', \'UNARCHIVE\', \'FILEFORMAT\', \'TOUCH\', \'COMPACT\', \'CONCATENATE\', \'CHANGE\', \'CASCADE\', \'RESTRICT\', \'CLUSTERED\', \'SORTED\', \'PURGE\', \'INPUTFORMAT\', \'OUTPUTFORMAT\', DATABASE, DATABASES, \'DFS\', \'TRUNCATE\', \'ANALYZE\', \'COMPUTE\', \'LIST\', \'STATISTICS\', \'PARTITIONED\', \'EXTERNAL\', \'DEFINED\', \'REVOKE\', \'GRANT\', \'LOCK\', \'UNLOCK\', \'MSCK\', \'REPAIR\', \'RECOVER\', \'EXPORT\', \'IMPORT\', \'LOAD\', \'ROLE\', \'ROLES\', \'COMPACTIONS\', \'PRINCIPALS\', \'TRANSACTIONS\', \'INDEX\', \'INDEXES\', \'LOCKS\', \'OPTION\', \'ANTI\', \'LOCAL\', \'INPATH\', IDENTIFIER, BACKQUOTED_IDENTIFIER}(line 1, pos 0)\n\n== SQL ==\n{"fields":[{"metadata":{},"name":"_id","nullable":true,"type":{"metadata":{},"name":"$oid","nullable":true,"type":"string"}},{"metadata":{},"name":"Id","nullable":true,"type":"double"},{"metadata":{},"name":"Timestamp","nullable":true,"type":{"metadata":{},"name":"$date","nullable":true,"type":"long"}},{"metadata":{},"name":"TTNR","nullable":true,"type":"string"},{"metadata":{},"name":"SNR","nullable":true,"type":"double"},{"metadata":{},"name":"State","nullable":true,"type":"long"},{"metadata":{},"name":"I_A1","nullable":true,"type":"string"},{"metadata":{},"name":"I_B1","nullable":true,"type":"string"},{"metadata":{},"name":"I1","nullable":true,"type":"double"},{"metadata":{},"name":"Mabs","nullable":true,"type":"double"},{"metadata":{},"name":"p_HD1","nullable":true,"type":"double"},{"metadata":{},"name":"pG","nullable":true,"type":"double"},{"metadata":{},"name":"pT","nullable":true,"type":"double"},{"metadata":{},"name":"pXZ","nullable":true,"type":"double"},{"metadata":{},"name":"T3","nullable":true,"type":"double"},{"metadata":{},"name":"nan","nullable":true,"type":"double"},{"metadata":{},"name":"Q1","nullable":true,"type":"long"},{"metadata":{},"name":"a_01X","nullable":true,"type":{"containsNull":true,"elementType":"double","type":"array"}}],"type":"struct"}\n^^^\n'
I want some help with this.
Note If you have complex nested json try to use this DataType.fromJson method to convert json schema to StructType schema & keep the json schema outside of code. Any change in schema just update json schema & restart your application, it will take new schema automatically.
I have converted json data to schema string, Please check below code.
scala> val jsonSchema = """{"type":"struct","fields":[{"name":"I1","type":"double","nullable":true,"metadata":{}},{"name":"I_A1","type":"string","nullable":true,"metadata":{}},{"name":"I_B1","type":"string","nullable":true,"metadata":{}},{"name":"Id","type":"long","nullable":true,"metadata":{}},{"name":"Mabs","type":"double","nullable":true,"metadata":{}},{"name":"Q1","type":"long","nullable":true,"metadata":{}},{"name":"SNR","type":"long","nullable":true,"metadata":{}},{"name":"State","type":"long","nullable":true,"metadata":{}},{"name":"T3","type":"double","nullable":true,"metadata":{}},{"name":"TTNR","type":"string","nullable":true,"metadata":{}},{"name":"Timestamp","type":{"type":"struct","fields":[{"name":"$date","type":"long","nullable":true,"metadata":{}}]},"nullable":true,"metadata":{}},{"name":"_id","type":{"type":"struct","fields":[{"name":"$oid","type":"string","nullable":true,"metadata":{}}]},"nullable":true,"metadata":{}},{"name":"a_01X","type":{"type":"array","elementType":"double","containsNull":true},"nullable":true,"metadata":{}},{"name":"nan","type":"double","nullable":true,"metadata":{}},{"name":"pG","type":"double","nullable":true,"metadata":{}},{"name":"pT","type":"double","nullable":true,"metadata":{}},{"name":"pXZ","type":"double","nullable":true,"metadata":{}},{"name":"p_HD1","type":"double","nullable":true,"metadata":{}}]}"""
jsonSchema: String = {"type":"struct","fields":[{"name":"I1","type":"double","nullable":true,"metadata":{}},{"name":"I_A1","type":"string","nullable":true,"metadata":{}},{"name":"I_B1","type":"string","nullable":true,"metadata":{}},{"name":"Id","type":"long","nullable":true,"metadata":{}},{"name":"Mabs","type":"double","nullable":true,"metadata":{}},{"name":"Q1","type":"long","nullable":true,"metadata":{}},{"name":"SNR","type":"long","nullable":true,"metadata":{}},{"name":"State","type":"long","nullable":true,"metadata":{}},{"name":"T3","type":"double","nullable":true,"metadata":{}},{"name":"TTNR","type":"string","nullable":true,"metadata":{}},{"name":"Timestamp","type":{"type":"struct","fields":[{"name":"$date","type":"long","nullable":true,"metadata":{}}]},"nullable":true,"metadata":{}},{"name":"_id","type":{"type":"struct","fields":[{"name":"$oid","type":"string","nullable":true,"metadata":{}}]},"nullable":true,"metadata":{}},{"name":"a_01X","type":{"type":"array","elementType":"double","containsNull":true},"nullable":true,"metadata":{}},{"name":"nan","type":"double","nullable":true,"metadata":{}},{"name":"pG","type":"double","nullable":true,"metadata":{}},{"name":"pT","type":"double","nullable":true,"metadata":{}},{"name":"pXZ","type":"double","nullable":true,"metadata":{}},{"name":"p_HD1","type":"double","nullable":true,"metadata":{}}]}
scala> val schema = DataType.fromJson(jsonSchema).asInstanceOf[StructType]
schema: org.apache.spark.sql.types.StructType = StructType(StructField(I1,DoubleType,true), StructField(I_A1,StringType,true), StructField(I_B1,StringType,true), StructField(Id,LongType,true), StructField(Mabs,DoubleType,true), StructField(Q1,LongType,true), StructField(SNR,LongType,true), StructField(State,LongType,true), StructField(T3,DoubleType,true), StructField(TTNR,StringType,true), StructField(Timestamp,StructType(StructField($date,LongType,true)),true), StructField(_id,StructType(StructField($oid,StringType,true)),true), StructField(a_01X,ArrayType(DoubleType,true),true), StructField(nan,DoubleType,true), StructField(pG,DoubleType,true), StructField(pT,DoubleType,true), StructField(pXZ,DoubleType,true), StructField(p_HD1,DoubleType,true))
I don't know your whole code, but by seeing the one you put here seems to me that you need to first convert your kafka input into String first as it initially comes in HexaDecimal and then you use your schema on that String.
I figured it out.
The trick was to change my Kafka serializer from AVRO to string format. Although AVRO preserves the schema, it also introduced some preceeding characters like newline (see below) that was hard to remove and parse as a json in my case.

|value |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|
{"_id":{"$oid":"5e58f86d5afd84039c135405"},"Id":1,"Timestamp":{"$date":1582889068580},"TTNR":"R902170286","SNR":92177446,"State":0,"I_A1":"FALSE","I_B1":"FALSE","I1":0.0036622,"Mabs":-20.5236976,"p_HD1":30.985062,"pG":27.7779473,"pT":1.727958,"pXZ":3.4487671,"T3":25.2296518,"nan":215.3000031,"Q1":0,"a_01X":[62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925,62.7839925]}
|

Getting my input as a String introduced more fields, which was easier to remove. I had to define a bigger schema, but parsing was successful.

Spark spark.sql.session.timeZone doesn't work with JSON source

Does Spark v2.3.1 depends on local timezone when reading from JSON file?
My src/test/resources/data/tmp.json:
[
{
"timestamp": "1970-01-01 00:00:00.000"
}
]
and Spark code:
SparkSession.builder()
.appName("test")
.master("local")
.config("spark.sql.session.timeZone", "UTC")
.getOrCreate()
.read()
.option("multiLine", true).option("mode", "PERMISSIVE")
.schema(new StructType()
.add(new StructField("timestamp", DataTypes.TimestampType, true, Metadata.empty())))
.json("src/test/resources/data/tmp.json")
.show();
Result:
+-------------------+
| timestamp|
+-------------------+
|1969-12-31 22:00:00|
+-------------------+
How to make spark return 1970-01-01 00:00:00.000?
P.S. This question is not a duplicate of Spark Strutured Streaming automatically converts timestamp to local time, because provided there solution not work for me and is already included (see .config("spark.sql.session.timeZone", "UTC")) into my question.

Databricks spark-xml when reading tags ending in "/>" return values are null

I'm using the latest version of spark-xml (0.4.1) with scala 11, when I read some xml that contains tags ending with "/>" the corresponding values ​​are null, fallow the example:
XML:
<Clients>
<Client ID="1" name="teste1" age="10">
<Operation ID="1" name="operation1">
</Operation>
<Operation ID="2" name="operation2">
</Operation>
</Client>
<Client ID="2" name="teste2" age="20"/>
<Client ID="3" name="teste3" age="30">
<Operation ID="1" name="operation1">
</Operation>
<Operation ID="2" name="operation2">
</Operation>
</Client>
</Clients>
Dataframe:
+----+------+----+--------------------+
| _ID| _name|_age| Operation|
+----+------+----+--------------------+
| 1|teste1| 10|[[1,operation1], ...|
|null| null|null| null|
+----+------+----+--------------------+
Code:
Dataset<Row> clients = sparkSession.sqlContext().read()
.format("com.databricks.spark.xml")
.option("rowTag", "Client")
.schema(getSchemaClient())
.load(dirtorio);
clients.show(10);
public StructType getSchemaClient() {
return new StructType(
new StructField[] {
new StructField("_ID", DataTypes.StringType, true, Metadata.empty()),
new StructField("_name", DataTypes.StringType, true, Metadata.empty()),
new StructField("_age", DataTypes.StringType, true, Metadata.empty()),
new StructField("Operation", DataTypes.createArrayType(this.getSchemaOperation()), true, Metadata.empty()) });
}
public StructType getSchemaOperation() {
return new StructType(new StructField[] {
new StructField("_ID", DataTypes.StringType, true, Metadata.empty()),
new StructField("_name", DataTypes.StringType, true, Metadata.empty()),
});
}
Version 0.5.0 was just released, which resolved issues with self-closing tags. It may resolve this issue. See https://github.com/databricks/spark-xml/pull/352

Spark-Xml: Array within an Array in Dataframe to generate XML

I have a requirement to generate a XML which has a below structure
<parent>
<name>parent</name
<childs>
<child>
<name>child1</name>
</child>
<child>
<name>child1</name>
<grandchilds>
<grandchild>
<name>grand1</name>
</grandchild>
<grandchild>
<name>grand2</name>
</grandchild>
<grandchild>
<name>grand3</name>
</grandchild>
</grandchilds>
</child>
<child>
<name>child1</name>
</child>
</childs>
</parent>
As you see a parent will have child(s) and a child node may have grandchild(s) nodes.
https://github.com/databricks/spark-xml#conversion-from-dataframe-to-xml
I understand from spark-xml that when we have an nested array structure the data-frame should be as below
+------------------------------------+
| a|
+------------------------------------+
|[WrappedArray(aa), WrappedArray(bb)]|
+------------------------------------+
Can you please help me with this small example on how to make a flattened DataFrame for my desired xml. I am working on Spark 2.X Spark-Xml 0.4.5(Latest)
My Schema
StructType categoryMapSchema = new StructType(new StructField[]{
new StructField("name", DataTypes.StringType, true, Metadata.empty()),
new StructField("childs", new StructType(new StructField[]{
new StructField("child",
DataTypes.createArrayType(new StructType(new StructField[]{
new StructField("name", DataTypes.StringType, true, Metadata.empty()),
new StructField("grandchilds", new StructType(new StructField[]{
new StructField("grandchild",
DataTypes.createArrayType(new StructType(new StructField[]{
new StructField("name", DataTypes.StringType, true,
Metadata.empty())
})), true, Metadata.empty())
}), true, Metadata.empty())
})), true, Metadata.empty())
}), true, Metadata.empty()),
});
My Row RDD data.. Not actual code, but somewhat like this.
final JavaRDD<Row> rowRdd = mapAttributes
.map(parent -> {
return RowFactory.create(
parent.getParentName(),
RowFactory.create(RowFactory.create((Object) parent.getChild))
);
});
What i have tried till now i have the WrappedArray within parent WrappedArray which does not work.

Resources