Extract StructType from JSON schema definition - apache-spark

I have a schema definition generated by Hackolade. It looks like this:
{
"properties": {
"data": {
"isActivated": true,
"type": "object",
"properties": {
"value01": {
"isActivated": true,
"type": "string",
"readOnly": true,
"description": "description 01",
"pattern": "^([a-z _-]*)$",
"minLength": 4,
"examples": [
"example01",
"example02"
],
"$comment": "comment 01."
},
"start_timestamp": {
"isActivated": true,
"type": "string",
"format": "date-time",
"pattern": "^(\\d{4}(?!\\d{2}\\b))((-?)((0[1-9]|1[0-2])(\\3([12]\\d|0[1-9]|3[01]))?|W([0-4]\\d|5[0-2])(-?[1-7])?|(00[1-9]|0[1-9]\\d|[12]\\d{2}|3([0-5]\\d|6[1-6])))([T\\s]((([01]\\d|2[0-3])((:?)[0-5]\\d)?|24\\:?00)([\\.,]\\d+(?!:))?)?(\\17[0-5]\\d?)?([zZ]|([\\+-])([01]\\d|2[0-3]):?([0-5]\\d)?)?)?)?(.\\d{6})?(\\+00:00)$",
"readOnly": true,
"examples": [
"2022-08-15T12:50:25.456789+00:00"
],
"description": "timestamp description",
"maxLength": 33,
"minLength": 27
}
},
"additionalProperties": false,
"description": "data description",
"readOnly": true,
"required": [
"value01",
"start_timestamp"
]
}
} }
I want to convert this schema definition to a StructType schema to use with PySpark DF:
df_with_json = df.withColumn("col_with_schema", f.from_json(f.col(value), schema))
Where value column is a dict string:
'{"data":{"value01":"example01", "start_timestamp":"2021-05-12T12:42:56.236123+00:00"}}'
I tried something like: schema = f.schema_of_json(f.lit(str(json_schema_definition))) but it didn't work.

The following will convert your JSON string into a StructType object. The issue is that you have regexes as values in your JSON, so before conversion you must do a small replace.
json_str = """
{
"properties": {
"data": {
"isActivated": true,
"type": "object",
"properties": {
"value01": {
"isActivated": true,
"type": "string",
"readOnly": true,
"description": "description 01",
"pattern": "^([a-z _-]*)$",
"minLength": 4,
"examples": [
"example01",
"example02"
],
"$comment": "comment 01."
},
"start_timestamp": {
"isActivated": true,
"type": "string",
"format": "date-time",
"pattern": "^(\\d{4}(?!\\d{2}\\b))((-?)((0[1-9]|1[0-2])(\\3([12]\\d|0[1-9]|3[01]))?|W([0-4]\\d|5[0-2])(-?[1-7])?|(00[1-9]|0[1-9]\\d|[12]\\d{2}|3([0-5]\\d|6[1-6])))([T\\s]((([01]\\d|2[0-3])((:?)[0-5]\\d)?|24\\:?00)([\\.,]\\d+(?!:))?)?(\\17[0-5]\\d?)?([zZ]|([\\+-])([01]\\d|2[0-3]):?([0-5]\\d)?)?)?)?(.\\d{6})?(\\+00:00)$",
"readOnly": true,
"examples": [
"2022-08-15T12:50:25.456789+00:00"
],
"description": "timestamp description",
"maxLength": 33,
"minLength": 27
}
},
"additionalProperties": false,
"description": "data description",
"readOnly": true,
"required": [
"value01",
"start_timestamp"
]
}
} }
"""
schema = spark.read.json(sc.parallelize([json_str.replace('\\', '\\\\\\\\')])).schema
print(schema)
# StructType([StructField('properties', StructType([StructField('data', StructType([StructField('additionalProperties', BooleanType(), True), StructField('description', StringType(), True), StructField('isActivated', BooleanType(), True), StructField('properties', StructType([StructField('start_timestamp', StructType([StructField('description', StringType(), True), StructField('examples', ArrayType(StringType(), True), True), StructField('format', StringType(), True), StructField('isActivated', BooleanType(), True), StructField('maxLength', LongType(), True), StructField('minLength', LongType(), True), StructField('pattern', StringType(), True), StructField('readOnly', BooleanType(), True), StructField('type', StringType(), True)]), True), StructField('value01', StructType([StructField('$comment', StringType(), True), StructField('description', StringType(), True), StructField('examples', ArrayType(StringType(), True), True), StructField('isActivated', BooleanType(), True), StructField('minLength', LongType(), True), StructField('pattern', StringType(), True), StructField('readOnly', BooleanType(), True), StructField('type', StringType(), True)]), True)]), True), StructField('readOnly', BooleanType(), True), StructField('required', ArrayType(StringType(), True), True), StructField('type', StringType(), True)]), True)]), True)])

Related

Pimcore: New product class not visible in e-commerce product list

Goal
Data objects of my data object class Product should be visible in the e-commerce Pimcore site.
Current Setup
Current Demo and Blue Print Application for Pimcore
I create a new data object class called Product. Parent PHP class is set to \App\Model\Product\AbstractProduct (Complete class definition export attached)
Created a new data object based on the Product class.
Result
The new product is not visible in the shop. There is no error shown up either.
What I also tried
Based on the Index Service documentation I manually updated the index, without any effect.
$ php bin/console ecommerce:indexservice:bootstrap --update-index
Processing 1 Product in segments of 50, batches of 50, 1 round, 1 batch in 1 process
1/1 [============================] 100% < 1 sec/< 1 sec 48.5 MiB
Processed 1 Product.
Attached complete class definition export
{
"id": "PROD",
"description": "",
"modificationDate": 1669880184,
"parentClass": "\\App\\Model\\Product\\AbstractProduct",
"implementsInterfaces": "",
"listingParentClass": "",
"useTraits": "",
"listingUseTraits": "",
"allowInherit": true,
"allowVariants": true,
"showVariants": true,
"layoutDefinitions": {
"name": "pimcore_root",
"type": null,
"region": null,
"title": null,
"width": 0,
"height": 0,
"collapsible": false,
"collapsed": false,
"bodyStyle": null,
"datatype": "layout",
"permissions": null,
"children": [
{
"name": "Layout",
"type": null,
"region": null,
"title": "",
"width": "",
"height": "",
"collapsible": false,
"collapsed": false,
"bodyStyle": "",
"datatype": "layout",
"permissions": null,
"children": [
{
"name": "Base Data",
"type": null,
"region": null,
"title": "Base Data",
"width": "",
"height": "",
"collapsible": false,
"collapsed": false,
"bodyStyle": "",
"datatype": "layout",
"permissions": null,
"children": [
{
"name": "productName",
"title": "Product Name",
"tooltip": "",
"mandatory": true,
"noteditable": false,
"index": true,
"locked": false,
"style": "",
"permissions": null,
"datatype": "data",
"fieldtype": "input",
"relationType": false,
"invisible": false,
"visibleGridView": false,
"visibleSearch": false,
"width": "",
"defaultValue": null,
"columnLength": 190,
"regex": "",
"regexFlags": [],
"unique": true,
"showCharCount": false,
"defaultValueGenerator": ""
},
{
"name": "localizedfields",
"title": "",
"tooltip": "",
"mandatory": false,
"noteditable": false,
"index": null,
"locked": false,
"style": "",
"permissions": null,
"datatype": "data",
"fieldtype": "localizedfields",
"relationType": false,
"invisible": false,
"visibleGridView": true,
"visibleSearch": true,
"children": [
{
"name": "description",
"title": "Description",
"tooltip": "",
"mandatory": false,
"noteditable": false,
"index": false,
"locked": false,
"style": "",
"permissions": null,
"datatype": "data",
"fieldtype": "textarea",
"relationType": false,
"invisible": false,
"visibleGridView": false,
"visibleSearch": false,
"width": "",
"height": "",
"maxLength": null,
"showCharCount": false,
"excludeFromSearchIndex": false
},
{
"name": "packaging",
"title": "Packaging",
"tooltip": "",
"mandatory": false,
"noteditable": false,
"index": false,
"locked": false,
"style": "",
"permissions": null,
"datatype": "data",
"fieldtype": "input",
"relationType": false,
"invisible": false,
"visibleGridView": false,
"visibleSearch": false,
"width": "",
"defaultValue": null,
"columnLength": 190,
"regex": "",
"regexFlags": [],
"unique": false,
"showCharCount": false,
"defaultValueGenerator": ""
}
],
"region": null,
"layout": null,
"width": "",
"height": "",
"maxTabs": null,
"border": false,
"provideSplitView": false,
"tabPosition": null,
"hideLabelsWhenTabsReached": null,
"fieldDefinitionsCache": null,
"permissionView": null,
"permissionEdit": null,
"labelWidth": 0,
"labelAlign": "left"
},
{
"name": "image",
"title": "Image",
"tooltip": "",
"mandatory": false,
"noteditable": false,
"index": false,
"locked": false,
"style": "",
"permissions": null,
"datatype": "data",
"fieldtype": "image",
"relationType": false,
"invisible": false,
"visibleGridView": false,
"visibleSearch": false,
"width": "",
"height": "",
"uploadPath": ""
},
{
"name": "group",
"title": "Group",
"tooltip": "",
"mandatory": false,
"noteditable": false,
"index": false,
"locked": false,
"style": "",
"permissions": null,
"datatype": "data",
"fieldtype": "manyToOneRelation",
"relationType": true,
"invisible": false,
"visibleGridView": false,
"visibleSearch": false,
"classes": [
{
"classes": "ProductGroup"
}
],
"pathFormatterClass": "",
"width": "",
"assetUploadPath": "",
"objectsAllowed": true,
"assetsAllowed": false,
"assetTypes": [],
"documentsAllowed": false,
"documentTypes": []
},
{
"name": "categories",
"title": "Categories",
"tooltip": "",
"mandatory": false,
"noteditable": false,
"index": false,
"locked": false,
"style": "",
"permissions": null,
"datatype": "data",
"fieldtype": "manyToManyObjectRelation",
"relationType": true,
"invisible": false,
"visibleGridView": false,
"visibleSearch": false,
"classes": [
{
"classes": "Category"
}
],
"pathFormatterClass": "",
"width": "",
"height": "",
"maxItems": null,
"visibleFields": "id,fullpath,name",
"allowToCreateNewObject": false,
"optimizedAdminLoading": false,
"enableTextSelection": false,
"visibleFieldDefinitions": []
}
],
"locked": false,
"fieldtype": "panel",
"layout": null,
"border": false,
"icon": "",
"labelWidth": 0,
"labelAlign": "left"
},
{
"name": "Attributes",
"type": null,
"region": null,
"title": "Attributes",
"width": "",
"height": "",
"collapsible": false,
"collapsed": false,
"bodyStyle": "",
"datatype": "layout",
"permissions": null,
"children": [
{
"name": "attributes",
"title": "Attributes",
"tooltip": "",
"mandatory": false,
"noteditable": false,
"index": false,
"locked": false,
"style": "",
"permissions": null,
"datatype": "data",
"fieldtype": "objectbricks",
"relationType": false,
"invisible": false,
"visibleGridView": false,
"visibleSearch": false,
"allowedTypes": [
"EdgebandingAttributes"
],
"maxItems": null,
"border": false
},
{
"name": "saleInformation",
"title": "Sale Information",
"tooltip": "",
"mandatory": false,
"noteditable": false,
"index": false,
"locked": false,
"style": "",
"permissions": null,
"datatype": "data",
"fieldtype": "objectbricks",
"relationType": false,
"invisible": false,
"visibleGridView": false,
"visibleSearch": false,
"allowedTypes": [
"SaleInformation"
],
"maxItems": null,
"border": false
}
],
"locked": false,
"fieldtype": "panel",
"layout": null,
"border": false,
"icon": "",
"labelWidth": 0,
"labelAlign": "left"
}
],
"locked": false,
"fieldtype": "tabpanel",
"border": false,
"tabPosition": null
}
],
"locked": false,
"fieldtype": "panel",
"layout": null,
"border": false,
"icon": null,
"labelWidth": 100,
"labelAlign": "left"
},
"icon": "",
"previewUrl": "",
"group": "Product Data",
"showAppLoggerTab": false,
"linkGeneratorReference": "",
"previewGeneratorReference": "",
"compositeIndices": [],
"generateTypeDeclarations": true,
"showFieldLookup": false,
"propertyVisibility": {
"grid": {
"id": true,
"key": false,
"path": true,
"published": true,
"modificationDate": true,
"creationDate": true
},
"search": {
"id": true,
"key": false,
"path": true,
"published": true,
"modificationDate": true,
"creationDate": true
}
},
"enableGridLocking": false
}
I finally got it to work (after my last answer which was supposed to be a comment, my bad :D )
did you check the following:
1 Class override
https://pimcore.com/docs/pimcore/current/Development_Documentation/Extending_Pimcore/Overriding_Models.html
in /config/ecommerce/base-ecommerce.yaml
pimcore:
models:
class_overrides:
Pimcore\Model\DataObject\YourClass: App\Model\Product\YourClass
make sure you clear the cache like in the documentation
./bin/console cache:clear --no-warmup && ./bin/console pimcore:cache:clear
2 Check all Car Class names in Model / Controller
For example:
src/controller/productController.php
src/Model/Adminstyle/Car --> to your Class
src/Model/Car --> to your Class
3 Make sure that saving a product of yours gets in the index
I saw that on saving the object in the backend, i got a log that the object was not indexed.
https://pimcore.com/docs/pimcore/current/Development_Documentation/E-Commerce_Framework/Index_Service/Product_Index_Configuration/Data_Architecture_and_Indexing_Process.html
I had some other issues which where 100% not meant to be fixed like i did. So try how far you come with this
edit: typo

Receiving RequestMalformed error when doing Typesense upsert

I have the following interface in typescript:
export interface TypesenseAtlistedProEvent {
// IDs
id: string;
proId: string;
eventId: string;
startTime: Number;
stopTime: Number;
eventRate: Number;
remainingSlots: Number;
displayName: string;
photoURL: string;
indexOptions: string;
location: Number[];
}
and the following schema in Typesense:
{
"created_at": 1665530883,
"default_sorting_field": "location",
"fields": [
{
"facet": false,
"index": true,
"infix": false,
"locale": "",
"name": "proId",
"optional": false,
"sort": false,
"type": "string"
},
{
"facet": false,
"index": true,
"infix": false,
"locale": "",
"name": "eventId",
"optional": false,
"sort": false,
"type": "string"
},
{
"facet": false,
"index": true,
"infix": false,
"locale": "",
"name": "startTime",
"optional": false,
"sort": true,
"type": "int64"
},
{
"facet": false,
"index": true,
"infix": false,
"locale": "",
"name": "stopTime",
"optional": false,
"sort": true,
"type": "int64"
},
{
"facet": false,
"index": true,
"infix": false,
"locale": "",
"name": "eventRate",
"optional": false,
"sort": true,
"type": "float"
},
{
"facet": false,
"index": true,
"infix": false,
"locale": "",
"name": "remainingSlots",
"optional": false,
"sort": true,
"type": "int32"
},
{
"facet": false,
"index": true,
"infix": false,
"locale": "",
"name": "displayName",
"optional": false,
"sort": false,
"type": "string"
},
{
"facet": false,
"index": true,
"infix": false,
"locale": "",
"name": "photoURL",
"optional": false,
"sort": false,
"type": "string"
},
{
"facet": false,
"index": true,
"infix": false,
"locale": "",
"name": "indexOptions",
"optional": false,
"sort": false,
"type": "string"
},
{
"facet": false,
"index": true,
"infix": false,
"locale": "",
"name": "location",
"optional": false,
"sort": true,
"type": "geopoint"
}
],
"name": "atlistedProEventIndex",
"num_documents": 0,
"symbols_to_index": [],
"token_separators": []
}
I look to upsert like the in the following:
const indexedDoc: TypesenseAtlistedProEvent = {
id: proId + eventId,
proId: proId,
eventId: eventId,
startTime: publicEvent.startTime.seconds,
stopTime: publicEvent.stopTime.seconds,
eventRate: publicEvent.eventRate,
remainingSlots: publicEvent.remainingSlots,
displayName: tpi.displayName,
photoURL: tpi.photoURL,
indexOptions: tpi.indexOptions,
location: [tpi.lat, tpi.lng],
};
return await typesenseClient
.collections("atlistedProEventIndex")
.documents()
.upsert(indexedDoc)
.then(() => {
return {success: true, exit: 0};
})
I am getting the following upon the query:
RequestMalformed: Request failed with HTTP code 400 | Server said: [json.exception.type_error.302] type must be number
I am passing it location as Number[], and trying to get that to update the geopoint in typesense. This is not working and thus it would be useful if:
I was able to locate the logs to go through. I would particularly like the logs given by the Typesense Cloud, and am feeling at a loss that I cannot find these.
I would like to pass in the geopoint as the right type in typescript. Right now, as you can see above, the location is of type Number[], which, from the examples I saw, assumed was right. It also may be the case that another field is off and I'm just missing it. Either way, I could really use some kind of server side logging coming from Typesense Cloud.
The error message is a little confusing, but the core of the issue is that the default_sorting_field can only be a numeric field, but it's currently set as a geopoint field (location), which is what that error is trying to convey.
So if you create a new collection without default_sorting_field, the error should not show up.
If you want to sort by geo location, you want to use the sort_by parameter: https://typesense.org/docs/0.23.1/api/geosearch.html#searching-within-a-radius
let searchParameters = {
'q' : '*',
'query_by' : 'title',
'filter_by' : 'location:(48.90615915923891, 2.3435897727061175, 5.1 km)',
'sort_by' : 'location(48.853, 2.344):asc'
}
client.collections('companies').documents().search(searchParameters)

Pubsub to bigquery subscription with AVRO schema ignores messages with value for last column

I have a pubsub topic with an avro schema, a bigquery subscription and a matching bigquery table. To test this I use a python script to publish a number of records.
Published records that have None or no value for the last column appear in bigquery soon after publishing. As soon as a value is specified for the last column, the record does not appear. I changed the column order and it is always the last column, that generates the issue.
How can I fix that?
Avro schema
{
"type": "record",
"name": "incoming_telemetry",
"doc": "Telemetry message from herby device",
"fields": [
{"name": "deviceId", "type": "string"},
{"name": "timestamp", "type": "string"},
{"name": "waterTableRange", "type": ["null", "float"]},
{"name": "batteryCapacity", "type": ["null", "float"]},
{"name": "batteryCurrent", "type": ["null", "float"]},
{"name": "solarVoltage", "type": ["null", "float"]},
{"name": "batteryVoltage", "type": ["null", "float"]},
{"name": "temperature", "type": ["null", "float"]},
{"name": "wifiStrength", "type": ["null", "float"]},
{"name": "flowRate", "type": ["null", "float"]}
]
}
Bigquery schema
[
{
"name": "timestamp",
"type": "TIMESTAMP",
"mode": "REQUIRED"
},
{
"name": "deviceId",
"type": "STRING",
"mode": "REQUIRED"
},
{
"name": "waterTableRange",
"type": "FLOAT",
"mode": "NULLABLE"
},
{
"name": "batteryCapacity",
"type": "FLOAT",
"mode": "NULLABLE"
},
{
"name": "batteryCurrent",
"type": "FLOAT",
"mode": "NULLABLE"
},
{
"name": "solarVoltage",
"type": "FLOAT",
"mode": "NULLABLE"
},
{
"name": "batteryVoltage",
"type": "FLOAT",
"mode": "NULLABLE"
},
{
"name": "temperature",
"type": "FLOAT",
"mode": "NULLABLE"
},
{
"name": "wifiStrength",
"type": "FLOAT",
"mode": "NULLABLE"
},
{
"name": "flowRate",
"type": "FLOAT",
"mode": "NULLABLE"
}
]
python script
import io
from datetime import datetime, timezone
import avro.schema
import google.auth
from avro.io import DatumWriter, BinaryEncoder, BinaryDecoder, DatumReader
from google.cloud import pubsub_v1, bigquery
INCOMING_TOPIC_ID = 'incoming-telemetry-v1.0'
SCHEMA_FILE = '../../schemas/incoming_telemetry_v1.0.avsc'
DATASET_ID = 'telemetry'
TABLE_ID = 'raw-telemetry'
GCP_TIMESTAMP_FORMAT = '%Y-%m-%dT%H:%M:%S.%fZ'
timestamp = datetime.now(timezone.utc).strftime(GCP_TIMESTAMP_FORMAT)
records = [
{
'deviceId': '1',
'timestamp': timestamp,
'waterTableRange': None,
'batteryCapacity': None,
'batteryCurrent': None,
'solarVoltage': None,
'batteryVoltage': None,
'temperature': None,
'wifiStrength': None,
'flowRate': None,
},
{
'deviceId': '2',
'timestamp': timestamp,
},
{
'deviceId': '3',
'timestamp': timestamp,
'waterTableRange': 1.1,
'batteryCapacity': 2.2,
'batteryCurrent': 3.3,
'solarVoltage': 4.4,
'batteryVoltage': 5.5,
'temperature': 6.6,
'wifiStrength': 7.7,
'flowRate': None,
},
{
'deviceId': '4',
'timestamp': timestamp,
'wifiStrength': 7.7,
},
{
'deviceId': '5',
'timestamp': timestamp,
'waterTableRange': 1.1,
'batteryCapacity': 2.2,
'batteryCurrent': 3.3,
'solarVoltage': 4.4,
'batteryVoltage': 5.5,
'temperature': 6.6,
'wifiStrength': 7.7,
'flowRate': 8.8,
},
{
'deviceId': '6',
'timestamp': timestamp,
'flowRate': 8.8,
},
]
project_id = google.auth.default()[1]
publisher_client = pubsub_v1.PublisherClient()
topic_path = publisher_client.topic_path(project_id, INCOMING_TOPIC_ID)
file = open(SCHEMA_FILE, 'rb')
schema = avro.schema.parse(file.read())
writer = DatumWriter(schema)
for record in records:
byte_stream = io.BytesIO()
encoder = BinaryEncoder(byte_stream)
writer.write(record, encoder)
data = byte_stream.getvalue()
byte_stream.flush()
future = publisher_client.publish(topic_path, data)
print(f"Published message ID: {future.result()}")
records 1-4 appear while records 5 + 6 are nowhere.
results in bigquery

how to assign value in dict1 to dict2 if the key are the same without loop

Now I have two dicts, and one is a template(values are empty). Now I want to assign values in dict2 to dict1 if the key in two dicts is the same. But without using loop(better not access in two dicts). Just operate in the dictionary layer.
the two dicts are like:
dict1: dict2:
{ {
"indexed": true, "name": "_target",
"internalType": "", "type": "address",
"name": "", "result": "0xED3A954c0ADFC8"
"type": "", }
"result": ""
}
in the last I wanna get:
{
"indexed": true,
"internalType": "",
"name": "_target",
"type": "address",
"result": "0xED3A954c0ADFC8"
}
without access in two dict(no loop)
In my opinion, this could use set to manipulate it, but I can't manage to do it.
Possible solutions are the following:
dict1 = {"indexed": True, "internalType": "", "name": "", "type": "", "result": ""}
dict2 = {"name": "_target", "type": "address", "result": "0xED3A954c0ADFC8"}
Python 3.9+:
dict_3 = dict1 | dict2
Python 3.5+:
dict_3 = {**dict1, **dict2}
Prints
print(dict_3)
{'indexed': True,
'internalType': '',
'name': '_target',
'type': 'address',
'result': '0xED3A954c0ADFC8'}

How to overwrite pyspark DataFrame schema without data scan?

This question is related to https://stackoverflow.com/a/37090151/1661491. Let's assume I have a pyspark DataFrame with certain schema, and I would like to overwrite that schema with a new schema that I know is compatible, I could do:
df: DataFrame
new_schema = ...
df.rdd.toDF(schema=new_schema)
Unfortunately this triggers computation as described in the link above. Is there a way to do that at the metadata level (or lazy), without eagerly triggering computation or conversions?
Edit, note:
the schema can be arbitrarily complicated (nested etc)
new schema includes updates to description, nullability and additional metadata (bonus points for updates to the type)
I would like to avoid writing a custom query expression generator, unless there's one already built into Spark that can generate query based on the schema/StructType
I've ended up diving into this a bit myself, and I'm curious about your opinion on my workaround/POC. See https://github.com/ravwojdyla/spark-schema-utils. It transforms expressions, and updates attributes.
Let's say I have two schemas, first one without any metadata, let's call to schema_wo_metadata:
{
"fields": [
{
"metadata": {},
"name": "oa",
"nullable": false,
"type": {
"containsNull": true,
"elementType": {
"fields": [
{
"metadata": {},
"name": "ia",
"nullable": false,
"type": "long"
},
{
"metadata": {},
"name": "ib",
"nullable": false,
"type": "string"
}
],
"type": "struct"
},
"type": "array"
}
},
{
"metadata": {},
"name": "ob",
"nullable": false,
"type": "double"
}
],
"type": "struct"
}
Second one with extra metadata on the inner (ia) field and outer (ob), let's call it schema_wi_metadata
{
"fields": [
{
"metadata": {},
"name": "oa",
"nullable": false,
"type": {
"containsNull": true,
"elementType": {
"fields": [
{
"metadata": {
"description": "this is ia desc"
},
"name": "ia",
"nullable": false,
"type": "long"
},
{
"metadata": {},
"name": "ib",
"nullable": false,
"type": "string"
}
],
"type": "struct"
},
"type": "array"
}
},
{
"metadata": {
"description": "this is ob desc"
},
"name": "ob",
"nullable": false,
"type": "double"
}
],
"type": "struct"
}
And now let's say I have a dataset with the schema_wo_metadata schema, and want to swap the schema with schema_wi_metadata:
from pyspark.sql import SparkSession
from pyspark.sql import Row, DataFrame
from pyspark.sql.types import StructType
# I assume these get generate/specified somewhere
schema_wo_metadata: StructType = ...
schema_wi_metadata: StructType = ...
# You need my extra package
spark = SparkSession.builder \
.config("spark.jars.packages", "io.github.ravwojdyla:spark-schema-utils_2.12:0.1.0") \
.getOrCreate()
# Dummy data with `schema_wo_metadata` schema:
df = spark.createDataFrame(data=[Row(oa=[Row(ia=0, ib=1)], ob=3.14),
Row(oa=[Row(ia=2, ib=3)], ob=42.0)],
schema=schema_wo_metadata)
_jdf = spark._sc._jvm.io.github.ravwojdyla.SchemaUtils.update(df._jdf, schema.json())
new_df = DataFrame(_jdf, df.sql_ctx)
Now the new_df has the schema_wi_metadata, e.g.:
new_df.schema["oa"].dataType.elementType["ia"].metadata
# -> {'description': 'this is ia desc'}
Any opinions?
FYI quick update, this functionality was added to Spark via https://github.com/apache/spark/pull/37011 and will be released in version 3.4.0.

Resources