How to overwrite pyspark DataFrame schema without data scan? - apache-spark

This question is related to https://stackoverflow.com/a/37090151/1661491. Let's assume I have a pyspark DataFrame with certain schema, and I would like to overwrite that schema with a new schema that I know is compatible, I could do:
df: DataFrame
new_schema = ...
df.rdd.toDF(schema=new_schema)
Unfortunately this triggers computation as described in the link above. Is there a way to do that at the metadata level (or lazy), without eagerly triggering computation or conversions?
Edit, note:
the schema can be arbitrarily complicated (nested etc)
new schema includes updates to description, nullability and additional metadata (bonus points for updates to the type)
I would like to avoid writing a custom query expression generator, unless there's one already built into Spark that can generate query based on the schema/StructType

I've ended up diving into this a bit myself, and I'm curious about your opinion on my workaround/POC. See https://github.com/ravwojdyla/spark-schema-utils. It transforms expressions, and updates attributes.
Let's say I have two schemas, first one without any metadata, let's call to schema_wo_metadata:
{
"fields": [
{
"metadata": {},
"name": "oa",
"nullable": false,
"type": {
"containsNull": true,
"elementType": {
"fields": [
{
"metadata": {},
"name": "ia",
"nullable": false,
"type": "long"
},
{
"metadata": {},
"name": "ib",
"nullable": false,
"type": "string"
}
],
"type": "struct"
},
"type": "array"
}
},
{
"metadata": {},
"name": "ob",
"nullable": false,
"type": "double"
}
],
"type": "struct"
}
Second one with extra metadata on the inner (ia) field and outer (ob), let's call it schema_wi_metadata
{
"fields": [
{
"metadata": {},
"name": "oa",
"nullable": false,
"type": {
"containsNull": true,
"elementType": {
"fields": [
{
"metadata": {
"description": "this is ia desc"
},
"name": "ia",
"nullable": false,
"type": "long"
},
{
"metadata": {},
"name": "ib",
"nullable": false,
"type": "string"
}
],
"type": "struct"
},
"type": "array"
}
},
{
"metadata": {
"description": "this is ob desc"
},
"name": "ob",
"nullable": false,
"type": "double"
}
],
"type": "struct"
}
And now let's say I have a dataset with the schema_wo_metadata schema, and want to swap the schema with schema_wi_metadata:
from pyspark.sql import SparkSession
from pyspark.sql import Row, DataFrame
from pyspark.sql.types import StructType
# I assume these get generate/specified somewhere
schema_wo_metadata: StructType = ...
schema_wi_metadata: StructType = ...
# You need my extra package
spark = SparkSession.builder \
.config("spark.jars.packages", "io.github.ravwojdyla:spark-schema-utils_2.12:0.1.0") \
.getOrCreate()
# Dummy data with `schema_wo_metadata` schema:
df = spark.createDataFrame(data=[Row(oa=[Row(ia=0, ib=1)], ob=3.14),
Row(oa=[Row(ia=2, ib=3)], ob=42.0)],
schema=schema_wo_metadata)
_jdf = spark._sc._jvm.io.github.ravwojdyla.SchemaUtils.update(df._jdf, schema.json())
new_df = DataFrame(_jdf, df.sql_ctx)
Now the new_df has the schema_wi_metadata, e.g.:
new_df.schema["oa"].dataType.elementType["ia"].metadata
# -> {'description': 'this is ia desc'}
Any opinions?

FYI quick update, this functionality was added to Spark via https://github.com/apache/spark/pull/37011 and will be released in version 3.4.0.

Related

Azure data-factory can't load data successfully through PolyBase if the source data in the last column of the first row is null

I am try using Azure DataFactory to load data from Azure Blob Storage to Azure Data warehouse
The relevant data is like below:
source csv:
1,james,
2,john,usa
sink table:
CREATE TABLE test_null (
id int NOT NULL,
name nvarchar(128) NULL,
address nvarchar(128) NULL
)
source dataset:
{
"name": "test_null_input",
"properties": {
"linkedServiceName": {
"referenceName": "StagingBlobStorage",
"type": "LinkedServiceReference"
},
"annotations": [],
"type": "DelimitedText",
"typeProperties": {
"location": {
"type": "AzureBlobStorageLocation",
"fileName": "1.csv",
"folderPath": "test_null",
"container": "adf"
},
"columnDelimiter": ",",
"escapeChar": "",
"firstRowAsHeader": false,
"quoteChar": ""
},
"schema": []
}
}
sink dataset:
{
"name": "test_null_output",
"properties": {
"linkedServiceName": {
"referenceName": "StagingAzureSqlDW",
"type": "LinkedServiceReference"
},
"annotations": [],
"type": "AzureSqlDWTable",
"schema": [
{
"name": "id",
"type": "int",
"precision": 10
},
{
"name": "name",
"type": "nvarchar"
},
{
"name": "address",
"type": "nvarchar"
}
],
"typeProperties": {
"schema": "dbo",
"table": "test_null"
}
}
}
pipeline
{
"name": "test_input",
"properties": {
"activities": [
{
"name": "Copy data1",
"type": "Copy",
"dependsOn": [],
"policy": {
"timeout": "0.12:00:00",
"retry": 0,
"retryIntervalInSeconds": 30,
"secureOutput": false,
"secureInput": false
},
"userProperties": [],
"typeProperties": {
"source": {
"type": "DelimitedTextSource",
"storeSettings": {
"type": "AzureBlobStorageReadSettings",
"recursive": true,
"enablePartitionDiscovery": false
},
"formatSettings": {
"type": "DelimitedTextReadSettings"
}
},
"sink": {
"type": "SqlDWSink",
"allowPolyBase": true,
"polyBaseSettings": {
"rejectValue": 0,
"rejectType": "value",
"useTypeDefault": false,
"treatEmptyAsNull": true
}
},
"enableStaging": false,
"translator": {
"type": "TabularTranslator",
"mappings": [
{
"source": {
"ordinal": 1
},
"sink": {
"name": "id"
}
},
{
"source": {
"ordinal": 2
},
"sink": {
"name": "name"
}
},
{
"source": {
"ordinal": 3
},
"sink": {
"name": "address"
}
}
]
}
},
"inputs": [
{
"referenceName": "test_null_input",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "test_null_output",
"type": "DatasetReference"
}
]
}
],
"annotations": []
}
}
The last column for the first row is null so when run the pipeline it pops out the below error:
ErrorCode=UserErrorInvalidColumnMappingColumnNotFound,'Type=Microsoft.DataTransfer.Common.Shared.HybridDeliveryException,Message=Invalid column mapping provided to copy activity: '{"Prop_0":"id","Prop_1":"name","Prop_2":"address"}', Detailed message: Column 'Prop_2' defined in column mapping cannot be found in Source structure.. Check column mapping in table definition.,Source=Microsoft.DataTransfer.Common,'
Tried set the treatEmptyAsNull to true, still the same error. Tried set skipLineCount to 1, it can work well, seems the last column null data in the first row affects the loading of the entire file. But the weirder thing is that it can also work well by enable staging even without setting treatEmptyAsNull and skipLineCount. In my scenario, it is unnecessary to enable it, since it is originally from blob to data warehouse. It seems unreasonable to change from blob to blob and then from blob to data warehouse after enabling, and it will bring additional data movement charges after enabling. I don't know why setting treatEmptyAsNull doesn't work, and then why enabling staging can work,this seems to make no sense?
I have reproduced the above with your Pipeline JSON and got same error.
This error occurred because as per your JSON, this is your copy data mapping between source and sink.
As per the above mapping you should have Prop_0, Prop_1 and Prop_2 as headers.
Here, as you didn't check the First Row as header in your source file, it is taking Prop_0, Prop_1 as headers. Since there is a null value in your first Row there is no Prop_2 column and that is the reason it is giving the error for that column.
To resolve it, Give a proper header your file in csv like below.
Then check the First Row as header in the source dataset.
It will give the mapping like below when you import.
Now, it will Execute successfully as mine.
Result:
You can see that the empty value taken as NULL in target table.

Convert Glue column datatype to Spark metadata

I have a glue column whose datatype in
Glue is
struct<quantity:bigint,unit:bigint>
However when spark infers this schema, it converts this glue type to spark metadata
and saves it to Glue table properties as follows:
"name": "columnName",
"type": {
"type": "struct",
"fields": [
{
"name": "quantity",
"type": "long",
"nullable": true,
"metadata": {}
},
{
"name": "unit",
"type": "long",
"nullable": true,
"metadata": {}
}
]
},
"nullable": true,
"metadata": {} }
Is there a library or any inbuilt function that glue or spark has that can help me with the conversion of glue column type to Spark metadata in Java..?
I have to convert those glue datatypes to Spark metadata
The in coming glue columns datatype can also be a nested structure of maps arrays and structs as well
Another example of Glue Datatype:
struct<column1:string,averageHeight:double employeeName:string,firstName:string,secondName:string,listOfBooks:bigint,price:bigint,studentId:bigint,offerPrice:struct<quantityOfBooks:bigint,class:bigint>,bookStore:string,reviewCount:bigint,author:string,title:string,studentRollNo:string>
Spark conversion:
{
"name": "studentData",
"type": {
"type": "struct",
"fields": [
{
"name": "column1",
"type": "string",
"nullable": true,
"metadata": {}
},
{
"name":"averageHeight",
"type": "double",
"nullable": true,
"metadata": {}
},
{
"name": "employeeName",
"type": "string",
"nullable": true,
"metadata": {}
},
{
"name": "firstName",
"type": "string",
"nullable": true,
"metadata": {}
},
{
"name": "secondName",
"type": "string",
"nullable": true,
"metadata": {}
},
{
"name": "listofBooks",
"type": "long",
"nullable": true,
"metadata": {}
},
{
"name": "price",
"type": "long",
"nullable": true,
"metadata": {}
},
{
"name": "studentId",
"type": "long",
"nullable": true,
"metadata": {}
},
{
"name": "offerPrice",
"type": {
"type": "struct",
"fields": [
{
"name": "quantityOfBooks",
"type": "long",
"nullable": true,
"metadata": {}
},
{
"name": "class",
"type": "long",
"nullable": true,
"metadata": {}
}
]
},
"nullable": true,
"metadata": {}
},
{
"name": "bookStore",
"type": "string",
"nullable": true,
"metadata": {}
},
{
"name": "reviewCount",
"type": "long",
"nullable": true,
"metadata": {}
},
{
"name": "author",
"type": "string",
"nullable": true,
"metadata": {}
},
{
"name": "title",
"type": "string",
"nullable": true,
"metadata": {}
},
{
"name": "studentRollNo",
"type": "string",
"nullable": true,
"metadata": {}
}
]
},
"nullable": true,
"metadata": {}
}
Note need to do this in Java.
I'm aware of Dataframes in spark and converting them to df.prettyJson to get the spark metadata conversion of the glue type.
However I need to do this conversion via Java code.
What is the best possible approach for this conversion ..?

How to traverse all Fields in all nested Records in an Avro file and check a certain property in their Types?

I have an avro file which has records, then in their fields (which have uniontypes) there are other records, which also have fields with union types, and some types have a certain property connect.name which i need to check if it equals to io.debezium.time.NanoTimestamp. I`m doing this in Apache NiFi using an ExecuteScript processor with Groovy script.
A shortened example of the Avro schema:
{
"type": "record",
"name": "Envelope",
"namespace": "data.none.bpm.pruitsmdb_nautilus_dbo.fast_frequency_tables.avro.test",
"fields": [
{
"name": "before",
"type": [
"null",
{
"type": "record",
"name": "Value",
"fields": [
{
"name": "Id",
"type": {
"type": "string",
"connect.parameters": {
"__debezium.source.column.type": "UNIQUEIDENTIFIER",
"__debezium.source.column.length": "36"
}
}
},
{
"name": "CreatedOn",
"type": [
"null",
{
"type": "long",
"connect.version": 1,
"connect.parameters": {
"__debezium.source.column.type": "DATETIME2",
"__debezium.source.column.length": "27",
"__debezium.source.column.scale": "7"
},
"connect.name": "io.debezium.time.NanoTimestamp"
}
],
"default": null
},
{
"name": "CreatedById",
"type": [
"null",
{
"type": "string",
"connect.parameters": {
"__debezium.source.column.type": "UNIQUEIDENTIFIER",
"__debezium.source.column.length": "36"
}
}
],
"default": null
}
],
"connect.name": "data.none.bpm.pruitsmdb_nautilus_dbo.fast_frequency_tables.avro.test.Value"
}
],
"default": null
},
{
"name": "after",
"type": [
"null",
"Value"
],
"default": null
},
{
"name": "source",
"type": {
"type": "record",
"name": "Source",
"namespace": "io.debezium.connector.sqlserver",
"fields": [
{
"name": "version",
"type": "string"
},
{
"name": "ts_ms",
"type": "long"
},
{
"name": "snapshot",
"type": [
{
"type": "string",
"connect.version": 1,
"connect.parameters": {
"allowed": "true,last,false"
},
"connect.default": "false",
"connect.name": "io.debezium.data.Enum"
},
"null"
],
"default": "false"
}
],
"connect.name": "io.debezium.connector.sqlserver.Source"
}
},
{
"name": "op",
"type": "string"
},
{
"name": "ts_ms",
"type": [
"null",
"long"
],
"default": null
}
],
"connect.name": "data.none.bpm.pruitsmdb_nautilus_dbo.fast_frequency_tables.avro.test.Envelope"
}
My Groovy code, which obviously seems to be checking the top-level records only, and also I'm not sure whether I'm checking the property connect.name correctly:
reader.forEach{ GenericRecord record ->
record.getSchema().getFields().forEach{ Schema.Field field ->
try {
field.schema().getTypes().forEach{ Schema typeSchema ->
if(typeSchema.getProp("connect.name") == "io.debezium.time.NanoTimestamp"){
record.put(field.name(), Long(record.get(field.name()).toString().substring(0, 13)))
typeSchema.addProp("logicalType", "timestamp-millis")
}
}
} catch(Exception ex){
println("Catching the exception")
}
}
writer.append(record)
}
My question is - how to traverse all nested Records (there are top-level records' fields which have "record" type and records inside) in the avro file? And when traversing their Fields - how to check correctly that one of their types (which may go in union) has a property connect.name == io.debezium.time.NanoTimestamp and if yes, perform a transformation on the field value and add a logicalType property to the field`s type?
I think you are looking for a recursion here - there should be a function that will accept the Record as a parameter. When you hit a field that is a nested record then you'll call this function recursively.
Jiri's approach suggestion worked, a recursive function was used, here`s the full code:
import org.apache.avro.*
import org.apache.avro.file.*
import org.apache.avro.generic.*
//define input and output files
DataInputStream inputStream = new File('input.avro').newDataInputStream()
DataOutputStream outputStream = new File('output.avro').newDataOutputStream()
DataFileStream<GenericRecord> reader = new DataFileStream<>(inputStream, new GenericDatumReader<GenericRecord>())
DataFileWriter<GenericRecord> writer = new DataFileWriter<>(new GenericDatumWriter<GenericRecord>())
def contentSchema = reader.schema //source Avro schema
def records = [] //list will be used to temporary store the processed records
//function which is traversing through all records (including nested ones)
def convertAvroNanosecToMillisec(record){
record.getSchema().getFields().forEach{ Schema.Field field ->
if (record.get(field.name()) instanceof org.apache.avro.generic.GenericData.Record){
convertAvroNanosecToMillisec(record.get(field.name()))
}
if (field.schema().getType().getName() == "union"){
field.schema().getTypes().forEach{ Schema unionTypeSchema ->
if(unionTypeSchema.getProp("connect.name") == "io.debezium.time.NanoTimestamp"){
record.put(field.name(), Long.valueOf(record.get(field.name()).toString().substring(0, 13)))
unionTypeSchema.addProp("logicalType", "timestamp-millis")
}
}
} else {
if(field.schema().getProp("connect.name") == "io.debezium.time.NanoTimestamp"){
record.put(field.name(), Long.valueOf(record.get(field.name()).toString().substring(0, 13)))
field.schema().addProp("logicalType", "timestamp-millis")
}
}
}
return record
}
//reading all records from incoming file and adding to the temporary list
reader.forEach{ GenericRecord contentRecord ->
records.add(convertAvroNanosecToMillisec(contentRecord))
}
//creating a file writer object with adjusted schema
writer.create(contentSchema, outputStream)
//adding records to the output file from the temporary list and closing the writer
records.forEach{ GenericRecord contentRecord ->
writer.append(contentRecord)
}
writer.close()

Loopback sends wrong datatype

In my mongodb i have several arrays, but when i load those arrays from the database they strangley are objects instead of arrays.
This strange behaviour is since a couple of days, before everything worked fine and i got arrays out of the database.
Has loopback some strange flags which are set automatically, that transform my arrays to objects or something like that?
Currently I have the newest versions in all my packages and have already tried to use older versions, but nothing changes this behaviour.
At first there was also a problem with the saving of the arrays, sometimes they were saved as objects, but since I removed all null objects from the database, only arrays were saved.
The problem occurs with the sections array
my model json is:
{
"name": "Track",
"plural": "Tracks",
"base": "PersistedModel",
"idInjection": true,
"options": {
"validateUpsert": true
},
"properties": {
"alreadySynced": {
"type": "boolean"
},
"approved": {
"type": "boolean"
},
"deletedByClient": {
"type": "boolean",
"default": false
},
"sections": {
"type": "object",
"required": true
},
"type": {
"type": "string"
},
"email": {
"type": "string",
"default": ""
},
"name": {
"type": "string",
"default": "Neuer Track"
},
"reason": {
"type": "string",
"default": ""
},
"date": {
"type": "date"
},
"duration": {
"type": "number",
"default": 0
},
"correctnessscore": {
"type": "number",
"default": 0
},
"evaluation": {
"type": "object"
}
},
"validations": [],
"relations": {},
"acls": [],
"methods": {}
}
I have also already tried to change the type object to array but without success
Well, I am not seeing any array type in your model and I am not sure what is exactly your problem.
Has loopback some strange flags which are set automatically, that
transform my arrays to objects or something like that?
No loopback has no flags and doesn't transform any data type unless you set it !
So if you define a property as object and you are passing an array without validating the data type this will change the type of your data and save it as object instead of array.
Lets define an array in your Track model :
"property": {
"type": "array"
}
Do you need array of objects?
"property": {
"type": ["object"]
}
Strings ?
"property": {
"type": ["string"]
}
Numbers ?
"property": {
"type": ["number"]
}
Read more about loopback types here.

How to serialize complex types using Microsoft Avro library

I am trying to serialize generic records (expressed as JSON strings) as avro objects using the Microsoft.Hadoop.Avro library.
I've been following the tutorial for Generic Records HERE. However, the records I am trying to serialize as more complex than the sample code provided by Microsoft (Location), with nested properties inside the JSON.
Here is a sample of a record I want to serialize in Avro:
{
"deviceId": "UnitTestDevice01",
"serializationFormat": "avro",
"messageType": "state",
"messageVersion": "avrov2.0",
"arrayProp": [
{
"itemProp1": "arrayValue1",
"itemProp2": "arrayValue2"
},
{
"itemProp1": "arrayValue3",
"itemProp2": "arrayValue4"
}
]
}
For info, here is the Avro schema I can extract:
{
"type": "record",
"namespace": "xxx.avro",
"name": "MachineModel",
"fields": [{
"name": "deviceId",
"type": ["string", "null"]
}, {
"name": "serializationFormat",
"type": ["string", "null"]
}, {
"name": "messageType",
"type": ["string", "null"]
}, {
"name": "messageVersion",
"type": ["string", "null"]
}, {
"name": "array",
"type": {
"type": "array",
"items": {
"type": "record",
"name": "array_record",
"fields": [{
"name": "arrayProp1",
"type": ["string", "null"]
}, {
"name": "arrayProp2",
"type": ["string", "null"]
}]
}
}
}]
}
I have managed to extract the correct schema for this object, but I can't get the code right to take the schema and create a correct Avro record.
Can someone provide some pointers on how I can use the AvroSerializer or AvroContainer classes to produce a valid avro object using this json payload and this avro schema? The sample from Microsoft are quite simple to work with complex objects and I have not been able to find relevant samples online either.

Resources