Spark dynamic partitioning: SchemaColumnConvertNotSupportedException on read - apache-spark

Question
Is there any way to store data with different (not compatible) schemas in different partitions?
The issue
I use PySpark v2.4.5, parquet format and dynamic partitioning with the following hierachy: BASE_PATH/COUNTRY=US/TYPE=sms/YEAR=2020/MONTH=04/DAY=10/. Unfortunatelly it can't be changed.
I got SchemaColumnConvertNotSupportedException on read. That happens because schema differs between different types (i.e. between sms and mms). Looks like Spark trying to merge to schemas on read under the hood.
If to be more precise, I can read data for F.col('TYPE') == 'sms', because mms schema can be converted to sms. But when I'm filtering by F.col('TYPE') == 'mms', than Spark fails.
Code
# Works, because Spark doesn't try to merge schemas
spark_session \
.read \
.option('mergeSchema', False) \
.parquet(BASE_PATH + '/COUNTRY_CODE=US/TYPE=mms/YEAR=2020/MONTH=04/DAY=07/HOUR=00') \
.show()
# Doesn't work, because Spark trying to merge schemas for TYPE=sms and TYPE=mms. Mms data can't be converted to merged schema.
# Types are correct, from explain Spark treat date partitions as integers
# Predicate pushdown isn't used for some reason, there is no PushedFilter in explained plan
spark_session \
.read \
.option('mergeSchema', False) \
.parquet(BASE_PATH) \
.filter(F.col('COUNTRY') == 'US') \
.filter(F.col('TYPE') == 'mms') \
.filter(F.col('YEAR') == 2020) \
.filter(F.col('MONTH') == 4) \
.filter(F.col('DAY') == 10) \
.show()

Just for situation it may be useful for someone. It's possible to have different data within different partitions. To make Spark no infer schema for parquet - specify the schema:
spark_session \
.read \
.schema(some_schema) \
.option('mergeSchema', False) \
.parquet(BASE_PATH) \
.filter(F.col('COUNTRY') == 'US') \
.filter(F.col('TYPE') == 'mms') \
.filter(F.col('YEAR') == 2020) \
.filter(F.col('MONTH') == 4) \
.filter(F.col('DAY') == 10) \
.show()

Related

Databricks Lakehouse Medallion Architecture Python Sample

Does anyone know a python sample about medallion architecture in Python?
A sample like this one in SQL https://www.databricks.com/notebooks/delta-lake-cdf.html
In the simplest case it's just a bunch of Spark's .readStream -> some transformations -> .writeStream (although it's possible to do it in the non-stream fashion, you spend more time on the tracking what has changed, etc.). In the plain Spark + Databricks Autoloader it will be:
# bronze
raw_df = spark.readStream.format("cloudFiles") \
.option("cloudFiles.format", "json") \
.load(input_data)
raw_df.writeStream.format("delta") \
.option("checkpointLocation", bronze_checkpoint) \
.trigger(...) \ # availableNow=True if you want to mimic batch-like processing
.start(bronze_path)
# silver
bronze_df = spark.readStream.load(bronze_path)
# do transformations on silver_df
silver_df = bronze_df.filter(....)
silver_df.writeStream.format("delta") \
.option("checkpointLocation", silver_checkpoint) \
.trigger(...) \
.start(silver_path)
# gold
silver_df = spark.readStream.load(silver_path)
gold = silver_df.groupBy(...)
But really, it's becoming much simpler if you're using Delta Live Tables - then you concentrate just on transformations, not on the things how to write data, etc. Something like this:
#dlt.table
def bronze():
return spark.readStream.format("cloudFiles") \
.option("cloudFiles.format", "json") \
.load(input_data)
#dlt.table
def silver():
bronze = dlt.read_stream("bronze")
return bronze.filter(...)
#dlt.table
def gold():
silver = dlt.read_stream("silver")
return silver.groupBy(...)

Can i exclude the column used for partitioning when writing to parquet?

i need to create parquet files, reading from jdbc. The table is quite big and all columns are varchars. So i created a new column with a random int to make partitioning.
so my read jdbc looks something like this:
data_df = sparkSession.read.format('jdbc') \
.option('url', 'jdbc:netezza://host:port/db') \
.option('dbtable', """(SELECT * FROM schema.table) A""") \
.option('user', 'user') \
.option('password', 'password') \
.option('partitionColumn','random_number') \
.option('lowerBound','1') \
.option('upperBound','200') \
.option('numPartitions','200') \
.load()
and my write to parquet looks something like this:
data_df.write.mode("overwrite").parquet("parquetfile.parquet").partitionBy('random_number')
The generated parquet also contains the 'random_number' column, but i only made that column for partitioning, is there a way to exclude that column to the writing of the parquet files?
Thanks for any help, i'm new to spark :)
I'm expecting to exclude the random_number column, but lack the knowledge if this is possible if i need the column for partitioning
So do you want to repartition in memory using a column but not writing it, you can just use .repartition(col("random_number")) before writing droping the column then write your data:
data_df = sparkSession.read.format('jdbc') \
.option('url', 'jdbc:netezza://host:port/db') \
.option('dbtable', """(SELECT * FROM schema.table) A""") \
.option('user', 'user') \
.option('password', 'password') \
.option('partitionColumn','random_number') \
.option('lowerBound','1') \
.option('upperBound','200') \
.option('numPartitions','200') \
.load()
.repartition(col("random_number")).drop("random_number")
then:
data_df.write.mode("overwrite").parquet("parquetfile.parquet")

Create Spark output streams with function

I use Databricks Auto Loader to ingest files that contain data with different schemas and want to write them in corresponding delta tables using update mode.
There may be many (>15) different message types in a stream, so that I'd have to write an output stream for very one of them. There is an "upsert" function for every table.
Can this be condensed using a function (example given below) that will save a few keystrokes?
upload_path = '/example'
# Set up the stream to begin reading incoming files from the
# upload_path location.
df = spark.readStream.format('cloudFiles') \
.option('cloudFiles.format', 'avro') \
.load(upload_path)
# filter messages and apply JSON schema
table1_df = filter_and_transform(df, json_schema1)
table2_df = filter_and_transform(df, json_schema2)
table3_df = filter_and_transform(df, json_schema3)
# each table has it's own upsert function
def create_output_stream(df, table_name, upsert_function):
# Create stream and return it.
return df.writeStream.format('delta') \
.writeStream \
.trigger(once=True) \
.format("delta") \
.foreachBatch(upsert_function) \
.queryName(f"autoLoader_query_{table_name}") \
.option("checkpointLocation", f"dbfs:/delta/somepath/{table_name}") \
.outputMode("update")
output_stream1 = create_output_stream(table1_df, "table_name1", upsert_function1).start() # start stream in outer environment
output_stream2 = create_output_stream(table2_df, "table_name2", upsert_function2).start()
output_stream3 = create_output_stream(table3_df, "table_name3", upsert_function3).start()
Yes, of course it's possible to do it this way - it's quite a standard pattern.
But you need to take one thing into a consideration - if your input data isn't partitioned by the message type, then you will scan same files multiple times (for each message type). Alternative to it could be following - you perform filtering & upsert of all message types using the single foreachBatch, like this:
df = spark.readStream.format('cloudFiles') \
.option('cloudFiles.format', 'avro') \
.load(upload_path)
def do_all_upserts(df, epoch):
df.cache()
table1_df = filter_and_transform(df, json_schema1)
table2_df = filter_and_transform(df, json_schema2)
table3_df = filter_and_transform(df, json_schema3)
# really you can run multiple writes using multithreading, or something like it
do_upsert(table1_df)
do_upsert(table2_df)
...
# free resources
df.unpersist()
df.writeStream.format('delta') \
.writeStream \
.trigger(once=True) \
.format("delta") \
.foreachBatch(do_all_upserts) \
.option("checkpointLocation", f"dbfs:/delta/somepath/{table_name}") \
.start()

Handling Duplicates in Databricks autoloader

I am new to this Databricks Autoloader, we have a requirement where we need to process the data from AWS s3 to delta table via Databricks autoloader. I was testing this autoloader so I came across duplicate issue that is if i upload a file with name say emp_09282021.csv having same data as emp_09272021.csv then it is not detecting any duplicate it is simply inserting them so if I had 5 rows in emp_09272021.csv file now it will become 10 rows as I upload emp_09282021.csv file.
below is the code that i tried:
spark.readStream.format("cloudFiles") \
.option("cloudFiles.format", "csv") \
.option("header",True) \
.schema("id string,name string, age string,city string") \
.load("s3://some-s3-path/source/") \
.writeStream.format("delta") \
.option("mergeSchema", "true") \
.option("checkpointLocation", "s3://some-s3-path/tgt_checkpoint_0928/") \
.start("s3://some-s3-path/spark_stream_processing/target/")
any guidance please to handle this?
It's not the task of the autoloader to detect duplicates, it provides you the possibility to ingest data, but you need to handle duplicates yourself. There are several approaches to that:
Use built-in dropDuplicates function. It's recommended to use it with watermarking to avoid creating a huge state, but you need to have some column that will be used as event time, and it should be part of dropDuplicate list (see docs for more details):
streamingDf \
.withWatermark("eventTime", "10 seconds") \
.dropDuplicates("col1", "eventTime")
Use Delta's merge capability - you just need to insert data that isn't in the Delta table, but you need to use foreachBatch for that. Something like this (please note that table should already exist, or you need to add a handling of non-existent table):
from delta.tables import *
def drop_duplicates(df, epoch):
table = DeltaTable.forPath(spark,
"s3://some-s3-path/spark_stream_processing/target/")
dname = "destination"
uname = "updates"
dup_columns = ["col1", "col2"]
merge_condition = " AND ".join([f"{dname}.{col} = {uname}.{col}"
for col in dup_columns])
table.alias(dname).merge(df.alias(uname), merge_condition)\
.whenNotMatchedInsertAll().execute()
# ....
spark.readStream.format("cloudFiles") \
.option("cloudFiles.format", "csv") \
.option("header",True) \
.schema("id string,name string, age string,city string") \
.load("s3://some-s3-path/source/") \
.writeStream.foreachBatch(drop_duplicates)\
.option("checkpointLocation", "s3://some-s3-path/tgt_checkpoint_0928/") \
.start()
In this code you need to change the dup_columns variable to specify columns that are used to detect duplicates.

How to include partitioned column in pyspark dataframe read method

I am writing Avro file-based from a parquet file. I have read the file as below:
Reading data
dfParquet = spark.read.format("parquet").option("mode", "FAILFAST")
.load("/Users/rashmik/flight-time.parquet")
Writing data
I have written the file in Avro format as below:
dfParquetRePartitioned.write \
.format("avro") \
.mode("overwrite") \
.option("path", "datasink/avro") \
.partitionBy("OP_CARRIER") \
.option("maxRecordsPerFile", 100000) \
.save()
As expected, I got data partitioned by OP_CARRIER.
Reading Avro partitioned data from a specific partition
In another job, I need to read data from the output of the above job, i.e. from datasink/avro directory. I am using the below code to read from datasink/avro
dfAvro = spark.read.format("avro") \
.option("mode","FAILFAST") \
.load("datasink/avro/OP_CARRIER=AA")
It reads data successfully, but as expected OP_CARRIER column is not available in dfAvro dataframe as it is a partition column of the first job. Now my requirement is to include OP_CARRIER field also in 2nd dataframe i.e. in dfAvro. Could somebody help me with this?
I am referring documentation from the spark document, but I am not able to locate the relevant information. Any pointer will be very helpful.
You replicate the same column value with a different alias.
dfParquetRePartitioned.withColumn("OP_CARRIER_1", lit(df.OP_CARRIER)) \
.write \
.format("avro") \
.mode("overwrite") \
.option("path", "datasink/avro") \
.partitionBy("OP_CARRIER") \
.option("maxRecordsPerFile", 100000) \
.save()
This would give you what you wanted. But with a different alias.
Or you can also do it during reading. If location is dynamic then you can easily append the column.
path = "datasink/avro/OP_CARRIER=AA"
newcol = path.split("/")[-1].split("=")
dfAvro = spark.read.format("avro") \
.option("mode","FAILFAST") \
.load(path).withColumn(newcol[0], lit(newcol[1]))
If the value is static its way more easy to add it during the data read.

Resources