Databricks Lakehouse Medallion Architecture Python Sample - databricks

Does anyone know a python sample about medallion architecture in Python?
A sample like this one in SQL https://www.databricks.com/notebooks/delta-lake-cdf.html

In the simplest case it's just a bunch of Spark's .readStream -> some transformations -> .writeStream (although it's possible to do it in the non-stream fashion, you spend more time on the tracking what has changed, etc.). In the plain Spark + Databricks Autoloader it will be:
# bronze
raw_df = spark.readStream.format("cloudFiles") \
.option("cloudFiles.format", "json") \
.load(input_data)
raw_df.writeStream.format("delta") \
.option("checkpointLocation", bronze_checkpoint) \
.trigger(...) \ # availableNow=True if you want to mimic batch-like processing
.start(bronze_path)
# silver
bronze_df = spark.readStream.load(bronze_path)
# do transformations on silver_df
silver_df = bronze_df.filter(....)
silver_df.writeStream.format("delta") \
.option("checkpointLocation", silver_checkpoint) \
.trigger(...) \
.start(silver_path)
# gold
silver_df = spark.readStream.load(silver_path)
gold = silver_df.groupBy(...)
But really, it's becoming much simpler if you're using Delta Live Tables - then you concentrate just on transformations, not on the things how to write data, etc. Something like this:
#dlt.table
def bronze():
return spark.readStream.format("cloudFiles") \
.option("cloudFiles.format", "json") \
.load(input_data)
#dlt.table
def silver():
bronze = dlt.read_stream("bronze")
return bronze.filter(...)
#dlt.table
def gold():
silver = dlt.read_stream("silver")
return silver.groupBy(...)

Related

Can i exclude the column used for partitioning when writing to parquet?

i need to create parquet files, reading from jdbc. The table is quite big and all columns are varchars. So i created a new column with a random int to make partitioning.
so my read jdbc looks something like this:
data_df = sparkSession.read.format('jdbc') \
.option('url', 'jdbc:netezza://host:port/db') \
.option('dbtable', """(SELECT * FROM schema.table) A""") \
.option('user', 'user') \
.option('password', 'password') \
.option('partitionColumn','random_number') \
.option('lowerBound','1') \
.option('upperBound','200') \
.option('numPartitions','200') \
.load()
and my write to parquet looks something like this:
data_df.write.mode("overwrite").parquet("parquetfile.parquet").partitionBy('random_number')
The generated parquet also contains the 'random_number' column, but i only made that column for partitioning, is there a way to exclude that column to the writing of the parquet files?
Thanks for any help, i'm new to spark :)
I'm expecting to exclude the random_number column, but lack the knowledge if this is possible if i need the column for partitioning
So do you want to repartition in memory using a column but not writing it, you can just use .repartition(col("random_number")) before writing droping the column then write your data:
data_df = sparkSession.read.format('jdbc') \
.option('url', 'jdbc:netezza://host:port/db') \
.option('dbtable', """(SELECT * FROM schema.table) A""") \
.option('user', 'user') \
.option('password', 'password') \
.option('partitionColumn','random_number') \
.option('lowerBound','1') \
.option('upperBound','200') \
.option('numPartitions','200') \
.load()
.repartition(col("random_number")).drop("random_number")
then:
data_df.write.mode("overwrite").parquet("parquetfile.parquet")

Create Spark output streams with function

I use Databricks Auto Loader to ingest files that contain data with different schemas and want to write them in corresponding delta tables using update mode.
There may be many (>15) different message types in a stream, so that I'd have to write an output stream for very one of them. There is an "upsert" function for every table.
Can this be condensed using a function (example given below) that will save a few keystrokes?
upload_path = '/example'
# Set up the stream to begin reading incoming files from the
# upload_path location.
df = spark.readStream.format('cloudFiles') \
.option('cloudFiles.format', 'avro') \
.load(upload_path)
# filter messages and apply JSON schema
table1_df = filter_and_transform(df, json_schema1)
table2_df = filter_and_transform(df, json_schema2)
table3_df = filter_and_transform(df, json_schema3)
# each table has it's own upsert function
def create_output_stream(df, table_name, upsert_function):
# Create stream and return it.
return df.writeStream.format('delta') \
.writeStream \
.trigger(once=True) \
.format("delta") \
.foreachBatch(upsert_function) \
.queryName(f"autoLoader_query_{table_name}") \
.option("checkpointLocation", f"dbfs:/delta/somepath/{table_name}") \
.outputMode("update")
output_stream1 = create_output_stream(table1_df, "table_name1", upsert_function1).start() # start stream in outer environment
output_stream2 = create_output_stream(table2_df, "table_name2", upsert_function2).start()
output_stream3 = create_output_stream(table3_df, "table_name3", upsert_function3).start()
Yes, of course it's possible to do it this way - it's quite a standard pattern.
But you need to take one thing into a consideration - if your input data isn't partitioned by the message type, then you will scan same files multiple times (for each message type). Alternative to it could be following - you perform filtering & upsert of all message types using the single foreachBatch, like this:
df = spark.readStream.format('cloudFiles') \
.option('cloudFiles.format', 'avro') \
.load(upload_path)
def do_all_upserts(df, epoch):
df.cache()
table1_df = filter_and_transform(df, json_schema1)
table2_df = filter_and_transform(df, json_schema2)
table3_df = filter_and_transform(df, json_schema3)
# really you can run multiple writes using multithreading, or something like it
do_upsert(table1_df)
do_upsert(table2_df)
...
# free resources
df.unpersist()
df.writeStream.format('delta') \
.writeStream \
.trigger(once=True) \
.format("delta") \
.foreachBatch(do_all_upserts) \
.option("checkpointLocation", f"dbfs:/delta/somepath/{table_name}") \
.start()

Handling Duplicates in Databricks autoloader

I am new to this Databricks Autoloader, we have a requirement where we need to process the data from AWS s3 to delta table via Databricks autoloader. I was testing this autoloader so I came across duplicate issue that is if i upload a file with name say emp_09282021.csv having same data as emp_09272021.csv then it is not detecting any duplicate it is simply inserting them so if I had 5 rows in emp_09272021.csv file now it will become 10 rows as I upload emp_09282021.csv file.
below is the code that i tried:
spark.readStream.format("cloudFiles") \
.option("cloudFiles.format", "csv") \
.option("header",True) \
.schema("id string,name string, age string,city string") \
.load("s3://some-s3-path/source/") \
.writeStream.format("delta") \
.option("mergeSchema", "true") \
.option("checkpointLocation", "s3://some-s3-path/tgt_checkpoint_0928/") \
.start("s3://some-s3-path/spark_stream_processing/target/")
any guidance please to handle this?
It's not the task of the autoloader to detect duplicates, it provides you the possibility to ingest data, but you need to handle duplicates yourself. There are several approaches to that:
Use built-in dropDuplicates function. It's recommended to use it with watermarking to avoid creating a huge state, but you need to have some column that will be used as event time, and it should be part of dropDuplicate list (see docs for more details):
streamingDf \
.withWatermark("eventTime", "10 seconds") \
.dropDuplicates("col1", "eventTime")
Use Delta's merge capability - you just need to insert data that isn't in the Delta table, but you need to use foreachBatch for that. Something like this (please note that table should already exist, or you need to add a handling of non-existent table):
from delta.tables import *
def drop_duplicates(df, epoch):
table = DeltaTable.forPath(spark,
"s3://some-s3-path/spark_stream_processing/target/")
dname = "destination"
uname = "updates"
dup_columns = ["col1", "col2"]
merge_condition = " AND ".join([f"{dname}.{col} = {uname}.{col}"
for col in dup_columns])
table.alias(dname).merge(df.alias(uname), merge_condition)\
.whenNotMatchedInsertAll().execute()
# ....
spark.readStream.format("cloudFiles") \
.option("cloudFiles.format", "csv") \
.option("header",True) \
.schema("id string,name string, age string,city string") \
.load("s3://some-s3-path/source/") \
.writeStream.foreachBatch(drop_duplicates)\
.option("checkpointLocation", "s3://some-s3-path/tgt_checkpoint_0928/") \
.start()
In this code you need to change the dup_columns variable to specify columns that are used to detect duplicates.

Spark dynamic partitioning: SchemaColumnConvertNotSupportedException on read

Question
Is there any way to store data with different (not compatible) schemas in different partitions?
The issue
I use PySpark v2.4.5, parquet format and dynamic partitioning with the following hierachy: BASE_PATH/COUNTRY=US/TYPE=sms/YEAR=2020/MONTH=04/DAY=10/. Unfortunatelly it can't be changed.
I got SchemaColumnConvertNotSupportedException on read. That happens because schema differs between different types (i.e. between sms and mms). Looks like Spark trying to merge to schemas on read under the hood.
If to be more precise, I can read data for F.col('TYPE') == 'sms', because mms schema can be converted to sms. But when I'm filtering by F.col('TYPE') == 'mms', than Spark fails.
Code
# Works, because Spark doesn't try to merge schemas
spark_session \
.read \
.option('mergeSchema', False) \
.parquet(BASE_PATH + '/COUNTRY_CODE=US/TYPE=mms/YEAR=2020/MONTH=04/DAY=07/HOUR=00') \
.show()
# Doesn't work, because Spark trying to merge schemas for TYPE=sms and TYPE=mms. Mms data can't be converted to merged schema.
# Types are correct, from explain Spark treat date partitions as integers
# Predicate pushdown isn't used for some reason, there is no PushedFilter in explained plan
spark_session \
.read \
.option('mergeSchema', False) \
.parquet(BASE_PATH) \
.filter(F.col('COUNTRY') == 'US') \
.filter(F.col('TYPE') == 'mms') \
.filter(F.col('YEAR') == 2020) \
.filter(F.col('MONTH') == 4) \
.filter(F.col('DAY') == 10) \
.show()
Just for situation it may be useful for someone. It's possible to have different data within different partitions. To make Spark no infer schema for parquet - specify the schema:
spark_session \
.read \
.schema(some_schema) \
.option('mergeSchema', False) \
.parquet(BASE_PATH) \
.filter(F.col('COUNTRY') == 'US') \
.filter(F.col('TYPE') == 'mms') \
.filter(F.col('YEAR') == 2020) \
.filter(F.col('MONTH') == 4) \
.filter(F.col('DAY') == 10) \
.show()

Exception has occurred: pyspark.sql.utils.AnalysisException 'Queries with streaming sources must be executed with writeStream.start();;\nkafka'

at the code
if not df.head(1).isEmpty:
I got exception,
Exception has occurred: pyspark.sql.utils.AnalysisException 'Queries with streaming sources must be executed with writeStream.start();;\nkafka'
I do not know how to use if in streaming data.
when I use jupyter, to execute each line, the code is well, and I can got my result. but use .py it's not good.
my perpose is this: I want use streaming to get data from kafka every one second, then I transform every batch steaming data(one batch means the data one second I get) to pandas dataframe, and then I use pandas function to do something to the data, finally I send the result to other kafka topic.
Please help me, and forgive my pool english, Thanks a lot.
sc = SparkContext("local[2]", "OdometryConsumer")
spark = SparkSession(sparkContext=sc) \
.builder \
.appName("StructuredNetworkWordCount") \
.getOrCreate()
# Enable Arrow-based columnar data transfers
spark.conf.set("spark.sql.execution.arrow.enabled", "true")
df = spark \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "localhost:9092") \
.option("subscribe", "data") \
.load()
ds = df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
print(type(ds))
if not df.head(1).isEmpty:
alertQuery = ds \
.writeStream \
.queryName("qalerts")\
.format("memory")\
.start()
alerts = spark.sql("select * from qalerts")
pdAlerts = alerts.toPandas()
a = pdAlerts['value'].tolist()
d = []
for i in a:
x = json.loads(i)
d.append(x)
df = pd.DataFrame(d)
print(df)
ds = df['jobID'].unique().tolist()
dics = {}
for source in ds:
ids = df.loc[df['jobID'] == source, 'id'].tolist()
dics[source]=ids
print(dics)
query = ds \
.writeStream \
.queryName("tableName") \
.format("console") \
.start()
query.awaitTermination()
Remove if not df.head(1).isEmpty: and you should be fine.
The reason for the exception is simple, i.e. a streaming query is a structured query that never ends and is continually executed. It is simply not possible to look at a single element since there is no "single element", but (possibly) thousands of elements and it'd be hard to tell when exactly you'd like to look under the covers and see just a single element.

Resources