Streaming not working in Delta Live table pipeline (Databricks)? - apache-spark

I am working on a pipeline in Databricks > Workflows > Delta Live Tables and having an issue with the streaming part.
Expectations:
One bronze table reads the json files with AutoLoader (cloudFiles), in a streaming mode (spark.readStream)
One silver table reads and flattens the bronze table in streaming (dlt.read_stream)
Result:
When taking the root location as the source (load /*, several hundreds of files): the pipelines starts, but the number of rows/files appended is not updated in the graph until the bronze part be completed. Then, the silver part starts, the number of files/rows never updates either and the pipeline terminates with a memory error.
When taking a very small number of files (/specific_folder among hundreds) : the pipeline runs well and terminates with no error, but again, the number of rows/files appended is not updated in the graph until each part is completed.
This led me to the conclusion that the pipeline seems not to run in a streaming mode.
Maybe I am missing something about the config or how to run properly a DLT pipeline, and would need your help on this please.
Here is the configuration of the pipeline:
{
"id": "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx",
"clusters": [
{
"label": "default",
"aws_attributes": {
"instance_profile_arn": "arn:aws:iam::xxxxxxxxxxxx:instance-profile/iam_role_example"
},
"autoscale": {
"min_workers": 1,
"max_workers": 10,
"mode": "LEGACY"
}
}
],
"development": true,
"continuous": false,
"channel": "CURRENT",
"edition": "PRO",
"photon": false,
"libraries": [
{
"notebook": {
"path": "/Repos/user_example#xxxxxx.xx/dms/bronze_job"
}
}
],
"name": "01-landing-task-1",
"storage": "dbfs:/pipelines/xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx",
"configuration": {
"SCHEMA": "example_schema",
"RAW_MOUNT_NAME": "xxxx",
"DELTA_MOUNT_NAME": "xxxx",
"spark.sql.parquet.enableVectorizedReader": "false"
},
"target": "landing"}
Here is the code of the pipeline (the query in the silver table contains many more columns with a get_json_object, ~30 actually):
import dlt
import pyspark.sql.functions as F
import pyspark.sql.types as T
from pyspark.sql.window import Window
RAW_MOUNT_NAME = spark.conf.get("RAW_MOUNT_NAME")
SCHEMA = spark.conf.get("SCHEMA")
SOURCE = spark.conf.get("SOURCE")
TABLE_NAME = spark.conf.get("TABLE_NAME")
PRIMARY_KEY_PATH = spark.conf.get("PRIMARY_KEY_PATH")
#dlt.table(
name=f"{SCHEMA}_{TABLE_NAME}_bronze",
table_properties={
"quality": "bronze"
}
)
def bronze_job():
load_path = f"/mnt/{RAW_MOUNT_NAME}/{SOURCE}/5e*"
return spark \
.readStream \
.format("text") \
.option("encoding", "UTF-8") \
.load(load_path) \
.select("value", "_metadata") \
.withColumnRenamed("value", "json") \
.withColumn("id", F.expr(f"get_json_object(json, '$.{PRIMARY_KEY_PATH}')")) \
.withColumn("_etl_timestamp", F.col("_metadata.file_modification_time")) \
.withColumn("_metadata", F.col("_metadata").cast(T.StringType())) \
.withColumn("_etl_operation", F.lit("U")) \
.withColumn("_etl_to_delete", F.lit(False)) \
.withColumn("_etl_file_name", F.input_file_name()) \
.withColumn("_etl_job_processing_timestamp", F.current_timestamp()) \
.withColumn("_etl_table", F.lit(f"{TABLE_NAME}")) \
.withColumn("_etl_partition_date", F.to_date(F.col("_etl_timestamp"), "yyyy-MM-dd")) \
.select("_etl_operation", "_etl_timestamp", "id", "json", "_etl_file_name", "_etl_job_processing_timestamp", "_etl_table", "_etl_partition_date", "_etl_to_delete", "_metadata")
#dlt.table(
name=f"{SCHEMA}_{TABLE_NAME}_silver",
table_properties = {
"quality": "silver",
"delta.autoOptimize.optimizeWrite": "true",
"delta.autoOptimize.autoCompact": "true"
}
)
def silver_job():
df = dlt.read_stream(f"{SCHEMA}_{TABLE_NAME}_bronze").where("_etl_table == 'extraction'")
return df.select(
df.id.alias('medium_id'),
F.get_json_object(df.json, '$.request').alias('request_id'))
Thank you very much for your help!

Related

How to read complex json array in pyspark df?

I have a json file that has below's structure that I need to read in as pyspark dataframe.
I tried reading in using multiLine option but it doesn't seem to return more data than the columns and datatypes. What am I possibly doing wrong and how can I read in belows'structure?
df = spark.read.format("json").option("inferSchema", "true") \
.option("multiLine", "true") \
.load("/mnt/blob/input/jsonfile.json")
Structure:
[
[
{
"Key": "Val"
},
{
"Key": "Val"
}
],
[
{
"Key": "Val"
},
{
"Key": "Val"
}
]
]

How to change partition columns in delta live tables?

I first setup a delta live tables using Python as follow
#dlt.table
def transaction():
return (
spark
.readStream
.format("cloudFiles")
.schema(transaction_schema)
.option("cloudFiles.format", "parquet")
.load(path)
)
And I wrote the delta live table to target database test
{
"id": <id>,
"clusters": [
{
"label": "default",
"autoscale": {
"min_workers": 1,
"max_workers": 5
}
}
],
"development": true,
"continuous": false,
"edition": "core",
"photon": false,
"libraries": [
{
"notebook": {
"path": <path>
}
}
],
"name": "dev pipeline",
"storage": <storage>,
"target": "test"
}
Everything worked as expected in the first trial.
After a while, I noticed that I forgot to add a partition column to the table, so I dropped the table in test by DROP TABLE test.transaction, and updated the notebook to
#dlt.table(
partition_cols=["partition"],
)
def transaction():
return (
spark
.readStream
.format("cloudFiles")
.schema(transaction_schema)
.option("cloudFiles.format", "parquet")
.load(path)
.withColumn("partition", F.to_date("timestamp"))
)
However, when I ran the pipeline again, I got an error
org.apache.spark.sql.AnalysisException: Cannot change partition columns for table transaction.
Current:
Requested: partition
Looks like I can't change the partition column by only dropping the target table.
What is the proper way to change partition columns in delta live tables?
If you have changed the partitioning schema, then instead of starting pipeline using Start button, you need to select "Full refresh" option from the dropdown of the Start button:

Spark can not process recursive avro data

I have avsc schema like below:
{
"name": "address",
"type": [
"null",
{
"type":"record",
"name":"Address",
"namespace":"com.data",
"fields":[
{
"name":"address",
"type":[ "null","com.data.Address"],
"default":null
}
]
}
],
"default": null
}
On loading this data in pyspark:
jsonFormatSchema = open("Address.avsc", "r").read()
spark = SparkSession.builder.appName('abc').getOrCreate()
df = spark.read.format("avro")\
.option("avroSchema", jsonFormatSchema)\
.load("xxx.avro")
I got such exception:
"Found recursive reference in Avro schema, which can not be processed by Spark"
I tried many other configurations, but without any success.
To execute I use with spark-submit:
--packages org.apache.spark:spark-avro_2.12:3.0.1
This is a intended feature, you can take a look at the "issue" :
https://issues.apache.org/jira/browse/SPARK-25718

Spark writing Parquet array<string> converts to a different datatype when loading into BigQuery

Spark Dataframe Schema:
StructType(
[StructField("a", StringType(), False),
StructField("b", StringType(), True),
StructField("c" , BinaryType(), False),
StructField("d", ArrayType(StringType(), False), True),
StructField("e", TimestampType(), True)
])
When I write the data frame to parquet and load it into BigQuery, it interprets the schema differently. It is a simple load from JSON and write to parquet using spark dataframe.
BigQuery Schema:
[
{
"type": "STRING",
"name": "a",
"mode": "REQUIRED"
},
{
"type": "STRING",
"name": "b",
"mode": "NULLABLE"
},
{
"type": "BYTES",
"name": "c",
"mode": "REQUIRED"
},
{
"fields": [
{
"fields": [
{
"type": "STRING",
"name": "element",
"mode": "NULLABLE"
}
],
"type": "RECORD",
"name": "list",
"mode": "REPEATED"
}
],
"type": "RECORD",
"name": "d",
"mode": "NULLABLE"
},
{
"type": "TIMESTAMP",
"name": "e",
"mode": "NULLABLE"
}
]
Is this something to do with the way spark writes or they way BigQuery reads parquet. Any idea how I can fix this?
This is due to the intermediate file format (parquet by default) that the spark-bigquery connector uses.
The connector first writes the data to parquet files, then loads them to BigQuery using BigQuery Insert API.
If you check the intermediate parquet schema using parquet-tools, you would find something like this the field d (ArrayType(StringType) in Spark)
optional group a (LIST) {
repeated group list {
optional binary element (STRING);
}
}
Now, if you were loading this parquet yourself in BigQuery using bq load or the BigQuery Insert API directly, you could be able to tell BQ to ignore the intermediate fields by enabling parquet_enable_list_inference
Unfortunately, I don't see how to enable this option when using the spark-bigquery connector!
As a workaround, you can try to use orc as the intermediate format.
df
.write
.format("bigquery")
.option("intermediateFormat", "orc")

Writing to Cosmos DB Graph API from Databricks (Apache Spark)

I have a DataFrame in Databricks which I want to use to create a graph in Cosmos, with one row in the DataFrame equating to 1 vertex in Cosmos.
When I write to Cosmos I can't see any properties on the vertices, just a generated id.
Get data:
data = spark.sql("select * from graph.testgraph")
Configuration:
writeConfig = {
"Endpoint" : "******",
"Masterkey" : "******",
"Database" : "graph",
"Collection" : "TestGraph",
"Upsert" : "true",
"query_pagesize" : "100000",
"bulkimport": "true",
"WritingBatchSize": "1000",
"ConnectionMaxPoolSize": "100",
"partitionkeydefinition": "/id"
}
Write to Cosmos:
data.write.
format("com.microsoft.azure.cosmosdb.spark").
options(**writeConfig).
save()
Below is the working code to insert records into cosmos DB.
go to the below site, click on the download option and select the uber.jar
https://search.maven.org/artifact/com.microsoft.azure/azure-cosmosdb-spark_2.3.0_2.11/1.2.2/jar then add in your dependency
spark-shell --master yarn --executor-cores 5 --executor-memory 10g --num-executors 10 --driver-memory 10g --jars "path/to/jar/dependency/azure-cosmosdb-spark_2.3.0_2.11-1.2.2-uber.jar" --packages "com.google.guava:guava:18.0,com.google.code.gson:gson:2.3.1,com.microsoft.azure:azure-documentdb:1.16.1"
import org.apache.spark.sql.types._
import org.apache.spark.sql.Row
val data = Seq(
Row(2, "Abb"),
Row(4, "Bcc"),
Row(6, "Cdd")
)
val schema = List(
StructField("partitionKey", IntegerType, true),
StructField("name", StringType, true)
)
val DF = spark.createDataFrame(
spark.sparkContext.parallelize(data),
StructType(schema)
)
val writeConfig = Map("Endpoint" -> "https://*******.documents.azure.com:443/",
"Masterkey" -> "**************",
"Database" -> "db_name",
"Collection" -> "collection_name",
"Upsert" -> "true",
"query_pagesize" -> "100000",
"bulkimport"-> "true",
"WritingBatchSize"-> "1000",
"ConnectionMaxPoolSize"-> "100",
"partitionkeydefinition"-> "/partitionKey")
DF.write.format("com.microsoft.azure.cosmosdb.spark").mode("overwrite").options(writeConfig).save()

Resources