Incrementallly reading and aggregating parquet files from S3 using Databricks DLT

Incrementallly reading and aggregating parquet files from S3 using Databricks DLT - databricks

I am trying to use DLT for incremental processing where inputs are parquet files on s3 arriving daily. I am told that dlt read_stream can help . I was able to get incrementally read files, but when I perform aggregations, it is doing wide aggregations instead of aggregating only the incrementals rows. Appreciate any suggestions .
Here is the example code
#dlt.table()
def tab1():
return (spark.readStream.format("cloudFiles")
.schema(schema)
.option("cloudFiles.format", "parquet")
.option("cloudFiles.includeExistingFiles",False)
.option("cloudFiles.allowOverwrites",False)
.option("cloudFiles.validateOptions",True)
.load(f"{s3_prefix}/tab1/")
#dlt.table(
comment="Aggregate table1"
)
def tab1_agg():
return dlt.read_stream("tab1")
.groupBy("col1")
.agg(F.count(F.lit(1)).alias("cnt"),
F.sum("col2").alias("sum_col2"))
.withColumn("kh_meta_canonical_timestamp", F.current_timestamp())

Related

How to process a large delta table with UDF?

I have a delta table with about 300 billion rows. Now I am performing some operations on a column using UDF and creating another column
My code is something like this
def my_udf(data):
return pass
udf_func = udf(my_udf, StringType())
data = spark.sql("""SELECT * FROM large_table """)
data = data.withColumn('new_column', udf_func(data.value))
The issue now is this take a long amount of time as Spark will process all 300 billion rows and then write the output. Is there a way where we can do some Mirco batching and write output of those regularly to the output delta table

The first rule usually is to avoid UDFs as much of possible - what kind of transformation do you need to perform that isn't available in the Spark itself?
Second rule - if you can't avoid using UDF, at least use Pandas UDFs that process data in batches, and don't have so big serialization/deserialization overhead - usual UDFs are handling data row by row, encoding & decoding data for each of them.
If your table was built over the time, and consists of many files, you can try to use Spark Structured Streaming with Trigger.AvailableNow (requires DBR 10.3 or 10.4), something like this:
maxNumFiles = 10 # max number of parquet files processed at once
df = spark.readStream \
.option("maxFilesPerTrigger", maxNumFiles) \
.table("large_table")
df = df.withColumn('new_column', udf_func(data.value))
df.writeStream \
.option("checkpointLocation", "/some/path") \
.trigger(availableNow=True) \
.toTable("my_destination_table")
this will read the source table chunk by chunk, apply your transformation, and write data into a destination table.

SparkSteaming reading entire table instead of by file

I have ~3PB of parquet on S3. I want to read it file-by-file with spark streaming and join some metadata to it before writing out. The metadata is small enough to be broadcasted. Files in the source data are ~60mb, none are huge.
val r = spark.readStream
.option("maxFilesPerTrigger", "100")
.schema(pschema)
.parquet("s3://mybigdata/sourcedata/")
.withColumn("id", regexp_extract(col("mycol"), "someregex", 1).cast(IntegerType))
.alias("p")
.join(broadcast(idmap.alias("i")), $"p.id" === $"i.id", "inner") //idmap is a small dataframe
.drop($"i.id")
.withColumn("date", regexp_extract($"filename", "someregex", 1))
val w = r.writeStream.format("delta")
.partitionBy("date", "some_id")
.option("checkpointLocation", "s3://mybigdata/checkpoint/")
.option("path", "s3://mybigdata/destination/")
.start()
When I do this, I get MASSIVE spills to memory and disk:
Which of course, is a disaster. How is it that I am getting these massive spills when I'm rate limiting via maxFilesPerTrigger to 100x60mb files at a time? It seems to be trying to read the entire S3 dataset and isn't streaming at all.
What is going wrong here?

Databricks spark best practice for writing a structured stream to a lot of sinks?

I'm using databricks spark 3.x, and I am reading a very large number of streams (100+), and each stream has its own contract, and needs to be written out to its own delta/parquet/sql/whatever table. While this is a lot of streams, the activity per stream is low - some streams might see only hundreds of records a day. I do want to stream because I am aiming for a fairly low-latency approach.
Here's what I'm talking about (code abbreviated for simplicity; I'm using checkpoints, output modes, etc. correctly).
Assume a schemas variable contains the schema for each topic. I've tried this approach, where I create a ton of individual streams, but it takes a lot of compute and most of it is wasted:
def batchprocessor(topic, schema):
def F(df, batchId):
sql = f'''
MERGE INTO SOME TABLE
USING SOME MERGE TABLE ON SOME CONDITION
WHEN MATCHED
UPDATE SET *
WHEN NOT MATCHED
INSERT *
'''
df.createOrReplaceTempView(f"SOME MERGE TABLE")
df._jdf.sparkSession().sql(sql)
return F
for topic in topics:
query = (spark
.readStream
.format("delta")
.load(f"/my-stream-one-table-per-topic/{topic}")
.withColumn('json', from_json(col('value'),schemas[topic]))
.select(col('json.*'))
.writeStream
.format("delta")
.foreachBatch(batchProcessor(topic, schema))
.start())
I also tried to create just one stream that did a ton of filtering, but performance was pretty abysmal even in a test environment where I pushed a single message to a single topic:
def batchprocessor(df, batchId):
df.cache()
for topic in topics:
filteredDf = (df.filter(f"topic == '{topic}'")
.withColumn('json', from_json(col('value'),schemas[topic]))
.select(col('json.*')))
sql = f'''
MERGE INTO SOME TABLE
USING SOME MERGE TABLE ON SOME CONDITION
WHEN MATCHED
UPDATE SET *
WHEN NOT MATCHED
INSERT *
'''
filteredDf.createOrReplaceTempView(f"SOME MERGE TABLE")
filteredDf._jdf.sparkSession().sql(sql)
df.unpersist()
query = (spark
.readStream
.format("delta")
.load(f"/my-stream-all-topics-in-one-but-partitioned")
.writeStream
.format("delta")
.foreachBatch(batchProcessor)
.start())
Is there any good way to essentially demultiplex a stream like this? It's already partitioned, so I assume the query planner isn't doing too much redundant work, but it seems like there's a huge amount of overhead nonetheless.

I ran a bunch of benchmarks, and option 2 is more efficient. I don't entirely know why yet.
Ultimately, performance still wasn't what I wanted - each topic runs in order, no matter the size, so a single record on each topic would lead the FIFO scheduler to queue up a lot of very inefficient small operations. I solved that using parallelisation:
import threading
def writeTable(table, df, poolId, sc):
sc.setLocalProperty("spark.scheduler.pool", poolId)
df.write.mode('append').format('delta').saveAsTable(table)
sc.setLocalProperty("spark.scheduler.pool", None)
def processBatch(df, batchId):
df.cache()
dfsToWrite = {}
for row in df.select('table').distinct().collect():
table = row.table
filteredDf = df.filter(f"table = '{table}'")
dfsToWrite[table] = filteredDf
threads = []
for table,df in dfsToWrite.items():
threads.append(threading.Thread(target=writeTable,args=(table, df,table,spark.sparkContext)))
for t in threads:
t.start()
for t in threads:
t.join()
df.unpersist()

How to iterate in Databricks to read hundreds of files stored in different subdirectories in a Data Lake?

I have to read hundreds of avro files in Databricks from an Azure Data Lake Gen2, extract data from the Body field inside every file, and concatenate all the extracted data in a unique dataframe. The point is that all avro files to read are stored in different subdirectories in the lake, following the pattern:
root/YYYY/MM/DD/HH/mm/ss.avro
This forces me to loop the ingestion and selection of data. I'm using this Python code, in which list_avro_files is the list of paths to all files:
list_data = []
for file_avro in list_avro_files:
df = spark.read.format('avro').load(file_avro)
data1 = spark.read.json(df.select(df.Body.cast('string')).rdd.map(lambda x: x[0]))
list_data.append(data1)
data = reduce(DataFrame.unionAll, list_data)
Is there any way to do this more efficiently? How can I parallelize/speed up this process?

As long as your list_avro_files can be expressed through standard wildcard syntax, you can probably use Spark's own ability to parallelize read operation. All you'd need is to specify a basepath and a filename pattern for your avro files:
scala> var df = spark.read
.option("basepath","/user/hive/warehouse/root")
.format("avro")
.load("/user/hive/warehouse/root/*/*/*/*.avro")
And, in case you find that you need to know exactly which file any given row came from, use input_file_name() built-in function to enrich your dataframe:
scala> df = df.withColumn("source",input_file_name())

Spark bucketing read performance

Spark version - 2.2.1.
I've created a bucketed table with 64 buckets, I'm executing an aggregation function select t1.ifa,count(*) from $tblName t1 where t1.date_ = '2018-01-01' group by ifa . I can see that 64 tasks in Spark UI, which utilize just 4 executors (each executor has 16 cores) out of 20. Is there a way I can scale out the number of tasks or that's how bucketed queries should run (number of running cores as the number of buckets)?
Here's the create table:
sql("""CREATE TABLE level_1 (
bundle string,
date_ date,
hour SMALLINT)
USING ORC
PARTITIONED BY (date_ , hour )
CLUSTERED BY (ifa)
SORTED BY (ifa)
INTO 64 BUCKETS
LOCATION 'XXX'""")
Here's the query:
sql(s"select t1.ifa,count(*) from $tblName t1 where t1.date_ = '2018-01-01' group by ifa").show

With bucketing, the number of tasks == number of buckets, so you should be aware of the number of cores/tasks that you need/want to use and then set it as the buckets number.

num of task = num of buckets is probably the most important and under-discussed aspect of bucketing in Spark. Buckets (by default) are historically solely useful for creating "pre-shuffled" dataframes which can optimize large joins. When you read a bucketed table all of the file or files for each bucket are read by a single spark executor (30 buckets = 30 spark tasks when reading the data) which would allow the table to be joined to another table bucketed on the same # of columns. I find this behavior annoying and like the user above mentioned problematic for tables that may grow.
You might be asking yourself now, why and when in the would I ever want to bucket and when will my real-world data grow exactly in the same way over time? (you probably partitioned your big data by date, be honest) In my experience you probably don't have a great use case to bucket tables in the default spark way. BUT ALL IS NOT LOST FOR BUCKETING!
Enter "bucket-pruning". Bucket pruning only works when you bucket ONE column but is potentially your greatest friend in Spark since the advent of SparkSQL and Dataframes. It allows Spark to determine which files in your table contain specific values based on some filter in your query, which can MASSIVELY reduce the number of files spark physically reads, resulting in hugely efficient and fast queries. (I've taken 2+hr queries down to 2 minutes and 1/100th of the Spark workers). But you probably don't care because of the # of buckets to tasks issue means your table will never "scale-up" if you have too many files per bucket, per partition.
Enter Spark 3.2.0. There is a new feature coming that will allow bucket pruning to stay active when you disable bucket-based reading, allowing you to distribute the spark reads with bucket-pruning/scan. I also have a trick for doing this with spark < 3.2 as follows.
(note the leaf-scan for files with vanilla spark.read on s3 is added overhead but if your table is big it doesn't matter, bc your bucket optimized table will be a distributed read across all your available spark workers and will now be scalable)
val table = "ex_db.ex_tbl"
val target_partition = "2021-01-01"
val bucket_target = "valuex"
val bucket_col = "bucket_col"
val partition_col = "date"
import org.apache.spark.sql.functions.{col, lit}
import org.apache.spark.sql.execution.FileSourceScanExec
import org.apache.spark.sql.execution.datasources.{FileScanRDD,FilePartition}
val df = spark.table(tablename).where((col(partition_col)===lit(target_partition)) && (col(bucket_col)===lit(bucket_target)))
val sparkplan = df.queryExecution.executedPlan
val scan = sparkplan.collectFirst { case exec: FileSourceScanExec => exec }.get
val rdd = scan.inputRDDs.head.asInstanceOf[FileScanRDD]
val bucket_files = for
{ FilePartition(bucketId, files) <- rdd.filePartitions f <- files }
yield s"$f".replaceAll("path: ", "").split(",")(0)
val format = bucket_files(0).split("
.").last
val result_df = spark.read.option("mergeSchema", "False").format(format).load(bucket_files:_*).where(col(bucket_col) === lit(bucket_target))

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Incrementallly reading and aggregating parquet files from S3 using Databricks DLT - databricks

Related

How to process a large delta table with UDF?

SparkSteaming reading entire table instead of by file

Databricks spark best practice for writing a structured stream to a lot of sinks?

How to iterate in Databricks to read hundreds of files stored in different subdirectories in a Data Lake?

Spark bucketing read performance

Categories

Resources