Delta Live Tables pipeline running time - databricks

New to Databricks Delta Live Tables. Set up my first pipeline to ingest a single 26Mb CSV file from an Azure blob using the following code:
import dlt
#dlt.table(
comment="this is a test"
)
def accounts():
return (
spark.readStream.format("cloudFiles")
.option("cloudFiles.format", "csv")
.load("/mnt/mntname/")
)
It's been running for 24 minutes on product edition advanced with a spark cluster config of runtime 10.4, 42GB active memory, 12 cores and 2.25 active DBU/hour.
Is this normal, it seems very slow for such a small workload?

Related

Azure Synapse serveless doesn't read delta

I've around 90 delta views on Synapse serveless, 90% of them works flawless but some of them don't. Databricks and hive shows all results correctly but on Synapse I getting error message when I try to read delta and no rows, if I write same view using .parquet I didn't got the error got all rows. When I limit my delta to 1000 rows and write it Synapse can show the rows. Any clue?
Total size of data scanned is 0 megabytes, total size of data moved is 0 megabytes, total size of data written is 0 megabytes."
Synapse Apache Spark version: 3.1
Synapse Python version: 3.8
Synapse Scala version: 2.12.10

Method to optimize PySpark dataframe saving time

I'm running a notebook on Azure databricks using a multinode cluster with 1 driver and 1-8 workers(each with 16 cores and 56 gb ram). Reading the source data from Azure ADLS which has 30K records. Notebook is consist of few transformation steps, also using two UDFs which are necessary for code implementation. While my entire transformation steps are running within 12 minutes(which is expected), it is taking more than 2 hours to save the final dataframe to ADSL Delta table. I'm providing some code snippet here(can't provide the entire code), suggest me ways to reduce this dataframe saving time.
# All the data reading and transformation code
# only one display statement before saving it to delta table. Up to this statement it is taking 12 minutes to run
data.display()
# Persisting the data frame
from pyspark import StorageLevel
data.persist(StorageLevel.MEMORY_ONLY)
# Finally writing the data to delta table
# This part is taking more than 2 hours to run
# Persist Brand Extraction Output
(
data
.write
.format('delta')
.mode('overwrite')
.option('overwriteSchema', 'true')
.saveAsTable('output_table')
)
Another save option tried but not much improvement
mount_path = "/mnt/********/"
table_name = "********"
adls_path = mount_path + table_name
(data.write.format('delta').mode('overwrite').option('overwriteSchema', 'true').save(adls_path))

How to improve spark job performance in databricks

I am executing the below query on Databricks cluster. And storing the result into DBFS in parquet format.
Here set1_interim and set2_interim are also parquet files having 450 million records each.
spark.sql(
s"""
select *
from set1_interim
union all
select *
from set2_interim
""").write.mode("overwrite").option("header", "true").option("delimiter","\t").parquet(s"${Interimpath}/unioned_interim")
But its taking so much of time to complete the job. If you see below picture out of 230 tasks 229 got completed soon. For last task it takes hours to complete.
DAG for this job.
In task page I can see this is running.
In executors all executors are alive. In running application I don't know what is Databricks Shell in Name.
Show Additional Metrics:
How can in make this job run faster. My cluster configuration are 1TB, 256 core.

Using Spark Bigquery Connector on Dataproc and data appears to be delayed by an hour

I'm using spark 2.4 running on Dataproc and running a batch job every 15 min to take some data from a bq table, aggregate it (sum) and store it in another bq table (overwrite) via pyspark.sql.
If I query the table in spark, it looks like the data is behind by roughly an hour. Or rather, it cuts off at roughly an hour before now. If I use the exact same query on the table that I am querying in Spark, but instead in the BQ web console, all the data is there and up to date. Am I doing something wrong? Or is this expected behavior of the connector?
Here's essentially the code I'm using:
orders_by_hour_query = """
SELECT
_id as app_id,
from_utc_timestamp(DATE_TRUNC('HOUR', created_at), 'America/Los_Angeles') as ts_hour,
SUM(total_price_usd) as gmv,
COUNT(order_id) as orders
FROM `orders`
WHERE DATE(from_utc_timestamp(created_at, 'America/Los_Angeles')) BETWEEN "2020-11-23" AND "2020-11-27"
GROUP BY 1, 2
ORDER BY 1, 2 ASC
"""
orders_df = spark.read.format("bigquery").load(bq_dataset+".orders")
orders_df.createOrReplaceTempView("orders")
orders_by_hour_df = spark.sql(orders_by_hour_query)
EDIT: It appears that the hourly cut-off appears to be almost arbitrary. For instance it's currently "2020-11-25 06:31 UTC" but the max timestamp that queries from BQ via the Spark connector is: "2020-11-25 05:56:39 UTC."
More Info on that table:
Table size 2.65 GB
Long-term storage size 1.05 GB
Number of rows 4,120,280
Created Jun 3, 2020, 4:56:11 PM
Table expiration Never
Last modified Nov 24, 2020, 10:07:54 PM
Data location US
Table type Partitioned
Partitioned by Day
Partitioned on field created_at
Partition filter Not required
Streaming buffer statistics
Estimated size 1.01 MB
Estimated rows 1,393
Earliest entry time Nov 24, 2020, 9:57:00 PM
Thanks in advance!
It looks like the missing data might be in the streaming buffer and has not yet reached BQ storage.
This means you can query it from BQ directly, but not with the BQ Spark Connector since that works over the Storage API (https://cloud.google.com/bigquery/docs/reference/storage)
As a workaround, you can try something like the below. Since it's only an hour of data, if that data is small enough, you could also simply use the BQ API directly and just convert the pandas data frame to a spark dataframe.
`def bq2df(QUERY):
bq = bigquery.Client()
query_job = bq.query(QUERY)
query_job.result()
df = spark.read.format('bigquery') \
.option('dataset', query_job.destination.dataset_id) \
.load(query_job.destination.table_id) \
.persist(StorageLevel.MEMORY_AND_DISK)
return df

Spark SQL working on very large data set fails with file not found exception after 30 mins

We have a Spark SQL working on very large data set (50 GB) this SQL does many aggregations. it fails after 30 mins saying "file not found". Is there any way we can ensure that code can be optimized to help reduce the network and CPU overhead?
We are working on HD insight cluster and we have increased timeout as well to ensure that it's not a network issue. Sample code below. Table A is created in parquet mode.
Agg_data_op = '''
SELECT trans_cd
Sum ( b ),
Sum (case … c)
<50 aggregated fields>
From A group by trans_cd”
df_trans_mart = spark.sql(Agg_data_op)
df_trans_mart.write.mode("overwrite").saveAsTable("df_core_mart")

Resources