Azure Synapse serveless doesn't read delta - azure

I've around 90 delta views on Synapse serveless, 90% of them works flawless but some of them don't. Databricks and hive shows all results correctly but on Synapse I getting error message when I try to read delta and no rows, if I write same view using .parquet I didn't got the error got all rows. When I limit my delta to 1000 rows and write it Synapse can show the rows. Any clue?
Total size of data scanned is 0 megabytes, total size of data moved is 0 megabytes, total size of data written is 0 megabytes."
Synapse Apache Spark version: 3.1
Synapse Python version: 3.8
Synapse Scala version: 2.12.10

Related

Delta Live Tables pipeline running time

New to Databricks Delta Live Tables. Set up my first pipeline to ingest a single 26Mb CSV file from an Azure blob using the following code:
import dlt
#dlt.table(
comment="this is a test"
)
def accounts():
return (
spark.readStream.format("cloudFiles")
.option("cloudFiles.format", "csv")
.load("/mnt/mntname/")
)
It's been running for 24 minutes on product edition advanced with a spark cluster config of runtime 10.4, 42GB active memory, 12 cores and 2.25 active DBU/hour.
Is this normal, it seems very slow for such a small workload?

Performance tuning for PySpark data load from a non-partitioned hive table to a partitioned hive table

We have a requirement to ingest data from a non-partitioned EXTERNAL hive table work_db.customer_tbl to a partitioned EXTERNAL hive table final_db.customer_tbl through PySpark, previously done through hive query. The final table is partitioned by the column load_date (format of load_date column is yyyy-MM-dd).
So we have a simple PySpark script which uses an insert query (same as the hive query which was used earlier), to ingest the data using spark.sql() command. But we have some serious performance issues because the table we are trying to ingest after ingestion has around 3000 partitions and each partitions has around 4 MB of data except for the last partition which is around 4GB. Total table size is nearly 15GB. Also, after ingestion each partition has 217 files. The final table is a snappy compressed parquet table.
The source work table has a single 15 GB file with filename in the format customers_tbl_unload.dat.
Earlier when we were using the hive query through a beeline connection it usually takes around 25-30 minutes to finish. Now when we are trying to use the PySpark script it is taking around 3 hours to finish.
How can we tune the spark performance to make the ingestion time less than what it took for beeline.
The configurations of the yarn queue we use is:
Used Resources: <memory:5117184, vCores:627>
Demand Resources: <memory:5120000, vCores:1000>
AM Used Resources: <memory:163072, vCores:45>
AM Max Resources: <memory:2560000, vCores:500>
Num Active Applications: 45
Num Pending Applications: 45
Min Resources: <memory:0, vCores:0>
Max Resources: <memory:5120000, vCores:1000>
Reserved Resources: <memory:0, vCores:0>
Max Running Applications: 200
Steady Fair Share: <memory:5120000, vCores:474>
Instantaneous Fair Share: <memory:5120000, vCores:1000>
Preemptable: true
The parameters passed to the PySpark script is:
num-executors=50
executor-cores=5
executor-memory=10GB
PySpark code used:
insert_stmt = """INSERT INTO final_db.customers_tbl PARTITION(load_date)
SELECT col_1,col_2,...,load_date FROM work_db.customer_tbl"""
spark.sql(insert_stmt)
Even after nearly using 10% resources of the yarn queue the job is taking so much time. How can we tune the job to make it more efficient.
You need to reanalyze your dataset and look if you are using the correct approach by partitioning yoir dataset on date column or should you be probably partitioning on year?
To understand why you end up with 200 plus files for each partition, you need to understand the difference between the Spark and Hive partitions.
A direct approach you should try first is to read your input dataset as a dataframe and partition it by the key you are planning to use as a partition key in Hive and then save it using df.write.partitionBy
Since the data seems to be skewed too on date column, try partitioning it on additional columns which might have equal distribution of data. Else, filter out the skewed data and process it separately

Convert spark dataframe to Delta table on azure databricks - warning

I am saving my spark dataframe on azure databricks and create delta lake table.
It works fine, however I am getting this warning message while execution.
Question- Why I am still getting this message, even with my table is delta table. What is wrong with my approach, any inputs is greatly appreciated.
Warning Message
This query contains a highly selective filter. To improve the performance of queries, convert the table to Delta and run the OPTIMIZE ZORDER BY command on the table
Code
dfMerged.write\
.partitionBy("Date")\
.mode("append")\
.format("delta")\
.option("overwriteSchema", "true")\
.save("/mnt/path..")
spark.sql("CREATE TABLE DeltaUDTable USING DELTA LOCATION '/mnt/path..'")
Some more details
I've mounted azure storage gen 2 to above mount location.
databricks runtime - 6.4 (includes Apache Spark 2.4.5, Scala 2.11)
The warning message is clearly misleading as you already have a Delta option. Ignore it.
df.write.mode("overwirte").saveAsTable("table_loc")

Spark SQL working on very large data set fails with file not found exception after 30 mins

We have a Spark SQL working on very large data set (50 GB) this SQL does many aggregations. it fails after 30 mins saying "file not found". Is there any way we can ensure that code can be optimized to help reduce the network and CPU overhead?
We are working on HD insight cluster and we have increased timeout as well to ensure that it's not a network issue. Sample code below. Table A is created in parquet mode.
Agg_data_op = '''
SELECT trans_cd
Sum ( b ),
Sum (case … c)
<50 aggregated fields>
From A group by trans_cd”
df_trans_mart = spark.sql(Agg_data_op)
df_trans_mart.write.mode("overwrite").saveAsTable("df_core_mart")

downloaded bytes from s3 of spark sql is multiple times more than hive sql

i have a hive table on aws s3 which contains 144 csv formatted files(20M per file) and total size is 3G;
when i execute sql by spark sql, it cost 10-15G downloaded bytes(not same everytime, counted by aws service), much more then hive table size;
but when i execute same sql by hive on hive client, the downloaded bytes is equal to hive table size on s3;
sql is simple like 'select count(1) from #table#';
from spark ui stages tag, there is almost 2k+ task, much greater than spark rdd read execution;
so one file is accessed by multiple tasks?
any help will be appreciated!
this is because spark will split one file into multi partitions(each partition refer to one task), even file size is less then block.size(64M or 128M);
so in order to decrease map task number, you can decrease conf 'mapreduce.job.maps' (default valued 2, work for csv but not orc format, changed to 80 in my mapred-site.xml);

Resources