Delta Lake Table Metadata Sync on Athena - apache-spark

now that we can query delta lake tables from Athena without having to generate manifest files, does it also take care of automatically syncing the underlying partitions? I see MSCK REPAIR is not supported for delta tables. Or do I need to use a Glue crawler for this?
I am confused because of these two statements from Athena documentation
You can use Amazon Athena to read Delta Lake tables stored in Amazon S3 directly without having to generate manifest files or run the MSCK REPAIR statement.
Athena synchronizes table metadata, including schema, partition columns, and table properties, to AWS Glue if you use Athena to create your Delta Lake table. As time passes, this metadata can lose its synchronization with the underlying table metadata in the transaction log. To keep your table up to date, you can use the AWS Glue crawler for Delta Lake tables.

Related

Databricks Serverless Computer - writeback to delta tables

Databricks Serverless Compute - I know this is still in preview and is by request and is only available on AWS.
Can this be used for Read and Write (Update) .delta tables [or] is it read-only?
And is it good to run small queries (transactional in nature)? [or] is it good to have Azure SQL for that?
Performance from Azure SQL (az sql) seems faster for small queries than Databricks.
As Dataricks has to traverse through Hive Metastore when querying from .delta tables - will this impact the performance?
According to the Release Notes (June 17 2021), the new photon executor is switched on for SQL endpoints and it does also support writes to Delta tables (and to Parquet).
If you want to run a lot of small queries on a set of data, then I'd say Az SQL interactions (or operations on a SparkDataFrame taken from the Delta Table), should always outperform the same thing expressed in SQL running directly against a Delta Lake table, since the latter has to negotiate the versioned parquet files and the Delta Lake transaction log on your behalf.

Synapse Analytics sql on-demand sync with spark pool is very slow to query

I have files loaded into an azure storage account gen2, and am using Azure Synapse Analytics to query them. Following the documentation here: https://learn.microsoft.com/en-us/azure/synapse-analytics/sql/develop-storage-files-spark-tables, I should be able to create a spark sql table to query the partitioned data, and thus subsequently use the metadata from spark sql in my sql on demand query to given the line in the doc: When a table is partitioned in Spark, files in storage are organized by folders. Serverless SQL pool will use partition metadata and only target relevant folders and files for your query
My data is partitioned in ADLS gen2 as:
Running the query in a spark notebook in Synapse Analytics returns in just over 4 seconds, as it should given the partitioning:
However, now running the same query in the sql on demand sql side script never completes:
This result and extreme reduction in performance compared to spark pool is completely counter to what the documentation notes. Is there something I am missing in the query to make sql-on demand use the partitions?
Filepath() and filename() functions can be used in the WHERE clause to filter the files to be read. Which that you can achieve the prunning you have been looking for.

Does Databricks cluster need to be always up for VACUUM operation of Delta Lake?

I am using Azure Databricks with latest runtime for the clusters. I had some confusion regarding VACUUM operation in delta lake. We know we can set a retention duration on the deleted data, however, for actual data to be delete after the retention period is over, do we need to keep the Cluster Up for the entire duration?
In simple words-: Do we need to have Cluster always in running state in order to leverage Delta lake ?
You don't need to always keep a cluster up and running. You can schedule a vacuum job to run daily (or weekly) to clean up stale data older than the threshold. Delta Lake doesn't require an always-on cluster. All the data/metadata are stored in the storage (s3/adls/abfs/hdfs), so no need to keep anything up and running.
Apparently you need a cluster to be up and running always to query for the data available in databricks tables.
If you have configured the external meta store for databricks, then you can use any wrappers like apache hive by pointing it to that external meta store DB and query the data using hive layer without using databricks.

Azure Data Lake - HDInsight vs Data Warehouse

I'm in a position where we're reading from our Azure Data Lake using external tables in Azure Data Warehouse.
This enables us to read from the data lake, using well known SQL.
However, another option is using Data Lake Analytics, or some variation of HDInsight.
Performance wise, I'm not seeing much difference. I assume Data Warehouse is running some form of distributed query in the background, converting to U-SQL(?), and so why would we use Data Lake Analytics with the slightly different syntax of U-SQL?
With python script also available in SQL, I feel I'm missing a key purpose of Data Lake Analytics, other than the cost (pay per batch job, rather than constant up time of a database).
If your main purpose is to query data stored in the Azure Data Warehouse (ADW) then there is not real benefit to using Azure Data Lake Analytics (ADLA). But as soon as you have other (un)structured data stored in ADLS, like json documents or csv files for example, the benefit of ADLA becomes clear as U-Sql allows you to join your relational data stored in ADW with the (un)structured / nosql data stored in ADLS.
Also, it enables you to use U-Sql to prepare this other data for direct import in ADW, so Azure Data Factory is not longer required to get the data into you data warehouse. See this blogpost for more information:
A common use case for ADLS and SQL DW is the following. Raw data is ingested into ADLS from a variety of sources. Then ADL Analytics is used to clean and process the data into a loading ready format. From there, the high value data can be imported into Azure SQL DW via PolyBase.
..
You can import data stored in ORC, RC, Parquet, or Delimited Text file formats directly into SQL DW using the Create Table As Select (CTAS) statement over an external table.
Please note that the SQL statement in SQL Data Warehouse is currently NOT generating U-SQL behind the scenes. Also, the use cases between ADLA/U-SQL and SDW are different.
ADLA is giving you an processing engine to do batch data preparation/cooking to generate your data to build a data mart/warehouse that you then can read interactively with SQL DW. In your example above, you seem to be mainly doing the second part. Adding "Views" on top on these EXTERNAL tables to do transformations in SQL DW will quickly run into scalability limits if you operating on big data (and not just a few 100k rows).

HDInsight Azure Blob Storage Change

On HDInsight cluster, a Hive table is created using CREATE EXTERNAL statement:
CREATE EXTERNAL TABLE HTable(t1 string, t2 string, t3 string) ROW FORMAT DELIMITED FIELDS TERMINATED BY ' ' STORED AS TEXTFILE LOCATION 'wasb://$containerName#$storageAccountName.blob.core.windows.net/HTable/data/';
Then some existing files changed, some files are added to Azure Blob Container mentioned in the CREATE statement.
Does a new hive query consider changes made to Blob Container with out again loading data to hive table?
Yes, your table definition is saved in the Hive metastore. You can subsequently simply query HTable and data will be there. Normally Hive on HDInsight follows the same rules that applies to Hive and HDFS.
For a more advanced discussion you can play some tricks, but you need to know what you're doing. Because HDInsight storage can survive a cluster lifetime, with HDInsight is feasible to tear down the cluster and redeploy a new HDInsight cluster and still have the Hive data. You can even keep the Hive metastore, as is a separate database (an SQL Azure DB). With an HDFS based cluster a recycle of the cluster leads to loss of all HDFS data.

Resources