I am trying to understand how to create an external table that supports partition elimination. I can create a view with a column derived using the filepath function, but that can't be used by Spark. I can create an external table using create external table as select, but that gives me a copy of the data. This article from Microsoft implies it can be done.
The native external tables in Synapse pools are able to ignore the
files placed in the folders that are not relevant for the queries. If
your files are stored in a folder hierarchy (for example -
/year=2020/month=03/day=16) and the values for year, month, and day
are exposed as the columns, the queries that contain filters like
year=2020 will read the files only from the subfolders placed within
the year=2020 folder. The files and folders placed in other folders
(year=2021 or year=2022) will be ignored in this query. This
elimination is known as partition elimination.
The folder partition elimination is available in the native external
tables that are synchronized from the Synapse Spark pools. If you have
partitioned data set and you would like to leverage the partition
elimination with the external tables that you create, use the
partitioned views instead of the external tables.
So, how do you expose those partition directories as columns in an external table?
Related
I am working with Azure Databricks and we are moving hundreds of gigabytes of data with Spark. We stream them with Databricks' autoloader function from a source storage on Azure Datalake Gen2, process them with Databricks notebooks, then load them into another storage. The idea is that the end result is a replica, a copy-paste of the source, but with some transformations involved.
This means if a record is deleted at the source, we also have to delete it. If a record is updated or added, then we do that too. For the latter autoloader with a file level listener, combined with a MERGE INTO and with .forEachBatch() is an efficient solution But what about deletions? For technical reasons (dynamics365 azure synapse link export being extremely limited in configuration) we are not getting delta files, we have no data on whether a certain record got updated, added or deleted. We only have the full data dump every time.
To simply put: I want to delete records in a target dataset if the record's primary key is no longer found in a source dataset. In T-SQL MERGE could check both ways, whether there is a match by the target or the source, however in Databricks this is not possible, MERGE INTO only checks for the target dataset.
Best idea so far:
DELETE FROM a WHERE NOT EXISTS (SELECT id FROM b WHERE a.id = b.id)
Occasionally a deletion job might delete millions of rows, which we have to replicate, so performance is important. What would you suggest? Any best practices to this?
With the Databricks Lakehouse platform, it is possible to create 'tables' or to be more specific, delta tables using a statement such as the following,
DROP TABLE IF EXISTS People10M;
CREATE TABLE People10M
USING parquet
OPTIONS (
path "/mnt/training/dataframes/people-10m.parquet",
header "true"
);
What I would like to know is, what exactly happens behind the scenes when you create one of these tables? What exactly is a table in this context? Because the data is actually contained in files in data lake (data storage location) that delta lake is running on top of.. right? Are tables some kind of abstraction that allows us to access the data stored in these files using something like SQL?
What does the USING parquet portion of this statement do? Are parquet tables different to CSV tables in some way? Or does this just depend on the format of the source data?
Any links to material that explains this idea would be appreciated? I want to understand this in depth from a technical point of view.
There are few aspects here. Your table definition is not a Delta Lake, it's Spark SQL (or Hive) syntax to define a table. It's just a metadata that allows users easily use the table without knowing where it's located, what data format, etc. You can read more about databases & tables in Databricks documentation.
The actual format for data storage is specified by the USING directive. In your case it's parquet, so when people or code will read or write data, underlying engine will first read table metadata, figure out location of the data & file format, and then will use corresponding code.
Delta is another file format (really a storage layer) that is built on the top of Parquet as data format, but adding additional capabilities such as ACID, time travel, etc. (see doc). If you want to use Delta instead of Parquet then you either need to use CONVERT TO DELTA to convert existing Parquet data into Delta, or specify USING delta when creating a completely new table.
We load data from on-prem database servers to Azure Data Lake Storage Gen2 using Azure Data Factory and Databricks store them as parquet files. Every run, we get only get the new and modified data from last run and UPSERT into existing parquet files using databricks merge statement.
Now we are trying to move this data from parquet files Azure Synapse. Ideally, I would like to do this.
Read incremental load data into a external table. (CETAS or COPY
INTO)
Use above as staging table.
Merge staging table with production table.
The problem is merge statement is not available in Azure Syanpse. Here is the solution Microsoft suggests for incremental load
CREATE TABLE dbo.[DimProduct_upsert]
WITH
( DISTRIBUTION = HASH([ProductKey])
, CLUSTERED INDEX ([ProductKey])
)
AS
-- New rows and new versions of rows
SELECT s.[ProductKey]
, s.[EnglishProductName]
, s.[Color]
FROM dbo.[stg_DimProduct] AS s
UNION ALL
-- Keep rows that are not being touched
SELECT p.[ProductKey]
, p.[EnglishProductName]
, p.[Color]
FROM dbo.[DimProduct] AS p
WHERE NOT EXISTS
( SELECT *
FROM [dbo].[stg_DimProduct] s
WHERE s.[ProductKey] = p.[ProductKey]
)
;
RENAME OBJECT dbo.[DimProduct] TO [DimProduct_old];
RENAME OBJECT dbo.[DimProduct_upsert] TO [DimProduct];
Basically dropping and re-creating the production table with CTAS. Will work fine with small dimenstion tables, but i'm apprehensive about large fact tables with 100's of millions of rows with indexes. Any suggestions on what would be the best way to do incremental loads for really large fact tables. Thanks!
Till the time SQL MERGE is officially supported, the recommended way fwd to update target tables is to use T SQL insert/update commands between the delta records and target table.
Alternatively, you can also use Mapping Data Flows (in ADF) to emulate SCD transactions for dimensional/fact data load.
When I run drop database command, spark deletes database directory and all its subdirectories on hdfs. How can I avoid this?
Short answer:
Unless you set up your database so that it contains only external tables that exist outside of the database HDFS directory, there is no way to achieve this without copying all of your data to another location in HDFS.
Long answer:
From the following website:
https://www.oreilly.com/library/view/programming-hive/9781449326944/ch04.html
By default, Hive won’t permit you to drop a database if it contains tables. You can either drop the tables first or append the CASCADE keyword to the command, which will cause the Hive to drop the tables in the database first:
Using the RESTRICT keyword instead of CASCADE is equivalent to the default behavior, where existing tables must be dropped before dropping the database.
When a database is dropped, its directory is also deleted.
You can copy the data to another location before dropping the database. I know it's a pain - but that's how Hive operates.
If you were trying to just drop a table without deleting the HDFS directory of the table, there's a solution for this described here: Can I change a table from internal to external in hive?
Dropping an external table preserves the HDFS location for the data.
Cascading the database drop to the tables after converting them to external will not fix this, because the database drop impacts the whole HDFS directory the database resides in. You would still need to copy the data to another location.
If you create a database from scratch, each table inside of which is external and references a location outside of the database HDFS directory, dropping this database would preserve the data. But if you have it set up so that the data is currently inside of the database HDFS directory, you will not have this functionality; it's something you would have to set up from scratch.
I'm designing Data Factory piplelines to load data from Azure SQL DB to Azure Data Factory.
My initial load/POC was a small subset of data and was able to load from SQL tables to Azure DL.
Now, there are huge volume of tables (that has even billion +) that I want to load from SQL DB using DF to Azure DL.
MS docs mentioned two options, i.e. watermark columns and change tracking.
Let's say I have a "cust_transaction" table that has millions of rows and if I load to DL then it loads as "cust_transaction.txt".
Questions.
1) What would an optimal design to incrementally load the source data from SQL DB into that file in the data lake?
2) How do I split or partition the files into smaller files?
3) How should I merge and load the deltas from source data into the files?
Thanks.
You will want multiple files. Typically, my data lakes have multiple zones. The first zone is Raw. It contains a copy of the source data organized into entity/year/month/day folders where entity is a table in your SQL DB. Typically, those files are incremental loads. Each incremental load for an entity has a file name similar to Entity_YYYYMMDDHHMMSS.txt (and maybe even more info than that) rather than just Entity.txt. And the timestamp in the file name is the end of the incremental slice (max possible insert or update time in the data) rather than just current time wherever possible (sometimes they are relatively the same and it doesn't matter, but I tend to get a consistent incremental slice end time for all tables in my batch). You can achieve the date folders and timestamp in the file name by parameterizing the folder and file in the dataset.
Melissa Coates has two good articles on Azure Data Lake: Zones in a Data Lake and Data Lake Use Cases and Planning. Her naming conventions are a bit different than mine, but both of us would tell you to just be consistent. I would land the incremental load file in Raw first. It should reflect the incremental data as it was loaded from the source. If you need to have a merged version, that can be done with Data Factory or U-SQL (or your tool of choice) and landed in the Standardized Raw zone. There are some performance issues with small files in a data lake, so consolidation could be good, but it all depends on what you plan to do with the data after you land it there. Most users would not access data in the RAW zone, instead using data from Standardized Raw or Curated Zones. Also, I want Raw to be an immutable archive from which I could regenerate data in other zones, so I tend to leave it in the files as it landed. But if you found you needed to consolidate there, that would be fine.
Change tracking is a reliable way to get changes, but I don't like their naming conventions/file organization in their example. I would make sure your file name has the entity name and a timestamp on it. They have Incremental - [PipelineRunID]. I would prefer [Entity]_[YYYYMMDDHHMMSS]_[TriggerID].txt (or leave the run ID off) because it is more informative to others. I also tend to use the Trigger ID rather than the pipeline RunID. The Trigger ID is across all the packages executed in that trigger instance (batch) whereas the pipeline RunID is specific to that pipeline.
If you can't do the change tracking, the watermark is fine. I usually can't add change tracking to my sources and have to go with watermark. The issue is that you are trusting that the application's modified date is accurate. Are there ever times when a row is updated and the modified date is not changed? When a row is inserted, is the modified date also updated or would you have to check two columns to get all new and changed rows? These are the things we have to consider when we can't use change tracking.
To summarize:
Load incrementally and name your incremental files intelligently
If you need a current version of the table in the data lake, that is a separate file in your Standardized Raw or Curated Zone.