I'm having difficulty referencing a Delta table to perform an upsert/merge on it after creating it new. Doing it via pySpark with a typical dataframe.write.format("delta") terminology works fine. When manually creating a table with the Delta table builder API create syntax,
deltaTable = DeltaTable.createIfNotExists(spark)
.location("/path/to/table")
.tableName("table")
.addColumn("id", dataType = "String")
...
.execute()
I can see the folder exists in storage as expected, and can verify that it's a Delta table using DeltaTable.isDeltaTable(spark, tablePath)
The problem I encounter is when running someTable = DeltaTable.forPath(spark, tablePath) , then I get an error indicating that
pyspark.sql.utils.AnalysisException: A partition path fragment should be the form like 'part1=foo/part2=bar'
Whether I do or don't explicitly partition the table in the create statement doesn't seem to matter. I am trying to read the whole table, not a single partition.
So the question is, how do I reference the table correctly to load and manage it?
I'm using Azure Data Lake Gen 2 blob storage, though I'm not sure that's part of the issue.
As it's part of a question, my full path used for location is abfss://container_name#storage_account_name.dfs.core.windows.net/blobContainerName/delta/tables/nws, where nws has business meaning.
Related
I am working with Azure Databricks and we are moving hundreds of gigabytes of data with Spark. We stream them with Databricks' autoloader function from a source storage on Azure Datalake Gen2, process them with Databricks notebooks, then load them into another storage. The idea is that the end result is a replica, a copy-paste of the source, but with some transformations involved.
This means if a record is deleted at the source, we also have to delete it. If a record is updated or added, then we do that too. For the latter autoloader with a file level listener, combined with a MERGE INTO and with .forEachBatch() is an efficient solution But what about deletions? For technical reasons (dynamics365 azure synapse link export being extremely limited in configuration) we are not getting delta files, we have no data on whether a certain record got updated, added or deleted. We only have the full data dump every time.
To simply put: I want to delete records in a target dataset if the record's primary key is no longer found in a source dataset. In T-SQL MERGE could check both ways, whether there is a match by the target or the source, however in Databricks this is not possible, MERGE INTO only checks for the target dataset.
Best idea so far:
DELETE FROM a WHERE NOT EXISTS (SELECT id FROM b WHERE a.id = b.id)
Occasionally a deletion job might delete millions of rows, which we have to replicate, so performance is important. What would you suggest? Any best practices to this?
With the Databricks Lakehouse platform, it is possible to create 'tables' or to be more specific, delta tables using a statement such as the following,
DROP TABLE IF EXISTS People10M;
CREATE TABLE People10M
USING parquet
OPTIONS (
path "/mnt/training/dataframes/people-10m.parquet",
header "true"
);
What I would like to know is, what exactly happens behind the scenes when you create one of these tables? What exactly is a table in this context? Because the data is actually contained in files in data lake (data storage location) that delta lake is running on top of.. right? Are tables some kind of abstraction that allows us to access the data stored in these files using something like SQL?
What does the USING parquet portion of this statement do? Are parquet tables different to CSV tables in some way? Or does this just depend on the format of the source data?
Any links to material that explains this idea would be appreciated? I want to understand this in depth from a technical point of view.
There are few aspects here. Your table definition is not a Delta Lake, it's Spark SQL (or Hive) syntax to define a table. It's just a metadata that allows users easily use the table without knowing where it's located, what data format, etc. You can read more about databases & tables in Databricks documentation.
The actual format for data storage is specified by the USING directive. In your case it's parquet, so when people or code will read or write data, underlying engine will first read table metadata, figure out location of the data & file format, and then will use corresponding code.
Delta is another file format (really a storage layer) that is built on the top of Parquet as data format, but adding additional capabilities such as ACID, time travel, etc. (see doc). If you want to use Delta instead of Parquet then you either need to use CONVERT TO DELTA to convert existing Parquet data into Delta, or specify USING delta when creating a completely new table.
Create a table from a storage account and I can insert records with no problem.
I decided to use this type of table to be able to share the URI (Https) and for another system to consume it, and since it is a NoSQL table, it gives me the possibility of adapting what I need to store in it.
But I have the inconvenience that this table, I must truncate it every time information is processed and the option from Data Factory where it indicates the insertion mode, (Replace or combine), does not work, it always performs an append.
I tried to do it from DataBricks, but I don't know how to reference that table, since it is outside of a Blob Storage and it cannot be mounted as such. any ideas?
Azure Table Storage
Azure data Factory
"Tipo de Inserción" = Insert Type
"Reemplazar" = Replace
Table Data
Configuration Data Factory
How to configure so that I can delete the data?
Thanks a lot.
Greetings.
I have a databricks notebook in which I currently create a view based off several delta tables, then update some of those same delta tables based on this view. However I'm getting incorrect results Because as the delta tables change the data in the view changes. What I effectively need is to take a snapshot of the data at the point the notebook starts to run which I can then use throughout the notebook, akin to a SQL temporary table. Currently I'm working around this by persisting the data into a table and dropping the table at the end of the notebook, but I wondered if there was a better solution?
The section Pinned view of a continuously updating Delta table across multiple downstream jobs contains the following example code:
version = spark.sql("SELECT max(version) FROM (DESCRIBE HISTORY my_table)")\
.collect()
# Will use the latest version of the table for all operations below
data = spark.table("my_table#v%s" % version[0][0]
data.where("event_type = e1").write.jdbc("table1")
data.where("event_type = e2").write.jdbc("table2")
...
data.where("event_type = e10").write.jdbc("table10")
We load data from on-prem database servers to Azure Data Lake Storage Gen2 using Azure Data Factory and Databricks store them as parquet files. Every run, we get only get the new and modified data from last run and UPSERT into existing parquet files using databricks merge statement.
Now we are trying to move this data from parquet files Azure Synapse. Ideally, I would like to do this.
Read incremental load data into a external table. (CETAS or COPY
INTO)
Use above as staging table.
Merge staging table with production table.
The problem is merge statement is not available in Azure Syanpse. Here is the solution Microsoft suggests for incremental load
CREATE TABLE dbo.[DimProduct_upsert]
WITH
( DISTRIBUTION = HASH([ProductKey])
, CLUSTERED INDEX ([ProductKey])
)
AS
-- New rows and new versions of rows
SELECT s.[ProductKey]
, s.[EnglishProductName]
, s.[Color]
FROM dbo.[stg_DimProduct] AS s
UNION ALL
-- Keep rows that are not being touched
SELECT p.[ProductKey]
, p.[EnglishProductName]
, p.[Color]
FROM dbo.[DimProduct] AS p
WHERE NOT EXISTS
( SELECT *
FROM [dbo].[stg_DimProduct] s
WHERE s.[ProductKey] = p.[ProductKey]
)
;
RENAME OBJECT dbo.[DimProduct] TO [DimProduct_old];
RENAME OBJECT dbo.[DimProduct_upsert] TO [DimProduct];
Basically dropping and re-creating the production table with CTAS. Will work fine with small dimenstion tables, but i'm apprehensive about large fact tables with 100's of millions of rows with indexes. Any suggestions on what would be the best way to do incremental loads for really large fact tables. Thanks!
Till the time SQL MERGE is officially supported, the recommended way fwd to update target tables is to use T SQL insert/update commands between the delta records and target table.
Alternatively, you can also use Mapping Data Flows (in ADF) to emulate SCD transactions for dimensional/fact data load.